diff mbox series

[net,v3] net: sched: fix packet stuck problem for lockless qdisc

Message ID 1616641991-14847-1-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v3] net: sched: fix packet stuck problem for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 15 of 15 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 4509 this patch: 4509
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 85 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 4749 this patch: 4749
netdev/header_inline success Link

Commit Message

Yunsheng Lin March 25, 2021, 3:13 a.m. UTC
Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. Get the flag before doing spin_trylock().
2. If the first spin_trylock() return false and the flag is not
   set before the first spin_trylock(), Set the flag and retry
   another spin_trylock() in case other CPU may not see the new
   flag after it releases the lock.
3. reschedule if the flags is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, the flags is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled
again to dequeue the skb enqueued by cpu2.

Only clear the flag before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the above double
spin_trylock() and __netif_schedule() calling.

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V3: fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 38 +++++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 12 ++++++++++++
 2 files changed, 49 insertions(+), 1 deletion(-)

Comments

Jürgen Groß April 9, 2021, 5:31 a.m. UTC | #1
On 25.03.21 04:13, Yunsheng Lin wrote:
> Lockless qdisc has below concurrent problem:
>      cpu0                 cpu1
>       .                     .
> q->enqueue                 .
>       .                     .
> qdisc_run_begin()          .
>       .                     .
> dequeue_skb()              .
>       .                     .
> sch_direct_xmit()          .
>       .                     .
>       .                q->enqueue
>       .             qdisc_run_begin()
>       .            return and do nothing
>       .                     .
> qdisc_run_end()            .
> 
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
> 
> Lockless qdisc has below another concurrent problem when
> tx_action is involved:
> 
> cpu0(serving tx_action)     cpu1             cpu2
>            .                   .                .
>            .              q->enqueue            .
>            .            qdisc_run_begin()       .
>            .              dequeue_skb()         .
>            .                   .            q->enqueue
>            .                   .                .
>            .             sch_direct_xmit()      .
>            .                   .         qdisc_run_begin()
>            .                   .       return and do nothing
>            .                   .                .
>   clear __QDISC_STATE_SCHED    .                .
>   qdisc_run_begin()            .                .
>   return and do nothing        .                .
>            .                   .                .
>            .            qdisc_run_end()         .
> 
> This patch fixes the above data race by:
> 1. Get the flag before doing spin_trylock().
> 2. If the first spin_trylock() return false and the flag is not
>     set before the first spin_trylock(), Set the flag and retry
>     another spin_trylock() in case other CPU may not see the new
>     flag after it releases the lock.
> 3. reschedule if the flags is set after the lock is released
>     at the end of qdisc_run_end().
> 
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_end(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
> 
> Only clear the flag before retrying a dequeuing when dequeuing
> returns NULL in order to reduce the overhead of the above double
> spin_trylock() and __netif_schedule() calling.
> 
> The performance impact of this patch, tested using pktgen and
> dummy netdev with pfifo_fast qdisc attached:
> 
>   threads  without+this_patch   with+this_patch      delta
>      1        2.61Mpps            2.60Mpps           -0.3%
>      2        3.97Mpps            3.82Mpps           -3.7%
>      4        5.62Mpps            5.59Mpps           -0.5%
>      8        2.78Mpps            2.77Mpps           -0.3%
>     16        2.22Mpps            2.22Mpps           -0.0%
> 
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

I have a setup which is able to reproduce the issue quite reliably:

In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
each of them. The average latency reported by sysbench is well below
1 msec, but at least once per hour I get latencies in the minute
range.

With this patch I don't see these high latencies any longer (test
is running for more than 20 hours now).

So you can add my:

Tested-by: Juergen Gross <jgross@suse.com>


Juergen
Yunsheng Lin April 12, 2021, 1:04 a.m. UTC | #2
On 2021/4/9 13:31, Juergen Gross wrote:
> On 25.03.21 04:13, Yunsheng Lin wrote:
>> Lockless qdisc has below concurrent problem:
>>      cpu0                 cpu1
>>       .                     .
>> q->enqueue                 .
>>       .                     .
>> qdisc_run_begin()          .
>>       .                     .
>> dequeue_skb()              .
>>       .                     .
>> sch_direct_xmit()          .
>>       .                     .
>>       .                q->enqueue
>>       .             qdisc_run_begin()
>>       .            return and do nothing
>>       .                     .
>> qdisc_run_end()            .
>>
>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
>> has not released the lock yet and spin_trylock() return false
>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
>> enqueue the skb after cpu0 calling dequeue_skb() and before
>> cpu0 calling qdisc_run_end().
>>
>> Lockless qdisc has below another concurrent problem when
>> tx_action is involved:
>>
>> cpu0(serving tx_action)     cpu1             cpu2
>>            .                   .                .
>>            .              q->enqueue            .
>>            .            qdisc_run_begin()       .
>>            .              dequeue_skb()         .
>>            .                   .            q->enqueue
>>            .                   .                .
>>            .             sch_direct_xmit()      .
>>            .                   .         qdisc_run_begin()
>>            .                   .       return and do nothing
>>            .                   .                .
>>   clear __QDISC_STATE_SCHED    .                .
>>   qdisc_run_begin()            .                .
>>   return and do nothing        .                .
>>            .                   .                .
>>            .            qdisc_run_end()         .
>>
>> This patch fixes the above data race by:
>> 1. Get the flag before doing spin_trylock().
>> 2. If the first spin_trylock() return false and the flag is not
>>     set before the first spin_trylock(), Set the flag and retry
>>     another spin_trylock() in case other CPU may not see the new
>>     flag after it releases the lock.
>> 3. reschedule if the flags is set after the lock is released
>>     at the end of qdisc_run_end().
>>
>> For tx_action case, the flags is also set when cpu1 is at the
>> end if qdisc_run_end(), so tx_action will be rescheduled
>> again to dequeue the skb enqueued by cpu2.
>>
>> Only clear the flag before retrying a dequeuing when dequeuing
>> returns NULL in order to reduce the overhead of the above double
>> spin_trylock() and __netif_schedule() calling.
>>
>> The performance impact of this patch, tested using pktgen and
>> dummy netdev with pfifo_fast qdisc attached:
>>
>>   threads  without+this_patch   with+this_patch      delta
>>      1        2.61Mpps            2.60Mpps           -0.3%
>>      2        3.97Mpps            3.82Mpps           -3.7%
>>      4        5.62Mpps            5.59Mpps           -0.5%
>>      8        2.78Mpps            2.77Mpps           -0.3%
>>     16        2.22Mpps            2.22Mpps           -0.0%
>>
>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> 
> I have a setup which is able to reproduce the issue quite reliably:
> 
> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
> each of them. The average latency reported by sysbench is well below
> 1 msec, but at least once per hour I get latencies in the minute
> range.
> 
> With this patch I don't see these high latencies any longer (test
> is running for more than 20 hours now).
> 
> So you can add my:
> 
> Tested-by: Juergen Gross <jgross@suse.com>

Hi, Juergen

Thanks for the testing.

With the simulated test case suggested by Michal, I still has some
potential issue to debug, hopefully will send out new version in
this week.

Also, is it possible to run your testcase any longer? I think "72 hours"
would be enough to testify that it fixes the problem completely:)



> 
> 
> Juergen
Yunsheng Lin April 12, 2021, 1:24 a.m. UTC | #3
On 2021/4/9 17:09, Hillf Danton wrote:
> On Fri, 9 Apr 2021 07:31:03  Juergen Gross wrote:
>> On 25.03.21 04:13, Yunsheng Lin wrote:
>> I have a setup which is able to reproduce the issue quite reliably:
>>
>> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
>> each of them. The average latency reported by sysbench is well below
>> 1 msec, but at least once per hour I get latencies in the minute
>> range.
>>
>> With this patch I don't see these high latencies any longer (test
>> is running for more than 20 hours now).
>>
>> So you can add my:
>>
>> Tested-by: Juergen Gross <jgross@suse.com>
>>
> 
> If retry is allowed in the dequeue method then a simple seqcount can do the
> work of serializing enqueuer and dequeuer. IIUC it was not attempted last year.

At the first glance, I do not think the below patch fix the data race
described in the commit log, as it does not handle the time window
between dequeuing and q->seqlock releasing, as below:

The cpu1 may not see the qdisc->pad changed after pfifo_fast_dequeue(),
and cpu2 is not able to take the q->seqlock yet because cpu1 do not
release the q->seqlock.

> 
> --- x/net/sched/sch_generic.c
> +++ y/net/sched/sch_generic.c
> @@ -632,6 +632,9 @@ static int pfifo_fast_enqueue(struct sk_
>  			return qdisc_drop(skb, qdisc, to_free);
>  	}
>  
> +	qdisc->pad++;
> +	smp_wmb();
> +
>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;
>  }
> @@ -641,6 +644,11 @@ static struct sk_buff *pfifo_fast_dequeu
>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>  	struct sk_buff *skb = NULL;
>  	int band;
> +	int seq;
> +
> +again:
> +	seq = READ_ONCE(qdisc->pad);
> +	smp_rmb();
>  
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
> @@ -652,10 +660,15 @@ static struct sk_buff *pfifo_fast_dequeu
>  	}
>  	if (likely(skb)) {
>  		qdisc_update_stats_at_dequeue(qdisc, skb);
> -	} else {
> -		WRITE_ONCE(qdisc->empty, true);
> +		return skb;
>  	}
>  
> +	smp_rmb();
> +	if (seq != READ_ONCE(qdisc->pad))
> +		goto again;
> +
> +	WRITE_ONCE(qdisc->empty, true);
> +
>  	return skb;
>  }
>  
> 
> .
>
Yunsheng Lin April 12, 2021, 3:37 a.m. UTC | #4
On 2021/4/12 11:21, Hillf Danton wrote:
> On Mon, 12 Apr 2021 09:24:30  Yunsheng Lin wrote:
>> On 2021/4/9 17:09, Hillf Danton wrote:
>>> On Fri, 9 Apr 2021 07:31:03  Juergen Gross wrote:
>>>> On 25.03.21 04:13, Yunsheng Lin wrote:
>>>> I have a setup which is able to reproduce the issue quite reliably:
>>>>
>>>> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
>>>> each of them. The average latency reported by sysbench is well below
>>>> 1 msec, but at least once per hour I get latencies in the minute
>>>> range.
>>>>
>>>> With this patch I don't see these high latencies any longer (test
>>>> is running for more than 20 hours now).
>>>>
>>>> So you can add my:
>>>>
>>>> Tested-by: Juergen Gross <jgross@suse.com>
>>>>
>>>
>>> If retry is allowed in the dequeue method then a simple seqcount can do the
>>> work of serializing enqueuer and dequeuer. IIUC it was not attempted last year.
>>
>> At the first glance, I do not think the below patch fix the data race
> 
> Thanks for taking a look.
> 
>> described in the commit log, as it does not handle the time window
>> between dequeuing and q->seqlock releasing, as below:
>>
> Yes the time window does exist.
> 
>> The cpu1 may not see the qdisc->pad changed after pfifo_fast_dequeue(),
>> and cpu2 is not able to take the q->seqlock yet because cpu1 do not
>> release the q->seqlock.
>>
> It's now covered by extending the seqcount aperture a bit.
> 
> --- x/net/sched/sch_generic.c
> +++ y/net/sched/sch_generic.c
> @@ -380,14 +380,23 @@ void __qdisc_run(struct Qdisc *q)
>  {
>  	int quota = dev_tx_weight;
>  	int packets;
> +	int seq;
> +
> +again:
> +	seq = READ_ONCE(q->pad);
> +	smp_rmb();
>  
>  	while (qdisc_restart(q, &packets)) {
>  		quota -= packets;
>  		if (quota <= 0) {
>  			__netif_schedule(q);
> -			break;
> +			return;
>  		}
>  	}
> +
> +	smp_rmb();
> +	if (seq != READ_ONCE(q->pad))
> +		goto again;

As my understanding, there is still time window between q->pad checking
above and q->seqlock releasing in qdisc_run_end().

>  }
>  
>  unsigned long dev_trans_start(struct net_device *dev)
> @@ -632,6 +641,9 @@ static int pfifo_fast_enqueue(struct sk_
>  			return qdisc_drop(skb, qdisc, to_free);
>  	}
>  
> +	qdisc->pad++;
> +	smp_wmb();
> +
>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;
>  }
> 
> .
>
Jürgen Groß April 12, 2021, 8:21 a.m. UTC | #5
On 12.04.21 03:04, Yunsheng Lin wrote:
> On 2021/4/9 13:31, Juergen Gross wrote:
>> On 25.03.21 04:13, Yunsheng Lin wrote:
>>> Lockless qdisc has below concurrent problem:
>>>       cpu0                 cpu1
>>>        .                     .
>>> q->enqueue                 .
>>>        .                     .
>>> qdisc_run_begin()          .
>>>        .                     .
>>> dequeue_skb()              .
>>>        .                     .
>>> sch_direct_xmit()          .
>>>        .                     .
>>>        .                q->enqueue
>>>        .             qdisc_run_begin()
>>>        .            return and do nothing
>>>        .                     .
>>> qdisc_run_end()            .
>>>
>>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
>>> has not released the lock yet and spin_trylock() return false
>>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
>>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
>>> enqueue the skb after cpu0 calling dequeue_skb() and before
>>> cpu0 calling qdisc_run_end().
>>>
>>> Lockless qdisc has below another concurrent problem when
>>> tx_action is involved:
>>>
>>> cpu0(serving tx_action)     cpu1             cpu2
>>>             .                   .                .
>>>             .              q->enqueue            .
>>>             .            qdisc_run_begin()       .
>>>             .              dequeue_skb()         .
>>>             .                   .            q->enqueue
>>>             .                   .                .
>>>             .             sch_direct_xmit()      .
>>>             .                   .         qdisc_run_begin()
>>>             .                   .       return and do nothing
>>>             .                   .                .
>>>    clear __QDISC_STATE_SCHED    .                .
>>>    qdisc_run_begin()            .                .
>>>    return and do nothing        .                .
>>>             .                   .                .
>>>             .            qdisc_run_end()         .
>>>
>>> This patch fixes the above data race by:
>>> 1. Get the flag before doing spin_trylock().
>>> 2. If the first spin_trylock() return false and the flag is not
>>>      set before the first spin_trylock(), Set the flag and retry
>>>      another spin_trylock() in case other CPU may not see the new
>>>      flag after it releases the lock.
>>> 3. reschedule if the flags is set after the lock is released
>>>      at the end of qdisc_run_end().
>>>
>>> For tx_action case, the flags is also set when cpu1 is at the
>>> end if qdisc_run_end(), so tx_action will be rescheduled
>>> again to dequeue the skb enqueued by cpu2.
>>>
>>> Only clear the flag before retrying a dequeuing when dequeuing
>>> returns NULL in order to reduce the overhead of the above double
>>> spin_trylock() and __netif_schedule() calling.
>>>
>>> The performance impact of this patch, tested using pktgen and
>>> dummy netdev with pfifo_fast qdisc attached:
>>>
>>>    threads  without+this_patch   with+this_patch      delta
>>>       1        2.61Mpps            2.60Mpps           -0.3%
>>>       2        3.97Mpps            3.82Mpps           -3.7%
>>>       4        5.62Mpps            5.59Mpps           -0.5%
>>>       8        2.78Mpps            2.77Mpps           -0.3%
>>>      16        2.22Mpps            2.22Mpps           -0.0%
>>>
>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>
>> I have a setup which is able to reproduce the issue quite reliably:
>>
>> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
>> each of them. The average latency reported by sysbench is well below
>> 1 msec, but at least once per hour I get latencies in the minute
>> range.
>>
>> With this patch I don't see these high latencies any longer (test
>> is running for more than 20 hours now).
>>
>> So you can add my:
>>
>> Tested-by: Juergen Gross <jgross@suse.com>
> 
> Hi, Juergen
> 
> Thanks for the testing.
> 
> With the simulated test case suggested by Michal, I still has some
> potential issue to debug, hopefully will send out new version in
> this week.
> 
> Also, is it possible to run your testcase any longer? I think "72 hours"
> would be enough to testify that it fixes the problem completely:)

This should be possible, yes.

I'm using the setup to catch another bug which is showing up every few
days. I don't see a reason why I shouldn't be able to add your patch
and verify it in parallel.


Juergen
Yunsheng Lin April 12, 2021, noon UTC | #6
On 2021/4/12 15:28, Hillf Danton wrote:
> On Mon, 12 Apr 2021 11:37:24 Yunsheng Lin wrote:
>> On 2021/4/12 11:21, Hillf Danton wrote:
>>> On Mon, 12 Apr 2021 09:24:30  Yunsheng Lin wrote:
>>>> On 2021/4/9 17:09, Hillf Danton wrote:
>>>>> On Fri, 9 Apr 2021 07:31:03  Juergen Gross wrote:
>>>>>> On 25.03.21 04:13, Yunsheng Lin wrote:
>>>>>> I have a setup which is able to reproduce the issue quite reliably:
>>>>>>
>>>>>> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
>>>>>> each of them. The average latency reported by sysbench is well below
>>>>>> 1 msec, but at least once per hour I get latencies in the minute
>>>>>> range.
>>>>>>
>>>>>> With this patch I don't see these high latencies any longer (test
>>>>>> is running for more than 20 hours now).
>>>>>>
>>>>>> So you can add my:
>>>>>>
>>>>>> Tested-by: Juergen Gross <jgross@suse.com>
>>>>>>
>>>>>
>>>>> If retry is allowed in the dequeue method then a simple seqcount can do the
>>>>> work of serializing enqueuer and dequeuer. IIUC it was not attempted last year.
>>>>
>>>> At the first glance, I do not think the below patch fix the data race
>>>
>>> Thanks for taking a look.
>>>
>>>> described in the commit log, as it does not handle the time window
>>>> between dequeuing and q->seqlock releasing, as below:
>>>>
>>> Yes the time window does exist.
>>>
>>>> The cpu1 may not see the qdisc->pad changed after pfifo_fast_dequeue(),
>>>> and cpu2 is not able to take the q->seqlock yet because cpu1 do not
>>>> release the q->seqlock.
>>>>
>>> It's now covered by extending the seqcount aperture a bit.
>>>
>>> --- x/net/sched/sch_generic.c
>>> +++ y/net/sched/sch_generic.c
>>> @@ -380,14 +380,23 @@ void __qdisc_run(struct Qdisc *q)
>>>  {
>>>  	int quota = dev_tx_weight;
>>>  	int packets;
>>> +	int seq;
>>> +
>>> +again:
>>> +	seq = READ_ONCE(q->pad);
>>> +	smp_rmb();
>>>  
>>>  	while (qdisc_restart(q, &packets)) {
>>>  		quota -= packets;
>>>  		if (quota <= 0) {
>>>  			__netif_schedule(q);
>>> -			break;
>>> +			return;
>>>  		}
>>>  	}
>>> +
>>> +	smp_rmb();
>>> +	if (seq != READ_ONCE(q->pad))
>>> +		goto again;
>>
>> As my understanding, there is still time window between q->pad checking
>> above and q->seqlock releasing in qdisc_run_end().
>>
> Then extend the cover across q->seqlock on top of the flag you added.

Yes, the below patch seems to fix the data race described in
the commit log.
Then what is the difference between my patch and your patch below:)

> 
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
>  	__QDISC_STATE_SCHED,
>  	__QDISC_STATE_DEACTIVATED,
> +	__QDISC_STATE_NEED_RESCHEDULE,
>  };
>  
>  struct qdisc_size_table {
> @@ -176,8 +177,13 @@ static inline bool qdisc_run_begin(struc
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>  	write_seqcount_end(&qdisc->running);
> -	if (qdisc->flags & TCQ_F_NOLOCK)
> +	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		spin_unlock(&qdisc->seqlock);
> +
> +		if (test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +							&qdisc->state))
> +			__netif_schedule(qdisc);
> +	}
>  }
>  
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -381,13 +381,21 @@ void __qdisc_run(struct Qdisc *q)
>  	int quota = dev_tx_weight;
>  	int packets;
>  
> +	if (q->flags & TCQ_F_NOLOCK)
> +		clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state);
> +again:
>  	while (qdisc_restart(q, &packets)) {
>  		quota -= packets;
>  		if (quota <= 0) {
>  			__netif_schedule(q);
> -			break;
> +			return;
>  		}
>  	}
> +
> +	if (q->flags & TCQ_F_NOLOCK)
> +		if (test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +					&q->state))
> +			goto again;
>  }
>  
>  unsigned long dev_trans_start(struct net_device *dev)
> @@ -632,6 +640,9 @@ static int pfifo_fast_enqueue(struct sk_
>  			return qdisc_drop(skb, qdisc, to_free);
>  	}
>  
> +	if (qdisc->flags & TCQ_F_NOLOCK)
> +		set_bit(__QDISC_STATE_NEED_RESCHEDULE, &qdisc->state);

Doing set_bit() in pfifo_fast_enqueue() unconditionally does not
seems to be performance friendly, because it requires exclusive access
to the cache line of qdisc->state.
Perhaps do some performance test?


> +
>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;
>  }
> 
> .
>
Yunsheng Lin April 13, 2021, 2:56 a.m. UTC | #7
On 2021/4/13 10:21, Hillf Danton wrote:
> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>
>> Yes, the below patch seems to fix the data race described in
>> the commit log.
>> Then what is the difference between my patch and your patch below:)
> 
> Hehe, this is one of the tough questions over a bounch of weeks.
> 
> If a seqcount can detect the race between skb enqueue and dequeue then we
> cant see any excuse for not rolling back to the point without NOLOCK.

I am not sure I understood what you meant above.

As my understanding, the below patch is essentially the same as
your previous patch, the only difference I see is it uses qdisc->pad
instead of __QDISC_STATE_NEED_RESCHEDULE.

So instead of proposing another patch, it would be better if you
comment on my patch, and make improvement upon that.

> 
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -632,6 +632,7 @@ static int pfifo_fast_enqueue(struct sk_
>  			return qdisc_drop(skb, qdisc, to_free);
>  	}
>  
> +	qdisc->pad++;

As has been mentioned:
Doing updating in pfifo_fast_enqueue() unconditionally does not
seems to be performance friendly, which is something my patch
tries to avoid as much as possible.

>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;
>  }
> @@ -642,6 +643,7 @@ static struct sk_buff *pfifo_fast_dequeu
>  	struct sk_buff *skb = NULL;
>  	int band;
>  
> +	qdisc->pad = 0;
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
>  
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -176,8 +176,12 @@ static inline bool qdisc_run_begin(struc
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>  	write_seqcount_end(&qdisc->running);
> -	if (qdisc->flags & TCQ_F_NOLOCK)
> +	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		spin_unlock(&qdisc->seqlock);
> +
> +		if (qdisc->pad != 0)
> +			__netif_schedule(qdisc);
> +	}
>  }
>  
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> 
> .
>
Yunsheng Lin April 13, 2021, 3:34 a.m. UTC | #8
On 2021/4/13 11:26, Hillf Danton wrote:
> On Tue, 13 Apr 2021 10:56:42 Yunsheng Lin wrote:
>> On 2021/4/13 10:21, Hillf Danton wrote:
>>> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>>>
>>>> Yes, the below patch seems to fix the data race described in
>>>> the commit log.
>>>> Then what is the difference between my patch and your patch below:)
>>>
>>> Hehe, this is one of the tough questions over a bounch of weeks.
>>>
>>> If a seqcount can detect the race between skb enqueue and dequeue then we
>>> cant see any excuse for not rolling back to the point without NOLOCK.
>>
>> I am not sure I understood what you meant above.
>>
>> As my understanding, the below patch is essentially the same as
>> your previous patch, the only difference I see is it uses qdisc->pad
>> instead of __QDISC_STATE_NEED_RESCHEDULE.
>>
>> So instead of proposing another patch, it would be better if you
>> comment on my patch, and make improvement upon that.
>>
> Happy to do that after you show how it helps revert NOLOCK.

Actually I am not going to revert NOLOCK, but add optimization
to it if the patch fixes the packet stuck problem.

Is there any reason why you want to revert it?

> 
> .
>
Yunsheng Lin April 13, 2021, 7:57 a.m. UTC | #9
On 2021/4/13 15:12, Hillf Danton wrote:
> On Tue, 13 Apr 2021 11:34:27 Yunsheng Lin wrote:
>> On 2021/4/13 11:26, Hillf Danton wrote:
>>> On Tue, 13 Apr 2021 10:56:42 Yunsheng Lin wrote:
>>>> On 2021/4/13 10:21, Hillf Danton wrote:
>>>>> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>>>>>
>>>>>> Yes, the below patch seems to fix the data race described in
>>>>>> the commit log.
>>>>>> Then what is the difference between my patch and your patch below:)
>>>>>
>>>>> Hehe, this is one of the tough questions over a bounch of weeks.
>>>>>
>>>>> If a seqcount can detect the race between skb enqueue and dequeue then we
>>>>> cant see any excuse for not rolling back to the point without NOLOCK.
>>>>
>>>> I am not sure I understood what you meant above.
>>>>
>>>> As my understanding, the below patch is essentially the same as
>>>> your previous patch, the only difference I see is it uses qdisc->pad
>>>> instead of __QDISC_STATE_NEED_RESCHEDULE.
>>>>
>>>> So instead of proposing another patch, it would be better if you
>>>> comment on my patch, and make improvement upon that.
>>>>
>>> Happy to do that after you show how it helps revert NOLOCK.
>>
>> Actually I am not going to revert NOLOCK, but add optimization
>> to it if the patch fixes the packet stuck problem.
>>
> Fix is not optimization, right?

For this patch, it is a fix.
In case you missed it, I do have a couple of idea to optimize the
lockless qdisc:

1. RFC patch to add lockless qdisc bypass optimization:

https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/

2. implement lockless enqueuing for lockless qdisc using the idea
   from Jason and Toke. And it has a noticable proformance increase with
   1-4 threads running using the below prototype based on ptr_ring.

static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
{

        int producer, next_producer;


        do {
                producer = READ_ONCE(r->producer);
                if (unlikely(!r->size) || r->queue[producer])
                        return -ENOSPC;
                next_producer = producer + 1;
                if (unlikely(next_producer >= r->size))
                        next_producer = 0;
        } while(cmpxchg_relaxed(&r->producer, producer, next_producer) != producer);

        /* Make sure the pointer we are storing points to a valid data. */
        /* Pairs with the dependency ordering in __ptr_ring_consume. */
        smp_wmb();

        WRITE_ONCE(r->queue[producer], ptr);
        return 0;
}

3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
   too, because dev_hard_start_xmit is also in the protection of
   qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
   a netdev queue, which is true for pfifo_fast, I believe).

4. Remove the qdisc->running seqcount operation for lockless qdisc, which
   is mainly used to do heuristic locking on q->busylock for locked qdisc.

> 
>> Is there any reason why you want to revert it?
>>
> I think you know Jiri's plan and it would be nice to wait a couple of
> months for it to complete.

I am not sure I am aware of Jiri's plan.
Is there any link referring to the plan?

> 
> .
>
Yunsheng Lin April 13, 2021, 9:03 a.m. UTC | #10
On 2021/4/13 16:33, Hillf Danton wrote:
> On Tue, 13 Apr 2021 15:57:29  Yunsheng Lin wrote:
>> On 2021/4/13 15:12, Hillf Danton wrote:
>>> On Tue, 13 Apr 2021 11:34:27 Yunsheng Lin wrote:
>>>> On 2021/4/13 11:26, Hillf Danton wrote:
>>>>> On Tue, 13 Apr 2021 10:56:42 Yunsheng Lin wrote:
>>>>>> On 2021/4/13 10:21, Hillf Danton wrote:
>>>>>>> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>>>>>>>
>>>>>>>> Yes, the below patch seems to fix the data race described in
>>>>>>>> the commit log.
>>>>>>>> Then what is the difference between my patch and your patch below:)
>>>>>>>
>>>>>>> Hehe, this is one of the tough questions over a bounch of weeks.
>>>>>>>
>>>>>>> If a seqcount can detect the race between skb enqueue and dequeue then we
>>>>>>> cant see any excuse for not rolling back to the point without NOLOCK.
>>>>>>
>>>>>> I am not sure I understood what you meant above.
>>>>>>
>>>>>> As my understanding, the below patch is essentially the same as
>>>>>> your previous patch, the only difference I see is it uses qdisc->pad
>>>>>> instead of __QDISC_STATE_NEED_RESCHEDULE.
>>>>>>
>>>>>> So instead of proposing another patch, it would be better if you
>>>>>> comment on my patch, and make improvement upon that.
>>>>>>
>>>>> Happy to do that after you show how it helps revert NOLOCK.
>>>>
>>>> Actually I am not going to revert NOLOCK, but add optimization
>>>> to it if the patch fixes the packet stuck problem.
>>>>
>>> Fix is not optimization, right?
>>
>> For this patch, it is a fix.
>> In case you missed it, I do have a couple of idea to optimize the
>> lockless qdisc:
>>
>> 1. RFC patch to add lockless qdisc bypass optimization:
>>
>> https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/
>>
>> 2. implement lockless enqueuing for lockless qdisc using the idea
>>   from Jason and Toke. And it has a noticable proformance increase with
>>   1-4 threads running using the below prototype based on ptr_ring.
>>
>> static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
>> {
>>
>>        int producer, next_producer;
>>
>>
>>        do {
>>                producer = READ_ONCE(r->producer);
>>                if (unlikely(!r->size) || r->queue[producer])
>>                        return -ENOSPC;
>>                next_producer = producer + 1;
>>                if (unlikely(next_producer >= r->size))
>>                        next_producer = 0;
>>        } while(cmpxchg_relaxed(&r->producer, producer, next_producer) != producer);
>>
>>        /* Make sure the pointer we are storing points to a valid data. */
>>        /* Pairs with the dependency ordering in __ptr_ring_consume. */
>>        smp_wmb();
>>
>>        WRITE_ONCE(r->queue[producer], ptr);
>>        return 0;
>> }
>>
>> 3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
>>   too, because dev_hard_start_xmit is also in the protection of
>>   qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
>>   a netdev queue, which is true for pfifo_fast, I believe).
>>
>> 4. Remove the qdisc->running seqcount operation for lockless qdisc, which
>>   is mainly used to do heuristic locking on q->busylock for locked qdisc.
>>
> 
> Sounds good. They can stand two months, cant they?
>>>
>>>> Is there any reason why you want to revert it?
>>>>
>>> I think you know Jiri's plan and it would be nice to wait a couple of
>>> months for it to complete.
>>
>> I am not sure I am aware of Jiri's plan.
>> Is there any link referring to the plan?
>>
> https://lore.kernel.org/lkml/eaff25bc-9b64-037e-b9bc-c06fc4a5a9fb@huawei.com/

I think there is some misunderstanding here.

As my understanding, Jiri and Juergen are from the same team(using
the suse.com mail server).
what Jiri said about "I am still planning to have Yunsheng Lin's
(CCing) fix [1] tested in the coming days." is that Juergen has
done the test and provide a "Tested-by" tag.

So if this patch fixes the packet stuck problem, Jiri is ok with
NOLOCK qdisc too.

Or do I misunderstand it again? Perhaps Jiri and Juergen can help to
clarify this?


> 
> .
>
Jürgen Groß April 13, 2021, 9:15 a.m. UTC | #11
On 13.04.21 11:03, Yunsheng Lin wrote:
> On 2021/4/13 16:33, Hillf Danton wrote:
>> On Tue, 13 Apr 2021 15:57:29  Yunsheng Lin wrote:
>>> On 2021/4/13 15:12, Hillf Danton wrote:
>>>> On Tue, 13 Apr 2021 11:34:27 Yunsheng Lin wrote:
>>>>> On 2021/4/13 11:26, Hillf Danton wrote:
>>>>>> On Tue, 13 Apr 2021 10:56:42 Yunsheng Lin wrote:
>>>>>>> On 2021/4/13 10:21, Hillf Danton wrote:
>>>>>>>> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>>>>>>>>
>>>>>>>>> Yes, the below patch seems to fix the data race described in
>>>>>>>>> the commit log.
>>>>>>>>> Then what is the difference between my patch and your patch below:)
>>>>>>>>
>>>>>>>> Hehe, this is one of the tough questions over a bounch of weeks.
>>>>>>>>
>>>>>>>> If a seqcount can detect the race between skb enqueue and dequeue then we
>>>>>>>> cant see any excuse for not rolling back to the point without NOLOCK.
>>>>>>>
>>>>>>> I am not sure I understood what you meant above.
>>>>>>>
>>>>>>> As my understanding, the below patch is essentially the same as
>>>>>>> your previous patch, the only difference I see is it uses qdisc->pad
>>>>>>> instead of __QDISC_STATE_NEED_RESCHEDULE.
>>>>>>>
>>>>>>> So instead of proposing another patch, it would be better if you
>>>>>>> comment on my patch, and make improvement upon that.
>>>>>>>
>>>>>> Happy to do that after you show how it helps revert NOLOCK.
>>>>>
>>>>> Actually I am not going to revert NOLOCK, but add optimization
>>>>> to it if the patch fixes the packet stuck problem.
>>>>>
>>>> Fix is not optimization, right?
>>>
>>> For this patch, it is a fix.
>>> In case you missed it, I do have a couple of idea to optimize the
>>> lockless qdisc:
>>>
>>> 1. RFC patch to add lockless qdisc bypass optimization:
>>>
>>> https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/
>>>
>>> 2. implement lockless enqueuing for lockless qdisc using the idea
>>>    from Jason and Toke. And it has a noticable proformance increase with
>>>    1-4 threads running using the below prototype based on ptr_ring.
>>>
>>> static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
>>> {
>>>
>>>         int producer, next_producer;
>>>
>>>
>>>         do {
>>>                 producer = READ_ONCE(r->producer);
>>>                 if (unlikely(!r->size) || r->queue[producer])
>>>                         return -ENOSPC;
>>>                 next_producer = producer + 1;
>>>                 if (unlikely(next_producer >= r->size))
>>>                         next_producer = 0;
>>>         } while(cmpxchg_relaxed(&r->producer, producer, next_producer) != producer);
>>>
>>>         /* Make sure the pointer we are storing points to a valid data. */
>>>         /* Pairs with the dependency ordering in __ptr_ring_consume. */
>>>         smp_wmb();
>>>
>>>         WRITE_ONCE(r->queue[producer], ptr);
>>>         return 0;
>>> }
>>>
>>> 3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
>>>    too, because dev_hard_start_xmit is also in the protection of
>>>    qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
>>>    a netdev queue, which is true for pfifo_fast, I believe).
>>>
>>> 4. Remove the qdisc->running seqcount operation for lockless qdisc, which
>>>    is mainly used to do heuristic locking on q->busylock for locked qdisc.
>>>
>>
>> Sounds good. They can stand two months, cant they?
>>>>
>>>>> Is there any reason why you want to revert it?
>>>>>
>>>> I think you know Jiri's plan and it would be nice to wait a couple of
>>>> months for it to complete.
>>>
>>> I am not sure I am aware of Jiri's plan.
>>> Is there any link referring to the plan?
>>>
>> https://lore.kernel.org/lkml/eaff25bc-9b64-037e-b9bc-c06fc4a5a9fb@huawei.com/
> 
> I think there is some misunderstanding here.
> 
> As my understanding, Jiri and Juergen are from the same team(using
> the suse.com mail server).

Yes, we are.

> what Jiri said about "I am still planning to have Yunsheng Lin's
> (CCing) fix [1] tested in the coming days." is that Juergen has
> done the test and provide a "Tested-by" tag.

Correct. And I did this after Jiri asking me to do so.

> So if this patch fixes the packet stuck problem, Jiri is ok with
> NOLOCK qdisc too.

I think so, yes. Otherwise I don't see why he asked me to test the
patch. :-)

> Or do I misunderstand it again? Perhaps Jiri and Juergen can help to
> clarify this?

I hope I did. :-)


Juergen
Jiri Kosina April 16, 2021, 11:46 a.m. UTC | #12
On Tue, 13 Apr 2021, Juergen Gross wrote:

> > what Jiri said about "I am still planning to have Yunsheng Lin's
> > (CCing) fix [1] tested in the coming days." is that Juergen has
> > done the test and provide a "Tested-by" tag.
> 
> Correct. And I did this after Jiri asking me to do so.

Exactly, Juergen's setup is the one where this issue is being reproduced 
reliably for us, and Juergen verified that your patch fixes that issue.

Seeing that you now also addressed the STATE_DEACTIVATED issue (which we 
don't have reproducer for though), I think your series should be good to 
go if the STATE_DEACTIVATED fix has been verified independently.

Thanks!
Michal Kubecek April 18, 2021, 10:59 p.m. UTC | #13
On Thu, Mar 25, 2021 at 11:13:11AM +0800, Yunsheng Lin wrote:
> Lockless qdisc has below concurrent problem:
>     cpu0                 cpu1
>      .                     .
> q->enqueue                 .
>      .                     .
> qdisc_run_begin()          .
>      .                     .
> dequeue_skb()              .
>      .                     .
> sch_direct_xmit()          .
>      .                     .
>      .                q->enqueue
>      .             qdisc_run_begin()
>      .            return and do nothing
>      .                     .
> qdisc_run_end()            .
> 
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
> 
> Lockless qdisc has below another concurrent problem when
> tx_action is involved:
> 
> cpu0(serving tx_action)     cpu1             cpu2
>           .                   .                .
>           .              q->enqueue            .
>           .            qdisc_run_begin()       .
>           .              dequeue_skb()         .
>           .                   .            q->enqueue
>           .                   .                .
>           .             sch_direct_xmit()      .
>           .                   .         qdisc_run_begin()
>           .                   .       return and do nothing
>           .                   .                .
>  clear __QDISC_STATE_SCHED    .                .
>  qdisc_run_begin()            .                .
>  return and do nothing        .                .
>           .                   .                .
>           .            qdisc_run_end()         .
> 
> This patch fixes the above data race by:
> 1. Get the flag before doing spin_trylock().
> 2. If the first spin_trylock() return false and the flag is not
>    set before the first spin_trylock(), Set the flag and retry
>    another spin_trylock() in case other CPU may not see the new
>    flag after it releases the lock.
> 3. reschedule if the flags is set after the lock is released
>    at the end of qdisc_run_end().
> 
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_end(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
> 
> Only clear the flag before retrying a dequeuing when dequeuing
> returns NULL in order to reduce the overhead of the above double
> spin_trylock() and __netif_schedule() calling.
> 
> The performance impact of this patch, tested using pktgen and
> dummy netdev with pfifo_fast qdisc attached:
> 
>  threads  without+this_patch   with+this_patch      delta
>     1        2.61Mpps            2.60Mpps           -0.3%
>     2        3.97Mpps            3.82Mpps           -3.7%
>     4        5.62Mpps            5.59Mpps           -0.5%
>     8        2.78Mpps            2.77Mpps           -0.3%
>    16        2.22Mpps            2.22Mpps           -0.0%
> 
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> V3: fix a compile error and a few comment typo, remove the
>     __QDISC_STATE_DEACTIVATED checking, and update the
>     performance data.
> V2: Avoid the overhead of fixing the data race as much as
>     possible.

I tried this patch o top of 5.12-rc7 with real devices. I used two
machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
a saturated ethernet, the CPU consumption grows quite a lot:

    threads     unpatched 5.12-rc7    5.12-rc7 + v3   
      1               25.6%               30.6%
      8               73.1%              241.4%
    128              132.2%             1012.0%

(The values are normalized to one core, i.e. 100% corresponds to one
fully used logical CPU.) I didn't perform a full statistical evaluation
but the growth is way beyond any statistical fluctuation with one
exception: 8-thread test of patched kernel showed values from 155.5% to
311.4%. Closer look shows that most of the CPU time was spent in softirq
and running top in parallel with the test confirms that there are
multiple ksofirqd threads running at 100% CPU. I had similar problem
with earlier versions of my patch (work in progress, I still need to
check some corner cases and write commit message explaining the logic)
and tracing confirmed that similar problem (non-empty queue, no other
thread going to clean it up but device queue stopped) was happening
repeatedly most of the time.

The biggest problem IMHO is that the loop in __qdisc_run() may finish
without rescheduling not only when the qdisc queue is empty but also
when the corresponding device Tx queue is stopped which devices tend to
do whenever they cannot send any more packets out. Thus whenever
__QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
frozen, we keep rescheduling the queue cleanup without any chance to
progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
we need is another thready to fail the first spin_trylock() while device
queue is stopped and qdisc queue not empty.

Another part of the problem may be that to avoid the race, the logic is
too pessimistic: consider e.g. (dotted lines show "barriers" where
ordering is important):

    CPU A                            CPU B
    spin_trylock() succeeds
                                     pfifo_fast_enqueue()
    ..................................................................
    skb_array empty, exit loop
                                     first spin_trylock() fails
                                     set __QDISC_STATE_NEED_RESCHEDULE
                                     second spin_trylock() fails
    ..................................................................
    spin_unlock()
    call __netif_schedule()

When we switch the order of spin_lock() on CPU A and second
spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
before CPU A tests it), we end up scheduling a queue cleanup even if
there is already one running. And either of these is quite realistic.

> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..e3f46eb 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
>  	__QDISC_STATE_SCHED,
>  	__QDISC_STATE_DEACTIVATED,
> +	__QDISC_STATE_NEED_RESCHEDULE,
>  };

I'm not sure if putting the flag here is a good idea. If you look at the
history of struct Qdisc reshufflings, this part (cacheline) should be
used for members which don't change too often. However, this new flag is
going to be touched about as often and in similar places as q->seqlock
or q->empty so that it should probably be in the last cacheline with
them.

[...]
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..4953430 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>  	struct sk_buff *skb = NULL;
> +	bool need_retry = true;
>  	int band;
>  
> +retry:
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
>  
> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  	}
>  	if (likely(skb)) {
>  		qdisc_update_stats_at_dequeue(qdisc, skb);
> +	} else if (need_retry &&
> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +				      &qdisc->state)) {
> +		/* do another enqueuing after clearing the flag to
> +		 * avoid calling __netif_schedule().
> +		 */
> +		smp_mb__after_atomic();
> +		need_retry = false;
> +
> +		goto retry;
>  	} else {
>  		WRITE_ONCE(qdisc->empty, true);
>  	}i

Does the retry really provide significant improvement? IIUC it only
helps if all of enqueue, first spin_trylock() failure and setting the
__QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
skb_array empty and checking the flag. If enqueue happens before we
check the array (and flag is set before the check), the retry is
useless. If the flag is set after we check it, we don't catch it (no
matter if the enqueue happened before or after we found skb_array
empty).

Michal
Yunsheng Lin April 19, 2021, 2:04 a.m. UTC | #14
On 2021/4/19 6:59, Michal Kubecek wrote:
> On Thu, Mar 25, 2021 at 11:13:11AM +0800, Yunsheng Lin wrote:
>> Lockless qdisc has below concurrent problem:
>>     cpu0                 cpu1
>>      .                     .
>> q->enqueue                 .
>>      .                     .
>> qdisc_run_begin()          .
>>      .                     .
>> dequeue_skb()              .
>>      .                     .
>> sch_direct_xmit()          .
>>      .                     .
>>      .                q->enqueue
>>      .             qdisc_run_begin()
>>      .            return and do nothing
>>      .                     .
>> qdisc_run_end()            .
>>
>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
>> has not released the lock yet and spin_trylock() return false
>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
>> enqueue the skb after cpu0 calling dequeue_skb() and before
>> cpu0 calling qdisc_run_end().
>>
>> Lockless qdisc has below another concurrent problem when
>> tx_action is involved:
>>
>> cpu0(serving tx_action)     cpu1             cpu2
>>           .                   .                .
>>           .              q->enqueue            .
>>           .            qdisc_run_begin()       .
>>           .              dequeue_skb()         .
>>           .                   .            q->enqueue
>>           .                   .                .
>>           .             sch_direct_xmit()      .
>>           .                   .         qdisc_run_begin()
>>           .                   .       return and do nothing
>>           .                   .                .
>>  clear __QDISC_STATE_SCHED    .                .
>>  qdisc_run_begin()            .                .
>>  return and do nothing        .                .
>>           .                   .                .
>>           .            qdisc_run_end()         .
>>
>> This patch fixes the above data race by:
>> 1. Get the flag before doing spin_trylock().
>> 2. If the first spin_trylock() return false and the flag is not
>>    set before the first spin_trylock(), Set the flag and retry
>>    another spin_trylock() in case other CPU may not see the new
>>    flag after it releases the lock.
>> 3. reschedule if the flags is set after the lock is released
>>    at the end of qdisc_run_end().
>>
>> For tx_action case, the flags is also set when cpu1 is at the
>> end if qdisc_run_end(), so tx_action will be rescheduled
>> again to dequeue the skb enqueued by cpu2.
>>
>> Only clear the flag before retrying a dequeuing when dequeuing
>> returns NULL in order to reduce the overhead of the above double
>> spin_trylock() and __netif_schedule() calling.
>>
>> The performance impact of this patch, tested using pktgen and
>> dummy netdev with pfifo_fast qdisc attached:
>>
>>  threads  without+this_patch   with+this_patch      delta
>>     1        2.61Mpps            2.60Mpps           -0.3%
>>     2        3.97Mpps            3.82Mpps           -3.7%
>>     4        5.62Mpps            5.59Mpps           -0.5%
>>     8        2.78Mpps            2.77Mpps           -0.3%
>>    16        2.22Mpps            2.22Mpps           -0.0%
>>
>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> V3: fix a compile error and a few comment typo, remove the
>>     __QDISC_STATE_DEACTIVATED checking, and update the
>>     performance data.
>> V2: Avoid the overhead of fixing the data race as much as
>>     possible.
> 
> I tried this patch o top of 5.12-rc7 with real devices. I used two
> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
> a saturated ethernet, the CPU consumption grows quite a lot:
> 
>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
>       1               25.6%               30.6%
>       8               73.1%              241.4%
>     128              132.2%             1012.0%
> 
> (The values are normalized to one core, i.e. 100% corresponds to one
> fully used logical CPU.) I didn't perform a full statistical evaluation
> but the growth is way beyond any statistical fluctuation with one
> exception: 8-thread test of patched kernel showed values from 155.5% to
> 311.4%. Closer look shows that most of the CPU time was spent in softirq
> and running top in parallel with the test confirms that there are
> multiple ksofirqd threads running at 100% CPU. I had similar problem
> with earlier versions of my patch (work in progress, I still need to
> check some corner cases and write commit message explaining the logic)

Great, if there is a better idea, maybe share the core idea first so
that we both can work on the that?

> and tracing confirmed that similar problem (non-empty queue, no other
> thread going to clean it up but device queue stopped) was happening
> repeatedly most of the time.

Make sense, maybe that is why the dummy netdevice can not catch this kind
of performance degradation because it always consume the skb without stopping
the tx queue, and a real netdevice with limited queue depth may stop the
queue when there are multil skbs queuing concurrently.

I think for the above to happen, there may be a lot of tx doorbell batching
happening, it would be better to see how many packets is enqueued at a time
when trace_qdisc_dequeue() tracepoint is enabled?

> 
> The biggest problem IMHO is that the loop in __qdisc_run() may finish
> without rescheduling not only when the qdisc queue is empty but also
> when the corresponding device Tx queue is stopped which devices tend to
> do whenever they cannot send any more packets out. Thus whenever
> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
> frozen, we keep rescheduling the queue cleanup without any chance to
> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
> we need is another thready to fail the first spin_trylock() while device
> queue is stopped and qdisc queue not empty.

Yes, We could just return false before doing the second spin_trylock() if
the netdev queue corresponding qdisc is stopped, and when the netdev queue
is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().

Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:

if (dont_retry || sch_qdisc_stopped())
	return false;

bool sch_qdisc_stopped(struct Qdisc *q)
{
	const struct netdev_queue *txq = q->dev_queue;

	if (!netif_xmit_frozen_or_stopped(txq))
		return true;

	reture false;
}

At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?

> 
> Another part of the problem may be that to avoid the race, the logic is
> too pessimistic: consider e.g. (dotted lines show "barriers" where
> ordering is important):
> 
>     CPU A                            CPU B
>     spin_trylock() succeeds
>                                      pfifo_fast_enqueue()
>     ..................................................................
>     skb_array empty, exit loop
>                                      first spin_trylock() fails
>                                      set __QDISC_STATE_NEED_RESCHEDULE
>                                      second spin_trylock() fails
>     ..................................................................
>     spin_unlock()
>     call __netif_schedule()
> 
> When we switch the order of spin_lock() on CPU A and second
> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
> before CPU A tests it), we end up scheduling a queue cleanup even if
> there is already one running. And either of these is quite realistic.

But I am not sure it is a good thing or bad thing when the above happen,
because it may be able to enable the tx batching?

> 
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index f7a6e14..e3f46eb 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>>  enum qdisc_state_t {
>>  	__QDISC_STATE_SCHED,
>>  	__QDISC_STATE_DEACTIVATED,
>> +	__QDISC_STATE_NEED_RESCHEDULE,
>>  };
> 
> I'm not sure if putting the flag here is a good idea. If you look at the
> history of struct Qdisc reshufflings, this part (cacheline) should be
> used for members which don't change too often. However, this new flag is
> going to be touched about as often and in similar places as q->seqlock
> or q->empty so that it should probably be in the last cacheline with
> them.

I am not sure about that too, but we can always adjust it's location
if it provdes to be the case:)

> 
> [...]
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 44991ea..4953430 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  {
>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>  	struct sk_buff *skb = NULL;
>> +	bool need_retry = true;
>>  	int band;
>>  
>> +retry:
>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>  		struct skb_array *q = band2list(priv, band);
>>  
>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  	}
>>  	if (likely(skb)) {
>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>> +	} else if (need_retry &&
>> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
>> +				      &qdisc->state)) {
>> +		/* do another enqueuing after clearing the flag to
>> +		 * avoid calling __netif_schedule().
>> +		 */
>> +		smp_mb__after_atomic();
>> +		need_retry = false;
>> +
>> +		goto retry;
>>  	} else {
>>  		WRITE_ONCE(qdisc->empty, true);
>>  	}i
> 
> Does the retry really provide significant improvement? IIUC it only
> helps if all of enqueue, first spin_trylock() failure and setting the
> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
> skb_array empty and checking the flag. If enqueue happens before we
> check the array (and flag is set before the check), the retry is
> useless. If the flag is set after we check it, we don't catch it (no
> matter if the enqueue happened before or after we found skb_array
> empty).

In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
queue is empty.
And it has about 5% performance improvement.

> 
> Michal
> 
> .
>
Yunsheng Lin April 19, 2021, 12:21 p.m. UTC | #15
On 2021/4/19 10:04, Yunsheng Lin wrote:
> On 2021/4/19 6:59, Michal Kubecek wrote:
>> On Thu, Mar 25, 2021 at 11:13:11AM +0800, Yunsheng Lin wrote:
>>> Lockless qdisc has below concurrent problem:
>>>     cpu0                 cpu1
>>>      .                     .
>>> q->enqueue                 .
>>>      .                     .
>>> qdisc_run_begin()          .
>>>      .                     .
>>> dequeue_skb()              .
>>>      .                     .
>>> sch_direct_xmit()          .
>>>      .                     .
>>>      .                q->enqueue
>>>      .             qdisc_run_begin()
>>>      .            return and do nothing
>>>      .                     .
>>> qdisc_run_end()            .
>>>
>>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
>>> has not released the lock yet and spin_trylock() return false
>>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
>>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
>>> enqueue the skb after cpu0 calling dequeue_skb() and before
>>> cpu0 calling qdisc_run_end().
>>>
>>> Lockless qdisc has below another concurrent problem when
>>> tx_action is involved:
>>>
>>> cpu0(serving tx_action)     cpu1             cpu2
>>>           .                   .                .
>>>           .              q->enqueue            .
>>>           .            qdisc_run_begin()       .
>>>           .              dequeue_skb()         .
>>>           .                   .            q->enqueue
>>>           .                   .                .
>>>           .             sch_direct_xmit()      .
>>>           .                   .         qdisc_run_begin()
>>>           .                   .       return and do nothing
>>>           .                   .                .
>>>  clear __QDISC_STATE_SCHED    .                .
>>>  qdisc_run_begin()            .                .
>>>  return and do nothing        .                .
>>>           .                   .                .
>>>           .            qdisc_run_end()         .
>>>
>>> This patch fixes the above data race by:
>>> 1. Get the flag before doing spin_trylock().
>>> 2. If the first spin_trylock() return false and the flag is not
>>>    set before the first spin_trylock(), Set the flag and retry
>>>    another spin_trylock() in case other CPU may not see the new
>>>    flag after it releases the lock.
>>> 3. reschedule if the flags is set after the lock is released
>>>    at the end of qdisc_run_end().
>>>
>>> For tx_action case, the flags is also set when cpu1 is at the
>>> end if qdisc_run_end(), so tx_action will be rescheduled
>>> again to dequeue the skb enqueued by cpu2.
>>>
>>> Only clear the flag before retrying a dequeuing when dequeuing
>>> returns NULL in order to reduce the overhead of the above double
>>> spin_trylock() and __netif_schedule() calling.
>>>
>>> The performance impact of this patch, tested using pktgen and
>>> dummy netdev with pfifo_fast qdisc attached:
>>>
>>>  threads  without+this_patch   with+this_patch      delta
>>>     1        2.61Mpps            2.60Mpps           -0.3%
>>>     2        3.97Mpps            3.82Mpps           -3.7%
>>>     4        5.62Mpps            5.59Mpps           -0.5%
>>>     8        2.78Mpps            2.77Mpps           -0.3%
>>>    16        2.22Mpps            2.22Mpps           -0.0%
>>>
>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>> ---
>>> V3: fix a compile error and a few comment typo, remove the
>>>     __QDISC_STATE_DEACTIVATED checking, and update the
>>>     performance data.
>>> V2: Avoid the overhead of fixing the data race as much as
>>>     possible.
>>
>> I tried this patch o top of 5.12-rc7 with real devices. I used two
>> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
>> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
>> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
>> a saturated ethernet, the CPU consumption grows quite a lot:
>>
>>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
>>       1               25.6%               30.6%
>>       8               73.1%              241.4%
>>     128              132.2%             1012.0%

I do not really read the above number, but I understand that v3 has a cpu
usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64
system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.

root@(none)$ cat /sys/class/net/eth4/device/numa_node
0
root@(none)$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 128646 MB
node 0 free: 127876 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 129019 MB
node 1 free: 128796 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 2 size: 129019 MB
node 2 free: 128755 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 127989 MB
node 3 free: 127772 MB
node distances:
node   0   1   2   3
  0:  10  16  32  33
  1:  16  10  25  32
  2:  32  25  10  16
  3:  33  32  16  10

and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in
order to trigger the tx queue stopping handling:
ethtool -L eth4 combined 16
ethtool -G eth4 tx 72

threads       unpatched          patched_v4    patched_v4+queue_stopped_opt
   16      11% (si: 3.8%)     20% (si:13.5%)       11% (si:3.8%)
   32      12% (si: 4.4%)     30% (si:22%)         13% (si:4.4%)
   64      13% (si: 4.9%)     35% (si:26%)         13% (si:4.7%)

"11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8%
means the softirq cpu useage .
And thread number is as below iperf cmd:
taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*

It seems after applying the queue_stopped_opt patch, the cpu usage
is closed to the unpatch one, at least with my testcase, maybe you
can try your testcase with the queue_stopped_opt patch to see if
it make any difference?

patched_v4:
https://patchwork.kernel.org/project/netdevbpf/patch/1618535809-11952-2-git-send-email-linyunsheng@huawei.com/

And queue_stopped_opt:
diff --git a/net/core/dev.c b/net/core/dev.c
index 5d6f2e2..62008d5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3762,7 +3762,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,

        if (q->flags & TCQ_F_NOLOCK) {
                rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-               qdisc_run(q);
+               if (likely(rc == NET_XMIT_SUCCESS &&
+                          !netif_xmit_frozen_or_stopped(txq)))
+                       qdisc_run(q);

                if (unlikely(to_free))
                        kfree_skb_list(to_free);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6d7f954..079a825 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -74,6 +74,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q)
                        }
                } else {
                        skb = SKB_XOFF_MAGIC;
+                       clear_bit(__QDISC_STATE_MISSED, &q->state);
                }
        }

@@ -242,6 +243,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
                        }
                } else {
                        skb = NULL;
+                       clear_bit(__QDISC_STATE_MISSED, &q->state);
                }
                if (lock)
                        spin_unlock(lock);
@@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
        *validate = true;

        if ((q->flags & TCQ_F_ONETXQUEUE) &&
-           netif_xmit_frozen_or_stopped(txq))
+           netif_xmit_frozen_or_stopped(txq)) {
+               clear_bit(__QDISC_STATE_MISSED, &q->state);
                return skb;
+       }

        skb = qdisc_dequeue_skb_bad_txq(q);
        if (unlikely(skb)) {
@@ -311,6 +315,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
                HARD_TX_LOCK(dev, txq, smp_processor_id());
                if (!netif_xmit_frozen_or_stopped(txq))
                        skb = dev_hard_start_xmit(skb, dev, txq, &ret);
+               else
+                       clear_bit(__QDISC_STATE_MISSED, &q->state);

                HARD_TX_UNLOCK(dev, txq);
        } else {


>>
>> (The values are normalized to one core, i.e. 100% corresponds to one
>> fully used logical CPU.) I didn't perform a full statistical evaluation
>> but the growth is way beyond any statistical fluctuation with one
>> exception: 8-thread test of patched kernel showed values from 155.5% to
>> 311.4%. Closer look shows that most of the CPU time was spent in softirq
>> and running top in parallel with the test confirms that there are
>> multiple ksofirqd threads running at 100% CPU. I had similar problem
>> with earlier versions of my patch (work in progress, I still need to
>> check some corner cases and write commit message explaining the logic)
> 
> Great, if there is a better idea, maybe share the core idea first so
> that we both can work on the that?
> 
>> and tracing confirmed that similar problem (non-empty queue, no other
>> thread going to clean it up but device queue stopped) was happening
>> repeatedly most of the time.
> 
> Make sense, maybe that is why the dummy netdevice can not catch this kind
> of performance degradation because it always consume the skb without stopping
> the tx queue, and a real netdevice with limited queue depth may stop the
> queue when there are multil skbs queuing concurrently.
> 
> I think for the above to happen, there may be a lot of tx doorbell batching
> happening, it would be better to see how many packets is enqueued at a time
> when trace_qdisc_dequeue() tracepoint is enabled?
> 
>>
>> The biggest problem IMHO is that the loop in __qdisc_run() may finish
>> without rescheduling not only when the qdisc queue is empty but also
>> when the corresponding device Tx queue is stopped which devices tend to
>> do whenever they cannot send any more packets out. Thus whenever
>> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
>> frozen, we keep rescheduling the queue cleanup without any chance to
>> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
>> we need is another thready to fail the first spin_trylock() while device
>> queue is stopped and qdisc queue not empty.
> 
> Yes, We could just return false before doing the second spin_trylock() if
> the netdev queue corresponding qdisc is stopped, and when the netdev queue
> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
> 
> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
> 
> if (dont_retry || sch_qdisc_stopped())
> 	return false;
> 
> bool sch_qdisc_stopped(struct Qdisc *q)
> {
> 	const struct netdev_queue *txq = q->dev_queue;
> 
> 	if (!netif_xmit_frozen_or_stopped(txq))
> 		return true;
> 
> 	reture false;
> }
> 
> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?
> 
>>
>> Another part of the problem may be that to avoid the race, the logic is
>> too pessimistic: consider e.g. (dotted lines show "barriers" where
>> ordering is important):
>>
>>     CPU A                            CPU B
>>     spin_trylock() succeeds
>>                                      pfifo_fast_enqueue()
>>     ..................................................................
>>     skb_array empty, exit loop
>>                                      first spin_trylock() fails
>>                                      set __QDISC_STATE_NEED_RESCHEDULE
>>                                      second spin_trylock() fails
>>     ..................................................................
>>     spin_unlock()
>>     call __netif_schedule()
>>
>> When we switch the order of spin_lock() on CPU A and second
>> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
>> before CPU A tests it), we end up scheduling a queue cleanup even if
>> there is already one running. And either of these is quite realistic.
> 
> But I am not sure it is a good thing or bad thing when the above happen,
> because it may be able to enable the tx batching?
> 
>>
>>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>>> index f7a6e14..e3f46eb 100644
>>> --- a/include/net/sch_generic.h
>>> +++ b/include/net/sch_generic.h
>>> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>>>  enum qdisc_state_t {
>>>  	__QDISC_STATE_SCHED,
>>>  	__QDISC_STATE_DEACTIVATED,
>>> +	__QDISC_STATE_NEED_RESCHEDULE,
>>>  };
>>
>> I'm not sure if putting the flag here is a good idea. If you look at the
>> history of struct Qdisc reshufflings, this part (cacheline) should be
>> used for members which don't change too often. However, this new flag is
>> going to be touched about as often and in similar places as q->seqlock
>> or q->empty so that it should probably be in the last cacheline with
>> them.
> 
> I am not sure about that too, but we can always adjust it's location
> if it provdes to be the case:)
> 
>>
>> [...]
>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>> index 44991ea..4953430 100644
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>  {
>>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>  	struct sk_buff *skb = NULL;
>>> +	bool need_retry = true;
>>>  	int band;
>>>  
>>> +retry:
>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>  		struct skb_array *q = band2list(priv, band);
>>>  
>>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>  	}
>>>  	if (likely(skb)) {
>>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>>> +	} else if (need_retry &&
>>> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
>>> +				      &qdisc->state)) {
>>> +		/* do another enqueuing after clearing the flag to
>>> +		 * avoid calling __netif_schedule().
>>> +		 */
>>> +		smp_mb__after_atomic();
>>> +		need_retry = false;
>>> +
>>> +		goto retry;
>>>  	} else {
>>>  		WRITE_ONCE(qdisc->empty, true);
>>>  	}i
>>
>> Does the retry really provide significant improvement? IIUC it only
>> helps if all of enqueue, first spin_trylock() failure and setting the
>> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
>> skb_array empty and checking the flag. If enqueue happens before we
>> check the array (and flag is set before the check), the retry is
>> useless. If the flag is set after we check it, we don't catch it (no
>> matter if the enqueue happened before or after we found skb_array
>> empty).
> 
> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"

__QDISC_STATE_MISSED is in V4, so should be:
set_bit(__QDISC_STATE_NEED_RESCHEDULE, &qdisc->state)

The above data race is not really the main concern here, the main concern is
when to clear the __QDISC_STATE_NEED_RESCHEDULE in order to avoid the multi
cpus doing the set_bit().

> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
> queue is empty.
> And it has about 5% performance improvement.
> 
>>
>> Michal
>>
>> .
>>
> _______________________________________________
> Linuxarm mailing list -- linuxarm@openeuler.org
> To unsubscribe send an email to linuxarm-leave@openeuler.org
>
Michal Kubecek April 19, 2021, 2:57 p.m. UTC | #16
On Mon, Apr 19, 2021 at 10:04:27AM +0800, Yunsheng Lin wrote:
> > 
> > I tried this patch o top of 5.12-rc7 with real devices. I used two
> > machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
> > with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
> > CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
> > a saturated ethernet, the CPU consumption grows quite a lot:
> > 
> >     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
> >       1               25.6%               30.6%
> >       8               73.1%              241.4%
> >     128              132.2%             1012.0%
> > 
> > (The values are normalized to one core, i.e. 100% corresponds to one
> > fully used logical CPU.) I didn't perform a full statistical evaluation
> > but the growth is way beyond any statistical fluctuation with one
> > exception: 8-thread test of patched kernel showed values from 155.5% to
> > 311.4%. Closer look shows that most of the CPU time was spent in softirq
> > and running top in parallel with the test confirms that there are
> > multiple ksofirqd threads running at 100% CPU. I had similar problem
> > with earlier versions of my patch (work in progress, I still need to
> > check some corner cases and write commit message explaining the logic)
> 
> Great, if there is a better idea, maybe share the core idea first so
> that we both can work on the that?

I'm not sure if it's really better but to minimize the false positives
and unnecessary calls to __netif_schedule(), I replaced q->seqlock with
an atomic combination of a "running" flag (which corresponds to current
seqlock being locked) and a "drainers" count (number of other threads
going to clean up the qdisc queue). This way we could keep track of them
and get reliable information if another thread is going to run a cleanup
after we leave the qdisc_run() critical section (so that there is no
need to schedule).

> > The biggest problem IMHO is that the loop in __qdisc_run() may finish
> > without rescheduling not only when the qdisc queue is empty but also
> > when the corresponding device Tx queue is stopped which devices tend to
> > do whenever they cannot send any more packets out. Thus whenever
> > __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
> > frozen, we keep rescheduling the queue cleanup without any chance to
> > progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
> > we need is another thready to fail the first spin_trylock() while device
> > queue is stopped and qdisc queue not empty.
> 
> Yes, We could just return false before doing the second spin_trylock() if
> the netdev queue corresponding qdisc is stopped, and when the netdev queue
> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
> 
> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
> 
> if (dont_retry || sch_qdisc_stopped())
> 	return false;
> 
> bool sch_qdisc_stopped(struct Qdisc *q)
> {
> 	const struct netdev_queue *txq = q->dev_queue;
> 
> 	if (!netif_xmit_frozen_or_stopped(txq))
> 		return true;
> 
> 	reture false;
> }
> 
> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?

Either this or you can do the check in qdisc_run_end() - when the device
queue is stopped or frozen, there is no need to schedule as we know it's
going to be done when the flag is cleared again (and we cannot do
anything until then anyway).

> > Another part of the problem may be that to avoid the race, the logic is
> > too pessimistic: consider e.g. (dotted lines show "barriers" where
> > ordering is important):
> > 
> >     CPU A                            CPU B
> >     spin_trylock() succeeds
> >                                      pfifo_fast_enqueue()
> >     ..................................................................
> >     skb_array empty, exit loop
> >                                      first spin_trylock() fails
> >                                      set __QDISC_STATE_NEED_RESCHEDULE
> >                                      second spin_trylock() fails
> >     ..................................................................
> >     spin_unlock()
> >     call __netif_schedule()
> > 
> > When we switch the order of spin_lock() on CPU A and second
> > spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
> > before CPU A tests it), we end up scheduling a queue cleanup even if
> > there is already one running. And either of these is quite realistic.
> 
> But I am not sure it is a good thing or bad thing when the above happen,
> because it may be able to enable the tx batching?

In either of the two scenarios, we call __netif_schedule() to schedule
a cleanup which does not do anything useful. In first, the qdics queue
is empty so that either the scheduled cleanup finds it empty or there
will be some new packets which would have their own cleanup. In second,
scheduling a cleanup will not prevent the other thread from doing its
own cleanup it already started.

> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >> index 44991ea..4953430 100644
> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >>  {
> >>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> >>  	struct sk_buff *skb = NULL;
> >> +	bool need_retry = true;
> >>  	int band;
> >>  
> >> +retry:
> >>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >>  		struct skb_array *q = band2list(priv, band);
> >>  
> >> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >>  	}
> >>  	if (likely(skb)) {
> >>  		qdisc_update_stats_at_dequeue(qdisc, skb);
> >> +	} else if (need_retry &&
> >> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> >> +				      &qdisc->state)) {
> >> +		/* do another enqueuing after clearing the flag to
> >> +		 * avoid calling __netif_schedule().
> >> +		 */
> >> +		smp_mb__after_atomic();
> >> +		need_retry = false;
> >> +
> >> +		goto retry;
> >>  	} else {
> >>  		WRITE_ONCE(qdisc->empty, true);
> >>  	}i
> > 
> > Does the retry really provide significant improvement? IIUC it only
> > helps if all of enqueue, first spin_trylock() failure and setting the
> > __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
> > skb_array empty and checking the flag. If enqueue happens before we
> > check the array (and flag is set before the check), the retry is
> > useless. If the flag is set after we check it, we don't catch it (no
> > matter if the enqueue happened before or after we found skb_array
> > empty).
> 
> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
> queue is empty.

This is what I'm worried about. We are trying to address a race
condition which is extremely rare so we should avoid adding too much
overhead to the normal use.

> And it has about 5% performance improvement.

OK then. It thought that it would do an unnecessary dequeue retry much
more often than prevent an unnecessary __netif_schedule() but I did not
run any special benchmark to check.

Michal
Michal Kubecek April 19, 2021, 3:25 p.m. UTC | #17
On Mon, Apr 19, 2021 at 08:21:38PM +0800, Yunsheng Lin wrote:
> On 2021/4/19 10:04, Yunsheng Lin wrote:
> > On 2021/4/19 6:59, Michal Kubecek wrote:
> >> I tried this patch o top of 5.12-rc7 with real devices. I used two
> >> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
> >> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
> >> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
> >> a saturated ethernet, the CPU consumption grows quite a lot:
> >>
> >>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
> >>       1               25.6%               30.6%
> >>       8               73.1%              241.4%
> >>     128              132.2%             1012.0%
> 
> I do not really read the above number, but I understand that v3 has a cpu
> usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64
> system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.
> 
> root@(none)$ cat /sys/class/net/eth4/device/numa_node
> 0
> root@(none)$ numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 0 size: 128646 MB
> node 0 free: 127876 MB
> node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 129019 MB
> node 1 free: 128796 MB
> node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 2 size: 129019 MB
> node 2 free: 128755 MB
> node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
> node 3 size: 127989 MB
> node 3 free: 127772 MB
> node distances:
> node   0   1   2   3
>   0:  10  16  32  33
>   1:  16  10  25  32
>   2:  32  25  10  16
>   3:  33  32  16  10
> 
> and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in
> order to trigger the tx queue stopping handling:
> ethtool -L eth4 combined 16
> ethtool -G eth4 tx 72
> 
> threads       unpatched          patched_v4    patched_v4+queue_stopped_opt
>    16      11% (si: 3.8%)     20% (si:13.5%)       11% (si:3.8%)
>    32      12% (si: 4.4%)     30% (si:22%)         13% (si:4.4%)
>    64      13% (si: 4.9%)     35% (si:26%)         13% (si:4.7%)
> 
> "11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8%
> means the softirq cpu useage .
> And thread number is as below iperf cmd:
> taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*

The problem I see with this is that iperf's -P option only allows
running the test in multiple connections but they do not actually run in
multiple threads. Therefore this may not result in as much concurrency
as it seems.

Also, 100Gb/s ethernet is not so easy to saturate, trying 10Gb/s or
1Gb/s might put more pressure on qdisc code concurrency.

Michal

> It seems after applying the queue_stopped_opt patch, the cpu usage
> is closed to the unpatch one, at least with my testcase, maybe you
> can try your testcase with the queue_stopped_opt patch to see if
> it make any difference?

I will, I was not aware of v4 submission. I'll write a short note to it
so that it does not accidentally get applied before we know for sure
what the CPU usage impact is.

Michal
Yunsheng Lin April 20, 2021, 1:45 a.m. UTC | #18
On 2021/4/19 22:57, Michal Kubecek wrote:
> On Mon, Apr 19, 2021 at 10:04:27AM +0800, Yunsheng Lin wrote:
>>>
>>> I tried this patch o top of 5.12-rc7 with real devices. I used two
>>> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
>>> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
>>> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
>>> a saturated ethernet, the CPU consumption grows quite a lot:
>>>
>>>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
>>>       1               25.6%               30.6%
>>>       8               73.1%              241.4%
>>>     128              132.2%             1012.0%
>>>
>>> (The values are normalized to one core, i.e. 100% corresponds to one
>>> fully used logical CPU.) I didn't perform a full statistical evaluation
>>> but the growth is way beyond any statistical fluctuation with one
>>> exception: 8-thread test of patched kernel showed values from 155.5% to
>>> 311.4%. Closer look shows that most of the CPU time was spent in softirq
>>> and running top in parallel with the test confirms that there are
>>> multiple ksofirqd threads running at 100% CPU. I had similar problem
>>> with earlier versions of my patch (work in progress, I still need to
>>> check some corner cases and write commit message explaining the logic)
>>
>> Great, if there is a better idea, maybe share the core idea first so
>> that we both can work on the that?
> 
> I'm not sure if it's really better but to minimize the false positives
> and unnecessary calls to __netif_schedule(), I replaced q->seqlock with
> an atomic combination of a "running" flag (which corresponds to current
> seqlock being locked) and a "drainers" count (number of other threads
> going to clean up the qdisc queue). This way we could keep track of them
> and get reliable information if another thread is going to run a cleanup
> after we leave the qdisc_run() critical section (so that there is no
> need to schedule).

It seems you are trying to match the skb enqueuing with the calling of
__qdisc_run() here, which is not reliable when considering the dequeue
batching, see try_bulk_dequeue_skb() or try_bulk_dequeue_skb_slow() in
dequeue_skb().

> 
>>> The biggest problem IMHO is that the loop in __qdisc_run() may finish
>>> without rescheduling not only when the qdisc queue is empty but also
>>> when the corresponding device Tx queue is stopped which devices tend to
>>> do whenever they cannot send any more packets out. Thus whenever
>>> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
>>> frozen, we keep rescheduling the queue cleanup without any chance to
>>> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
>>> we need is another thready to fail the first spin_trylock() while device
>>> queue is stopped and qdisc queue not empty.
>>
>> Yes, We could just return false before doing the second spin_trylock() if
>> the netdev queue corresponding qdisc is stopped, and when the netdev queue
>> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
>>
>> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
>>
>> if (dont_retry || sch_qdisc_stopped())
>> 	return false;
>>
>> bool sch_qdisc_stopped(struct Qdisc *q)
>> {
>> 	const struct netdev_queue *txq = q->dev_queue;
>>
>> 	if (!netif_xmit_frozen_or_stopped(txq))
>> 		return true;
>>
>> 	reture false;
>> }
>>
>> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?
> 
> Either this or you can do the check in qdisc_run_end() - when the device
> queue is stopped or frozen, there is no need to schedule as we know it's
> going to be done when the flag is cleared again (and we cannot do
> anything until then anyway).
> 
>>> Another part of the problem may be that to avoid the race, the logic is
>>> too pessimistic: consider e.g. (dotted lines show "barriers" where
>>> ordering is important):
>>>
>>>     CPU A                            CPU B
>>>     spin_trylock() succeeds
>>>                                      pfifo_fast_enqueue()
>>>     ..................................................................
>>>     skb_array empty, exit loop
>>>                                      first spin_trylock() fails
>>>                                      set __QDISC_STATE_NEED_RESCHEDULE
>>>                                      second spin_trylock() fails
>>>     ..................................................................
>>>     spin_unlock()
>>>     call __netif_schedule()
>>>
>>> When we switch the order of spin_lock() on CPU A and second
>>> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
>>> before CPU A tests it), we end up scheduling a queue cleanup even if
>>> there is already one running. And either of these is quite realistic.
>>
>> But I am not sure it is a good thing or bad thing when the above happen,
>> because it may be able to enable the tx batching?
> 
> In either of the two scenarios, we call __netif_schedule() to schedule
> a cleanup which does not do anything useful. In first, the qdics queue
> is empty so that either the scheduled cleanup finds it empty or there
> will be some new packets which would have their own cleanup. In second,
> scheduling a cleanup will not prevent the other thread from doing its
> own cleanup it already started.

The main idea is that a thread(holding q->seqlock)to do the dequeuing
as much as possible while other threads are enqueuing skb, which seems
possible to avoid the above case?

> 
>>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>>> index 44991ea..4953430 100644
>>>> --- a/net/sched/sch_generic.c
>>>> +++ b/net/sched/sch_generic.c
>>>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  {
>>>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>>  	struct sk_buff *skb = NULL;
>>>> +	bool need_retry = true;
>>>>  	int band;
>>>>  
>>>> +retry:
>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>>  		struct skb_array *q = band2list(priv, band);
>>>>  
>>>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  	}
>>>>  	if (likely(skb)) {
>>>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>>>> +	} else if (need_retry &&
>>>> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
>>>> +				      &qdisc->state)) {
>>>> +		/* do another enqueuing after clearing the flag to
>>>> +		 * avoid calling __netif_schedule().
>>>> +		 */
>>>> +		smp_mb__after_atomic();
>>>> +		need_retry = false;
>>>> +
>>>> +		goto retry;
>>>>  	} else {
>>>>  		WRITE_ONCE(qdisc->empty, true);
>>>>  	}i
>>>
>>> Does the retry really provide significant improvement? IIUC it only
>>> helps if all of enqueue, first spin_trylock() failure and setting the
>>> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
>>> skb_array empty and checking the flag. If enqueue happens before we
>>> check the array (and flag is set before the check), the retry is
>>> useless. If the flag is set after we check it, we don't catch it (no
>>> matter if the enqueue happened before or after we found skb_array
>>> empty).
>>
>> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
>> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
>> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
>> queue is empty.
> 
> This is what I'm worried about. We are trying to address a race
> condition which is extremely rare so we should avoid adding too much
> overhead to the normal use.
> 
>> And it has about 5% performance improvement.
> 
> OK then. It thought that it would do an unnecessary dequeue retry much
> more often than prevent an unnecessary __netif_schedule() but I did not
> run any special benchmark to check.

When netdev tx queue is not stopped:
1. if __QDISC_STATE_NEED_RESCHEDULE is set and there is a lot of skb to be
   dequeued, it is likely that __netif_schedule() is already called in
   __qdisc_run(), so the __netif_schedule() called in qdisc_run_end() is
   no-op.

2. if __QDISC_STATE_NEED_RESCHEDULE is set during the data race this patch is
   trying to fix, then we do need to call __netif_schedule().

3. otherwise the __QDISC_STATE_NEED_RESCHEDULE is cleared in test_and_clear()
   added in pfifo_fast_dequeue().

The main point here is to delay clearing __QDISC_STATE_NEED_RESCHEDULE bit
as much as possible so that the set_bit() and second spin_trylock() is
avoided.

> 
> Michal
> 
> .
>
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..e3f46eb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@  struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_NEED_RESCHEDULE,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,38 @@  static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		bool dont_retry = test_bit(__QDISC_STATE_NEED_RESCHEDULE,
+					   &qdisc->state);
+
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the flag is set before doing the spin_trylock() and
+		 * the above spin_trylock() return false, it means other cpu
+		 * holding the lock will do dequeuing for us, or it wil see
+		 * the flag set after releasing lock and reschedule the
+		 * net_tx_action() to do the dequeuing.
+		 */
+		if (dont_retry)
+			return false;
+
+		/* We could do set_bit() before the first spin_trylock(),
+		 * and avoid doing second spin_trylock() completely, then
+		 * we could have multi cpus doing the set_bit(). Here use
+		 * dont_retry to avoid doing the set_bit() and the second
+		 * spin_trylock(), which has 5% performance improvement than
+		 * doing the set_bit() before the first spin_trylock().
+		 */
+		set_bit(__QDISC_STATE_NEED_RESCHEDULE,
+			&qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +207,13 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				      &qdisc->state)))
+			__netif_schedule(qdisc);
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..4953430 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,16 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				      &qdisc->state)) {
+		/* do another enqueuing after clearing the flag to
+		 * avoid calling __netif_schedule().
+		 */
+		smp_mb__after_atomic();
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}