diff mbox series

[net,v2] net: sched: fix packet stuck problem for lockless qdisc

Message ID 1616552677-39016-1-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] net: sched: fix packet stuck problem for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 15 of 15 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 4510 this patch: 4061
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 99 lines checked
netdev/build_allmodconfig_warn fail Errors and warnings before: 4750 this patch: 4313
netdev/header_inline success Link

Commit Message

Yunsheng Lin March 24, 2021, 2:24 a.m. UTC
Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. Get the flag before doing spin_trylock().
2. If the first spin_trylock() return false and the flag is not
   set before the first spin_trylock(), Set the flag and retry
   another spin_trylock() in case other CPU may not see the new
   flag after it releases the lock.
3. reschedule if the flags is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, the flags is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled
again to dequeue the skb enqueued by cpu2.

Only clear the flag before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the above double
spin_trylock() and __netif_schedule() calling.

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.6Mpps            2.6Mpps             +0.0%
    2        3.9Mpps            3.8Mpps             -2.5%
    4        5.6Mpps            5.6Mpps             -0.0%
    8        2.7Mpps            2.8Mpps             +3.7%
   16        2.2Mpps            2.2Mpps             +0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 12 ++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

Comments

Cong Wang March 24, 2021, 7:20 p.m. UTC | #1
On Tue, Mar 23, 2021 at 7:24 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> @@ -176,8 +207,23 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>         write_seqcount_end(&qdisc->running);
> -       if (qdisc->flags & TCQ_F_NOLOCK)
> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>                 spin_unlock(&qdisc->seqlock);
> +
> +               /* qdisc_run_end() is protected by RCU lock, and
> +                * qdisc reset will do a synchronize_net() after
> +                * setting __QDISC_STATE_DEACTIVATED, so testing
> +                * the below two bits separately should be fine.

Hmm, why synchronize_net() after setting this bit is fine? It could
still be flipped right after you test RESCHEDULE bit.


> +                * For qdisc_run() in net_tx_action() case, we
> +                * really should provide rcu protection explicitly
> +                * for document purposes or PREEMPT_RCU.
> +                */
> +               if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +                                     &qdisc->state) &&
> +                            !test_bit(__QDISC_STATE_DEACTIVATED,
> +                                      &qdisc->state)))

Why do you want to test __QDISC_STATE_DEACTIVATED bit at all?
dev_deactivate_many() will wait for those scheduled but being
deactivated, so what's the problem of scheduling it even with this bit?

Thanks.
Yunsheng Lin March 25, 2021, 2:08 a.m. UTC | #2
On 2021/3/25 3:20, Cong Wang wrote:
> On Tue, Mar 23, 2021 at 7:24 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>> @@ -176,8 +207,23 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>>  {
>>         write_seqcount_end(&qdisc->running);
>> -       if (qdisc->flags & TCQ_F_NOLOCK)
>> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>>                 spin_unlock(&qdisc->seqlock);
>> +
>> +               /* qdisc_run_end() is protected by RCU lock, and
>> +                * qdisc reset will do a synchronize_net() after
>> +                * setting __QDISC_STATE_DEACTIVATED, so testing
>> +                * the below two bits separately should be fine.
> 
> Hmm, why synchronize_net() after setting this bit is fine? It could
> still be flipped right after you test RESCHEDULE bit.

That depends on when it will be fliped again.

As I see:
1. __QDISC_STATE_DEACTIVATED is set during dev_deactivate() process,
   which should also wait for all process related to "test_bit(
   __QDISC_STATE_NEED_RESCHEDULE, &q->state)" to finish by calling
   synchronize_net() and checking some_qdisc_is_busy().

2. it is cleared during dev_activate() process.

And dev_deactivate() and dev_activate() is protected by RTNL lock, or
serialized by linkwatch.

> 
> 
>> +                * For qdisc_run() in net_tx_action() case, we
>> +                * really should provide rcu protection explicitly
>> +                * for document purposes or PREEMPT_RCU.
>> +                */
>> +               if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
>> +                                     &qdisc->state) &&
>> +                            !test_bit(__QDISC_STATE_DEACTIVATED,
>> +                                      &qdisc->state)))
> 
> Why do you want to test __QDISC_STATE_DEACTIVATED bit at all?
> dev_deactivate_many() will wait for those scheduled but being
> deactivated, so what's the problem of scheduling it even with this bit?

The problem I tried to fix is:

  CPU0(calling dev_deactivate)   CPU1(calling qdisc_run_end)   CPU2(calling tx_atcion)
             .                       __netif_schedule()                   .
             .                     set __QDISC_STATE_SCHED                .
             .                                .                           .
clear __QDISC_STATE_DEACTIVATED               .                           .
     synchronize_net()                        .                           .
             .                                .                           .
             .                                .              clear __QDISC_STATE_SCHED
             .                                .                           .
 some_qdisc_is_busy() return false            .                           .
             .                                .                           .
             .                                .                      qdisc_run()

some_qdisc_is_busy() checks if the qdisc is busy by checking __QDISC_STATE_SCHED
and spin_is_locked(&qdisc->seqlock) for lockless qdisc, and some_qdisc_is_busy()
return false for CPU0 because CPU2 has cleared the __QDISC_STATE_SCHED and has not
taken the qdisc->seqlock yet, qdisc is clearly still busy when qdisc_run() is run
by CPU2 later.

So you are right, testing __QDISC_STATE_DEACTIVATED does not completely solve
the above data race, and there are __netif_schedule() called by dev_requeue_skb()
and __qdisc_run() too, which need the same fixing.

So will remove the __QDISC_STATE_DEACTIVATED testing for this patch first, and
deal with it later.

> 
> Thanks.
> 
> .
>
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..09a755d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@  struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_NEED_RESCHEDULE,
 };
 
 struct qdisc_size_table {
@@ -159,12 +160,42 @@  static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		bool dont_retry = test_bit(__QDISC_STATE_NEED_RESCHEDULE,
+					   &qdisc->state);
+
+		if (spin_trylock(&qdisc->seqlock))
+			goto out;
+
+		/* If the flag is set before doing the spin_trylock() and
+		 * the above spin_trylock() return false, it means other cpu
+		 * holding the lock will do dequeuing for us, or it wil see
+		 * the flag set after releasing lock and reschedule the
+		 * net_tx_action() to do the dequeuing.
+		 */
+		if (dont_retry)
+			return false;
+
+		/* We could do set_bit() before the first spin_trylock(),
+		 * and avoid doing secord spin_trylock() completely, then
+		 * we could have multi cpus doing the test_bit(). Here use
+		 * dont_retry to avoiding the test_bit() and the second
+		 * spin_trylock(), which has 5% performance improvement than
+		 * doing the set_bit() before the first spin_trylock().
+		 */
+		set_bit(__QDISC_STATE_NEED_RESCHEDULE,
+			&qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
 	}
+
+out:
 	/* Variant of write_seqcount_begin() telling lockdep a trylock
 	 * was attempted.
 	 */
@@ -176,8 +207,23 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		/* qdisc_run_end() is protected by RCU lock, and
+		 * qdisc reset will do a synchronize_net() after
+		 * setting __QDISC_STATE_DEACTIVATED, so testing
+		 * the below two bits separately should be fine.
+		 * For qdisc_run() in net_tx_action() case, we
+		 * really should provide rcu protection explicitly
+		 * for document purposes or PREEMPT_RCU.
+		 */
+		if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				      &qdisc->state) &&
+			     !test_bit(__QDISC_STATE_DEACTIVATED,
+				       &qdisc->state)))
+			__netif_schedule(qdisc);
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..7e3426b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,16 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				      &qdisc->stat)) {
+		/* do another enqueuing after clearing the flag to
+		 * avoid calling __netif_schedule().
+		 */
+		smp_mb__after_atomic();
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}