diff mbox series

[net,v8,1/3] net: sched: fix packet stuck problem for lockless qdisc

Message ID 1620959218-17250-2-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series fix packet stuck problem for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 15 of 15 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 3518 this patch: 3518
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 89 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 3619 this patch: 3619
netdev/header_inline success Link

Commit Message

Yunsheng Lin May 14, 2021, 2:26 a.m. UTC
Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
   not set, set STATE_MISSED and retry another spin_trylock() in
   case other CPU may not see STATE_MISSED after it releases the
   lock.
2. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.

Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V7: Clear STATE_MISSED before calling __netif_schedule()
    as suggested by Jakub.
V6: Check MISSED after the first trylock, and remove the
    automic test and set for performance sake as suggested
    by Jakub.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
    NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
    tag for there is only renaming and typo fixing between
    V4 and V3.
V3: Fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 19 +++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

Comments

Cong Wang May 14, 2021, 11:36 p.m. UTC | #1
On Thu, May 13, 2021 at 7:27 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>  struct qdisc_size_table {
> @@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>         if (qdisc->flags & TCQ_F_NOLOCK) {
> +               if (spin_trylock(&qdisc->seqlock))
> +                       goto nolock_empty;
> +
> +               /* If the MISSED flag is set, it means other thread has
> +                * set the MISSED flag before second spin_trylock(), so
> +                * we can return false here to avoid multi cpus doing
> +                * the set_bit() and second spin_trylock() concurrently.
> +                */
> +               if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> +                       return false;
> +
> +               /* Set the MISSED flag before the second spin_trylock(),
> +                * if the second spin_trylock() return false, it means
> +                * other cpu holding the lock will do dequeuing for us
> +                * or it will see the MISSED flag set after releasing
> +                * lock and reschedule the net_tx_action() to do the
> +                * dequeuing.
> +                */
> +               set_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +
> +               /* Retry again in case other CPU may not see the new flag
> +                * after it releases the lock at the end of qdisc_run_end().
> +                */
>                 if (!spin_trylock(&qdisc->seqlock))
>                         return false;
> +
> +nolock_empty:
>                 WRITE_ONCE(qdisc->empty, false);
>         } else if (qdisc_is_running(qdisc)) {
>                 return false;
> @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>         write_seqcount_end(&qdisc->running);
> -       if (qdisc->flags & TCQ_F_NOLOCK)
> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>                 spin_unlock(&qdisc->seqlock);
> +
> +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> +                                     &qdisc->state))) {
> +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);


We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
is not.


> +                       __netif_schedule(qdisc);
> +               }
> +       }
>  }
>
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..795d986 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
> +       bool need_retry = true;
>         int band;
>
> +retry:
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
>
> @@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>         }
>         if (likely(skb)) {
>                 qdisc_update_stats_at_dequeue(qdisc, skb);
> +       } else if (need_retry &&
> +                  test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> +               /* Delay clearing the STATE_MISSED here to reduce
> +                * the overhead of the second spin_trylock() in
> +                * qdisc_run_begin() and __netif_schedule() calling
> +                * in qdisc_run_end().
> +                */
> +               clear_bit(__QDISC_STATE_MISSED, &qdisc->state);

Ditto.

> +
> +               /* Make sure dequeuing happens after clearing
> +                * STATE_MISSED.
> +                */
> +               smp_mb__after_atomic();
> +
> +               need_retry = false;
> +
> +               goto retry;

Two concurrent pfifo_fast_dequeue() would possibly retry it at the
same time when they test __QDISC_STATE_MISSED at the same
time and get true. Is this a problem?

Also, any reason why you want pfifo_fast to handle a generic
Qdisc flag? IOW, why not handle this logic in, for example,
qdisc_restart()?

Thanks.
Jakub Kicinski May 14, 2021, 11:39 p.m. UTC | #2
On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> > @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >  static inline void qdisc_run_end(struct Qdisc *qdisc)
> >  {
> >         write_seqcount_end(&qdisc->running);
> > -       if (qdisc->flags & TCQ_F_NOLOCK)
> > +       if (qdisc->flags & TCQ_F_NOLOCK) {
> >                 spin_unlock(&qdisc->seqlock);
> > +
> > +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> > +                                     &qdisc->state))) {
> > +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);  
> 
> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> is not.

It doesn't have to be atomic, right? I asked to split the test because
test_and_clear is a locked op on x86, test by itself is not.
Cong Wang May 14, 2021, 11:57 p.m. UTC | #3
On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> > > @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> > >  static inline void qdisc_run_end(struct Qdisc *qdisc)
> > >  {
> > >         write_seqcount_end(&qdisc->running);
> > > -       if (qdisc->flags & TCQ_F_NOLOCK)
> > > +       if (qdisc->flags & TCQ_F_NOLOCK) {
> > >                 spin_unlock(&qdisc->seqlock);
> > > +
> > > +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> > > +                                     &qdisc->state))) {
> > > +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> >
> > We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> > is not.
>
> It doesn't have to be atomic, right? I asked to split the test because
> test_and_clear is a locked op on x86, test by itself is not.

It depends on whether you expect the code under the true condition
to run once or multiple times, something like:

if (test_bit()) {
  clear_bit();
  // this code may run multiple times
}

With the atomic test_and_clear_bit(), it only runs once:

if (test_and_clear_bit()) {
  // this code runs once
}

This is why __netif_schedule() uses test_and_set_bit() instead of
test_bit()+set_bit().

Thanks.
Jakub Kicinski May 15, 2021, 12:17 a.m. UTC | #4
On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  
>  [...]  
> > >
> > > We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> > > is not.  
> >
> > It doesn't have to be atomic, right? I asked to split the test because
> > test_and_clear is a locked op on x86, test by itself is not.  
> 
> It depends on whether you expect the code under the true condition
> to run once or multiple times, something like:
> 
> if (test_bit()) {
>   clear_bit();
>   // this code may run multiple times
> }
> 
> With the atomic test_and_clear_bit(), it only runs once:
> 
> if (test_and_clear_bit()) {
>   // this code runs once
> }
> 
> This is why __netif_schedule() uses test_and_set_bit() instead of
> test_bit()+set_bit().

Thanks, makes sense, so hopefully the MISSED-was-set case is not common
and we can depend on __netif_schedule() to DTRT, avoiding the atomic op
in the common case.
Yunsheng Lin May 15, 2021, 2:25 a.m. UTC | #5
On 2021/5/15 8:17, Jakub Kicinski wrote:
> On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
>> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>
>>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  
>>  [...]  
>>>>
>>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
>>>> is not.  
>>>
>>> It doesn't have to be atomic, right? I asked to split the test because
>>> test_and_clear is a locked op on x86, test by itself is not.  
>>
>> It depends on whether you expect the code under the true condition
>> to run once or multiple times, something like:
>>
>> if (test_bit()) {
>>   clear_bit();
>>   // this code may run multiple times
>> }
>>
>> With the atomic test_and_clear_bit(), it only runs once:
>>
>> if (test_and_clear_bit()) {
>>   // this code runs once
>> }

I am not sure if the above really matter when the test and clear
does not need to be atomic.

In order for the above to happens, the MISSED has to set between
test and clear, right?

>>
>> This is why __netif_schedule() uses test_and_set_bit() instead of
>> test_bit()+set_bit().

I think test_and_set_bit() is needed in __netif_schedule() mainly
because STATE_SCHED is also used to indicate if the qdisc is in
sd->output_queue, so it has to be atomic.

> 
> Thanks, makes sense, so hopefully the MISSED-was-set case is not common
> and we can depend on __netif_schedule() to DTRT, avoiding the atomic op
> in the common case.
> 
> .
>
Cong Wang May 18, 2021, 12:49 a.m. UTC | #6
On Fri, May 14, 2021 at 7:25 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/5/15 8:17, Jakub Kicinski wrote:
> > On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
> >> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >>>
> >>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> >>  [...]
> >>>>
> >>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> >>>> is not.
> >>>
> >>> It doesn't have to be atomic, right? I asked to split the test because
> >>> test_and_clear is a locked op on x86, test by itself is not.
> >>
> >> It depends on whether you expect the code under the true condition
> >> to run once or multiple times, something like:
> >>
> >> if (test_bit()) {
> >>   clear_bit();
> >>   // this code may run multiple times
> >> }
> >>
> >> With the atomic test_and_clear_bit(), it only runs once:
> >>
> >> if (test_and_clear_bit()) {
> >>   // this code runs once
> >> }
>
> I am not sure if the above really matter when the test and clear
> does not need to be atomic.
>
> In order for the above to happens, the MISSED has to set between
> test and clear, right?

Nope, see the following:

// MISSED bit is already set
CPU0                            CPU1
if (test_bit(MISSED) ( //true
                                if (test_bit(MISSED)) { // also true
        clear_bit(MISSED);
        do_something();
                                        clear_bit(MISSED);
                                        do_something();
                                }
}

Now do_something() is called twice instead of once. This may or may
not be a problem, hence I asked this question.

>
> >>
> >> This is why __netif_schedule() uses test_and_set_bit() instead of
> >> test_bit()+set_bit().
>
> I think test_and_set_bit() is needed in __netif_schedule() mainly
> because STATE_SCHED is also used to indicate if the qdisc is in
> sd->output_queue, so it has to be atomic.

If you replace the "do_something()" above with __netif_reschedule(),
this is exactly my point. An entry can not be inserted twice into a
list, hence it should never be called twice like above.

Thanks.
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..1e62551 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@  struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_MISSED,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,33 @@  static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the MISSED flag is set, it means other thread has
+		 * set the MISSED flag before second spin_trylock(), so
+		 * we can return false here to avoid multi cpus doing
+		 * the set_bit() and second spin_trylock() concurrently.
+		 */
+		if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
+			return false;
+
+		/* Set the MISSED flag before the second spin_trylock(),
+		 * if the second spin_trylock() return false, it means
+		 * other cpu holding the lock will do dequeuing for us
+		 * or it will see the MISSED flag set after releasing
+		 * lock and reschedule the net_tx_action() to do the
+		 * dequeuing.
+		 */
+		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +202,15 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state))) {
+			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+			__netif_schedule(qdisc);
+		}
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..795d986 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,23 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
+		/* Delay clearing the STATE_MISSED here to reduce
+		 * the overhead of the second spin_trylock() in
+		 * qdisc_run_begin() and __netif_schedule() calling
+		 * in qdisc_run_end().
+		 */
+		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Make sure dequeuing happens after clearing
+		 * STATE_MISSED.
+		 */
+		smp_mb__after_atomic();
+
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}