diff mbox series

[net-next] net: dev: Always serialize on Qdisc::busylock in __dev_xmit_skb() on PREEMPT_RT.

Message ID YbN1OL0I1ja4Fwkb@linutronix.de (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: dev: Always serialize on Qdisc::busylock in __dev_xmit_skb() on PREEMPT_RT. | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 7 this patch: 7
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 20 this patch: 20
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 9 this patch: 9
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Sebastian Andrzej Siewior Dec. 10, 2021, 3:41 p.m. UTC
The root-lock is dropped before dev_hard_start_xmit() is invoked and after
setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
by another sender/task with a higher priority then this new sender won't
be able to submit packets to the NIC directly instead they will be
enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
is scheduled again and finishes the job.

By serializing every task on the ->busylock then the task will be
preempted by a sender only after the Qdisc has no owner.

Always serialize on the busylock on PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Paolo Abeni Dec. 10, 2021, 4:35 p.m. UTC | #1
On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> by another sender/task with a higher priority then this new sender won't
> be able to submit packets to the NIC directly instead they will be
> enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> is scheduled again and finishes the job.
> 
> By serializing every task on the ->busylock then the task will be
> preempted by a sender only after the Qdisc has no owner.
> 
> Always serialize on the busylock on PREEMPT_RT.

Not sure how much is relevant in the RT context, but this should impact
the xmit tput in a relevant, negative way.

If I read correctly, you use the busylock to trigger priority ceiling
on each sender. I'm wondering if there are other alternative ways (no
contended lock, just some RT specific annotation) to mark a whole
section of code for priority ceiling ?!?

Thanks!

Paolo
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/dev.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2a352e668d103..4a701cf2e2c10 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3836,8 +3836,16 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	 * separate lock before trying to get qdisc main lock.
>  	 * This permits qdisc->running owner to get the lock more
>  	 * often and dequeue packets faster.
> +	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> +	 * and then other tasks will only enqueue packets. The packets will be
> +	 * sent after the qdisc owner is scheduled again. To prevent this
> +	 * scenario the task always serialize on the lock.
>  	 */
> -	contended = qdisc_is_running(q);
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		contended = qdisc_is_running(q);
> +	else
> +		contended = true;
> +
>  	if (unlikely(contended))
>  		spin_lock(&q->busylock);
>
Sebastian Andrzej Siewior Dec. 10, 2021, 4:46 p.m. UTC | #2
On 2021-12-10 17:35:21 [+0100], Paolo Abeni wrote:
> On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> > The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> > setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> > by another sender/task with a higher priority then this new sender won't
> > be able to submit packets to the NIC directly instead they will be
> > enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> > is scheduled again and finishes the job.
> > 
> > By serializing every task on the ->busylock then the task will be
> > preempted by a sender only after the Qdisc has no owner.
> > 
> > Always serialize on the busylock on PREEMPT_RT.
> 
> Not sure how much is relevant in the RT context, but this should impact
> the xmit tput in a relevant, negative way.

Negative because everyone blocks on lock and transmits packets directly
instead of adding it to the queue and leaving for more?

> If I read correctly, you use the busylock to trigger priority ceiling
> on each sender. I'm wondering if there are other alternative ways (no
> contended lock, just some RT specific annotation) to mark a whole
> section of code for priority ceiling ?!?

priority ceiling as you call it, happens always with the help of a lock.
The root_lock is dropped in sch_direct_xmit().
qdisc_run_begin() sets only a bit with no owner association.
If I know why the busy-lock bad than I could add another one. The
important part is force the sende out of the section so the task with
the higher priority can send packets instead of queueing them only.

> Thanks!
> 
> Paolo

Sebastian
Eric Dumazet Dec. 10, 2021, 4:52 p.m. UTC | #3
On Fri, Dec 10, 2021 at 8:46 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2021-12-10 17:35:21 [+0100], Paolo Abeni wrote:
> > On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> > > The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> > > setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> > > by another sender/task with a higher priority then this new sender won't
> > > be able to submit packets to the NIC directly instead they will be
> > > enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> > > is scheduled again and finishes the job.
> > >
> > > By serializing every task on the ->busylock then the task will be
> > > preempted by a sender only after the Qdisc has no owner.
> > >
> > > Always serialize on the busylock on PREEMPT_RT.
> >
> > Not sure how much is relevant in the RT context, but this should impact
> > the xmit tput in a relevant, negative way.
>
> Negative because everyone blocks on lock and transmits packets directly
> instead of adding it to the queue and leaving for more?
>
> > If I read correctly, you use the busylock to trigger priority ceiling
> > on each sender. I'm wondering if there are other alternative ways (no
> > contended lock, just some RT specific annotation) to mark a whole
> > section of code for priority ceiling ?!?
>
> priority ceiling as you call it, happens always with the help of a lock.
> The root_lock is dropped in sch_direct_xmit().
> qdisc_run_begin() sets only a bit with no owner association.
> If I know why the busy-lock bad than I could add another one. The
> important part is force the sende out of the section so the task with
> the higher priority can send packets instead of queueing them only.

Problem is that if you have a qdisc, qdisc_dequeue() will not know that the
packet queued by high prio thread needs to shortcut all prior packets and
be sent right away.

Because of that, it seems just better that a high prio thread returns
immediately and let the dirty work (dequeue packets and send them to devices)
be done by other threads ?

>
> > Thanks!
> >
> > Paolo
>
> Sebastian
Sebastian Andrzej Siewior Dec. 10, 2021, 5:43 p.m. UTC | #4
On 2021-12-10 08:52:55 [-0800], Eric Dumazet wrote:
> 
> Problem is that if you have a qdisc, qdisc_dequeue() will not know that the
> packet queued by high prio thread needs to shortcut all prior packets and
> be sent right away.

That is correct.

> Because of that, it seems just better that a high prio thread returns
> immediately and let the dirty work (dequeue packets and send them to devices)
> be done by other threads ?

Ah okay. Lets say you have a task sending a packet every 100ms. And you
want that packet to leave the NIC asap. This is your task with
SCHED_FIFO 70 priority. IMPORTANT otherwise.
Lets say you have a ssh session on the same NIC running at SCHED_OTHER.
Nothing important obviously.

The NIC is idle, ssh sends a keep-alive, 80 bytes, one skb. 
    SCHED_OTHER, SSH, CPU0                                    FIFO 70, IMPORTANT, CPU1
  __dev_xmit_skb()
    -> spin_lock(root_lock);
    -> qdisc_run_begin()
      -> __test_and_set_bit(__QDISC_STATE2_RUNNING);
    -> sch_direct_xmit()
        *PREEMPT* (by something else)
                                                            __dev_xmit_skb()
                                                            -> spin_lock(root_lock);
                                                           <--- PI-boost
      -> spin_unlock(root_lock);   
         *PREEMPT* again     ---> PI boost ends after unlock
                                                            -> qdisc_run_begin()
                                                              -> __test_and_set_bit(__QDISC_STATE2_RUNNING) (nope);
                                                             -> dev_qdisc_enqueue()
                                                             -> qdisc_run_begin() returns false
                                                            -> spin_unlock(root_lock);

at this point, we don't return to the SSH thread, instead other RT tasks
are running. That means that the SSH thread, which owns the qdisc and
the NIC, remains preempted and the NIC idle.

300ms pass by, the IMPORTANT thread queued two additional packets.
Now, a few usecs later, all SCHED_FIFO tasks go idle on CPU0, and the
SSH thread can continue with dev_hard_start_xmit() and send its packet, 
    -> dev_hard_start_xmit()
    __qdisc_run() (yes, and the three additional packets)
    …

By always acquiring the busy-lock, in the previous scenario, the task
with the high-priority allows the low-priority task to run until
qdisc_run_end() at which point it releases the qdisc so the
high-priority task can actually send packets.

One side note: This scenario requires two CPUs (one for low priority
task that gets preempted and owns the qdisc, and one one for high
priority task that wants to sends packets).
With one CPU only the local_bh_disable() invocation would ensure that
low-priority thread left the bh-disable section entirely.

> > > Thanks!
> > >
> > > Paolo
> >

Sebastian
Jakub Kicinski Dec. 11, 2021, 4:32 a.m. UTC | #5
On Fri, 10 Dec 2021 16:41:44 +0100 Sebastian Andrzej Siewior wrote:
> -	contended = qdisc_is_running(q);
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		contended = qdisc_is_running(q);
> +	else
> +		contended = true;

Why not:

	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
Paolo Abeni Dec. 13, 2021, 10:27 a.m. UTC | #6
On Fri, 2021-12-10 at 17:46 +0100, Sebastian Andrzej Siewior wrote:
> On 2021-12-10 17:35:21 [+0100], Paolo Abeni wrote:
> > On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> > > The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> > > setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> > > by another sender/task with a higher priority then this new sender won't
> > > be able to submit packets to the NIC directly instead they will be
> > > enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> > > is scheduled again and finishes the job.
> > > 
> > > By serializing every task on the ->busylock then the task will be
> > > preempted by a sender only after the Qdisc has no owner.
> > > 
> > > Always serialize on the busylock on PREEMPT_RT.
> > 
> > Not sure how much is relevant in the RT context, but this should impact
> > the xmit tput in a relevant, negative way.
> 
> Negative because everyone blocks on lock and transmits packets directly
> instead of adding it to the queue and leaving for more?

In the uncontended scenario this adds an atomic operation and increases
the data set size. Thinking again about it, in RT build this is much
less noticeable (I forgot spinlock are actually mutex in RT...)

> > If I read correctly, you use the busylock to trigger priority
> > ceiling
> > on each sender. I'm wondering if there are other alternative ways
> > (no
> > contended lock, just some RT specific annotation) to mark a whole
> > section of code for priority ceiling ?!?
> 
> priority ceiling as you call it, happens always with the help of a
> lock.
> The root_lock is dropped in sch_direct_xmit().
> qdisc_run_begin() sets only a bit with no owner association.
> If I know why the busy-lock bad than I could add another one. The
> important part is force the sende out of the section so the task with
> the higher priority can send packets instead of queueing them only.

I thought about adding a local_lock() instead, but that will not solve.

I don't see a better solution than this patch, so I'm ok with that.
Please follow Jakub's suggestion to clean the code a bit.

Thanks!

Paolo
Sebastian Andrzej Siewior Dec. 13, 2021, 10:45 a.m. UTC | #7
On 2021-12-10 20:32:56 [-0800], Jakub Kicinski wrote:
> On Fri, 10 Dec 2021 16:41:44 +0100 Sebastian Andrzej Siewior wrote:
> > -	contended = qdisc_is_running(q);
> > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		contended = qdisc_is_running(q);
> > +	else
> > +		contended = true;
> 
> Why not:
> 
> 	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);

I could do that. But I would swap the two arguments so that the
IS_ENABLED macro comes first and if true qdisc_is_running() is optimized
away.

Sebastian
Jakub Kicinski Dec. 13, 2021, 4:15 p.m. UTC | #8
On Mon, 13 Dec 2021 11:45:59 +0100 Sebastian Andrzej Siewior wrote:
> On 2021-12-10 20:32:56 [-0800], Jakub Kicinski wrote:
> > On Fri, 10 Dec 2021 16:41:44 +0100 Sebastian Andrzej Siewior wrote:  
> > > -	contended = qdisc_is_running(q);
> > > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +		contended = qdisc_is_running(q);
> > > +	else
> > > +		contended = true;  
> > 
> > Why not:
> > 
> > 	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);  
> 
> I could do that. But I would swap the two arguments so that the
> IS_ENABLED macro comes first and if true qdisc_is_running() is optimized
> away.

FWIW I disagree. My version was more readable. The sprinkling of the
PREEMPT_RT tosh is getting a little annoying. Trying to regress the 
clarity of the code should be top of mind here.
Sebastian Andrzej Siewior Dec. 13, 2021, 4:18 p.m. UTC | #9
On 2021-12-13 08:15:56 [-0800], Jakub Kicinski wrote:
> FWIW I disagree. My version was more readable. The sprinkling of the
> PREEMPT_RT tosh is getting a little annoying. Trying to regress the 
> clarity of the code should be top of mind here.

No worries. Let me spin a new version with a swap.

Sebastian
Jakub Kicinski Dec. 13, 2021, 4:20 p.m. UTC | #10
On Mon, 13 Dec 2021 17:18:05 +0100 Sebastian Andrzej Siewior wrote:
> On 2021-12-13 08:15:56 [-0800], Jakub Kicinski wrote:
> > FWIW I disagree. My version was more readable. The sprinkling of the
> > PREEMPT_RT tosh is getting a little annoying. Trying to regress the 
> > clarity of the code should be top of mind here.  
> 
> No worries. Let me spin a new version with a swap.

Dave applied your previous version, it's not a big deal, we'll live.
Sebastian Andrzej Siewior Dec. 13, 2021, 4:32 p.m. UTC | #11
On 2021-12-13 08:20:46 [-0800], Jakub Kicinski wrote:
> On Mon, 13 Dec 2021 17:18:05 +0100 Sebastian Andrzej Siewior wrote:
> > On 2021-12-13 08:15:56 [-0800], Jakub Kicinski wrote:
> > > FWIW I disagree. My version was more readable. The sprinkling of the
> > > PREEMPT_RT tosh is getting a little annoying. Trying to regress the 
> > > clarity of the code should be top of mind here.  
> > 
> > No worries. Let me spin a new version with a swap.
> 
> Dave applied your previous version, it's not a big deal, we'll live.

but happy. Not just live. I've sent a follow-up. I need the network
department in a happy mood :)

Sebastian
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 2a352e668d103..4a701cf2e2c10 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3836,8 +3836,16 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	 * separate lock before trying to get qdisc main lock.
 	 * This permits qdisc->running owner to get the lock more
 	 * often and dequeue packets faster.
+	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
+	 * and then other tasks will only enqueue packets. The packets will be
+	 * sent after the qdisc owner is scheduled again. To prevent this
+	 * scenario the task always serialize on the lock.
 	 */
-	contended = qdisc_is_running(q);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		contended = qdisc_is_running(q);
+	else
+		contended = true;
+
 	if (unlikely(contended))
 		spin_lock(&q->busylock);