diff mbox series

packet stuck in qdisc : patch proposal

Message ID 1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series packet stuck in qdisc : patch proposal | expand

Checks

Context Check Description
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1478 this patch: 1478
netdev/cc_maintainers warning 2 maintainers not CCed: pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 174 this patch: 174
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff fail author Signed-off-by missing
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1491 this patch: 1491
netdev/checkpatch warning WARNING: 'enqueing' may be misspelled - perhaps 'enqueuing'? WARNING: Unknown commit id '917f7ff2b0f59d721d11f983af1f46c1cd74130a', maybe rebased or not pulled? WARNING: memory barrier without comment
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Vincent Ray May 23, 2022, 1:54 p.m. UTC
Hi Yunsheng, all,

I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while.
It's in qdisc_run_begin within sch_generic.h : 

smp_mb__before_atomic();
 
// [comments]

if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
	return false;

should be 

smp_mb();

// [comments]

if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
	return false;

I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation.
Therefore it does not give you any ordering guarantee on any architecture.
And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock...
So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue().
We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog).
Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog.

I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier.
It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86.

I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run().

In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs.

I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel.
A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable).

Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered.
At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes.
    
More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history. 
I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-)
I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-)

I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible).
I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline.

Can you please tell me :

1) if you agree with this ?

2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ?

NB : an alternative fix (which I've tested OK too) would be to simply remove the

if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
	return false;

code path, but I have no clue if this would be better or worse than the present patch in terms of performance.
    
Thank you, best regards,

V

Comments

Eric Dumazet May 24, 2022, 2:55 a.m. UTC | #1
On 5/23/22 06:54, Vincent Ray wrote:
> Hi Yunsheng, all,
>
> I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while.
> It's in qdisc_run_begin within sch_generic.h :
>
> smp_mb__before_atomic();
>   
> // [comments]
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> should be
>
> smp_mb();
>
> // [comments]
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation.
> Therefore it does not give you any ordering guarantee on any architecture.
> And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock...
> So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue().
> We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog).
> Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog.
>
> I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier.
> It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86.
>
> I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run().
>
> In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs.
>
> I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel.
> A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable).
>
> Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered.
> At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes.
>      
> More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history.
> I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-)
> I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-)
>
> I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible).
> I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline.
>
> Can you please tell me :
>
> 1) if you agree with this ?
>
> 2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ?
>
> NB : an alternative fix (which I've tested OK too) would be to simply remove the
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> code path, but I have no clue if this would be better or worse than the present patch in terms of performance.
>      
> Thank you, best regards,
>
> V


We keep adding code and comments, this is quite silly.

test_and_set_bit() is exactly what we need.


diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
9bab396c1f3ba3d143de4d63f4142cff3c9b9f3e..9d1b448c0dfc3925967635f3390b884a4ef7c55a 
100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -187,35 +187,9 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
                 if (spin_trylock(&qdisc->seqlock))
                         return true;

-               /* Paired with smp_mb__after_atomic() to make sure
-                * STATE_MISSED checking is synchronized with clearing
-                * in pfifo_fast_dequeue().
-                */
-               smp_mb__before_atomic();
-
-               /* If the MISSED flag is set, it means other thread has
-                * set the MISSED flag before second spin_trylock(), so
-                * we can return false here to avoid multi cpus doing
-                * the set_bit() and second spin_trylock() concurrently.
-                */
-               if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
+               if (test_and_set_bit(__QDISC_STATE_MISSED, &qdisc->state))
                         return false;

-               /* Set the MISSED flag before the second spin_trylock(),
-                * if the second spin_trylock() return false, it means
-                * other cpu holding the lock will do dequeuing for us
-                * or it will see the MISSED flag set after releasing
-                * lock and reschedule the net_tx_action() to do the
-                * dequeuing.
-                */
-               set_bit(__QDISC_STATE_MISSED, &qdisc->state);
-
-               /* spin_trylock() only has load-acquire semantic, so use
-                * smp_mb__after_atomic() to ensure STATE_MISSED is set
-                * before doing the second spin_trylock().
-                */
-               smp_mb__after_atomic();
-
                 /* Retry again in case other CPU may not see the new flag
                  * after it releases the lock at the end of 
qdisc_run_end().
                  */
Yunsheng Lin May 24, 2022, 6:43 a.m. UTC | #2
On 2022/5/23 21:54, Vincent Ray wrote:
> Hi Yunsheng, all,

+cc Will & Eric

> 
> I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while.
> It's in qdisc_run_begin within sch_generic.h : 
> 
> smp_mb__before_atomic();
>  
> // [comments]
> 
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
> 
> should be 
> 
> smp_mb();
> 
> // [comments]
> 
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
> 
> I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation.
> Therefore it does not give you any ordering guarantee on any architecture.
> And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock...
> So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue().
> We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog).
> Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog.
> 
> I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier.
> It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86.
> 
> I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run().
> 
> In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs.
> 
> I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel.
> A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable).
> 
> Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered.
> At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes.
>     
> More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history. 
> I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-)
> I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-)
> 
> I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible).
> I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline.
> 
> Can you please tell me :
> 
> 1) if you agree with this ?
> 
> 2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ?
> 
> NB : an alternative fix (which I've tested OK too) would be to simply remove the
> 
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
> 
> code path, but I have no clue if this would be better or worse than the present patch in terms of performance.
>     
> Thank you, best regards

Hi, thanks for the testing and debugging. So the main problem is that
test_bit() is not an atomic operation, so using smp_mb __*_atomic() is
not really helping, right?

In that case we might only need to change smp_mb__before_atomic() to
smp_rmb() in qdisc_run_begin(), as we only need to order test_bit()
after set_bit() and clear_bit(), which is a read/write ordering?

By the way,  Guoju need to ensure ordering between spin_unlock() and
test_bit() in qdisc_run_end(), which is a write/read ordering, so
smp_mb() is used.


> 
> V
> 
>
Vincent Ray May 24, 2022, 8:13 a.m. UTC | #3
> Hi, thanks for the testing and debugging. So the main problem is that
> test_bit() is not an atomic operation, so using smp_mb __*_atomic() is
> not really helping, right?

Yes

> In that case we might only need to change smp_mb__before_atomic() to
> smp_rmb() in qdisc_run_begin(), as we only need to order test_bit()
> after set_bit() and clear_bit(), which is a read/write ordering?

I don't think so : we need to order test_bit() with respect to an earlier enqueue().
So we need a store/load barrier => smp_mb()

> By the way,  Guoju need to ensure ordering between spin_unlock() and
> test_bit() in qdisc_run_end(), which is a write/read ordering, so
> smp_mb() is used.

Agreed

But, for qdisc_run_begin(), I think Eric is right anyway, his fix, using test_and_set_bit() seems much better.
I'm going to try it and will tell you what it gives.

Thanks,

V
Vincent Ray May 24, 2022, 5 p.m. UTC | #4
All,

I confirm Eric's patch works well too, and it's better and clearer than mine.
So I think we should go for it, and the one from Guoju in addition.

@Eric : I see you are one of the networking maintainers, so I have a few questions for you :

a) are you going to take care of these patches directly yourself, or is there something Guoju or I should do to promote them ?

b) Can we expect to see them land in the mainline soon ?

c) Will they be backported to previous versions of the kernel ? Which ones ?

Thanks a lot, best,

V
Eric Dumazet May 24, 2022, 8:17 p.m. UTC | #5
On 5/24/22 10:00, Vincent Ray wrote:
> All,
>
> I confirm Eric's patch works well too, and it's better and clearer than mine.
> So I think we should go for it, and the one from Guoju in addition.
>
> @Eric : I see you are one of the networking maintainers, so I have a few questions for you :
>
> a) are you going to take care of these patches directly yourself, or is there something Guoju or I should do to promote them ?

I think this is totally fine you take ownership of the patch, please 
send a formal V2.

Please double check what patchwork had to say about your V1 :


https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/


And make sure to address the relevant points

The most important one is the lack of 'Signed-off-by:' tag, of course.


> b) Can we expect to see them land in the mainline soon ?

If your v2 submission is correct, it can be merged this week ;)


>
> c) Will they be backported to previous versions of the kernel ? Which ones ?

You simply can include a proper Fixes: tag, so that stable teams can 
backport

the patch to all affected kernel versions.



>
> Thanks a lot, best,

Thanks a lot for working on this long standing issue.
Vincent Ray May 25, 2022, 9:44 a.m. UTC | #6
----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

> On 5/24/22 10:00, Vincent Ray wrote:
>> All,
>>
>> I confirm Eric's patch works well too, and it's better and clearer than mine.
>> So I think we should go for it, and the one from Guoju in addition.
>>
>> @Eric : I see you are one of the networking maintainers, so I have a few
>> questions for you :
>>
>> a) are you going to take care of these patches directly yourself, or is there
>> something Guoju or I should do to promote them ?
> 
> I think this is totally fine you take ownership of the patch, please
> send a formal V2.
> 
> Please double check what patchwork had to say about your V1 :
> 
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/
> 
> 
> And make sure to address the relevant points

OK I will.
If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message
similar to my original one.

What about Guoju's patch ? 
(adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). 
I think it is also necessary though potentially less critical.
Do we embed it in the same patch ? or patch series ?

@Guoju : have you submitted it for integration ?


> The most important one is the lack of 'Signed-off-by:' tag, of course.
> 
> 
>> b) Can we expect to see them land in the mainline soon ?
> 
> If your v2 submission is correct, it can be merged this week ;)
> 
> 
>>
>> c) Will they be backported to previous versions of the kernel ? Which ones ?
> 
> You simply can include a proper Fixes: tag, so that stable teams can
> backport
> 
> the patch to all affected kernel versions.
> 

Here things get a little complicated in my head ;-)
As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction
of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14.
It's hard to tell at a glance since the code looks quite different back then.
Because of these changes, a unique patch will also only apply up to a certain point in the past.

However, I think the bug became really critical only with the introduction of "true bypass" behavior 
in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal 
even in no-bypass mode.

=> I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly,
 and we live with the bug for versions before that.
This would mean that 5.15 LT can be patched but no earlier LT
 
What do you think ?

BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that 
won't be fixed in a given kernel ?

> 
> 
>>
>> Thanks a lot, best,
> 
> Thanks a lot for working on this long standing issue.
> 
> 
> 
> 
> To declare a filtering error, please use the following link :
> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
Yunsheng Lin May 25, 2022, 10:45 a.m. UTC | #7
On 2022/5/25 17:44, Vincent Ray wrote:
> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
> 
>> On 5/24/22 10:00, Vincent Ray wrote:
>>> All,
>>>
>>> I confirm Eric's patch works well too, and it's better and clearer than mine.
>>> So I think we should go for it, and the one from Guoju in addition.
>>>
>>> @Eric : I see you are one of the networking maintainers, so I have a few
>>> questions for you :
>>>
>>> a) are you going to take care of these patches directly yourself, or is there
>>> something Guoju or I should do to promote them ?
>>
>> I think this is totally fine you take ownership of the patch, please
>> send a formal V2.
>>
>> Please double check what patchwork had to say about your V1 :
>>
>>
>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/
>>
>>
>> And make sure to address the relevant points
> 
> OK I will.
> If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message
> similar to my original one.
> 
> What about Guoju's patch ? 

@Guoju, please speak up if you want to handle the patch yourself.

> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). 
> I think it is also necessary though potentially less critical.
> Do we embed it in the same patch ? or patch series ?

Guoju's patch fixes the commit a90c57f2cedd, so "patch series"
seems better if Guoju is not speaking up to handle the patch himself.


> 
> @Guoju : have you submitted it for integration ?
> 
> 
>> The most important one is the lack of 'Signed-off-by:' tag, of course.
>>
>>
>>> b) Can we expect to see them land in the mainline soon ?
>>
>> If your v2 submission is correct, it can be merged this week ;)
>>
>>
>>>
>>> c) Will they be backported to previous versions of the kernel ? Which ones ?
>>
>> You simply can include a proper Fixes: tag, so that stable teams can
>> backport
>>
>> the patch to all affected kernel versions.
>>
> 
> Here things get a little complicated in my head ;-)
> As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction
> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14.
> It's hard to tell at a glance since the code looks quite different back then.
> Because of these changes, a unique patch will also only apply up to a certain point in the past.
> 
> However, I think the bug became really critical only with the introduction of "true bypass" behavior 
> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal 
> even in no-bypass mode.


commit 89837eb4b246 tried to fix that, but it did not fix it completely, and that commit should has
been back-ported to the affected kernel versions as much as possible, so I think the Fixes tag
should be:

Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc")

> 
> => I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly,
>  and we live with the bug for versions before that.
> This would mean that 5.15 LT can be patched but no earlier LT
>  
> What do you think ?
> 
> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that 
> won't be fixed in a given kernel ?
> 
>>
>>
>>>
>>> Thanks a lot, best,
>>
>> Thanks a lot for working on this long standing issue.
>>
>>
>>
>>
>> To declare a filtering error, please use the following link :
>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
> 
> 
> 
> 
> .
>
Guoju Fang May 25, 2022, 12:40 p.m. UTC | #8
On 2022/5/25 18:45, Yunsheng Lin wrote:
> On 2022/5/25 17:44, Vincent Ray wrote:
>> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>>
>>> On 5/24/22 10:00, Vincent Ray wrote:
>>>> All,
>>>>
>>>> I confirm Eric's patch works well too, and it's better and clearer than mine.
>>>> So I think we should go for it, and the one from Guoju in addition.
>>>>
>>>> @Eric : I see you are one of the networking maintainers, so I have a few
>>>> questions for you :
>>>>
>>>> a) are you going to take care of these patches directly yourself, or is there
>>>> something Guoju or I should do to promote them ?
>>>
>>> I think this is totally fine you take ownership of the patch, please
>>> send a formal V2.
>>>
>>> Please double check what patchwork had to say about your V1 :
>>>
>>>
>>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/
>>>
>>>
>>> And make sure to address the relevant points
>>
>> OK I will.
>> If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message
>> similar to my original one.
>>
>> What about Guoju's patch ?
> 
> @Guoju, please speak up if you want to handle the patch yourself.

Hi Yunsheng, all,

I rewrite the comments of my patch and it looks a little clearer. :)

Thank you.

Best regards,

> 
>> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()).
>> I think it is also necessary though potentially less critical.
>> Do we embed it in the same patch ? or patch series ?
> 
> Guoju's patch fixes the commit a90c57f2cedd, so "patch series"
> seems better if Guoju is not speaking up to handle the patch himself.
> 
> 
>>
>> @Guoju : have you submitted it for integration ?
>>
>>
>>> The most important one is the lack of 'Signed-off-by:' tag, of course.
>>>
>>>
>>>> b) Can we expect to see them land in the mainline soon ?
>>>
>>> If your v2 submission is correct, it can be merged this week ;)
>>>
>>>
>>>>
>>>> c) Will they be backported to previous versions of the kernel ? Which ones ?
>>>
>>> You simply can include a proper Fixes: tag, so that stable teams can
>>> backport
>>>
>>> the patch to all affected kernel versions.
>>>
>>
>> Here things get a little complicated in my head ;-)
>> As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction
>> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14.
>> It's hard to tell at a glance since the code looks quite different back then.
>> Because of these changes, a unique patch will also only apply up to a certain point in the past.
>>
>> However, I think the bug became really critical only with the introduction of "true bypass" behavior
>> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal
>> even in no-bypass mode.
> 
> 
> commit 89837eb4b246 tried to fix that, but it did not fix it completely, and that commit should has
> been back-ported to the affected kernel versions as much as possible, so I think the Fixes tag
> should be:
> 
> Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc")
> 
>>
>> => I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly,
>>   and we live with the bug for versions before that.
>> This would mean that 5.15 LT can be patched but no earlier LT
>>   
>> What do you think ?
>>
>> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that
>> won't be fixed in a given kernel ?
>>
>>>
>>>
>>>>
>>>> Thanks a lot, best,
>>>
>>> Thanks a lot for working on this long standing issue.
>>>
>>>
>>>
>>>
>>> To declare a filtering error, please use the following link :
>>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
>>
>>
>>
>>
>> .
>>
From 28cc81b33972cb1bcdfd529ceae024c9800a54a1 Mon Sep 17 00:00:00 2001
From: Guoju Fang <gjfang@linux.alibaba.com>
Date: Wed, 25 May 2022 19:12:00 +0800
Subject: [PATCH] net: sched: add barrier to fix packet stuck problem for
 lockless qdisc

In qdisc_run_end(), the spin_unlock() only has store-release semantic,
which guarantees all earlier memory access are visible before it. But
the subsequent test_bit() may be reordered ahead of the spin_unlock(),
and may cause a packet stuck problem.

The concurrent operations can be described as below,
         CPU 0                      |          CPU 1
   qdisc_run_end()                  |     qdisc_run_begin()
          .                         |           .
 ----> /* may be reorderd here */   |           .
|         .                         |           .
|     spin_unlock()                 |         set_bit()
|         .                         |         smp_mb__after_atomic()
 ---- test_bit()                    |         spin_trylock()
          .                         |          .

Consider the following sequence of events:
    CPU 0 reorder test_bit() ahead and see MISSED = 0
    CPU 1 calls set_bit()
    CPU 1 calls spin_trylock() and return fail
    CPU 0 executes spin_unlock()

At the end of the sequence, CPU 0 calls spin_unlock() and does nothing
because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one
take it, until the next cpu pushing to the qdisc (if ever ...) will
notice and dequeue it.

So one explicit barrier is needed between spin_unlock() and
test_bit() to ensure the correct order.

Signed-off-by: Guoju Fang <gjfang@linux.alibaba.com>
---
 include/net/sch_generic.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9bab396c1f3b..8a8738642ca0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -229,6 +229,9 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
 
+		/* ensure ordering between spin_unlock() and test_bit() */
+		smp_mb();
+
 		if (unlikely(test_bit(__QDISC_STATE_MISSED,
 				      &qdisc->state)))
 			__netif_schedule(qdisc);
Vincent Ray May 25, 2022, 5:43 p.m. UTC | #9
----- On May 25, 2022, at 2:40 PM, Guoju Fang gjfang@linux.alibaba.com wrote:

> On 2022/5/25 18:45, Yunsheng Lin wrote:
>> On 2022/5/25 17:44, Vincent Ray wrote:
>>> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>>>
>>>> On 5/24/22 10:00, Vincent Ray wrote:
>>>>> All,
>>>>>
>>>>> I confirm Eric's patch works well too, and it's better and clearer than mine.
>>>>> So I think we should go for it, and the one from Guoju in addition.
>>>>>
>>>>> @Eric : I see you are one of the networking maintainers, so I have a few
>>>>> questions for you :
>>>>>
>>>>> a) are you going to take care of these patches directly yourself, or is there
>>>>> something Guoju or I should do to promote them ?
>>>>
>>>> I think this is totally fine you take ownership of the patch, please
>>>> send a formal V2.
>>>>
>>>> Please double check what patchwork had to say about your V1 :
>>>>
>>>>
>>>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/
>>>>
>>>>
>>>> And make sure to address the relevant points
>>>
>>> OK I will.
>>> If you agree, I will take your version of the fix (test_and_set_bit()), keeping
>>> the commit message
>>> similar to my original one.
>>>
>>> What about Guoju's patch ?
>> 
>> @Guoju, please speak up if you want to handle the patch yourself.
> 
> Hi Yunsheng, all,
> 
> I rewrite the comments of my patch and it looks a little clearer. :)
> 
> Thank you.
> 
> Best regards,

Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ?

Here's mine, attached. Hope it's well formatted this time. Tell me.
I don't feel quite confident with the submission process to produce the series myself, so I'll let Eric handle it if it's ok.

> 
>> 
>>> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()).
>>> I think it is also necessary though potentially less critical.
>>> Do we embed it in the same patch ? or patch series ?
>> 
>> Guoju's patch fixes the commit a90c57f2cedd, so "patch series"
>> seems better if Guoju is not speaking up to handle the patch himself.
>> 
>> 
>>>
>>> @Guoju : have you submitted it for integration ?
>>>
>>>
>>>> The most important one is the lack of 'Signed-off-by:' tag, of course.
>>>>
>>>>
>>>>> b) Can we expect to see them land in the mainline soon ?
>>>>
>>>> If your v2 submission is correct, it can be merged this week ;)
>>>>
>>>>
>>>>>
>>>>> c) Will they be backported to previous versions of the kernel ? Which ones ?
>>>>
>>>> You simply can include a proper Fixes: tag, so that stable teams can
>>>> backport
>>>>
>>>> the patch to all affected kernel versions.
>>>>
>>>
>>> Here things get a little complicated in my head ;-)
>>> As explained, I think this mechanism has been bugged, in a way or an other, for
>>> some time, perhaps since the introduction
>>> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14.
>>> It's hard to tell at a glance since the code looks quite different back then.
>>> Because of these changes, a unique patch will also only apply up to a certain
>>> point in the past.
>>>
>>> However, I think the bug became really critical only with the introduction of
>>> "true bypass" behavior
>>> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it
>>> is a big deal
>>> even in no-bypass mode.
>> 
>> 
>> commit 89837eb4b246 tried to fix that, but it did not fix it completely, and
>> that commit should has
>> been back-ported to the affected kernel versions as much as possible, so I think
>> the Fixes tag
>> should be:
>> 
>> Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for
>> lockless qdisc")
>> 
>>>
>>> => I suggest we only tag it for backward fix up to the 5.14, where it should
>>> apply smoothly,
>>>   and we live with the bug for versions before that.
>>> This would mean that 5.15 LT can be patched but no earlier LT
>>>   
>>> What do you think ?
>>>
>>> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar
>>> for known bugs that
>>> won't be fixed in a given kernel ?
>>>
>>>>
>>>>
>>>>>
>>>>> Thanks a lot, best,
>>>>
>>>> Thanks a lot for working on this long standing issue.
>>>>
>>>>
>>>>
>>>>
>>>> To declare a filtering error, please use the following link :
>>>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
>>>
>>>
>>>
>>>
>>> .
>>>
> 
> To declare a filtering error, please use the following link :
> https://www.security-mail.net/reporter.php?mid=2c69.628e23bf.45908.0&r=vray%40kalrayinc.com&s=gjfang%40linux.alibaba.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=6106070134039ab6725b6d3de67bd24d624c8b51
Vincent Ray May 25, 2022, 5:48 p.m. UTC | #10
----- On May 25, 2022, at 7:43 PM, Vincent Ray vray@kalrayinc.com wrote:

> ----- On May 25, 2022, at 2:40 PM, Guoju Fang gjfang@linux.alibaba.com wrote:
> 
>> On 2022/5/25 18:45, Yunsheng Lin wrote:
>>> On 2022/5/25 17:44, Vincent Ray wrote:
>>>> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>>>>
>>>>> On 5/24/22 10:00, Vincent Ray wrote:
>>>>>> All,
>>>>>>
>>>>>> I confirm Eric's patch works well too, and it's better and clearer than mine.
>>>>>> So I think we should go for it, and the one from Guoju in addition.
>>>>>>
>>>>>> @Eric : I see you are one of the networking maintainers, so I have a few
>>>>>> questions for you :
>>>>>>
>>>>>> a) are you going to take care of these patches directly yourself, or is there
>>>>>> something Guoju or I should do to promote them ?
>>>>>
>>>>> I think this is totally fine you take ownership of the patch, please
>>>>> send a formal V2.
>>>>>
>>>>> Please double check what patchwork had to say about your V1 :
>>>>>
>>>>>
>>>>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/
>>>>>
>>>>>
>>>>> And make sure to address the relevant points
>>>>
>>>> OK I will.
>>>> If you agree, I will take your version of the fix (test_and_set_bit()), keeping
>>>> the commit message
>>>> similar to my original one.
>>>>
>>>> What about Guoju's patch ?
>>> 
>>> @Guoju, please speak up if you want to handle the patch yourself.
>> 
>> Hi Yunsheng, all,
>> 
>> I rewrite the comments of my patch and it looks a little clearer. :)
>> 
>> Thank you.
>> 
>> Best regards,
> 
> Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ?
> 
> Here's mine, attached. Hope it's well formatted this time. Tell me.
> I don't feel quite confident with the submission process to produce the series
> myself, so I'll let Eric handle it if it's ok.

NB : from what I've understood reading some doc, as this is a fix, it's supposed to go 
in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok

> 
>> 
>>> 
>>>> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()).
>>>> I think it is also necessary though potentially less critical.
>>>> Do we embed it in the same patch ? or patch series ?
>>> 
>>> Guoju's patch fixes the commit a90c57f2cedd, so "patch series"
>>> seems better if Guoju is not speaking up to handle the patch himself.
>>> 
>>> 
>>>>
>>>> @Guoju : have you submitted it for integration ?
>>>>
>>>>
>>>>> The most important one is the lack of 'Signed-off-by:' tag, of course.
>>>>>
>>>>>
>>>>>> b) Can we expect to see them land in the mainline soon ?
>>>>>
>>>>> If your v2 submission is correct, it can be merged this week ;)
>>>>>
>>>>>
>>>>>>
>>>>>> c) Will they be backported to previous versions of the kernel ? Which ones ?
>>>>>
>>>>> You simply can include a proper Fixes: tag, so that stable teams can
>>>>> backport
>>>>>
>>>>> the patch to all affected kernel versions.
>>>>>
>>>>
>>>> Here things get a little complicated in my head ;-)
>>>> As explained, I think this mechanism has been bugged, in a way or an other, for
>>>> some time, perhaps since the introduction
>>>> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14.
>>>> It's hard to tell at a glance since the code looks quite different back then.
>>>> Because of these changes, a unique patch will also only apply up to a certain
>>>> point in the past.
>>>>
>>>> However, I think the bug became really critical only with the introduction of
>>>> "true bypass" behavior
>>>> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it
>>>> is a big deal
>>>> even in no-bypass mode.
>>> 
>>> 
>>> commit 89837eb4b246 tried to fix that, but it did not fix it completely, and
>>> that commit should has
>>> been back-ported to the affected kernel versions as much as possible, so I think
>>> the Fixes tag
>>> should be:
>>> 
>>> Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for
>>> lockless qdisc")
>>> 
>>>>
>>>> => I suggest we only tag it for backward fix up to the 5.14, where it should
>>>> apply smoothly,
>>>>   and we live with the bug for versions before that.
>>>> This would mean that 5.15 LT can be patched but no earlier LT
>>>>   
>>>> What do you think ?
>>>>
>>>> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar
>>>> for known bugs that
>>>> won't be fixed in a given kernel ?
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks a lot, best,
>>>>>
>>>>> Thanks a lot for working on this long standing issue.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> To declare a filtering error, please use the following link :
>>>>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
>>>>
>>>>
>>>>
>>>>
>>>> .
>>>>
>> 
>> To declare a filtering error, please use the following link :
>> https://www.security-mail.net/reporter.php?mid=2c69.628e23bf.45908.0&r=vray%40kalrayinc.com&s=gjfang%40linux.alibaba.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=6106070134039ab6725b6d3de67bd24d624c8b51
Eric Dumazet May 26, 2022, 12:17 a.m. UTC | #11
On Wed, May 25, 2022 at 10:48 AM Vincent Ray <vray@kalrayinc.com> wrote:
>

> >
> > Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ?
> >
> > Here's mine, attached. Hope it's well formatted this time. Tell me.
> > I don't feel quite confident with the submission process to produce the series
> > myself, so I'll let Eric handle it if it's ok.
>
> NB : from what I've understood reading some doc, as this is a fix, it's supposed to go
> in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok

Patch was not delivered to patchwork this time.

I will resend it.

Next time, please take the time to read
Documentation/process/submitting-patches.rst, thanks.

(line 276: No MIME, no links, no compression, no attachments.  Just plain text)

Thanks.
Vincent Ray May 30, 2022, 9:36 a.m. UTC | #12
----- On May 26, 2022, at 2:17 AM, Eric Dumazet edumazet@google.com wrote:

> On Wed, May 25, 2022 at 10:48 AM Vincent Ray <vray@kalrayinc.com> wrote:
>>
> 
>> >
>> > Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ?
>> >
>> > Here's mine, attached. Hope it's well formatted this time. Tell me.
>> > I don't feel quite confident with the submission process to produce the series
>> > myself, so I'll let Eric handle it if it's ok.
>>
>> NB : from what I've understood reading some doc, as this is a fix, it's supposed
>> to go
>> in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok
> 
> Patch was not delivered to patchwork this time.
> 
> I will resend it.
> 
> Next time, please take the time to read
> Documentation/process/submitting-patches.rst, thanks.
> 
> (line 276: No MIME, no links, no compression, no attachments.  Just plain text)
> 
> Thanks.

Yes, sorry about that Eric.
I did read the howto but somehow convinced myself that plain-text attachment was ok, and 
was confused about how / if the patch would be correctly merged into the mail thread if inlined.
Will do better next time !
Thanks,

V 

> 
> 
> To declare a filtering error, please use the following link :
> https://www.security-mail.net/reporter.php?mid=f3cf.628ec731.cdb4a.0&r=vray%40kalrayinc.com&s=edumazet%40google.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=08b5406354e783e551d3ed8910b4278bf54f89d9
diff mbox series

Patch

commit 917f7ff2b0f59d721d11f983af1f46c1cd74130a
Author: Vincent Ray <vray@kalray.eu>
Date:   Mon May 23 15:24:12 2022 +0200

    net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog
    
    In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit()
    does not provide any ordering guarantee as test_bit() is not an atomic
    operation. This, added to the fact that the spin_trylock() call at
    the beginning of qdisc_run_begin() does not guarantee acquire
    semantics if it does not grab the lock, makes it possible for the
    following statement :
    
    if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
    
    to be executed before an enqueue operation called before
    qdisc_run_begin().
    
    As a result the following race can happen :
    
               CPU 1                             CPU 2
    
          qdisc_run_begin()               qdisc_run_begin() /* true */
            set(MISSED)                            .
          /* returns false */                      .
              .                            /* sees MISSED = 1 */
              .                            /* so qdisc not empty */
              .                            __qdisc_run()
              .                                    .
              .                              pfifo_fast_dequeue()
     ----> /* may be done here */                  .
    |         .                                clear(MISSED)
    |         .                                    .
    |         .                                smp_mb __after_atomic();
    |         .                                    .
    |         .                                /* recheck the queue */
    |         .                                /* nothing => exit   */
    |   enqueue(skb1)
    |         .
    |   qdisc_run_begin()
    |         .
    |     spin_trylock() /* fail */
    |         .
    |     smp_mb__before_atomic() /* not enough */
    |         .
     ---- if (test_bit(MISSED))
            return false;   /* exit */
    
    In the above scenario, CPU 1 and CPU 2 both try to grab the
    qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the
    bypass code path, where it emits its skb then calls __qdisc_run().
    
    CPU1 fails, sets MISSED and goes down the traditionnal enqueue() +
    dequeue() code path. But when executing qdisc_run_begin() for the
    second time, after enqueing its skbuff, it sees the MISSED bit still
    set (by itself) and consequently chooses to exit early without setting
    it again nor trying to grab the spinlock again.
    
    Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue
    and found it empty, so it returned.
    
    At the end of the sequence, we end up with skb1 enqueued in the
    backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set,
    and no __netif_schedule() called made. skb1 will now linger in the
    qdisc until somebody later performs a full __qdisc_run(). Associated
    to the bypass capacity of the qdisc, and the ability of the TCP layer
    to avoid resending packets which it knows are still in the qdisc, this
    can lead to serious traffic "holes" in a TCP connexion.
    
    We fix this by turning smp_mb__before_atomic() into smp_mb() which
    guarantees the correct ordering of enqueue() vs test_bit() and
    consequently prevents the race.

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9bab396c1f3b..0c6016e10a6f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -191,7 +191,7 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 		 * STATE_MISSED checking is synchronized with clearing
 		 * in pfifo_fast_dequeue().
 		 */
-		smp_mb__before_atomic();
+		smp_mb();
 
 		/* If the MISSED flag is set, it means other thread has
 		 * set the MISSED flag before second spin_trylock(), so