Message ID | 1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | packet stuck in qdisc : patch proposal | expand |
On 5/23/22 06:54, Vincent Ray wrote: > Hi Yunsheng, all, > > I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while. > It's in qdisc_run_begin within sch_generic.h : > > smp_mb__before_atomic(); > > // [comments] > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > should be > > smp_mb(); > > // [comments] > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation. > Therefore it does not give you any ordering guarantee on any architecture. > And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock... > So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue(). > We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog). > Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog. > > I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier. > It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86. > > I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run(). > > In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs. > > I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel. > A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable). > > Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered. > At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes. > > More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history. > I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-) > I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-) > > I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible). > I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline. > > Can you please tell me : > > 1) if you agree with this ? > > 2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ? > > NB : an alternative fix (which I've tested OK too) would be to simply remove the > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > code path, but I have no clue if this would be better or worse than the present patch in terms of performance. > > Thank you, best regards, > > V We keep adding code and comments, this is quite silly. test_and_set_bit() is exactly what we need. diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 9bab396c1f3ba3d143de4d63f4142cff3c9b9f3e..9d1b448c0dfc3925967635f3390b884a4ef7c55a 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -187,35 +187,9 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) if (spin_trylock(&qdisc->seqlock)) return true; - /* Paired with smp_mb__after_atomic() to make sure - * STATE_MISSED checking is synchronized with clearing - * in pfifo_fast_dequeue(). - */ - smp_mb__before_atomic(); - - /* If the MISSED flag is set, it means other thread has - * set the MISSED flag before second spin_trylock(), so - * we can return false here to avoid multi cpus doing - * the set_bit() and second spin_trylock() concurrently. - */ - if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) + if (test_and_set_bit(__QDISC_STATE_MISSED, &qdisc->state)) return false; - /* Set the MISSED flag before the second spin_trylock(), - * if the second spin_trylock() return false, it means - * other cpu holding the lock will do dequeuing for us - * or it will see the MISSED flag set after releasing - * lock and reschedule the net_tx_action() to do the - * dequeuing. - */ - set_bit(__QDISC_STATE_MISSED, &qdisc->state); - - /* spin_trylock() only has load-acquire semantic, so use - * smp_mb__after_atomic() to ensure STATE_MISSED is set - * before doing the second spin_trylock(). - */ - smp_mb__after_atomic(); - /* Retry again in case other CPU may not see the new flag * after it releases the lock at the end of qdisc_run_end(). */
On 2022/5/23 21:54, Vincent Ray wrote: > Hi Yunsheng, all, +cc Will & Eric > > I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while. > It's in qdisc_run_begin within sch_generic.h : > > smp_mb__before_atomic(); > > // [comments] > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > should be > > smp_mb(); > > // [comments] > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation. > Therefore it does not give you any ordering guarantee on any architecture. > And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock... > So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue(). > We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog). > Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog. > > I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier. > It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86. > > I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run(). > > In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs. > > I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel. > A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable). > > Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered. > At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes. > > More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history. > I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-) > I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-) > > I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible). > I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline. > > Can you please tell me : > > 1) if you agree with this ? > > 2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ? > > NB : an alternative fix (which I've tested OK too) would be to simply remove the > > if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) > return false; > > code path, but I have no clue if this would be better or worse than the present patch in terms of performance. > > Thank you, best regards Hi, thanks for the testing and debugging. So the main problem is that test_bit() is not an atomic operation, so using smp_mb __*_atomic() is not really helping, right? In that case we might only need to change smp_mb__before_atomic() to smp_rmb() in qdisc_run_begin(), as we only need to order test_bit() after set_bit() and clear_bit(), which is a read/write ordering? By the way, Guoju need to ensure ordering between spin_unlock() and test_bit() in qdisc_run_end(), which is a write/read ordering, so smp_mb() is used. > > V > >
> Hi, thanks for the testing and debugging. So the main problem is that > test_bit() is not an atomic operation, so using smp_mb __*_atomic() is > not really helping, right? Yes > In that case we might only need to change smp_mb__before_atomic() to > smp_rmb() in qdisc_run_begin(), as we only need to order test_bit() > after set_bit() and clear_bit(), which is a read/write ordering? I don't think so : we need to order test_bit() with respect to an earlier enqueue(). So we need a store/load barrier => smp_mb() > By the way, Guoju need to ensure ordering between spin_unlock() and > test_bit() in qdisc_run_end(), which is a write/read ordering, so > smp_mb() is used. Agreed But, for qdisc_run_begin(), I think Eric is right anyway, his fix, using test_and_set_bit() seems much better. I'm going to try it and will tell you what it gives. Thanks, V
All, I confirm Eric's patch works well too, and it's better and clearer than mine. So I think we should go for it, and the one from Guoju in addition. @Eric : I see you are one of the networking maintainers, so I have a few questions for you : a) are you going to take care of these patches directly yourself, or is there something Guoju or I should do to promote them ? b) Can we expect to see them land in the mainline soon ? c) Will they be backported to previous versions of the kernel ? Which ones ? Thanks a lot, best, V
On 5/24/22 10:00, Vincent Ray wrote: > All, > > I confirm Eric's patch works well too, and it's better and clearer than mine. > So I think we should go for it, and the one from Guoju in addition. > > @Eric : I see you are one of the networking maintainers, so I have a few questions for you : > > a) are you going to take care of these patches directly yourself, or is there something Guoju or I should do to promote them ? I think this is totally fine you take ownership of the patch, please send a formal V2. Please double check what patchwork had to say about your V1 : https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ And make sure to address the relevant points The most important one is the lack of 'Signed-off-by:' tag, of course. > b) Can we expect to see them land in the mainline soon ? If your v2 submission is correct, it can be merged this week ;) > > c) Will they be backported to previous versions of the kernel ? Which ones ? You simply can include a proper Fixes: tag, so that stable teams can backport the patch to all affected kernel versions. > > Thanks a lot, best, Thanks a lot for working on this long standing issue.
----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote: > On 5/24/22 10:00, Vincent Ray wrote: >> All, >> >> I confirm Eric's patch works well too, and it's better and clearer than mine. >> So I think we should go for it, and the one from Guoju in addition. >> >> @Eric : I see you are one of the networking maintainers, so I have a few >> questions for you : >> >> a) are you going to take care of these patches directly yourself, or is there >> something Guoju or I should do to promote them ? > > I think this is totally fine you take ownership of the patch, please > send a formal V2. > > Please double check what patchwork had to say about your V1 : > > > https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ > > > And make sure to address the relevant points OK I will. If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message similar to my original one. What about Guoju's patch ? (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). I think it is also necessary though potentially less critical. Do we embed it in the same patch ? or patch series ? @Guoju : have you submitted it for integration ? > The most important one is the lack of 'Signed-off-by:' tag, of course. > > >> b) Can we expect to see them land in the mainline soon ? > > If your v2 submission is correct, it can be merged this week ;) > > >> >> c) Will they be backported to previous versions of the kernel ? Which ones ? > > You simply can include a proper Fixes: tag, so that stable teams can > backport > > the patch to all affected kernel versions. > Here things get a little complicated in my head ;-) As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14. It's hard to tell at a glance since the code looks quite different back then. Because of these changes, a unique patch will also only apply up to a certain point in the past. However, I think the bug became really critical only with the introduction of "true bypass" behavior in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal even in no-bypass mode. => I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly, and we live with the bug for versions before that. This would mean that 5.15 LT can be patched but no earlier LT What do you think ? BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that won't be fixed in a given kernel ? > > >> >> Thanks a lot, best, > > Thanks a lot for working on this long standing issue. > > > > > To declare a filtering error, please use the following link : > https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f
On 2022/5/25 17:44, Vincent Ray wrote: > ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote: > >> On 5/24/22 10:00, Vincent Ray wrote: >>> All, >>> >>> I confirm Eric's patch works well too, and it's better and clearer than mine. >>> So I think we should go for it, and the one from Guoju in addition. >>> >>> @Eric : I see you are one of the networking maintainers, so I have a few >>> questions for you : >>> >>> a) are you going to take care of these patches directly yourself, or is there >>> something Guoju or I should do to promote them ? >> >> I think this is totally fine you take ownership of the patch, please >> send a formal V2. >> >> Please double check what patchwork had to say about your V1 : >> >> >> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ >> >> >> And make sure to address the relevant points > > OK I will. > If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message > similar to my original one. > > What about Guoju's patch ? @Guoju, please speak up if you want to handle the patch yourself. > (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). > I think it is also necessary though potentially less critical. > Do we embed it in the same patch ? or patch series ? Guoju's patch fixes the commit a90c57f2cedd, so "patch series" seems better if Guoju is not speaking up to handle the patch himself. > > @Guoju : have you submitted it for integration ? > > >> The most important one is the lack of 'Signed-off-by:' tag, of course. >> >> >>> b) Can we expect to see them land in the mainline soon ? >> >> If your v2 submission is correct, it can be merged this week ;) >> >> >>> >>> c) Will they be backported to previous versions of the kernel ? Which ones ? >> >> You simply can include a proper Fixes: tag, so that stable teams can >> backport >> >> the patch to all affected kernel versions. >> > > Here things get a little complicated in my head ;-) > As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction > of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14. > It's hard to tell at a glance since the code looks quite different back then. > Because of these changes, a unique patch will also only apply up to a certain point in the past. > > However, I think the bug became really critical only with the introduction of "true bypass" behavior > in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal > even in no-bypass mode. commit 89837eb4b246 tried to fix that, but it did not fix it completely, and that commit should has been back-ported to the affected kernel versions as much as possible, so I think the Fixes tag should be: Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc") > > => I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly, > and we live with the bug for versions before that. > This would mean that 5.15 LT can be patched but no earlier LT > > What do you think ? > > BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that > won't be fixed in a given kernel ? > >> >> >>> >>> Thanks a lot, best, >> >> Thanks a lot for working on this long standing issue. >> >> >> >> >> To declare a filtering error, please use the following link : >> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f > > > > > . >
On 2022/5/25 18:45, Yunsheng Lin wrote: > On 2022/5/25 17:44, Vincent Ray wrote: >> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote: >> >>> On 5/24/22 10:00, Vincent Ray wrote: >>>> All, >>>> >>>> I confirm Eric's patch works well too, and it's better and clearer than mine. >>>> So I think we should go for it, and the one from Guoju in addition. >>>> >>>> @Eric : I see you are one of the networking maintainers, so I have a few >>>> questions for you : >>>> >>>> a) are you going to take care of these patches directly yourself, or is there >>>> something Guoju or I should do to promote them ? >>> >>> I think this is totally fine you take ownership of the patch, please >>> send a formal V2. >>> >>> Please double check what patchwork had to say about your V1 : >>> >>> >>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ >>> >>> >>> And make sure to address the relevant points >> >> OK I will. >> If you agree, I will take your version of the fix (test_and_set_bit()), keeping the commit message >> similar to my original one. >> >> What about Guoju's patch ? > > @Guoju, please speak up if you want to handle the patch yourself. Hi Yunsheng, all, I rewrite the comments of my patch and it looks a little clearer. :) Thank you. Best regards, > >> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). >> I think it is also necessary though potentially less critical. >> Do we embed it in the same patch ? or patch series ? > > Guoju's patch fixes the commit a90c57f2cedd, so "patch series" > seems better if Guoju is not speaking up to handle the patch himself. > > >> >> @Guoju : have you submitted it for integration ? >> >> >>> The most important one is the lack of 'Signed-off-by:' tag, of course. >>> >>> >>>> b) Can we expect to see them land in the mainline soon ? >>> >>> If your v2 submission is correct, it can be merged this week ;) >>> >>> >>>> >>>> c) Will they be backported to previous versions of the kernel ? Which ones ? >>> >>> You simply can include a proper Fixes: tag, so that stable teams can >>> backport >>> >>> the patch to all affected kernel versions. >>> >> >> Here things get a little complicated in my head ;-) >> As explained, I think this mechanism has been bugged, in a way or an other, for some time, perhaps since the introduction >> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14. >> It's hard to tell at a glance since the code looks quite different back then. >> Because of these changes, a unique patch will also only apply up to a certain point in the past. >> >> However, I think the bug became really critical only with the introduction of "true bypass" behavior >> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it is a big deal >> even in no-bypass mode. > > > commit 89837eb4b246 tried to fix that, but it did not fix it completely, and that commit should has > been back-ported to the affected kernel versions as much as possible, so I think the Fixes tag > should be: > > Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc") > >> >> => I suggest we only tag it for backward fix up to the 5.14, where it should apply smoothly, >> and we live with the bug for versions before that. >> This would mean that 5.15 LT can be patched but no earlier LT >> >> What do you think ? >> >> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar for known bugs that >> won't be fixed in a given kernel ? >> >>> >>> >>>> >>>> Thanks a lot, best, >>> >>> Thanks a lot for working on this long standing issue. >>> >>> >>> >>> >>> To declare a filtering error, please use the following link : >>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f >> >> >> >> >> . >> From 28cc81b33972cb1bcdfd529ceae024c9800a54a1 Mon Sep 17 00:00:00 2001 From: Guoju Fang <gjfang@linux.alibaba.com> Date: Wed, 25 May 2022 19:12:00 +0800 Subject: [PATCH] net: sched: add barrier to fix packet stuck problem for lockless qdisc In qdisc_run_end(), the spin_unlock() only has store-release semantic, which guarantees all earlier memory access are visible before it. But the subsequent test_bit() may be reordered ahead of the spin_unlock(), and may cause a packet stuck problem. The concurrent operations can be described as below, CPU 0 | CPU 1 qdisc_run_end() | qdisc_run_begin() . | . ----> /* may be reorderd here */ | . | . | . | spin_unlock() | set_bit() | . | smp_mb__after_atomic() ---- test_bit() | spin_trylock() . | . Consider the following sequence of events: CPU 0 reorder test_bit() ahead and see MISSED = 0 CPU 1 calls set_bit() CPU 1 calls spin_trylock() and return fail CPU 0 executes spin_unlock() At the end of the sequence, CPU 0 calls spin_unlock() and does nothing because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one take it, until the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it. So one explicit barrier is needed between spin_unlock() and test_bit() to ensure the correct order. Signed-off-by: Guoju Fang <gjfang@linux.alibaba.com> --- include/net/sch_generic.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 9bab396c1f3b..8a8738642ca0 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -229,6 +229,9 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); + /* ensure ordering between spin_unlock() and test_bit() */ + smp_mb(); + if (unlikely(test_bit(__QDISC_STATE_MISSED, &qdisc->state))) __netif_schedule(qdisc);
----- On May 25, 2022, at 2:40 PM, Guoju Fang gjfang@linux.alibaba.com wrote: > On 2022/5/25 18:45, Yunsheng Lin wrote: >> On 2022/5/25 17:44, Vincent Ray wrote: >>> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote: >>> >>>> On 5/24/22 10:00, Vincent Ray wrote: >>>>> All, >>>>> >>>>> I confirm Eric's patch works well too, and it's better and clearer than mine. >>>>> So I think we should go for it, and the one from Guoju in addition. >>>>> >>>>> @Eric : I see you are one of the networking maintainers, so I have a few >>>>> questions for you : >>>>> >>>>> a) are you going to take care of these patches directly yourself, or is there >>>>> something Guoju or I should do to promote them ? >>>> >>>> I think this is totally fine you take ownership of the patch, please >>>> send a formal V2. >>>> >>>> Please double check what patchwork had to say about your V1 : >>>> >>>> >>>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ >>>> >>>> >>>> And make sure to address the relevant points >>> >>> OK I will. >>> If you agree, I will take your version of the fix (test_and_set_bit()), keeping >>> the commit message >>> similar to my original one. >>> >>> What about Guoju's patch ? >> >> @Guoju, please speak up if you want to handle the patch yourself. > > Hi Yunsheng, all, > > I rewrite the comments of my patch and it looks a little clearer. :) > > Thank you. > > Best regards, Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ? Here's mine, attached. Hope it's well formatted this time. Tell me. I don't feel quite confident with the submission process to produce the series myself, so I'll let Eric handle it if it's ok. > >> >>> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). >>> I think it is also necessary though potentially less critical. >>> Do we embed it in the same patch ? or patch series ? >> >> Guoju's patch fixes the commit a90c57f2cedd, so "patch series" >> seems better if Guoju is not speaking up to handle the patch himself. >> >> >>> >>> @Guoju : have you submitted it for integration ? >>> >>> >>>> The most important one is the lack of 'Signed-off-by:' tag, of course. >>>> >>>> >>>>> b) Can we expect to see them land in the mainline soon ? >>>> >>>> If your v2 submission is correct, it can be merged this week ;) >>>> >>>> >>>>> >>>>> c) Will they be backported to previous versions of the kernel ? Which ones ? >>>> >>>> You simply can include a proper Fixes: tag, so that stable teams can >>>> backport >>>> >>>> the patch to all affected kernel versions. >>>> >>> >>> Here things get a little complicated in my head ;-) >>> As explained, I think this mechanism has been bugged, in a way or an other, for >>> some time, perhaps since the introduction >>> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14. >>> It's hard to tell at a glance since the code looks quite different back then. >>> Because of these changes, a unique patch will also only apply up to a certain >>> point in the past. >>> >>> However, I think the bug became really critical only with the introduction of >>> "true bypass" behavior >>> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it >>> is a big deal >>> even in no-bypass mode. >> >> >> commit 89837eb4b246 tried to fix that, but it did not fix it completely, and >> that commit should has >> been back-ported to the affected kernel versions as much as possible, so I think >> the Fixes tag >> should be: >> >> Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for >> lockless qdisc") >> >>> >>> => I suggest we only tag it for backward fix up to the 5.14, where it should >>> apply smoothly, >>> and we live with the bug for versions before that. >>> This would mean that 5.15 LT can be patched but no earlier LT >>> >>> What do you think ? >>> >>> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar >>> for known bugs that >>> won't be fixed in a given kernel ? >>> >>>> >>>> >>>>> >>>>> Thanks a lot, best, >>>> >>>> Thanks a lot for working on this long standing issue. >>>> >>>> >>>> >>>> >>>> To declare a filtering error, please use the following link : >>>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f >>> >>> >>> >>> >>> . >>> > > To declare a filtering error, please use the following link : > https://www.security-mail.net/reporter.php?mid=2c69.628e23bf.45908.0&r=vray%40kalrayinc.com&s=gjfang%40linux.alibaba.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=6106070134039ab6725b6d3de67bd24d624c8b51
----- On May 25, 2022, at 7:43 PM, Vincent Ray vray@kalrayinc.com wrote: > ----- On May 25, 2022, at 2:40 PM, Guoju Fang gjfang@linux.alibaba.com wrote: > >> On 2022/5/25 18:45, Yunsheng Lin wrote: >>> On 2022/5/25 17:44, Vincent Ray wrote: >>>> ----- On May 24, 2022, at 10:17 PM, Eric Dumazet eric.dumazet@gmail.com wrote: >>>> >>>>> On 5/24/22 10:00, Vincent Ray wrote: >>>>>> All, >>>>>> >>>>>> I confirm Eric's patch works well too, and it's better and clearer than mine. >>>>>> So I think we should go for it, and the one from Guoju in addition. >>>>>> >>>>>> @Eric : I see you are one of the networking maintainers, so I have a few >>>>>> questions for you : >>>>>> >>>>>> a) are you going to take care of these patches directly yourself, or is there >>>>>> something Guoju or I should do to promote them ? >>>>> >>>>> I think this is totally fine you take ownership of the patch, please >>>>> send a formal V2. >>>>> >>>>> Please double check what patchwork had to say about your V1 : >>>>> >>>>> >>>>> https://patchwork.kernel.org/project/netdevbpf/patch/1684598287.15044793.1653314052575.JavaMail.zimbra@kalray.eu/ >>>>> >>>>> >>>>> And make sure to address the relevant points >>>> >>>> OK I will. >>>> If you agree, I will take your version of the fix (test_and_set_bit()), keeping >>>> the commit message >>>> similar to my original one. >>>> >>>> What about Guoju's patch ? >>> >>> @Guoju, please speak up if you want to handle the patch yourself. >> >> Hi Yunsheng, all, >> >> I rewrite the comments of my patch and it looks a little clearer. :) >> >> Thank you. >> >> Best regards, > > Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ? > > Here's mine, attached. Hope it's well formatted this time. Tell me. > I don't feel quite confident with the submission process to produce the series > myself, so I'll let Eric handle it if it's ok. NB : from what I've understood reading some doc, as this is a fix, it's supposed to go in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok > >> >>> >>>> (adding a smp_mb() between the spin_unlock() and test_bit() in qdisc_run_end()). >>>> I think it is also necessary though potentially less critical. >>>> Do we embed it in the same patch ? or patch series ? >>> >>> Guoju's patch fixes the commit a90c57f2cedd, so "patch series" >>> seems better if Guoju is not speaking up to handle the patch himself. >>> >>> >>>> >>>> @Guoju : have you submitted it for integration ? >>>> >>>> >>>>> The most important one is the lack of 'Signed-off-by:' tag, of course. >>>>> >>>>> >>>>>> b) Can we expect to see them land in the mainline soon ? >>>>> >>>>> If your v2 submission is correct, it can be merged this week ;) >>>>> >>>>> >>>>>> >>>>>> c) Will they be backported to previous versions of the kernel ? Which ones ? >>>>> >>>>> You simply can include a proper Fixes: tag, so that stable teams can >>>>> backport >>>>> >>>>> the patch to all affected kernel versions. >>>>> >>>> >>>> Here things get a little complicated in my head ;-) >>>> As explained, I think this mechanism has been bugged, in a way or an other, for >>>> some time, perhaps since the introduction >>>> of lockless qdiscs (4.16) or somewhere between 4.16 and 5.14. >>>> It's hard to tell at a glance since the code looks quite different back then. >>>> Because of these changes, a unique patch will also only apply up to a certain >>>> point in the past. >>>> >>>> However, I think the bug became really critical only with the introduction of >>>> "true bypass" behavior >>>> in lockless qdiscs by YunSheng in 5.14, though there may be scenarios where it >>>> is a big deal >>>> even in no-bypass mode. >>> >>> >>> commit 89837eb4b246 tried to fix that, but it did not fix it completely, and >>> that commit should has >>> been back-ported to the affected kernel versions as much as possible, so I think >>> the Fixes tag >>> should be: >>> >>> Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for >>> lockless qdisc") >>> >>>> >>>> => I suggest we only tag it for backward fix up to the 5.14, where it should >>>> apply smoothly, >>>> and we live with the bug for versions before that. >>>> This would mean that 5.15 LT can be patched but no earlier LT >>>> >>>> What do you think ? >>>> >>>> BTW : forgive my ignorance, but are there any kind of "Errata Sheet" or similar >>>> for known bugs that >>>> won't be fixed in a given kernel ? >>>> >>>>> >>>>> >>>>>> >>>>>> Thanks a lot, best, >>>>> >>>>> Thanks a lot for working on this long standing issue. >>>>> >>>>> >>>>> >>>>> >>>>> To declare a filtering error, please use the following link : >>>>> https://www.security-mail.net/reporter.php?mid=7009.628d3d4c.37c04.0&r=vray%40kalrayinc.com&s=eric.dumazet%40gmail.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=0ca08e7b7e420d1ab014cda67db48db71df41f5f >>>> >>>> >>>> >>>> >>>> . >>>> >> >> To declare a filtering error, please use the following link : >> https://www.security-mail.net/reporter.php?mid=2c69.628e23bf.45908.0&r=vray%40kalrayinc.com&s=gjfang%40linux.alibaba.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=6106070134039ab6725b6d3de67bd24d624c8b51
On Wed, May 25, 2022 at 10:48 AM Vincent Ray <vray@kalrayinc.com> wrote: > > > > > Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ? > > > > Here's mine, attached. Hope it's well formatted this time. Tell me. > > I don't feel quite confident with the submission process to produce the series > > myself, so I'll let Eric handle it if it's ok. > > NB : from what I've understood reading some doc, as this is a fix, it's supposed to go > in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok Patch was not delivered to patchwork this time. I will resend it. Next time, please take the time to read Documentation/process/submitting-patches.rst, thanks. (line 276: No MIME, no links, no compression, no attachments. Just plain text) Thanks.
----- On May 26, 2022, at 2:17 AM, Eric Dumazet edumazet@google.com wrote: > On Wed, May 25, 2022 at 10:48 AM Vincent Ray <vray@kalrayinc.com> wrote: >> > >> > >> > Guoju : shouldn't you also include the same Fixes tag suggested by YunSheng ? >> > >> > Here's mine, attached. Hope it's well formatted this time. Tell me. >> > I don't feel quite confident with the submission process to produce the series >> > myself, so I'll let Eric handle it if it's ok. >> >> NB : from what I've understood reading some doc, as this is a fix, it's supposed >> to go >> in the "net" tree, so I tagged it accordingly in the Subject. Hope it's Ok > > Patch was not delivered to patchwork this time. > > I will resend it. > > Next time, please take the time to read > Documentation/process/submitting-patches.rst, thanks. > > (line 276: No MIME, no links, no compression, no attachments. Just plain text) > > Thanks. Yes, sorry about that Eric. I did read the howto but somehow convinced myself that plain-text attachment was ok, and was confused about how / if the patch would be correctly merged into the mail thread if inlined. Will do better next time ! Thanks, V > > > To declare a filtering error, please use the following link : > https://www.security-mail.net/reporter.php?mid=f3cf.628ec731.cdb4a.0&r=vray%40kalrayinc.com&s=edumazet%40google.com&o=Re%3A+packet+stuck+in+qdisc+%3A+patch+proposal&verdict=C&c=08b5406354e783e551d3ed8910b4278bf54f89d9
commit 917f7ff2b0f59d721d11f983af1f46c1cd74130a Author: Vincent Ray <vray@kalray.eu> Date: Mon May 23 15:24:12 2022 +0200 net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit() does not provide any ordering guarantee as test_bit() is not an atomic operation. This, added to the fact that the spin_trylock() call at the beginning of qdisc_run_begin() does not guarantee acquire semantics if it does not grab the lock, makes it possible for the following statement : if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) to be executed before an enqueue operation called before qdisc_run_begin(). As a result the following race can happen : CPU 1 CPU 2 qdisc_run_begin() qdisc_run_begin() /* true */ set(MISSED) . /* returns false */ . . /* sees MISSED = 1 */ . /* so qdisc not empty */ . __qdisc_run() . . . pfifo_fast_dequeue() ----> /* may be done here */ . | . clear(MISSED) | . . | . smp_mb __after_atomic(); | . . | . /* recheck the queue */ | . /* nothing => exit */ | enqueue(skb1) | . | qdisc_run_begin() | . | spin_trylock() /* fail */ | . | smp_mb__before_atomic() /* not enough */ | . ---- if (test_bit(MISSED)) return false; /* exit */ In the above scenario, CPU 1 and CPU 2 both try to grab the qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the bypass code path, where it emits its skb then calls __qdisc_run(). CPU1 fails, sets MISSED and goes down the traditionnal enqueue() + dequeue() code path. But when executing qdisc_run_begin() for the second time, after enqueing its skbuff, it sees the MISSED bit still set (by itself) and consequently chooses to exit early without setting it again nor trying to grab the spinlock again. Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue and found it empty, so it returned. At the end of the sequence, we end up with skb1 enqueued in the backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set, and no __netif_schedule() called made. skb1 will now linger in the qdisc until somebody later performs a full __qdisc_run(). Associated to the bypass capacity of the qdisc, and the ability of the TCP layer to avoid resending packets which it knows are still in the qdisc, this can lead to serious traffic "holes" in a TCP connexion. We fix this by turning smp_mb__before_atomic() into smp_mb() which guarantees the correct ordering of enqueue() vs test_bit() and consequently prevents the race. diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 9bab396c1f3b..0c6016e10a6f 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -191,7 +191,7 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) * STATE_MISSED checking is synchronized with clearing * in pfifo_fast_dequeue(). */ - smp_mb__before_atomic(); + smp_mb(); /* If the MISSED flag is set, it means other thread has * set the MISSED flag before second spin_trylock(), so