Message ID | 20240413-tcp-ao-selftests-fixes-v1-0-f9c41c96949d@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | selftests/net/tcp_ao: A bunch of fixes for TCP-AO selftests | expand |
Hello: This series was applied to netdev/net.git (main) by Paolo Abeni <pabeni@redhat.com>: On Sat, 13 Apr 2024 02:42:51 +0100 you wrote: > Started as addressing the flakiness issues in rst_ipv*, that affect > netdev dashboard. > > Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> > --- > Dmitry Safonov (4): > selftests/tcp_ao: Make RST tests less flaky > selftests/tcp_ao: Zero-init tcp_ao_info_opt > selftests/tcp_ao: Fix fscanf() call for format-security > selftests/tcp_ao: Printing fixes to confirm with format-security > > [...] Here is the summary with links: - [net,1/4] selftests/tcp_ao: Make RST tests less flaky https://git.kernel.org/netdev/net/c/4225dfa4535f - [net,2/4] selftests/tcp_ao: Zero-init tcp_ao_info_opt https://git.kernel.org/netdev/net/c/b089b3bead53 - [net,3/4] selftests/tcp_ao: Fix fscanf() call for format-security https://git.kernel.org/netdev/net/c/beb78cd1329d - [net,4/4] selftests/tcp_ao: Printing fixes to confirm with format-security https://git.kernel.org/netdev/net/c/b476c93654d7 You are awesome, thank you!
On Sat, 13 Apr 2024 02:42:51 +0100 Dmitry Safonov via B4 Relay wrote: > Started as addressing the flakiness issues in rst_ipv*, that affect > netdev dashboard. Thank you! :)
On Tue, 16 Apr 2024 at 15:28, Jakub Kicinski <kuba@kernel.org> wrote: > > On Sat, 13 Apr 2024 02:42:51 +0100 Dmitry Safonov via B4 Relay wrote: > > Started as addressing the flakiness issues in rst_ipv*, that affect > > netdev dashboard. > > Thank you! :) Jakub, you are very welcome :) I'll keep an eye on the dashboard, but I very much encourage you to ping me in case of any other issues with tcp_ao selftests. I currently have v2 for tcp-ao tracepoints, but delaying it as working on a reproducer/selftest for an issue I think I have a patch for. BTW, do you know if those were addressed or anyone is looking into them? (from other tcp-ao hits, that seem not anyhow related to tcp-ao itself): 1. [ 240.001391][ T833] Possible interrupt unsafe locking scenario: [ 240.001391][ T833] [ 240.001635][ T833] CPU0 CPU1 [ 240.001797][ T833] ---- ---- [ 240.001958][ T833] lock(&p->alloc_lock); [ 240.002083][ T833] local_irq_disable(); [ 240.002284][ T833] lock(&ndev->lock); [ 240.002490][ T833] lock(&p->alloc_lock); [ 240.002709][ T833] <Interrupt> [ 240.002819][ T833] lock(&ndev->lock); [ 240.002937][ T833] [ 240.002937][ T833] *** DEADLOCK *** https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537021/14-self-connect-ipv6/stderr 2. [ 251.411647][ T71] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected [ 251.411986][ T71] 6.9.0-rc1-virtme #1 Not tainted [ 251.412214][ T71] ----------------------------------------------------- [ 251.412533][ T71] kworker/u16:1/71 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire: [ 251.412837][ T71] ffff888005182c28 (&p->alloc_lock){+.+.}-{2:2}, at: __get_task_comm+0x27/0x70 [ 251.413214][ T71] [ 251.413214][ T71] and this task is already holding: [ 251.413527][ T71] ffff88802f83efd8 (&ul->lock){+.-.}-{2:2}, at: rt6_uncached_list_flush_dev+0x138/0x840 [ 251.413887][ T71] which would create a new lock dependency: [ 251.414153][ T71] (&ul->lock){+.-.}-{2:2} -> (&p->alloc_lock){+.+.}-{2:2} [ 251.414464][ T71] [ 251.414464][ T71] but this new dependency connects a SOFTIRQ-irq-safe lock: [ 251.414808][ T71] (&ul->lock){+.-.}-{2:2} https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537201/17-icmps-discard-ipv4/stderr 3. [ 264.280734][ C3] Possible unsafe locking scenario: [ 264.280734][ C3] [ 264.280968][ C3] CPU0 CPU1 [ 264.281117][ C3] ---- ---- [ 264.281263][ C3] lock((&tw->tw_timer)); [ 264.281427][ C3] lock(&hashinfo->ehash_locks[i]); [ 264.281647][ C3] lock((&tw->tw_timer)); [ 264.281834][ C3] lock(&hashinfo->ehash_locks[i]); https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/547461/19-self-connect-ipv4/stderr I can spend some time on them after I verify that my fix for -stable is actually fixing an issue I think it fixes. Seems like your automation + my selftests are giving some fruits, hehe. Thanks, Dmitry
On Wed, 17 Apr 2024 19:47:18 +0100 Dmitry Safonov wrote: > 1. [ 240.001391][ T833] Possible interrupt unsafe locking scenario: > [ 240.001391][ T833] > [ 240.001635][ T833] CPU0 CPU1 > [ 240.001797][ T833] ---- ---- > [ 240.001958][ T833] lock(&p->alloc_lock); > [ 240.002083][ T833] local_irq_disable(); > [ 240.002284][ T833] lock(&ndev->lock); > [ 240.002490][ T833] lock(&p->alloc_lock); > [ 240.002709][ T833] <Interrupt> > [ 240.002819][ T833] lock(&ndev->lock); > [ 240.002937][ T833] > [ 240.002937][ T833] *** DEADLOCK *** > > https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537021/14-self-connect-ipv6/stderr > > 2. [ 251.411647][ T71] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock > order detected > [ 251.411986][ T71] 6.9.0-rc1-virtme #1 Not tainted > [ 251.412214][ T71] ----------------------------------------------------- > [ 251.412533][ T71] kworker/u16:1/71 [HC0[0]:SC0[2]:HE1:SE0] is > trying to acquire: > [ 251.412837][ T71] ffff888005182c28 (&p->alloc_lock){+.+.}-{2:2}, > at: __get_task_comm+0x27/0x70 > [ 251.413214][ T71] > [ 251.413214][ T71] and this task is already holding: > [ 251.413527][ T71] ffff88802f83efd8 (&ul->lock){+.-.}-{2:2}, at: > rt6_uncached_list_flush_dev+0x138/0x840 > [ 251.413887][ T71] which would create a new lock dependency: > [ 251.414153][ T71] (&ul->lock){+.-.}-{2:2} -> (&p->alloc_lock){+.+.}-{2:2} > [ 251.414464][ T71] > [ 251.414464][ T71] but this new dependency connects a SOFTIRQ-irq-safe lock: > [ 251.414808][ T71] (&ul->lock){+.-.}-{2:2} > > https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537201/17-icmps-discard-ipv4/stderr > > 3. [ 264.280734][ C3] Possible unsafe locking scenario: > [ 264.280734][ C3] > [ 264.280968][ C3] CPU0 CPU1 > [ 264.281117][ C3] ---- ---- > [ 264.281263][ C3] lock((&tw->tw_timer)); > [ 264.281427][ C3] > lock(&hashinfo->ehash_locks[i]); > [ 264.281647][ C3] lock((&tw->tw_timer)); > [ 264.281834][ C3] lock(&hashinfo->ehash_locks[i]); > > https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/547461/19-self-connect-ipv4/stderr > > I can spend some time on them after I verify that my fix for -stable > is actually fixing an issue I think it fixes. > Seems like your automation + my selftests are giving some fruits, hehe. Oh, very interesting, I don't recall these coming up before. We try to extract crashes but apparently we're missing lockdep splats. I'll try to improve the extraction logic...
On Wed, 17 Apr 2024 13:46:36 -0700 Jakub Kicinski wrote: > > I can spend some time on them after I verify that my fix for -stable > > is actually fixing an issue I think it fixes. > > Seems like your automation + my selftests are giving some fruits, hehe. > > Oh, very interesting, I don't recall these coming up before. Correction, these are old, and if I plug the branch names here: https://netdev.bots.linux.dev/contest.html there is a whole bunch of tests failing that day. Keep in mind these run pre-commit so not all failures are flakes.
On Wed, 17 Apr 2024 at 22:28, Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 17 Apr 2024 13:46:36 -0700 Jakub Kicinski wrote: > > > I can spend some time on them after I verify that my fix for -stable > > > is actually fixing an issue I think it fixes. > > > Seems like your automation + my selftests are giving some fruits, hehe. > > > > Oh, very interesting, I don't recall these coming up before. > > Correction, these are old, and if I plug the branch names here: > https://netdev.bots.linux.dev/contest.html > there is a whole bunch of tests failing that day. Hmm, yeah, I was looking at the history of selftests to see if there is anything else interesting: 2024-04-11--15-00 - lockdep for hashinfo->ehash_locks vs tw->tw_timer It seems that you actually reported that already here: https://lore.kernel.org/all/20240411100536.224fa1e7@kernel.org/ 2024-04-04--12-00 - lockdep for p->alloc_lock vs ul->lock (rt6_uncached_list_flush_dev) 2024-04-04--09-00 - lockdep for p->alloc_lock vs ndev->lock (addrconf_permanent_addr) 2024-04-04--03-00 - lockdep for p->alloc_lock vs ul->lock Was reported as well: https://lore.kernel.org/all/8576a80ac958812ac75b01299c2de3a6485f84a1.camel@redhat.com/ 2024-03-06--00-00 - kernel BUG at net/core/skbuff.c:2813 Can't really track this down to any report/fix. Probably as it's month old and hasn't happened since on these tests - something was borken on that particular day. > Keep in mind these run pre-commit so not all failures are flakes. Thanks, Dmitry
Started as addressing the flakiness issues in rst_ipv*, that affect netdev dashboard. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> --- Dmitry Safonov (4): selftests/tcp_ao: Make RST tests less flaky selftests/tcp_ao: Zero-init tcp_ao_info_opt selftests/tcp_ao: Fix fscanf() call for format-security selftests/tcp_ao: Printing fixes to confirm with format-security tools/testing/selftests/net/tcp_ao/lib/proc.c | 2 +- tools/testing/selftests/net/tcp_ao/lib/setup.c | 12 +++++------ tools/testing/selftests/net/tcp_ao/rst.c | 23 ++++++++++++---------- .../selftests/net/tcp_ao/setsockopt-closed.c | 2 +- 4 files changed, 21 insertions(+), 18 deletions(-) --- base-commit: 8f2c057754b25075aa3da132cd4fd4478cdab854 change-id: 20240413-tcp-ao-selftests-fixes-adacd65cb8ba Best regards,