mbox series

[net,0/4] selftests/net/tcp_ao: A bunch of fixes for TCP-AO selftests

Message ID 20240413-tcp-ao-selftests-fixes-v1-0-f9c41c96949d@gmail.com (mailing list archive)
Headers show
Series selftests/net/tcp_ao: A bunch of fixes for TCP-AO selftests | expand

Message

Dmitry Safonov via B4 Relay April 13, 2024, 1:42 a.m. UTC
Started as addressing the flakiness issues in rst_ipv*, that affect
netdev dashboard.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
Dmitry Safonov (4):
      selftests/tcp_ao: Make RST tests less flaky
      selftests/tcp_ao: Zero-init tcp_ao_info_opt
      selftests/tcp_ao: Fix fscanf() call for format-security
      selftests/tcp_ao: Printing fixes to confirm with format-security

 tools/testing/selftests/net/tcp_ao/lib/proc.c      |  2 +-
 tools/testing/selftests/net/tcp_ao/lib/setup.c     | 12 +++++------
 tools/testing/selftests/net/tcp_ao/rst.c           | 23 ++++++++++++----------
 .../selftests/net/tcp_ao/setsockopt-closed.c       |  2 +-
 4 files changed, 21 insertions(+), 18 deletions(-)
---
base-commit: 8f2c057754b25075aa3da132cd4fd4478cdab854
change-id: 20240413-tcp-ao-selftests-fixes-adacd65cb8ba

Best regards,

Comments

patchwork-bot+netdevbpf@kernel.org April 16, 2024, 11:40 a.m. UTC | #1
Hello:

This series was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Sat, 13 Apr 2024 02:42:51 +0100 you wrote:
> Started as addressing the flakiness issues in rst_ipv*, that affect
> netdev dashboard.
> 
> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
> ---
> Dmitry Safonov (4):
>       selftests/tcp_ao: Make RST tests less flaky
>       selftests/tcp_ao: Zero-init tcp_ao_info_opt
>       selftests/tcp_ao: Fix fscanf() call for format-security
>       selftests/tcp_ao: Printing fixes to confirm with format-security
> 
> [...]

Here is the summary with links:
  - [net,1/4] selftests/tcp_ao: Make RST tests less flaky
    https://git.kernel.org/netdev/net/c/4225dfa4535f
  - [net,2/4] selftests/tcp_ao: Zero-init tcp_ao_info_opt
    https://git.kernel.org/netdev/net/c/b089b3bead53
  - [net,3/4] selftests/tcp_ao: Fix fscanf() call for format-security
    https://git.kernel.org/netdev/net/c/beb78cd1329d
  - [net,4/4] selftests/tcp_ao: Printing fixes to confirm with format-security
    https://git.kernel.org/netdev/net/c/b476c93654d7

You are awesome, thank you!
Jakub Kicinski April 16, 2024, 2:28 p.m. UTC | #2
On Sat, 13 Apr 2024 02:42:51 +0100 Dmitry Safonov via B4 Relay wrote:
> Started as addressing the flakiness issues in rst_ipv*, that affect
> netdev dashboard.

Thank you! :)
Dmitry Safonov April 17, 2024, 6:47 p.m. UTC | #3
On Tue, 16 Apr 2024 at 15:28, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 13 Apr 2024 02:42:51 +0100 Dmitry Safonov via B4 Relay wrote:
> > Started as addressing the flakiness issues in rst_ipv*, that affect
> > netdev dashboard.
>
> Thank you! :)

Jakub, you are very welcome :)
I'll keep an eye on the dashboard, but I very much encourage you to
ping me in case of any other issues with tcp_ao selftests.

I currently have v2 for tcp-ao tracepoints, but delaying it as working
on a reproducer/selftest for an issue I think I have a patch for.

BTW, do you know if those were addressed or anyone is looking into
them? (from other tcp-ao hits, that seem not anyhow related to tcp-ao
itself):

1. [ 240.001391][ T833] Possible interrupt unsafe locking scenario:
[  240.001391][  T833]
[  240.001635][  T833]        CPU0                    CPU1
[  240.001797][  T833]        ----                    ----
[  240.001958][  T833]   lock(&p->alloc_lock);
[  240.002083][  T833]                                local_irq_disable();
[  240.002284][  T833]                                lock(&ndev->lock);
[  240.002490][  T833]                                lock(&p->alloc_lock);
[  240.002709][  T833]   <Interrupt>
[  240.002819][  T833]     lock(&ndev->lock);
[  240.002937][  T833]
[  240.002937][  T833]  *** DEADLOCK ***

https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537021/14-self-connect-ipv6/stderr

2. [  251.411647][   T71] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock
order detected
[  251.411986][   T71] 6.9.0-rc1-virtme #1 Not tainted
[  251.412214][   T71] -----------------------------------------------------
[  251.412533][   T71] kworker/u16:1/71 [HC0[0]:SC0[2]:HE1:SE0] is
trying to acquire:
[  251.412837][   T71] ffff888005182c28 (&p->alloc_lock){+.+.}-{2:2},
at: __get_task_comm+0x27/0x70
[  251.413214][   T71]
[  251.413214][   T71] and this task is already holding:
[  251.413527][   T71] ffff88802f83efd8 (&ul->lock){+.-.}-{2:2}, at:
rt6_uncached_list_flush_dev+0x138/0x840
[  251.413887][   T71] which would create a new lock dependency:
[  251.414153][   T71]  (&ul->lock){+.-.}-{2:2} -> (&p->alloc_lock){+.+.}-{2:2}
[  251.414464][   T71]
[  251.414464][   T71] but this new dependency connects a SOFTIRQ-irq-safe lock:
[  251.414808][   T71]  (&ul->lock){+.-.}-{2:2}

https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537201/17-icmps-discard-ipv4/stderr

3. [ 264.280734][ C3] Possible unsafe locking scenario:
[  264.280734][    C3]
[  264.280968][    C3]        CPU0                    CPU1
[  264.281117][    C3]        ----                    ----
[  264.281263][    C3]   lock((&tw->tw_timer));
[  264.281427][    C3]
lock(&hashinfo->ehash_locks[i]);
[  264.281647][    C3]                                lock((&tw->tw_timer));
[  264.281834][    C3]   lock(&hashinfo->ehash_locks[i]);

https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/547461/19-self-connect-ipv4/stderr

I can spend some time on them after I verify that my fix for -stable
is actually fixing an issue I think it fixes.
Seems like your automation + my selftests are giving some fruits, hehe.

Thanks,
             Dmitry
Jakub Kicinski April 17, 2024, 8:46 p.m. UTC | #4
On Wed, 17 Apr 2024 19:47:18 +0100 Dmitry Safonov wrote:
> 1. [ 240.001391][ T833] Possible interrupt unsafe locking scenario:
> [  240.001391][  T833]
> [  240.001635][  T833]        CPU0                    CPU1
> [  240.001797][  T833]        ----                    ----
> [  240.001958][  T833]   lock(&p->alloc_lock);
> [  240.002083][  T833]                                local_irq_disable();
> [  240.002284][  T833]                                lock(&ndev->lock);
> [  240.002490][  T833]                                lock(&p->alloc_lock);
> [  240.002709][  T833]   <Interrupt>
> [  240.002819][  T833]     lock(&ndev->lock);
> [  240.002937][  T833]
> [  240.002937][  T833]  *** DEADLOCK ***
> 
> https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537021/14-self-connect-ipv6/stderr
> 
> 2. [  251.411647][   T71] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock
> order detected
> [  251.411986][   T71] 6.9.0-rc1-virtme #1 Not tainted
> [  251.412214][   T71] -----------------------------------------------------
> [  251.412533][   T71] kworker/u16:1/71 [HC0[0]:SC0[2]:HE1:SE0] is
> trying to acquire:
> [  251.412837][   T71] ffff888005182c28 (&p->alloc_lock){+.+.}-{2:2},
> at: __get_task_comm+0x27/0x70
> [  251.413214][   T71]
> [  251.413214][   T71] and this task is already holding:
> [  251.413527][   T71] ffff88802f83efd8 (&ul->lock){+.-.}-{2:2}, at:
> rt6_uncached_list_flush_dev+0x138/0x840
> [  251.413887][   T71] which would create a new lock dependency:
> [  251.414153][   T71]  (&ul->lock){+.-.}-{2:2} -> (&p->alloc_lock){+.+.}-{2:2}
> [  251.414464][   T71]
> [  251.414464][   T71] but this new dependency connects a SOFTIRQ-irq-safe lock:
> [  251.414808][   T71]  (&ul->lock){+.-.}-{2:2}
> 
> https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/537201/17-icmps-discard-ipv4/stderr
> 
> 3. [ 264.280734][ C3] Possible unsafe locking scenario:
> [  264.280734][    C3]
> [  264.280968][    C3]        CPU0                    CPU1
> [  264.281117][    C3]        ----                    ----
> [  264.281263][    C3]   lock((&tw->tw_timer));
> [  264.281427][    C3]
> lock(&hashinfo->ehash_locks[i]);
> [  264.281647][    C3]                                lock((&tw->tw_timer));
> [  264.281834][    C3]   lock(&hashinfo->ehash_locks[i]);
> 
> https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/547461/19-self-connect-ipv4/stderr
> 
> I can spend some time on them after I verify that my fix for -stable
> is actually fixing an issue I think it fixes.
> Seems like your automation + my selftests are giving some fruits, hehe.

Oh, very interesting, I don't recall these coming up before.

We try to extract crashes but apparently we're missing lockdep splats.
I'll try to improve the extraction logic...
Jakub Kicinski April 17, 2024, 9:28 p.m. UTC | #5
On Wed, 17 Apr 2024 13:46:36 -0700 Jakub Kicinski wrote:
> > I can spend some time on them after I verify that my fix for -stable
> > is actually fixing an issue I think it fixes.
> > Seems like your automation + my selftests are giving some fruits, hehe.  
> 
> Oh, very interesting, I don't recall these coming up before.

Correction, these are old, and if I plug the branch names here:
https://netdev.bots.linux.dev/contest.html
there is a whole bunch of tests failing that day.

Keep in mind these run pre-commit so not all failures are flakes.
Dmitry Safonov April 17, 2024, 10:30 p.m. UTC | #6
On Wed, 17 Apr 2024 at 22:28, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 17 Apr 2024 13:46:36 -0700 Jakub Kicinski wrote:
> > > I can spend some time on them after I verify that my fix for -stable
> > > is actually fixing an issue I think it fixes.
> > > Seems like your automation + my selftests are giving some fruits, hehe.
> >
> > Oh, very interesting, I don't recall these coming up before.
>
> Correction, these are old, and if I plug the branch names here:
> https://netdev.bots.linux.dev/contest.html
> there is a whole bunch of tests failing that day.

Hmm, yeah, I was looking at the history of selftests to see if there
is anything else interesting:

2024-04-11--15-00 - lockdep for hashinfo->ehash_locks vs tw->tw_timer

It seems that you actually reported that already here:
https://lore.kernel.org/all/20240411100536.224fa1e7@kernel.org/

2024-04-04--12-00 - lockdep for p->alloc_lock vs ul->lock
(rt6_uncached_list_flush_dev)
2024-04-04--09-00 - lockdep for p->alloc_lock vs ndev->lock
(addrconf_permanent_addr)
2024-04-04--03-00 - lockdep for p->alloc_lock vs ul->lock

Was reported as well:
https://lore.kernel.org/all/8576a80ac958812ac75b01299c2de3a6485f84a1.camel@redhat.com/

2024-03-06--00-00 - kernel BUG at net/core/skbuff.c:2813

Can't really track this down to any report/fix. Probably as it's month
old and hasn't happened since on these tests - something was borken on
that particular day.

> Keep in mind these run pre-commit so not all failures are flakes.

Thanks,
             Dmitry