diff mbox series

[net-next,1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()

Message ID 20240417165756.2531620-2-edumazet@google.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series tcp: tcp_v4_err() changes | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 945 this patch: 945
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 938 this patch: 938
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 956 this patch: 956
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-04-18--12-00 (tests: 962)

Commit Message

Eric Dumazet April 17, 2024, 4:57 p.m. UTC
Blamed commit claimed in its changelog that the new functionality
was guarded by IP_RECVERR/IPV6_RECVERR :

    Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
    enable this feature, and that the error message is only queued
    while in SYN_SNT state.

This was true only for IPv6, because ipv6_icmp_error() has
the following check:

if (!inet6_test_bit(RECVERR6, sk))
    return;

Other callers check IP_RECVERR by themselves, it is unclear
if we could factorize these checks in ip_icmp_error()

For stable backports, I chose to add the missing check in tcp_v4_err()

We think this missing check was the root cause for commit
0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
some ICMP") breakage, leading to a revert.

Many thanks to Dragos Tatulea for conducting the investigations.

As Jakub said :

    The suspicion is that SSH sees the ICMP report on the socket error queue
    and tries to connect() again, but due to the patch the socket isn't
    disconnected, so it gets EALREADY, and throws its hands up...

    The error bubbles up to Vagrant which also becomes unhappy.

    Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?

Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Shachar Kagan <skagan@nvidia.com>
---
 net/ipv4/tcp_ipv4.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Maciej Żenczykowski April 17, 2024, 5:08 p.m. UTC | #1
On Wed, Apr 17, 2024 at 9:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
>
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
>
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
>
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
>
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
>
> For stable backports, I chose to add the missing check in tcp_v4_err()
>
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
>
> Many thanks to Dragos Tatulea for conducting the investigations.
>
> As Jakub said :
>
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
>
>     The error bubbles up to Vagrant which also becomes unhappy.
>
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
>
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>
> ---
>  net/ipv4/tcp_ipv4.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
>                 if (fastopen && !fastopen->sk)
>                         break;
>
> -               ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> +               if (inet_test_bit(RECVERR, sk))
> +                       ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
>
>                 if (!sock_owned_by_user(sk)) {
>                         WRITE_ONCE(sk->sk_err, err);
> --
> 2.44.0.683.g7961c838ac-goog
>

Reviewed-by: Maciej Żenczykowski <maze@google.com>

Makes sense to me.
Jason Xing April 18, 2024, 3:22 a.m. UTC | #2
On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
>
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
>
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
>
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
>
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
>
> For stable backports, I chose to add the missing check in tcp_v4_err()
>
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
>
> Many thanks to Dragos Tatulea for conducting the investigations.
>
> As Jakub said :
>
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
>
>     The error bubbles up to Vagrant which also becomes unhappy.
>
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
>
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

I wonder if we're supposed to move this check into ip_icmp_error()
like ipv6_icmp_error() does, because I notice one caller
rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
the core ICMP/ICMP6 parsers'')?

Or should it be a follow-up patch (moving it inside of
ip_icmp_error()) to handle the rxrpc case and also prevent future
misuse for other people?

Thanks,
Jason
Eric Dumazet April 18, 2024, 6:45 a.m. UTC | #3
On Thu, Apr 18, 2024 at 5:23 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Blamed commit claimed in its changelog that the new functionality
> > was guarded by IP_RECVERR/IPV6_RECVERR :
> >
> >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> >     enable this feature, and that the error message is only queued
> >     while in SYN_SNT state.
> >
> > This was true only for IPv6, because ipv6_icmp_error() has
> > the following check:
> >
> > if (!inet6_test_bit(RECVERR6, sk))
> >     return;
> >
> > Other callers check IP_RECVERR by themselves, it is unclear
> > if we could factorize these checks in ip_icmp_error()
> >
> > For stable backports, I chose to add the missing check in tcp_v4_err()
> >
> > We think this missing check was the root cause for commit
> > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > some ICMP") breakage, leading to a revert.
> >
> > Many thanks to Dragos Tatulea for conducting the investigations.
> >
> > As Jakub said :
> >
> >     The suspicion is that SSH sees the ICMP report on the socket error queue
> >     and tries to connect() again, but due to the patch the socket isn't
> >     disconnected, so it gets EALREADY, and throws its hands up...
> >
> >     The error bubbles up to Vagrant which also becomes unhappy.
> >
> >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> >
> > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Maciej Żenczykowski <maze@google.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Shachar Kagan <skagan@nvidia.com>
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> I wonder if we're supposed to move this check into ip_icmp_error()
> like ipv6_icmp_error() does, because I notice one caller
> rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
> error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
> the core ICMP/ICMP6 parsers'')?

I tried to focus on the TCP issues, and to have a stable candidate for patch #1.

The refactoring can wait.

>
> Or should it be a follow-up patch (moving it inside of
> ip_icmp_error()) to handle the rxrpc case and also prevent future
> misuse for other people?
Jason Xing April 18, 2024, 6:53 a.m. UTC | #4
On Thu, Apr 18, 2024 at 2:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 5:23 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > Blamed commit claimed in its changelog that the new functionality
> > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > >
> > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > >     enable this feature, and that the error message is only queued
> > >     while in SYN_SNT state.
> > >
> > > This was true only for IPv6, because ipv6_icmp_error() has
> > > the following check:
> > >
> > > if (!inet6_test_bit(RECVERR6, sk))
> > >     return;
> > >
> > > Other callers check IP_RECVERR by themselves, it is unclear
> > > if we could factorize these checks in ip_icmp_error()
> > >
> > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > >
> > > We think this missing check was the root cause for commit
> > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > some ICMP") breakage, leading to a revert.
> > >
> > > Many thanks to Dragos Tatulea for conducting the investigations.
> > >
> > > As Jakub said :
> > >
> > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > >     and tries to connect() again, but due to the patch the socket isn't
> > >     disconnected, so it gets EALREADY, and throws its hands up...
> > >
> > >     The error bubbles up to Vagrant which also becomes unhappy.
> > >
> > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > >
> > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Maciej Żenczykowski <maze@google.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > Cc: Shachar Kagan <skagan@nvidia.com>
> >
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > I wonder if we're supposed to move this check into ip_icmp_error()
> > like ipv6_icmp_error() does, because I notice one caller
> > rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
> > error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
> > the core ICMP/ICMP6 parsers'')?
>
> I tried to focus on the TCP issues, and to have a stable candidate for patch #1.
>
> The refactoring can wait.

Got it. It's clear.

After this patch is applied, I can adjust a little bit (only by moving
it into ip_icmp_error()).

Thanks,
Jason

>
> >
> > Or should it be a follow-up patch (moving it inside of
> > ip_icmp_error()) to handle the rxrpc case and also prevent future
> > misuse for other people?
Paolo Abeni April 18, 2024, 8:02 a.m. UTC | #5
Hi,

On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
> 
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
> 
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
> 
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
> 
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
> 
> For stable backports, I chose to add the missing check in tcp_v4_err()
> 
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
> 
> Many thanks to Dragos Tatulea for conducting the investigations.
> 
> As Jakub said :
> 
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
> 
>     The error bubbles up to Vagrant which also becomes unhappy.
> 
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> 
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>
> ---
>  net/ipv4/tcp_ipv4.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
>  		if (fastopen && !fastopen->sk)
>  			break;
>  
> -		ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> +		if (inet_test_bit(RECVERR, sk))
> +			ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
>  
>  		if (!sock_owned_by_user(sk)) {
>  			WRITE_ONCE(sk->sk_err, err);

We have a fcnal-test.sh self-test failure:

https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh

that I suspect are related to this patch (or the following one): the
test case creates a TCP connection on loopback and this is the only
patchseries touching the related code, included in the relevant patch
burst.

Could you please have a look?

Thanks!

Paolo
Eric Dumazet April 18, 2024, 8:03 a.m. UTC | #6
On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > Blamed commit claimed in its changelog that the new functionality
> > was guarded by IP_RECVERR/IPV6_RECVERR :
> >
> >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> >     enable this feature, and that the error message is only queued
> >     while in SYN_SNT state.
> >
> > This was true only for IPv6, because ipv6_icmp_error() has
> > the following check:
> >
> > if (!inet6_test_bit(RECVERR6, sk))
> >     return;
> >
> > Other callers check IP_RECVERR by themselves, it is unclear
> > if we could factorize these checks in ip_icmp_error()
> >
> > For stable backports, I chose to add the missing check in tcp_v4_err()
> >
> > We think this missing check was the root cause for commit
> > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > some ICMP") breakage, leading to a revert.
> >
> > Many thanks to Dragos Tatulea for conducting the investigations.
> >
> > As Jakub said :
> >
> >     The suspicion is that SSH sees the ICMP report on the socket error queue
> >     and tries to connect() again, but due to the patch the socket isn't
> >     disconnected, so it gets EALREADY, and throws its hands up...
> >
> >     The error bubbles up to Vagrant which also becomes unhappy.
> >
> >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> >
> > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Maciej Żenczykowski <maze@google.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Shachar Kagan <skagan@nvidia.com>
> > ---
> >  net/ipv4/tcp_ipv4.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> >               if (fastopen && !fastopen->sk)
> >                       break;
> >
> > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > +             if (inet_test_bit(RECVERR, sk))
> > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> >
> >               if (!sock_owned_by_user(sk)) {
> >                       WRITE_ONCE(sk->sk_err, err);
>
> We have a fcnal-test.sh self-test failure:
>
> https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
>
> that I suspect are related to this patch (or the following one): the
> test case creates a TCP connection on loopback and this is the only
> patchseries touching the related code, included in the relevant patch
> burst.
>
> Could you please have a look?

Sure, thanks Paolo !
Eric Dumazet April 18, 2024, 9:26 a.m. UTC | #7
On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > Hi,
> >
> > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > Blamed commit claimed in its changelog that the new functionality
> > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > >
> > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > >     enable this feature, and that the error message is only queued
> > >     while in SYN_SNT state.
> > >
> > > This was true only for IPv6, because ipv6_icmp_error() has
> > > the following check:
> > >
> > > if (!inet6_test_bit(RECVERR6, sk))
> > >     return;
> > >
> > > Other callers check IP_RECVERR by themselves, it is unclear
> > > if we could factorize these checks in ip_icmp_error()
> > >
> > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > >
> > > We think this missing check was the root cause for commit
> > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > some ICMP") breakage, leading to a revert.
> > >
> > > Many thanks to Dragos Tatulea for conducting the investigations.
> > >
> > > As Jakub said :
> > >
> > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > >     and tries to connect() again, but due to the patch the socket isn't
> > >     disconnected, so it gets EALREADY, and throws its hands up...
> > >
> > >     The error bubbles up to Vagrant which also becomes unhappy.
> > >
> > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > >
> > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Maciej Żenczykowski <maze@google.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > ---
> > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > >               if (fastopen && !fastopen->sk)
> > >                       break;
> > >
> > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > +             if (inet_test_bit(RECVERR, sk))
> > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > >
> > >               if (!sock_owned_by_user(sk)) {
> > >                       WRITE_ONCE(sk->sk_err, err);
> >
> > We have a fcnal-test.sh self-test failure:
> >
> > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> >
> > that I suspect are related to this patch (or the following one): the
> > test case creates a TCP connection on loopback and this is the only
> > patchseries touching the related code, included in the relevant patch
> > burst.
> >
> > Could you please have a look?
>
> Sure, thanks Paolo !

First patch is fine, I see no failure from fcnal-test.sh (as I would expect)

For the second one, I am not familiar enough with this very slow test
suite (all these "sleep 1" ... oh well)

I guess "failing tests" depended on TCP connect() to immediately abort
on one ICMP message,
depending on old kernel behavior.

I do not know how to launch a subset of the tests, and trace these.

"./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
VM running a non debug kernel :/

David, do you have an idea how to proceed ?

Thanks.

[1]
Tests passed: 134
Tests failed:   0

real 9m33.085s
user 0m40.159s
sys 0m30.098s
Paolo Abeni April 18, 2024, 9:58 a.m. UTC | #8
On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > 
> > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > 
> > > Hi,
> > > 
> > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > Blamed commit claimed in its changelog that the new functionality
> > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > 
> > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > >     enable this feature, and that the error message is only queued
> > > >     while in SYN_SNT state.
> > > > 
> > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > the following check:
> > > > 
> > > > if (!inet6_test_bit(RECVERR6, sk))
> > > >     return;
> > > > 
> > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > if we could factorize these checks in ip_icmp_error()
> > > > 
> > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > 
> > > > We think this missing check was the root cause for commit
> > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > some ICMP") breakage, leading to a revert.
> > > > 
> > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > 
> > > > As Jakub said :
> > > > 
> > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > >     and tries to connect() again, but due to the patch the socket isn't
> > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > 
> > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > 
> > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > 
> > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > ---
> > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > >               if (fastopen && !fastopen->sk)
> > > >                       break;
> > > > 
> > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > +             if (inet_test_bit(RECVERR, sk))
> > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > 
> > > >               if (!sock_owned_by_user(sk)) {
> > > >                       WRITE_ONCE(sk->sk_err, err);
> > > 
> > > We have a fcnal-test.sh self-test failure:
> > > 
> > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > 
> > > that I suspect are related to this patch (or the following one): the
> > > test case creates a TCP connection on loopback and this is the only
> > > patchseries touching the related code, included in the relevant patch
> > > burst.
> > > 
> > > Could you please have a look?
> > 
> > Sure, thanks Paolo !
> 
> First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> 
> For the second one, I am not familiar enough with this very slow test
> suite (all these "sleep 1" ... oh well)

@David, some of them could be replaced with loopy_wait calls

> I guess "failing tests" depended on TCP connect() to immediately abort
> on one ICMP message,
> depending on old kernel behavior.
> 
> I do not know how to launch a subset of the tests, and trace these.
> 
> "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> VM running a non debug kernel :/
> 
> David, do you have an idea how to proceed ?

One very dumb thing I do in that cases is commenting out the other
tests, something alike (completely untested!):
---
diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
index 386ebd829df5..494932aa99b2 100755
--- a/tools/testing/selftests/net/fcnal-test.sh
+++ b/tools/testing/selftests/net/fcnal-test.sh
@@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
 {
 	local a
 
+if false; then
 	#
 	# server tests
 	#
@@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
 		log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
 	done
 
+fi
 	a=${NSA_IP}
 	log_start
 	run_cmd nettest -s &
@@ -1487,12 +1489,14 @@ ipv4_tcp()
 	set_sysctl net.ipv4.tcp_l3mdev_accept=0
 	ipv4_tcp_novrf
 	log_subsection "tcp_l3mdev_accept enabled"
+if false; then
 	set_sysctl net.ipv4.tcp_l3mdev_accept=1
 	ipv4_tcp_novrf
 
 	log_subsection "With VRF"
 	setup "yes"
 	ipv4_tcp_vrf
+fi
 }
 
 ################################################################################
Eric Dumazet April 18, 2024, 10:15 a.m. UTC | #9
On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > >
> > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > >     enable this feature, and that the error message is only queued
> > > > >     while in SYN_SNT state.
> > > > >
> > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > the following check:
> > > > >
> > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > >     return;
> > > > >
> > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > if we could factorize these checks in ip_icmp_error()
> > > > >
> > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > >
> > > > > We think this missing check was the root cause for commit
> > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > some ICMP") breakage, leading to a revert.
> > > > >
> > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > >
> > > > > As Jakub said :
> > > > >
> > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > >
> > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > >
> > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > >
> > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > ---
> > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > >               if (fastopen && !fastopen->sk)
> > > > >                       break;
> > > > >
> > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > >
> > > > >               if (!sock_owned_by_user(sk)) {
> > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > >
> > > > We have a fcnal-test.sh self-test failure:
> > > >
> > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > >
> > > > that I suspect are related to this patch (or the following one): the
> > > > test case creates a TCP connection on loopback and this is the only
> > > > patchseries touching the related code, included in the relevant patch
> > > > burst.
> > > >
> > > > Could you please have a look?
> > >
> > > Sure, thanks Paolo !
> >
> > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> >
> > For the second one, I am not familiar enough with this very slow test
> > suite (all these "sleep 1" ... oh well)
>
> @David, some of them could be replaced with loopy_wait calls
>
> > I guess "failing tests" depended on TCP connect() to immediately abort
> > on one ICMP message,
> > depending on old kernel behavior.
> >
> > I do not know how to launch a subset of the tests, and trace these.
> >
> > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > VM running a non debug kernel :/
> >
> > David, do you have an idea how to proceed ?
>
> One very dumb thing I do in that cases is commenting out the other
> tests, something alike (completely untested!):
> ---
> diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> index 386ebd829df5..494932aa99b2 100755
> --- a/tools/testing/selftests/net/fcnal-test.sh
> +++ b/tools/testing/selftests/net/fcnal-test.sh
> @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
>  {
>         local a
>
> +if false; then
>         #
>         # server tests
>         #
> @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
>                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
>         done
>
> +fi
>         a=${NSA_IP}
>         log_start
>         run_cmd nettest -s &
> @@ -1487,12 +1489,14 @@ ipv4_tcp()
>         set_sysctl net.ipv4.tcp_l3mdev_accept=0
>         ipv4_tcp_novrf
>         log_subsection "tcp_l3mdev_accept enabled"
> +if false; then
>         set_sysctl net.ipv4.tcp_l3mdev_accept=1
>         ipv4_tcp_novrf
>
>         log_subsection "With VRF"
>         setup "yes"
>         ipv4_tcp_vrf
> +fi
>  }

Thanks Paolo

I found that the following patch is fixing the issue for me.

diff --git a/tools/testing/selftests/net/nettest.c
b/tools/testing/selftests/net/nettest.c
index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
100644
--- a/tools/testing/selftests/net/nettest.c
+++ b/tools/testing/selftests/net/nettest.c
@@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
alen, struct sock_args *args)
        if (args->bind_test_only)
                goto out;

+       set_recv_attr(sd, args->version);
        if (connect(sd, addr, alen) < 0) {
                if (errno != EINPROGRESS) {
                        log_err_errno("Failed to connect to remote host");
Eric Dumazet April 18, 2024, 10:22 a.m. UTC | #10
On Thu, Apr 18, 2024 at 12:15 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > > >
> > > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > > >     enable this feature, and that the error message is only queued
> > > > > >     while in SYN_SNT state.
> > > > > >
> > > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > > the following check:
> > > > > >
> > > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > > >     return;
> > > > > >
> > > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > > if we could factorize these checks in ip_icmp_error()
> > > > > >
> > > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > > >
> > > > > > We think this missing check was the root cause for commit
> > > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > > some ICMP") breakage, leading to a revert.
> > > > > >
> > > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > > >
> > > > > > As Jakub said :
> > > > > >
> > > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > > >
> > > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > > >
> > > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > > >
> > > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > > >               if (fastopen && !fastopen->sk)
> > > > > >                       break;
> > > > > >
> > > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > >
> > > > > >               if (!sock_owned_by_user(sk)) {
> > > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > > >
> > > > > We have a fcnal-test.sh self-test failure:
> > > > >
> > > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > > >
> > > > > that I suspect are related to this patch (or the following one): the
> > > > > test case creates a TCP connection on loopback and this is the only
> > > > > patchseries touching the related code, included in the relevant patch
> > > > > burst.
> > > > >
> > > > > Could you please have a look?
> > > >
> > > > Sure, thanks Paolo !
> > >
> > > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> > >
> > > For the second one, I am not familiar enough with this very slow test
> > > suite (all these "sleep 1" ... oh well)
> >
> > @David, some of them could be replaced with loopy_wait calls
> >
> > > I guess "failing tests" depended on TCP connect() to immediately abort
> > > on one ICMP message,
> > > depending on old kernel behavior.
> > >
> > > I do not know how to launch a subset of the tests, and trace these.
> > >
> > > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > > VM running a non debug kernel :/
> > >
> > > David, do you have an idea how to proceed ?
> >
> > One very dumb thing I do in that cases is commenting out the other
> > tests, something alike (completely untested!):
> > ---
> > diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> > index 386ebd829df5..494932aa99b2 100755
> > --- a/tools/testing/selftests/net/fcnal-test.sh
> > +++ b/tools/testing/selftests/net/fcnal-test.sh
> > @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
> >  {
> >         local a
> >
> > +if false; then
> >         #
> >         # server tests
> >         #
> > @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
> >                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
> >         done
> >
> > +fi
> >         a=${NSA_IP}
> >         log_start
> >         run_cmd nettest -s &
> > @@ -1487,12 +1489,14 @@ ipv4_tcp()
> >         set_sysctl net.ipv4.tcp_l3mdev_accept=0
> >         ipv4_tcp_novrf
> >         log_subsection "tcp_l3mdev_accept enabled"
> > +if false; then
> >         set_sysctl net.ipv4.tcp_l3mdev_accept=1
> >         ipv4_tcp_novrf
> >
> >         log_subsection "With VRF"
> >         setup "yes"
> >         ipv4_tcp_vrf
> > +fi
> >  }
>
> Thanks Paolo
>
> I found that the following patch is fixing the issue for me.
>
> diff --git a/tools/testing/selftests/net/nettest.c
> b/tools/testing/selftests/net/nettest.c
> index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> 100644
> --- a/tools/testing/selftests/net/nettest.c
> +++ b/tools/testing/selftests/net/nettest.c
> @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> alen, struct sock_args *args)
>         if (args->bind_test_only)
>                 goto out;
>
> +       set_recv_attr(sd, args->version);
>         if (connect(sd, addr, alen) < 0) {
>                 if (errno != EINPROGRESS) {
>                         log_err_errno("Failed to connect to remote host");

When tracing nettest we now have EHOSTUNREACH

3343  setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 <0.000210>
3343  setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0", 5) = 0 <0.000170>
3343  setsockopt(3, SOL_IP, IP_PKTINFO, [1], 4) = 0 <0.000161>
3343  setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0 <0.000181>
3343  connect(3, {sa_family=AF_INET, sin_port=htons(12345),
sin_addr=inet_addr("172.16.2.1")}, 16) = -1 EINPROGRESS (Operation now
in progress) <0.000874>
3343  pselect6(1024, NULL, [3], NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1
(out [3], left {tv_sec=1, tv_nsec=930762080}) <3.069673>
3343  getsockopt(3, SOL_SOCKET, SO_ERROR, [EHOSTUNREACH], [4]) = 0 <0.000270>

As mentioned in net/ipv4/icmp.c :
 RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
MUST be considered 'transient errs'.

Maybe another way to fix nettest would be to change wait_for_connect()
to pass a non NULL fdset in 4th argument of select()

select(FD_SETSIZE, NULL, &wfd, NULL /* here */, tv);
Eric Dumazet April 18, 2024, 10:36 a.m. UTC | #11
On Thu, Apr 18, 2024 at 12:22 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:15 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > > > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > > > >
> > > > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > > > >     enable this feature, and that the error message is only queued
> > > > > > >     while in SYN_SNT state.
> > > > > > >
> > > > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > > > the following check:
> > > > > > >
> > > > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > > > >     return;
> > > > > > >
> > > > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > > > if we could factorize these checks in ip_icmp_error()
> > > > > > >
> > > > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > > > >
> > > > > > > We think this missing check was the root cause for commit
> > > > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > > > some ICMP") breakage, leading to a revert.
> > > > > > >
> > > > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > > > >
> > > > > > > As Jakub said :
> > > > > > >
> > > > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > > > >
> > > > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > > > >
> > > > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > > > >
> > > > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > > > ---
> > > > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > > > >               if (fastopen && !fastopen->sk)
> > > > > > >                       break;
> > > > > > >
> > > > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > >
> > > > > > >               if (!sock_owned_by_user(sk)) {
> > > > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > > > >
> > > > > > We have a fcnal-test.sh self-test failure:
> > > > > >
> > > > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > > > >
> > > > > > that I suspect are related to this patch (or the following one): the
> > > > > > test case creates a TCP connection on loopback and this is the only
> > > > > > patchseries touching the related code, included in the relevant patch
> > > > > > burst.
> > > > > >
> > > > > > Could you please have a look?
> > > > >
> > > > > Sure, thanks Paolo !
> > > >
> > > > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> > > >
> > > > For the second one, I am not familiar enough with this very slow test
> > > > suite (all these "sleep 1" ... oh well)
> > >
> > > @David, some of them could be replaced with loopy_wait calls
> > >
> > > > I guess "failing tests" depended on TCP connect() to immediately abort
> > > > on one ICMP message,
> > > > depending on old kernel behavior.
> > > >
> > > > I do not know how to launch a subset of the tests, and trace these.
> > > >
> > > > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > > > VM running a non debug kernel :/
> > > >
> > > > David, do you have an idea how to proceed ?
> > >
> > > One very dumb thing I do in that cases is commenting out the other
> > > tests, something alike (completely untested!):
> > > ---
> > > diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> > > index 386ebd829df5..494932aa99b2 100755
> > > --- a/tools/testing/selftests/net/fcnal-test.sh
> > > +++ b/tools/testing/selftests/net/fcnal-test.sh
> > > @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
> > >  {
> > >         local a
> > >
> > > +if false; then
> > >         #
> > >         # server tests
> > >         #
> > > @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
> > >                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
> > >         done
> > >
> > > +fi
> > >         a=${NSA_IP}
> > >         log_start
> > >         run_cmd nettest -s &
> > > @@ -1487,12 +1489,14 @@ ipv4_tcp()
> > >         set_sysctl net.ipv4.tcp_l3mdev_accept=0
> > >         ipv4_tcp_novrf
> > >         log_subsection "tcp_l3mdev_accept enabled"
> > > +if false; then
> > >         set_sysctl net.ipv4.tcp_l3mdev_accept=1
> > >         ipv4_tcp_novrf
> > >
> > >         log_subsection "With VRF"
> > >         setup "yes"
> > >         ipv4_tcp_vrf
> > > +fi
> > >  }
> >
> > Thanks Paolo
> >
> > I found that the following patch is fixing the issue for me.
> >
> > diff --git a/tools/testing/selftests/net/nettest.c
> > b/tools/testing/selftests/net/nettest.c
> > index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> > 100644
> > --- a/tools/testing/selftests/net/nettest.c
> > +++ b/tools/testing/selftests/net/nettest.c
> > @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> > alen, struct sock_args *args)
> >         if (args->bind_test_only)
> >                 goto out;
> >
> > +       set_recv_attr(sd, args->version);
> >         if (connect(sd, addr, alen) < 0) {
> >                 if (errno != EINPROGRESS) {
> >                         log_err_errno("Failed to connect to remote host");
>
> When tracing nettest we now have EHOSTUNREACH
>
> 3343  setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 <0.000210>
> 3343  setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0", 5) = 0 <0.000170>
> 3343  setsockopt(3, SOL_IP, IP_PKTINFO, [1], 4) = 0 <0.000161>
> 3343  setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0 <0.000181>
> 3343  connect(3, {sa_family=AF_INET, sin_port=htons(12345),
> sin_addr=inet_addr("172.16.2.1")}, 16) = -1 EINPROGRESS (Operation now
> in progress) <0.000874>
> 3343  pselect6(1024, NULL, [3], NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1
> (out [3], left {tv_sec=1, tv_nsec=930762080}) <3.069673>
> 3343  getsockopt(3, SOL_SOCKET, SO_ERROR, [EHOSTUNREACH], [4]) = 0 <0.000270>
>
> As mentioned in net/ipv4/icmp.c :
>  RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
> MUST be considered 'transient errs'.
>
> Maybe another way to fix nettest would be to change wait_for_connect()
> to pass a non NULL fdset in 4th argument of select()
>
> select(FD_SETSIZE, NULL, &wfd, NULL /* here */, tv);

This change in wait_for_connect() does not help.

I am guessing set_recv_attr(sd, args->version); is what we need.

I am running all ./fcnal-test.sh tests to make sure everything is green.
David Ahern April 18, 2024, 5:46 p.m. UTC | #12
On 4/18/24 4:15 AM, Eric Dumazet wrote:
> 
> Thanks Paolo
> 
> I found that the following patch is fixing the issue for me.
> 
> diff --git a/tools/testing/selftests/net/nettest.c
> b/tools/testing/selftests/net/nettest.c
> index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> 100644
> --- a/tools/testing/selftests/net/nettest.c
> +++ b/tools/testing/selftests/net/nettest.c
> @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> alen, struct sock_args *args)
>         if (args->bind_test_only)
>                 goto out;
> 
> +       set_recv_attr(sd, args->version);
>         if (connect(sd, addr, alen) < 0) {
>                 if (errno != EINPROGRESS) {
>                         log_err_errno("Failed to connect to remote host");

You have a kernel patch that makes a test fail, and your solution is
changing userspace? The tests are examples of userspace applications and
how they can use APIs, so if the patch breaks a test it is by definition
breaking userspace which is not allowed.
Eric Dumazet April 18, 2024, 5:47 p.m. UTC | #13
On Thu, Apr 18, 2024 at 7:46 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/18/24 4:15 AM, Eric Dumazet wrote:
> >
> > Thanks Paolo
> >
> > I found that the following patch is fixing the issue for me.
> >
> > diff --git a/tools/testing/selftests/net/nettest.c
> > b/tools/testing/selftests/net/nettest.c
> > index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> > 100644
> > --- a/tools/testing/selftests/net/nettest.c
> > +++ b/tools/testing/selftests/net/nettest.c
> > @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> > alen, struct sock_args *args)
> >         if (args->bind_test_only)
> >                 goto out;
> >
> > +       set_recv_attr(sd, args->version);
> >         if (connect(sd, addr, alen) < 0) {
> >                 if (errno != EINPROGRESS) {
> >                         log_err_errno("Failed to connect to remote host");
>
> You have a kernel patch that makes a test fail, and your solution is
> changing userspace? The tests are examples of userspace applications and
> how they can use APIs, so if the patch breaks a test it is by definition
> breaking userspace which is not allowed.

I think the userspace program relied on a bug added in linux in 2020

Jakub, I will stop trying to push the patches, this is a lost battle.
David Ahern April 18, 2024, 5:56 p.m. UTC | #14
On 4/18/24 3:26 AM, Eric Dumazet wrote:
> For the second one, I am not familiar enough with this very slow test
> suite (all these "sleep 1" ... oh well)
> 
> I guess "failing tests" depended on TCP connect() to immediately abort
> on one ICMP message,
> depending on old kernel behavior.
> 
> I do not know how to launch a subset of the tests, and trace these.
> 
> "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> VM running a non debug kernel :/
> 
> David, do you have an idea how to proceed ?
> 
> Thanks.
> 
> [1]
> Tests passed: 134
> Tests failed:   0
> 
> real 9m33.085s
> user 0m40.159s
> sys 0m30.098s

The test suite was started in 2013 and has evolved to cover the many
permutations of APIs -- setsockopts, cmsg, VRF, routing, ip rules,
netfilter, etc. There are a lot of combinations. They verify not just
control path or socket bind succeeds, but data path works as expected
which means do a connect and packet transfer.

Some years back nettest gained support to change namespaces and run both
client and server in a single instance. I started converting the tests
to use that capability and remove the sleeps, but it did not speed up
the tests in any significant way.

The tests are in blocks and the blocks can be split out to separate
scripts or kept in one to retain the common setup code and launched in
parallel. Splitting out to any lower level does not make sense.

If someone wants to pick up the task of splitting the tests or running
them in parallel, please do. I have zero time right now.  That the suite
detects changes in kernel behavior shows it is doing what it is designed
to do and proves its value.
David Ahern April 18, 2024, 6:02 p.m. UTC | #15
On 4/18/24 11:47 AM, Eric Dumazet wrote:
> I think the userspace program relied on a bug added in linux in 2020

which test - when was it added? nettest.c and the fcnal script were
merged in 2019.
Jakub Kicinski April 18, 2024, 6:09 p.m. UTC | #16
On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
> > You have a kernel patch that makes a test fail, and your solution is
> > changing userspace? The tests are examples of userspace applications and
> > how they can use APIs, so if the patch breaks a test it is by definition
> > breaking userspace which is not allowed.  

Tests are often overly sensitive to kernel behavior, while this is
obviously a red flag it's not an automatic nack. The more tests we
have the more often we'll catch tiny changes. A lot of tests started
flaking with 6.9 because of the optimizations in the timer subsystem.
You know where I'm going with this..

> I think the userspace program relied on a bug added in linux in 2020
> 
> Jakub, I will stop trying to push the patches, this is a lost battle.

If you have the patches ready - please post them.
I'm happy to take the blame if they actually regress something in 
the wild :(

We're pursuing this because real application suffer real problems
when routing changes cause transient ICMP errors. Users read the RFC
and then come shouting at us that Linux is buggy.
Eric Dumazet April 18, 2024, 6:09 p.m. UTC | #17
On Thu, Apr 18, 2024 at 8:02 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/18/24 11:47 AM, Eric Dumazet wrote:
> > I think the userspace program relied on a bug added in linux in 2020
>
> which test - when was it added? nettest.c and the fcnal script were
> merged in 2019.

The test relies on connect() being aborted by EHOSTUNREACH

This is in complete violation with RFC

Even our own net.ipv4/icmp.c states this clearly.

 RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
MUST be considered 'transient errs'.

Your program wants to use RECVERR instead of depending on a bug that
we are trying very hard to solve.
Eric Dumazet April 18, 2024, 6:14 p.m. UTC | #18
On Thu, Apr 18, 2024 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
> > > You have a kernel patch that makes a test fail, and your solution is
> > > changing userspace? The tests are examples of userspace applications and
> > > how they can use APIs, so if the patch breaks a test it is by definition
> > > breaking userspace which is not allowed.
>
> Tests are often overly sensitive to kernel behavior, while this is
> obviously a red flag it's not an automatic nack. The more tests we
> have the more often we'll catch tiny changes. A lot of tests started
> flaking with 6.9 because of the optimizations in the timer subsystem.
> You know where I'm going with this..
>
> > I think the userspace program relied on a bug added in linux in 2020
> >
> > Jakub, I will stop trying to push the patches, this is a lost battle.
>
> If you have the patches ready - please post them.
> I'm happy to take the blame if they actually regress something in
> the wild :(

The series with the 2 patches has been posted already.

The remaining part is a nettest.c fix, that David is not happy with.

diff --git a/tools/testing/selftests/net/nettest.c
b/tools/testing/selftests/net/nettest.c
index cd8a580974480212b45d86f35293b77f3d033473..23d56797900f19890923028af0b7ba162d9d5794
100644
--- a/tools/testing/selftests/net/nettest.c
+++ b/tools/testing/selftests/net/nettest.c
@@ -1744,6 +1744,8 @@ static int connectsock(void *addr, socklen_t
alen, struct sock_args *args)
        if (args->bind_test_only)
                goto out;

+       set_recv_attr(sd, args->version);
+
        if (connect(sd, addr, alen) < 0) {
                if (errno != EINPROGRESS) {
                        log_err_errno("Failed to connect to remote host");

>
> We're pursuing this because real application suffer real problems
> when routing changes cause transient ICMP errors. Users read the RFC
> and then come shouting at us that Linux is buggy.
David Ahern April 18, 2024, 8:20 p.m. UTC | #19
On 4/18/24 12:09 PM, Jakub Kicinski wrote:
> On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
>>> You have a kernel patch that makes a test fail, and your solution is
>>> changing userspace? The tests are examples of userspace applications and
>>> how they can use APIs, so if the patch breaks a test it is by definition
>>> breaking userspace which is not allowed.  
> 
> Tests are often overly sensitive to kernel behavior, while this is

That test script is a fairly comprehensive sweep of uAPIs and kernel
behavior. Its sole job is to detect user visible changes and breakage
from patches. It has done its job here. Do not shoot or criticize the
messenger because you do not like the message.


> obviously a red flag it's not an automatic nack. The more tests we
> have the more often we'll catch tiny changes. A lot of tests started
> flaking with 6.9 because of the optimizations in the timer subsystem.
> You know where I'm going with this..
> 
>> I think the userspace program relied on a bug added in linux in 2020
>>
>> Jakub, I will stop trying to push the patches, this is a lost battle.
> 
> If you have the patches ready - please post them.
> I'm happy to take the blame if they actually regress something in 
> the wild :(

And because of this test suite you are making a conscious decision to
merge a patch that is making a user visible change. That is part of the
tough decisions a maintainer has to make; I have no problem with that.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -602,7 +602,8 @@  int tcp_v4_err(struct sk_buff *skb, u32 info)
 		if (fastopen && !fastopen->sk)
 			break;
 
-		ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
+		if (inet_test_bit(RECVERR, sk))
+			ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
 
 		if (!sock_owned_by_user(sk)) {
 			WRITE_ONCE(sk->sk_err, err);