diff mbox series

[v2] udp: Avoid call to compute_score on multiple sites

Message ID 20240410215047.21462-1-krisman@suse.de (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2] udp: Avoid call to compute_score on multiple sites | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 950 this patch: 950
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: kuniyu@amazon.com; 5 maintainers not CCed: pabeni@redhat.com kuba@kernel.org kuniyu@amazon.com dsahern@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 954 this patch: 954
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 961 this patch: 961
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 67 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-11--21-00 (tests: 959)

Commit Message

Gabriel Krisman Bertazi April 10, 2024, 9:50 p.m. UTC
We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
sockets are present").  The failing tests were those that would spawn
UDP sockets per-cpu on systems that have a high number of cpus.

Unsurprisingly, it is not caused by the extra re-scoring of the reused
socket, but due to the compiler no longer inlining compute_score, once
it has the extra call site in udp4_lib_lookup2.  This is augmented by
the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.

We could just explicitly inline it, but compute_score() is quite a large
function, around 300b.  Inlining in two sites would almost double
udp4_lib_lookup2, which is a silly thing to do just to workaround a
mitigation.  Instead, this patch shuffles the code a bit to avoid the
multiple calls to compute_score.  Since it is a static function used in
one spot, the compiler can safely fold it in, as it did before, without
increasing the text size.

With this patch applied I ran my original iperf3 testcases.  The failing
cases all looked like this (ipv4):
	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2

where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
tree. harmean == harmonic mean; CV == coefficient of variation.

ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)

ipv6:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)

This restores the performance we had before the change above with this
benchmark.  We obviously don't expect any real impact when mitigations
are disabled, but just to be sure it also doesn't regresses:

mitigations=off ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)

Cc: Lorenz Bauer <lmb@isovalent.com>
Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

---
Changes since v1:
(me)
  - recollected performance data after changes below only for the
  mitigations enabled case.
(suggested by Willem de Bruijn)
  - Drop __always_inline in compute_score
  - Simplify logic by replacing third struct sock pointer with bool
  - Fix typo in commit message
  - Don't explicitly break out of loop after rescore
---
 net/ipv4/udp.c | 18 +++++++++++++-----
 net/ipv6/udp.c | 17 +++++++++++++----
 2 files changed, 26 insertions(+), 9 deletions(-)

Comments

Kuniyuki Iwashima April 10, 2024, 10:13 p.m. UTC | #1
From: Gabriel Krisman Bertazi <krisman@suse.de>
Date: Wed, 10 Apr 2024 17:50:47 -0400
> We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> sockets are present").  The failing tests were those that would spawn
> UDP sockets per-cpu on systems that have a high number of cpus.
> 
> Unsurprisingly, it is not caused by the extra re-scoring of the reused
> socket, but due to the compiler no longer inlining compute_score, once
> it has the extra call site in udp4_lib_lookup2.  This is augmented by
> the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> 
> We could just explicitly inline it, but compute_score() is quite a large
> function, around 300b.  Inlining in two sites would almost double
> udp4_lib_lookup2, which is a silly thing to do just to workaround a
> mitigation.  Instead, this patch shuffles the code a bit to avoid the
> multiple calls to compute_score.  Since it is a static function used in
> one spot, the compiler can safely fold it in, as it did before, without
> increasing the text size.
> 
> With this patch applied I ran my original iperf3 testcases.  The failing
> cases all looked like this (ipv4):
> 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> 
> where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> tree. harmean == harmonic mean; CV == coefficient of variation.
> 
> ipv4:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> 
> ipv6:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> 
> This restores the performance we had before the change above with this
> benchmark.  We obviously don't expect any real impact when mitigations
> are disabled, but just to be sure it also doesn't regresses:
> 
> mitigations=off ipv4:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> 
> Cc: Lorenz Bauer <lmb@isovalent.com>
> Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> 
> ---
> Changes since v1:
> (me)
>   - recollected performance data after changes below only for the
>   mitigations enabled case.
> (suggested by Willem de Bruijn)
>   - Drop __always_inline in compute_score
>   - Simplify logic by replacing third struct sock pointer with bool
>   - Fix typo in commit message
>   - Don't explicitly break out of loop after rescore
> ---
>  net/ipv4/udp.c | 18 +++++++++++++-----
>  net/ipv6/udp.c | 17 +++++++++++++----
>  2 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 661d0e0d273f..a13ef8e06093 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  {
>  	struct sock *sk, *result;
>  	int score, badness;
> +	bool rescore = false;

nit: Keep reverse xmax tree order.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

>  
>  	result = NULL;
>  	badness = 0;
>  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> -		score = compute_score(sk, net, saddr, sport,
> -				      daddr, hnum, dif, sdif);
> +rescore:
> +		score = compute_score((rescore ? result : sk), net, saddr,

I guess () is not needed around rescore ?

Both same for IPv6.

Otherwise, looks good to me.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>


> +				      sport, daddr, hnum, dif, sdif);
> +		rescore = false;
>  		if (score > badness) {
>  			badness = score;
>  
> @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  			if (IS_ERR(result))
>  				continue;
>  
> -			badness = compute_score(result, net, saddr, sport,
> -						daddr, hnum, dif, sdif);
> -
> +			/* compute_score is too long of a function to be
> +			 * inlined, and calling it again here yields
> +			 * measureable overhead for some
> +			 * workloads. Work around it by jumping
> +			 * backwards to rescore 'result'.
> +			 */
> +			rescore = true;
> +			goto rescore;
>  		}
>  	}
>  	return result;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 7c1e6469d091..7a55c050de2b 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -168,12 +168,15 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>  {
>  	struct sock *sk, *result;
>  	int score, badness;
> +	bool rescore = false;
>  
>  	result = NULL;
>  	badness = -1;
>  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> -		score = compute_score(sk, net, saddr, sport,
> -				      daddr, hnum, dif, sdif);
> +rescore:
> +		score = compute_score((rescore ? result : sk), net, saddr,
> +				      sport, daddr, hnum, dif, sdif);
> +		rescore = false;
>  		if (score > badness) {
>  			badness = score;
>  
> @@ -197,8 +200,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>  			if (IS_ERR(result))
>  				continue;
>  
> -			badness = compute_score(sk, net, saddr, sport,
> -						daddr, hnum, dif, sdif);
> +			/* compute_score is too long of a function to be
> +			 * inlined, and calling it again here yields
> +			 * measureable overhead for some
> +			 * workloads. Work around it by jumping
> +			 * backwards to rescore 'result'.
> +			 */
> +			rescore = true;
> +			goto rescore;
>  		}
>  	}
>  	return result;
> -- 
> 2.44.0
Willem de Bruijn April 10, 2024, 10:51 p.m. UTC | #2
Kuniyuki Iwashima wrote:
> From: Gabriel Krisman Bertazi <krisman@suse.de>
> Date: Wed, 10 Apr 2024 17:50:47 -0400
> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > sockets are present").  The failing tests were those that would spawn
> > UDP sockets per-cpu on systems that have a high number of cpus.
> > 
> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > socket, but due to the compiler no longer inlining compute_score, once
> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > 
> > We could just explicitly inline it, but compute_score() is quite a large
> > function, around 300b.  Inlining in two sites would almost double
> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> > multiple calls to compute_score.  Since it is a static function used in
> > one spot, the compiler can safely fold it in, as it did before, without
> > increasing the text size.
> > 
> > With this patch applied I ran my original iperf3 testcases.  The failing
> > cases all looked like this (ipv4):
> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> > 
> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > tree. harmean == harmonic mean; CV == coefficient of variation.
> > 
> > ipv4:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> > 
> > ipv6:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> > 
> > This restores the performance we had before the change above with this
> > benchmark.  We obviously don't expect any real impact when mitigations
> > are disabled, but just to be sure it also doesn't regresses:
> > 
> > mitigations=off ipv4:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > 
> > Cc: Lorenz Bauer <lmb@isovalent.com>
> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> > 
> > ---
> > Changes since v1:
> > (me)
> >   - recollected performance data after changes below only for the
> >   mitigations enabled case.
> > (suggested by Willem de Bruijn)
> >   - Drop __always_inline in compute_score
> >   - Simplify logic by replacing third struct sock pointer with bool
> >   - Fix typo in commit message
> >   - Don't explicitly break out of loop after rescore
> > ---
> >  net/ipv4/udp.c | 18 +++++++++++++-----
> >  net/ipv6/udp.c | 17 +++++++++++++----
> >  2 files changed, 26 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 661d0e0d273f..a13ef8e06093 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >  {
> >  	struct sock *sk, *result;
> >  	int score, badness;
> > +	bool rescore = false;
> 
> nit: Keep reverse xmax tree order.
> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> 
> >  
> >  	result = NULL;
> >  	badness = 0;
> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > -		score = compute_score(sk, net, saddr, sport,
> > -				      daddr, hnum, dif, sdif);
> > +rescore:
> > +		score = compute_score((rescore ? result : sk), net, saddr,
> 
> I guess () is not needed around rescore ?
> 
> Both same for IPv6.
> 
> Otherwise, looks good to me.
> 
> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Can we avoid using the same name for the label and boolean?

And since if looping result will have state TCP_ESTABLISHED, can it
just be

    sk = result;
    goto rescore;


> 
> > +				      sport, daddr, hnum, dif, sdif);
> > +		rescore = false;
> >  		if (score > badness) {
> >  			badness = score;
> >  
> > @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >  			if (IS_ERR(result))
> >  				continue;
> >  
> > -			badness = compute_score(result, net, saddr, sport,
> > -						daddr, hnum, dif, sdif);
> > -
> > +			/* compute_score is too long of a function to be
> > +			 * inlined, and calling it again here yields
> > +			 * measureable overhead for some
> > +			 * workloads. Work around it by jumping
> > +			 * backwards to rescore 'result'.
> > +			 */
> > +			rescore = true;
> > +			goto rescore;
> >  		}
> >  	}
Kuniyuki Iwashima April 10, 2024, 11:07 p.m. UTC | #3
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 10 Apr 2024 18:51:33 -0400
> Kuniyuki Iwashima wrote:
> > From: Gabriel Krisman Bertazi <krisman@suse.de>
> > Date: Wed, 10 Apr 2024 17:50:47 -0400
> > > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > > sockets are present").  The failing tests were those that would spawn
> > > UDP sockets per-cpu on systems that have a high number of cpus.
> > > 
> > > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > > socket, but due to the compiler no longer inlining compute_score, once
> > > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> > > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > > 
> > > We could just explicitly inline it, but compute_score() is quite a large
> > > function, around 300b.  Inlining in two sites would almost double
> > > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> > > multiple calls to compute_score.  Since it is a static function used in
> > > one spot, the compiler can safely fold it in, as it did before, without
> > > increasing the text size.
> > > 
> > > With this patch applied I ran my original iperf3 testcases.  The failing
> > > cases all looked like this (ipv4):
> > > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> > > 
> > > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> > > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > > tree. harmean == harmonic mean; CV == coefficient of variation.
> > > 
> > > ipv4:
> > >                  1G                10G                  MAX
> > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> > > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> > > 
> > > ipv6:
> > >                  1G                10G                  MAX
> > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> > > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> > > 
> > > This restores the performance we had before the change above with this
> > > benchmark.  We obviously don't expect any real impact when mitigations
> > > are disabled, but just to be sure it also doesn't regresses:
> > > 
> > > mitigations=off ipv4:
> > >                  1G                10G                  MAX
> > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > > 
> > > Cc: Lorenz Bauer <lmb@isovalent.com>
> > > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> > > 
> > > ---
> > > Changes since v1:
> > > (me)
> > >   - recollected performance data after changes below only for the
> > >   mitigations enabled case.
> > > (suggested by Willem de Bruijn)
> > >   - Drop __always_inline in compute_score
> > >   - Simplify logic by replacing third struct sock pointer with bool
> > >   - Fix typo in commit message
> > >   - Don't explicitly break out of loop after rescore
> > > ---
> > >  net/ipv4/udp.c | 18 +++++++++++++-----
> > >  net/ipv6/udp.c | 17 +++++++++++++----
> > >  2 files changed, 26 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 661d0e0d273f..a13ef8e06093 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > >  {
> > >  	struct sock *sk, *result;
> > >  	int score, badness;
> > > +	bool rescore = false;
> > 
> > nit: Keep reverse xmax tree order.
> > https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> > 
> > >  
> > >  	result = NULL;
> > >  	badness = 0;
> > >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > > -		score = compute_score(sk, net, saddr, sport,
> > > -				      daddr, hnum, dif, sdif);
> > > +rescore:
> > > +		score = compute_score((rescore ? result : sk), net, saddr,
> > 
> > I guess () is not needed around rescore ?
> > 
> > Both same for IPv6.
> > 
> > Otherwise, looks good to me.
> > 
> > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> 
> Can we avoid using the same name for the label and boolean?
> 
> And since if looping result will have state TCP_ESTABLISHED, can it
> just be
> 
>     sk = result;
>     goto rescore;

TCP_ESTABLISHED never reaches the rescore jump as it's checked
before calling inet_lookup_reuseport() and inet_lookup_reuseport()
also does not select TCP_ESTABLISHED.


> 
> 
> > 
> > > +				      sport, daddr, hnum, dif, sdif);
> > > +		rescore = false;
> > >  		if (score > badness) {
> > >  			badness = score;
> > >  
> > > @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > >  			if (IS_ERR(result))
> > >  				continue;
> > >  
> > > -			badness = compute_score(result, net, saddr, sport,
> > > -						daddr, hnum, dif, sdif);
> > > -
> > > +			/* compute_score is too long of a function to be
> > > +			 * inlined, and calling it again here yields
> > > +			 * measureable overhead for some
> > > +			 * workloads. Work around it by jumping
> > > +			 * backwards to rescore 'result'.
> > > +			 */
> > > +			rescore = true;
> > > +			goto rescore;
> > >  		}
> > >  	}
Willem de Bruijn April 10, 2024, 11:18 p.m. UTC | #4
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Wed, 10 Apr 2024 18:51:33 -0400
> > Kuniyuki Iwashima wrote:
> > > From: Gabriel Krisman Bertazi <krisman@suse.de>
> > > Date: Wed, 10 Apr 2024 17:50:47 -0400
> > > > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > > > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > > > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > > > sockets are present").  The failing tests were those that would spawn
> > > > UDP sockets per-cpu on systems that have a high number of cpus.
> > > > 
> > > > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > > > socket, but due to the compiler no longer inlining compute_score, once
> > > > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> > > > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > > > 
> > > > We could just explicitly inline it, but compute_score() is quite a large
> > > > function, around 300b.  Inlining in two sites would almost double
> > > > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > > > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> > > > multiple calls to compute_score.  Since it is a static function used in
> > > > one spot, the compiler can safely fold it in, as it did before, without
> > > > increasing the text size.
> > > > 
> > > > With this patch applied I ran my original iperf3 testcases.  The failing
> > > > cases all looked like this (ipv4):
> > > > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> > > > 
> > > > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> > > > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > > > tree. harmean == harmonic mean; CV == coefficient of variation.
> > > > 
> > > > ipv4:
> > > >                  1G                10G                  MAX
> > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> > > > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> > > > 
> > > > ipv6:
> > > >                  1G                10G                  MAX
> > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> > > > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> > > > 
> > > > This restores the performance we had before the change above with this
> > > > benchmark.  We obviously don't expect any real impact when mitigations
> > > > are disabled, but just to be sure it also doesn't regresses:
> > > > 
> > > > mitigations=off ipv4:
> > > >                  1G                10G                  MAX
> > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > > > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > > > 
> > > > Cc: Lorenz Bauer <lmb@isovalent.com>
> > > > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> > > > 
> > > > ---
> > > > Changes since v1:
> > > > (me)
> > > >   - recollected performance data after changes below only for the
> > > >   mitigations enabled case.
> > > > (suggested by Willem de Bruijn)
> > > >   - Drop __always_inline in compute_score
> > > >   - Simplify logic by replacing third struct sock pointer with bool
> > > >   - Fix typo in commit message
> > > >   - Don't explicitly break out of loop after rescore
> > > > ---
> > > >  net/ipv4/udp.c | 18 +++++++++++++-----
> > > >  net/ipv6/udp.c | 17 +++++++++++++----
> > > >  2 files changed, 26 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index 661d0e0d273f..a13ef8e06093 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > > >  {
> > > >  	struct sock *sk, *result;
> > > >  	int score, badness;
> > > > +	bool rescore = false;
> > > 
> > > nit: Keep reverse xmax tree order.
> > > https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> > > 
> > > >  
> > > >  	result = NULL;
> > > >  	badness = 0;
> > > >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > > > -		score = compute_score(sk, net, saddr, sport,
> > > > -				      daddr, hnum, dif, sdif);
> > > > +rescore:
> > > > +		score = compute_score((rescore ? result : sk), net, saddr,
> > > 
> > > I guess () is not needed around rescore ?
> > > 
> > > Both same for IPv6.
> > > 
> > > Otherwise, looks good to me.
> > > 
> > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > 
> > Can we avoid using the same name for the label and boolean?
> > 
> > And since if looping result will have state TCP_ESTABLISHED, can it
> > just be
> > 
> >     sk = result;
> >     goto rescore;
> 
> TCP_ESTABLISHED never reaches the rescore jump as it's checked
> before calling inet_lookup_reuseport() and inet_lookup_reuseport()
> also does not select TCP_ESTABLISHED.

I was thinking of the second part, inet_lookup_reuseport returning
a connection from the group. I suppose this is assured not to be
the case due to

           /* Falleback to scoring if grnult;p has connections */
           if (!reuseport_has_conns(sk))
                   return result;


Is that what you mean?

There are a lot of hidden assumptions then to make sure this
control flow is correct.

Should we instead just have

            badness = score;

+            if (rescore)
+                    continue;

Also, can the rescore = false in the datapath be avoided. The purpose
is to only jump once. It would be good if it is obvious that a
repeated (or infinite) loop is not possible, regardless of what
the callees return.
Kuniyuki Iwashima April 11, 2024, 12:08 a.m. UTC | #5
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 10 Apr 2024 19:18:14 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Wed, 10 Apr 2024 18:51:33 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From: Gabriel Krisman Bertazi <krisman@suse.de>
> > > > Date: Wed, 10 Apr 2024 17:50:47 -0400
> > > > > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > > > > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > > > > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > > > > sockets are present").  The failing tests were those that would spawn
> > > > > UDP sockets per-cpu on systems that have a high number of cpus.
> > > > > 
> > > > > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > > > > socket, but due to the compiler no longer inlining compute_score, once
> > > > > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> > > > > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > > > > 
> > > > > We could just explicitly inline it, but compute_score() is quite a large
> > > > > function, around 300b.  Inlining in two sites would almost double
> > > > > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > > > > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> > > > > multiple calls to compute_score.  Since it is a static function used in
> > > > > one spot, the compiler can safely fold it in, as it did before, without
> > > > > increasing the text size.
> > > > > 
> > > > > With this patch applied I ran my original iperf3 testcases.  The failing
> > > > > cases all looked like this (ipv4):
> > > > > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> > > > > 
> > > > > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> > > > > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > > > > tree. harmean == harmonic mean; CV == coefficient of variation.
> > > > > 
> > > > > ipv4:
> > > > >                  1G                10G                  MAX
> > > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> > > > > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> > > > > 
> > > > > ipv6:
> > > > >                  1G                10G                  MAX
> > > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> > > > > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> > > > > 
> > > > > This restores the performance we had before the change above with this
> > > > > benchmark.  We obviously don't expect any real impact when mitigations
> > > > > are disabled, but just to be sure it also doesn't regresses:
> > > > > 
> > > > > mitigations=off ipv4:
> > > > >                  1G                10G                  MAX
> > > > > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > > > > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > > > > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > > > > 
> > > > > Cc: Lorenz Bauer <lmb@isovalent.com>
> > > > > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > > > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> > > > > 
> > > > > ---
> > > > > Changes since v1:
> > > > > (me)
> > > > >   - recollected performance data after changes below only for the
> > > > >   mitigations enabled case.
> > > > > (suggested by Willem de Bruijn)
> > > > >   - Drop __always_inline in compute_score
> > > > >   - Simplify logic by replacing third struct sock pointer with bool
> > > > >   - Fix typo in commit message
> > > > >   - Don't explicitly break out of loop after rescore
> > > > > ---
> > > > >  net/ipv4/udp.c | 18 +++++++++++++-----
> > > > >  net/ipv6/udp.c | 17 +++++++++++++----
> > > > >  2 files changed, 26 insertions(+), 9 deletions(-)
> > > > > 
> > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > index 661d0e0d273f..a13ef8e06093 100644
> > > > > --- a/net/ipv4/udp.c
> > > > > +++ b/net/ipv4/udp.c
> > > > > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > > > >  {
> > > > >  	struct sock *sk, *result;
> > > > >  	int score, badness;
> > > > > +	bool rescore = false;
> > > > 
> > > > nit: Keep reverse xmax tree order.
> > > > https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> > > > 
> > > > >  
> > > > >  	result = NULL;
> > > > >  	badness = 0;
> > > > >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > > > > -		score = compute_score(sk, net, saddr, sport,
> > > > > -				      daddr, hnum, dif, sdif);
> > > > > +rescore:
> > > > > +		score = compute_score((rescore ? result : sk), net, saddr,
> > > > 
> > > > I guess () is not needed around rescore ?
> > > > 
> > > > Both same for IPv6.
> > > > 
> > > > Otherwise, looks good to me.
> > > > 
> > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > 
> > > Can we avoid using the same name for the label and boolean?
> > > 
> > > And since if looping result will have state TCP_ESTABLISHED, can it
> > > just be
> > > 
> > >     sk = result;
> > >     goto rescore;
> > 
> > TCP_ESTABLISHED never reaches the rescore jump as it's checked

Sorry I forgot about BPF.  I think BPF can select established one,
so this part is not true.


> > before calling inet_lookup_reuseport() and inet_lookup_reuseport()
> > also does not select TCP_ESTABLISHED.
> 
> I was thinking of the second part, inet_lookup_reuseport returning
> a connection from the group. I suppose this is assured not to be
> the case due to
> 
>            /* Falleback to scoring if grnult;p has connections */
>            if (!reuseport_has_conns(sk))
>                    return result;
> 
> 
> Is that what you mean?

reuseport_has_conns() is for reuseport group mixed with TCP_CLOSE and
TCP_ESTABLISHED, and here sk is usually (I mean without BPF) TCP_CLOSE
so that we don't decide too early and can check TCP_ESTABLISHED socket
placed later in the same hash chain.

Also, reuseport_select_sock_by_hash() returns NULL If the reuseport group
has only TCP_ESTABLISHED sockets when selecting, and then we continue with
result = sk;


> 
> There are a lot of hidden assumptions then to make sure this
> control flow is correct.
> 
> Should we instead just have
> 
>             badness = score;
> 
> +            if (rescore)
> +                    continue;

The 2nd compute_score() is added recently for a situation where
inet_lookup_reuseport() returns sk with higher score to avoid
calling inet_lookup_reuseport() again for the result.

  f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")


> 
> Also, can the rescore = false in the datapath be avoided. The purpose
> is to only jump once. It would be good if it is obvious that a
> repeated (or infinite) loop is not possible, regardless of what
> the callees return.
>
Gabriel Krisman Bertazi April 11, 2024, 1:54 a.m. UTC | #6
Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:

> Kuniyuki Iwashima wrote:
>> From: Gabriel Krisman Bertazi <krisman@suse.de>
>> Date: Wed, 10 Apr 2024 17:50:47 -0400
>> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
>> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
>> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
>> > sockets are present").  The failing tests were those that would spawn
>> > UDP sockets per-cpu on systems that have a high number of cpus.
>> > 
>> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
>> > socket, but due to the compiler no longer inlining compute_score, once
>> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
>> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>> > 
>> > We could just explicitly inline it, but compute_score() is quite a large
>> > function, around 300b.  Inlining in two sites would almost double
>> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
>> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
>> > multiple calls to compute_score.  Since it is a static function used in
>> > one spot, the compiler can safely fold it in, as it did before, without
>> > increasing the text size.
>> > 
>> > With this patch applied I ran my original iperf3 testcases.  The failing
>> > cases all looked like this (ipv4):
>> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
>> > 
>> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
>> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
>> > tree. harmean == harmonic mean; CV == coefficient of variation.
>> > 
>> > ipv4:
>> >                  1G                10G                  MAX
>> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
>> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
>> > 
>> > ipv6:
>> >                  1G                10G                  MAX
>> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
>> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
>> > 
>> > This restores the performance we had before the change above with this
>> > benchmark.  We obviously don't expect any real impact when mitigations
>> > are disabled, but just to be sure it also doesn't regresses:
>> > 
>> > mitigations=off ipv4:
>> >                  1G                10G                  MAX
>> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
>> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>> > 
>> > Cc: Lorenz Bauer <lmb@isovalent.com>
>> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
>> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
>> > 
>> > ---
>> > Changes since v1:
>> > (me)
>> >   - recollected performance data after changes below only for the
>> >   mitigations enabled case.
>> > (suggested by Willem de Bruijn)
>> >   - Drop __always_inline in compute_score
>> >   - Simplify logic by replacing third struct sock pointer with bool
>> >   - Fix typo in commit message
>> >   - Don't explicitly break out of loop after rescore
>> > ---
>> >  net/ipv4/udp.c | 18 +++++++++++++-----
>> >  net/ipv6/udp.c | 17 +++++++++++++----
>> >  2 files changed, 26 insertions(+), 9 deletions(-)
>> > 
>> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> > index 661d0e0d273f..a13ef8e06093 100644
>> > --- a/net/ipv4/udp.c
>> > +++ b/net/ipv4/udp.c
>> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> >  {
>> >  	struct sock *sk, *result;
>> >  	int score, badness;
>> > +	bool rescore = false;
>> 
>> nit: Keep reverse xmax tree order.
>> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>> 
>> >  
>> >  	result = NULL;
>> >  	badness = 0;
>> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>> > -		score = compute_score(sk, net, saddr, sport,
>> > -				      daddr, hnum, dif, sdif);
>> > +rescore:
>> > +		score = compute_score((rescore ? result : sk), net, saddr,
>> 
>> I guess () is not needed around rescore ?
>> 
>> Both same for IPv6.
>> 
>> Otherwise, looks good to me.
>> 
>> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>
> Can we avoid using the same name for the label and boolean?
>
> And since if looping result will have state TCP_ESTABLISHED, can it
> just be
>
>     sk = result;
>     goto rescore;

This would be much simpler, sure.  I actually didn't want to do it
because sk is the iteration cursor, and I couldn't prove to myself it is
safe to skip through part of the list (assuming result isn't the
immediate next socket in the list).  I'll take a better look, as I'm not
familiar with this code, but if you considered this already, I will
follow up with the change you suggested.

>
>> 
>> > +				      sport, daddr, hnum, dif, sdif);
>> > +		rescore = false;
>> >  		if (score > badness) {
>> >  			badness = score;
>> >  
>> > @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> >  			if (IS_ERR(result))
>> >  				continue;
>> >  
>> > -			badness = compute_score(result, net, saddr, sport,
>> > -						daddr, hnum, dif, sdif);
>> > -
>> > +			/* compute_score is too long of a function to be
>> > +			 * inlined, and calling it again here yields
>> > +			 * measureable overhead for some
>> > +			 * workloads. Work around it by jumping
>> > +			 * backwards to rescore 'result'.
>> > +			 */
>> > +			rescore = true;
>> > +			goto rescore;
>> >  		}
>> >  	}
Kuniyuki Iwashima April 11, 2024, 2:21 a.m. UTC | #7
From: Gabriel Krisman Bertazi <krisman@suse.de>
Date: Wed, 10 Apr 2024 21:54:30 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:
> 
> > Kuniyuki Iwashima wrote:
> >> From: Gabriel Krisman Bertazi <krisman@suse.de>
> >> Date: Wed, 10 Apr 2024 17:50:47 -0400
> >> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> >> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> >> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> >> > sockets are present").  The failing tests were those that would spawn
> >> > UDP sockets per-cpu on systems that have a high number of cpus.
> >> > 
> >> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> >> > socket, but due to the compiler no longer inlining compute_score, once
> >> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> >> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> >> > 
> >> > We could just explicitly inline it, but compute_score() is quite a large
> >> > function, around 300b.  Inlining in two sites would almost double
> >> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> >> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> >> > multiple calls to compute_score.  Since it is a static function used in
> >> > one spot, the compiler can safely fold it in, as it did before, without
> >> > increasing the text size.
> >> > 
> >> > With this patch applied I ran my original iperf3 testcases.  The failing
> >> > cases all looked like this (ipv4):
> >> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> >> > 
> >> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> >> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> >> > tree. harmean == harmonic mean; CV == coefficient of variation.
> >> > 
> >> > ipv4:
> >> >                  1G                10G                  MAX
> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> >> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> >> > 
> >> > ipv6:
> >> >                  1G                10G                  MAX
> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> >> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> >> > 
> >> > This restores the performance we had before the change above with this
> >> > benchmark.  We obviously don't expect any real impact when mitigations
> >> > are disabled, but just to be sure it also doesn't regresses:
> >> > 
> >> > mitigations=off ipv4:
> >> >                  1G                10G                  MAX
> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> >> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> >> > 
> >> > Cc: Lorenz Bauer <lmb@isovalent.com>
> >> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> >> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> >> > 
> >> > ---
> >> > Changes since v1:
> >> > (me)
> >> >   - recollected performance data after changes below only for the
> >> >   mitigations enabled case.
> >> > (suggested by Willem de Bruijn)
> >> >   - Drop __always_inline in compute_score
> >> >   - Simplify logic by replacing third struct sock pointer with bool
> >> >   - Fix typo in commit message
> >> >   - Don't explicitly break out of loop after rescore
> >> > ---
> >> >  net/ipv4/udp.c | 18 +++++++++++++-----
> >> >  net/ipv6/udp.c | 17 +++++++++++++----
> >> >  2 files changed, 26 insertions(+), 9 deletions(-)
> >> > 
> >> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> > index 661d0e0d273f..a13ef8e06093 100644
> >> > --- a/net/ipv4/udp.c
> >> > +++ b/net/ipv4/udp.c
> >> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >> >  {
> >> >  	struct sock *sk, *result;
> >> >  	int score, badness;
> >> > +	bool rescore = false;
> >> 
> >> nit: Keep reverse xmax tree order.
> >> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> >> 
> >> >  
> >> >  	result = NULL;
> >> >  	badness = 0;
> >> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> >> > -		score = compute_score(sk, net, saddr, sport,
> >> > -				      daddr, hnum, dif, sdif);
> >> > +rescore:
> >> > +		score = compute_score((rescore ? result : sk), net, saddr,
> >> 
> >> I guess () is not needed around rescore ?
> >> 
> >> Both same for IPv6.
> >> 
> >> Otherwise, looks good to me.
> >> 
> >> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >
> > Can we avoid using the same name for the label and boolean?
> >
> > And since if looping result will have state TCP_ESTABLISHED, can it
> > just be
> >
> >     sk = result;
> >     goto rescore;
> 
> This would be much simpler, sure.  I actually didn't want to do it
> because sk is the iteration cursor, and I couldn't prove to myself it is
> safe to skip through part of the list (assuming result isn't the
> immediate next socket in the list).

Good point, this is not safe actually.

Let's say sockets on the same port are placed in these order in the list:

  1. TCP_CLOSE sk w/ SO_INCOMING_CPU _not_ matching the current CPU
  2. TCP_ESTABLISHED sk matching 4-tuple
  3. TCP_CLOSE sk w/ SO_INCOMING_CPU matching the current CPU

When we check the first socket, we'll get the 3rd socket as it matches
the current CPU ID and TCP_ESTABLISHED cannot be selected without BPF,
and `sk = result;` skips the 2nd socket, which should have been selected.
Gabriel Krisman Bertazi April 11, 2024, 7:53 p.m. UTC | #8
Kuniyuki Iwashima <kuniyu@amazon.com> writes:

> From: Gabriel Krisman Bertazi <krisman@suse.de>
> Date: Wed, 10 Apr 2024 21:54:30 -0400
>> Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:
>> 
>> > Kuniyuki Iwashima wrote:
>> >> From: Gabriel Krisman Bertazi <krisman@suse.de>
>> >> Date: Wed, 10 Apr 2024 17:50:47 -0400
>> >> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
>> >> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
>> >> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
>> >> > sockets are present").  The failing tests were those that would spawn
>> >> > UDP sockets per-cpu on systems that have a high number of cpus.
>> >> > 
>> >> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
>> >> > socket, but due to the compiler no longer inlining compute_score, once
>> >> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
>> >> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>> >> > 
>> >> > We could just explicitly inline it, but compute_score() is quite a large
>> >> > function, around 300b.  Inlining in two sites would almost double
>> >> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
>> >> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
>> >> > multiple calls to compute_score.  Since it is a static function used in
>> >> > one spot, the compiler can safely fold it in, as it did before, without
>> >> > increasing the text size.
>> >> > 
>> >> > With this patch applied I ran my original iperf3 testcases.  The failing
>> >> > cases all looked like this (ipv4):
>> >> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
>> >> > 
>> >> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
>> >> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
>> >> > tree. harmean == harmonic mean; CV == coefficient of variation.
>> >> > 
>> >> > ipv4:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
>> >> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
>> >> > 
>> >> > ipv6:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
>> >> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
>> >> > 
>> >> > This restores the performance we had before the change above with this
>> >> > benchmark.  We obviously don't expect any real impact when mitigations
>> >> > are disabled, but just to be sure it also doesn't regresses:
>> >> > 
>> >> > mitigations=off ipv4:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
>> >> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>> >> > 
>> >> > Cc: Lorenz Bauer <lmb@isovalent.com>
>> >> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
>> >> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
>> >> > 
>> >> > ---
>> >> > Changes since v1:
>> >> > (me)
>> >> >   - recollected performance data after changes below only for the
>> >> >   mitigations enabled case.
>> >> > (suggested by Willem de Bruijn)
>> >> >   - Drop __always_inline in compute_score
>> >> >   - Simplify logic by replacing third struct sock pointer with bool
>> >> >   - Fix typo in commit message
>> >> >   - Don't explicitly break out of loop after rescore
>> >> > ---
>> >> >  net/ipv4/udp.c | 18 +++++++++++++-----
>> >> >  net/ipv6/udp.c | 17 +++++++++++++----
>> >> >  2 files changed, 26 insertions(+), 9 deletions(-)
>> >> > 
>> >> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> >> > index 661d0e0d273f..a13ef8e06093 100644
>> >> > --- a/net/ipv4/udp.c
>> >> > +++ b/net/ipv4/udp.c
>> >> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> >> >  {
>> >> >  	struct sock *sk, *result;
>> >> >  	int score, badness;
>> >> > +	bool rescore = false;
>> >> 
>> >> nit: Keep reverse xmax tree order.
>> >> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>> >> 
>> >> >  
>> >> >  	result = NULL;
>> >> >  	badness = 0;
>> >> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>> >> > -		score = compute_score(sk, net, saddr, sport,
>> >> > -				      daddr, hnum, dif, sdif);
>> >> > +rescore:
>> >> > +		score = compute_score((rescore ? result : sk), net, saddr,
>> >> 
>> >> I guess () is not needed around rescore ?
>> >> 
>> >> Both same for IPv6.
>> >> 
>> >> Otherwise, looks good to me.
>> >> 
>> >> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>> >
>> > Can we avoid using the same name for the label and boolean?
>> >
>> > And since if looping result will have state TCP_ESTABLISHED, can it
>> > just be
>> >
>> >     sk = result;
>> >     goto rescore;
>> 
>> This would be much simpler, sure.  I actually didn't want to do it
>> because sk is the iteration cursor, and I couldn't prove to myself it is
>> safe to skip through part of the list (assuming result isn't the
>> immediate next socket in the list).
>
> Good point, this is not safe actually.
>
> Let's say sockets on the same port are placed in these order in the list:
>
>   1. TCP_CLOSE sk w/ SO_INCOMING_CPU _not_ matching the current CPU
>   2. TCP_ESTABLISHED sk matching 4-tuple
>   3. TCP_CLOSE sk w/ SO_INCOMING_CPU matching the current CPU
>
> When we check the first socket, we'll get the 3rd socket as it matches
> the current CPU ID and TCP_ESTABLISHED cannot be selected without BPF,
> and `sk = result;` skips the 2nd socket, which should have been
> selected.

k. Considering this, what I get from the discussion is:

since we need to preserve the sk pointer, we are keeping the
condition (rescore ?  result:sk) in the compute_score call and
maintaining the reset of rescore in the hotpath to prevent scoring the
wrong thing on further loops. It is ugly, but it is a single instruction
over a in-register value, so it hardly matters.

If so, I'll do the style changes (parenthesis, sort of stack variables)
and add the early continue to resume the loop right after the rescore,
similar to what I had in V1, that Willem suggested:

            badness = score;

+            if (rescore)
+                    continue;

Please, let me know if I misunderstood, so I don't send a bogus v3.  I
will take some hours to run the tests, so I should send a v3 by
tomorrow.
Kuniyuki Iwashima April 11, 2024, 8:13 p.m. UTC | #9
From: Gabriel Krisman Bertazi <krisman@suse.de>
Date: Thu, 11 Apr 2024 15:53:31 -0400
> Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> 
> > From: Gabriel Krisman Bertazi <krisman@suse.de>
> > Date: Wed, 10 Apr 2024 21:54:30 -0400
> >> Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:
> >> 
> >> > Kuniyuki Iwashima wrote:
> >> >> From: Gabriel Krisman Bertazi <krisman@suse.de>
> >> >> Date: Wed, 10 Apr 2024 17:50:47 -0400
> >> >> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> >> >> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> >> >> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> >> >> > sockets are present").  The failing tests were those that would spawn
> >> >> > UDP sockets per-cpu on systems that have a high number of cpus.
> >> >> > 
> >> >> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> >> >> > socket, but due to the compiler no longer inlining compute_score, once
> >> >> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
> >> >> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> >> >> > 
> >> >> > We could just explicitly inline it, but compute_score() is quite a large
> >> >> > function, around 300b.  Inlining in two sites would almost double
> >> >> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> >> >> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> >> >> > multiple calls to compute_score.  Since it is a static function used in
> >> >> > one spot, the compiler can safely fold it in, as it did before, without
> >> >> > increasing the text size.
> >> >> > 
> >> >> > With this patch applied I ran my original iperf3 testcases.  The failing
> >> >> > cases all looked like this (ipv4):
> >> >> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> >> >> > 
> >> >> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> >> >> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> >> >> > tree. harmean == harmonic mean; CV == coefficient of variation.
> >> >> > 
> >> >> > ipv4:
> >> >> >                  1G                10G                  MAX
> >> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> >> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> >> >> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> >> >> > 
> >> >> > ipv6:
> >> >> >                  1G                10G                  MAX
> >> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> >> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> >> >> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> >> >> > 
> >> >> > This restores the performance we had before the change above with this
> >> >> > benchmark.  We obviously don't expect any real impact when mitigations
> >> >> > are disabled, but just to be sure it also doesn't regresses:
> >> >> > 
> >> >> > mitigations=off ipv4:
> >> >> >                  1G                10G                  MAX
> >> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> >> >> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> >> >> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> >> >> > 
> >> >> > Cc: Lorenz Bauer <lmb@isovalent.com>
> >> >> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> >> >> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> >> >> > 
> >> >> > ---
> >> >> > Changes since v1:
> >> >> > (me)
> >> >> >   - recollected performance data after changes below only for the
> >> >> >   mitigations enabled case.
> >> >> > (suggested by Willem de Bruijn)
> >> >> >   - Drop __always_inline in compute_score
> >> >> >   - Simplify logic by replacing third struct sock pointer with bool
> >> >> >   - Fix typo in commit message
> >> >> >   - Don't explicitly break out of loop after rescore
> >> >> > ---
> >> >> >  net/ipv4/udp.c | 18 +++++++++++++-----
> >> >> >  net/ipv6/udp.c | 17 +++++++++++++----
> >> >> >  2 files changed, 26 insertions(+), 9 deletions(-)
> >> >> > 
> >> >> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> >> > index 661d0e0d273f..a13ef8e06093 100644
> >> >> > --- a/net/ipv4/udp.c
> >> >> > +++ b/net/ipv4/udp.c
> >> >> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >> >> >  {
> >> >> >  	struct sock *sk, *result;
> >> >> >  	int score, badness;
> >> >> > +	bool rescore = false;
> >> >> 
> >> >> nit: Keep reverse xmax tree order.
> >> >> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> >> >> 
> >> >> >  
> >> >> >  	result = NULL;
> >> >> >  	badness = 0;
> >> >> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> >> >> > -		score = compute_score(sk, net, saddr, sport,
> >> >> > -				      daddr, hnum, dif, sdif);
> >> >> > +rescore:
> >> >> > +		score = compute_score((rescore ? result : sk), net, saddr,
> >> >> 
> >> >> I guess () is not needed around rescore ?
> >> >> 
> >> >> Both same for IPv6.
> >> >> 
> >> >> Otherwise, looks good to me.
> >> >> 
> >> >> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >> >
> >> > Can we avoid using the same name for the label and boolean?
> >> >
> >> > And since if looping result will have state TCP_ESTABLISHED, can it
> >> > just be
> >> >
> >> >     sk = result;
> >> >     goto rescore;
> >> 
> >> This would be much simpler, sure.  I actually didn't want to do it
> >> because sk is the iteration cursor, and I couldn't prove to myself it is
> >> safe to skip through part of the list (assuming result isn't the
> >> immediate next socket in the list).
> >
> > Good point, this is not safe actually.
> >
> > Let's say sockets on the same port are placed in these order in the list:
> >
> >   1. TCP_CLOSE sk w/ SO_INCOMING_CPU _not_ matching the current CPU
> >   2. TCP_ESTABLISHED sk matching 4-tuple
> >   3. TCP_CLOSE sk w/ SO_INCOMING_CPU matching the current CPU
> >
> > When we check the first socket, we'll get the 3rd socket as it matches
> > the current CPU ID and TCP_ESTABLISHED cannot be selected without BPF,
> > and `sk = result;` skips the 2nd socket, which should have been
> > selected.
> 
> k. Considering this, what I get from the discussion is:
> 
> since we need to preserve the sk pointer, we are keeping the
> condition (rescore ?  result:sk) in the compute_score call and
> maintaining the reset of rescore in the hotpath to prevent scoring the
> wrong thing on further loops. It is ugly, but it is a single instruction
> over a in-register value, so it hardly matters.
> 
> If so, I'll do the style changes (parenthesis, sort of stack variables)
> and add the early continue to resume the loop right after the rescore,
> similar to what I had in V1, that Willem suggested:
> 
>             badness = score;
> 
> +            if (rescore)
> +                    continue;

I'm wondering if rescore happens only once, but at least here we
need to update result.  Otherwise, in the example above, the 3rd
socket is returned as result.


> 
> Please, let me know if I misunderstood, so I don't send a bogus v3.  I
> will take some hours to run the tests, so I should send a v3 by
> tomorrow.
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 661d0e0d273f..a13ef8e06093 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -427,12 +427,15 @@  static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool rescore = false;
 
 	result = NULL;
 	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+rescore:
+		score = compute_score((rescore ? result : sk), net, saddr,
+				      sport, daddr, hnum, dif, sdif);
+		rescore = false;
 		if (score > badness) {
 			badness = score;
 
@@ -456,9 +459,14 @@  static struct sock *udp4_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(result, net, saddr, sport,
-						daddr, hnum, dif, sdif);
-
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			rescore = true;
+			goto rescore;
 		}
 	}
 	return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7c1e6469d091..7a55c050de2b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -168,12 +168,15 @@  static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool rescore = false;
 
 	result = NULL;
 	badness = -1;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+rescore:
+		score = compute_score((rescore ? result : sk), net, saddr,
+				      sport, daddr, hnum, dif, sdif);
+		rescore = false;
 		if (score > badness) {
 			badness = score;
 
@@ -197,8 +200,14 @@  static struct sock *udp6_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(sk, net, saddr, sport,
-						daddr, hnum, dif, sdif);
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			rescore = true;
+			goto rescore;
 		}
 	}
 	return result;