diff mbox series

[net-next] tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()

Message ID 20250305130550.1865988-1-edumazet@google.com (mailing list archive)
State Accepted
Commit f8ece40786c9342249aa0a1b55e148ee23b2a746
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: bring back NUMA dispersion in inet_ehash_locks_alloc() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 48 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-03-06--03-00 (tests: 894)

Commit Message

Eric Dumazet March 5, 2025, 1:05 p.m. UTC
We have platforms with 6 NUMA nodes and 480 cpus.

inet_ehash_locks_alloc() currently allocates a single 64KB page
to hold all ehash spinlocks. This adds more pressure on a single node.

Change inet_ehash_locks_alloc() to use vmalloc() to spread
the spinlocks on all online nodes, driven by NUMA policies.

At boot time, NUMA policy is interleave=all, meaning that
tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.

Tested:

lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

Comments

Jason Xing March 6, 2025, 3:35 a.m. UTC | #1
On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
>
> We have platforms with 6 NUMA nodes and 480 cpus.
>
> inet_ehash_locks_alloc() currently allocates a single 64KB page
> to hold all ehash spinlocks. This adds more pressure on a single node.
>
> Change inet_ehash_locks_alloc() to use vmalloc() to spread
> the spinlocks on all online nodes, driven by NUMA policies.
>
> At boot time, NUMA policy is interleave=all, meaning that
> tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
>
> Tested:
>
> lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
>
> lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
>
> lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
>
> lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Tested-by: Jason Xing <kerneljasonxing@gmail.com>

> ---
>  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
>  1 file changed, 26 insertions(+), 11 deletions(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
>  {
>         unsigned int locksz = sizeof(spinlock_t);
>         unsigned int i, nblocks = 1;
> +       spinlock_t *ptr = NULL;
>
> -       if (locksz != 0) {
> -               /* allocate 2 cache lines or at least one spinlock per cpu */
> -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> +       if (locksz == 0)
> +               goto set_mask;
>
> -               /* no more locks than number of hash buckets */
> -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
>
> -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> -               if (!hashinfo->ehash_locks)
> -                       return -ENOMEM;
> +       /* At least one page per NUMA node. */
> +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> +
> +       nblocks = roundup_pow_of_two(nblocks);
> +
> +       /* No more locks than number of hash buckets. */
> +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
>
> -               for (i = 0; i < nblocks; i++)
> -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> +       if (num_online_nodes() > 1) {
> +               /* Use vmalloc() to allow NUMA policy to spread pages
> +                * on all available nodes if desired.
> +                */
> +               ptr = vmalloc_array(nblocks, locksz);

I wonder if at this point the memory shortage occurs, is it necessary
to fall back to kvmalloc() later even when non-contiguous allocation
fails? Could we return with -ENOMEM directly here? If so, I can cook a
follow-up patch so that you don't need to revise this version:)

Thanks,
Jason

> +       }
> +       if (!ptr) {
> +               ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> +               if (!ptr)
> +                       return -ENOMEM;
>         }
> +       for (i = 0; i < nblocks; i++)
> +               spin_lock_init(&ptr[i]);
> +       hashinfo->ehash_locks = ptr;
> +set_mask:
>         hashinfo->ehash_locks_mask = nblocks - 1;
>         return 0;
>  }
> --
> 2.48.1.711.g2feabab25a-goog
>
>
Kuniyuki Iwashima March 6, 2025, 4:12 a.m. UTC | #2
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 6 Mar 2025 11:35:27 +0800
> On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > We have platforms with 6 NUMA nodes and 480 cpus.
> >
> > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > to hold all ehash spinlocks. This adds more pressure on a single node.
> >
> > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > the spinlocks on all online nodes, driven by NUMA policies.
> >
> > At boot time, NUMA policy is interleave=all, meaning that
> > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> >
> > Tested:
> >
> > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> >
> > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> >
> > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> >
> > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Tested-by: Jason Xing <kerneljasonxing@gmail.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>


> 
> > ---
> >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> >  1 file changed, 26 insertions(+), 11 deletions(-)
> >
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> >  {
> >         unsigned int locksz = sizeof(spinlock_t);
> >         unsigned int i, nblocks = 1;
> > +       spinlock_t *ptr = NULL;
> >
> > -       if (locksz != 0) {
> > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > +       if (locksz == 0)
> > +               goto set_mask;
> >
> > -               /* no more locks than number of hash buckets */
> > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> >
> > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > -               if (!hashinfo->ehash_locks)
> > -                       return -ENOMEM;
> > +       /* At least one page per NUMA node. */
> > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > +
> > +       nblocks = roundup_pow_of_two(nblocks);
> > +
> > +       /* No more locks than number of hash buckets. */
> > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> >
> > -               for (i = 0; i < nblocks; i++)
> > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > +       if (num_online_nodes() > 1) {
> > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > +                * on all available nodes if desired.
> > +                */
> > +               ptr = vmalloc_array(nblocks, locksz);
> 
> I wonder if at this point the memory shortage occurs, is it necessary
> to fall back to kvmalloc() later

If ptr is NULL here, kvmalloc_array() is called below.


> even when non-contiguous allocation
> fails? Could we return with -ENOMEM directly here? If so, I can cook a
> follow-up patch so that you don't need to revise this version:)
> 
> Thanks,
> Jason
> 
> > +       }
> > +       if (!ptr) {
> > +               ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > +               if (!ptr)
> > +                       return -ENOMEM;
> >         }
> > +       for (i = 0; i < nblocks; i++)
> > +               spin_lock_init(&ptr[i]);
> > +       hashinfo->ehash_locks = ptr;
> > +set_mask:
> >         hashinfo->ehash_locks_mask = nblocks - 1;
> >         return 0;
> >  }
> > --
> > 2.48.1.711.g2feabab25a-goog
Jason Xing March 6, 2025, 4:59 a.m. UTC | #3
On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 6 Mar 2025 11:35:27 +0800
> > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > We have platforms with 6 NUMA nodes and 480 cpus.
> > >
> > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > >
> > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > the spinlocks on all online nodes, driven by NUMA policies.
> > >
> > > At boot time, NUMA policy is interleave=all, meaning that
> > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > >
> > > Tested:
> > >
> > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > >
> > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > >
> > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > >
> > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>
>
> >
> > > ---
> > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > >  {
> > >         unsigned int locksz = sizeof(spinlock_t);
> > >         unsigned int i, nblocks = 1;
> > > +       spinlock_t *ptr = NULL;
> > >
> > > -       if (locksz != 0) {
> > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > +       if (locksz == 0)
> > > +               goto set_mask;
> > >
> > > -               /* no more locks than number of hash buckets */
> > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > >
> > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > -               if (!hashinfo->ehash_locks)
> > > -                       return -ENOMEM;
> > > +       /* At least one page per NUMA node. */
> > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > +
> > > +       nblocks = roundup_pow_of_two(nblocks);
> > > +
> > > +       /* No more locks than number of hash buckets. */
> > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > >
> > > -               for (i = 0; i < nblocks; i++)
> > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > +       if (num_online_nodes() > 1) {
> > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > +                * on all available nodes if desired.
> > > +                */
> > > +               ptr = vmalloc_array(nblocks, locksz);
> >
> > I wonder if at this point the memory shortage occurs, is it necessary
> > to fall back to kvmalloc() later
>
> If ptr is NULL here, kvmalloc_array() is called below.

My point is why not return with -ENOMEM directly? Or else It looks meaningless.

Thanks,
Jason

>
>
> > even when non-contiguous allocation
> > fails? Could we return with -ENOMEM directly here? If so, I can cook a
> > follow-up patch so that you don't need to revise this version:)
> >
> > Thanks,
> > Jason
> >
> > > +       }
> > > +       if (!ptr) {
> > > +               ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > +               if (!ptr)
> > > +                       return -ENOMEM;
> > >         }
> > > +       for (i = 0; i < nblocks; i++)
> > > +               spin_lock_init(&ptr[i]);
> > > +       hashinfo->ehash_locks = ptr;
> > > +set_mask:
> > >         hashinfo->ehash_locks_mask = nblocks - 1;
> > >         return 0;
> > >  }
> > > --
> > > 2.48.1.711.g2feabab25a-goog
Kuniyuki Iwashima March 6, 2025, 6:21 a.m. UTC | #4
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 6 Mar 2025 12:59:03 +0800
> On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > >
> > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > >
> > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > >
> > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > >
> > > > Tested:
> > > >
> > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > >
> > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > >
> > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > >
> > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > >
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >
> > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >
> >
> > >
> > > > ---
> > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > >  {
> > > >         unsigned int locksz = sizeof(spinlock_t);
> > > >         unsigned int i, nblocks = 1;
> > > > +       spinlock_t *ptr = NULL;
> > > >
> > > > -       if (locksz != 0) {
> > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > +       if (locksz == 0)
> > > > +               goto set_mask;
> > > >
> > > > -               /* no more locks than number of hash buckets */
> > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > >
> > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > -               if (!hashinfo->ehash_locks)
> > > > -                       return -ENOMEM;
> > > > +       /* At least one page per NUMA node. */
> > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > +
> > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > +
> > > > +       /* No more locks than number of hash buckets. */
> > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > >
> > > > -               for (i = 0; i < nblocks; i++)
> > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > +       if (num_online_nodes() > 1) {
> > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > +                * on all available nodes if desired.
> > > > +                */
> > > > +               ptr = vmalloc_array(nblocks, locksz);
> > >
> > > I wonder if at this point the memory shortage occurs, is it necessary
> > > to fall back to kvmalloc() later
> >
> > If ptr is NULL here, kvmalloc_array() is called below.
> 
> My point is why not return with -ENOMEM directly? Or else It looks meaningless.
>

Ah, I misread.  I'm not sure how likely such a case happens, but I
think vmalloc() and kmalloc() failure do not always correlate, the
former uses node_alloc() and the latter use the page allocator.


> Thanks,
> Jason
> 
> >
> >
> > > even when non-contiguous allocation
> > > fails? Could we return with -ENOMEM directly here? If so, I can cook a
> > > follow-up patch so that you don't need to revise this version:)
> > >
> > > Thanks,
> > > Jason
> > >
> > > > +       }
> > > > +       if (!ptr) {
> > > > +               ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > +               if (!ptr)
> > > > +                       return -ENOMEM;
> > > >         }
> > > > +       for (i = 0; i < nblocks; i++)
> > > > +               spin_lock_init(&ptr[i]);
> > > > +       hashinfo->ehash_locks = ptr;
> > > > +set_mask:
> > > >         hashinfo->ehash_locks_mask = nblocks - 1;
> > > >         return 0;
> > > >  }
> > > > --
> > > > 2.48.1.711.g2feabab25a-goog
Jason Xing March 6, 2025, 6:35 a.m. UTC | #5
On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 6 Mar 2025 12:59:03 +0800
> > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > >
> > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > >
> > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > >
> > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > >
> > > > > Tested:
> > > > >
> > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > >
> > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > >
> > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > >
> > >
> > > >
> > > > > ---
> > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > >  {
> > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > >         unsigned int i, nblocks = 1;
> > > > > +       spinlock_t *ptr = NULL;
> > > > >
> > > > > -       if (locksz != 0) {
> > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > +       if (locksz == 0)
> > > > > +               goto set_mask;
> > > > >
> > > > > -               /* no more locks than number of hash buckets */
> > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > >
> > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > -               if (!hashinfo->ehash_locks)
> > > > > -                       return -ENOMEM;
> > > > > +       /* At least one page per NUMA node. */
> > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > +
> > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > +
> > > > > +       /* No more locks than number of hash buckets. */
> > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > >
> > > > > -               for (i = 0; i < nblocks; i++)
> > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > +       if (num_online_nodes() > 1) {
> > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > +                * on all available nodes if desired.
> > > > > +                */
> > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > >
> > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > to fall back to kvmalloc() later
> > >
> > > If ptr is NULL here, kvmalloc_array() is called below.
> >
> > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> >
>
> Ah, I misread.  I'm not sure how likely such a case happens, but I
> think vmalloc() and kmalloc() failure do not always correlate, the
> former uses node_alloc() and the latter use the page allocator.

Sure, it is unlikely to happen.

As to memory allocation, we usually try kmalloc() for less than page
size memory allocation while vmalloc() for larger one. The same logic
can be seen in kvmalloc(): try kmalloc() first, then fall back to
vmalloc(). Since we fail to allocate non-contiguous memory, there is
no need to try kvmalloc() (which will call kmalloc and vmalloc one
more round).

Thanks,
Jason
Eric Dumazet March 6, 2025, 7:25 a.m. UTC | #6
On Thu, Mar 6, 2025 at 7:35 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Date: Thu, 6 Mar 2025 12:59:03 +0800
> > > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > > >
> > > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > > >
> > > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > > >
> > > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > > >
> > > > > > Tested:
> > > > > >
> > > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > >
> > > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > >
> > > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > >
> > > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > >
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > > >
> > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > >
> > > >
> > > > >
> > > > > > ---
> > > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > >  {
> > > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > > >         unsigned int i, nblocks = 1;
> > > > > > +       spinlock_t *ptr = NULL;
> > > > > >
> > > > > > -       if (locksz != 0) {
> > > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > > +       if (locksz == 0)
> > > > > > +               goto set_mask;
> > > > > >
> > > > > > -               /* no more locks than number of hash buckets */
> > > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > > >
> > > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > > -               if (!hashinfo->ehash_locks)
> > > > > > -                       return -ENOMEM;
> > > > > > +       /* At least one page per NUMA node. */
> > > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > > +
> > > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > > +
> > > > > > +       /* No more locks than number of hash buckets. */
> > > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > >
> > > > > > -               for (i = 0; i < nblocks; i++)
> > > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > > +       if (num_online_nodes() > 1) {
> > > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > > +                * on all available nodes if desired.
> > > > > > +                */
> > > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > > >
> > > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > > to fall back to kvmalloc() later
> > > >
> > > > If ptr is NULL here, kvmalloc_array() is called below.
> > >
> > > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> > >
> >
> > Ah, I misread.  I'm not sure how likely such a case happens, but I
> > think vmalloc() and kmalloc() failure do not always correlate, the
> > former uses node_alloc() and the latter use the page allocator.
>
> Sure, it is unlikely to happen.
>
> As to memory allocation, we usually try kmalloc() for less than page
> size memory allocation while vmalloc() for larger one. The same logic
> can be seen in kvmalloc(): try kmalloc() first, then fall back to
> vmalloc(). Since we fail to allocate non-contiguous memory, there is
> no need to try kvmalloc() (which will call kmalloc and vmalloc one
> more round).

I chose to not add code, because:

       if (num_online_nodes() > 1) {
               /* Use vmalloc() to allow NUMA policy to spread pages
                * on all available nodes if desired.
                */
               ptr = vmalloc_array(nblocks, locksz);

<< adding here a test is pointless, we already have correct code if
ptr == NULLL >>

       }
       if (!ptr) {
               ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
               if (!ptr)
                       return -ENOMEM;
        }


Sure, this could be written in a different way, but ultimately it is a
matter of taste.
Jason Xing March 6, 2025, 8:03 a.m. UTC | #7
On Thu, Mar 6, 2025 at 3:26 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 6, 2025 at 7:35 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > Date: Thu, 6 Mar 2025 12:59:03 +0800
> > > > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > > > >
> > > > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > > > >
> > > > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > > > >
> > > > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > > > >
> > > > > > > Tested:
> > > > > > >
> > > > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > >
> > > > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > >
> > > > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > >
> > > > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > >
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > >
> > > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > > > >
> > > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > >
> > > > >
> > > > > >
> > > > > > > ---
> > > > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > > >  {
> > > > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > > > >         unsigned int i, nblocks = 1;
> > > > > > > +       spinlock_t *ptr = NULL;
> > > > > > >
> > > > > > > -       if (locksz != 0) {
> > > > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > > > +       if (locksz == 0)
> > > > > > > +               goto set_mask;
> > > > > > >
> > > > > > > -               /* no more locks than number of hash buckets */
> > > > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > > > >
> > > > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > > > -               if (!hashinfo->ehash_locks)
> > > > > > > -                       return -ENOMEM;
> > > > > > > +       /* At least one page per NUMA node. */
> > > > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > > > +
> > > > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > > > +
> > > > > > > +       /* No more locks than number of hash buckets. */
> > > > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > >
> > > > > > > -               for (i = 0; i < nblocks; i++)
> > > > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > > > +       if (num_online_nodes() > 1) {
> > > > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > > > +                * on all available nodes if desired.
> > > > > > > +                */
> > > > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > > > >
> > > > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > > > to fall back to kvmalloc() later
> > > > >
> > > > > If ptr is NULL here, kvmalloc_array() is called below.
> > > >
> > > > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> > > >
> > >
> > > Ah, I misread.  I'm not sure how likely such a case happens, but I
> > > think vmalloc() and kmalloc() failure do not always correlate, the
> > > former uses node_alloc() and the latter use the page allocator.
> >
> > Sure, it is unlikely to happen.
> >
> > As to memory allocation, we usually try kmalloc() for less than page
> > size memory allocation while vmalloc() for larger one. The same logic
> > can be seen in kvmalloc(): try kmalloc() first, then fall back to
> > vmalloc(). Since we fail to allocate non-contiguous memory, there is
> > no need to try kvmalloc() (which will call kmalloc and vmalloc one
> > more round).
>
> I chose to not add code, because:
>
>        if (num_online_nodes() > 1) {
>                /* Use vmalloc() to allow NUMA policy to spread pages
>                 * on all available nodes if desired.
>                 */
>                ptr = vmalloc_array(nblocks, locksz);
>
> << adding here a test is pointless, we already have correct code if
> ptr == NULLL >>
>
>        }
>        if (!ptr) {
>                ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
>                if (!ptr)
>                        return -ENOMEM;
>         }
>
>
> Sure, this could be written in a different way, but ultimately it is a
> matter of taste.

Sorry that I didn't make myself clear enough. I mean if
vmalloc_array() fails, then it will fall back to kvmalloc_array()
which will call kmalloc() or even vmalloc() to allocate memory again.
My intention is to return with an error code when the first time
allocation fails.

Code like this on top of your patch:
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 3edbe2dad8ba..d026918319d2 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1282,12 +1282,12 @@ int inet_ehash_locks_alloc(struct
inet_hashinfo *hashinfo)
                 * on all available nodes if desired.
                 */
                ptr = vmalloc_array(nblocks, locksz);
-       }
-       if (!ptr) {
+       } else {
                ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
-               if (!ptr)
-                       return -ENOMEM;
        }
+       if (!ptr)
+               return -ENOMEM;
+
        for (i = 0; i < nblocks; i++)
                spin_lock_init(&ptr[i]);
        hashinfo->ehash_locks = ptr;

Sure, it's not a big deal at all. Just try more rounds to allocate memory...

Thanks,
jason
Eric Dumazet March 6, 2025, 8:17 a.m. UTC | #8
On Thu, Mar 6, 2025 at 9:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Mar 6, 2025 at 3:26 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 7:35 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > Date: Thu, 6 Mar 2025 12:59:03 +0800
> > > > > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > >
> > > > > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > > > > >
> > > > > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > > > > >
> > > > > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > > > > >
> > > > > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > > > > >
> > > > > > > > Tested:
> > > > > > > >
> > > > > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > >
> > > > > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > >
> > > > > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > >
> > > > > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > >
> > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > >
> > > > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > > > > >
> > > > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > ---
> > > > > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > > > >  {
> > > > > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > > > > >         unsigned int i, nblocks = 1;
> > > > > > > > +       spinlock_t *ptr = NULL;
> > > > > > > >
> > > > > > > > -       if (locksz != 0) {
> > > > > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > > > > +       if (locksz == 0)
> > > > > > > > +               goto set_mask;
> > > > > > > >
> > > > > > > > -               /* no more locks than number of hash buckets */
> > > > > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > > > > >
> > > > > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > > > > -               if (!hashinfo->ehash_locks)
> > > > > > > > -                       return -ENOMEM;
> > > > > > > > +       /* At least one page per NUMA node. */
> > > > > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > > > > +
> > > > > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > > > > +
> > > > > > > > +       /* No more locks than number of hash buckets. */
> > > > > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > >
> > > > > > > > -               for (i = 0; i < nblocks; i++)
> > > > > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > > > > +       if (num_online_nodes() > 1) {
> > > > > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > > > > +                * on all available nodes if desired.
> > > > > > > > +                */
> > > > > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > > > > >
> > > > > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > > > > to fall back to kvmalloc() later
> > > > > >
> > > > > > If ptr is NULL here, kvmalloc_array() is called below.
> > > > >
> > > > > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> > > > >
> > > >
> > > > Ah, I misread.  I'm not sure how likely such a case happens, but I
> > > > think vmalloc() and kmalloc() failure do not always correlate, the
> > > > former uses node_alloc() and the latter use the page allocator.
> > >
> > > Sure, it is unlikely to happen.
> > >
> > > As to memory allocation, we usually try kmalloc() for less than page
> > > size memory allocation while vmalloc() for larger one. The same logic
> > > can be seen in kvmalloc(): try kmalloc() first, then fall back to
> > > vmalloc(). Since we fail to allocate non-contiguous memory, there is
> > > no need to try kvmalloc() (which will call kmalloc and vmalloc one
> > > more round).
> >
> > I chose to not add code, because:
> >
> >        if (num_online_nodes() > 1) {
> >                /* Use vmalloc() to allow NUMA policy to spread pages
> >                 * on all available nodes if desired.
> >                 */
> >                ptr = vmalloc_array(nblocks, locksz);
> >
> > << adding here a test is pointless, we already have correct code if
> > ptr == NULLL >>
> >
> >        }
> >        if (!ptr) {
> >                ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> >                if (!ptr)
> >                        return -ENOMEM;
> >         }
> >
> >
> > Sure, this could be written in a different way, but ultimately it is a
> > matter of taste.
>
> Sorry that I didn't make myself clear enough. I mean if
> vmalloc_array() fails, then it will fall back to kvmalloc_array()
> which will call kmalloc() or even vmalloc() to allocate memory again.
> My intention is to return with an error code when the first time
> allocation fails.
>
> Code like this on top of your patch:
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 3edbe2dad8ba..d026918319d2 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1282,12 +1282,12 @@ int inet_ehash_locks_alloc(struct
> inet_hashinfo *hashinfo)
>                  * on all available nodes if desired.
>                  */
>                 ptr = vmalloc_array(nblocks, locksz);
> -       }
> -       if (!ptr) {
> +       } else {
>                 ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> -               if (!ptr)
> -                       return -ENOMEM;
>         }
> +       if (!ptr)
> +               return -ENOMEM;
> +

I think I understood pretty well, and said it was a matter of taste.

I wish we could move on to more interesting stuff, like the main ehash
table, which currently
uses 2 huge pages, they can all land into one physical socket.
Jason Xing March 6, 2025, 8:50 a.m. UTC | #9
On Thu, Mar 6, 2025 at 4:18 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 6, 2025 at 9:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 3:26 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Mar 6, 2025 at 7:35 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > Date: Thu, 6 Mar 2025 12:59:03 +0800
> > > > > > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > > > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > > >
> > > > > > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > > > > > >
> > > > > > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > > > > > >
> > > > > > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > > > > > >
> > > > > > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > > > > > >
> > > > > > > > > Tested:
> > > > > > > > >
> > > > > > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > > >
> > > > > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > > > > > >
> > > > > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > > > > >  {
> > > > > > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > > > > > >         unsigned int i, nblocks = 1;
> > > > > > > > > +       spinlock_t *ptr = NULL;
> > > > > > > > >
> > > > > > > > > -       if (locksz != 0) {
> > > > > > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > > > > > +       if (locksz == 0)
> > > > > > > > > +               goto set_mask;
> > > > > > > > >
> > > > > > > > > -               /* no more locks than number of hash buckets */
> > > > > > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > > > > > >
> > > > > > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > > > > > -               if (!hashinfo->ehash_locks)
> > > > > > > > > -                       return -ENOMEM;
> > > > > > > > > +       /* At least one page per NUMA node. */
> > > > > > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > > > > > +
> > > > > > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > > > > > +
> > > > > > > > > +       /* No more locks than number of hash buckets. */
> > > > > > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > > >
> > > > > > > > > -               for (i = 0; i < nblocks; i++)
> > > > > > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > > > > > +       if (num_online_nodes() > 1) {
> > > > > > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > > > > > +                * on all available nodes if desired.
> > > > > > > > > +                */
> > > > > > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > > > > > >
> > > > > > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > > > > > to fall back to kvmalloc() later
> > > > > > >
> > > > > > > If ptr is NULL here, kvmalloc_array() is called below.
> > > > > >
> > > > > > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> > > > > >
> > > > >
> > > > > Ah, I misread.  I'm not sure how likely such a case happens, but I
> > > > > think vmalloc() and kmalloc() failure do not always correlate, the
> > > > > former uses node_alloc() and the latter use the page allocator.
> > > >
> > > > Sure, it is unlikely to happen.
> > > >
> > > > As to memory allocation, we usually try kmalloc() for less than page
> > > > size memory allocation while vmalloc() for larger one. The same logic
> > > > can be seen in kvmalloc(): try kmalloc() first, then fall back to
> > > > vmalloc(). Since we fail to allocate non-contiguous memory, there is
> > > > no need to try kvmalloc() (which will call kmalloc and vmalloc one
> > > > more round).
> > >
> > > I chose to not add code, because:
> > >
> > >        if (num_online_nodes() > 1) {
> > >                /* Use vmalloc() to allow NUMA policy to spread pages
> > >                 * on all available nodes if desired.
> > >                 */
> > >                ptr = vmalloc_array(nblocks, locksz);
> > >
> > > << adding here a test is pointless, we already have correct code if
> > > ptr == NULLL >>
> > >
> > >        }
> > >        if (!ptr) {
> > >                ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > >                if (!ptr)
> > >                        return -ENOMEM;
> > >         }
> > >
> > >
> > > Sure, this could be written in a different way, but ultimately it is a
> > > matter of taste.
> >
> > Sorry that I didn't make myself clear enough. I mean if
> > vmalloc_array() fails, then it will fall back to kvmalloc_array()
> > which will call kmalloc() or even vmalloc() to allocate memory again.
> > My intention is to return with an error code when the first time
> > allocation fails.
> >
> > Code like this on top of your patch:
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 3edbe2dad8ba..d026918319d2 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -1282,12 +1282,12 @@ int inet_ehash_locks_alloc(struct
> > inet_hashinfo *hashinfo)
> >                  * on all available nodes if desired.
> >                  */
> >                 ptr = vmalloc_array(nblocks, locksz);
> > -       }
> > -       if (!ptr) {
> > +       } else {
> >                 ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > -               if (!ptr)
> > -                       return -ENOMEM;
> >         }
> > +       if (!ptr)
> > +               return -ENOMEM;
> > +
>
> I think I understood pretty well, and said it was a matter of taste.
>
> I wish we could move on to more interesting stuff, like the main ehash
> table, which currently
> uses 2 huge pages, they can all land into one physical socket.

Interesting. I will dig into it after next week because next week
netdev will take place :)

Before this, I tried to accelerate transmitting skbs with four numa
nodes like allocating skbs in local numa nodes or something like that,
but it didn't show good throughput. Glad to know you're doing various
challenging tests.

Thanks,
Jason
patchwork-bot+netdevbpf@kernel.org March 6, 2025, 11:40 p.m. UTC | #10
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed,  5 Mar 2025 13:05:50 +0000 you wrote:
> We have platforms with 6 NUMA nodes and 480 cpus.
> 
> inet_ehash_locks_alloc() currently allocates a single 64KB page
> to hold all ehash spinlocks. This adds more pressure on a single node.
> 
> Change inet_ehash_locks_alloc() to use vmalloc() to spread
> the spinlocks on all online nodes, driven by NUMA policies.
> 
> [...]

Here is the summary with links:
  - [net-next] tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()
    https://git.kernel.org/netdev/net-next/c/f8ece40786c9

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1230,22 +1230,37 @@  int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 {
 	unsigned int locksz = sizeof(spinlock_t);
 	unsigned int i, nblocks = 1;
+	spinlock_t *ptr = NULL;
 
-	if (locksz != 0) {
-		/* allocate 2 cache lines or at least one spinlock per cpu */
-		nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
-		nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
+	if (locksz == 0)
+		goto set_mask;
 
-		/* no more locks than number of hash buckets */
-		nblocks = min(nblocks, hashinfo->ehash_mask + 1);
+	/* Allocate 2 cache lines or at least one spinlock per cpu. */
+	nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
 
-		hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
-		if (!hashinfo->ehash_locks)
-			return -ENOMEM;
+	/* At least one page per NUMA node. */
+	nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
+
+	nblocks = roundup_pow_of_two(nblocks);
+
+	/* No more locks than number of hash buckets. */
+	nblocks = min(nblocks, hashinfo->ehash_mask + 1);
 
-		for (i = 0; i < nblocks; i++)
-			spin_lock_init(&hashinfo->ehash_locks[i]);
+	if (num_online_nodes() > 1) {
+		/* Use vmalloc() to allow NUMA policy to spread pages
+		 * on all available nodes if desired.
+		 */
+		ptr = vmalloc_array(nblocks, locksz);
+	}
+	if (!ptr) {
+		ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
+		if (!ptr)
+			return -ENOMEM;
 	}
+	for (i = 0; i < nblocks; i++)
+		spin_lock_init(&ptr[i]);
+	hashinfo->ehash_locks = ptr;
+set_mask:
 	hashinfo->ehash_locks_mask = nblocks - 1;
 	return 0;
 }