Message ID | 20240226022452.20558-1-adamli@os.amperecomputing.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 12a686c2e761f1f1f6e6e2117a9ab9c6de2ac8a7 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: make SK_MEMORY_PCPU_RESERV tunable | expand |
Looks good to me. What may be done in an additional patch is to set the
tunable automatically higher on machines with high core counts.
Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
On Mon, Feb 26, 2024 at 3:25 AM Adam Li <adamli@os.amperecomputing.com> wrote: > > This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file, > to make SK_MEMORY_PCPU_RESERV tunable. > > Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for > memory_allocated") introduced per-cpu forward alloc cache: > > "Implement a per-cpu cache of +1/-1 MB, to reduce number > of changes to sk->sk_prot->memory_allocated, which > would otherwise be cause of false sharing." > > sk_prot->memory_allocated points to global atomic variable: > atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; > > If increasing the per-cpu cache size from 1MB to e.g. 16MB, > changes to sk->sk_prot->memory_allocated can be further reduced. > Performance may be improved on system with many cores. This looks good, do you have any performance numbers to share ? On a host with 384 threads, 384*16 -> 6 GB of memory. With this kind of use, we might need a shrinker...
On Tue, 27 Feb 2024, Eric Dumazet wrote: >> sk_prot->memory_allocated points to global atomic variable: >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; >> >> If increasing the per-cpu cache size from 1MB to e.g. 16MB, >> changes to sk->sk_prot->memory_allocated can be further reduced. >> Performance may be improved on system with many cores. > > This looks good, do you have any performance numbers to share ? > > On a host with 384 threads, 384*16 -> 6 GB of memory. Those things also come with corresponding memories of a couple of TB... > With this kind of use, we might need a shrinker... Yes. No point of keeping the buffers around if the core stops doing networking. But to be done at times when there is no contention please. Isnt there something like a timeout for skbs in the network stack already?
On Tue, 27 Feb 2024 15:08:18 -0800 (PST) Lameter, Christopher wrote: > > This looks good, do you have any performance numbers to share ? > > > > On a host with 384 threads, 384*16 -> 6 GB of memory. > > Those things also come with corresponding memories of a couple of TB... We have a lot of machines at Meta with more cores than gigabytes of memory. Keying on amount of memory would make sense. Something like max(1MB, sk_mem / cores / 8) comes to mind? In fact it may be a better idea to have the sysctl control the divisor (the 8 in my example above). I had issues in the past with people "micro-optimizing" the absolute size, forgetting about it, moving workload to a larger machine and then complaining TCP is choking :(
Hello: This patch was applied to netdev/net-next.git (main) by David S. Miller <davem@davemloft.net>: On Mon, 26 Feb 2024 02:24:52 +0000 you wrote: > This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file, > to make SK_MEMORY_PCPU_RESERV tunable. > > Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for > memory_allocated") introduced per-cpu forward alloc cache: > > "Implement a per-cpu cache of +1/-1 MB, to reduce number > of changes to sk->sk_prot->memory_allocated, which > would otherwise be cause of false sharing." > > [...] Here is the summary with links: - net: make SK_MEMORY_PCPU_RESERV tunable https://git.kernel.org/netdev/net-next/c/12a686c2e761 You are awesome, thank you!
On Wed, Feb 28, 2024 at 12:08 AM Lameter, Christopher <cl@os.amperecomputing.com> wrote: > > On Tue, 27 Feb 2024, Eric Dumazet wrote: > > >> sk_prot->memory_allocated points to global atomic variable: > >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; > >> > >> If increasing the per-cpu cache size from 1MB to e.g. 16MB, > >> changes to sk->sk_prot->memory_allocated can be further reduced. > >> Performance may be improved on system with many cores. > > > > This looks good, do you have any performance numbers to share ? > > > > On a host with 384 threads, 384*16 -> 6 GB of memory. > > Those things also come with corresponding memories of a couple of TB... > > > With this kind of use, we might need a shrinker... > > Yes. No point of keeping the buffers around if the core stops doing > networking. But to be done at times when there is no contention please. I yet have to see the 'contention' ? I usually see one on the zone spinlock or memcg ones when allocating/freeing pages, not on the tcp_memory_allocated atomic We can add caches for sure, we had a giant one before my patch, and this was a disaster really, for workloads with millions of TCP sockets.
On 2/28/2024 4:38 AM, Eric Dumazet wrote: >> >> sk_prot->memory_allocated points to global atomic variable: >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; >> >> If increasing the per-cpu cache size from 1MB to e.g. 16MB, >> changes to sk->sk_prot->memory_allocated can be further reduced. >> Performance may be improved on system with many cores. > > This looks good, do you have any performance numbers to share ? I ran localhost memcached test on system with 320 CPU threads, perf shows 4% cycles spent in __sk_mem_raise_allocated() -->sk_memory_allocated(). If increasing SK_MEMORY_PCPU_RESERV to 16MB, perf cycles spent in __sk_mem_raise_allocated() drops to 0.4%. Thanks, -adam
On Wed, Feb 28, 2024 at 2:21 PM Adam Li <adamli@os.amperecomputing.com> wrote: > > On 2/28/2024 4:38 AM, Eric Dumazet wrote: > > >> > >> sk_prot->memory_allocated points to global atomic variable: > >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; > >> > >> If increasing the per-cpu cache size from 1MB to e.g. 16MB, > >> changes to sk->sk_prot->memory_allocated can be further reduced. > >> Performance may be improved on system with many cores. > > > > This looks good, do you have any performance numbers to share ? > > I ran localhost memcached test on system with 320 CPU threads, > perf shows 4% cycles spent in __sk_mem_raise_allocated() -->sk_memory_allocated(). > If increasing SK_MEMORY_PCPU_RESERV to 16MB, perf cycles spent in > __sk_mem_raise_allocated() drops to 0.4%. I suspect some kind of flow/cpu steering issues then. Also maybe SO_RESERVE_MEM would be better for this workload.
On Wed, 28 Feb 2024, Eric Dumazet wrote: >> __sk_mem_raise_allocated() drops to 0.4%. > > I suspect some kind of flow/cpu steering issues then. > Also maybe SO_RESERVE_MEM would be better for this workload. This is via loopback. So there is a flow steering issue in the IP stack?
On Wed, Mar 6, 2024 at 6:01 PM Lameter, Christopher <cl@os.amperecomputing.com> wrote: > > On Wed, 28 Feb 2024, Eric Dumazet wrote: > > >> __sk_mem_raise_allocated() drops to 0.4%. > > > > I suspect some kind of flow/cpu steering issues then. > > Also maybe SO_RESERVE_MEM would be better for this workload. > > This is via loopback. So there is a flow steering issue in the IP > stack? Asymmetric allocations / freeing, things that will usually have a high cost for payload copy anyway. Maybe a hierarchical tracking would avoid false sharings if some arches pay a high price to them. - One per-cpu reserve. (X MB) - One per-memory-domain reserve. (number_of_cpu_in_this_domain * X MB) - A global reserve, with an uncertainty of number_of_cpus * X MB Basically reworking lib/percpu_counter.c for better NUMA awareness.
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 396091651955..7250c0542828 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -206,6 +206,11 @@ Will increase power usage. Default: 0 (off) +mem_pcpu_rsv +------------ + +Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. + rmem_default ------------ diff --git a/include/net/sock.h b/include/net/sock.h index 796a902cf4c1..09a0cde8bf52 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1443,6 +1443,7 @@ sk_memory_allocated(const struct sock *sk) /* 1 MB per cpu, in page units */ #define SK_MEMORY_PCPU_RESERVE (1 << (20 - PAGE_SHIFT)) +extern int sysctl_mem_pcpu_rsv; static inline void sk_memory_allocated_add(struct sock *sk, int amt) @@ -1451,7 +1452,7 @@ sk_memory_allocated_add(struct sock *sk, int amt) preempt_disable(); local_reserve = __this_cpu_add_return(*sk->sk_prot->per_cpu_fw_alloc, amt); - if (local_reserve >= SK_MEMORY_PCPU_RESERVE) { + if (local_reserve >= READ_ONCE(sysctl_mem_pcpu_rsv)) { __this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve); atomic_long_add(local_reserve, sk->sk_prot->memory_allocated); } @@ -1465,7 +1466,7 @@ sk_memory_allocated_sub(struct sock *sk, int amt) preempt_disable(); local_reserve = __this_cpu_sub_return(*sk->sk_prot->per_cpu_fw_alloc, amt); - if (local_reserve <= -SK_MEMORY_PCPU_RESERVE) { + if (local_reserve <= -READ_ONCE(sysctl_mem_pcpu_rsv)) { __this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve); atomic_long_add(local_reserve, sk->sk_prot->memory_allocated); } diff --git a/net/core/sock.c b/net/core/sock.c index 8d86886e39fa..df2ac54a8f74 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -283,6 +283,7 @@ __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX; EXPORT_SYMBOL(sysctl_rmem_max); __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX; __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX; +int sysctl_mem_pcpu_rsv __read_mostly = SK_MEMORY_PCPU_RESERVE; int sysctl_tstamp_allow_data __read_mostly = 1; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 0f0cb1465e08..986f15e5d6c4 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -30,6 +30,7 @@ static int int_3600 = 3600; static int min_sndbuf = SOCK_MIN_SNDBUF; static int min_rcvbuf = SOCK_MIN_RCVBUF; static int max_skb_frags = MAX_SKB_FRAGS; +static int min_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE; static int net_msg_warn; /* Unused, but still a sysctl */ @@ -407,6 +408,14 @@ static struct ctl_table net_core_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &min_rcvbuf, }, + { + .procname = "mem_pcpu_rsv", + .data = &sysctl_mem_pcpu_rsv, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &min_mem_pcpu_rsv, + }, { .procname = "dev_weight", .data = &weight_p,
This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file, to make SK_MEMORY_PCPU_RESERV tunable. Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated") introduced per-cpu forward alloc cache: "Implement a per-cpu cache of +1/-1 MB, to reduce number of changes to sk->sk_prot->memory_allocated, which would otherwise be cause of false sharing." sk_prot->memory_allocated points to global atomic variable: atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; If increasing the per-cpu cache size from 1MB to e.g. 16MB, changes to sk->sk_prot->memory_allocated can be further reduced. Performance may be improved on system with many cores. Signed-off-by: Adam Li <adamli@os.amperecomputing.com> --- Documentation/admin-guide/sysctl/net.rst | 5 +++++ include/net/sock.h | 5 +++-- net/core/sock.c | 1 + net/core/sysctl_net_core.c | 9 +++++++++ 4 files changed, 18 insertions(+), 2 deletions(-)