Message ID | 20250331223516.7810-2-sweettea-kernel@dorminy.me (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v2] mm: use per-numa-node atomics instead of percpu_counters | expand |
On Tue, Apr 1, 2025 at 6:36 AM Sweet Tea Dorminy <sweettea-kernel@dorminy.me> wrote: > > [Resend as requested as RFC and minus prereq-patch-id junk] > > Recently, several internal services had an RSS usage regression as part of a > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > read RSS statistics in a backup watchdog process to monitor and decide if > they'd overrun their memory budget. Now, however, a representative service > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > machine had memory usage tens of megabytes different from the expected amount > -- this constituted a significant percentage of inaccuracy, causing the > watchdog to act. > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") [1]. Previously, the memory error was bounded by > 64*nr_threads pages, a very livable megabyte. Now, however, as a result of > scheduler decisions moving the threads around the CPUs, the memory error could > be as large as a gigabyte. > > This is a really tremendous inaccuracy for any few-threaded program on a > large machine and impedes monitoring significantly. These stat counters are > also used to make OOM killing decisions, so this additional inaccuracy could > make a big difference in OOM situations -- either resulting in the wrong > process being killed, or in less memory being returned from an OOM-kill than > expected. > > Finally, while the change to percpu_counter does significantly improve the > accuracy over the previous per-thread error for many-threaded services, it does > also have performance implications - up to 12% slower for short-lived processes > and 9% increased system time in make test workloads [2]. > > A previous attempt to address this regression by Peng Zhang [3] used a hybrid > approach with delayed allocation of percpu memory for rss_stats, showing > promising improvements of 2-4% for process operations and 6.7% for page > faults. > > This RFC takes a different direction by replacing percpu_counters with a > more efficient set of per-NUMA-node atomics. The approach: > > - Uses one atomic per node up to a bound to reduce cross-node updates. > - Keeps a similar batching mechanism, with a smaller batch size. > - Eliminates the use of a spin lock during batch updates, bounding stat > update latency. > - Reduces percpu memory usage and thus thread startup time. > > Most importantly, this bounds the total error to 32 times the number of NUMA > nodes, significantly smaller than previous error bounds. > > On a 112-core machine, lmbench showed comparable results before and after this > patch. However, on a 224 core machine, performance improvements were > significant over percpu_counter: > - Pagefault latency improved by 8.91% > - Process fork latency improved by 6.27% > - Process fork/execve latency improved by 6.06% > - Process fork/exit latency improved by 6.58% > > will-it-scale also showed significant improvements on these machines. > > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/ > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/ > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ Hi, thanks for the idea. I'd like to mention my previous work on this: https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ Basically using one global percpu counter instead of a per-task one, and flush each CPU's sub-counter on context_switch (if next->active_mm != current->active_mm, no switch for IRQ or kthread). More like a percpu stash. Benchmark looks great and the fast path is super fast (just a this_cpu_add). context_switch is also fine because the scheduler would try to keep one task on the same CPU to make better use of cache. And it can leverage the cpu bitmap like tlb shootdown to optimize the whole thing. The error and total memory consumption are both lower than current design too.
On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote: > [Resend as requested as RFC and minus prereq-patch-id junk] > > Recently, several internal services had an RSS usage regression as part of a > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > read RSS statistics in a backup watchdog process to monitor and decide if > they'd overrun their memory budget. Any reason these applications are not using memcg stats/usage instead of RSS? RSS is not the only memory comsumption for these applications. > Now, however, a representative service > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > machine had memory usage tens of megabytes different from the expected amount > -- this constituted a significant percentage of inaccuracy, causing the > watchdog to act. Are these 5 threads jump all over the 250 cpus? > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") [1]. Previously, the memory error was bounded by > 64*nr_threads pages, a very livable megabyte. Now, however, as a result of > scheduler decisions moving the threads around the CPUs, the memory error could > be as large as a gigabyte. Applications with 10s of thousands of threads is very normal at Google. So, inaccuracy should be comparable for such applications. > > This is a really tremendous inaccuracy for any few-threaded program on a > large machine and impedes monitoring significantly. These stat counters are > also used to make OOM killing decisions, so this additional inaccuracy could > make a big difference in OOM situations -- either resulting in the wrong > process being killed, or in less memory being returned from an OOM-kill than > expected. > > Finally, while the change to percpu_counter does significantly improve the > accuracy over the previous per-thread error for many-threaded services, it does > also have performance implications - up to 12% slower for short-lived processes > and 9% increased system time in make test workloads [2]. > > A previous attempt to address this regression by Peng Zhang [3] used a hybrid > approach with delayed allocation of percpu memory for rss_stats, showing > promising improvements of 2-4% for process operations and 6.7% for page > faults. > > This RFC takes a different direction by replacing percpu_counters with a > more efficient set of per-NUMA-node atomics. The approach: > > - Uses one atomic per node up to a bound to reduce cross-node updates. > - Keeps a similar batching mechanism, with a smaller batch size. > - Eliminates the use of a spin lock during batch updates, bounding stat > update latency. > - Reduces percpu memory usage and thus thread startup time. That one atomic per node will easily become a bottleneck for applications with a lot of threads particularly on the system where there are a lot of cpus per numa node. > > Most importantly, this bounds the total error to 32 times the number of NUMA > nodes, significantly smaller than previous error bounds. > > On a 112-core machine, lmbench showed comparable results before and after this > patch. However, on a 224 core machine, performance improvements were How many cpus per node for each of these machines? > significant over percpu_counter: > - Pagefault latency improved by 8.91% The following fork ones are understandable as percpu counter allocation is involved but the above page fault latency needs some explanation. > - Process fork latency improved by 6.27% > - Process fork/execve latency improved by 6.06% > - Process fork/exit latency improved by 6.58% > > will-it-scale also showed significant improvements on these machines. Are these process ones or the threads ones? > > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/ > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/ > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> > Cc: Yu Zhao <yuzhao@google.com> > Cc: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Shakeel Butt <shakeel.butt@linux.dev> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Mateusz Guzik <mjguzik@gmail.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > --- > > This is mostly a resend of an earlier patch, where I made an utter hash > of specifying a base commit (and forgot to update my commit text to not > call it an RFC, and forgot to update my email to the one I use for > upstream work...). This is based on akpm/mm-unstable as of today. > > v1 can be found at > https://lore.kernel.org/lkml/20250325221550.396212-1-sweettea-kernel@dorminy.me/ > > Some interesting ideas came out of that discussion: Mathieu Desnoyers > has a design doc for a improved percpu counter, multi-level, with > constant drift, at > https://lore.kernel.org/lkml/a89cb4d9-088e-4ed6-afde-f1b097de8db9@efficios.com/ > and would like performance comparisons against just reducing the batch > size in the existing code; You can do the experiments with different batch sizes in the existing code without waiting for Mathieu's multi-level percpu counter. > and Mateusz Guzik would also like a more general solution and is also > working to fix the performance issues by caching mm state. Finally, > Lorenzo Stoakes nacks, as it's too speculative and needs more > discussion. > > I think the important part is that this improves accuracy; the current > scheme is difficult to use on many-cored machines. It improves > performance, but there are tradeoffs; but it tightly bounds the > inaccuracy so that decisions can actually be reasonably made with the > resulting numbers. > > This patch assumes that intra-NUMA node atomic updates are very cheap The above statement/assumption needs experimental data. > and that > assigning CPUs to an atomic counter by numa_node_id() % 16 is suitably > balanced. However, if each atomic were shared by only, say, eight CPUs from the > same NUMA node, this would further reduce atomic contention at the cost of more > memory and more complicated assignment of CPU to atomic index. I don't think > that additional complication is worth it given that this scheme seems to get > good performance, but it might be. I do need to actually test the impact > on a many-cores-one-NUMA-node machine, and I look forward to testing out > Mathieu's heirarchical percpu counter with bounded error. > I am still not buying the 'good performance' point. To me we might need to go with reduced batch size of existing approach or multi level approach from Mathieu (I still have to see Mateusz and Kairui's proposals).
On Tue, Apr 1, 2025 at 5:27 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Tue, Apr 1, 2025 at 6:36 AM Sweet Tea Dorminy > <sweettea-kernel@dorminy.me> wrote: > > > > [Resend as requested as RFC and minus prereq-patch-id junk] > > > > Recently, several internal services had an RSS usage regression as part of a > > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > > read RSS statistics in a backup watchdog process to monitor and decide if > > they'd overrun their memory budget. Now, however, a representative service > > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > > machine had memory usage tens of megabytes different from the expected amount > > -- this constituted a significant percentage of inaccuracy, causing the > > watchdog to act. > > > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > > percpu_counter") [1]. Previously, the memory error was bounded by > > 64*nr_threads pages, a very livable megabyte. Now, however, as a result of > > scheduler decisions moving the threads around the CPUs, the memory error could > > be as large as a gigabyte. > > > > This is a really tremendous inaccuracy for any few-threaded program on a > > large machine and impedes monitoring significantly. These stat counters are > > also used to make OOM killing decisions, so this additional inaccuracy could > > make a big difference in OOM situations -- either resulting in the wrong > > process being killed, or in less memory being returned from an OOM-kill than > > expected. > > > > Finally, while the change to percpu_counter does significantly improve the > > accuracy over the previous per-thread error for many-threaded services, it does > > also have performance implications - up to 12% slower for short-lived processes > > and 9% increased system time in make test workloads [2]. > > > > A previous attempt to address this regression by Peng Zhang [3] used a hybrid > > approach with delayed allocation of percpu memory for rss_stats, showing > > promising improvements of 2-4% for process operations and 6.7% for page > > faults. > > > > This RFC takes a different direction by replacing percpu_counters with a > > more efficient set of per-NUMA-node atomics. The approach: > > > > - Uses one atomic per node up to a bound to reduce cross-node updates. > > - Keeps a similar batching mechanism, with a smaller batch size. > > - Eliminates the use of a spin lock during batch updates, bounding stat > > update latency. > > - Reduces percpu memory usage and thus thread startup time. > > > > Most importantly, this bounds the total error to 32 times the number of NUMA > > nodes, significantly smaller than previous error bounds. > > > > On a 112-core machine, lmbench showed comparable results before and after this > > patch. However, on a 224 core machine, performance improvements were > > significant over percpu_counter: > > - Pagefault latency improved by 8.91% > > - Process fork latency improved by 6.27% > > - Process fork/execve latency improved by 6.06% > > - Process fork/exit latency improved by 6.58% > > > > will-it-scale also showed significant improvements on these machines. > > > > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/ > > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/ > > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ > > Hi, thanks for the idea. > > I'd like to mention my previous work on this: > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ > > Basically using one global percpu counter instead of a per-task one, and > flush each CPU's sub-counter on context_switch (if next->active_mm != > current->active_mm, no switch for IRQ or kthread). > More like a percpu stash. > > Benchmark looks great and the fast path is super fast (just a > this_cpu_add). context_switch is also fine because the scheduler would > try to keep one task on the same CPU to make better use of cache. And > it can leverage the cpu bitmap like tlb shootdown to optimize the > whole thing. > > The error and total memory consumption are both lower than current design too. Note there are 2 unrelated components in that patchset: - one per-cpu instance of rss counters which is rolled up on context switches, avoiding the costly counter alloc/free on mm creation/teardown - cpu iteration in get_mm_counter The allocation problem is fixable without abandoning the counters, see my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the counters). This aspect has to be solved anyway due to mm_alloc_cid(). Providing a way to sort it out covers *both* the rss counters and the cid thing. In your patchset the accuracy increase comes at the expense of walking all CPUs every time, while a big part of the point of using percpu counters is to have a good enough approximation somewhere that this is not necessary. Indeed the stock kernel fails to achieve that at the moment and as you can see there is discussion how to tackle it. It is a general percpu counter problem. I verified get_mm_counter is issued in particular on mmap and munmap. On high core count boxes (hundreds of cores) the mandatory all CPU walk has to be a problem, especially if a given process is also highly multi-threaded and mmap/munmap heavy. Thus I think your patchset would also benefit from some form of distribution of the counter other than just per-cpu and the one centralized value. At the same time if RSS accuracy is your only concern and you don't care about walking the CPUs, then you could modify the current code to also do it. Or to put it differently, while it may be changing the scheme to have a local copy makes sense, the patchset is definitely not committable in the proposed form -- it really wants to have better quality caching of the state. -- Mateusz Guzik <mjguzik gmail.com>
On Tue, Apr 1, 2025 at 12:36 AM Sweet Tea Dorminy <sweettea-kernel@dorminy.me> wrote: > > [Resend as requested as RFC and minus prereq-patch-id junk] > > Recently, several internal services had an RSS usage regression as part of a > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > read RSS statistics in a backup watchdog process to monitor and decide if > they'd overrun their memory budget. Now, however, a representative service > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > machine had memory usage tens of megabytes different from the expected amount > -- this constituted a significant percentage of inaccuracy, causing the > watchdog to act. > [snip] > I think the important part is that this improves accuracy; the current > scheme is difficult to use on many-cored machines. It improves > performance, but there are tradeoffs; but it tightly bounds the > inaccuracy so that decisions can actually be reasonably made with the > resulting numbers. > Even disregarding this specific report, a prior patch submission points a result which is so off that it already constitutes a bug: https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ So something definitely needs to be done to improve the accuracy. But that always will be a tradeoff vs update performance. This brings me to a question: how often does the watchdog thing look at the stats? I wonder if it would make sense add another file to proc, similar to "status", but returning *exact* values. So in particular with percpu counters it would walk all CPUs on to generate the answer. Then interested parties would still get an accurate count and not get in the way, provided they don't relentlessly do it.
On 2025-04-02 20:00, Shakeel Butt wrote: > On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote: [...] > I am still not buying the 'good performance' point. To me we might need > to go with reduced batch size of existing approach or multi level > approach from Mathieu (I still have to see Mateusz and Kairui's > proposals). Here is an initial userspace prototype of my hierarchical split counters: https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c How to try it out: * Install liburcu * Clone & build https://github.com/compudj/librseq branch: percpu-counter (note: it's currently a POC, very lightly tested.) Run tests/percpu_counter_test.tap : ok 1 - Registered current thread with rseq Counter init: approx: 0 precise: 0 inaccuracy: ±2048 Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048 Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048 ok 2 - Unregistered current thread with rseq 1..2 It implements the following operations: Fast paths: - counter_add - counter_approx_sum Function call APIs for: - counter_add_slowpath (approximation propagation to levels > 0), - counter_precise_sum (iterate on all per-cpu counters) - counter_set: Set a bias to bring precise sum to a given target value. - counter_inaccuracy: returns the maximum inaccuracy of approximation for this counter configuration. - counter_compare: Compare a counter against a value. Use approximation if value is further than inaccuracy limits, else use precise sum. Porting it to the Linux kernel and replacing lib/percpu_counter.c should be straightforward. AFAIU, the only thing I have not implemented is a replacement for percpu_counter_limited_add, and I'm not so sure how useful it is. The most relevant piece of the algorithm is within counter_add as follows: static inline void counter_add(struct percpu_counter *counter, long inc) { unsigned long bit_mask = counter->level0_bit_mask; intptr_t orig, res; int ret, cpu; if (!inc) return; // This is basically a percpu_add_return() in userspace with rseq. do { cpu = rseq_cpu_start(); orig = *rseq_percpu_ptr(counter->level0, cpu); ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, rseq_percpu_ptr(counter->level0, cpu), orig, orig + inc, cpu); } while (ret); res = orig + inc; counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n", inc, bit_mask, orig, res); if (inc < 0) { inc = -(-inc & ~((bit_mask << 1) - 1)); /* xor bit_mask, same sign: underflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc -= bit_mask; } else { inc &= ~((bit_mask << 1) - 1); /* xor bit_mask, same sign: overflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc += bit_mask; } if (inc) counter_add_slowpath(counter, inc); } void counter_add_slowpath(struct percpu_counter *counter, long inc) { struct percpu_counter_level_item *item = counter->items; unsigned int level_items = counter->nr_cpus >> 1; unsigned int level, nr_levels = counter->nr_levels; long bit_mask = counter->level0_bit_mask; int cpu = rseq_current_cpu_raw(); for (level = 1; level < nr_levels; level++) { long orig, res; long *count = &item[cpu & (level_items - 1)].count; bit_mask <<= 1; res = uatomic_add_return(count, inc, CMM_RELAXED); orig = res - inc; counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n", level, inc, bit_mask, orig, res); if (inc < 0) { inc = -(-inc & ~((bit_mask << 1) - 1)); /* xor bit_mask, same sign: underflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc -= bit_mask; } else { inc &= ~((bit_mask << 1) - 1); /* xor bit_mask, same sign: overflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc += bit_mask; } item += level_items; level_items >>= 1; if (!inc) return; } counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc); uatomic_add(&item->count, inc, CMM_RELAXED); } Feedback is welcome! Thanks, Mathieu
diff --git a/include/linux/mm.h b/include/linux/mm.h index fdd7cb8609a5..b27a79a36a57 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2694,30 +2694,58 @@ static inline bool get_user_page_fast_only(unsigned long addr, /* * per-process(per-mm_struct) statistics. */ +static inline long get_mm_counter_slow(struct mm_struct *mm, + int member) +{ + long ret = atomic64_read(&mm->rss_stat[member].global); + + for (int i = 0; i < _MM_NUMA_COUNTERS; i++) + ret += atomic64_read(&mm->rss_stat[member].numa_counters[i]); + return ret; +} + static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + s64 val = atomic64_read(&mm->rss_stat[member].global); + + if (val < 0) + return 0; + return val; } void mm_trace_rss_stat(struct mm_struct *mm, int member); +static inline void __mm_update_numa_counter(struct mm_struct *mm, int member, + long value) +{ + size_t index = numa_node_id() % _MM_NUMA_COUNTERS; + struct mm_pernuma_counter *rss_stat = &mm->rss_stat[member]; + atomic64_t *numa_counter = &rss_stat->numa_counters[index]; + s64 val = atomic64_add_return(value, numa_counter); + + if (abs(val) >= _MM_NUMA_COUNTERS_BATCH) { + atomic64_add(val, &rss_stat->global); + atomic64_add(-val, numa_counter); + } +} + static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + __mm_update_numa_counter(mm, member, value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); + __mm_update_numa_counter(mm, member, 1); mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + __mm_update_numa_counter(mm, member, -1); mm_trace_rss_stat(mm, member); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index aac7c87b04e1..a305fbebc8f6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -889,6 +889,13 @@ struct mm_cid { }; #endif +#define _MM_NUMA_COUNTERS MIN(MAX_NUMNODES, 32) +#define _MM_NUMA_COUNTERS_BATCH max(32, (num_possible_cpus() >> NODES_SHIFT) * 2) +struct mm_pernuma_counter { + atomic64_t global; + atomic64_t numa_counters[_MM_NUMA_COUNTERS]; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1055,7 +1062,7 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct mm_pernuma_counter rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index f74925a6cf69..72b8278fdb40 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -477,7 +477,7 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + __entry->size = (atomic64_read(&mm->rss_stat[member].global) << PAGE_SHIFT); ), diff --git a/kernel/fork.c b/kernel/fork.c index 83cb82643817..3de71d335022 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -838,7 +838,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = get_mm_counter_slow(mm, i); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", @@ -940,7 +940,6 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); free_mm(mm); } @@ -1327,16 +1326,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; -fail_pcpu: - mm_destroy_cid(mm); fail_cid: destroy_context(mm); fail_nocontext:
[Resend as requested as RFC and minus prereq-patch-id junk] Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. Finally, while the change to percpu_counter does significantly improve the accuracy over the previous per-thread error for many-threaded services, it does also have performance implications - up to 12% slower for short-lived processes and 9% increased system time in make test workloads [2]. A previous attempt to address this regression by Peng Zhang [3] used a hybrid approach with delayed allocation of percpu memory for rss_stats, showing promising improvements of 2-4% for process operations and 6.7% for page faults. This RFC takes a different direction by replacing percpu_counters with a more efficient set of per-NUMA-node atomics. The approach: - Uses one atomic per node up to a bound to reduce cross-node updates. - Keeps a similar batching mechanism, with a smaller batch size. - Eliminates the use of a spin lock during batch updates, bounding stat update latency. - Reduces percpu memory usage and thus thread startup time. Most importantly, this bounds the total error to 32 times the number of NUMA nodes, significantly smaller than previous error bounds. On a 112-core machine, lmbench showed comparable results before and after this patch. However, on a 224 core machine, performance improvements were significant over percpu_counter: - Pagefault latency improved by 8.91% - Process fork latency improved by 6.27% - Process fork/execve latency improved by 6.06% - Process fork/exit latency improved by 6.58% will-it-scale also showed significant improvements on these machines. [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/ [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/ [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Yu Zhao <yuzhao@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- This is mostly a resend of an earlier patch, where I made an utter hash of specifying a base commit (and forgot to update my commit text to not call it an RFC, and forgot to update my email to the one I use for upstream work...). This is based on akpm/mm-unstable as of today. v1 can be found at https://lore.kernel.org/lkml/20250325221550.396212-1-sweettea-kernel@dorminy.me/ Some interesting ideas came out of that discussion: Mathieu Desnoyers has a design doc for a improved percpu counter, multi-level, with constant drift, at https://lore.kernel.org/lkml/a89cb4d9-088e-4ed6-afde-f1b097de8db9@efficios.com/ and would like performance comparisons against just reducing the batch size in the existing code; and Mateusz Guzik would also like a more general solution and is also working to fix the performance issues by caching mm state. Finally, Lorenzo Stoakes nacks, as it's too speculative and needs more discussion. I think the important part is that this improves accuracy; the current scheme is difficult to use on many-cored machines. It improves performance, but there are tradeoffs; but it tightly bounds the inaccuracy so that decisions can actually be reasonably made with the resulting numbers. This patch assumes that intra-NUMA node atomic updates are very cheap and that assigning CPUs to an atomic counter by numa_node_id() % 16 is suitably balanced. However, if each atomic were shared by only, say, eight CPUs from the same NUMA node, this would further reduce atomic contention at the cost of more memory and more complicated assignment of CPU to atomic index. I don't think that additional complication is worth it given that this scheme seems to get good performance, but it might be. I do need to actually test the impact on a many-cores-one-NUMA-node machine, and I look forward to testing out Mathieu's heirarchical percpu counter with bounded error. --- include/linux/mm.h | 36 ++++++++++++++++++++++++++++++++---- include/linux/mm_types.h | 9 ++++++++- include/trace/events/kmem.h | 2 +- kernel/fork.c | 9 +-------- 4 files changed, 42 insertions(+), 14 deletions(-) base-commit: e026356e4192ff5a52c1d535e6b9e3fa50def2c4