Message ID | 20250325221550.396212-1-sweettea-kernel@dorminy.me (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: use per-numa-node atomics instead of percpu_counters | expand |
On 2025-03-25 18:15, Sweet Tea Dorminy wrote: > From: Sweet Tea Dorminy <sweettea@google.com> > > Recently, several internal services had an RSS usage regression as part of a > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > read RSS statistics in a backup watchdog process to monitor and decide if > they'd overrun their memory budget. Now, however, a representative service > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > machine had memory usage tens of megabytes different from the expected amount > -- this constituted a significant percentage of inaccuracy, causing the > watchdog to act. > I suspect the culprit sits here: int percpu_counter_batch __read_mostly = 32; EXPORT_SYMBOL(percpu_counter_batch); static int compute_batch_value(unsigned int cpu) { int nr = num_online_cpus(); percpu_counter_batch = max(32, nr*2); return 0; } So correct me if I'm wrong, but in this case the worse-case inaccuracy for a 256 cpu machine would be "+/- percpu_counter_batch" within each percpu counter, thus globally: +/- (256 * 2) * 256, or 131072 pages, meaning an inaccuracy of +/- 512MB with 4kB pages. This is quite significant. So I understand that the batch size is scaled up as the number of CPUs increases to minimize contention on the percpu_counter lock. Unfortunately, as the number of CPUs increases, the inaccuracy increases with the square of the number of cpus. Have you tried decreasing this percpu_counter_batch value on larger machines to see if it helps ? Thanks, Mathieu [...] > > base-commit: b0cb56cbbdb4754918c28d6d7c294d56e28a3dd5 > prerequisite-patch-id: b7b47d1b9aa3fd887dd718ab276581f120e516e6 [...][ cutting the gazillions prerequisite-patch-id, please review you use of tooling when preparing/sending patches. ]
On Wed, Mar 26, 2025 at 03:56:15PM -0400, Mathieu Desnoyers wrote: > On 2025-03-25 18:15, Sweet Tea Dorminy wrote: > > From: Sweet Tea Dorminy <sweettea@google.com> > > > > Recently, several internal services had an RSS usage regression as part of a > > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > > read RSS statistics in a backup watchdog process to monitor and decide if > > they'd overrun their memory budget. Now, however, a representative service > > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > > machine had memory usage tens of megabytes different from the expected amount > > -- this constituted a significant percentage of inaccuracy, causing the > > watchdog to act. > > > > I suspect the culprit sits here: > > int percpu_counter_batch __read_mostly = 32; > EXPORT_SYMBOL(percpu_counter_batch); > > static int compute_batch_value(unsigned int cpu) > { > int nr = num_online_cpus(); > > percpu_counter_batch = max(32, nr*2); > return 0; > } > > So correct me if I'm wrong, but in this case the worse-case > inaccuracy for a 256 cpu machine would be > "+/- percpu_counter_batch" within each percpu counter, > thus globally: > > +/- (256 * 2) * 256, or 131072 pages, meaning an inaccuracy > of +/- 512MB with 4kB pages. This is quite significant. > > So I understand that the batch size is scaled up as the > number of CPUs increases to minimize contention on the > percpu_counter lock. Unfortunately, as the number of CPUs > increases, the inaccuracy increases with the square of the > number of cpus. > > Have you tried decreasing this percpu_counter_batch value on > larger machines to see if it helps ? > per-cpu rss counters replaced a per-thread variant, which for sufficiently threaded processes had a significantly bigger error. See f1a7941243c102a4 ("mm: convert mm's rss stats into percpu_counter"). The use in rss aside, the current implementation of per-cpu counters is met with two seemingly conflicting requirements: on one hand, synchronisation with other CPUs needs to be rare to maintain scalability, on the other the more CPUs are there to worry about, the bigger the error vs the central value and the more often you should synchronize it. So I think something needs to be done about the mechansism in general. While I don't have throught out idea, off hand I suspect turning these into a hierarchical state should help solve it? As in instead of *one* central value everyone writes to in order to offload their batch, there could be a level or two of intermediary values -- think of a tree you go up as needed. Then for example the per-cpu batch could be much smaller as the penalty for rolling it up to one level higher would be significantly lower than going after the main counter. I have no time to work on something like this though. Myabe someone has a better idea.
On Tue, Mar 25, 2025 at 06:15:49PM -0400, Sweet Tea Dorminy wrote: > From: Sweet Tea Dorminy <sweettea@google.com> > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") [1]. Previously, the memory error was bounded by > 64*nr_threads pages, a very livable megabyte. Now, however, as a result of > scheduler decisions moving the threads around the CPUs, the memory error could > be as large as a gigabyte. > > This is a really tremendous inaccuracy for any few-threaded program on a > large machine and impedes monitoring significantly. These stat counters are > also used to make OOM killing decisions, so this additional inaccuracy could > make a big difference in OOM situations -- either resulting in the wrong > process being killed, or in less memory being returned from an OOM-kill than > expected. > > Finally, while the change to percpu_counter does significantly improve the > accuracy over the previous per-thread error for many-threaded services, it does > also have performance implications - up to 12% slower for short-lived processes > and 9% increased system time in make test workloads [2]. > > A previous attempt to address this regression by Peng Zhang [3] used a hybrid > approach with delayed allocation of percpu memory for rss_stats, showing > promising improvements of 2-4% for process operations and 6.7% for page > faults. > > This RFC takes a different direction by replacing percpu_counters with a > more efficient set of per-NUMA-node atomics. The approach: > > - Uses one atomic per node up to a bound to reduce cross-node updates. > - Keeps a similar batching mechanism, with a smaller batch size. > - Eliminates the use of a spin lock during batch updates, bounding stat > update latency. > - Reduces percpu memory usage and thus thread startup time. > > Most importantly, this bounds the total error to 32 times the number of NUMA > nodes, significantly smaller than previous error bounds. > > On a 112-core machine, lmbench showed comparable results before and after this > patch. However, on a 224 core machine, performance improvements were > significant over percpu_counter: > - Pagefault latency improved by 8.91% > - Process fork latency improved by 6.27% > - Process fork/execve latency improved by 6.06% > - Process fork/exit latency improved by 6.58% > > will-it-scale also showed significant improvements on these machines. > The problem on fork/exec/exit stems from back-to-back trips to the per-cpu allocator every time a mm is allocated/freed (which happens for each of these syscalls) -- they end up serializing on the same global spinlock. On the alloc side this is mm_alloc_cid() followed by percpu_counter_init_many(). Even if you eliminate the counters for rss, you are still paying for CID. While this scales better than the stock kernel, it still leaves perf on the table. Per our discussion on IRC there is WIP to eliminate both cases by caching the state in mm. This depends on adding a dtor for SLUB to undo the work in ctor. Harry did the work on that front, this is not submitted to -next though. There is a highly-inefficient sanity-check loop in check_mm(). Instead of walking the entire list 4 times with toggling interrupts in-between, it can do the walk once. So that's for the fork/execve/exit triplets. As for the page fault latency, your patch adds atomics to the fast path. Even absent any competition for cachelines with other CPUs this will be slower to execute than the current primitive. I suspect you are observing a speed up with your change because you end up landing in the slowpath a lot and that sucker is globally serialized on a spinlock -- this has to hurt. Per my other message in the thread, and a later IRC discussion, this is fixable with adding intermediate counters, in the same spirit you did here. I'll note though that numa nodes can be incredibly core-y and that granularity may be way too coarse. That aside there are globally-locked lists mms are cycling in and out of which also can get the "stay there while cached" treatment. All in all I claim that: 1. fork/execve/exit tests will do better than they are doing with your patch if going to the percpu allocator gets eliminated altogether (in your patch it is not for mm_alloc_cid() and the freeing counterpart), along with unscrewing the loop in check_mm(). 2. fault handling will be faster than it is with your patch *if* something like per-numa state gets added for the slowpath -- the stock fast path is faster than yours, the stock slowpath is way slower. you can get the best of both worlds on this one. Hell, it may be your patch as is can be easily repurposed to decentralize the main percpu counter? I mean perhaps there is no need for any fancy hierarchical structure. I can commit to providing a viable patch for sorting out the fork/execve/exit side, but it is going to take about a week. You do have a PoC in the meantime (too ugly to share publicly :>). So that's my take on it. Note I'm not a maintainer of any of this, but I did some work on the thing in the past.
On 2025-03-26 19:36, Mateusz Guzik wrote: [...] > > Hell, it may be your patch as is can be easily repurposed to > decentralize the main percpu counter? I mean perhaps there is no need > for any fancy hierarchical structure. Here is an initial attempt at a design document for the hierarchical percpu counters. Feedback welcome! Split Counters With Binary Tree Approximation Propagation ========================================================= Mathieu Desnoyers <mathieu.desnoyers@efficios.com> March 27, 2025 * Propagation diagram when reaching batch size thresholds (± batch size): Example diagram for 8 CPUs: log2(8) = 3 levels At each level, each pair propagates its values to the next level when reaching the batch size thresholds. Counters at levels 0, 1, 2 can be kept on a single byte (±128 range). Counter at level 3 can be kept on a 32/64-bit counter. Level 0: 0 1 2 3 4 5 6 7 | / | / | / | / | / | / | / | / | / | / | / | / Level 1: 0 1 2 3 | / | / | / | / | / | / Level 2: 0 1 | / | / | / Level 3: 0 * Inaccuracy: BATCH(level N): Level N batch size. Example for BATCH(level 0) = 4 BATCH(level 0) = 4 BATCH(level 1) = 8 BATCH(level 2) = 16 BATCH(level N) = BATCH(level 0) * 2^N per-counter global inaccuracy inaccuracy Level 0: ± 3 ± 24 (8 * 3) Level 1: ± 7 ± 28 (4 * 7) Level 2: ± 15 ± 30 (2 * 15) Total: ------ ± 82 (log2(nr_cpus) * BATCH(level 0) * nr_cpus - Sum[0 .. log2(nr_cpus) - 1](nr_cpus / 2^n) Thanks, Mathieu
diff --git a/include/linux/mm.h b/include/linux/mm.h index b7f13f087954..b0f93afb9e86 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2686,30 +2686,58 @@ static inline bool get_user_page_fast_only(unsigned long addr, /* * per-process(per-mm_struct) statistics. */ +static inline long get_mm_counter_slow(struct mm_struct *mm, + int member) +{ + long ret = atomic64_read(&mm->rss_stat[member].global); + + for (int i = 0; i < _MM_NUMA_COUNTERS; i++) + ret += atomic64_read(&mm->rss_stat[member].numa_counters[i]); + return ret; +} + static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + s64 val = atomic64_read(&mm->rss_stat[member].global); + + if (val < 0) + return 0; + return val; } void mm_trace_rss_stat(struct mm_struct *mm, int member); +static inline void __mm_update_numa_counter(struct mm_struct *mm, int member, + long value) +{ + size_t index = numa_node_id() % _MM_NUMA_COUNTERS; + struct mm_pernuma_counter *rss_stat = &mm->rss_stat[member]; + atomic64_t *numa_counter = &rss_stat->numa_counters[index]; + s64 val = atomic64_add_return(value, numa_counter); + + if (abs(val) >= _MM_NUMA_COUNTERS_BATCH) { + atomic64_add(val, &rss_stat->global); + atomic64_add(-val, numa_counter); + } +} + static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + __mm_update_numa_counter(mm, member, value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); + __mm_update_numa_counter(mm, member, 1); mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + __mm_update_numa_counter(mm, member, -1); mm_trace_rss_stat(mm, member); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..000e12382e2e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -893,6 +893,13 @@ struct mm_cid { }; #endif +#define _MM_NUMA_COUNTERS MIN(MAX_NUMNODES, 32) +#define _MM_NUMA_COUNTERS_BATCH max(32, (num_possible_cpus() >> NODES_SHIFT) * 2) +struct mm_pernuma_counter { + atomic64_t global; + atomic64_t numa_counters[_MM_NUMA_COUNTERS]; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1059,7 +1066,7 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct mm_pernuma_counter rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index f74925a6cf69..72b8278fdb40 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -477,7 +477,7 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + __entry->size = (atomic64_read(&mm->rss_stat[member].global) << PAGE_SHIFT); ), diff --git a/kernel/fork.c b/kernel/fork.c index 92b39a78d7fb..8390beac5da5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -838,7 +838,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = get_mm_counter_slow(mm, i); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", @@ -940,7 +940,6 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); free_mm(mm); } @@ -1327,16 +1326,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; -fail_pcpu: - mm_destroy_cid(mm); fail_cid: destroy_context(mm); fail_nocontext: