Message ID | 20240418142008.2775308-3-zhangpeng362@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: convert mm's rss stats to use atomic mode | expand |
On 2024/4/18 22:20, Peng Zhang wrote: > From: ZhangPeng <zhangpeng362@huawei.com> > > Since commit f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter"), the rss_stats have converted into percpu_counter, > which convert the error margin from (nr_threads * 64) to approximately > (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a > performance regression on fork/exec/shell. Even after commit 14ef95be6f55 > ("kernel/fork: group allocation/free of per-cpu counters for mm struct"), > the performance of fork/exec/shell is still poor compared to previous > kernel versions. > > To mitigate performance regression, we delay the allocation of percpu > memory for rss_stats. Therefore, we convert mm's rss stats to use > percpu_counter atomic mode. For single-thread processes, rss_stat is in > atomic mode, which reduces the memory consumption and performance > regression caused by using percpu. For multiple-thread processes, > rss_stat is switched to the percpu mode to reduce the error margin. > We convert rss_stats from atomic mode to percpu mode only when the > second thread is created. Hi, Zhang Peng This regression we also found it in lmbench these days. I have not test your patch, but it seems will solve a lot for it. And I see this patch not fix the regression in multi-threads, that's because of the rss_stat switched to percpu mode? (If I'm wrong, please correct me.) And It seems percpu_counter also has a bad effect in exit_mmap(). If so, I'm wondering if we can further improving it on the exit_mmap() path in multi-threads scenario, e.g. to determine which CPUs the process has run on (mm_cpumask()? I'm not sure). > > After lmbench test, we can get 2% ~ 4% performance improvement > for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance > improvement for lmbench page_fault (before batch mode[1]). > > The test results are as follows: > > base base+revert base+this patch > > fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) > exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) > shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) > page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) I think the regression will becomes more obvious if more cores. How about your test machine? Thanks, -wrw > > [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ > > Suggested-by: Jan Kara <jack@suse.cz> > Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > include/linux/mm.h | 50 +++++++++++++++++++++++++++++++------ > include/trace/events/kmem.h | 4 +-- > kernel/fork.c | 18 +++++++------ > 3 files changed, 56 insertions(+), 16 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d261e45bb29b..8f1bfbd54697 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2631,30 +2631,66 @@ static inline bool get_user_page_fast_only(unsigned long addr, > */ > static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) > { > - return percpu_counter_read_positive(&mm->rss_stat[member]); > + struct percpu_counter *fbc = &mm->rss_stat[member]; > + > + if (percpu_counter_initialized(fbc)) > + return percpu_counter_read_positive(fbc); > + > + return percpu_counter_atomic_read(fbc); > } > > void mm_trace_rss_stat(struct mm_struct *mm, int member); > > static inline void add_mm_counter(struct mm_struct *mm, int member, long value) > { > - percpu_counter_add(&mm->rss_stat[member], value); > + struct percpu_counter *fbc = &mm->rss_stat[member]; > + > + if (percpu_counter_initialized(fbc)) > + percpu_counter_add(fbc, value); > + else > + percpu_counter_atomic_add(fbc, value); > > mm_trace_rss_stat(mm, member); > } > > static inline void inc_mm_counter(struct mm_struct *mm, int member) > { > - percpu_counter_inc(&mm->rss_stat[member]); > - > - mm_trace_rss_stat(mm, member); > + add_mm_counter(mm, member, 1); > } > > static inline void dec_mm_counter(struct mm_struct *mm, int member) > { > - percpu_counter_dec(&mm->rss_stat[member]); > + add_mm_counter(mm, member, -1); > +} > > - mm_trace_rss_stat(mm, member); > +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) > +{ > + struct percpu_counter *fbc = &mm->rss_stat[member]; > + > + if (percpu_counter_initialized(fbc)) > + return percpu_counter_sum(fbc); > + > + return percpu_counter_atomic_read(fbc); > +} > + > +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int member) > +{ > + struct percpu_counter *fbc = &mm->rss_stat[member]; > + > + if (percpu_counter_initialized(fbc)) > + return percpu_counter_sum_positive(fbc); > + > + return percpu_counter_atomic_read(fbc); > +} > + > +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) > +{ > + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, NR_MM_COUNTERS); > +} > + > +static inline void mm_counter_destroy_many(struct mm_struct *mm) > +{ > + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); > } > > /* Optimized variant when folio is already known not to be anon */ > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h > index 6e62cc64cd92..a4e40ae6a8c8 100644 > --- a/include/trace/events/kmem.h > +++ b/include/trace/events/kmem.h > @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, > __entry->mm_id = mm_ptr_to_hash(mm); > __entry->curr = !!(current->mm == mm); > __entry->member = member; > - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) > - << PAGE_SHIFT); > + __entry->size = (mm_counter_sum_positive(mm, member) > + << PAGE_SHIFT); > ), > > TP_printk("mm_id=%u curr=%d type=%s size=%ldB", > diff --git a/kernel/fork.c b/kernel/fork.c > index 99076dbe27d8..0214273798c5 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) > "Please make sure 'struct resident_page_types[]' is updated as well"); > > for (i = 0; i < NR_MM_COUNTERS; i++) { > - long x = percpu_counter_sum(&mm->rss_stat[i]); > + long x = mm_counter_sum(mm, i); > > if (unlikely(x)) > pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", > @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > if (mm_alloc_cid(mm)) > goto fail_cid; > > - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, > - NR_MM_COUNTERS)) > - goto fail_pcpu; > - > mm->user_ns = get_user_ns(user_ns); > lru_gen_init_mm(mm); > return mm; > > -fail_pcpu: > - mm_destroy_cid(mm); > fail_cid: > destroy_context(mm); > fail_nocontext: > @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) > if (!oldmm) > return 0; > > + /* > + * For single-thread processes, rss_stat is in atomic mode, which > + * reduces the memory consumption and performance regression caused by > + * using percpu. For multiple-thread processes, rss_stat is switched to > + * the percpu mode to reduce the error margin. > + */ > + if (clone_flags & CLONE_THREAD) > + if (mm_counter_switch_to_pcpu_many(oldmm)) > + return -ENOMEM; > + > if (clone_flags & CLONE_VM) { > mmget(oldmm); > mm = oldmm;
On 2024/4/19 10:30, Rongwei Wang wrote: > On 2024/4/18 22:20, Peng Zhang wrote: >> From: ZhangPeng <zhangpeng362@huawei.com> >> >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >> percpu_counter"), the rss_stats have converted into percpu_counter, >> which convert the error margin from (nr_threads * 64) to approximately >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a >> performance regression on fork/exec/shell. Even after commit >> 14ef95be6f55 >> ("kernel/fork: group allocation/free of per-cpu counters for mm >> struct"), >> the performance of fork/exec/shell is still poor compared to previous >> kernel versions. >> >> To mitigate performance regression, we delay the allocation of percpu >> memory for rss_stats. Therefore, we convert mm's rss stats to use >> percpu_counter atomic mode. For single-thread processes, rss_stat is in >> atomic mode, which reduces the memory consumption and performance >> regression caused by using percpu. For multiple-thread processes, >> rss_stat is switched to the percpu mode to reduce the error margin. >> We convert rss_stats from atomic mode to percpu mode only when the >> second thread is created. > Hi, Zhang Peng > > This regression we also found it in lmbench these days. I have not > test your patch, but it seems will solve a lot for it. > And I see this patch not fix the regression in multi-threads, that's > because of the rss_stat switched to percpu mode? > (If I'm wrong, please correct me.) And It seems percpu_counter also > has a bad effect in exit_mmap(). > > If so, I'm wondering if we can further improving it on the exit_mmap() > path in multi-threads scenario, e.g. to determine which CPUs the > process has run on (mm_cpumask()? I'm not sure). > Hi, Rongwei, Yes, this patch only fixes the regression in single-thread processes. How much bad effect does percpu_counter have in exit_mmap()? IMHO, the addition of mm counter is already in batch mode, maybe I miss something? >> >> After lmbench test, we can get 2% ~ 4% performance improvement >> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >> improvement for lmbench page_fault (before batch mode[1]). >> >> The test results are as follows: >> >> base base+revert base+this patch >> >> fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) >> exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) >> shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) >> page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) > I think the regression will becomes more obvious if more cores. How > about your test machine? > Maybe multi-core is not a factor in the performance of the lmbench test here. Both of my test machines have 96 cores. > Thanks, > -wrw >> >> [1] >> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >> >> Suggested-by: Jan Kara <jack@suse.cz> >> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> --- >> include/linux/mm.h | 50 +++++++++++++++++++++++++++++++------ >> include/trace/events/kmem.h | 4 +-- >> kernel/fork.c | 18 +++++++------ >> 3 files changed, 56 insertions(+), 16 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index d261e45bb29b..8f1bfbd54697 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2631,30 +2631,66 @@ static inline bool >> get_user_page_fast_only(unsigned long addr, >> */ >> static inline unsigned long get_mm_counter(struct mm_struct *mm, >> int member) >> { >> - return percpu_counter_read_positive(&mm->rss_stat[member]); >> + struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> + if (percpu_counter_initialized(fbc)) >> + return percpu_counter_read_positive(fbc); >> + >> + return percpu_counter_atomic_read(fbc); >> } >> void mm_trace_rss_stat(struct mm_struct *mm, int member); >> static inline void add_mm_counter(struct mm_struct *mm, int >> member, long value) >> { >> - percpu_counter_add(&mm->rss_stat[member], value); >> + struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> + if (percpu_counter_initialized(fbc)) >> + percpu_counter_add(fbc, value); >> + else >> + percpu_counter_atomic_add(fbc, value); >> mm_trace_rss_stat(mm, member); >> } >> static inline void inc_mm_counter(struct mm_struct *mm, int member) >> { >> - percpu_counter_inc(&mm->rss_stat[member]); >> - >> - mm_trace_rss_stat(mm, member); >> + add_mm_counter(mm, member, 1); >> } >> static inline void dec_mm_counter(struct mm_struct *mm, int member) >> { >> - percpu_counter_dec(&mm->rss_stat[member]); >> + add_mm_counter(mm, member, -1); >> +} >> - mm_trace_rss_stat(mm, member); >> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >> +{ >> + struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> + if (percpu_counter_initialized(fbc)) >> + return percpu_counter_sum(fbc); >> + >> + return percpu_counter_atomic_read(fbc); >> +} >> + >> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int >> member) >> +{ >> + struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> + if (percpu_counter_initialized(fbc)) >> + return percpu_counter_sum_positive(fbc); >> + >> + return percpu_counter_atomic_read(fbc); >> +} >> + >> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) >> +{ >> + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >> NR_MM_COUNTERS); >> +} >> + >> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >> +{ >> + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >> } >> /* Optimized variant when folio is already known not to be anon */ >> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >> index 6e62cc64cd92..a4e40ae6a8c8 100644 >> --- a/include/trace/events/kmem.h >> +++ b/include/trace/events/kmem.h >> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >> __entry->mm_id = mm_ptr_to_hash(mm); >> __entry->curr = !!(current->mm == mm); >> __entry->member = member; >> - __entry->size = >> (percpu_counter_sum_positive(&mm->rss_stat[member]) >> - << PAGE_SHIFT); >> + __entry->size = (mm_counter_sum_positive(mm, member) >> + << PAGE_SHIFT); >> ), >> TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 99076dbe27d8..0214273798c5 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >> "Please make sure 'struct resident_page_types[]' is >> updated as well"); >> for (i = 0; i < NR_MM_COUNTERS; i++) { >> - long x = percpu_counter_sum(&mm->rss_stat[i]); >> + long x = mm_counter_sum(mm, i); >> if (unlikely(x)) >> pr_alert("BUG: Bad rss-counter state mm:%p type:%s >> val:%ld\n", >> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >> mm_struct *mm, struct task_struct *p, >> if (mm_alloc_cid(mm)) >> goto fail_cid; >> - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, >> - NR_MM_COUNTERS)) >> - goto fail_pcpu; >> - >> mm->user_ns = get_user_ns(user_ns); >> lru_gen_init_mm(mm); >> return mm; >> -fail_pcpu: >> - mm_destroy_cid(mm); >> fail_cid: >> destroy_context(mm); >> fail_nocontext: >> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, >> struct task_struct *tsk) >> if (!oldmm) >> return 0; >> + /* >> + * For single-thread processes, rss_stat is in atomic mode, which >> + * reduces the memory consumption and performance regression >> caused by >> + * using percpu. For multiple-thread processes, rss_stat is >> switched to >> + * the percpu mode to reduce the error margin. >> + */ >> + if (clone_flags & CLONE_THREAD) >> + if (mm_counter_switch_to_pcpu_many(oldmm)) >> + return -ENOMEM; >> + >> if (clone_flags & CLONE_VM) { >> mmget(oldmm); >> mm = oldmm; > >
On 2024/4/19 11:32, zhangpeng (AS) wrote: > On 2024/4/19 10:30, Rongwei Wang wrote: > >> On 2024/4/18 22:20, Peng Zhang wrote: >>> From: ZhangPeng <zhangpeng362@huawei.com> >>> >>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >>> percpu_counter"), the rss_stats have converted into percpu_counter, >>> which convert the error margin from (nr_threads * 64) to approximately >>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a >>> performance regression on fork/exec/shell. Even after commit >>> 14ef95be6f55 >>> ("kernel/fork: group allocation/free of per-cpu counters for mm >>> struct"), >>> the performance of fork/exec/shell is still poor compared to previous >>> kernel versions. >>> >>> To mitigate performance regression, we delay the allocation of percpu >>> memory for rss_stats. Therefore, we convert mm's rss stats to use >>> percpu_counter atomic mode. For single-thread processes, rss_stat is in >>> atomic mode, which reduces the memory consumption and performance >>> regression caused by using percpu. For multiple-thread processes, >>> rss_stat is switched to the percpu mode to reduce the error margin. >>> We convert rss_stats from atomic mode to percpu mode only when the >>> second thread is created. >> Hi, Zhang Peng >> >> This regression we also found it in lmbench these days. I have not >> test your patch, but it seems will solve a lot for it. >> And I see this patch not fix the regression in multi-threads, that's >> because of the rss_stat switched to percpu mode? >> (If I'm wrong, please correct me.) And It seems percpu_counter also >> has a bad effect in exit_mmap(). >> >> If so, I'm wondering if we can further improving it on the >> exit_mmap() path in multi-threads scenario, e.g. to determine which >> CPUs the process has run on (mm_cpumask()? I'm not sure). >> > Hi, Rongwei, > > Yes, this patch only fixes the regression in single-thread processes. How > much bad effect does percpu_counter have in exit_mmap()? IMHO, the > addition Actually, I not sure, just found a little free percpu hotspot in exit_mmap() path when comparing 4 core vs 32 cores. I can test more next. > of mm counter is already in batch mode, maybe I miss something? > >>> >>> After lmbench test, we can get 2% ~ 4% performance improvement >>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >>> improvement for lmbench page_fault (before batch mode[1]). >>> >>> The test results are as follows: >>> >>> base base+revert base+this patch >>> >>> fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) >>> exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) >>> shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) >>> page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) >> I think the regression will becomes more obvious if more cores. How >> about your test machine? >> > Maybe multi-core is not a factor in the performance of the lmbench > test here. > Both of my test machines have 96 cores. > >> Thanks, >> -wrw >>> >>> [1] >>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >>> >>> Suggested-by: Jan Kara <jack@suse.cz> >>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>> --- >>> include/linux/mm.h | 50 >>> +++++++++++++++++++++++++++++++------ >>> include/trace/events/kmem.h | 4 +-- >>> kernel/fork.c | 18 +++++++------ >>> 3 files changed, 56 insertions(+), 16 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index d261e45bb29b..8f1bfbd54697 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2631,30 +2631,66 @@ static inline bool >>> get_user_page_fast_only(unsigned long addr, >>> */ >>> static inline unsigned long get_mm_counter(struct mm_struct *mm, >>> int member) >>> { >>> - return percpu_counter_read_positive(&mm->rss_stat[member]); >>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> + if (percpu_counter_initialized(fbc)) >>> + return percpu_counter_read_positive(fbc); >>> + >>> + return percpu_counter_atomic_read(fbc); >>> } >>> void mm_trace_rss_stat(struct mm_struct *mm, int member); >>> static inline void add_mm_counter(struct mm_struct *mm, int >>> member, long value) >>> { >>> - percpu_counter_add(&mm->rss_stat[member], value); >>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> + if (percpu_counter_initialized(fbc)) >>> + percpu_counter_add(fbc, value); >>> + else >>> + percpu_counter_atomic_add(fbc, value); >>> mm_trace_rss_stat(mm, member); >>> } >>> static inline void inc_mm_counter(struct mm_struct *mm, int member) >>> { >>> - percpu_counter_inc(&mm->rss_stat[member]); >>> - >>> - mm_trace_rss_stat(mm, member); >>> + add_mm_counter(mm, member, 1); >>> } >>> static inline void dec_mm_counter(struct mm_struct *mm, int member) >>> { >>> - percpu_counter_dec(&mm->rss_stat[member]); >>> + add_mm_counter(mm, member, -1); >>> +} >>> - mm_trace_rss_stat(mm, member); >>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >>> +{ >>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> + if (percpu_counter_initialized(fbc)) >>> + return percpu_counter_sum(fbc); >>> + >>> + return percpu_counter_atomic_read(fbc); >>> +} >>> + >>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int >>> member) >>> +{ >>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> + if (percpu_counter_initialized(fbc)) >>> + return percpu_counter_sum_positive(fbc); >>> + >>> + return percpu_counter_atomic_read(fbc); >>> +} >>> + >>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) >>> +{ >>> + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >>> NR_MM_COUNTERS); >>> +} >>> + >>> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >>> +{ >>> + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >>> } >>> /* Optimized variant when folio is already known not to be anon */ >>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >>> index 6e62cc64cd92..a4e40ae6a8c8 100644 >>> --- a/include/trace/events/kmem.h >>> +++ b/include/trace/events/kmem.h >>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >>> __entry->mm_id = mm_ptr_to_hash(mm); >>> __entry->curr = !!(current->mm == mm); >>> __entry->member = member; >>> - __entry->size = >>> (percpu_counter_sum_positive(&mm->rss_stat[member]) >>> - << PAGE_SHIFT); >>> + __entry->size = (mm_counter_sum_positive(mm, member) >>> + << PAGE_SHIFT); >>> ), >>> TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >>> diff --git a/kernel/fork.c b/kernel/fork.c >>> index 99076dbe27d8..0214273798c5 100644 >>> --- a/kernel/fork.c >>> +++ b/kernel/fork.c >>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >>> "Please make sure 'struct resident_page_types[]' is >>> updated as well"); >>> for (i = 0; i < NR_MM_COUNTERS; i++) { >>> - long x = percpu_counter_sum(&mm->rss_stat[i]); >>> + long x = mm_counter_sum(mm, i); >>> if (unlikely(x)) >>> pr_alert("BUG: Bad rss-counter state mm:%p type:%s >>> val:%ld\n", >>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >>> mm_struct *mm, struct task_struct *p, >>> if (mm_alloc_cid(mm)) >>> goto fail_cid; >>> - if (percpu_counter_init_many(mm->rss_stat, 0, >>> GFP_KERNEL_ACCOUNT, >>> - NR_MM_COUNTERS)) >>> - goto fail_pcpu; >>> - >>> mm->user_ns = get_user_ns(user_ns); >>> lru_gen_init_mm(mm); >>> return mm; >>> -fail_pcpu: >>> - mm_destroy_cid(mm); >>> fail_cid: >>> destroy_context(mm); >>> fail_nocontext: >>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, >>> struct task_struct *tsk) >>> if (!oldmm) >>> return 0; >>> + /* >>> + * For single-thread processes, rss_stat is in atomic mode, which >>> + * reduces the memory consumption and performance regression >>> caused by >>> + * using percpu. For multiple-thread processes, rss_stat is >>> switched to >>> + * the percpu mode to reduce the error margin. >>> + */ >>> + if (clone_flags & CLONE_THREAD) >>> + if (mm_counter_switch_to_pcpu_many(oldmm)) >>> + return -ENOMEM; >>> + >>> if (clone_flags & CLONE_VM) { >>> mmget(oldmm); >>> mm = oldmm; >> >>
On 2024/4/20 11:13, Rongwei Wang wrote: > On 2024/4/19 11:32, zhangpeng (AS) wrote: >> On 2024/4/19 10:30, Rongwei Wang wrote: >> >>> On 2024/4/18 22:20, Peng Zhang wrote: >>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>> >>>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >>>> percpu_counter"), the rss_stats have converted into percpu_counter, >>>> which convert the error margin from (nr_threads * 64) to approximately >>>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() >>>> causes a >>>> performance regression on fork/exec/shell. Even after commit >>>> 14ef95be6f55 >>>> ("kernel/fork: group allocation/free of per-cpu counters for mm >>>> struct"), >>>> the performance of fork/exec/shell is still poor compared to previous >>>> kernel versions. >>>> >>>> To mitigate performance regression, we delay the allocation of percpu >>>> memory for rss_stats. Therefore, we convert mm's rss stats to use >>>> percpu_counter atomic mode. For single-thread processes, rss_stat >>>> is in >>>> atomic mode, which reduces the memory consumption and performance >>>> regression caused by using percpu. For multiple-thread processes, >>>> rss_stat is switched to the percpu mode to reduce the error margin. >>>> We convert rss_stats from atomic mode to percpu mode only when the >>>> second thread is created. >>> Hi, Zhang Peng >>> >>> This regression we also found it in lmbench these days. I have not >>> test your patch, but it seems will solve a lot for it. >>> And I see this patch not fix the regression in multi-threads, that's >>> because of the rss_stat switched to percpu mode? >>> (If I'm wrong, please correct me.) And It seems percpu_counter also >>> has a bad effect in exit_mmap(). >>> >>> If so, I'm wondering if we can further improving it on the >>> exit_mmap() path in multi-threads scenario, e.g. to determine which >>> CPUs the process has run on (mm_cpumask()? I'm not sure). >>> >> Hi, Rongwei, >> >> Yes, this patch only fixes the regression in single-thread processes. >> How >> much bad effect does percpu_counter have in exit_mmap()? IMHO, the >> addition > Actually, I not sure, just found a little free percpu hotspot in > exit_mmap() path when comparing 4 core vs 32 cores. > > I can test more next. Thanks, it would be better if there is test data. >> of mm counter is already in batch mode, maybe I miss something? >> >>>> >>>> After lmbench test, we can get 2% ~ 4% performance improvement >>>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >>>> improvement for lmbench page_fault (before batch mode[1]). >>>> >>>> The test results are as follows: >>>> >>>> base base+revert base+this patch >>>> >>>> fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) >>>> exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) >>>> shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) >>>> page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) >>> I think the regression will becomes more obvious if more cores. How >>> about your test machine? >>> >> Maybe multi-core is not a factor in the performance of the lmbench >> test here. >> Both of my test machines have 96 cores. >> >>> Thanks, >>> -wrw >>>> >>>> [1] >>>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >>>> >>>> Suggested-by: Jan Kara <jack@suse.cz> >>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> --- >>>> include/linux/mm.h | 50 >>>> +++++++++++++++++++++++++++++++------ >>>> include/trace/events/kmem.h | 4 +-- >>>> kernel/fork.c | 18 +++++++------ >>>> 3 files changed, 56 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>> index d261e45bb29b..8f1bfbd54697 100644 >>>> --- a/include/linux/mm.h >>>> +++ b/include/linux/mm.h >>>> @@ -2631,30 +2631,66 @@ static inline bool >>>> get_user_page_fast_only(unsigned long addr, >>>> */ >>>> static inline unsigned long get_mm_counter(struct mm_struct *mm, >>>> int member) >>>> { >>>> - return percpu_counter_read_positive(&mm->rss_stat[member]); >>>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> + if (percpu_counter_initialized(fbc)) >>>> + return percpu_counter_read_positive(fbc); >>>> + >>>> + return percpu_counter_atomic_read(fbc); >>>> } >>>> void mm_trace_rss_stat(struct mm_struct *mm, int member); >>>> static inline void add_mm_counter(struct mm_struct *mm, int >>>> member, long value) >>>> { >>>> - percpu_counter_add(&mm->rss_stat[member], value); >>>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> + if (percpu_counter_initialized(fbc)) >>>> + percpu_counter_add(fbc, value); >>>> + else >>>> + percpu_counter_atomic_add(fbc, value); >>>> mm_trace_rss_stat(mm, member); >>>> } >>>> static inline void inc_mm_counter(struct mm_struct *mm, int >>>> member) >>>> { >>>> - percpu_counter_inc(&mm->rss_stat[member]); >>>> - >>>> - mm_trace_rss_stat(mm, member); >>>> + add_mm_counter(mm, member, 1); >>>> } >>>> static inline void dec_mm_counter(struct mm_struct *mm, int >>>> member) >>>> { >>>> - percpu_counter_dec(&mm->rss_stat[member]); >>>> + add_mm_counter(mm, member, -1); >>>> +} >>>> - mm_trace_rss_stat(mm, member); >>>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >>>> +{ >>>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> + if (percpu_counter_initialized(fbc)) >>>> + return percpu_counter_sum(fbc); >>>> + >>>> + return percpu_counter_atomic_read(fbc); >>>> +} >>>> + >>>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, >>>> int member) >>>> +{ >>>> + struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> + if (percpu_counter_initialized(fbc)) >>>> + return percpu_counter_sum_positive(fbc); >>>> + >>>> + return percpu_counter_atomic_read(fbc); >>>> +} >>>> + >>>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct >>>> *mm) >>>> +{ >>>> + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >>>> NR_MM_COUNTERS); >>>> +} >>>> + >>>> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >>>> +{ >>>> + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >>>> } >>>> /* Optimized variant when folio is already known not to be anon */ >>>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >>>> index 6e62cc64cd92..a4e40ae6a8c8 100644 >>>> --- a/include/trace/events/kmem.h >>>> +++ b/include/trace/events/kmem.h >>>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >>>> __entry->mm_id = mm_ptr_to_hash(mm); >>>> __entry->curr = !!(current->mm == mm); >>>> __entry->member = member; >>>> - __entry->size = >>>> (percpu_counter_sum_positive(&mm->rss_stat[member]) >>>> - << PAGE_SHIFT); >>>> + __entry->size = (mm_counter_sum_positive(mm, member) >>>> + << PAGE_SHIFT); >>>> ), >>>> TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 99076dbe27d8..0214273798c5 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >>>> "Please make sure 'struct resident_page_types[]' is >>>> updated as well"); >>>> for (i = 0; i < NR_MM_COUNTERS; i++) { >>>> - long x = percpu_counter_sum(&mm->rss_stat[i]); >>>> + long x = mm_counter_sum(mm, i); >>>> if (unlikely(x)) >>>> pr_alert("BUG: Bad rss-counter state mm:%p type:%s >>>> val:%ld\n", >>>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >>>> mm_struct *mm, struct task_struct *p, >>>> if (mm_alloc_cid(mm)) >>>> goto fail_cid; >>>> - if (percpu_counter_init_many(mm->rss_stat, 0, >>>> GFP_KERNEL_ACCOUNT, >>>> - NR_MM_COUNTERS)) >>>> - goto fail_pcpu; >>>> - >>>> mm->user_ns = get_user_ns(user_ns); >>>> lru_gen_init_mm(mm); >>>> return mm; >>>> -fail_pcpu: >>>> - mm_destroy_cid(mm); >>>> fail_cid: >>>> destroy_context(mm); >>>> fail_nocontext: >>>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long >>>> clone_flags, struct task_struct *tsk) >>>> if (!oldmm) >>>> return 0; >>>> + /* >>>> + * For single-thread processes, rss_stat is in atomic mode, which >>>> + * reduces the memory consumption and performance regression >>>> caused by >>>> + * using percpu. For multiple-thread processes, rss_stat is >>>> switched to >>>> + * the percpu mode to reduce the error margin. >>>> + */ >>>> + if (clone_flags & CLONE_THREAD) >>>> + if (mm_counter_switch_to_pcpu_many(oldmm)) >>>> + return -ENOMEM; >>>> + >>>> if (clone_flags & CLONE_VM) { >>>> mmget(oldmm); >>>> mm = oldmm; >>> >>> > >
On Fri, Apr 19, 2024 at 11:32 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote: > On 2024/4/19 10:30, Rongwei Wang wrote: > > On 2024/4/18 22:20, Peng Zhang wrote: > >> From: ZhangPeng <zhangpeng362@huawei.com> > >> > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into > >> percpu_counter"), the rss_stats have converted into percpu_counter, > >> which convert the error margin from (nr_threads * 64) to approximately > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a > >> performance regression on fork/exec/shell. Even after commit > >> 14ef95be6f55 > >> ("kernel/fork: group allocation/free of per-cpu counters for mm > >> struct"), > >> the performance of fork/exec/shell is still poor compared to previous > >> kernel versions. > >> > >> To mitigate performance regression, we delay the allocation of percpu > >> memory for rss_stats. Therefore, we convert mm's rss stats to use > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in > >> atomic mode, which reduces the memory consumption and performance > >> regression caused by using percpu. For multiple-thread processes, > >> rss_stat is switched to the percpu mode to reduce the error margin. > >> We convert rss_stats from atomic mode to percpu mode only when the > >> second thread is created. > > Hi, Zhang Peng > > > > This regression we also found it in lmbench these days. I have not > > test your patch, but it seems will solve a lot for it. > > And I see this patch not fix the regression in multi-threads, that's > > because of the rss_stat switched to percpu mode? > > (If I'm wrong, please correct me.) And It seems percpu_counter also > > has a bad effect in exit_mmap(). > > > > If so, I'm wondering if we can further improving it on the exit_mmap() > > path in multi-threads scenario, e.g. to determine which CPUs the > > process has run on (mm_cpumask()? I'm not sure). > > > Hi, Rongwei, > > Yes, this patch only fixes the regression in single-thread processes. How > much bad effect does percpu_counter have in exit_mmap()? IMHO, the addition > of mm counter is already in batch mode, maybe I miss something? > Hi, Peng Zhang, Rongwei, and all: I've a patch series that is earlier than commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"): https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ Instead of a per-mm-per-cpu cache, it used only one global per-cpu cache, and flush it on schedule. Or, if the arch supports, flush and fetch it use mm bitmap as an optimization (like tlb shootdown). Unfortunately it didn't get much attention and I moved to work on other things. I also noticed the fork regression issue, so I did a local rebase of my previous patch, and revert f1a7941243c1. The result is looking good, on my 32 core VM machine, I see similar improvement as the one you posted (alloc/free on fork/exit is gone), I also see minor improvement with database tests, memory usage is lower by a little bit too (no more per-mm cache), and I think the error margin in my patch should be close to zero. I hope I can get some attention here for my idea...
On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote: > > > On 2024/4/18 22:20, Peng Zhang wrote: > > >> From: ZhangPeng <zhangpeng362@huawei.com> > > >> > > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into > > >> percpu_counter"), the rss_stats have converted into percpu_counter, > > >> which convert the error margin from (nr_threads * 64) to approximately > > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a > > >> performance regression on fork/exec/shell. Even after commit > > >> 14ef95be6f55 > > >> ("kernel/fork: group allocation/free of per-cpu counters for mm > > >> struct"), > > >> the performance of fork/exec/shell is still poor compared to previous > > >> kernel versions. > > >> > > >> To mitigate performance regression, we delay the allocation of percpu > > >> memory for rss_stats. Therefore, we convert mm's rss stats to use > > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in > > >> atomic mode, which reduces the memory consumption and performance > > >> regression caused by using percpu. For multiple-thread processes, > > >> rss_stat is switched to the percpu mode to reduce the error margin. > > >> We convert rss_stats from atomic mode to percpu mode only when the > > >> second thread is created. > > I've a patch series that is earlier than commit f1a7941243c1 ("mm: > convert mm's rss stats into > percpu_counter"): > > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ > > Instead of a per-mm-per-cpu cache, it used only one global per-cpu > cache, and flush it on schedule. Or, if the arch supports, flush and > fetch it use mm bitmap as an optimization (like tlb shootdown). > I just spotted this thread. I have a rather long rant to write about the entire ordeal, but don't have the time at the moment. I do have time to make some remarks though. Rolling with a centralized counter and only distributing per-cpu upon creation of a thread is something which was discussed last time and which I was considering doing. Then life got it in the way and in the meantime I managed to conclude it's a questionable idea anyway. The state prior to the counters moving to per-cpu was not that great to begin with, with quite a few serialization points. As far as allocating stuff goes one example is mm_alloc_cid, with the following: mm->pcpu_cid = alloc_percpu(struct mm_cid); Converting the code to avoid per-cpu rss counters in the common case or the above patchset only damage-control the state back to what it was, don't do anything to push things further. Another note is that unfortunately userspace is increasingly multithreaded for no good reason, see the Rust ecosystem as an example. All that to say is that the multithreaded case is what has to get faster, as a side effect possibly obsoleting both approaches proposed above. I concede if there is nobody wiling to commit to doing the work in the foreseeable future then indeed a damage-controlling solution should land. On that note in check_mm there is this loop: for (i = 0; i < NR_MM_COUNTERS; i++) { long x = percpu_counter_sum(&mm->rss_stat[i]); This avoidably walks all cpus 4 times with a preemption and lock trip for each round. Instead one can observe all modifications are supposed to have already stopped and that this is allocated in a banch. A routine, say percpu_counter_sum_many_unsafe, could do one iteration without any locks or interrupt play and return an array. This should be markedly faster and I perhaps will hack it up. A part of The Real Solution(tm) would make counter allocations scale (including mcid, not just rss) or dodge them (while maintaining the per-cpu distribution, see below for one idea), but that boils down to balancing scalability versus total memory usage. It is trivial to just slap together a per-cpu cache of these allocations and have the problem go away for benchmarking purposes, while being probably being too memory hungry for actual usage. I was pondering an allocator with caches per some number of cores (say 4 or 8). Microbenchmarks aside I suspect real workloads would not suffer from contention at this kind of granularity. This would trivially reduce memory usage compared to per-cpu caching. I suspect things like mm_struct, task_struct, task stacks and similar would be fine with it. Suppose mm_struct is allocated from a more coarse grained allocator than per-cpu. Total number of cached objects would be lower than it is now. That would also mean these allocated but not currently used mms could hold on to other stuff, for example per-cpu rss and mcid counters. Then should someone fork or exit, alloc/free_percpu would be avoided for most cases. This would scale better and be faster single-threaded than the current state. (believe it or not this is not the actual long rant I have in mind) I can't commit to work on the Real Solution though. In the meantime I can submit percpu_counter_sum_many_unsafe as described above if Denis likes the idea.
Mateusz Guzik <mjguzik@gmail.com> 于 2024年5月16日周四 23:14写道: > > On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote: > > > > On 2024/4/18 22:20, Peng Zhang wrote: > > > >> From: ZhangPeng <zhangpeng362@huawei.com> > > > >> > > > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into > > > >> percpu_counter"), the rss_stats have converted into percpu_counter, > > > >> which convert the error margin from (nr_threads * 64) to approximately > > > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a > > > >> performance regression on fork/exec/shell. Even after commit > > > >> 14ef95be6f55 > > > >> ("kernel/fork: group allocation/free of per-cpu counters for mm > > > >> struct"), > > > >> the performance of fork/exec/shell is still poor compared to previous > > > >> kernel versions. > > > >> > > > >> To mitigate performance regression, we delay the allocation of percpu > > > >> memory for rss_stats. Therefore, we convert mm's rss stats to use > > > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in > > > >> atomic mode, which reduces the memory consumption and performance > > > >> regression caused by using percpu. For multiple-thread processes, > > > >> rss_stat is switched to the percpu mode to reduce the error margin. > > > >> We convert rss_stats from atomic mode to percpu mode only when the > > > >> second thread is created. > > > > I've a patch series that is earlier than commit f1a7941243c1 ("mm: > > convert mm's rss stats into > > percpu_counter"): > > > > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ > > > > Instead of a per-mm-per-cpu cache, it used only one global per-cpu > > cache, and flush it on schedule. Or, if the arch supports, flush and > > fetch it use mm bitmap as an optimization (like tlb shootdown). > > > > I just spotted this thread. > > I have a rather long rant to write about the entire ordeal, but don't > have the time at the moment. I do have time to make some remarks though. > > Rolling with a centralized counter and only distributing per-cpu upon > creation of a thread is something which was discussed last time and > which I was considering doing. Then life got it in the way and in the > meantime I managed to conclude it's a questionable idea anyway. > > The state prior to the counters moving to per-cpu was not that great to > begin with, with quite a few serialization points. As far as allocating > stuff goes one example is mm_alloc_cid, with the following: > mm->pcpu_cid = alloc_percpu(struct mm_cid); > > Converting the code to avoid per-cpu rss counters in the common case or > the above patchset only damage-control the state back to what it was, > don't do anything to push things further. > > Another note is that unfortunately userspace is increasingly > multithreaded for no good reason, see the Rust ecosystem as an example. > > All that to say is that the multithreaded case is what has to get > faster, as a side effect possibly obsoleting both approaches proposed > above. I concede if there is nobody wiling to commit to doing the work > in the foreseeable future then indeed a damage-controlling solution > should land. Hi, Mateusz, Which patch are you referencing? My series didn't need any allocations on thread creation or destruction. Also RSS update is extremely lightweight (pretty much just read GS and do a few ADD/INC, that's all), performance is better than all even with micro benchmarks. RSS read only collects info from CPUs that may contain real updates. I understand you may not have time to go through my series... but I think I should add some details here. > On that note in check_mm there is this loop: > for (i = 0; i < NR_MM_COUNTERS; i++) { > long x = percpu_counter_sum(&mm->rss_stat[i]); > > This avoidably walks all cpus 4 times with a preemption and lock trip > for each round. Instead one can observe all modifications are supposed > to have already stopped and that this is allocated in a banch. A > routine, say percpu_counter_sum_many_unsafe, could do one iteration > without any locks or interrupt play and return an array. This should be > markedly faster and I perhaps will hack it up. Which is similar to the RSS read in my earlier series... It is based on the assumption that updates are likely stopped so just read the counter "unsafely" with a double (and fast) check to ensure no race. And even more, when coupled with mm shootdown (CONFIG_ARCH_PCP_RSS_USE_CPUMASK), it doesn't need to collect RSS info on thread exit at all. > > A part of The Real Solution(tm) would make counter allocations scale > (including mcid, not just rss) or dodge them (while maintaining the > per-cpu distribution, see below for one idea), but that boils down to > balancing scalability versus total memory usage. It is trivial to just > slap together a per-cpu cache of these allocations and have the problem > go away for benchmarking purposes, while being probably being too memory > hungry for actual usage. > > I was pondering an allocator with caches per some number of cores (say 4 > or 8). Microbenchmarks aside I suspect real workloads would not suffer > from contention at this kind of granularity. This would trivially reduce > memory usage compared to per-cpu caching. I suspect things like > mm_struct, task_struct, task stacks and similar would be fine with it. > > Suppose mm_struct is allocated from a more coarse grained allocator than > per-cpu. Total number of cached objects would be lower than it is now. > That would also mean these allocated but not currently used mms could > hold on to other stuff, for example per-cpu rss and mcid counters. Then > should someone fork or exit, alloc/free_percpu would be avoided for most > cases. This would scale better and be faster single-threaded than the > current state. And what is the issue with using only one CPU cache, and flush on mm switch? No more alloc after boot, and the total (and fixed) memory usage is just about a few unsigned long per CPU, which should be even lower that the old RSS cache solution (4 unsigned long per task). And it scaled very well with many kinds of microbench or workload I've tested. Unless the workload keeps doing something like "alloc one page then switch to another mm", I think the performance will be horrible already due to cache invalidations and many switch_*()s, RSS isn't really a concern there. > > (believe it or not this is not the actual long rant I have in mind) > > I can't commit to work on the Real Solution though. > > In the meantime I can submit percpu_counter_sum_many_unsafe as described > above if Denis likes the idea.
On Fri, May 17, 2024 at 11:29:57AM +0800, Kairui Song wrote: > Mateusz Guzik <mjguzik@gmail.com> 于 2024年5月16日周四 23:14写道: > > A part of The Real Solution(tm) would make counter allocations scale > > (including mcid, not just rss) or dodge them (while maintaining the > > per-cpu distribution, see below for one idea), but that boils down to > > balancing scalability versus total memory usage. It is trivial to just > > slap together a per-cpu cache of these allocations and have the problem > > go away for benchmarking purposes, while being probably being too memory > > hungry for actual usage. > > > > I was pondering an allocator with caches per some number of cores (say 4 > > or 8). Microbenchmarks aside I suspect real workloads would not suffer > > from contention at this kind of granularity. This would trivially reduce > > memory usage compared to per-cpu caching. I suspect things like > > mm_struct, task_struct, task stacks and similar would be fine with it. > > > > Suppose mm_struct is allocated from a more coarse grained allocator than > > per-cpu. Total number of cached objects would be lower than it is now. > > That would also mean these allocated but not currently used mms could > > hold on to other stuff, for example per-cpu rss and mcid counters. Then > > should someone fork or exit, alloc/free_percpu would be avoided for most > > cases. This would scale better and be faster single-threaded than the > > current state. > > And what is the issue with using only one CPU cache, and flush on mm > switch? No more alloc after boot, and the total (and fixed) memory > usage is just about a few unsigned long per CPU, which should be even > lower that the old RSS cache solution (4 unsigned long per task). And > it scaled very well with many kinds of microbench or workload I've > tested. > > Unless the workload keeps doing something like "alloc one page then > switch to another mm", I think the performance will be horrible > already due to cache invalidations and many switch_*()s, RSS isn't > really a concern there. > I only skimmed through your patchset. I do think it has a legitimate approach, but personally I would not do it like that due to the extra work on context switches. However, I have 0 say about this, so you will need to prod the mm overlords to get this moving forward. Maybe I was not clear enough in my opening e-mail, so I'm going to reiterate some bits: there are scalability problems in execve even with your patchset or the one which uses atomics. One of them concerns another bit which allocates per-cpu memory (the mcid thing). I note that sorting it out would possibly also take care of the rss problem, outlining an example approach above.
Hi Mateusz and Kairui, On Thu, May 16, 2024 at 05:14:06PM +0200, Mateusz Guzik wrote: > On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote: > > > > On 2024/4/18 22:20, Peng Zhang wrote: > > > >> From: ZhangPeng <zhangpeng362@huawei.com> > > > >> > > > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into > > > >> percpu_counter"), the rss_stats have converted into percpu_counter, > > > >> which convert the error margin from (nr_threads * 64) to approximately > > > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a > > > >> performance regression on fork/exec/shell. Even after commit > > > >> 14ef95be6f55 > > > >> ("kernel/fork: group allocation/free of per-cpu counters for mm > > > >> struct"), > > > >> the performance of fork/exec/shell is still poor compared to previous > > > >> kernel versions. > > > >> > > > >> To mitigate performance regression, we delay the allocation of percpu > > > >> memory for rss_stats. Therefore, we convert mm's rss stats to use > > > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in > > > >> atomic mode, which reduces the memory consumption and performance > > > >> regression caused by using percpu. For multiple-thread processes, > > > >> rss_stat is switched to the percpu mode to reduce the error margin. > > > >> We convert rss_stats from atomic mode to percpu mode only when the > > > >> second thread is created. > > > > I've a patch series that is earlier than commit f1a7941243c1 ("mm: > > convert mm's rss stats into > > percpu_counter"): > > > > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ > > I hadn't seen this series as my inbox filters on percpu, but not per-cpu. I can take a closer look this week. > > Instead of a per-mm-per-cpu cache, it used only one global per-cpu > > cache, and flush it on schedule. Or, if the arch supports, flush and > > fetch it use mm bitmap as an optimization (like tlb shootdown). > > > > I just spotted this thread. > > I have a rather long rant to write about the entire ordeal, but don't > have the time at the moment. I do have time to make some remarks though. > > Rolling with a centralized counter and only distributing per-cpu upon > creation of a thread is something which was discussed last time and > which I was considering doing. Then life got it in the way and in the > meantime I managed to conclude it's a questionable idea anyway. > To clarify my stance, I'm not against the API of switching a percpu_counter to percpu mode. We do it for percpu_refcount. I think the implementation here was fragile. Secondly, Kent did implement lazy_percpu_counters. We likely should see how that can be leveraged and how we can reconcile the 2 APIs. > The state prior to the counters moving to per-cpu was not that great to > begin with, with quite a few serialization points. As far as allocating > stuff goes one example is mm_alloc_cid, with the following: > mm->pcpu_cid = alloc_percpu(struct mm_cid); > > Converting the code to avoid per-cpu rss counters in the common case or > the above patchset only damage-control the state back to what it was, > don't do anything to push things further. > > Another note is that unfortunately userspace is increasingly > multithreaded for no good reason, see the Rust ecosystem as an example. > > All that to say is that the multithreaded case is what has to get > faster, as a side effect possibly obsoleting both approaches proposed > above. I concede if there is nobody wiling to commit to doing the work > in the foreseeable future then indeed a damage-controlling solution > should land. > > On that note in check_mm there is this loop: > for (i = 0; i < NR_MM_COUNTERS; i++) { > long x = percpu_counter_sum(&mm->rss_stat[i]); > > This avoidably walks all cpus 4 times with a preemption and lock trip > for each round. Instead one can observe all modifications are supposed > to have already stopped and that this is allocated in a banch. A > routine, say percpu_counter_sum_many_unsafe, could do one iteration > without any locks or interrupt play and return an array. This should be > markedly faster and I perhaps will hack it up. I'm a little worried about the correctness in the cpu_hotplug case, idk if it's warranted. > > A part of The Real Solution(tm) would make counter allocations scale > (including mcid, not just rss) or dodge them (while maintaining the > per-cpu distribution, see below for one idea), but that boils down to > balancing scalability versus total memory usage. It is trivial to just > slap together a per-cpu cache of these allocations and have the problem > go away for benchmarking purposes, while being probably being too memory > hungry for actual usage. > > I was pondering an allocator with caches per some number of cores (say 4 > or 8). Microbenchmarks aside I suspect real workloads would not suffer > from contention at this kind of granularity. This would trivially reduce > memory usage compared to per-cpu caching. I suspect things like > mm_struct, task_struct, task stacks and similar would be fine with it. > > Suppose mm_struct is allocated from a more coarse grained allocator than > per-cpu. Total number of cached objects would be lower than it is now. > That would also mean these allocated but not currently used mms could > hold on to other stuff, for example per-cpu rss and mcid counters. Then > should someone fork or exit, alloc/free_percpu would be avoided for most > cases. This would scale better and be faster single-threaded than the > current state. > > (believe it or not this is not the actual long rant I have in mind) > > I can't commit to work on the Real Solution though. > > In the meantime I can submit percpu_counter_sum_many_unsafe as described > above if Denis likes the idea. To be honest, I'm a little out of my depth here. I haven't spent a lot of time in the mm code paths to really know when and how we're accounting RSS and other stats. Given that, I think we should align on the approach we want to take because in some way it sounds like percpu RSS might not be the final evolution here for this and other stats. I think lets hold off on percpu_counter_sum_many_unsafe() initially and if that's the way we have to go so be it. I'm going to respin a cpu_hotplug related series that might change some of the correctness here and have that ready for the v6.11 merge window. Thanks, Dennis
diff --git a/include/linux/mm.h b/include/linux/mm.h index d261e45bb29b..8f1bfbd54697 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2631,30 +2631,66 @@ static inline bool get_user_page_fast_only(unsigned long addr, */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_read_positive(fbc); + + return percpu_counter_atomic_read(fbc); } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + percpu_counter_add(fbc, value); + else + percpu_counter_atomic_add(fbc, value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); - - mm_trace_rss_stat(mm, member); + add_mm_counter(mm, member, 1); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + add_mm_counter(mm, member, -1); +} - mm_trace_rss_stat(mm, member); +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) +{ + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_sum(fbc); + + return percpu_counter_atomic_read(fbc); +} + +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int member) +{ + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_sum_positive(fbc); + + return percpu_counter_atomic_read(fbc); +} + +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) +{ + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, NR_MM_COUNTERS); +} + +static inline void mm_counter_destroy_many(struct mm_struct *mm) +{ + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); } /* Optimized variant when folio is already known not to be anon */ diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 6e62cc64cd92..a4e40ae6a8c8 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) - << PAGE_SHIFT); + __entry->size = (mm_counter_sum_positive(mm, member) + << PAGE_SHIFT); ), TP_printk("mm_id=%u curr=%d type=%s size=%ldB", diff --git a/kernel/fork.c b/kernel/fork.c index 99076dbe27d8..0214273798c5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = mm_counter_sum(mm, i); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; -fail_pcpu: - mm_destroy_cid(mm); fail_cid: destroy_context(mm); fail_nocontext: @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) if (!oldmm) return 0; + /* + * For single-thread processes, rss_stat is in atomic mode, which + * reduces the memory consumption and performance regression caused by + * using percpu. For multiple-thread processes, rss_stat is switched to + * the percpu mode to reduce the error margin. + */ + if (clone_flags & CLONE_THREAD) + if (mm_counter_switch_to_pcpu_many(oldmm)) + return -ENOMEM; + if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm;