Message ID | 20220510122242.1380536-1-xu.xin16@zte.com.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v6] mm/ksm: introduce ksm_force for each process | expand |
Hello. On úterý 10. května 2022 14:22:42 CEST cgel.zte@gmail.com wrote: > From: xu xin <xu.xin16@zte.com.cn> > > To use KSM, we have to explicitly call madvise() in application code, > which means installed apps on OS needs to be uninstall and source code > needs to be modified. It is inconvenient. > > In order to change this situation, We add a new proc file ksm_force > under /proc/<pid>/ to support turning on/off KSM scanning of a > process's mm dynamically. > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > of this mm to be involved in KSM scanning without explicitly calling > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > the klob of /sys/kernel/mm/ksm/run is set as 1. > > If ksm_force is set to 0, cancel the feature of ksm_force of this > process (fallback to the default state) and unmerge those merged pages > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > but still leave MADV_MERGEABLE areas merged. To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). So, what changed since that discussion? P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. [1] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/ [2] https://lore.kernel.org/all/20190515145151.GG16651@dhcp22.suse.cz/T/#u [3] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/ [4] https://gitlab.com/post-factum/pf-kernel/-/commits/ksm-5.17/ > Signed-off-by: xu xin <xu.xin16@zte.com.cn> > Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> > Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> > Reviewed-by: wangyong <wang.yong12@zte.com.cn> > Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> > Suggested-by: Matthew Wilcox <willy@infradead.org> > --- > v6: > - modify the way of "return" > - remove unnecessary words in Documentation/admin-guide/mm/ksm.rst > - add additional notes to "set 0 to ksm_force" in Documentation/../ksm.rst > and Documentation/../proc.rst > v5: > - fix typos in Documentation/filesystem/proc.rst > v4: > - fix typos in commit log > - add interface descriptions under Documentation/ > v3: > - fix compile error of mm/ksm.c > v2: > - fix a spelling error in commit log. > - remove a redundant condition check in ksm_force_write(). > --- > Documentation/admin-guide/mm/ksm.rst | 19 +++++- > Documentation/filesystems/proc.rst | 17 +++++ > fs/proc/base.c | 93 ++++++++++++++++++++++++++++ > include/linux/mm_types.h | 9 +++ > mm/ksm.c | 32 +++++++++- > 5 files changed, 167 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst > index b244f0202a03..8cabc2504005 100644 > --- a/Documentation/admin-guide/mm/ksm.rst > +++ b/Documentation/admin-guide/mm/ksm.rst > @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again). > Controlling KSM with madvise > ============================ > > -KSM only operates on those areas of address space which an application > +KSM can operates on those areas of address space which an application > has advised to be likely candidates for merging, by using the madvise(2) > system call:: > > @@ -70,6 +70,23 @@ Applications should be considerate in their use of MADV_MERGEABLE, > restricting its use to areas likely to benefit. KSM's scans may use a lot > of processing power: some installations will disable KSM for that reason. > > +Controlling KSM with procfs > +=========================== > + > +KSM can also operate on anonymous areas of address space of those processes's > +knob ``/proc/<pid>/ksm_force`` is on, even if app codes doesn't call madvise() > +explicitly to advise specific areas as MADV_MERGEABLE. > + > +You can set ksm_force to 1 to force all anonymous and qualified VMAs of > +this process to be involved in KSM scanning. > + e.g. ``echo 1 > /proc/<pid>/ksm_force`` > + > +You can also set ksm_force to 0 to cancel that force feature of this process > +and unmerge those merged pages which belongs to those VMAs not marked as > +MADV_MERGEABLE of this process. But that still leave those pages belonging to > +VMAs marked as MADV_MERGEABLE merged (fallback to the default state). > + e.g. ``echo 0 > /proc/<pid>/ksm_force`` > + > .. _ksm_sysfs: > > KSM daemon sysfs interface > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 061744c436d9..8890b8b457a4 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 > 3.10 /proc/<pid>/timerslack_ns - Task timerslack value > 3.11 /proc/<pid>/patch_state - Livepatch patch operation state > 3.12 /proc/<pid>/arch_status - Task architecture specific information > + 3.13 /proc/<pid>/ksm_force - Setting of mandatory involvement in KSM > > 4 Configuring procfs > 4.1 Mount options > @@ -2176,6 +2177,22 @@ AVX512_elapsed_ms > the task is unlikely an AVX512 user, but depends on the workload and the > scheduling scenario, it also could be a false negative mentioned above. > > +3.13 /proc/<pid>/ksm_force - Setting of mandatory involvement in KSM > +----------------------------------------------------------------------- > +When CONFIG_KSM is enabled, this file can be used to specify if this > +process's anonymous memory can be involved in KSM scanning without app codes > +explicitly calling madvise to mark memory address as MADV_MERGEABLE. > + > +If writing 1 to this file, the kernel will force all anonymous and qualified > +memory to be involved in KSM scanning without explicitly calling madvise to > +mark memory address as MADV_MERGEABLE. But that is effective only when the > +klob of '/sys/kernel/mm/ksm/run' is set as 1. > + > +If writing 0 to this file, the mandatory KSM feature of this process's will > +be cancelled and unmerge those merged pages which belongs to those areas not > +marked as MADV_MERGEABLE of this process, but leave those pages belonging to > +areas marked as MADV_MERGEABLE merged (fallback to the default state). > + > Chapter 4: Configuring procfs > ============================= > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 8dfa36a99c74..d60f7342f79e 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -96,6 +96,7 @@ > #include <linux/time_namespace.h> > #include <linux/resctrl.h> > #include <linux/cn_proc.h> > +#include <linux/ksm.h> > #include <trace/events/oom.h> > #include "internal.h" > #include "fd.h" > @@ -3168,6 +3169,96 @@ static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace * > > return 0; > } > + > +static ssize_t ksm_force_read(struct file *file, char __user *buf, size_t count, > + loff_t *ppos) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + char buffer[PROC_NUMBUF]; > + ssize_t len; > + int ret; > + > + task = get_proc_task(file_inode(file)); > + if (!task) > + return -ESRCH; > + > + mm = get_task_mm(task); > + ret = 0; > + if (mm) { > + len = snprintf(buffer, sizeof(buffer), "%d\n", mm->ksm_force); > + ret = simple_read_from_buffer(buf, count, ppos, buffer, len); > + mmput(mm); > + } > + > + return ret; > +} > + > +static ssize_t ksm_force_write(struct file *file, const char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + char buffer[PROC_NUMBUF]; > + int force; > + int err = 0; > + > + memset(buffer, 0, sizeof(buffer)); > + if (count > sizeof(buffer) - 1) > + count = sizeof(buffer) - 1; > + if (copy_from_user(buffer, buf, count)) > + return -EFAULT; > + > + err = kstrtoint(strstrip(buffer), 0, &force); > + if (err) > + return err; > + > + if (force != 0 && force != 1) > + return -EINVAL; > + > + task = get_proc_task(file_inode(file)); > + if (!task) > + return -ESRCH; > + > + mm = get_task_mm(task); > + if (!mm) > + goto out_put_task; > + > + if (mm->ksm_force != force) { > + if (mmap_write_lock_killable(mm)) { > + err = -EINTR; > + goto out_mmput; > + } > + > + if (force == 0) > + mm->ksm_force = force; > + else { > + /* > + * Force anonymous pages of this mm to be involved in KSM merging > + * without explicitly calling madvise. > + */ > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) > + err = __ksm_enter(mm); > + if (!err) > + mm->ksm_force = force; > + } > + > + mmap_write_unlock(mm); > + } > + > +out_mmput: > + mmput(mm); > +out_put_task: > + put_task_struct(task); > + > + return err < 0 ? err : count; > +} > + > +static const struct file_operations proc_pid_ksm_force_operations = { > + .read = ksm_force_read, > + .write = ksm_force_write, > + .llseek = generic_file_llseek, > +}; > #endif /* CONFIG_KSM */ > > #ifdef CONFIG_STACKLEAK_METRICS > @@ -3303,6 +3394,7 @@ static const struct pid_entry tgid_base_stuff[] = { > #endif > #ifdef CONFIG_KSM > ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), > + REG("ksm_force", S_IRUSR|S_IWUSR, proc_pid_ksm_force_operations), > #endif > }; > > @@ -3639,6 +3731,7 @@ static const struct pid_entry tid_base_stuff[] = { > #endif > #ifdef CONFIG_KSM > ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), > + REG("ksm_force", S_IRUSR|S_IWUSR, proc_pid_ksm_force_operations), > #endif > }; > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index b34ff2cdbc4f..1b1592c2f5cf 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -661,6 +661,15 @@ struct mm_struct { > * merging. > */ > unsigned long ksm_merging_pages; > + /* > + * If true, force anonymous pages of this mm to be involved in KSM > + * merging without explicitly calling madvise. It is effctive only > + * when the klob of '/sys/kernel/mm/ksm/run' is set as 1. If false, > + * cancel the feature of ksm_force of this process and unmerge > + * those merged pages which is not madvised as MERGEABLE of this > + * process, but leave MERGEABLE areas merged. > + */ > + bool ksm_force; > #endif > } __randomize_layout; > > diff --git a/mm/ksm.c b/mm/ksm.c > index 38360285497a..c9f672dcc72e 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -334,6 +334,34 @@ static void __init ksm_slab_free(void) > mm_slot_cache = NULL; > } > > +/* Check if vma is qualified for ksmd scanning */ > +static bool ksm_vma_check(struct vm_area_struct *vma) > +{ > + unsigned long vm_flags = vma->vm_flags; > + > + if (!(vma->vm_flags & VM_MERGEABLE) && !(vma->vm_mm->ksm_force)) > + return false; > + > + if (vm_flags & (VM_SHARED | VM_MAYSHARE | > + VM_PFNMAP | VM_IO | VM_DONTEXPAND | > + VM_HUGETLB | VM_MIXEDMAP)) > + return false; /* just ignore this vma*/ > + > + if (vma_is_dax(vma)) > + return false; > + > +#ifdef VM_SAO > + if (vm_flags & VM_SAO) > + return false; > +#endif > +#ifdef VM_SPARC_ADI > + if (vm_flags & VM_SPARC_ADI) > + return false; > +#endif > + > + return true; > +} > + > static __always_inline bool is_stable_node_chain(struct stable_node *chain) > { > return chain->rmap_hlist_len == STABLE_NODE_CHAIN; > @@ -523,7 +551,7 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, > if (ksm_test_exit(mm)) > return NULL; > vma = vma_lookup(mm, addr); > - if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) > + if (!vma || !ksm_vma_check(vma) || !vma->anon_vma) > return NULL; > return vma; > } > @@ -2297,7 +2325,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) > vma = find_vma(mm, ksm_scan.address); > > for (; vma; vma = vma->vm_next) { > - if (!(vma->vm_flags & VM_MERGEABLE)) > + if (!ksm_vma_check(vma)) > continue; > if (ksm_scan.address < vma->vm_start) > ksm_scan.address = vma->vm_start; >
On Tue, May 10, 2022 at 03:30:36PM +0200, Oleksandr Natalenko wrote: > Hello. > > On úterý 10. května 2022 14:22:42 CEST cgel.zte@gmail.com wrote: > > From: xu xin <xu.xin16@zte.com.cn> > > > > To use KSM, we have to explicitly call madvise() in application code, > > which means installed apps on OS needs to be uninstall and source code > > needs to be modified. It is inconvenient. > > > > In order to change this situation, We add a new proc file ksm_force > > under /proc/<pid>/ to support turning on/off KSM scanning of a > > process's mm dynamically. > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > of this mm to be involved in KSM scanning without explicitly calling > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > process (fallback to the default state) and unmerge those merged pages > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > but still leave MADV_MERGEABLE areas merged. > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > So, what changed since that discussion? > Thanks a lot for your information. However, the patch here is slightly different from your previous discussion: your patch focuses on using procfs to change the madvise behaviour of another process, but this patch will not modify the flag of all VMAs of the process. It introduces a new bool ksm_force to represent this forcible feature of KSM based on process, which is independent of madvise. the same way, process_madvise is a kind of madvise in essence. > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. > > [1] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/ > [2] https://lore.kernel.org/all/20190515145151.GG16651@dhcp22.suse.cz/T/#u > [3] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/ > [4] https://gitlab.com/post-factum/pf-kernel/-/commits/ksm-5.17/ > > Oleksandr Natalenko (post-factum) >
On Tue, May 10, 2022 at 12:22:42PM +0000, cgel.zte@gmail.com wrote: > +++ b/Documentation/admin-guide/mm/ksm.rst > @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again). > Controlling KSM with madvise > ============================ > > -KSM only operates on those areas of address space which an application > +KSM can operates on those areas of address space which an application "can operate on" > +static ssize_t ksm_force_write(struct file *file, const char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + char buffer[PROC_NUMBUF]; > + int force; > + int err = 0; > + > + memset(buffer, 0, sizeof(buffer)); > + if (count > sizeof(buffer) - 1) > + count = sizeof(buffer) - 1; > + if (copy_from_user(buffer, buf, count)) > + return -EFAULT; > + > + err = kstrtoint(strstrip(buffer), 0, &force); > + if (err) > + return err; > + > + if (force != 0 && force != 1) > + return -EINVAL; > + > + task = get_proc_task(file_inode(file)); > + if (!task) > + return -ESRCH; > + > + mm = get_task_mm(task); > + if (!mm) > + goto out_put_task; > + > + if (mm->ksm_force != force) { > + if (mmap_write_lock_killable(mm)) { > + err = -EINTR; > + goto out_mmput; > + } > + > + if (force == 0) > + mm->ksm_force = force; > + else { > + /* > + * Force anonymous pages of this mm to be involved in KSM merging > + * without explicitly calling madvise. > + */ > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) > + err = __ksm_enter(mm); > + if (!err) > + mm->ksm_force = force; > + } > + > + mmap_write_unlock(mm); > + } There's a much simpler patch hiding inside this complicated one. if (force) { set_bit(MMF_VM_MERGEABLE, &mm->flags)); for each VMA set VM_MERGEABLE; err = __ksm_enter(mm); } else { clear_bit(MMF_VM_MERGEABLE, &mm->flags)); for each VMA clear VM_MERGEABLE; } ... and all the extra complications you added go away.
On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > of this mm to be involved in KSM scanning without explicitly calling > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > process (fallback to the default state) and unmerge those merged pages > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > but still leave MADV_MERGEABLE areas merged. > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > So, what changed since that discussion? > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. Why are you patching the kernel with a new syscall rather than using process_madvise()?
On Thu, May 12, 2022 at 10:07:09PM +0100, Matthew Wilcox wrote: > On Tue, May 10, 2022 at 12:22:42PM +0000, cgel.zte@gmail.com wrote: > > +++ b/Documentation/admin-guide/mm/ksm.rst > > @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again). > > Controlling KSM with madvise > > ============================ > > > > -KSM only operates on those areas of address space which an application > > +KSM can operates on those areas of address space which an application > > "can operate on" > Thanks. > > + * Force anonymous pages of this mm to be involved in KSM merging > > + * without explicitly calling madvise. > > + */ > > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) > > + err = __ksm_enter(mm); > > + if (!err) > > + mm->ksm_force = force; > > + } > > + > > + mmap_write_unlock(mm); > > + } > > There's a much simpler patch hiding inside this complicated one. > > if (force) { > set_bit(MMF_VM_MERGEABLE, &mm->flags)); > for each VMA > set VM_MERGEABLE; > err = __ksm_enter(mm); > } else { > clear_bit(MMF_VM_MERGEABLE, &mm->flags)); > for each VMA > clear VM_MERGEABLE; > } > > ... and all the extra complications you added go away. Sorry, but I don't think that is a better way of implementation, although it is simpler. It overrides the intention of code ifself which is unrecoverable. For example, if a program which madvise just a part of VMAs (not all) as MERGEABLE then its ksm_force is turned on, and subsequently it ksm_force is turned off again, the "madvised MERGEBLE" cannot be recoverd. =========================================================================== I have a idea: can we refer to the interface of THP? Substitute ksm_force with ksm_enabled and give three values to it: 1) always: force all anonymous VMAs of this process to be scanned. 2) madvise: the default state, unless user code call madvise, don't scan this process. 3) never: never be involed in KSM. =========================================================================== How about this?
Hello. On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote: > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > > of this mm to be involved in KSM scanning without explicitly calling > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > > process (fallback to the default state) and unmerge those merged pages > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > > but still leave MADV_MERGEABLE areas merged. > > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > > > So, what changed since that discussion? > > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. > > Why are you patching the kernel with a new syscall rather than using > process_madvise()? Because I'm not sure how to use `process_madvise()` to achieve $subj properly. The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging. For that, all the eligible VMAs have to be traversed. Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is: a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and b) may induce more latency than needed because the target task has to be stopped to avoid races. OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`, I cannot just call it like this: ``` iovec.iov_base = 0; iovec.iov_len = ~0ULL; process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0); ``` to cover the whole address space because iovec expects total size to be under ssize_t. Or maybe there's no need to cover the whole address space, only the lower half of it? Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?.. I'm more than happy to read your comments on this. Thanks.
On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hello. > > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote: > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > > > of this mm to be involved in KSM scanning without explicitly calling > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > > > process (fallback to the default state) and unmerge those merged pages > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > > > but still leave MADV_MERGEABLE areas merged. > > > > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > > > > > So, what changed since that discussion? > > > > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. > > > > Why are you patching the kernel with a new syscall rather than using > > process_madvise()? > > Because I'm not sure how to use `process_madvise()` to achieve $subj properly. > > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging. > > For that, all the eligible VMAs have to be traversed. > > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is: > > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and > b) may induce more latency than needed because the target task has to be stopped to avoid races. OK. And what happens to new vmas that the target process creates after the process_madvise()? > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`, Is it not? > I cannot just call it like this: > > ``` > iovec.iov_base = 0; > iovec.iov_len = ~0ULL; > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0); > ``` > > to cover the whole address space because iovec expects total size to be under ssize_t. > > Or maybe there's no need to cover the whole address space, only the lower half of it? Call process_madvise() twice, once for each half? > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?.. > > I'm more than happy to read your comments on this. > I see the problem. I do like the simplicity of the ksm_force concept. Are there alternative ideas?
Hello. On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote: > On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote: > > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > > > > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > > > > of this mm to be involved in KSM scanning without explicitly calling > > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > > > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > > > > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > > > > process (fallback to the default state) and unmerge those merged pages > > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > > > > but still leave MADV_MERGEABLE areas merged. > > > > > > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > > > > > > > So, what changed since that discussion? > > > > > > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. > > > > > > Why are you patching the kernel with a new syscall rather than using > > > process_madvise()? > > > > Because I'm not sure how to use `process_madvise()` to achieve $subj properly. > > > > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging. > > > > For that, all the eligible VMAs have to be traversed. > > > > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is: > > > > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and > > b) may induce more latency than needed because the target task has to be stopped to avoid races. > > OK. And what happens to new vmas that the target process creates after > the process_madvise()? Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]). [1] https://gitlab.com/post-factum/uksmd > > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`, > > Is it not? It is not: ``` 1158 static bool 1159 process_madvise_behavior_valid(int behavior) 1160 { 1161 switch (behavior) { 1162 case MADV_COLD: 1163 case MADV_PAGEOUT: 1164 case MADV_WILLNEED: 1165 return true; 1166 default: 1167 return false; 1168 } 1169 } ``` Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it. [2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@suse.cz/ [3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/ > > I cannot just call it like this: > > > > ``` > > iovec.iov_base = 0; > > iovec.iov_len = ~0ULL; > > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0); > > ``` > > > > to cover the whole address space because iovec expects total size to be under ssize_t. > > > > Or maybe there's no need to cover the whole address space, only the lower half of it? > > Call process_madvise() twice, once for each half? And still get `-ENOMEM`? ``` 1191 /* 1192 * If the interval [start,end) covers some unmapped address 1193 * ranges, just ignore them, but return -ENOMEM at the end. 1194 * - different from the way of handling in mlock etc. 1195 */ ``` I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right. > > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?.. > > > > I'm more than happy to read your comments on this. > > > > I see the problem. I do like the simplicity of the ksm_force concept. > Are there alternative ideas? I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API. [4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/ [5] https://lore.kernel.org/lkml/20190515145151.GG16651@dhcp22.suse.cz/
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index b244f0202a03..8cabc2504005 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again). Controlling KSM with madvise ============================ -KSM only operates on those areas of address space which an application +KSM can operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call:: @@ -70,6 +70,23 @@ Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM's scans may use a lot of processing power: some installations will disable KSM for that reason. +Controlling KSM with procfs +=========================== + +KSM can also operate on anonymous areas of address space of those processes's +knob ``/proc/<pid>/ksm_force`` is on, even if app codes doesn't call madvise() +explicitly to advise specific areas as MADV_MERGEABLE. + +You can set ksm_force to 1 to force all anonymous and qualified VMAs of +this process to be involved in KSM scanning. + e.g. ``echo 1 > /proc/<pid>/ksm_force`` + +You can also set ksm_force to 0 to cancel that force feature of this process +and unmerge those merged pages which belongs to those VMAs not marked as +MADV_MERGEABLE of this process. But that still leave those pages belonging to +VMAs marked as MADV_MERGEABLE merged (fallback to the default state). + e.g. ``echo 0 > /proc/<pid>/ksm_force`` + .. _ksm_sysfs: KSM daemon sysfs interface diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 061744c436d9..8890b8b457a4 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 3.10 /proc/<pid>/timerslack_ns - Task timerslack value 3.11 /proc/<pid>/patch_state - Livepatch patch operation state 3.12 /proc/<pid>/arch_status - Task architecture specific information + 3.13 /proc/<pid>/ksm_force - Setting of mandatory involvement in KSM 4 Configuring procfs 4.1 Mount options @@ -2176,6 +2177,22 @@ AVX512_elapsed_ms the task is unlikely an AVX512 user, but depends on the workload and the scheduling scenario, it also could be a false negative mentioned above. +3.13 /proc/<pid>/ksm_force - Setting of mandatory involvement in KSM +----------------------------------------------------------------------- +When CONFIG_KSM is enabled, this file can be used to specify if this +process's anonymous memory can be involved in KSM scanning without app codes +explicitly calling madvise to mark memory address as MADV_MERGEABLE. + +If writing 1 to this file, the kernel will force all anonymous and qualified +memory to be involved in KSM scanning without explicitly calling madvise to +mark memory address as MADV_MERGEABLE. But that is effective only when the +klob of '/sys/kernel/mm/ksm/run' is set as 1. + +If writing 0 to this file, the mandatory KSM feature of this process's will +be cancelled and unmerge those merged pages which belongs to those areas not +marked as MADV_MERGEABLE of this process, but leave those pages belonging to +areas marked as MADV_MERGEABLE merged (fallback to the default state). + Chapter 4: Configuring procfs ============================= diff --git a/fs/proc/base.c b/fs/proc/base.c index 8dfa36a99c74..d60f7342f79e 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -96,6 +96,7 @@ #include <linux/time_namespace.h> #include <linux/resctrl.h> #include <linux/cn_proc.h> +#include <linux/ksm.h> #include <trace/events/oom.h> #include "internal.h" #include "fd.h" @@ -3168,6 +3169,96 @@ static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace * return 0; } + +static ssize_t ksm_force_read(struct file *file, char __user *buf, size_t count, + loff_t *ppos) +{ + struct task_struct *task; + struct mm_struct *mm; + char buffer[PROC_NUMBUF]; + ssize_t len; + int ret; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + + mm = get_task_mm(task); + ret = 0; + if (mm) { + len = snprintf(buffer, sizeof(buffer), "%d\n", mm->ksm_force); + ret = simple_read_from_buffer(buf, count, ppos, buffer, len); + mmput(mm); + } + + return ret; +} + +static ssize_t ksm_force_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + struct mm_struct *mm; + char buffer[PROC_NUMBUF]; + int force; + int err = 0; + + memset(buffer, 0, sizeof(buffer)); + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + + err = kstrtoint(strstrip(buffer), 0, &force); + if (err) + return err; + + if (force != 0 && force != 1) + return -EINVAL; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + + mm = get_task_mm(task); + if (!mm) + goto out_put_task; + + if (mm->ksm_force != force) { + if (mmap_write_lock_killable(mm)) { + err = -EINTR; + goto out_mmput; + } + + if (force == 0) + mm->ksm_force = force; + else { + /* + * Force anonymous pages of this mm to be involved in KSM merging + * without explicitly calling madvise. + */ + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) + err = __ksm_enter(mm); + if (!err) + mm->ksm_force = force; + } + + mmap_write_unlock(mm); + } + +out_mmput: + mmput(mm); +out_put_task: + put_task_struct(task); + + return err < 0 ? err : count; +} + +static const struct file_operations proc_pid_ksm_force_operations = { + .read = ksm_force_read, + .write = ksm_force_write, + .llseek = generic_file_llseek, +}; #endif /* CONFIG_KSM */ #ifdef CONFIG_STACKLEAK_METRICS @@ -3303,6 +3394,7 @@ static const struct pid_entry tgid_base_stuff[] = { #endif #ifdef CONFIG_KSM ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), + REG("ksm_force", S_IRUSR|S_IWUSR, proc_pid_ksm_force_operations), #endif }; @@ -3639,6 +3731,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_KSM ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), + REG("ksm_force", S_IRUSR|S_IWUSR, proc_pid_ksm_force_operations), #endif }; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b34ff2cdbc4f..1b1592c2f5cf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -661,6 +661,15 @@ struct mm_struct { * merging. */ unsigned long ksm_merging_pages; + /* + * If true, force anonymous pages of this mm to be involved in KSM + * merging without explicitly calling madvise. It is effctive only + * when the klob of '/sys/kernel/mm/ksm/run' is set as 1. If false, + * cancel the feature of ksm_force of this process and unmerge + * those merged pages which is not madvised as MERGEABLE of this + * process, but leave MERGEABLE areas merged. + */ + bool ksm_force; #endif } __randomize_layout; diff --git a/mm/ksm.c b/mm/ksm.c index 38360285497a..c9f672dcc72e 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -334,6 +334,34 @@ static void __init ksm_slab_free(void) mm_slot_cache = NULL; } +/* Check if vma is qualified for ksmd scanning */ +static bool ksm_vma_check(struct vm_area_struct *vma) +{ + unsigned long vm_flags = vma->vm_flags; + + if (!(vma->vm_flags & VM_MERGEABLE) && !(vma->vm_mm->ksm_force)) + return false; + + if (vm_flags & (VM_SHARED | VM_MAYSHARE | + VM_PFNMAP | VM_IO | VM_DONTEXPAND | + VM_HUGETLB | VM_MIXEDMAP)) + return false; /* just ignore this vma*/ + + if (vma_is_dax(vma)) + return false; + +#ifdef VM_SAO + if (vm_flags & VM_SAO) + return false; +#endif +#ifdef VM_SPARC_ADI + if (vm_flags & VM_SPARC_ADI) + return false; +#endif + + return true; +} + static __always_inline bool is_stable_node_chain(struct stable_node *chain) { return chain->rmap_hlist_len == STABLE_NODE_CHAIN; @@ -523,7 +551,7 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, if (ksm_test_exit(mm)) return NULL; vma = vma_lookup(mm, addr); - if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) + if (!vma || !ksm_vma_check(vma) || !vma->anon_vma) return NULL; return vma; } @@ -2297,7 +2325,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) vma = find_vma(mm, ksm_scan.address); for (; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_MERGEABLE)) + if (!ksm_vma_check(vma)) continue; if (ksm_scan.address < vma->vm_start) ksm_scan.address = vma->vm_start;