diff mbox series

mm/ksm: introduce ksm_enabled for each process

Message ID 20220517092701.1662641-1-xu.xin16@zte.com.cn (mailing list archive)
State New
Headers show
Series mm/ksm: introduce ksm_enabled for each process | expand

Commit Message

CGEL May 17, 2022, 9:27 a.m. UTC
From: xu xin <xu.xin16@zte.com.cn>

For now, if we want to use KSM to merge pages of some apps, we have to
explicitly call madvise() in application code, which means installed
apps on OS needs to be uninstall and source code needs to be modified.
It is very inconvenient because sometimes users or app developers are not
willing to modify their app source codes for any reasons.

So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
/proc/<pid>/. We can pass parameter into this file with one of three values
as follows:

	always:
		force all anonymous and eligible VMAs of this process to be
		scanned by ksmd.
	madvise:
		the default state, unless user code call madvise, ksmd
		doesn't scan this process.
	never:
		this process will never be scanned by ksmd and no merged
		pages occurred in this process.

With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
based on every process. KSM for each process can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
regions (to avoid the risk of consuming more cpu resources to scan for
ksmd) or enabled entirely for a process.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: wangyong <wang.yong12@zte.com.cn>
Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Signed-off-by: CGEL <cgel.zte@gmail.com>
---
 Documentation/admin-guide/mm/ksm.rst |  24 ++++++-
 Documentation/filesystems/proc.rst   |  14 ++++
 fs/proc/base.c                       | 102 ++++++++++++++++++++++++++-
 include/linux/ksm.h                  |   5 ++
 include/linux/mm_types.h             |  10 +++
 mm/ksm.c                             |  35 ++++++++-
 6 files changed, 185 insertions(+), 5 deletions(-)

Comments

Michal Hocko May 17, 2022, 2:04 p.m. UTC | #1
[CCing Hugh and linux-api]

On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> For now, if we want to use KSM to merge pages of some apps, we have to
> explicitly call madvise() in application code, which means installed
> apps on OS needs to be uninstall and source code needs to be modified.
> It is very inconvenient because sometimes users or app developers are not
> willing to modify their app source codes for any reasons.

Would it help to allow external control by process_madvise?
 
> So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
> /proc/<pid>/. We can pass parameter into this file with one of three values
> as follows:
> 
> 	always:
> 		force all anonymous and eligible VMAs of this process to be
> 		scanned by ksmd.
> 	madvise:
> 		the default state, unless user code call madvise, ksmd
> 		doesn't scan this process.
> 	never:
> 		this process will never be scanned by ksmd and no merged
> 		pages occurred in this process.
> 
> With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
> based on every process. KSM for each process can be entirely disabled
> (mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
> regions (to avoid the risk of consuming more cpu resources to scan for
> ksmd) or enabled entirely for a process.

I am not really familiar with KSM much but I am wondering whether the
proc based interface is really the best fit. We have a very similar
concern with THP where processes would like to override the global setup
and that has been done with prctl interface. Is there any reason why
this should be any different?

Another question I have is about the interaction of the per-process
tunable with any explicit madvise calls. AFAICS you have made this knob
per mm but the actual implementation currently relies on the per-vma
flags. That means that one can explicitly disallow merging by madvise
for a range. Is it wise to override that by a per-process knob? I mean
there might be a very good reason why a particular memory ranges should
never be merged but a per-process knob could easily ignore that hint
from the application. Or am I just confused?

[I am keeping the rest of the email for reference]

> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
> Reviewed-by: wangyong <wang.yong12@zte.com.cn>
> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
> Signed-off-by: CGEL <cgel.zte@gmail.com>
> ---
>  Documentation/admin-guide/mm/ksm.rst |  24 ++++++-
>  Documentation/filesystems/proc.rst   |  14 ++++
>  fs/proc/base.c                       | 102 ++++++++++++++++++++++++++-
>  include/linux/ksm.h                  |   5 ++
>  include/linux/mm_types.h             |  10 +++
>  mm/ksm.c                             |  35 ++++++++-
>  6 files changed, 185 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> index b244f0202a03..91326198e37f 100644
> --- a/Documentation/admin-guide/mm/ksm.rst
> +++ b/Documentation/admin-guide/mm/ksm.rst
> @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again).
>  Controlling KSM with madvise
>  ============================
>  
> -KSM only operates on those areas of address space which an application
> +KSM can operate on those areas of address space which an application
>  has advised to be likely candidates for merging, by using the madvise(2)
>  system call::
>  
> @@ -70,6 +70,28 @@ Applications should be considerate in their use of MADV_MERGEABLE,
>  restricting its use to areas likely to benefit.  KSM's scans may use a lot
>  of processing power: some installations will disable KSM for that reason.
>  
> +Controlling KSM with procfs
> +===========================
> +We can also control KSM with ``/proc/<pid>/ksm_enabled`` based on every
> +process. KSM for each process can be entirely disabled (mostly for
> +debugging purposes) or only enabled inside MADV_MERGEABLE regions (to avoid
> +the risk of consuming more cpu resources to scan for ksmd) or enabled entirely
> +for a process. This can be achieved with one of::
> +
> +    echo always > /proc/<pid>/ksm_enabled
> +    echo madvise > /proc/<pid>/ksm_enabled
> +    echo never > /proc/<pid>/ksm_enabled
> +
> +always:
> +        force all anonymous and eligible VMAs of this process to be scanned
> +        by ksmd.
> +madvise:
> +        the default state, unless user code call madvise, ksmd doesn't scan
> +        this process.
> +never:
> +        this process will never be scanned by ksmd and no merged pages
> +	occurred in this process.
> +
>  .. _ksm_sysfs:
>  
>  KSM daemon sysfs interface
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 1bc91fb8c321..ea7e08a1c143 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> +  3.13  /proc/<pid>/ksm_enabled - Controlling KSM based on process
>  
>    4	Configuring procfs
>    4.1	Mount options
> @@ -2140,6 +2141,19 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>  
> +3.13 /proc/<pid>/ksm_enabled - Controlling KSM based on process
> +---------------------------------------------------------------
> +When CONFIG_KSM is enabled, this file can be used to specify how this
> +process's anonymous memory gets involved in KSM scanning.
> +
> +If writing "always" to this file, it will force all anonymous and eligible
> +VMAs of this process to be scanned by ksmd.
> +
> +If writing "madvise" to this file, turn to the default state, unless user
> +code call madvise, ksmd doesn't scan this process.
> +
> +If writing "never" to this file, this process will never be scanned by ksmd.
> +
>  Chapter 4: Configuring procfs
>  =============================
>  
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 617816168748..760ceeab4aa1 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -96,6 +96,7 @@
>  #include <linux/time_namespace.h>
>  #include <linux/resctrl.h>
>  #include <linux/cn_proc.h>
> +#include <linux/ksm.h>
>  #include <trace/events/oom.h>
>  #include "internal.h"
>  #include "fd.h"
> @@ -3171,7 +3172,104 @@ static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace *
>  
>  	return 0;
>  }
> -#endif /* CONFIG_KSM */
> +
> +static int ksm_enabled_show(struct seq_file *m, void *v)
> +{
> +	struct inode *inode = m->private;
> +	struct mm_struct *mm;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		if (mm->ksm_enabled == KSM_PROC_ALWAYS)
> +			seq_puts(m, "[always] madvise never\n");
> +		else if (mm->ksm_enabled == KSM_PROC_MADVISE)
> +			seq_puts(m, "always [madvise] never\n");
> +		else
> +			seq_puts(m, "always madvise [never]\n");
> +		mmput(mm);
> +	}
> +
> +	put_task_struct(task);
> +	return 0;
> +}
> +
> +static ssize_t ksm_enabled_write(struct file *file, const char __user *buf,
> +								size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	int value;
> +	int err = 0;
> +	long str_len;
> +
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	str_len = strncpy_from_user(buffer, buf, count);
> +	if (str_len < 0)
> +		return -EFAULT;
> +	buffer[str_len - 1] = '\0';
> +
> +	if (!strcmp("always", buffer))
> +		value = KSM_PROC_ALWAYS;
> +	else if (!strcmp("madvise", buffer))
> +		value = KSM_PROC_MADVISE;
> +	else if (!strcmp("never", buffer))
> +		value = KSM_PROC_NEVER;
> +	else
> +		return -EINVAL;
> +
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		return -ESRCH;
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_put_task;
> +
> +	if (mm->ksm_enabled != value) {
> +		if (mmap_write_lock_killable(mm)) {
> +			err = -EINTR;
> +			goto out_mmput;
> +		}
> +		if (value == KSM_PROC_NEVER)
> +			mm->ksm_enabled = value;
> +		else {
> +			/*
> +			 * No matter whether it's KSM_PROC_ALWAYS or KSM_PROC_MADVISE, we need
> +			 * to recheck mm->flags to guarantee that this mm is in ksm_scan.
> +			 */
> +			if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> +				err = __ksm_enter(mm);
> +			if (!err)
> +				mm->ksm_enabled = value;
> +		}
> +		mmap_write_unlock(mm);
> +	}
> +
> +out_mmput:
> +	mmput(mm);
> +out_put_task:
> +	put_task_struct(task);
> +	return err < 0 ? err : count;
> +}
> +
> +static int ksm_enabled_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, ksm_enabled_show, inode);
> +}
> +
> +static const struct file_operations proc_pid_ksm_enabled_operations = {
> +	.open       = ksm_enabled_open,
> +	.read		= seq_read,
> +	.write		= ksm_enabled_write,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +#endif /*CONFIG_KSM */
>  
>  #ifdef CONFIG_STACKLEAK_METRICS
>  static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> @@ -3306,6 +3404,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_KSM
>  	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
> +	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
>  #endif
>  };
>  
> @@ -3642,6 +3741,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_KSM
>  	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
> +	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
>  #endif
>  };
>  
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 0b4f17418f64..29d23d208b54 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -19,6 +19,11 @@ struct stable_node;
>  struct mem_cgroup;
>  
>  #ifdef CONFIG_KSM
> +
> +#define KSM_PROC_MADVISE	0
> +#define KSM_PROC_ALWAYS		1
> +#define KSM_PROC_NEVER		2
> +
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 417ef1519475..29fd4c84d08c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,6 +649,16 @@ struct mm_struct {
>  		 * merging.
>  		 */
>  		unsigned long ksm_merging_pages;
> +
> +		/*
> +		 * Represent the state of this mm involing in KSM, with 3 states:
> +		 * 1) KSM_PROC_ALWAYS:  force all anonymous VMAs of this process to
> +		 *						be scanned.
> +		 * 2) KSM_PROC_MADVISE: the default state, unless user code call
> +		 *						madvise, don't scan this process.
> +		 * 3) KSM_PROC_NEVER: never be involed in KSM.
> +		 */
> +		int ksm_enabled;
>  #endif
>  	} __randomize_layout;
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 26da7f813f23..90cc8eda8bca 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -334,6 +334,35 @@ static void __init ksm_slab_free(void)
>  	mm_slot_cache = NULL;
>  }
>  
> +static bool vma_scannable(struct vm_area_struct *vma)
> +{
> +	unsigned long vm_flags = vma->vm_flags;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (mm->ksm_enabled == KSM_PROC_NEVER ||
> +			(!(vma->vm_flags & VM_MERGEABLE) &&
> +			mm->ksm_enabled != KSM_PROC_ALWAYS))
> +		return false;
> +
> +	if (vm_flags & (VM_SHARED	| VM_MAYSHARE	|
> +			VM_PFNMAP	| VM_IO | VM_DONTEXPAND |
> +			VM_HUGETLB	| VM_MIXEDMAP))
> +		return false;       /* just ignore this vma*/
> +
> +	if (vma_is_dax(vma))
> +		return false;
> +#ifdef VM_SAO
> +	if (vm_flags & VM_SAO)
> +		return false;
> +#endif
> +#ifdef VM_SPARC_ADI
> +	if (vm_flags & VM_SPARC_ADI)
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +
>  static __always_inline bool is_stable_node_chain(struct stable_node *chain)
>  {
>  	return chain->rmap_hlist_len == STABLE_NODE_CHAIN;
> @@ -523,7 +552,7 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
>  	if (ksm_test_exit(mm))
>  		return NULL;
>  	vma = vma_lookup(mm, addr);
> -	if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> +	if (!vma || !vma_scannable(vma) || !vma->anon_vma)
>  		return NULL;
>  	return vma;
>  }
> @@ -990,7 +1019,7 @@ static int unmerge_and_remove_all_rmap_items(void)
>  		for_each_vma(vmi, vma) {
>  			if (ksm_test_exit(mm))
>  				break;
> -			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> +			if (!vma_scannable(vma) || !vma->anon_vma)
>  				continue;
>  			err = unmerge_ksm_pages(vma,
>  						vma->vm_start, vma->vm_end);
> @@ -2300,7 +2329,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		goto no_vmas;
>  
>  	for_each_vma(vmi, vma) {
> -		if (!(vma->vm_flags & VM_MERGEABLE))
> +		if (!vma_scannable(vma))
>  			continue;
>  		if (ksm_scan.address < vma->vm_start)
>  			ksm_scan.address = vma->vm_start;
> -- 
> 2.25.1
CGEL May 18, 2022, 2:47 a.m. UTC | #2
On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> [CCing Hugh and linux-api]
> 
> On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > From: xu xin <xu.xin16@zte.com.cn>
> > 
> > For now, if we want to use KSM to merge pages of some apps, we have to
> > explicitly call madvise() in application code, which means installed
> > apps on OS needs to be uninstall and source code needs to be modified.
> > It is very inconvenient because sometimes users or app developers are not
> > willing to modify their app source codes for any reasons.

Hello, Michal.
> 
> Would it help to allow external control by process_madvise?
>

Maybe, but it will be much more complicated to achieve this by
process_madvise(). process_madvise works on a serires of VMAs in
essential, So all the eligible VMAs have to be traversed. The problem
about this has been discussed in [1],[2].
[1]https://lore.kernel.org/lkml/1835064.A2aMcgg3dW@natalenko.name/
[2]https://lore.kernel.org/lkml/20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org/
> > So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
> > /proc/<pid>/. We can pass parameter into this file with one of three values
> > as follows:
> > 
> > 	always:
> > 		force all anonymous and eligible VMAs of this process to be
> > 		scanned by ksmd.
> > 	madvise:
> > 		the default state, unless user code call madvise, ksmd
> > 		doesn't scan this process.
> > 	never:
> > 		this process will never be scanned by ksmd and no merged
> > 		pages occurred in this process.
> > 
> > With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
> > based on every process. KSM for each process can be entirely disabled
> > (mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
> > regions (to avoid the risk of consuming more cpu resources to scan for
> > ksmd) or enabled entirely for a process.
> 
> I am not really familiar with KSM much but I am wondering whether the
> proc based interface is really the best fit. We have a very similar
> concern with THP where processes would like to override the global setup
> and that has been done with prctl interface. Is there any reason why
> this should be any different?
> 
At least for now, I can't find a simpler implementation than proc file,
unless we add a new syscall used for changing another process mm's flag
in user space.

Speaking to THP, the interactive UI of KSM is relatively simpler because
KSM dosen't have global knob like THP. OTOH, THP trades space for time
(consume memory) while KSM trades time for space (save memory), so THP
tends to be enabled system wide while KSM not.

> Another question I have is about the interaction of the per-process
> tunable with any explicit madvise calls. AFAICS you have made this knob
> per mm but the actual implementation currently relies on the per-vma
> flags. That means that one can explicitly disallow merging by madvise
> for a range. Is it wise to override that by a per-process knob? I mean
> there might be a very good reason why a particular memory ranges should
> never be merged but a per-process knob could easily ignore that hint
> from the application. Or am I just confuse?
For now, there is no any hints for letting KSM never merge some memory
ranges.

> 
> [I am keeping the rest of the email for reference]
> 

Sync with you.

happy to see your reply.
> > Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> > Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
> > Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
> > Reviewed-by: wangyong <wang.yong12@zte.com.cn>
> > Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
> > Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
> > Signed-off-by: CGEL <cgel.zte@gmail.com>
> > ---
> >  Documentation/admin-guide/mm/ksm.rst |  24 ++++++-
> >  Documentation/filesystems/proc.rst   |  14 ++++
> >  fs/proc/base.c                       | 102 ++++++++++++++++++++++++++-
> >  include/linux/ksm.h                  |   5 ++
> >  include/linux/mm_types.h             |  10 +++
> >  mm/ksm.c                             |  35 ++++++++-
> >  6 files changed, 185 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> > index b244f0202a03..91326198e37f 100644
> > --- a/Documentation/admin-guide/mm/ksm.rst
> > +++ b/Documentation/admin-guide/mm/ksm.rst
> > @@ -32,7 +32,7 @@ are swapped back in: ksmd must rediscover their identity and merge again).
> >  Controlling KSM with madvise
> >  ============================
> >  
> > -KSM only operates on those areas of address space which an application
> > +KSM can operate on those areas of address space which an application
> >  has advised to be likely candidates for merging, by using the madvise(2)
> >  system call::
> >  
> > @@ -70,6 +70,28 @@ Applications should be considerate in their use of MADV_MERGEABLE,
> >  restricting its use to areas likely to benefit.  KSM's scans may use a lot
> >  of processing power: some installations will disable KSM for that reason.
> >  
> > +Controlling KSM with procfs
> > +===========================
> > +We can also control KSM with ``/proc/<pid>/ksm_enabled`` based on every
> > +process. KSM for each process can be entirely disabled (mostly for
> > +debugging purposes) or only enabled inside MADV_MERGEABLE regions (to avoid
> > +the risk of consuming more cpu resources to scan for ksmd) or enabled entirely
> > +for a process. This can be achieved with one of::
> > +
> > +    echo always > /proc/<pid>/ksm_enabled
> > +    echo madvise > /proc/<pid>/ksm_enabled
> > +    echo never > /proc/<pid>/ksm_enabled
> > +
> > +always:
> > +        force all anonymous and eligible VMAs of this process to be scanned
> > +        by ksmd.
> > +madvise:
> > +        the default state, unless user code call madvise, ksmd doesn't scan
> > +        this process.
> > +never:
> > +        this process will never be scanned by ksmd and no merged pages
> > +	occurred in this process.
> > +
> >  .. _ksm_sysfs:
> >  
> >  KSM daemon sysfs interface
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 1bc91fb8c321..ea7e08a1c143 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
> >    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> >    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> >    3.12	/proc/<pid>/arch_status - Task architecture specific information
> > +  3.13  /proc/<pid>/ksm_enabled - Controlling KSM based on process
> >  
> >    4	Configuring procfs
> >    4.1	Mount options
> > @@ -2140,6 +2141,19 @@ AVX512_elapsed_ms
> >    the task is unlikely an AVX512 user, but depends on the workload and the
> >    scheduling scenario, it also could be a false negative mentioned above.
> >  
> > +3.13 /proc/<pid>/ksm_enabled - Controlling KSM based on process
> > +---------------------------------------------------------------
> > +When CONFIG_KSM is enabled, this file can be used to specify how this
> > +process's anonymous memory gets involved in KSM scanning.
> > +
> > +If writing "always" to this file, it will force all anonymous and eligible
> > +VMAs of this process to be scanned by ksmd.
> > +
> > +If writing "madvise" to this file, turn to the default state, unless user
> > +code call madvise, ksmd doesn't scan this process.
> > +
> > +If writing "never" to this file, this process will never be scanned by ksmd.
> > +
> >  Chapter 4: Configuring procfs
> >  =============================
> >  
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 617816168748..760ceeab4aa1 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -96,6 +96,7 @@
> >  #include <linux/time_namespace.h>
> >  #include <linux/resctrl.h>
> >  #include <linux/cn_proc.h>
> > +#include <linux/ksm.h>
> >  #include <trace/events/oom.h>
> >  #include "internal.h"
> >  #include "fd.h"
> > @@ -3171,7 +3172,104 @@ static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace *
> >  
> >  	return 0;
> >  }
> > -#endif /* CONFIG_KSM */
> > +
> > +static int ksm_enabled_show(struct seq_file *m, void *v)
> > +{
> > +	struct inode *inode = m->private;
> > +	struct mm_struct *mm;
> > +	struct task_struct *task = get_proc_task(inode);
> > +
> > +	if (!task)
> > +		return -ESRCH;
> > +
> > +	mm = get_task_mm(task);
> > +	if (mm) {
> > +		if (mm->ksm_enabled == KSM_PROC_ALWAYS)
> > +			seq_puts(m, "[always] madvise never\n");
> > +		else if (mm->ksm_enabled == KSM_PROC_MADVISE)
> > +			seq_puts(m, "always [madvise] never\n");
> > +		else
> > +			seq_puts(m, "always madvise [never]\n");
> > +		mmput(mm);
> > +	}
> > +
> > +	put_task_struct(task);
> > +	return 0;
> > +}
> > +
> > +static ssize_t ksm_enabled_write(struct file *file, const char __user *buf,
> > +								size_t count, loff_t *ppos)
> > +{
> > +	struct task_struct *task;
> > +	struct mm_struct *mm;
> > +	char buffer[PROC_NUMBUF];
> > +	int value;
> > +	int err = 0;
> > +	long str_len;
> > +
> > +	if (count > sizeof(buffer) - 1)
> > +		count = sizeof(buffer) - 1;
> > +	str_len = strncpy_from_user(buffer, buf, count);
> > +	if (str_len < 0)
> > +		return -EFAULT;
> > +	buffer[str_len - 1] = '\0';
> > +
> > +	if (!strcmp("always", buffer))
> > +		value = KSM_PROC_ALWAYS;
> > +	else if (!strcmp("madvise", buffer))
> > +		value = KSM_PROC_MADVISE;
> > +	else if (!strcmp("never", buffer))
> > +		value = KSM_PROC_NEVER;
> > +	else
> > +		return -EINVAL;
> > +
> > +	task = get_proc_task(file_inode(file));
> > +	if (!task)
> > +		return -ESRCH;
> > +	mm = get_task_mm(task);
> > +	if (!mm)
> > +		goto out_put_task;
> > +
> > +	if (mm->ksm_enabled != value) {
> > +		if (mmap_write_lock_killable(mm)) {
> > +			err = -EINTR;
> > +			goto out_mmput;
> > +		}
> > +		if (value == KSM_PROC_NEVER)
> > +			mm->ksm_enabled = value;
> > +		else {
> > +			/*
> > +			 * No matter whether it's KSM_PROC_ALWAYS or KSM_PROC_MADVISE, we need
> > +			 * to recheck mm->flags to guarantee that this mm is in ksm_scan.
> > +			 */
> > +			if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > +				err = __ksm_enter(mm);
> > +			if (!err)
> > +				mm->ksm_enabled = value;
> > +		}
> > +		mmap_write_unlock(mm);
> > +	}
> > +
> > +out_mmput:
> > +	mmput(mm);
> > +out_put_task:
> > +	put_task_struct(task);
> > +	return err < 0 ? err : count;
> > +}
> > +
> > +static int ksm_enabled_open(struct inode *inode, struct file *filp)
> > +{
> > +	return single_open(filp, ksm_enabled_show, inode);
> > +}
> > +
> > +static const struct file_operations proc_pid_ksm_enabled_operations = {
> > +	.open       = ksm_enabled_open,
> > +	.read		= seq_read,
> > +	.write		= ksm_enabled_write,
> > +	.llseek		= seq_lseek,
> > +	.release	= single_release,
> > +};
> > +#endif /*CONFIG_KSM */
> >  
> >  #ifdef CONFIG_STACKLEAK_METRICS
> >  static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > @@ -3306,6 +3404,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  #endif
> >  #ifdef CONFIG_KSM
> >  	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
> > +	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
> >  #endif
> >  };
> >  
> > @@ -3642,6 +3741,7 @@ static const struct pid_entry tid_base_stuff[] = {
> >  #endif
> >  #ifdef CONFIG_KSM
> >  	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
> > +	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
> >  #endif
> >  };
> >  
> > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> > index 0b4f17418f64..29d23d208b54 100644
> > --- a/include/linux/ksm.h
> > +++ b/include/linux/ksm.h
> > @@ -19,6 +19,11 @@ struct stable_node;
> >  struct mem_cgroup;
> >  
> >  #ifdef CONFIG_KSM
> > +
> > +#define KSM_PROC_MADVISE	0
> > +#define KSM_PROC_ALWAYS		1
> > +#define KSM_PROC_NEVER		2
> > +
> >  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> >  		unsigned long end, int advice, unsigned long *vm_flags);
> >  int __ksm_enter(struct mm_struct *mm);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 417ef1519475..29fd4c84d08c 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -649,6 +649,16 @@ struct mm_struct {
> >  		 * merging.
> >  		 */
> >  		unsigned long ksm_merging_pages;
> > +
> > +		/*
> > +		 * Represent the state of this mm involing in KSM, with 3 states:
> > +		 * 1) KSM_PROC_ALWAYS:  force all anonymous VMAs of this process to
> > +		 *						be scanned.
> > +		 * 2) KSM_PROC_MADVISE: the default state, unless user code call
> > +		 *						madvise, don't scan this process.
> > +		 * 3) KSM_PROC_NEVER: never be involed in KSM.
> > +		 */
> > +		int ksm_enabled;
> >  #endif
> >  	} __randomize_layout;
> >  
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 26da7f813f23..90cc8eda8bca 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -334,6 +334,35 @@ static void __init ksm_slab_free(void)
> >  	mm_slot_cache = NULL;
> >  }
> >  
> > +static bool vma_scannable(struct vm_area_struct *vma)
> > +{
> > +	unsigned long vm_flags = vma->vm_flags;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +
> > +	if (mm->ksm_enabled == KSM_PROC_NEVER ||
> > +			(!(vma->vm_flags & VM_MERGEABLE) &&
> > +			mm->ksm_enabled != KSM_PROC_ALWAYS))
> > +		return false;
> > +
> > +	if (vm_flags & (VM_SHARED	| VM_MAYSHARE	|
> > +			VM_PFNMAP	| VM_IO | VM_DONTEXPAND |
> > +			VM_HUGETLB	| VM_MIXEDMAP))
> > +		return false;       /* just ignore this vma*/
> > +
> > +	if (vma_is_dax(vma))
> > +		return false;
> > +#ifdef VM_SAO
> > +	if (vm_flags & VM_SAO)
> > +		return false;
> > +#endif
> > +#ifdef VM_SPARC_ADI
> > +	if (vm_flags & VM_SPARC_ADI)
> > +		return false;
> > +#endif
> > +
> > +	return true;
> > +}
> > +
> >  static __always_inline bool is_stable_node_chain(struct stable_node *chain)
> >  {
> >  	return chain->rmap_hlist_len == STABLE_NODE_CHAIN;
> > @@ -523,7 +552,7 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
> >  	if (ksm_test_exit(mm))
> >  		return NULL;
> >  	vma = vma_lookup(mm, addr);
> > -	if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> > +	if (!vma || !vma_scannable(vma) || !vma->anon_vma)
> >  		return NULL;
> >  	return vma;
> >  }
> > @@ -990,7 +1019,7 @@ static int unmerge_and_remove_all_rmap_items(void)
> >  		for_each_vma(vmi, vma) {
> >  			if (ksm_test_exit(mm))
> >  				break;
> > -			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> > +			if (!vma_scannable(vma) || !vma->anon_vma)
> >  				continue;
> >  			err = unmerge_ksm_pages(vma,
> >  						vma->vm_start, vma->vm_end);
> > @@ -2300,7 +2329,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  		goto no_vmas;
> >  
> >  	for_each_vma(vmi, vma) {
> > -		if (!(vma->vm_flags & VM_MERGEABLE))
> > +		if (!vma_scannable(vma))
> >  			continue;
> >  		if (ksm_scan.address < vma->vm_start)
> >  			ksm_scan.address = vma->vm_start;
> > -- 
> > 2.25.1
> 
> -- 
> Michal Hocko
> SUSE Labs
Balbir Singh May 18, 2022, 6:58 a.m. UTC | #3
On Tue, May 17, 2022 at 09:27:01AM +0000, cgel.zte@gmail.com wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> For now, if we want to use KSM to merge pages of some apps, we have to
> explicitly call madvise() in application code, which means installed
> apps on OS needs to be uninstall and source code needs to be modified.
> It is very inconvenient because sometimes users or app developers are not
> willing to modify their app source codes for any reasons.
>

I thought we got process_madvise in 5.13, doesn't that help solve the
problem with extensions to advice and an appropriate wrapper? Granted
it might be a little racy with a running process potentially
participating in KSM already. If we are concerned about KSM, I would
assume that we care about long running processes, is that true?

Balbir Singh
CGEL May 18, 2022, 7:40 a.m. UTC | #4
On Wed, May 18, 2022 at 04:58:27PM +1000, Balbir Singh wrote:
> On Tue, May 17, 2022 at 09:27:01AM +0000, cgel.zte@gmail.com wrote:
> > From: xu xin <xu.xin16@zte.com.cn>
> > 
> > For now, if we want to use KSM to merge pages of some apps, we have to
> > explicitly call madvise() in application code, which means installed
> > apps on OS needs to be uninstall and source code needs to be modified.
> > It is very inconvenient because sometimes users or app developers are not
> > willing to modify their app source codes for any reasons.
> >
> 
> I thought we got process_madvise in 5.13, doesn't that help solve the
> problem with extensions to advice and an appropriate wrapper? Granted
> it might be a little racy with a running process potentially
> participating in KSM already. 

The reasons why I don't use process_madvise to achieve this include:

1. see https://lore.kernel.org/lkml/1835064.A2aMcgg3dW@natalenko.name/

2. process_madvise is still a kind of madvise. processs_madvise from
another process overrides the intention of origin app code ifself that
also calls madvise, which is unrecoverable. For example, if a process "A"
which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
already, meanwhile, if another process which doesn't know the information
of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
"A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
is erasured permanently.

> If we are concerned about KSM, I would assume that we care about long running processes, is that true?
> 

Yes, it is.

> Balbir Singh
Michal Hocko May 18, 2022, 12:12 p.m. UTC | #5
On Wed 18-05-22 02:47:06, CGEL wrote:
> On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> > [CCing Hugh and linux-api]
> > 
> > On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > > From: xu xin <xu.xin16@zte.com.cn>
> > > 
> > > For now, if we want to use KSM to merge pages of some apps, we have to
> > > explicitly call madvise() in application code, which means installed
> > > apps on OS needs to be uninstall and source code needs to be modified.
> > > It is very inconvenient because sometimes users or app developers are not
> > > willing to modify their app source codes for any reasons.
> 
> Hello, Michal.
> > 
> > Would it help to allow external control by process_madvise?
> >
> 
> Maybe, but it will be much more complicated to achieve this by
> process_madvise(). process_madvise works on a serires of VMAs in
> essential, So all the eligible VMAs have to be traversed. The problem
> about this has been discussed in [1],[2].
> [1]https://lore.kernel.org/lkml/1835064.A2aMcgg3dW@natalenko.name/
> [2]https://lore.kernel.org/lkml/20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org/

I can see that this is not a trivial interface to use but I do not
think this would be too hard to be usable. There is certainly some
coordination required between the external and the target tasks. But
that is to be expected IMHO. You do not really want to configure merging
without actually understanding what the application does and whether
that is really OK. Right?

Besides that, as far as I remember, the process_madvise interface
doesn't really require exact vmas to be provided and a single range can
span multiple VMAs.

> > > So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
> > > /proc/<pid>/. We can pass parameter into this file with one of three values
> > > as follows:
> > > 
> > > 	always:
> > > 		force all anonymous and eligible VMAs of this process to be
> > > 		scanned by ksmd.
> > > 	madvise:
> > > 		the default state, unless user code call madvise, ksmd
> > > 		doesn't scan this process.
> > > 	never:
> > > 		this process will never be scanned by ksmd and no merged
> > > 		pages occurred in this process.
> > > 
> > > With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
> > > based on every process. KSM for each process can be entirely disabled
> > > (mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
> > > regions (to avoid the risk of consuming more cpu resources to scan for
> > > ksmd) or enabled entirely for a process.
> > 
> > I am not really familiar with KSM much but I am wondering whether the
> > proc based interface is really the best fit. We have a very similar
> > concern with THP where processes would like to override the global setup
> > and that has been done with prctl interface. Is there any reason why
> > this should be any different?
> > 
> At least for now, I can't find a simpler implementation than proc file,
> unless we add a new syscall used for changing another process mm's flag
> in user space.

What is the problem with the prctl extension?

> Speaking to THP, the interactive UI of KSM is relatively simpler because
> KSM dosen't have global knob like THP. OTOH, THP trades space for time
> (consume memory) while KSM trades time for space (save memory), so THP
> tends to be enabled system wide while KSM not.
> 
> > Another question I have is about the interaction of the per-process
> > tunable with any explicit madvise calls. AFAICS you have made this knob
> > per mm but the actual implementation currently relies on the per-vma
> > flags. That means that one can explicitly disallow merging by madvise
> > for a range. Is it wise to override that by a per-process knob? I mean
> > there might be a very good reason why a particular memory ranges should
> > never be merged but a per-process knob could easily ignore that hint
> > from the application. Or am I just confuse?
> For now, there is no any hints for letting KSM never merge some memory
> ranges.

I am not sure I understand. Could you be more specific?
Michal Hocko May 18, 2022, 12:14 p.m. UTC | #6
On Wed 18-05-22 07:40:30, CGEL wrote:
[...]
> 2. process_madvise is still a kind of madvise. processs_madvise from
> another process overrides the intention of origin app code ifself that
> also calls madvise, which is unrecoverable. For example, if a process "A"
> which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> already, meanwhile, if another process which doesn't know the information
> of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> is erasured permanently.

I do not really follow. How is this any different from an external
process modifying the process wide policy via the proc or any other
interface?
Jann Horn May 18, 2022, 2:31 p.m. UTC | #7
On Tue, May 17, 2022 at 11:27 AM <cgel.zte@gmail.com> wrote:
> For now, if we want to use KSM to merge pages of some apps, we have to
> explicitly call madvise() in application code, which means installed
> apps on OS needs to be uninstall and source code needs to be modified.
> It is very inconvenient because sometimes users or app developers are not
> willing to modify their app source codes for any reasons.

As a sidenote: If you're going to enable KSM on your devices, I hope
you're aware that KSM significantly reduces security -
when cloud providers were using KSM, there were a bunch of papers that
abused it for attacks. In particular, KSM inherently creates
significant information leaks, because an attacker can determine
whether a memory page with specific content exists in other apps
through timing side channels. In the worst case, this could lead to an
attacker being able to steal things like authentication tokens out of
other apps.

If you see significant memory savings from enabling KSM, it might be a
good idea to look into where exactly those savings are coming from,
and look into whether there is a better way to reduce memory
utilization that doesn't rely on comparing entire pages against each
other.

See https://arxiv.org/pdf/2111.08553.pdf for a recent research paper
that shows that memory deduplication can even make it possible to
remotely (!) leak memory contents out of a machine, over the internet.

(On top of that, KSM can also make it easier to pull off Rowhammer
attacks in some contexts -
see https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
.)
CGEL May 19, 2022, 3:39 a.m. UTC | #8
On Wed, May 18, 2022 at 04:31:26PM +0200, Jann Horn wrote:
> On Tue, May 17, 2022 at 11:27 AM <cgel.zte@gmail.com> wrote:
> > For now, if we want to use KSM to merge pages of some apps, we have to
> > explicitly call madvise() in application code, which means installed
> > apps on OS needs to be uninstall and source code needs to be modified.
> > It is very inconvenient because sometimes users or app developers are not
> > willing to modify their app source codes for any reasons.
> 
> As a sidenote: If you're going to enable KSM on your devices, I hope
> you're aware that KSM significantly reduces security -
> when cloud providers were using KSM, there were a bunch of papers that
> abused it for attacks. In particular, KSM inherently creates
> significant information leaks, because an attacker can determine
> whether a memory page with specific content exists in other apps
> through timing side channels. In the worst case, this could lead to an
> attacker being able to steal things like authentication tokens out of
> other apps.
> 
> If you see significant memory savings from enabling KSM, it might be a
> good idea to look into where exactly those savings are coming from,
> and look into whether there is a better way to reduce memory
> utilization that doesn't rely on comparing entire pages against each
> other.
> 
> See https://arxiv.org/pdf/2111.08553.pdf for a recent research paper
> that shows that memory deduplication can even make it possible to
> remotely (!) leak memory contents out of a machine, over the internet.
> 
> (On top of that, KSM can also make it easier to pull off Rowhammer
> attacks in some contexts -
> see https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> .)

Thank you for your reply. The information you provided is very
meaningful. However, the administrator should have the right to decide
whether to use KSM. The kernel should provide a flexible mechanism to
use KSM. How to use KSM safely should be decided by the user's security
policy.
CGEL May 19, 2022, 6:23 a.m. UTC | #9
On Wed, May 18, 2022 at 02:12:26PM +0200, Michal Hocko wrote:
> On Wed 18-05-22 02:47:06, CGEL wrote:
> > On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> > > [CCing Hugh and linux-api]
> > > 
> > > On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > > > From: xu xin <xu.xin16@zte.com.cn>
> > > > 
> > > > For now, if we want to use KSM to merge pages of some apps, we have to
> > > > explicitly call madvise() in application code, which means installed
> > > > apps on OS needs to be uninstall and source code needs to be modified.
> > > > It is very inconvenient because sometimes users or app developers are not
> > > > willing to modify their app source codes for any reasons.
> > 
> > Hello, Michal.
> > > 
> > > Would it help to allow external control by process_madvise?
> > >
> > 
> > Maybe, but it will be much more complicated to achieve this by
> > process_madvise(). process_madvise works on a serires of VMAs in
> > essential, So all the eligible VMAs have to be traversed. The problem
> > about this has been discussed in [1],[2].
> > [1]https://lore.kernel.org/lkml/1835064.A2aMcgg3dW@natalenko.name/
> > [2]https://lore.kernel.org/lkml/20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org/
> 
> I can see that this is not a trivial interface to use but I do not
> think this would be too hard to be usable. There is certainly some
> coordination required between the external and the target tasks. But
> that is to be expected IMHO. You do not really want to configure merging
> without actually understanding what the application does and whether
> that is really OK. Right?
> 
> Besides that, as far as I remember, the process_madvise interface
> doesn't really require exact vmas to be provided and a single range can
> span multiple VMAs.

Yes, but all the eligible VMAs still have to be traversed after all. It may
induce more latency than needed because the target task has to be
stopped to avoid races.


> 
> > > > So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
> > > > /proc/<pid>/. We can pass parameter into this file with one of three values
> > > > as follows:
> > > > 
> > > > 	always:
> > > > 		force all anonymous and eligible VMAs of this process to be
> > > > 		scanned by ksmd.
> > > > 	madvise:
> > > > 		the default state, unless user code call madvise, ksmd
> > > > 		doesn't scan this process.
> > > > 	never:
> > > > 		this process will never be scanned by ksmd and no merged
> > > > 		pages occurred in this process.
> > > > 
> > > > With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
> > > > based on every process. KSM for each process can be entirely disabled
> > > > (mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
> > > > regions (to avoid the risk of consuming more cpu resources to scan for
> > > > ksmd) or enabled entirely for a process.
> > > 
> > > I am not really familiar with KSM much but I am wondering whether the
> > > proc based interface is really the best fit. We have a very similar
> > > concern with THP where processes would like to override the global setup
> > > and that has been done with prctl interface. Is there any reason why
> > > this should be any different?
> > > 
> > At least for now, I can't find a simpler implementation than proc file,
> > unless we add a new syscall used for changing another process mm's flag
> > in user space.
> 
> What is the problem with the prctl extension?
>

What's the meaning of the prctl extension?
I don't quite get your point. Can prctl control external process?

> > Speaking to THP, the interactive UI of KSM is relatively simpler because
> > KSM dosen't have global knob like THP. OTOH, THP trades space for time
> > (consume memory) while KSM trades time for space (save memory), so THP
> > tends to be enabled system wide while KSM not.
> > 
> > > Another question I have is about the interaction of the per-process
> > > tunable with any explicit madvise calls. AFAICS you have made this knob
> > > per mm but the actual implementation currently relies on the per-vma
> > > flags. That means that one can explicitly disallow merging by madvise
> > > for a range. Is it wise to override that by a per-process knob? I mean
> > > there might be a very good reason why a particular memory ranges should
> > > never be merged but a per-process knob could easily ignore that hint
> > > from the application. Or am I just confuse?
> > For now, there is no any hints for letting KSM never merge some memory
> > ranges.
> 
> I am not sure I understand. Could you be more specific?

Not like THP, KSM doesn't have anything like VM_NOHUGEPAGE, so apps
cann't explicitly disallow merging by madvise. If it is really necessary for
a particular meory ranges of a process to be never merged, we have to submit
one more patch to achieve that.

> -- 
> Michal Hocko
> SUSE Labs
CGEL May 19, 2022, 6:35 a.m. UTC | #10
On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> On Wed 18-05-22 07:40:30, CGEL wrote:
> [...]
> > 2. process_madvise is still a kind of madvise. processs_madvise from
> > another process overrides the intention of origin app code ifself that
> > also calls madvise, which is unrecoverable. For example, if a process "A"
> > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > already, meanwhile, if another process which doesn't know the information
> > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > is erasured permanently.
> 
> I do not really follow. How is this any different from an external
> process modifying the process wide policy via the proc or any other
> interface?

In this patch, you can see that we didn't modify the flag of any VMA of
the target process, which is different from process_madvise. So it is
easy to keep the original MERGEABLE information of the target process
when we turn back to the default state from the state "always".

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 19, 2022, 7:35 a.m. UTC | #11
On Thu 19-05-22 06:23:30, CGEL wrote:
> On Wed, May 18, 2022 at 02:12:26PM +0200, Michal Hocko wrote:
> > On Wed 18-05-22 02:47:06, CGEL wrote:
> > > On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> > > > [CCing Hugh and linux-api]
> > > > 
> > > > On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > > > > From: xu xin <xu.xin16@zte.com.cn>
> > > > > 
> > > > > For now, if we want to use KSM to merge pages of some apps, we have to
> > > > > explicitly call madvise() in application code, which means installed
> > > > > apps on OS needs to be uninstall and source code needs to be modified.
> > > > > It is very inconvenient because sometimes users or app developers are not
> > > > > willing to modify their app source codes for any reasons.
> > > 
> > > Hello, Michal.
> > > > 
> > > > Would it help to allow external control by process_madvise?
> > > >
> > > 
> > > Maybe, but it will be much more complicated to achieve this by
> > > process_madvise(). process_madvise works on a serires of VMAs in
> > > essential, So all the eligible VMAs have to be traversed. The problem
> > > about this has been discussed in [1],[2].
> > > [1]https://lore.kernel.org/lkml/1835064.A2aMcgg3dW@natalenko.name/
> > > [2]https://lore.kernel.org/lkml/20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org/
> > 
> > I can see that this is not a trivial interface to use but I do not
> > think this would be too hard to be usable. There is certainly some
> > coordination required between the external and the target tasks. But
> > that is to be expected IMHO. You do not really want to configure merging
> > without actually understanding what the application does and whether
> > that is really OK. Right?
> > 
> > Besides that, as far as I remember, the process_madvise interface
> > doesn't really require exact vmas to be provided and a single range can
> > span multiple VMAs.
> 
> Yes, but all the eligible VMAs still have to be traversed after all. It may
> induce more latency than needed because the target task has to be
> stopped to avoid races.

Does it really? I mean, you are right that some sort of synchronization
(e.g. freezing) is required to prevent from address space modifications.
My question is whether this is really required for something that is
mostly an optimization. Missing some VMAs or failing because of racing
modifications is should be toleratable. Or am I wrong on that?
 
> > > > > So to use KSM more flexibly, we provide a new proc file "ksm_enabled" under
> > > > > /proc/<pid>/. We can pass parameter into this file with one of three values
> > > > > as follows:
> > > > > 
> > > > > 	always:
> > > > > 		force all anonymous and eligible VMAs of this process to be
> > > > > 		scanned by ksmd.
> > > > > 	madvise:
> > > > > 		the default state, unless user code call madvise, ksmd
> > > > > 		doesn't scan this process.
> > > > > 	never:
> > > > > 		this process will never be scanned by ksmd and no merged
> > > > > 		pages occurred in this process.
> > > > > 
> > > > > With this patch, we can control KSM with ``/proc/<pid>/ksm_enabled``
> > > > > based on every process. KSM for each process can be entirely disabled
> > > > > (mostly for debugging purposes) or only enabled inside MADV_MERGEABLE
> > > > > regions (to avoid the risk of consuming more cpu resources to scan for
> > > > > ksmd) or enabled entirely for a process.
> > > > 
> > > > I am not really familiar with KSM much but I am wondering whether the
> > > > proc based interface is really the best fit. We have a very similar
> > > > concern with THP where processes would like to override the global setup
> > > > and that has been done with prctl interface. Is there any reason why
> > > > this should be any different?
> > > > 
> > > At least for now, I can't find a simpler implementation than proc file,
> > > unless we add a new syscall used for changing another process mm's flag
> > > in user space.
> > 
> > What is the problem with the prctl extension?
> >
> 
> What's the meaning of the prctl extension?

Add a new prctl "command"

> I don't quite get your point. Can prctl control external process?

Not directly but the properties set by prctl are usually inherited so it
is trivial to create a spawn process to set up the property and exec
into the target application. The target application is not really aware
it has been executed like that so it doesn't really need any
modifications.

But please note that I am not really suggesting to go with prctl. I am
just saying it is a better interface than proc one. I am still not
convinced this really required and I would prefer process_madvise more.
 
> > > Speaking to THP, the interactive UI of KSM is relatively simpler because
> > > KSM dosen't have global knob like THP. OTOH, THP trades space for time
> > > (consume memory) while KSM trades time for space (save memory), so THP
> > > tends to be enabled system wide while KSM not.
> > > 
> > > > Another question I have is about the interaction of the per-process
> > > > tunable with any explicit madvise calls. AFAICS you have made this knob
> > > > per mm but the actual implementation currently relies on the per-vma
> > > > flags. That means that one can explicitly disallow merging by madvise
> > > > for a range. Is it wise to override that by a per-process knob? I mean
> > > > there might be a very good reason why a particular memory ranges should
> > > > never be merged but a per-process knob could easily ignore that hint
> > > > from the application. Or am I just confuse?
> > > For now, there is no any hints for letting KSM never merge some memory
> > > ranges.
> > 
> > I am not sure I understand. Could you be more specific?
> 
> Not like THP, KSM doesn't have anything like VM_NOHUGEPAGE, so apps
> cann't explicitly disallow merging by madvise. If it is really necessary for
> a particular meory ranges of a process to be never merged, we have to submit
> one more patch to achieve that.

What about MADV_UNMERGEABLE?
Michal Hocko May 19, 2022, 7:39 a.m. UTC | #12
On Thu 19-05-22 06:35:03, CGEL wrote:
> On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> > On Wed 18-05-22 07:40:30, CGEL wrote:
> > [...]
> > > 2. process_madvise is still a kind of madvise. processs_madvise from
> > > another process overrides the intention of origin app code ifself that
> > > also calls madvise, which is unrecoverable. For example, if a process "A"
> > > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > > already, meanwhile, if another process which doesn't know the information
> > > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > > is erasured permanently.
> > 
> > I do not really follow. How is this any different from an external
> > process modifying the process wide policy via the proc or any other
> > interface?
> 
> In this patch, you can see that we didn't modify the flag of any VMA of
> the target process, which is different from process_madvise. So it is
> easy to keep the original MERGEABLE information of the target process
> when we turn back to the default state from the state "always".

This means that /proc/<pid>/smaps doesn't show the real state, right?
CGEL May 19, 2022, 8:02 a.m. UTC | #13
On Thu, May 19, 2022 at 09:35:30AM +0200, Michal Hocko wrote:
> On Thu 19-05-22 06:23:30, CGEL wrote:
> > On Wed, May 18, 2022 at 02:12:26PM +0200, Michal Hocko wrote:
> > > On Wed 18-05-22 02:47:06, CGEL wrote:
> > > > On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> > > > > [CCing Hugh and linux-api]
> > > > > 
> > > > > On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > > > > per mm but the actual implementation currently relies on the per-vma
> > > > > flags. That means that one can explicitly disallow merging by madvise
> > > > > for a range. Is it wise to override that by a per-process knob? I mean
> > > > > there might be a very good reason why a particular memory ranges should
> > > > > never be merged but a per-process knob could easily ignore that hint
> > > > > from the application. Or am I just confuse?
> > > > For now, there is no any hints for letting KSM never merge some memory
> > > > ranges.
> > > 
> > > I am not sure I understand. Could you be more specific?
> > 
> > Not like THP, KSM doesn't have anything like VM_NOHUGEPAGE, so apps
> > cann't explicitly disallow merging by madvise. If it is really necessary for
> > a particular meory ranges of a process to be never merged, we have to submit
> > one more patch to achieve that.
> 
> What about MADV_UNMERGEABLE?

MADV_UNMERGEABLE and MADV_MERGEABLE usually appear in pairs, MADV_UNMERGEABLE cannot
appear alone. I mean MADV_UNMERGEABLE is used to unmerges whatever it merged in the
specifed range, not to disallow merging.

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 19, 2022, 8:24 a.m. UTC | #14
On Thu 19-05-22 08:02:10, CGEL wrote:
> On Thu, May 19, 2022 at 09:35:30AM +0200, Michal Hocko wrote:
> > On Thu 19-05-22 06:23:30, CGEL wrote:
> > > On Wed, May 18, 2022 at 02:12:26PM +0200, Michal Hocko wrote:
> > > > On Wed 18-05-22 02:47:06, CGEL wrote:
> > > > > On Tue, May 17, 2022 at 04:04:38PM +0200, Michal Hocko wrote:
> > > > > > [CCing Hugh and linux-api]
> > > > > > 
> > > > > > On Tue 17-05-22 09:27:01, cgel.zte@gmail.com wrote:
> > > > > > per mm but the actual implementation currently relies on the per-vma
> > > > > > flags. That means that one can explicitly disallow merging by madvise
> > > > > > for a range. Is it wise to override that by a per-process knob? I mean
> > > > > > there might be a very good reason why a particular memory ranges should
> > > > > > never be merged but a per-process knob could easily ignore that hint
> > > > > > from the application. Or am I just confuse?
> > > > > For now, there is no any hints for letting KSM never merge some memory
> > > > > ranges.
> > > > 
> > > > I am not sure I understand. Could you be more specific?
> > > 
> > > Not like THP, KSM doesn't have anything like VM_NOHUGEPAGE, so apps
> > > cann't explicitly disallow merging by madvise. If it is really necessary for
> > > a particular meory ranges of a process to be never merged, we have to submit
> > > one more patch to achieve that.
> > 
> > What about MADV_UNMERGEABLE?
> 
> MADV_UNMERGEABLE and MADV_MERGEABLE usually appear in pairs, MADV_UNMERGEABLE cannot
> appear alone.

That might be the case currently because KSM is an opt-in feature that
has to be explicitly enabled. The existing interface only allows to
enable it by MADV_MERGEABLE but now you are proposing an extension when
there would be other way to achieve the same (with a wider scope but
that is not really all that important). MADV_UNMERGEABLE has a well
defined behavior even on VMAs which are not marked for merging.

Let's say that somebody would like to use a process wide setup except
for few special mappings because merging is not really desirable for
whatever reason. How do you achieve that?

> I mean MADV_UNMERGEABLE is used to unmerges whatever it merged in the
> specifed range, not to disallow merging.

I disagree. It clearly drops the mergeable flag so it effectivelly disallow
merging.
CGEL May 24, 2022, 8:52 a.m. UTC | #15
On Thu, May 19, 2022 at 09:39:57AM +0200, Michal Hocko wrote:
> On Thu 19-05-22 06:35:03, CGEL wrote:
> > On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> > > On Wed 18-05-22 07:40:30, CGEL wrote:
> > > [...]
> > > > 2. process_madvise is still a kind of madvise. processs_madvise from
> > > > another process overrides the intention of origin app code ifself that
> > > > also calls madvise, which is unrecoverable. For example, if a process "A"
> > > > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > > > already, meanwhile, if another process which doesn't know the information
> > > > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > > > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > > > is erasured permanently.
> > > 
> > > I do not really follow. How is this any different from an external
> > > process modifying the process wide policy via the proc or any other
> > > interface?
> > 
> > In this patch, you can see that we didn't modify the flag of any VMA of
> > the target process, which is different from process_madvise. So it is
> > easy to keep the original MERGEABLE information of the target process
> > when we turn back to the default state from the state "always".
> 
> This means that /proc/<pid>/smaps doesn't show the real state, right?

Maybe we can add extra information of KSM forcible state in /proc/<pid>/smaps
like THPeligible. 

Really, Michal, I think it again, 'process_ madvise' is really not good. In
addition to some shortcomings I said before, If new vmas of the target process
are created after the external process calls process_madvise(), then we have to
call `process_madvise()` on them again, over and over again, regularly, just like
Oleksandr said [1].

[1]: https://lore.kernel.org/lkml/1817008.tdWV9SEqCh@natalenko.name/


> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 24, 2022, 9:04 a.m. UTC | #16
On Tue 24-05-22 08:52:02, CGEL wrote:
> On Thu, May 19, 2022 at 09:39:57AM +0200, Michal Hocko wrote:
> > On Thu 19-05-22 06:35:03, CGEL wrote:
> > > On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> > > > On Wed 18-05-22 07:40:30, CGEL wrote:
> > > > [...]
> > > > > 2. process_madvise is still a kind of madvise. processs_madvise from
> > > > > another process overrides the intention of origin app code ifself that
> > > > > also calls madvise, which is unrecoverable. For example, if a process "A"
> > > > > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > > > > already, meanwhile, if another process which doesn't know the information
> > > > > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > > > > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > > > > is erasured permanently.
> > > > 
> > > > I do not really follow. How is this any different from an external
> > > > process modifying the process wide policy via the proc or any other
> > > > interface?
> > > 
> > > In this patch, you can see that we didn't modify the flag of any VMA of
> > > the target process, which is different from process_madvise. So it is
> > > easy to keep the original MERGEABLE information of the target process
> > > when we turn back to the default state from the state "always".
> > 
> > This means that /proc/<pid>/smaps doesn't show the real state, right?
> 
> Maybe we can add extra information of KSM forcible state in /proc/<pid>/smaps
> like THPeligible. 

That information is already printed and I do not think that adding
another flag or whatever would make the situation much more clear.

> Really, Michal, I think it again, 'process_ madvise' is really not good. In
> addition to some shortcomings I said before, If new vmas of the target process
> are created after the external process calls process_madvise(), then we have to
> call `process_madvise()` on them again, over and over again, regularly, just like
> Oleksandr said [1].

I can see that this is not the most convenient way but so far I haven't
really heard any arguments that this would be impossible.

Look, I am not claiming that process_madvise is the only way to achieve
the goal. I really do not like the proc based interface because it is
rather adhoc and limited. We have other means to set a process wide
property and I do not see any strong arguments agaist prctl.

But more importantly I haven't really seen any serious analysis whether
per-process (resp. per MM) property is even a desirable interface.
Especially in the current form when opting out for certain VMAs is not
possible.
 
> [1]: https://lore.kernel.org/lkml/1817008.tdWV9SEqCh@natalenko.name/
CGEL May 25, 2022, 6:56 a.m. UTC | #17
On Tue, May 24, 2022 at 11:04:40AM +0200, Michal Hocko wrote:
> On Tue 24-05-22 08:52:02, CGEL wrote:
> > On Thu, May 19, 2022 at 09:39:57AM +0200, Michal Hocko wrote:
> > > On Thu 19-05-22 06:35:03, CGEL wrote:
> > > > On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> > > > > On Wed 18-05-22 07:40:30, CGEL wrote:
> > > > > [...]
> > > > > > 2. process_madvise is still a kind of madvise. processs_madvise from
> > > > > > another process overrides the intention of origin app code ifself that
> > > > > > also calls madvise, which is unrecoverable. For example, if a process "A"
> > > > > > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > > > > > already, meanwhile, if another process which doesn't know the information
> > > > > > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > > > > > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > > > > > is erasured permanently.
> > > > > 
> > > > > I do not really follow. How is this any different from an external
> > > > > process modifying the process wide policy via the proc or any other
> > > > > interface?
> > > > 
> > > > In this patch, you can see that we didn't modify the flag of any VMA of
> > > > the target process, which is different from process_madvise. So it is
> > > > easy to keep the original MERGEABLE information of the target process
> > > > when we turn back to the default state from the state "always".
> > > 
> > > This means that /proc/<pid>/smaps doesn't show the real state, right?
> > 
> > Maybe we can add extra information of KSM forcible state in /proc/<pid>/smaps
> > like THPeligible. 
> 
> That information is already printed and I do not think that adding
> another flag or whatever would make the situation much more clear.
> 
> > Really, Michal, I think it again, 'process_ madvise' is really not good. In
> > addition to some shortcomings I said before, If new vmas of the target process
> > are created after the external process calls process_madvise(), then we have to
> > call `process_madvise()` on them again, over and over again, regularly, just like
> > Oleksandr said [1].
> 
> I can see that this is not the most convenient way but so far I haven't
> really heard any arguments that this would be impossible.
> 
> Look, I am not claiming that process_madvise is the only way to achieve
> the goal. I really do not like the proc based interface because it is
> rather adhoc and limited. We have other means to set a process wide
> property and I do not see any strong arguments agaist prctl.
> 

I can agree with you that proc is adhoc and limit. Use prctl extension
is probably better, but the problem is that it can't control external
process directly.

> But more importantly I haven't really seen any serious analysis whether
> per-process (resp. per MM) property is even a desirable interface.
> Especially in the current form when opting out for certain VMAs is not
> possible.

I think the reasons of using per-process (resp. per MM) property are as
follows:

The KSM mandatory attribute is set for the entire mm space rather than
some VMAs. Its system is to allow all eligible VMAs of the entire mm to
participate in KSM. Although marking all VMAs as mergeble can achieve the
same purpose, the concept is different:

From another perspective, for example, the rule of a company is to hold
a morning meeting at 9:30 a.m., but one day, the local law stipulates
that it is illegal to go to work before 10 o'clock, then this rule of
the company have to be covered and invalid. Here, 'mm->ksm_ enabled' is
analogous to local laws, while the company's rule is analogous to VMA
-> flag. One day, after the local law is repealed, the company's rule
can still be restored.

>  
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 25, 2022, 7:38 a.m. UTC | #18
On Wed 25-05-22 06:56:05, CGEL wrote:
> On Tue, May 24, 2022 at 11:04:40AM +0200, Michal Hocko wrote:
> > On Tue 24-05-22 08:52:02, CGEL wrote:
> > > On Thu, May 19, 2022 at 09:39:57AM +0200, Michal Hocko wrote:
> > > > On Thu 19-05-22 06:35:03, CGEL wrote:
> > > > > On Wed, May 18, 2022 at 02:14:28PM +0200, Michal Hocko wrote:
> > > > > > On Wed 18-05-22 07:40:30, CGEL wrote:
> > > > > > [...]
> > > > > > > 2. process_madvise is still a kind of madvise. processs_madvise from
> > > > > > > another process overrides the intention of origin app code ifself that
> > > > > > > also calls madvise, which is unrecoverable. For example, if a process "A"
> > > > > > > which madvises just one part of VMAs (not all) as MERGEABLE run on the OS
> > > > > > > already, meanwhile, if another process which doesn't know the information
> > > > > > > of "A" 's MERGEABLE areas, then call process_madvise to advise all VMAs of
> > > > > > > "A" as MERGEABLE, the original MERGEABLE information of "A" calling madivse
> > > > > > > is erasured permanently.
> > > > > > 
> > > > > > I do not really follow. How is this any different from an external
> > > > > > process modifying the process wide policy via the proc or any other
> > > > > > interface?
> > > > > 
> > > > > In this patch, you can see that we didn't modify the flag of any VMA of
> > > > > the target process, which is different from process_madvise. So it is
> > > > > easy to keep the original MERGEABLE information of the target process
> > > > > when we turn back to the default state from the state "always".
> > > > 
> > > > This means that /proc/<pid>/smaps doesn't show the real state, right?
> > > 
> > > Maybe we can add extra information of KSM forcible state in /proc/<pid>/smaps
> > > like THPeligible. 
> > 
> > That information is already printed and I do not think that adding
> > another flag or whatever would make the situation much more clear.
> > 
> > > Really, Michal, I think it again, 'process_ madvise' is really not good. In
> > > addition to some shortcomings I said before, If new vmas of the target process
> > > are created after the external process calls process_madvise(), then we have to
> > > call `process_madvise()` on them again, over and over again, regularly, just like
> > > Oleksandr said [1].
> > 
> > I can see that this is not the most convenient way but so far I haven't
> > really heard any arguments that this would be impossible.
> > 
> > Look, I am not claiming that process_madvise is the only way to achieve
> > the goal. I really do not like the proc based interface because it is
> > rather adhoc and limited. We have other means to set a process wide
> > property and I do not see any strong arguments agaist prctl.
> > 
> 
> I can agree with you that proc is adhoc and limit. Use prctl extension
> is probably better, but the problem is that it can't control external
> process directly.

Why is that a problem? I have seen this claim in this thread already but
never really backed by a usecase where this would be a real problem.
Either your tasks want to have their memory KSMed or not. Why do you
need to change that during the process runtime?

> > But more importantly I haven't really seen any serious analysis whether
> > per-process (resp. per MM) property is even a desirable interface.
> > Especially in the current form when opting out for certain VMAs is not
> > possible.
> 
> I think the reasons of using per-process (resp. per MM) property are as
> follows:
> 
> The KSM mandatory attribute is set for the entire mm space rather than
> some VMAs. Its system is to allow all eligible VMAs of the entire mm to
> participate in KSM. Although marking all VMAs as mergeble can achieve the
> same purpose, the concept is different:

You are not really answering my question but now that you have brought
that up I think that implemenation which would mark all eligible VMAs
as VM_MERGE and implicitly set the same for any new VMA would be more
reasonable. It wouldn't have the above mentioned problem with
MADV_UNMERGEABLE.

The thing though is whether there are any usecases which benefit from
implicit sharing for all the anonymous memory (including stacks, brk and
any random private mmaping backing the heap allocators).

> From another perspective, for example, the rule of a company is to hold
> a morning meeting at 9:30 a.m., but one day, the local law stipulates
> that it is illegal to go to work before 10 o'clock, then this rule of
> the company have to be covered and invalid. Here, 'mm->ksm_ enabled' is
> analogous to local laws, while the company's rule is analogous to VMA
> -> flag. One day, after the local law is repealed, the company's rule
> can still be restored.

I am sorry but I got lost in your example. Could you talk about specific
usecases?
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index b244f0202a03..91326198e37f 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -32,7 +32,7 @@  are swapped back in: ksmd must rediscover their identity and merge again).
 Controlling KSM with madvise
 ============================
 
-KSM only operates on those areas of address space which an application
+KSM can operate on those areas of address space which an application
 has advised to be likely candidates for merging, by using the madvise(2)
 system call::
 
@@ -70,6 +70,28 @@  Applications should be considerate in their use of MADV_MERGEABLE,
 restricting its use to areas likely to benefit.  KSM's scans may use a lot
 of processing power: some installations will disable KSM for that reason.
 
+Controlling KSM with procfs
+===========================
+We can also control KSM with ``/proc/<pid>/ksm_enabled`` based on every
+process. KSM for each process can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_MERGEABLE regions (to avoid
+the risk of consuming more cpu resources to scan for ksmd) or enabled entirely
+for a process. This can be achieved with one of::
+
+    echo always > /proc/<pid>/ksm_enabled
+    echo madvise > /proc/<pid>/ksm_enabled
+    echo never > /proc/<pid>/ksm_enabled
+
+always:
+        force all anonymous and eligible VMAs of this process to be scanned
+        by ksmd.
+madvise:
+        the default state, unless user code call madvise, ksmd doesn't scan
+        this process.
+never:
+        this process will never be scanned by ksmd and no merged pages
+	occurred in this process.
+
 .. _ksm_sysfs:
 
 KSM daemon sysfs interface
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 1bc91fb8c321..ea7e08a1c143 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,6 +47,7 @@  fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
   3.12	/proc/<pid>/arch_status - Task architecture specific information
+  3.13  /proc/<pid>/ksm_enabled - Controlling KSM based on process
 
   4	Configuring procfs
   4.1	Mount options
@@ -2140,6 +2141,19 @@  AVX512_elapsed_ms
   the task is unlikely an AVX512 user, but depends on the workload and the
   scheduling scenario, it also could be a false negative mentioned above.
 
+3.13 /proc/<pid>/ksm_enabled - Controlling KSM based on process
+---------------------------------------------------------------
+When CONFIG_KSM is enabled, this file can be used to specify how this
+process's anonymous memory gets involved in KSM scanning.
+
+If writing "always" to this file, it will force all anonymous and eligible
+VMAs of this process to be scanned by ksmd.
+
+If writing "madvise" to this file, turn to the default state, unless user
+code call madvise, ksmd doesn't scan this process.
+
+If writing "never" to this file, this process will never be scanned by ksmd.
+
 Chapter 4: Configuring procfs
 =============================
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617816168748..760ceeab4aa1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -96,6 +96,7 @@ 
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
 #include <linux/cn_proc.h>
+#include <linux/ksm.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -3171,7 +3172,104 @@  static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace *
 
 	return 0;
 }
-#endif /* CONFIG_KSM */
+
+static int ksm_enabled_show(struct seq_file *m, void *v)
+{
+	struct inode *inode = m->private;
+	struct mm_struct *mm;
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ESRCH;
+
+	mm = get_task_mm(task);
+	if (mm) {
+		if (mm->ksm_enabled == KSM_PROC_ALWAYS)
+			seq_puts(m, "[always] madvise never\n");
+		else if (mm->ksm_enabled == KSM_PROC_MADVISE)
+			seq_puts(m, "always [madvise] never\n");
+		else
+			seq_puts(m, "always madvise [never]\n");
+		mmput(mm);
+	}
+
+	put_task_struct(task);
+	return 0;
+}
+
+static ssize_t ksm_enabled_write(struct file *file, const char __user *buf,
+								size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	char buffer[PROC_NUMBUF];
+	int value;
+	int err = 0;
+	long str_len;
+
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	str_len = strncpy_from_user(buffer, buf, count);
+	if (str_len < 0)
+		return -EFAULT;
+	buffer[str_len - 1] = '\0';
+
+	if (!strcmp("always", buffer))
+		value = KSM_PROC_ALWAYS;
+	else if (!strcmp("madvise", buffer))
+		value = KSM_PROC_MADVISE;
+	else if (!strcmp("never", buffer))
+		value = KSM_PROC_NEVER;
+	else
+		return -EINVAL;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_put_task;
+
+	if (mm->ksm_enabled != value) {
+		if (mmap_write_lock_killable(mm)) {
+			err = -EINTR;
+			goto out_mmput;
+		}
+		if (value == KSM_PROC_NEVER)
+			mm->ksm_enabled = value;
+		else {
+			/*
+			 * No matter whether it's KSM_PROC_ALWAYS or KSM_PROC_MADVISE, we need
+			 * to recheck mm->flags to guarantee that this mm is in ksm_scan.
+			 */
+			if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
+				err = __ksm_enter(mm);
+			if (!err)
+				mm->ksm_enabled = value;
+		}
+		mmap_write_unlock(mm);
+	}
+
+out_mmput:
+	mmput(mm);
+out_put_task:
+	put_task_struct(task);
+	return err < 0 ? err : count;
+}
+
+static int ksm_enabled_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ksm_enabled_show, inode);
+}
+
+static const struct file_operations proc_pid_ksm_enabled_operations = {
+	.open       = ksm_enabled_open,
+	.read		= seq_read,
+	.write		= ksm_enabled_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /*CONFIG_KSM */
 
 #ifdef CONFIG_STACKLEAK_METRICS
 static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
@@ -3306,6 +3404,7 @@  static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_KSM
 	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
+	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
 #endif
 };
 
@@ -3642,6 +3741,7 @@  static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_KSM
 	ONE("ksm_merging_pages",  S_IRUSR, proc_pid_ksm_merging_pages),
+	REG("ksm_enabled", S_IRUGO|S_IWUSR, proc_pid_ksm_enabled_operations),
 #endif
 };
 
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 0b4f17418f64..29d23d208b54 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -19,6 +19,11 @@  struct stable_node;
 struct mem_cgroup;
 
 #ifdef CONFIG_KSM
+
+#define KSM_PROC_MADVISE	0
+#define KSM_PROC_ALWAYS		1
+#define KSM_PROC_NEVER		2
+
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 417ef1519475..29fd4c84d08c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -649,6 +649,16 @@  struct mm_struct {
 		 * merging.
 		 */
 		unsigned long ksm_merging_pages;
+
+		/*
+		 * Represent the state of this mm involing in KSM, with 3 states:
+		 * 1) KSM_PROC_ALWAYS:  force all anonymous VMAs of this process to
+		 *						be scanned.
+		 * 2) KSM_PROC_MADVISE: the default state, unless user code call
+		 *						madvise, don't scan this process.
+		 * 3) KSM_PROC_NEVER: never be involed in KSM.
+		 */
+		int ksm_enabled;
 #endif
 	} __randomize_layout;
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 26da7f813f23..90cc8eda8bca 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -334,6 +334,35 @@  static void __init ksm_slab_free(void)
 	mm_slot_cache = NULL;
 }
 
+static bool vma_scannable(struct vm_area_struct *vma)
+{
+	unsigned long vm_flags = vma->vm_flags;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (mm->ksm_enabled == KSM_PROC_NEVER ||
+			(!(vma->vm_flags & VM_MERGEABLE) &&
+			mm->ksm_enabled != KSM_PROC_ALWAYS))
+		return false;
+
+	if (vm_flags & (VM_SHARED	| VM_MAYSHARE	|
+			VM_PFNMAP	| VM_IO | VM_DONTEXPAND |
+			VM_HUGETLB	| VM_MIXEDMAP))
+		return false;       /* just ignore this vma*/
+
+	if (vma_is_dax(vma))
+		return false;
+#ifdef VM_SAO
+	if (vm_flags & VM_SAO)
+		return false;
+#endif
+#ifdef VM_SPARC_ADI
+	if (vm_flags & VM_SPARC_ADI)
+		return false;
+#endif
+
+	return true;
+}
+
 static __always_inline bool is_stable_node_chain(struct stable_node *chain)
 {
 	return chain->rmap_hlist_len == STABLE_NODE_CHAIN;
@@ -523,7 +552,7 @@  static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 	if (ksm_test_exit(mm))
 		return NULL;
 	vma = vma_lookup(mm, addr);
-	if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+	if (!vma || !vma_scannable(vma) || !vma->anon_vma)
 		return NULL;
 	return vma;
 }
@@ -990,7 +1019,7 @@  static int unmerge_and_remove_all_rmap_items(void)
 		for_each_vma(vmi, vma) {
 			if (ksm_test_exit(mm))
 				break;
-			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+			if (!vma_scannable(vma) || !vma->anon_vma)
 				continue;
 			err = unmerge_ksm_pages(vma,
 						vma->vm_start, vma->vm_end);
@@ -2300,7 +2329,7 @@  static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		goto no_vmas;
 
 	for_each_vma(vmi, vma) {
-		if (!(vma->vm_flags & VM_MERGEABLE))
+		if (!vma_scannable(vma))
 			continue;
 		if (ksm_scan.address < vma->vm_start)
 			ksm_scan.address = vma->vm_start;