[V3] KSM: allow dedup all tasks memory

Message ID	20181112231344.7161-1-timofey.titovets@synesis.ru (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of nefelim4ag@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Timofey Titovets <timofey.titovets@synesis.ru> To: linux-kernel@vger.kernel.org Cc: Timofey Titovets <nefelim4ag@gmail.com>, Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org, linux-doc@vger.kernel.org Subject: [PATCH V3] KSM: allow dedup all tasks memory Date: Tue, 13 Nov 2018 02:13:44 +0300 Message-Id: <20181112231344.7161-1-timofey.titovets@synesis.ru> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[V3] KSM: allow dedup all tasks memory \| expand [V3] KSM: allow dedup all tasks memory

Timofey Titovets Nov. 12, 2018, 11:13 p.m. UTC

From: Timofey Titovets <nefelim4ag@gmail.com>

ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
    I try mimic hugepages attribute, so mode have two states:
      * madvise      - old default behaviour
      * always [new] - allow ksm to get tasks vma and
                       try working on that.
  * seeker_sleep_millisecs:
    Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
    Description: Many different chrome/electron apps + KDE
    Ratio: 5%
    Saved: ~100  MiB
    Used:  ~2000 MiB

  * Name: K8s test VM
    Description: Some small random running docker images
    Ratio: 40%
    Saved: ~160 MiB
    Used:  ~920 MiB

  * Name: Ceph test VM
    Description: Ceph Mon/OSD, some containers
    Ratio: 20%
    Saved: ~60 MiB
    Used:  ~600 MiB

  * Name: BareMetal K8s backend server
    Description: Different server apps in containers C, Java, GO & etc
    Ratio: 72%
    Saved: ~5800 MiB
    Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
    Description: Many instance of one CPU intensive application
    Ratio: 55%
    Saved: ~2600 MiB
    Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
    Description: Only OSD storage daemons running
    Raio: 2%
    Saved: ~190 MiB
    Used:  ~11.7 GiB

Changes:
  v1 -> v2:
    * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
    * Reformat patch description
    * Rename mode normal to madvise
    * Add some memory numbers
    * Fix checkpatch.pl warnings
    * Separate ksm vma seeker to another kthread
    * Fix: "BUG: scheduling while atomic: ksmd"
      by move seeker to another thread

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: linux-mm@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 mm/ksm.c                             | 215 +++++++++++++++++++++++----
 2 files changed, 198 insertions(+), 32 deletions(-)

Matthew Wilcox Nov. 13, 2018, 1:49 a.m. UTC | #1

On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote:
> Some numbers from different not madvised workloads.
> Formulas:
>   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
>   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
>   Memory used = free -h
> 
>   * Name: My working laptop
>     Description: Many different chrome/electron apps + KDE
>     Ratio: 5%
>     Saved: ~100  MiB
>     Used:  ~2000 MiB

Your _laptop_ saves 100MB of RAM?  That's extraordinary.  Essentially
that's like getting an extra 100MB of page cache for free.  Is there
any observable slowdown?  I could even see there being a speedup (due
to your working set being allowed to be 5% larger)

I am now a big fan of this patch and shall try to give it the review
that it deserves.

Pasha Tatashin Nov. 13, 2018, 2:25 a.m. UTC | #2

On 18-11-13 02:13:44, Timofey Titovets wrote:
> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> ksm by default working only on memory that added by
> madvise().
> 
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
> 
> Lets use kernel task list and add logic to import VMAs from tasks.
> 
> That behaviour controlled by new attributes:
>   * mode:
>     I try mimic hugepages attribute, so mode have two states:
>       * madvise      - old default behaviour
>       * always [new] - allow ksm to get tasks vma and
>                        try working on that.
>   * seeker_sleep_millisecs:
>     Add pauses between imports tasks VMA
> 
> For rate limiting proporses and tasklist locking time,
> ksm seeker thread only import VMAs from one task per loop.
> 
> Some numbers from different not madvised workloads.
> Formulas:
>   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
>   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
>   Memory used = free -h
> 
>   * Name: My working laptop
>     Description: Many different chrome/electron apps + KDE
>     Ratio: 5%
>     Saved: ~100  MiB
>     Used:  ~2000 MiB
> 
>   * Name: K8s test VM
>     Description: Some small random running docker images
>     Ratio: 40%
>     Saved: ~160 MiB
>     Used:  ~920 MiB
> 
>   * Name: Ceph test VM
>     Description: Ceph Mon/OSD, some containers
>     Ratio: 20%
>     Saved: ~60 MiB
>     Used:  ~600 MiB
> 
>   * Name: BareMetal K8s backend server
>     Description: Different server apps in containers C, Java, GO & etc
>     Ratio: 72%
>     Saved: ~5800 MiB
>     Used:  ~35.7 GiB
> 
>   * Name: BareMetal K8s processing server
>     Description: Many instance of one CPU intensive application
>     Ratio: 55%
>     Saved: ~2600 MiB
>     Used:  ~28.0 GiB
> 
>   * Name: BareMetal Ceph node
>     Description: Only OSD storage daemons running
>     Raio: 2%
>     Saved: ~190 MiB
>     Used:  ~11.7 GiB
> 
> Changes:
>   v1 -> v2:
>     * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
>   v2 -> v3:
>     * Reformat patch description
>     * Rename mode normal to madvise
>     * Add some memory numbers
>     * Fix checkpatch.pl warnings
>     * Separate ksm vma seeker to another kthread
>     * Fix: "BUG: scheduling while atomic: ksmd"
>       by move seeker to another thread
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> CC: Matthew Wilcox <willy@infradead.org>
> CC: linux-mm@kvack.org
> CC: linux-doc@vger.kernel.org
> ---
>  Documentation/admin-guide/mm/ksm.rst |  15 ++
>  mm/ksm.c                             | 215 +++++++++++++++++++++++----
>  2 files changed, 198 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> index 9303786632d1..7cffd47f9b38 100644
> --- a/Documentation/admin-guide/mm/ksm.rst
> +++ b/Documentation/admin-guide/mm/ksm.rst
> @@ -116,6 +116,21 @@ run
>          Default: 0 (must be changed to 1 to activate KSM, except if
>          CONFIG_SYSFS is disabled)
>  
> +mode
> +        * set always to allow ksm deduplicate memory of every process
> +        * set madvise to use only madvised memory
> +
> +        Default: madvise (dedupulicate only madvised memory as in
> +        earlier releases)
> +
> +seeker_sleep_millisecs
> +        how many milliseconds ksmd task seeker should sleep try another
> +        task.
> +        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
> +
> +        Default: 1000 (chosen for rate limit purposes)
> +
> +
>  use_zero_pages
>          specifies whether empty pages (i.e. allocated pages that only
>          contain zeroes) should be treated specially.  When set to 1,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 5b0894b45ee5..1a03b28b6288 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Milliseconds ksmd seeker should sleep between runs */
> +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
> +
>  /* Checksum of an empty (zeroed) page */
>  static unsigned int zero_checksum __read_mostly;
>  
> @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
>  static unsigned long ksm_run = KSM_RUN_STOP;
>  static void wait_while_offlining(void);
>  
> +#define KSM_MODE_MADVISE 0
> +#define KSM_MODE_ALWAYS	1
> +static unsigned long ksm_mode = KSM_MODE_MADVISE;
> +
>  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
> +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
>  static DEFINE_MUTEX(ksm_thread_mutex);
>  static DEFINE_SPINLOCK(ksm_mmlist_lock);
>  
> @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
>  		sizeof(struct __struct), __alignof__(struct __struct),\
>  		(__flags), NULL)
>  
> +static inline int ksm_mode_always(void)
> +{
> +	return (ksm_mode == KSM_MODE_ALWAYS);
> +}
> +
>  static int __init ksm_slab_init(void)
>  {
>  	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
>  	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
>  }
>  
> +
> +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
> +{
> +	int err;
> +
> +	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
> +			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
> +			 VM_HUGETLB | VM_MIXEDMAP))
> +		return 0;
> +
> +#ifdef VM_SAO
> +	if (*vm_flags & VM_SAO)
> +		return 0;
> +#endif
> +#ifdef VM_SPARC_ADI
> +	if (*vm_flags & VM_SPARC_ADI)
> +		return 0;
> +#endif
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> +		err = __ksm_enter(mm);
> +		if (err)
> +			return err;
> +	}
> +
> +	*vm_flags |= VM_MERGEABLE;
> +
> +	return 0;
> +}
> +
> +/*
> + * Register all vmas for all processes in the system with KSM.
> + * Note that every call to ksm_, for a given vma, after the first
> + * does nothing but set flags.
> + */
> +void ksm_import_task_vma(struct task_struct *task)
> +{
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	int error;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		return;
> +	down_write(&mm->mmap_sem);
> +	vma = mm->mmap;
> +	while (vma) {
> +		error = ksm_enter(vma->vm_mm, &vma->vm_flags);
> +		vma = vma->vm_next;
> +	}
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);
> +}
> +
> +static int ksm_seeker_thread(void *nothing)

Is it really necessary to have an extra thread in ksm just to add vma's
for scanning? Can we do it right from the scanner thread? Also, may be
it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
enabled?

Thank you,
Pasha

Timofey Titovets Nov. 13, 2018, 11:25 a.m. UTC | #3

вт, 13 нояб. 2018 г. в 04:49, Matthew Wilcox <willy@infradead.org>:
>
> On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote:
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> >     Description: Many different chrome/electron apps + KDE
> >     Ratio: 5%
> >     Saved: ~100  MiB
> >     Used:  ~2000 MiB
>
> Your _laptop_ saves 100MB of RAM?  That's extraordinary.  Essentially
> that's like getting an extra 100MB of page cache for free.  Is there
> any observable slowdown?  I could even see there being a speedup (due
> to your working set being allowed to be 5% larger)
>
> I am now a big fan of this patch and shall try to give it the review
> that it deserves.

I'm not sure if this is sarcasm,
anyway i try do my best to get that working.

On any x86 desktop with mixed load (browser, docs, games & etc)
You will always see something like 40-200 MiB of deduped pages,
based on type of load of course.

I'm just don't try use that numbers as reason to get general KSM
deduplication in kernel.
Because in current generation with several gigabytes of memory,
several saved MiB not looks serious for most of people.

Thanks!

Timofey Titovets Nov. 13, 2018, 11:40 a.m. UTC | #4

вт, 13 нояб. 2018 г. в 05:25, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 02:13:44, Timofey Titovets wrote:
> > From: Timofey Titovets <nefelim4ag@gmail.com>
> >
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> >     I try mimic hugepages attribute, so mode have two states:
> >       * madvise      - old default behaviour
> >       * always [new] - allow ksm to get tasks vma and
> >                        try working on that.
> >   * seeker_sleep_millisecs:
> >     Add pauses between imports tasks VMA
> >
> > For rate limiting proporses and tasklist locking time,
> > ksm seeker thread only import VMAs from one task per loop.
> >
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> >     Description: Many different chrome/electron apps + KDE
> >     Ratio: 5%
> >     Saved: ~100  MiB
> >     Used:  ~2000 MiB
> >
> >   * Name: K8s test VM
> >     Description: Some small random running docker images
> >     Ratio: 40%
> >     Saved: ~160 MiB
> >     Used:  ~920 MiB
> >
> >   * Name: Ceph test VM
> >     Description: Ceph Mon/OSD, some containers
> >     Ratio: 20%
> >     Saved: ~60 MiB
> >     Used:  ~600 MiB
> >
> >   * Name: BareMetal K8s backend server
> >     Description: Different server apps in containers C, Java, GO & etc
> >     Ratio: 72%
> >     Saved: ~5800 MiB
> >     Used:  ~35.7 GiB
> >
> >   * Name: BareMetal K8s processing server
> >     Description: Many instance of one CPU intensive application
> >     Ratio: 55%
> >     Saved: ~2600 MiB
> >     Used:  ~28.0 GiB
> >
> >   * Name: BareMetal Ceph node
> >     Description: Only OSD storage daemons running
> >     Raio: 2%
> >     Saved: ~190 MiB
> >     Used:  ~11.7 GiB
> >
> > Changes:
> >   v1 -> v2:
> >     * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
> >   v2 -> v3:
> >     * Reformat patch description
> >     * Rename mode normal to madvise
> >     * Add some memory numbers
> >     * Fix checkpatch.pl warnings
> >     * Separate ksm vma seeker to another kthread
> >     * Fix: "BUG: scheduling while atomic: ksmd"
> >       by move seeker to another thread
> >
> > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > CC: Matthew Wilcox <willy@infradead.org>
> > CC: linux-mm@kvack.org
> > CC: linux-doc@vger.kernel.org
> > ---
> >  Documentation/admin-guide/mm/ksm.rst |  15 ++
> >  mm/ksm.c                             | 215 +++++++++++++++++++++++----
> >  2 files changed, 198 insertions(+), 32 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> > index 9303786632d1..7cffd47f9b38 100644
> > --- a/Documentation/admin-guide/mm/ksm.rst
> > +++ b/Documentation/admin-guide/mm/ksm.rst
> > @@ -116,6 +116,21 @@ run
> >          Default: 0 (must be changed to 1 to activate KSM, except if
> >          CONFIG_SYSFS is disabled)
> >
> > +mode
> > +        * set always to allow ksm deduplicate memory of every process
> > +        * set madvise to use only madvised memory
> > +
> > +        Default: madvise (dedupulicate only madvised memory as in
> > +        earlier releases)
> > +
> > +seeker_sleep_millisecs
> > +        how many milliseconds ksmd task seeker should sleep try another
> > +        task.
> > +        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
> > +
> > +        Default: 1000 (chosen for rate limit purposes)
> > +
> > +
> >  use_zero_pages
> >          specifies whether empty pages (i.e. allocated pages that only
> >          contain zeroes) should be treated specially.  When set to 1,
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 5b0894b45ee5..1a03b28b6288 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> >  /* Milliseconds ksmd should sleep between batches */
> >  static unsigned int ksm_thread_sleep_millisecs = 20;
> >
> > +/* Milliseconds ksmd seeker should sleep between runs */
> > +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
> > +
> >  /* Checksum of an empty (zeroed) page */
> >  static unsigned int zero_checksum __read_mostly;
> >
> > @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
> >  static unsigned long ksm_run = KSM_RUN_STOP;
> >  static void wait_while_offlining(void);
> >
> > +#define KSM_MODE_MADVISE 0
> > +#define KSM_MODE_ALWAYS      1
> > +static unsigned long ksm_mode = KSM_MODE_MADVISE;
> > +
> >  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
> > +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
> >  static DEFINE_MUTEX(ksm_thread_mutex);
> >  static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >
> > @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >               sizeof(struct __struct), __alignof__(struct __struct),\
> >               (__flags), NULL)
> >
> > +static inline int ksm_mode_always(void)
> > +{
> > +     return (ksm_mode == KSM_MODE_ALWAYS);
> > +}
> > +
> >  static int __init ksm_slab_init(void)
> >  {
> >       rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> > @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
> >       return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
> >  }
> >
> > +
> > +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
> > +{
> > +     int err;
> > +
> > +     if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
> > +                      VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
> > +                      VM_HUGETLB | VM_MIXEDMAP))
> > +             return 0;
> > +
> > +#ifdef VM_SAO
> > +     if (*vm_flags & VM_SAO)
> > +             return 0;
> > +#endif
> > +#ifdef VM_SPARC_ADI
> > +     if (*vm_flags & VM_SPARC_ADI)
> > +             return 0;
> > +#endif
> > +     if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> > +             err = __ksm_enter(mm);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     *vm_flags |= VM_MERGEABLE;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Register all vmas for all processes in the system with KSM.
> > + * Note that every call to ksm_, for a given vma, after the first
> > + * does nothing but set flags.
> > + */
> > +void ksm_import_task_vma(struct task_struct *task)
> > +{
> > +     struct vm_area_struct *vma;
> > +     struct mm_struct *mm;
> > +     int error;
> > +
> > +     mm = get_task_mm(task);
> > +     if (!mm)
> > +             return;
> > +     down_write(&mm->mmap_sem);
> > +     vma = mm->mmap;
> > +     while (vma) {
> > +             error = ksm_enter(vma->vm_mm, &vma->vm_flags);
> > +             vma = vma->vm_next;
> > +     }
> > +     up_write(&mm->mmap_sem);
> > +     mmput(mm);
> > +}
> > +
> > +static int ksm_seeker_thread(void *nothing)
>
> Is it really necessary to have an extra thread in ksm just to add vma's
> for scanning? Can we do it right from the scanner thread? Also, may be
> it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> enabled?
>
> Thank you,
> Pasha

Oh, thats a long story, and my english to bad for describe all things,
even that hard to find linux-mm conversation several years ago about that.

Anyway, so:
In V2 - i use scanner thread to add VMA, but i think scanner do that
with too high rate.
i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
to get VMA...
To high rate for task list scanner, i think it's overkill.

About add VMA from creation time,
UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
But some mm people say what they not like add KSM hooks to other subsystems.
And want ksm do that internally by some way.

Frankly speaking i didn't have enough knowledge and skills to do that
another way in past time.
They also suggest me look to THP for that logic, but i can't find how
THP do that without hooks, and
where THP truly scan memory.

So, after all of that i implemented this in that way.
In first iteration as part of ksm scan thread, and in second, by
separate thread.
Because that allow to add VMA in fully independent way.

Thanks!

Jann Horn Nov. 13, 2018, 11:57 a.m. UTC | #5

On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
<timofey.titovets@synesis.ru> wrote:
> ksm by default working only on memory that added by
> madvise().
>
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
>
> Lets use kernel task list and add logic to import VMAs from tasks.
>
> That behaviour controlled by new attributes:
>   * mode:
>     I try mimic hugepages attribute, so mode have two states:
>       * madvise      - old default behaviour
>       * always [new] - allow ksm to get tasks vma and
>                        try working on that.

Please don't. And if you really have to for some reason, put some big
warnings on this, advising people that it's a security risk.

KSM is one of the favorite punching bags of side-channel and hardware
security researchers:

As a gigantic, problematic side channel:
http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
https://access.redhat.com/blogs/766093/posts/1976303
https://gruss.cc/files/dedup.pdf

In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
it possible to use malicious JavaScript to determine whether a given
page of memory exists elsewhere on your system.

And also as a way to target rowhammer-based faults:
https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/

Timofey Titovets Nov. 13, 2018, 12:58 p.m. UTC | #6

вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>:
>
> On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
> <timofey.titovets@synesis.ru> wrote:
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> >     I try mimic hugepages attribute, so mode have two states:
> >       * madvise      - old default behaviour
> >       * always [new] - allow ksm to get tasks vma and
> >                        try working on that.
>
> Please don't. And if you really have to for some reason, put some big
> warnings on this, advising people that it's a security risk.
>
> KSM is one of the favorite punching bags of side-channel and hardware
> security researchers:
>
> As a gigantic, problematic side channel:
> http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> https://access.redhat.com/blogs/766093/posts/1976303
> https://gruss.cc/files/dedup.pdf
>
> In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> it possible to use malicious JavaScript to determine whether a given
> page of memory exists elsewhere on your system.
>
> And also as a way to target rowhammer-based faults:
> https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/

I'm very sorry, i'm not a security specialist.
But if i understood correctly, ksm have that security issues _without_
my patch set.
Even more, not only KSM have that type of issue, any memory
deduplication have that problems.
Any guy who care about security must decide on it self. Which things
him use and how he will
defend from others.
Even more on it self he must learn tools, what he use and make some
decision right?

So, if you really care about that problem in general, or only on KSM side,
that your initiative and your duty to warn people about that.

KSM already exists for 10+ years. You know about security implication
of use memory deduplication.
That your duty to send a patches to documentation, and add appropriate warnings.

Sorry for my passive aggressive,
i don't try hurt someone, or humiliate.

That's just my IMHO and i'm just to restricted in my english knowledge,
to write that more gentle.

Thanks!

Jann Horn Nov. 13, 2018, 1:25 p.m. UTC | #7

On Tue, Nov 13, 2018 at 1:59 PM Timofey Titovets
<timofey.titovets@synesis.ru> wrote:
>
> вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>:
> >
> > On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
> > <timofey.titovets@synesis.ru> wrote:
> > > ksm by default working only on memory that added by
> > > madvise().
> > >
> > > And only way get that work on other applications:
> > >   * Use LD_PRELOAD and libraries
> > >   * Patch kernel
> > >
> > > Lets use kernel task list and add logic to import VMAs from tasks.
> > >
> > > That behaviour controlled by new attributes:
> > >   * mode:
> > >     I try mimic hugepages attribute, so mode have two states:
> > >       * madvise      - old default behaviour
> > >       * always [new] - allow ksm to get tasks vma and
> > >                        try working on that.
> >
> > Please don't. And if you really have to for some reason, put some big
> > warnings on this, advising people that it's a security risk.
> >
> > KSM is one of the favorite punching bags of side-channel and hardware
> > security researchers:
> >
> > As a gigantic, problematic side channel:
> > http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> > https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> > https://access.redhat.com/blogs/766093/posts/1976303
> > https://gruss.cc/files/dedup.pdf
> >
> > In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> > Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> > it possible to use malicious JavaScript to determine whether a given
> > page of memory exists elsewhere on your system.
> >
> > And also as a way to target rowhammer-based faults:
> > https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> > https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/
>
> I'm very sorry, i'm not a security specialist.
> But if i understood correctly, ksm have that security issues _without_
> my patch set.

Yep. However, so far, it requires an application to explicitly opt in
to this behavior, so it's not all that bad. Your patch would remove
the requirement for application opt-in, which, in my opinion, makes
this way worse and reduces the number of applications for which this
is acceptable.

> Even more, not only KSM have that type of issue, any memory
> deduplication have that problems.

Yup.

> Any guy who care about security must decide on it self. Which things
> him use and how he will
> defend from others.

> Even more on it self he must learn tools, what he use and make some
> decision right?
>
> So, if you really care about that problem in general, or only on KSM side,
> that your initiative and your duty to warn people about that.
>
> KSM already exists for 10+ years. You know about security implication
> of use memory deduplication.
> That your duty to send a patches to documentation, and add appropriate warnings.

As far as I know, basically nobody is using KSM at this point. There
are blog posts from several cloud providers about these security risks
that explicitly state that they're not using memory deduplication.

Pasha Tatashin Nov. 13, 2018, 5:59 p.m. UTC | #8

On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> Hi.
> 
> > Yep. However, so far, it requires an application to explicitly opt in
> > to this behavior, so it's not all that bad. Your patch would remove
> > the requirement for application opt-in, which, in my opinion, makes
> > this way worse and reduces the number of applications for which this
> > is acceptable.
> 
> The default is to maintain the old behaviour, so unless the explicit
> decision is made by the administrator, no extra risk is imposed.

The new interface would be more tolerable if it honored MADV_UNMERGEABLE:

KSM default on: merge everything except when MADV_UNMERGEABLE is
excplicitly set.

KSM default off: merge only when MADV_MERGEABLE is set.

The proposed change won't honor MADV_UNMERGEABLE, meaning that
application programmers won't have a way to prevent sensitive data to be
every merged. So, I think, we should keep allow an explicit opt-out
option for applications.

> 
> > As far as I know, basically nobody is using KSM at this point. There
> > are blog posts from several cloud providers about these security risks
> > that explicitly state that they're not using memory deduplication.
> 
> I tend to disagree here. Based on both what my company does and what UKSM
> users do, memory dedup is a desired option (note "option" word here, not the
> default choice).

Lightweight containers is a use case for KSM: when many VMs share the
same small kernel. KSM is used in production by large cloud vendors.

Thank you,
Pasha

Timofey Titovets Nov. 13, 2018, 6:17 p.m. UTC | #9

вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>

We just did not have VM/Madvise flag for that currently.
Same as THP.
Because all logic written with assumption, what we have exactly 2 states.
Allow/Disallow (More like not allow).

And if we try to add, that must be something like:
MADV_FORBID_* to disallow something completely.

And same for THP
(because currently apps just refuse to start if THP enabled, because of no way
to forbid thp).

Thanks.

> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.
>
> Thank you,
> Pasha
>

Pasha Tatashin Nov. 13, 2018, 6:35 p.m. UTC | #10

On 18-11-13 21:17:42, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> >
> > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > Hi.
> > >
> > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > the requirement for application opt-in, which, in my opinion, makes
> > > > this way worse and reduces the number of applications for which this
> > > > is acceptable.
> > >
> > > The default is to maintain the old behaviour, so unless the explicit
> > > decision is made by the administrator, no extra risk is imposed.
> >
> > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> >
> > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > excplicitly set.
> >
> > KSM default off: merge only when MADV_MERGEABLE is set.
> >
> > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > application programmers won't have a way to prevent sensitive data to be
> > every merged. So, I think, we should keep allow an explicit opt-out
> > option for applications.
> >
> 
> We just did not have VM/Madvise flag for that currently.
> Same as THP.
> Because all logic written with assumption, what we have exactly 2 states.
> Allow/Disallow (More like not allow).
> 
> And if we try to add, that must be something like:
> MADV_FORBID_* to disallow something completely.

No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
set, memory is indeed becomes always unmergeable regardless of ksm mode
of operation.

To do the above in ksm_madvise(), a new state should be added, for example
instead of: 

case MADV_UNMERGEABLE:
	*vm_flags &= ~VM_MERGEABLE;

A new flag should be used:
	*vm_flags |= VM_UNMERGEABLE;

I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
be accepted.

Pasha

Pasha Tatashin Nov. 13, 2018, 6:42 p.m. UTC | #11

> > Is it really necessary to have an extra thread in ksm just to add vma's
> > for scanning? Can we do it right from the scanner thread? Also, may be
> > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > enabled?
> >
> > Thank you,
> > Pasha
>
> Oh, thats a long story, and my english to bad for describe all things,
> even that hard to find linux-mm conversation several years ago about that.
>
> Anyway, so:
> In V2 - i use scanner thread to add VMA, but i think scanner do that
> with too high rate.
> i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
> to get VMA...
> To high rate for task list scanner, i think it's overkill.
>
> About add VMA from creation time,
> UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> But some mm people say what they not like add KSM hooks to other subsystems.
> And want ksm do that internally by some way.
>
> Frankly speaking i didn't have enough knowledge and skills to do that
> another way in past time.
> They also suggest me look to THP for that logic, but i can't find how
> THP do that without hooks, and
> where THP truly scan memory.
>
> So, after all of that i implemented this in that way.
> In first iteration as part of ksm scan thread, and in second, by
> separate thread.
> Because that allow to add VMA in fully independent way.

It still feels as a wrong direction. A new thread that adds random
VMA's to scan, and no way to optimize the queue fairness for example.
It should really be done at creation time, when VMA is created it
should be added to KSM scanning queue, or KSM main scanner thread
should go through VMA list in a coherent order.

The design of having a separate thread is bad. I plan in the future to
add thread per node support to KSM, and this one odd thread won't
break things, to which queue should this thread add VMA if there are
multiple queues?

Thank you,
Pasha

Timofey Titovets Nov. 13, 2018, 6:54 p.m. UTC | #12

вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 21:17:42, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > >
> > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > Hi.
> > > >
> > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > this way worse and reduces the number of applications for which this
> > > > > is acceptable.
> > > >
> > > > The default is to maintain the old behaviour, so unless the explicit
> > > > decision is made by the administrator, no extra risk is imposed.
> > >
> > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > >
> > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > excplicitly set.
> > >
> > > KSM default off: merge only when MADV_MERGEABLE is set.
> > >
> > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > application programmers won't have a way to prevent sensitive data to be
> > > every merged. So, I think, we should keep allow an explicit opt-out
> > > option for applications.
> > >
> >
> > We just did not have VM/Madvise flag for that currently.
> > Same as THP.
> > Because all logic written with assumption, what we have exactly 2 states.
> > Allow/Disallow (More like not allow).
> >
> > And if we try to add, that must be something like:
> > MADV_FORBID_* to disallow something completely.
>
> No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> set, memory is indeed becomes always unmergeable regardless of ksm mode
> of operation.
>
> To do the above in ksm_madvise(), a new state should be added, for example
> instead of:
>
> case MADV_UNMERGEABLE:
>         *vm_flags &= ~VM_MERGEABLE;
>
> A new flag should be used:
>         *vm_flags |= VM_UNMERGEABLE;
>
> I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> be accepted.
>
> Pasha
>

That must work, but we out of bit space in vm_flags [1].
i.e. first 32 bits already defined, and other only accessible only on
64-bit machines.

1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

Pasha Tatashin Nov. 13, 2018, 7:16 p.m. UTC | #13

On 18-11-13 21:54:13, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
> >
> > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > > >
> > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > Hi.
> > > > >
> > > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > > this way worse and reduces the number of applications for which this
> > > > > > is acceptable.
> > > > >
> > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > decision is made by the administrator, no extra risk is imposed.
> > > >
> > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > > >
> > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > excplicitly set.
> > > >
> > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > >
> > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > application programmers won't have a way to prevent sensitive data to be
> > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > option for applications.
> > > >
> > >
> > > We just did not have VM/Madvise flag for that currently.
> > > Same as THP.
> > > Because all logic written with assumption, what we have exactly 2 states.
> > > Allow/Disallow (More like not allow).
> > >
> > > And if we try to add, that must be something like:
> > > MADV_FORBID_* to disallow something completely.
> >
> > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > of operation.
> >
> > To do the above in ksm_madvise(), a new state should be added, for example
> > instead of:
> >
> > case MADV_UNMERGEABLE:
> >         *vm_flags &= ~VM_MERGEABLE;
> >
> > A new flag should be used:
> >         *vm_flags |= VM_UNMERGEABLE;
> >
> > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > be accepted.
> >
> > Pasha
> >
> 
> That must work, but we out of bit space in vm_flags [1].
> i.e. first 32 bits already defined, and other only accessible only on
> 64-bit machines.

So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

> 
> 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

Jann Horn Nov. 13, 2018, 8:26 p.m. UTC | #14

+cc Daniel Gruss

On Tue, Nov 13, 2018 at 6:59 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>
> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.

Wait, what? Can you name specific ones? Nowadays, enabling KSM for
untrusted VMs seems like a terrible idea to me, security-wise.

Google says at <https://cloud.google.com/blog/products/gcp/7-ways-we-harden-our-kvm-hypervisor-at-google-cloud-security-in-plaintext>:
"Compute Engine and Container Engine are not vulnerable to this kind
of attack, since they do not use KSM."

An AWS employee says at
<https://forums.aws.amazon.com/thread.jspa?threadID=238519&tstart=0&messageID=739485#739485>:
"memory de-duplication is not enabled by Amazon EC2's hypervisor"

In my opinion, KSM is fundamentally insecure for systems hosting
multiple VMs that don't trust each other. I don't think anyone writes
cryptographic software under the assumption that an attacker will be
given the ability to query whether a given page of data exists
anywhere else on the system.

Pasha Tatashin Nov. 13, 2018, 10:35 p.m. UTC | #15

> Wait, what? Can you name specific ones? Nowadays, enabling KSM for
> untrusted VMs seems like a terrible idea to me, security-wise.

Of course it is not used to share data among different
customers/tenants, as far as I know it is used by Oracle Cloud to
merge the same pages in clear containers.

https://medium.com/cri-o/intel-clear-containers-and-cri-o-70824fb51811
One performance enhancing feature is the use of KSM, a recent KVM
optimized for memory sharing and boot speed. Another is the use of an
optimized Clear Containers mini-OS.

Pasha

Timofey Titovets Nov. 13, 2018, 10:40 p.m. UTC | #16

вт, 13 нояб. 2018 г. в 22:17, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 21:54:13, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > >
> > > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > > > >
> > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > > Hi.
> > > > > >
> > > > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > > > this way worse and reduces the number of applications for which this
> > > > > > > is acceptable.
> > > > > >
> > > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > > decision is made by the administrator, no extra risk is imposed.
> > > > >
> > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > > > >
> > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > > excplicitly set.
> > > > >
> > > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > > >
> > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > > application programmers won't have a way to prevent sensitive data to be
> > > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > > option for applications.
> > > > >
> > > >
> > > > We just did not have VM/Madvise flag for that currently.
> > > > Same as THP.
> > > > Because all logic written with assumption, what we have exactly 2 states.
> > > > Allow/Disallow (More like not allow).
> > > >
> > > > And if we try to add, that must be something like:
> > > > MADV_FORBID_* to disallow something completely.
> > >
> > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > > of operation.
> > >
> > > To do the above in ksm_madvise(), a new state should be added, for example
> > > instead of:
> > >
> > > case MADV_UNMERGEABLE:
> > >         *vm_flags &= ~VM_MERGEABLE;
> > >
> > > A new flag should be used:
> > >         *vm_flags |= VM_UNMERGEABLE;
> > >
> > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > > be accepted.
> > >
> > > Pasha
> > >
> >
> > That must work, but we out of bit space in vm_flags [1].
> > i.e. first 32 bits already defined, and other only accessible only on
> > 64-bit machines.
>
> So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

With all due respect to you, for that type of things we need
mm maintainer opinion.

I just don't want get situation, where after touch of other subsystems,
maintainer will just refuse that work by some reason.

i.e. writing patches for upstream (from my point of view),
is more art of communication and making resulte code acceptable by community.
Because any code which written correctly from engineering point of view,
can be easy refused, just because someone not found it useful.

Thanks.

> >
> > 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

Pasha Tatashin Nov. 13, 2018, 10:53 p.m. UTC | #17

> > > That must work, but we out of bit space in vm_flags [1].
> > > i.e. first 32 bits already defined, and other only accessible only on
> > > 64-bit machines.
> >
> > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> 
> With all due respect to you, for that type of things we need
> mm maintainer opinion.

As far as I understood, you already got directions from the maintainers
to do similar to the way THP is implemented, and THP uses two flags:

VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
honor MADV_UNMERGEABLE.

When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.

There may be a way to add VM_UNMERGEABLE without extending the size of
vm_flags, but that would be a good start point in looking how to add a
new flag.

Again, you could simply enable this feature on 64-bit only.

Pasha

Timofey Titovets Nov. 13, 2018, 10:55 p.m. UTC | #18

вт, 13 нояб. 2018 г. в 21:43, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> > > Is it really necessary to have an extra thread in ksm just to add vma's
> > > for scanning? Can we do it right from the scanner thread? Also, may be
> > > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > > enabled?
> > >
> > > Thank you,
> > > Pasha
> >
> > Oh, thats a long story, and my english to bad for describe all things,
> > even that hard to find linux-mm conversation several years ago about that.
> >
> > Anyway, so:
> > In V2 - i use scanner thread to add VMA, but i think scanner do that
> > with too high rate.
> > i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
> > to get VMA...
> > To high rate for task list scanner, i think it's overkill.
> >
> > About add VMA from creation time,
> > UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> > But some mm people say what they not like add KSM hooks to other subsystems.
> > And want ksm do that internally by some way.
> >
> > Frankly speaking i didn't have enough knowledge and skills to do that
> > another way in past time.
> > They also suggest me look to THP for that logic, but i can't find how
> > THP do that without hooks, and
> > where THP truly scan memory.
> >
> > So, after all of that i implemented this in that way.
> > In first iteration as part of ksm scan thread, and in second, by
> > separate thread.
> > Because that allow to add VMA in fully independent way.
>
> It still feels as a wrong direction. A new thread that adds random
> VMA's to scan, and no way to optimize the queue fairness for example.
> It should really be done at creation time, when VMA is created it
> should be added to KSM scanning queue, or KSM main scanner thread
> should go through VMA list in a coherent order.

How you see queue fairness in that case?
i.e. if you talk about moving from old VMA to new VMA,
IIRC i can't find any whole kernel list of VMAs.

i.e. i really understood what you don't like exactly,
but for that we need add hooks as i already mentioned above.
(And i already try get that to kernel [1]).

So, as i wrote you below, i need some maintainer opinion
in which way that responsible person of mm see 'right' implementation.

> The design of having a separate thread is bad. I plan in the future to
> add thread per node support to KSM, and this one odd thread won't
> break things, to which queue should this thread add VMA if there are
> multiple queues?

That will be interesting to look :)
But IMHO:
I think you will need to add some code to ksm_enter().
Because madvise() internally call ksm_enter().

So ksm_enter() will decide which tread must process that.
That not depends on caller.

Thanks.

> Thank you,
> Pasha
>
- - -
1. https://lkml.org/lkml/2014/11/8/206

Timofey Titovets Nov. 13, 2018, 11:07 p.m. UTC | #19

ср, 14 нояб. 2018 г. в 01:53, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> > > > That must work, but we out of bit space in vm_flags [1].
> > > > i.e. first 32 bits already defined, and other only accessible only on
> > > > 64-bit machines.
> > >
> > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> >
> > With all due respect to you, for that type of things we need
> > mm maintainer opinion.
>
> As far as I understood, you already got directions from the maintainers
> to do similar to the way THP is implemented, and THP uses two flags:
>
> VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
> honor MADV_UNMERGEABLE.
>
> When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.
>
> There may be a way to add VM_UNMERGEABLE without extending the size of
> vm_flags, but that would be a good start point in looking how to add a
> new flag.
>
> Again, you could simply enable this feature on 64-bit only.
>
> Pasha
>

Deal!
I will try with only on 64bit machines.

[V3] KSM: allow dedup all tasks memory

Commit Message

Comments

Patch