Message ID | 20181112231344.7161-1-timofey.titovets@synesis.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V3] KSM: allow dedup all tasks memory | expand |
On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote: > Some numbers from different not madvised workloads. > Formulas: > Percentage ratio = (pages_sharing - pages_shared)/pages_unshared > Memory saved = (pages_sharing - pages_shared)*4/1024 MiB > Memory used = free -h > > * Name: My working laptop > Description: Many different chrome/electron apps + KDE > Ratio: 5% > Saved: ~100 MiB > Used: ~2000 MiB Your _laptop_ saves 100MB of RAM? That's extraordinary. Essentially that's like getting an extra 100MB of page cache for free. Is there any observable slowdown? I could even see there being a speedup (due to your working set being allowed to be 5% larger) I am now a big fan of this patch and shall try to give it the review that it deserves.
On 18-11-13 02:13:44, Timofey Titovets wrote: > From: Timofey Titovets <nefelim4ag@gmail.com> > > ksm by default working only on memory that added by > madvise(). > > And only way get that work on other applications: > * Use LD_PRELOAD and libraries > * Patch kernel > > Lets use kernel task list and add logic to import VMAs from tasks. > > That behaviour controlled by new attributes: > * mode: > I try mimic hugepages attribute, so mode have two states: > * madvise - old default behaviour > * always [new] - allow ksm to get tasks vma and > try working on that. > * seeker_sleep_millisecs: > Add pauses between imports tasks VMA > > For rate limiting proporses and tasklist locking time, > ksm seeker thread only import VMAs from one task per loop. > > Some numbers from different not madvised workloads. > Formulas: > Percentage ratio = (pages_sharing - pages_shared)/pages_unshared > Memory saved = (pages_sharing - pages_shared)*4/1024 MiB > Memory used = free -h > > * Name: My working laptop > Description: Many different chrome/electron apps + KDE > Ratio: 5% > Saved: ~100 MiB > Used: ~2000 MiB > > * Name: K8s test VM > Description: Some small random running docker images > Ratio: 40% > Saved: ~160 MiB > Used: ~920 MiB > > * Name: Ceph test VM > Description: Ceph Mon/OSD, some containers > Ratio: 20% > Saved: ~60 MiB > Used: ~600 MiB > > * Name: BareMetal K8s backend server > Description: Different server apps in containers C, Java, GO & etc > Ratio: 72% > Saved: ~5800 MiB > Used: ~35.7 GiB > > * Name: BareMetal K8s processing server > Description: Many instance of one CPU intensive application > Ratio: 55% > Saved: ~2600 MiB > Used: ~28.0 GiB > > * Name: BareMetal Ceph node > Description: Only OSD storage daemons running > Raio: 2% > Saved: ~190 MiB > Used: ~11.7 GiB > > Changes: > v1 -> v2: > * Rebase on v4.19.1 (must also apply on 4.20-rc2+) > v2 -> v3: > * Reformat patch description > * Rename mode normal to madvise > * Add some memory numbers > * Fix checkpatch.pl warnings > * Separate ksm vma seeker to another kthread > * Fix: "BUG: scheduling while atomic: ksmd" > by move seeker to another thread > > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> > CC: Matthew Wilcox <willy@infradead.org> > CC: linux-mm@kvack.org > CC: linux-doc@vger.kernel.org > --- > Documentation/admin-guide/mm/ksm.rst | 15 ++ > mm/ksm.c | 215 +++++++++++++++++++++++---- > 2 files changed, 198 insertions(+), 32 deletions(-) > > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst > index 9303786632d1..7cffd47f9b38 100644 > --- a/Documentation/admin-guide/mm/ksm.rst > +++ b/Documentation/admin-guide/mm/ksm.rst > @@ -116,6 +116,21 @@ run > Default: 0 (must be changed to 1 to activate KSM, except if > CONFIG_SYSFS is disabled) > > +mode > + * set always to allow ksm deduplicate memory of every process > + * set madvise to use only madvised memory > + > + Default: madvise (dedupulicate only madvised memory as in > + earlier releases) > + > +seeker_sleep_millisecs > + how many milliseconds ksmd task seeker should sleep try another > + task. > + e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs`` > + > + Default: 1000 (chosen for rate limit purposes) > + > + > use_zero_pages > specifies whether empty pages (i.e. allocated pages that only > contain zeroes) should be treated specially. When set to 1, > diff --git a/mm/ksm.c b/mm/ksm.c > index 5b0894b45ee5..1a03b28b6288 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100; > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Milliseconds ksmd seeker should sleep between runs */ > +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000; > + > /* Checksum of an empty (zeroed) page */ > static unsigned int zero_checksum __read_mostly; > > @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1; > static unsigned long ksm_run = KSM_RUN_STOP; > static void wait_while_offlining(void); > > +#define KSM_MODE_MADVISE 0 > +#define KSM_MODE_ALWAYS 1 > +static unsigned long ksm_mode = KSM_MODE_MADVISE; > + > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait); > static DEFINE_MUTEX(ksm_thread_mutex); > static DEFINE_SPINLOCK(ksm_mmlist_lock); > > @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock); > sizeof(struct __struct), __alignof__(struct __struct),\ > (__flags), NULL) > > +static inline int ksm_mode_always(void) > +{ > + return (ksm_mode == KSM_MODE_ALWAYS); > +} > + > static int __init ksm_slab_init(void) > { > rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0); > @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void) > return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list); > } > > + > +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags) > +{ > + int err; > + > + if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | > + VM_PFNMAP | VM_IO | VM_DONTEXPAND | > + VM_HUGETLB | VM_MIXEDMAP)) > + return 0; > + > +#ifdef VM_SAO > + if (*vm_flags & VM_SAO) > + return 0; > +#endif > +#ifdef VM_SPARC_ADI > + if (*vm_flags & VM_SPARC_ADI) > + return 0; > +#endif > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { > + err = __ksm_enter(mm); > + if (err) > + return err; > + } > + > + *vm_flags |= VM_MERGEABLE; > + > + return 0; > +} > + > +/* > + * Register all vmas for all processes in the system with KSM. > + * Note that every call to ksm_, for a given vma, after the first > + * does nothing but set flags. > + */ > +void ksm_import_task_vma(struct task_struct *task) > +{ > + struct vm_area_struct *vma; > + struct mm_struct *mm; > + int error; > + > + mm = get_task_mm(task); > + if (!mm) > + return; > + down_write(&mm->mmap_sem); > + vma = mm->mmap; > + while (vma) { > + error = ksm_enter(vma->vm_mm, &vma->vm_flags); > + vma = vma->vm_next; > + } > + up_write(&mm->mmap_sem); > + mmput(mm); > +} > + > +static int ksm_seeker_thread(void *nothing) Is it really necessary to have an extra thread in ksm just to add vma's for scanning? Can we do it right from the scanner thread? Also, may be it is better to add vma's at their creation time when KSM_MODE_ALWAYS is enabled? Thank you, Pasha
вт, 13 нояб. 2018 г. в 04:49, Matthew Wilcox <willy@infradead.org>: > > On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote: > > Some numbers from different not madvised workloads. > > Formulas: > > Percentage ratio = (pages_sharing - pages_shared)/pages_unshared > > Memory saved = (pages_sharing - pages_shared)*4/1024 MiB > > Memory used = free -h > > > > * Name: My working laptop > > Description: Many different chrome/electron apps + KDE > > Ratio: 5% > > Saved: ~100 MiB > > Used: ~2000 MiB > > Your _laptop_ saves 100MB of RAM? That's extraordinary. Essentially > that's like getting an extra 100MB of page cache for free. Is there > any observable slowdown? I could even see there being a speedup (due > to your working set being allowed to be 5% larger) > > I am now a big fan of this patch and shall try to give it the review > that it deserves. I'm not sure if this is sarcasm, anyway i try do my best to get that working. On any x86 desktop with mixed load (browser, docs, games & etc) You will always see something like 40-200 MiB of deduped pages, based on type of load of course. I'm just don't try use that numbers as reason to get general KSM deduplication in kernel. Because in current generation with several gigabytes of memory, several saved MiB not looks serious for most of people. Thanks!
вт, 13 нояб. 2018 г. в 05:25, Pavel Tatashin <pasha.tatashin@soleen.com>: > > On 18-11-13 02:13:44, Timofey Titovets wrote: > > From: Timofey Titovets <nefelim4ag@gmail.com> > > > > ksm by default working only on memory that added by > > madvise(). > > > > And only way get that work on other applications: > > * Use LD_PRELOAD and libraries > > * Patch kernel > > > > Lets use kernel task list and add logic to import VMAs from tasks. > > > > That behaviour controlled by new attributes: > > * mode: > > I try mimic hugepages attribute, so mode have two states: > > * madvise - old default behaviour > > * always [new] - allow ksm to get tasks vma and > > try working on that. > > * seeker_sleep_millisecs: > > Add pauses between imports tasks VMA > > > > For rate limiting proporses and tasklist locking time, > > ksm seeker thread only import VMAs from one task per loop. > > > > Some numbers from different not madvised workloads. > > Formulas: > > Percentage ratio = (pages_sharing - pages_shared)/pages_unshared > > Memory saved = (pages_sharing - pages_shared)*4/1024 MiB > > Memory used = free -h > > > > * Name: My working laptop > > Description: Many different chrome/electron apps + KDE > > Ratio: 5% > > Saved: ~100 MiB > > Used: ~2000 MiB > > > > * Name: K8s test VM > > Description: Some small random running docker images > > Ratio: 40% > > Saved: ~160 MiB > > Used: ~920 MiB > > > > * Name: Ceph test VM > > Description: Ceph Mon/OSD, some containers > > Ratio: 20% > > Saved: ~60 MiB > > Used: ~600 MiB > > > > * Name: BareMetal K8s backend server > > Description: Different server apps in containers C, Java, GO & etc > > Ratio: 72% > > Saved: ~5800 MiB > > Used: ~35.7 GiB > > > > * Name: BareMetal K8s processing server > > Description: Many instance of one CPU intensive application > > Ratio: 55% > > Saved: ~2600 MiB > > Used: ~28.0 GiB > > > > * Name: BareMetal Ceph node > > Description: Only OSD storage daemons running > > Raio: 2% > > Saved: ~190 MiB > > Used: ~11.7 GiB > > > > Changes: > > v1 -> v2: > > * Rebase on v4.19.1 (must also apply on 4.20-rc2+) > > v2 -> v3: > > * Reformat patch description > > * Rename mode normal to madvise > > * Add some memory numbers > > * Fix checkpatch.pl warnings > > * Separate ksm vma seeker to another kthread > > * Fix: "BUG: scheduling while atomic: ksmd" > > by move seeker to another thread > > > > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> > > CC: Matthew Wilcox <willy@infradead.org> > > CC: linux-mm@kvack.org > > CC: linux-doc@vger.kernel.org > > --- > > Documentation/admin-guide/mm/ksm.rst | 15 ++ > > mm/ksm.c | 215 +++++++++++++++++++++++---- > > 2 files changed, 198 insertions(+), 32 deletions(-) > > > > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst > > index 9303786632d1..7cffd47f9b38 100644 > > --- a/Documentation/admin-guide/mm/ksm.rst > > +++ b/Documentation/admin-guide/mm/ksm.rst > > @@ -116,6 +116,21 @@ run > > Default: 0 (must be changed to 1 to activate KSM, except if > > CONFIG_SYSFS is disabled) > > > > +mode > > + * set always to allow ksm deduplicate memory of every process > > + * set madvise to use only madvised memory > > + > > + Default: madvise (dedupulicate only madvised memory as in > > + earlier releases) > > + > > +seeker_sleep_millisecs > > + how many milliseconds ksmd task seeker should sleep try another > > + task. > > + e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs`` > > + > > + Default: 1000 (chosen for rate limit purposes) > > + > > + > > use_zero_pages > > specifies whether empty pages (i.e. allocated pages that only > > contain zeroes) should be treated specially. When set to 1, > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 5b0894b45ee5..1a03b28b6288 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100; > > /* Milliseconds ksmd should sleep between batches */ > > static unsigned int ksm_thread_sleep_millisecs = 20; > > > > +/* Milliseconds ksmd seeker should sleep between runs */ > > +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000; > > + > > /* Checksum of an empty (zeroed) page */ > > static unsigned int zero_checksum __read_mostly; > > > > @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1; > > static unsigned long ksm_run = KSM_RUN_STOP; > > static void wait_while_offlining(void); > > > > +#define KSM_MODE_MADVISE 0 > > +#define KSM_MODE_ALWAYS 1 > > +static unsigned long ksm_mode = KSM_MODE_MADVISE; > > + > > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > > +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait); > > static DEFINE_MUTEX(ksm_thread_mutex); > > static DEFINE_SPINLOCK(ksm_mmlist_lock); > > > > @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock); > > sizeof(struct __struct), __alignof__(struct __struct),\ > > (__flags), NULL) > > > > +static inline int ksm_mode_always(void) > > +{ > > + return (ksm_mode == KSM_MODE_ALWAYS); > > +} > > + > > static int __init ksm_slab_init(void) > > { > > rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0); > > @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void) > > return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list); > > } > > > > + > > +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags) > > +{ > > + int err; > > + > > + if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | > > + VM_PFNMAP | VM_IO | VM_DONTEXPAND | > > + VM_HUGETLB | VM_MIXEDMAP)) > > + return 0; > > + > > +#ifdef VM_SAO > > + if (*vm_flags & VM_SAO) > > + return 0; > > +#endif > > +#ifdef VM_SPARC_ADI > > + if (*vm_flags & VM_SPARC_ADI) > > + return 0; > > +#endif > > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { > > + err = __ksm_enter(mm); > > + if (err) > > + return err; > > + } > > + > > + *vm_flags |= VM_MERGEABLE; > > + > > + return 0; > > +} > > + > > +/* > > + * Register all vmas for all processes in the system with KSM. > > + * Note that every call to ksm_, for a given vma, after the first > > + * does nothing but set flags. > > + */ > > +void ksm_import_task_vma(struct task_struct *task) > > +{ > > + struct vm_area_struct *vma; > > + struct mm_struct *mm; > > + int error; > > + > > + mm = get_task_mm(task); > > + if (!mm) > > + return; > > + down_write(&mm->mmap_sem); > > + vma = mm->mmap; > > + while (vma) { > > + error = ksm_enter(vma->vm_mm, &vma->vm_flags); > > + vma = vma->vm_next; > > + } > > + up_write(&mm->mmap_sem); > > + mmput(mm); > > +} > > + > > +static int ksm_seeker_thread(void *nothing) > > Is it really necessary to have an extra thread in ksm just to add vma's > for scanning? Can we do it right from the scanner thread? Also, may be > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is > enabled? > > Thank you, > Pasha Oh, thats a long story, and my english to bad for describe all things, even that hard to find linux-mm conversation several years ago about that. Anyway, so: In V2 - i use scanner thread to add VMA, but i think scanner do that with too high rate. i.e. walk on task list, and get new task every 20ms, to wait write semaphore, to get VMA... To high rate for task list scanner, i think it's overkill. About add VMA from creation time, UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM. But some mm people say what they not like add KSM hooks to other subsystems. And want ksm do that internally by some way. Frankly speaking i didn't have enough knowledge and skills to do that another way in past time. They also suggest me look to THP for that logic, but i can't find how THP do that without hooks, and where THP truly scan memory. So, after all of that i implemented this in that way. In first iteration as part of ksm scan thread, and in second, by separate thread. Because that allow to add VMA in fully independent way. Thanks!
On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets <timofey.titovets@synesis.ru> wrote: > ksm by default working only on memory that added by > madvise(). > > And only way get that work on other applications: > * Use LD_PRELOAD and libraries > * Patch kernel > > Lets use kernel task list and add logic to import VMAs from tasks. > > That behaviour controlled by new attributes: > * mode: > I try mimic hugepages attribute, so mode have two states: > * madvise - old default behaviour > * always [new] - allow ksm to get tasks vma and > try working on that. Please don't. And if you really have to for some reason, put some big warnings on this, advising people that it's a security risk. KSM is one of the favorite punching bags of side-channel and hardware security researchers: As a gigantic, problematic side channel: http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf https://access.redhat.com/blogs/766093/posts/1976303 https://gruss.cc/files/dedup.pdf In particular https://gruss.cc/files/dedup.pdf ("Practical Memory Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes it possible to use malicious JavaScript to determine whether a given page of memory exists elsewhere on your system. And also as a way to target rowhammer-based faults: https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/
вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>: > > On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets > <timofey.titovets@synesis.ru> wrote: > > ksm by default working only on memory that added by > > madvise(). > > > > And only way get that work on other applications: > > * Use LD_PRELOAD and libraries > > * Patch kernel > > > > Lets use kernel task list and add logic to import VMAs from tasks. > > > > That behaviour controlled by new attributes: > > * mode: > > I try mimic hugepages attribute, so mode have two states: > > * madvise - old default behaviour > > * always [new] - allow ksm to get tasks vma and > > try working on that. > > Please don't. And if you really have to for some reason, put some big > warnings on this, advising people that it's a security risk. > > KSM is one of the favorite punching bags of side-channel and hardware > security researchers: > > As a gigantic, problematic side channel: > http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf > https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf > https://access.redhat.com/blogs/766093/posts/1976303 > https://gruss.cc/files/dedup.pdf > > In particular https://gruss.cc/files/dedup.pdf ("Practical Memory > Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes > it possible to use malicious JavaScript to determine whether a given > page of memory exists elsewhere on your system. > > And also as a way to target rowhammer-based faults: > https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf > https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/ I'm very sorry, i'm not a security specialist. But if i understood correctly, ksm have that security issues _without_ my patch set. Even more, not only KSM have that type of issue, any memory deduplication have that problems. Any guy who care about security must decide on it self. Which things him use and how he will defend from others. Even more on it self he must learn tools, what he use and make some decision right? So, if you really care about that problem in general, or only on KSM side, that your initiative and your duty to warn people about that. KSM already exists for 10+ years. You know about security implication of use memory deduplication. That your duty to send a patches to documentation, and add appropriate warnings. Sorry for my passive aggressive, i don't try hurt someone, or humiliate. That's just my IMHO and i'm just to restricted in my english knowledge, to write that more gentle. Thanks!
On Tue, Nov 13, 2018 at 1:59 PM Timofey Titovets <timofey.titovets@synesis.ru> wrote: > > вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>: > > > > On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets > > <timofey.titovets@synesis.ru> wrote: > > > ksm by default working only on memory that added by > > > madvise(). > > > > > > And only way get that work on other applications: > > > * Use LD_PRELOAD and libraries > > > * Patch kernel > > > > > > Lets use kernel task list and add logic to import VMAs from tasks. > > > > > > That behaviour controlled by new attributes: > > > * mode: > > > I try mimic hugepages attribute, so mode have two states: > > > * madvise - old default behaviour > > > * always [new] - allow ksm to get tasks vma and > > > try working on that. > > > > Please don't. And if you really have to for some reason, put some big > > warnings on this, advising people that it's a security risk. > > > > KSM is one of the favorite punching bags of side-channel and hardware > > security researchers: > > > > As a gigantic, problematic side channel: > > http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf > > https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf > > https://access.redhat.com/blogs/766093/posts/1976303 > > https://gruss.cc/files/dedup.pdf > > > > In particular https://gruss.cc/files/dedup.pdf ("Practical Memory > > Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes > > it possible to use malicious JavaScript to determine whether a given > > page of memory exists elsewhere on your system. > > > > And also as a way to target rowhammer-based faults: > > https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf > > https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/ > > I'm very sorry, i'm not a security specialist. > But if i understood correctly, ksm have that security issues _without_ > my patch set. Yep. However, so far, it requires an application to explicitly opt in to this behavior, so it's not all that bad. Your patch would remove the requirement for application opt-in, which, in my opinion, makes this way worse and reduces the number of applications for which this is acceptable. > Even more, not only KSM have that type of issue, any memory > deduplication have that problems. Yup. > Any guy who care about security must decide on it self. Which things > him use and how he will > defend from others. > Even more on it self he must learn tools, what he use and make some > decision right? > > So, if you really care about that problem in general, or only on KSM side, > that your initiative and your duty to warn people about that. > > KSM already exists for 10+ years. You know about security implication > of use memory deduplication. > That your duty to send a patches to documentation, and add appropriate warnings. As far as I know, basically nobody is using KSM at this point. There are blog posts from several cloud providers about these security risks that explicitly state that they're not using memory deduplication.
On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > Hi. > > > Yep. However, so far, it requires an application to explicitly opt in > > to this behavior, so it's not all that bad. Your patch would remove > > the requirement for application opt-in, which, in my opinion, makes > > this way worse and reduces the number of applications for which this > > is acceptable. > > The default is to maintain the old behaviour, so unless the explicit > decision is made by the administrator, no extra risk is imposed. The new interface would be more tolerable if it honored MADV_UNMERGEABLE: KSM default on: merge everything except when MADV_UNMERGEABLE is excplicitly set. KSM default off: merge only when MADV_MERGEABLE is set. The proposed change won't honor MADV_UNMERGEABLE, meaning that application programmers won't have a way to prevent sensitive data to be every merged. So, I think, we should keep allow an explicit opt-out option for applications. > > > As far as I know, basically nobody is using KSM at this point. There > > are blog posts from several cloud providers about these security risks > > that explicitly state that they're not using memory deduplication. > > I tend to disagree here. Based on both what my company does and what UKSM > users do, memory dedup is a desired option (note "option" word here, not the > default choice). Lightweight containers is a use case for KSM: when many VMs share the same small kernel. KSM is used in production by large cloud vendors. Thank you, Pasha
вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>: > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > Hi. > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > to this behavior, so it's not all that bad. Your patch would remove > > > the requirement for application opt-in, which, in my opinion, makes > > > this way worse and reduces the number of applications for which this > > > is acceptable. > > > > The default is to maintain the old behaviour, so unless the explicit > > decision is made by the administrator, no extra risk is imposed. > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > KSM default on: merge everything except when MADV_UNMERGEABLE is > excplicitly set. > > KSM default off: merge only when MADV_MERGEABLE is set. > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > application programmers won't have a way to prevent sensitive data to be > every merged. So, I think, we should keep allow an explicit opt-out > option for applications. > We just did not have VM/Madvise flag for that currently. Same as THP. Because all logic written with assumption, what we have exactly 2 states. Allow/Disallow (More like not allow). And if we try to add, that must be something like: MADV_FORBID_* to disallow something completely. And same for THP (because currently apps just refuse to start if THP enabled, because of no way to forbid thp). Thanks. > > > > > As far as I know, basically nobody is using KSM at this point. There > > > are blog posts from several cloud providers about these security risks > > > that explicitly state that they're not using memory deduplication. > > > > I tend to disagree here. Based on both what my company does and what UKSM > > users do, memory dedup is a desired option (note "option" word here, not the > > default choice). > > Lightweight containers is a use case for KSM: when many VMs share the > same small kernel. KSM is used in production by large cloud vendors. > > Thank you, > Pasha >
On 18-11-13 21:17:42, Timofey Titovets wrote: > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > > Hi. > > > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > > to this behavior, so it's not all that bad. Your patch would remove > > > > the requirement for application opt-in, which, in my opinion, makes > > > > this way worse and reduces the number of applications for which this > > > > is acceptable. > > > > > > The default is to maintain the old behaviour, so unless the explicit > > > decision is made by the administrator, no extra risk is imposed. > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is > > excplicitly set. > > > > KSM default off: merge only when MADV_MERGEABLE is set. > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > > application programmers won't have a way to prevent sensitive data to be > > every merged. So, I think, we should keep allow an explicit opt-out > > option for applications. > > > > We just did not have VM/Madvise flag for that currently. > Same as THP. > Because all logic written with assumption, what we have exactly 2 states. > Allow/Disallow (More like not allow). > > And if we try to add, that must be something like: > MADV_FORBID_* to disallow something completely. No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is set, memory is indeed becomes always unmergeable regardless of ksm mode of operation. To do the above in ksm_madvise(), a new state should be added, for example instead of: case MADV_UNMERGEABLE: *vm_flags &= ~VM_MERGEABLE; A new flag should be used: *vm_flags |= VM_UNMERGEABLE; I think, without honoring MADV_UNMERGEABLE correctly, this patch won't be accepted. Pasha
> > Is it really necessary to have an extra thread in ksm just to add vma's > > for scanning? Can we do it right from the scanner thread? Also, may be > > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is > > enabled? > > > > Thank you, > > Pasha > > Oh, thats a long story, and my english to bad for describe all things, > even that hard to find linux-mm conversation several years ago about that. > > Anyway, so: > In V2 - i use scanner thread to add VMA, but i think scanner do that > with too high rate. > i.e. walk on task list, and get new task every 20ms, to wait write semaphore, > to get VMA... > To high rate for task list scanner, i think it's overkill. > > About add VMA from creation time, > UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM. > But some mm people say what they not like add KSM hooks to other subsystems. > And want ksm do that internally by some way. > > Frankly speaking i didn't have enough knowledge and skills to do that > another way in past time. > They also suggest me look to THP for that logic, but i can't find how > THP do that without hooks, and > where THP truly scan memory. > > So, after all of that i implemented this in that way. > In first iteration as part of ksm scan thread, and in second, by > separate thread. > Because that allow to add VMA in fully independent way. It still feels as a wrong direction. A new thread that adds random VMA's to scan, and no way to optimize the queue fairness for example. It should really be done at creation time, when VMA is created it should be added to KSM scanning queue, or KSM main scanner thread should go through VMA list in a coherent order. The design of having a separate thread is bad. I plan in the future to add thread per node support to KSM, and this one odd thread won't break things, to which queue should this thread add VMA if there are multiple queues? Thank you, Pasha
вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>: > > On 18-11-13 21:17:42, Timofey Titovets wrote: > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > > > Hi. > > > > > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > > > to this behavior, so it's not all that bad. Your patch would remove > > > > > the requirement for application opt-in, which, in my opinion, makes > > > > > this way worse and reduces the number of applications for which this > > > > > is acceptable. > > > > > > > > The default is to maintain the old behaviour, so unless the explicit > > > > decision is made by the administrator, no extra risk is imposed. > > > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is > > > excplicitly set. > > > > > > KSM default off: merge only when MADV_MERGEABLE is set. > > > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > > > application programmers won't have a way to prevent sensitive data to be > > > every merged. So, I think, we should keep allow an explicit opt-out > > > option for applications. > > > > > > > We just did not have VM/Madvise flag for that currently. > > Same as THP. > > Because all logic written with assumption, what we have exactly 2 states. > > Allow/Disallow (More like not allow). > > > > And if we try to add, that must be something like: > > MADV_FORBID_* to disallow something completely. > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is > set, memory is indeed becomes always unmergeable regardless of ksm mode > of operation. > > To do the above in ksm_madvise(), a new state should be added, for example > instead of: > > case MADV_UNMERGEABLE: > *vm_flags &= ~VM_MERGEABLE; > > A new flag should be used: > *vm_flags |= VM_UNMERGEABLE; > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't > be accepted. > > Pasha > That must work, but we out of bit space in vm_flags [1]. i.e. first 32 bits already defined, and other only accessible only on 64-bit machines. 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219
On 18-11-13 21:54:13, Timofey Titovets wrote: > вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > On 18-11-13 21:17:42, Timofey Titovets wrote: > > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > > > > Hi. > > > > > > > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > > > > to this behavior, so it's not all that bad. Your patch would remove > > > > > > the requirement for application opt-in, which, in my opinion, makes > > > > > > this way worse and reduces the number of applications for which this > > > > > > is acceptable. > > > > > > > > > > The default is to maintain the old behaviour, so unless the explicit > > > > > decision is made by the administrator, no extra risk is imposed. > > > > > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > > > > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is > > > > excplicitly set. > > > > > > > > KSM default off: merge only when MADV_MERGEABLE is set. > > > > > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > > > > application programmers won't have a way to prevent sensitive data to be > > > > every merged. So, I think, we should keep allow an explicit opt-out > > > > option for applications. > > > > > > > > > > We just did not have VM/Madvise flag for that currently. > > > Same as THP. > > > Because all logic written with assumption, what we have exactly 2 states. > > > Allow/Disallow (More like not allow). > > > > > > And if we try to add, that must be something like: > > > MADV_FORBID_* to disallow something completely. > > > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE > > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is > > set, memory is indeed becomes always unmergeable regardless of ksm mode > > of operation. > > > > To do the above in ksm_madvise(), a new state should be added, for example > > instead of: > > > > case MADV_UNMERGEABLE: > > *vm_flags &= ~VM_MERGEABLE; > > > > A new flag should be used: > > *vm_flags |= VM_UNMERGEABLE; > > > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't > > be accepted. > > > > Pasha > > > > That must work, but we out of bit space in vm_flags [1]. > i.e. first 32 bits already defined, and other only accessible only on > 64-bit machines. So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only. > > 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219
+cc Daniel Gruss On Tue, Nov 13, 2018 at 6:59 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote: > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > Hi. > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > to this behavior, so it's not all that bad. Your patch would remove > > > the requirement for application opt-in, which, in my opinion, makes > > > this way worse and reduces the number of applications for which this > > > is acceptable. > > > > The default is to maintain the old behaviour, so unless the explicit > > decision is made by the administrator, no extra risk is imposed. > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > KSM default on: merge everything except when MADV_UNMERGEABLE is > excplicitly set. > > KSM default off: merge only when MADV_MERGEABLE is set. > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > application programmers won't have a way to prevent sensitive data to be > every merged. So, I think, we should keep allow an explicit opt-out > option for applications. > > > > > > As far as I know, basically nobody is using KSM at this point. There > > > are blog posts from several cloud providers about these security risks > > > that explicitly state that they're not using memory deduplication. > > > > I tend to disagree here. Based on both what my company does and what UKSM > > users do, memory dedup is a desired option (note "option" word here, not the > > default choice). > > Lightweight containers is a use case for KSM: when many VMs share the > same small kernel. KSM is used in production by large cloud vendors. Wait, what? Can you name specific ones? Nowadays, enabling KSM for untrusted VMs seems like a terrible idea to me, security-wise. Google says at <https://cloud.google.com/blog/products/gcp/7-ways-we-harden-our-kvm-hypervisor-at-google-cloud-security-in-plaintext>: "Compute Engine and Container Engine are not vulnerable to this kind of attack, since they do not use KSM." An AWS employee says at <https://forums.aws.amazon.com/thread.jspa?threadID=238519&tstart=0&messageID=739485#739485>: "memory de-duplication is not enabled by Amazon EC2's hypervisor" In my opinion, KSM is fundamentally insecure for systems hosting multiple VMs that don't trust each other. I don't think anyone writes cryptographic software under the assumption that an attacker will be given the ability to query whether a given page of data exists anywhere else on the system.
> Wait, what? Can you name specific ones? Nowadays, enabling KSM for > untrusted VMs seems like a terrible idea to me, security-wise. Of course it is not used to share data among different customers/tenants, as far as I know it is used by Oracle Cloud to merge the same pages in clear containers. https://medium.com/cri-o/intel-clear-containers-and-cri-o-70824fb51811 One performance enhancing feature is the use of KSM, a recent KVM optimized for memory sharing and boot speed. Another is the use of an optimized Clear Containers mini-OS. Pasha
вт, 13 нояб. 2018 г. в 22:17, Pavel Tatashin <pasha.tatashin@soleen.com>: > > On 18-11-13 21:54:13, Timofey Titovets wrote: > > вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > > > On 18-11-13 21:17:42, Timofey Titovets wrote: > > > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > > > > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote: > > > > > > Hi. > > > > > > > > > > > > > Yep. However, so far, it requires an application to explicitly opt in > > > > > > > to this behavior, so it's not all that bad. Your patch would remove > > > > > > > the requirement for application opt-in, which, in my opinion, makes > > > > > > > this way worse and reduces the number of applications for which this > > > > > > > is acceptable. > > > > > > > > > > > > The default is to maintain the old behaviour, so unless the explicit > > > > > > decision is made by the administrator, no extra risk is imposed. > > > > > > > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE: > > > > > > > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is > > > > > excplicitly set. > > > > > > > > > > KSM default off: merge only when MADV_MERGEABLE is set. > > > > > > > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that > > > > > application programmers won't have a way to prevent sensitive data to be > > > > > every merged. So, I think, we should keep allow an explicit opt-out > > > > > option for applications. > > > > > > > > > > > > > We just did not have VM/Madvise flag for that currently. > > > > Same as THP. > > > > Because all logic written with assumption, what we have exactly 2 states. > > > > Allow/Disallow (More like not allow). > > > > > > > > And if we try to add, that must be something like: > > > > MADV_FORBID_* to disallow something completely. > > > > > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE > > > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is > > > set, memory is indeed becomes always unmergeable regardless of ksm mode > > > of operation. > > > > > > To do the above in ksm_madvise(), a new state should be added, for example > > > instead of: > > > > > > case MADV_UNMERGEABLE: > > > *vm_flags &= ~VM_MERGEABLE; > > > > > > A new flag should be used: > > > *vm_flags |= VM_UNMERGEABLE; > > > > > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't > > > be accepted. > > > > > > Pasha > > > > > > > That must work, but we out of bit space in vm_flags [1]. > > i.e. first 32 bits already defined, and other only accessible only on > > 64-bit machines. > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only. With all due respect to you, for that type of things we need mm maintainer opinion. I just don't want get situation, where after touch of other subsystems, maintainer will just refuse that work by some reason. i.e. writing patches for upstream (from my point of view), is more art of communication and making resulte code acceptable by community. Because any code which written correctly from engineering point of view, can be easy refused, just because someone not found it useful. Thanks. > > > > 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219
> > > That must work, but we out of bit space in vm_flags [1]. > > > i.e. first 32 bits already defined, and other only accessible only on > > > 64-bit machines. > > > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only. > > With all due respect to you, for that type of things we need > mm maintainer opinion. As far as I understood, you already got directions from the maintainers to do similar to the way THP is implemented, and THP uses two flags: VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we honor MADV_UNMERGEABLE. When VM_NOHUGEPAGE is set khugepaged ignores those VMAs. There may be a way to add VM_UNMERGEABLE without extending the size of vm_flags, but that would be a good start point in looking how to add a new flag. Again, you could simply enable this feature on 64-bit only. Pasha
вт, 13 нояб. 2018 г. в 21:43, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > Is it really necessary to have an extra thread in ksm just to add vma's > > > for scanning? Can we do it right from the scanner thread? Also, may be > > > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is > > > enabled? > > > > > > Thank you, > > > Pasha > > > > Oh, thats a long story, and my english to bad for describe all things, > > even that hard to find linux-mm conversation several years ago about that. > > > > Anyway, so: > > In V2 - i use scanner thread to add VMA, but i think scanner do that > > with too high rate. > > i.e. walk on task list, and get new task every 20ms, to wait write semaphore, > > to get VMA... > > To high rate for task list scanner, i think it's overkill. > > > > About add VMA from creation time, > > UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM. > > But some mm people say what they not like add KSM hooks to other subsystems. > > And want ksm do that internally by some way. > > > > Frankly speaking i didn't have enough knowledge and skills to do that > > another way in past time. > > They also suggest me look to THP for that logic, but i can't find how > > THP do that without hooks, and > > where THP truly scan memory. > > > > So, after all of that i implemented this in that way. > > In first iteration as part of ksm scan thread, and in second, by > > separate thread. > > Because that allow to add VMA in fully independent way. > > It still feels as a wrong direction. A new thread that adds random > VMA's to scan, and no way to optimize the queue fairness for example. > It should really be done at creation time, when VMA is created it > should be added to KSM scanning queue, or KSM main scanner thread > should go through VMA list in a coherent order. How you see queue fairness in that case? i.e. if you talk about moving from old VMA to new VMA, IIRC i can't find any whole kernel list of VMAs. i.e. i really understood what you don't like exactly, but for that we need add hooks as i already mentioned above. (And i already try get that to kernel [1]). So, as i wrote you below, i need some maintainer opinion in which way that responsible person of mm see 'right' implementation. > The design of having a separate thread is bad. I plan in the future to > add thread per node support to KSM, and this one odd thread won't > break things, to which queue should this thread add VMA if there are > multiple queues? That will be interesting to look :) But IMHO: I think you will need to add some code to ksm_enter(). Because madvise() internally call ksm_enter(). So ksm_enter() will decide which tread must process that. That not depends on caller. Thanks. > Thank you, > Pasha > - - - 1. https://lkml.org/lkml/2014/11/8/206
ср, 14 нояб. 2018 г. в 01:53, Pavel Tatashin <pasha.tatashin@soleen.com>: > > > > > That must work, but we out of bit space in vm_flags [1]. > > > > i.e. first 32 bits already defined, and other only accessible only on > > > > 64-bit machines. > > > > > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only. > > > > With all due respect to you, for that type of things we need > > mm maintainer opinion. > > As far as I understood, you already got directions from the maintainers > to do similar to the way THP is implemented, and THP uses two flags: > > VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we > honor MADV_UNMERGEABLE. > > When VM_NOHUGEPAGE is set khugepaged ignores those VMAs. > > There may be a way to add VM_UNMERGEABLE without extending the size of > vm_flags, but that would be a good start point in looking how to add a > new flag. > > Again, you could simply enable this feature on 64-bit only. > > Pasha > Deal! I will try with only on 64bit machines.
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index 9303786632d1..7cffd47f9b38 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -116,6 +116,21 @@ run Default: 0 (must be changed to 1 to activate KSM, except if CONFIG_SYSFS is disabled) +mode + * set always to allow ksm deduplicate memory of every process + * set madvise to use only madvised memory + + Default: madvise (dedupulicate only madvised memory as in + earlier releases) + +seeker_sleep_millisecs + how many milliseconds ksmd task seeker should sleep try another + task. + e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs`` + + Default: 1000 (chosen for rate limit purposes) + + use_zero_pages specifies whether empty pages (i.e. allocated pages that only contain zeroes) should be treated specially. When set to 1, diff --git a/mm/ksm.c b/mm/ksm.c index 5b0894b45ee5..1a03b28b6288 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100; /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +/* Milliseconds ksmd seeker should sleep between runs */ +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000; + /* Checksum of an empty (zeroed) page */ static unsigned int zero_checksum __read_mostly; @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1; static unsigned long ksm_run = KSM_RUN_STOP; static void wait_while_offlining(void); +#define KSM_MODE_MADVISE 0 +#define KSM_MODE_ALWAYS 1 +static unsigned long ksm_mode = KSM_MODE_MADVISE; + static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait); static DEFINE_MUTEX(ksm_thread_mutex); static DEFINE_SPINLOCK(ksm_mmlist_lock); @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock); sizeof(struct __struct), __alignof__(struct __struct),\ (__flags), NULL) +static inline int ksm_mode_always(void) +{ + return (ksm_mode == KSM_MODE_ALWAYS); +} + static int __init ksm_slab_init(void) { rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0); @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void) return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list); } + +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags) +{ + int err; + + if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | + VM_PFNMAP | VM_IO | VM_DONTEXPAND | + VM_HUGETLB | VM_MIXEDMAP)) + return 0; + +#ifdef VM_SAO + if (*vm_flags & VM_SAO) + return 0; +#endif +#ifdef VM_SPARC_ADI + if (*vm_flags & VM_SPARC_ADI) + return 0; +#endif + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { + err = __ksm_enter(mm); + if (err) + return err; + } + + *vm_flags |= VM_MERGEABLE; + + return 0; +} + +/* + * Register all vmas for all processes in the system with KSM. + * Note that every call to ksm_, for a given vma, after the first + * does nothing but set flags. + */ +void ksm_import_task_vma(struct task_struct *task) +{ + struct vm_area_struct *vma; + struct mm_struct *mm; + int error; + + mm = get_task_mm(task); + if (!mm) + return; + down_write(&mm->mmap_sem); + vma = mm->mmap; + while (vma) { + error = ksm_enter(vma->vm_mm, &vma->vm_flags); + vma = vma->vm_next; + } + up_write(&mm->mmap_sem); + mmput(mm); +} + +static int ksm_seeker_thread(void *nothing) +{ + pid_t last_pid = 1; + pid_t curr_pid; + struct task_struct *task; + + set_freezable(); + set_user_nice(current, 5); + + while (!kthread_should_stop()) { + wait_while_offlining(); + + try_to_freeze(); + + if (!ksm_mode_always()) { + wait_event_freezable(ksm_seeker_thread_wait, + ksm_mode_always() || kthread_should_stop()); + continue; + } + + /* + * import one task's vma per run + */ + read_lock(&tasklist_lock); + + /* Try always get next task */ + for_each_process(task) { + curr_pid = task_pid_nr(task); + if (curr_pid == last_pid) { + task = next_task(task); + break; + } + + if (curr_pid > last_pid) + break; + } + + last_pid = task_pid_nr(task); + ksm_import_task_vma(task); + read_unlock(&tasklist_lock); + + schedule_timeout_interruptible( + msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs)); + } + return 0; +} + static int ksm_scan_thread(void *nothing) { set_freezable(); @@ -2422,33 +2535,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, switch (advice) { case MADV_MERGEABLE: - /* - * Be somewhat over-protective for now! - */ - if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | - VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_HUGETLB | VM_MIXEDMAP)) - return 0; /* just ignore the advice */ - - if (vma_is_dax(vma)) - return 0; - -#ifdef VM_SAO - if (*vm_flags & VM_SAO) - return 0; -#endif -#ifdef VM_SPARC_ADI - if (*vm_flags & VM_SPARC_ADI) - return 0; -#endif - - if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { - err = __ksm_enter(mm); - if (err) - return err; - } - - *vm_flags |= VM_MERGEABLE; + err = ksm_enter(mm, vm_flags); + if (err) + return err; break; case MADV_UNMERGEABLE: @@ -2829,6 +2918,29 @@ static ssize_t sleep_millisecs_store(struct kobject *kobj, } KSM_ATTR(sleep_millisecs); +static ssize_t seeker_sleep_millisecs_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", ksm_thread_seeker_sleep_millisecs); +} + +static ssize_t seeker_sleep_millisecs_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long msecs; + int err; + + err = kstrtoul(buf, 10, &msecs); + if (err || msecs > UINT_MAX) + return -EINVAL; + + ksm_thread_seeker_sleep_millisecs = msecs; + + return count; +} +KSM_ATTR(seeker_sleep_millisecs); + static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -2852,6 +2964,34 @@ static ssize_t pages_to_scan_store(struct kobject *kobj, } KSM_ATTR(pages_to_scan); +static ssize_t mode_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + switch (ksm_mode) { + case KSM_MODE_ALWAYS: + return sprintf(buf, "[always] madvise\n"); + case KSM_MODE_MADVISE: + return sprintf(buf, "always [madvise]\n"); + } + + return sprintf(buf, "always [madvise]\n"); +} + +static ssize_t mode_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!memcmp("always", buf, min(sizeof("always")-1, count))) { + ksm_mode = KSM_MODE_ALWAYS; + wake_up_interruptible(&ksm_seeker_thread_wait); + } else if (!memcmp("madvise", buf, min(sizeof("madvise")-1, count))) { + ksm_mode = KSM_MODE_MADVISE; + } else + return -EINVAL; + + return count; +} +KSM_ATTR(mode); + static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -3108,7 +3248,9 @@ KSM_ATTR_RO(full_scans); static struct attribute *ksm_attrs[] = { &sleep_millisecs_attr.attr, + &seeker_sleep_millisecs_attr.attr, &pages_to_scan_attr.attr, + &mode_attr.attr, &run_attr.attr, &pages_shared_attr.attr, &pages_sharing_attr.attr, @@ -3134,7 +3276,7 @@ static const struct attribute_group ksm_attr_group = { static int __init ksm_init(void) { - struct task_struct *ksm_thread; + struct task_struct *ksm_thread[2]; int err; /* The correct value depends on page size and endianness */ @@ -3146,10 +3288,18 @@ static int __init ksm_init(void) if (err) goto out; - ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); - if (IS_ERR(ksm_thread)) { + ksm_thread[0] = kthread_run(ksm_scan_thread, NULL, "ksmd"); + if (IS_ERR(ksm_thread[0])) { pr_err("ksm: creating kthread failed\n"); - err = PTR_ERR(ksm_thread); + err = PTR_ERR(ksm_thread[0]); + goto out_free; + } + + ksm_thread[1] = kthread_run(ksm_seeker_thread, NULL, "ksmd_seeker"); + if (IS_ERR(ksm_thread[1])) { + pr_err("ksm: creating seeker kthread failed\n"); + err = PTR_ERR(ksm_thread[1]); + kthread_stop(ksm_thread[0]); goto out_free; } @@ -3157,7 +3307,8 @@ static int __init ksm_init(void) err = sysfs_create_group(mm_kobj, &ksm_attr_group); if (err) { pr_err("ksm: register sysfs failed\n"); - kthread_stop(ksm_thread); + kthread_stop(ksm_thread[0]); + kthread_stop(ksm_thread[1]); goto out_free; } #else