[RFC,0/2] SKSM: Synchronous Kernel Samepage Merging

Message ID	20250228023043.83726-1-mathieu.desnoyers@efficios.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Linus Torvalds <torvalds@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Olivier Dion <odion@efficios.com>, linux-mm@kvack.org Subject: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Date: Thu, 27 Feb 2025 21:30:41 -0500 Message-Id: <20250228023043.83726-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	SKSM: Synchronous Kernel Samepage Merging \| expand [RFC,0/2] SKSM: Synchronous Kernel Samepage Merging [RFC,1/2] mm: Introduce SKSM: Synchronous Kernel Samepage Merging [RFC,2/2] selftests/kskm: Introduce SKSM basic test

Mathieu Desnoyers Feb. 28, 2025, 2:30 a.m. UTC

This series introduces SKSM, a new page deduplication ABI,
aiming to fix the limitations inherent to the KSM ABI.

The implementation is simple enough: SKSM is implemented in about 100
LOC compared to 2.5k LOC for KSM (on top of the common KSM helpers).

This is sent as a proof of concept. It applies on top of v6.13.

Feedback is welcome!

Mathieu

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org

Mathieu Desnoyers (2):
  mm: Introduce SKSM: Synchronous Kernel Samepage Merging
  selftests/kskm: Introduce SKSM basic test

 include/linux/ksm.h                       |   4 +
 include/linux/mm_types.h                  |   7 +
 include/linux/page-flags.h                |  42 ++++
 include/linux/sksm.h                      |  27 +++
 include/uapi/asm-generic/mman-common.h    |   2 +
 mm/Kconfig                                |   5 +
 mm/Makefile                               |   1 +
 mm/ksm-common.h                           | 228 ++++++++++++++++++++++
 mm/ksm.c                                  | 219 +--------------------
 mm/madvise.c                              |   6 +
 mm/memory.c                               |   2 +
 mm/page_alloc.c                           |   3 +
 mm/sksm.c                                 | 190 ++++++++++++++++++
 tools/testing/selftests/sksm/.gitignore   |   2 +
 tools/testing/selftests/sksm/Makefile     |  14 ++
 tools/testing/selftests/sksm/basic_test.c | 217 ++++++++++++++++++++
 16 files changed, 751 insertions(+), 218 deletions(-)
 create mode 100644 include/linux/sksm.h
 create mode 100644 mm/ksm-common.h
 create mode 100644 mm/sksm.c
 create mode 100644 tools/testing/selftests/sksm/.gitignore
 create mode 100644 tools/testing/selftests/sksm/Makefile
 create mode 100644 tools/testing/selftests/sksm/basic_test.c

Linus Torvalds Feb. 28, 2025, 2:51 a.m. UTC | #1

On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This series introduces SKSM, a new page deduplication ABI,
> aiming to fix the limitations inherent to the KSM ABI.

So I'm not interested in seeing *another* KSM version.

Because I absolutely do *NOT* want a new chapter in the saga of SLUB
vs SLAB vs SLOB.

However, if the feeling is that this can *replace* the current horror
that is KSM, I'm a lot more interested. I suspect our current KSM
model has largely been a failure, and this might be "good enough".

             Linus

Mathieu Desnoyers Feb. 28, 2025, 3:03 a.m. UTC | #2

On 2025-02-27 21:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
> 
> So I'm not interested in seeing *another* KSM version.
> 
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
> 
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".
I'd be fine with SKSM replacing KSM entirely. However, I don't
think we should try to re-implement the existing KSM userspace ABIs
over SKSM. I suspect that much of the problems KSM has today are
caused by the semantic of the ABI it exposes, which were targeted
solely for a host deduplicating guest VMs memory use-case.

KSM tracks memory meant to be mergeable on an ongoing
basis with a worker thread:

   madvise(2) MADV_{UN,}MERGEABLE
   prctl(2) PR_{SET,GET}_MEMORY_MERGE (security concern)
   ~2.5k LOC exclusing ksm-common code
   requires parameter fine-tuning from sysadmin

SKSM gets the hint from userspace that memory is a good
candidate for merging in its current state and is expected
to stay invariant:

   madvise(2) MADV_MERGE
   ~100 LOC exclusing ksm-common code

The main reason why SKSM could be implemented without all the
scanning complexity is because of this simpler ABI.

Thanks for the feedback!

Mathieu

Linus Torvalds Feb. 28, 2025, 5:17 a.m. UTC | #3

On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> I'd be fine with SKSM replacing KSM entirely. However, I don't
> think we should try to re-implement the existing KSM userspace ABIs
> over SKSM.

No, absolutely. The only point (for me) for your new synchronous one
would be if it replaced the kernel thread async scanning, which would
make the old user space interface basically pointless.

But I don't actually know who uses KSM right now. My reaction really
comes from a "it's not nice code in the kernel", not from any actual
knowledge of the users.

Maybe it works really well in some cloud VM environment, and we're
stuck with it forever.

In which case I don't want to see some second different interface that
just makes it all worse.

                 Linus

David Hildenbrand Feb. 28, 2025, 1:59 p.m. UTC | #4

On 28.02.25 06:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
> 
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
> 
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
> 
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.

Exactly that; and besides the VM use-case, lately people stated using it 
in the context of interpreters (IIRC inside Meta) quite successfully as 
well.

Mathieu Desnoyers Feb. 28, 2025, 2:59 p.m. UTC | #5

On 2025-02-28 00:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
> 
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
> 
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
> 
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.
> 

For the VM use-case, I wonder if we could just add a userfaultfd
"COW" event that would notify userspace when a COW happens ?

This would allow userspace to replace ksmd by tracking the age of
those anonymous pages, and issue madvise MADV_MERGE on them to
write-protect+merge them when it is deemed useful.

With both a new userfaultfd COW event and madvise MADV_MERGE,
is there anything else that is fundamentally missing to move
all the scanning complexity of KSM to userspace for the VM
deduplication use-case ?

Thanks,

Mathieu

Sean Christopherson Feb. 28, 2025, 2:59 p.m. UTC | #6

On Fri, Feb 28, 2025, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
> > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > > 
> > > I'd be fine with SKSM replacing KSM entirely. However, I don't
> > > think we should try to re-implement the existing KSM userspace ABIs
> > > over SKSM.
> > 
> > No, absolutely. The only point (for me) for your new synchronous one
> > would be if it replaced the kernel thread async scanning, which would
> > make the old user space interface basically pointless.
> > 
> > But I don't actually know who uses KSM right now. My reaction really
> > comes from a "it's not nice code in the kernel", not from any actual
> > knowledge of the users.
> > 
> > Maybe it works really well in some cloud VM environment, and we're
> > stuck with it forever.
> 
> Exactly that; and besides the VM use-case, lately people stated using it in
> the context of interpreters (IIRC inside Meta) quite successfully as well.

Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
in cloud environments?

The security implications of scanning guest memory and having co-tenant VMs share
mappings (should) make it a complete non-starter for any scenario where VMs and/or
their workloads are owned by third parties.

I can imagine there might be first-party use cases, but I would expect many/most
of those to be able to explicitly share mappings, which would provide far, far
better power and performance characteristics.

Mathieu Desnoyers Feb. 28, 2025, 3:01 p.m. UTC | #7

On 2025-02-28 08:59, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>> think we should try to re-implement the existing KSM userspace ABIs
>>> over SKSM.
>>
>> No, absolutely. The only point (for me) for your new synchronous one
>> would be if it replaced the kernel thread async scanning, which would
>> make the old user space interface basically pointless.
>>
>> But I don't actually know who uses KSM right now. My reaction really
>> comes from a "it's not nice code in the kernel", not from any actual
>> knowledge of the users.
>>
>> Maybe it works really well in some cloud VM environment, and we're
>> stuck with it forever.
> 
> Exactly that; and besides the VM use-case, lately people stated using it 
> in the context of interpreters (IIRC inside Meta) quite successfully as 
> well.
> 

I suspect that SKSM is a better fit for JIT and code patching than KSM,
because user-space knows better when a set of pages is going to become
invariant for a long time and thus benefit from merging. This removes
the background scanning from the picture.

Does the interpreter use-case require background scanning, or does
it know when a set of pages are meant to become invariant for a long
time ?

Thanks,

Mathieu

David Hildenbrand Feb. 28, 2025, 3:10 p.m. UTC | #8

On 28.02.25 15:59, Sean Christopherson wrote:
> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it in
>> the context of interpreters (IIRC inside Meta) quite successfully as well.
> 
> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
> in cloud environments?

Private clouds yes, that's where it is most commonly used for. I would 
assume that nobody for

For example, there is some older documentation here:

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/virtualization_administration_guide/chap-ksm#chap-KSM

which touches on the security aspects:

"The page deduplication technology (used also by the KSM implementation) 
may introduce side channels that could potentially be used to leak 
information across multiple guests. In case this is a concern, KSM can 
be disabled on a per-guest basis."

> 
> The security implications of scanning guest memory and having co-tenant VMs share
> mappings (should) make it a complete non-starter for any scenario where VMs and/or
> their workloads are owned by third parties.

Jep.

> 
> I can imagine there might be first-party use cases, but I would expect many/most
> of those to be able to explicitly share mappings, which would provide far, far
> better power and performance characteristics.

Note that KSM can be very efficient when you have multiple VMs running 
the same kernel,executable,libraries etc. If my memory doesn't trick me, 
that's precisely for what it was originally invented, and how it is 
getting used today in the context of VMs.

For example, QEMU will mark all guest memory is mergeable using MADV, to 
limit the deduplicaton to guest RAM only.

David Hildenbrand Feb. 28, 2025, 3:18 p.m. UTC | #9

On 28.02.25 16:01, Mathieu Desnoyers wrote:
> On 2025-02-28 08:59, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it
>> in the context of interpreters (IIRC inside Meta) quite successfully as
>> well.
>>
> 
> I suspect that SKSM is a better fit for JIT and code patching than KSM,
> because user-space knows better when a set of pages is going to become
> invariant for a long time and thus benefit from merging. This removes
> the background scanning from the picture.
 > > Does the interpreter use-case require background scanning, or does
> it know when a set of pages are meant to become invariant for a long
> time ?

To make the JIT/interpreter use case happy, people wanted ways to 
*force* KSM on for *the whole process*, not just individual VMAs like 
the traditional VM use case would have done.

I recall one of the reasons being that you don't really want to modify 
your JIT/interpreter to just make KSM work.

See [1] "KSM at Meta" for some details, and in general, optimization 
work to adapt KSM to new use cases.

Regarding some concerns you raised, Stefan did a lot of optimization 
work like "smart scanning" (slide "Optimization - Smart Scan (6.7)") to 
reduce the scanning overhead and make it much more efficient.

So people started optimizing for that already and got pretty good results.

[1] 
https://lpc.events/event/17/contributions/1625/attachments/1320/2649/KSM.pdf

David Hildenbrand Feb. 28, 2025, 3:19 p.m. UTC | #10

On 28.02.25 16:10, David Hildenbrand wrote:
> On 28.02.25 15:59, Sean Christopherson wrote:
>> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>>> On 28.02.25 06:17, Linus Torvalds wrote:
>>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>>> over SKSM.
>>>>
>>>> No, absolutely. The only point (for me) for your new synchronous one
>>>> would be if it replaced the kernel thread async scanning, which would
>>>> make the old user space interface basically pointless.
>>>>
>>>> But I don't actually know who uses KSM right now. My reaction really
>>>> comes from a "it's not nice code in the kernel", not from any actual
>>>> knowledge of the users.
>>>>
>>>> Maybe it works really well in some cloud VM environment, and we're
>>>> stuck with it forever.
>>>
>>> Exactly that; and besides the VM use-case, lately people stated using it in
>>> the context of interpreters (IIRC inside Meta) quite successfully as well.
>>
>> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
>> in cloud environments?
> 
> Private clouds yes, that's where it is most commonly used for. I would
> assume that nobody for

forgot to complete that sentence: "... nobody really should be using 
that in public clouds."

David Hildenbrand Feb. 28, 2025, 3:34 p.m. UTC | #11

On 28.02.25 03:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
> 
> So I'm not interested in seeing *another* KSM version.
> 
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
> 
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".

Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?

Many/most use cases just leave THP scanning+collapsing to khugepaged; 
selected ones might "know better" what to do, so they effectively 
disable khugepaged, and manually collapse THPs using MADV_COLLAPSE.

If it would be similar to that, it would not be completely different KSM 
version, just a different way to trigger merging: background scanning 
vs. user-space triggered ("synchronous").

I could see use cases for such a synchronous interface, but I doubt it 
could replace the background scanning that is actively getting used for 
existing use cases; I have similar thoughts about khugepaged vs. 
MADV_COLLAPSE.

Matthew Wilcox Feb. 28, 2025, 3:38 p.m. UTC | #12

On Fri, Feb 28, 2025 at 04:34:50PM +0100, David Hildenbrand wrote:
> Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?

I think it is comparable ... because many people find khugepaged
unacceptable and there are proposals to move that to userspace.

Peter Xu Feb. 28, 2025, 4:32 p.m. UTC | #13

On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> For the VM use-case, I wonder if we could just add a userfaultfd
> "COW" event that would notify userspace when a COW happens ?

I don't know what's the best for KSM and how well this will work, but we
have such event for years..  See UFFDIO_REGISTER_MODE_WP:

https://man7.org/linux/man-pages/man2/userfaultfd.2.html

> 
> This would allow userspace to replace ksmd by tracking the age of
> those anonymous pages, and issue madvise MADV_MERGE on them to
> write-protect+merge them when it is deemed useful.
> 
> With both a new userfaultfd COW event and madvise MADV_MERGE,
> is there anything else that is fundamentally missing to move
> all the scanning complexity of KSM to userspace for the VM
> deduplication use-case ?

Thanks,

Mathieu Desnoyers Feb. 28, 2025, 5:53 p.m. UTC | #14

On 2025-02-28 11:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>> For the VM use-case, I wonder if we could just add a userfaultfd
>> "COW" event that would notify userspace when a COW happens ?
> 
> I don't know what's the best for KSM and how well this will work, but we
> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> 
> https://man7.org/linux/man-pages/man2/userfaultfd.2.html

userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
resulting from a mmap mapping, but returns EINVAL if I pass a
page-aligned address which sits within a private file mapping
(e.g. executable data).

Also, I notice that do_wp_page() only calls handle_userfault
VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
set.

AFAIU, as it stands now userfaultfd would not help tracking COW faults
caused by stores to private file mappings. Am I missing something ?

Thanks,

Mathieu

> 
>>
>> This would allow userspace to replace ksmd by tracking the age of
>> those anonymous pages, and issue madvise MADV_MERGE on them to
>> write-protect+merge them when it is deemed useful.
>>
>> With both a new userfaultfd COW event and madvise MADV_MERGE,
>> is there anything else that is fundamentally missing to move
>> all the scanning complexity of KSM to userspace for the VM
>> deduplication use-case ?
> 
> Thanks,
>

Mathieu Desnoyers Feb. 28, 2025, 9:38 p.m. UTC | #15

On 2025-02-28 10:10, David Hildenbrand wrote:
[...]
> For example, QEMU will mark all guest memory is mergeable using MADV, to 
> limit the deduplicaton to guest RAM only.
> 

On a related note, I think the madvise(2) documentation is inaccurate.

It states:

        MADV_MERGEABLE (since Linux 2.6.32)
               Enable  Kernel Samepage Merging (KSM) for the pages in the range
               specified by addr and length. [...]

AFAIU, based on code review of ksm_madvise(), this is not strictly true.

The KSM implementation enables KSM for pages in the entire vma containing the range.
So if it so happens that two mmap areas with identical protection flags are merged,
both will be considered mergeable by KSM as soon as at least one page from any of
those areas is made mergeable.

This does not appear to be an issue in qemu because guard pages with different
protection are placed between distinct mappings, which should prevent combining
the vmas.

Thanks,

Mathieu

David Hildenbrand Feb. 28, 2025, 9:45 p.m. UTC | #16

On 28.02.25 22:38, Mathieu Desnoyers wrote:
> On 2025-02-28 10:10, David Hildenbrand wrote:
> [...]
>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>> limit the deduplicaton to guest RAM only.
>>
> 
> On a related note, I think the madvise(2) documentation is inaccurate.
> 
> It states:
> 
>          MADV_MERGEABLE (since Linux 2.6.32)
>                 Enable  Kernel Samepage Merging (KSM) for the pages in the range
>                 specified by addr and length. [...]
> 
> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
> 
> The KSM implementation enables KSM for pages in the entire vma containing the range.
> So if it so happens that two mmap areas with identical protection flags are merged,
> both will be considered mergeable by KSM as soon as at least one page from any of
> those areas is made mergeable.

I *think* it does what is documented. In madvise_vma_behavior(), 
ksm_madvise() will update "new_flags".

Then we call madvise_update_vma() to split the VMA if required and set 
new_flags only on the split VMA. The handling is similar to other MADV 
operations that end up modifying vm_flags.

If I am missing something and this is indeed broken, we should 
definitely write a selftest for it and fix it.

Mathieu Desnoyers Feb. 28, 2025, 9:49 p.m. UTC | #17

On 2025-02-28 16:45, David Hildenbrand wrote:
> On 28.02.25 22:38, Mathieu Desnoyers wrote:
>> On 2025-02-28 10:10, David Hildenbrand wrote:
>> [...]
>>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>>> limit the deduplicaton to guest RAM only.
>>>
>>
>> On a related note, I think the madvise(2) documentation is inaccurate.
>>
>> It states:
>>
>>          MADV_MERGEABLE (since Linux 2.6.32)
>>                 Enable  Kernel Samepage Merging (KSM) for the pages in 
>> the range
>>                 specified by addr and length. [...]
>>
>> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
>>
>> The KSM implementation enables KSM for pages in the entire vma 
>> containing the range.
>> So if it so happens that two mmap areas with identical protection 
>> flags are merged,
>> both will be considered mergeable by KSM as soon as at least one page 
>> from any of
>> those areas is made mergeable.
> 
> I *think* it does what is documented. In madvise_vma_behavior(), 
> ksm_madvise() will update "new_flags".
> 
> Then we call madvise_update_vma() to split the VMA if required and set 
> new_flags only on the split VMA. The handling is similar to other MADV 
> operations that end up modifying vm_flags.
> 
> If I am missing something and this is indeed broken, we should 
> definitely write a selftest for it and fix it.
> 

You are correct, I missed that part. Thanks for the clarification!

Mathieu

Peter Xu Feb. 28, 2025, 10:32 p.m. UTC | #18

On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 11:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > "COW" event that would notify userspace when a COW happens ?
> > 
> > I don't know what's the best for KSM and how well this will work, but we
> > have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> > 
> > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
> 
> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> resulting from a mmap mapping, but returns EINVAL if I pass a
> page-aligned address which sits within a private file mapping
> (e.g. executable data).

Yes, so far sync traps only supports RAM-based file systems, or anonymous.
Generic private file mappings (that stores executables and libraries) are
not yet supported.

> 
> Also, I notice that do_wp_page() only calls handle_userfault
> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> set.

AFAICT that's expected, unshare should only be set on reads, never writes.
So uffd-wp shouldn't trap any of those.

> 
> AFAIU, as it stands now userfaultfd would not help tracking COW faults
> caused by stores to private file mappings. Am I missing something ?

I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
most mappings.  That one is async, though, so more like soft-dirty.  It
might be doable to try making it sync too without a lot of changes based on
how async tracking works.

Thanks,

Mathieu Desnoyers March 1, 2025, 3:44 p.m. UTC | #19

On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
> 
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.

OK, this confirms my observations.

> 
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
> 
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.

I'm confused by your comment. I thought unshare only applies to
*write* faults. What am I missing ?

> 
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
> 
> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings.  That one is async, though, so more like soft-dirty.  It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.

I'll try this out. It may not matter that it's async given a use-case
use-cases of tracking the age since the WP fault on the COW pages. We
don't need to react to the event in-place to alter its behavior, just
a notification should be fine AFAIU.

Thanks,

Mathieu

Peter Xu March 3, 2025, 3:01 p.m. UTC | #20

On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> > 
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
> 
> I'm confused by your comment. I thought unshare only applies to
> *write* faults. What am I missing ?

The major path so far to set unshare is here in GUP (ignoring two corner
cases used in either s390 and ksm):

	if (unshare) {
		fault_flags |= FAULT_FLAG_UNSHARE;
		/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
		VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
	}

See the VM_BUG_ON() - if it's write it'll crash already.

"unshare", in its earliest form of patch, used to be called COR
(Copy-On-Read), which might be more straightforward in this case.. so it's
the counterpart of COW but for read cases where a copy is required. The
patchset that introduced it has more information (e.g. a7f2266041).

Thanks,

David Hildenbrand March 3, 2025, 4:36 p.m. UTC | #21

On 03.03.25 16:01, Peter Xu wrote:
> On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never writes.
>>> So uffd-wp shouldn't trap any of those.
>>
>> I'm confused by your comment. I thought unshare only applies to
>> *write* faults. What am I missing ?
> 
> The major path so far to set unshare is here in GUP (ignoring two corner
> cases used in either s390 and ksm):

"unshare" fault, in contrast to a write fault, will not turn the PTE 
writable.

That's why it does not trigger userfaultfd-wp: there is no write access, 
write-protection is left unchanged.

Mathieu Desnoyers March 3, 2025, 8:01 p.m. UTC | #22

On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
> 
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.
> 
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
> 
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.
> 
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
> 
> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings.  That one is async, though, so more like soft-dirty.  It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.

I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
be a good fit. Here is what I have in mind to replace the ksmd scanning
thread for the VM use-case by a purely user-space driven scanning:

Within qemu or similar user-space process:

1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
    UFFDIO_REGISTER_MODE_WP mode.

2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
    to detect memory which stays invariant for a long time.

3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
    Keep track of memory which is frequently modified, so it can be left alone and
    not write-protected nor merged anymore.

4) Whenever pages stay invariant for a given lapse of time, merge them with the new
    madvise(2) KSM_MERGE behavior.

Let me know if that makes sense.

Thanks,

Mathieu

Peter Xu March 3, 2025, 8:45 p.m. UTC | #23

On Mon, Mar 03, 2025 at 03:01:38PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> > > On 2025-02-28 11:32, Peter Xu wrote:
> > > > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > > > "COW" event that would notify userspace when a COW happens ?
> > > > 
> > > > I don't know what's the best for KSM and how well this will work, but we
> > > > have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> > > > 
> > > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
> > > 
> > > userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> > > resulting from a mmap mapping, but returns EINVAL if I pass a
> > > page-aligned address which sits within a private file mapping
> > > (e.g. executable data).
> > 
> > Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> > Generic private file mappings (that stores executables and libraries) are
> > not yet supported.
> > 
> > > 
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> > 
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
> > 
> > > 
> > > AFAIU, as it stands now userfaultfd would not help tracking COW faults
> > > caused by stores to private file mappings. Am I missing something ?
> > 
> > I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> > most mappings.  That one is async, though, so more like soft-dirty.  It
> > might be doable to try making it sync too without a lot of changes based on
> > how async tracking works.
> 
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
> 
> Within qemu or similar user-space process:
> 
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
>    UFFDIO_REGISTER_MODE_WP mode.
> 
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
>    to detect memory which stays invariant for a long time.
> 
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
>    Keep track of memory which is frequently modified, so it can be left alone and
>    not write-protected nor merged anymore.
> 
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
>    madvise(2) KSM_MERGE behavior.
> 
> Let me know if that makes sense.

I can't speak of how KSM should go from there, but from userfault tracking
POV, that makes sense to me.

Thanks,

David Hildenbrand March 3, 2025, 8:49 p.m. UTC | #24

On 03.03.25 21:01, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>
>>>> I don't know what's the best for KSM and how well this will work, but we
>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>
>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>
>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>> page-aligned address which sits within a private file mapping
>>> (e.g. executable data).
>>
>> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
>> Generic private file mappings (that stores executables and libraries) are
>> not yet supported.
>>
>>>
>>> Also, I notice that do_wp_page() only calls handle_userfault
>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>> set.
>>
>> AFAICT that's expected, unshare should only be set on reads, never writes.
>> So uffd-wp shouldn't trap any of those.
>>
>>>
>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>> caused by stores to private file mappings. Am I missing something ?
>>
>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
>> most mappings.  That one is async, though, so more like soft-dirty.  It
>> might be doable to try making it sync too without a lot of changes based on
>> how async tracking works.
> 
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
> 
> Within qemu or similar user-space process:
> 
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
>      UFFDIO_REGISTER_MODE_WP mode.
> 
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
>      to detect memory which stays invariant for a long time.
> 
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
>      Keep track of memory which is frequently modified, so it can be left alone and
>      not write-protected nor merged anymore.
> 
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
>      madvise(2) KSM_MERGE behavior.
> 
> Let me know if that makes sense.

Note that one of the strengths of ksm in the kernel right now is that we 
write-protect + try-deduplicate only when we are fairly sure that we can 
deduplicate (unstable tree), and that the interaction with THPs / large 
folios is fairly well thought-through.

Also note that, just because data hasn't been written in some time 
interval, doesn't mean that it should be deduplicated and result in CoW 
on next write access.

One probably would have to mimic what the KSM implementation in the 
kernel does, and built something like the unstable tree, to find 
candidates where we can actually deduplciate. Then, have a way to 
not-deduplicate if the content changed.

Mathieu Desnoyers March 5, 2025, 2:06 p.m. UTC | #25

On 2025-03-03 15:49, David Hildenbrand wrote:
> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>> On 2025-02-28 17:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>
>>>>> I don't know what's the best for KSM and how well this will work, 
>>>>> but we
>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>
>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>
>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>> page-aligned address which sits within a private file mapping
>>>> (e.g. executable data).
>>>
>>> Yes, so far sync traps only supports RAM-based file systems, or 
>>> anonymous.
>>> Generic private file mappings (that stores executables and libraries) 
>>> are
>>> not yet supported.
>>>
>>>>
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never 
>>> writes.
>>> So uffd-wp shouldn't trap any of those.
>>>
>>>>
>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>> caused by stores to private file mappings. Am I missing something ?
>>>
>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should 
>>> work on
>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>> might be doable to try making it sync too without a lot of changes 
>>> based on
>>> how async tracking works.
>>
>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>> thread for the VM use-case by a purely user-space driven scanning:
>>
>> Within qemu or similar user-space process:
>>
>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC 
>> feature and
>>      UFFDIO_REGISTER_MODE_WP mode.
>>
>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl 
>> PM_SCAN_WP_MATCHING flag
>>      to detect memory which stays invariant for a long time.
>>
>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which 
>> pages are written to.
>>      Keep track of memory which is frequently modified, so it can be 
>> left alone and
>>      not write-protected nor merged anymore.
>>
>> 4) Whenever pages stay invariant for a given lapse of time, merge them 
>> with the new
>>      madvise(2) KSM_MERGE behavior.
>>
>> Let me know if that makes sense.
> 
> Note that one of the strengths of ksm in the kernel right now is that we 
> write-protect + try-deduplicate only when we are fairly sure that we can 
> deduplicate (unstable tree), and that the interaction with THPs / large 
> folios is fairly well thought-through.
> 
> Also note that, just because data hasn't been written in some time 
> interval, doesn't mean that it should be deduplicated and result in CoW 
> on next write access.

Right. This tracking of address range access pattern would have to be
implemented in user-space.

> One probably would have to mimic what the KSM implementation in the 
> kernel does, and built something like the unstable tree, to find 
> candidates where we can actually deduplciate. Then, have a way to not- 
> deduplicate if the content changed.

With madvise MADV_MERGE, there is no need to "unmerge". The merge
write-protects the page and merges its content at the time of the
MADV_MERGE with exact duplicates, and keeps that write protected page in
a global hash table indexed by checksum.

However, unlike KSM, it won't track that range on an ongoing basis.

"Unmerging" the page is done naturally by writing to the merged address
range. Because it is write-protected, this will trigger COW, and will 
therefore provide a new anonymous page to the process, thus "unmerging"
that page.

It's really just up to userspace to track COW faults and figure out
that it really should not try to merge that range anymore, based on the
the access pattern monitored through write-protection faults.

Thanks,

Mathieu

David Hildenbrand March 5, 2025, 7:22 p.m. UTC | #26

On 05.03.25 15:06, Mathieu Desnoyers wrote:
> On 2025-03-03 15:49, David Hildenbrand wrote:
>> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>>> On 2025-02-28 17:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>>
>>>>>> I don't know what's the best for KSM and how well this will work,
>>>>>> but we
>>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>>
>>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>>
>>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>>> page-aligned address which sits within a private file mapping
>>>>> (e.g. executable data).
>>>>
>>>> Yes, so far sync traps only supports RAM-based file systems, or
>>>> anonymous.
>>>> Generic private file mappings (that stores executables and libraries)
>>>> are
>>>> not yet supported.
>>>>
>>>>>
>>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>>> set.
>>>>
>>>> AFAICT that's expected, unshare should only be set on reads, never
>>>> writes.
>>>> So uffd-wp shouldn't trap any of those.
>>>>
>>>>>
>>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>>> caused by stores to private file mappings. Am I missing something ?
>>>>
>>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should
>>>> work on
>>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>>> might be doable to try making it sync too without a lot of changes
>>>> based on
>>>> how async tracking works.
>>>
>>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>>> thread for the VM use-case by a purely user-space driven scanning:
>>>
>>> Within qemu or similar user-space process:
>>>
>>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
>>> feature and
>>>       UFFDIO_REGISTER_MODE_WP mode.
>>>
>>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl
>>> PM_SCAN_WP_MATCHING flag
>>>       to detect memory which stays invariant for a long time.
>>>
>>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
>>> pages are written to.
>>>       Keep track of memory which is frequently modified, so it can be
>>> left alone and
>>>       not write-protected nor merged anymore.
>>>
>>> 4) Whenever pages stay invariant for a given lapse of time, merge them
>>> with the new
>>>       madvise(2) KSM_MERGE behavior.
>>>
>>> Let me know if that makes sense.
>>
>> Note that one of the strengths of ksm in the kernel right now is that we
>> write-protect + try-deduplicate only when we are fairly sure that we can
>> deduplicate (unstable tree), and that the interaction with THPs / large
>> folios is fairly well thought-through.
>>
>> Also note that, just because data hasn't been written in some time
>> interval, doesn't mean that it should be deduplicated and result in CoW
>> on next write access.
> 
> Right. This tracking of address range access pattern would have to be
> implemented in user-space.
> 
>> One probably would have to mimic what the KSM implementation in the
>> kernel does, and built something like the unstable tree, to find
>> candidates where we can actually deduplciate. Then, have a way to not-
>> deduplicate if the content changed.
> 
> With madvise MADV_MERGE, there is no need to "unmerge". The merge
> write-protects the page and merges its content at the time of the
> MADV_MERGE with exact duplicates, and keeps that write protected page in
> a global hash table indexed by checksum.

Right, and that's a real problem.

> 
> However, unlike KSM, it won't track that range on an ongoing basis.
> 
> "Unmerging" the page is done naturally by writing to the merged address
> range. Because it is write-protected, this will trigger COW, and will
> therefore provide a new anonymous page to the process, thus "unmerging"
> that page.
> 
> It's really just up to userspace to track COW faults and figure out
> that it really should not try to merge that range anymore, based on the
> the access pattern monitored through write-protection faults.
> 

Just to be clear, what you described here is very likely not 
performance-wise any feasible replacement for the in-tree ksm for the VM 
use case (again, the thing that was primarily invented for VMs).

[RFC,0/2] SKSM: Synchronous Kernel Samepage Merging

Message

Comments