Message ID | 20250228023043.83726-1-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
Headers | show |
Series | SKSM: Synchronous Kernel Samepage Merging | expand |
On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > This series introduces SKSM, a new page deduplication ABI, > aiming to fix the limitations inherent to the KSM ABI. So I'm not interested in seeing *another* KSM version. Because I absolutely do *NOT* want a new chapter in the saga of SLUB vs SLAB vs SLOB. However, if the feeling is that this can *replace* the current horror that is KSM, I'm a lot more interested. I suspect our current KSM model has largely been a failure, and this might be "good enough". Linus
On 2025-02-27 21:51, Linus Torvalds wrote: > On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> This series introduces SKSM, a new page deduplication ABI, >> aiming to fix the limitations inherent to the KSM ABI. > > So I'm not interested in seeing *another* KSM version. > > Because I absolutely do *NOT* want a new chapter in the saga of SLUB > vs SLAB vs SLOB. > > However, if the feeling is that this can *replace* the current horror > that is KSM, I'm a lot more interested. I suspect our current KSM > model has largely been a failure, and this might be "good enough". I'd be fine with SKSM replacing KSM entirely. However, I don't think we should try to re-implement the existing KSM userspace ABIs over SKSM. I suspect that much of the problems KSM has today are caused by the semantic of the ABI it exposes, which were targeted solely for a host deduplicating guest VMs memory use-case. KSM tracks memory meant to be mergeable on an ongoing basis with a worker thread: madvise(2) MADV_{UN,}MERGEABLE prctl(2) PR_{SET,GET}_MEMORY_MERGE (security concern) ~2.5k LOC exclusing ksm-common code requires parameter fine-tuning from sysadmin SKSM gets the hint from userspace that memory is a good candidate for merging in its current state and is expected to stay invariant: madvise(2) MADV_MERGE ~100 LOC exclusing ksm-common code The main reason why SKSM could be implemented without all the scanning complexity is because of this simpler ABI. Thanks for the feedback! Mathieu
On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > I'd be fine with SKSM replacing KSM entirely. However, I don't > think we should try to re-implement the existing KSM userspace ABIs > over SKSM. No, absolutely. The only point (for me) for your new synchronous one would be if it replaced the kernel thread async scanning, which would make the old user space interface basically pointless. But I don't actually know who uses KSM right now. My reaction really comes from a "it's not nice code in the kernel", not from any actual knowledge of the users. Maybe it works really well in some cloud VM environment, and we're stuck with it forever. In which case I don't want to see some second different interface that just makes it all worse. Linus
On 28.02.25 06:17, Linus Torvalds wrote: > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> I'd be fine with SKSM replacing KSM entirely. However, I don't >> think we should try to re-implement the existing KSM userspace ABIs >> over SKSM. > > No, absolutely. The only point (for me) for your new synchronous one > would be if it replaced the kernel thread async scanning, which would > make the old user space interface basically pointless. > > But I don't actually know who uses KSM right now. My reaction really > comes from a "it's not nice code in the kernel", not from any actual > knowledge of the users. > > Maybe it works really well in some cloud VM environment, and we're > stuck with it forever. Exactly that; and besides the VM use-case, lately people stated using it in the context of interpreters (IIRC inside Meta) quite successfully as well.
On 2025-02-28 00:17, Linus Torvalds wrote: > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> I'd be fine with SKSM replacing KSM entirely. However, I don't >> think we should try to re-implement the existing KSM userspace ABIs >> over SKSM. > > No, absolutely. The only point (for me) for your new synchronous one > would be if it replaced the kernel thread async scanning, which would > make the old user space interface basically pointless. > > But I don't actually know who uses KSM right now. My reaction really > comes from a "it's not nice code in the kernel", not from any actual > knowledge of the users. > > Maybe it works really well in some cloud VM environment, and we're > stuck with it forever. > For the VM use-case, I wonder if we could just add a userfaultfd "COW" event that would notify userspace when a COW happens ? This would allow userspace to replace ksmd by tracking the age of those anonymous pages, and issue madvise MADV_MERGE on them to write-protect+merge them when it is deemed useful. With both a new userfaultfd COW event and madvise MADV_MERGE, is there anything else that is fundamentally missing to move all the scanning complexity of KSM to userspace for the VM deduplication use-case ? Thanks, Mathieu
On Fri, Feb 28, 2025, David Hildenbrand wrote: > On 28.02.25 06:17, Linus Torvalds wrote: > > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > I'd be fine with SKSM replacing KSM entirely. However, I don't > > > think we should try to re-implement the existing KSM userspace ABIs > > > over SKSM. > > > > No, absolutely. The only point (for me) for your new synchronous one > > would be if it replaced the kernel thread async scanning, which would > > make the old user space interface basically pointless. > > > > But I don't actually know who uses KSM right now. My reaction really > > comes from a "it's not nice code in the kernel", not from any actual > > knowledge of the users. > > > > Maybe it works really well in some cloud VM environment, and we're > > stuck with it forever. > > Exactly that; and besides the VM use-case, lately people stated using it in > the context of interpreters (IIRC inside Meta) quite successfully as well. Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs in cloud environments? The security implications of scanning guest memory and having co-tenant VMs share mappings (should) make it a complete non-starter for any scenario where VMs and/or their workloads are owned by third parties. I can imagine there might be first-party use cases, but I would expect many/most of those to be able to explicitly share mappings, which would provide far, far better power and performance characteristics.
On 2025-02-28 08:59, David Hildenbrand wrote: > On 28.02.25 06:17, Linus Torvalds wrote: >> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers >> <mathieu.desnoyers@efficios.com> wrote: >>> >>> I'd be fine with SKSM replacing KSM entirely. However, I don't >>> think we should try to re-implement the existing KSM userspace ABIs >>> over SKSM. >> >> No, absolutely. The only point (for me) for your new synchronous one >> would be if it replaced the kernel thread async scanning, which would >> make the old user space interface basically pointless. >> >> But I don't actually know who uses KSM right now. My reaction really >> comes from a "it's not nice code in the kernel", not from any actual >> knowledge of the users. >> >> Maybe it works really well in some cloud VM environment, and we're >> stuck with it forever. > > Exactly that; and besides the VM use-case, lately people stated using it > in the context of interpreters (IIRC inside Meta) quite successfully as > well. > I suspect that SKSM is a better fit for JIT and code patching than KSM, because user-space knows better when a set of pages is going to become invariant for a long time and thus benefit from merging. This removes the background scanning from the picture. Does the interpreter use-case require background scanning, or does it know when a set of pages are meant to become invariant for a long time ? Thanks, Mathieu
On 28.02.25 15:59, Sean Christopherson wrote: > On Fri, Feb 28, 2025, David Hildenbrand wrote: >> On 28.02.25 06:17, Linus Torvalds wrote: >>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers >>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> I'd be fine with SKSM replacing KSM entirely. However, I don't >>>> think we should try to re-implement the existing KSM userspace ABIs >>>> over SKSM. >>> >>> No, absolutely. The only point (for me) for your new synchronous one >>> would be if it replaced the kernel thread async scanning, which would >>> make the old user space interface basically pointless. >>> >>> But I don't actually know who uses KSM right now. My reaction really >>> comes from a "it's not nice code in the kernel", not from any actual >>> knowledge of the users. >>> >>> Maybe it works really well in some cloud VM environment, and we're >>> stuck with it forever. >> >> Exactly that; and besides the VM use-case, lately people stated using it in >> the context of interpreters (IIRC inside Meta) quite successfully as well. > > Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs > in cloud environments? Private clouds yes, that's where it is most commonly used for. I would assume that nobody for For example, there is some older documentation here: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/virtualization_administration_guide/chap-ksm#chap-KSM which touches on the security aspects: "The page deduplication technology (used also by the KSM implementation) may introduce side channels that could potentially be used to leak information across multiple guests. In case this is a concern, KSM can be disabled on a per-guest basis." > > The security implications of scanning guest memory and having co-tenant VMs share > mappings (should) make it a complete non-starter for any scenario where VMs and/or > their workloads are owned by third parties. Jep. > > I can imagine there might be first-party use cases, but I would expect many/most > of those to be able to explicitly share mappings, which would provide far, far > better power and performance characteristics. Note that KSM can be very efficient when you have multiple VMs running the same kernel,executable,libraries etc. If my memory doesn't trick me, that's precisely for what it was originally invented, and how it is getting used today in the context of VMs. For example, QEMU will mark all guest memory is mergeable using MADV, to limit the deduplicaton to guest RAM only.
On 28.02.25 16:01, Mathieu Desnoyers wrote: > On 2025-02-28 08:59, David Hildenbrand wrote: >> On 28.02.25 06:17, Linus Torvalds wrote: >>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers >>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> I'd be fine with SKSM replacing KSM entirely. However, I don't >>>> think we should try to re-implement the existing KSM userspace ABIs >>>> over SKSM. >>> >>> No, absolutely. The only point (for me) for your new synchronous one >>> would be if it replaced the kernel thread async scanning, which would >>> make the old user space interface basically pointless. >>> >>> But I don't actually know who uses KSM right now. My reaction really >>> comes from a "it's not nice code in the kernel", not from any actual >>> knowledge of the users. >>> >>> Maybe it works really well in some cloud VM environment, and we're >>> stuck with it forever. >> >> Exactly that; and besides the VM use-case, lately people stated using it >> in the context of interpreters (IIRC inside Meta) quite successfully as >> well. >> > > I suspect that SKSM is a better fit for JIT and code patching than KSM, > because user-space knows better when a set of pages is going to become > invariant for a long time and thus benefit from merging. This removes > the background scanning from the picture. > > Does the interpreter use-case require background scanning, or does > it know when a set of pages are meant to become invariant for a long > time ? To make the JIT/interpreter use case happy, people wanted ways to *force* KSM on for *the whole process*, not just individual VMAs like the traditional VM use case would have done. I recall one of the reasons being that you don't really want to modify your JIT/interpreter to just make KSM work. See [1] "KSM at Meta" for some details, and in general, optimization work to adapt KSM to new use cases. Regarding some concerns you raised, Stefan did a lot of optimization work like "smart scanning" (slide "Optimization - Smart Scan (6.7)") to reduce the scanning overhead and make it much more efficient. So people started optimizing for that already and got pretty good results. [1] https://lpc.events/event/17/contributions/1625/attachments/1320/2649/KSM.pdf
On 28.02.25 16:10, David Hildenbrand wrote: > On 28.02.25 15:59, Sean Christopherson wrote: >> On Fri, Feb 28, 2025, David Hildenbrand wrote: >>> On 28.02.25 06:17, Linus Torvalds wrote: >>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers >>>> <mathieu.desnoyers@efficios.com> wrote: >>>>> >>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't >>>>> think we should try to re-implement the existing KSM userspace ABIs >>>>> over SKSM. >>>> >>>> No, absolutely. The only point (for me) for your new synchronous one >>>> would be if it replaced the kernel thread async scanning, which would >>>> make the old user space interface basically pointless. >>>> >>>> But I don't actually know who uses KSM right now. My reaction really >>>> comes from a "it's not nice code in the kernel", not from any actual >>>> knowledge of the users. >>>> >>>> Maybe it works really well in some cloud VM environment, and we're >>>> stuck with it forever. >>> >>> Exactly that; and besides the VM use-case, lately people stated using it in >>> the context of interpreters (IIRC inside Meta) quite successfully as well. >> >> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs >> in cloud environments? > > Private clouds yes, that's where it is most commonly used for. I would > assume that nobody for forgot to complete that sentence: "... nobody really should be using that in public clouds."
On 28.02.25 03:51, Linus Torvalds wrote: > On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> This series introduces SKSM, a new page deduplication ABI, >> aiming to fix the limitations inherent to the KSM ABI. > > So I'm not interested in seeing *another* KSM version. > > Because I absolutely do *NOT* want a new chapter in the saga of SLUB > vs SLAB vs SLOB. > > However, if the feeling is that this can *replace* the current horror > that is KSM, I'm a lot more interested. I suspect our current KSM > model has largely been a failure, and this might be "good enough". Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE? Many/most use cases just leave THP scanning+collapsing to khugepaged; selected ones might "know better" what to do, so they effectively disable khugepaged, and manually collapse THPs using MADV_COLLAPSE. If it would be similar to that, it would not be completely different KSM version, just a different way to trigger merging: background scanning vs. user-space triggered ("synchronous"). I could see use cases for such a synchronous interface, but I doubt it could replace the background scanning that is actively getting used for existing use cases; I have similar thoughts about khugepaged vs. MADV_COLLAPSE.
On Fri, Feb 28, 2025 at 04:34:50PM +0100, David Hildenbrand wrote:
> Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?
I think it is comparable ... because many people find khugepaged
unacceptable and there are proposals to move that to userspace.
On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: > For the VM use-case, I wonder if we could just add a userfaultfd > "COW" event that would notify userspace when a COW happens ? I don't know what's the best for KSM and how well this will work, but we have such event for years.. See UFFDIO_REGISTER_MODE_WP: https://man7.org/linux/man-pages/man2/userfaultfd.2.html > > This would allow userspace to replace ksmd by tracking the age of > those anonymous pages, and issue madvise MADV_MERGE on them to > write-protect+merge them when it is deemed useful. > > With both a new userfaultfd COW event and madvise MADV_MERGE, > is there anything else that is fundamentally missing to move > all the scanning complexity of KSM to userspace for the VM > deduplication use-case ? Thanks,
On 2025-02-28 11:32, Peter Xu wrote: > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >> For the VM use-case, I wonder if we could just add a userfaultfd >> "COW" event that would notify userspace when a COW happens ? > > I don't know what's the best for KSM and how well this will work, but we > have such event for years.. See UFFDIO_REGISTER_MODE_WP: > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html userfaultfd UFFDIO_REGISTER only seems to work if I pass an address resulting from a mmap mapping, but returns EINVAL if I pass a page-aligned address which sits within a private file mapping (e.g. executable data). Also, I notice that do_wp_page() only calls handle_userfault VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE set. AFAIU, as it stands now userfaultfd would not help tracking COW faults caused by stores to private file mappings. Am I missing something ? Thanks, Mathieu > >> >> This would allow userspace to replace ksmd by tracking the age of >> those anonymous pages, and issue madvise MADV_MERGE on them to >> write-protect+merge them when it is deemed useful. >> >> With both a new userfaultfd COW event and madvise MADV_MERGE, >> is there anything else that is fundamentally missing to move >> all the scanning complexity of KSM to userspace for the VM >> deduplication use-case ? > > Thanks, >
On 2025-02-28 10:10, David Hildenbrand wrote: [...] > For example, QEMU will mark all guest memory is mergeable using MADV, to > limit the deduplicaton to guest RAM only. > On a related note, I think the madvise(2) documentation is inaccurate. It states: MADV_MERGEABLE (since Linux 2.6.32) Enable Kernel Samepage Merging (KSM) for the pages in the range specified by addr and length. [...] AFAIU, based on code review of ksm_madvise(), this is not strictly true. The KSM implementation enables KSM for pages in the entire vma containing the range. So if it so happens that two mmap areas with identical protection flags are merged, both will be considered mergeable by KSM as soon as at least one page from any of those areas is made mergeable. This does not appear to be an issue in qemu because guard pages with different protection are placed between distinct mappings, which should prevent combining the vmas. Thanks, Mathieu
On 28.02.25 22:38, Mathieu Desnoyers wrote: > On 2025-02-28 10:10, David Hildenbrand wrote: > [...] >> For example, QEMU will mark all guest memory is mergeable using MADV, to >> limit the deduplicaton to guest RAM only. >> > > On a related note, I think the madvise(2) documentation is inaccurate. > > It states: > > MADV_MERGEABLE (since Linux 2.6.32) > Enable Kernel Samepage Merging (KSM) for the pages in the range > specified by addr and length. [...] > > AFAIU, based on code review of ksm_madvise(), this is not strictly true. > > The KSM implementation enables KSM for pages in the entire vma containing the range. > So if it so happens that two mmap areas with identical protection flags are merged, > both will be considered mergeable by KSM as soon as at least one page from any of > those areas is made mergeable. I *think* it does what is documented. In madvise_vma_behavior(), ksm_madvise() will update "new_flags". Then we call madvise_update_vma() to split the VMA if required and set new_flags only on the split VMA. The handling is similar to other MADV operations that end up modifying vm_flags. If I am missing something and this is indeed broken, we should definitely write a selftest for it and fix it.
On 2025-02-28 16:45, David Hildenbrand wrote: > On 28.02.25 22:38, Mathieu Desnoyers wrote: >> On 2025-02-28 10:10, David Hildenbrand wrote: >> [...] >>> For example, QEMU will mark all guest memory is mergeable using MADV, to >>> limit the deduplicaton to guest RAM only. >>> >> >> On a related note, I think the madvise(2) documentation is inaccurate. >> >> It states: >> >> MADV_MERGEABLE (since Linux 2.6.32) >> Enable Kernel Samepage Merging (KSM) for the pages in >> the range >> specified by addr and length. [...] >> >> AFAIU, based on code review of ksm_madvise(), this is not strictly true. >> >> The KSM implementation enables KSM for pages in the entire vma >> containing the range. >> So if it so happens that two mmap areas with identical protection >> flags are merged, >> both will be considered mergeable by KSM as soon as at least one page >> from any of >> those areas is made mergeable. > > I *think* it does what is documented. In madvise_vma_behavior(), > ksm_madvise() will update "new_flags". > > Then we call madvise_update_vma() to split the VMA if required and set > new_flags only on the split VMA. The handling is similar to other MADV > operations that end up modifying vm_flags. > > If I am missing something and this is indeed broken, we should > definitely write a selftest for it and fix it. > You are correct, I missed that part. Thanks for the clarification! Mathieu
On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: > On 2025-02-28 11:32, Peter Xu wrote: > > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: > > > For the VM use-case, I wonder if we could just add a userfaultfd > > > "COW" event that would notify userspace when a COW happens ? > > > > I don't know what's the best for KSM and how well this will work, but we > > have such event for years.. See UFFDIO_REGISTER_MODE_WP: > > > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html > > userfaultfd UFFDIO_REGISTER only seems to work if I pass an address > resulting from a mmap mapping, but returns EINVAL if I pass a > page-aligned address which sits within a private file mapping > (e.g. executable data). Yes, so far sync traps only supports RAM-based file systems, or anonymous. Generic private file mappings (that stores executables and libraries) are not yet supported. > > Also, I notice that do_wp_page() only calls handle_userfault > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE > set. AFAICT that's expected, unshare should only be set on reads, never writes. So uffd-wp shouldn't trap any of those. > > AFAIU, as it stands now userfaultfd would not help tracking COW faults > caused by stores to private file mappings. Am I missing something ? I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on most mappings. That one is async, though, so more like soft-dirty. It might be doable to try making it sync too without a lot of changes based on how async tracking works. Thanks,
On 2025-02-28 17:32, Peter Xu wrote: > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: >> On 2025-02-28 11:32, Peter Xu wrote: >>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >>>> For the VM use-case, I wonder if we could just add a userfaultfd >>>> "COW" event that would notify userspace when a COW happens ? >>> >>> I don't know what's the best for KSM and how well this will work, but we >>> have such event for years.. See UFFDIO_REGISTER_MODE_WP: >>> >>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html >> >> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address >> resulting from a mmap mapping, but returns EINVAL if I pass a >> page-aligned address which sits within a private file mapping >> (e.g. executable data). > > Yes, so far sync traps only supports RAM-based file systems, or anonymous. > Generic private file mappings (that stores executables and libraries) are > not yet supported. OK, this confirms my observations. > >> >> Also, I notice that do_wp_page() only calls handle_userfault >> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >> set. > > AFAICT that's expected, unshare should only be set on reads, never writes. > So uffd-wp shouldn't trap any of those. I'm confused by your comment. I thought unshare only applies to *write* faults. What am I missing ? > >> >> AFAIU, as it stands now userfaultfd would not help tracking COW faults >> caused by stores to private file mappings. Am I missing something ? > > I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on > most mappings. That one is async, though, so more like soft-dirty. It > might be doable to try making it sync too without a lot of changes based on > how async tracking works. I'll try this out. It may not matter that it's async given a use-case use-cases of tracking the age since the WP fault on the COW pages. We don't need to react to the event in-place to alter its behavior, just a notification should be fine AFAIU. Thanks, Mathieu
On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote: > > > Also, I notice that do_wp_page() only calls handle_userfault > > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE > > > set. > > > > AFAICT that's expected, unshare should only be set on reads, never writes. > > So uffd-wp shouldn't trap any of those. > > I'm confused by your comment. I thought unshare only applies to > *write* faults. What am I missing ? The major path so far to set unshare is here in GUP (ignoring two corner cases used in either s390 and ksm): if (unshare) { fault_flags |= FAULT_FLAG_UNSHARE; /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); } See the VM_BUG_ON() - if it's write it'll crash already. "unshare", in its earliest form of patch, used to be called COR (Copy-On-Read), which might be more straightforward in this case.. so it's the counterpart of COW but for read cases where a copy is required. The patchset that introduced it has more information (e.g. a7f2266041). Thanks,
On 03.03.25 16:01, Peter Xu wrote: > On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote: >>>> Also, I notice that do_wp_page() only calls handle_userfault >>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >>>> set. >>> >>> AFAICT that's expected, unshare should only be set on reads, never writes. >>> So uffd-wp shouldn't trap any of those. >> >> I'm confused by your comment. I thought unshare only applies to >> *write* faults. What am I missing ? > > The major path so far to set unshare is here in GUP (ignoring two corner > cases used in either s390 and ksm): "unshare" fault, in contrast to a write fault, will not turn the PTE writable. That's why it does not trigger userfaultfd-wp: there is no write access, write-protection is left unchanged.
On 2025-02-28 17:32, Peter Xu wrote: > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: >> On 2025-02-28 11:32, Peter Xu wrote: >>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >>>> For the VM use-case, I wonder if we could just add a userfaultfd >>>> "COW" event that would notify userspace when a COW happens ? >>> >>> I don't know what's the best for KSM and how well this will work, but we >>> have such event for years.. See UFFDIO_REGISTER_MODE_WP: >>> >>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html >> >> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address >> resulting from a mmap mapping, but returns EINVAL if I pass a >> page-aligned address which sits within a private file mapping >> (e.g. executable data). > > Yes, so far sync traps only supports RAM-based file systems, or anonymous. > Generic private file mappings (that stores executables and libraries) are > not yet supported. > >> >> Also, I notice that do_wp_page() only calls handle_userfault >> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >> set. > > AFAICT that's expected, unshare should only be set on reads, never writes. > So uffd-wp shouldn't trap any of those. > >> >> AFAIU, as it stands now userfaultfd would not help tracking COW faults >> caused by stores to private file mappings. Am I missing something ? > > I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on > most mappings. That one is async, though, so more like soft-dirty. It > might be doable to try making it sync too without a lot of changes based on > how async tracking works. I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to be a good fit. Here is what I have in mind to replace the ksmd scanning thread for the VM use-case by a purely user-space driven scanning: Within qemu or similar user-space process: 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and UFFDIO_REGISTER_MODE_WP mode. 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag to detect memory which stays invariant for a long time. 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to. Keep track of memory which is frequently modified, so it can be left alone and not write-protected nor merged anymore. 4) Whenever pages stay invariant for a given lapse of time, merge them with the new madvise(2) KSM_MERGE behavior. Let me know if that makes sense. Thanks, Mathieu
On Mon, Mar 03, 2025 at 03:01:38PM -0500, Mathieu Desnoyers wrote: > On 2025-02-28 17:32, Peter Xu wrote: > > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: > > > On 2025-02-28 11:32, Peter Xu wrote: > > > > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: > > > > > For the VM use-case, I wonder if we could just add a userfaultfd > > > > > "COW" event that would notify userspace when a COW happens ? > > > > > > > > I don't know what's the best for KSM and how well this will work, but we > > > > have such event for years.. See UFFDIO_REGISTER_MODE_WP: > > > > > > > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html > > > > > > userfaultfd UFFDIO_REGISTER only seems to work if I pass an address > > > resulting from a mmap mapping, but returns EINVAL if I pass a > > > page-aligned address which sits within a private file mapping > > > (e.g. executable data). > > > > Yes, so far sync traps only supports RAM-based file systems, or anonymous. > > Generic private file mappings (that stores executables and libraries) are > > not yet supported. > > > > > > > > Also, I notice that do_wp_page() only calls handle_userfault > > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE > > > set. > > > > AFAICT that's expected, unshare should only be set on reads, never writes. > > So uffd-wp shouldn't trap any of those. > > > > > > > > AFAIU, as it stands now userfaultfd would not help tracking COW faults > > > caused by stores to private file mappings. Am I missing something ? > > > > I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on > > most mappings. That one is async, though, so more like soft-dirty. It > > might be doable to try making it sync too without a lot of changes based on > > how async tracking works. > > I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to > be a good fit. Here is what I have in mind to replace the ksmd scanning > thread for the VM use-case by a purely user-space driven scanning: > > Within qemu or similar user-space process: > > 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and > UFFDIO_REGISTER_MODE_WP mode. > > 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag > to detect memory which stays invariant for a long time. > > 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to. > Keep track of memory which is frequently modified, so it can be left alone and > not write-protected nor merged anymore. > > 4) Whenever pages stay invariant for a given lapse of time, merge them with the new > madvise(2) KSM_MERGE behavior. > > Let me know if that makes sense. I can't speak of how KSM should go from there, but from userfault tracking POV, that makes sense to me. Thanks,
On 03.03.25 21:01, Mathieu Desnoyers wrote: > On 2025-02-28 17:32, Peter Xu wrote: >> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: >>> On 2025-02-28 11:32, Peter Xu wrote: >>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >>>>> For the VM use-case, I wonder if we could just add a userfaultfd >>>>> "COW" event that would notify userspace when a COW happens ? >>>> >>>> I don't know what's the best for KSM and how well this will work, but we >>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP: >>>> >>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html >>> >>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address >>> resulting from a mmap mapping, but returns EINVAL if I pass a >>> page-aligned address which sits within a private file mapping >>> (e.g. executable data). >> >> Yes, so far sync traps only supports RAM-based file systems, or anonymous. >> Generic private file mappings (that stores executables and libraries) are >> not yet supported. >> >>> >>> Also, I notice that do_wp_page() only calls handle_userfault >>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >>> set. >> >> AFAICT that's expected, unshare should only be set on reads, never writes. >> So uffd-wp shouldn't trap any of those. >> >>> >>> AFAIU, as it stands now userfaultfd would not help tracking COW faults >>> caused by stores to private file mappings. Am I missing something ? >> >> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on >> most mappings. That one is async, though, so more like soft-dirty. It >> might be doable to try making it sync too without a lot of changes based on >> how async tracking works. > > I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to > be a good fit. Here is what I have in mind to replace the ksmd scanning > thread for the VM use-case by a purely user-space driven scanning: > > Within qemu or similar user-space process: > > 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and > UFFDIO_REGISTER_MODE_WP mode. > > 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag > to detect memory which stays invariant for a long time. > > 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to. > Keep track of memory which is frequently modified, so it can be left alone and > not write-protected nor merged anymore. > > 4) Whenever pages stay invariant for a given lapse of time, merge them with the new > madvise(2) KSM_MERGE behavior. > > Let me know if that makes sense. Note that one of the strengths of ksm in the kernel right now is that we write-protect + try-deduplicate only when we are fairly sure that we can deduplicate (unstable tree), and that the interaction with THPs / large folios is fairly well thought-through. Also note that, just because data hasn't been written in some time interval, doesn't mean that it should be deduplicated and result in CoW on next write access. One probably would have to mimic what the KSM implementation in the kernel does, and built something like the unstable tree, to find candidates where we can actually deduplciate. Then, have a way to not-deduplicate if the content changed.
On 2025-03-03 15:49, David Hildenbrand wrote: > On 03.03.25 21:01, Mathieu Desnoyers wrote: >> On 2025-02-28 17:32, Peter Xu wrote: >>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: >>>> On 2025-02-28 11:32, Peter Xu wrote: >>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >>>>>> For the VM use-case, I wonder if we could just add a userfaultfd >>>>>> "COW" event that would notify userspace when a COW happens ? >>>>> >>>>> I don't know what's the best for KSM and how well this will work, >>>>> but we >>>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP: >>>>> >>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html >>>> >>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address >>>> resulting from a mmap mapping, but returns EINVAL if I pass a >>>> page-aligned address which sits within a private file mapping >>>> (e.g. executable data). >>> >>> Yes, so far sync traps only supports RAM-based file systems, or >>> anonymous. >>> Generic private file mappings (that stores executables and libraries) >>> are >>> not yet supported. >>> >>>> >>>> Also, I notice that do_wp_page() only calls handle_userfault >>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >>>> set. >>> >>> AFAICT that's expected, unshare should only be set on reads, never >>> writes. >>> So uffd-wp shouldn't trap any of those. >>> >>>> >>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults >>>> caused by stores to private file mappings. Am I missing something ? >>> >>> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should >>> work on >>> most mappings. That one is async, though, so more like soft-dirty. It >>> might be doable to try making it sync too without a lot of changes >>> based on >>> how async tracking works. >> >> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to >> be a good fit. Here is what I have in mind to replace the ksmd scanning >> thread for the VM use-case by a purely user-space driven scanning: >> >> Within qemu or similar user-space process: >> >> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC >> feature and >> UFFDIO_REGISTER_MODE_WP mode. >> >> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl >> PM_SCAN_WP_MATCHING flag >> to detect memory which stays invariant for a long time. >> >> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which >> pages are written to. >> Keep track of memory which is frequently modified, so it can be >> left alone and >> not write-protected nor merged anymore. >> >> 4) Whenever pages stay invariant for a given lapse of time, merge them >> with the new >> madvise(2) KSM_MERGE behavior. >> >> Let me know if that makes sense. > > Note that one of the strengths of ksm in the kernel right now is that we > write-protect + try-deduplicate only when we are fairly sure that we can > deduplicate (unstable tree), and that the interaction with THPs / large > folios is fairly well thought-through. > > Also note that, just because data hasn't been written in some time > interval, doesn't mean that it should be deduplicated and result in CoW > on next write access. Right. This tracking of address range access pattern would have to be implemented in user-space. > One probably would have to mimic what the KSM implementation in the > kernel does, and built something like the unstable tree, to find > candidates where we can actually deduplciate. Then, have a way to not- > deduplicate if the content changed. With madvise MADV_MERGE, there is no need to "unmerge". The merge write-protects the page and merges its content at the time of the MADV_MERGE with exact duplicates, and keeps that write protected page in a global hash table indexed by checksum. However, unlike KSM, it won't track that range on an ongoing basis. "Unmerging" the page is done naturally by writing to the merged address range. Because it is write-protected, this will trigger COW, and will therefore provide a new anonymous page to the process, thus "unmerging" that page. It's really just up to userspace to track COW faults and figure out that it really should not try to merge that range anymore, based on the the access pattern monitored through write-protection faults. Thanks, Mathieu
On 05.03.25 15:06, Mathieu Desnoyers wrote: > On 2025-03-03 15:49, David Hildenbrand wrote: >> On 03.03.25 21:01, Mathieu Desnoyers wrote: >>> On 2025-02-28 17:32, Peter Xu wrote: >>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote: >>>>> On 2025-02-28 11:32, Peter Xu wrote: >>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote: >>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd >>>>>>> "COW" event that would notify userspace when a COW happens ? >>>>>> >>>>>> I don't know what's the best for KSM and how well this will work, >>>>>> but we >>>>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP: >>>>>> >>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html >>>>> >>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address >>>>> resulting from a mmap mapping, but returns EINVAL if I pass a >>>>> page-aligned address which sits within a private file mapping >>>>> (e.g. executable data). >>>> >>>> Yes, so far sync traps only supports RAM-based file systems, or >>>> anonymous. >>>> Generic private file mappings (that stores executables and libraries) >>>> are >>>> not yet supported. >>>> >>>>> >>>>> Also, I notice that do_wp_page() only calls handle_userfault >>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE >>>>> set. >>>> >>>> AFAICT that's expected, unshare should only be set on reads, never >>>> writes. >>>> So uffd-wp shouldn't trap any of those. >>>> >>>>> >>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults >>>>> caused by stores to private file mappings. Am I missing something ? >>>> >>>> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should >>>> work on >>>> most mappings. That one is async, though, so more like soft-dirty. It >>>> might be doable to try making it sync too without a lot of changes >>>> based on >>>> how async tracking works. >>> >>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to >>> be a good fit. Here is what I have in mind to replace the ksmd scanning >>> thread for the VM use-case by a purely user-space driven scanning: >>> >>> Within qemu or similar user-space process: >>> >>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC >>> feature and >>> UFFDIO_REGISTER_MODE_WP mode. >>> >>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl >>> PM_SCAN_WP_MATCHING flag >>> to detect memory which stays invariant for a long time. >>> >>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which >>> pages are written to. >>> Keep track of memory which is frequently modified, so it can be >>> left alone and >>> not write-protected nor merged anymore. >>> >>> 4) Whenever pages stay invariant for a given lapse of time, merge them >>> with the new >>> madvise(2) KSM_MERGE behavior. >>> >>> Let me know if that makes sense. >> >> Note that one of the strengths of ksm in the kernel right now is that we >> write-protect + try-deduplicate only when we are fairly sure that we can >> deduplicate (unstable tree), and that the interaction with THPs / large >> folios is fairly well thought-through. >> >> Also note that, just because data hasn't been written in some time >> interval, doesn't mean that it should be deduplicated and result in CoW >> on next write access. > > Right. This tracking of address range access pattern would have to be > implemented in user-space. > >> One probably would have to mimic what the KSM implementation in the >> kernel does, and built something like the unstable tree, to find >> candidates where we can actually deduplciate. Then, have a way to not- >> deduplicate if the content changed. > > With madvise MADV_MERGE, there is no need to "unmerge". The merge > write-protects the page and merges its content at the time of the > MADV_MERGE with exact duplicates, and keeps that write protected page in > a global hash table indexed by checksum. Right, and that's a real problem. > > However, unlike KSM, it won't track that range on an ongoing basis. > > "Unmerging" the page is done naturally by writing to the merged address > range. Because it is write-protected, this will trigger COW, and will > therefore provide a new anonymous page to the process, thus "unmerging" > that page. > > It's really just up to userspace to track COW faults and figure out > that it really should not try to merge that range anymore, based on the > the access pattern monitored through write-protection faults. > Just to be clear, what you described here is very likely not performance-wise any feasible replacement for the in-tree ksm for the VM use case (again, the thing that was primarily invented for VMs).