mbox series

[v7,00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults

Message ID 20240215235405.368539-1-amoorthy@google.com (mailing list archive)
Headers show
Series Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults | expand

Message

Anish Moorthy Feb. 15, 2024, 11:53 p.m. UTC
This series adds an option to cause stage-2 fault handlers to
KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
the userspace mappings. Doing so allows userspace to receive stage-2
faults directly from KVM_RUN instead of through userfaultfd, which
suffers from serious contention issues as the number of vCPUs scales.

Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
demand_paging_test, which demonstrates the scalability improvements:
the following data was collected using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

The diff since the last version is small enough that I've attached a
range-diff in the cover letter- hopefully it's useful for review.

Links
~~~~~
[1] Original RFC from James Houghton:
    https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series

---

v7
  - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
    [James]
  - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
    and split it off into its own commit [Anish]
  - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
    available on arm [James]
  - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
    arm64" commit [Anish]
  - Drop buggy "fast GUP on read faults" patch [Thanks James!]
  - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
    [Sean, Oliver]
  - Drop incorrect "Documentation:" from some shortlogs [Sean]
  - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
  - Style issues [Sean]

v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
  - Rebase onto guest_memfd series [Anish/Sean]
  - Set write fault flag properly in user_mem_abort() [Oliver]
  - Reformat unnecessarily multi-line comments [Sean]
  - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
  - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
  - Remove unnecessary rounding in user_mem_abort() annotation [David]
  - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
    them with the stage-2 fault annotation patches [Sean]
  - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
    add another boolean parameter instead [Sean]
  - Better shortlog for the hva_to_pfn_fast() change [Anish]

v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
  - Rename APIs (again) [Sean]
  - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
  - Reword hva_to_pfn_fast() change commit message [Sean]
  - Correct style on terminal if statements [Sean]
  - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
  - Add read fault flag for annotated faults [Sean]
  - read/write_guest_page() changes
      - Move the annotations into vcpu wrapper fns [Sean]
      - Reorder parameters [Robert]
  - Rename kvm_populate_efault_info() to
    kvm_handle_guest_uaccess_fault() [Sean]
  - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
  - Correct description of the faults which hva_to_pfn_fast() can now
    resolve [Sean]
  - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
  - Magnanimously accept Sean's rewrite of the handle_error_pfn()
    annotation [Anish]
  - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]

v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  KVM: Add function comments for __kvm_read/write_guest_page()
  KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
    userspace
  KVM: Add memslot flag to let userspace force an exit on missing hva
    mappings
  KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
    stage-2 fault handler
  KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
    stage-2 fault handler
  KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  39 ++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |   8 +-
 include/linux/kvm_host.h                      |  21 +-
 include/uapi/linux/kvm.h                      |   5 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  46 ++-
 22 files changed, 453 insertions(+), 172 deletions(-)

Range-diff against v6:
 1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
    +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
     
    -    The current docstring can be read as "atomic -> allowed to sleep," when
    -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
    -    that clearer in the docstring.
    +    The current description can be read as "atomic -> allowed to sleep,"
    +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
    +    Make that clearer in the docstring.
     
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
    +    KVM: Add function comments for __kvm_read/write_guest_page()
     
         The (gfn, data, offset, len) order of parameters is a little strange
    -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
    -    make things perfectly clear.
    +    since "offset" applies to "gfn" rather than to "data". Add function
    +    comments to make things perfectly clear.
     
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
 3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
 4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
    @@ Metadata
      ## Commit message ##
         KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
     
    +    kvm_prepare_memory_fault_exit() already takes parameters describing the
    +    RWX-ness of the relevant access but doesn't actually do anything with
    +    them. Define and use the flags necessary to pass this information on to
    +    userspace.
    +
         Suggested-by: Sean Christopherson <seanjc@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
 6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
    @@ Commit message
     
         Suggested-by: James Houghton <jthoughton@google.com>
         Suggested-by: Sean Christopherson <seanjc@google.com>
    -    Reviewed-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
      ## Documentation/virt/kvm/api.rst ##
     @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
    -   /* for kvm_userspace_memory_region::flags */
        #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
        #define KVM_MEM_READONLY	(1UL << 1)
    -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
    +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
     +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
      
      This ioctl allows the user to create, modify or delete a guest physical
    @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
      be identical.  This allows large pages in the guest to be backed by large
      pages in the host.
      
    --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
    --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
    +-The flags field supports three flags
     +The flags field supports four flags
    -+
    -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
    + 
    + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
      writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
    --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
    -+use it.
    -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
    - to make a new slot read-only.  In this case, writes to this memory will be
    +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
      posted to userspace as KVM_EXIT_MMIO exits.
    -+3.  KVM_MEM_GUEST_MEMFD
    + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
    + incompatible with KVM_SET_USER_MEMORY_REGION.
     +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
      
      When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
      the memory region are automatically reflected into the guest.  For example, an
    +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
    + was a load or a store, instruction abort if it was an instruction
    + fetch) is injected in the guest.
    + 
    ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
    ++exclusive.
    ++
    + 4.36 KVM_SET_TSS_ADDR
    + ---------------------
    + 
     @@ Documentation/virt/kvm/api.rst: error/annotated fault.
      
      See KVM_EXIT_MEMORY_FAULT for more information.
    @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
      
      /* for KVM_IRQ_LINE */
      struct kvm_irq_level {
    -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
    +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
      #define KVM_CAP_MEMORY_ATTRIBUTES 233
      #define KVM_CAP_GUEST_MEMFD 234
      #define KVM_CAP_VM_TYPES 235
     +#define KVM_CAP_EXIT_ON_MISSING 236
      
    - #ifdef KVM_CAP_IRQ_ROUTING
    - 
    + struct kvm_irq_routing_irqchip {
    + 	__u32 irqchip;
     
      ## virt/kvm/Kconfig ##
     @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
    @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
     +
      	if (mem->flags & ~valid_flags)
      		return -EINVAL;
    ++	else if ((mem->flags & KVM_MEM_READONLY) &&
    ++		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
    ++		return -EINVAL;
      
    + 	return 0;
    + }
     @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
      
      kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
    @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
      		writable = NULL;
      	}
      
    -+	if (!atomic && can_exit_on_missing
    -+	    && kvm_is_slot_exit_on_missing(slot)) {
    ++	/* When the slot is exit-on-missing (and when we should respect that)
    ++	 * set atomic=true to prevent GUP from faulting in the userspace
    ++	 * mappings.
    ++	 */
    ++	if (!atomic && can_exit_on_missing &&
    ++	    kvm_is_slot_exit_on_missing(slot)) {
     +		atomic = true;
     +		if (async) {
     +			*async = false;
 7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
    @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
     
      ## arch/x86/kvm/Kconfig ##
     @@ arch/x86/kvm/Kconfig: config KVM
    - 	select INTERVAL_TREE
    + 	select KVM_VFIO
      	select HAVE_KVM_PM_NOTIFIER if PM
      	select KVM_GENERIC_HARDWARE_ENABLING
     +        select HAVE_KVM_EXIT_ON_MISSING
 8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
 -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
 9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
    +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
     
         Prevent the stage-2 fault handler from faulting in pages when
         KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
    -    calls to check the memslot flag.
    -
    -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
    -    when the stage-2 handler cannot resolve the pfn for a fault. With
    -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
    -    faults as vCPU exits, which userspace can attempt to resolve without
    -    terminating the guest.
    +    call to check the memslot flag. This effects the delivery of stage-2
    +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
    +    can attempt to resolve without terminating the guest.
     
         Delivering stage-2 faults to userspace in this way sidesteps the
         significant scalabiliy issues associated with using userfaultfd for the
    @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
     
      ## arch/arm64/kvm/Kconfig ##
     @@ arch/arm64/kvm/Kconfig: menuconfig KVM
    + 	select SCHED_INFO
      	select GUEST_PERF_EVENTS if PERF_EVENTS
    - 	select INTERVAL_TREE
      	select XARRAY_MULTI
     +        select HAVE_KVM_EXIT_ON_MISSING
      	help
    @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
      	if (pfn == KVM_PFN_ERR_HWPOISON) {
      		kvm_send_hwpoison_signal(hva, vma_shift);
      		return 0;
    - 	}
    --	if (is_error_noslot_pfn(pfn))
    -+	if (is_error_noslot_pfn(pfn)) {
    -+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
    -+					      write_fault, exec_fault, false);
    - 		return -EFAULT;
    -+	}
    - 
    - 	if (kvm_is_device_pfn(pfn)) {
    - 		/*
10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
    @@ Commit message
         configuring the number of reader threads per UFFD as well: add the "-r"
         flag to do so.
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
     @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
    @@ Commit message
         [1] Single-vCPU performance does suffer somewhat.
         [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/demand_paging_test.c ##
     @@
13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
    @@ Commit message
     
         Demonstrate a (very basic) scheme for supporting memory fault exits.
     
    -    >From the vCPU threads:
    +    From the vCPU threads:
         1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
            with the purpose of establishing the absent mappings. Do so with
            wake_waiters=false to avoid serializing on the userfaultfd wait queue
    @@ Commit message
         [A] In reality it is much likelier that the vCPU thread simply lost a
             race to establish the mapping for the page.
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/demand_paging_test.c ##
     @@

base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3

Comments

Gupta, Pankaj Feb. 16, 2024, 7:36 a.m. UTC | #1
On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> This series adds an option to cause stage-2 fault handlers to
> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> the userspace mappings. Doing so allows userspace to receive stage-2
> faults directly from KVM_RUN instead of through userfaultfd, which
> suffers from serious contention issues as the number of vCPUs scales.

Thanks for your work!

So, this is an alternative approach userspace like Qemu to do post copy
live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
seems slower with more vCPU's.

Maybe I am missing some things here, just curious how userspace VMM e.g 
Qemu would do memory copy with this approach once the page is available 
from remote host which was done with UFFDIO_COPY earlier?

Just trying to understand how this will work for the existing interfaces.

Best regards,
Pankaj

> 
> Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> demand_paging_test, which demonstrates the scalability improvements:
> the following data was collected using [2] on an x86 machine with 256
> cores.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1       150     340
> 2       191     477
> 4       210     809
> 8       155     1239
> 16      130     1595
> 32      108     2299
> 64      86      3482
> 128     62      4134
> 256     36      4012
> 
> The diff since the last version is small enough that I've attached a
> range-diff in the cover letter- hopefully it's useful for review.
> 
> Links
> ~~~~~
> [1] Original RFC from James Houghton:
>      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> 
> [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
>      A quick rundown of the new flags (also detailed in later commits)
>          -a registers all of guest memory to a single uffd.
>          -r species the number of reader threads for polling the uffd.
>          -w is what actually enables the new capabilities.
>      All data was collected after applying the entire series
> 
> ---
> 
> v7
>    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
>      [James]
>    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
>      and split it off into its own commit [Anish]
>    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
>      available on arm [James]
>    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
>      arm64" commit [Anish]
>    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
>    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
>      [Sean, Oliver]
>    - Drop incorrect "Documentation:" from some shortlogs [Sean]
>    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
>    - Style issues [Sean]
> 
> v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
>    - Rebase onto guest_memfd series [Anish/Sean]
>    - Set write fault flag properly in user_mem_abort() [Oliver]
>    - Reformat unnecessarily multi-line comments [Sean]
>    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
>    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
>    - Remove unnecessary rounding in user_mem_abort() annotation [David]
>    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
>      them with the stage-2 fault annotation patches [Sean]
>    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
>      add another boolean parameter instead [Sean]
>    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> 
> v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
>    - Rename APIs (again) [Sean]
>    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
>    - Reword hva_to_pfn_fast() change commit message [Sean]
>    - Correct style on terminal if statements [Sean]
>    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
>    - Add read fault flag for annotated faults [Sean]
>    - read/write_guest_page() changes
>        - Move the annotations into vcpu wrapper fns [Sean]
>        - Reorder parameters [Robert]
>    - Rename kvm_populate_efault_info() to
>      kvm_handle_guest_uaccess_fault() [Sean]
>    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
>    - Correct description of the faults which hva_to_pfn_fast() can now
>      resolve [Sean]
>    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
>    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
>      annotation [Anish]
>    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> 
> v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
>    - Fix excessive indentation [Robert, Oliver]
>    - Calculate final stats when uffd handler fn returns an error [Robert]
>    - Remove redundant info from uffd_desc [Robert]
>    - Fix various commit message typos [Robert]
>    - Add comment about suppressed EEXISTs in selftest [Robert]
>    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
>    - Fix some include/logic issues in self test [Robert]
>    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
>    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
>    - Drop most of the annotations from v3: see
>      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
>    - Remove WARN on bare efaults [Sean, Oliver]
>    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> 
> v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
>    - Rework the implementation to be based on two orthogonal
>      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
>      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
>    - Change return code of kvm_populate_efault_info [Isaku]
>    - Use kvm_populate_efault_info from arm code [Oliver]
> 
> v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> 
>      This was a bit of a misfire, as I sent my WIP series on the mailing
>      list but was just targeting Sean for some feedback. Oliver Upton and
>      Isaku Yamahata ended up discovering the series and giving me some
>      feedback anyways, so thanks to them :) In the end, there was enough
>      discussion to justify retroactively labeling it as v2, even with the
>      limited cc list.
> 
>    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
>    - API changes:
>          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
>            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
>            requirement).
>          - Switched to memslot flag
>    - Take Oliver's simplification to the "allow fast gup for readable
>      faults" logic.
>    - Slightly redefine the return code of user_mem_abort.
>    - Fix documentation errors brought up by Marc
>    - Reword commit messages in imperative mood
> 
> v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> 
> Anish Moorthy (14):
>    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>    KVM: Add function comments for __kvm_read/write_guest_page()
>    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
>    KVM: Simplify error handling in __gfn_to_pfn_memslot()
>    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
>      userspace
>    KVM: Add memslot flag to let userspace force an exit on missing hva
>      mappings
>    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
>      stage-2 fault handler
>    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
>      stage-2 fault handler
>    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
>    KVM: selftests: Report per-vcpu demand paging rate from demand paging
>      test
>    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
>      paging test
>    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
>      signal errors via TEST_ASSERT
>    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
>    KVM: selftests: Handle memory fault exits in demand_paging_test
> 
>   Documentation/virt/kvm/api.rst                |  39 ++-
>   arch/arm64/kvm/Kconfig                        |   1 +
>   arch/arm64/kvm/arm.c                          |   1 +
>   arch/arm64/kvm/mmu.c                          |   7 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
>   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
>   arch/x86/kvm/Kconfig                          |   1 +
>   arch/x86/kvm/mmu/mmu.c                        |   8 +-
>   include/linux/kvm_host.h                      |  21 +-
>   include/uapi/linux/kvm.h                      |   5 +
>   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
>   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
>   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
>   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
>   .../testing/selftests/kvm/include/memstress.h |   2 +-
>   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
>   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
>   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
>   .../kvm/memslot_modification_stress_test.c    |   2 +-
>   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
>   virt/kvm/Kconfig                              |   3 +
>   virt/kvm/kvm_main.c                           |  46 ++-
>   22 files changed, 453 insertions(+), 172 deletions(-)
> 
> Range-diff against v6:
>   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>       
>      -    The current docstring can be read as "atomic -> allowed to sleep," when
>      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
>      -    that clearer in the docstring.
>      +    The current description can be read as "atomic -> allowed to sleep,"
>      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
>      +    Make that clearer in the docstring.
>       
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>      +    KVM: Add function comments for __kvm_read/write_guest_page()
>       
>           The (gfn, data, offset, len) order of parameters is a little strange
>      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
>      -    make things perfectly clear.
>      +    since "offset" applies to "gfn" rather than to "data". Add function
>      +    comments to make things perfectly clear.
>       
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
>   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
>   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
>      @@ Metadata
>        ## Commit message ##
>           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
>       
>      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
>      +    RWX-ness of the relevant access but doesn't actually do anything with
>      +    them. Define and use the flags necessary to pass this information on to
>      +    userspace.
>      +
>           Suggested-by: Sean Christopherson <seanjc@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
>   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
>      @@ Commit message
>       
>           Suggested-by: James Houghton <jthoughton@google.com>
>           Suggested-by: Sean Christopherson <seanjc@google.com>
>      -    Reviewed-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>        ## Documentation/virt/kvm/api.rst ##
>       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
>      -   /* for kvm_userspace_memory_region::flags */
>          #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>          #define KVM_MEM_READONLY	(1UL << 1)
>      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
>      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
>       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
>        
>        This ioctl allows the user to create, modify or delete a guest physical
>      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
>        be identical.  This allows large pages in the guest to be backed by large
>        pages in the host.
>        
>      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
>      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
>      +-The flags field supports three flags
>       +The flags field supports four flags
>      -+
>      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>      +
>      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
>      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
>      -+use it.
>      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>      - to make a new slot read-only.  In this case, writes to this memory will be
>      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
>        posted to userspace as KVM_EXIT_MMIO exits.
>      -+3.  KVM_MEM_GUEST_MEMFD
>      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
>      + incompatible with KVM_SET_USER_MEMORY_REGION.
>       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
>        
>        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>        the memory region are automatically reflected into the guest.  For example, an
>      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
>      + was a load or a store, instruction abort if it was an instruction
>      + fetch) is injected in the guest.
>      +
>      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
>      ++exclusive.
>      ++
>      + 4.36 KVM_SET_TSS_ADDR
>      + ---------------------
>      +
>       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
>        
>        See KVM_EXIT_MEMORY_FAULT for more information.
>      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
>        
>        /* for KVM_IRQ_LINE */
>        struct kvm_irq_level {
>      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
>      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
>        #define KVM_CAP_MEMORY_ATTRIBUTES 233
>        #define KVM_CAP_GUEST_MEMFD 234
>        #define KVM_CAP_VM_TYPES 235
>       +#define KVM_CAP_EXIT_ON_MISSING 236
>        
>      - #ifdef KVM_CAP_IRQ_ROUTING
>      -
>      + struct kvm_irq_routing_irqchip {
>      + 	__u32 irqchip;
>       
>        ## virt/kvm/Kconfig ##
>       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
>      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
>       +
>        	if (mem->flags & ~valid_flags)
>        		return -EINVAL;
>      ++	else if ((mem->flags & KVM_MEM_READONLY) &&
>      ++		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
>      ++		return -EINVAL;
>        
>      + 	return 0;
>      + }
>       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>        
>        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
>        		writable = NULL;
>        	}
>        
>      -+	if (!atomic && can_exit_on_missing
>      -+	    && kvm_is_slot_exit_on_missing(slot)) {
>      ++	/* When the slot is exit-on-missing (and when we should respect that)
>      ++	 * set atomic=true to prevent GUP from faulting in the userspace
>      ++	 * mappings.
>      ++	 */
>      ++	if (!atomic && can_exit_on_missing &&
>      ++	    kvm_is_slot_exit_on_missing(slot)) {
>       +		atomic = true;
>       +		if (async) {
>       +			*async = false;
>   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
>      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
>       
>        ## arch/x86/kvm/Kconfig ##
>       @@ arch/x86/kvm/Kconfig: config KVM
>      - 	select INTERVAL_TREE
>      + 	select KVM_VFIO
>        	select HAVE_KVM_PM_NOTIFIER if PM
>        	select KVM_GENERIC_HARDWARE_ENABLING
>       +        select HAVE_KVM_EXIT_ON_MISSING
>   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
>   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
>   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
>      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
>       
>           Prevent the stage-2 fault handler from faulting in pages when
>           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
>      -    calls to check the memslot flag.
>      -
>      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
>      -    when the stage-2 handler cannot resolve the pfn for a fault. With
>      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
>      -    faults as vCPU exits, which userspace can attempt to resolve without
>      -    terminating the guest.
>      +    call to check the memslot flag. This effects the delivery of stage-2
>      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
>      +    can attempt to resolve without terminating the guest.
>       
>           Delivering stage-2 faults to userspace in this way sidesteps the
>           significant scalabiliy issues associated with using userfaultfd for the
>      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
>       
>        ## arch/arm64/kvm/Kconfig ##
>       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
>      + 	select SCHED_INFO
>        	select GUEST_PERF_EVENTS if PERF_EVENTS
>      - 	select INTERVAL_TREE
>        	select XARRAY_MULTI
>       +        select HAVE_KVM_EXIT_ON_MISSING
>        	help
>      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
>        	if (pfn == KVM_PFN_ERR_HWPOISON) {
>        		kvm_send_hwpoison_signal(hva, vma_shift);
>        		return 0;
>      - 	}
>      --	if (is_error_noslot_pfn(pfn))
>      -+	if (is_error_noslot_pfn(pfn)) {
>      -+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
>      -+					      write_fault, exec_fault, false);
>      - 		return -EFAULT;
>      -+	}
>      -
>      - 	if (kvm_is_device_pfn(pfn)) {
>      - 		/*
> 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
>      @@ Commit message
>           configuring the number of reader threads per UFFD as well: add the "-r"
>           flag to do so.
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
>       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
>      @@ Commit message
>           [1] Single-vCPU performance does suffer somewhat.
>           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/demand_paging_test.c ##
>       @@
> 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
>      @@ Commit message
>       
>           Demonstrate a (very basic) scheme for supporting memory fault exits.
>       
>      -    >From the vCPU threads:
>      +    From the vCPU threads:
>           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
>              with the purpose of establishing the absent mappings. Do so with
>              wake_waiters=false to avoid serializing on the userfaultfd wait queue
>      @@ Commit message
>           [A] In reality it is much likelier that the vCPU thread simply lost a
>               race to establish the mapping for the page.
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/demand_paging_test.c ##
>       @@
> 
> base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
Anish Moorthy Feb. 16, 2024, 8 p.m. UTC | #2
On Thu, Feb 15, 2024 at 11:36 PM Gupta, Pankaj <pankaj.gupta@amd.com> wrote:
>
> On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> > This series adds an option to cause stage-2 fault handlers to
> > KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> > the userspace mappings. Doing so allows userspace to receive stage-2
> > faults directly from KVM_RUN instead of through userfaultfd, which
> > suffers from serious contention issues as the number of vCPUs scales.
>
> Thanks for your work!

:D

>
> So, this is an alternative approach userspace like Qemu to do post copy
> live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
> seems slower with more vCPU's.
>
> Maybe I am missing some things here, just curious how userspace VMM e.g
> Qemu would do memory copy with this approach once the page is available
> from remote host which was done with UFFDIO_COPY earlier?

This new capability is meant to be used *alongside* userfaultfd during
post-copy: it's not a replacement. KVM_RUN can generate page faults
from outside the stage-2 fault handlers (IIUC instruction emulation is
one source), and these paths are unchanged: so it's important that
userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
to catch these guest accesses. But with the new
KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
queuing onto the UFFD.

In the workloads I've tested, the vast majority of guest-generated
page faults (99%+) come from the stage-2 handlers. So this series
"solves" the issue of contention on the UFFD file descriptor by
(mostly) sidestepping it.

As for how userspace actually uses the new functionality: when a vCPU
thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
post-copy it might

(a) Fetch the page
(b) Install the page into KVM's mapping via UFFDIO_COPY (don't
necessarily need to UFFDIO_WAKE!)
(c) Call KVM_RUN to re-enter the guest and retry the access. The
stage-2 fault handler will fire again but almost certainly won't
KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
page), so the guest can continue.

and userspace can continue using some thread(s) to

(a) Read page faults from the UFFD.
(b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
(c) goto (a)

to make sure it catches everything. The combination of these two things
adds up to more performant "uffd-based" postcopy.

I'm of course skimming over some details (e.g.: when two vCPU threads
race to fetch a page one of them should probably MADV_POPULATE_WRITE
somehow), but I hope this is helpful. My patch to the KVM demand
paging self test might also clarify things a bit [1].

Please let me know if you have more questions!

[1] https://lore.kernel.org/kvm/1f67639d-c6a2-1f36-b086-eb65fa2ab275@amd.com/T/#m28055e5d708103d126985e38e18b591d535e1e84




> Just trying to understand how this will work for the existing interfaces.
> Best regards,
> Pankaj
>
> >
> > Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> > demand_paging_test, which demonstrates the scalability improvements:
> > the following data was collected using [2] on an x86 machine with 256
> > cores.
> >
> > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > 1       150     340
> > 2       191     477
> > 4       210     809
> > 8       155     1239
> > 16      130     1595
> > 32      108     2299
> > 64      86      3482
> > 128     62      4134
> > 256     36      4012
> >
> > The diff since the last version is small enough that I've attached a
> > range-diff in the cover letter- hopefully it's useful for review.
> >
> > Links
> > ~~~~~
> > [1] Original RFC from James Houghton:
> >      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> >
> > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> >      A quick rundown of the new flags (also detailed in later commits)
> >          -a registers all of guest memory to a single uffd.
> >          -r species the number of reader threads for polling the uffd.
> >          -w is what actually enables the new capabilities.
> >      All data was collected after applying the entire series
> >
> > ---
> >
> > v7
> >    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
> >      [James]
> >    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
> >      and split it off into its own commit [Anish]
> >    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
> >      available on arm [James]
> >    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
> >      arm64" commit [Anish]
> >    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
> >    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
> >      [Sean, Oliver]
> >    - Drop incorrect "Documentation:" from some shortlogs [Sean]
> >    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
> >    - Style issues [Sean]
> >
> > v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
> >    - Rebase onto guest_memfd series [Anish/Sean]
> >    - Set write fault flag properly in user_mem_abort() [Oliver]
> >    - Reformat unnecessarily multi-line comments [Sean]
> >    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
> >    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
> >    - Remove unnecessary rounding in user_mem_abort() annotation [David]
> >    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
> >      them with the stage-2 fault annotation patches [Sean]
> >    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
> >      add another boolean parameter instead [Sean]
> >    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> >
> > v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
> >    - Rename APIs (again) [Sean]
> >    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
> >    - Reword hva_to_pfn_fast() change commit message [Sean]
> >    - Correct style on terminal if statements [Sean]
> >    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
> >    - Add read fault flag for annotated faults [Sean]
> >    - read/write_guest_page() changes
> >        - Move the annotations into vcpu wrapper fns [Sean]
> >        - Reorder parameters [Robert]
> >    - Rename kvm_populate_efault_info() to
> >      kvm_handle_guest_uaccess_fault() [Sean]
> >    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
> >    - Correct description of the faults which hva_to_pfn_fast() can now
> >      resolve [Sean]
> >    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
> >    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
> >      annotation [Anish]
> >    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> >
> > v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
> >    - Fix excessive indentation [Robert, Oliver]
> >    - Calculate final stats when uffd handler fn returns an error [Robert]
> >    - Remove redundant info from uffd_desc [Robert]
> >    - Fix various commit message typos [Robert]
> >    - Add comment about suppressed EEXISTs in selftest [Robert]
> >    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
> >    - Fix some include/logic issues in self test [Robert]
> >    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
> >    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
> >    - Drop most of the annotations from v3: see
> >      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
> >    - Remove WARN on bare efaults [Sean, Oliver]
> >    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> >
> > v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
> >    - Rework the implementation to be based on two orthogonal
> >      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
> >      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
> >    - Change return code of kvm_populate_efault_info [Isaku]
> >    - Use kvm_populate_efault_info from arm code [Oliver]
> >
> > v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> >
> >      This was a bit of a misfire, as I sent my WIP series on the mailing
> >      list but was just targeting Sean for some feedback. Oliver Upton and
> >      Isaku Yamahata ended up discovering the series and giving me some
> >      feedback anyways, so thanks to them :) In the end, there was enough
> >      discussion to justify retroactively labeling it as v2, even with the
> >      limited cc list.
> >
> >    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
> >    - API changes:
> >          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
> >            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
> >            requirement).
> >          - Switched to memslot flag
> >    - Take Oliver's simplification to the "allow fast gup for readable
> >      faults" logic.
> >    - Slightly redefine the return code of user_mem_abort.
> >    - Fix documentation errors brought up by Marc
> >    - Reword commit messages in imperative mood
> >
> > v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> >
> > Anish Moorthy (14):
> >    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >    KVM: Add function comments for __kvm_read/write_guest_page()
> >    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> >    KVM: Simplify error handling in __gfn_to_pfn_memslot()
> >    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
> >      userspace
> >    KVM: Add memslot flag to let userspace force an exit on missing hva
> >      mappings
> >    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
> >      stage-2 fault handler
> >    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
> >      stage-2 fault handler
> >    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> >    KVM: selftests: Report per-vcpu demand paging rate from demand paging
> >      test
> >    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
> >      paging test
> >    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
> >      signal errors via TEST_ASSERT
> >    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> >    KVM: selftests: Handle memory fault exits in demand_paging_test
> >
> >   Documentation/virt/kvm/api.rst                |  39 ++-
> >   arch/arm64/kvm/Kconfig                        |   1 +
> >   arch/arm64/kvm/arm.c                          |   1 +
> >   arch/arm64/kvm/mmu.c                          |   7 +-
> >   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
> >   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
> >   arch/x86/kvm/Kconfig                          |   1 +
> >   arch/x86/kvm/mmu/mmu.c                        |   8 +-
> >   include/linux/kvm_host.h                      |  21 +-
> >   include/uapi/linux/kvm.h                      |   5 +
> >   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
> >   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
> >   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
> >   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
> >   .../testing/selftests/kvm/include/memstress.h |   2 +-
> >   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
> >   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
> >   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
> >   .../kvm/memslot_modification_stress_test.c    |   2 +-
> >   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
> >   virt/kvm/Kconfig                              |   3 +
> >   virt/kvm/kvm_main.c                           |  46 ++-
> >   22 files changed, 453 insertions(+), 172 deletions(-)
> >
> > Range-diff against v6:
> >   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >
> >      -    The current docstring can be read as "atomic -> allowed to sleep," when
> >      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
> >      -    that clearer in the docstring.
> >      +    The current description can be read as "atomic -> allowed to sleep,"
> >      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
> >      +    Make that clearer in the docstring.
> >
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> >      +    KVM: Add function comments for __kvm_read/write_guest_page()
> >
> >           The (gfn, data, offset, len) order of parameters is a little strange
> >      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
> >      -    make things perfectly clear.
> >      +    since "offset" applies to "gfn" rather than to "data". Add function
> >      +    comments to make things perfectly clear.
> >
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> >   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
> >   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> >      @@ Metadata
> >        ## Commit message ##
> >           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> >
> >      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
> >      +    RWX-ness of the relevant access but doesn't actually do anything with
> >      +    them. Define and use the flags necessary to pass this information on to
> >      +    userspace.
> >      +
> >           Suggested-by: Sean Christopherson <seanjc@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
> >   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
> >      @@ Commit message
> >
> >           Suggested-by: James Houghton <jthoughton@google.com>
> >           Suggested-by: Sean Christopherson <seanjc@google.com>
> >      -    Reviewed-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Documentation/virt/kvm/api.rst ##
> >       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
> >      -   /* for kvm_userspace_memory_region::flags */
> >          #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> >          #define KVM_MEM_READONLY     (1UL << 1)
> >      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> >      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> >       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> >
> >        This ioctl allows the user to create, modify or delete a guest physical
> >      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
> >        be identical.  This allows large pages in the guest to be backed by large
> >        pages in the host.
> >
> >      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> >      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> >      +-The flags field supports three flags
> >       +The flags field supports four flags
> >      -+
> >      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> >      +
> >      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> >        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> >      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >      -+use it.
> >      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >      - to make a new slot read-only.  In this case, writes to this memory will be
> >      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
> >        posted to userspace as KVM_EXIT_MMIO exits.
> >      -+3.  KVM_MEM_GUEST_MEMFD
> >      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
> >      + incompatible with KVM_SET_USER_MEMORY_REGION.
> >       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
> >
> >        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >        the memory region are automatically reflected into the guest.  For example, an
> >      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
> >      + was a load or a store, instruction abort if it was an instruction
> >      + fetch) is injected in the guest.
> >      +
> >      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
> >      ++exclusive.
> >      ++
> >      + 4.36 KVM_SET_TSS_ADDR
> >      + ---------------------
> >      +
> >       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
> >
> >        See KVM_EXIT_MEMORY_FAULT for more information.
> >      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
> >
> >        /* for KVM_IRQ_LINE */
> >        struct kvm_irq_level {
> >      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
> >      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
> >        #define KVM_CAP_MEMORY_ATTRIBUTES 233
> >        #define KVM_CAP_GUEST_MEMFD 234
> >        #define KVM_CAP_VM_TYPES 235
> >       +#define KVM_CAP_EXIT_ON_MISSING 236
> >
> >      - #ifdef KVM_CAP_IRQ_ROUTING
> >      -
> >      + struct kvm_irq_routing_irqchip {
> >      +        __u32 irqchip;
> >
> >        ## virt/kvm/Kconfig ##
> >       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
> >      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
> >       +
> >               if (mem->flags & ~valid_flags)
> >                       return -EINVAL;
> >      ++       else if ((mem->flags & KVM_MEM_READONLY) &&
> >      ++                (mem->flags & KVM_MEM_EXIT_ON_MISSING))
> >      ++               return -EINVAL;
> >
> >      +        return 0;
> >      + }
> >       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> >        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> >      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
> >                       writable = NULL;
> >               }
> >
> >      -+       if (!atomic && can_exit_on_missing
> >      -+           && kvm_is_slot_exit_on_missing(slot)) {
> >      ++       /* When the slot is exit-on-missing (and when we should respect that)
> >      ++        * set atomic=true to prevent GUP from faulting in the userspace
> >      ++        * mappings.
> >      ++        */
> >      ++       if (!atomic && can_exit_on_missing &&
> >      ++           kvm_is_slot_exit_on_missing(slot)) {
> >       +               atomic = true;
> >       +               if (async) {
> >       +                       *async = false;
> >   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
> >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> >
> >        ## arch/x86/kvm/Kconfig ##
> >       @@ arch/x86/kvm/Kconfig: config KVM
> >      -        select INTERVAL_TREE
> >      +        select KVM_VFIO
> >               select HAVE_KVM_PM_NOTIFIER if PM
> >               select KVM_GENERIC_HARDWARE_ENABLING
> >       +        select HAVE_KVM_EXIT_ON_MISSING
> >   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
> >   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
> >   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> >      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> >
> >           Prevent the stage-2 fault handler from faulting in pages when
> >           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> >      -    calls to check the memslot flag.
> >      -
> >      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> >      -    when the stage-2 handler cannot resolve the pfn for a fault. With
> >      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> >      -    faults as vCPU exits, which userspace can attempt to resolve without
> >      -    terminating the guest.
> >      +    call to check the memslot flag. This effects the delivery of stage-2
> >      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
> >      +    can attempt to resolve without terminating the guest.
> >
> >           Delivering stage-2 faults to userspace in this way sidesteps the
> >           significant scalabiliy issues associated with using userfaultfd for the
> >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> >
> >        ## arch/arm64/kvm/Kconfig ##
> >       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
> >      +        select SCHED_INFO
> >               select GUEST_PERF_EVENTS if PERF_EVENTS
> >      -        select INTERVAL_TREE
> >               select XARRAY_MULTI
> >       +        select HAVE_KVM_EXIT_ON_MISSING
> >               help
> >      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
> >               if (pfn == KVM_PFN_ERR_HWPOISON) {
> >                       kvm_send_hwpoison_signal(hva, vma_shift);
> >                       return 0;
> >      -        }
> >      --       if (is_error_noslot_pfn(pfn))
> >      -+       if (is_error_noslot_pfn(pfn)) {
> >      -+               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> >      -+                                             write_fault, exec_fault, false);
> >      -                return -EFAULT;
> >      -+       }
> >      -
> >      -        if (kvm_is_device_pfn(pfn)) {
> >      -                /*
> > 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> > 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
> >      @@ Commit message
> >           configuring the number of reader threads per UFFD as well: add the "-r"
> >           flag to do so.
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
> >       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> > 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
> >      @@ Commit message
> >           [1] Single-vCPU performance does suffer somewhat.
> >           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> >       @@
> > 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
> >      @@ Commit message
> >
> >           Demonstrate a (very basic) scheme for supporting memory fault exits.
> >
> >      -    >From the vCPU threads:
> >      +    From the vCPU threads:
> >           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> >              with the purpose of establishing the absent mappings. Do so with
> >              wake_waiters=false to avoid serializing on the userfaultfd wait queue
> >      @@ Commit message
> >           [A] In reality it is much likelier that the vCPU thread simply lost a
> >               race to establish the mapping for the page.
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> >       @@
> >
> > base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
>
Axel Rasmussen Feb. 16, 2024, 11:40 p.m. UTC | #3
On Fri, Feb 16, 2024 at 12:00 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Thu, Feb 15, 2024 at 11:36 PM Gupta, Pankaj <pankaj.gupta@amd.com> wrote:
> >
> > On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> > > This series adds an option to cause stage-2 fault handlers to
> > > KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> > > the userspace mappings. Doing so allows userspace to receive stage-2
> > > faults directly from KVM_RUN instead of through userfaultfd, which
> > > suffers from serious contention issues as the number of vCPUs scales.
> >
> > Thanks for your work!
>
> :D
>
> >
> > So, this is an alternative approach userspace like Qemu to do post copy
> > live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
> > seems slower with more vCPU's.
> >
> > Maybe I am missing some things here, just curious how userspace VMM e.g
> > Qemu would do memory copy with this approach once the page is available
> > from remote host which was done with UFFDIO_COPY earlier?
>
> This new capability is meant to be used *alongside* userfaultfd during
> post-copy: it's not a replacement. KVM_RUN can generate page faults
> from outside the stage-2 fault handlers (IIUC instruction emulation is
> one source), and these paths are unchanged: so it's important that
> userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
> to catch these guest accesses. But with the new
> KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
> report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
> queuing onto the UFFD.
>
> In the workloads I've tested, the vast majority of guest-generated
> page faults (99%+) come from the stage-2 handlers. So this series
> "solves" the issue of contention on the UFFD file descriptor by
> (mostly) sidestepping it.
>
> As for how userspace actually uses the new functionality: when a vCPU
> thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
> post-copy it might
>
> (a) Fetch the page
> (b) Install the page into KVM's mapping via UFFDIO_COPY (don't
> necessarily need to UFFDIO_WAKE!)
> (c) Call KVM_RUN to re-enter the guest and retry the access. The
> stage-2 fault handler will fire again but almost certainly won't
> KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
> page), so the guest can continue.
>
> and userspace can continue using some thread(s) to
>
> (a) Read page faults from the UFFD.
> (b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
> (c) goto (a)
>
> to make sure it catches everything. The combination of these two things
> adds up to more performant "uffd-based" postcopy.
>
> I'm of course skimming over some details (e.g.: when two vCPU threads
> race to fetch a page one of them should probably MADV_POPULATE_WRITE
> somehow), but I hope this is helpful. My patch to the KVM demand
> paging self test might also clarify things a bit [1].

One other small detail is, you can equally use UFFDIO_CONTINUE,
depending on how the rest of the live migration implementation works.

Really briefly, this series should be viewed as an alternate (and more
scalable) mechanism to find out that a fault occurred. The way
userspace then *resolves* the fault (whether via UFFDIO_COPY or
UFFDIO_CONTINUE) can remain the same as before.

>
>
> Please let me know if you have more questions!
>
> [1] https://lore.kernel.org/kvm/1f67639d-c6a2-1f36-b086-eb65fa2ab275@amd.com/T/#m28055e5d708103d126985e38e18b591d535e1e84
>
>
>
>
> > Just trying to understand how this will work for the existing interfaces.
> > Best regards,
> > Pankaj
> >
> > >
> > > Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> > > demand_paging_test, which demonstrates the scalability improvements:
> > > the following data was collected using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1       150     340
> > > 2       191     477
> > > 4       210     809
> > > 8       155     1239
> > > 16      130     1595
> > > 32      108     2299
> > > 64      86      3482
> > > 128     62      4134
> > > 256     36      4012
> > >
> > > The diff since the last version is small enough that I've attached a
> > > range-diff in the cover letter- hopefully it's useful for review.
> > >
> > > Links
> > > ~~~~~
> > > [1] Original RFC from James Houghton:
> > >      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > >
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > >      A quick rundown of the new flags (also detailed in later commits)
> > >          -a registers all of guest memory to a single uffd.
> > >          -r species the number of reader threads for polling the uffd.
> > >          -w is what actually enables the new capabilities.
> > >      All data was collected after applying the entire series
> > >
> > > ---
> > >
> > > v7
> > >    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
> > >      [James]
> > >    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
> > >      and split it off into its own commit [Anish]
> > >    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
> > >      available on arm [James]
> > >    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
> > >      arm64" commit [Anish]
> > >    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
> > >    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
> > >      [Sean, Oliver]
> > >    - Drop incorrect "Documentation:" from some shortlogs [Sean]
> > >    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
> > >    - Style issues [Sean]
> > >
> > > v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
> > >    - Rebase onto guest_memfd series [Anish/Sean]
> > >    - Set write fault flag properly in user_mem_abort() [Oliver]
> > >    - Reformat unnecessarily multi-line comments [Sean]
> > >    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
> > >    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
> > >    - Remove unnecessary rounding in user_mem_abort() annotation [David]
> > >    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
> > >      them with the stage-2 fault annotation patches [Sean]
> > >    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
> > >      add another boolean parameter instead [Sean]
> > >    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> > >
> > > v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
> > >    - Rename APIs (again) [Sean]
> > >    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
> > >    - Reword hva_to_pfn_fast() change commit message [Sean]
> > >    - Correct style on terminal if statements [Sean]
> > >    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
> > >    - Add read fault flag for annotated faults [Sean]
> > >    - read/write_guest_page() changes
> > >        - Move the annotations into vcpu wrapper fns [Sean]
> > >        - Reorder parameters [Robert]
> > >    - Rename kvm_populate_efault_info() to
> > >      kvm_handle_guest_uaccess_fault() [Sean]
> > >    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
> > >    - Correct description of the faults which hva_to_pfn_fast() can now
> > >      resolve [Sean]
> > >    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
> > >    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
> > >      annotation [Anish]
> > >    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> > >
> > > v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
> > >    - Fix excessive indentation [Robert, Oliver]
> > >    - Calculate final stats when uffd handler fn returns an error [Robert]
> > >    - Remove redundant info from uffd_desc [Robert]
> > >    - Fix various commit message typos [Robert]
> > >    - Add comment about suppressed EEXISTs in selftest [Robert]
> > >    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
> > >    - Fix some include/logic issues in self test [Robert]
> > >    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
> > >    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
> > >    - Drop most of the annotations from v3: see
> > >      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
> > >    - Remove WARN on bare efaults [Sean, Oliver]
> > >    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> > >
> > > v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
> > >    - Rework the implementation to be based on two orthogonal
> > >      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
> > >      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
> > >    - Change return code of kvm_populate_efault_info [Isaku]
> > >    - Use kvm_populate_efault_info from arm code [Oliver]
> > >
> > > v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> > >
> > >      This was a bit of a misfire, as I sent my WIP series on the mailing
> > >      list but was just targeting Sean for some feedback. Oliver Upton and
> > >      Isaku Yamahata ended up discovering the series and giving me some
> > >      feedback anyways, so thanks to them :) In the end, there was enough
> > >      discussion to justify retroactively labeling it as v2, even with the
> > >      limited cc list.
> > >
> > >    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
> > >    - API changes:
> > >          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
> > >            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
> > >            requirement).
> > >          - Switched to memslot flag
> > >    - Take Oliver's simplification to the "allow fast gup for readable
> > >      faults" logic.
> > >    - Slightly redefine the return code of user_mem_abort.
> > >    - Fix documentation errors brought up by Marc
> > >    - Reword commit messages in imperative mood
> > >
> > > v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> > >
> > > Anish Moorthy (14):
> > >    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >    KVM: Add function comments for __kvm_read/write_guest_page()
> > >    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> > >    KVM: Simplify error handling in __gfn_to_pfn_memslot()
> > >    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
> > >      userspace
> > >    KVM: Add memslot flag to let userspace force an exit on missing hva
> > >      mappings
> > >    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
> > >      stage-2 fault handler
> > >    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
> > >      stage-2 fault handler
> > >    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> > >    KVM: selftests: Report per-vcpu demand paging rate from demand paging
> > >      test
> > >    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
> > >      paging test
> > >    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
> > >      signal errors via TEST_ASSERT
> > >    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > >    KVM: selftests: Handle memory fault exits in demand_paging_test
> > >
> > >   Documentation/virt/kvm/api.rst                |  39 ++-
> > >   arch/arm64/kvm/Kconfig                        |   1 +
> > >   arch/arm64/kvm/arm.c                          |   1 +
> > >   arch/arm64/kvm/mmu.c                          |   7 +-
> > >   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
> > >   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
> > >   arch/x86/kvm/Kconfig                          |   1 +
> > >   arch/x86/kvm/mmu/mmu.c                        |   8 +-
> > >   include/linux/kvm_host.h                      |  21 +-
> > >   include/uapi/linux/kvm.h                      |   5 +
> > >   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
> > >   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
> > >   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
> > >   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
> > >   .../testing/selftests/kvm/include/memstress.h |   2 +-
> > >   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
> > >   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
> > >   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
> > >   .../kvm/memslot_modification_stress_test.c    |   2 +-
> > >   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
> > >   virt/kvm/Kconfig                              |   3 +
> > >   virt/kvm/kvm_main.c                           |  46 ++-
> > >   22 files changed, 453 insertions(+), 172 deletions(-)
> > >
> > > Range-diff against v6:
> > >   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >
> > >      -    The current docstring can be read as "atomic -> allowed to sleep," when
> > >      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
> > >      -    that clearer in the docstring.
> > >      +    The current description can be read as "atomic -> allowed to sleep,"
> > >      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
> > >      +    Make that clearer in the docstring.
> > >
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> > >      +    KVM: Add function comments for __kvm_read/write_guest_page()
> > >
> > >           The (gfn, data, offset, len) order of parameters is a little strange
> > >      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
> > >      -    make things perfectly clear.
> > >      +    since "offset" applies to "gfn" rather than to "data". Add function
> > >      +    comments to make things perfectly clear.
> > >
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> > >   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
> > >   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> > >      @@ Metadata
> > >        ## Commit message ##
> > >           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> > >
> > >      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
> > >      +    RWX-ness of the relevant access but doesn't actually do anything with
> > >      +    them. Define and use the flags necessary to pass this information on to
> > >      +    userspace.
> > >      +
> > >           Suggested-by: Sean Christopherson <seanjc@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
> > >   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
> > >      @@ Commit message
> > >
> > >           Suggested-by: James Houghton <jthoughton@google.com>
> > >           Suggested-by: Sean Christopherson <seanjc@google.com>
> > >      -    Reviewed-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Documentation/virt/kvm/api.rst ##
> > >       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
> > >      -   /* for kvm_userspace_memory_region::flags */
> > >          #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > >          #define KVM_MEM_READONLY     (1UL << 1)
> > >      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > >      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > >       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > >
> > >        This ioctl allows the user to create, modify or delete a guest physical
> > >      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
> > >        be identical.  This allows large pages in the guest to be backed by large
> > >        pages in the host.
> > >
> > >      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > >      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > >      +-The flags field supports three flags
> > >       +The flags field supports four flags
> > >      -+
> > >      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> > >      +
> > >      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> > >        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > >      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > >      -+use it.
> > >      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > >      - to make a new slot read-only.  In this case, writes to this memory will be
> > >      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
> > >        posted to userspace as KVM_EXIT_MMIO exits.
> > >      -+3.  KVM_MEM_GUEST_MEMFD
> > >      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
> > >      + incompatible with KVM_SET_USER_MEMORY_REGION.
> > >       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
> > >
> > >        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > >        the memory region are automatically reflected into the guest.  For example, an
> > >      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
> > >      + was a load or a store, instruction abort if it was an instruction
> > >      + fetch) is injected in the guest.
> > >      +
> > >      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
> > >      ++exclusive.
> > >      ++
> > >      + 4.36 KVM_SET_TSS_ADDR
> > >      + ---------------------
> > >      +
> > >       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
> > >
> > >        See KVM_EXIT_MEMORY_FAULT for more information.
> > >      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
> > >
> > >        /* for KVM_IRQ_LINE */
> > >        struct kvm_irq_level {
> > >      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
> > >      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
> > >        #define KVM_CAP_MEMORY_ATTRIBUTES 233
> > >        #define KVM_CAP_GUEST_MEMFD 234
> > >        #define KVM_CAP_VM_TYPES 235
> > >       +#define KVM_CAP_EXIT_ON_MISSING 236
> > >
> > >      - #ifdef KVM_CAP_IRQ_ROUTING
> > >      -
> > >      + struct kvm_irq_routing_irqchip {
> > >      +        __u32 irqchip;
> > >
> > >        ## virt/kvm/Kconfig ##
> > >       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
> > >      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
> > >       +
> > >               if (mem->flags & ~valid_flags)
> > >                       return -EINVAL;
> > >      ++       else if ((mem->flags & KVM_MEM_READONLY) &&
> > >      ++                (mem->flags & KVM_MEM_EXIT_ON_MISSING))
> > >      ++               return -EINVAL;
> > >
> > >      +        return 0;
> > >      + }
> > >       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > >        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > >      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
> > >                       writable = NULL;
> > >               }
> > >
> > >      -+       if (!atomic && can_exit_on_missing
> > >      -+           && kvm_is_slot_exit_on_missing(slot)) {
> > >      ++       /* When the slot is exit-on-missing (and when we should respect that)
> > >      ++        * set atomic=true to prevent GUP from faulting in the userspace
> > >      ++        * mappings.
> > >      ++        */
> > >      ++       if (!atomic && can_exit_on_missing &&
> > >      ++           kvm_is_slot_exit_on_missing(slot)) {
> > >       +               atomic = true;
> > >       +               if (async) {
> > >       +                       *async = false;
> > >   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
> > >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> > >
> > >        ## arch/x86/kvm/Kconfig ##
> > >       @@ arch/x86/kvm/Kconfig: config KVM
> > >      -        select INTERVAL_TREE
> > >      +        select KVM_VFIO
> > >               select HAVE_KVM_PM_NOTIFIER if PM
> > >               select KVM_GENERIC_HARDWARE_ENABLING
> > >       +        select HAVE_KVM_EXIT_ON_MISSING
> > >   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
> > >   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
> > >   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> > >      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> > >
> > >           Prevent the stage-2 fault handler from faulting in pages when
> > >           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> > >      -    calls to check the memslot flag.
> > >      -
> > >      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> > >      -    when the stage-2 handler cannot resolve the pfn for a fault. With
> > >      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> > >      -    faults as vCPU exits, which userspace can attempt to resolve without
> > >      -    terminating the guest.
> > >      +    call to check the memslot flag. This effects the delivery of stage-2
> > >      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
> > >      +    can attempt to resolve without terminating the guest.
> > >
> > >           Delivering stage-2 faults to userspace in this way sidesteps the
> > >           significant scalabiliy issues associated with using userfaultfd for the
> > >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> > >
> > >        ## arch/arm64/kvm/Kconfig ##
> > >       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
> > >      +        select SCHED_INFO
> > >               select GUEST_PERF_EVENTS if PERF_EVENTS
> > >      -        select INTERVAL_TREE
> > >               select XARRAY_MULTI
> > >       +        select HAVE_KVM_EXIT_ON_MISSING
> > >               help
> > >      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
> > >               if (pfn == KVM_PFN_ERR_HWPOISON) {
> > >                       kvm_send_hwpoison_signal(hva, vma_shift);
> > >                       return 0;
> > >      -        }
> > >      --       if (is_error_noslot_pfn(pfn))
> > >      -+       if (is_error_noslot_pfn(pfn)) {
> > >      -+               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > >      -+                                             write_fault, exec_fault, false);
> > >      -                return -EFAULT;
> > >      -+       }
> > >      -
> > >      -        if (kvm_is_device_pfn(pfn)) {
> > >      -                /*
> > > 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> > > 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
> > >      @@ Commit message
> > >           configuring the number of reader threads per UFFD as well: add the "-r"
> > >           flag to do so.
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
> > >       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> > > 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
> > >      @@ Commit message
> > >           [1] Single-vCPU performance does suffer somewhat.
> > >           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> > >       @@
> > > 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > > 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
> > >      @@ Commit message
> > >
> > >           Demonstrate a (very basic) scheme for supporting memory fault exits.
> > >
> > >      -    >From the vCPU threads:
> > >      +    From the vCPU threads:
> > >           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> > >              with the purpose of establishing the absent mappings. Do so with
> > >              wake_waiters=false to avoid serializing on the userfaultfd wait queue
> > >      @@ Commit message
> > >           [A] In reality it is much likelier that the vCPU thread simply lost a
> > >               race to establish the mapping for the page.
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> > >       @@
> > >
> > > base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
> >
Gupta, Pankaj Feb. 21, 2024, 7:35 a.m. UTC | #4
>>> On 2/16/2024 12:53 AM, Anish Moorthy wrote:
>>>> This series adds an option to cause stage-2 fault handlers to
>>>> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
>>>> the userspace mappings. Doing so allows userspace to receive stage-2
>>>> faults directly from KVM_RUN instead of through userfaultfd, which
>>>> suffers from serious contention issues as the number of vCPUs scales.
>>>
>>> Thanks for your work!
>>
>> :D
>>
>>>
>>> So, this is an alternative approach userspace like Qemu to do post copy
>>> live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
>>> seems slower with more vCPU's.
>>>
>>> Maybe I am missing some things here, just curious how userspace VMM e.g
>>> Qemu would do memory copy with this approach once the page is available
>>> from remote host which was done with UFFDIO_COPY earlier?
>>
>> This new capability is meant to be used *alongside* userfaultfd during
>> post-copy: it's not a replacement. KVM_RUN can generate page faults
>> from outside the stage-2 fault handlers (IIUC instruction emulation is
>> one source), and these paths are unchanged: so it's important that
>> userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
>> to catch these guest accesses. But with the new
>> KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
>> report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
>> queuing onto the UFFD.
>>
>> In the workloads I've tested, the vast majority of guest-generated
>> page faults (99%+) come from the stage-2 handlers. So this series
>> "solves" the issue of contention on the UFFD file descriptor by
>> (mostly) sidestepping it.
>>
>> As for how userspace actually uses the new functionality: when a vCPU
>> thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
>> post-copy it might
>>
>> (a) Fetch the page
>> (b) Install the page into KVM's mapping via UFFDIO_COPY (don't
>> necessarily need to UFFDIO_WAKE!)
>> (c) Call KVM_RUN to re-enter the guest and retry the access. The
>> stage-2 fault handler will fire again but almost certainly won't
>> KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
>> page), so the guest can continue.
>>
>> and userspace can continue using some thread(s) to
>>
>> (a) Read page faults from the UFFD.
>> (b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
>> (c) goto (a)
>>
>> to make sure it catches everything. The combination of these two things
>> adds up to more performant "uffd-based" postcopy.
>>
>> I'm of course skimming over some details (e.g.: when two vCPU threads
>> race to fetch a page one of them should probably MADV_POPULATE_WRITE
>> somehow), but I hope this is helpful. My patch to the KVM demand
>> paging self test might also clarify things a bit [1].
> 
> One other small detail is, you can equally use UFFDIO_CONTINUE,
> depending on how the rest of the live migration implementation works.
> 
> Really briefly, this series should be viewed as an alternate (and more
> scalable) mechanism to find out that a fault occurred. The way
> userspace then *resolves* the fault (whether via UFFDIO_COPY or
> UFFDIO_CONTINUE) can remain the same as before.
> 

That clarifies. Thank you!

Best regards,
Pankaj
Sean Christopherson April 10, 2024, 12:19 a.m. UTC | #5
On Thu, 15 Feb 2024 23:53:51 +0000, Anish Moorthy wrote:
> This series adds an option to cause stage-2 fault handlers to
> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> the userspace mappings. Doing so allows userspace to receive stage-2
> faults directly from KVM_RUN instead of through userfaultfd, which
> suffers from serious contention issues as the number of vCPUs scales.
> 
> Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> demand_paging_test, which demonstrates the scalability improvements:
> the following data was collected using [2] on an x86 machine with 256
> cores.
> 
> [...]

Applied 1,2, and 4 to kvm-x86 generic, and 10-12 to kvm-x86 selftests.

I skipped all KVM_CAP_EXIT_ON_MISSING as per our decision to hold off until we
see the KVM userfault stuff.  I skipped the docs patch because it would require
more massaging than I wanted to do when applying.  And lastly, I skipped the
"Add memslot_flags parameter to memstress_create_vm()" patch because it would be
dead code without the exit-on-missing usage.

Please take a look at the selftests commits in particular, as I did a decent
amount of massaging when applying.

Thanks!

[01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
        https://github.com/kvm-x86/linux/commit/ed2f049fc144
[02/14] KVM: Add function comments for __kvm_read/write_guest_page()
        https://github.com/kvm-x86/linux/commit/a3bd2f7ead6d
...

[04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot()
        https://github.com/kvm-x86/linux/commit/f588557ac4ac

...

[10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
        https://github.com/kvm-x86/linux/commit/2ca76c12c48b
[11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
        https://github.com/kvm-x86/linux/commit/df4ec5aada9d
[12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads
        https://github.com/kvm-x86/linux/commit/0cba6442e9e2

--
https://github.com/kvm-x86/linux/tree/next
Anish Moorthy May 7, 2024, 5:38 p.m. UTC | #6
On Tue, Apr 9, 2024 at 5:21 PM Sean Christopherson <seanjc@google.com> wrote:
>
> I skipped all KVM_CAP_EXIT_ON_MISSING as per our decision to hold off until we
> see the KVM userfault stuff.  I skipped the docs patch because it would require
> more massaging than I wanted to do when applying.  And lastly, I skipped the
> "Add memslot_flags parameter to memstress_create_vm()" patch because it would be
> dead code without the exit-on-missing usage.
>
> Please take a look at the selftests commits in particular, as I did a decent
> amount of massaging when applying.

Thanks for cleaning the commits, and for all the help along the way. I
just got around to checking the selftest commits, and they look good
to me