mbox series

[v6,00/14] Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults

Message ID 20231109210325.3806151-1-amoorthy@google.com (mailing list archive)
Headers show
Series Improve KVM + userfaultfd performance via KVM_MEMORY_FAULT_EXITs on stage-2 faults | expand

Message

Anish Moorthy Nov. 9, 2023, 9:03 p.m. UTC
This series adds an option to cause stage-2 fault handlers to
KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
the userspace mappings. Doing so allows userspace to receive stage-2
faults directly from KVM_RUN instead of through userfaultfd, which
suffers from serious contention issues as the number of vCPUs scales.

Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
demand_paging_test, which demonstrates the scalability improvements:
the following data was collected using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

TODO
~~~~
No known issues/things to resolve. However, documentation/commit logs
merit a close look given how much feedback I've received on those :/

Base Commit
~~~~~~~~~~~
This series is based off of kvm/next (45b890f7689e) with v14 of the
guest_memfd series applied, with some fixes on top [3].

Links
~~~~~
[1] Original RFC from James Houghton:
    https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series

[3] https://lore.kernel.org/kvm/20231105163040.14904-1-pbonzini@redhat.com/T/#m56361120ee1dd5265a5710e6a814906cda8e1020
    The following fixes are required to get the KVM selftests to compile
    on arm64
    - https://lore.kernel.org/kvm/20231108233723.3380042-1-amoorthy@google.com/
    - https://lore.kernel.org/kvm/affca7a8-116e-4b0f-9edf-6cdc05ba65ca@redhat.com/
    - Unguarding the definitions of MEM_REGION_GPA/SLOT in set_memory_region_test
      (not sure if this is the "right" fix for that test, but it compiles)

---

v6
  - Rebase onto guest_memfd series [Anish/Sean]
  - Set write fault flag properly in user_mem_abort() [Oliver]
  - Reformat unnecessarily multi-line comments [Sean]
  - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
  - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
  - Remove unnecessary rounding in user_mem_abort() annotation [David]
  - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
    them with the stage-2 fault annotation patches [Sean]
  - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
    add another boolean parameter instead [Sean]
  - Better shortlog for the hva_to_pfn_fast() change [Anish]

v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
  - Rename APIs (again) [Sean]
  - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
  - Reword hva_to_pfn_fast() change commit message [Sean]
  - Correct style on terminal if statements [Sean]
  - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
  - Add read fault flag for annotated faults [Sean]
  - read/write_guest_page() changes
      - Move the annotations into vcpu wrapper fns [Sean]
      - Reorder parameters [Robert]
  - Rename kvm_populate_efault_info() to
    kvm_handle_guest_uaccess_fault() [Sean]
  - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
  - Correct description of the faults which hva_to_pfn_fast() can now
    resolve [Sean]
  - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
  - Magnanimously accept Sean's rewrite of the handle_error_pfn()
    annotation [Anish]
  - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]

v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic'
    parameter
  KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
    userspace
  KVM: Try using fast GUP to resolve read faults
  KVM: Add memslot flag to let userspace force an exit on missing hva
    mappings
  KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
    stage-2 fault handler
  KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
  KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from
    stage-2 fault-handler
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  33 +-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |   8 +-
 include/linux/kvm_host.h                      |  21 +-
 include/uapi/linux/kvm.h                      |   5 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  46 ++-
 22 files changed, 444 insertions(+), 175 deletions(-)

Comments

Sean Christopherson Feb. 7, 2024, 3:46 p.m. UTC | #1
On Thu, Nov 09, 2023, Anish Moorthy wrote:
> Base Commit
> ~~~~~~~~~~~
> This series is based off of kvm/next (45b890f7689e) with v14 of the
> guest_memfd series applied, with some fixes on top [3].

Please use `--base`.  I have gotten spoiled by git appending the object ID at the
bottom, and get annoyed every time I have to go spelunking for the base :-)

Also, in the future, when posting a series that has multiple dependencies, it is
*very* helpful to reviewers and maintainers to provide a full branch somewhere,
e.g. on github, gitlab, etc.  That way someone that wants to actually test things
doesn't need to hunt down and splice together a bunch of different assets.

From Documentation/process/maintainer-kvm-x86.rst:

Git Base
~~~~~~~~
If you are using git version 2.9.0 or later (Googlers, this is all of you!),
use ``git format-patch`` with the ``--base`` flag to automatically include the
base tree information in the generated patches.

Note, ``--base=auto`` works as expected if and only if a branch's upstream is
set to the base topic branch, e.g. it will do the wrong thing if your upstream
is set to your personal repository for backup purposes.  An alternative "auto"
solution is to derive the names of your development branches based on their
KVM x86 topic, and feed that into ``--base``.  E.g. ``x86/pmu/my_branch_name``,
and then write a small wrapper to extract ``pmu`` from the current branch name
to yield ``--base=x/pmu``, where ``x`` is whatever name your repository uses to
track the KVM x86 remote.

> Anish Moorthy (14):
>   KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic'
>     parameter
>   KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>   KVM: Simplify error handling in __gfn_to_pfn_memslot()
>   KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
>     userspace
>   KVM: Try using fast GUP to resolve read faults
>   KVM: Add memslot flag to let userspace force an exit on missing hva
>     mappings
>   KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
>     stage-2 fault handler
>   KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
>   KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from
>     stage-2 fault-handler
>   KVM: selftests: Report per-vcpu demand paging rate from demand paging
>     test
>   KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
>     paging test
>   KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
>     signal errors via TEST_ASSERT
>   KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
>   KVM: selftests: Handle memory fault exits in demand_paging_test
> 
>  Documentation/virt/kvm/api.rst                |  33 +-
>  arch/arm64/kvm/Kconfig                        |   1 +
>  arch/arm64/kvm/arm.c                          |   1 +
>  arch/arm64/kvm/mmu.c                          |   7 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
>  arch/x86/kvm/Kconfig                          |   1 +
>  arch/x86/kvm/mmu/mmu.c                        |   8 +-
>  include/linux/kvm_host.h                      |  21 +-
>  include/uapi/linux/kvm.h                      |   5 +
>  .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
>  .../selftests/kvm/access_tracking_perf_test.c |   2 +-
>  .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
>  .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
>  .../testing/selftests/kvm/include/memstress.h |   2 +-
>  .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
>  tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
>  .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
>  .../kvm/memslot_modification_stress_test.c    |   2 +-
>  .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
>  virt/kvm/Kconfig                              |   3 +
>  virt/kvm/kvm_main.c                           |  46 ++-
>  22 files changed, 444 insertions(+), 175 deletions(-)

A few nits throughout, but this is looking good for 6.9.

Oliver / Marc,

Any objection to taking this through kvm-x86? (when you feel it's ready, obviously)
My plan is to put it in a dedicated topic branch, with a massaged cover letter as
the tag used for the pull request so that we can capture the motivation/benefits.
Anish Moorthy Feb. 9, 2024, 4 p.m. UTC | #2
On Wed, Feb 7, 2024 at 7:46 AM Sean Christopherson <seanjc@google.com> wrote:
>
> A few nits throughout, but this is looking good for 6.9.
>
> Oliver / Marc,
>
> Any objection to taking this through kvm-x86? (when you feel it's ready, obviously)
> My plan is to put it in a dedicated topic branch, with a massaged cover letter as
> the tag used for the pull request so that we can capture the motivation/benefits.

Oliver and Marc,

I have a v7 ready based on the feedback I've received so far- please
let me know if I should send it or wait for you to take a look at this
version first.

On the one hand I obviously want to incorporate any feedback you have
for the next version, but on the other I suspect that if/when you look
at this you'll want to see a version with as few (known) flaws as
possible