mbox series

[v3,00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Message ID 20230412213510.1220557-1-amoorthy@google.com (mailing list archive)
Headers show
Series Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand

Message

Anish Moorthy April 12, 2023, 9:34 p.m. UTC
Due to serialization on internal wait queues, userfaultfd can be quite
slow at delivering faults to userspace when many vCPUs fault on the same
VMA/uffd: this problem only worsens as the number of vCPUs increases.
This series allows page faults encountered in KVM_RUN to bypass
userfaultfd (KVM_CAP_ABSENT_MAPPING_FAULT) and be delivered directly via
VM exit to the faulting vCPU (KVM_CAP_MEMORY_FAULT_INFO), allowing much
higher page-in rates during uffd-based postcopy.

As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This
capability is meant to deliver useful information to userspace (i.e. the
problematic range of guest physical memory) when a vCPU fails a guest
memory access. KVM_RUN currently just returns -1 and sets errno=EFAULT
in response to these failed accesses: the new capability will cause it
to also fill the kvm_run struct with an exit reason of
KVM_EXIT_MEMORY_FAULT and populate the memory_fault field with the
faulting range.

Upon receiving an annotated EFAULT, userspace may take appropriate
action to resolve the failed access. For instance, this might involve a
UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
migration postcopy.

KVM_CAP_MEMORY_FAULT_INFO comes with two important caveats, one public
and another internal.

1. Its implementation is incomplete: userspace may still receive
   un-annotated EFAULTs (exit reason != KVM_EXIT_MEMORY_FAULT) and must
   be able to handle these, although these cases are to be fixed as the
   maintainers learn of/identify them.

2. The implementation strategy given in this series, which is basically
   to fill the kvm_run.memory_fault field whenever a vCPU fails a guest
   memory access (even if the resulting EFAULT might not have been
   returned to userspace from KVM_RUN) is not without risk: some safety
   measures are taken, but they will not ensure total correctness.

   For example, if there are any existing paths in KVM_RUN which cause
   a vCPU to (1) populate the kvm_run struct in preparation for an
   exit to userspace then (2) try and fail to access guest memory for
   some reason, but ignore the result of the access and then (3)
   complete the exit to userspace, then the contents of the kvm_run
   struct written in (1) could be lost.

   Another example: if KVM_RUN fails a guest memory access for which the
   EFAULT is annotated but does not return the EFAULT to userspace,
   then later encounters another *un*annotated EFAULT which *is*
   returned to userspace, then the kvm_run.memory_fault field read by
   userspace will correspond to the first EFAULT, not the second.

   The discussion on this topic and of the alternative (filling the
   efault info only for those cases where KVM_RUN immediately returns to
   userspace) occurs in [3].

KVM_CAP_ABSENT_MAPPING_FAULT is introduced next (and is, I should note,
an idea proposed by James Houghton in [1] :). This capability causes
KVM_RUN to error with errno=EFAULT when it encounters a page fault for
which the userspace page tables do not contain present mappings. When
combined with KVM_CAP_MEMORY_FAULT_INFO, this capability allows KVM to
deliver information on page faults directly to the involved vCPU thread,
thus bypassing the userfaultfd wait queue and its related contention.

As a side note, KVM_CAP_ABSENT_MAPPING_FAULT prevents KVM from
generating async page faults. For this reason, hypervisors using it to
improve postcopy performance will likely want to disable it at the end
of postcopy.

KVM's demand paging self test is extended to demonstrate the performance
benefits of using the two new capabilities to bypass the userfaultfd
wait queue. The performance samples below (rates in thousands of
pages/s, n = 5), were generated using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

[1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying both the entire series and
    the following bugfix:
    https://lore.kernel.org/kvm/20230223001805.2971237-1-amoorthy@google.com/#r
[3] https://lore.kernel.org/kvm/ZBTgnjXJvR8jtc4i@google.com/

---

v3
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_ABSENT_MAPPING_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (22):
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: Allow hva_pfn_fast() to resolve read-only faults.
  KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN  at the start of
    KVM_RUN
  KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  KVM: Add docstrings to __kvm_write_guest_page() and
    __kvm_read_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_map()
  KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault()
  KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch()
  KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault()
  KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page()
  KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing()
  KVM: x86: Annotate -EFAULTs from direct_map()
  KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort()
  KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  66 ++++-
 arch/arm64/kvm/arm.c                          |   2 +
 arch/arm64/kvm/mmu.c                          |  18 +-
 arch/x86/kvm/hyperv.c                         |  14 +-
 arch/x86/kvm/mmu/mmu.c                        |  39 ++-
 arch/x86/kvm/svm/sev.c                        |   1 +
 arch/x86/kvm/x86.c                            |   7 +-
 include/linux/kvm_host.h                      |  19 ++
 include/uapi/linux/kvm.h                      |  18 ++
 tools/include/uapi/linux/kvm.h                |  12 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 242 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  18 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 158 +++++++-----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 virt/kvm/kvm_main.c                           |  75 +++++-
 20 files changed, 551 insertions(+), 154 deletions(-)


base-commit: d8708b80fa0e6e21bc0c9e7276ad0bccef73b6e7

Comments

Peter Xu April 19, 2023, 7:55 p.m. UTC | #1
Hi, Anish,

On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> KVM's demand paging self test is extended to demonstrate the performance
> benefits of using the two new capabilities to bypass the userfaultfd
> wait queue. The performance samples below (rates in thousands of
> pages/s, n = 5), were generated using [2] on an x86 machine with 256
> cores.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1       150     340
> 2       191     477
> 4       210     809
> 8       155     1239
> 16      130     1595
> 32      108     2299
> 64      86      3482
> 128     62      4134
> 256     36      4012

The number looks very promising.  Though..

> 
> [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
>     A quick rundown of the new flags (also detailed in later commits)
>         -a registers all of guest memory to a single uffd.

... this is the worst case scenario.  I'd say it's slightly unfair to
compare by first introducing a bottleneck then compare with it. :)

Jokes aside: I'd think it'll make more sense if such a performance solution
will be measured on real systems showing real benefits, because so far it's
still not convincing enough if it's only with the test especially with only
one uffd.

I don't remember whether I used to discuss this with James before, but..

I know that having multiple uffds in productions also means scattered guest
memory and scattered VMAs all over the place.  However split the guest
large mem into at least a few (or even tens of) VMAs may still be something
worth trying?  Do you think that'll already solve some of the contentions
on userfaultfd, either on the queue or else?

With a bunch of VMAs and userfaultfds (paired with uffd fault handler
threads, totally separate uffd queues), I'd expect to some extend other
things can pop up already, e.g., the network bandwidth, without teaching
each vcpu thread to report uffd faults themselves.

These are my pure imaginations though, I think that's also why it'll be
great if such a solution can be tested more or less on a real migration
scenario to show its real benefits.

Thanks,
Axel Rasmussen April 19, 2023, 8:15 p.m. UTC | #2
On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Anish,
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > KVM's demand paging self test is extended to demonstrate the performance
> > benefits of using the two new capabilities to bypass the userfaultfd
> > wait queue. The performance samples below (rates in thousands of
> > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > cores.
> >
> > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > 1       150     340
> > 2       191     477
> > 4       210     809
> > 8       155     1239
> > 16      130     1595
> > 32      108     2299
> > 64      86      3482
> > 128     62      4134
> > 256     36      4012
>
> The number looks very promising.  Though..
>
> >
> > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> >     A quick rundown of the new flags (also detailed in later commits)
> >         -a registers all of guest memory to a single uffd.
>
> ... this is the worst case scenario.  I'd say it's slightly unfair to
> compare by first introducing a bottleneck then compare with it. :)
>
> Jokes aside: I'd think it'll make more sense if such a performance solution
> will be measured on real systems showing real benefits, because so far it's
> still not convincing enough if it's only with the test especially with only
> one uffd.
>
> I don't remember whether I used to discuss this with James before, but..
>
> I know that having multiple uffds in productions also means scattered guest
> memory and scattered VMAs all over the place.  However split the guest
> large mem into at least a few (or even tens of) VMAs may still be something
> worth trying?  Do you think that'll already solve some of the contentions
> on userfaultfd, either on the queue or else?

We considered sharding into several UFFDs. I do think it helps, but
also I think there are two main problems with it:

- One is, I think there's a limit to how much you'd want to do that.
E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
but 1/100 or 1/1000 might become ridiculous in terms of the
"scattering" of VMAs and so on like you mentioned. Especially for very
large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
not sure splitting just "slightly" is enough to get good performance.

- Another is, sharding UFFDs sort of assumes accesses are randomly
distributed across the guest physical address space. I'm not sure this
is guaranteed for all possible VMs / customer workloads. In other
words, even if we shard across several UFFDs, we may end up with a
small number of them being "hot".

A benefit to Anish's series is that it solves the problem more
fundamentally, and allows demand paging with no "global" locking. So,
it will scale better regardless of VM size, or access pattern.

[1]: https://cloud.google.com/compute/docs/memory-optimized-machines

>
> With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> threads, totally separate uffd queues), I'd expect to some extend other
> things can pop up already, e.g., the network bandwidth, without teaching
> each vcpu thread to report uffd faults themselves.
>
> These are my pure imaginations though, I think that's also why it'll be
> great if such a solution can be tested more or less on a real migration
> scenario to show its real benefits.

I wonder, is there an existing open source QEMU/KVM based live
migration stress test?

I think we could share numbers from some of our internal benchmarks,
or at the very least give relative numbers (e.g. +50% increase), but
since a lot of the software stack is proprietary (e.g. we don't use
QEMU), it may not be that useful or reproducible for folks.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 19, 2023, 9:05 p.m. UTC | #3
On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Anish,
> >
> > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > > KVM's demand paging self test is extended to demonstrate the performance
> > > benefits of using the two new capabilities to bypass the userfaultfd
> > > wait queue. The performance samples below (rates in thousands of
> > > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1       150     340
> > > 2       191     477
> > > 4       210     809
> > > 8       155     1239
> > > 16      130     1595
> > > 32      108     2299
> > > 64      86      3482
> > > 128     62      4134
> > > 256     36      4012
> >
> > The number looks very promising.  Though..
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > >     A quick rundown of the new flags (also detailed in later commits)
> > >         -a registers all of guest memory to a single uffd.
> >
> > ... this is the worst case scenario.  I'd say it's slightly unfair to
> > compare by first introducing a bottleneck then compare with it. :)
> >
> > Jokes aside: I'd think it'll make more sense if such a performance solution
> > will be measured on real systems showing real benefits, because so far it's
> > still not convincing enough if it's only with the test especially with only
> > one uffd.
> >
> > I don't remember whether I used to discuss this with James before, but..
> >
> > I know that having multiple uffds in productions also means scattered guest
> > memory and scattered VMAs all over the place.  However split the guest
> > large mem into at least a few (or even tens of) VMAs may still be something
> > worth trying?  Do you think that'll already solve some of the contentions
> > on userfaultfd, either on the queue or else?
> 
> We considered sharding into several UFFDs. I do think it helps, but
> also I think there are two main problems with it:
> 
> - One is, I think there's a limit to how much you'd want to do that.
> E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
> but 1/100 or 1/1000 might become ridiculous in terms of the
> "scattering" of VMAs and so on like you mentioned. Especially for very
> large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
> not sure splitting just "slightly" is enough to get good performance.
> 
> - Another is, sharding UFFDs sort of assumes accesses are randomly
> distributed across the guest physical address space. I'm not sure this
> is guaranteed for all possible VMs / customer workloads. In other
> words, even if we shard across several UFFDs, we may end up with a
> small number of them being "hot".

I never tried to monitor this, but I had a feeling that it's actually
harder to maintain physical continuity of pages being used and accessed at
least on Linux.

The more possible case to me is the system pages goes very scattered easily
after boot a few hours unless special care is taken, e.g., on using hugetlb
pages or reservations for specific purpose.

I also think that's normally optimal to the system, e.g., numa balancing
will help nodes / cpus using local memory which helps spread the memory
consumptions, hence each core can access different pages that is local to
it.

But I agree I can never justify that it'll always work.  If you or Anish
could provide some data points to further support this issue that would be
very interesting and helpful, IMHO, not required though.

> 
> A benefit to Anish's series is that it solves the problem more
> fundamentally, and allows demand paging with no "global" locking. So,
> it will scale better regardless of VM size, or access pattern.
> 
> [1]: https://cloud.google.com/compute/docs/memory-optimized-machines
> 
> >
> > With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> > threads, totally separate uffd queues), I'd expect to some extend other
> > things can pop up already, e.g., the network bandwidth, without teaching
> > each vcpu thread to report uffd faults themselves.
> >
> > These are my pure imaginations though, I think that's also why it'll be
> > great if such a solution can be tested more or less on a real migration
> > scenario to show its real benefits.
> 
> I wonder, is there an existing open source QEMU/KVM based live
> migration stress test?

I am not aware of any.

> 
> I think we could share numbers from some of our internal benchmarks,
> or at the very least give relative numbers (e.g. +50% increase), but
> since a lot of the software stack is proprietary (e.g. we don't use
> QEMU), it may not be that useful or reproducible for folks.

Those numbers can still be helpful.  I was not asking for reproduceability,
but some test to better justify this feature.

IMHO the demand paging test (at least the current one) may or may not be a
good test to show the value of this specific feature.  When with 1-uffd, it
obviously bottlenecks on the single uffd, so it doesn't explain whether
scaling num of uffds could help.

But it's not friendly to multi-uffd either, because it'll be the other
extreme case where all mem accesses are spread the cores, so probably the
feature won't show a result proving its worthwhile.

From another aspect, if a kernel feature is proposed it'll be always nice
(and sometimes mandatory) to have at least one user of it (besides the unit
tests).  I think that should also include proprietary softwares.  It
doesn't need to be used already in production, but some POC would
definitely be very helpful to move a feature forward towards community
acceptance.

Thanks,
Peter Xu April 20, 2023, 9:29 p.m. UTC | #4
Hi, Anish,

[Copied Nadav Amit for the last few paragraphs on userfaultfd, because
 Nadav worked on a few userfaultfd performance problems; so maybe he'll
 also have some ideas around]

On Wed, Apr 19, 2023 at 02:53:46PM -0700, Anish Moorthy wrote:
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> > > We considered sharding into several UFFDs. I do think it helps, but
> > > also I think there are two main problems with it...
> >
> > But I agree I can never justify that it'll always work.  If you or Anish
> > could provide some data points to further support this issue that would be
> > very interesting and helpful, IMHO, not required though.
> 
> Axel covered the reasons for not pursuing the sharding approach nicely
> (thanks!). It's not something we ever prototyped, so I don't have any
> further numbers there.
> 
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> >
> > > I think we could share numbers from some of our internal benchmarks,
> > > or at the very least give relative numbers (e.g. +50% increase), but
> > > since a lot of the software stack is proprietary (e.g. we don't use
> > > QEMU), it may not be that useful or reproducible for folks.
> >
> > Those numbers can still be helpful.  I was not asking for reproduceability,
> > but some test to better justify this feature.
> 
> I do have some internal benchmarking numbers on this front, although
> it's been a while since I've collected them so the details might be a
> little sparse.

Thanks for sharing these data points.  I don't understand most of them yet,
but I think it's better than the unit test numbers provided.

> 
> I've confirmed performance gains with "scalable userfaultfd" using two
> workloads besides the self-test:
> 
> The first, cycler, spins up a VM and launches a binary which (a) maps
> a large amount of memory and then (b) loops over it issuing writes as
> fast as possible. It's not a very realistic guest but it at least
> involves an actual migrating VM, and we often use it to
> stress/performance test migration changes. The write rate which cycler
> achieves during userfaultfd-based postcopy (without scalable uffd
> enabled) is about 25% of what it achieves under KVM Demand Paging (the
> internal KVM feature GCE currently uses for postcopy). With
> userfaultfd-based postcopy and scalable uffd enabled that rate jumps
> nearly 3x, so about 75% of what KVM Demand Paging achieves. The
> attached "Cycler.png" illustrates this effect (though due to some
> other details, faster demand paging actually makes the migrations
> worse: the point is that scalable uffd performs more similarly to kvm
> demand paging :)

Yes I don't understand why vanilla uffd is so different, neither am I sure
what does the graph mean, though. :)

Is the first drop caused by starting migration/precopy?

Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new
pages during postcopy?

Is the workload busy writes single thread, or NCPU threads?

Is what you mentioned on the 25%-75% comparison can be shown on the graph?
Or maybe that's part of the period where all three are very close to 0?

> 
> The second is the redis memtier benchmark [1], a more realistic
> workflow where we migrate a VM running the redis server. With scalable
> userfaultfd, the client VM observes significantly higher transaction
> rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> exact numbers if needed, but just from eyeballing the graph you can
> see that the improvement is something like 5-10x (at least) for
> several seconds. There's still a noticeable gap with KVM demand paging
> based-postcopy, but the improvement is definitely significant.
> 
> [1] https://github.com/RedisLabs/memtier_benchmark

Does the "5-10x" difference rely in the "15s valley" you pointed out in the
graph?

Is it reproduceable that the blue line always has a totally different
"valley" comparing to yellow/red?

Personally I still really want to know what happens if we just split the
vma and see how it goes with a standard workloads, but maybe I'm asking too
much so don't yet worry.  The solution here proposed still makes sense to
me and I agree if this can be done well it can resolve the bottleneck over
1-userfaultfd.

But after I read some of the patches I'm not sure whether it's possible it
can be implemented in a complete way.  You mentioned here and there on that
things can be missing probably due to random places accessing guest pages
all over kvm.  Relying sololy on -EFAULT so far doesn't look very reliable
to me, but it could be because I didn't yet really understand how it works.

Is above a concern to the current solution?

Have any of you tried to investigate the other approach to scale
userfaultfd?

It seems userfaultfd does one thing great which is to have the trapping at
an unified place (when the page fault happens), hence it doesn't need to
worry on random codes splat over KVM module read/write a guest page.  The
question is whether it'll be easy to do so.

Split vma definitely is still a way to scale userfaultfd, but probably not
in a good enough way because it's scaling in memory axis, not cores.  If
tens of cores accessing a small region that falls into the same VMA, then
it stops working.

However maybe it can be scaled in other form?  So far my understanding is
"read" upon uffd for messages is still not a problem - the read can be done
in chunk, and each message will be converted into a request to be send
later.

If the real problem relies in a bunch of threads queuing, is it possible
that we can provide just more queues for the events?  The readers will just
need to go over all the queues.

Way to decide "which thread uses which queue" can be another problem, what
comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be
better.  Each vcpu thread will have different tids, then they can hopefully
scale on the queues.

There's at least one issue that I know with such an idea, that after we
have >1 uffd queues it means the message order will be uncertain.  It may
matter for some uffd users (e.g. cooperative userfaultfd, see
UFFD_FEATURE_FORK|REMOVE|etc.)  because I believe order of messages matter
for them (mostly CRIU).  But I think that's not a blocker either because we
can forbid those features with multi queues.

That's a wild idea that I'm just thinking about, which I have totally no
idea whether it'll work or not.  It's more or less of a generic question on
"whether there's chance to scale on uffd side just in case it might be a
cleaner approach", when above concern is a real concern.

Thanks,
Anish Moorthy April 20, 2023, 11:42 p.m. UTC | #5
My reply to Peter earlier bounced from the mailing list due to the
attached images (sorry!). I've copied it below to get a record
on-list.

Just for completeness, the message ID of the bounced mail was
<CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>

On Wed, Apr 19, 2023 at 2:53 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> > > We considered sharding into several UFFDs. I do think it helps, but
> > > also I think there are two main problems with it...
> >
> > But I agree I can never justify that it'll always work.  If you or Anish
> > could provide some data points to further support this issue that would be
> > very interesting and helpful, IMHO, not required though.
>
> Axel covered the reasons for not pursuing the sharding approach nicely
> (thanks!). It's not something we ever prototyped, so I don't have any
> further numbers there.
>
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> >
> > > I think we could share numbers from some of our internal benchmarks,
> > > or at the very least give relative numbers (e.g. +50% increase), but
> > > since a lot of the software stack is proprietary (e.g. we don't use
> > > QEMU), it may not be that useful or reproducible for folks.
> >
> > Those numbers can still be helpful.  I was not asking for reproduceability,
> > but some test to better justify this feature.
>
> I do have some internal benchmarking numbers on this front, although
> it's been a while since I've collected them so the details might be a
> little sparse.
>
> I've confirmed performance gains with "scalable userfaultfd" using two
> workloads besides the self-test:
>
> The first, cycler, spins up a VM and launches a binary which (a) maps
> a large amount of memory and then (b) loops over it issuing writes as
> fast as possible. It's not a very realistic guest but it at least
> involves an actual migrating VM, and we often use it to
> stress/performance test migration changes. The write rate which cycler
> achieves during userfaultfd-based postcopy (without scalable uffd
> enabled) is about 25% of what it achieves under KVM Demand Paging (the
> internal KVM feature GCE currently uses for postcopy). With
> userfaultfd-based postcopy and scalable uffd enabled that rate jumps
> nearly 3x, so about 75% of what KVM Demand Paging achieves. The
> attached "Cycler.png" illustrates this effect (though due to some
> other details, faster demand paging actually makes the migrations
> worse: the point is that scalable uffd performs more similarly to kvm
> demand paging :)
>
> The second is the redis memtier benchmark [1], a more realistic
> workflow where we migrate a VM running the redis server. With scalable
> userfaultfd, the client VM observes significantly higher transaction
> rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> exact numbers if needed, but just from eyeballing the graph you can
> see that the improvement is something like 5-10x (at least) for
> several seconds. There's still a noticeable gap with KVM demand paging
> based-postcopy, but the improvement is definitely significant.
>
> [1] https://github.com/RedisLabs/memtier_benchmark
Anish Moorthy April 21, 2023, 4:58 p.m. UTC | #6
On Thu, Apr 20, 2023 at 2:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> Yes I don't understand why vanilla uffd is so different, neither am I sure
> what does the graph mean, though. :)
>
> Is the first drop caused by starting migration/precopy?
>
> Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new
> pages during postcopy?

Right on both counts. By the way, for anyone who notices that the
userfaultfd (red/yellow) lines never recover to the initial level of
performance, whereas the blue line does: that's a separate issue,
please ignore :)

> Is the workload busy writes single thread, or NCPU threads?

One thread per vCPU.

> Is what you mentioned on the 25%-75% comparison can be shown on the graph?
> Or maybe that's part of the period where all three are very close to 0?

Yes, unfortunately the absolute size of the improvement is still
pretty small (we go from ~50 writes/s to ~150), so all looks like zero
with this scale.

> > The second is the redis memtier benchmark [1], a more realistic
> > workflow where we migrate a VM running the redis server. With scalable
> > userfaultfd, the client VM observes significantly higher transaction
> > rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> > exact numbers if needed, but just from eyeballing the graph you can
> > see that the improvement is something like 5-10x (at least) for
> > several seconds. There's still a noticeable gap with KVM demand paging
> > based-postcopy, but the improvement is definitely significant.
> >
> > [1] https://github.com/RedisLabs/memtier_benchmark
>
> Does the "5-10x" difference rely in the "15s valley" you pointed out in the
> graph?

Not quite sure what you mean: I meant to point out that the ~15s
valley is where we observe improvements due to scalable userfaultfd.
For most of that valley, the speedup of scalable uffd is 5-10x (or
something, I admit to eyeballing those numbers :)

> Is it reproduceable that the blue line always has a totally different
> "valley" comparing to yellow/red?

Yes, but the offset of that valley is just precopy taking longer for
some reason on that configuration. Honestly it's probably just better
to ignore the blue line, since that's a google-specific stack.

> Personally I still really want to know what happens if we just split the
> vma and see how it goes with a standard workloads, but maybe I'm asking too
> much so don't yet worry.  The solution here proposed still makes sense to
> me and I agree if this can be done well it can resolve the bottleneck over
> 1-userfaultfd.
>
> But after I read some of the patches I'm not sure whether it's possible it
> can be implemented in a complete way.  You mentioned here and there on that
> things can be missing probably due to random places accessing guest pages
> all over kvm.  Relying sololy on -EFAULT so far doesn't look very reliable
> to me, but it could be because I didn't yet really understand how it works.
>
> Is above a concern to the current solution?

Based on your comment in [1], I think your impression of this series
is that it tries to (a) catch all of the cases where userfaultfd would
be triggered and (b) bypass userfaultfd by surfacing the page faults
via vCPU exit. That's only happening in two places (the
KVM_ABSENT_MAPPING_FAULT changes) corresponding to the EPT violation
handler on x86 and the arm64 equivalent. Bypassing the queuing of
faults onto a uffd in those two cases, and instead delivering those
faults via vCPU exit, is what provides the performance gains I'm
demonstrating.

However, all of the other changes (KVM_MEMORY_FAULT_INFO, the bulk of
this series) are totally unrelated to if/how faults are queued onto
userfaultfd. Page faults from copy_to_user/copy_from_user, etc will
continue to be delivered via uffd (if one is registered, obviously),
and changing that is *not* a goal. All that KVM_MEMORY_FAULT_INFO does
is deliver some extra information to userspace in cases where KVM_RUN
currently just returns -EFAULT.

Hopefully this, and my response to [1], clears things up. If not, let
me know and I'll be glad to discuss further.

[1] https://lore.kernel.org/kvm/ZEGuogfbtxPNUq7t@x1n/T/#m76f940846ecc94ea85efa80ffbe42366c2352636

> Have any of you tried to investigate the other approach to scale
> userfaultfd?

As Axel mentioned we considered sharding VMAs but didn't pursue it for
a few different reasons.

> It seems userfaultfd does one thing great which is to have the trapping at
> an unified place (when the page fault happens), hence it doesn't need to
> worry on random codes splat over KVM module read/write a guest page.  The
> question is whether it'll be easy to do so.

See a couple of notes above.

> Split vma definitely is still a way to scale userfaultfd, but probably not
> in a good enough way because it's scaling in memory axis, not cores.  If
> tens of cores accessing a small region that falls into the same VMA, then
> it stops working.
>
> However maybe it can be scaled in other form?  So far my understanding is
> "read" upon uffd for messages is still not a problem - the read can be done
> in chunk, and each message will be converted into a request to be send
> later.
>
> If the real problem relies in a bunch of threads queuing, is it possible
> that we can provide just more queues for the events?  The readers will just
> need to go over all the queues.
>
> Way to decide "which thread uses which queue" can be another problem, what
> comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be
> better.  Each vcpu thread will have different tids, then they can hopefully
> scale on the queues.
>
> There's at least one issue that I know with such an idea, that after we
> have >1 uffd queues it means the message order will be uncertain.  It may
> matter for some uffd users (e.g. cooperative userfaultfd, see
> UFFD_FEATURE_FORK|REMOVE|etc.)  because I believe order of messages matter
> for them (mostly CRIU).  But I think that's not a blocker either because we
> can forbid those features with multi queues.
>
> That's a wild idea that I'm just thinking about, which I have totally no
> idea whether it'll work or not.  It's more or less of a generic question on
> "whether there's chance to scale on uffd side just in case it might be a
> cleaner approach", when above concern is a real concern.

You bring up a good point, which is that this series only deals with
uffd's performance in the context of KVM. I had another idea in this
vein, which was to allow dedicating queues to certain threads: I even
threw together a prototype, though there was some bug in it which
stopped me from ever getting a real signal :(

I think there's still potential to make uffd itself faster but, as you
point out, that might get messy from an API perspective (I know my
prototype did :) and is going to require more investigation and
prototyping. The advantage of this approach is that it's simple, makes
a lot of conceptual sense IMO (in that the previously-stalled vCPU
threads can now participate in the work of demand fetching), and
solves a very important (probably *the* most important) bottleneck
when it comes to KVM + uffd-based postcopy.
Nadav Amit April 21, 2023, 5:39 p.m. UTC | #7
> On Apr 20, 2023, at 2:29 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Hi, Anish,
> 
> [Copied Nadav Amit for the last few paragraphs on userfaultfd, because
> Nadav worked on a few userfaultfd performance problems; so maybe he'll
> also have some ideas around]

Sorry for not following this thread and previous ones.

I skimmed through and I hope my answers would be helpful and relevant…

Anyhow, I also encountered to some extent the cost of locking (not the
contention). In my case, I only did a small change to reduce the overhead of
acquiring the locks by “combining" the locks of faulting_pending_wqh and
fault_wqh, which are (almost?) always acquired together. I only acquired
fault_pending_wqh and then introduced “fake_spin_lock()” which only got
lockdep to understand the fault_wqh is already protected.

But as I said, this solution was only intended to reduce the cost of locking,
and it does not solve the contention.

If I understand the problem correctly, it sounds as if the proper solution
should be some kind of a range-locks. If it is too heavy or the interface can
be changed/extended to wake a single address (instead of a range),
simpler hashed-locks can be used.
Anish Moorthy April 24, 2023, 5:54 p.m. UTC | #8
On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> If I understand the problem correctly, it sounds as if the proper solution
> should be some kind of a range-locks. If it is too heavy or the interface can
> be changed/extended to wake a single address (instead of a range),
> simpler hashed-locks can be used.

Some sort of range-based locking system does seem relevant, although I
don't see how that would necessarily speed up the delivery of faults
to UFFD readers: I'll have to think about it more.

On the KVM side though, I think there's value in merging
KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to
UFFD itself proceed separately. It's likely that returning faults
directly via the vCPU threads will be faster than even an improved
UFFD, since the former approach basically removes one level of
indirection. That seems important, given the criticality of the
EPT-violation path during postcopy. Furthermore, if future performance
improvements to UFFD involve changes/restrictions to its API, then
KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways.

Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
so I'm guessing (some version of) that will (eventually :) be merged
in any case.
Nadav Amit April 24, 2023, 7:44 p.m. UTC | #9
> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> 
> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> If I understand the problem correctly, it sounds as if the proper solution
>> should be some kind of a range-locks. If it is too heavy or the interface can
>> be changed/extended to wake a single address (instead of a range),
>> simpler hashed-locks can be used.
> 
> Some sort of range-based locking system does seem relevant, although I
> don't see how that would necessarily speed up the delivery of faults
> to UFFD readers: I'll have to think about it more.

Perhaps I misread your issue. Based on the scalability issues you raised,
I assumed that the problem you encountered is related to lock contention.
I do not know whether your profiled it, but some information would be
useful.

Anyhow, if the issue you encounter is mostly about the general overhead
of delivering userfaultfd, I encountered this one too. The solution I tried
(and you can find some old patches) is in delivering and resolving userfaultfd
using IO-uring. The main advantage is that this solution is generic and
performance is pretty good. The disadvantage is that you do need to allocate
a polling thread/core to handle the userfaultfd.

If you want, I can send you privately the last iteration of my patches for
you to give it a spin.

> 
> On the KVM side though, I think there's value in merging
> KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to
> UFFD itself proceed separately. It's likely that returning faults
> directly via the vCPU threads will be faster than even an improved
> UFFD, since the former approach basically removes one level of
> indirection. That seems important, given the criticality of the
> EPT-violation path during postcopy. Furthermore, if future performance
> improvements to UFFD involve changes/restrictions to its API, then
> KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways.
> 
> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> so I'm guessing (some version of) that will (eventually :) be merged
> in any case.

It certainly not my call. But if you ask me, introducing a solution for
a concrete use-case that requires API changes/enhancements is not
guaranteed to be the best solution. It may be better first to fully
understand the existing overheads and agree that there is no alternative
cleaner and more general solution with similar performance.

Considering the mess that KVM async-PF introduced, I
would be very careful before introducing such API changes. I did not look
too much on the details, but some things anyhow look slightly strange
(which might be since I am out-of-touch with KVM). For instance, returning
-EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more
appropriate since the invocation did succeed.
Sean Christopherson April 24, 2023, 8:35 p.m. UTC | #10
On Mon, Apr 24, 2023, Nadav Amit wrote:
> 
> > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> > Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> > so I'm guessing (some version of) that will (eventually :) be merged
> > in any case.
> 
> It certainly not my call. But if you ask me, introducing a solution for
> a concrete use-case that requires API changes/enhancements is not
> guaranteed to be the best solution. It may be better first to fully
> understand the existing overheads and agree that there is no alternative
> cleaner and more general solution with similar performance.

KVM already returns -EFAULT for these situations, the change I really want to land
is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
happy to carry the code in KVM even if userspace never does anything beyond dumping
the extra information on failures.

> Considering the mess that KVM async-PF introduced, I would be very careful
> before introducing such API changes. I did not look too much on the details,
> but some things anyhow look slightly strange (which might be since I am
> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
> would have assumed -EAGAIN would be more appropriate since the invocation did
> succeed.

Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
behavior that's been around for many years.
Nadav Amit April 24, 2023, 11:47 p.m. UTC | #11
Feel free to tell me to shut up, as it is none of my business, and I might be
missing a lot of context. Yet, it never stopped me before. :)

> On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Apr 24, 2023, Nadav Amit wrote:
>> 
>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
>>> so I'm guessing (some version of) that will (eventually :) be merged
>>> in any case.
>> 
>> It certainly not my call. But if you ask me, introducing a solution for
>> a concrete use-case that requires API changes/enhancements is not
>> guaranteed to be the best solution. It may be better first to fully
>> understand the existing overheads and agree that there is no alternative
>> cleaner and more general solution with similar performance.
> 
> KVM already returns -EFAULT for these situations, the change I really want to land
> is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
> happy to carry the code in KVM even if userspace never does anything beyond dumping
> the extra information on failures.

I thought that the change is to inform on page-faults through a new interface
instead of the existing uffd-file-based one. There is already another interface
(signals) and I thought (but did not upstream) io-uring. You now suggest yet
another one.

I am not sure it is very clean. IIUC, it means that you would still need in
userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you-
use) might also trigger a page-fault. So userspace becomes more complicated,
and the ordering of different events/page-faults reporting becomes even more
broken.

At the same time, you also break various assumptions of userfaultfd. I don’t
immediately find some functionality that would break, but I am not very
confident about it either.

> 
>> Considering the mess that KVM async-PF introduced, I would be very careful
>> before introducing such API changes. I did not look too much on the details,
>> but some things anyhow look slightly strange (which might be since I am
>> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
>> would have assumed -EAGAIN would be more appropriate since the invocation did
>> succeed.
> 
> Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
> behavior that's been around for many years.

Again, it is none of my business, but don’t you want to gradually try to fix
the interface so maybe on day you would be able to deprecate it?

IOW, if you already introduce a new interface which is enabled with a new
flag, which would require userspace change, then you can return the more
appropriate error-code.
Anish Moorthy April 25, 2023, 12:15 a.m. UTC | #12
On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >> If I understand the problem correctly, it sounds as if the proper solution
> >> should be some kind of a range-locks. If it is too heavy or the interface can
> >> be changed/extended to wake a single address (instead of a range),
> >> simpler hashed-locks can be used.
> >
> > Some sort of range-based locking system does seem relevant, although I
> > don't see how that would necessarily speed up the delivery of faults
> > to UFFD readers: I'll have to think about it more.
>
> Perhaps I misread your issue. Based on the scalability issues you raised,
> I assumed that the problem you encountered is related to lock contention.
> I do not know whether your profiled it, but some information would be
> useful.

No, you had it right: the issue at hand is contention on the uffd wait
queues. I'm just not sure what the range-based locking would really be
doing. Events would still have to be delivered to userspace in an
ordered manner, so it seems to me that each uffd would still need to
maintain a queue (and the associated contention).

With respect to the "sharding" idea, I collected some more runs of the
self test (full command in [1]). This time I omitted the "-a" flag, so
that every vCPU accesses a different range of guest memory with its
own UFFD, and set the number of reader threads per UFFD to 1.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1      180     307
2       85      220
4       80      206
8       39     163
16     18     104
32      8      73
64      4      57
128    1      37
256    1      16

I'm reporting paging rate on a per-vcpu rather than total basis, which
is why the numbers look so different than the ones in the cover
letter. I'm actually not sure why the demand paging rate falls off
with the number of vCPUs (maybe a prioritization issue on my side?),
but even when UFFDs aren't being contended for it's clear that demand
paging via memory fault exits is significantly faster.

I'll try to get some perf traces as well: that will take a little bit
of time though, as to do it for cycler will involve patching our VMM
first.

[1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

> It certainly not my call. But if you ask me, introducing a solution for
> a concrete use-case that requires API changes/enhancements is not
> guaranteed to be the best solution. It may be better first to fully
> understand the existing overheads and agree that there is no alternative
> cleaner and more general solution with similar performance.
>
> Considering the mess that KVM async-PF introduced, I
> would be very careful before introducing such API changes. I did not look
> too much on the details, but some things anyhow look slightly strange
> (which might be since I am out-of-touch with KVM). For instance, returning
> -EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more
> appropriate since the invocation did succeed.

I'm not quite sure whether you're focusing on
KVM_CAP_MEMORY_FAULT_INFO or KVM_CAP_ABSENT_MAPPING_FAULT here. But to
my knowledge, none of the KVM folks have objections to either:
hopefully it stays that way, but we'll have to see :)
Sean Christopherson April 25, 2023, 12:26 a.m. UTC | #13
On Mon, Apr 24, 2023, Nadav Amit wrote:
> Feel free to tell me to shut up, as it is none of my business, and I might be
> missing a lot of context. Yet, it never stopped me before. :)
> 
> > On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Mon, Apr 24, 2023, Nadav Amit wrote:
> >> 
> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> >>> so I'm guessing (some version of) that will (eventually :) be merged
> >>> in any case.
> >> 
> >> It certainly not my call. But if you ask me, introducing a solution for
> >> a concrete use-case that requires API changes/enhancements is not
> >> guaranteed to be the best solution. It may be better first to fully
> >> understand the existing overheads and agree that there is no alternative
> >> cleaner and more general solution with similar performance.
> > 
> > KVM already returns -EFAULT for these situations, the change I really want to land
> > is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
> > happy to carry the code in KVM even if userspace never does anything beyond dumping
> > the extra information on failures.
> 
> I thought that the change is to inform on page-faults through a new interface
> instead of the existing uffd-file-based one. There is already another interface
> (signals) and I thought (but did not upstream) io-uring. You now suggest yet
> another one.

There are two capabilities proposed in this series: one to provide more information
when KVM encounters a fault it can't resolve, and another to tell KVM to kick out
to userspace instead of attempting to resolve a fault when a given address couldn't
be resolved with fast gup.  I'm talking purely about the first one: providing more
information when KVM exits.

As for the second, my plan is to try and stay out of the way and let people that
actually deal with the userspace side of things settle on an approach.  From the
KVM side, supporting the "don't wait to resolve faults" flag is quite simple so
long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address
isn't mapped with the appropriate permissions.

> I am not sure it is very clean. IIUC, it means that you would still need in
> userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you-
> use) might also trigger a page-fault. So userspace becomes more complicated,
> and the ordering of different events/page-faults reporting becomes even more
> broken.
> 
> At the same time, you also break various assumptions of userfaultfd. I don’t
> immediately find some functionality that would break, but I am not very
> confident about it either.
> 
> > 
> >> Considering the mess that KVM async-PF introduced, I would be very careful
> >> before introducing such API changes. I did not look too much on the details,
> >> but some things anyhow look slightly strange (which might be since I am
> >> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
> >> would have assumed -EAGAIN would be more appropriate since the invocation did
> >> succeed.
> > 
> > Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
> > behavior that's been around for many years.
> 
> Again, it is none of my business, but don’t you want to gradually try to fix
> the interface so maybe on day you would be able to deprecate it?
>
> IOW, if you already introduce a new interface which is enabled with a new
> flag, which would require userspace change, then you can return the more
> appropriate error-code.

In a perfect world, yes.  But unfortunately, the relevant plumbing in KVM is quite
brittle (understatement) with respect to returning "0", and AFAICT, returning
-EFAULT instead of 0 is nothing more than an odditity.  E.g. at worst, it could be
suprising to users writing a new VMM from scratch.

But I hadn't thought about returning a _different_ error code.  -EAGAIN isn't
obviously better though, e.g. my understanding is that -EAGAIN typically means that
retrying will succeed, but in pretty much every case where KVM returns -EFAULT,
simply trying again will never succeed.  It's not even guaranteed to be appropriate
in the "don't wait to resolve faults" case, because KVM won't actually know why
an address isn't accessible, e.g. it could be because the page needs to be faulted
in, but it could also be a due to a guest bug that resulted in a permission
violation.

All in all, returning -EFAULT doesn't seem egregious.  I can't recall a single complaint
about returning -EFAULT instead of -XYZ, just complaints about not KVM providing any
useful information.  So absent a concrete need for returning something other than
-EFAULT, I'm definitely inclined to maintain the status quo, even though it's imperfect.
Nadav Amit April 25, 2023, 12:37 a.m. UTC | #14
> On Apr 24, 2023, at 5:26 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Apr 24, 2023, Nadav Amit wrote:
>> Feel free to tell me to shut up, as it is none of my business, and I might be
>> missing a lot of context. Yet, it never stopped me before. :)
>> 
>>> On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
>>> 
>>> On Mon, Apr 24, 2023, Nadav Amit wrote:
>>>> 
>>>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>>>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
>>>>> so I'm guessing (some version of) that will (eventually :) be merged
>>>>> in any case.
>>>> 
>>>> It certainly not my call. But if you ask me, introducing a solution for
>>>> a concrete use-case that requires API changes/enhancements is not
>>>> guaranteed to be the best solution. It may be better first to fully
>>>> understand the existing overheads and agree that there is no alternative
>>>> cleaner and more general solution with similar performance.
>>> 
>>> KVM already returns -EFAULT for these situations, the change I really want to land
>>> is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
>>> happy to carry the code in KVM even if userspace never does anything beyond dumping
>>> the extra information on failures.
>> 
>> I thought that the change is to inform on page-faults through a new interface
>> instead of the existing uffd-file-based one. There is already another interface
>> (signals) and I thought (but did not upstream) io-uring. You now suggest yet
>> another one.
> 
> There are two capabilities proposed in this series: one to provide more information
> when KVM encounters a fault it can't resolve, and another to tell KVM to kick out
> to userspace instead of attempting to resolve a fault when a given address couldn't
> be resolved with fast gup.  I'm talking purely about the first one: providing more
> information when KVM exits.
> 
> As for the second, my plan is to try and stay out of the way and let people that
> actually deal with the userspace side of things settle on an approach.  From the
> KVM side, supporting the "don't wait to resolve faults" flag is quite simple so
> long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address
> isn't mapped with the appropriate permissions.

Thanks for your kind and detailed reply. I removed the parts that I understand.

As you might guess, I focus on the second part, which you leave for others. The
interactions between two page-fault reporting mechanisms is not trivial, and it
is already not fully correct.

I understand the approach that Anish prefers is to do something that is tailored
for KVM, but I am not sure it is the best thing.
Nadav Amit April 25, 2023, 12:54 a.m. UTC | #15
> On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote:
> 
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> 
>> 
>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>> 
>>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> 
>>>> If I understand the problem correctly, it sounds as if the proper solution
>>>> should be some kind of a range-locks. If it is too heavy or the interface can
>>>> be changed/extended to wake a single address (instead of a range),
>>>> simpler hashed-locks can be used.
>>> 
>>> Some sort of range-based locking system does seem relevant, although I
>>> don't see how that would necessarily speed up the delivery of faults
>>> to UFFD readers: I'll have to think about it more.
>> 
>> Perhaps I misread your issue. Based on the scalability issues you raised,
>> I assumed that the problem you encountered is related to lock contention.
>> I do not know whether your profiled it, but some information would be
>> useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).

There are 2 queues. One for the pending faults that were still not reported
to userspace, and one for the faults that we might need to wake up. The second
one can have range locks.

Perhaps some hybrid approach would be best: do not block on page-faults that
KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

But I do not know whether the reporting through KVM instead of 
userfaultfd-based mechanism is very clean. I think that an IO-uring based
solution, such as the one I proposed before, would be more generic. Actually,
now that I understand better your use-case, you do not need a core to poll
and you would just be able to read the page-fault information from the IO-uring.

Then, you can report whether the page-fault blocked or not in a flag.

> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.

Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
wake is not needed.
James Houghton April 27, 2023, 4:38 p.m. UTC | #16
On Mon, Apr 24, 2023 at 5:54 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >>
> >>
> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >>>
> >>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>
> >>>> If I understand the problem correctly, it sounds as if the proper solution
> >>>> should be some kind of a range-locks. If it is too heavy or the interface can
> >>>> be changed/extended to wake a single address (instead of a range),
> >>>> simpler hashed-locks can be used.
> >>>
> >>> Some sort of range-based locking system does seem relevant, although I
> >>> don't see how that would necessarily speed up the delivery of faults
> >>> to UFFD readers: I'll have to think about it more.
> >>
> >> Perhaps I misread your issue. Based on the scalability issues you raised,
> >> I assumed that the problem you encountered is related to lock contention.
> >> I do not know whether your profiled it, but some information would be
> >> useful.
> >
> > No, you had it right: the issue at hand is contention on the uffd wait
> > queues. I'm just not sure what the range-based locking would really be
> > doing. Events would still have to be delivered to userspace in an
> > ordered manner, so it seems to me that each uffd would still need to
> > maintain a queue (and the associated contention).
>
> There are 2 queues. One for the pending faults that were still not reported
> to userspace, and one for the faults that we might need to wake up. The second
> one can have range locks.
>
> Perhaps some hybrid approach would be best: do not block on page-faults that
> KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

Hi Nadav,

If we don't block on the page faults that KVM runs into, what are you
suggesting that these threads do?

1. If you're saying that we should kick the threads out to userspace
and then read the page fault event, then I would say that it's just
unnecessary complexity. (Seems like this is what you mean from what
you said below.)
2. If you're saying they should busy-wait, then unfortunately we can't
afford that.
3. If it's neither of those, could you clarify?

>
> But I do not know whether the reporting through KVM instead of
> userfaultfd-based mechanism is very clean. I think that an IO-uring based
> solution, such as the one I proposed before, would be more generic. Actually,
> now that I understand better your use-case, you do not need a core to poll
> and you would just be able to read the page-fault information from the IO-uring.
>
> Then, you can report whether the page-fault blocked or not in a flag.

This is a fine idea, but I don't think the required complexity is
worth it. The memory fault info reporting piece of this series is
relatively uncontentious, so let's assume we have it at our disposal.

Now, the complexity to make KVM only attempt fast GUP (and EFAULT if
it fails) is really minimal. We automatically know that we don't need
to WAKE and which address to make ready.  Userspace is also able to
resolve the fault: UFFDIO_CONTINUE if we haven't already, then
MADV_POPULATE_WRITE if we have (forces userspace page tables to be
populated if they haven't been, potentially going through userfaultfd
to do so, i.e., if UFFDIO_CONTINUE wasn't already done).

It sounds like what you're suggesting is something like:
1. KVM attempts fast GUP then slow GUP.
2. In slow GUP, queue a "non-blocking" userfault, but don't go to
sleep (return with VM_FAULT_SIGBUS or something).
3. The vCPU thread gets kicked out to userspace with EFAULT (+ fault
info if we've enabled it).
4. Read a fault from the userfaultfd or io_uring.
5. Make the page ready, and if it were non-blocking, then don't WAKE.

I have some questions/thoughts with this approach:
1. Is io_uring the only way to make reading from a userfaultfd scale?
Maybe it's possible to avoid using a wait_queue for "non-blocking"
faults, but then we'd need a special read() API specifically to
*avoid* the standard fault_pending_wqh queue. Either approach will be
quite complex.
2. We'll still need to annotate KVM in the same-ish place to tell
userfaultfd that the fault should be non-blocking, but we'll probably
*also* need like GUP_USERFAULT_NONBLOCK and/or
FAULT_FLAG_USERFAULT_NOBLOCK or something. (UFFD_FEATURE_SIGBUS does
not exactly solve this problem either.)
3. If the vCPU thread is getting kicked out to userspace, it seems
like there is no way for it to find/read the #pf it generated. This
seems problematic.

>
> >
> > With respect to the "sharding" idea, I collected some more runs of the
> > self test (full command in [1]). This time I omitted the "-a" flag, so
> > that every vCPU accesses a different range of guest memory with its
> > own UFFD, and set the number of reader threads per UFFD to 1.
>
> Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
> wake is not needed.
>

Anish's selftest only WAKEs when it's necessary[1]. IOW, we only WAKE
when we actually read the #pf from the userfaultfd. If we were to WAKE
for each fault, we wouldn't get much of a scalability improvement at
all (we would still be contending on the wait_queue locks, just not
quite as much as before).

[1]: https://lore.kernel.org/kvm/20230412213510.1220557-23-amoorthy@google.com/

Thanks for your insights/suggestions, Nadav.

- James
Peter Xu April 27, 2023, 8:26 p.m. UTC | #17
Hi, Anish,

On Mon, Apr 24, 2023 at 05:15:49PM -0700, Anish Moorthy wrote:
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >
> >
> >
> > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> > >
> > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > >>
> > >> If I understand the problem correctly, it sounds as if the proper solution
> > >> should be some kind of a range-locks. If it is too heavy or the interface can
> > >> be changed/extended to wake a single address (instead of a range),
> > >> simpler hashed-locks can be used.
> > >
> > > Some sort of range-based locking system does seem relevant, although I
> > > don't see how that would necessarily speed up the delivery of faults
> > > to UFFD readers: I'll have to think about it more.
> >
> > Perhaps I misread your issue. Based on the scalability issues you raised,
> > I assumed that the problem you encountered is related to lock contention.
> > I do not know whether your profiled it, but some information would be
> > useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).
> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1      180     307
> 2       85      220
> 4       80      206
> 8       39     163
> 16     18     104
> 32      8      73
> 64      4      57
> 128    1      37
> 256    1      16
> 
> I'm reporting paging rate on a per-vcpu rather than total basis, which
> is why the numbers look so different than the ones in the cover
> letter. I'm actually not sure why the demand paging rate falls off
> with the number of vCPUs (maybe a prioritization issue on my side?),
> but even when UFFDs aren't being contended for it's clear that demand
> paging via memory fault exits is significantly faster.
> 
> I'll try to get some perf traces as well: that will take a little bit
> of time though, as to do it for cycler will involve patching our VMM
> first.
> 
> [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

Thanks (for doing this test, and also to Nadav for all his inputs), and
sorry for a late response.

These numbers caught my eye, and I'm very curious why even 2 vcpus can
scale that bad.

I gave it a shot on a test machine and I got something slightly different:

  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
  $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
  |-------+----------+--------|
  | n_thr | per-vcpu | total  |
  |-------+----------+--------|
  |     1 | 39.5K    | 39.5K  |
  |     2 | 33.8K    | 67.6K  |
  |     4 | 31.8K    | 127.2K |
  |     8 | 30.8K    | 246.1K |
  |    16 | 21.9K    | 351.0K |
  |-------+----------+--------|

I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
don't have two threads content on a core/thread already since I only got 40
hardware threads there, but still we can compare with your lower half.

When I was testing I noticed bad numbers and another bug on not using
NSEC_PER_SEC properly, so I did this before the test:

https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/

I think it means it still doesn't scale that good, however not so bad
either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
paths triggered in the test so I also don't expect it to fully scale
linearly.  From my numbers I just didn't see as drastic as yours. I'm not
sure whether it's simply broken test number, parameter differences
(e.g. you used 64M only per-vcpu), or hardware differences.
Anish Moorthy May 3, 2023, 7:45 p.m. UTC | #18
On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> Thanks (for doing this test, and also to Nadav for all his inputs), and
> sorry for a late response.

No need to apologize: anyways, I've got you comfortably beat on being
late at this point :)

> These numbers caught my eye, and I'm very curious why even 2 vcpus can
> scale that bad.
>
> I gave it a shot on a test machine and I got something slightly different:
>
>   Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
>   $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
>   |-------+----------+--------|
>   | n_thr | per-vcpu | total  |
>   |-------+----------+--------|
>   |     1 | 39.5K    | 39.5K  |
>   |     2 | 33.8K    | 67.6K  |
>   |     4 | 31.8K    | 127.2K |
>   |     8 | 30.8K    | 246.1K |
>   |    16 | 21.9K    | 351.0K |
>   |-------+----------+--------|
>
> I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
> don't have two threads content on a core/thread already since I only got 40
> hardware threads there, but still we can compare with your lower half.
>
> When I was testing I noticed bad numbers and another bug on not using
> NSEC_PER_SEC properly, so I did this before the test:
>
> https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/
>
> I think it means it still doesn't scale that good, however not so bad
> either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
> paths triggered in the test so I also don't expect it to fully scale
> linearly.  From my numbers I just didn't see as drastic as yours. I'm not
> sure whether it's simply broken test number, parameter differences
> (e.g. you used 64M only per-vcpu), or hardware differences.

Hmm, I suspect we're dealing with  hardware differences here. I
rebased my changes onto those two patches you sent up, taking care not
to clobber them, but even with the repro command you provided my
results look very different than yours (at least on 1-4 vcpus) on the
machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz).

(n=20)
n_thr      per_vcpu       total
1            154K              154K
2             92k                184K
4             71K                285K
8             36K                291K
16           19K                310K

Out of interested I tested on another machine (Intel(R) Xeon(R)
Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit
different again

(n=20)
n_thr      per_vcpu       total
1            115K              115K
2             103k              206K
4             65K                262K
8             39K                319K
16           19K                398K

It is interesting how all three sets of numbers start off different
but seem to converge around 16 vCPUs. I did check to make sure the
memory fault exits sped things up in all cases, and that at least
stays true.

By the way, I've got a little helper script that I've been using to
run/average the selftest results (which can vary quite a bit). I've
attached it below- hopefully it doesn't bounce from the mailing list.
Just for reference, the invocation to test the command you provided is

> python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
Sean Christopherson May 3, 2023, 8:09 p.m. UTC | #19
On Wed, May 03, 2023, Anish Moorthy wrote:
> On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Thanks (for doing this test, and also to Nadav for all his inputs), and
> > sorry for a late response.
> 
> No need to apologize: anyways, I've got you comfortably beat on being
> late at this point :)

LOL, hold my beer and let me you you the true meaning of "late response". :-)
Peter Xu May 3, 2023, 9:27 p.m. UTC | #20
Oops, bounced back from the list..

Forward with no attachment this time - I assume the information is still
enough in the paragraphs even without the flamegraphs.  Sorry for the
noise.

On Wed, May 03, 2023 at 05:18:13PM -0400, Peter Xu wrote:
> On Wed, May 03, 2023 at 12:45:07PM -0700, Anish Moorthy wrote:
> > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Thanks (for doing this test, and also to Nadav for all his inputs), and
> > > sorry for a late response.
> > 
> > No need to apologize: anyways, I've got you comfortably beat on being
> > late at this point :)
> > 
> > > These numbers caught my eye, and I'm very curious why even 2 vcpus can
> > > scale that bad.
> > >
> > > I gave it a shot on a test machine and I got something slightly different:
> > >
> > >   Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
> > >   $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
> > >   |-------+----------+--------|
> > >   | n_thr | per-vcpu | total  |
> > >   |-------+----------+--------|
> > >   |     1 | 39.5K    | 39.5K  |
> > >   |     2 | 33.8K    | 67.6K  |
> > >   |     4 | 31.8K    | 127.2K |
> > >   |     8 | 30.8K    | 246.1K |
> > >   |    16 | 21.9K    | 351.0K |
> > >   |-------+----------+--------|
> > >
> > > I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
> > > don't have two threads content on a core/thread already since I only got 40
> > > hardware threads there, but still we can compare with your lower half.
> > >
> > > When I was testing I noticed bad numbers and another bug on not using
> > > NSEC_PER_SEC properly, so I did this before the test:
> > >
> > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/
> > >
> > > I think it means it still doesn't scale that good, however not so bad
> > > either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
> > > paths triggered in the test so I also don't expect it to fully scale
> > > linearly.  From my numbers I just didn't see as drastic as yours. I'm not
> > > sure whether it's simply broken test number, parameter differences
> > > (e.g. you used 64M only per-vcpu), or hardware differences.
> > 
> > Hmm, I suspect we're dealing with  hardware differences here. I
> > rebased my changes onto those two patches you sent up, taking care not
> > to clobber them, but even with the repro command you provided my
> > results look very different than yours (at least on 1-4 vcpus) on the
> > machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz).
> > 
> > (n=20)
> > n_thr      per_vcpu       total
> > 1            154K              154K
> > 2             92k                184K
> > 4             71K                285K
> > 8             36K                291K
> > 16           19K                310K
> > 
> > Out of interested I tested on another machine (Intel(R) Xeon(R)
> > Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit
> > different again
> > 
> > (n=20)
> > n_thr      per_vcpu       total
> > 1            115K              115K
> > 2             103k              206K
> > 4             65K                262K
> > 8             39K                319K
> > 16           19K                398K
> 
> Interesting.
> 
> > 
> > It is interesting how all three sets of numbers start off different
> > but seem to converge around 16 vCPUs. I did check to make sure the
> > memory fault exits sped things up in all cases, and that at least
> > stays true.
> > 
> > By the way, I've got a little helper script that I've been using to
> > run/average the selftest results (which can vary quite a bit). I've
> > attached it below- hopefully it doesn't bounce from the mailing list.
> > Just for reference, the invocation to test the command you provided is
> > 
> > > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
> 
> I found that indeed I shouldn't have stopped at 16 vcpus since that's
> exactly where it starts to bottleneck. :)
> 
> So out of my curiosity I tried to profile 32 vcpus case on my system with
> this test case, meanwhile I tried it both with:
> 
>   - 1 uffd + 8 readers
>   - 32 uffds (so 32 readers)
> 
> I've got the flamegraphs attached for both.
> 
> It seems that when using >1 uffds the bottleneck is not the spinlock
> anymore but something else.
> 
> From what I got there, vmx_vcpu_load() gets more highlights than the
> spinlocks. I think that's the tlb flush broadcast.
> 
> While OTOH indeed when using 1 uffd we can see obviously the overhead of
> spinlock contention on either the fault() path or read()/poll() as you and
> James rightfully pointed out.
> 
> I'm not sure whether my number is caused by special setup, though. After
> all I only had 40 threads and I started 32 vcpus + 8 readers and there'll
> be contention already between the workloads.
> 
> IMHO this means that there's still chance to provide a more generic
> userfaultfd scaling solution as long as we can remove the single spinlock
> contention on the fault/fault_pending queues.  I'll see whether I can still
> explore a bit on the possibility of this and keep you guys updated.  The
> general idea here to me is still to make multi-queue out of 1 uffd.
> 
> I _think_ this might also be a positive result to your work, because if the
> bottleneck is not userfaultfd (as we scale it with creating multiple;
> ignoring the split vma effect), then it cannot be resolved by scaling
> userfaultfd alone anyway, anymore.  So a general solution, even if existed,
> may not work here for kvm, because we'll get stuck somewhere else already.
> 
> -- 
> Peter Xu
Sean Christopherson May 3, 2023, 9:42 p.m. UTC | #21
On Wed, May 03, 2023, Peter Xu wrote:
> Oops, bounced back from the list..
> 
> Forward with no attachment this time - I assume the information is still
> enough in the paragraphs even without the flamegraphs.

The flamegraphs are definitely useful beyond what is captured here.  Not sure
how to get them accepted on the list though.

> > From what I got there, vmx_vcpu_load() gets more highlights than the
> > spinlocks. I think that's the tlb flush broadcast.

No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
VMCS caches not being coherent).

Task migration can also trigger IBPB (if mitigations are enabled), and also does
an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.

Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
world configuration.
Peter Xu May 3, 2023, 11:45 p.m. UTC | #22
On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> On Wed, May 03, 2023, Peter Xu wrote:
> > Oops, bounced back from the list..
> > 
> > Forward with no attachment this time - I assume the information is still
> > enough in the paragraphs even without the flamegraphs.
> 
> The flamegraphs are definitely useful beyond what is captured here.  Not sure
> how to get them accepted on the list though.

Trying again with google drive:

single uffd:
https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view

32 uffds:
https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view

> 
> > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > spinlocks. I think that's the tlb flush broadcast.
> 
> No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> VMCS caches not being coherent).
> 
> Task migration can also trigger IBPB (if mitigations are enabled), and also does
> an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> 
> Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> world configuration.

Yes it does went away:

https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view

With pinning vcpu threads only (again, over 40 hard cores/threads):

./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32

It seems to me for some reason the scheduler ate more than I expected..
Maybe tomorrow I can try two more things:

  - Do cpu isolations, and
  - pin reader threads too (or just leave the readers on housekeeping cores)
Peter Xu May 4, 2023, 7:09 p.m. UTC | #23
On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote:
> On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> > On Wed, May 03, 2023, Peter Xu wrote:
> > > Oops, bounced back from the list..
> > > 
> > > Forward with no attachment this time - I assume the information is still
> > > enough in the paragraphs even without the flamegraphs.
> > 
> > The flamegraphs are definitely useful beyond what is captured here.  Not sure
> > how to get them accepted on the list though.
> 
> Trying again with google drive:
> 
> single uffd:
> https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> 
> 32 uffds:
> https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> 
> > 
> > > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > > spinlocks. I think that's the tlb flush broadcast.
> > 
> > No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> > VMCS caches not being coherent).
> > 
> > Task migration can also trigger IBPB (if mitigations are enabled), and also does
> > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> > 
> > Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> > world configuration.
> 
> Yes it does went away:
> 
> https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> 
> With pinning vcpu threads only (again, over 40 hard cores/threads):
> 
> ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> 
> It seems to me for some reason the scheduler ate more than I expected..
> Maybe tomorrow I can try two more things:
> 
>   - Do cpu isolations, and
>   - pin reader threads too (or just leave the readers on housekeeping cores)

I gave it a shot by isolating 32 cores and split into two groups, 16 for
uffd threads and 16 for vcpu threads.  I got similiar results and I don't
see much changed.

I think it's possible it's just reaching the limit of my host since it only
got 40 cores anyway.  Throughput never hits over 350K faults/sec overall.

I assume this might not be the case for Anish if he has a much larger host.
So we can have similar test carried out and see how that goes.  I think the
idea is making sure vcpu load overhead during sched-in is ruled out, then
see whether it can keep scaling with more cores.
Anish Moorthy May 5, 2023, 6:32 p.m. UTC | #24
On Thu, May 4, 2023 at 12:09 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote:
> > On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> > > On Wed, May 03, 2023, Peter Xu wrote:
> > > > Oops, bounced back from the list..
> > > >
> > > > Forward with no attachment this time - I assume the information is still
> > > > enough in the paragraphs even without the flamegraphs.
> > >
> > > The flamegraphs are definitely useful beyond what is captured here.  Not sure
> > > how to get them accepted on the list though.
> >
> > Trying again with google drive:
> >
> > single uffd:
> > https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> >
> > 32 uffds:
> > https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> >
> > >
> > > > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > > > spinlocks. I think that's the tlb flush broadcast.
> > >
> > > No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> > > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> > > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> > > VMCS caches not being coherent).
> > >
> > > Task migration can also trigger IBPB (if mitigations are enabled), and also does
> > > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> > >
> > > Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> > > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> > > world configuration.
> >
> > Yes it does went away:
> >
> > https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> >
> > With pinning vcpu threads only (again, over 40 hard cores/threads):
> >
> > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> >
> > It seems to me for some reason the scheduler ate more than I expected..
> > Maybe tomorrow I can try two more things:

I pulled in your patch adding the -c flag, and confirmed that it
doesn't seem to make a huge difference to the self test's
numbers/scalability. The percpu paging rate actually seems a bit
lower, going 117-103-77-55-18-9k for 1-32 vcpus

> >   - Do cpu isolations, and
> >   - pin reader threads too (or just leave the readers on housekeeping cores)
>
> I gave it a shot by isolating 32 cores and split into two groups, 16 for
> uffd threads and 16 for vcpu threads.  I got similiar results and I don't
> see much changed.
>
> I think it's possible it's just reaching the limit of my host since it only
> got 40 cores anyway.  Throughput never hits over 350K faults/sec overall.
>
> I assume this might not be the case for Anish if he has a much larger host.
> So we can have similar test carried out and see how that goes.  I think the
> idea is making sure vcpu load overhead during sched-in is ruled out, then
> see whether it can keep scaling with more cores.

Peter, I'm afraid that isolating cores and splitting them into groups
is new to me. Do you mind explaining exactly what you did here?

Also, I finally got some of my own perf traces for the self test: [1]
shows what happens with 32 vCPUs faulting on a single uffd with 32
reader threads, with the contention clearly being a huge issue, and
[2] shows the effect of demand paging through memory faults on that
configuration. Unfortunately the export-to-svg functionality on our
internal tool seems broken, so I could only grab pngs :(

[1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk
[2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk
Nadav Amit May 5, 2023, 8:05 p.m. UTC | #25
> On May 3, 2023, at 4:45 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
>> On Wed, May 03, 2023, Peter Xu wrote:
>>> Oops, bounced back from the list..
>>> 
>>> Forward with no attachment this time - I assume the information is still
>>> enough in the paragraphs even without the flamegraphs.
>> 
>> The flamegraphs are definitely useful beyond what is captured here.  Not sure
>> how to get them accepted on the list though.
> 
> Trying again with google drive:
> 
> single uffd:
> https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> 
> 32 uffds:
> https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> 
>> 
>>>> From what I got there, vmx_vcpu_load() gets more highlights than the
>>>> spinlocks. I think that's the tlb flush broadcast.
>> 
>> No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
>> smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
>> triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
>> VMCS caches not being coherent).
>> 
>> Task migration can also trigger IBPB (if mitigations are enabled), and also does
>> an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
>> 
>> Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
>> vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
>> world configuration.
> 
> Yes it does went away:
> 
> https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> 
> With pinning vcpu threads only (again, over 40 hard cores/threads):
> 
> ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> 
> It seems to me for some reason the scheduler ate more than I expected..
> Maybe tomorrow I can try two more things:
> 
>  - Do cpu isolations, and
>  - pin reader threads too (or just leave the readers on housekeeping cores)

For the record (and I hope I do not repeat myself): these scheduler overheads
is something that I have encountered before.

The two main solutions I tried were:

1. Optional polling on the faulting thread to avoid context switch on the
   faulting thread.

(something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ )

and 

2. IO-uring to avoid context switch on the handler thread.

In addition, as I mentioned before, the queue locks is something that can be
simplified.
Peter Xu May 8, 2023, 1:12 a.m. UTC | #26
Hi, Nadav,

On Fri, May 05, 2023 at 01:05:02PM -0700, Nadav Amit wrote:
> > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> > 
> > It seems to me for some reason the scheduler ate more than I expected..
> > Maybe tomorrow I can try two more things:
> > 
> >  - Do cpu isolations, and
> >  - pin reader threads too (or just leave the readers on housekeeping cores)
> 
> For the record (and I hope I do not repeat myself): these scheduler overheads
> is something that I have encountered before.
> 
> The two main solutions I tried were:
> 
> 1. Optional polling on the faulting thread to avoid context switch on the
>    faulting thread.
> 
> (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ )
> 
> and 
> 
> 2. IO-uring to avoid context switch on the handler thread.
> 
> In addition, as I mentioned before, the queue locks is something that can be
> simplified.

Right, thanks for double checking on that.  Though do you think these are
two separate issues to be looked into?

One thing on reducing context switch overhead with a static configuration,
which I think is what can be resolved by what you mentioned above, and the
iouring series.

One thing on the possibility of scaling userfaultfd over splitting guest
memory into a few chunks (literally demand paging test with no -a).
Logically I think it should scale if with pcpu pinning on vcpu threads to
avoid kvm bottlenecks around.

Side note: IIUC none of above will resolve the problem right now if we
assume we can only have 1 uffd to register to the guest mem.

However I'm curious on testing multi-uffd because I wanted to make sure
there's no other thing stops the whole system from scaling with threads,
hence I'd expect to get higher fault/sec overall if we increase the cores
we use in the test.

If it already cannot scale for whatever reason then it means a generic
solution may not be possible at least for kvm use case.  While OTOH if
multi-uffd can scale well, then there's a chance of general solution as
long as we can remove the single-queue contention over the whole guest mem.

PS: Nadav, I think you mentioned twice on avoiding taking two locks for the
fault queue, which sounds reasonable.  Do you have plan to post a patch?

Thanks,
Peter Xu May 8, 2023, 1:23 a.m. UTC | #27
On Fri, May 05, 2023 at 11:32:11AM -0700, Anish Moorthy wrote:
> Peter, I'm afraid that isolating cores and splitting them into groups
> is new to me. Do you mind explaining exactly what you did here?

So far I think the most important pinning is the vcpu thread pinning, we
should test always with that in this case to avoid the vcpu load overhead
not scaling with cores/vcpus.

What I did was (1) isolate cores (using isolcpus=xxx), then (2) manually
pinning the userfault threads to some other isolated cores.  But maybe this
is not needed.

> 
> Also, I finally got some of my own perf traces for the self test: [1]
> shows what happens with 32 vCPUs faulting on a single uffd with 32
> reader threads, with the contention clearly being a huge issue, and
> [2] shows the effect of demand paging through memory faults on that
> configuration. Unfortunately the export-to-svg functionality on our
> internal tool seems broken, so I could only grab pngs :(
> 
> [1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk
> [2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk

Understood.  What I tested was without -a so it's using >1 uffds.

I explained why I think it could be useful to test this in my reply to
Nadav, do you think it makes sense to you?  e.g. compare (1) 32 vcpus + 32
uffd threads and (2) 64 vcpus + 64 uffd threads, again we need to make sure
vcpu threads are pinned using -c this time.  It'll be nice to pin the uffd
threads too but I'm not sure whether it'll make a huge difference.

Thanks,
Anish Moorthy May 9, 2023, 8:52 p.m. UTC | #28
On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> I explained why I think it could be useful to test this in my reply to
> Nadav, do you think it makes sense to you?

Ah, I actually missed your reply to Nadav: didn't realize you had sent
*two* emails.

> While OTOH if multi-uffd can scale well, then there's a chance of
> general solution as long as we can remove the single-queue
> contention over the whole guest mem.

I don't quite understand your statement here: if we pursue multi-uffd,
then it seems to me that by definition we've removed the single
queue(s) for all of guest memory, and thus the associated contention.
And we'd still have the issue of multiple vCPUs contending for a
single UFFD.

But I do share some of your curiosity about multi-uffd performance,
especially since some of my earlier numbers indicated that multi-uffd
doesn't scale linearly, even when each vCPU corresponds to a single
UFFD.

So, I grabbed some more profiles for 32 and 64 vcpus using the following command
./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n>

The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate
goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue
from the traces, but safe to say it's definitely not scaling. Since I
applied your fixes from earlier, the prefaulting isn't being counted
against the demand paging rate either.

32-vcpu profile:
https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk
64-vcpu profile:
https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk

Do let me know if you need svg files instead and I'll try and figure that out.
David Matlack May 9, 2023, 10:19 p.m. UTC | #29
On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> Upon receiving an annotated EFAULT, userspace may take appropriate
> action to resolve the failed access. For instance, this might involve a
> UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> migration postcopy.

As implemented, I think it will be prohibitively expensive if not
impossible for userspace to determine why KVM is returning EFAULT when
KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
decide the correct action to take (try to resolve or bail).

Consider the direct_map() case in patch in PATCH 15. The only way to hit
that condition is a logic bug in KVM or data corruption. There isn't
really anything userspace can do to handle this situation, and it has no
way to distinguish that from faults to due absent mappings.

We could end up hitting cases where userspace loops forever doing
KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...

Maybe we should just change direct_map() to use KVM_BUG() and return
something other than EFAULT. But the general problem still exists and
even if we have confidence in all the current EFAULT sites, we don't have
much protection against someone adding an EFAULT in the future that
userspace can't handle.
Anish Moorthy May 10, 2023, 4:35 p.m. UTC | #30
On Tue, May 9, 2023 at 3:19 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
>
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
>
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
>
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
>
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.

Hmm, I had been operating under the assumption that userspace would
always have been able to make the memory access succeed somehow- I
(naively) didn't count on some guest memory access errors being
unrecoverable.

If that's the case, then we're back to needing some way to distinguish
the new faults/exits emitted by user_mem_abort/kvm_faultin_pfn with
the ABSENT_MAPPING_FAULT cap enabled :/ Let me paste in a bit of what
Sean said to refute the idea of a special page-fault-failure set in
those spots.

(from https://lore.kernel.org/kvm/ZBoIzo8FGxSyUJ2I@google.com/)
On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest".  And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault.  And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever.  I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure.  Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.

I wonder, how much of this problem comes down to my description/name
(I suggested MEMFAULT_REASON_PAGE_FAULT_FAILURE) for the flag? I see
Sean's concerns of the behavior issues when fast-only pages are
accessed via guest mode or via emulation/uaccess. What if the
description of the fast-only fault cap was tightened to something like
"generates vcpu faults/exits in response to *EPT/SLAT violations*
which cannot be mapped by present userspace page table entries?" I
think that would eliminate the emulation/uaccess issues (though I may
be wrong, so please let me know).

Of course, by the time we get to kvm_faultin_pfn we don't know that
we're faulting pages in response to an EPT violation... but if the
idea makes sense then that might justify some plumbing code.
Peter Xu May 10, 2023, 9:50 p.m. UTC | #31
Hi, Anish,

On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > I explained why I think it could be useful to test this in my reply to
> > Nadav, do you think it makes sense to you?
> 
> Ah, I actually missed your reply to Nadav: didn't realize you had sent
> *two* emails.
> 
> > While OTOH if multi-uffd can scale well, then there's a chance of
> > general solution as long as we can remove the single-queue
> > contention over the whole guest mem.
> 
> I don't quite understand your statement here: if we pursue multi-uffd,
> then it seems to me that by definition we've removed the single
> queue(s) for all of guest memory, and thus the associated contention.
> And we'd still have the issue of multiple vCPUs contending for a
> single UFFD.

Yes as I mentioned it's purely what I was curious and it also shows the
best result we can have if we go a more generic solution; it doesn't really
solve the issue immediately.

> 
> But I do share some of your curiosity about multi-uffd performance,
> especially since some of my earlier numbers indicated that multi-uffd
> doesn't scale linearly, even when each vCPU corresponds to a single
> UFFD.
> 
> So, I grabbed some more profiles for 32 and 64 vcpus using the following command
> ./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n>
> 
> The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate
> goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue
> from the traces, but safe to say it's definitely not scaling. Since I
> applied your fixes from earlier, the prefaulting isn't being counted
> against the demand paging rate either.
> 
> 32-vcpu profile:
> https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk
> 64-vcpu profile:
> https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk
> 
> Do let me know if you need svg files instead and I'll try and figure that out.

Thanks for trying all these out, and sorry if I caused confusion in my
reply.

What I wanted to do is to understand whether there's still chance to
provide a generic solution.  I don't know why you have had a bunch of pmu
stack showing in the graph, perhaps you forgot to disable some of the perf
events when doing the test?  Let me know if you figure out why it happened
like that (so far I didn't see), but I feel guilty to keep overloading you
with such questions.

The major problem I had with this series is it's definitely not a clean
approach.  Say, even if you'll all rely on userapp you'll still need to
rely on userfaultfd for kernel traps on corner cases or it just won't work.
IIUC that's also the concern from Nadav.

But I also agree it seems to resolve every bottleneck in the kernel no
matter whether it's in scheduler or vcpu loading. After all you throw
everything into userspace.. :)

Considering that most of the changes are for -EFAULT traps and the 2nd part
change is very self contained and maintainable, no objection here to have
it.  I'll leave that to the maintainers to decide.

Thanks,
Sean Christopherson May 10, 2023, 10:35 p.m. UTC | #32
On Tue, May 09, 2023, David Matlack wrote:
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
> 
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
> 
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
> 
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
> 
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.

Yeah, when I speed read the series, several of the conversions stood out as being
"wrong".  My (potentially unstated) idea was that KVM would only signal
KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
i.e. when the fault _might_ be resolvable by userspace.

If we want to populate KVM_EXIT_MEMORY_FAULT even on kernel bugs, and anything
else that userspace can't possibly resolve, then the easiest thing would be to
add a flag to signal that the fault is fatal, i.e. that userspace shouldn't retry.
Adding a flag may be more robust in the long term as it will force developers to
think about whether or not a fault is fatal, versus relying on documentation to
say "don't signal KVM_EXIT_MEMORY_FAULT for fatal EFAULT conditions".

Side topic, KVM x86 really should have a version of KVM_SYNC_X86_REGS that stores
registers for userspace, but doesn't load registers.  That would allow userspace
to detect many infinite loops with minimal overhead, e.g. (1) set KVM_STORE_X86_REGS
during demand paging, (2) check RIP on every exit to see if the vCPU is making
forward progress, (3) escalate to checking all registers if RIP hasn't changed for
N exits, and finally (4) take action if the guest is well and truly stuck after
N more exits.  KVM could even store RIP on every exit if userspace wanted to avoid
the overhead of storing registers until userspace actually wants all registers.
Anish Moorthy May 10, 2023, 11:44 p.m. UTC | #33
On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Yeah, when I speed read the series, several of the conversions stood out as being
> "wrong".  My (potentially unstated) idea was that KVM would only signal
> KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> i.e. when the fault _might_ be resolvable by userspace.

Well, you definitely tried to get the idea across somehow- even in my
cover letter here, I state

> As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This
> capability is meant to deliver useful information to userspace (i.e. the
> problematic range of guest physical memory) when a vCPU fails a guest
> memory access.

So the fact that I'm doing something more here is unintentional and
stems from unfamiliarity with all of the ways in which KVM does (or
does not) perform user accesses.

Sean, besides direct_map which other patches did you notice as needing
to be dropped/marked as unrecoverable errors?
David Matlack May 11, 2023, 5:17 p.m. UTC | #34
On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> What I wanted to do is to understand whether there's still chance to
> provide a generic solution.  I don't know why you have had a bunch of pmu
> stack showing in the graph, perhaps you forgot to disable some of the perf
> events when doing the test?  Let me know if you figure out why it happened
> like that (so far I didn't see), but I feel guilty to keep overloading you
> with such questions.
>
> The major problem I had with this series is it's definitely not a clean
> approach.  Say, even if you'll all rely on userapp you'll still need to
> rely on userfaultfd for kernel traps on corner cases or it just won't work.
> IIUC that's also the concern from Nadav.

This is a long thread, so apologies if the following has already been discussed.

Would per-tid userfaultfd support be a generic solution? i.e. Allow
userspace to create a userfaultfd that is tied to a specific task. Any
userfaults encountered by that task use that fd, rather than the
process-wide fd. I'm making the assumption here that each of these fds
would have independent signaling mechanisms/queues and so this would
solve the scaling problem.

A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
vCPU for handling userfault requests. This seems like it'd have
roughly the same scalability characteristics as the KVM -EFAULT
approach.
Axel Rasmussen May 11, 2023, 5:33 p.m. UTC | #35
On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > What I wanted to do is to understand whether there's still chance to
> > provide a generic solution.  I don't know why you have had a bunch of pmu
> > stack showing in the graph, perhaps you forgot to disable some of the perf
> > events when doing the test?  Let me know if you figure out why it happened
> > like that (so far I didn't see), but I feel guilty to keep overloading you
> > with such questions.
> >
> > The major problem I had with this series is it's definitely not a clean
> > approach.  Say, even if you'll all rely on userapp you'll still need to
> > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > IIUC that's also the concern from Nadav.
>
> This is a long thread, so apologies if the following has already been discussed.
>
> Would per-tid userfaultfd support be a generic solution? i.e. Allow
> userspace to create a userfaultfd that is tied to a specific task. Any
> userfaults encountered by that task use that fd, rather than the
> process-wide fd. I'm making the assumption here that each of these fds
> would have independent signaling mechanisms/queues and so this would
> solve the scaling problem.
>
> A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> vCPU for handling userfault requests. This seems like it'd have
> roughly the same scalability characteristics as the KVM -EFAULT
> approach.

I think this would work in principle, but it's significantly different
from what exists today.

The splitting of userfaultfds Peter is describing is splitting up the
HVA address space, not splitting per-thread.

I think for this design, we'd need to change UFFD registration so
multiple UFFDs can register the same VMA, but can be filtered so they
only receive fault events caused by some particular tid(s).

This might also incur some (small?) overhead, because in the fault
path we now need to maintain some data structure so we can lookup
which UFFD to notify based on a combination of the address and our
tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I think it's worth keeping in mind that a selling point of Anish's
approach is that it's a very small change. It's plausible we can come
up with some alternative way to scale, but it seems to me everything
suggested so far is likely to require a lot more code, complexity, and
effort vs. Anish's approach.
David Matlack May 11, 2023, 7:05 p.m. UTC | #36
On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > What I wanted to do is to understand whether there's still chance to
> > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > events when doing the test?  Let me know if you figure out why it happened
> > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > with such questions.
> > >
> > > The major problem I had with this series is it's definitely not a clean
> > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > IIUC that's also the concern from Nadav.
> >
> > This is a long thread, so apologies if the following has already been discussed.
> >
> > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > userspace to create a userfaultfd that is tied to a specific task. Any
> > userfaults encountered by that task use that fd, rather than the
> > process-wide fd. I'm making the assumption here that each of these fds
> > would have independent signaling mechanisms/queues and so this would
> > solve the scaling problem.
> >
> > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > vCPU for handling userfault requests. This seems like it'd have
> > roughly the same scalability characteristics as the KVM -EFAULT
> > approach.
> 
> I think this would work in principle, but it's significantly different
> from what exists today.
> 
> The splitting of userfaultfds Peter is describing is splitting up the
> HVA address space, not splitting per-thread.
> 
> I think for this design, we'd need to change UFFD registration so
> multiple UFFDs can register the same VMA, but can be filtered so they
> only receive fault events caused by some particular tid(s).
> 
> This might also incur some (small?) overhead, because in the fault
> path we now need to maintain some data structure so we can lookup
> which UFFD to notify based on a combination of the address and our
> tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I was (perhaps naively) assuming the lookup would be as simple as:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..e9856e2ba9ef 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
         */
        mmap_assert_locked(mm);

-       ctx = vma->vm_userfaultfd_ctx.ctx;
+       if (current->userfaultfd_ctx)
+               ctx = current->userfaultfd_ctx;
+       else
+               ctx = vma->vm_userfaultfd_ctx.ctx;
        if (!ctx)
                goto out;

> 
> I think it's worth keeping in mind that a selling point of Anish's
> approach is that it's a very small change. It's plausible we can come
> up with some alternative way to scale, but it seems to me everything
> suggested so far is likely to require a lot more code, complexity, and
> effort vs. Anish's approach.

Agreed.

Mostly I think the per-thread UFFD approach would add complexity on the
userspace side of things. With Anish's approach userspace is able to
trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
handle the request. That gets more complicated when juggling the extra
paired threads.

The per-thread approach would requires a new userfault UAPI change which
I think is a higher bar than the KVM UAPI change proposed here.

The per-thread approach would require KVM call into slow GUP and take
the mmap_lock before contacting userspace. I'm not 100% convinced that's
a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
in Anish's proposal), but could have performance implications.

Lastly, inter-thread communication is likely slower than returning to
userspace from KVM_RUN. So the per-thread approach might increase the
end-to-end latency of demand fetches.
Axel Rasmussen May 11, 2023, 7:45 p.m. UTC | #37
On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > What I wanted to do is to understand whether there's still chance to
> > > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > > events when doing the test?  Let me know if you figure out why it happened
> > > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > > with such questions.
> > > >
> > > > The major problem I had with this series is it's definitely not a clean
> > > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > > IIUC that's also the concern from Nadav.
> > >
> > > This is a long thread, so apologies if the following has already been discussed.
> > >
> > > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > > userspace to create a userfaultfd that is tied to a specific task. Any
> > > userfaults encountered by that task use that fd, rather than the
> > > process-wide fd. I'm making the assumption here that each of these fds
> > > would have independent signaling mechanisms/queues and so this would
> > > solve the scaling problem.
> > >
> > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > > vCPU for handling userfault requests. This seems like it'd have
> > > roughly the same scalability characteristics as the KVM -EFAULT
> > > approach.
> >
> > I think this would work in principle, but it's significantly different
> > from what exists today.
> >
> > The splitting of userfaultfds Peter is describing is splitting up the
> > HVA address space, not splitting per-thread.
> >
> > I think for this design, we'd need to change UFFD registration so
> > multiple UFFDs can register the same VMA, but can be filtered so they
> > only receive fault events caused by some particular tid(s).
> >
> > This might also incur some (small?) overhead, because in the fault
> > path we now need to maintain some data structure so we can lookup
> > which UFFD to notify based on a combination of the address and our
> > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.
>
> I was (perhaps naively) assuming the lookup would be as simple as:
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 44d1ee429eb0..e9856e2ba9ef 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          */
>         mmap_assert_locked(mm);
>
> -       ctx = vma->vm_userfaultfd_ctx.ctx;
> +       if (current->userfaultfd_ctx)
> +               ctx = current->userfaultfd_ctx;
> +       else
> +               ctx = vma->vm_userfaultfd_ctx.ctx;
>         if (!ctx)
>                 goto out;

Hmm, perhaps. It might have to be more complicated if we want to allow
a single task to have both per-TID UFFDs for some addresses, and
"global" UFFDs for others.



Actually, while thinking about this, another wrinkle:

Imagine we have per-thread UFFDs. Thread X faults on some address, and
goes to sleep waiting for its paired resolver thread to resolve the
fault.

In the meantime, thread Y also faults on the same address, before the
resolution happens.

In the existing model, there is a single UFFD context per VMA, and
therefore a single wait queue for all threads to wait on. In the
per-TID-UFFD design, now each thread has its own context, and
ostensibly its own wait queue (since the wait queue locks are where
Anish saw the contention, I think this is exactly what we want to
split up). When we have this "multiple threads waiting on the same
address" situation, how do we ensure the fault is resolved exactly
once? And how do we wake up all of the sleeping threads when it is
resolved?

I'm sure it's solvable, but especially doing it without any locks /
contention seems like it could be a bit complicated.

>
> >
> > I think it's worth keeping in mind that a selling point of Anish's
> > approach is that it's a very small change. It's plausible we can come
> > up with some alternative way to scale, but it seems to me everything
> > suggested so far is likely to require a lot more code, complexity, and
> > effort vs. Anish's approach.
>
> Agreed.
>
> Mostly I think the per-thread UFFD approach would add complexity on the
> userspace side of things. With Anish's approach userspace is able to
> trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
> handle the request. That gets more complicated when juggling the extra
> paired threads.
>
> The per-thread approach would requires a new userfault UAPI change which
> I think is a higher bar than the KVM UAPI change proposed here.
>
> The per-thread approach would require KVM call into slow GUP and take
> the mmap_lock before contacting userspace. I'm not 100% convinced that's
> a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
> in Anish's proposal), but could have performance implications.
>
> Lastly, inter-thread communication is likely slower than returning to
> userspace from KVM_RUN. So the per-thread approach might increase the
> end-to-end latency of demand fetches.
Peter Xu May 15, 2023, 3:05 p.m. UTC | #38
On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > What I wanted to do is to understand whether there's still chance to
> > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > events when doing the test?  Let me know if you figure out why it happened
> > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > with such questions.
> > >
> > > The major problem I had with this series is it's definitely not a clean
> > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > IIUC that's also the concern from Nadav.
> >
> > This is a long thread, so apologies if the following has already been discussed.
> >
> > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > userspace to create a userfaultfd that is tied to a specific task. Any
> > userfaults encountered by that task use that fd, rather than the
> > process-wide fd. I'm making the assumption here that each of these fds
> > would have independent signaling mechanisms/queues and so this would
> > solve the scaling problem.
> >
> > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > vCPU for handling userfault requests. This seems like it'd have
> > roughly the same scalability characteristics as the KVM -EFAULT
> > approach.
> 
> I think this would work in principle, but it's significantly different
> from what exists today.
> 
> The splitting of userfaultfds Peter is describing is splitting up the
> HVA address space, not splitting per-thread.

[sorry mostly travel last week]

No, my idea was actually split per-thread, but since currently there's no
way to split per thread I was thinking we should start testing with split
per vma so it "emulates" the best we can have out of a split per thread.

> 
> I think for this design, we'd need to change UFFD registration so
> multiple UFFDs can register the same VMA, but can be filtered so they
> only receive fault events caused by some particular tid(s).

Having multiple real uffds per vma is challenging, as you mentioned
enqueuing may be more of an effort, meanwhile it's hard to know what's the
attribute of the uffd over this vma because each uffd has one feature list.

Here what we may need is only the "logical queue" of the uffd.  So I was
considering supporting multi-queue for a _single_ userfaultfd.

I actually mentioned some of it in the very initial reply to Anish:

https://lore.kernel.org/all/ZEGuogfbtxPNUq7t@x1n/

        If the real problem relies in a bunch of threads queuing, is it
        possible that we can provide just more queues for the events?  The
        readers will just need to go over all the queues.

        Way to decide "which thread uses which queue" can be another
        problem, what comes ups quickly to me is a "hash(tid) % n_queues"
        but maybe it can be better.  Each vcpu thread will have different
        tids, then they can hopefully scale on the queues.

The queues may need to be created also as sub-uffds, each only support
partial of the uffd interfaces (read/poll, COPY/CONTINUE/ZEROPAGE) but not
all (e.g. UFFDIO_API shouldn't be supported there).

> This might also incur some (small?) overhead, because in the fault path
> we now need to maintain some data structure so we can lookup which UFFD
> to notify based on a combination of the address and our tid. Today, since
> VMAs and UFFDs are 1:1 this lookup is trivial.  I think it's worth
> keeping in mind that a selling point of Anish's approach is that it's a
> very small change. It's plausible we can come up with some alternative
> way to scale, but it seems to me everything suggested so far is likely to
> require a lot more code, complexity, and effort vs. Anish's approach.

Yes, I think that's also the reason why I thought I overloaded too much on
this work.  If Anish eagerly wants that and make it useful, then I'm
totally fine because maintaining the 2nd cap seems trivial assuming the
maintainer already would accept the 1st cap.

I just hope it'll be thoroughly tested with even Google's private userspace
hypervisor, so the kernel interface is (even if not straightforward enough
to a new user seeing this) solid so it will service the goal for the
problem Anish is tackling with.

Thanks,
Peter Xu May 15, 2023, 3:16 p.m. UTC | #39
On Thu, May 11, 2023 at 12:45:32PM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> > > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > What I wanted to do is to understand whether there's still chance to
> > > > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > > > events when doing the test?  Let me know if you figure out why it happened
> > > > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > > > with such questions.
> > > > >
> > > > > The major problem I had with this series is it's definitely not a clean
> > > > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > > > IIUC that's also the concern from Nadav.
> > > >
> > > > This is a long thread, so apologies if the following has already been discussed.
> > > >
> > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > > > userspace to create a userfaultfd that is tied to a specific task. Any
> > > > userfaults encountered by that task use that fd, rather than the
> > > > process-wide fd. I'm making the assumption here that each of these fds
> > > > would have independent signaling mechanisms/queues and so this would
> > > > solve the scaling problem.
> > > >
> > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > > > vCPU for handling userfault requests. This seems like it'd have
> > > > roughly the same scalability characteristics as the KVM -EFAULT
> > > > approach.
> > >
> > > I think this would work in principle, but it's significantly different
> > > from what exists today.
> > >
> > > The splitting of userfaultfds Peter is describing is splitting up the
> > > HVA address space, not splitting per-thread.
> > >
> > > I think for this design, we'd need to change UFFD registration so
> > > multiple UFFDs can register the same VMA, but can be filtered so they
> > > only receive fault events caused by some particular tid(s).
> > >
> > > This might also incur some (small?) overhead, because in the fault
> > > path we now need to maintain some data structure so we can lookup
> > > which UFFD to notify based on a combination of the address and our
> > > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.
> >
> > I was (perhaps naively) assuming the lookup would be as simple as:
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 44d1ee429eb0..e9856e2ba9ef 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >          */
> >         mmap_assert_locked(mm);
> >
> > -       ctx = vma->vm_userfaultfd_ctx.ctx;
> > +       if (current->userfaultfd_ctx)
> > +               ctx = current->userfaultfd_ctx;
> > +       else
> > +               ctx = vma->vm_userfaultfd_ctx.ctx;
> >         if (!ctx)
> >                 goto out;

This is an interesting idea, but I'll just double check before I grow
task_struct and see whether that's the only solution. :)

I'd start with hash(tid) or even hash(pcpu) to choose queue.  In a pinned
use case hash(pcpu) should probably reach a similar goal here (and I'd
guess hash(tid) too if vcpus are mostly always created in one shot, just
slightly trickier).

> 
> Hmm, perhaps. It might have to be more complicated if we want to allow
> a single task to have both per-TID UFFDs for some addresses, and
> "global" UFFDs for others.
> 
> 
> 
> Actually, while thinking about this, another wrinkle:
> 
> Imagine we have per-thread UFFDs. Thread X faults on some address, and
> goes to sleep waiting for its paired resolver thread to resolve the
> fault.
> 
> In the meantime, thread Y also faults on the same address, before the
> resolution happens.
> 
> In the existing model, there is a single UFFD context per VMA, and
> therefore a single wait queue for all threads to wait on. In the
> per-TID-UFFD design, now each thread has its own context, and
> ostensibly its own wait queue (since the wait queue locks are where
> Anish saw the contention, I think this is exactly what we want to
> split up). When we have this "multiple threads waiting on the same
> address" situation, how do we ensure the fault is resolved exactly
> once? And how do we wake up all of the sleeping threads when it is
> resolved?

We probably need to wake them one by one in that case.  The 2nd-Nth
UFFDIO_COPY/CONTINUE will fail with -EEXIST anyway, then the userapp will
need a UFFDIO_WAKE I assume.

> 
> I'm sure it's solvable, but especially doing it without any locks /
> contention seems like it could be a bit complicated.

IMHO no complicated locking is needed.  Here the "complicated locking" is
done with pgtable lock and it should be reflected by -EEXISTs to userapp.

> 
> >
> > >
> > > I think it's worth keeping in mind that a selling point of Anish's
> > > approach is that it's a very small change. It's plausible we can come
> > > up with some alternative way to scale, but it seems to me everything
> > > suggested so far is likely to require a lot more code, complexity, and
> > > effort vs. Anish's approach.
> >
> > Agreed.
> >
> > Mostly I think the per-thread UFFD approach would add complexity on the
> > userspace side of things. With Anish's approach userspace is able to
> > trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
> > handle the request. That gets more complicated when juggling the extra
> > paired threads.
> >
> > The per-thread approach would requires a new userfault UAPI change which
> > I think is a higher bar than the KVM UAPI change proposed here.
> >
> > The per-thread approach would require KVM call into slow GUP and take
> > the mmap_lock before contacting userspace. I'm not 100% convinced that's
> > a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
> > in Anish's proposal), but could have performance implications.
> >
> > Lastly, inter-thread communication is likely slower than returning to
> > userspace from KVM_RUN. So the per-thread approach might increase the
> > end-to-end latency of demand fetches.

Right.  The overhead here is (IMHO):

  - KVM solution: vcpu exit -> enter again, whatever happens in the
    procedure of exit/enter will count.

  - mm solution: at least schedule overhead, meanwhile let's hope we can
    scale first elsewhere (and I'm not sure if there'll be other issues,
    e.g., even if we can split the uffd queues, hopefully totally nowhere I
    overlooked that still need the shared uffd context)

I'm not sure which one will be higher or maybe it depends (e.g., some
specific cases where vcpu KVM_RUN can have higher overhead when loading?).

Thanks,
Anish Moorthy May 15, 2023, 5:16 p.m. UTC | #40
On Wed, May 10, 2023 at 2:51 PM Peter Xu <peterx@redhat.com> wrote:
>
> What I wanted to do is to understand whether there's still chance to
> provide a generic solution.  I don't know why you have had a bunch of pmu
> stack showing in the graph, perhaps you forgot to disable some of the perf
> events when doing the test?  Let me know if you figure out why it happened
> like that (so far I didn't see), but I feel guilty to keep overloading you
> with such questions.

Not at all, I'm happy to help try and answer as many questions as I
can. It helps me learn as well.

I'll see about revisiting these traces, but I'll be busy for the next
few days with other things so I doubt they'll come soon. I'll jump
back into the mailing list sometime on Thursday/Friday
Anish Moorthy May 23, 2023, 5:49 p.m. UTC | #41
On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Yeah, when I speed read the series, several of the conversions stood out as being
> > "wrong".  My (potentially unstated) idea was that KVM would only signal
> > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > i.e. when the fault _might_ be resolvable by userspace.
>
> Sean, besides direct_map which other patches did you notice as needing
> to be dropped/marked as unrecoverable errors?

I tried going through on my own to try and identify the incorrect
annotations: here's my read.

Correct (or can easily be corrected)
-----------------------------------------------
- user_mem_abort
  Incorrect as is: the annotations in patch 19 are incorrect, as they
cover an error-on-no-slot case and one more I don't fully understand:
the one in patch 20 should be good though.

- kvm_vcpu_read/write_guest_page:
  Incorrect as-is, but can fixed: the current annotations cover
gpa_to_hva_memslot(_prot) failures, which can happen when "gpa" is not
converted by a memslot. However we can leave these as bare efaults and
just annotate the copy_to/from_user failures, which userspace should
be able to resolve by checking/changing the slot permissions.

- kvm_handle_error_pfn
  Correct: at the annotation point, the fault must be either a (a)
read/write to a writable memslot or (b) read from a readable one.
hva_to_pfn must have returned KVM_PFN_ERR_FAULT, which userspace can
attempt to resolve using a MADV

Flatly Incorrect (will drop in next version)
-----------------------------------------------
- kvm_handle_page_fault
  efault corresponds to a kernel bug not resolvable by userspace

- direct_map
  Same as above

- kvm_mmu_page_fault
  Not a "leaf" return of efault, Also, the
check-for-efault-and-annotate here catches efaults which userspace can
do nothing about: such as the one from direct_map [1]

Unsure (Switch kvm_read/write_guest to kvm_vcpu_read/write_guest?)
-----------------------------------------------

- setup_vmgexit_scratch and kvm_pv_clock_pairing
  These efault on errors from kvm_read/write_guest, and theoretically
it does seem to make sense to annotate them. However, the annotations
are incorrect as is for the same reason that the
kvm_vcpu_read/write_guest_page need to be corrected.

In fact, the kvm_read/write_guest calls are of the form
"kvm_read_guest(vcpu->kvm, ...)": if we switched these calls to
kvm_vcpu_read/write_guest instead, then it seems like we'd get correct
annotations for free. Would it be correct to make this switch? If not,
then perhaps an optional kvm_vcpu* parameter for the "non-vcpu"
read/write functions strictly for annotation purposes? That seems
rather ugly though...

Unsure (Similar-ish to above)
-----------------------------------------------

- kvm_hv_get_assist_page
  Incorrect as-is. The existing annotation would cover some efaults
which it doesn't seem likely that userspace can resolve [2]. Right
after those though, there's a copy_from_user which it could make sense
to annotate.

The efault here comes from failures of
kvm_read_guest_cached/kvm_read_guest_offset_cached, for which all of
the calls are again of the form "f(vcpu->kvm, ...)". Again, we'll need
either an (optional) vcpu parameter or to refactor these to just take
a "kvm_vcpu" instead if we want to annotate just the failing
uaccesses.

PS: I plan to add a couple of flags to the memory fault exit to
identify whether the failed access was a read/write/exec


[1] https://github.com/torvalds/linux/blob/v6.3/arch/x86/kvm/mmu/mmu.c#L3196
[2] https://github.com/torvalds/linux/blob/v6.3/virt/kvm/kvm_main.c#L3261-L3270
Oliver Upton June 1, 2023, 10:43 p.m. UTC | #42
On Tue, May 23, 2023 at 10:49:04AM -0700, Anish Moorthy wrote:
> On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Yeah, when I speed read the series, several of the conversions stood out as being
> > > "wrong".  My (potentially unstated) idea was that KVM would only signal
> > > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > > i.e. when the fault _might_ be resolvable by userspace.
> >
> > Sean, besides direct_map which other patches did you notice as needing
> > to be dropped/marked as unrecoverable errors?
> 
> I tried going through on my own to try and identify the incorrect
> annotations: here's my read.
> 
> Correct (or can easily be corrected)
> -----------------------------------------------
> - user_mem_abort
>   Incorrect as is: the annotations in patch 19 are incorrect, as they
> cover an error-on-no-slot case and one more I don't fully understand:

That other case is a wart we endearingly refer to as MTE (Memory Tagging
Extension). You theoretically _could_ pop out an annotated exit here, as
userspace likely messed up the mapping (like PROT_MTE missing).

But I'm perfectly happy letting someone complain about it before we go
out of our way to annotate that one. So feel free to drop.