[WIP,v2,00/14] Avoiding slow get-user-pages via memory fault exit

Message ID	20230315021738.1151386-1-amoorthy@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Date: Wed, 15 Mar 2023 02:17:24 +0000 Mime-Version: 1.0 Message-ID: <20230315021738.1151386-1-amoorthy@google.com> Subject: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit From: Anish Moorthy <amoorthy@google.com> To: seanjc@google.com Cc: jthoughton@google.com, kvm@vger.kernel.org, Anish Moorthy <amoorthy@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Avoiding slow get-user-pages via memory fault exit \| expand [WIP,v2,00/14] Avoiding slow get-user-pages via memory fault exit [WIP,v2,01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test [WIP,v2,02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via T… [WIP,v2,03/14] KVM: Allow hva_pfn_fast to resolve read-only faults. [WIP,v2,04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field [WIP,v2,05/14] KVM: x86: Implement memory fault exit for direct_map [WIP,v2,06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault [WIP,v2,07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch [WIP,v2,08/14] KVM: x86: Implement memory fault exit for FNAME(fetch) [WIP,v2,09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation [WIP,v2,10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT [WIP,v2,11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit [WIP,v2,12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT [WIP,v2,13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm [WIP,v2,14/14] KVM: selftests: Handle memory fault exits in demand_paging_test

Anish Moorthy March 15, 2023, 2:17 a.m. UTC

Hi Sean, here's what I'm planing to send up as v2 of the scalable
userfaultfd series.

Don't worry, I'm not asking you to review this all :) I just have a few
remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
enough to mention before I ask for more attention from others, and they'll be
clearer with the patches in hand. Anything else I'm happy to find out about when
I send the actual v2.

I want your opinion on

1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
   (described in the api.rst file)
2. Whether the UNKNOWN exit reason cases (everywhere but
   handle_error_pfn atm) would need to be given "real" reasons
   before this could be merged.
3. If you think I've missed sites that currently -EFAULT to userspace

About (3): after we agreed to only tackle cases where -EFAULT currently makes it
to userspace, I went though our list and tried to trace which EFAULTS actually
bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
sanity-check my findings with you. Lmk if you see obvious errors in my list
below.

--- EFAULTs under KVM_RUN ---

Confident that needs conversion (already converted)
---------------------------------------------------
* direct_map
* handle_error_pfn
* setup_vmgexit_scratch
* kvm_handle_page_fault
* FNAME(fetch)

EFAULT does not propagate to userspace (do not convert)
-------------------------------------------------------
* record_steal_time (arch/x86/kvm/x86.c:3463)
* hva_to_pfn_retry
* kvm_vcpu_map
* FNAME(update_accessed_dirty_bits)
* __kvm_gfn_to_hva_cache_init
  Might actually make it to userspace, but only through
  kvm_read|write_guest_offset_cached- would be covered by those conversions
* kvm_gfn_to_hva_cache_init
* __kvm_read_guest_page
* hva_to_pfn_remapped
  handle_error_pfn will handle this for the scalable uffd case. Don't think
  other callers -EFAULT to userspace.

Still unsure if needs conversion
--------------------------------
* __kvm_read_guest_atomic
  The EFAULT might be propagated though FNAME(sync_page)?
* kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
* __kvm_write_guest_page
  Called from kvm_write_guest_offset_cached: if that needs change, this does too
* kvm_write_guest_page
  Two interesting paths:
      - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
        (arch/x86/kvm/x86.c:9578)
      - kvm_write_guest_offset_cached returns this directly (so if that needs
        change, this does too)
* kvm_read_guest_offset_cached
  I actually do see a path to userspace, but it's through hyper-v, which we've
  said is out of scope for round 1.

--- Actual Cover Letter ---

Omitted: hasn't changed much since v1 anyways

--- Changelog ---

WIP v2
  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: Allow hva_pfn_fast to resolve read-only faults.
  KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run
    field
  KVM: x86: Implement memory fault exit for direct_map
  KVM: x86: Implement memory fault exit for kvm_handle_page_fault
  KVM: x86: Implement memory fault exit for setup_vmgexit_scratch
  KVM: x86: Implement memory fault exit for FNAME(fetch)
  KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit
  KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  74 ++++-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  29 +-
 arch/x86/kvm/mmu/mmu.c                        |  42 ++-
 arch/x86/kvm/mmu/paging_tmpl.h                |   4 +-
 arch/x86/kvm/svm/sev.c                        |   4 +-
 arch/x86/kvm/x86.c                            |   2 +
 include/linux/kvm_host.h                      |  22 ++
 include/uapi/linux/kvm.h                      |  19 ++
 tools/include/uapi/linux/kvm.h                |  17 ++
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 253 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  18 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 160 ++++++-----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 virt/kvm/kvm_main.c                           |  41 ++-
 20 files changed, 544 insertions(+), 158 deletions(-)

Oliver Upton March 17, 2023, 5:43 p.m. UTC | #1

Anish,

Generally the 'RFC PATCH' prefix is used for patches that are for feedback
only (i.e. not to be considered for inclusion).

On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> Hi Sean, here's what I'm planing to send up as v2 of the scalable
> userfaultfd series.

I don't see a ton of value in sending a targeted posting of a series to the
list. IOW, just CC all of the appropriate reviewers+maintainers. I promise,
we won't bite.

> Don't worry, I'm not asking you to review this all :) I just have a few
> remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
> enough to mention before I ask for more attention from others, and they'll be
> clearer with the patches in hand. Anything else I'm happy to find out about when
> I send the actual v2.
> 
> I want your opinion on
> 
> 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
>    (described in the api.rst file)
> 2. Whether the UNKNOWN exit reason cases (everywhere but
>    handle_error_pfn atm) would need to be given "real" reasons
>    before this could be merged.
> 3. If you think I've missed sites that currently -EFAULT to userspace
> 
> About (3): after we agreed to only tackle cases where -EFAULT currently makes it
> to userspace, I went though our list and tried to trace which EFAULTS actually
> bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
> sanity-check my findings with you. Lmk if you see obvious errors in my list
> below.
> 
> --- EFAULTs under KVM_RUN ---
> 
> Confident that needs conversion (already converted)
> ---------------------------------------------------
> * direct_map
> * handle_error_pfn
> * setup_vmgexit_scratch
> * kvm_handle_page_fault
> * FNAME(fetch)
> 
> EFAULT does not propagate to userspace (do not convert)
> -------------------------------------------------------
> * record_steal_time (arch/x86/kvm/x86.c:3463)
> * hva_to_pfn_retry
> * kvm_vcpu_map
> * FNAME(update_accessed_dirty_bits)
> * __kvm_gfn_to_hva_cache_init
>   Might actually make it to userspace, but only through
>   kvm_read|write_guest_offset_cached- would be covered by those conversions
> * kvm_gfn_to_hva_cache_init
> * __kvm_read_guest_page
> * hva_to_pfn_remapped
>   handle_error_pfn will handle this for the scalable uffd case. Don't think
>   other callers -EFAULT to userspace.
> 
> Still unsure if needs conversion
> --------------------------------
> * __kvm_read_guest_atomic
>   The EFAULT might be propagated though FNAME(sync_page)?
> * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> * __kvm_write_guest_page
>   Called from kvm_write_guest_offset_cached: if that needs change, this does too

The low-level accessors are common across architectures and can be called from
other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
and convert that into an exit?

> * kvm_write_guest_page
>   Two interesting paths:
>       - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
>         (arch/x86/kvm/x86.c:9578)

This is a hypercall handler, so the return code is ABI with the guest. So it
shouldn't be converted to an exit to userspace.

Sean Christopherson March 17, 2023, 6:13 p.m. UTC | #2

On Fri, Mar 17, 2023, Oliver Upton wrote:
> On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > userfaultfd series.
> 
> I don't see a ton of value in sending a targeted posting of a series to the
> list. IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> we won't bite.

+1.  And though I discourage off-list review, if something is really truly not
ready for public review, e.g. will do more harm than good by causing confusing,
then just send the patches off-list.  Half measures like this will just make folks
grumpy.

> > Don't worry, I'm not asking you to review this all :) I just have a few
> > remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
> > enough to mention before I ask for more attention from others, and they'll be
> > clearer with the patches in hand. Anything else I'm happy to find out about when
> > I send the actual v2.
> > 
> > I want your opinion on
> > 
> > 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
> >    (described in the api.rst file)
> > 2. Whether the UNKNOWN exit reason cases (everywhere but
> >    handle_error_pfn atm) would need to be given "real" reasons
> >    before this could be merged.
> > 3. If you think I've missed sites that currently -EFAULT to userspace
> > 
> > About (3): after we agreed to only tackle cases where -EFAULT currently makes it
> > to userspace, I went though our list and tried to trace which EFAULTS actually
> > bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
> > sanity-check my findings with you. Lmk if you see obvious errors in my list
> > below.
> > 
> > --- EFAULTs under KVM_RUN ---
> > 
> > Confident that needs conversion (already converted)
> > ---------------------------------------------------
> > * direct_map
> > * handle_error_pfn
> > * setup_vmgexit_scratch
> > * kvm_handle_page_fault
> > * FNAME(fetch)
> > 
> > EFAULT does not propagate to userspace (do not convert)
> > -------------------------------------------------------
> > * record_steal_time (arch/x86/kvm/x86.c:3463)
> > * hva_to_pfn_retry
> > * kvm_vcpu_map
> > * FNAME(update_accessed_dirty_bits)
> > * __kvm_gfn_to_hva_cache_init
> >   Might actually make it to userspace, but only through
> >   kvm_read|write_guest_offset_cached- would be covered by those conversions
> > * kvm_gfn_to_hva_cache_init
> > * __kvm_read_guest_page
> > * hva_to_pfn_remapped
> >   handle_error_pfn will handle this for the scalable uffd case. Don't think
> >   other callers -EFAULT to userspace.
> >
> > Still unsure if needs conversion
> > --------------------------------
> > * __kvm_read_guest_atomic
> >   The EFAULT might be propagated though FNAME(sync_page)?
> > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> > * __kvm_write_guest_page
> >   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> 
> The low-level accessors are common across architectures and can be called from
> other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> and convert that into an exit?

Ya, as things stand today, the conversions _must_ be performed at the caller, as
there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
all of x86's paravirt code just suppresses user memory faults :-(

Anish, when we discussed this off-list, what I meant by limiting the intial support
to existing -EFAULT cases was limiting support to existing cases where KVM directly
returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.

David Matlack March 17, 2023, 6:46 p.m. UTC | #3

On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Oliver Upton wrote:
> > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > userfaultfd series.
> >
> > I don't see a ton of value in sending a targeted posting of a series to the
> > list.

But isn't it already generating value as you were able to weigh in and
provide feedback on technical aspects that you would not have been
otherwise able to if Anish had just messaged Sean?

> > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > we won't bite.

I disagree. While I think it's fine to reach out to someone off-list
to discuss a specific question, if you're going to message all
reviewers and maintainers, you should also CC the mailing list. That
allows more people to follow along and weigh in if necessary.

>
> +1.  And though I discourage off-list review, if something is really truly not
> ready for public review, e.g. will do more harm than good by causing confusing,
> then just send the patches off-list.  Half measures like this will just make folks
> grumpy.

In this specific case, Anish very clearly laid out the reason for
sending the patches and asked very specific directed questions in the
cover letter and called it out as WIP. Yes "WIP" should have been
"RFC" but other than that should anything have been different?

Oliver Upton March 17, 2023, 6:54 p.m. UTC | #4

David,

On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > userfaultfd series.
> > >
> > > I don't see a ton of value in sending a targeted posting of a series to the
> > > list.
> 
> But isn't it already generating value as you were able to weigh in and
> provide feedback on technical aspects that you would not have been
> otherwise able to if Anish had just messaged Sean?

No, I only happened upon this series looking at lore. My problem is that
none of the affected maintainers or reviewers were cc'ed on the series.

> > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > we won't bite.
> 
> I disagree. While I think it's fine to reach out to someone off-list
> to discuss a specific question, if you're going to message all
> reviewers and maintainers, you should also CC the mailing list. That
> allows more people to follow along and weigh in if necessary.

I think there may be a slight disconnect here :) I'm in no way encouraging
off-list discussion and instead asking that mail on the list arrives in
the right folks' inboxes.

Posting an RFC on the list was absolutely the right thing to do.

> >
> > +1.  And though I discourage off-list review, if something is really truly not
> > ready for public review, e.g. will do more harm than good by causing confusing,
> > then just send the patches off-list.  Half measures like this will just make folks
> > grumpy.
> 
> In this specific case, Anish very clearly laid out the reason for
> sending the patches and asked very specific directed questions in the
> cover letter and called it out as WIP. Yes "WIP" should have been
> "RFC" but other than that should anything have been different?

See above

David Matlack March 17, 2023, 6:59 p.m. UTC | #5

On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> David,
>
> On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > > userfaultfd series.
> > > >
> > > > I don't see a ton of value in sending a targeted posting of a series to the
> > > > list.
> >
> > But isn't it already generating value as you were able to weigh in and
> > provide feedback on technical aspects that you would not have been
> > otherwise able to if Anish had just messaged Sean?
>
> No, I only happened upon this series looking at lore. My problem is that
> none of the affected maintainers or reviewers were cc'ed on the series.
>
> > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > > we won't bite.
> >
> > I disagree. While I think it's fine to reach out to someone off-list
> > to discuss a specific question, if you're going to message all
> > reviewers and maintainers, you should also CC the mailing list. That
> > allows more people to follow along and weigh in if necessary.
>
> I think there may be a slight disconnect here :) I'm in no way encouraging
> off-list discussion and instead asking that mail on the list arrives in
> the right folks' inboxes.
>
> Posting an RFC on the list was absolutely the right thing to do.

Doh. I misunderstood what you meant. We are in violent agreement!

Anish Moorthy March 17, 2023, 7:53 p.m. UTC | #6

On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > David,
> >
> > On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> > > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > > > userfaultfd series.
> > > > >
> > > > > I don't see a ton of value in sending a targeted posting of a series to the
> > > > > list.
> > >
> > > But isn't it already generating value as you were able to weigh in and
> > > provide feedback on technical aspects that you would not have been
> > > otherwise able to if Anish had just messaged Sean?
> >
> > No, I only happened upon this series looking at lore. My problem is that
> > none of the affected maintainers or reviewers were cc'ed on the series.
> >
> > > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > > > we won't bite.
> > >
> > > I disagree. While I think it's fine to reach out to someone off-list
> > > to discuss a specific question, if you're going to message all
> > > reviewers and maintainers, you should also CC the mailing list. That
> > > allows more people to follow along and weigh in if necessary.
> >
> > I think there may be a slight disconnect here :) I'm in no way encouraging
> > off-list discussion and instead asking that mail on the list arrives in
> > the right folks' inboxes.
> >
> > Posting an RFC on the list was absolutely the right thing to do.
>
> Doh. I misunderstood what you meant. We are in violent agreement!

Noted. Also, thanks Oliver and Isaku for paying attention to the
series despite it being obscure.

On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > Still unsure if needs conversion
> > > --------------------------------
> > > * __kvm_read_guest_atomic
> > >   The EFAULT might be propagated though FNAME(sync_page)?
> > > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> > > * __kvm_write_guest_page
> > >   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> >
> > The low-level accessors are common across architectures and can be called from
> > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > and convert that into an exit?
>
> Ya, as things stand today, the conversions _must_ be performed at the caller, as
> there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> all of x86's paravirt code just suppresses user memory faults :-(
>
> Anish, when we discussed this off-list, what I meant by limiting the intial support
> to existing -EFAULT cases was limiting support to existing cases where KVM directly
> returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.

Don't worry, we eventually got there off-list :)

This brings us back to my original set of questions. As has already
been pointed out, I'll have to revisit my "Confident that needs
conversion" changes and tweak them so that the vCPU exit is populated
only for the call sites where the -EFAULT makes it to userspace. I
still want feedback on if I've mis-identified any of the functions in
my "EFAULT does not propagate to userspace" list and whether there are
functions/callers in the "Still unsure if needs conversion" which do
have return paths to KVM_RUN.

Sean Christopherson March 17, 2023, 8:35 p.m. UTC | #7

On Wed, Mar 15, 2023, Anish Moorthy wrote:
> Still unsure if needs conversion
> --------------------------------
> * __kvm_read_guest_atomic
>   The EFAULT might be propagated though FNAME(sync_page)?
> * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> * __kvm_write_guest_page
>   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> * kvm_write_guest_page
>   Two interesting paths:
>       - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
>         (arch/x86/kvm/x86.c:9578)
>       - kvm_write_guest_offset_cached returns this directly (so if that needs
>         change, this does too)
> * kvm_read_guest_offset_cached
>   I actually do see a path to userspace, but it's through hyper-v, which we've
>   said is out of scope for round 1.

To clarify: I didn't intend to make Hyper-V explicitly out-of-scope, rather Hyper-V
happened to be out-of-scope because the existing code suppresses -EFAULT.  I don't
think we should make any particular feature/area out-of-scope, as that will lead
to even more arbitrary behavior than we already have.

What I intended, and what I still think we should do, is limit the scope of the
capability to existing paths that return -EFAULT to userspace.  Trying to fix all
of the paths that suppress -EFAULT is going to be ridiculously difficult as so
much of the behavior is arguaby ABI, and there's no authoritative documentation
on what's supposed to happen.  I definitely would love to fix those paths in the
long term, but for the initial implementation/conversion, I think it makes sense
to punt on them, otherwise it'll take months/years to merge this code.

Back to the Hyper-V case, assuming you're referring to the use of kvm_hv_verify_vp_assist()
in nested_svm_vmrun(), that code is a mess.  KVM shouldn't inject a #GP and then
exit to userspace, e.g. the guest might see a spurious #GP if userspace fixes the
fault and resume the instruction.  And just a few lines below, KVM skips the
instruction if kvm_vcpu_map() returns -EFAULT.

As above, ideally that code would be converted to gracefully report the error,
but it's such a snafu that the easiest thing might be to change the "return ret;"
to "return 1;" until we fix all such KVM-on-HyperV code.

Sean Christopherson March 17, 2023, 10:03 p.m. UTC | #8

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
> > > The low-level accessors are common across architectures and can be called from
> > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > > and convert that into an exit?
> >
> > Ya, as things stand today, the conversions _must_ be performed at the caller, as
> > there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> > all of x86's paravirt code just suppresses user memory faults :-(
> >
> > Anish, when we discussed this off-list, what I meant by limiting the intial support
> > to existing -EFAULT cases was limiting support to existing cases where KVM directly
> > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> > returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.
> 
> Don't worry, we eventually got there off-list :)
> 
> This brings us back to my original set of questions. As has already
> been pointed out, I'll have to revisit my "Confident that needs
> conversion" changes and tweak them so that the vCPU exit is populated
> only for the call sites where the -EFAULT makes it to userspace. I
> still want feedback on if I've mis-identified any of the functions in
> my "EFAULT does not propagate to userspace" list and whether there are
> functions/callers in the "Still unsure if needs conversion" which do
> have return paths to KVM_RUN.

As you've probably gathered from the type of feedback you're receiving, identifying
the conversion touchpoints isn't going to be the long pole of this series.  Correctly
identifying all of the touchpoints may not be easy, but fixing any cases we get wrong
will likely be straightforward.  And realistically, no matter how many eyeballs look
at the code, odds are good we'll miss at least one case.  In other words, don't worry
too much about getting all the touchpoints correct on the first version.  Getting the
uAPI right is much more important.

And rather than rely on code review to get things right, we should be able to
detect issues programmatically.  E.g. use fault injection to make gup() and/or
uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN
path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT
(assuming we go don't try to get KVM to return 0 everywhere), e.g. something like
the below would at least flag the "misses", although debug could still prove to be
annoying.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67b890e54cf1..cccae0ad1436 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
                }
                r = kvm_arch_vcpu_ioctl_run(vcpu);
                trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
+               WARN_ON(r == -EFAULT &&
+                       vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT);
                break;
        }
        case KVM_GET_REGS: {

Sean Christopherson March 20, 2023, 3:56 p.m. UTC | #9

On Fri, Mar 17, 2023, Sean Christopherson wrote:
> On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
> > > > The low-level accessors are common across architectures and can be called from
> > > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > > > and convert that into an exit?
> > >
> > > Ya, as things stand today, the conversions _must_ be performed at the caller, as
> > > there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> > > all of x86's paravirt code just suppresses user memory faults :-(
> > >
> > > Anish, when we discussed this off-list, what I meant by limiting the intial support
> > > to existing -EFAULT cases was limiting support to existing cases where KVM directly
> > > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> > > returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.
> > 
> > Don't worry, we eventually got there off-list :)
> > 
> > This brings us back to my original set of questions. As has already
> > been pointed out, I'll have to revisit my "Confident that needs
> > conversion" changes and tweak them so that the vCPU exit is populated
> > only for the call sites where the -EFAULT makes it to userspace. I
> > still want feedback on if I've mis-identified any of the functions in
> > my "EFAULT does not propagate to userspace" list and whether there are
> > functions/callers in the "Still unsure if needs conversion" which do
> > have return paths to KVM_RUN.
> 
> As you've probably gathered from the type of feedback you're receiving, identifying
> the conversion touchpoints isn't going to be the long pole of this series.  Correctly
> identifying all of the touchpoints may not be easy, but fixing any cases we get wrong
> will likely be straightforward.  And realistically, no matter how many eyeballs look
> at the code, odds are good we'll miss at least one case.  In other words, don't worry
> too much about getting all the touchpoints correct on the first version.  Getting the
> uAPI right is much more important.
> 
> And rather than rely on code review to get things right, we should be able to
> detect issues programmatically.  E.g. use fault injection to make gup() and/or
> uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN
> path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT
> (assuming we go don't try to get KVM to return 0 everywhere), e.g. something like
> the below would at least flag the "misses", although debug could still prove to be
> annoying.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67b890e54cf1..cccae0ad1436 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
>                 }
>                 r = kvm_arch_vcpu_ioctl_run(vcpu);
>                 trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
> +               WARN_ON(r == -EFAULT &&
> +                       vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT);

Gah, I inverted the second check, this should be 

		WARN_ON(r == -EFAULT &&
			vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
		
>                 break;
>         }
>         case KVM_GET_REGS: {
>

[WIP,v2,00/14] Avoiding slow get-user-pages via memory fault exit

Message

Comments