Message ID | 20230315021738.1151386-1-amoorthy@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Avoiding slow get-user-pages via memory fault exit | expand |
Anish, Generally the 'RFC PATCH' prefix is used for patches that are for feedback only (i.e. not to be considered for inclusion). On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > Hi Sean, here's what I'm planing to send up as v2 of the scalable > userfaultfd series. I don't see a ton of value in sending a targeted posting of a series to the list. IOW, just CC all of the appropriate reviewers+maintainers. I promise, we won't bite. > Don't worry, I'm not asking you to review this all :) I just have a few > remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important > enough to mention before I ask for more attention from others, and they'll be > clearer with the patches in hand. Anything else I'm happy to find out about when > I send the actual v2. > > I want your opinion on > > 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT > (described in the api.rst file) > 2. Whether the UNKNOWN exit reason cases (everywhere but > handle_error_pfn atm) would need to be given "real" reasons > before this could be merged. > 3. If you think I've missed sites that currently -EFAULT to userspace > > About (3): after we agreed to only tackle cases where -EFAULT currently makes it > to userspace, I went though our list and tried to trace which EFAULTS actually > bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to > sanity-check my findings with you. Lmk if you see obvious errors in my list > below. > > --- EFAULTs under KVM_RUN --- > > Confident that needs conversion (already converted) > --------------------------------------------------- > * direct_map > * handle_error_pfn > * setup_vmgexit_scratch > * kvm_handle_page_fault > * FNAME(fetch) > > EFAULT does not propagate to userspace (do not convert) > ------------------------------------------------------- > * record_steal_time (arch/x86/kvm/x86.c:3463) > * hva_to_pfn_retry > * kvm_vcpu_map > * FNAME(update_accessed_dirty_bits) > * __kvm_gfn_to_hva_cache_init > Might actually make it to userspace, but only through > kvm_read|write_guest_offset_cached- would be covered by those conversions > * kvm_gfn_to_hva_cache_init > * __kvm_read_guest_page > * hva_to_pfn_remapped > handle_error_pfn will handle this for the scalable uffd case. Don't think > other callers -EFAULT to userspace. > > Still unsure if needs conversion > -------------------------------- > * __kvm_read_guest_atomic > The EFAULT might be propagated though FNAME(sync_page)? > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226) > * __kvm_write_guest_page > Called from kvm_write_guest_offset_cached: if that needs change, this does too The low-level accessors are common across architectures and can be called from other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT and convert that into an exit? > * kvm_write_guest_page > Two interesting paths: > - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here > (arch/x86/kvm/x86.c:9578) This is a hypercall handler, so the return code is ABI with the guest. So it shouldn't be converted to an exit to userspace.
On Fri, Mar 17, 2023, Oliver Upton wrote: > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > > Hi Sean, here's what I'm planing to send up as v2 of the scalable > > userfaultfd series. > > I don't see a ton of value in sending a targeted posting of a series to the > list. IOW, just CC all of the appropriate reviewers+maintainers. I promise, > we won't bite. +1. And though I discourage off-list review, if something is really truly not ready for public review, e.g. will do more harm than good by causing confusing, then just send the patches off-list. Half measures like this will just make folks grumpy. > > Don't worry, I'm not asking you to review this all :) I just have a few > > remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important > > enough to mention before I ask for more attention from others, and they'll be > > clearer with the patches in hand. Anything else I'm happy to find out about when > > I send the actual v2. > > > > I want your opinion on > > > > 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT > > (described in the api.rst file) > > 2. Whether the UNKNOWN exit reason cases (everywhere but > > handle_error_pfn atm) would need to be given "real" reasons > > before this could be merged. > > 3. If you think I've missed sites that currently -EFAULT to userspace > > > > About (3): after we agreed to only tackle cases where -EFAULT currently makes it > > to userspace, I went though our list and tried to trace which EFAULTS actually > > bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to > > sanity-check my findings with you. Lmk if you see obvious errors in my list > > below. > > > > --- EFAULTs under KVM_RUN --- > > > > Confident that needs conversion (already converted) > > --------------------------------------------------- > > * direct_map > > * handle_error_pfn > > * setup_vmgexit_scratch > > * kvm_handle_page_fault > > * FNAME(fetch) > > > > EFAULT does not propagate to userspace (do not convert) > > ------------------------------------------------------- > > * record_steal_time (arch/x86/kvm/x86.c:3463) > > * hva_to_pfn_retry > > * kvm_vcpu_map > > * FNAME(update_accessed_dirty_bits) > > * __kvm_gfn_to_hva_cache_init > > Might actually make it to userspace, but only through > > kvm_read|write_guest_offset_cached- would be covered by those conversions > > * kvm_gfn_to_hva_cache_init > > * __kvm_read_guest_page > > * hva_to_pfn_remapped > > handle_error_pfn will handle this for the scalable uffd case. Don't think > > other callers -EFAULT to userspace. > > > > Still unsure if needs conversion > > -------------------------------- > > * __kvm_read_guest_atomic > > The EFAULT might be propagated though FNAME(sync_page)? > > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226) > > * __kvm_write_guest_page > > Called from kvm_write_guest_offset_cached: if that needs change, this does too > > The low-level accessors are common across architectures and can be called from > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT > and convert that into an exit? Ya, as things stand today, the conversions _must_ be performed at the caller, as there are (sadly) far too many flows where KVM squashes the error. E.g. almost all of x86's paravirt code just suppresses user memory faults :-( Anish, when we discussed this off-list, what I meant by limiting the intial support to existing -EFAULT cases was limiting support to existing cases where KVM directly returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever returned _within KVM_ while handling KVM_RUN. My apologies if I didn't make that clear.
On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Mar 17, 2023, Oliver Upton wrote: > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable > > > userfaultfd series. > > > > I don't see a ton of value in sending a targeted posting of a series to the > > list. But isn't it already generating value as you were able to weigh in and provide feedback on technical aspects that you would not have been otherwise able to if Anish had just messaged Sean? > > IOW, just CC all of the appropriate reviewers+maintainers. I promise, > > we won't bite. I disagree. While I think it's fine to reach out to someone off-list to discuss a specific question, if you're going to message all reviewers and maintainers, you should also CC the mailing list. That allows more people to follow along and weigh in if necessary. > > +1. And though I discourage off-list review, if something is really truly not > ready for public review, e.g. will do more harm than good by causing confusing, > then just send the patches off-list. Half measures like this will just make folks > grumpy. In this specific case, Anish very clearly laid out the reason for sending the patches and asked very specific directed questions in the cover letter and called it out as WIP. Yes "WIP" should have been "RFC" but other than that should anything have been different?
David, On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote: > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Mar 17, 2023, Oliver Upton wrote: > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable > > > > userfaultfd series. > > > > > > I don't see a ton of value in sending a targeted posting of a series to the > > > list. > > But isn't it already generating value as you were able to weigh in and > provide feedback on technical aspects that you would not have been > otherwise able to if Anish had just messaged Sean? No, I only happened upon this series looking at lore. My problem is that none of the affected maintainers or reviewers were cc'ed on the series. > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise, > > > we won't bite. > > I disagree. While I think it's fine to reach out to someone off-list > to discuss a specific question, if you're going to message all > reviewers and maintainers, you should also CC the mailing list. That > allows more people to follow along and weigh in if necessary. I think there may be a slight disconnect here :) I'm in no way encouraging off-list discussion and instead asking that mail on the list arrives in the right folks' inboxes. Posting an RFC on the list was absolutely the right thing to do. > > > > +1. And though I discourage off-list review, if something is really truly not > > ready for public review, e.g. will do more harm than good by causing confusing, > > then just send the patches off-list. Half measures like this will just make folks > > grumpy. > > In this specific case, Anish very clearly laid out the reason for > sending the patches and asked very specific directed questions in the > cover letter and called it out as WIP. Yes "WIP" should have been > "RFC" but other than that should anything have been different? See above
On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > David, > > On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote: > > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Fri, Mar 17, 2023, Oliver Upton wrote: > > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable > > > > > userfaultfd series. > > > > > > > > I don't see a ton of value in sending a targeted posting of a series to the > > > > list. > > > > But isn't it already generating value as you were able to weigh in and > > provide feedback on technical aspects that you would not have been > > otherwise able to if Anish had just messaged Sean? > > No, I only happened upon this series looking at lore. My problem is that > none of the affected maintainers or reviewers were cc'ed on the series. > > > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise, > > > > we won't bite. > > > > I disagree. While I think it's fine to reach out to someone off-list > > to discuss a specific question, if you're going to message all > > reviewers and maintainers, you should also CC the mailing list. That > > allows more people to follow along and weigh in if necessary. > > I think there may be a slight disconnect here :) I'm in no way encouraging > off-list discussion and instead asking that mail on the list arrives in > the right folks' inboxes. > > Posting an RFC on the list was absolutely the right thing to do. Doh. I misunderstood what you meant. We are in violent agreement!
On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote: > > On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > David, > > > > On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote: > > > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Fri, Mar 17, 2023, Oliver Upton wrote: > > > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote: > > > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable > > > > > > userfaultfd series. > > > > > > > > > > I don't see a ton of value in sending a targeted posting of a series to the > > > > > list. > > > > > > But isn't it already generating value as you were able to weigh in and > > > provide feedback on technical aspects that you would not have been > > > otherwise able to if Anish had just messaged Sean? > > > > No, I only happened upon this series looking at lore. My problem is that > > none of the affected maintainers or reviewers were cc'ed on the series. > > > > > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise, > > > > > we won't bite. > > > > > > I disagree. While I think it's fine to reach out to someone off-list > > > to discuss a specific question, if you're going to message all > > > reviewers and maintainers, you should also CC the mailing list. That > > > allows more people to follow along and weigh in if necessary. > > > > I think there may be a slight disconnect here :) I'm in no way encouraging > > off-list discussion and instead asking that mail on the list arrives in > > the right folks' inboxes. > > > > Posting an RFC on the list was absolutely the right thing to do. > > Doh. I misunderstood what you meant. We are in violent agreement! Noted. Also, thanks Oliver and Isaku for paying attention to the series despite it being obscure. On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Mar 17, 2023, Oliver Upton wrote: > > > Still unsure if needs conversion > > > -------------------------------- > > > * __kvm_read_guest_atomic > > > The EFAULT might be propagated though FNAME(sync_page)? > > > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226) > > > * __kvm_write_guest_page > > > Called from kvm_write_guest_offset_cached: if that needs change, this does too > > > > The low-level accessors are common across architectures and can be called from > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT > > and convert that into an exit? > > Ya, as things stand today, the conversions _must_ be performed at the caller, as > there are (sadly) far too many flows where KVM squashes the error. E.g. almost > all of x86's paravirt code just suppresses user memory faults :-( > > Anish, when we discussed this off-list, what I meant by limiting the intial support > to existing -EFAULT cases was limiting support to existing cases where KVM directly > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever > returned _within KVM_ while handling KVM_RUN. My apologies if I didn't make that clear. Don't worry, we eventually got there off-list :) This brings us back to my original set of questions. As has already been pointed out, I'll have to revisit my "Confident that needs conversion" changes and tweak them so that the vCPU exit is populated only for the call sites where the -EFAULT makes it to userspace. I still want feedback on if I've mis-identified any of the functions in my "EFAULT does not propagate to userspace" list and whether there are functions/callers in the "Still unsure if needs conversion" which do have return paths to KVM_RUN.
On Wed, Mar 15, 2023, Anish Moorthy wrote: > Still unsure if needs conversion > -------------------------------- > * __kvm_read_guest_atomic > The EFAULT might be propagated though FNAME(sync_page)? > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226) > * __kvm_write_guest_page > Called from kvm_write_guest_offset_cached: if that needs change, this does too > * kvm_write_guest_page > Two interesting paths: > - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here > (arch/x86/kvm/x86.c:9578) > - kvm_write_guest_offset_cached returns this directly (so if that needs > change, this does too) > * kvm_read_guest_offset_cached > I actually do see a path to userspace, but it's through hyper-v, which we've > said is out of scope for round 1. To clarify: I didn't intend to make Hyper-V explicitly out-of-scope, rather Hyper-V happened to be out-of-scope because the existing code suppresses -EFAULT. I don't think we should make any particular feature/area out-of-scope, as that will lead to even more arbitrary behavior than we already have. What I intended, and what I still think we should do, is limit the scope of the capability to existing paths that return -EFAULT to userspace. Trying to fix all of the paths that suppress -EFAULT is going to be ridiculously difficult as so much of the behavior is arguaby ABI, and there's no authoritative documentation on what's supposed to happen. I definitely would love to fix those paths in the long term, but for the initial implementation/conversion, I think it makes sense to punt on them, otherwise it'll take months/years to merge this code. Back to the Hyper-V case, assuming you're referring to the use of kvm_hv_verify_vp_assist() in nested_svm_vmrun(), that code is a mess. KVM shouldn't inject a #GP and then exit to userspace, e.g. the guest might see a spurious #GP if userspace fixes the fault and resume the instruction. And just a few lines below, KVM skips the instruction if kvm_vcpu_map() returns -EFAULT. As above, ideally that code would be converted to gracefully report the error, but it's such a snafu that the easiest thing might be to change the "return ret;" to "return 1;" until we fix all such KVM-on-HyperV code.
On Fri, Mar 17, 2023, Anish Moorthy wrote: > On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote: > > > The low-level accessors are common across architectures and can be called from > > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT > > > and convert that into an exit? > > > > Ya, as things stand today, the conversions _must_ be performed at the caller, as > > there are (sadly) far too many flows where KVM squashes the error. E.g. almost > > all of x86's paravirt code just suppresses user memory faults :-( > > > > Anish, when we discussed this off-list, what I meant by limiting the intial support > > to existing -EFAULT cases was limiting support to existing cases where KVM directly > > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever > > returned _within KVM_ while handling KVM_RUN. My apologies if I didn't make that clear. > > Don't worry, we eventually got there off-list :) > > This brings us back to my original set of questions. As has already > been pointed out, I'll have to revisit my "Confident that needs > conversion" changes and tweak them so that the vCPU exit is populated > only for the call sites where the -EFAULT makes it to userspace. I > still want feedback on if I've mis-identified any of the functions in > my "EFAULT does not propagate to userspace" list and whether there are > functions/callers in the "Still unsure if needs conversion" which do > have return paths to KVM_RUN. As you've probably gathered from the type of feedback you're receiving, identifying the conversion touchpoints isn't going to be the long pole of this series. Correctly identifying all of the touchpoints may not be easy, but fixing any cases we get wrong will likely be straightforward. And realistically, no matter how many eyeballs look at the code, odds are good we'll miss at least one case. In other words, don't worry too much about getting all the touchpoints correct on the first version. Getting the uAPI right is much more important. And rather than rely on code review to get things right, we should be able to detect issues programmatically. E.g. use fault injection to make gup() and/or uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT (assuming we go don't try to get KVM to return 0 everywhere), e.g. something like the below would at least flag the "misses", although debug could still prove to be annoying. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 67b890e54cf1..cccae0ad1436 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp, } r = kvm_arch_vcpu_ioctl_run(vcpu); trace_kvm_userspace_exit(vcpu->run->exit_reason, r); + WARN_ON(r == -EFAULT && + vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT); break; } case KVM_GET_REGS: {
On Fri, Mar 17, 2023, Sean Christopherson wrote: > On Fri, Mar 17, 2023, Anish Moorthy wrote: > > On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote: > > > > The low-level accessors are common across architectures and can be called from > > > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT > > > > and convert that into an exit? > > > > > > Ya, as things stand today, the conversions _must_ be performed at the caller, as > > > there are (sadly) far too many flows where KVM squashes the error. E.g. almost > > > all of x86's paravirt code just suppresses user memory faults :-( > > > > > > Anish, when we discussed this off-list, what I meant by limiting the intial support > > > to existing -EFAULT cases was limiting support to existing cases where KVM directly > > > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever > > > returned _within KVM_ while handling KVM_RUN. My apologies if I didn't make that clear. > > > > Don't worry, we eventually got there off-list :) > > > > This brings us back to my original set of questions. As has already > > been pointed out, I'll have to revisit my "Confident that needs > > conversion" changes and tweak them so that the vCPU exit is populated > > only for the call sites where the -EFAULT makes it to userspace. I > > still want feedback on if I've mis-identified any of the functions in > > my "EFAULT does not propagate to userspace" list and whether there are > > functions/callers in the "Still unsure if needs conversion" which do > > have return paths to KVM_RUN. > > As you've probably gathered from the type of feedback you're receiving, identifying > the conversion touchpoints isn't going to be the long pole of this series. Correctly > identifying all of the touchpoints may not be easy, but fixing any cases we get wrong > will likely be straightforward. And realistically, no matter how many eyeballs look > at the code, odds are good we'll miss at least one case. In other words, don't worry > too much about getting all the touchpoints correct on the first version. Getting the > uAPI right is much more important. > > And rather than rely on code review to get things right, we should be able to > detect issues programmatically. E.g. use fault injection to make gup() and/or > uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN > path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT > (assuming we go don't try to get KVM to return 0 everywhere), e.g. something like > the below would at least flag the "misses", although debug could still prove to be > annoying. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 67b890e54cf1..cccae0ad1436 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp, > } > r = kvm_arch_vcpu_ioctl_run(vcpu); > trace_kvm_userspace_exit(vcpu->run->exit_reason, r); > + WARN_ON(r == -EFAULT && > + vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT); Gah, I inverted the second check, this should be WARN_ON(r == -EFAULT && vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT); > break; > } > case KVM_GET_REGS: { >