diff mbox series

KVM: x86: Use gfn_to_pfn_cache for steal_time

Message ID 20240802114402.96669-1-stollmc@amazon.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Use gfn_to_pfn_cache for steal_time | expand

Commit Message

Carsten Stollmaier Aug. 2, 2024, 11:44 a.m. UTC
On vcpu_run, before entering the guest, the update of the steal time
information causes a page-fault if the page is not present. In our
scenario, this gets handled by do_user_addr_fault and successively
handle_userfault since we have the region registered to that.

handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
signals. do_user_addr_fault then busy-retries it if the pending signal
is non-fatal. This leads to contention of the mmap_lock.

This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
as gfn_to_pfn_cache ensures page presence for the memory access,
preventing the contention of the mmap_lock.

Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: Sean Christopherson <seanjc@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Peter Xu <peterx@redhat.com>
CC: Sebastian Biemueller <sbiemue@amazon.de>
---
 arch/x86/include/asm/kvm_host.h |   2 +-
 arch/x86/kvm/x86.c              | 115 +++++++++++++++-----------------
 2 files changed, 54 insertions(+), 63 deletions(-)

Comments

David Woodhouse Aug. 2, 2024, 12:03 p.m. UTC | #1
On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> On vcpu_run, before entering the guest, the update of the steal time
> information causes a page-fault if the page is not present. In our
> scenario, this gets handled by do_user_addr_fault and successively
> handle_userfault since we have the region registered to that.
> 
> handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> signals. do_user_addr_fault then busy-retries it if the pending signal
> is non-fatal. This leads to contention of the mmap_lock.

The busy-loop causes so much contention on mmap_lock that post-copy
live migration fails to make progress, and is leading to failures. Yes?

> This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> as gfn_to_pfn_cache ensures page presence for the memory access,
> preventing the contention of the mmap_lock.
> 
> Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

I think this makes sense on its own, as it addresses the specific case
where KVM is *likely* to be touching a userfaulted (guest) page. And it
allows us to ditch yet another explicit asm exception handler.

We should note, though, that in terms of the original problem described
above, it's a bit of a workaround. It just means that by using
kvm_gpc_refresh() to obtain the user page, we end up in
handle_userfault() without the FAULT_FLAG_INTERRUPTIBLE flag.

(Note to self: should kvm_gpc_refresh() take fault flags, to allow
interruptible and killable modes to be selected by its caller?)


An alternative workaround (which perhaps we should *also* consider)
looked like this (plus some suitable code comment, of course):

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
         */
        if (user_mode(regs))
                flags |= FAULT_FLAG_USER;
+       else
+               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
 
 #ifdef CONFIG_X86_64
        /*


That would *also* handle arbitrary copy_to_user/copy_from_user() to
userfault pages, which could theoretically hit the same busy loop.

I'm actually tempted to make user access *interruptible* though, and
either add copy_{from,to}_user_interruptible() or change the semantics
of the existing ones (which I believe are already killable).

That would require each architecture implementing interruptible
exceptions, by doing an extable lookup before the retry. Not overly
complex, but needs to be done for all architectures (although not at
once; we could live with not-yet-done architectures just remaining
killable).

Thoughts?
Matthew Wilcox (Oracle) Aug. 2, 2024, 12:38 p.m. UTC | #2
On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > signals. do_user_addr_fault then busy-retries it if the pending signal
> > is non-fatal. This leads to contention of the mmap_lock.

Why does handle_userfault use TASK_INTERRUPTIBLE?  We really don't
want to stop handling a page fault just because somebody resized a
window or a timer went off.  TASK_KILLABLE, sure.

This goes all the way back to Andreas' terse "add new syscall"
patch, so there's no justification for it in the commit logs.

> The busy-loop causes so much contention on mmap_lock that post-copy
> live migration fails to make progress, and is leading to failures. Yes?
> 
> > This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> > as gfn_to_pfn_cache ensures page presence for the memory access,
> > preventing the contention of the mmap_lock.
> > 
> > Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
> 
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> I think this makes sense on its own, as it addresses the specific case
> where KVM is *likely* to be touching a userfaulted (guest) page. And it
> allows us to ditch yet another explicit asm exception handler.
> 
> We should note, though, that in terms of the original problem described
> above, it's a bit of a workaround. It just means that by using
> kvm_gpc_refresh() to obtain the user page, we end up in
> handle_userfault() without the FAULT_FLAG_INTERRUPTIBLE flag.
> 
> (Note to self: should kvm_gpc_refresh() take fault flags, to allow
> interruptible and killable modes to be selected by its caller?)
> 
> 
> An alternative workaround (which perhaps we should *also* consider)
> looked like this (plus some suitable code comment, of course):
> 
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>          */
>         if (user_mode(regs))
>                 flags |= FAULT_FLAG_USER;
> +       else
> +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
>  
>  #ifdef CONFIG_X86_64
>         /*
> 
> 
> That would *also* handle arbitrary copy_to_user/copy_from_user() to
> userfault pages, which could theoretically hit the same busy loop.
> 
> I'm actually tempted to make user access *interruptible* though, and
> either add copy_{from,to}_user_interruptible() or change the semantics
> of the existing ones (which I believe are already killable).
> 
> That would require each architecture implementing interruptible
> exceptions, by doing an extable lookup before the retry. Not overly
> complex, but needs to be done for all architectures (although not at
> once; we could live with not-yet-done architectures just remaining
> killable).
> 
> Thoughts?
>
David Woodhouse Aug. 2, 2024, 12:53 p.m. UTC | #3
On Fri, 2024-08-02 at 13:38 +0100, Matthew Wilcox wrote:
> On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> > On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > > signals. do_user_addr_fault then busy-retries it if the pending signal
> > > is non-fatal. This leads to contention of the mmap_lock.
> 
> Why does handle_userfault use TASK_INTERRUPTIBLE?  We really don't
> want to stop handling a page fault just because somebody resized a
> window or a timer went off.  TASK_KILLABLE, sure.

Well, the literal answer there in this case is "because we ask it to".

The handle_userfault() function will literally do what it's told by the
fault flags: 

static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
{
	if (flags & FAULT_FLAG_INTERRUPTIBLE)
		return TASK_INTERRUPTIBLE;

	if (flags & FAULT_FLAG_KILLABLE)
		return TASK_KILLABLE;

	return TASK_UNINTERRUPTIBLE;
}


Hence the other potential workaround I mentioned, for
do_user_addr_fault() *not* to ask it to, for faults from the kernel:

> > 
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> >          */
> >         if (user_mode(regs))
> >                 flags |= FAULT_FLAG_USER;
> > +       else
> > +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
> >  
> >  #ifdef CONFIG_X86_64
> >         /*
> > 


But I don't know that I agree with your statement above, that we "don't
want to stop handling a page fault just because somebody resized a
window or a timer went off". 

In fact, I don't think we *do* even stop handling the page fault in
those cases; we just stop *waiting* for it to be handled. 

In fact, couldn't you contrive a test case where a thread is handling
its own uffd faults via SIGIO, where it's the opposite of what you say.
In that case the *only* way the fault actually gets handled is if we
let the signal happen instead of just waiting? 

That doesn't seem like *such* a contrived case either — that seems
perfectly reasonable for a vCPU thread, to then handle its own missing
pages?
David Woodhouse Aug. 2, 2024, 12:56 p.m. UTC | #4
On Fri, 2024-08-02 at 13:53 +0100, David Woodhouse wrote:
> On Fri, 2024-08-02 at 13:38 +0100, Matthew Wilcox wrote:
> > On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> > > On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > > > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > > > signals. do_user_addr_fault then busy-retries it if the pending signal
> > > > is non-fatal. This leads to contention of the mmap_lock.
> > 
> > Why does handle_userfault use TASK_INTERRUPTIBLE?  We really don't
> > want to stop handling a page fault just because somebody resized a
> > window or a timer went off.  TASK_KILLABLE, sure.
> 
> Well, the literal answer there in this case is "because we ask it to".
> 
> The handle_userfault() function will literally do what it's told by the
> fault flags: 
> 
> static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> {
>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
>                 return TASK_INTERRUPTIBLE;
> 
>         if (flags & FAULT_FLAG_KILLABLE)
>                 return TASK_KILLABLE;
> 
>         return TASK_UNINTERRUPTIBLE;
> }
> 
> 
> Hence the other potential workaround I mentioned, for
> do_user_addr_fault() *not* to ask it to, for faults from the kernel:
> 
> > > 
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >          */
> > >         if (user_mode(regs))
> > >                 flags |= FAULT_FLAG_USER;
> > > +       else
> > > +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
> > >  
> > >  #ifdef CONFIG_X86_64
> > >         /*
> > > 
> 
> 
> But I don't know that I agree with your statement above, that we "don't
> want to stop handling a page fault just because somebody resized a
> window or a timer went off". 

See also "we don't want to stop waiting for a page fault, just because
somebody hit Ctrl-C, but SIGINT has a trivial handler to do some minor
cleanup before exiting so it isn't considered a *fatal* signal".

I'm very sure I'd disagree with that one :)
David Woodhouse Aug. 2, 2024, 4:06 p.m. UTC | #5
On Fri, 2024-08-02 at 13:03 +0100, David Woodhouse wrote:
> 
> I'm actually tempted to make user access *interruptible* though, and
> either add copy_{from,to}_user_interruptible() or change the semantics
> of the existing ones (which I believe are already killable).
> 
> That would require each architecture implementing interruptible
> exceptions, by doing an extable lookup before the retry. Not overly
> complex, but needs to be done for all architectures (although not at
> once; we could live with not-yet-done architectures just remaining
> killable).
> 
> Thoughts?

Utterly untested, hasn't even built yet and would need some cleanup
(and better thoughts about how to indicate -EFAULT vs. -EINTR instead
of having the caller check signal_pending() for itself). But should
demonstrate what I was thinking, at least...

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 2bec0c89a95c..854ccd5f2342 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -229,6 +229,9 @@ register unsigned long current_stack_pointer asm(_ASM_SP);
 #define _ASM_EXTABLE_UA(from, to)				\
 	_ASM_EXTABLE_TYPE(from, to, EX_TYPE_UACCESS)
 
+#define _ASM_EXTABLE_UA_INTR(from, to)				\
+	_ASM_EXTABLE_TYPE(from, to, EX_TYPE_UACCESS_INTERRUPTIBLE)
+
 #define _ASM_EXTABLE_FAULT(from, to)				\
 	_ASM_EXTABLE_TYPE(from, to, EX_TYPE_FAULT)
 
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 906b0d5541e8..651d42f39e9b 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -36,7 +36,7 @@
 #define	EX_TYPE_DEFAULT			 1
 #define	EX_TYPE_FAULT			 2
 #define	EX_TYPE_UACCESS			 3
-/* unused, was: #define EX_TYPE_COPY	 4 */
+#define	EX_TYPE_UACCESS_INTERRUPTIBLE	 4
 #define	EX_TYPE_CLEAR_FS		 5
 #define	EX_TYPE_FPU_RESTORE		 6
 #define	EX_TYPE_BPF			 7
diff --git a/arch/x86/include/asm/trapnr.h b/arch/x86/include/asm/trapnr.h
index 8d1154cdf787..9f6397bad398 100644
--- a/arch/x86/include/asm/trapnr.h
+++ b/arch/x86/include/asm/trapnr.h
@@ -41,4 +41,5 @@
 #define X86_TRAP_VC		29	/* VMM Communication Exception */
 #define X86_TRAP_IRET		32	/* IRET Exception */
 
+#define X86_TRAP_INTERRUPTIBLE 0x40000000	/* Internal, for interruptible exceptions */
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cef729a25655..ab00150d360b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3819,12 +3819,15 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		asm volatile("1: xchgb %0, %2\n"
 			     "xor %1, %1\n"
 			     "2:\n"
-			     _ASM_EXTABLE_UA(1b, 2b)
+			     _ASM_EXTABLE_UA_INTR(1b, 2b)
 			     : "+q" (st_preempted),
 			       "+&r" (err),
 			       "+m" (st->preempted));
-		if (err)
+		if (err) {
+			if (signal_pending(current))
+				err = -EINTR;
 			goto out;
+		}
 
 		user_access_end();
 
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 51986e8a9d35..d2cef84042a5 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -325,6 +325,12 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 	reg  = FIELD_GET(EX_DATA_REG_MASK,  e->data);
 	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
 
+	if (trapnr & X86_TRAP_INTERRUPTIBLE) {
+		trapnr &= ~X86_TRAP_INTERRUPTIBLE;
+		if (type != EX_TYPE_UACCESS_INTERRUPTIBLE)
+			return 0;
+	}
+
 	switch (type) {
 	case EX_TYPE_DEFAULT:
 	case EX_TYPE_DEFAULT_MCE_SAFE:
@@ -333,6 +339,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 	case EX_TYPE_FAULT_MCE_SAFE:
 		return ex_handler_fault(e, regs, trapnr);
 	case EX_TYPE_UACCESS:
+	case EX_TYPE_UACCESS_INTERRUPTIBLE:
 		return ex_handler_uaccess(e, regs, trapnr, fault_addr);
 	case EX_TYPE_CLEAR_FS:
 		return ex_handler_clear_fs(e, regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6c469b323cc..4b32348dbb23 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1388,18 +1388,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	fault = handle_mm_fault(vma, address, flags, regs);
 
-	if (fault_signal_pending(fault, regs)) {
-		/*
-		 * Quick path to respond to signals.  The core mm code
-		 * has unlocked the mm for us if we get here.
-		 */
-		if (!user_mode(regs))
-			kernelmode_fixup_or_oops(regs, error_code, address,
-						 SIGBUS, BUS_ADRERR,
-						 ARCH_DEFAULT_PKEY);
-		return;
-	}
-
 	/* The fault is fully completed (including releasing mmap lock) */
 	if (fault & VM_FAULT_COMPLETED)
 		return;
@@ -1410,6 +1398,28 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * that we made any progress. Handle this case first.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
+		if (signal_pending(current)) {
+			if (user_mode(regs))
+				return;
+
+			if (fatal_signal_pending(current)) {
+				kernelmode_fixup_or_oops(regs, error_code, address,
+							 SIGBUS, BUS_ADRERR,
+							 ARCH_DEFAULT_PKEY);
+				return;
+			}
+
+			/*
+			 * First time round, if woken by a signal, see if there
+			 * is an interruptible exception handler. If so, do it.
+			 * Else, switch off FAULT_FLAG_INTERRUPTIBLE.
+			 */
+			if (fixup_exception(regs, X86_TRAP_INTERRUPTIBLE | X86_TRAP_PF,
+					    error_code, address))
+				return;
+			flags &= ~FAULT_FLAG_INTERRUPTIBLE;
+		}
+
 		flags |= FAULT_FLAG_TRIED;
 		goto retry;
 	}
Peter Xu Aug. 2, 2024, 10:40 p.m. UTC | #6
On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > On vcpu_run, before entering the guest, the update of the steal time
> > information causes a page-fault if the page is not present. In our
> > scenario, this gets handled by do_user_addr_fault and successively
> > handle_userfault since we have the region registered to that.
> > 
> > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > signals. do_user_addr_fault then busy-retries it if the pending signal
> > is non-fatal. This leads to contention of the mmap_lock.
> 
> The busy-loop causes so much contention on mmap_lock that post-copy
> live migration fails to make progress, and is leading to failures. Yes?
> 
> > This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> > as gfn_to_pfn_cache ensures page presence for the memory access,
> > preventing the contention of the mmap_lock.
> > 
> > Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
> 
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> I think this makes sense on its own, as it addresses the specific case
> where KVM is *likely* to be touching a userfaulted (guest) page. And it
> allows us to ditch yet another explicit asm exception handler.
> 
> We should note, though, that in terms of the original problem described
> above, it's a bit of a workaround. It just means that by using
> kvm_gpc_refresh() to obtain the user page, we end up in
> handle_userfault() without the FAULT_FLAG_INTERRUPTIBLE flag.
> 
> (Note to self: should kvm_gpc_refresh() take fault flags, to allow
> interruptible and killable modes to be selected by its caller?)
> 
> 
> An alternative workaround (which perhaps we should *also* consider)
> looked like this (plus some suitable code comment, of course):
> 
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>          */
>         if (user_mode(regs))
>                 flags |= FAULT_FLAG_USER;
> +       else
> +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
>  
>  #ifdef CONFIG_X86_64
>         /*
> 
> 
> That would *also* handle arbitrary copy_to_user/copy_from_user() to
> userfault pages, which could theoretically hit the same busy loop.
> 
> I'm actually tempted to make user access *interruptible* though, and
> either add copy_{from,to}_user_interruptible() or change the semantics
> of the existing ones (which I believe are already killable).
> 
> That would require each architecture implementing interruptible
> exceptions, by doing an extable lookup before the retry. Not overly
> complex, but needs to be done for all architectures (although not at
> once; we could live with not-yet-done architectures just remaining
> killable).
> 
> Thoughts?

Instead of "interruptible exception" or the original patch (which might
still be worthwhile, though?  I didn't follow much on kvm and the new gpc
cache, but looks still nicer than get/put user from initial glance), above
looks like the easier and complete solution to me.  For "completeness", I
mean I am not sure how many other copy_to/from_user() code in kvm can hit
this, so looks like still possible to hit outside steal time page?

I thought only the slow fault path was involved in INTERRUPTIBLE thing and
that was the plan, but I guess I overlooked how the default value could
affect copy to/from user invoked from KVM as well..

With above patch to drop FAULT_FLAG_INTERRUPTIBLE for !user, KVM can still
opt-in INTERRUPTIBLE anywhere by leveraging hva_to_pfn[_slow]() API, which
is "INTERRUPTIBLE"-ready with a boolean the caller can set. But the caller
will need to be able to process KVM_PFN_ERR_SIGPENDING.

Thanks,
David Woodhouse Aug. 3, 2024, 8:35 a.m. UTC | #7
On Fri, 2024-08-02 at 18:40 -0400, Peter Xu wrote:
> On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> > An alternative workaround (which perhaps we should *also* consider)
> > looked like this (plus some suitable code comment, of course):
> > 
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> >          */
> >         if (user_mode(regs))
> >                 flags |= FAULT_FLAG_USER;
> > +       else
> > +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
> >  
...
> Instead of "interruptible exception" or the original patch (which might
> still be worthwhile, though?  I didn't follow much on kvm and the new gpc
> cache, but looks still nicer than get/put user from initial glance), above

Yes, I definitely think we want the GPC conversion anyway. That's why I
suggested it to Carsten, to resolve our *immediate* problem while we
continue to ponder the general case.

> looks like the easier and complete solution to me.  For "completeness", I
> mean I am not sure how many other copy_to/from_user() code in kvm can hit
> this, so looks like still possible to hit outside steal time page?

Right. It theoretically applies to *any* user access. It's just that
anything other than *guest* pages is slightly less likely to be backed
by userfaultfd.

> I thought only the slow fault path was involved in INTERRUPTIBLE thing and
> that was the plan, but I guess I overlooked how the default value could
> affect copy to/from user invoked from KVM as well..
> 
> With above patch to drop FAULT_FLAG_INTERRUPTIBLE for !user, KVM can still
> opt-in INTERRUPTIBLE anywhere by leveraging hva_to_pfn[_slow]() API, which
> is "INTERRUPTIBLE"-ready with a boolean the caller can set. But the caller
> will need to be able to process KVM_PFN_ERR_SIGPENDING.

Right. I think converting kvm_{read,write}_guest() and friends to do
that and be interruptible might make sense?

The patch snippet above obviously only fixes it for x86 and would need
to be done across the board. Unless we do this one instead, abusing the
knowledge that uffd is the only thing which honours
FAULT_FLAG_INTERRUPTIBLE?

--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -351,7 +351,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 
 static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
 {
-       if (flags & FAULT_FLAG_INTERRUPTIBLE)
+       if ((flags & FAULT_FLAG_INTERRUPTIBLE) && (flags & FAULT_FLAG_USER))
                return TASK_INTERRUPTIBLE;
 
        if (flags & FAULT_FLAG_KILLABLE)

I still quite like the idea of *optional* interruptible exceptions, as
seen in my proof of concept. Perhaps we wouldn't want the read(2) and
write(2) system calls to use them, but there are plenty of other system
calls which could be interruptible instead of blocking.

Right now, even the simple case of a trivial SIGINT handler which does
some minor cleanup before exiting, makes it a non-fatal signal so the
kernel blocks and waits for ever.
Peter Xu Aug. 4, 2024, 1:31 p.m. UTC | #8
On Sat, Aug 03, 2024 at 09:35:56AM +0100, David Woodhouse wrote:
> On Fri, 2024-08-02 at 18:40 -0400, Peter Xu wrote:
> > On Fri, Aug 02, 2024 at 01:03:16PM +0100, David Woodhouse wrote:
> > > An alternative workaround (which perhaps we should *also* consider)
> > > looked like this (plus some suitable code comment, of course):
> > > 
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >          */
> > >         if (user_mode(regs))
> > >                 flags |= FAULT_FLAG_USER;
> > > +       else
> > > +               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
> > >  
> ...
> > Instead of "interruptible exception" or the original patch (which might
> > still be worthwhile, though?  I didn't follow much on kvm and the new gpc
> > cache, but looks still nicer than get/put user from initial glance), above
> 
> Yes, I definitely think we want the GPC conversion anyway. That's why I
> suggested it to Carsten, to resolve our *immediate* problem while we
> continue to ponder the general case.
> 
> > looks like the easier and complete solution to me.  For "completeness", I
> > mean I am not sure how many other copy_to/from_user() code in kvm can hit
> > this, so looks like still possible to hit outside steal time page?
> 
> Right. It theoretically applies to *any* user access. It's just that
> anything other than *guest* pages is slightly less likely to be backed
> by userfaultfd.
> 
> > I thought only the slow fault path was involved in INTERRUPTIBLE thing and
> > that was the plan, but I guess I overlooked how the default value could
> > affect copy to/from user invoked from KVM as well..
> > 
> > With above patch to drop FAULT_FLAG_INTERRUPTIBLE for !user, KVM can still
> > opt-in INTERRUPTIBLE anywhere by leveraging hva_to_pfn[_slow]() API, which
> > is "INTERRUPTIBLE"-ready with a boolean the caller can set. But the caller
> > will need to be able to process KVM_PFN_ERR_SIGPENDING.
> 
> Right. I think converting kvm_{read,write}_guest() and friends to do
> that and be interruptible might make sense?

Makes sense to me. It's just that there seem to be a lot of the contexts
that using this, so I'm not sure how much work needed to integrate the new
KVM_PFN_ERR_SIGPENDING, and whether it'll be worthwhile.  Also, not sure
whether some context that can be too involved to only handle sigkill/quit.

And this can also, logically, trigger with kvm_{read,write}_guest() or
similar path already, right?  I wonder why it can so far only trigger with
steal time; I probably missed something.

> 
> The patch snippet above obviously only fixes it for x86 and would need
> to be done across the board. Unless we do this one instead, abusing the
> knowledge that uffd is the only thing which honours
> FAULT_FLAG_INTERRUPTIBLE?
> 
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -351,7 +351,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  
>  static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>  {
> -       if (flags & FAULT_FLAG_INTERRUPTIBLE)
> +       if ((flags & FAULT_FLAG_INTERRUPTIBLE) && (flags & FAULT_FLAG_USER))
>                 return TASK_INTERRUPTIBLE;
>  
>         if (flags & FAULT_FLAG_KILLABLE)

This smells hacky to me, as FAULT_FLAG_INTERRUPTIBLE is a fault API that
the fault handler says "let's respond to non-fatal signals".  It means here
userfault is violating the ABI..

And, IIUC this concept of "handling non-fatal signal" can apply outside
userfaultfd too. The one in my mind is __folio_lock_or_retry().

The previous change looks more reasonable, as I think it's a bug that in
do_user_addr_fault() (just take x86 as example) it specifies the
INTERRUPTIBLE but later after handle_mm_fault() it ignored it in
fault_signal_pending() for !user.

So it makes sense to me to have FAULT_FLAG_DEFAULT matching what
fault_signal_pending() does. From that POV perhaps if FAULT_FLAG_DEFAULT
can take "user" as input would be even cleaner (instead of clearing it
later).

> 
> I still quite like the idea of *optional* interruptible exceptions, as
> seen in my proof of concept. Perhaps we wouldn't want the read(2) and
> write(2) system calls to use them, but there are plenty of other system
> calls which could be interruptible instead of blocking.

I don't have enough much direct experience there, but it sounds reasonable
to me.

> 
> Right now, even the simple case of a trivial SIGINT handler which does
> some minor cleanup before exiting, makes it a non-fatal signal so the
> kernel blocks and waits for ever.
> 

Thanks,
Sean Christopherson Aug. 17, 2024, 12:22 a.m. UTC | #9
On Fri, Aug 02, 2024, David Woodhouse wrote:
> On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > On vcpu_run, before entering the guest, the update of the steal time
> > information causes a page-fault if the page is not present. In our
> > scenario, this gets handled by do_user_addr_fault and successively
> > handle_userfault since we have the region registered to that.
> > 
> > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > signals. do_user_addr_fault then busy-retries it if the pending signal
> > is non-fatal. This leads to contention of the mmap_lock.
> 
> The busy-loop causes so much contention on mmap_lock that post-copy
> live migration fails to make progress, and is leading to failures. Yes?
> 
> > This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> > as gfn_to_pfn_cache ensures page presence for the memory access,
> > preventing the contention of the mmap_lock.
> > 
> > Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
> 
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> I think this makes sense on its own, as it addresses the specific case
> where KVM is *likely* to be touching a userfaulted (guest) page. And it
> allows us to ditch yet another explicit asm exception handler.

At the cost of using a gpc, which has its own complexities.

But I don't understand why steal_time is special.  If the issue is essentially
with handle_userfault(), can't this happen on any KVM uaccess?
David Woodhouse Aug. 20, 2024, 10:11 a.m. UTC | #10
On Fri, 2024-08-16 at 17:22 -0700, Sean Christopherson wrote:
> On Fri, Aug 02, 2024, David Woodhouse wrote:
> > On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> > > On vcpu_run, before entering the guest, the update of the steal time
> > > information causes a page-fault if the page is not present. In our
> > > scenario, this gets handled by do_user_addr_fault and successively
> > > handle_userfault since we have the region registered to that.
> > > 
> > > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> > > signals. do_user_addr_fault then busy-retries it if the pending signal
> > > is non-fatal. This leads to contention of the mmap_lock.
> > 
> > The busy-loop causes so much contention on mmap_lock that post-copy
> > live migration fails to make progress, and is leading to failures. Yes?
> > 
> > > This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> > > as gfn_to_pfn_cache ensures page presence for the memory access,
> > > preventing the contention of the mmap_lock.
> > > 
> > > Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
> > 
> > Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > I think this makes sense on its own, as it addresses the specific case
> > where KVM is *likely* to be touching a userfaulted (guest) page. And it
> > allows us to ditch yet another explicit asm exception handler.
> 
> At the cost of using a gpc, which has its own complexities.
> 
> But I don't understand why steal_time is special.  If the issue is essentially
> with handle_userfault(), can't this happen on any KVM uaccess?

Theoretically, yes. The steal time is only special in that it happens
so *often*, every time the vCPU is scheduled in.

We should *also* address the general case, perhaps making by
interruptible user access functions as discussed. But this solves the
immediate issue which is being observed, *and* lets us ditch the last
explicit asm exception handling in kvm/x86.c which is why I think it's
worth doing anyway, even if there's an upcoming fix for the general
case.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..63d0c0cd7a8e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -898,7 +898,7 @@  struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
-		struct gfn_to_hva_cache cache;
+		struct gfn_to_pfn_cache cache;
 	} st;
 
 	u64 l1_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index af6c8cf6a37a..2b8adbadfc50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3652,10 +3652,8 @@  EXPORT_SYMBOL_GPL(kvm_service_local_tlb_flush_requests);
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
-	struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
-	struct kvm_steal_time __user *st;
-	struct kvm_memslots *slots;
-	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+	struct kvm_steal_time *st;
 	u64 steal;
 	u32 version;
 
@@ -3670,42 +3668,26 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(current->mm != vcpu->kvm->mm))
 		return;
 
-	slots = kvm_memslots(vcpu->kvm);
-
-	if (unlikely(slots->generation != ghc->generation ||
-		     gpa != ghc->gpa ||
-		     kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
+	read_lock(&gpc->lock);
+	while (!kvm_gpc_check(gpc, sizeof(*st))) {
 		/* We rely on the fact that it fits in a single page. */
 		BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
 
-		if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gpa, sizeof(*st)) ||
-		    kvm_is_error_hva(ghc->hva) || !ghc->memslot)
+		read_unlock(&gpc->lock);
+
+		if (kvm_gpc_refresh(gpc, sizeof(*st)))
 			return;
+
+		read_lock(&gpc->lock);
 	}
 
-	st = (struct kvm_steal_time __user *)ghc->hva;
+	st = (struct kvm_steal_time *)gpc->khva;
 	/*
 	 * Doing a TLB flush here, on the guest's behalf, can avoid
 	 * expensive IPIs.
 	 */
 	if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) {
-		u8 st_preempted = 0;
-		int err = -EFAULT;
-
-		if (!user_access_begin(st, sizeof(*st)))
-			return;
-
-		asm volatile("1: xchgb %0, %2\n"
-			     "xor %1, %1\n"
-			     "2:\n"
-			     _ASM_EXTABLE_UA(1b, 2b)
-			     : "+q" (st_preempted),
-			       "+&r" (err),
-			       "+m" (st->preempted));
-		if (err)
-			goto out;
-
-		user_access_end();
+		u8 st_preempted = xchg(&st->preempted, 0);
 
 		vcpu->arch.st.preempted = 0;
 
@@ -3713,39 +3695,32 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 				       st_preempted & KVM_VCPU_FLUSH_TLB);
 		if (st_preempted & KVM_VCPU_FLUSH_TLB)
 			kvm_vcpu_flush_tlb_guest(vcpu);
-
-		if (!user_access_begin(st, sizeof(*st)))
-			goto dirty;
 	} else {
-		if (!user_access_begin(st, sizeof(*st)))
-			return;
-
-		unsafe_put_user(0, &st->preempted, out);
+		st->preempted = 0;
 		vcpu->arch.st.preempted = 0;
 	}
 
-	unsafe_get_user(version, &st->version, out);
+	version = st->version;
 	if (version & 1)
 		version += 1;  /* first time write, random junk */
 
 	version += 1;
-	unsafe_put_user(version, &st->version, out);
+	st->version = version;
 
 	smp_wmb();
 
-	unsafe_get_user(steal, &st->steal, out);
+	steal = st->steal;
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
-	unsafe_put_user(steal, &st->steal, out);
+	st->steal = steal;
 
 	version += 1;
-	unsafe_put_user(version, &st->version, out);
+	st->version = version;
+
+	kvm_gpc_mark_dirty_in_slot(gpc);
 
- out:
-	user_access_end();
- dirty:
-	mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+	read_unlock(&gpc->lock);
 }
 
 static bool kvm_is_msr_to_save(u32 msr_index)
@@ -4020,8 +3995,12 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		vcpu->arch.st.msr_val = data;
 
-		if (!(data & KVM_MSR_ENABLED))
-			break;
+		if (data & KVM_MSR_ENABLED) {
+			kvm_gpc_activate(&vcpu->arch.st.cache, data & ~KVM_MSR_ENABLED,
+					sizeof(struct kvm_steal_time));
+		} else {
+			kvm_gpc_deactivate(&vcpu->arch.st.cache);
+		}
 
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
@@ -5051,11 +5030,10 @@  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 {
-	struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
-	struct kvm_steal_time __user *st;
-	struct kvm_memslots *slots;
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+	struct kvm_steal_time *st;
 	static const u8 preempted = KVM_VCPU_PREEMPTED;
-	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+	unsigned long flags;
 
 	/*
 	 * The vCPU can be marked preempted if and only if the VM-Exit was on
@@ -5080,20 +5058,28 @@  static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	if (unlikely(current->mm != vcpu->kvm->mm))
 		return;
 
-	slots = kvm_memslots(vcpu->kvm);
-
-	if (unlikely(slots->generation != ghc->generation ||
-		     gpa != ghc->gpa ||
-		     kvm_is_error_hva(ghc->hva) || !ghc->memslot))
-		return;
+	read_lock_irqsave(&gpc->lock, flags);
+	if (!kvm_gpc_check(gpc, sizeof(*st)))
+		goto out_unlock_gpc;
 
-	st = (struct kvm_steal_time __user *)ghc->hva;
+	st = (struct kvm_steal_time *)gpc->khva;
 	BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
 
-	if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted)))
-		vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+	st->preempted = preempted;
+	vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
 
-	mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+	kvm_gpc_mark_dirty_in_slot(gpc);
+
+out_unlock_gpc:
+	read_unlock_irqrestore(&gpc->lock, flags);
+}
+
+static void kvm_steal_time_reset(struct kvm_vcpu *vcpu)
+{
+	kvm_gpc_deactivate(&vcpu->arch.st.cache);
+	vcpu->arch.st.preempted = 0;
+	vcpu->arch.st.msr_val = 0;
+	vcpu->arch.st.last_steal = 0;
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -12219,6 +12205,8 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);
 
+	kvm_gpc_init(&vcpu->arch.st.cache, vcpu->kvm);
+
 	if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
 		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 	else
@@ -12331,6 +12319,8 @@  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
+	kvm_steal_time_reset(vcpu);
+
 	kvmclock_reset(vcpu);
 
 	kvm_x86_call(vcpu_free)(vcpu);
@@ -12401,7 +12391,8 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	vcpu->arch.apf.msr_en_val = 0;
 	vcpu->arch.apf.msr_int_val = 0;
-	vcpu->arch.st.msr_val = 0;
+
+	kvm_steal_time_reset(vcpu);
 
 	kvmclock_reset(vcpu);