diff mbox series

[RFCv2,7/9] kvm/arm64: Support async page fault

Message ID 20200508032919.52147-8-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series kvm/arm64: Support Async Page Fault | expand

Commit Message

Gavin Shan May 8, 2020, 3:29 a.m. UTC
There are two stages of fault pages and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be related to IO activities if the page was swapped
out previously. In this case, the guest has to suspend for a few of
milliseconds at least, regardless of the overall system load. There
is no useful work done during the suspended period from guest's view.

This adds asychornous page fault to improve the situation. A signal
(PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
to be populated. Guest might reschedule to another running process if
possible. Otherwise, the vCPU is put into power-saving mode, which is
actually to cause vCPU reschedule from host's view. A followup signal
(PAGE_READY) is sent to guest once the requested page is populated.
The suspended task is waken up or scheduled when guest receives the
signal. With this mechanism, the vCPU won't be stuck when the requested
page is being populated by host.

There are more details highlighted as below. Note the implementation is
similar to what x86 has to some extent:

   * A dedicated SMCCC ID is reserved to enable, disable or configure
     the functionality. The only 64-bits parameter is conveyed by two
     registers (w2/w1). Bits[63:56] is the bitmap used to specify the
     operated functionality like enabling/disabling/configuration. The
     bits[55:6] is the physical address of control block or external
     data abort injection disallowed region. Bit[5:0] are used to pass
     control flags.

   * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
     isn't ready. In the mean while, a work is started to populate the
     page asynchronously in background. The stage 2 page table entry is
     updated accordingly and another signal (PAGE_READY) is fired after
     the request page is populted. The signals is notified by injected
     data abort fault.

   * The signals are fired and consumed in sequential fashion. It means
     no more signals will be fired if there is pending one, awaiting the
     guest to consume. It's because the injected data abort faults have
     to be done in sequential fashion.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h      |  43 ++++
 arch/arm64/include/asm/kvm_para.h      |  27 ++
 arch/arm64/include/uapi/asm/Kbuild     |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   2 +
 include/linux/arm-smccc.h              |   6 +
 virt/kvm/arm/arm.c                     |  36 ++-
 virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
 virt/kvm/arm/hypercalls.c              |   8 +
 virt/kvm/arm/mmu.c                     |  29 ++-
 11 files changed, 506 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

Comments

Mark Rutland May 26, 2020, 12:34 p.m. UTC | #1
On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:
> There are two stages of fault pages and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. To populate the
> requested page can be related to IO activities if the page was swapped
> out previously. In this case, the guest has to suspend for a few of
> milliseconds at least, regardless of the overall system load. There
> is no useful work done during the suspended period from guest's view.

This is a bit difficult to read. How about:

| When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
| is not mapped at Stage-2), the vCPU is suspended until the host has
| handled the fault. It can take the host milliseconds or longer to
| handle the fault as this may require IO, and when the system load is
| low neither the host nor guest perform useful work during such
| periods.

> 
> This adds asychornous page fault to improve the situation. A signal

Nit: typo for `asynchronous` here, and there are a few other typos in
the patch itself. It would be nice if you could run a spellcheck over
that.

> (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
> to be populated. Guest might reschedule to another running process if
> possible. Otherwise, the vCPU is put into power-saving mode, which is
> actually to cause vCPU reschedule from host's view. A followup signal
> (PAGE_READY) is sent to guest once the requested page is populated.
> The suspended task is waken up or scheduled when guest receives the
> signal. With this mechanism, the vCPU won't be stuck when the requested
> page is being populated by host.

It would probably be best to say 'notification' rather than 'signal'
here, and say 'the guest is notified', etc. As above, it seems that this
is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
make it clear which context this applies to.

> 
> There are more details highlighted as below. Note the implementation is
> similar to what x86 has to some extent:
> 
>    * A dedicated SMCCC ID is reserved to enable, disable or configure
>      the functionality. The only 64-bits parameter is conveyed by two
>      registers (w2/w1). Bits[63:56] is the bitmap used to specify the
>      operated functionality like enabling/disabling/configuration. The
>      bits[55:6] is the physical address of control block or external
>      data abort injection disallowed region. Bit[5:0] are used to pass
>      control flags.
> 
>    * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
>      isn't ready. In the mean while, a work is started to populate the
>      page asynchronously in background. The stage 2 page table entry is
>      updated accordingly and another signal (PAGE_READY) is fired after
>      the request page is populted. The signals is notified by injected
>      data abort fault.
> 
>    * The signals are fired and consumed in sequential fashion. It means
>      no more signals will be fired if there is pending one, awaiting the
>      guest to consume. It's because the injected data abort faults have
>      to be done in sequential fashion.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_host.h      |  43 ++++
>  arch/arm64/include/asm/kvm_para.h      |  27 ++
>  arch/arm64/include/uapi/asm/Kbuild     |   2 -
>  arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
>  arch/arm64/kvm/Kconfig                 |   1 +
>  arch/arm64/kvm/Makefile                |   2 +
>  include/linux/arm-smccc.h              |   6 +
>  virt/kvm/arm/arm.c                     |  36 ++-
>  virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
>  virt/kvm/arm/hypercalls.c              |   8 +
>  virt/kvm/arm/mmu.c                     |  29 ++-
>  11 files changed, 506 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_para.h
>  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>  create mode 100644 virt/kvm/arm/async_pf.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index f77c706777ec..a207728d6f3f 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -250,6 +250,23 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +
> +/* Should be a power of two number */
> +#define ASYNC_PF_PER_VCPU	64

What exactly is this number?

> +
> +/*
> + * The association of gfn and token. The token will be sent to guest as
> + * page fault address. Also, the guest could be in aarch32 mode. So its
> + * length should be 32-bits.
> + */

The length of what should be 32-bit? The token?

The guest sees the token as the fault address? How exactly is that
exposed to the guest, is that via a synthetic S1 fault?

> +struct kvm_arch_async_pf {
> +	u32     token;
> +	gfn_t   gfn;
> +	u32	esr;
> +};
> +#endif /* CONFIG_KVM_ASYNC_PF */
> +
>  struct kvm_vcpu_arch {
>  	struct kvm_cpu_context ctxt;
>  	void *sve_state;
> @@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
>  		u64 last_steal;
>  		gpa_t base;
>  	} steal;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	struct {
> +		struct gfn_to_hva_cache	cache;
> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
> +		u64			control_block;
> +		u16			id;
> +		bool			send_user_only;
> +		u64			no_fault_inst_range;

What are all of these fields? This implies functionality not covered
in the commit message, and it's not at all clear what these are. 

For example, what exactly is `no_fault_inst_range`? If it's a range,
surely that needs a start/end or base/size pair rather than a single
value?

> +	} apf;
> +#endif
>  };
>  
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
> @@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void) {}
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
> +			    gpa_t gpa, gfn_t gfn);
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work);
> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
> +#endif /* CONFIG_KVM_ASYNC_PF */
> +
>  /* Guest/host FPSIMD coordination helpers */
>  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
> new file mode 100644
> index 000000000000..0ea481dd1c7a
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM_KVM_PARA_H
> +#define _ASM_ARM_KVM_PARA_H
> +
> +#include <uapi/asm/kvm_para.h>
> +
> +static inline bool kvm_check_and_clear_guest_paused(void)
> +{
> +	return false;
> +}
> +
> +static inline unsigned int kvm_arch_para_features(void)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned int kvm_arch_para_hints(void)
> +{
> +	return 0;
> +}
> +
> +static inline bool kvm_para_available(void)
> +{
> +	return false;
> +}
> +
> +#endif /* _ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
> index 602d137932dc..f66554cd5c45 100644
> --- a/arch/arm64/include/uapi/asm/Kbuild
> +++ b/arch/arm64/include/uapi/asm/Kbuild
> @@ -1,3 +1 @@
>  # SPDX-License-Identifier: GPL-2.0
> -
> -generic-y += kvm_para.h
> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
> new file mode 100644
> index 000000000000..e0bd0e579b9a
> --- /dev/null
> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
> +#define _UAPI_ASM_ARM_KVM_PARA_H
> +
> +#include <linux/types.h>
> +
> +#define KVM_FEATURE_ASYNC_PF	0
> +
> +/* Async PF */
> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
> +
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
> +#define KVM_PV_REASON_PAGE_READY	2
> +
> +struct kvm_vcpu_pv_apf_data {
> +	__u32	reason;
> +	__u8	pad[60];
> +	__u32	enabled;
> +};

What's all the padding for?

> +
> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 449386d76441..1053e16b1739 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -34,6 +34,7 @@ config KVM
>  	select KVM_VFIO
>  	select HAVE_KVM_EVENTFD
>  	select HAVE_KVM_IRQFD
> +	select KVM_ASYNC_PF
>  	select KVM_ARM_PMU if HW_PERF_EVENTS
>  	select HAVE_KVM_MSI
>  	select HAVE_KVM_IRQCHIP
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 5ffbdc39e780..3be24c1e401f 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -37,3 +37,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>  kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index bdc0124a064a..22007dd3b9f0 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -94,6 +94,7 @@
>  
>  /* KVM "vendor specific" services */
>  #define ARM_SMCCC_KVM_FUNC_FEATURES		0
> +#define ARM_SMCCC_KVM_FUNC_APF			1
>  #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
>  #define ARM_SMCCC_KVM_NUM_FUNCS			128
>  
> @@ -102,6 +103,11 @@
>  			   ARM_SMCCC_SMC_32,				\
>  			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>  			   ARM_SMCCC_KVM_FUNC_FEATURES)
> +#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
> +			   ARM_SMCCC_SMC_32,				\
> +			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
> +			   ARM_SMCCC_KVM_FUNC_APF)
>  
>  #ifndef __ASSEMBLY__
>  
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 2cbb57485760..3f62899cef13 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -222,6 +222,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		 */
>  		r = 1;
>  		break;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	case KVM_CAP_ASYNC_PF:
> +		r = 1;
> +		break;
> +#endif
>  	default:
>  		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>  		break;
> @@ -269,6 +274,10 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	/* Force users to call KVM_ARM_VCPU_INIT */
>  	vcpu->arch.target = -1;
>  	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	vcpu->arch.apf.control_block = 0UL;
> +	vcpu->arch.apf.no_fault_inst_range = 0x800;

Where has this magic number come from?

> +#endif
>  
>  	/* Set up the timer */
>  	kvm_timer_vcpu_init(vcpu);
> @@ -426,8 +435,27 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>  {
>  	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
> -		&& !v->arch.power_off && !v->arch.pause);
> +
> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
> +	    !v->arch.power_off && !v->arch.pause)
> +		return true;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
> +		u32 val;
> +		int ret;
> +
> +		if (!list_empty_careful(&v->async_pf.done))
> +			return true;
> +
> +		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
> +					    &val, sizeof(val));
> +		if (ret || val)
> +			return true;
> +	}
> +#endif
> +
> +	return false;
>  }
>  
>  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> @@ -683,6 +711,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  
>  		check_vcpu_requests(vcpu);
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +		kvm_check_async_pf_completion(vcpu);
> +#endif

Rather than adding ifdeffery like this, please add an empty stub for
when CONFIG_KVM_ASYNC_PF isn't selected, so that this can be used
unconditionally.

> +
>  		/*
>  		 * Preparing the interrupts to be injected also
>  		 * involves poking the GIC, which must be done in a
> diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
> new file mode 100644
> index 000000000000..5be49d684de3
> --- /dev/null
> +++ b/virt/kvm/arm/async_pf.c
> @@ -0,0 +1,335 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Asynchronous Page Fault Support
> + *
> + * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
> + *
> + * Based on arch/x86/kernel/kvm.c
> + */
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_emulate.h>
> +#include <kvm/arm_hypercalls.h>
> +
> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> +{
> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
> +}
> +
> +static inline u32 kvm_async_pf_hash_next(u32 key)
> +{
> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
> +}
> +
> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
> +		vcpu->arch.apf.gfns[i] = ~0;
> +}
> +
> +/*
> + * Add gfn to the hash table. It's ensured there is a free entry
> + * when this function is called.
> + */
> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != ~0)
> +		key = kvm_async_pf_hash_next(key);
> +
> +	vcpu->arch.apf.gfns[key] = gfn;
> +}
> +
> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
> +		if (vcpu->arch.apf.gfns[key] == gfn ||
> +		    vcpu->arch.apf.gfns[key] == ~0)
> +			break;
> +
> +		key = kvm_async_pf_hash_next(key);
> +	}
> +
> +	return key;
> +}
> +
> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 i, j, k;
> +
> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
> +	while (true) {
> +		vcpu->arch.apf.gfns[i] = ~0;
> +
> +		do {
> +			j = kvm_async_pf_hash_next(j);
> +			if (vcpu->arch.apf.gfns[j] == ~0)
> +				return;
> +
> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
> +			/*
> +			 * k lies cyclically in ]i,j]
> +			 * |    i.k.j |
> +			 * |....j i.k.| or  |.k..j i...|
> +			 */
> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
> +
> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
> +		i = j;
> +	}
> +}

This looks like a copy-paste of code under arch/x86.

This looks like something that should be factored into common code
rather than duplicated. Do we not have an existing common hash table
implementation that we can use rather than building one specific to KVM
async page faults?

> +
> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
> +
> +	return vcpu->arch.apf.gfns[key] == gfn;
> +}
> +
> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
> +{
> +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
> +				     val, sizeof(*val));
> +}
> +
> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
> +				      &val, sizeof(val));
> +}
> +
> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
> +{
> +	u64 vbar, pc;
> +	u32 val;
> +	int ret;
> +
> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
> +		return false;
> +
> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
> +		return false;
> +
> +	/* Pending page fault, which ins't acknowledged by guest */
> +	ret = kvm_async_pf_read_cache(vcpu, &val);
> +	if (ret || val)
> +		return false;
> +
> +	/*
> +	 * Events can't be injected through data abort because it's
> +	 * going to clobber ELR_EL1, which might not consued (or saved)
> +	 * by guest yet.
> +	 */
> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
> +	pc = *vcpu_pc(vcpu);
> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
> +		return false;

Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub code,
and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even virtually
contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a contiguous
range controlled by the kernel isn't ideal.

How does this work on x86?

> +
> +	return true;
> +}
> +
> +/*
> + * We need deliver the page present signal as quick as possible because
> + * it's performance critical. So the signal is delivered no matter which
> + * privilege level the guest has. It's possible the signal can't be handled
> + * by the guest immediately. However, host doesn't contribute the delay
> + * anyway.
> + */
> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> +{
> +	u64 vbar, pc;
> +	u32 val;
> +	int ret;
> +
> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
> +		return true;
> +
> +	/* Pending page fault, which ins't acknowledged by guest */
> +	ret = kvm_async_pf_read_cache(vcpu, &val);
> +	if (ret || val)
> +		return false;
> +
> +	/*
> +	 * Events can't be injected through data abort because it's
> +	 * going to clobber ELR_EL1, which might not consued (or saved)
> +	 * by guest yet.
> +	 */
> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
> +	pc = *vcpu_pc(vcpu);
> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
> +		return false;
> +
> +	return true;
> +}

Much of this is identical to the not_present case, so the same comments
apply. The common bits should probably be factored out into a helper.

> +
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
> +			    gpa_t gpa, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
> +
> +	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
> +	arch.gfn = gfn;
> +	arch.esr = esr;
> +
> +	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
> +}
> +
> +/*
> + * It's garanteed that no pending asynchronous page fault when this is
> + * called. It means all previous issued asynchronous page faults have
> + * been acknoledged.
> + */
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work)
> +{
> +	int ret;
> +
> +	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
> +	if (ret) {
> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
> +		return;
> +	}
> +
> +	kvm_inject_dabt(vcpu, work->arch.token);
> +}
> +
> +/*
> + * It's garanteed that no pending asynchronous page fault when this is
> + * called. It means all previous issued asynchronous page faults have
> + * been acknoledged.
> + */
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work)
> +{
> +	int ret;
> +
> +	/* Broadcast wakeup */
> +	if (work->wakeup_all)
> +		work->arch.token = ~0;
> +	else
> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
> +
> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
> +	if (ret) {
> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
> +		return;
> +	}
> +
> +	kvm_inject_dabt(vcpu, work->arch.token);

So the guest sees a fake S1 abort with a fake address?

How is the guest expected to distinguish this from a real S1 fault?

> +}
> +
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work)
> +{
> +	struct kvm_memory_slot *memslot;
> +	unsigned int esr = work->arch.esr;
> +	phys_addr_t gpa = work->cr2_or_gpa;
> +	gfn_t gfn = gpa >> PAGE_SHIFT;

Perhaps:

	gfn_t gfn = gpa_to_gfn(gpa);

> +	unsigned long hva;
> +	bool write_fault, writable;
> +	int idx;
> +
> +	/*
> +	 * We shouldn't issue prefault for special work to wake up
> +	 * all pending tasks because the associated token (address)
> +	 * is invalid.
> +	 */

I'm not sure what this comment is trying to say.

> +	if (work->wakeup_all)
> +		return;
> +
> +	/*
> +	 * The gpa was validated before the work is started. However, the
> +	 * memory slots might be changed since then. So we need to redo the
> +	 * validatation here.
> +	 */
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	write_fault = kvm_is_write_fault(esr);
> +	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> +	if (kvm_is_error_hva(hva) || (write_fault && !writable))
> +		goto out;
> +
> +	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
> +
> +out:
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +}
> +
> +static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	bool enabled, enable;
> +	gpa_t gpa = (data & ~0x3F);

What exactly is going on here? Why are the low 7 bits of data not valid?

This will also truncate the value to 32 bits; did you mean to do that?

> +	int ret;
> +
> +	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
> +	enable = !!(data & KVM_ASYNC_PF_ENABLED);
> +	if (enable == enabled) {
> +		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
> +			  __func__, enabled ? "enabled" : "disabled",
> +			  vcpu->arch.apf.control_block, data);
> +		return SMCCC_RET_NOT_REQUIRED;
> +	}
> +
> +	if (enable) {
> +		ret = kvm_gfn_to_hva_cache_init(
> +			vcpu->kvm, &vcpu->arch.apf.cache,
> +			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
> +			sizeof(u32));
> +		if (ret) {
> +			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
> +				__func__, ret, data);
> +			return SMCCC_RET_NOT_SUPPORTED;
> +		}
> +
> +		kvm_async_pf_hash_reset(vcpu);
> +		vcpu->arch.apf.send_user_only =
> +			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
> +		kvm_async_pf_wakeup_all(vcpu);
> +		vcpu->arch.apf.control_block = data;
> +	} else {
> +		kvm_clear_async_pf_completion_queue(vcpu);
> +		vcpu->arch.apf.control_block = data;
> +	}
> +
> +	return SMCCC_RET_SUCCESS;
> +}
> +
> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	u64 data, func, val, range;
> +	long ret = SMCCC_RET_SUCCESS;
> +
> +	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);

What prevents the high bits being set in arg1?

> +	func = data & (0xfful << 56);
> +	val = data & ~(0xfful << 56);
> +	switch (func) {
> +	case BIT(63):
> +		ret = kvm_async_pf_update_enable_reg(vcpu, val);

Please give BIT(63) a mnemonic.

> +		break;
> +	case BIT(62):

Likewise.

> +		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
> +			ret = SMCCC_RET_NOT_SUPPORTED;
> +			break;
> +		}
> +
> +		range = vcpu->arch.apf.no_fault_inst_range;
> +		vcpu->arch.apf.no_fault_inst_range = max(range, val);

Huh? How is the `no_fault_inst_range` set by he guest?

Thanks,
Mark.

> +		break;
> +	default:
> +		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
> +		ret = SMCCC_RET_NOT_SUPPORTED;
> +	}
> +
> +	return ret;
> +}
> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> index db6dce3d0e23..a7e0fe17e2f1 100644
> --- a/virt/kvm/arm/hypercalls.c
> +++ b/virt/kvm/arm/hypercalls.c
> @@ -70,7 +70,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  		break;
>  	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
>  		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
> +#endif
>  		break;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
> +		val[0] = kvm_async_pf_hypercall(vcpu);
> +		break;
> +#endif
>  	default:
>  		return kvm_psci_call(vcpu);
>  	}
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 95aaabb2b1fc..a303815845a2 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1656,6 +1656,30 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>  }
>  
> +static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
> +			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
> +			 bool *writable, bool prefault)
> +{
> +	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	bool async = false;
> +
> +	/* Bail if *pfn has correct page */
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
> +	if (!async)
> +		return false;
> +
> +	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
> +		if (kvm_async_pf_hash_find(vcpu, gfn) ||
> +		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
> +			return true;
> +	}
> +#endif
> +
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
> +	return false;
> +}
> +
>  int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>  			      struct kvm_memory_slot *memslot,
>  			      phys_addr_t fault_ipa, unsigned long hva,
> @@ -1737,7 +1761,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>  	 */
>  	smp_rmb();
>  
> -	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
> +			 write_fault, &writable, prefault))
> +		return 1;
> +
>  	if (pfn == KVM_PFN_ERR_HWPOISON) {
>  		kvm_send_hwpoison_signal(hva, vma_shift);
>  		return 0;
> -- 
> 2.23.0
>
Gavin Shan May 27, 2020, 4:05 a.m. UTC | #2
Hi Mark,

On 5/26/20 10:34 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:
>> There are two stages of fault pages and the stage one page fault is
>> handled by guest itself. The guest is trapped to host when the page
>> fault is caused by stage 2 page table, for example missing. The guest
>> is suspended until the requested page is populated. To populate the
>> requested page can be related to IO activities if the page was swapped
>> out previously. In this case, the guest has to suspend for a few of
>> milliseconds at least, regardless of the overall system load. There
>> is no useful work done during the suspended period from guest's view.
> 
> This is a bit difficult to read. How about:
> 
> | When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
> | is not mapped at Stage-2), the vCPU is suspended until the host has
> | handled the fault. It can take the host milliseconds or longer to
> | handle the fault as this may require IO, and when the system load is
> | low neither the host nor guest perform useful work during such
> | periods.
> 

Yes, It's much better.

>>
>> This adds asychornous page fault to improve the situation. A signal
> 
> Nit: typo for `asynchronous` here, and there are a few other typos in
> the patch itself. It would be nice if you could run a spellcheck over
> that.
> 

Sure.

>> (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
>> to be populated. Guest might reschedule to another running process if
>> possible. Otherwise, the vCPU is put into power-saving mode, which is
>> actually to cause vCPU reschedule from host's view. A followup signal
>> (PAGE_READY) is sent to guest once the requested page is populated.
>> The suspended task is waken up or scheduled when guest receives the
>> signal. With this mechanism, the vCPU won't be stuck when the requested
>> page is being populated by host.
> 
> It would probably be best to say 'notification' rather than 'signal'
> here, and say 'the guest is notified', etc. As above, it seems that this
> is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
> make it clear which context this applies to.
> 

Ok.

>>
>> There are more details highlighted as below. Note the implementation is
>> similar to what x86 has to some extent:
>>
>>     * A dedicated SMCCC ID is reserved to enable, disable or configure
>>       the functionality. The only 64-bits parameter is conveyed by two
>>       registers (w2/w1). Bits[63:56] is the bitmap used to specify the
>>       operated functionality like enabling/disabling/configuration. The
>>       bits[55:6] is the physical address of control block or external
>>       data abort injection disallowed region. Bit[5:0] are used to pass
>>       control flags.
>>
>>     * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
>>       isn't ready. In the mean while, a work is started to populate the
>>       page asynchronously in background. The stage 2 page table entry is
>>       updated accordingly and another signal (PAGE_READY) is fired after
>>       the request page is populted. The signals is notified by injected
>>       data abort fault.
>>
>>     * The signals are fired and consumed in sequential fashion. It means
>>       no more signals will be fired if there is pending one, awaiting the
>>       guest to consume. It's because the injected data abort faults have
>>       to be done in sequential fashion.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/kvm_host.h      |  43 ++++
>>   arch/arm64/include/asm/kvm_para.h      |  27 ++
>>   arch/arm64/include/uapi/asm/Kbuild     |   2 -
>>   arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
>>   arch/arm64/kvm/Kconfig                 |   1 +
>>   arch/arm64/kvm/Makefile                |   2 +
>>   include/linux/arm-smccc.h              |   6 +
>>   virt/kvm/arm/arm.c                     |  36 ++-
>>   virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
>>   virt/kvm/arm/hypercalls.c              |   8 +
>>   virt/kvm/arm/mmu.c                     |  29 ++-
>>   11 files changed, 506 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/arm64/include/asm/kvm_para.h
>>   create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>>   create mode 100644 virt/kvm/arm/async_pf.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index f77c706777ec..a207728d6f3f 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -250,6 +250,23 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +
>> +/* Should be a power of two number */
>> +#define ASYNC_PF_PER_VCPU	64
> 
> What exactly is this number?
> 

It's maximal number of async page faults pending on the specific vCPU.

>> +
>> +/*
>> + * The association of gfn and token. The token will be sent to guest as
>> + * page fault address. Also, the guest could be in aarch32 mode. So its
>> + * length should be 32-bits.
>> + */
> 
> The length of what should be 32-bit? The token?
> 
> The guest sees the token as the fault address? How exactly is that
> exposed to the guest, is that via a synthetic S1 fault?
> 

Yes, the token is 32-bits in length and it's exposed to guest via fault
address. I'll improve the comments in next revision.

>> +struct kvm_arch_async_pf {
>> +	u32     token;
>> +	gfn_t   gfn;
>> +	u32	esr;
>> +};
>> +#endif /* CONFIG_KVM_ASYNC_PF */
>> +
>>   struct kvm_vcpu_arch {
>>   	struct kvm_cpu_context ctxt;
>>   	void *sve_state;
>> @@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
>>   		u64 last_steal;
>>   		gpa_t base;
>>   	} steal;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	struct {
>> +		struct gfn_to_hva_cache	cache;
>> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
>> +		u64			control_block;
>> +		u16			id;
>> +		bool			send_user_only;
>> +		u64			no_fault_inst_range;
> 
> What are all of these fields? This implies functionality not covered
> in the commit message, and it's not at all clear what these are.
> 
> For example, what exactly is `no_fault_inst_range`? If it's a range,
> surely that needs a start/end or base/size pair rather than a single
> value?
> 

Ok. I will add more words about how the data struct is designed and
how it's used etc in next revision.

>> +	} apf;
>> +#endif
>>   };
>>   
>>   /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>> @@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>   
>>   static inline void __cpu_init_stage2(void) {}
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
>> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
>> +			    gpa_t gpa, gfn_t gfn);
>> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work);
>> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work);
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>> +			       struct kvm_async_pf *work);
>> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
>> +#endif /* CONFIG_KVM_ASYNC_PF */
>> +
>>   /* Guest/host FPSIMD coordination helpers */
>>   int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>>   void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
>> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..0ea481dd1c7a
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/kvm_para.h
>> @@ -0,0 +1,27 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_ARM_KVM_PARA_H
>> +#define _ASM_ARM_KVM_PARA_H
>> +
>> +#include <uapi/asm/kvm_para.h>
>> +
>> +static inline bool kvm_check_and_clear_guest_paused(void)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline unsigned int kvm_arch_para_features(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline unsigned int kvm_arch_para_hints(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline bool kvm_para_available(void)
>> +{
>> +	return false;
>> +}
>> +
>> +#endif /* _ASM_ARM_KVM_PARA_H */
>> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
>> index 602d137932dc..f66554cd5c45 100644
>> --- a/arch/arm64/include/uapi/asm/Kbuild
>> +++ b/arch/arm64/include/uapi/asm/Kbuild
>> @@ -1,3 +1 @@
>>   # SPDX-License-Identifier: GPL-2.0
>> -
>> -generic-y += kvm_para.h
>> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..e0bd0e579b9a
>> --- /dev/null
>> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
>> @@ -0,0 +1,22 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
>> +#define _UAPI_ASM_ARM_KVM_PARA_H
>> +
>> +#include <linux/types.h>
>> +
>> +#define KVM_FEATURE_ASYNC_PF	0
>> +
>> +/* Async PF */
>> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
>> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
>> +
>> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
>> +#define KVM_PV_REASON_PAGE_READY	2
>> +
>> +struct kvm_vcpu_pv_apf_data {
>> +	__u32	reason;
>> +	__u8	pad[60];
>> +	__u32	enabled;
>> +};
> 
> What's all the padding for?
> 

The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.

>> +
>> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 449386d76441..1053e16b1739 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -34,6 +34,7 @@ config KVM
>>   	select KVM_VFIO
>>   	select HAVE_KVM_EVENTFD
>>   	select HAVE_KVM_IRQFD
>> +	select KVM_ASYNC_PF
>>   	select KVM_ARM_PMU if HW_PERF_EVENTS
>>   	select HAVE_KVM_MSI
>>   	select HAVE_KVM_IRQCHIP
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 5ffbdc39e780..3be24c1e401f 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -37,3 +37,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
>>   kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>>   kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>>   kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
>> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
>> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
>> index bdc0124a064a..22007dd3b9f0 100644
>> --- a/include/linux/arm-smccc.h
>> +++ b/include/linux/arm-smccc.h
>> @@ -94,6 +94,7 @@
>>   
>>   /* KVM "vendor specific" services */
>>   #define ARM_SMCCC_KVM_FUNC_FEATURES		0
>> +#define ARM_SMCCC_KVM_FUNC_APF			1
>>   #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
>>   #define ARM_SMCCC_KVM_NUM_FUNCS			128
>>   
>> @@ -102,6 +103,11 @@
>>   			   ARM_SMCCC_SMC_32,				\
>>   			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>>   			   ARM_SMCCC_KVM_FUNC_FEATURES)
>> +#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
>> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
>> +			   ARM_SMCCC_SMC_32,				\
>> +			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>> +			   ARM_SMCCC_KVM_FUNC_APF)
>>   
>>   #ifndef __ASSEMBLY__
>>   
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index 2cbb57485760..3f62899cef13 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -222,6 +222,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   		 */
>>   		r = 1;
>>   		break;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	case KVM_CAP_ASYNC_PF:
>> +		r = 1;
>> +		break;
>> +#endif
>>   	default:
>>   		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>>   		break;
>> @@ -269,6 +274,10 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>   	/* Force users to call KVM_ARM_VCPU_INIT */
>>   	vcpu->arch.target = -1;
>>   	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	vcpu->arch.apf.control_block = 0UL;
>> +	vcpu->arch.apf.no_fault_inst_range = 0x800;
> 
> Where has this magic number come from?
> 

It's the total length (size) of exception vectors. When the vCPU's
PC in the range, the DABT can't be injected.

>> +#endif
>>   
>>   	/* Set up the timer */
>>   	kvm_timer_vcpu_init(vcpu);
>> @@ -426,8 +435,27 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>   {
>>   	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
>> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
>> -		&& !v->arch.power_off && !v->arch.pause);
>> +
>> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
>> +	    !v->arch.power_off && !v->arch.pause)
>> +		return true;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
>> +		u32 val;
>> +		int ret;
>> +
>> +		if (!list_empty_careful(&v->async_pf.done))
>> +			return true;
>> +
>> +		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
>> +					    &val, sizeof(val));
>> +		if (ret || val)
>> +			return true;
>> +	}
>> +#endif
>> +
>> +	return false;
>>   }
>>   
>>   bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>> @@ -683,6 +711,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   
>>   		check_vcpu_requests(vcpu);
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +		kvm_check_async_pf_completion(vcpu);
>> +#endif
> 
> Rather than adding ifdeffery like this, please add an empty stub for
> when CONFIG_KVM_ASYNC_PF isn't selected, so that this can be used
> unconditionally.
> 

Ok.

>> +
>>   		/*
>>   		 * Preparing the interrupts to be injected also
>>   		 * involves poking the GIC, which must be done in a
>> diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
>> new file mode 100644
>> index 000000000000..5be49d684de3
>> --- /dev/null
>> +++ b/virt/kvm/arm/async_pf.c
>> @@ -0,0 +1,335 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Asynchronous Page Fault Support
>> + *
>> + * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
>> + *
>> + * Based on arch/x86/kernel/kvm.c
>> + */
>> +
>> +#include <linux/arm-smccc.h>
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <kvm/arm_hypercalls.h>
>> +
>> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
>> +{
>> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
>> +}
>> +
>> +static inline u32 kvm_async_pf_hash_next(u32 key)
>> +{
>> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
>> +}
>> +
>> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +}
>> +
>> +/*
>> + * Add gfn to the hash table. It's ensured there is a free entry
>> + * when this function is called.
>> + */
>> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +
>> +	while (vcpu->arch.apf.gfns[key] != ~0)
>> +		key = kvm_async_pf_hash_next(key);
>> +
>> +	vcpu->arch.apf.gfns[key] = gfn;
>> +}
>> +
>> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
>> +		if (vcpu->arch.apf.gfns[key] == gfn ||
>> +		    vcpu->arch.apf.gfns[key] == ~0)
>> +			break;
>> +
>> +		key = kvm_async_pf_hash_next(key);
>> +	}
>> +
>> +	return key;
>> +}
>> +
>> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 i, j, k;
>> +
>> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
>> +	while (true) {
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +
>> +		do {
>> +			j = kvm_async_pf_hash_next(j);
>> +			if (vcpu->arch.apf.gfns[j] == ~0)
>> +				return;
>> +
>> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
>> +			/*
>> +			 * k lies cyclically in ]i,j]
>> +			 * |    i.k.j |
>> +			 * |....j i.k.| or  |.k..j i...|
>> +			 */
>> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
>> +
>> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
>> +		i = j;
>> +	}
>> +}
> 
> This looks like a copy-paste of code under arch/x86.
> 
> This looks like something that should be factored into common code
> rather than duplicated. Do we not have an existing common hash table
> implementation that we can use rather than building one specific to KVM
> async page faults?
> 

Yeah, It's copied from arch/x86. I will explore the possiblity to
make the code is sharable among multiple architectures.

>> +
>> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
>> +
>> +	return vcpu->arch.apf.gfns[key] == gfn;
>> +}
>> +
>> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
>> +{
>> +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
>> +				     val, sizeof(*val));
>> +}
>> +
>> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
>> +{
>> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
>> +				      &val, sizeof(val));
>> +}
>> +
>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 vbar, pc;
>> +	u32 val;
>> +	int ret;
>> +
>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>> +		return false;
>> +
>> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>> +		return false;
>> +
>> +	/* Pending page fault, which ins't acknowledged by guest */
>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>> +	if (ret || val)
>> +		return false;
>> +
>> +	/*
>> +	 * Events can't be injected through data abort because it's
>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>> +	 * by guest yet.
>> +	 */
>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>> +	pc = *vcpu_pc(vcpu);
>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>> +		return false;
> 
> Ah, so that's when this `no_fault_inst_range` is for.
> 
> As-is this is not sufficient, and we'll need t be extremely careful
> here.
> 
> The vectors themselves typically only have a small amount of stub code,
> and the bulk of the non-reentrant exception entry work happens
> elsewhere, in a mixture of assembly and C code that isn't even virtually
> contiguous with either the vectors or itself.
> 
> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
> code) that isn't safe to take a fault from, so even having a contiguous
> range controlled by the kernel isn't ideal.
> 
> How does this work on x86?
> 

Yeah, here we just provide a mechanism to forbid injecting data abort. The
range is fed by guest through HVC call. So I think it's guest related issue.
You had more comments about this in PATCH[9]. I will explain a bit more there.

x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
if it's on. Otherwise, it's forbidden. It's workable because exception is
special interrupt to x86 if I'm correct.

            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
                         (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));


>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * We need deliver the page present signal as quick as possible because
>> + * it's performance critical. So the signal is delivered no matter which
>> + * privilege level the guest has. It's possible the signal can't be handled
>> + * by the guest immediately. However, host doesn't contribute the delay
>> + * anyway.
>> + */
>> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 vbar, pc;
>> +	u32 val;
>> +	int ret;
>> +
>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>> +		return true;
>> +
>> +	/* Pending page fault, which ins't acknowledged by guest */
>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>> +	if (ret || val)
>> +		return false;
>> +
>> +	/*
>> +	 * Events can't be injected through data abort because it's
>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>> +	 * by guest yet.
>> +	 */
>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>> +	pc = *vcpu_pc(vcpu);
>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> Much of this is identical to the not_present case, so the same comments
> apply. The common bits should probably be factored out into a helper.
> 

Yep, I will take look to introduce a helper function if needed.

>> +
>> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
>> +			    gpa_t gpa, gfn_t gfn)
>> +{
>> +	struct kvm_arch_async_pf arch;
>> +	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
>> +
>> +	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
>> +	arch.gfn = gfn;
>> +	arch.esr = esr;
>> +
>> +	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
>> +}
>> +
>> +/*
>> + * It's garanteed that no pending asynchronous page fault when this is
>> + * called. It means all previous issued asynchronous page faults have
>> + * been acknoledged.
>> + */
>> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work)
>> +{
>> +	int ret;
>> +
>> +	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
>> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
>> +	if (ret) {
>> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
>> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
>> +		return;
>> +	}
>> +
>> +	kvm_inject_dabt(vcpu, work->arch.token);
>> +}
>> +
>> +/*
>> + * It's garanteed that no pending asynchronous page fault when this is
>> + * called. It means all previous issued asynchronous page faults have
>> + * been acknoledged.
>> + */
>> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>> +				 struct kvm_async_pf *work)
>> +{
>> +	int ret;
>> +
>> +	/* Broadcast wakeup */
>> +	if (work->wakeup_all)
>> +		work->arch.token = ~0;
>> +	else
>> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
>> +
>> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
>> +	if (ret) {
>> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
>> +		return;
>> +	}
>> +
>> +	kvm_inject_dabt(vcpu, work->arch.token);
> 
> So the guest sees a fake S1 abort with a fake address?
> 
> How is the guest expected to distinguish this from a real S1 fault?
> 

KVM_PV_REASON_PAGE_READY is stored in the per-vCPU status block, which
is represented by the following data struct. When the @reason is zero,
it's normal fault. Otherwise, it's a async page fault. So we shouldn't
have async page fault and normal one cross over.

    struct kvm_vcpu_pv_apf_data {
         __u32   reason;
         __u8    pad[60];
         __u32   enabled;
    };

>> +}
>> +
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>> +			       struct kvm_async_pf *work)
>> +{
>> +	struct kvm_memory_slot *memslot;
>> +	unsigned int esr = work->arch.esr;
>> +	phys_addr_t gpa = work->cr2_or_gpa;
>> +	gfn_t gfn = gpa >> PAGE_SHIFT;
> 
> Perhaps:
> 
> 	gfn_t gfn = gpa_to_gfn(gpa);
> 

Sure.

>> +	unsigned long hva;
>> +	bool write_fault, writable;
>> +	int idx;
>> +
>> +	/*
>> +	 * We shouldn't issue prefault for special work to wake up
>> +	 * all pending tasks because the associated token (address)
>> +	 * is invalid.
>> +	 */
> 
> I'm not sure what this comment is trying to say.
> 

There are two types of workers. One of them is to wakeup all sleepers
in guest, preparing to shutdown the machine. For this type of wakeup,
the token (faulting address) is set to ~0, which is invalid. So there
isn't a prefault issued for this particular wakeup. I'll improve the
comments in next reivion :)

>> +	if (work->wakeup_all)
>> +		return;
>> +
>> +	/*
>> +	 * The gpa was validated before the work is started. However, the
>> +	 * memory slots might be changed since then. So we need to redo the
>> +	 * validatation here.
>> +	 */
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +
>> +	write_fault = kvm_is_write_fault(esr);
>> +	memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> +	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>> +	if (kvm_is_error_hva(hva) || (write_fault && !writable))
>> +		goto out;
>> +
>> +	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
>> +
>> +out:
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +}
>> +
>> +static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
>> +{
>> +	bool enabled, enable;
>> +	gpa_t gpa = (data & ~0x3F);
> 
> What exactly is going on here? Why are the low 7 bits of data not valid?
> 
> This will also truncate the value to 32 bits; did you mean to do that?
> 

Yes, because the control block (@data) is 64-bytes aligned.

>> +	int ret;
>> +
>> +	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
>> +	enable = !!(data & KVM_ASYNC_PF_ENABLED);
>> +	if (enable == enabled) {
>> +		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
>> +			  __func__, enabled ? "enabled" : "disabled",
>> +			  vcpu->arch.apf.control_block, data);
>> +		return SMCCC_RET_NOT_REQUIRED;
>> +	}
>> +
>> +	if (enable) {
>> +		ret = kvm_gfn_to_hva_cache_init(
>> +			vcpu->kvm, &vcpu->arch.apf.cache,
>> +			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
>> +			sizeof(u32));
>> +		if (ret) {
>> +			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
>> +				__func__, ret, data);
>> +			return SMCCC_RET_NOT_SUPPORTED;
>> +		}
>> +
>> +		kvm_async_pf_hash_reset(vcpu);
>> +		vcpu->arch.apf.send_user_only =
>> +			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
>> +		kvm_async_pf_wakeup_all(vcpu);
>> +		vcpu->arch.apf.control_block = data;
>> +	} else {
>> +		kvm_clear_async_pf_completion_queue(vcpu);
>> +		vcpu->arch.apf.control_block = data;
>> +	}
>> +
>> +	return SMCCC_RET_SUCCESS;
>> +}
>> +
>> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 data, func, val, range;
>> +	long ret = SMCCC_RET_SUCCESS;
>> +
>> +	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);
> 
> What prevents the high bits being set in arg1?
> 

The function shuld work for both 32-bit and 64-bit guest.

>> +	func = data & (0xfful << 56);
>> +	val = data & ~(0xfful << 56);
>> +	switch (func) {
>> +	case BIT(63):
>> +		ret = kvm_async_pf_update_enable_reg(vcpu, val);
> 
> Please give BIT(63) a mnemonic.
> 
>> +		break;
>> +	case BIT(62):
> 
> Likewise.
> 

Yes.

>> +		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
>> +			ret = SMCCC_RET_NOT_SUPPORTED;
>> +			break;
>> +		}
>> +
>> +		range = vcpu->arch.apf.no_fault_inst_range;
>> +		vcpu->arch.apf.no_fault_inst_range = max(range, val);
> 
> Huh? How is the `no_fault_inst_range` set by he guest?
> 

I'll explain to you when replying your comments on PATCH[9] :)

> Thanks,
> Mark.
> 

Thanks,
Gavin

>> +		break;
>> +	default:
>> +		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
>> +		ret = SMCCC_RET_NOT_SUPPORTED;
>> +	}
>> +
>> +	return ret;
>> +}
>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
>> index db6dce3d0e23..a7e0fe17e2f1 100644
>> --- a/virt/kvm/arm/hypercalls.c
>> +++ b/virt/kvm/arm/hypercalls.c
>> @@ -70,7 +70,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>>   		break;
>>   	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
>>   		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
>> +#endif
>>   		break;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
>> +		val[0] = kvm_async_pf_hypercall(vcpu);
>> +		break;
>> +#endif
>>   	default:
>>   		return kvm_psci_call(vcpu);
>>   	}
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 95aaabb2b1fc..a303815845a2 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1656,6 +1656,30 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>>   	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>>   }
>>   
>> +static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
>> +			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
>> +			 bool *writable, bool prefault)
>> +{
>> +	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	bool async = false;
>> +
>> +	/* Bail if *pfn has correct page */
>> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
>> +	if (!async)
>> +		return false;
>> +
>> +	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
>> +		if (kvm_async_pf_hash_find(vcpu, gfn) ||
>> +		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
>> +			return true;
>> +	}
>> +#endif
>> +
>> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
>> +	return false;
>> +}
>> +
>>   int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>>   			      struct kvm_memory_slot *memslot,
>>   			      phys_addr_t fault_ipa, unsigned long hva,
>> @@ -1737,7 +1761,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>>   	 */
>>   	smp_rmb();
>>   
>> -	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
>> +			 write_fault, &writable, prefault))
>> +		return 1;
>> +
>>   	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>   		kvm_send_hwpoison_signal(hva, vma_shift);
>>   		return 0;
>> -- 
>> 2.23.0
>>
>
Marc Zyngier May 27, 2020, 7:37 a.m. UTC | #3
On 2020-05-27 05:05, Gavin Shan wrote:
> Hi Mark,
> 

[...]

>>> +struct kvm_vcpu_pv_apf_data {
>>> +	__u32	reason;
>>> +	__u8	pad[60];
>>> +	__u32	enabled;
>>> +};
>> 
>> What's all the padding for?
>> 
> 
> The padding is ensure the @reason and @enabled in different cache
> line. @reason is shared by host/guest, while @enabled is almostly
> owned by guest.

So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]

>>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu 
>>> *vcpu)
>>> +{
>>> +	u64 vbar, pc;
>>> +	u32 val;
>>> +	int ret;
>>> +
>>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>>> +		return false;
>>> +
>>> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>>> +		return false;
>>> +
>>> +	/* Pending page fault, which ins't acknowledged by guest */
>>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>>> +	if (ret || val)
>>> +		return false;
>>> +
>>> +	/*
>>> +	 * Events can't be injected through data abort because it's
>>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>>> +	 * by guest yet.
>>> +	 */
>>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>>> +	pc = *vcpu_pc(vcpu);
>>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>>> +		return false;
>> 
>> Ah, so that's when this `no_fault_inst_range` is for.
>> 
>> As-is this is not sufficient, and we'll need t be extremely careful
>> here.
>> 
>> The vectors themselves typically only have a small amount of stub 
>> code,
>> and the bulk of the non-reentrant exception entry work happens
>> elsewhere, in a mixture of assembly and C code that isn't even 
>> virtually
>> contiguous with either the vectors or itself.
>> 
>> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
>> code) that isn't safe to take a fault from, so even having a 
>> contiguous
>> range controlled by the kernel isn't ideal.
>> 
>> How does this work on x86?
>> 
> 
> Yeah, here we just provide a mechanism to forbid injecting data abort. 
> The
> range is fed by guest through HVC call. So I think it's guest related 
> issue.
> You had more comments about this in PATCH[9]. I will explain a bit more 
> there.
> 
> x86 basically relies on EFLAGS[IF] flag. The async page fault can be 
> injected
> if it's on. Otherwise, it's forbidden. It's workable because exception 
> is
> special interrupt to x86 if I'm correct.
> 
>            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
>                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
>                         (GUEST_INTR_STATE_STI | 
> GUEST_INTR_STATE_MOV_SS));

I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

         M.
Gavin Shan May 28, 2020, 6:32 a.m. UTC | #4
Hi Marc,

On 5/27/20 5:37 PM, Marc Zyngier wrote:
> On 2020-05-27 05:05, Gavin Shan wrote:

[...]
  
>>>> +struct kvm_vcpu_pv_apf_data {
>>>> +    __u32    reason;
>>>> +    __u8    pad[60];
>>>> +    __u32    enabled;
>>>> +};
>>>
>>> What's all the padding for?
>>>
>>
>> The padding is ensure the @reason and @enabled in different cache
>> line. @reason is shared by host/guest, while @enabled is almostly
>> owned by guest.
> 
> So you are assuming that a cache line is at most 64 bytes.
> It is actualy implementation defined, and you can probe for it
> by looking at the CTR_EL0 register. There are implementations
> ranging from 32 to 256 bytes in the wild, and let's not mention
> broken big-little implementations here.
> 
> [...]
> 

Ok, Thanks for your comments and hints.

>>>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    u64 vbar, pc;
>>>> +    u32 val;
>>>> +    int ret;
>>>> +
>>>> +    if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>>>> +        return false;
>>>> +
>>>> +    if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>>>> +        return false;
>>>> +
>>>> +    /* Pending page fault, which ins't acknowledged by guest */
>>>> +    ret = kvm_async_pf_read_cache(vcpu, &val);
>>>> +    if (ret || val)
>>>> +        return false;
>>>> +
>>>> +    /*
>>>> +     * Events can't be injected through data abort because it's
>>>> +     * going to clobber ELR_EL1, which might not consued (or saved)
>>>> +     * by guest yet.
>>>> +     */
>>>> +    vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>>>> +    pc = *vcpu_pc(vcpu);
>>>> +    if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>>>> +        return false;
>>>
>>> Ah, so that's when this `no_fault_inst_range` is for.
>>>
>>> As-is this is not sufficient, and we'll need t be extremely careful
>>> here.
>>>
>>> The vectors themselves typically only have a small amount of stub code,
>>> and the bulk of the non-reentrant exception entry work happens
>>> elsewhere, in a mixture of assembly and C code that isn't even virtually
>>> contiguous with either the vectors or itself.
>>>
>>> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
>>> code) that isn't safe to take a fault from, so even having a contiguous
>>> range controlled by the kernel isn't ideal.
>>>
>>> How does this work on x86?
>>>
>>
>> Yeah, here we just provide a mechanism to forbid injecting data abort. The
>> range is fed by guest through HVC call. So I think it's guest related issue.
>> You had more comments about this in PATCH[9]. I will explain a bit more there.
>>
>> x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
>> if it's on. Otherwise, it's forbidden. It's workable because exception is
>> special interrupt to x86 if I'm correct.
>>
>>            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
>>                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
>>                         (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
> 
> I really wish this was relying on an architected exception delivery
> mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
> Trying to guess based on the PC won't fly. But these signals are
> pretty hard to multiplex with anything else. Like any form of
> non-architected exception injection, I don't see a good path forward
> unless we start considering something like SDEI.
> 
>          M.

As Paolo mentioned in another reply. There are two types of notifications
sent from host to guest: page_not_present and page_ready. The page_not_present
notification should be delivered synchronously while page_ready can be
delivered asynchronously. He also suggested to reserve a ESR (or DFSC) subclass
for page_not_present. For page_ready, it can be delivered by interrupt, like PPI.
x86 is changing the code to deliver page_ready by interrupt, which was done by
exception previously.

when we use interrupt, instead of exception for page_ready. We won't need the
game of guessing PC.

I assume you prefer to use SDEI for page_not_present, correct? In that case,
what's the current status of SDEI? I mean it has been fully or partially
supported, or we need develop it from the scratch :)

Thanks,
Gavin
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f77c706777ec..a207728d6f3f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -250,6 +250,23 @@  struct vcpu_reset_state {
 	bool		reset;
 };
 
+#ifdef CONFIG_KVM_ASYNC_PF
+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU	64
+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */
+struct kvm_arch_async_pf {
+	u32     token;
+	gfn_t   gfn;
+	u32	esr;
+};
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 struct kvm_vcpu_arch {
 	struct kvm_cpu_context ctxt;
 	void *sve_state;
@@ -351,6 +368,17 @@  struct kvm_vcpu_arch {
 		u64 last_steal;
 		gpa_t base;
 	} steal;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		struct gfn_to_hva_cache	cache;
+		gfn_t			gfns[ASYNC_PF_PER_VCPU];
+		u64			control_block;
+		u16			id;
+		bool			send_user_only;
+		u64			no_fault_inst_range;
+	} apf;
+#endif
 };
 
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
@@ -604,6 +632,21 @@  int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
 static inline void __cpu_init_stage2(void) {}
 
+#ifdef CONFIG_KVM_ASYNC_PF
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
+			    gpa_t gpa, gfn_t gfn);
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
new file mode 100644
index 000000000000..0ea481dd1c7a
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -0,0 +1,27 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_KVM_PARA_H
+#define _ASM_ARM_KVM_PARA_H
+
+#include <uapi/asm/kvm_para.h>
+
+static inline bool kvm_check_and_clear_guest_paused(void)
+{
+	return false;
+}
+
+static inline unsigned int kvm_arch_para_features(void)
+{
+	return 0;
+}
+
+static inline unsigned int kvm_arch_para_hints(void)
+{
+	return 0;
+}
+
+static inline bool kvm_para_available(void)
+{
+	return false;
+}
+
+#endif /* _ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
index 602d137932dc..f66554cd5c45 100644
--- a/arch/arm64/include/uapi/asm/Kbuild
+++ b/arch/arm64/include/uapi/asm/Kbuild
@@ -1,3 +1 @@ 
 # SPDX-License-Identifier: GPL-2.0
-
-generic-y += kvm_para.h
diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
new file mode 100644
index 000000000000..e0bd0e579b9a
--- /dev/null
+++ b/arch/arm64/include/uapi/asm/kvm_para.h
@@ -0,0 +1,22 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_ARM_KVM_PARA_H
+#define _UAPI_ASM_ARM_KVM_PARA_H
+
+#include <linux/types.h>
+
+#define KVM_FEATURE_ASYNC_PF	0
+
+/* Async PF */
+#define KVM_ASYNC_PF_ENABLED		(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
+
+#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
+#define KVM_PV_REASON_PAGE_READY	2
+
+struct kvm_vcpu_pv_apf_data {
+	__u32	reason;
+	__u8	pad[60];
+	__u32	enabled;
+};
+
+#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 449386d76441..1053e16b1739 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -34,6 +34,7 @@  config KVM
 	select KVM_VFIO
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_IRQFD
+	select KVM_ASYNC_PF
 	select KVM_ARM_PMU if HW_PERF_EVENTS
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5ffbdc39e780..3be24c1e401f 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -37,3 +37,5 @@  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
 kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index bdc0124a064a..22007dd3b9f0 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -94,6 +94,7 @@ 
 
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES		0
+#define ARM_SMCCC_KVM_FUNC_APF			1
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -102,6 +103,11 @@ 
 			   ARM_SMCCC_SMC_32,				\
 			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
 			   ARM_SMCCC_KVM_FUNC_FEATURES)
+#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
+			   ARM_SMCCC_SMC_32,				\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
+			   ARM_SMCCC_KVM_FUNC_APF)
 
 #ifndef __ASSEMBLY__
 
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 2cbb57485760..3f62899cef13 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -222,6 +222,11 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		 */
 		r = 1;
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case KVM_CAP_ASYNC_PF:
+		r = 1;
+		break;
+#endif
 	default:
 		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
 		break;
@@ -269,6 +274,10 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	/* Force users to call KVM_ARM_VCPU_INIT */
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
+#ifdef CONFIG_KVM_ASYNC_PF
+	vcpu->arch.apf.control_block = 0UL;
+	vcpu->arch.apf.no_fault_inst_range = 0x800;
+#endif
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -426,8 +435,27 @@  int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
 	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
-	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
-		&& !v->arch.power_off && !v->arch.pause);
+
+	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
+	    !v->arch.power_off && !v->arch.pause)
+		return true;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+		u32 val;
+		int ret;
+
+		if (!list_empty_careful(&v->async_pf.done))
+			return true;
+
+		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
+					    &val, sizeof(val));
+		if (ret || val)
+			return true;
+	}
+#endif
+
+	return false;
 }
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
@@ -683,6 +711,10 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 		check_vcpu_requests(vcpu);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+		kvm_check_async_pf_completion(vcpu);
+#endif
+
 		/*
 		 * Preparing the interrupts to be injected also
 		 * involves poking the GIC, which must be done in a
diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
new file mode 100644
index 000000000000..5be49d684de3
--- /dev/null
+++ b/virt/kvm/arm/async_pf.c
@@ -0,0 +1,335 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Asynchronous Page Fault Support
+ *
+ * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
+ *
+ * Based on arch/x86/kernel/kvm.c
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_emulate.h>
+#include <kvm/arm_hypercalls.h>
+
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_hash_next(u32 key)
+{
+	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
+}
+
+static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
+		vcpu->arch.apf.gfns[i] = ~0;
+}
+
+/*
+ * Add gfn to the hash table. It's ensured there is a free entry
+ * when this function is called.
+ */
+static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != ~0)
+		key = kvm_async_pf_hash_next(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
+		if (vcpu->arch.apf.gfns[key] == gfn ||
+		    vcpu->arch.apf.gfns[key] == ~0)
+			break;
+
+		key = kvm_async_pf_hash_next(key);
+	}
+
+	return key;
+}
+
+static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = ~0;
+
+		do {
+			j = kvm_async_pf_hash_next(j);
+			if (vcpu->arch.apf.gfns[j] == ~0)
+				return;
+
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
+
+	return vcpu->arch.apf.gfns[key] == gfn;
+}
+
+static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
+{
+	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
+				     val, sizeof(*val));
+}
+
+static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
+{
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
+				      &val, sizeof(val));
+}
+
+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
+{
+	u64 vbar, pc;
+	u32 val;
+	int ret;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return false;
+
+	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+		return false;
+
+	/* Pending page fault, which ins't acknowledged by guest */
+	ret = kvm_async_pf_read_cache(vcpu, &val);
+	if (ret || val)
+		return false;
+
+	/*
+	 * Events can't be injected through data abort because it's
+	 * going to clobber ELR_EL1, which might not consued (or saved)
+	 * by guest yet.
+	 */
+	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+	pc = *vcpu_pc(vcpu);
+	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+		return false;
+
+	return true;
+}
+
+/*
+ * We need deliver the page present signal as quick as possible because
+ * it's performance critical. So the signal is delivered no matter which
+ * privilege level the guest has. It's possible the signal can't be handled
+ * by the guest immediately. However, host doesn't contribute the delay
+ * anyway.
+ */
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	u64 vbar, pc;
+	u32 val;
+	int ret;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return true;
+
+	/* Pending page fault, which ins't acknowledged by guest */
+	ret = kvm_async_pf_read_cache(vcpu, &val);
+	if (ret || val)
+		return false;
+
+	/*
+	 * Events can't be injected through data abort because it's
+	 * going to clobber ELR_EL1, which might not consued (or saved)
+	 * by guest yet.
+	 */
+	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+	pc = *vcpu_pc(vcpu);
+	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+		return false;
+
+	return true;
+}
+
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
+			    gpa_t gpa, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
+
+	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
+	arch.gfn = gfn;
+	arch.esr = esr;
+
+	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
+}
+
+/*
+ * It's garanteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknoledged.
+ */
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	int ret;
+
+	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
+	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
+	if (ret) {
+		kvm_err("%s: Error %d writing cache\n", __func__, ret);
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+		return;
+	}
+
+	kvm_inject_dabt(vcpu, work->arch.token);
+}
+
+/*
+ * It's garanteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknoledged.
+ */
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	int ret;
+
+	/* Broadcast wakeup */
+	if (work->wakeup_all)
+		work->arch.token = ~0;
+	else
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+
+	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
+	if (ret) {
+		kvm_err("%s: Error %d writing cache\n", __func__, ret);
+		return;
+	}
+
+	kvm_inject_dabt(vcpu, work->arch.token);
+}
+
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work)
+{
+	struct kvm_memory_slot *memslot;
+	unsigned int esr = work->arch.esr;
+	phys_addr_t gpa = work->cr2_or_gpa;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	unsigned long hva;
+	bool write_fault, writable;
+	int idx;
+
+	/*
+	 * We shouldn't issue prefault for special work to wake up
+	 * all pending tasks because the associated token (address)
+	 * is invalid.
+	 */
+	if (work->wakeup_all)
+		return;
+
+	/*
+	 * The gpa was validated before the work is started. However, the
+	 * memory slots might be changed since then. So we need to redo the
+	 * validatation here.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	write_fault = kvm_is_write_fault(esr);
+	memslot = gfn_to_memslot(vcpu->kvm, gfn);
+	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
+	if (kvm_is_error_hva(hva) || (write_fault && !writable))
+		goto out;
+
+	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
+
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+}
+
+static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
+{
+	bool enabled, enable;
+	gpa_t gpa = (data & ~0x3F);
+	int ret;
+
+	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
+	enable = !!(data & KVM_ASYNC_PF_ENABLED);
+	if (enable == enabled) {
+		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
+			  __func__, enabled ? "enabled" : "disabled",
+			  vcpu->arch.apf.control_block, data);
+		return SMCCC_RET_NOT_REQUIRED;
+	}
+
+	if (enable) {
+		ret = kvm_gfn_to_hva_cache_init(
+			vcpu->kvm, &vcpu->arch.apf.cache,
+			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
+			sizeof(u32));
+		if (ret) {
+			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
+				__func__, ret, data);
+			return SMCCC_RET_NOT_SUPPORTED;
+		}
+
+		kvm_async_pf_hash_reset(vcpu);
+		vcpu->arch.apf.send_user_only =
+			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
+		kvm_async_pf_wakeup_all(vcpu);
+		vcpu->arch.apf.control_block = data;
+	} else {
+		kvm_clear_async_pf_completion_queue(vcpu);
+		vcpu->arch.apf.control_block = data;
+	}
+
+	return SMCCC_RET_SUCCESS;
+}
+
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
+{
+	u64 data, func, val, range;
+	long ret = SMCCC_RET_SUCCESS;
+
+	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);
+	func = data & (0xfful << 56);
+	val = data & ~(0xfful << 56);
+	switch (func) {
+	case BIT(63):
+		ret = kvm_async_pf_update_enable_reg(vcpu, val);
+		break;
+	case BIT(62):
+		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+			ret = SMCCC_RET_NOT_SUPPORTED;
+			break;
+		}
+
+		range = vcpu->arch.apf.no_fault_inst_range;
+		vcpu->arch.apf.no_fault_inst_range = max(range, val);
+		break;
+	default:
+		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
+		ret = SMCCC_RET_NOT_SUPPORTED;
+	}
+
+	return ret;
+}
diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
index db6dce3d0e23..a7e0fe17e2f1 100644
--- a/virt/kvm/arm/hypercalls.c
+++ b/virt/kvm/arm/hypercalls.c
@@ -70,7 +70,15 @@  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 		break;
 	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
 		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
+#ifdef CONFIG_KVM_ASYNC_PF
+		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
+#endif
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
+		val[0] = kvm_async_pf_hypercall(vcpu);
+		break;
+#endif
 	default:
 		return kvm_psci_call(vcpu);
 	}
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 95aaabb2b1fc..a303815845a2 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1656,6 +1656,30 @@  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
 	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
+static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
+			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
+			 bool *writable, bool prefault)
+{
+	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+#ifdef CONFIG_KVM_ASYNC_PF
+	bool async = false;
+
+	/* Bail if *pfn has correct page */
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	if (!async)
+		return false;
+
+	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
+		if (kvm_async_pf_hash_find(vcpu, gfn) ||
+		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
+			return true;
+	}
+#endif
+
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	return false;
+}
+
 int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
 			      struct kvm_memory_slot *memslot,
 			      phys_addr_t fault_ipa, unsigned long hva,
@@ -1737,7 +1761,10 @@  int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
 	 */
 	smp_rmb();
 
-	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
+			 write_fault, &writable, prefault))
+		return 1;
+
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;