diff mbox series

[v5,33/36] KVM: arm64: Wrap the host with a stage 2

Message ID 20210315143536.214621-34-qperret@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: A stage 2 for the host | expand

Commit Message

Quentin Perret March 15, 2021, 2:35 p.m. UTC
When KVM runs in protected nVHE mode, make use of a stage 2 page-table
to give the hypervisor some control over the host memory accesses. The
host stage 2 is created lazily using large block mappings if possible,
and will default to page mappings in absence of a better solution.

From this point on, memory accesses from the host to protected memory
regions (e.g. not 'owned' by the host) are fatal and lead to hyp_panic().

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
---
 arch/arm64/include/asm/kvm_asm.h              |   1 +
 arch/arm64/kernel/image-vars.h                |   3 +
 arch/arm64/kvm/arm.c                          |  10 +
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  34 +++
 arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-init.S            |   1 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  11 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 246 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/setup.c               |   5 +
 arch/arm64/kvm/hyp/nvhe/switch.c              |   7 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
 11 files changed, 317 insertions(+), 7 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c

Comments

Mate Toth-Pal March 16, 2021, 12:28 p.m. UTC | #1
On 2021-03-15 15:35, Quentin Perret wrote:
> +static int host_stage2_idmap(u64 addr)
> +{
> +	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> +	struct kvm_mem_range range;
> +	bool is_memory = find_mem_range(addr, &range);
> +	struct hyp_pool *pool = is_memory ? &host_s2_mem : &host_s2_dev;
> +	int ret;
> +
> +	if (is_memory)
> +		prot |= KVM_PGTABLE_PROT_X;
> +
>   		/* Ensure write of the host VMID */

Hi Quentin,

Testing the latest version of the patchset, we seem to have found 
another thing related to FEAT_S2FWB.

This function always sets Normal memory in the stage 2 table, even if 
the address in stage 1 was mapped as a device memory. However with the 
current settings for normal memory (i.e. MT_S2_FWB_NORMAL being defined 
to 6) according to the architecture (See Arm ARM, 'D5.5.5 Stage 2 memory 
region type and cacheability attributes when FEAT_S2FWB is implemented') 
the resulting attributes will be 'Normal Write-Back' even if the stage 1 
mapping sets device memory. Accessing device memory mapped like this 
causes an SError on some platforms with FEAT_S2FWB being implemented.

Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, 
and the resulting memory type would be device.

Another solution would be to add an else branch to the last 'if' above 
like this:

diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c 
b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index fffa432ce3eb..54e5d3b0b2e1 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)

         if (is_memory)
                 prot |= KVM_PGTABLE_PROT_X;
+       else
+               prot |= KVM_PGTABLE_PROT_DEVICE;

         hyp_spin_lock(&host_kvm.lock);
         ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, 
&range);

Best regards,
Mate Toth-Pal
Quentin Perret March 16, 2021, 12:53 p.m. UTC | #2
On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> Testing the latest version of the patchset, we seem to have found another
> thing related to FEAT_S2FWB.

Argh! I wish I could put my hands on hardware with FWB. Thanks again for
the report.

> This function always sets Normal memory in the stage 2 table, even if the
> address in stage 1 was mapped as a device memory. However with the current
> settings for normal memory (i.e. MT_S2_FWB_NORMAL being defined to 6)
> according to the architecture (See Arm ARM, 'D5.5.5 Stage 2 memory region
> type and cacheability attributes when FEAT_S2FWB is implemented') the
> resulting attributes will be 'Normal Write-Back' even if the stage 1 mapping
> sets device memory. Accessing device memory mapped like this causes an
> SError on some platforms with FEAT_S2FWB being implemented.

Right.

> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> the resulting memory type would be device.

Sounds like the correct fix here -- see below.

> Another solution would be to add an else branch to the last 'if' above like
> this:
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index fffa432ce3eb..54e5d3b0b2e1 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)
> 
>         if (is_memory)
>                 prot |= KVM_PGTABLE_PROT_X;
> +       else
> +               prot |= KVM_PGTABLE_PROT_DEVICE;
> 
>         hyp_spin_lock(&host_kvm.lock);
>         ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot,
> &range);

While this would work in this particular case, I don't think we should
force all non-RAM accesses as device as the host may have reasons not to
want this (e.g. accessing SRAM). Your first suggestions allows us to do
just that, so it is preferable I think.

Thanks,
Quentin
Quentin Perret March 16, 2021, 2:29 p.m. UTC | #3
On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > the resulting memory type would be device.
> 
> Sounds like the correct fix here -- see below.

Just to clarify this, I meant this should be the configuration for the
host stage-2. We'll want to keep the existing behaviour for guests I
believe.

Thanks,
Quentin
Mate Toth-Pal March 16, 2021, 3:16 p.m. UTC | #4
On 2021-03-16 15:29, Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>> the resulting memory type would be device.
>>
>> Sounds like the correct fix here -- see below.
> 
> Just to clarify this, I meant this should be the configuration for the
> host stage-2. We'll want to keep the existing behaviour for guests I
> believe.

I Agree.

Thanks,
Mate
Quentin Perret March 16, 2021, 5:46 p.m. UTC | #5
On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
> On 2021-03-16 15:29, Quentin Perret wrote:
> > On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> > > On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > > > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > > > the resulting memory type would be device.
> > > 
> > > Sounds like the correct fix here -- see below.
> > 
> > Just to clarify this, I meant this should be the configuration for the
> > host stage-2. We'll want to keep the existing behaviour for guests I
> > believe.
> 
> I Agree.

OK, so the below seems to boot on my non-FWB-capable hardware and should
fix the issue. Could you by any chance give it a spin?

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index b93a2a3526ab..b2066bd03ca2 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -76,10 +76,11 @@ struct kvm_pgtable {

 /**
  * enum kvm_pgtable_prot - Page-table permissions and attributes.
- * @KVM_PGTABLE_PROT_X:                Execute permission.
- * @KVM_PGTABLE_PROT_W:                Write permission.
- * @KVM_PGTABLE_PROT_R:                Read permission.
- * @KVM_PGTABLE_PROT_DEVICE:   Device attributes.
+ * @KVM_PGTABLE_PROT_X:                        Execute permission.
+ * @KVM_PGTABLE_PROT_W:                        Write permission.
+ * @KVM_PGTABLE_PROT_R:                        Read permission.
+ * @KVM_PGTABLE_PROT_DEVICE:           Device attributes.
+ * @KVM_PGTABLE_PROT_S2_NOFWB:         Don't enforce Normal-WB with FWB.
  */
 enum kvm_pgtable_prot {
        KVM_PGTABLE_PROT_X                      = BIT(0),
@@ -87,6 +88,8 @@ enum kvm_pgtable_prot {
        KVM_PGTABLE_PROT_R                      = BIT(2),

        KVM_PGTABLE_PROT_DEVICE                 = BIT(3),
+
+       KVM_PGTABLE_PROT_S2_NOFWB               = BIT(4),
 };

 #define PAGE_HYP               (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c759faf7a1ff..e695d2e1839d 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -144,13 +144,16 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL           0xf
+#define MT_S2_WEAK             MT_S2_NORMAL
 #define MT_S2_DEVICE_nGnRE     0x1

 /*
  * Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
- * Stage-2 enforces Normal-WB and Device-nGnRE
+ * Stage-2 enforces Normal-WB and Device-nGnRE by default. The 'weak' mode
+ * honors Stage-1 attributes.
  */
 #define MT_S2_FWB_NORMAL       6
+#define MT_S2_FWB_WEAK         7
 #define MT_S2_FWB_DEVICE_nGnRE 1

 #ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index dd03252b9574..1ff72babe565 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)

        if (is_memory)
                prot |= KVM_PGTABLE_PROT_X;
+       else
+               prot |= KVM_PGTABLE_PROT_S2_NOFWB;

        hyp_spin_lock(&host_kvm.lock);
        ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3a971df278bd..bd1b8464a537 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -343,6 +343,9 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
        if (!(prot & KVM_PGTABLE_PROT_R))
                return -EINVAL;

+       if (prot & KVM_PGTABLE_PROT_S2_NOFWB)
+               return -EINVAL;
+
        if (prot & KVM_PGTABLE_PROT_X) {
                if (prot & KVM_PGTABLE_PROT_W)
                        return -EINVAL;
@@ -510,9 +513,18 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 static int stage2_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
 {
        bool device = prot & KVM_PGTABLE_PROT_DEVICE;
-       kvm_pte_t attr = device ? PAGE_S2_MEMATTR(DEVICE_nGnRE) :
-                           PAGE_S2_MEMATTR(NORMAL);
+       bool nofwb = prot & KVM_PGTABLE_PROT_S2_NOFWB;
        u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
+       kvm_pte_t attr;
+
+       WARN_ON(nofwb && device);
+
+       if (device)
+               attr = PAGE_S2_MEMATTR(DEVICE_nGnRE);
+       else if (nofwb)
+               attr = PAGE_S2_MEMATTR(WEAK);
+       else
+               attr = PAGE_S2_MEMATTR(NORMAL);

        if (!(prot & KVM_PGTABLE_PROT_X))
                attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
Mate Toth-Pal March 17, 2021, 8:41 a.m. UTC | #6
On 2021-03-16 18:46, Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
>> On 2021-03-16 15:29, Quentin Perret wrote:
>>> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>>>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>>>> the resulting memory type would be device.
>>>>
>>>> Sounds like the correct fix here -- see below.
>>>
>>> Just to clarify this, I meant this should be the configuration for the
>>> host stage-2. We'll want to keep the existing behaviour for guests I
>>> believe.
>>
>> I Agree.
> 
> OK, so the below seems to boot on my non-FWB-capable hardware and should
> fix the issue. Could you by any chance give it a spin?
> 

Sure, I can give it a go. I was trying to apply the patch on top of 
https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5 
but it seems that your base is significantly different. Can you give 
some hints what should I use as base?
Quentin Perret March 17, 2021, 9:02 a.m. UTC | #7
On Wednesday 17 Mar 2021 at 09:41:09 (+0100), Mate Toth-Pal wrote:
> On 2021-03-16 18:46, Quentin Perret wrote:
> > On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
> > > On 2021-03-16 15:29, Quentin Perret wrote:
> > > > On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> > > > > On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > > > > > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > > > > > the resulting memory type would be device.
> > > > > 
> > > > > Sounds like the correct fix here -- see below.
> > > > 
> > > > Just to clarify this, I meant this should be the configuration for the
> > > > host stage-2. We'll want to keep the existing behaviour for guests I
> > > > believe.
> > > 
> > > I Agree.
> > 
> > OK, so the below seems to boot on my non-FWB-capable hardware and should
> > fix the issue. Could you by any chance give it a spin?
> > 
> 
> Sure, I can give it a go. I was trying to apply the patch on top of https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5
> but it seems that your base is significantly different. Can you give some
> hints what should I use as base?

Oh interesting, it _should_ apply on v5. I just pushed a branch with
everything applied if that helps:

  https://android-kvm.googlesource.com/linux qperret/wip/fix-fwb-host-stage2

Thanks again!
Quentin
Mate Toth-Pal March 17, 2021, 2:57 p.m. UTC | #8
On 2021-03-17 10:02, Quentin Perret wrote:
> On Wednesday 17 Mar 2021 at 09:41:09 (+0100), Mate Toth-Pal wrote:
>> On 2021-03-16 18:46, Quentin Perret wrote:
>>> On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
>>>> On 2021-03-16 15:29, Quentin Perret wrote:
>>>>> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>>>>>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>>>>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>>>>>> the resulting memory type would be device.
>>>>>>
>>>>>> Sounds like the correct fix here -- see below.
>>>>>
>>>>> Just to clarify this, I meant this should be the configuration for the
>>>>> host stage-2. We'll want to keep the existing behaviour for guests I
>>>>> believe.
>>>>
>>>> I Agree.
>>>
>>> OK, so the below seems to boot on my non-FWB-capable hardware and should
>>> fix the issue. Could you by any chance give it a spin?
>>>
>>
>> Sure, I can give it a go. I was trying to apply the patch on top of https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5
>> but it seems that your base is significantly different. Can you give some
>> hints what should I use as base?
> 
> Oh interesting, it _should_ apply on v5. I just pushed a branch with
> everything applied if that helps:
> 
>    https://android-kvm.googlesource.com/linux qperret/wip/fix-fwb-host-stage2
> 
> Thanks again!
> Quentin
> 

Sorry, it was my mistake. I must have messed up my source, because the 
patch should really have applied on the v5 patchset.

So I was able to boot the kernel from the branch 
qperret/wip/fix-fwb-host-stage2 on the machine I was testing, so the fix 
looks to be OK.

Thanks,
Mate
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 6dce860f8bca..b127af02bd45 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -61,6 +61,7 @@ 
 #define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings		16
 #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping	17
 #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector		18
+#define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize		19
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 940c378fa837..d5dc2b792651 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -131,6 +131,9 @@  KVM_NVHE_ALIAS(__hyp_bss_end);
 KVM_NVHE_ALIAS(__hyp_rodata_start);
 KVM_NVHE_ALIAS(__hyp_rodata_end);
 
+/* pKVM static key */
+KVM_NVHE_ALIAS(kvm_protected_mode_initialized);
+
 #endif /* CONFIG_KVM */
 
 #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d474eec606a3..7e6a81079652 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1889,12 +1889,22 @@  static int init_hyp_mode(void)
 	return err;
 }
 
+void _kvm_host_prot_finalize(void *discard)
+{
+	WARN_ON(kvm_call_hyp_nvhe(__pkvm_prot_finalize));
+}
+
 static int finalize_hyp_mode(void)
 {
 	if (!is_protected_kvm_enabled())
 		return 0;
 
+	/*
+	 * Flip the static key upfront as that may no longer be possible
+	 * once the host stage 2 is installed.
+	 */
 	static_branch_enable(&kvm_protected_mode_initialized);
+	on_each_cpu(_kvm_host_prot_finalize, NULL, 1);
 
 	return 0;
 }
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
new file mode 100644
index 000000000000..d293cb328cc4
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -0,0 +1,34 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <qperret@google.com>
+ */
+
+#ifndef __KVM_NVHE_MEM_PROTECT__
+#define __KVM_NVHE_MEM_PROTECT__
+#include <linux/kvm_host.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/virt.h>
+#include <nvhe/spinlock.h>
+
+struct host_kvm {
+	struct kvm_arch arch;
+	struct kvm_pgtable pgt;
+	struct kvm_pgtable_mm_ops mm_ops;
+	hyp_spinlock_t lock;
+};
+extern struct host_kvm host_kvm;
+
+int __pkvm_prot_finalize(void);
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool);
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
+
+static __always_inline void __load_host_stage2(void)
+{
+	if (static_branch_likely(&kvm_protected_mode_initialized))
+		__load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+	else
+		write_sysreg(0, vttbr_el2);
+}
+#endif /* __KVM_NVHE_MEM_PROTECT__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index b334354b8dd0..f55201a7ff33 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -14,7 +14,7 @@  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
 
 obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
 	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
-	 cache.o setup.o mm.o
+	 cache.o setup.o mm.o mem_protect.o
 obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
 obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
index a50ad9e9fc05..c164045af238 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
@@ -119,6 +119,7 @@  alternative_else_nop_endif
 
 	/* Invalidate the stale TLBs from Bootloader */
 	tlbi	alle2
+	tlbi	vmalls12e1
 	dsb	sy
 
 	/*
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index ae6503c9be15..f47028d3fd0a 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -13,6 +13,7 @@ 
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 
+#include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
 #include <nvhe/trap_handler.h>
 
@@ -151,6 +152,10 @@  static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ct
 	cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
 }
 
+static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt)
+{
+	cpu_reg(host_ctxt, 1) = __pkvm_prot_finalize();
+}
 typedef void (*hcall_t)(struct kvm_cpu_context *);
 
 #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -174,6 +179,7 @@  static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__pkvm_cpu_set_vector),
 	HANDLE_FUNC(__pkvm_create_mappings),
 	HANDLE_FUNC(__pkvm_create_private_mapping),
+	HANDLE_FUNC(__pkvm_prot_finalize),
 };
 
 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
@@ -226,6 +232,11 @@  void handle_trap(struct kvm_cpu_context *host_ctxt)
 	case ESR_ELx_EC_SMC64:
 		handle_host_smc(host_ctxt);
 		break;
+	case ESR_ELx_EC_IABT_LOW:
+		fallthrough;
+	case ESR_ELx_EC_DABT_LOW:
+		handle_host_mem_abort(host_ctxt);
+		break;
 	default:
 		hyp_panic();
 	}
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
new file mode 100644
index 000000000000..5c88a325e6fc
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -0,0 +1,246 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <qperret@google.com>
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_cpufeature.h>
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/stage2_pgtable.h>
+
+#include <hyp/switch.h>
+
+#include <nvhe/gfp.h>
+#include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
+
+extern unsigned long hyp_nr_cpus;
+struct host_kvm host_kvm;
+
+struct hyp_pool host_s2_mem;
+struct hyp_pool host_s2_dev;
+
+static void *host_s2_zalloc_pages_exact(size_t size)
+{
+	return hyp_alloc_pages(&host_s2_mem, get_order(size));
+}
+
+static void *host_s2_zalloc_page(void *pool)
+{
+	return hyp_alloc_pages(pool, 0);
+}
+
+static int prepare_s2_pools(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+	unsigned long nr_pages, pfn;
+	int ret;
+
+	pfn = hyp_virt_to_pfn(mem_pgt_pool);
+	nr_pages = host_s2_mem_pgtable_pages();
+	ret = hyp_pool_init(&host_s2_mem, pfn, nr_pages, 0);
+	if (ret)
+		return ret;
+
+	pfn = hyp_virt_to_pfn(dev_pgt_pool);
+	nr_pages = host_s2_dev_pgtable_pages();
+	ret = hyp_pool_init(&host_s2_dev, pfn, nr_pages, 0);
+	if (ret)
+		return ret;
+
+	host_kvm.mm_ops = (struct kvm_pgtable_mm_ops) {
+		.zalloc_pages_exact = host_s2_zalloc_pages_exact,
+		.zalloc_page = host_s2_zalloc_page,
+		.phys_to_virt = hyp_phys_to_virt,
+		.virt_to_phys = hyp_virt_to_phys,
+		.page_count = hyp_page_count,
+		.get_page = hyp_get_page,
+		.put_page = hyp_put_page,
+	};
+
+	return 0;
+}
+
+static void prepare_host_vtcr(void)
+{
+	u32 parange, phys_shift;
+	u64 mmfr0, mmfr1;
+
+	mmfr0 = arm64_ftr_reg_id_aa64mmfr0_el1.sys_val;
+	mmfr1 = arm64_ftr_reg_id_aa64mmfr1_el1.sys_val;
+
+	/* The host stage 2 is id-mapped, so use parange for T0SZ */
+	parange = kvm_get_parange(mmfr0);
+	phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
+
+	host_kvm.arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+}
+
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+	struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
+	int ret;
+
+	prepare_host_vtcr();
+	hyp_spin_lock_init(&host_kvm.lock);
+
+	ret = prepare_s2_pools(mem_pgt_pool, dev_pgt_pool);
+	if (ret)
+		return ret;
+
+	ret = kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch,
+				      &host_kvm.mm_ops);
+	if (ret)
+		return ret;
+
+	mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd);
+	mmu->arch = &host_kvm.arch;
+	mmu->pgt = &host_kvm.pgt;
+	mmu->vmid.vmid_gen = 0;
+	mmu->vmid.vmid = 0;
+
+	return 0;
+}
+
+int __pkvm_prot_finalize(void)
+{
+	struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
+	struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params);
+
+	params->vttbr = kvm_get_vttbr(mmu);
+	params->vtcr = host_kvm.arch.vtcr;
+	params->hcr_el2 |= HCR_VM;
+	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+		params->hcr_el2 |= HCR_FWB;
+	kvm_flush_dcache_to_poc(params, sizeof(*params));
+
+	write_sysreg(params->hcr_el2, hcr_el2);
+	__load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+
+	/*
+	 * Make sure to have an ISB before the TLB maintenance below but only
+	 * when __load_stage2() doesn't include one already.
+	 */
+	asm(ALTERNATIVE("isb", "nop", ARM64_WORKAROUND_SPECULATIVE_AT));
+
+	/* Invalidate stale HCR bits that may be cached in TLBs */
+	__tlbi(vmalls12e1);
+	dsb(nsh);
+	isb();
+
+	return 0;
+}
+
+static int host_stage2_unmap_dev_all(void)
+{
+	struct kvm_pgtable *pgt = &host_kvm.pgt;
+	struct memblock_region *reg;
+	u64 addr = 0;
+	int i, ret;
+
+	/* Unmap all non-memory regions to recycle the pages */
+	for (i = 0; i < hyp_memblock_nr; i++, addr = reg->base + reg->size) {
+		reg = &hyp_memory[i];
+		ret = kvm_pgtable_stage2_unmap(pgt, addr, reg->base - addr);
+		if (ret)
+			return ret;
+	}
+	return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr);
+}
+
+static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range)
+{
+	int cur, left = 0, right = hyp_memblock_nr;
+	struct memblock_region *reg;
+	phys_addr_t end;
+
+	range->start = 0;
+	range->end = ULONG_MAX;
+
+	/* The list of memblock regions is sorted, binary search it */
+	while (left < right) {
+		cur = (left + right) >> 1;
+		reg = &hyp_memory[cur];
+		end = reg->base + reg->size;
+		if (addr < reg->base) {
+			right = cur;
+			range->end = reg->base;
+		} else if (addr >= end) {
+			left = cur + 1;
+			range->start = end;
+		} else {
+			range->start = reg->base;
+			range->end = end;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static inline int __host_stage2_idmap(u64 start, u64 end,
+				      enum kvm_pgtable_prot prot,
+				      struct hyp_pool *pool)
+{
+	return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start,
+				      prot, pool);
+}
+
+static int host_stage2_idmap(u64 addr)
+{
+	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+	struct kvm_mem_range range;
+	bool is_memory = find_mem_range(addr, &range);
+	struct hyp_pool *pool = is_memory ? &host_s2_mem : &host_s2_dev;
+	int ret;
+
+	if (is_memory)
+		prot |= KVM_PGTABLE_PROT_X;
+
+	hyp_spin_lock(&host_kvm.lock);
+	ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
+	if (ret)
+		goto unlock;
+
+	ret = __host_stage2_idmap(range.start, range.end, prot, pool);
+	if (is_memory || ret != -ENOMEM)
+		goto unlock;
+
+	/*
+	 * host_s2_mem has been provided with enough pages to cover all of
+	 * memory with page granularity, so we should never hit the ENOMEM case.
+	 * However, it is difficult to know how much of the MMIO range we will
+	 * need to cover upfront, so we may need to 'recycle' the pages if we
+	 * run out.
+	 */
+	ret = host_stage2_unmap_dev_all();
+	if (ret)
+		goto unlock;
+
+	ret = __host_stage2_idmap(range.start, range.end, prot, pool);
+
+unlock:
+	hyp_spin_unlock(&host_kvm.lock);
+
+	return ret;
+}
+
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
+{
+	struct kvm_vcpu_fault_info fault;
+	u64 esr, addr;
+	int ret = 0;
+
+	esr = read_sysreg_el2(SYS_ESR);
+	if (!__get_fault_info(esr, &fault))
+		hyp_panic();
+
+	addr = (fault.hpfar_el2 & HPFAR_MASK) << 8;
+	ret = host_stage2_idmap(addr);
+	if (ret && ret != -EAGAIN)
+		hyp_panic();
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index c1a3e7e0ebbc..7488f53b0aa2 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -12,6 +12,7 @@ 
 #include <nvhe/early_alloc.h>
 #include <nvhe/gfp.h>
 #include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
 #include <nvhe/trap_handler.h>
 
@@ -157,6 +158,10 @@  void __noreturn __pkvm_init_finalise(void)
 	if (ret)
 		goto out;
 
+	ret = kvm_host_prepare_stage2(host_s2_mem_pgt_base, host_s2_dev_pgt_base);
+	if (ret)
+		goto out;
+
 	pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
 		.zalloc_page = hyp_zalloc_hyp_page,
 		.phys_to_virt = hyp_phys_to_virt,
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 979a76cdf9fb..31bc1a843bf8 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -28,6 +28,8 @@ 
 #include <asm/processor.h>
 #include <asm/thread_info.h>
 
+#include <nvhe/mem_protect.h>
+
 /* Non-VHE specific context */
 DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
 DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
@@ -102,11 +104,6 @@  static void __deactivate_traps(struct kvm_vcpu *vcpu)
 	write_sysreg(__kvm_hyp_host_vector, vbar_el2);
 }
 
-static void __load_host_stage2(void)
-{
-	write_sysreg(0, vttbr_el2);
-}
-
 /* Save VGICv3 state on non-VHE systems */
 static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index fbde89a2c6e8..255a23a1b2db 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -8,6 +8,8 @@ 
 #include <asm/kvm_mmu.h>
 #include <asm/tlbflush.h>
 
+#include <nvhe/mem_protect.h>
+
 struct tlb_inv_context {
 	u64		tcr;
 };
@@ -43,7 +45,7 @@  static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
 
 static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
 {
-	write_sysreg(0, vttbr_el2);
+	__load_host_stage2();
 
 	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
 		/* Ensure write of the host VMID */