diff mbox series

[v7,4/4] KVM: arm64: Move guest CMOs to the fault handlers

Message ID 20210617105824.31752-5-wangyanan55@huawei.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Improve efficiency of stage2 page table | expand

Commit Message

Yanan Wang June 17, 2021, 10:58 a.m. UTC
We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are
necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the stage-2 page table code.

We can imagine two specific scenarios which will gain much benefit:
1) In a normal VM startup, this solution will improve the efficiency of
handling guest page faults incurred by vCPUs, when initially populating
stage-2 page tables.
2) After live migration, the heavy workload will be resumed on the
destination VM, however all the stage-2 page tables need to be rebuilt
at the moment. So this solution will ease the performance drop during
resuming stage.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 38 +++++++++++++++++++++++++++++-------
 arch/arm64/kvm/mmu.c         | 37 ++++++++++++++---------------------
 2 files changed, 46 insertions(+), 29 deletions(-)

Comments

Will Deacon June 17, 2021, 12:45 p.m. UTC | #1
On Thu, Jun 17, 2021 at 06:58:24PM +0800, Yanan Wang wrote:
> We currently uniformly permorm CMOs of D-cache and I-cache in function
> user_mem_abort before calling the fault handlers. If we get concurrent
> guest faults(e.g. translation faults, permission faults) or some really
> unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> necessary while the others later are not.
> 
> By moving CMOs to the fault handlers, we can easily identify conditions
> where they are really needed and avoid the unnecessary ones. As it's a
> time consuming process to perform CMOs especially when flushing a block
> range, so this solution reduces much load of kvm and improve efficiency
> of the stage-2 page table code.
> 
> We can imagine two specific scenarios which will gain much benefit:
> 1) In a normal VM startup, this solution will improve the efficiency of
> handling guest page faults incurred by vCPUs, when initially populating
> stage-2 page tables.
> 2) After live migration, the heavy workload will be resumed on the
> destination VM, however all the stage-2 page tables need to be rebuilt
> at the moment. So this solution will ease the performance drop during
> resuming stage.
> 
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 38 +++++++++++++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c         | 37 ++++++++++++++---------------------
>  2 files changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d99789432b05..760c551f61da 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -577,12 +577,24 @@ static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>  	mm_ops->put_page(ptep);
>  }
>  
> +static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
> +}
> +
> +static bool stage2_pte_executable(kvm_pte_t pte)
> +{
> +	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
>  {
>  	kvm_pte_t new, old = *ptep;
>  	u64 granule = kvm_granule_size(level), phys = data->phys;
> +	struct kvm_pgtable *pgt = data->mmu->pgt;
>  	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>  
>  	if (!kvm_block_mapping_supported(addr, end, phys, level))
> @@ -606,6 +618,14 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
>  	}
>  
> +	/* Perform CMOs before installation of the guest stage-2 PTE */
> +	if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
> +		mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
> +						granule);
> +
> +	if (mm_ops->invalidate_icache && stage2_pte_executable(new))
> +		mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);

One thing I'm missing here is why we need the indirection via mm_ops. Are
there cases where we would want to pass a different function pointer for
invalidating the icache? If not, why not just call the function directly?

Same for the D side.

Will
Marc Zyngier June 17, 2021, 12:59 p.m. UTC | #2
On Thu, 17 Jun 2021 13:45:57 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Thu, Jun 17, 2021 at 06:58:24PM +0800, Yanan Wang wrote:
> > We currently uniformly permorm CMOs of D-cache and I-cache in function
> > user_mem_abort before calling the fault handlers. If we get concurrent
> > guest faults(e.g. translation faults, permission faults) or some really
> > unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> > necessary while the others later are not.
> > 
> > By moving CMOs to the fault handlers, we can easily identify conditions
> > where they are really needed and avoid the unnecessary ones. As it's a
> > time consuming process to perform CMOs especially when flushing a block
> > range, so this solution reduces much load of kvm and improve efficiency
> > of the stage-2 page table code.
> > 
> > We can imagine two specific scenarios which will gain much benefit:
> > 1) In a normal VM startup, this solution will improve the efficiency of
> > handling guest page faults incurred by vCPUs, when initially populating
> > stage-2 page tables.
> > 2) After live migration, the heavy workload will be resumed on the
> > destination VM, however all the stage-2 page tables need to be rebuilt
> > at the moment. So this solution will ease the performance drop during
> > resuming stage.
> > 
> > Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 38 +++++++++++++++++++++++++++++-------
> >  arch/arm64/kvm/mmu.c         | 37 ++++++++++++++---------------------
> >  2 files changed, 46 insertions(+), 29 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index d99789432b05..760c551f61da 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -577,12 +577,24 @@ static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
> >  	mm_ops->put_page(ptep);
> >  }
> >  
> > +static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
> > +{
> > +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> > +	return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
> > +}
> > +
> > +static bool stage2_pte_executable(kvm_pte_t pte)
> > +{
> > +	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
> > +}
> > +
> >  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  				      kvm_pte_t *ptep,
> >  				      struct stage2_map_data *data)
> >  {
> >  	kvm_pte_t new, old = *ptep;
> >  	u64 granule = kvm_granule_size(level), phys = data->phys;
> > +	struct kvm_pgtable *pgt = data->mmu->pgt;
> >  	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> >  
> >  	if (!kvm_block_mapping_supported(addr, end, phys, level))
> > @@ -606,6 +618,14 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> >  	}
> >  
> > +	/* Perform CMOs before installation of the guest stage-2 PTE */
> > +	if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
> > +		mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
> > +						granule);
> > +
> > +	if (mm_ops->invalidate_icache && stage2_pte_executable(new))
> > +		mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);
> 
> One thing I'm missing here is why we need the indirection via mm_ops. Are
> there cases where we would want to pass a different function pointer for
> invalidating the icache? If not, why not just call the function directly?
> 
> Same for the D side.

If we didn't do that, we'd end-up having to track whether the guest
context requires CMOs with additional flags, which is pretty ugly (see
v5 of this series for reference [1]).

It also means that we would have to drag the CM functions into the EL2
object, something that we don't need with this approach.

	M.

[1] https://lore.kernel.org/r/20210415115032.35760-1-wangyanan55@huawei.com
Will Deacon June 17, 2021, 1:21 p.m. UTC | #3
On Thu, Jun 17, 2021 at 01:59:37PM +0100, Marc Zyngier wrote:
> On Thu, 17 Jun 2021 13:45:57 +0100,
> Will Deacon <will@kernel.org> wrote:
> > On Thu, Jun 17, 2021 at 06:58:24PM +0800, Yanan Wang wrote:
> > > @@ -606,6 +618,14 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> > >  		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> > >  	}
> > >  
> > > +	/* Perform CMOs before installation of the guest stage-2 PTE */
> > > +	if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
> > > +		mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
> > > +						granule);
> > > +
> > > +	if (mm_ops->invalidate_icache && stage2_pte_executable(new))
> > > +		mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);
> > 
> > One thing I'm missing here is why we need the indirection via mm_ops. Are
> > there cases where we would want to pass a different function pointer for
> > invalidating the icache? If not, why not just call the function directly?
> > 
> > Same for the D side.
> 
> If we didn't do that, we'd end-up having to track whether the guest
> context requires CMOs with additional flags, which is pretty ugly (see
> v5 of this series for reference [1]).

Fair enough, although the function pointers here _are_ being used as flags,
as they only ever have one of two possible values (NULL or the CMO function),
so it's a shame to bring in the indirect branch as well.

> It also means that we would have to drag the CM functions into the EL2
> object, something that we don't need with this approach.

I think it won't be long before we end up with CMO functions at EL2 and
you'd hope we'd be able to use the same code as EL1 for something like
that. But I also wouldn't want to put money on it...

Anyway, no strong opinion on this, it just jumped out when I skimmed the
patches.

Will
Marc Zyngier June 17, 2021, 1:37 p.m. UTC | #4
On Thu, 17 Jun 2021 14:21:16 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Thu, Jun 17, 2021 at 01:59:37PM +0100, Marc Zyngier wrote:
> > On Thu, 17 Jun 2021 13:45:57 +0100,
> > Will Deacon <will@kernel.org> wrote:
> > > On Thu, Jun 17, 2021 at 06:58:24PM +0800, Yanan Wang wrote:
> > > > @@ -606,6 +618,14 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> > > >  		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> > > >  	}
> > > >  
> > > > +	/* Perform CMOs before installation of the guest stage-2 PTE */
> > > > +	if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
> > > > +		mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
> > > > +						granule);
> > > > +
> > > > +	if (mm_ops->invalidate_icache && stage2_pte_executable(new))
> > > > +		mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);
> > > 
> > > One thing I'm missing here is why we need the indirection via mm_ops. Are
> > > there cases where we would want to pass a different function pointer for
> > > invalidating the icache? If not, why not just call the function directly?
> > > 
> > > Same for the D side.
> > 
> > If we didn't do that, we'd end-up having to track whether the guest
> > context requires CMOs with additional flags, which is pretty ugly (see
> > v5 of this series for reference [1]).
> 
> Fair enough, although the function pointers here _are_ being used as
> flags, as they only ever have one of two possible values (NULL or
> the CMO function), so it's a shame to bring in the indirect branch
> as well.

What I hope eventually is to get rid of some of the FWB tracking we
have for the host in the protected case, and use the same abstraction.

> 
> > It also means that we would have to drag the CM functions into the EL2
> > object, something that we don't need with this approach.
> 
> I think it won't be long before we end up with CMO functions at EL2 and
> you'd hope we'd be able to use the same code as EL1 for something like
> that. But I also wouldn't want to put money on it...

It we reach that stage, I'll be happy to try and move these function
into some shared location.

> Anyway, no strong opinion on this, it just jumped out when I skimmed the
> patches.

Thanks,

	M.
Fuad Tabba June 18, 2021, 9:30 a.m. UTC | #5
Hi Yanan,

On Thu, Jun 17, 2021 at 11:58 AM Yanan Wang <wangyanan55@huawei.com> wrote:
>
> We currently uniformly permorm CMOs of D-cache and I-cache in function

Nit: permorm -> perform

> user_mem_abort before calling the fault handlers. If we get concurrent
> guest faults(e.g. translation faults, permission faults) or some really
> unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> necessary while the others later are not.
>
> By moving CMOs to the fault handlers, we can easily identify conditions
> where they are really needed and avoid the unnecessary ones. As it's a
> time consuming process to perform CMOs especially when flushing a block
> range, so this solution reduces much load of kvm and improve efficiency
> of the stage-2 page table code.
>
> We can imagine two specific scenarios which will gain much benefit:
> 1) In a normal VM startup, this solution will improve the efficiency of
> handling guest page faults incurred by vCPUs, when initially populating
> stage-2 page tables.
> 2) After live migration, the heavy workload will be resumed on the
> destination VM, however all the stage-2 page tables need to be rebuilt
> at the moment. So this solution will ease the performance drop during
> resuming stage.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 38 +++++++++++++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c         | 37 ++++++++++++++---------------------
>  2 files changed, 46 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d99789432b05..760c551f61da 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -577,12 +577,24 @@ static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>         mm_ops->put_page(ptep);
>  }
>
> +static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
> +{
> +       u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +       return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
> +}
> +
> +static bool stage2_pte_executable(kvm_pte_t pte)
> +{
> +       return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>                                       kvm_pte_t *ptep,
>                                       struct stage2_map_data *data)
>  {
>         kvm_pte_t new, old = *ptep;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
> +       struct kvm_pgtable *pgt = data->mmu->pgt;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
>         if (!kvm_block_mapping_supported(addr, end, phys, level))
> @@ -606,6 +618,14 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>                 stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
>         }
>
> +       /* Perform CMOs before installation of the guest stage-2 PTE */
> +       if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
> +               mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
> +                                               granule);
> +
> +       if (mm_ops->invalidate_icache && stage2_pte_executable(new))
> +               mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);
> +
>         smp_store_release(ptep, new);
>         if (stage2_pte_is_counted(new))
>                 mm_ops->get_page(ptep);
> @@ -798,12 +818,6 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
>         return ret;
>  }
>
> -static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
> -{
> -       u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -       return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                                enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
> @@ -874,6 +888,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  {
>         kvm_pte_t pte = *ptep;
>         struct stage2_attr_data *data = arg;
> +       struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
>         if (!kvm_pte_valid(pte))
>                 return 0;
> @@ -888,8 +903,17 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>          * but worst-case the access flag update gets lost and will be
>          * set on the next access instead.
>          */
> -       if (data->pte != pte)
> +       if (data->pte != pte) {
> +               /*
> +                * Invalidate instruction cache before updating the guest
> +                * stage-2 PTE if we are going to add executable permission.
> +                */
> +               if (mm_ops->invalidate_icache &&
> +                   stage2_pte_executable(pte) && !stage2_pte_executable(*ptep))
> +                       mm_ops->invalidate_icache(kvm_pte_follow(pte, mm_ops),
> +                                                 kvm_granule_size(level));
>                 WRITE_ONCE(*ptep, pte);
> +       }
>
>         return 0;
>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index b980f8a47cbb..c9f002d74ab4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -434,14 +434,16 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>  }
>
>  static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> -       .zalloc_page            = stage2_memcache_zalloc_page,
> -       .zalloc_pages_exact     = kvm_host_zalloc_pages_exact,
> -       .free_pages_exact       = free_pages_exact,
> -       .get_page               = kvm_host_get_page,
> -       .put_page               = kvm_host_put_page,
> -       .page_count             = kvm_host_page_count,
> -       .phys_to_virt           = kvm_host_va,
> -       .virt_to_phys           = kvm_host_pa,
> +       .zalloc_page                    = stage2_memcache_zalloc_page,
> +       .zalloc_pages_exact             = kvm_host_zalloc_pages_exact,
> +       .free_pages_exact               = free_pages_exact,
> +       .get_page                       = kvm_host_get_page,
> +       .put_page                       = kvm_host_put_page,
> +       .page_count                     = kvm_host_page_count,
> +       .phys_to_virt                   = kvm_host_va,
> +       .virt_to_phys                   = kvm_host_pa,
> +       .clean_invalidate_dcache        = clean_dcache_guest_page,
> +       .invalidate_icache              = invalidate_icache_guest_page,
>  };
>
>  /**
> @@ -1012,15 +1014,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (writable)
>                 prot |= KVM_PGTABLE_PROT_W;
>
> -       if (fault_status != FSC_PERM && !device)
> -               clean_dcache_guest_page(page_address(pfn_to_page(pfn)),
> -                                       vma_pagesize);
> -
> -       if (exec_fault) {
> +       if (exec_fault)
>                 prot |= KVM_PGTABLE_PROT_X;
> -               invalidate_icache_guest_page(page_address(pfn_to_page(pfn)),
> -                                            vma_pagesize);
> -       }
>
>         if (device)
>                 prot |= KVM_PGTABLE_PROT_DEVICE;
> @@ -1218,12 +1213,10 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         WARN_ON(range->end - range->start != 1);
>
>         /*
> -        * We've moved a page around, probably through CoW, so let's treat it
> -        * just like a translation fault and clean the cache to the PoC.
> -        */
> -       clean_dcache_guest_page(page_address(pfn_to_page(pfn), PAGE_SIZE);
> -
> -       /*
> +        * We've moved a page around, probably through CoW, so let's treat
> +        * it just like a translation fault and the map handler will clean
> +        * the cache to the PoC.
> +        *
>          * The MMU notifiers will have unmapped a huge PMD before calling
>          * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
>          * therefore we never need to clear out a huge PMD through this

Reviewed-by: Fuad Tabba <tabba@google.com>

Thanks,
/fuad

> --
> 2.23.0
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d99789432b05..760c551f61da 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -577,12 +577,24 @@  static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 	mm_ops->put_page(ptep);
 }
 
+static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
+{
+	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+	return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
+}
+
+static bool stage2_pte_executable(kvm_pte_t pte)
+{
+	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
+}
+
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
 {
 	kvm_pte_t new, old = *ptep;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
+	struct kvm_pgtable *pgt = data->mmu->pgt;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
@@ -606,6 +618,14 @@  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
 	}
 
+	/* Perform CMOs before installation of the guest stage-2 PTE */
+	if (mm_ops->clean_invalidate_dcache && stage2_pte_cacheable(pgt, new))
+		mm_ops->clean_invalidate_dcache(kvm_pte_follow(new, mm_ops),
+						granule);
+
+	if (mm_ops->invalidate_icache && stage2_pte_executable(new))
+		mm_ops->invalidate_icache(kvm_pte_follow(new, mm_ops), granule);
+
 	smp_store_release(ptep, new);
 	if (stage2_pte_is_counted(new))
 		mm_ops->get_page(ptep);
@@ -798,12 +818,6 @@  int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	return ret;
 }
 
-static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
-{
-	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-	return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
@@ -874,6 +888,7 @@  static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 {
 	kvm_pte_t pte = *ptep;
 	struct stage2_attr_data *data = arg;
+	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
 	if (!kvm_pte_valid(pte))
 		return 0;
@@ -888,8 +903,17 @@  static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * but worst-case the access flag update gets lost and will be
 	 * set on the next access instead.
 	 */
-	if (data->pte != pte)
+	if (data->pte != pte) {
+		/*
+		 * Invalidate instruction cache before updating the guest
+		 * stage-2 PTE if we are going to add executable permission.
+		 */
+		if (mm_ops->invalidate_icache &&
+		    stage2_pte_executable(pte) && !stage2_pte_executable(*ptep))
+			mm_ops->invalidate_icache(kvm_pte_follow(pte, mm_ops),
+						  kvm_granule_size(level));
 		WRITE_ONCE(*ptep, pte);
+	}
 
 	return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b980f8a47cbb..c9f002d74ab4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -434,14 +434,16 @@  int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 }
 
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
-	.zalloc_page		= stage2_memcache_zalloc_page,
-	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
-	.free_pages_exact	= free_pages_exact,
-	.get_page		= kvm_host_get_page,
-	.put_page		= kvm_host_put_page,
-	.page_count		= kvm_host_page_count,
-	.phys_to_virt		= kvm_host_va,
-	.virt_to_phys		= kvm_host_pa,
+	.zalloc_page			= stage2_memcache_zalloc_page,
+	.zalloc_pages_exact		= kvm_host_zalloc_pages_exact,
+	.free_pages_exact		= free_pages_exact,
+	.get_page			= kvm_host_get_page,
+	.put_page			= kvm_host_put_page,
+	.page_count			= kvm_host_page_count,
+	.phys_to_virt			= kvm_host_va,
+	.virt_to_phys			= kvm_host_pa,
+	.clean_invalidate_dcache	= clean_dcache_guest_page,
+	.invalidate_icache		= invalidate_icache_guest_page,
 };
 
 /**
@@ -1012,15 +1014,8 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
-		clean_dcache_guest_page(page_address(pfn_to_page(pfn)),
-					vma_pagesize);
-
-	if (exec_fault) {
+	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
-		invalidate_icache_guest_page(page_address(pfn_to_page(pfn)),
-					     vma_pagesize);
-	}
 
 	if (device)
 		prot |= KVM_PGTABLE_PROT_DEVICE;
@@ -1218,12 +1213,10 @@  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	WARN_ON(range->end - range->start != 1);
 
 	/*
-	 * We've moved a page around, probably through CoW, so let's treat it
-	 * just like a translation fault and clean the cache to the PoC.
-	 */
-	clean_dcache_guest_page(page_address(pfn_to_page(pfn), PAGE_SIZE);
-
-	/*
+	 * We've moved a page around, probably through CoW, so let's treat
+	 * it just like a translation fault and the map handler will clean
+	 * the cache to the PoC.
+	 *
 	 * The MMU notifiers will have unmapped a huge PMD before calling
 	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
 	 * therefore we never need to clear out a huge PMD through this