Message ID | 20220311002528.2230172-17-dmatlack@google.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | Extend Eager Page Splitting to the shadow MMU | expand |
On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote: > In order to split a huge page we need to know what access bits to assign > to the role of the new child page table. This can't be easily derived > from the huge page SPTE itself since KVM applies its own access policies > on top, such as for HugePage NX. > > We could walk the guest page tables to determine the correct access > bits, but that is difficult to plumb outside of a vCPU fault context. > Instead, we can store the original access bits for each leaf SPTE > alongside the GFN in the gfns array. The access bits only take up 3 > bits, which leaves 61 bits left over for gfns, which is more than > enough. So this change does not require any additional memory. I have a pure question on why eager page split needs to worry on hugepage nx.. IIUC that was about forbidden huge page being mapped as executable. So afaiu the only missing bit that could happen if we copy over the huge page ptes is the executable bit. But then? I think we could get a page fault on fault->exec==true on the split small page (because when we copy over it's cleared, even though the page can actually be executable), but it should be well resolved right after that small page fault. The thing is IIUC this is a very rare case, IOW, it should mostly not happen in 99% of the use case? And there's a slight penalty when it happens, but only perf-wise. As I'm not really fluent with the code base, perhaps I missed something? > > In order to keep the access bit cache in sync with the guest, we have to > extend FNAME(sync_page) to also update the access bits. Besides sync_page(), I also see that in mmu_set_spte() there's a path that we will skip the rmap_add() if existed: if (!was_rmapped) { WARN_ON_ONCE(ret == RET_PF_SPURIOUS); kvm_update_page_stats(vcpu->kvm, level, 1); rmap_add(vcpu, slot, sptep, gfn); } I didn't check, but it's not obvious whether the sync_page() change here will cover all of the cases, hence raise this up too. > > Now that the gfns array caches more information than just GFNs, rename > it to shadowed_translation. > > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 32 +++++++++++++++++++------------- > arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++-- > arch/x86/kvm/mmu/paging_tmpl.h | 7 +++++-- > 4 files changed, 38 insertions(+), 18 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index f72e80178ffc..0f5a36772bdc 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -694,7 +694,7 @@ struct kvm_vcpu_arch { > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; > struct kvm_mmu_memory_cache mmu_shadow_page_cache; > - struct kvm_mmu_memory_cache mmu_gfn_array_cache; > + struct kvm_mmu_memory_cache mmu_shadowed_translation_cache; I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache? No strong opinion. > struct kvm_mmu_memory_cache mmu_page_header_cache; > > /* [...] > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index b6e22ba9c654..c5b8ee625df7 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -32,6 +32,11 @@ extern bool dbg; > > typedef u64 __rcu *tdp_ptep_t; > > +struct shadowed_translation_entry { > + u64 access:3; > + u64 gfn:56; Why 56? > +}; Thanks,
On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote: > > In order to split a huge page we need to know what access bits to assign > > to the role of the new child page table. This can't be easily derived > > from the huge page SPTE itself since KVM applies its own access policies > > on top, such as for HugePage NX. > > > > We could walk the guest page tables to determine the correct access > > bits, but that is difficult to plumb outside of a vCPU fault context. > > Instead, we can store the original access bits for each leaf SPTE > > alongside the GFN in the gfns array. The access bits only take up 3 > > bits, which leaves 61 bits left over for gfns, which is more than > > enough. So this change does not require any additional memory. > > I have a pure question on why eager page split needs to worry on hugepage > nx.. > > IIUC that was about forbidden huge page being mapped as executable. So > afaiu the only missing bit that could happen if we copy over the huge page > ptes is the executable bit. > > But then? I think we could get a page fault on fault->exec==true on the > split small page (because when we copy over it's cleared, even though the > page can actually be executable), but it should be well resolved right > after that small page fault. > > The thing is IIUC this is a very rare case, IOW, it should mostly not > happen in 99% of the use case? And there's a slight penalty when it > happens, but only perf-wise. > > As I'm not really fluent with the code base, perhaps I missed something? You're right that we could get away with not knowing the shadowed access permissions to assign to the child SPTEs when splitting a huge SPTE. We could just copy the huge SPTE access permissions and then let the execute bit be repaired on fault (although those faults would be a performance cost). But the access permissions are also needed to lookup an existing shadow page (or create a new shadow page) to use to split the huge page. For example, let's say we are going to split a huge page that does not have execute permissions enabled. That could be because NX HugePages are enabled or because we are shadowing a guest translation that does not allow execution (or both). We wouldn't want to propagate the no-execute permission into the child SP role.access if the shadowed translation really does allow execution (and vice versa). > > > > > In order to keep the access bit cache in sync with the guest, we have to > > extend FNAME(sync_page) to also update the access bits. > > Besides sync_page(), I also see that in mmu_set_spte() there's a path that > we will skip the rmap_add() if existed: > > if (!was_rmapped) { > WARN_ON_ONCE(ret == RET_PF_SPURIOUS); > kvm_update_page_stats(vcpu->kvm, level, 1); > rmap_add(vcpu, slot, sptep, gfn); > } > > I didn't check, but it's not obvious whether the sync_page() change here > will cover all of the cases, hence raise this up too. Good catch. I will need to dig into this more to confirm but I think you might be right. > > > > > Now that the gfns array caches more information than just GFNs, rename > > it to shadowed_translation. > > > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > arch/x86/include/asm/kvm_host.h | 2 +- > > arch/x86/kvm/mmu/mmu.c | 32 +++++++++++++++++++------------- > > arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++-- > > arch/x86/kvm/mmu/paging_tmpl.h | 7 +++++-- > > 4 files changed, 38 insertions(+), 18 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index f72e80178ffc..0f5a36772bdc 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -694,7 +694,7 @@ struct kvm_vcpu_arch { > > > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; > > struct kvm_mmu_memory_cache mmu_shadow_page_cache; > > - struct kvm_mmu_memory_cache mmu_gfn_array_cache; > > + struct kvm_mmu_memory_cache mmu_shadowed_translation_cache; > > I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache? No > strong opinion. > > > struct kvm_mmu_memory_cache mmu_page_header_cache; > > > > /* > > [...] > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > index b6e22ba9c654..c5b8ee625df7 100644 > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > @@ -32,6 +32,11 @@ extern bool dbg; > > > > typedef u64 __rcu *tdp_ptep_t; > > > > +struct shadowed_translation_entry { > > + u64 access:3; > > + u64 gfn:56; > > Why 56? I was going for the theoretical maximum number of bits for a GFN. But that would be 64 - 12 = 52... so I'm not sure what I was thinking here. I'll switch it to 52 and add a comment. > > > +}; > > Thanks, > > -- > Peter Xu >
On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote: > On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote: > > > In order to split a huge page we need to know what access bits to assign > > > to the role of the new child page table. This can't be easily derived > > > from the huge page SPTE itself since KVM applies its own access policies > > > on top, such as for HugePage NX. > > > > > > We could walk the guest page tables to determine the correct access > > > bits, but that is difficult to plumb outside of a vCPU fault context. > > > Instead, we can store the original access bits for each leaf SPTE > > > alongside the GFN in the gfns array. The access bits only take up 3 > > > bits, which leaves 61 bits left over for gfns, which is more than > > > enough. So this change does not require any additional memory. > > > > I have a pure question on why eager page split needs to worry on hugepage > > nx.. > > > > IIUC that was about forbidden huge page being mapped as executable. So > > afaiu the only missing bit that could happen if we copy over the huge page > > ptes is the executable bit. > > > > But then? I think we could get a page fault on fault->exec==true on the > > split small page (because when we copy over it's cleared, even though the > > page can actually be executable), but it should be well resolved right > > after that small page fault. > > > > The thing is IIUC this is a very rare case, IOW, it should mostly not > > happen in 99% of the use case? And there's a slight penalty when it > > happens, but only perf-wise. > > > > As I'm not really fluent with the code base, perhaps I missed something? > > You're right that we could get away with not knowing the shadowed > access permissions to assign to the child SPTEs when splitting a huge > SPTE. We could just copy the huge SPTE access permissions and then let > the execute bit be repaired on fault (although those faults would be a > performance cost). > > But the access permissions are also needed to lookup an existing > shadow page (or create a new shadow page) to use to split the huge > page. For example, let's say we are going to split a huge page that > does not have execute permissions enabled. That could be because NX > HugePages are enabled or because we are shadowing a guest translation > that does not allow execution (or both). We wouldn't want to propagate > the no-execute permission into the child SP role.access if the > shadowed translation really does allow execution (and vice versa). Then the follow up (pure) question is what will happen if we simply propagate the no-exec permission into the child SP? I think that only happens with direct sptes where guest used huge pages because that's where the shadow page will be huge, so IIUC that's checked here when the small page fault triggers: static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned direct_access) { if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { struct kvm_mmu_page *child; /* * For the direct sp, if the guest pte's dirty bit * changed form clean to dirty, it will corrupt the * sp's access: allow writable in the read-only sp, * so we should update the spte at this point to get * a new sp with the correct access. */ child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK); if (child->role.access == direct_access) return; drop_parent_pte(child, sptep); kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1); } } Due to missing EXEC the role.access check will not match with direct access, which is the guest pgtable value which has EXEC set. Then IIUC we'll simply drop the no-exec SP and replace it with a new one with exec perm. The question is, would that untimately work too? Even if that works, I agree this sounds tricky because we're potentially caching fake sp.role conditionally and it seems we never do that before. It's just that the other option that you proposed here seems to add other way of complexity on caching spte permission information while kvm doesn't do either before. IMHO we need to see which is the best trade off. I could have missed something else, though. Thanks,
On Wed, Mar 30, 2022 at 11:30 AM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote: > > On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote: > > > > In order to split a huge page we need to know what access bits to assign > > > > to the role of the new child page table. This can't be easily derived > > > > from the huge page SPTE itself since KVM applies its own access policies > > > > on top, such as for HugePage NX. > > > > > > > > We could walk the guest page tables to determine the correct access > > > > bits, but that is difficult to plumb outside of a vCPU fault context. > > > > Instead, we can store the original access bits for each leaf SPTE > > > > alongside the GFN in the gfns array. The access bits only take up 3 > > > > bits, which leaves 61 bits left over for gfns, which is more than > > > > enough. So this change does not require any additional memory. > > > > > > I have a pure question on why eager page split needs to worry on hugepage > > > nx.. > > > > > > IIUC that was about forbidden huge page being mapped as executable. So > > > afaiu the only missing bit that could happen if we copy over the huge page > > > ptes is the executable bit. > > > > > > But then? I think we could get a page fault on fault->exec==true on the > > > split small page (because when we copy over it's cleared, even though the > > > page can actually be executable), but it should be well resolved right > > > after that small page fault. > > > > > > The thing is IIUC this is a very rare case, IOW, it should mostly not > > > happen in 99% of the use case? And there's a slight penalty when it > > > happens, but only perf-wise. > > > > > > As I'm not really fluent with the code base, perhaps I missed something? > > > > You're right that we could get away with not knowing the shadowed > > access permissions to assign to the child SPTEs when splitting a huge > > SPTE. We could just copy the huge SPTE access permissions and then let > > the execute bit be repaired on fault (although those faults would be a > > performance cost). > > > > But the access permissions are also needed to lookup an existing > > shadow page (or create a new shadow page) to use to split the huge > > page. For example, let's say we are going to split a huge page that > > does not have execute permissions enabled. That could be because NX > > HugePages are enabled or because we are shadowing a guest translation > > that does not allow execution (or both). We wouldn't want to propagate > > the no-execute permission into the child SP role.access if the > > shadowed translation really does allow execution (and vice versa). > > Then the follow up (pure) question is what will happen if we simply > propagate the no-exec permission into the child SP? > > I think that only happens with direct sptes where guest used huge pages > because that's where the shadow page will be huge, so IIUC that's checked > here when the small page fault triggers: > > static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, > unsigned direct_access) > { > if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { > struct kvm_mmu_page *child; > > /* > * For the direct sp, if the guest pte's dirty bit > * changed form clean to dirty, it will corrupt the > * sp's access: allow writable in the read-only sp, > * so we should update the spte at this point to get > * a new sp with the correct access. > */ > child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK); > if (child->role.access == direct_access) > return; > > drop_parent_pte(child, sptep); > kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1); > } > } > > Due to missing EXEC the role.access check will not match with direct > access, which is the guest pgtable value which has EXEC set. Then IIUC > we'll simply drop the no-exec SP and replace it with a new one with exec > perm. The question is, would that untimately work too? > > Even if that works, I agree this sounds tricky because we're potentially > caching fake sp.role conditionally and it seems we never do that before. > It's just that the other option that you proposed here seems to add other > way of complexity on caching spte permission information while kvm doesn't > do either before. IMHO we need to see which is the best trade off. Clever! I think you're right that it would work correctly. This approach avoids the need for caching access bits, but comes with downsides: - Performance impact from the extra faults needed to drop the SP and repair the execute permission bit. - Some amount of memory overhead from KVM allocating new SPs when it could re-use existing SPs. Given the relative simplicity of access caching (and the fact that it requires no additional memory), I'm inclined to stick with it rather than taking the access permissions from the huge page. > > I could have missed something else, though. > > Thanks, > > -- > Peter Xu >
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f72e80178ffc..0f5a36772bdc 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -694,7 +694,7 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; struct kvm_mmu_memory_cache mmu_shadow_page_cache; - struct kvm_mmu_memory_cache mmu_gfn_array_cache; + struct kvm_mmu_memory_cache mmu_shadowed_translation_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 73a7077f9991..89a7a8d7a632 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -708,7 +708,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) if (r) return r; if (maybe_indirect) { - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache, + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache, PT64_ROOT_MAX_LEVEL); if (r) return r; @@ -721,7 +721,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) { kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache); + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); } @@ -738,15 +738,17 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) { if (!sp->role.direct) - return sp->gfns[index]; + return sp->shadowed_translation[index].gfn; return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS)); } -static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) +static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index, + gfn_t gfn, u32 access) { if (!sp->role.direct) { - sp->gfns[index] = gfn; + sp->shadowed_translation[index].gfn = gfn; + sp->shadowed_translation[index].access = access; return; } @@ -1599,14 +1601,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head, static void __rmap_add(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, const struct kvm_memory_slot *slot, - u64 *spte, gfn_t gfn) + u64 *spte, gfn_t gfn, u32 access) { struct kvm_mmu_page *sp; struct kvm_rmap_head *rmap_head; int rmap_count; sp = sptep_to_sp(spte); - kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn); + kvm_mmu_page_set_gfn_access(sp, spte - sp->spt, gfn, access); kvm_update_page_stats(kvm, sp->role.level, 1); rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); @@ -1620,9 +1622,9 @@ static void __rmap_add(struct kvm *kvm, } static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, - u64 *spte, gfn_t gfn) + u64 *spte, gfn_t gfn, u32 access) { - __rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn); + __rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access); } bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) @@ -1683,7 +1685,7 @@ void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) { free_page((unsigned long)sp->spt); if (!sp->role.direct) - free_page((unsigned long)sp->gfns); + free_page((unsigned long)sp->shadowed_translation); kmem_cache_free(mmu_page_header_cache, sp); } @@ -1720,8 +1722,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); + + BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64)); + if (!direct) - sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); + sp->shadowed_translation = + kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_translation_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); @@ -1733,7 +1739,7 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc * * Huge page splitting always uses direct shadow pages since the huge page is * being mapped directly with a lower level page table. Thus there's no need to - * allocate the gfns array. + * allocate the shadowed_translation array. */ struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked) { @@ -2849,7 +2855,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, if (!was_rmapped) { WARN_ON_ONCE(ret == RET_PF_SPURIOUS); - rmap_add(vcpu, slot, sptep, gfn); + rmap_add(vcpu, slot, sptep, gfn, pte_access); } return ret; diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index b6e22ba9c654..c5b8ee625df7 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -32,6 +32,11 @@ extern bool dbg; typedef u64 __rcu *tdp_ptep_t; +struct shadowed_translation_entry { + u64 access:3; + u64 gfn:56; +}; + struct kvm_mmu_page { /* * Note, "link" through "spt" fit in a single 64 byte cache line on @@ -53,8 +58,14 @@ struct kvm_mmu_page { gfn_t gfn; u64 *spt; - /* hold the gfn of each spte inside spt */ - gfn_t *gfns; + /* + * For indirect shadow pages, caches the result of the intermediate + * guest translation being shadowed by each SPTE. + * + * NULL for direct shadow pages. + */ + struct shadowed_translation_entry *shadowed_translation; + /* Currently serving as active root */ union { int root_count; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 55cac59b9c9b..128eccadf1de 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -1014,7 +1014,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, } /* - * Using the cached information from sp->gfns is safe because: + * Using the information in sp->shadowed_translation is safe because: * - The spte has a reference to the struct page, so the pfn for a given gfn * can't change unless all sptes pointing to it are nuked first. * @@ -1088,12 +1088,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access)) continue; - if (gfn != sp->gfns[i]) { + if (gfn != sp->shadowed_translation[i].gfn) { drop_spte(vcpu->kvm, &sp->spt[i]); flush = true; continue; } + if (pte_access != sp->shadowed_translation[i].access) + sp->shadowed_translation[i].access = pte_access; + sptep = &sp->spt[i]; spte = *sptep; host_writable = spte & shadow_host_writable_mask;
In order to split a huge page we need to know what access bits to assign to the role of the new child page table. This can't be easily derived from the huge page SPTE itself since KVM applies its own access policies on top, such as for HugePage NX. We could walk the guest page tables to determine the correct access bits, but that is difficult to plumb outside of a vCPU fault context. Instead, we can store the original access bits for each leaf SPTE alongside the GFN in the gfns array. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. So this change does not require any additional memory. In order to keep the access bit cache in sync with the guest, we have to extend FNAME(sync_page) to also update the access bits. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. Signed-off-by: David Matlack <dmatlack@google.com> --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu/mmu.c | 32 +++++++++++++++++++------------- arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 7 +++++-- 4 files changed, 38 insertions(+), 18 deletions(-)