Message ID | 20220331213844.2894006-1-oupton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Don't split hugepages outside of MMU write lock | expand |
Hi Oliver, On Thu, Mar 31, 2022 at 2:38 PM Oliver Upton <oupton@google.com> wrote: > > It is possible to take a stage-2 permission fault on a page larger than > PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM > eagerly maps at the largest possible block size. When dirty logging is > enabled on a memslot, KVM does *not* eagerly split these 2M stage-2 > mappings and instead clears the write bit on the pte. > > Since dirty logging is always performed at PAGE_SIZE granularity, KVM > lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2 > fault handler. This operation must be done under the write lock. Since > commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission > relaxation during dirty logging"), the stage-2 fault handler > conditionally takes the read lock on permission faults with dirty > logging enabled. To that end, it is possible to split a 2M block mapping > while only holding the read lock. > > The problem is demonstrated by running kvm_page_table_test with 2M > anonymous HugeTLB, which splats like so: > > WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158 > > [...] > > Call trace: > stage2_map_walk_leaf+0x124/0x158 > stage2_map_walker+0x5c/0xf0 > __kvm_pgtable_walk+0x100/0x1d4 > __kvm_pgtable_walk+0x140/0x1d4 > __kvm_pgtable_walk+0x140/0x1d4 > kvm_pgtable_walk+0xa0/0xf8 > kvm_pgtable_stage2_map+0x15c/0x198 > user_mem_abort+0x56c/0x838 > kvm_handle_guest_abort+0x1fc/0x2a4 > handle_exit+0xa4/0x120 > kvm_arch_vcpu_ioctl_run+0x200/0x448 > kvm_vcpu_ioctl+0x588/0x664 > __arm64_sys_ioctl+0x9c/0xd4 > invoke_syscall+0x4c/0x144 > el0_svc_common+0xc4/0x190 > do_el0_svc+0x30/0x8c > el0_svc+0x28/0xcc > el0t_64_sync_handler+0x84/0xe4 > el0t_64_sync+0x1a4/0x1a8 > > Fix the issue by only acquiring the read lock if the guest faulted on a > PAGE_SIZE granule w/ dirty logging enabled. Since it is possible for the > faulting IPA to get collapsed into a larger block mapping until the read > lock is acquired, retry the faulting instruction any time that the fault > cannot be fixed by relaxing permissions. In so doing, the fault handler > will acquire the write lock for the subsequent fault on a larger > PAGE_SIZE mapping and split the block safely behind the write lock. > > Fixes: f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging") > Cc: Jing Zhang <jingzhangos@google.com> > Signed-off-by: Oliver Upton <oupton@google.com> > --- > > Applies cleanly to kvmarm/fixes at the following commit: > > 8872d9b3e35a ("KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler") > > Tested the patch by running KVM selftests. Additionally, I did 10 > iterations of the kvm_page_table_test with 2M anon HugeTLB memory. > > It is expected that this patch will cause fault serialization in the > pathological case where all vCPUs are faulting on the same granule of > memory, as every vCPU will attempt to acquire the write lock. The only > safe way to cure this contention is to dissolve pages eagerly outside of > the stage-2 fault handler (like x86). > > arch/arm64/kvm/mmu.c | 25 ++++++++++++++++++++----- > 1 file changed, 20 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 0d19259454d8..9384325bf3df 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1079,7 +1079,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > gfn_t gfn; > kvm_pfn_t pfn; > bool logging_active = memslot_is_logging(memslot); > - bool logging_perm_fault = false; > + bool use_read_lock = false; > unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); > unsigned long vma_pagesize, fault_granule; > enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; > @@ -1114,7 +1114,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (logging_active) { > force_pte = true; > vma_shift = PAGE_SHIFT; > - logging_perm_fault = (fault_status == FSC_PERM && write_fault); > + use_read_lock = (fault_status == FSC_PERM && write_fault && > + fault_granule == PAGE_SIZE); > } else { > vma_shift = get_vma_page_shift(vma, hva); > } > @@ -1218,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > * logging dirty logging, only acquire read lock for permission > * relaxation. > */ > - if (logging_perm_fault) > + if (use_read_lock) > read_lock(&kvm->mmu_lock); > else > write_lock(&kvm->mmu_lock); > @@ -1267,10 +1268,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > */ > if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { > ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); When use_read_lock is set to true, it appears the above condition will become always true (since fault_granule is PAGE_SIZE and force_pte is true in this case). So, I don't think the following "else" changes really make any difference. (Or am I overlooking something?) Looking at the code, I doubt that even the original (before the regression) code detects the case that is described in the comment below in the first place. Thanks, Reiji > - } else { > + } else if (!use_read_lock) { > ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, > __pfn_to_phys(pfn), prot, > memcache); > + > + /* > + * The read lock is taken if the FSC indicates that the guest faulted on > + * a PAGE_SIZE granule. It is possible that the stage-2 fault raced with > + * a map operation that collapsed the faulted address into a larger > + * block mapping. > + * > + * Since KVM splits mappings down to PAGE_SIZE when dirty logging is > + * enabled, it is necessary to hold the write lock for faults where > + * fault_granule > PAGE_SIZE. Retry the faulting instruction and acquire > + * the write lock on the next exit. > + */ > + } else { > + ret = -EAGAIN; > } > > /* Mark the page dirty only if the fault is handled successfully */ > @@ -1280,7 +1295,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > } > > out_unlock: > - if (logging_perm_fault) > + if (use_read_lock) > read_unlock(&kvm->mmu_lock); > else > write_unlock(&kvm->mmu_lock); > -- > 2.35.1.1094.g7c7d902a7c-goog >
Hi Reiji, Thanks for the review! On Thu, Mar 31, 2022 at 11:07:23PM -0700, Reiji Watanabe wrote: > > @@ -1267,10 +1268,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > */ > > if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { > > ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); > > When use_read_lock is set to true, it appears the above condition will > become always true (since fault_granule is PAGE_SIZE and force_pte > is true in this case). So, I don't think the following "else" changes > really make any difference. (Or am I overlooking something?) > Looking at the code, I doubt that even the original (before the regression) > code detects the case that is described in the comment below in the > first place. Yes, you are right. I liked the explicit check against !use_read_lock (even if it is dead) to make it abundantly clear what is guarded by the write lock. Personally, I'm not a big fan of the conditional locking because it is hard to tell exactly what is protected by the read or write lock. Maybe instead a WARN() could suffice. That way we bark at anyone who changes MMU locking again and breaks it. I found the bug with the splat above, but warning about replacing an already present PTE is rather far removed from the smoking gun in this situation. Outside of the regression, I believe there are some subtle races where stage-2 is modified before taking the MMU lock. I'm going to think further about the implications of these, but perhaps we should pass through a page level argument to kvm_pagetable_stage2_relax_perms() and only do the operation if the paging structures match what is reported in the FSC. If the IPA is not in fact mapped at the specified page level, retry the faulting instruction. I'll get a patch up that addresses your comment and crams a WARN() in to assert the write lock is held. -- Thanks, Oliver
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 0d19259454d8..9384325bf3df 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1079,7 +1079,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn_t gfn; kvm_pfn_t pfn; bool logging_active = memslot_is_logging(memslot); - bool logging_perm_fault = false; + bool use_read_lock = false; unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); unsigned long vma_pagesize, fault_granule; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; @@ -1114,7 +1114,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (logging_active) { force_pte = true; vma_shift = PAGE_SHIFT; - logging_perm_fault = (fault_status == FSC_PERM && write_fault); + use_read_lock = (fault_status == FSC_PERM && write_fault && + fault_granule == PAGE_SIZE); } else { vma_shift = get_vma_page_shift(vma, hva); } @@ -1218,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * logging dirty logging, only acquire read lock for permission * relaxation. */ - if (logging_perm_fault) + if (use_read_lock) read_lock(&kvm->mmu_lock); else write_lock(&kvm->mmu_lock); @@ -1267,10 +1268,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, */ if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); - } else { + } else if (!use_read_lock) { ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, __pfn_to_phys(pfn), prot, memcache); + + /* + * The read lock is taken if the FSC indicates that the guest faulted on + * a PAGE_SIZE granule. It is possible that the stage-2 fault raced with + * a map operation that collapsed the faulted address into a larger + * block mapping. + * + * Since KVM splits mappings down to PAGE_SIZE when dirty logging is + * enabled, it is necessary to hold the write lock for faults where + * fault_granule > PAGE_SIZE. Retry the faulting instruction and acquire + * the write lock on the next exit. + */ + } else { + ret = -EAGAIN; } /* Mark the page dirty only if the fault is handled successfully */ @@ -1280,7 +1295,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, } out_unlock: - if (logging_perm_fault) + if (use_read_lock) read_unlock(&kvm->mmu_lock); else write_unlock(&kvm->mmu_lock);
It is possible to take a stage-2 permission fault on a page larger than PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM eagerly maps at the largest possible block size. When dirty logging is enabled on a memslot, KVM does *not* eagerly split these 2M stage-2 mappings and instead clears the write bit on the pte. Since dirty logging is always performed at PAGE_SIZE granularity, KVM lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2 fault handler. This operation must be done under the write lock. Since commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"), the stage-2 fault handler conditionally takes the read lock on permission faults with dirty logging enabled. To that end, it is possible to split a 2M block mapping while only holding the read lock. The problem is demonstrated by running kvm_page_table_test with 2M anonymous HugeTLB, which splats like so: WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158 [...] Call trace: stage2_map_walk_leaf+0x124/0x158 stage2_map_walker+0x5c/0xf0 __kvm_pgtable_walk+0x100/0x1d4 __kvm_pgtable_walk+0x140/0x1d4 __kvm_pgtable_walk+0x140/0x1d4 kvm_pgtable_walk+0xa0/0xf8 kvm_pgtable_stage2_map+0x15c/0x198 user_mem_abort+0x56c/0x838 kvm_handle_guest_abort+0x1fc/0x2a4 handle_exit+0xa4/0x120 kvm_arch_vcpu_ioctl_run+0x200/0x448 kvm_vcpu_ioctl+0x588/0x664 __arm64_sys_ioctl+0x9c/0xd4 invoke_syscall+0x4c/0x144 el0_svc_common+0xc4/0x190 do_el0_svc+0x30/0x8c el0_svc+0x28/0xcc el0t_64_sync_handler+0x84/0xe4 el0t_64_sync+0x1a4/0x1a8 Fix the issue by only acquiring the read lock if the guest faulted on a PAGE_SIZE granule w/ dirty logging enabled. Since it is possible for the faulting IPA to get collapsed into a larger block mapping until the read lock is acquired, retry the faulting instruction any time that the fault cannot be fixed by relaxing permissions. In so doing, the fault handler will acquire the write lock for the subsequent fault on a larger PAGE_SIZE mapping and split the block safely behind the write lock. Fixes: f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging") Cc: Jing Zhang <jingzhangos@google.com> Signed-off-by: Oliver Upton <oupton@google.com> --- Applies cleanly to kvmarm/fixes at the following commit: 8872d9b3e35a ("KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler") Tested the patch by running KVM selftests. Additionally, I did 10 iterations of the kvm_page_table_test with 2M anon HugeTLB memory. It is expected that this patch will cause fault serialization in the pathological case where all vCPUs are faulting on the same granule of memory, as every vCPU will attempt to acquire the write lock. The only safe way to cure this contention is to dissolve pages eagerly outside of the stage-2 fault handler (like x86). arch/arm64/kvm/mmu.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-)