Message ID | 20220327205803.739336-1-mizhang@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: add lockdep check before lookup_address_in_mm() | expand |
On Sun, Mar 27, 2022, Mingwei Zhang wrote: > Add a lockdep check before invoking lookup_address_in_mm(). > lookup_address_in_mm() walks all levels of host page table without > accquiring any lock. This is usually unsafe unless we are walking the > kernel addresses (check other usage cases of lookup_address_in_mm and > lookup_address_in_pgd). > > Walking host page table (especially guest addresses) usually requires > holding two types of locks: 1) mmu_lock in mm or the lock that protects > the reverse maps of host memory in range; 2) lock for the leaf paging > structures. > > One exception case is when we take the mmu_lock of the secondary mmu. > Holding mmu_lock of KVM MMU in either read mode or write mode prevents host > level entities from modifying the host page table concurrently. This is > because all of them will have to invoke KVM mmu_notifier first before doing > the actual work. Since KVM mmu_notifier invalidation operations always take > the mmu write lock, we are safe if we hold the mmu lock here. > > Note: this means that KVM cannot allow concurrent multiple mmu_notifier > invalidation callbacks by using KVM mmu read lock. Since, otherwise, any > host level entity can cause race conditions with this one. Walking host > page table here may get us stale information or may trigger NULL ptr > dereference that is hard to reproduce. > > Having a lockdep check here will prevent or at least warn future > development that directly walks host page table simply in a KVM ioctl > function. In addition, it provides a record for any future development on > KVM mmu_notifier. > > Cc: Sean Christopherson <seanjc@google.com> > Cc: Ben Gardon <bgardon@google.com> > Cc: David Matlack <dmatlack@google.com> > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 1361eb4599b4..066bb5435156 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > */ > hva = __gfn_to_hva_memslot(slot, gfn); > > + /* > + * lookup_address_in_mm() walks all levels of host page table without > + * accquiring any lock. This is not safe when KVM does not take the > + * mmu_lock. Holding mmu_lock in either read mode or write mode prevents > + * host level entities from modifying the host page table. This is > + * because all of them will have to invoke KVM mmu_notifier first before > + * doing the actual work. Since KVM mmu_notifier invalidation operations > + * always take the mmu write lock, we are safe if we hold the mmu lock > + * here. > + * > + * Note: this means that KVM cannot allow concurrent multiple > + * mmu_notifier invalidation callbacks by using KVM mmu read lock. > + * Otherwise, any host level entity can cause race conditions with this > + * one. Walking host page table here may get us stale information or may > + * trigger NULL ptr dereference that is hard to reproduce. > + */ > + lockdep_assert_held(&kvm->mmu_lock); Holding mmu_lock isn't strictly required. It would also be safe to use this helper if mmu_notifier_retry_hva() were checked after grabbing the mapping level, before consuming it. E.g. we could theoretically move this to kvm_faultin_pfn(). And simply holding the lock isn't sufficient, i.e. the lockdep gives a false sense of security. E.g. calling this while holding mmu_lock but without first checking mmu_notifier_count would let it run concurrently with host PTE modifications. I'm definitely in favor of adding a comment to document the mmu_notifier interactions, but I don't like adding a lockdep.
Hi Sean, On Mon, Mar 28, 2022 at 8:16 AM Sean Christopherson <seanjc@google.com> wrote: > > On Sun, Mar 27, 2022, Mingwei Zhang wrote: > > Add a lockdep check before invoking lookup_address_in_mm(). > > lookup_address_in_mm() walks all levels of host page table without > > accquiring any lock. This is usually unsafe unless we are walking the > > kernel addresses (check other usage cases of lookup_address_in_mm and > > lookup_address_in_pgd). > > > > Walking host page table (especially guest addresses) usually requires > > holding two types of locks: 1) mmu_lock in mm or the lock that protects > > the reverse maps of host memory in range; 2) lock for the leaf paging > > structures. > > > > One exception case is when we take the mmu_lock of the secondary mmu. > > Holding mmu_lock of KVM MMU in either read mode or write mode prevents host > > level entities from modifying the host page table concurrently. This is > > because all of them will have to invoke KVM mmu_notifier first before doing > > the actual work. Since KVM mmu_notifier invalidation operations always take > > the mmu write lock, we are safe if we hold the mmu lock here. > > > > Note: this means that KVM cannot allow concurrent multiple mmu_notifier > > invalidation callbacks by using KVM mmu read lock. Since, otherwise, any > > host level entity can cause race conditions with this one. Walking host > > page table here may get us stale information or may trigger NULL ptr > > dereference that is hard to reproduce. > > > > Having a lockdep check here will prevent or at least warn future > > development that directly walks host page table simply in a KVM ioctl > > function. In addition, it provides a record for any future development on > > KVM mmu_notifier. > > > > Cc: Sean Christopherson <seanjc@google.com> > > Cc: Ben Gardon <bgardon@google.com> > > Cc: David Matlack <dmatlack@google.com> > > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > --- > > arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++ > > 1 file changed, 18 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 1361eb4599b4..066bb5435156 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > */ > > hva = __gfn_to_hva_memslot(slot, gfn); > > > > + /* > > + * lookup_address_in_mm() walks all levels of host page table without > > + * accquiring any lock. This is not safe when KVM does not take the > > + * mmu_lock. Holding mmu_lock in either read mode or write mode prevents > > + * host level entities from modifying the host page table. This is > > + * because all of them will have to invoke KVM mmu_notifier first before > > + * doing the actual work. Since KVM mmu_notifier invalidation operations > > + * always take the mmu write lock, we are safe if we hold the mmu lock > > + * here. > > + * > > + * Note: this means that KVM cannot allow concurrent multiple > > + * mmu_notifier invalidation callbacks by using KVM mmu read lock. > > + * Otherwise, any host level entity can cause race conditions with this > > + * one. Walking host page table here may get us stale information or may > > + * trigger NULL ptr dereference that is hard to reproduce. > > + */ > > + lockdep_assert_held(&kvm->mmu_lock); > > Holding mmu_lock isn't strictly required. It would also be safe to use this helper > if mmu_notifier_retry_hva() were checked after grabbing the mapping level, before > consuming it. E.g. we could theoretically move this to kvm_faultin_pfn(). > > And simply holding the lock isn't sufficient, i.e. the lockdep gives a false sense > of security. E.g. calling this while holding mmu_lock but without first checking > mmu_notifier_count would let it run concurrently with host PTE modifications. Right, even holding the kvm->mmu_lock is not safe, since we may have several concurrent invalidations ongoing and they are done zapping entries in EPT (so that they could just release the kvm->mmu_lock) and start working on the host page table. If we want to make it safe, we also have to check mmu_notifier_count (and potentially mmu_seq as well). With that, I start to feel this is a bug. The issue is just so rare that it has never triggered a problem. lookup_address_in_mm() walks the host page table as if it is a sequence of _static_ memory chunks. This is clearly dangerous. If we look at hva_to_pfn(), which is the right way to walk host page table: hva_to_pfn() => hva_to_pfn_fast() => get_user_page_fast_only() => internal_get_user_pages_fast() => lockless_pages_from_mm() => local_irq_save(flags); /* Disable interrupts here. */ gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); ... ... hva_to_pfn_slow() => get_user_pages_unlocked() => mmap_read_lock(mm); /* taking the mm lock here. */ The above code has two branches to walk the host page table: 1) the fast one and 2) slow one; The slower one takes the mm lock, while the faster one simply disables the interrupts. I think we might have to mimic the same thing in lockless_pages_from_mm(), i.e. wrapping local_irq_{save,restore}(flags) around the lookup_address_in_mm(). Alternatively, we have to specify that the function lookup_address_in_mm() as well as its callers: host_pfn_mapping_level() and kvm_mmu_max_mapping_level() CANNOT be called in generic places in KVM, but only in the fault path and AFTER the check of "is_page_fault_stale()". But right now, kvm_mmu_max_mapping_level() are used in other places as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the strict requirement of walking the host page table. > > I'm definitely in favor of adding a comment to document the mmu_notifier > interactions, but I don't like adding a lockdep.
On Mon, Mar 28, 2022, Mingwei Zhang wrote: > With that, I start to feel this is a bug. The issue is just so rare > that it has never triggered a problem. > > lookup_address_in_mm() walks the host page table as if it is a > sequence of _static_ memory chunks. This is clearly dangerous. Yeah, it's broken. The proper fix is do something like what perf uses, or maybe just genericize and reuse the code from commit 8af26be06272 ("perf/core: Fix arch_perf_get_page_size()). > But right now, kvm_mmu_max_mapping_level() are used in other places > as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the > strict requirement of walking the host page table. The host pfn size is used only as a hueristic, so false postives/negatives are ok, the only race that needs to be avoided is dereferencing freed page table memory. lookup_address_in_pgd() is really broken because it doesn't even ensure a given PxE is READ_ONCE(). I suppose one could argue the caller is broken, but I doubt KVM is the only user that doesn't provide the necessary protections.
On Mon, Mar 28, 2022 at 11:15 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Mar 28, 2022, Mingwei Zhang wrote: > > With that, I start to feel this is a bug. The issue is just so rare > > that it has never triggered a problem. > > > > lookup_address_in_mm() walks the host page table as if it is a > > sequence of _static_ memory chunks. This is clearly dangerous. > > Yeah, it's broken. The proper fix is do something like what perf uses, or maybe > just genericize and reuse the code from commit 8af26be06272 > ("perf/core: Fix arch_perf_get_page_size()). hmm, I am thinking about this. We clearly need an adaptor layer if we choose to use this function, e.g., size -> layer change; using irq or not. Alternatively, I am wondering if we can just modify lookup_address_in_mm() to make the code compatible with "lockless" walk? On top of that, since kvm_mmu_max_mapping_level() is used in two places: 1) ept violation and 2) disabling dirty logging. The former does not require disable/enable irq since it is safe. So maybe add a parameter in this function and plumb through towards host_pfn_mapping_level()? > > > But right now, kvm_mmu_max_mapping_level() are used in other places > > as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the > > strict requirement of walking the host page table. > > The host pfn size is used only as a hueristic, so false postives/negatives are > ok, the only race that needs to be avoided is dereferencing freed page table > memory. lookup_address_in_pgd() is really broken because it doesn't even ensure > a given PxE is READ_ONCE(). I suppose one could argue the caller is broken, but > I doubt KVM is the only user that doesn't provide the necessary protections. right. since lookup_address_in_pgd() is so broken. I am thinking about just fix it in place instead of switching to a different function.
On 3/28/22 20:15, Sean Christopherson wrote: >> lookup_address_in_mm() walks the host page table as if it is a >> sequence of_static_ memory chunks. This is clearly dangerous. > Yeah, it's broken. The proper fix is do something like what perf uses, or maybe > just genericize and reuse the code from commit 8af26be06272 > ("perf/core: Fix arch_perf_get_page_size()). > Indeed, KVM could use perf_get_pgtable_size(). The conversion from the result of *_leaf_size() to level is basically (ctz(size) - 12) / 9. Alternatively, there are the three difference between perf_get_page_size() and lookup_address_in_pgd(): * the *_offset_lockless() macros, which are unnecessary on x86 * READ_ONCE, which is important but in practice unlikely to make a difference * local_irq_{save,restore} around the walk The last is the important one and it should be added to lookup_address_in_pgd(). Paolo
On Tue, Apr 26, 2022, Paolo Bonzini wrote: > On 3/28/22 20:15, Sean Christopherson wrote: > > > lookup_address_in_mm() walks the host page table as if it is a > > > sequence of_static_ memory chunks. This is clearly dangerous. > > Yeah, it's broken. The proper fix is do something like what perf uses, or maybe > > just genericize and reuse the code from commit 8af26be06272 > > ("perf/core: Fix arch_perf_get_page_size()). > > > > Indeed, KVM could use perf_get_pgtable_size(). The conversion from the > result of *_leaf_size() to level is basically (ctz(size) - 12) / 9. > > Alternatively, there are the three difference between perf_get_page_size() > and lookup_address_in_pgd(): > > * the *_offset_lockless() macros, which are unnecessary on x86 > > * READ_ONCE, which is important but in practice unlikely to make a > difference It can make a difference for this specific case. I can't find the bug/patch, but a year or two back there was a bug in a similar mm/ path where lack of READ_ONCE() led to deferencing garbage due re-reading an upper level entry. IIRC, it was a page promotion (to huge page) case, where the p*d_large() check came back false (saw the old value) and then p*d_offset() walked into the weeds because it used the new value (huge page offset). > * local_irq_{save,restore} around the walk > > > The last is the important one and it should be added to > lookup_address_in_pgd(). I don't think so. The issue is that, similar to adding a lockdep here, simply disabling IRQs is not sufficient to ensure the resolved pfn is valid. And again, like this case, disabling IRQs is not actually required when sufficient protections are in place, e.g. in KVM's page fault case, the mmu_notifier invalidate_start event must occur before the primary MMUs modifies its PTEs. In other words, disabling IRQs is both unnecessary and gives a false sense of security. I completely agree that lookup_address() and friends are unnecessarily fragile, but I think that attempting to harden them to fix this KVM bug will open a can of worms and end up delaying getting KVM fixed.
> I completely agree that lookup_address() and friends are unnecessarily fragile, > but I think that attempting to harden them to fix this KVM bug will open a can > of worms and end up delaying getting KVM fixed. So basically, we need to: - choose perf_get_page_size() instead of using any of the lookup_address*() in mm. - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size -> level translation. Agree?
On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > I completely agree that lookup_address() and friends are unnecessarily fragile, > > but I think that attempting to harden them to fix this KVM bug will open a can > > of worms and end up delaying getting KVM fixed. > > So basically, we need to: > - choose perf_get_page_size() instead of using any of the > lookup_address*() in mm. > - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size > -> level translation. > > Agree? Drat, I didn't see that it returns the page size, not the level. That's a bit unfortunate. It definitely makes me less averse to fixing lookup_address_in_pgd() Hrm. I guess since we know there's at least one broken user, and in theory fixing lookup_address_in_pgd() should do no harm to users that don't need protection, it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers push back.
On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > > I completely agree that lookup_address() and friends are unnecessarily fragile, > > > but I think that attempting to harden them to fix this KVM bug will open a can > > > of worms and end up delaying getting KVM fixed. > > > > So basically, we need to: > > - choose perf_get_page_size() instead of using any of the > > lookup_address*() in mm. > > - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size > > -> level translation. > > > > Agree? > > Drat, I didn't see that it returns the page size, not the level. That's a bit > unfortunate. It definitely makes me less averse to fixing lookup_address_in_pgd() > > Hrm. I guess since we know there's at least one broken user, and in theory > fixing lookup_address_in_pgd() should do no harm to users that don't need protection, > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers > push back. Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the page fault usage case does not need irq save/restore. But the other one needs it. So, we can easily fix the function with READ_ONCE and lockless staff. But wrapping the function with irq save/restore from the KVM side. ok, I think I can proceed and upload one version for that. Thanks. -Mingwei
On Tue, Apr 26, 2022, Mingwei Zhang wrote: > On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > > > I completely agree that lookup_address() and friends are unnecessarily fragile, > > > > but I think that attempting to harden them to fix this KVM bug will open a can > > > > of worms and end up delaying getting KVM fixed. > > > > > > So basically, we need to: > > > - choose perf_get_page_size() instead of using any of the > > > lookup_address*() in mm. > > > - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size > > > -> level translation. > > > > > > Agree? > > > > Drat, I didn't see that it returns the page size, not the level. That's a bit > > unfortunate. It definitely makes me less averse to fixing lookup_address_in_pgd() > > > > Hrm. I guess since we know there's at least one broken user, and in theory > > fixing lookup_address_in_pgd() should do no harm to users that don't need protection, > > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers > > push back. > > Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the > page fault usage case does not need irq save/restore. But the other > one needs it. So, we can easily fix the function with READ_ONCE and > lockless staff. But wrapping the function with irq save/restore from > the KVM side. I think it makes sense to do the save/restore in lookup_address_in_pgd(). The Those helpers are exported, so odds are good there are broken users that will benefit from fixing all paths.
On Tue, Apr 26, 2022 at 6:30 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > > > > I completely agree that lookup_address() and friends are unnecessarily fragile, > > > > > but I think that attempting to harden them to fix this KVM bug will open a can > > > > > of worms and end up delaying getting KVM fixed. > > > > > > > > So basically, we need to: > > > > - choose perf_get_page_size() instead of using any of the > > > > lookup_address*() in mm. > > > > - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size > > > > -> level translation. > > > > > > > > Agree? > > > > > > Drat, I didn't see that it returns the page size, not the level. That's a bit > > > unfortunate. It definitely makes me less averse to fixing lookup_address_in_pgd() > > > > > > Hrm. I guess since we know there's at least one broken user, and in theory > > > fixing lookup_address_in_pgd() should do no harm to users that don't need protection, > > > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers > > > push back. > > > > Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the > > page fault usage case does not need irq save/restore. But the other > > one needs it. So, we can easily fix the function with READ_ONCE and > > lockless staff. But wrapping the function with irq save/restore from > > the KVM side. > > I think it makes sense to do the save/restore in lookup_address_in_pgd(). The > Those helpers are exported, so odds are good there are broken users that will > benefit from fixing all paths. no, lookup_address_in_pgd() is probably just broken for KVM. In other call sites, some may already disable IRQ, so doing that again inside lookup_address_in_pgd() will be bad. I am looking at here: https://elixir.bootlin.com/linux/latest/source/arch/arm/kernel/traps.c#L304 so, the save/restore are done in oops_begin() and oops_end(), which is wrapping show_fault_oops() that calls lookup_address_in_pgd(). So, I think we need to ensure the READ_ONCE. hmm, regarding the lockless macros, Paolo is right, for x86 it makes no difference. s390 seems to have a different implementation, but kvm_mmu_max_mapping_level() as well as host_pfn_mapping_level are both functions in x86 mmu. Thanks. -Mingwei
On Tue, Apr 26, 2022, Mingwei Zhang wrote: > On Tue, Apr 26, 2022 at 6:30 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > > On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Tue, Apr 26, 2022, Mingwei Zhang wrote: > > > > > > I completely agree that lookup_address() and friends are unnecessarily fragile, > > > > > > but I think that attempting to harden them to fix this KVM bug will open a can > > > > > > of worms and end up delaying getting KVM fixed. > > > > > > > > > > So basically, we need to: > > > > > - choose perf_get_page_size() instead of using any of the > > > > > lookup_address*() in mm. > > > > > - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size > > > > > -> level translation. > > > > > > > > > > Agree? > > > > > > > > Drat, I didn't see that it returns the page size, not the level. That's a bit > > > > unfortunate. It definitely makes me less averse to fixing lookup_address_in_pgd() > > > > > > > > Hrm. I guess since we know there's at least one broken user, and in theory > > > > fixing lookup_address_in_pgd() should do no harm to users that don't need protection, > > > > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers > > > > push back. > > > > > > Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the > > > page fault usage case does not need irq save/restore. But the other > > > one needs it. So, we can easily fix the function with READ_ONCE and > > > lockless staff. But wrapping the function with irq save/restore from > > > the KVM side. > > > > I think it makes sense to do the save/restore in lookup_address_in_pgd(). The > > Those helpers are exported, so odds are good there are broken users that will > > benefit from fixing all paths. > > no, lookup_address_in_pgd() is probably just broken for KVM. In other > call sites, some may already disable IRQ, so doing that again inside > lookup_address_in_pgd() will be bad. No, it's not bad. local_irq_save/restore() intended preciesly for cases where IRQs need to be disabled but IRQs may or may not have already been disabled by the caller. PUSHF+POPF is not expensive relatively speaking, > I am looking at here: > https://elixir.bootlin.com/linux/latest/source/arch/arm/kernel/traps.c#L304 That's arm code, lookup_address_in_pgd() is x86 specific. :-) That said, I'm sure there exists at least one caller that runs with IRQs disabled. But as above, it's not a problem. > so, the save/restore are done in oops_begin() and oops_end(), which is > wrapping show_fault_oops() that calls lookup_address_in_pgd(). > > So, I think we need to ensure the READ_ONCE. > > hmm, regarding the lockless macros, Paolo is right, for x86 it makes > no difference. s390 seems to have a different implementation, but > kvm_mmu_max_mapping_level() as well as host_pfn_mapping_level are both > functions in x86 mmu. Yep, all of this is x86 specific.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1361eb4599b4..066bb5435156 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, */ hva = __gfn_to_hva_memslot(slot, gfn); + /* + * lookup_address_in_mm() walks all levels of host page table without + * accquiring any lock. This is not safe when KVM does not take the + * mmu_lock. Holding mmu_lock in either read mode or write mode prevents + * host level entities from modifying the host page table. This is + * because all of them will have to invoke KVM mmu_notifier first before + * doing the actual work. Since KVM mmu_notifier invalidation operations + * always take the mmu write lock, we are safe if we hold the mmu lock + * here. + * + * Note: this means that KVM cannot allow concurrent multiple + * mmu_notifier invalidation callbacks by using KVM mmu read lock. + * Otherwise, any host level entity can cause race conditions with this + * one. Walking host page table here may get us stale information or may + * trigger NULL ptr dereference that is hard to reproduce. + */ + lockdep_assert_held(&kvm->mmu_lock); + pte = lookup_address_in_mm(kvm->mm, hva, &level); if (unlikely(!pte)) return PG_LEVEL_4K;
Add a lockdep check before invoking lookup_address_in_mm(). lookup_address_in_mm() walks all levels of host page table without accquiring any lock. This is usually unsafe unless we are walking the kernel addresses (check other usage cases of lookup_address_in_mm and lookup_address_in_pgd). Walking host page table (especially guest addresses) usually requires holding two types of locks: 1) mmu_lock in mm or the lock that protects the reverse maps of host memory in range; 2) lock for the leaf paging structures. One exception case is when we take the mmu_lock of the secondary mmu. Holding mmu_lock of KVM MMU in either read mode or write mode prevents host level entities from modifying the host page table concurrently. This is because all of them will have to invoke KVM mmu_notifier first before doing the actual work. Since KVM mmu_notifier invalidation operations always take the mmu write lock, we are safe if we hold the mmu lock here. Note: this means that KVM cannot allow concurrent multiple mmu_notifier invalidation callbacks by using KVM mmu read lock. Since, otherwise, any host level entity can cause race conditions with this one. Walking host page table here may get us stale information or may trigger NULL ptr dereference that is hard to reproduce. Having a lockdep check here will prevent or at least warn future development that directly walks host page table simply in a KVM ioctl function. In addition, it provides a record for any future development on KVM mmu_notifier. Cc: Sean Christopherson <seanjc@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> --- arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)