diff mbox

[1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled

Message ID 1461758686-27157-1-git-send-email-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrea Arcangeli April 27, 2016, 12:04 p.m. UTC
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound
page is immediately mappable from a secondary MMU.

A secondary MMU doesn't want to call get_user_pages() more than once
for each compound page, in order to know if it can map the whole
compound page. So a secondary MMU needs to know from a single
get_user_pages() invocation when it can map immediately the entire
compound page to avoid a flood of unnecessary secondary MMU faults and
spurious atomic_inc()/atomic_dec() (pages don't have to be pinned by
MMU notifier users).

Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages(). However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages). So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and
in turn if the caller can assume that the whole compound page is
mapped in the current "mm" (in a pmd_trans_huge()). In such case the
entire compound page is safe to map into the secondary MMU without
additional get_user_pages() calls on the surrounding tail/head
pages. In addition of being faster, not having to run other
get_user_pages() calls also reduces the memory footprint of the
secondary MMU fault in case the pmd split happened as result of memory
pressure.

Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with
a failure in split_huge_page() that would only result in pmd splitting
and not a physical page split), KVM would map the whole compound page
into the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of
the pte faults, leading to the guest mode and userland mode going out
of sync and not working on the same memory at all times.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm/kvm/mmu.c         |  2 +-
 arch/x86/kvm/mmu.c         |  4 ++--
 include/linux/page-flags.h | 22 ++++++++++++++++++++++
 3 files changed, 25 insertions(+), 3 deletions(-)

Comments

Kirill A. Shutemov April 27, 2016, 1:50 p.m. UTC | #1
On Wed, Apr 27, 2016 at 02:04:46PM +0200, Andrea Arcangeli wrote:
> After the THP refcounting change, obtaining a compound pages from
> get_user_pages() no longer allows us to assume the entire compound
> page is immediately mappable from a secondary MMU.
> 
> A secondary MMU doesn't want to call get_user_pages() more than once
> for each compound page, in order to know if it can map the whole
> compound page. So a secondary MMU needs to know from a single
> get_user_pages() invocation when it can map immediately the entire
> compound page to avoid a flood of unnecessary secondary MMU faults and
> spurious atomic_inc()/atomic_dec() (pages don't have to be pinned by
> MMU notifier users).
> 
> Ideally instead of the page->_mapcount < 1 check, get_user_pages()
> should return the granularity of the "page" mapping in the "mm" passed
> to get_user_pages(). However it's non trivial change to pass the "pmd"
> status belonging to the "mm" walked by get_user_pages up the stack (up
> to the caller of get_user_pages). So the fix just checks if there is
> not a single pte mapping on the page returned by get_user_pages, and
> in turn if the caller can assume that the whole compound page is
> mapped in the current "mm" (in a pmd_trans_huge()). In such case the
> entire compound page is safe to map into the secondary MMU without
> additional get_user_pages() calls on the surrounding tail/head
> pages. In addition of being faster, not having to run other
> get_user_pages() calls also reduces the memory footprint of the
> secondary MMU fault in case the pmd split happened as result of memory
> pressure.
> 
> Without this fix after a MADV_DONTNEED (like invoked by QEMU during
> postcopy live migration or balloning) or after generic swapping (with
> a failure in split_huge_page() that would only result in pmd splitting
> and not a physical page split), KVM would map the whole compound page
> into the shadow pagetables, despite regular faults or userfaults (like
> UFFDIO_COPY) may map regular pages into the primary MMU as result of
> the pte faults, leading to the guest mode and userland mode going out
> of sync and not working on the same memory at all times.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/arm/kvm/mmu.c         |  2 +-
>  arch/x86/kvm/mmu.c         |  4 ++--
>  include/linux/page-flags.h | 22 ++++++++++++++++++++++
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 58dbd5c..d6d4191 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1004,7 +1004,7 @@ static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>  	kvm_pfn_t pfn = *pfnp;
>  	gfn_t gfn = *ipap >> PAGE_SHIFT;
>  
> -	if (PageTransCompound(pfn_to_page(pfn))) {
> +	if (PageTransCompoundMap(pfn_to_page(pfn))) {
>  		unsigned long mask;
>  		/*
>  		 * The address we faulted on is backed by a transparent huge
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1ff4dbb..b6f50e8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2823,7 +2823,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>  	    level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompound(pfn_to_page(pfn)) &&
> +	    PageTransCompoundMap(pfn_to_page(pfn)) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -4785,7 +4785,7 @@ restart:
>  		 */
>  		if (sp->role.direct &&
>  			!kvm_is_reserved_pfn(pfn) &&
> -			PageTransCompound(pfn_to_page(pfn))) {
> +			PageTransCompoundMap(pfn_to_page(pfn))) {
>  			drop_spte(kvm, sptep);
>  			need_tlb_flush = 1;
>  			goto restart;
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b..6b052aa 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -517,6 +517,27 @@ static inline int PageTransCompound(struct page *page)
>  }
>  
>  /*
> + * PageTransCompoundMap is the same as PageTransCompound, but it also
> + * guarantees the primary MMU has the entire compound page mapped
> + * through pmd_trans_huge, which in turn guarantees the secondary MMUs
> + * can also map the entire compound page. This allows the secondary
> + * MMUs to call get_user_pages() only once for each compound page and
> + * to immediately map the entire compound page with a single secondary
> + * MMU fault. If there will be a pmd split later, the secondary MMUs
> + * will get an update through the MMU notifier invalidation through
> + * split_huge_pmd().
> + *
> + * Unlike PageTransCompound, this is safe to be called only while
> + * split_huge_pmd() cannot run from under us, like if protected by the
> + * MMU notifier, otherwise it may result in page->_mapcount < 0 false
> + * positives.
> + */

I know nothing about kvm. How do you protect against pmd splitting between
get_user_pages() and the check?

And the helper looks highly kvm-specific, doesn't it?

> +static inline int PageTransCompoundMap(struct page *page)
> +{
> +	return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0;
> +}
> +
> +/*
>   * PageTransTail returns true for both transparent huge pages
>   * and hugetlbfs pages, so it should only be called when it's known
>   * that hugetlbfs pages aren't involved.
> @@ -559,6 +580,7 @@ static inline int TestClearPageDoubleMap(struct page *page)
>  #else
>  TESTPAGEFLAG_FALSE(TransHuge)
>  TESTPAGEFLAG_FALSE(TransCompound)
> +TESTPAGEFLAG_FALSE(TransCompoundMap)
>  TESTPAGEFLAG_FALSE(TransTail)
>  TESTPAGEFLAG_FALSE(DoubleMap)
>  	TESTSETFLAG_FALSE(DoubleMap)
Andrea Arcangeli April 27, 2016, 2:59 p.m. UTC | #2
On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> I know nothing about kvm. How do you protect against pmd splitting between
> get_user_pages() and the check?

get_user_pages_fast() runs fully lockless and unpins the page right
away (we need a get_user_pages_fast without the FOLL_GET in fact to
avoid a totally useless atomic_inc/dec!).

Then we take a lock that is also taken by
mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
block in mmu_notifier_invalidate_range_start if it tries to run again
(every other mmu notifier like mmu_notifier_invalidate_page will also
block).

Then after we serialized against __split_huge_pmd through the MMU
notifier KVM internal locking, we are able to tell if any mmu_notifier
invalidate happened in the region just before get_user_pages_fast()
was invoked, until we call PageCompoundTransMap and we actually map
the shadow pagetable into the compound page with hugepage
granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
in the guest pagetables).

After the shadow pagetable is mapped, we drop the internal MMU
notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
can continue and drop the shadow pagetable that we just mapped in the
above paragraph just before dropping the mmu notifier internal lock.

To be able to tell if any invalidate happened while
get_user_pages_fast was running and until we grab the lock again and
we start mapping the shadow pagtable we use:

	mmu_seq = vcpu->kvm->mmu_notifier_seq;
	smp_rmb();

	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
	    ^^^^^^^^^^^^ this is get_user_pages and does put_page on the page
	    		 and just returns the &pfn
	    		 this is why we need a get_user_pages_fast that won't
			 attempt to touch the page->_count at all! we can avoid
			 2 atomic ops for each secondary MMU fault that way
		return 0;

	spin_lock(&vcpu->kvm->mmu_lock);
	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
		goto out_unlock;
	... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
	map a 4k or 2MB shadow pagetable on "pfn" ...


Note mmu_notifier_retry does the other side of the smp_rmb():

	smp_rmb();
	if (kvm->mmu_notifier_seq != mmu_seq)
		return 1;
	return 0;
Kirill A. Shutemov April 27, 2016, 3:18 p.m. UTC | #3
On Wed, Apr 27, 2016 at 04:59:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> > I know nothing about kvm. How do you protect against pmd splitting between
> > get_user_pages() and the check?
> 
> get_user_pages_fast() runs fully lockless and unpins the page right
> away (we need a get_user_pages_fast without the FOLL_GET in fact to
> avoid a totally useless atomic_inc/dec!).
> 
> Then we take a lock that is also taken by
> mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
> block in mmu_notifier_invalidate_range_start if it tries to run again
> (every other mmu notifier like mmu_notifier_invalidate_page will also
> block).
> 
> Then after we serialized against __split_huge_pmd through the MMU
> notifier KVM internal locking, we are able to tell if any mmu_notifier
> invalidate happened in the region just before get_user_pages_fast()
> was invoked, until we call PageCompoundTransMap and we actually map
> the shadow pagetable into the compound page with hugepage
> granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
> in the guest pagetables).
> 
> After the shadow pagetable is mapped, we drop the internal MMU
> notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
> can continue and drop the shadow pagetable that we just mapped in the
> above paragraph just before dropping the mmu notifier internal lock.
> 
> To be able to tell if any invalidate happened while
> get_user_pages_fast was running and until we grab the lock again and
> we start mapping the shadow pagtable we use:
> 
> 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> 	smp_rmb();
> 
> 	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
> 	    ^^^^^^^^^^^^ this is get_user_pages and does put_page on the page
> 	    		 and just returns the &pfn
> 	    		 this is why we need a get_user_pages_fast that won't
> 			 attempt to touch the page->_count at all! we can avoid
> 			 2 atomic ops for each secondary MMU fault that way
> 		return 0;
> 
> 	spin_lock(&vcpu->kvm->mmu_lock);
> 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> 		goto out_unlock;
> 	... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
> 	map a 4k or 2MB shadow pagetable on "pfn" ...
> 
> 
> Note mmu_notifier_retry does the other side of the smp_rmb():
> 
> 	smp_rmb();
> 	if (kvm->mmu_notifier_seq != mmu_seq)
> 		return 1;
> 	return 0;

Okay, I see.

But do we really want to make PageTransCompoundMap() visiable beyond KVM
code? It looks like too KVM-specific.
Andrea Arcangeli April 27, 2016, 3:57 p.m. UTC | #4
On Wed, Apr 27, 2016 at 06:18:34PM +0300, Kirill A. Shutemov wrote:
> Okay, I see.
> 
> But do we really want to make PageTransCompoundMap() visiable beyond KVM
> code? It looks like too KVM-specific.

Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want
to run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified
to be exposed not just to KVM.

The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data
structures in it, so it's definitely not a cut-and-paste work, so I
couldn't do a fix as cleaner as this one for 4.6.
Andrea Arcangeli April 27, 2016, 4:03 p.m. UTC | #5
On Wed, Apr 27, 2016 at 05:57:30PM +0200, Andrea Arcangeli wrote:
> couldn't do a fix as cleaner as this one for 4.6.

ehm "cleaner then"

If you've suggestions for a better name than PageTransCompoundMap I
can respin a new patch though, I considered "CanMap" but I opted for
the short version.

Also I'm not really sure moving transparent_hugepage_adjust will make
much sense. I mentioned it because Andres in another thread said it
was suggested but the real common code knowledge is about
PageTransCompoundMap only, all sort of !mmu_gfn_lpage_is_disallowed
for dirty logging at 4k shadow granularity is KVM internal.
diff mbox

Patch

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 58dbd5c..d6d4191 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1004,7 +1004,7 @@  static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *ipap >> PAGE_SHIFT;
 
-	if (PageTransCompound(pfn_to_page(pfn))) {
+	if (PageTransCompoundMap(pfn_to_page(pfn))) {
 		unsigned long mask;
 		/*
 		 * The address we faulted on is backed by a transparent huge
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1ff4dbb..b6f50e8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2823,7 +2823,7 @@  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
 	    level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompound(pfn_to_page(pfn)) &&
+	    PageTransCompoundMap(pfn_to_page(pfn)) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
 		/*
@@ -4785,7 +4785,7 @@  restart:
 		 */
 		if (sp->role.direct &&
 			!kvm_is_reserved_pfn(pfn) &&
-			PageTransCompound(pfn_to_page(pfn))) {
+			PageTransCompoundMap(pfn_to_page(pfn))) {
 			drop_spte(kvm, sptep);
 			need_tlb_flush = 1;
 			goto restart;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b..6b052aa 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -517,6 +517,27 @@  static inline int PageTransCompound(struct page *page)
 }
 
 /*
+ * PageTransCompoundMap is the same as PageTransCompound, but it also
+ * guarantees the primary MMU has the entire compound page mapped
+ * through pmd_trans_huge, which in turn guarantees the secondary MMUs
+ * can also map the entire compound page. This allows the secondary
+ * MMUs to call get_user_pages() only once for each compound page and
+ * to immediately map the entire compound page with a single secondary
+ * MMU fault. If there will be a pmd split later, the secondary MMUs
+ * will get an update through the MMU notifier invalidation through
+ * split_huge_pmd().
+ *
+ * Unlike PageTransCompound, this is safe to be called only while
+ * split_huge_pmd() cannot run from under us, like if protected by the
+ * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * positives.
+ */
+static inline int PageTransCompoundMap(struct page *page)
+{
+	return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0;
+}
+
+/*
  * PageTransTail returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
  * that hugetlbfs pages aren't involved.
@@ -559,6 +580,7 @@  static inline int TestClearPageDoubleMap(struct page *page)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
+TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
 TESTPAGEFLAG_FALSE(DoubleMap)
 	TESTSETFLAG_FALSE(DoubleMap)