hugetlbfs: Take read_lock on i_mmap for PMD sharing

Message ID	20191107190628.22667-1-longman@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ID7B=Y7=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 312E22166E From: Waiman Long <longman@redhat.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>, Waiman Long <longman@redhat.com> Subject: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing Date: Thu, 7 Nov 2019 14:06:28 -0500 Message-Id: <20191107190628.22667-1-longman@redhat.com> Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hugetlbfs: Take read_lock on i_mmap for PMD sharing \| expand hugetlbfs: Take read_lock on i_mmap for PMD sharing

Waiman Long Nov. 7, 2019, 7:06 p.m. UTC

A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are experiencing
random multisecond delays. These delays was caused by the long time it
took to scan the VMA interval tree with mmap_sem held.

The sharing of huge PMD does not require changes to the i_mmap at all.
As a result, we can just take the read lock and let other threads
searching for the right VMA to share in parallel. Once the right
VMA is found, either the PMD lock (2M huge page for x86-64) or the
mm->page_table_lock will be acquired to perform the actual PMD sharing.

Lock contention, if present, will happen in the spinlock. That is much
better than contention in the rwsem where the time needed to scan the
the interval tree is indeterminate.

With this patch applied, the customer is seeing significant improvements
over the unpatched kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/hugetlb.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Waiman Long Nov. 7, 2019, 7:13 p.m. UTC | #1

On 11/7/19 2:06 PM, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
>
> The sharing of huge PMD does not require changes to the i_mmap at all.
> As a result, we can just take the read lock and let other threads
> searching for the right VMA to share in parallel. Once the right
> VMA is found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
>
> Lock contention, if present, will happen in the spinlock. That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
>
> With this patch applied, the customer is seeing significant improvements
> over the unpatched kernel.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/hugetlb.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b45a95363a84..087e7ff00137 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4842,7 +4842,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * PMD sharing does not require changes to i_mmap. So a read lock
> +	 * is enuogh.
> +	 */
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4872,7 +4876,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>  

This is a follow-up of my previous patch to disable PMD sharing for
large system.

https://lore.kernel.org/lkml/20190911150537.19527-1-longman@redhat.com/

This patch is simpler and has lower risk while still solve the customer
problem. It supersedes the previous patch.

Cheers,
Longman

Waiman Long Nov. 7, 2019, 7:15 p.m. UTC | #2

On 11/7/19 2:06 PM, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
>
> The sharing of huge PMD does not require changes to the i_mmap at all.
> As a result, we can just take the read lock and let other threads
> searching for the right VMA to share in parallel. Once the right
> VMA is found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
>
> Lock contention, if present, will happen in the spinlock. That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
>
> With this patch applied, the customer is seeing significant improvements
> over the unpatched kernel.

Oh, I forgot to add

Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>

> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/hugetlb.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b45a95363a84..087e7ff00137 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4842,7 +4842,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * PMD sharing does not require changes to i_mmap. So a read lock
> +	 * is enuogh.
> +	 */
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4872,7 +4876,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>  

Cheers,
Longman

Mike Kravetz Nov. 7, 2019, 7:31 p.m. UTC | #3

On 11/7/19 11:06 AM, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
> 
> The sharing of huge PMD does not require changes to the i_mmap at all.
> As a result, we can just take the read lock and let other threads
> searching for the right VMA to share in parallel. Once the right
> VMA is found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
> 
> Lock contention, if present, will happen in the spinlock. That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
> 
> With this patch applied, the customer is seeing significant improvements
> over the unpatched kernel.

Thanks for getting this tested in the customers environment!

> Signed-off-by: Waiman Long <longman@redhat.com>

Just a small typo in the comment, otherwise.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

> ---
>  mm/hugetlb.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b45a95363a84..087e7ff00137 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4842,7 +4842,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * PMD sharing does not require changes to i_mmap. So a read lock
> +	 * is enuogh.

s/enuogh/enough/

> +	 */
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4872,7 +4876,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>  
>

Matthew Wilcox Nov. 7, 2019, 7:42 p.m. UTC | #4

On Thu, Nov 07, 2019 at 02:06:28PM -0500, Waiman Long wrote:
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * PMD sharing does not require changes to i_mmap. So a read lock
> +	 * is enuogh.
> +	 */
> +	i_mmap_lock_read(mapping);

We don't have comments before any of the other calls to i_mmap_lock_read()
justifying why we don't need the write lock.  I don't understand why this
situation is different.  Just delete the comment and make this a two-line
patch.

Matthew Wilcox Nov. 7, 2019, 7:54 p.m. UTC | #5

On Thu, Nov 07, 2019 at 02:06:28PM -0500, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
> 
> The sharing of huge PMD does not require changes to the i_mmap at all.
> As a result, we can just take the read lock and let other threads
> searching for the right VMA to share in parallel. Once the right
> VMA is found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
> 
> Lock contention, if present, will happen in the spinlock. That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.

I don't think this description really explains the contention argument
well.  There are _more_ PMD locks than there are i_mmap_sem locks, so
processes accessing different parts of the same file can work in parallel.

Are there other current users of the write lock that could use a read lock?
At first blush, it would seem that unmap_ref_private() also only needs
a read lock on the i_mmap tree.  I don't think hugetlb_change_protection()
needs the write lock either.  Nor retract_page_tables().

Waiman Long Nov. 7, 2019, 9:06 p.m. UTC | #6

On 11/7/19 2:42 PM, Matthew Wilcox wrote:
> On Thu, Nov 07, 2019 at 02:06:28PM -0500, Waiman Long wrote:
>> -	i_mmap_lock_write(mapping);
>> +	/*
>> +	 * PMD sharing does not require changes to i_mmap. So a read lock
>> +	 * is enuogh.
>> +	 */
>> +	i_mmap_lock_read(mapping);
> We don't have comments before any of the other calls to i_mmap_lock_read()
> justifying why we don't need the write lock.  I don't understand why this
> situation is different.  Just delete the comment and make this a two-line
> patch.
>
I am fine with that.

I will send a v2 patch.

Cheers,
Longman

Waiman Long Nov. 7, 2019, 9:27 p.m. UTC | #7

On 11/7/19 2:54 PM, Matthew Wilcox wrote:
> On Thu, Nov 07, 2019 at 02:06:28PM -0500, Waiman Long wrote:
>> A customer with large SMP systems (up to 16 sockets) with application
>> that uses large amount of static hugepages (~500-1500GB) are experiencing
>> random multisecond delays. These delays was caused by the long time it
>> took to scan the VMA interval tree with mmap_sem held.
>>
>> The sharing of huge PMD does not require changes to the i_mmap at all.
>> As a result, we can just take the read lock and let other threads
>> searching for the right VMA to share in parallel. Once the right
>> VMA is found, either the PMD lock (2M huge page for x86-64) or the
>> mm->page_table_lock will be acquired to perform the actual PMD sharing.
>>
>> Lock contention, if present, will happen in the spinlock. That is much
>> better than contention in the rwsem where the time needed to scan the
>> the interval tree is indeterminate.
> I don't think this description really explains the contention argument
> well.  There are _more_ PMD locks than there are i_mmap_sem locks, so
> processes accessing different parts of the same file can work in parallel.

I am sorry for not being clear enough. PMD lock contention here means 2
or more tasks that happens to touch the same PMD. Because of the use of
PMD lock, modification of the same PMD cannot happen in parallel. If
they touch different PMDs, they can do that in parallel. Previously,
they are contending the same rwsem write lock and hence have to be done
serially.

> Are there other current users of the write lock that could use a read lock?
> At first blush, it would seem that unmap_ref_private() also only needs
> a read lock on the i_mmap tree.  I don't think hugetlb_change_protection()
> needs the write lock either.  Nor retract_page_tables().

It is possible that other locking sites can be converted to use read
lock, but it is outside the scope of this patch.

Cheers,
Longman

Mike Kravetz Nov. 7, 2019, 9:49 p.m. UTC | #8

On 11/7/19 11:54 AM, Matthew Wilcox wrote:
> Are there other current users of the write lock that could use a read lock?
> At first blush, it would seem that unmap_ref_private() also only needs
> a read lock on the i_mmap tree.  I don't think hugetlb_change_protection()
> needs the write lock either.  Nor retract_page_tables().

I believe that the semaphore still needs to be held in write mode while
calling huge_pmd_unshare (as is done in the call sites above).  Why?
There is this check for sharing in huge_pmd_unshare,

	if (page_count(virt_to_page(ptep)) == 1)
		return 0;	// implies no sharing

Note that huge_pmd_share now increments the page count with the semaphore
held just in read mode.  It is OK to do increments in parallel without
synchronization.  However, we don't want anyone else changing the count
while that check in huge_pmd_unshare is happening.  Hence, the need for
taking the semaphore in write mode.

Mike Kravetz Nov. 7, 2019, 9:56 p.m. UTC | #9

On 11/7/19 1:49 PM, Mike Kravetz wrote:
> On 11/7/19 11:54 AM, Matthew Wilcox wrote:
>> Are there other current users of the write lock that could use a read lock?
>> At first blush, it would seem that unmap_ref_private() also only needs
>> a read lock on the i_mmap tree.  I don't think hugetlb_change_protection()
>> needs the write lock either.  Nor retract_page_tables().

Sorry, I missed retract_page_tables which is not part of hugetlb code.
The comments below do not apply to retract_page_tables.  Someone would
need to take a closer look to see if that really needs write mode.

Davidlohr Bueso Nov. 8, 2019, 2:04 a.m. UTC | #10

On Thu, 07 Nov 2019, Mike Kravetz wrote:

>Note that huge_pmd_share now increments the page count with the semaphore
>held just in read mode.  It is OK to do increments in parallel without
>synchronization.  However, we don't want anyone else changing the count
>while that check in huge_pmd_unshare is happening.  Hence, the need for
>taking the semaphore in write mode.

This would be a nice addition to the changelog methinks.

Thanks,
Davidlohr

Andrew Morton Nov. 8, 2019, 3:22 a.m. UTC | #11

On Thu, 7 Nov 2019 18:04:56 -0800 Davidlohr Bueso <dave@stgolabs.net> wrote:

> On Thu, 07 Nov 2019, Mike Kravetz wrote:
> 
> >Note that huge_pmd_share now increments the page count with the semaphore
> >held just in read mode.  It is OK to do increments in parallel without
> >synchronization.  However, we don't want anyone else changing the count
> >while that check in huge_pmd_unshare is happening.  Hence, the need for
> >taking the semaphore in write mode.
> 
> This would be a nice addition to the changelog methinks.

Done.

Mike Kravetz Nov. 8, 2019, 7:10 p.m. UTC | #12

On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
> On Thu, 07 Nov 2019, Mike Kravetz wrote:
> 
>> Note that huge_pmd_share now increments the page count with the semaphore
>> held just in read mode.  It is OK to do increments in parallel without
>> synchronization.  However, we don't want anyone else changing the count
>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>> taking the semaphore in write mode.
> 
> This would be a nice addition to the changelog methinks.

Last night I remembered there is one place where we currently take
i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
is in try_to_unmap_one.  Yes, there is a potential race here today.
But that race is somewhat contained as you need two threads doing some
combination of page migration and page poisoning to race.  This change
now allows migration or poisoning to race with page fault.  I would
really prefer if we do not open up the race window in this manner.

Getting this right in the try_to_unmap_one case is a bit tricky.  I had
code to do this in the past that was part of a bigger hugetlb synchronization
change.  All those changes got reverted (commit ddeaab32a89f), but I
believe it is possible to change try_to_unmap_one calling sequences
without introducing other issues.

Bottom line is that more changes are needed in this patch.  I'll work
on those changes unless someone else volunteers.  It will likely take me
one or two days to come up with and test proposed changes.

Mike Kravetz Nov. 9, 2019, 1:47 a.m. UTC | #13

On 11/8/19 11:10 AM, Mike Kravetz wrote:
> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>
>>> Note that huge_pmd_share now increments the page count with the semaphore
>>> held just in read mode.  It is OK to do increments in parallel without
>>> synchronization.  However, we don't want anyone else changing the count
>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>> taking the semaphore in write mode.
>>
>> This would be a nice addition to the changelog methinks.
> 
> Last night I remembered there is one place where we currently take
> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
> is in try_to_unmap_one.  Yes, there is a potential race here today.

Actually there is no race there today.  Callers to huge_pmd_unshare
hold the page table lock.  So, this synchronizes those unshare calls
from  page migration and page poisoning.

> But that race is somewhat contained as you need two threads doing some
> combination of page migration and page poisoning to race.  This change
> now allows migration or poisoning to race with page fault.  I would
> really prefer if we do not open up the race window in this manner.

But, we do open a race window by changing huge_pmd_share to take the
i_mmap_rwsem in read mode as in the original patch.  

Here is the additional code needed to take the semaphore in write mode
for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
combine this with Longman's patch.  Please take a look and provide feedback.
Some of the changes are subtle, especially the exception for MAP_PRIVATE
mappings, but I tried to add sufficient comments.

From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 8 Nov 2019 17:25:37 -0800
Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
 call to huge_pmd_unshare in try_to_unmap_one.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c        |  9 ++++++++-
 mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
 mm/migrate.c        | 27 +++++++++++++++++++++++++--
 3 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f78891f92765..73d9136549a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with page table lock held.
+ * Must be called while holding page table lock.
+ * In general, the caller should also hold the i_mmap_rwsem in write mode.
+ * This is to prevent races with page faults calling huge_pmd_share which
+ * will not be holding the page table lock, but will be holding i_mmap_rwsem
+ * in read mode.  It is possible to call without holding i_mmap_rwsem in
+ * write mode if the caller KNOWS the page table is associated with a private
+ * mapping.  This is because private mappings can not share PMDs and can
+ * not race with huge_pmd_share calls during page faults.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3151c87dff73..8f52b22cf71b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	if (kill)
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_success = try_to_unmap(hpage, ttu);
+	if (!PageHuge(hpage)) {
+		unmap_success = try_to_unmap(hpage, ttu);
+	} else {
+		mapping = page_mapping(hpage);
+		if (mapping) {
+			/*
+			 * For hugetlb pages, try_to_unmap could potentially
+			 * call huge_pmd_unshare.  Because of this, take
+			 * semaphore in write mode here and set TTU_RMAP_LOCKED
+			 * to indicate we have taken the lock at this higher
+			 * level.
+			 */
+			i_mmap_lock_write(mapping);
+			unmap_success = try_to_unmap(hpage,
+							ttu|TTU_RMAP_LOCKED);
+			i_mmap_unlock_write(mapping);
+		} else {
+			/*
+			 * !mapping implies a MAP_PRIVATE huge page mapping.
+			 * Since PMDs will never be shared in a private
+			 * mapping, it is safe to let huge_pmd_unshare be
+			 * called with the semaphore in read mode.
+			 */
+			unmap_success = try_to_unmap(hpage, ttu);
+		}
+	}
+
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fe45d1428c8..9cae5a4f1e48 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		goto put_anon;
 
 	if (page_mapped(hpage)) {
-		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		struct address_space *mapping = page_mapping(hpage);
+
+		if (mapping) {
+			/*
+			 * try_to_unmap could potentially call huge_pmd_unshare.
+			 * Because of this, take semaphore in write mode here
+			 * and set TTU_RMAP_LOCKED to indicate we have taken
+			 * the lock at this higher level.
+			 */
+			i_mmap_lock_write(mapping);
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
+			i_mmap_unlock_write(mapping);
+		} else {
+			/*
+			 * !mapping implies a MAP_PRIVATE huge page mapping.
+			 * Since PMDs will never be shared in a private
+			 * mapping, it is safe to let huge_pmd_unshare be
+			 * called with the semaphore in read mode.
+			 */
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS);
+		}
 		page_was_mapped = 1;
 	}

Waiman Long Nov. 12, 2019, 5:27 p.m. UTC | #14

On 11/8/19 8:47 PM, Mike Kravetz wrote:
> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>
>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>> held just in read mode.  It is OK to do increments in parallel without
>>>> synchronization.  However, we don't want anyone else changing the count
>>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>>> taking the semaphore in write mode.
>>> This would be a nice addition to the changelog methinks.
>> Last night I remembered there is one place where we currently take
>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>> is in try_to_unmap_one.  Yes, there is a potential race here today.
> Actually there is no race there today.  Callers to huge_pmd_unshare
> hold the page table lock.  So, this synchronizes those unshare calls
> from  page migration and page poisoning.
>
>> But that race is somewhat contained as you need two threads doing some
>> combination of page migration and page poisoning to race.  This change
>> now allows migration or poisoning to race with page fault.  I would
>> really prefer if we do not open up the race window in this manner.
> But, we do open a race window by changing huge_pmd_share to take the
> i_mmap_rwsem in read mode as in the original patch.  
>
> Here is the additional code needed to take the semaphore in write mode
> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
> combine this with Longman's patch.  Please take a look and provide feedback.
> Some of the changes are subtle, especially the exception for MAP_PRIVATE
> mappings, but I tried to add sufficient comments.
>
> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 8 Nov 2019 17:25:37 -0800
> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>  call to huge_pmd_unshare in try_to_unmap_one.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c        |  9 ++++++++-
>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>  3 files changed, 60 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f78891f92765..73d9136549a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>   * decrementing the ref count. If count == 1, the pte page is not shared.
>   *
> - * called with page table lock held.
> + * Must be called while holding page table lock.
> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
> + * This is to prevent races with page faults calling huge_pmd_share which
> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
> + * write mode if the caller KNOWS the page table is associated with a private
> + * mapping.  This is because private mappings can not share PMDs and can
> + * not race with huge_pmd_share calls during page faults.

So the page table lock here is the huge_pte_lock(). Right? In
huge_pmd_share(), the pte lock has to be taken before one can share it.
So would you mind explaining where exactly is the race?

Thanks,
Longman

>   *
>   * returns: 1 successfully unmapped a shared pte page
>   *	    0 the underlying pte page is not shared, or it is the last user
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3151c87dff73..8f52b22cf71b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	if (kill)
>  		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
>  
> -	unmap_success = try_to_unmap(hpage, ttu);
> +	if (!PageHuge(hpage)) {
> +		unmap_success = try_to_unmap(hpage, ttu);
> +	} else {
> +		mapping = page_mapping(hpage);
> +		if (mapping) {
> +			/*
> +			 * For hugetlb pages, try_to_unmap could potentially
> +			 * call huge_pmd_unshare.  Because of this, take
> +			 * semaphore in write mode here and set TTU_RMAP_LOCKED
> +			 * to indicate we have taken the lock at this higher
> +			 * level.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			unmap_success = try_to_unmap(hpage,
> +							ttu|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * !mapping implies a MAP_PRIVATE huge page mapping.
> +			 * Since PMDs will never be shared in a private
> +			 * mapping, it is safe to let huge_pmd_unshare be
> +			 * called with the semaphore in read mode.
> +			 */
> +			unmap_success = try_to_unmap(hpage, ttu);
> +		}
> +	}
> +
>  	if (!unmap_success)
>  		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
>  		       pfn, page_mapcount(hpage));
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..9cae5a4f1e48 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  		goto put_anon;
>  
>  	if (page_mapped(hpage)) {
> -		try_to_unmap(hpage,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +		struct address_space *mapping = page_mapping(hpage);
> +
> +		if (mapping) {
> +			/*
> +			 * try_to_unmap could potentially call huge_pmd_unshare.
> +			 * Because of this, take semaphore in write mode here
> +			 * and set TTU_RMAP_LOCKED to indicate we have taken
> +			 * the lock at this higher level.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * !mapping implies a MAP_PRIVATE huge page mapping.
> +			 * Since PMDs will never be shared in a private
> +			 * mapping, it is safe to let huge_pmd_unshare be
> +			 * called with the semaphore in read mode.
> +			 */
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS);
> +		}
>  		page_was_mapped = 1;
>  	}
>

Mike Kravetz Nov. 12, 2019, 11:11 p.m. UTC | #15

On 11/12/19 9:27 AM, Waiman Long wrote:
> On 11/8/19 8:47 PM, Mike Kravetz wrote:
>> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>>
>>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>>> held just in read mode.  It is OK to do increments in parallel without
>>>>> synchronization.  However, we don't want anyone else changing the count
>>>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>>>> taking the semaphore in write mode.
>>>> This would be a nice addition to the changelog methinks.
>>> Last night I remembered there is one place where we currently take
>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>>> is in try_to_unmap_one.  Yes, there is a potential race here today.
>> Actually there is no race there today.  Callers to huge_pmd_unshare
>> hold the page table lock.  So, this synchronizes those unshare calls
>> from  page migration and page poisoning.
>>
>>> But that race is somewhat contained as you need two threads doing some
>>> combination of page migration and page poisoning to race.  This change
>>> now allows migration or poisoning to race with page fault.  I would
>>> really prefer if we do not open up the race window in this manner.
>> But, we do open a race window by changing huge_pmd_share to take the
>> i_mmap_rwsem in read mode as in the original patch.  
>>
>> Here is the additional code needed to take the semaphore in write mode
>> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
>> combine this with Longman's patch.  Please take a look and provide feedback.
>> Some of the changes are subtle, especially the exception for MAP_PRIVATE
>> mappings, but I tried to add sufficient comments.
>>
>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Fri, 8 Nov 2019 17:25:37 -0800
>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>>  call to huge_pmd_unshare in try_to_unmap_one.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/hugetlb.c        |  9 ++++++++-
>>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>>  3 files changed, 60 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f78891f92765..73d9136549a5 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>>   * decrementing the ref count. If count == 1, the pte page is not shared.
>>   *
>> - * called with page table lock held.
>> + * Must be called while holding page table lock.
>> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
>> + * This is to prevent races with page faults calling huge_pmd_share which
>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
>> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
>> + * write mode if the caller KNOWS the page table is associated with a private
>> + * mapping.  This is because private mappings can not share PMDs and can
>> + * not race with huge_pmd_share calls during page faults.
> 
> So the page table lock here is the huge_pte_lock(). Right? In
> huge_pmd_share(), the pte lock has to be taken before one can share it.
> So would you mind explaining where exactly is the race?

My concern was that this code in huge_pmd_share:

		saddr = page_table_shareable(svma, vma, addr, idx);
		if (saddr) {
			spte = huge_pte_offset(svma->vm_mm, saddr,
					       vma_mmu_pagesize(svma));
			if (spte) {
				get_page(virt_to_page(spte));
				break;
			}
		}

and this code in huge_pmd_unshare:

        BUG_ON(page_count(virt_to_page(ptep)) == 0);
        if (page_count(virt_to_page(ptep)) == 1)
                return 0;

could run concurrently.  Note that return code for huge_pmd_unshare is
specified as,

 * returns: 1 successfully unmapped a shared pte page
 *          0 the underlying pte page is not shared, or it is the last user

And, callers take different action depending on the return value.

Now, since the code blocks above can run concurrently it is possible that:
- huge_pmd_unshare sees page_count == 1
- huge_pmd_share increments page_count in anticipation of sharing
- huge_pmd_unshare returns 0 indicating there is no pmd sharing
- huge_pmd_share waits for page table lock to insert pmd page before it
  sharts sharing

My concern was with what might happen if a huge_pmd_unshare caller received
a 0 return code and thought there were no other users of the pmd.  In the
specific case mentioned here (try_to_unmap_one) I now do not believe there
are any issues.  The caller is simply going to clear one entry on the pmd
page.  I can not think of a way this could impact the other thread waiting
to share the pmd.  It will simply start sharing the pmd after it gets the
page table lock and the entry has been removed.

As the result of your question, I went back and took a really close look
at the code in question.  While that huge_pmd_share/huge_pmd_unshare code
running concurrently does make me a bit nervous, I can not find any specific
problem.  In addition, I ran a test causes this race for several hours
without issue.

Sorry for the churn/second thoughts.  This code is subtle, and sometimes hard
to understand.  IMO, Longman's patch (V2) can go forward, but please delete
the following blob in the commit message from me:

"Note that huge_pmd_share now increments the page count with the
 semaphore held just in read mode.  It is OK to do increments in
 parallel without synchronization.  However, we don't want anyone else
 changing the count while that check in huge_pmd_unshare is happening. 
 Hence, the need for taking the semaphore in write mode."

Waiman Long Nov. 13, 2019, 2:55 a.m. UTC | #16

On 11/12/19 6:11 PM, Mike Kravetz wrote:
> On 11/12/19 9:27 AM, Waiman Long wrote:
>> On 11/8/19 8:47 PM, Mike Kravetz wrote:
>>> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>>>
>>>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>>>> held just in read mode.  It is OK to do increments in parallel without
>>>>>> synchronization.  However, we don't want anyone else changing the count
>>>>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>>>>> taking the semaphore in write mode.
>>>>> This would be a nice addition to the changelog methinks.
>>>> Last night I remembered there is one place where we currently take
>>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>>>> is in try_to_unmap_one.  Yes, there is a potential race here today.
>>> Actually there is no race there today.  Callers to huge_pmd_unshare
>>> hold the page table lock.  So, this synchronizes those unshare calls
>>> from  page migration and page poisoning.
>>>
>>>> But that race is somewhat contained as you need two threads doing some
>>>> combination of page migration and page poisoning to race.  This change
>>>> now allows migration or poisoning to race with page fault.  I would
>>>> really prefer if we do not open up the race window in this manner.
>>> But, we do open a race window by changing huge_pmd_share to take the
>>> i_mmap_rwsem in read mode as in the original patch.  
>>>
>>> Here is the additional code needed to take the semaphore in write mode
>>> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
>>> combine this with Longman's patch.  Please take a look and provide feedback.
>>> Some of the changes are subtle, especially the exception for MAP_PRIVATE
>>> mappings, but I tried to add sufficient comments.
>>>
>>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>> Date: Fri, 8 Nov 2019 17:25:37 -0800
>>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>>>  call to huge_pmd_unshare in try_to_unmap_one.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  mm/hugetlb.c        |  9 ++++++++-
>>>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>>>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>>>  3 files changed, 60 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index f78891f92765..73d9136549a5 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>>>   * decrementing the ref count. If count == 1, the pte page is not shared.
>>>   *
>>> - * called with page table lock held.
>>> + * Must be called while holding page table lock.
>>> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
>>> + * This is to prevent races with page faults calling huge_pmd_share which
>>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
>>> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
>>> + * write mode if the caller KNOWS the page table is associated with a private
>>> + * mapping.  This is because private mappings can not share PMDs and can
>>> + * not race with huge_pmd_share calls during page faults.
>> So the page table lock here is the huge_pte_lock(). Right? In
>> huge_pmd_share(), the pte lock has to be taken before one can share it.
>> So would you mind explaining where exactly is the race?
> My concern was that this code in huge_pmd_share:
>
> 		saddr = page_table_shareable(svma, vma, addr, idx);
> 		if (saddr) {
> 			spte = huge_pte_offset(svma->vm_mm, saddr,
> 					       vma_mmu_pagesize(svma));
> 			if (spte) {
> 				get_page(virt_to_page(spte));
> 				break;
> 			}
> 		}
>
> and this code in huge_pmd_unshare:
>
>         BUG_ON(page_count(virt_to_page(ptep)) == 0);
>         if (page_count(virt_to_page(ptep)) == 1)
>                 return 0;
>
> could run concurrently.  Note that return code for huge_pmd_unshare is
> specified as,
>
>  * returns: 1 successfully unmapped a shared pte page
>  *          0 the underlying pte page is not shared, or it is the last user
>
> And, callers take different action depending on the return value.
>
> Now, since the code blocks above can run concurrently it is possible that:
> - huge_pmd_unshare sees page_count == 1
> - huge_pmd_share increments page_count in anticipation of sharing
> - huge_pmd_unshare returns 0 indicating there is no pmd sharing
> - huge_pmd_share waits for page table lock to insert pmd page before it
>   sharts sharing
>
> My concern was with what might happen if a huge_pmd_unshare caller received
> a 0 return code and thought there were no other users of the pmd.  In the
> specific case mentioned here (try_to_unmap_one) I now do not believe there
> are any issues.  The caller is simply going to clear one entry on the pmd
> page.  I can not think of a way this could impact the other thread waiting
> to share the pmd.  It will simply start sharing the pmd after it gets the
> page table lock and the entry has been removed.
>
> As the result of your question, I went back and took a really close look
> at the code in question.  While that huge_pmd_share/huge_pmd_unshare code
> running concurrently does make me a bit nervous, I can not find any specific
> problem.  In addition, I ran a test causes this race for several hours
> without issue.
>
> Sorry for the churn/second thoughts.  This code is subtle, and sometimes hard
> to understand.  IMO, Longman's patch (V2) can go forward, but please delete
> the following blob in the commit message from me:
>
> "Note that huge_pmd_share now increments the page count with the
>  semaphore held just in read mode.  It is OK to do increments in
>  parallel without synchronization.  However, we don't want anyone else
>  changing the count while that check in huge_pmd_unshare is happening. 
>  Hence, the need for taking the semaphore in write mode."
>
Thanks for the explanation. I understand that it is often time hard to
figure out if a race exists or not. I got confused myself many times.
Anyway, as long as any destructive action is only taken while holding
the page table lock, it should be safe.

Cheers,
Longman

hugetlbfs: Take read_lock on i_mmap for PMD sharing

Commit Message

Comments

Patch