diff mbox series

[v2,2/2] hugetlb: use same fault hash key for shared and private mappings

Message ID 20190328234704.27083-3-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series A couple hugetlbfs fixes | expand

Commit Message

Mike Kravetz March 28, 2019, 11:47 p.m. UTC
hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A write fault will first map the page from
the file and then do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could temporarily map the file
page before COW.  This causes problems (BUG) for the hole punch code
as it expects the mutex to prevent additional uses/mappings of the page.

There seems to be another potential COW issue/race with this approach
of different private and shared keys as notes in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    |  7 ++-----
 include/linux/hugetlb.h |  4 +---
 mm/hugetlb.c            | 22 ++++++----------------
 mm/userfaultfd.c        |  3 +--
 4 files changed, 10 insertions(+), 26 deletions(-)

Comments

Mike Kravetz April 11, 2019, 6:32 p.m. UTC | #1
On 3/28/19 4:47 PM, Mike Kravetz wrote:
> hugetlb uses a fault mutex hash table to prevent page faults of the
> same pages concurrently.  The key for shared and private mappings is
> different.  Shared keys off address_space and file index.  Private
> keys off mm and virtual address.  Consider a private mappings of a
> populated hugetlbfs file.  A write fault will first map the page from
> the file and then do a COW to map a writable page.

Davidlohr suggested adding the stack trace to the commit log.  When I
originally 'discovered' this issue I was debugging something else.  The
routine remove_inode_hugepages() contains the following:

			 * ...
			 * This race can only happen in the hole punch case.
			 * Getting here in a truncate operation is a bug.
			 */
			if (unlikely(page_mapped(page))) {
				BUG_ON(truncate_op);

				i_mmap_lock_write(mapping);
				hugetlb_vmdelete_list(&mapping->i_mmap,
					index * pages_per_huge_page(h),
					(index + 1) * pages_per_huge_page(h));
				i_mmap_unlock_write(mapping);
			}

			lock_page(page);
			/*
			 * We must free the huge page and remove from page
			 * ...
			 */
			VM_BUG_ON(PagePrivate(page));
			remove_huge_page(page);
			freed++;

I observed that the page could be mapped (again) before the call to lock_page
if we raced with a private write fault.  However, for COW faults the faulting
code is holding the page lock until it unmaps the file page.  Hence, we will
not call remove_huge_page() with the page mapped.  That is good.  However, for
simple read faults the page remains mapped after releasing the page lock and
we can call remove_huge_page with a mapped page and BUG.

Sorry, the original commit message was not completely accurate in describing
the issue.  I was basing the change on behavior experienced during debug of
a another issue.  Actually, it is MUCH easier to BUG by making private read
faults race with hole punch.  As a result, I now think this should go to
stable.

Andrew, below is an updated commit message.  No changes to code.  Would you
like me to send an updated patch?  Also, need to add stable.

hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A fault will map the page from the file
and if needed do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could race with this code to map
the file page.  This causes problems (BUG) for the page cache remove
code as it expects the page to be unmapped.  A sample stack is:

page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
kernel BUG at mm/filemap.c:169!
...
RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
...
Call Trace:
__delete_from_page_cache+0x39/0x220
delete_from_page_cache+0x45/0x70
remove_inode_hugepages+0x13c/0x380
? __add_to_page_cache_locked+0x162/0x380
hugetlbfs_fallocate+0x403/0x540
? _cond_resched+0x15/0x30
? __inode_security_revalidate+0x5d/0x70
? selinux_file_permission+0x100/0x130
vfs_fallocate+0x13f/0x270
ksys_fallocate+0x3c/0x80
__x64_sys_fallocate+0x1a/0x20
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There seems to be another potential COW issue/race with this approach
of different private and shared keys as noted in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Cc: <stable@vger.kernel.org>
Davidlohr Bueso April 12, 2019, 4:52 p.m. UTC | #2
On Thu, 11 Apr 2019, Mike Kravetz wrote:

>On 3/28/19 4:47 PM, Mike Kravetz wrote:
>> hugetlb uses a fault mutex hash table to prevent page faults of the
>> same pages concurrently.  The key for shared and private mappings is
>> different.  Shared keys off address_space and file index.  Private
>> keys off mm and virtual address.  Consider a private mappings of a
>> populated hugetlbfs file.  A write fault will first map the page from
>> the file and then do a COW to map a writable page.
>
>Davidlohr suggested adding the stack trace to the commit log.  When I
>originally 'discovered' this issue I was debugging something else.  The
>routine remove_inode_hugepages() contains the following:
>
>			 * ...
>			 * This race can only happen in the hole punch case.
>			 * Getting here in a truncate operation is a bug.
>			 */
>			if (unlikely(page_mapped(page))) {
>				BUG_ON(truncate_op);
>
>				i_mmap_lock_write(mapping);
>				hugetlb_vmdelete_list(&mapping->i_mmap,
>					index * pages_per_huge_page(h),
>					(index + 1) * pages_per_huge_page(h));
>				i_mmap_unlock_write(mapping);
>			}
>
>			lock_page(page);
>			/*
>			 * We must free the huge page and remove from page
>			 * ...
>			 */
>			VM_BUG_ON(PagePrivate(page));
>			remove_huge_page(page);
>			freed++;
>
>I observed that the page could be mapped (again) before the call to lock_page
>if we raced with a private write fault.  However, for COW faults the faulting
>code is holding the page lock until it unmaps the file page.  Hence, we will
>not call remove_huge_page() with the page mapped.  That is good.  However, for
>simple read faults the page remains mapped after releasing the page lock and
>we can call remove_huge_page with a mapped page and BUG.
>
>Sorry, the original commit message was not completely accurate in describing
>the issue.  I was basing the change on behavior experienced during debug of
>a another issue.  Actually, it is MUCH easier to BUG by making private read
>faults race with hole punch.  As a result, I now think this should go to
>stable.
>
>Andrew, below is an updated commit message.  No changes to code.  Would you
>like me to send an updated patch?  Also, need to add stable.
>
>hugetlb uses a fault mutex hash table to prevent page faults of the
>same pages concurrently.  The key for shared and private mappings is
>different.  Shared keys off address_space and file index.  Private
>keys off mm and virtual address.  Consider a private mappings of a
>populated hugetlbfs file.  A fault will map the page from the file
>and if needed do a COW to map a writable page.
>
>Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
>pages.  It uses the address_space file index key.  However, private
>mappings will use a different key and could race with this code to map
>the file page.  This causes problems (BUG) for the page cache remove
>code as it expects the page to be unmapped.  A sample stack is:
>
>page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>kernel BUG at mm/filemap.c:169!
>...
>RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
>...
>Call Trace:
>__delete_from_page_cache+0x39/0x220
>delete_from_page_cache+0x45/0x70
>remove_inode_hugepages+0x13c/0x380
>? __add_to_page_cache_locked+0x162/0x380
>hugetlbfs_fallocate+0x403/0x540
>? _cond_resched+0x15/0x30
>? __inode_security_revalidate+0x5d/0x70
>? selinux_file_permission+0x100/0x130
>vfs_fallocate+0x13f/0x270
>ksys_fallocate+0x3c/0x80
>__x64_sys_fallocate+0x1a/0x20
>do_syscall_64+0x5b/0x180
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>There seems to be another potential COW issue/race with this approach
>of different private and shared keys as noted in commit 8382d914ebf7
>("mm, hugetlb: improve page-fault scalability").
>
>Since every hugetlb mapping (even anon and private) is actually a file
>mapping, just use the address_space index key for all mappings.  This
>results in potentially more hash collisions.  However, this should not
>be the common case.

This is fair enough as most mappings will be shared anyway (it would be
lovely to have some machinery to measure collisions in kernel hash tables,
in general).

>Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")

Ok the issue was introduced after we had the mutex table.

>Cc: <stable@vger.kernel.org>

Thanks for the details, I'm definitely seeing the idx mismatch issue now.

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ec32fece5e1e..6189ba80b57b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -440,9 +440,7 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash;
 
 			index = page->index;
-			hash = hugetlb_fault_mutex_hash(h, current->mm,
-							&pseudo_vma,
-							mapping, index, 0);
+			hash = hugetlb_fault_mutex_hash(h, mapping, index, 0);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
@@ -639,8 +637,7 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		addr = index * hpage_size;
 
 		/* mutex taken here, fault path and hole punch */
-		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
-						index, addr);
+		hash = hugetlb_fault_mutex_hash(h, mapping, index, addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		/* See if already present in mapping to avoid alloc/free */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ea35263eb76b..3bc0d02649fe 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -123,9 +123,7 @@  void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 				pgoff_t idx, unsigned long address);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8651d6a602f9..4409a87434f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3837,8 +3837,7 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * handling userfault.  Reacquire after handling
 			 * fault to make calling code simpler.
 			 */
-			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
-							idx, haddr);
+			hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -3946,21 +3945,14 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_SMP
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-			    struct vm_area_struct *vma,
-			    struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
 {
 	unsigned long key[2];
 	u32 hash;
 
-	if (vma->vm_flags & VM_SHARED) {
-		key[0] = (unsigned long) mapping;
-		key[1] = idx;
-	} else {
-		key[0] = (unsigned long) mm;
-		key[1] = address >> huge_page_shift(h);
-	}
+	key[0] = (unsigned long) mapping;
+	key[1] = idx;
 
 	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
 
@@ -3971,9 +3963,7 @@  u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
  * For uniprocesor systems we always use a single mutex, so just
  * return 0 and avoid the hashing overhead.
  */
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-			    struct vm_area_struct *vma,
-			    struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
 {
 	return 0;
@@ -4018,7 +4008,7 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
+	hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73dfb3..9932d5755e4c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -271,8 +271,7 @@  static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
-		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
-								idx, dst_addr);
+		hash = hugetlb_fault_mutex_hash(h, mapping, idx, dst_addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		err = -ENOMEM;