[v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing

Message ID	20191107211809.9539-1-longman@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ID7B=Y7=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 41C4220869 From: Waiman Long <longman@redhat.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>, Matthew Wilcox <willy@infradead.org>, Waiman Long <longman@redhat.com> Subject: [PATCH v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing Date: Thu, 7 Nov 2019 16:18:09 -0500 Message-Id: <20191107211809.9539-1-longman@redhat.com> Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing \| expand [v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing

Message ID

20191107211809.9539-1-longman@redhat.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 41C4220869
From: Waiman Long <longman@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	Davidlohr Bueso <dave@stgolabs.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Will Deacon <will.deacon@arm.com>,
	Matthew Wilcox <willy@infradead.org>,
	Waiman Long <longman@redhat.com>
Subject: [PATCH v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing
Date: Thu,  7 Nov 2019 16:18:09 -0500
Message-Id: <20191107211809.9539-1-longman@redhat.com>
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing | expand

Commit Message

Waiman Long Nov. 7, 2019, 9:18 p.m. UTC

A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are experiencing
random multisecond delays. These delays was caused by the long time it
took to scan the VMA interval tree with mmap_sem held.

The sharing of huge PMD does not require changes to the i_mmap at all.
Therefore, we can just take the read lock and let other threads searching
for the right VMA to share it in parallel. Once the right VMA is found,
either the PMD lock (2M huge page for x86-64) or the mm->page_table_lock
will be acquired to perform the actual PMD sharing.

Lock contention, if present, will happen in the spinlock. That is much
better than contention in the rwsem where the time needed to scan the
the interval tree is indeterminate.

With this patch applied, the customer is seeing significant performance
improvement over the unpatched kernel.

Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Davidlohr Bueso Nov. 8, 2019, 2:03 a.m. UTC | #1

On Thu, 07 Nov 2019, Waiman Long wrote:
>With this patch applied, the customer is seeing significant performance
>improvement over the unpatched kernel.

Could you give more details here?

Thanks,
Davidlohr

Waiman Long Nov. 8, 2019, 6:44 p.m. UTC | #2

On 11/7/19 9:03 PM, Davidlohr Bueso wrote:
> On Thu, 07 Nov 2019, Waiman Long wrote:
>> With this patch applied, the customer is seeing significant performance
>> improvement over the unpatched kernel.
>
> Could you give more details here? 

Red Hat has a customer that is running a transactional database
workload. In this particular case, about ~500-1500GB of static hugepages
are allocated.  The database then allocates a single large shared memory
segment in those hugepages to use primarily as a database buffer for 8kB
blocks from disk (there are also other database structures in that
shared memory, but it's mostly for buffer).  Then thousands of separate
processes reference and load data into that buffer. They were seeing
multi-second pauses when starting up the database.

I first gave them a patched kernel that disabled PMD sharing. That fixed
their problem. After that, I gave them another test kernel that
contained this patch. They said there were significant improved compared
with the unpatched kernel. There is still some degradation compared to
the kernel with huge shared pmd disabled entirely, but they're pretty
close in performance.

Cheer,
Longman

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b45a95363a84..f78891f92765 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4842,7 +4842,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -4872,7 +4872,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_read(mapping);
 	return pte;
 }

[v2] hugetlbfs: Take read_lock on i_mmap for PMD sharing

Commit Message

Comments

Patch