diff mbox series

[6/6] mm: Run the fault-around code under the VMA lock

Message ID 20230404135850.3673404-7-willy@infradead.org (mailing list archive)
State New
Headers show
Series Avoid the mmap lock for fault-around | expand

Commit Message

Matthew Wilcox April 4, 2023, 1:58 p.m. UTC
The map_pages fs method should be safe to run under the VMA lock instead
of the mmap lock.  This should have a measurable reduction in contention
on the mmap lock.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox April 4, 2023, 3:23 p.m. UTC | #1
On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle) wrote:
> The map_pages fs method should be safe to run under the VMA lock instead
> of the mmap lock.  This should have a measurable reduction in contention
> on the mmap lock.

https://github.com/antonblanchard/will-it-scale/pull/37/files should
be a good microbenchmark to report numbers from.  Obviously real-world
benchmarks will be more compelling.
Suren Baghdasaryan April 7, 2023, 6:20 p.m. UTC | #2
On Tue, Apr 4, 2023 at 8:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle) wrote:
> > The map_pages fs method should be safe to run under the VMA lock instead
> > of the mmap lock.  This should have a measurable reduction in contention
> > on the mmap lock.
>
> https://github.com/antonblanchard/will-it-scale/pull/37/files should
> be a good microbenchmark to report numbers from.  Obviously real-world
> benchmarks will be more compelling.

The series looks sane to me. I'll run some tests on a NUMA machine to
see if anything breaks. Thanks!
Yin Fengwei April 10, 2023, 4:53 a.m. UTC | #3
On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
> wrote:
> > The map_pages fs method should be safe to run under the VMA lock
> > instead
> > of the mmap lock.  This should have a measurable reduction in
> > contention
> > on the mmap lock.
> 
> https://github.com/antonblanchard/will-it-scale/pull/37/files should
> be a good microbenchmark to report numbers from.  Obviously real-
> world
> benchmarks will be more compelling.
> 

Test result in my side with page_fault4 of will-it-scale in thread 
mode is:
  15274196 (without the patch) -> 17291444 (with the patch)

13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
filesystem.


The perf showed the mmap_lock contention reduced a lot:
(Removed the grandson functions of do_user_addr_fault()) 

latest linux-next with the patch:
    51.78%--do_user_addr_fault
            |          
            |--49.09%--handle_mm_fault
            |--1.19%--lock_vma_under_rcu
            --1.09%--down_read

latest linux-next without the patch:
    73.65%--do_user_addr_fault
            |          
            |--28.65%--handle_mm_fault
            |--17.22%--down_read_trylock
            |--10.92%--down_read
            |--9.20%--up_read
            --7.30%--find_vma

My understanding is down_read_trylock, down_read and up_read all are
related with mmap_lock. So the mmap_lock contention reduction is quite
obvious.

For functional testing, our 0day already cherry-picked the patchset and
the testing is ongoing. If any problem hit during testing, 0day will
report it out. Thanks.


Regards
Yin, Fengwei
Matthew Wilcox April 10, 2023, 2:11 p.m. UTC | #4
On Mon, Apr 10, 2023 at 04:53:19AM +0000, Yin, Fengwei wrote:
> On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
> > On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
> > wrote:
> > > The map_pages fs method should be safe to run under the VMA lock
> > > instead
> > > of the mmap lock.  This should have a measurable reduction in
> > > contention
> > > on the mmap lock.
> > 
> > https://github.com/antonblanchard/will-it-scale/pull/37/files should
> > be a good microbenchmark to report numbers from.  Obviously real-
> > world
> > benchmarks will be more compelling.
> > 
> 
> Test result in my side with page_fault4 of will-it-scale in thread 
> mode is:
>   15274196 (without the patch) -> 17291444 (with the patch)
> 
> 13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
> filesystem.

Thanks!  That is really good news.

> The perf showed the mmap_lock contention reduced a lot:
> (Removed the grandson functions of do_user_addr_fault()) 
> 
> latest linux-next with the patch:
>     51.78%--do_user_addr_fault
>             |          
>             |--49.09%--handle_mm_fault
>             |--1.19%--lock_vma_under_rcu
>             --1.09%--down_read
> 
> latest linux-next without the patch:
>     73.65%--do_user_addr_fault
>             |          
>             |--28.65%--handle_mm_fault
>             |--17.22%--down_read_trylock
>             |--10.92%--down_read
>             |--9.20%--up_read
>             --7.30%--find_vma
> 
> My understanding is down_read_trylock, down_read and up_read all are
> related with mmap_lock. So the mmap_lock contention reduction is quite
> obvious.

Absolutely.  I'm a little surprised that find_vma() basically disappeared
from the perf results.  I guess that it was cache cold after contending
on the mmap_lock rwsem.  But this is a very encouraging result.
Yin Fengwei April 12, 2023, 7:35 a.m. UTC | #5
On 4/10/2023 10:11 PM, Matthew Wilcox wrote:
> On Mon, Apr 10, 2023 at 04:53:19AM +0000, Yin, Fengwei wrote:
>> On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
>>> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
>>> wrote:
>>>> The map_pages fs method should be safe to run under the VMA lock
>>>> instead
>>>> of the mmap lock.  This should have a measurable reduction in
>>>> contention
>>>> on the mmap lock.
>>>
>>> https://github.com/antonblanchard/will-it-scale/pull/37/files should
>>> be a good microbenchmark to report numbers from.  Obviously real-
>>> world
>>> benchmarks will be more compelling.
>>>
>>
>> Test result in my side with page_fault4 of will-it-scale in thread 
>> mode is:
>>   15274196 (without the patch) -> 17291444 (with the patch)
>>
>> 13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
>> filesystem.
> 
> Thanks!  That is really good news.
> 
>> The perf showed the mmap_lock contention reduced a lot:
>> (Removed the grandson functions of do_user_addr_fault()) 
>>
>> latest linux-next with the patch:
>>     51.78%--do_user_addr_fault
>>             |          
>>             |--49.09%--handle_mm_fault
>>             |--1.19%--lock_vma_under_rcu
>>             --1.09%--down_read
>>
>> latest linux-next without the patch:
>>     73.65%--do_user_addr_fault
>>             |          
>>             |--28.65%--handle_mm_fault
>>             |--17.22%--down_read_trylock
>>             |--10.92%--down_read
>>             |--9.20%--up_read
>>             --7.30%--find_vma
>>
>> My understanding is down_read_trylock, down_read and up_read all are
>> related with mmap_lock. So the mmap_lock contention reduction is quite
>> obvious.
> 
> Absolutely.  I'm a little surprised that find_vma() basically disappeared
> from the perf results.  I guess that it was cache cold after contending
> on the mmap_lock rwsem.  But this is a very encouraging result.

Yes. find_vma() was surprise. So I did more check about it.
1. re-run the testing for more 3 times in case I made stupid mistake.
   All testing show same behavior for find_vma().

2. perf report for find_vma() shows:
	6.26%--find_vma                                       
		|                                             
		--0.66%--mt_find                             
			|                                  
			--0.60%--mtree_range_walk

   The most time used in find_vma() is not mt_find. It's mmap_assert_locked(mm).

3. perf annotate of find_vma shows:
          │    ffffffffa91e9f20 <load0>:                                                                                                                         
     0.07 │      nop                                                                                                                                             
     0.07 │      sub  $0x10,%rsp                                                                                                                                 
     0.01 │      mov  %gs:0x28,%rax                                                                                                                              
     0.05 │      mov  %rax,0x8(%rsp)                                                                                                                             
     0.02 │      xor  %eax,%eax                                                                                                                                  
          │      mov  %rsi,(%rsp)                                                                                                                                
     0.00 │      mov  0x70(%rdi),%rax     --> This is rwsem_is_locked(&mm->mmap_lock)                                                                                                                      
    99.60 │      test %rax,%rax                                                                                                                                  
          │    ↓ je   4e                                                                                                                                         
     0.09 │      mov  $0xffffffffffffffff,%rdx                                                                                                                   
          │      mov  %rsp,%rsi

   I believe the line "mov  0x70(%rdi),%rax" should occupy 99.60% runtime of find_vma()
   instead of line "test %rax,%rax". And it's accessing the mmap_lock. With this series,
   the mmap_lock contention is reduced a lot with page_fault4 of will-it-scale. It
   makes mmap_lock access fast also.


Regards
Yin, Fengwei
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 9952bebd25b4..590ccaf6f7fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4521,8 +4521,6 @@  static vm_fault_t do_read_fault(struct vm_fault *vmf)
 {
 	vm_fault_t ret = 0;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
-		return VM_FAULT_RETRY;
 	/*
 	 * Let's call ->map_pages() first and use ->fault() as fallback
 	 * if page by the offset is not ready to be mapped (cold cache or
@@ -4534,6 +4532,8 @@  static vm_fault_t do_read_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		return VM_FAULT_RETRY;
 	ret = __do_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;