[6/6] mm: Run the fault-around code under the VMA lock

Message ID	20230404135850.3673404-7-willy@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Suren Baghdasaryan <surenb@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Punit Agrawal <punit.agrawal@bytedance.com> Subject: [PATCH 6/6] mm: Run the fault-around code under the VMA lock Date: Tue, 4 Apr 2023 14:58:50 +0100 Message-Id: <20230404135850.3673404-7-willy@infradead.org> In-Reply-To: <20230404135850.3673404-1-willy@infradead.org> References: <20230404135850.3673404-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Avoid the mmap lock for fault-around \| expand [0/6] Avoid the mmap lock for fault-around [1/6] mm: Allow per-VMA locks on file-backed VMAs [2/6] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault() [3/6] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() [4/6] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() [5/6] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_pte_missing() [6/6] mm: Run the fault-around code under the VMA lock

Matthew Wilcox April 4, 2023, 1:58 p.m. UTC

The map_pages fs method should be safe to run under the VMA lock instead
of the mmap lock.  This should have a measurable reduction in contention
on the mmap lock.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Matthew Wilcox April 4, 2023, 3:23 p.m. UTC | #1

On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle) wrote:
> The map_pages fs method should be safe to run under the VMA lock instead
> of the mmap lock.  This should have a measurable reduction in contention
> on the mmap lock.

https://github.com/antonblanchard/will-it-scale/pull/37/files should
be a good microbenchmark to report numbers from.  Obviously real-world
benchmarks will be more compelling.

Suren Baghdasaryan April 7, 2023, 6:20 p.m. UTC | #2

On Tue, Apr 4, 2023 at 8:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle) wrote:
> > The map_pages fs method should be safe to run under the VMA lock instead
> > of the mmap lock.  This should have a measurable reduction in contention
> > on the mmap lock.
>
> https://github.com/antonblanchard/will-it-scale/pull/37/files should
> be a good microbenchmark to report numbers from.  Obviously real-world
> benchmarks will be more compelling.

The series looks sane to me. I'll run some tests on a NUMA machine to
see if anything breaks. Thanks!

Yin Fengwei April 10, 2023, 4:53 a.m. UTC | #3

On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
> wrote:
> > The map_pages fs method should be safe to run under the VMA lock
> > instead
> > of the mmap lock.  This should have a measurable reduction in
> > contention
> > on the mmap lock.
> 
> https://github.com/antonblanchard/will-it-scale/pull/37/files should
> be a good microbenchmark to report numbers from.  Obviously real-
> world
> benchmarks will be more compelling.
> 

Test result in my side with page_fault4 of will-it-scale in thread 
mode is:
  15274196 (without the patch) -> 17291444 (with the patch)

13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
filesystem.


The perf showed the mmap_lock contention reduced a lot:
(Removed the grandson functions of do_user_addr_fault()) 

latest linux-next with the patch:
    51.78%--do_user_addr_fault
            |          
            |--49.09%--handle_mm_fault
            |--1.19%--lock_vma_under_rcu
            --1.09%--down_read

latest linux-next without the patch:
    73.65%--do_user_addr_fault
            |          
            |--28.65%--handle_mm_fault
            |--17.22%--down_read_trylock
            |--10.92%--down_read
            |--9.20%--up_read
            --7.30%--find_vma

My understanding is down_read_trylock, down_read and up_read all are
related with mmap_lock. So the mmap_lock contention reduction is quite
obvious.

For functional testing, our 0day already cherry-picked the patchset and
the testing is ongoing. If any problem hit during testing, 0day will
report it out. Thanks.


Regards
Yin, Fengwei

Matthew Wilcox April 10, 2023, 2:11 p.m. UTC | #4

On Mon, Apr 10, 2023 at 04:53:19AM +0000, Yin, Fengwei wrote:
> On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
> > On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
> > wrote:
> > > The map_pages fs method should be safe to run under the VMA lock
> > > instead
> > > of the mmap lock.  This should have a measurable reduction in
> > > contention
> > > on the mmap lock.
> > 
> > https://github.com/antonblanchard/will-it-scale/pull/37/files should
> > be a good microbenchmark to report numbers from.  Obviously real-
> > world
> > benchmarks will be more compelling.
> > 
> 
> Test result in my side with page_fault4 of will-it-scale in thread 
> mode is:
>   15274196 (without the patch) -> 17291444 (with the patch)
> 
> 13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
> filesystem.

Thanks!  That is really good news.

> The perf showed the mmap_lock contention reduced a lot:
> (Removed the grandson functions of do_user_addr_fault()) 
> 
> latest linux-next with the patch:
>     51.78%--do_user_addr_fault
>             |          
>             |--49.09%--handle_mm_fault
>             |--1.19%--lock_vma_under_rcu
>             --1.09%--down_read
> 
> latest linux-next without the patch:
>     73.65%--do_user_addr_fault
>             |          
>             |--28.65%--handle_mm_fault
>             |--17.22%--down_read_trylock
>             |--10.92%--down_read
>             |--9.20%--up_read
>             --7.30%--find_vma
> 
> My understanding is down_read_trylock, down_read and up_read all are
> related with mmap_lock. So the mmap_lock contention reduction is quite
> obvious.

Absolutely.  I'm a little surprised that find_vma() basically disappeared
from the perf results.  I guess that it was cache cold after contending
on the mmap_lock rwsem.  But this is a very encouraging result.

Yin Fengwei April 12, 2023, 7:35 a.m. UTC | #5

On 4/10/2023 10:11 PM, Matthew Wilcox wrote:
> On Mon, Apr 10, 2023 at 04:53:19AM +0000, Yin, Fengwei wrote:
>> On Tue, 2023-04-04 at 16:23 +0100, Matthew Wilcox wrote:
>>> On Tue, Apr 04, 2023 at 02:58:50PM +0100, Matthew Wilcox (Oracle)
>>> wrote:
>>>> The map_pages fs method should be safe to run under the VMA lock
>>>> instead
>>>> of the mmap lock.  This should have a measurable reduction in
>>>> contention
>>>> on the mmap lock.
>>>
>>> https://github.com/antonblanchard/will-it-scale/pull/37/files should
>>> be a good microbenchmark to report numbers from.  Obviously real-
>>> world
>>> benchmarks will be more compelling.
>>>
>>
>> Test result in my side with page_fault4 of will-it-scale in thread 
>> mode is:
>>   15274196 (without the patch) -> 17291444 (with the patch)
>>
>> 13.2% improvement on a Ice Lake with 48C/96T + 192G RAM + ext4 
>> filesystem.
> 
> Thanks!  That is really good news.
> 
>> The perf showed the mmap_lock contention reduced a lot:
>> (Removed the grandson functions of do_user_addr_fault()) 
>>
>> latest linux-next with the patch:
>>     51.78%--do_user_addr_fault
>>             |          
>>             |--49.09%--handle_mm_fault
>>             |--1.19%--lock_vma_under_rcu
>>             --1.09%--down_read
>>
>> latest linux-next without the patch:
>>     73.65%--do_user_addr_fault
>>             |          
>>             |--28.65%--handle_mm_fault
>>             |--17.22%--down_read_trylock
>>             |--10.92%--down_read
>>             |--9.20%--up_read
>>             --7.30%--find_vma
>>
>> My understanding is down_read_trylock, down_read and up_read all are
>> related with mmap_lock. So the mmap_lock contention reduction is quite
>> obvious.
> 
> Absolutely.  I'm a little surprised that find_vma() basically disappeared
> from the perf results.  I guess that it was cache cold after contending
> on the mmap_lock rwsem.  But this is a very encouraging result.

Yes. find_vma() was surprise. So I did more check about it.
1. re-run the testing for more 3 times in case I made stupid mistake.
   All testing show same behavior for find_vma().

2. perf report for find_vma() shows:
	6.26%--find_vma                                       
		|                                             
		--0.66%--mt_find                             
			|                                  
			--0.60%--mtree_range_walk

   The most time used in find_vma() is not mt_find. It's mmap_assert_locked(mm).

3. perf annotate of find_vma shows:
          │    ffffffffa91e9f20 <load0>:                                                                                                                         
     0.07 │      nop                                                                                                                                             
     0.07 │      sub  $0x10,%rsp                                                                                                                                 
     0.01 │      mov  %gs:0x28,%rax                                                                                                                              
     0.05 │      mov  %rax,0x8(%rsp)                                                                                                                             
     0.02 │      xor  %eax,%eax                                                                                                                                  
          │      mov  %rsi,(%rsp)                                                                                                                                
     0.00 │      mov  0x70(%rdi),%rax     --> This is rwsem_is_locked(&mm->mmap_lock)                                                                                                                      
    99.60 │      test %rax,%rax                                                                                                                                  
          │    ↓ je   4e                                                                                                                                         
     0.09 │      mov  $0xffffffffffffffff,%rdx                                                                                                                   
          │      mov  %rsp,%rsi

   I believe the line "mov  0x70(%rdi),%rax" should occupy 99.60% runtime of find_vma()
   instead of line "test %rax,%rax". And it's accessing the mmap_lock. With this series,
   the mmap_lock contention is reduced a lot with page_fault4 of will-it-scale. It
   makes mmap_lock access fast also.


Regards
Yin, Fengwei

[6/6] mm: Run the fault-around code under the VMA lock

Commit Message

Comments

Patch