diff mbox series

mm: vma: skip anonymous vma when inserting vma to file rmap tree

Message ID 20250306214948.2939043-1-yang@os.amperecomputing.com (mailing list archive)
State New
Headers show
Series mm: vma: skip anonymous vma when inserting vma to file rmap tree | expand

Commit Message

Yang Shi March 6, 2025, 9:49 p.m. UTC
LKP reported 800% performance improvement for small-allocs benchmark
from vm-scalability [1] with patch ("/dev/zero: make private mapping
full anonymous mapping") [2], but the patch was nack'ed since it changes
the output of smaps somewhat.

The profiling shows one of the major sources of the performance
improvement is the less contention to i_mmap_rwsem.

The small-allocs benchmark creates a lot of 40K size memory maps by
mmap'ing private /dev/zero then triggers page fault on the mappings.
When creating private mapping for /dev/zero, the anonymous VMA is
created, but it has valid vm_file.  Kernel basically assumes anonymous
VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
will be inserted to the file rmap tree, this resulted in the contention
to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
to insert it to file rmap tree.

Skip anonymous VMA for this case.  Over 400% performance improvement was
reported [3].

It is not on par with the 800% improvement from the original patch.  It is
because page fault handler needs to access some members of struct file
if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
the same cacheline with file refcount.  When mmap'ing a file the file
refcount is inc'ed and dec'ed, this caused bad cache false sharing
problem.  The further debug showed checking whether the VMA is anonymous
or not can alleviate the problem.  But I'm not sure whether it is the
best way to handle it, maybe we should consider shuffle the layout of
struct file.

However it sounds rare that real life applications would create that
many maps with mmap'ing private /dev/zero and share the same struct
file, so the cache false sharing problem may be not that bad.  But
i_mmap_rwsem contention problem seems more real since all /dev/zero
private mappings even from different applications share the same struct
address_space so the same i_mmap_rwsem.

[1] https://lore.kernel.org/linux-mm/202501281038.617c6b60-lkp@intel.com/
[2] https://lore.kernel.org/linux-mm/20250113223033.4054534-1-yang@os.amperecomputing.com/
[3] https://lore.kernel.org/linux-mm/Z6RshwXCWhAGoMOK@xsang-OptiPlex-9020/#t

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 mm/vma.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Lorenzo Stoakes March 7, 2025, 1:12 p.m. UTC | #1
On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
> LKP reported 800% performance improvement for small-allocs benchmark
> from vm-scalability [1] with patch ("/dev/zero: make private mapping
> full anonymous mapping") [2], but the patch was nack'ed since it changes
> the output of smaps somewhat.

Yeah sorry about that, but unfortunately something we really do have to
think about (among other things, the VMA edge cases are always the source
of weirdness...)

>
> The profiling shows one of the major sources of the performance
> improvement is the less contention to i_mmap_rwsem.

Great work tracking that down! Sorry I lost track of the other thread.

>
> The small-allocs benchmark creates a lot of 40K size memory maps by
> mmap'ing private /dev/zero then triggers page fault on the mappings.
> When creating private mapping for /dev/zero, the anonymous VMA is
> created, but it has valid vm_file.  Kernel basically assumes anonymous
> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
> rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
> will be inserted to the file rmap tree, this resulted in the contention
> to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
> to insert it to file rmap tree.

Ughhhh god haha.

>
> Skip anonymous VMA for this case.  Over 400% performance improvement was
> reported [3].

That's insane. Amazing work.

>
> It is not on par with the 800% improvement from the original patch.  It is
> because page fault handler needs to access some members of struct file
> if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
> the same cacheline with file refcount.  When mmap'ing a file the file
> refcount is inc'ed and dec'ed, this caused bad cache false sharing
> problem.  The further debug showed checking whether the VMA is anonymous
> or not can alleviate the problem.  But I'm not sure whether it is the
> best way to handle it, maybe we should consider shuffle the layout of
> struct file.

Interesting, I guess you'll take a look at this also?

>
> However it sounds rare that real life applications would create that
> many maps with mmap'ing private /dev/zero and share the same struct
> file, so the cache false sharing problem may be not that bad.  But

Right

> i_mmap_rwsem contention problem seems more real since all /dev/zero
> private mappings even from different applications share the same struct
> address_space so the same i_mmap_rwsem.

Yeah, lord above, that's crazy. Again, great work!

>
> [1] https://lore.kernel.org/linux-mm/202501281038.617c6b60-lkp@intel.com/
> [2] https://lore.kernel.org/linux-mm/20250113223033.4054534-1-yang@os.amperecomputing.com/
> [3] https://lore.kernel.org/linux-mm/Z6RshwXCWhAGoMOK@xsang-OptiPlex-9020/#t
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>

Thanks, this looks sensible, other than nits below, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/vma.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index c7abef5177cc..f4cf85c32b7a 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1648,6 +1648,9 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
>  void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb,
>  			       struct vm_area_struct *vma)
>  {
> +	if (vma_is_anonymous(vma))
> +		return;
> +

This really needs a comment imo, something simple like:

/* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */

>  	if (vma->vm_file == NULL)
>  		return;
>
> @@ -1671,8 +1674,12 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
>   */
>  void unlink_file_vma(struct vm_area_struct *vma)
>  {
> -	struct file *file = vma->vm_file;
> +	struct file *file;
> +
> +	if (vma_is_anonymous(vma))
> +		return;

Same comment as above.

>
> +	file = vma->vm_file;
>  	if (file) {
>  		struct address_space *mapping = file->f_mapping;
>
> @@ -1684,9 +1691,13 @@ void unlink_file_vma(struct vm_area_struct *vma)
>
>  void vma_link_file(struct vm_area_struct *vma)
>  {
> -	struct file *file = vma->vm_file;
> +	struct file *file;
>  	struct address_space *mapping;
>
> +	if (vma_is_anonymous(vma))
> +		return;

Same comment as above.

> +
> +	file = vma->vm_file;
>  	if (file) {
>  		mapping = file->f_mapping;
>  		i_mmap_lock_write(mapping);
> --
> 2.47.0
>
Pedro Falcato March 7, 2025, 1:35 p.m. UTC | #2
On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
> > LKP reported 800% performance improvement for small-allocs benchmark
> > from vm-scalability [1] with patch ("/dev/zero: make private mapping
> > full anonymous mapping") [2], but the patch was nack'ed since it changes
> > the output of smaps somewhat.
>
> Yeah sorry about that, but unfortunately something we really do have to
> think about (among other things, the VMA edge cases are always the source
> of weirdness...)
>
> >
> > The profiling shows one of the major sources of the performance
> > improvement is the less contention to i_mmap_rwsem.
>
> Great work tracking that down! Sorry I lost track of the other thread.
>
> >
> > The small-allocs benchmark creates a lot of 40K size memory maps by
> > mmap'ing private /dev/zero then triggers page fault on the mappings.
> > When creating private mapping for /dev/zero, the anonymous VMA is
> > created, but it has valid vm_file.  Kernel basically assumes anonymous
> > VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
> > rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
> > will be inserted to the file rmap tree, this resulted in the contention
> > to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
> > to insert it to file rmap tree.
>
> Ughhhh god haha.
>
> >
> > Skip anonymous VMA for this case.  Over 400% performance improvement was
> > reported [3].
>
> That's insane. Amazing work.
>

Ok, so the real question (to Yang) is: who are these /dev/zero users
that require an insane degree of scalability, and why didn't they
switch to regular MAP_ANONYMOUS? Are they in the room with us?

> >
> > It is not on par with the 800% improvement from the original patch.  It is
> > because page fault handler needs to access some members of struct file
> > if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
> > the same cacheline with file refcount.  When mmap'ing a file the file
> > refcount is inc'ed and dec'ed, this caused bad cache false sharing
> > problem.  The further debug showed checking whether the VMA is anonymous
> > or not can alleviate the problem.  But I'm not sure whether it is the
> > best way to handle it, maybe we should consider shuffle the layout of
> > struct file.
>
> Interesting, I guess you'll take a look at this also?

... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
it's something like libc.so.6 at an insane rate of execs/second.

This seems like a patch in search of a problem and I really don't see
why we should wart up the mmap code otherwise. Not that I have a huge
problem with this patch, which is somewhat simple and obvious.
It'd be great if there was a real workload driving this rather than
useless synthetic benchmarks.
Lorenzo Stoakes March 7, 2025, 1:41 p.m. UTC | #3
On Fri, Mar 07, 2025 at 01:35:00PM +0000, Pedro Falcato wrote:
> On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
> > > LKP reported 800% performance improvement for small-allocs benchmark
> > > from vm-scalability [1] with patch ("/dev/zero: make private mapping
> > > full anonymous mapping") [2], but the patch was nack'ed since it changes
> > > the output of smaps somewhat.
> >
> > Yeah sorry about that, but unfortunately something we really do have to
> > think about (among other things, the VMA edge cases are always the source
> > of weirdness...)
> >
> > >
> > > The profiling shows one of the major sources of the performance
> > > improvement is the less contention to i_mmap_rwsem.
> >
> > Great work tracking that down! Sorry I lost track of the other thread.
> >
> > >
> > > The small-allocs benchmark creates a lot of 40K size memory maps by
> > > mmap'ing private /dev/zero then triggers page fault on the mappings.
> > > When creating private mapping for /dev/zero, the anonymous VMA is
> > > created, but it has valid vm_file.  Kernel basically assumes anonymous
> > > VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
> > > rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
> > > will be inserted to the file rmap tree, this resulted in the contention
> > > to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
> > > to insert it to file rmap tree.
> >
> > Ughhhh god haha.
> >
> > >
> > > Skip anonymous VMA for this case.  Over 400% performance improvement was
> > > reported [3].
> >
> > That's insane. Amazing work.
> >
>
> Ok, so the real question (to Yang) is: who are these /dev/zero users
> that require an insane degree of scalability, and why didn't they
> switch to regular MAP_ANONYMOUS? Are they in the room with us?

This could be said about a lot of benchmarks.

>
> > >
> > > It is not on par with the 800% improvement from the original patch.  It is
> > > because page fault handler needs to access some members of struct file
> > > if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
> > > the same cacheline with file refcount.  When mmap'ing a file the file
> > > refcount is inc'ed and dec'ed, this caused bad cache false sharing
> > > problem.  The further debug showed checking whether the VMA is anonymous
> > > or not can alleviate the problem.  But I'm not sure whether it is the
> > > best way to handle it, maybe we should consider shuffle the layout of
> > > struct file.
> >
> > Interesting, I guess you'll take a look at this also?
>
> ... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
> it's something like libc.so.6 at an insane rate of execs/second.

But the cost of fixing this is...?

>
> This seems like a patch in search of a problem and I really don't see
> why we should wart up the mmap code otherwise. Not that I have a huge
> problem with this patch, which is somewhat simple and obvious.
> It'd be great if there was a real workload driving this rather than
> useless synthetic benchmarks.

Disagree with first part. Disallowing a known-broken situation for very low
cost in the majority of cases, as well as documenting that such odd
creatures exist is valuable.

Improving benchmarks, however synthetic they may be, also valuable.

But on the latter bit, yes it'd be nice if we could get information on
real-life scenarios where this is an issue if you have it Yang.

>
> --
> Pedro

The patch is fine as-is AFAIC, and I am very happy to reduce lock
contention on heavily contested locks wherever I can, especially when the
cost for doing so, in this case, is so low.
Yang Shi March 7, 2025, 5:34 p.m. UTC | #4
On 3/7/25 5:12 AM, Lorenzo Stoakes wrote:
> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
>> LKP reported 800% performance improvement for small-allocs benchmark
>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>> the output of smaps somewhat.
> Yeah sorry about that, but unfortunately something we really do have to
> think about (among other things, the VMA edge cases are always the source
> of weirdness...)
>
>> The profiling shows one of the major sources of the performance
>> improvement is the less contention to i_mmap_rwsem.
> Great work tracking that down! Sorry I lost track of the other thread.
>
>> The small-allocs benchmark creates a lot of 40K size memory maps by
>> mmap'ing private /dev/zero then triggers page fault on the mappings.
>> When creating private mapping for /dev/zero, the anonymous VMA is
>> created, but it has valid vm_file.  Kernel basically assumes anonymous
>> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
>> rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
>> will be inserted to the file rmap tree, this resulted in the contention
>> to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
>> to insert it to file rmap tree.
> Ughhhh god haha.
>
>> Skip anonymous VMA for this case.  Over 400% performance improvement was
>> reported [3].
> That's insane. Amazing work.
>
>> It is not on par with the 800% improvement from the original patch.  It is
>> because page fault handler needs to access some members of struct file
>> if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
>> the same cacheline with file refcount.  When mmap'ing a file the file
>> refcount is inc'ed and dec'ed, this caused bad cache false sharing
>> problem.  The further debug showed checking whether the VMA is anonymous
>> or not can alleviate the problem.  But I'm not sure whether it is the
>> best way to handle it, maybe we should consider shuffle the layout of
>> struct file.
> Interesting, I guess you'll take a look at this also?

I looked into it, but I can't determine what is the best way. We 
basically have two options:

1. Regroup the layout of struct file, for example, move refcount to 
another cache line. But I'm not sure whether it will cause any other 
problems. This needs extensive testing.
2. Put refcount in a dedicated cacheline. Currently struct file is 192 
bytes, a dedicated cacheline will increase the size to 256 bytes. I'm 
not sure whether it is worth it or not.

>
>> However it sounds rare that real life applications would create that
>> many maps with mmap'ing private /dev/zero and share the same struct
>> file, so the cache false sharing problem may be not that bad.  But
> Right
>
>> i_mmap_rwsem contention problem seems more real since all /dev/zero
>> private mappings even from different applications share the same struct
>> address_space so the same i_mmap_rwsem.
> Yeah, lord above, that's crazy. Again, great work!
>
>> [1] https://lore.kernel.org/linux-mm/202501281038.617c6b60-lkp@intel.com/
>> [2] https://lore.kernel.org/linux-mm/20250113223033.4054534-1-yang@os.amperecomputing.com/
>> [3] https://lore.kernel.org/linux-mm/Z6RshwXCWhAGoMOK@xsang-OptiPlex-9020/#t
>>
>> Reported-by: kernel test robot <oliver.sang@intel.com>
>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> Thanks, this looks sensible, other than nits below, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thank you.

>
>> ---
>>   mm/vma.c | 15 +++++++++++++--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vma.c b/mm/vma.c
>> index c7abef5177cc..f4cf85c32b7a 100644
>> --- a/mm/vma.c
>> +++ b/mm/vma.c
>> @@ -1648,6 +1648,9 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
>>   void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb,
>>   			       struct vm_area_struct *vma)
>>   {
>> +	if (vma_is_anonymous(vma))
>> +		return;
>> +
> This really needs a comment imo, something simple like:
>
> /* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */

Thanks for the suggestion, will add the comment in the new spin.

Yang

>
>>   	if (vma->vm_file == NULL)
>>   		return;
>>
>> @@ -1671,8 +1674,12 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
>>    */
>>   void unlink_file_vma(struct vm_area_struct *vma)
>>   {
>> -	struct file *file = vma->vm_file;
>> +	struct file *file;
>> +
>> +	if (vma_is_anonymous(vma))
>> +		return;
> Same comment as above.
>
>> +	file = vma->vm_file;
>>   	if (file) {
>>   		struct address_space *mapping = file->f_mapping;
>>
>> @@ -1684,9 +1691,13 @@ void unlink_file_vma(struct vm_area_struct *vma)
>>
>>   void vma_link_file(struct vm_area_struct *vma)
>>   {
>> -	struct file *file = vma->vm_file;
>> +	struct file *file;
>>   	struct address_space *mapping;
>>
>> +	if (vma_is_anonymous(vma))
>> +		return;
> Same comment as above.
>
>> +
>> +	file = vma->vm_file;
>>   	if (file) {
>>   		mapping = file->f_mapping;
>>   		i_mmap_lock_write(mapping);
>> --
>> 2.47.0
>>
Yang Shi March 7, 2025, 5:51 p.m. UTC | #5
On 3/7/25 5:35 AM, Pedro Falcato wrote:
> On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
>>> LKP reported 800% performance improvement for small-allocs benchmark
>>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>>> the output of smaps somewhat.
>> Yeah sorry about that, but unfortunately something we really do have to
>> think about (among other things, the VMA edge cases are always the source
>> of weirdness...)
>>
>>> The profiling shows one of the major sources of the performance
>>> improvement is the less contention to i_mmap_rwsem.
>> Great work tracking that down! Sorry I lost track of the other thread.
>>
>>> The small-allocs benchmark creates a lot of 40K size memory maps by
>>> mmap'ing private /dev/zero then triggers page fault on the mappings.
>>> When creating private mapping for /dev/zero, the anonymous VMA is
>>> created, but it has valid vm_file.  Kernel basically assumes anonymous
>>> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
>>> rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
>>> will be inserted to the file rmap tree, this resulted in the contention
>>> to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
>>> to insert it to file rmap tree.
>> Ughhhh god haha.
>>
>>> Skip anonymous VMA for this case.  Over 400% performance improvement was
>>> reported [3].
>> That's insane. Amazing work.
>>
> Ok, so the real question (to Yang) is: who are these /dev/zero users
> that require an insane degree of scalability, and why didn't they
> switch to regular MAP_ANONYMOUS? Are they in the room with us?

I wish I could. Who knows what applications use /dev/zero. But mmap'ing 
private /dev/zero is definitely an established way to create anonymous 
mappings. So we can't rule out it is *NOT* used.

>
>>> It is not on par with the 800% improvement from the original patch.  It is
>>> because page fault handler needs to access some members of struct file
>>> if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
>>> the same cacheline with file refcount.  When mmap'ing a file the file
>>> refcount is inc'ed and dec'ed, this caused bad cache false sharing
>>> problem.  The further debug showed checking whether the VMA is anonymous
>>> or not can alleviate the problem.  But I'm not sure whether it is the
>>> best way to handle it, maybe we should consider shuffle the layout of
>>> struct file.
>> Interesting, I guess you'll take a look at this also?
> ... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
> it's something like libc.so.6 at an insane rate of execs/second.
>
> This seems like a patch in search of a problem and I really don't see
> why we should wart up the mmap code otherwise. Not that I have a huge
> problem with this patch, which is somewhat simple and obvious.
> It'd be great if there was a real workload driving this rather than
> useless synthetic benchmarks.

Inserting an anonymous VMA to file rmap tree is definitely not expected 
by other parts of kernel. This is a broken behavior (or special case) 
IMHO. Making it behave in right way and making no surprise (also less 
special) to other parts of kernel is worth it even though we don't count 
the potential performance gain.

Thanks,
Yang

>
Yang Shi March 7, 2025, 5:58 p.m. UTC | #6
On 3/7/25 5:41 AM, Lorenzo Stoakes wrote:
> On Fri, Mar 07, 2025 at 01:35:00PM +0000, Pedro Falcato wrote:
>> On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
>>>> LKP reported 800% performance improvement for small-allocs benchmark
>>>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>>>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>>>> the output of smaps somewhat.
>>> Yeah sorry about that, but unfortunately something we really do have to
>>> think about (among other things, the VMA edge cases are always the source
>>> of weirdness...)
>>>
>>>> The profiling shows one of the major sources of the performance
>>>> improvement is the less contention to i_mmap_rwsem.
>>> Great work tracking that down! Sorry I lost track of the other thread.
>>>
>>>> The small-allocs benchmark creates a lot of 40K size memory maps by
>>>> mmap'ing private /dev/zero then triggers page fault on the mappings.
>>>> When creating private mapping for /dev/zero, the anonymous VMA is
>>>> created, but it has valid vm_file.  Kernel basically assumes anonymous
>>>> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
>>>> rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
>>>> will be inserted to the file rmap tree, this resulted in the contention
>>>> to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
>>>> to insert it to file rmap tree.
>>> Ughhhh god haha.
>>>
>>>> Skip anonymous VMA for this case.  Over 400% performance improvement was
>>>> reported [3].
>>> That's insane. Amazing work.
>>>
>> Ok, so the real question (to Yang) is: who are these /dev/zero users
>> that require an insane degree of scalability, and why didn't they
>> switch to regular MAP_ANONYMOUS? Are they in the room with us?
> This could be said about a lot of benchmarks.
>
>>>> It is not on par with the 800% improvement from the original patch.  It is
>>>> because page fault handler needs to access some members of struct file
>>>> if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
>>>> the same cacheline with file refcount.  When mmap'ing a file the file
>>>> refcount is inc'ed and dec'ed, this caused bad cache false sharing
>>>> problem.  The further debug showed checking whether the VMA is anonymous
>>>> or not can alleviate the problem.  But I'm not sure whether it is the
>>>> best way to handle it, maybe we should consider shuffle the layout of
>>>> struct file.
>>> Interesting, I guess you'll take a look at this also?
>> ... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
>> it's something like libc.so.6 at an insane rate of execs/second.
> But the cost of fixing this is...?
>
>> This seems like a patch in search of a problem and I really don't see
>> why we should wart up the mmap code otherwise. Not that I have a huge
>> problem with this patch, which is somewhat simple and obvious.
>> It'd be great if there was a real workload driving this rather than
>> useless synthetic benchmarks.
> Disagree with first part. Disallowing a known-broken situation for very low
> cost in the majority of cases, as well as documenting that such odd
> creatures exist is valuable.

Yes, I agree. This is one of the points of this patch too. The commit 
log may focus too much on the benchmark improvement. I will add more 
words about fixing the special odd behavior.

>
> Improving benchmarks, however synthetic they may be, also valuable.
>
> But on the latter bit, yes it'd be nice if we could get information on
> real-life scenarios where this is an issue if you have it Yang.

I wish I could. As I said in the reply to Pedro, creating anonymous 
mapping via mmap'ing private /dev/zero is an established way. So we can 
not rule out it is *NOT* used.

Thanks,
Yang

>
>> --
>> Pedro
> The patch is fine as-is AFAIC, and I am very happy to reduce lock
> contention on heavily contested locks wherever I can, especially when the
> cost for doing so, in this case, is so low.
diff mbox series

Patch

diff --git a/mm/vma.c b/mm/vma.c
index c7abef5177cc..f4cf85c32b7a 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1648,6 +1648,9 @@  static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
 void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb,
 			       struct vm_area_struct *vma)
 {
+	if (vma_is_anonymous(vma))
+		return;
+
 	if (vma->vm_file == NULL)
 		return;
 
@@ -1671,8 +1674,12 @@  void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
  */
 void unlink_file_vma(struct vm_area_struct *vma)
 {
-	struct file *file = vma->vm_file;
+	struct file *file;
+
+	if (vma_is_anonymous(vma))
+		return;
 
+	file = vma->vm_file;
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
 
@@ -1684,9 +1691,13 @@  void unlink_file_vma(struct vm_area_struct *vma)
 
 void vma_link_file(struct vm_area_struct *vma)
 {
-	struct file *file = vma->vm_file;
+	struct file *file;
 	struct address_space *mapping;
 
+	if (vma_is_anonymous(vma))
+		return;
+
+	file = vma->vm_file;
 	if (file) {
 		mapping = file->f_mapping;
 		i_mmap_lock_write(mapping);