diff mbox

[RFC,v3,4/5] mm: mmap: zap pages with read mmap_sem for large mapping

Message ID 1530311985-31251-5-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yang Shi June 29, 2018, 10:39 p.m. UTC
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 134 insertions(+), 2 deletions(-)

Comments

Andrew Morton June 30, 2018, 1:28 a.m. UTC | #1
On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?

> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!

> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?

> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif
> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +			       size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)

> +{
> +	unsigned long end = 0;
> +	struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.

> +	bool success = false;
> +	int ret = 0;
> +
> +	if (!munmap_addr_sanity(start, len))
> +		return -EINVAL;
> +
> +	len = PAGE_ALIGN(len);
> +
> +	end = start + len;
> +
> +	/* Just deal with uf in regular path */
> +	if (unlikely(uf))
> +		goto regular_path;
> +
> +	if (len >= LARGE_MAP_THRESH) {
> +		/*
> +		 * need write mmap_sem to split vma and set VM_DEAD flag
> +		 * splitting vma up-front to save PITA to clean if it is failed
> +		 */
> +		down_write(&mm->mmap_sem);
> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
> +		if (ret != 1) {
> +			up_write(&mm->mmap_sem);
> +			return ret;

Can just use `goto out' here, and that would avoid the unpleasing use
of a deeply eembded `return'.

> +		}
> +		/* This ret value might be returned, so reset it */
> +		ret = 0;
> +
> +		/*
> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +		 * flag set or has uprobes set, need acquire write map_sem,
> +		 * so skip them in early zap. Just deal with such mapping in
> +		 * regular path.

For each case, please describe *why* mmap_sem must be held for writing.

> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
> +		 */
> +		tmp = vma;
> +		while (tmp && tmp->vm_start < end) {
> +			if (!can_madv_dontneed_vma(tmp) ||
> +			    vma_has_uprobes(tmp, start, end)) {
> +				up_write(&mm->mmap_sem);
> +				goto regular_path;
> +			}
> +			tmp = tmp->vm_next;
> +		}
> +		/*
> +		 * set VM_DEAD flag before tear down them.
> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
> +		 */
> +		tmp = vma;
> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
> +			tmp->vm_flags |= VM_DEAD;
> +		up_write(&mm->mmap_sem);
> +
> +		/* zap mappings with read mmap_sem */
> +		down_read(&mm->mmap_sem);

Use downgrade_write()?

> +		zap_page_range(vma, start, len);
> +		/* indicates early zap is success */
> +		success = true;
> +		up_read(&mm->mmap_sem);
> +	}
> +
> +regular_path:
> +	/* hold write mmap_sem for vma manipulation or regular path */
> +	if (down_write_killable(&mm->mmap_sem))
> +		return -EINTR;

Why is this _killable() while the preceding down_write() was not?

> +	if (success) {
> +		/* vmas have been zapped, here clean up pgtable and vmas */
> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
> +		struct mmu_gather tlb;
> +		tlb_gather_mmu(&tlb, mm, start, end);
> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
> +		tlb_finish_mmu(&tlb, start, end);
> +
> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +		arch_unmap(mm, vma, start, end);
> +		remove_vma_list(mm, vma);
> +	} else {
> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
> +		if (vma) {
> +			if (unlikely(uf)) {
> +				int ret = userfaultfd_unmap_prep(vma, start,
> +								 end, uf);
> +				if (ret)
> +					goto out;

Bug?  This `ret' shadows the other `ret' in this function.

> +			}
> +			if (mm->locked_vm) {
> +				tmp = vma;
> +				while (tmp && tmp->vm_start < end) {
> +					if (tmp->vm_flags & VM_LOCKED) {
> +						mm->locked_vm -= vma_pages(tmp);
> +						munlock_vma_pages_all(tmp);
> +					}
> +					tmp = tmp->vm_next;
> +				}
> +			}
> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +			unmap_region(mm, vma, prev, start, end);
> +			remove_vma_list(mm, vma);
> +		} else
> +			/* When mapping size < LARGE_MAP_THRESH */
> +			ret = do_munmap(mm, start, len, uf);
> +	}
> +
> +out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
>  /* Munmap is split into 2 main parts -- this part which finds
>   * what needs doing, and the areas themselves, which do the
>   * work.  This now handles partial unmappings.
> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return 0;
>  }
>  
> +static int vm_munmap_zap_early(unsigned long start, size_t len)
> +{
> +	int ret;
> +	struct mm_struct *mm = current->mm;
> +	LIST_HEAD(uf);
> +
> +	ret = do_munmap_zap_early(mm, start, len, &uf);
> +	userfaultfd_unmap_complete(mm, &uf);
> +	return ret;
> +}
> +
>  int vm_munmap(unsigned long start, size_t len)
>  {
>  	int ret;
> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>  {
>  	profile_munmap(addr);
> -	return vm_munmap(addr, len);
> +	return vm_munmap_zap_early(addr, len);
>  }
>  
> -
>  /*
>   * Emulation of deprecated remap_file_pages() syscall.
>   */
> -- 
> 1.8.3.1
Andrew Morton June 30, 2018, 1:35 a.m. UTC | #2
On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:


And...

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(
Yang Shi June 30, 2018, 2:10 a.m. UTC | #3
On 6/29/18 6:28 PM, Andrew Morton wrote:
> On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
>>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
> Perhaps it would be better to treat all mappings in the fashion,
> regardless of size.  Simpler code, lesser mmap_sem hold times, much
> better testing coverage.  Is there any particular downside to doing
> this?

Actually, my testing uses 4K size to improve the coverage. The only 
downside AFAICS is the cost of multiple acquiring/releasing lock may 
outpace the benefits.

>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
> So we'll still get huge latencies an softlockup warnings for some
> usecases.  This is a problem!

Because unmapping such area needs modify vm_flags, which need write 
mmap_sem, in current code base.

>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
>>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> Where is this "regression and performance data"?  Something mising from
> the changelog?

oops, it might be removed inadvertently.

>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
>>
>> [1] https://lwn.net/Articles/753269/
>>
>> ...
>>
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
>> +
>> +/* Unmap large mapping early with acquiring read mmap_sem */
>> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
>> +			       size_t len, struct list_head *uf)
> Can we have a comment describing what `uf' is and what it does? (at least)

Sure.

>
>> +{
>> +	unsigned long end = 0;
>> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
> `tmp' is a poor choice of identifier - it doesn't communicate either
> the variable's type nor its purpose.
>
> Perhaps rename vma to start_vma(?) and rename tmp to vma?
>
> And declaring start_vma to be const would be a nice readability addition.

Sounds ok.

>
>> +	bool success = false;
>> +	int ret = 0;
>> +
>> +	if (!munmap_addr_sanity(start, len))
>> +		return -EINVAL;
>> +
>> +	len = PAGE_ALIGN(len);
>> +
>> +	end = start + len;
>> +
>> +	/* Just deal with uf in regular path */
>> +	if (unlikely(uf))
>> +		goto regular_path;
>> +
>> +	if (len >= LARGE_MAP_THRESH) {
>> +		/*
>> +		 * need write mmap_sem to split vma and set VM_DEAD flag
>> +		 * splitting vma up-front to save PITA to clean if it is failed
>> +		 */
>> +		down_write(&mm->mmap_sem);
>> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
>> +		if (ret != 1) {
>> +			up_write(&mm->mmap_sem);
>> +			return ret;
> Can just use `goto out' here, and that would avoid the unpleasing use
> of a deeply eembded `return'.

Yes.

>
>> +		}
>> +		/* This ret value might be returned, so reset it */
>> +		ret = 0;
>> +
>> +		/*
>> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
>> +		 * flag set or has uprobes set, need acquire write map_sem,
>> +		 * so skip them in early zap. Just deal with such mapping in
>> +		 * regular path.
> For each case, please describe *why* mmap_sem must be held for writing.

Sure.

>
>> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
>> +		 */
>> +		tmp = vma;
>> +		while (tmp && tmp->vm_start < end) {
>> +			if (!can_madv_dontneed_vma(tmp) ||
>> +			    vma_has_uprobes(tmp, start, end)) {
>> +				up_write(&mm->mmap_sem);
>> +				goto regular_path;
>> +			}
>> +			tmp = tmp->vm_next;
>> +		}
>> +		/*
>> +		 * set VM_DEAD flag before tear down them.
>> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
>> +		 */
>> +		tmp = vma;
>> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
>> +			tmp->vm_flags |= VM_DEAD;
>> +		up_write(&mm->mmap_sem);
>> +
>> +		/* zap mappings with read mmap_sem */
>> +		down_read(&mm->mmap_sem);
> Use downgrade_write()?

Aha, thanks for reminding. I forgot this. Just focused on "upgrade_read" 
part suggested by Michal.

>
>> +		zap_page_range(vma, start, len);
>> +		/* indicates early zap is success */
>> +		success = true;
>> +		up_read(&mm->mmap_sem);
>> +	}
>> +
>> +regular_path:
>> +	/* hold write mmap_sem for vma manipulation or regular path */
>> +	if (down_write_killable(&mm->mmap_sem))
>> +		return -EINTR;
> Why is this _killable() while the preceding down_write() was not?

This is copied from the original do_munmap(). The preceding one could be 
_killable() too.

>
>> +	if (success) {
>> +		/* vmas have been zapped, here clean up pgtable and vmas */
>> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
>> +		struct mmu_gather tlb;
>> +		tlb_gather_mmu(&tlb, mm, start, end);
>> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
>> +		tlb_finish_mmu(&tlb, start, end);
>> +
>> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +		arch_unmap(mm, vma, start, end);
>> +		remove_vma_list(mm, vma);
>> +	} else {
>> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
>> +		if (vma) {
>> +			if (unlikely(uf)) {
>> +				int ret = userfaultfd_unmap_prep(vma, start,
>> +								 end, uf);
>> +				if (ret)
>> +					goto out;
> Bug?  This `ret' shadows the other `ret' in this function.

oops, it should just use the same "ret".

Thanks,
Yang

>
>> +			}
>> +			if (mm->locked_vm) {
>> +				tmp = vma;
>> +				while (tmp && tmp->vm_start < end) {
>> +					if (tmp->vm_flags & VM_LOCKED) {
>> +						mm->locked_vm -= vma_pages(tmp);
>> +						munlock_vma_pages_all(tmp);
>> +					}
>> +					tmp = tmp->vm_next;
>> +				}
>> +			}
>> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +			unmap_region(mm, vma, prev, start, end);
>> +			remove_vma_list(mm, vma);
>> +		} else
>> +			/* When mapping size < LARGE_MAP_THRESH */
>> +			ret = do_munmap(mm, start, len, uf);
>> +	}
>> +
>> +out:
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}
>> +
>>   /* Munmap is split into 2 main parts -- this part which finds
>>    * what needs doing, and the areas themselves, which do the
>>    * work.  This now handles partial unmappings.
>> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   	return 0;
>>   }
>>   
>> +static int vm_munmap_zap_early(unsigned long start, size_t len)
>> +{
>> +	int ret;
>> +	struct mm_struct *mm = current->mm;
>> +	LIST_HEAD(uf);
>> +
>> +	ret = do_munmap_zap_early(mm, start, len, &uf);
>> +	userfaultfd_unmap_complete(mm, &uf);
>> +	return ret;
>> +}
>> +
>>   int vm_munmap(unsigned long start, size_t len)
>>   {
>>   	int ret;
>> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>>   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>   {
>>   	profile_munmap(addr);
>> -	return vm_munmap(addr, len);
>> +	return vm_munmap_zap_early(addr, len);
>>   }
>>   
>> -
>>   /*
>>    * Emulation of deprecated remap_file_pages() syscall.
>>    */
>> -- 
>> 1.8.3.1
Yang Shi June 30, 2018, 2:28 a.m. UTC | #4
On 6/29/18 6:35 PM, Andrew Morton wrote:
> On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
> And...
>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 87dcf83..d61e08b 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
> So this assumes that 32-bit machines cannot have 1GB mappings (fair
> enough) and this is the sole means by which we avoid falling into the
> "len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
> least because for such machines, VM_DEAD is zero.
>
> This is rather ugly and fragile.  And, I guess, explains why we can't
> give all mappings this treatment: 32-bit machines can't do it.  And
> we're adding a bunch of code to 32-bit kernels which will never be
> executed.
>
> I'm thinking it would be better to be much more explicit with "#ifdef
> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>
> But I tend to think that the fact that we haven't solved anything on
> locked vmas or on uprobed mappings is a shostopper for the whole
> approach :(

I agree it is not that perfect. But, it still could improve the most use 
cases.

For the locked vmas and hugetlb vmas, unmapping operations need modify 
vm_flags. But, I'm wondering we might be able to separate unmap and 
vm_flags update. Because we know they will be unmapped right away, the 
vm_flags might be able to be updated in write mmap_sem critical section 
before the actual unmap is called or after it. This is just off the top 
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

>
Andrew Morton June 30, 2018, 3:15 a.m. UTC | #5
On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> 
> 
> > we're adding a bunch of code to 32-bit kernels which will never be
> > executed.
> >
> > I'm thinking it would be better to be much more explicit with "#ifdef
> > CONFIG_64BIT" in this code, rather than relying upon the above magic.
> >
> > But I tend to think that the fact that we haven't solved anything on
> > locked vmas or on uprobed mappings is a shostopper for the whole
> > approach :(
> 
> I agree it is not that perfect. But, it still could improve the most use 
> cases.

Well, those unaddressed usecases will need to be fixed at some point. 
What's our plan for that?

Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

> For the locked vmas and hugetlb vmas, unmapping operations need modify 
> vm_flags. But, I'm wondering we might be able to separate unmap and 
> vm_flags update. Because we know they will be unmapped right away, the 
> vm_flags might be able to be updated in write mmap_sem critical section 
> before the actual unmap is called or after it. This is just off the top 
> of my head.
> 
> For uprobed mappings, I'm not sure how vital it is to this case.
> 
> Thanks,
> Yang
> 
> >
Yang Shi June 30, 2018, 4:26 a.m. UTC | #6
On 6/29/18 8:15 PM, Andrew Morton wrote:
> On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>>
>>> we're adding a bunch of code to 32-bit kernels which will never be
>>> executed.
>>>
>>> I'm thinking it would be better to be much more explicit with "#ifdef
>>> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>>>
>>> But I tend to think that the fact that we haven't solved anything on
>>> locked vmas or on uprobed mappings is a shostopper for the whole
>>> approach :(
>> I agree it is not that perfect. But, it still could improve the most use
>> cases.
> Well, those unaddressed usecases will need to be fixed at some point.

Yes, definitely.

> What's our plan for that?

As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I will 
look into it further later.

 From my point of view, uprobe mapping sounds not that vital.

>
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.

Thanks,
Yang

>
>> For the locked vmas and hugetlb vmas, unmapping operations need modify
>> vm_flags. But, I'm wondering we might be able to separate unmap and
>> vm_flags update. Because we know they will be unmapped right away, the
>> vm_flags might be able to be updated in write mmap_sem critical section
>> before the actual unmap is called or after it. This is just off the top
>> of my head.
>>
>> For uprobed mappings, I'm not sure how vital it is to this case.
>>
>> Thanks,
>> Yang
>>
Kirill A . Shutemov July 2, 2018, 12:33 p.m. UTC | #7
On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif

PUD_SIZE is defined everywhere.

> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +			       size_t len, struct list_head *uf)
> +{
> +	unsigned long end = 0;
> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
> +	bool success = false;
> +	int ret = 0;
> +
> +	if (!munmap_addr_sanity(start, len))
> +		return -EINVAL;
> +
> +	len = PAGE_ALIGN(len);
> +
> +	end = start + len;
> +
> +	/* Just deal with uf in regular path */
> +	if (unlikely(uf))
> +		goto regular_path;
> +
> +	if (len >= LARGE_MAP_THRESH) {
> +		/*
> +		 * need write mmap_sem to split vma and set VM_DEAD flag
> +		 * splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?

> +		 */
> +		down_write(&mm->mmap_sem);
> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
> +		if (ret != 1) {
> +			up_write(&mm->mmap_sem);
> +			return ret;
> +		}
> +		/* This ret value might be returned, so reset it */
> +		ret = 0;
> +
> +		/*
> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +		 * flag set or has uprobes set, need acquire write map_sem,
> +		 * so skip them in early zap. Just deal with such mapping in
> +		 * regular path.
> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
> +		 */
> +		tmp = vma;
> +		while (tmp && tmp->vm_start < end) {
> +			if (!can_madv_dontneed_vma(tmp) ||
> +			    vma_has_uprobes(tmp, start, end)) {
> +				up_write(&mm->mmap_sem);
> +				goto regular_path;
> +			}
> +			tmp = tmp->vm_next;
> +		}
> +		/*
> +		 * set VM_DEAD flag before tear down them.
> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
> +		 */
> +		tmp = vma;
> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
> +			tmp->vm_flags |= VM_DEAD;

I probably miss the explanation somewhere, but what's wrong with allowing
other thread to re-populate the VMA?

I would rather allow the VMA to be re-populated by other thread while we
are zapping the range. And later zap the range again under down_write.

It should also lead to consolidated regular path: take mmap_sem for write
and call do_munmap().

On the first path we just skip VMA we cannot deal with under
down_read(mmap_sem), regular path will take care of them.


> +		up_write(&mm->mmap_sem);
> +
> +		/* zap mappings with read mmap_sem */
> +		down_read(&mm->mmap_sem);

Yeah. There's race between up_write() and down_read().
Use downgrade, as Andrew suggested.

> +		zap_page_range(vma, start, len);
> +		/* indicates early zap is success */
> +		success = true;
> +		up_read(&mm->mmap_sem);

And here again.

This race can be avoided if we wouldn't carry vma to regular_path, but
just go directly to do_munmap().

> +	}
> +
> +regular_path:
> +	/* hold write mmap_sem for vma manipulation or regular path */
> +	if (down_write_killable(&mm->mmap_sem))
> +		return -EINTR;
> +	if (success) {
> +		/* vmas have been zapped, here clean up pgtable and vmas */
> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
> +		struct mmu_gather tlb;
> +		tlb_gather_mmu(&tlb, mm, start, end);
> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
> +		tlb_finish_mmu(&tlb, start, end);
> +
> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +		arch_unmap(mm, vma, start, end);
> +		remove_vma_list(mm, vma);
> +	} else {
> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
> +		if (vma) {
> +			if (unlikely(uf)) {
> +				int ret = userfaultfd_unmap_prep(vma, start,
> +								 end, uf);
> +				if (ret)
> +					goto out;
> +			}
> +			if (mm->locked_vm) {
> +				tmp = vma;
> +				while (tmp && tmp->vm_start < end) {
> +					if (tmp->vm_flags & VM_LOCKED) {
> +						mm->locked_vm -= vma_pages(tmp);
> +						munlock_vma_pages_all(tmp);
> +					}
> +					tmp = tmp->vm_next;
> +				}
> +			}
> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +			unmap_region(mm, vma, prev, start, end);
> +			remove_vma_list(mm, vma);
> +		} else
> +			/* When mapping size < LARGE_MAP_THRESH */
> +			ret = do_munmap(mm, start, len, uf);
> +	}
> +
> +out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
>  /* Munmap is split into 2 main parts -- this part which finds
>   * what needs doing, and the areas themselves, which do the
>   * work.  This now handles partial unmappings.
> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return 0;
>  }
>  
> +static int vm_munmap_zap_early(unsigned long start, size_t len)
> +{
> +	int ret;
> +	struct mm_struct *mm = current->mm;
> +	LIST_HEAD(uf);
> +
> +	ret = do_munmap_zap_early(mm, start, len, &uf);
> +	userfaultfd_unmap_complete(mm, &uf);
> +	return ret;
> +}
> +
>  int vm_munmap(unsigned long start, size_t len)
>  {
>  	int ret;
> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>  {
>  	profile_munmap(addr);
> -	return vm_munmap(addr, len);
> +	return vm_munmap_zap_early(addr, len);
>  }
>  
> -
>  /*
>   * Emulation of deprecated remap_file_pages() syscall.
>   */
> -- 
> 1.8.3.1
>
Michal Hocko July 2, 2018, 12:49 p.m. UTC | #8
On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
[...]
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?

We have discussed that earlier and it boils down to how is racy access
to munmap supposed to behave. Right now we have either the original
content or SEGV. If we allow to simply madvise_dontneed before real
unmap we could get a new page as well. There might be (quite broken I
would say) user space code that would simply corrupt data silently that
way.
Michal Hocko July 2, 2018, 1:53 p.m. UTC | #9
On Sat 30-06-18 06:39:44, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@

> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not? 

> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.

What are those stability reasons?

> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.
Michal Hocko July 2, 2018, 2:05 p.m. UTC | #10
On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work. You simply
cannot drop the mmap_sem during unmap because another thread could
change the address space under your feet. So you need some form of
VM_DEAD and handle concurrent and conflicting address space operations.
Yang Shi July 2, 2018, 5:07 p.m. UTC | #11
On 7/2/18 6:53 AM, Michal Hocko wrote:
> On Sat 30-06-18 06:39:44, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
> s@Hock@Hocko@

Sorry for the wrong spelling.

>
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
> This really deserves an explanation why the all dance is really needed.
>
> It would be also good to mention how do you achieve the overal
> consistency. E.g. you are dropping mmap_sem and then re-taking it for
> write. What if any pending write lock succeeds and modify the address
> space? Does it matter, why if not?

Sure.

>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
>>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
>>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
> What are those stability reasons?

mmap() and mremap() may call do_munmap() as well, so it may introduce 
more race condition if they use the zap early version of do_munmap too. 
They would have much more chances to take mmap_sem to change address 
space and cause conflict.

And, it looks they are not the vital source of long period of write 
mmap_sem hold. So, it sounds not worth making things more complicated 
for the time being.

>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
>>
>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
> I haven't read through the implemenation carefuly TBH but the changelog
> needs quite some work to explain the solution and resulting semantic of
> munmap after the change.

Thanks for the suggestion. Will polish the changelog.

Yang
Yang Shi July 2, 2018, 5:19 p.m. UTC | #12
On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
>>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
>>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
>>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
>>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
>>
>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
>>
>> [1] https://lwn.net/Articles/753269/
>>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 134 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 87dcf83..d61e08b 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
> PUD_SIZE is defined everywhere.

If THP is defined, otherwise it is:

#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

>
>> +
>> +/* Unmap large mapping early with acquiring read mmap_sem */
>> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
>> +			       size_t len, struct list_head *uf)
>> +{
>> +	unsigned long end = 0;
>> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
>> +	bool success = false;
>> +	int ret = 0;
>> +
>> +	if (!munmap_addr_sanity(start, len))
>> +		return -EINVAL;
>> +
>> +	len = PAGE_ALIGN(len);
>> +
>> +	end = start + len;
>> +
>> +	/* Just deal with uf in regular path */
>> +	if (unlikely(uf))
>> +		goto regular_path;
>> +
>> +	if (len >= LARGE_MAP_THRESH) {
>> +		/*
>> +		 * need write mmap_sem to split vma and set VM_DEAD flag
>> +		 * splitting vma up-front to save PITA to clean if it is failed
> What errors do you talk about? ENOMEM on VMA split? Anything else?

Yes, ENOMEM on vma split.

>
>> +		 */
>> +		down_write(&mm->mmap_sem);
>> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
>> +		if (ret != 1) {
>> +			up_write(&mm->mmap_sem);
>> +			return ret;
>> +		}
>> +		/* This ret value might be returned, so reset it */
>> +		ret = 0;
>> +
>> +		/*
>> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
>> +		 * flag set or has uprobes set, need acquire write map_sem,
>> +		 * so skip them in early zap. Just deal with such mapping in
>> +		 * regular path.
>> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
>> +		 */
>> +		tmp = vma;
>> +		while (tmp && tmp->vm_start < end) {
>> +			if (!can_madv_dontneed_vma(tmp) ||
>> +			    vma_has_uprobes(tmp, start, end)) {
>> +				up_write(&mm->mmap_sem);
>> +				goto regular_path;
>> +			}
>> +			tmp = tmp->vm_next;
>> +		}
>> +		/*
>> +		 * set VM_DEAD flag before tear down them.
>> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
>> +		 */
>> +		tmp = vma;
>> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
>> +			tmp->vm_flags |= VM_DEAD;
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?
>
> I would rather allow the VMA to be re-populated by other thread while we
> are zapping the range. And later zap the range again under down_write.
>
> It should also lead to consolidated regular path: take mmap_sem for write
> and call do_munmap().
>
> On the first path we just skip VMA we cannot deal with under
> down_read(mmap_sem), regular path will take care of them.
>
>
>> +		up_write(&mm->mmap_sem);
>> +
>> +		/* zap mappings with read mmap_sem */
>> +		down_read(&mm->mmap_sem);
> Yeah. There's race between up_write() and down_read().
> Use downgrade, as Andrew suggested.
>
>> +		zap_page_range(vma, start, len);
>> +		/* indicates early zap is success */
>> +		success = true;
>> +		up_read(&mm->mmap_sem);
> And here again.
>
> This race can be avoided if we wouldn't carry vma to regular_path, but
> just go directly to do_munmap().

Thanks, Kirill. Yes, I did think about re-validating vmas before. This 
sounds reasonable to avoid the race. Although we spend more time in 
re-looking up vmas, but it should be very short, and the duplicate zap 
should be very short too.

Yang

>
>> +	}
>> +
>> +regular_path:
>> +	/* hold write mmap_sem for vma manipulation or regular path */
>> +	if (down_write_killable(&mm->mmap_sem))
>> +		return -EINTR;
>> +	if (success) {
>> +		/* vmas have been zapped, here clean up pgtable and vmas */
>> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
>> +		struct mmu_gather tlb;
>> +		tlb_gather_mmu(&tlb, mm, start, end);
>> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
>> +		tlb_finish_mmu(&tlb, start, end);
>> +
>> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +		arch_unmap(mm, vma, start, end);
>> +		remove_vma_list(mm, vma);
>> +	} else {
>> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
>> +		if (vma) {
>> +			if (unlikely(uf)) {
>> +				int ret = userfaultfd_unmap_prep(vma, start,
>> +								 end, uf);
>> +				if (ret)
>> +					goto out;
>> +			}
>> +			if (mm->locked_vm) {
>> +				tmp = vma;
>> +				while (tmp && tmp->vm_start < end) {
>> +					if (tmp->vm_flags & VM_LOCKED) {
>> +						mm->locked_vm -= vma_pages(tmp);
>> +						munlock_vma_pages_all(tmp);
>> +					}
>> +					tmp = tmp->vm_next;
>> +				}
>> +			}
>> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +			unmap_region(mm, vma, prev, start, end);
>> +			remove_vma_list(mm, vma);
>> +		} else
>> +			/* When mapping size < LARGE_MAP_THRESH */
>> +			ret = do_munmap(mm, start, len, uf);
>> +	}
>> +
>> +out:
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}
>> +
>>   /* Munmap is split into 2 main parts -- this part which finds
>>    * what needs doing, and the areas themselves, which do the
>>    * work.  This now handles partial unmappings.
>> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   	return 0;
>>   }
>>   
>> +static int vm_munmap_zap_early(unsigned long start, size_t len)
>> +{
>> +	int ret;
>> +	struct mm_struct *mm = current->mm;
>> +	LIST_HEAD(uf);
>> +
>> +	ret = do_munmap_zap_early(mm, start, len, &uf);
>> +	userfaultfd_unmap_complete(mm, &uf);
>> +	return ret;
>> +}
>> +
>>   int vm_munmap(unsigned long start, size_t len)
>>   {
>>   	int ret;
>> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>>   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>   {
>>   	profile_munmap(addr);
>> -	return vm_munmap(addr, len);
>> +	return vm_munmap_zap_early(addr, len);
>>   }
>>   
>> -
>>   /*
>>    * Emulation of deprecated remap_file_pages() syscall.
>>    */
>> -- 
>> 1.8.3.1
>>
Andrew Morton July 2, 2018, 8:48 p.m. UTC | #13
On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> [...]
> > Would one of your earlier designs have addressed all usecases?  I
> > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> 
> It has been already pointed out that this will not work.

I said "one of".  There were others.

> You simply
> cannot drop the mmap_sem during unmap because another thread could
> change the address space under your feet. So you need some form of
> VM_DEAD and handle concurrent and conflicting address space operations.

Unclear that this is a problem.  If a thread does an unmap of a range
of virtual address space, there's no guarantee that upon return some
other thread has not already mapped new stuff into that address range. 
So what's changed?
Yang Shi July 3, 2018, 12:01 a.m. UTC | #14
On 6/29/18 9:26 PM, Yang Shi wrote:
>
>
> On 6/29/18 8:15 PM, Andrew Morton wrote:
>> On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi 
>> <yang.shi@linux.alibaba.com> wrote:
>>
>>>
>>>> we're adding a bunch of code to 32-bit kernels which will never be
>>>> executed.
>>>>
>>>> I'm thinking it would be better to be much more explicit with "#ifdef
>>>> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>>>>
>>>> But I tend to think that the fact that we haven't solved anything on
>>>> locked vmas or on uprobed mappings is a shostopper for the whole
>>>> approach :(
>>> I agree it is not that perfect. But, it still could improve the most 
>>> use
>>> cases.
>> Well, those unaddressed usecases will need to be fixed at some point.
>
> Yes, definitely.
>
>> What's our plan for that?
>
> As I mentioned in the earlier email, locked and hugetlb cases might be 
> able to be solved by separating vm_flags update and actual unmap. I 
> will look into it further later.

By looking into this further, I think both mlocked and hugetlb vmas can 
be handled.

For mlocked vmas, it is easy since we acquires write mmap_sem before 
unmapping, so VM_LOCK flags can be cleared here then unmap, just like 
what the regular path does.

For hugetlb vmas, the VM_MAYSHARE flag is just checked by 
huge_pmd_share() in hugetlb_fault()->huge_pte_alloc(), another call site 
is dup_mm()->copy_page_range()->copy_hugetlb_page_range(), we don't care 
this call chain in this case.

So we may expand VM_DEAD to hugetlb_fault().  Michal suggested to check 
VM_DEAD in check_stable_address_space(), so it would be called in 
hugetlb_fault() too (not in current code), then the page fault handler 
would bail out before huge_pte_alloc() is called.

With this trick, we don't have to care about when the vm_flags is 
updated, we can unmap hugetlb vmas in read mmap_sem critical section, 
then update the vm_flags with write mmap_sem held or before the unmap.

Yang

>
> From my point of view, uprobe mapping sounds not that vital.
>
>>
>> Would one of your earlier designs have addressed all usecases? I
>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>
> Yes. The v1 design does unmap with holding write map_sem. So, the 
> vm_flags update is not a problem.
>
> Thanks,
> Yang
>
>>
>>> For the locked vmas and hugetlb vmas, unmapping operations need modify
>>> vm_flags. But, I'm wondering we might be able to separate unmap and
>>> vm_flags update. Because we know they will be unmapped right away, the
>>> vm_flags might be able to be updated in write mmap_sem critical section
>>> before the actual unmap is called or after it. This is just off the top
>>> of my head.
>>>
>>> For uprobed mappings, I'm not sure how vital it is to this case.
>>>
>>> Thanks,
>>> Yang
>>>
>
Michal Hocko July 3, 2018, 6:09 a.m. UTC | #15
On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > [...]
> > > Would one of your earlier designs have addressed all usecases?  I
> > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > 
> > It has been already pointed out that this will not work.
> 
> I said "one of".  There were others.

Well, I was aware only about two potential solutions. Either do the
heavy lifting under the shared lock and do the rest with the exlusive
one and this, drop the lock per parts. Maybe I have missed others?

> > You simply
> > cannot drop the mmap_sem during unmap because another thread could
> > change the address space under your feet. So you need some form of
> > VM_DEAD and handle concurrent and conflicting address space operations.
> 
> Unclear that this is a problem.  If a thread does an unmap of a range
> of virtual address space, there's no guarantee that upon return some
> other thread has not already mapped new stuff into that address range. 
> So what's changed?

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

They do not use any external synchronization and rely on the atomic
munmap. Thread B only munmaps range that it knows belongs to it (e.g.
called mmap in the past). It should be clear that ThreadA should not
get an address from the addr, sizeB range, right? In the most simple case
it will not happen. But let's say that the addr, sizeB range has
unmapped holes for what ever reasons. Now anytime munmap drops the
exclusive lock after handling one VMA, Thread A might find its sizeA
range and use it. ThreadB then might remove this new range as soon as it
gets its exclusive lock again.

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

I remember my attempt to make MAP_LOCKED consistent with mlock (if the
population fails then return -ENOMEM) and that required to drop the
shared mmap_sem and take it in exclusive mode (because we do not
have upgrade_read) and Linus was strongly against [1][2] for very
similar reasons. If you drop the lock you simply do not know what
happened under your feet.

[1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
[2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
Kirill A . Shutemov July 3, 2018, 8:07 a.m. UTC | #16
On Mon, Jul 02, 2018 at 10:19:32AM -0700, Yang Shi wrote:
> 
> 
> On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > 300GB), the below hung task issue may happen occasionally.
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > >         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
> > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >   ps              D    0 14018      1 0x00000004
> > >    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > >    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > >    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > >   Call Trace:
> > >    [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > >    [<ffffffff817159e6>] schedule+0x36/0x80
> > >    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > >    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > >    [<ffffffff81717db0>] down_read+0x20/0x40
> > >    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > >    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> > >    [<ffffffff81241d87>] __vfs_read+0x37/0x150
> > >    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> > >    [<ffffffff81242266>] vfs_read+0x96/0x130
> > >    [<ffffffff812437b5>] SyS_read+0x55/0xc0
> > >    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > Zapping pages is the most time consuming part, according to the
> > > suggestion from Michal Hock [1], zapping pages can be done with holding
> > > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> > > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> > > the page fault to VM_DEAD vma will trigger SIGSEGV.
> > > 
> > > Define large mapping size thresh as PUD size or 1GB, just zap pages with
> > > read mmap_sem for mappings which are >= thresh value.
> > > 
> > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> > > fallback to regular path since unmapping those mappings need acquire
> > > write mmap_sem.
> > > 
> > > For the time being, just do this in munmap syscall path. Other
> > > vm_munmap() or do_munmap() call sites remain intact for stability
> > > reason.
> > > 
> > > The below is some regression and performance data collected on a machine
> > > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> > > 
> > > With the patched kernel, write mmap_sem hold time is dropped to us level
> > > from second.
> > > 
> > > [1] https://lwn.net/Articles/753269/
> > > 
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > > ---
> > >   mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 87dcf83..d61e08b 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
> > >   	return 1;
> > >   }
> > > +/* Consider PUD size or 1GB mapping as large mapping */
> > > +#ifdef HPAGE_PUD_SIZE
> > > +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> > > +#else
> > > +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> > > +#endif
> > PUD_SIZE is defined everywhere.
> 
> If THP is defined, otherwise it is:
> 
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

I'm talking about PUD_SIZE, not HPAGE_PUD_SIZE.
Kirill A . Shutemov July 3, 2018, 8:12 a.m. UTC | #17
On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> [...]
> > I probably miss the explanation somewhere, but what's wrong with allowing
> > other thread to re-populate the VMA?
> 
> We have discussed that earlier and it boils down to how is racy access
> to munmap supposed to behave. Right now we have either the original
> content or SEGV. If we allow to simply madvise_dontneed before real
> unmap we could get a new page as well. There might be (quite broken I
> would say) user space code that would simply corrupt data silently that
> way.

Okay, so we add a lot of complexity to accommodate broken userspace that
may or may not exist. Is it right? :)
Yang Shi July 3, 2018, 4:53 p.m. UTC | #18
On 7/2/18 11:09 PM, Michal Hocko wrote:
> On Mon 02-07-18 13:48:45, Andrew Morton wrote:
>> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
>>> [...]
>>>> Would one of your earlier designs have addressed all usecases?  I
>>>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>>> It has been already pointed out that this will not work.
>> I said "one of".  There were others.
> Well, I was aware only about two potential solutions. Either do the
> heavy lifting under the shared lock and do the rest with the exlusive
> one and this, drop the lock per parts. Maybe I have missed others?

There is the other one which I presented on LSFMM summit. But, actually 
it turns out that one looks very similar to the current under review one.

Yang

>
>>> You simply
>>> cannot drop the mmap_sem during unmap because another thread could
>>> change the address space under your feet. So you need some form of
>>> VM_DEAD and handle concurrent and conflicting address space operations.
>> Unclear that this is a problem.  If a thread does an unmap of a range
>> of virtual address space, there's no guarantee that upon return some
>> other thread has not already mapped new stuff into that address range.
>> So what's changed?
> Well, consider the following scenario:
> Thread A = calling mmap(NULL, sizeA)
> Thread B = calling munmap(addr, sizeB)
>
> They do not use any external synchronization and rely on the atomic
> munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> called mmap in the past). It should be clear that ThreadA should not
> get an address from the addr, sizeB range, right? In the most simple case
> it will not happen. But let's say that the addr, sizeB range has
> unmapped holes for what ever reasons. Now anytime munmap drops the
> exclusive lock after handling one VMA, Thread A might find its sizeA
> range and use it. ThreadB then might remove this new range as soon as it
> gets its exclusive lock again.
>
> Is such a code safe? No it is not and I would call it fragile at best
> but people tend to do weird things and atomic munmap behavior is
> something they can easily depend on.
>
> Another example would be an atomic address range probing by
> MAP_FIXED_NOREPLACE. It would simply break for similar reasons.
>
> I remember my attempt to make MAP_LOCKED consistent with mlock (if the
> population fails then return -ENOMEM) and that required to drop the
> shared mmap_sem and take it in exclusive mode (because we do not
> have upgrade_read) and Linus was strongly against [1][2] for very
> similar reasons. If you drop the lock you simply do not know what
> happened under your feet.
>
> [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
Yang Shi July 3, 2018, 6:22 p.m. UTC | #19
On 7/2/18 11:09 PM, Michal Hocko wrote:
> On Mon 02-07-18 13:48:45, Andrew Morton wrote:
>> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
>>> [...]
>>>> Would one of your earlier designs have addressed all usecases?  I
>>>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>>> It has been already pointed out that this will not work.
>> I said "one of".  There were others.
> Well, I was aware only about two potential solutions. Either do the
> heavy lifting under the shared lock and do the rest with the exlusive
> one and this, drop the lock per parts. Maybe I have missed others?
>
>>> You simply
>>> cannot drop the mmap_sem during unmap because another thread could
>>> change the address space under your feet. So you need some form of
>>> VM_DEAD and handle concurrent and conflicting address space operations.
>> Unclear that this is a problem.  If a thread does an unmap of a range
>> of virtual address space, there's no guarantee that upon return some
>> other thread has not already mapped new stuff into that address range.
>> So what's changed?
> Well, consider the following scenario:
> Thread A = calling mmap(NULL, sizeA)
> Thread B = calling munmap(addr, sizeB)
>
> They do not use any external synchronization and rely on the atomic
> munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> called mmap in the past). It should be clear that ThreadA should not
> get an address from the addr, sizeB range, right? In the most simple case
> it will not happen. But let's say that the addr, sizeB range has
> unmapped holes for what ever reasons. Now anytime munmap drops the
> exclusive lock after handling one VMA, Thread A might find its sizeA
> range and use it. ThreadB then might remove this new range as soon as it
> gets its exclusive lock again.

I'm a little bit confused here. If ThreadB already has unmapped that 
range, then ThreadA uses it. It sounds not like a problem since ThreadB 
should just go ahead to handle the next range when it gets its exclusive 
lock again, right? I don't think of why ThreadB would re-visit that 
range to remove it.

But, if ThreadA uses MAP_FIXED, it might remap some ranges, then ThreadB 
remove them. It might trigger SIGSEGV or SIGBUS, but this is not even 
guaranteed on vanilla kernel too if the application doesn't do any 
synchronization. It all depends on timing.

>
> Is such a code safe? No it is not and I would call it fragile at best
> but people tend to do weird things and atomic munmap behavior is
> something they can easily depend on.
>
> Another example would be an atomic address range probing by
> MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

Yes, I agree, it could simply break MAP_FIXED_NOREPLACE.

Yang

>
> I remember my attempt to make MAP_LOCKED consistent with mlock (if the
> population fails then return -ENOMEM) and that required to drop the
> shared mmap_sem and take it in exclusive mode (because we do not
> have upgrade_read) and Linus was strongly against [1][2] for very
> similar reasons. If you drop the lock you simply do not know what
> happened under your feet.
>
> [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
Michal Hocko July 4, 2018, 8:13 a.m. UTC | #20
On Tue 03-07-18 11:22:17, Yang Shi wrote:
> 
> 
> On 7/2/18 11:09 PM, Michal Hocko wrote:
> > On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> > > On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > > > [...]
> > > > > Would one of your earlier designs have addressed all usecases?  I
> > > > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > > > It has been already pointed out that this will not work.
> > > I said "one of".  There were others.
> > Well, I was aware only about two potential solutions. Either do the
> > heavy lifting under the shared lock and do the rest with the exlusive
> > one and this, drop the lock per parts. Maybe I have missed others?
> > 
> > > > You simply
> > > > cannot drop the mmap_sem during unmap because another thread could
> > > > change the address space under your feet. So you need some form of
> > > > VM_DEAD and handle concurrent and conflicting address space operations.
> > > Unclear that this is a problem.  If a thread does an unmap of a range
> > > of virtual address space, there's no guarantee that upon return some
> > > other thread has not already mapped new stuff into that address range.
> > > So what's changed?
> > Well, consider the following scenario:
> > Thread A = calling mmap(NULL, sizeA)
> > Thread B = calling munmap(addr, sizeB)
> > 
> > They do not use any external synchronization and rely on the atomic
> > munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> > called mmap in the past). It should be clear that ThreadA should not
> > get an address from the addr, sizeB range, right? In the most simple case
> > it will not happen. But let's say that the addr, sizeB range has
> > unmapped holes for what ever reasons. Now anytime munmap drops the
> > exclusive lock after handling one VMA, Thread A might find its sizeA
> > range and use it. ThreadB then might remove this new range as soon as it
> > gets its exclusive lock again.
> 
> I'm a little bit confused here. If ThreadB already has unmapped that range,
> then ThreadA uses it. It sounds not like a problem since ThreadB should just
> go ahead to handle the next range when it gets its exclusive lock again,
> right? I don't think of why ThreadB would re-visit that range to remove it.

Not if the new range overlap with the follow up range that ThreadB does.
Example

B: munmap [XXXXX]    [XXXXXX] [XXXXXXXXXX]
B: breaks the lock after processing the first vma.
A: mmap [XXXXXXXXXXXX]
B: munmap retakes the lock and revalidate from the last vm_end because
   the old vma->vm_next might be gone
B:               [XXX][XXXXX] [XXXXXXXXXX]

so you munmap part of the range. Sure you can plan some tricks and skip
over vmas that do not start above your last vma->vm_end or something
like that but I expect there are other can of worms hidden there.
diff mbox

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@  static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
 	return 1;
 }
 
+/* Consider PUD size or 1GB mapping as large mapping */
+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
+#endif
+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+			       size_t len, struct list_head *uf)
+{
+	unsigned long end = 0;
+	struct vm_area_struct *vma = NULL, *prev, *tmp;
+	bool success = false;
+	int ret = 0;
+
+	if (!munmap_addr_sanity(start, len))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+
+	end = start + len;
+
+	/* Just deal with uf in regular path */
+	if (unlikely(uf))
+		goto regular_path;
+
+	if (len >= LARGE_MAP_THRESH) {
+		/*
+		 * need write mmap_sem to split vma and set VM_DEAD flag
+		 * splitting vma up-front to save PITA to clean if it is failed
+		 */
+		down_write(&mm->mmap_sem);
+		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
+		if (ret != 1) {
+			up_write(&mm->mmap_sem);
+			return ret;
+		}
+		/* This ret value might be returned, so reset it */
+		ret = 0;
+
+		/*
+		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
+		 * flag set or has uprobes set, need acquire write map_sem,
+		 * so skip them in early zap. Just deal with such mapping in
+		 * regular path.
+		 * Borrow can_madv_dontneed_vma() to check the conditions.
+		 */
+		tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (!can_madv_dontneed_vma(tmp) ||
+			    vma_has_uprobes(tmp, start, end)) {
+				up_write(&mm->mmap_sem);
+				goto regular_path;
+			}
+			tmp = tmp->vm_next;
+		}
+		/*
+		 * set VM_DEAD flag before tear down them.
+		 * page fault on VM_DEAD vma will trigger SIGSEGV.
+		 */
+		tmp = vma;
+		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
+			tmp->vm_flags |= VM_DEAD;
+		up_write(&mm->mmap_sem);
+
+		/* zap mappings with read mmap_sem */
+		down_read(&mm->mmap_sem);
+		zap_page_range(vma, start, len);
+		/* indicates early zap is success */
+		success = true;
+		up_read(&mm->mmap_sem);
+	}
+
+regular_path:
+	/* hold write mmap_sem for vma manipulation or regular path */
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	if (success) {
+		/* vmas have been zapped, here clean up pgtable and vmas */
+		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
+		struct mmu_gather tlb;
+		tlb_gather_mmu(&tlb, mm, start, end);
+		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+			      next ? next->vm_start : USER_PGTABLES_CEILING);
+		tlb_finish_mmu(&tlb, start, end);
+
+		detach_vmas_to_be_unmapped(mm, vma, prev, end);
+		arch_unmap(mm, vma, start, end);
+		remove_vma_list(mm, vma);
+	} else {
+		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
+		if (vma) {
+			if (unlikely(uf)) {
+				int ret = userfaultfd_unmap_prep(vma, start,
+								 end, uf);
+				if (ret)
+					goto out;
+			}
+			if (mm->locked_vm) {
+				tmp = vma;
+				while (tmp && tmp->vm_start < end) {
+					if (tmp->vm_flags & VM_LOCKED) {
+						mm->locked_vm -= vma_pages(tmp);
+						munlock_vma_pages_all(tmp);
+					}
+					tmp = tmp->vm_next;
+				}
+			}
+			detach_vmas_to_be_unmapped(mm, vma, prev, end);
+			unmap_region(mm, vma, prev, start, end);
+			remove_vma_list(mm, vma);
+		} else
+			/* When mapping size < LARGE_MAP_THRESH */
+			ret = do_munmap(mm, start, len, uf);
+	}
+
+out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2829,6 +2951,17 @@  int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return 0;
 }
 
+static int vm_munmap_zap_early(unsigned long start, size_t len)
+{
+	int ret;
+	struct mm_struct *mm = current->mm;
+	LIST_HEAD(uf);
+
+	ret = do_munmap_zap_early(mm, start, len, &uf);
+	userfaultfd_unmap_complete(mm, &uf);
+	return ret;
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
@@ -2848,10 +2981,9 @@  int vm_munmap(unsigned long start, size_t len)
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	profile_munmap(addr);
-	return vm_munmap(addr, len);
+	return vm_munmap_zap_early(addr, len);
 }
 
-
 /*
  * Emulation of deprecated remap_file_pages() syscall.
  */