diff mbox series

[v2,6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

Message ID 20230830125654.21257-7-zhangpeng.00@bytedance.com (mailing list archive)
State New
Headers show
Series Introduce __mt_dup() to improve the performance of fork() | expand

Commit Message

Peng Zhang Aug. 30, 2023, 12:56 p.m. UTC
Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
directly modify the entries of VMAs in the new maple tree, which can
get better performance. The optimization effect is proportional to the
number of VMAs.

There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.

Below are the test numbers. There are 21 VMAs by default. The first row
indicates the number of added VMAs. The following two lines are the
number of fork() calls every 10 seconds. These numbers are different
from the test results in v1 because this time the benchmark is bound to
a CPU. This way the numbers are more stable.

  Increment of VMAs: 0      100     200     400     800     1600    3200    6400
6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
                     +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

[1] https://github.com/kdlucas/byte-unixbench/tree/master

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 kernel/fork.c | 34 ++++++++++++++++++++++++++--------
 mm/mmap.c     | 14 ++++++++++++--
 2 files changed, 38 insertions(+), 10 deletions(-)

Comments

Liam R. Howlett Sept. 7, 2023, 8:14 p.m. UTC | #1
* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> directly modify the entries of VMAs in the new maple tree, which can
> get better performance. The optimization effect is proportional to the
> number of VMAs.
> 
> There is a "spawn" in byte-unixbench[1], which can be used to test the
> performance of fork(). I modified it slightly to make it work with
> different number of VMAs.
> 
> Below are the test numbers. There are 21 VMAs by default. The first row
> indicates the number of added VMAs. The following two lines are the
> number of fork() calls every 10 seconds. These numbers are different
> from the test results in v1 because this time the benchmark is bound to
> a CPU. This way the numbers are more stable.
> 
>   Increment of VMAs: 0      100     200     400     800     1600    3200    6400
> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>                      +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

Thanks!

Can you include 21 in this table since it's the default?

> 
> [1] https://github.com/kdlucas/byte-unixbench/tree/master
> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>  mm/mmap.c     | 14 ++++++++++++--
>  2 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b6d20dfb9a8..e6299adefbd8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	int retval;
>  	unsigned long charge = 0;
>  	LIST_HEAD(uf);
> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>  	VMA_ITERATOR(vmi, mm, 0);
>  
>  	uprobe_start_dup_mmap();
> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		goto out;
>  	khugepaged_fork(mm, oldmm);
>  
> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> -	if (retval)
> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);

Apparently the flags should be GFP_KERNEL here so that compaction can
run.

> +	if (unlikely(retval))
>  		goto out;
>  
>  	mt_clear_in_rcu(vmi.mas.tree);
> -	for_each_vma(old_vmi, mpnt) {
> +	for_each_vma(vmi, mpnt) {
>  		struct file *file;
>  
>  		vma_start_write(mpnt);
>  		if (mpnt->vm_flags & VM_DONTCOPY) {
>  			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +
> +			/*
> +			 * Since the new tree is exactly the same as the old one,
> +			 * we need to remove the unneeded VMAs.
> +			 */
> +			mas_store(&vmi.mas, NULL);
> +
> +			/*
> +			 * Even removing an entry may require memory allocation,
> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
> +			 * from which VMA it failed. The case of encountering
> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
> +			 */
> +			if (unlikely(mas_is_err(&vmi.mas))) {
> +				retval = xa_err(vmi.mas.node);
> +				mas_reset(&vmi.mas);
> +				if (mas_find(&vmi.mas, ULONG_MAX))
> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
> +				goto loop_out;
> +			}
> +

Storing NULL may need extra space as you noted, so we need to be careful
what happens if we don't have that space.  We should have a testcase to
test this scenario.

mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
in this function, see vm_area_dup().

Don't use the exit_mmap() path to undo a failed fork.  You've added
checks and complications to the exit path for all tasks in the very
unlikely event that we run out of memory when we hit a very unlikely
VM_DONTCOPY flag.

I see the issue with having a portion of the tree with new VMAs that are
accounted and a portion of the tree that has old VMAs that should not be
looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
we cannot add that complication to the exit path and then there is the
OOM race to worry about (maybe, I am not sure since this MM isn't
active yet).

Using what is done in exit_mmap() and do_vmi_align_munmap() as a
prototype, we can do something like the *untested* code below:

if (unlikely(mas_is_err(&vmi.mas))) {
	unsigned long max = vmi.index;

	retval = xa_err(vmi.mas.node);
	mas_set(&vmi.mas, 0);
	tmp = mas_find(&vmi.mas, ULONG_MAX);
	if (tmp) { /* Not the first VMA failed */
		unsigned long nr_accounted = 0;

		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
				true);
		do {
			if (vma->vm_flags & VM_ACCOUNT)
				nr_accounted += vma_pages(vma);
			remove_vma(vma, true);
			cond_resched();
			vma = mas_find(&vmi.mas, max - 1);
		} while (vma != NULL);

		vm_unacct_memory(nr_accounted);
	}
	__mt_destroy(&mm->mm_mt);
	goto loop_out;
}

Once exit_mmap() is called, the check for OOM (no vma) will catch that
nothing is left to do.

It might be worth making an inline function to do this to keep the fork
code clean.  We should test this by detecting a specific task name and
returning a failure at a given interval:

if (!strcmp(current->comm, "fork_test") {
...
}


>  			continue;
>  		}
>  		charge = 0;
> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			hugetlb_dup_vma_private(tmp);
>  
>  		/* Link the vma into the MT */
> -		if (vma_iter_bulk_store(&vmi, tmp))
> -			goto fail_nomem_vmi_store;
> +		mas_store(&vmi.mas, tmp);
>  
>  		mm->map_count++;
>  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	uprobe_end_dup_mmap();
>  	return retval;
>  
> -fail_nomem_vmi_store:
> -	unlink_anon_vmas(tmp);
>  fail_nomem_anon_vma_fork:
>  	mpol_put(vma_policy(tmp));
>  fail_nomem_policy:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b56a7f0c9f85..dfc6881be81c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>  	arch_exit_mmap(mm);
>  
>  	vma = mas_find(&mas, ULONG_MAX);
> -	if (!vma) {
> +	/*
> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> +	 * xa_is_zero(vma) may be true.
> +	 */
> +	if (!vma || xa_is_zero(vma)) {
>  		/* Can happen if dup_mmap() received an OOM */
>  		mmap_read_unlock(mm);
>  		return;
> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>  		remove_vma(vma, true);
>  		count++;
>  		cond_resched();
> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> +		vma = mas_find(&mas, ULONG_MAX);
> +		/*
> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
> +		 * donot need to be removed. Can happen if dup_mmap() fails to
> +		 * remove a VMA marked VM_DONTCOPY.
> +		 */
> +	} while (vma != NULL && !xa_is_zero(vma));
>  
>  	BUG_ON(count != mm->map_count);
>  
> -- 
> 2.20.1
>
Peng Zhang Sept. 8, 2023, 9:58 a.m. UTC | #2
在 2023/9/8 04:14, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. The optimization effect is proportional to the
>> number of VMAs.
>>
>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>> performance of fork(). I modified it slightly to make it work with
>> different number of VMAs.
>>
>> Below are the test numbers. There are 21 VMAs by default. The first row
>> indicates the number of added VMAs. The following two lines are the
>> number of fork() calls every 10 seconds. These numbers are different
>> from the test results in v1 because this time the benchmark is bound to
>> a CPU. This way the numbers are more stable.
>>
>>    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
>> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> 
> Thanks!
> 
> Can you include 21 in this table since it's the default?
Maybe I didn't express clearly, "Increment of VMAs" means the number of
VMAs added on the basis of 21 VMAs.
> 
>>
>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>   mm/mmap.c     | 14 ++++++++++++--
>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 3b6d20dfb9a8..e6299adefbd8 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	int retval;
>>   	unsigned long charge = 0;
>>   	LIST_HEAD(uf);
>> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>>   	VMA_ITERATOR(vmi, mm, 0);
>>   
>>   	uprobe_start_dup_mmap();
>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		goto out;
>>   	khugepaged_fork(mm, oldmm);
>>   
>> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> -	if (retval)
>> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
>> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> 
> Apparently the flags should be GFP_KERNEL here so that compaction can
> run.
OK, I'll change it to GFP_KERNEL.
> 
>> +	if (unlikely(retval))
>>   		goto out;
>>   
>>   	mt_clear_in_rcu(vmi.mas.tree);
>> -	for_each_vma(old_vmi, mpnt) {
>> +	for_each_vma(vmi, mpnt) {
>>   		struct file *file;
>>   
>>   		vma_start_write(mpnt);
>>   		if (mpnt->vm_flags & VM_DONTCOPY) {
>>   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> +			/*
>> +			 * Since the new tree is exactly the same as the old one,
>> +			 * we need to remove the unneeded VMAs.
>> +			 */
>> +			mas_store(&vmi.mas, NULL);
>> +
>> +			/*
>> +			 * Even removing an entry may require memory allocation,
>> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
>> +			 * from which VMA it failed. The case of encountering
>> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
>> +			 */
>> +			if (unlikely(mas_is_err(&vmi.mas))) {
>> +				retval = xa_err(vmi.mas.node);
>> +				mas_reset(&vmi.mas);
>> +				if (mas_find(&vmi.mas, ULONG_MAX))
>> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
>> +				goto loop_out;
>> +			}
>> +
> 
> Storing NULL may need extra space as you noted, so we need to be careful
> what happens if we don't have that space.  We should have a testcase to
> test this scenario.
> 
> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> in this function, see vm_area_dup().
> 
> Don't use the exit_mmap() path to undo a failed fork.  You've added
> checks and complications to the exit path for all tasks in the very
> unlikely event that we run out of memory when we hit a very unlikely
> VM_DONTCOPY flag.
> 
> I see the issue with having a portion of the tree with new VMAs that are
> accounted and a portion of the tree that has old VMAs that should not be
> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> we cannot add that complication to the exit path and then there is the
> OOM race to worry about (maybe, I am not sure since this MM isn't
> active yet).
> 
> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> prototype, we can do something like the *untested* code below:
> 
> if (unlikely(mas_is_err(&vmi.mas))) {
> 	unsigned long max = vmi.index;
> 
> 	retval = xa_err(vmi.mas.node);
> 	mas_set(&vmi.mas, 0);
> 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> 	if (tmp) { /* Not the first VMA failed */
> 		unsigned long nr_accounted = 0;
> 
> 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> 				true);
> 		do {
> 			if (vma->vm_flags & VM_ACCOUNT)
> 				nr_accounted += vma_pages(vma);
> 			remove_vma(vma, true);
> 			cond_resched();
> 			vma = mas_find(&vmi.mas, max - 1);
> 		} while (vma != NULL);
> 
> 		vm_unacct_memory(nr_accounted);
> 	}
> 	__mt_destroy(&mm->mm_mt);
> 	goto loop_out;
> }
> 
> Once exit_mmap() is called, the check for OOM (no vma) will catch that
> nothing is left to do.
> 
> It might be worth making an inline function to do this to keep the fork
> code clean.  We should test this by detecting a specific task name and
> returning a failure at a given interval:
> 
> if (!strcmp(current->comm, "fork_test") {
> ...
> }

Thank you for your suggestion, I will do this in the next version.
> 
> 
>>   			continue;
>>   		}
>>   		charge = 0;
>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   			hugetlb_dup_vma_private(tmp);
>>   
>>   		/* Link the vma into the MT */
>> -		if (vma_iter_bulk_store(&vmi, tmp))
>> -			goto fail_nomem_vmi_store;
>> +		mas_store(&vmi.mas, tmp);
>>   
>>   		mm->map_count++;
>>   		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	uprobe_end_dup_mmap();
>>   	return retval;
>>   
>> -fail_nomem_vmi_store:
>> -	unlink_anon_vmas(tmp);
>>   fail_nomem_anon_vma_fork:
>>   	mpol_put(vma_policy(tmp));
>>   fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b56a7f0c9f85..dfc6881be81c 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>   	arch_exit_mmap(mm);
>>   
>>   	vma = mas_find(&mas, ULONG_MAX);
>> -	if (!vma) {
>> +	/*
>> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> +	 * xa_is_zero(vma) may be true.
>> +	 */
>> +	if (!vma || xa_is_zero(vma)) {
>>   		/* Can happen if dup_mmap() received an OOM */
>>   		mmap_read_unlock(mm);
>>   		return;
>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>   		remove_vma(vma, true);
>>   		count++;
>>   		cond_resched();
>> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> +		vma = mas_find(&mas, ULONG_MAX);
>> +		/*
>> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> +		 * donot need to be removed. Can happen if dup_mmap() fails to
>> +		 * remove a VMA marked VM_DONTCOPY.
>> +		 */
>> +	} while (vma != NULL && !xa_is_zero(vma));
>>   
>>   	BUG_ON(count != mm->map_count);
>>   
>> -- 
>> 2.20.1
>>
>
Liam R. Howlett Sept. 8, 2023, 4:07 p.m. UTC | #3
* Peng Zhang <zhangpeng.00@bytedance.com> [230908 05:59]:
> 
> 
> 在 2023/9/8 04:14, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
> > > Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> > > directly modify the entries of VMAs in the new maple tree, which can
> > > get better performance. The optimization effect is proportional to the
> > > number of VMAs.
> > > 
> > > There is a "spawn" in byte-unixbench[1], which can be used to test the
> > > performance of fork(). I modified it slightly to make it work with
> > > different number of VMAs.
> > > 
> > > Below are the test numbers. There are 21 VMAs by default. The first row
> > > indicates the number of added VMAs. The following two lines are the
> > > number of fork() calls every 10 seconds. These numbers are different
> > > from the test results in v1 because this time the benchmark is bound to
> > > a CPU. This way the numbers are more stable.
> > > 
> > >    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
> > > 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
> > > Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
> > >                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> > 
> > Thanks!
> > 
> > Can you include 21 in this table since it's the default?
> Maybe I didn't express clearly, "Increment of VMAs" means the number of
> VMAs added on the basis of 21 VMAs.

Ah, I see.  Thanks.

> > 
> > > 
> > > [1] https://github.com/kdlucas/byte-unixbench/tree/master
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
> > >   mm/mmap.c     | 14 ++++++++++++--
> > >   2 files changed, 38 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 3b6d20dfb9a8..e6299adefbd8 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   	int retval;
> > >   	unsigned long charge = 0;
> > >   	LIST_HEAD(uf);
> > > -	VMA_ITERATOR(old_vmi, oldmm, 0);
> > >   	VMA_ITERATOR(vmi, mm, 0);
> > >   	uprobe_start_dup_mmap();
> > > @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   		goto out;
> > >   	khugepaged_fork(mm, oldmm);
> > > -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> > > -	if (retval)
> > > +	/* Use __mt_dup() to efficiently build an identical maple tree. */
> > > +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> > 
> > Apparently the flags should be GFP_KERNEL here so that compaction can
> > run.
> OK, I'll change it to GFP_KERNEL.
> > 
> > > +	if (unlikely(retval))
> > >   		goto out;
> > >   	mt_clear_in_rcu(vmi.mas.tree);
> > > -	for_each_vma(old_vmi, mpnt) {
> > > +	for_each_vma(vmi, mpnt) {
> > >   		struct file *file;
> > >   		vma_start_write(mpnt);
> > >   		if (mpnt->vm_flags & VM_DONTCOPY) {
> > >   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > +
> > > +			/*
> > > +			 * Since the new tree is exactly the same as the old one,
> > > +			 * we need to remove the unneeded VMAs.
> > > +			 */
> > > +			mas_store(&vmi.mas, NULL);
> > > +
> > > +			/*
> > > +			 * Even removing an entry may require memory allocation,
> > > +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > +			 * from which VMA it failed. The case of encountering
> > > +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > +			 */
> > > +			if (unlikely(mas_is_err(&vmi.mas))) {
> > > +				retval = xa_err(vmi.mas.node);
> > > +				mas_reset(&vmi.mas);
> > > +				if (mas_find(&vmi.mas, ULONG_MAX))
> > > +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > +				goto loop_out;
> > > +			}
> > > +
> > 
> > Storing NULL may need extra space as you noted, so we need to be careful
> > what happens if we don't have that space.  We should have a testcase to
> > test this scenario.
> > 
> > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > in this function, see vm_area_dup().
> > 
> > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > checks and complications to the exit path for all tasks in the very
> > unlikely event that we run out of memory when we hit a very unlikely
> > VM_DONTCOPY flag.
> > 
> > I see the issue with having a portion of the tree with new VMAs that are
> > accounted and a portion of the tree that has old VMAs that should not be
> > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > we cannot add that complication to the exit path and then there is the
> > OOM race to worry about (maybe, I am not sure since this MM isn't
> > active yet).
> > 
> > Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> > prototype, we can do something like the *untested* code below:
> > 
> > if (unlikely(mas_is_err(&vmi.mas))) {
> > 	unsigned long max = vmi.index;
> > 
> > 	retval = xa_err(vmi.mas.node);
> > 	mas_set(&vmi.mas, 0);
> > 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> > 	if (tmp) { /* Not the first VMA failed */
> > 		unsigned long nr_accounted = 0;
> > 
> > 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> > 				true);
> > 		do {
> > 			if (vma->vm_flags & VM_ACCOUNT)
> > 				nr_accounted += vma_pages(vma);
> > 			remove_vma(vma, true);
> > 			cond_resched();
> > 			vma = mas_find(&vmi.mas, max - 1);
> > 		} while (vma != NULL);
> > 
> > 		vm_unacct_memory(nr_accounted);
> > 	}
> > 	__mt_destroy(&mm->mm_mt);
> > 	goto loop_out;
> > }
> > 
> > Once exit_mmap() is called, the check for OOM (no vma) will catch that
> > nothing is left to do.
> > 
> > It might be worth making an inline function to do this to keep the fork
> > code clean.  We should test this by detecting a specific task name and
> > returning a failure at a given interval:
> > 
> > if (!strcmp(current->comm, "fork_test") {
> > ...
> > }
> 
> Thank you for your suggestion, I will do this in the next version.
> > 
> > 
> > >   			continue;
> > >   		}
> > >   		charge = 0;
> > > @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   			hugetlb_dup_vma_private(tmp);
> > >   		/* Link the vma into the MT */
> > > -		if (vma_iter_bulk_store(&vmi, tmp))
> > > -			goto fail_nomem_vmi_store;
> > > +		mas_store(&vmi.mas, tmp);
> > >   		mm->map_count++;
> > >   		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > > @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   	uprobe_end_dup_mmap();
> > >   	return retval;
> > > -fail_nomem_vmi_store:
> > > -	unlink_anon_vmas(tmp);
> > >   fail_nomem_anon_vma_fork:
> > >   	mpol_put(vma_policy(tmp));
> > >   fail_nomem_policy:
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b56a7f0c9f85..dfc6881be81c 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
> > >   	arch_exit_mmap(mm);
> > >   	vma = mas_find(&mas, ULONG_MAX);
> > > -	if (!vma) {
> > > +	/*
> > > +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> > > +	 * xa_is_zero(vma) may be true.
> > > +	 */
> > > +	if (!vma || xa_is_zero(vma)) {
> > >   		/* Can happen if dup_mmap() received an OOM */
> > >   		mmap_read_unlock(mm);
> > >   		return;
> > > @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
> > >   		remove_vma(vma, true);
> > >   		count++;
> > >   		cond_resched();
> > > -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > > +		vma = mas_find(&mas, ULONG_MAX);
> > > +		/*
> > > +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
> > > +		 * donot need to be removed. Can happen if dup_mmap() fails to
> > > +		 * remove a VMA marked VM_DONTCOPY.
> > > +		 */
> > > +	} while (vma != NULL && !xa_is_zero(vma));
> > >   	BUG_ON(count != mm->map_count);
> > > -- 
> > > 2.20.1
> > > 
> >
Peng Zhang Sept. 15, 2023, 10:51 a.m. UTC | #4
在 2023/9/8 04:14, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. The optimization effect is proportional to the
>> number of VMAs.
>>
>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>> performance of fork(). I modified it slightly to make it work with
>> different number of VMAs.
>>
>> Below are the test numbers. There are 21 VMAs by default. The first row
>> indicates the number of added VMAs. The following two lines are the
>> number of fork() calls every 10 seconds. These numbers are different
>> from the test results in v1 because this time the benchmark is bound to
>> a CPU. This way the numbers are more stable.
>>
>>    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
>> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> 
> Thanks!
> 
> Can you include 21 in this table since it's the default?
> 
>>
>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>   mm/mmap.c     | 14 ++++++++++++--
>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 3b6d20dfb9a8..e6299adefbd8 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	int retval;
>>   	unsigned long charge = 0;
>>   	LIST_HEAD(uf);
>> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>>   	VMA_ITERATOR(vmi, mm, 0);
>>   
>>   	uprobe_start_dup_mmap();
>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		goto out;
>>   	khugepaged_fork(mm, oldmm);
>>   
>> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> -	if (retval)
>> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
>> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> 
> Apparently the flags should be GFP_KERNEL here so that compaction can
> run.
> 
>> +	if (unlikely(retval))
>>   		goto out;
>>   
>>   	mt_clear_in_rcu(vmi.mas.tree);
>> -	for_each_vma(old_vmi, mpnt) {
>> +	for_each_vma(vmi, mpnt) {
>>   		struct file *file;
>>   
>>   		vma_start_write(mpnt);
>>   		if (mpnt->vm_flags & VM_DONTCOPY) {
>>   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> +			/*
>> +			 * Since the new tree is exactly the same as the old one,
>> +			 * we need to remove the unneeded VMAs.
>> +			 */
>> +			mas_store(&vmi.mas, NULL);
>> +
>> +			/*
>> +			 * Even removing an entry may require memory allocation,
>> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
>> +			 * from which VMA it failed. The case of encountering
>> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
>> +			 */
>> +			if (unlikely(mas_is_err(&vmi.mas))) {
>> +				retval = xa_err(vmi.mas.node);
>> +				mas_reset(&vmi.mas);
>> +				if (mas_find(&vmi.mas, ULONG_MAX))
>> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
>> +				goto loop_out;
>> +			}
>> +
> 
> Storing NULL may need extra space as you noted, so we need to be careful
> what happens if we don't have that space.  We should have a testcase to
> test this scenario.
> 
> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> in this function, see vm_area_dup().
> 
> Don't use the exit_mmap() path to undo a failed fork.  You've added
> checks and complications to the exit path for all tasks in the very
> unlikely event that we run out of memory when we hit a very unlikely
> VM_DONTCOPY flag.
> 
> I see the issue with having a portion of the tree with new VMAs that are
> accounted and a portion of the tree that has old VMAs that should not be
> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> we cannot add that complication to the exit path and then there is the
> OOM race to worry about (maybe, I am not sure since this MM isn't
> active yet).
I encountered some errors after implementing the scheme you mentioned
below. This would also clutter fork.c and mmap.c, as some internal
functions would need to be made global.

I thought of another way to put everything into maple tree. In non-RCU
mode, we can remove the last half of the tree without allocating any
memory. This requires modifications to the internal implementation of
mas_store().
Then remove the second half of the tree like this:

mas.index = 0;
mas.last = ULONGN_MAX;
mas_store(&mas, NULL).

At least in non-RCU mode, we can do this, since we only need to merge
some nodes, or move some items to adjacent nodes.
However, this will increase the workload significantly.

> 
> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> prototype, we can do something like the *untested* code below:
> 
> if (unlikely(mas_is_err(&vmi.mas))) {
> 	unsigned long max = vmi.index;
> 
> 	retval = xa_err(vmi.mas.node);
> 	mas_set(&vmi.mas, 0);
> 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> 	if (tmp) { /* Not the first VMA failed */
> 		unsigned long nr_accounted = 0;
> 
> 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> 				true);
> 		do {
> 			if (vma->vm_flags & VM_ACCOUNT)
> 				nr_accounted += vma_pages(vma);
> 			remove_vma(vma, true);
> 			cond_resched();
> 			vma = mas_find(&vmi.mas, max - 1);
> 		} while (vma != NULL);
> 
> 		vm_unacct_memory(nr_accounted);
> 	}
> 	__mt_destroy(&mm->mm_mt);
> 	goto loop_out;
> }
> 
> Once exit_mmap() is called, the check for OOM (no vma) will catch that
> nothing is left to do.
> 
> It might be worth making an inline function to do this to keep the fork
> code clean.  We should test this by detecting a specific task name and
> returning a failure at a given interval:
> 
> if (!strcmp(current->comm, "fork_test") {
> ...
> }
> 
> 
>>   			continue;
>>   		}
>>   		charge = 0;
>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   			hugetlb_dup_vma_private(tmp);
>>   
>>   		/* Link the vma into the MT */
>> -		if (vma_iter_bulk_store(&vmi, tmp))
>> -			goto fail_nomem_vmi_store;
>> +		mas_store(&vmi.mas, tmp);
>>   
>>   		mm->map_count++;
>>   		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	uprobe_end_dup_mmap();
>>   	return retval;
>>   
>> -fail_nomem_vmi_store:
>> -	unlink_anon_vmas(tmp);
>>   fail_nomem_anon_vma_fork:
>>   	mpol_put(vma_policy(tmp));
>>   fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b56a7f0c9f85..dfc6881be81c 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>   	arch_exit_mmap(mm);
>>   
>>   	vma = mas_find(&mas, ULONG_MAX);
>> -	if (!vma) {
>> +	/*
>> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> +	 * xa_is_zero(vma) may be true.
>> +	 */
>> +	if (!vma || xa_is_zero(vma)) {
>>   		/* Can happen if dup_mmap() received an OOM */
>>   		mmap_read_unlock(mm);
>>   		return;
>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>   		remove_vma(vma, true);
>>   		count++;
>>   		cond_resched();
>> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> +		vma = mas_find(&mas, ULONG_MAX);
>> +		/*
>> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> +		 * donot need to be removed. Can happen if dup_mmap() fails to
>> +		 * remove a VMA marked VM_DONTCOPY.
>> +		 */
>> +	} while (vma != NULL && !xa_is_zero(vma));
>>   
>>   	BUG_ON(count != mm->map_count);
>>   
>> -- 
>> 2.20.1
>>
>
Peng Zhang Sept. 15, 2023, 10:56 a.m. UTC | #5
在 2023/9/15 18:51, Peng Zhang 写道:
> 
> 
> 在 2023/9/8 04:14, Liam R. Howlett 写道:
>> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>>> directly modify the entries of VMAs in the new maple tree, which can
>>> get better performance. The optimization effect is proportional to the
>>> number of VMAs.
>>>
>>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>>> performance of fork(). I modified it slightly to make it work with
>>> different number of VMAs.
>>>
>>> Below are the test numbers. There are 21 VMAs by default. The first row
>>> indicates the number of added VMAs. The following two lines are the
>>> number of fork() calls every 10 seconds. These numbers are different
>>> from the test results in v1 because this time the benchmark is bound to
>>> a CPU. This way the numbers are more stable.
>>>
>>>    Increment of VMAs: 0      100     200     400     800     1600    
>>> 3200    6400
>>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   
>>> 6110    3158
>>> Apply this patchset: 114531 85420   64541   44592   28660   16371   
>>> 9038    4831
>>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% 
>>> +47.92% +52.98%
>>
>> Thanks!
>>
>> Can you include 21 in this table since it's the default?
>>
>>>
>>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>>
>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>> ---
>>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>>   mm/mmap.c     | 14 ++++++++++++--
>>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 3b6d20dfb9a8..e6299adefbd8 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>       int retval;
>>>       unsigned long charge = 0;
>>>       LIST_HEAD(uf);
>>> -    VMA_ITERATOR(old_vmi, oldmm, 0);
>>>       VMA_ITERATOR(vmi, mm, 0);
>>>       uprobe_start_dup_mmap();
>>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>           goto out;
>>>       khugepaged_fork(mm, oldmm);
>>> -    retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>>> -    if (retval)
>>> +    /* Use __mt_dup() to efficiently build an identical maple tree. */
>>> +    retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | 
>>> __GFP_NOWARN);
>>
>> Apparently the flags should be GFP_KERNEL here so that compaction can
>> run.
>>
>>> +    if (unlikely(retval))
>>>           goto out;
>>>       mt_clear_in_rcu(vmi.mas.tree);
>>> -    for_each_vma(old_vmi, mpnt) {
>>> +    for_each_vma(vmi, mpnt) {
>>>           struct file *file;
>>>           vma_start_write(mpnt);
>>>           if (mpnt->vm_flags & VM_DONTCOPY) {
>>>               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>> +
>>> +            /*
>>> +             * Since the new tree is exactly the same as the old one,
>>> +             * we need to remove the unneeded VMAs.
>>> +             */
>>> +            mas_store(&vmi.mas, NULL);
>>> +
>>> +            /*
>>> +             * Even removing an entry may require memory allocation,
>>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark
>>> +             * from which VMA it failed. The case of encountering
>>> +             * XA_ZERO_ENTRY will be handled in exit_mmap().
>>> +             */
>>> +            if (unlikely(mas_is_err(&vmi.mas))) {
>>> +                retval = xa_err(vmi.mas.node);
>>> +                mas_reset(&vmi.mas);
>>> +                if (mas_find(&vmi.mas, ULONG_MAX))
>>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
>>> +                goto loop_out;
>>> +            }
>>> +
>>
>> Storing NULL may need extra space as you noted, so we need to be careful
>> what happens if we don't have that space.  We should have a testcase to
>> test this scenario.
>>
>> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
>> in this function, see vm_area_dup().
>>
>> Don't use the exit_mmap() path to undo a failed fork.  You've added
>> checks and complications to the exit path for all tasks in the very
>> unlikely event that we run out of memory when we hit a very unlikely
>> VM_DONTCOPY flag.
>>
>> I see the issue with having a portion of the tree with new VMAs that are
>> accounted and a portion of the tree that has old VMAs that should not be
>> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
>> we cannot add that complication to the exit path and then there is the
>> OOM race to worry about (maybe, I am not sure since this MM isn't
>> active yet).
> I encountered some errors after implementing the scheme you mentioned
> below. This would also clutter fork.c and mmap.c, as some internal
> functions would need to be made global.
> 
> I thought of another way to put everything into maple tree. In non-RCU
> mode, we can remove the last half of the tree without allocating any
> memory. This requires modifications to the internal implementation of
> mas_store().
> Then remove the second half of the tree like this:
> 
> mas.index = 0;
Sorry, typo.
Change to: mas.index = vma->start
> mas.last = ULONGN_MAX;
> mas_store(&mas, NULL).

> 
> At least in non-RCU mode, we can do this, since we only need to merge
> some nodes, or move some items to adjacent nodes.
> However, this will increase the workload significantly.
> 
>>
>> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
>> prototype, we can do something like the *untested* code below:
>>
>> if (unlikely(mas_is_err(&vmi.mas))) {
>>     unsigned long max = vmi.index;
>>
>>     retval = xa_err(vmi.mas.node);
>>     mas_set(&vmi.mas, 0);
>>     tmp = mas_find(&vmi.mas, ULONG_MAX);
>>     if (tmp) { /* Not the first VMA failed */
>>         unsigned long nr_accounted = 0;
>>
>>         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
>>                 true);
>>         do {
>>             if (vma->vm_flags & VM_ACCOUNT)
>>                 nr_accounted += vma_pages(vma);
>>             remove_vma(vma, true);
>>             cond_resched();
>>             vma = mas_find(&vmi.mas, max - 1);
>>         } while (vma != NULL);
>>
>>         vm_unacct_memory(nr_accounted);
>>     }
>>     __mt_destroy(&mm->mm_mt);
>>     goto loop_out;
>> }
>>
>> Once exit_mmap() is called, the check for OOM (no vma) will catch that
>> nothing is left to do.
>>
>> It might be worth making an inline function to do this to keep the fork
>> code clean.  We should test this by detecting a specific task name and
>> returning a failure at a given interval:
>>
>> if (!strcmp(current->comm, "fork_test") {
>> ...
>> }
>>
>>
>>>               continue;
>>>           }
>>>           charge = 0;
>>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>               hugetlb_dup_vma_private(tmp);
>>>           /* Link the vma into the MT */
>>> -        if (vma_iter_bulk_store(&vmi, tmp))
>>> -            goto fail_nomem_vmi_store;
>>> +        mas_store(&vmi.mas, tmp);
>>>           mm->map_count++;
>>>           if (!(tmp->vm_flags & VM_WIPEONFORK))
>>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>       uprobe_end_dup_mmap();
>>>       return retval;
>>> -fail_nomem_vmi_store:
>>> -    unlink_anon_vmas(tmp);
>>>   fail_nomem_anon_vma_fork:
>>>       mpol_put(vma_policy(tmp));
>>>   fail_nomem_policy:
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index b56a7f0c9f85..dfc6881be81c 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>>       arch_exit_mmap(mm);
>>>       vma = mas_find(&mas, ULONG_MAX);
>>> -    if (!vma) {
>>> +    /*
>>> +     * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>>> +     * xa_is_zero(vma) may be true.
>>> +     */
>>> +    if (!vma || xa_is_zero(vma)) {
>>>           /* Can happen if dup_mmap() received an OOM */
>>>           mmap_read_unlock(mm);
>>>           return;
>>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>>           remove_vma(vma, true);
>>>           count++;
>>>           cond_resched();
>>> -    } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>>> +        vma = mas_find(&mas, ULONG_MAX);
>>> +        /*
>>> +         * If xa_is_zero(vma) is true, it means that subsequent VMAs
>>> +         * donot need to be removed. Can happen if dup_mmap() fails to
>>> +         * remove a VMA marked VM_DONTCOPY.
>>> +         */
>>> +    } while (vma != NULL && !xa_is_zero(vma));
>>>       BUG_ON(count != mm->map_count);
>>> -- 
>>> 2.20.1
>>>
>>
Liam R. Howlett Sept. 15, 2023, 8 p.m. UTC | #6
* Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
> 
> 

...

> > > > +    if (unlikely(retval))
> > > >           goto out;
> > > >       mt_clear_in_rcu(vmi.mas.tree);
> > > > -    for_each_vma(old_vmi, mpnt) {
> > > > +    for_each_vma(vmi, mpnt) {
> > > >           struct file *file;
> > > >           vma_start_write(mpnt);
> > > >           if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > > +
> > > > +            /*
> > > > +             * Since the new tree is exactly the same as the old one,
> > > > +             * we need to remove the unneeded VMAs.
> > > > +             */
> > > > +            mas_store(&vmi.mas, NULL);
> > > > +
> > > > +            /*
> > > > +             * Even removing an entry may require memory allocation,
> > > > +             * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > > +             * from which VMA it failed. The case of encountering
> > > > +             * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > > +             */
> > > > +            if (unlikely(mas_is_err(&vmi.mas))) {
> > > > +                retval = xa_err(vmi.mas.node);
> > > > +                mas_reset(&vmi.mas);
> > > > +                if (mas_find(&vmi.mas, ULONG_MAX))
> > > > +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > > +                goto loop_out;
> > > > +            }
> > > > +
> > > 
> > > Storing NULL may need extra space as you noted, so we need to be careful
> > > what happens if we don't have that space.  We should have a testcase to
> > > test this scenario.
> > > 
> > > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > > in this function, see vm_area_dup().
> > > 
> > > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > > checks and complications to the exit path for all tasks in the very
> > > unlikely event that we run out of memory when we hit a very unlikely
> > > VM_DONTCOPY flag.
> > > 
> > > I see the issue with having a portion of the tree with new VMAs that are
> > > accounted and a portion of the tree that has old VMAs that should not be
> > > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > > we cannot add that complication to the exit path and then there is the
> > > OOM race to worry about (maybe, I am not sure since this MM isn't
> > > active yet).
> > I encountered some errors after implementing the scheme you mentioned
> > below.

What were the errors?  Maybe I missed something or there is another way.

> > This would also clutter fork.c and mmap.c, as some internal
> > functions would need to be made global.

Could it not be a new function in mm/mmap.c and added to mm/internal.h
that does the accounting and VMA freeing from [0 - vma->vm_start)?

Maybe we could use it in the other areas that do this sort of work?
do_vmi_align_munmap() does something similar to what we need after the
"Point of no return".

> > 
> > I thought of another way to put everything into maple tree. In non-RCU
> > mode, we can remove the last half of the tree without allocating any
> > memory. This requires modifications to the internal implementation of
> > mas_store().
> > Then remove the second half of the tree like this:
> > 
> > mas.index = 0;
> Sorry, typo.
> Change to: mas.index = vma->start
> > mas.last = ULONGN_MAX;
> > mas_store(&mas, NULL).

Well, we know we are not in RCU mode here, but I am concerned about this
going poorly.

> 
> > 
> > At least in non-RCU mode, we can do this, since we only need to merge
> > some nodes, or move some items to adjacent nodes.
> > However, this will increase the workload significantly.

In the unlikely event of an issue allocating memory, this would be
unwelcome.  If we can avoid it, it would be best.  I don't mind being
slow in error paths, but a significant workload would be rather bad on
an overloaded system.

> > 
> > > 
> > > Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> > > prototype, we can do something like the *untested* code below:
> > > 
> > > if (unlikely(mas_is_err(&vmi.mas))) {
> > >     unsigned long max = vmi.index;
> > > 
> > >     retval = xa_err(vmi.mas.node);
> > >     mas_set(&vmi.mas, 0);
> > >     tmp = mas_find(&vmi.mas, ULONG_MAX);
> > >     if (tmp) { /* Not the first VMA failed */
> > >         unsigned long nr_accounted = 0;
> > > 
> > >         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> > >                 true);
> > >         do {
> > >             if (vma->vm_flags & VM_ACCOUNT)
> > >                 nr_accounted += vma_pages(vma);
> > >             remove_vma(vma, true);
> > >             cond_resched();
> > >             vma = mas_find(&vmi.mas, max - 1);
> > >         } while (vma != NULL);
> > > 
> > >         vm_unacct_memory(nr_accounted);
> > >     }
> > >     __mt_destroy(&mm->mm_mt);
> > >     goto loop_out;
> > > }
> > > 
> > > Once exit_mmap() is called, the check for OOM (no vma) will catch that
> > > nothing is left to do.
> > > 
> > > It might be worth making an inline function to do this to keep the fork
> > > code clean.  We should test this by detecting a specific task name and
> > > returning a failure at a given interval:
> > > 
> > > if (!strcmp(current->comm, "fork_test") {
> > > ...
> > > }
> > > 
...


Thanks,
Liam
Peng Zhang Sept. 18, 2023, 1:14 p.m. UTC | #7
在 2023/9/16 04:00, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
>>
>>
> 
> ...
> 
>>>>> +    if (unlikely(retval))
>>>>>            goto out;
>>>>>        mt_clear_in_rcu(vmi.mas.tree);
>>>>> -    for_each_vma(old_vmi, mpnt) {
>>>>> +    for_each_vma(vmi, mpnt) {
>>>>>            struct file *file;
>>>>>            vma_start_write(mpnt);
>>>>>            if (mpnt->vm_flags & VM_DONTCOPY) {
>>>>>                vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>>>> +
>>>>> +            /*
>>>>> +             * Since the new tree is exactly the same as the old one,
>>>>> +             * we need to remove the unneeded VMAs.
>>>>> +             */
>>>>> +            mas_store(&vmi.mas, NULL);
>>>>> +
>>>>> +            /*
>>>>> +             * Even removing an entry may require memory allocation,
>>>>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark
>>>>> +             * from which VMA it failed. The case of encountering
>>>>> +             * XA_ZERO_ENTRY will be handled in exit_mmap().
>>>>> +             */
>>>>> +            if (unlikely(mas_is_err(&vmi.mas))) {
>>>>> +                retval = xa_err(vmi.mas.node);
>>>>> +                mas_reset(&vmi.mas);
>>>>> +                if (mas_find(&vmi.mas, ULONG_MAX))
>>>>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
>>>>> +                goto loop_out;
>>>>> +            }
>>>>> +
>>>>
>>>> Storing NULL may need extra space as you noted, so we need to be careful
>>>> what happens if we don't have that space.  We should have a testcase to
>>>> test this scenario.
>>>>
>>>> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
>>>> in this function, see vm_area_dup().
>>>>
>>>> Don't use the exit_mmap() path to undo a failed fork.  You've added
>>>> checks and complications to the exit path for all tasks in the very
>>>> unlikely event that we run out of memory when we hit a very unlikely
>>>> VM_DONTCOPY flag.
>>>>
>>>> I see the issue with having a portion of the tree with new VMAs that are
>>>> accounted and a portion of the tree that has old VMAs that should not be
>>>> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
>>>> we cannot add that complication to the exit path and then there is the
>>>> OOM race to worry about (maybe, I am not sure since this MM isn't
>>>> active yet).
>>> I encountered some errors after implementing the scheme you mentioned
>>> below.
> 
> What were the errors?  Maybe I missed something or there is another way.
I found the cause of the problem and fixed it, tested the error path and
it seems to be working fine now.

The reason is that "free_pgd_range(tlb, addr, vma->vm_end,floor, next?
next->vm_start: ceiling);" in free_pgtables() does not free all page
tables due to the existence of the last false VMA. I've fixed it.
Thanks.

> 
>>> This would also clutter fork.c and mmap.c, as some internal
>>> functions would need to be made global.
> 
> Could it not be a new function in mm/mmap.c and added to mm/internal.h
> that does the accounting and VMA freeing from [0 - vma->vm_start)?
> 
> Maybe we could use it in the other areas that do this sort of work?
> do_vmi_align_munmap() does something similar to what we need after the
> "Point of no return".
> 
>>>
>>> I thought of another way to put everything into maple tree. In non-RCU
>>> mode, we can remove the last half of the tree without allocating any
>>> memory. This requires modifications to the internal implementation of
>>> mas_store().
>>> Then remove the second half of the tree like this:
>>>
>>> mas.index = 0;
>> Sorry, typo.
>> Change to: mas.index = vma->start
>>> mas.last = ULONGN_MAX;
>>> mas_store(&mas, NULL).
> 
> Well, we know we are not in RCU mode here, but I am concerned about this
> going poorly.
> 
>>
>>>
>>> At least in non-RCU mode, we can do this, since we only need to merge
>>> some nodes, or move some items to adjacent nodes.
>>> However, this will increase the workload significantly.
> 
> In the unlikely event of an issue allocating memory, this would be
> unwelcome.  If we can avoid it, it would be best.  I don't mind being
> slow in error paths, but a significant workload would be rather bad on
> an overloaded system.
> 
>>>
>>>>
>>>> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
>>>> prototype, we can do something like the *untested* code below:
>>>>
>>>> if (unlikely(mas_is_err(&vmi.mas))) {
>>>>      unsigned long max = vmi.index;
>>>>
>>>>      retval = xa_err(vmi.mas.node);
>>>>      mas_set(&vmi.mas, 0);
>>>>      tmp = mas_find(&vmi.mas, ULONG_MAX);
>>>>      if (tmp) { /* Not the first VMA failed */
>>>>          unsigned long nr_accounted = 0;
>>>>
>>>>          unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
>>>>                  true);
>>>>          do {
>>>>              if (vma->vm_flags & VM_ACCOUNT)
>>>>                  nr_accounted += vma_pages(vma);
>>>>              remove_vma(vma, true);
>>>>              cond_resched();
>>>>              vma = mas_find(&vmi.mas, max - 1);
>>>>          } while (vma != NULL);
>>>>
>>>>          vm_unacct_memory(nr_accounted);
>>>>      }
>>>>      __mt_destroy(&mm->mm_mt);
>>>>      goto loop_out;
>>>> }
>>>>
>>>> Once exit_mmap() is called, the check for OOM (no vma) will catch that
>>>> nothing is left to do.
>>>>
>>>> It might be worth making an inline function to do this to keep the fork
>>>> code clean.  We should test this by detecting a specific task name and
>>>> returning a failure at a given interval:
>>>>
>>>> if (!strcmp(current->comm, "fork_test") {
>>>> ...
>>>> }
>>>>
> ...
> 
> 
> Thanks,
> Liam
>
Liam R. Howlett Sept. 18, 2023, 5:59 p.m. UTC | #8
* Peng Zhang <zhangpeng.00@bytedance.com> [230918 09:15]:
> 
> 
> 在 2023/9/16 04:00, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
> > > 
> > > 
> > 
> > ...
> > 
> > > > > > +    if (unlikely(retval))
> > > > > >            goto out;
> > > > > >        mt_clear_in_rcu(vmi.mas.tree);
> > > > > > -    for_each_vma(old_vmi, mpnt) {
> > > > > > +    for_each_vma(vmi, mpnt) {
> > > > > >            struct file *file;
> > > > > >            vma_start_write(mpnt);
> > > > > >            if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > > >                vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > > > > +
> > > > > > +            /*
> > > > > > +             * Since the new tree is exactly the same as the old one,
> > > > > > +             * we need to remove the unneeded VMAs.
> > > > > > +             */
> > > > > > +            mas_store(&vmi.mas, NULL);
> > > > > > +
> > > > > > +            /*
> > > > > > +             * Even removing an entry may require memory allocation,
> > > > > > +             * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > > > > +             * from which VMA it failed. The case of encountering
> > > > > > +             * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > > > > +             */
> > > > > > +            if (unlikely(mas_is_err(&vmi.mas))) {
> > > > > > +                retval = xa_err(vmi.mas.node);
> > > > > > +                mas_reset(&vmi.mas);
> > > > > > +                if (mas_find(&vmi.mas, ULONG_MAX))
> > > > > > +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > > > > +                goto loop_out;
> > > > > > +            }
> > > > > > +
> > > > > 
> > > > > Storing NULL may need extra space as you noted, so we need to be careful
> > > > > what happens if we don't have that space.  We should have a testcase to
> > > > > test this scenario.
> > > > > 
> > > > > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > > > > in this function, see vm_area_dup().
> > > > > 
> > > > > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > > > > checks and complications to the exit path for all tasks in the very
> > > > > unlikely event that we run out of memory when we hit a very unlikely
> > > > > VM_DONTCOPY flag.
> > > > > 
> > > > > I see the issue with having a portion of the tree with new VMAs that are
> > > > > accounted and a portion of the tree that has old VMAs that should not be
> > > > > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > > > > we cannot add that complication to the exit path and then there is the
> > > > > OOM race to worry about (maybe, I am not sure since this MM isn't
> > > > > active yet).
> > > > I encountered some errors after implementing the scheme you mentioned
> > > > below.
> > 
> > What were the errors?  Maybe I missed something or there is another way.
> I found the cause of the problem and fixed it, tested the error path and
> it seems to be working fine now.
> 
> The reason is that "free_pgd_range(tlb, addr, vma->vm_end,floor, next?
> next->vm_start: ceiling);" in free_pgtables() does not free all page
> tables due to the existence of the last false VMA. I've fixed it.
> Thanks.

Sounds good.

Please Cc the maple tree mailing (maple-tree@lists.infradead.org) list
on v3 - we are looking forward to seeing it.

Thanks,
Liam
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..e6299adefbd8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -650,7 +650,6 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	int retval;
 	unsigned long charge = 0;
 	LIST_HEAD(uf);
-	VMA_ITERATOR(old_vmi, oldmm, 0);
 	VMA_ITERATOR(vmi, mm, 0);
 
 	uprobe_start_dup_mmap();
@@ -678,17 +677,39 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		goto out;
 	khugepaged_fork(mm, oldmm);
 
-	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
-	if (retval)
+	/* Use __mt_dup() to efficiently build an identical maple tree. */
+	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
+	if (unlikely(retval))
 		goto out;
 
 	mt_clear_in_rcu(vmi.mas.tree);
-	for_each_vma(old_vmi, mpnt) {
+	for_each_vma(vmi, mpnt) {
 		struct file *file;
 
 		vma_start_write(mpnt);
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+
+			/*
+			 * Since the new tree is exactly the same as the old one,
+			 * we need to remove the unneeded VMAs.
+			 */
+			mas_store(&vmi.mas, NULL);
+
+			/*
+			 * Even removing an entry may require memory allocation,
+			 * and if removal fails, we use XA_ZERO_ENTRY to mark
+			 * from which VMA it failed. The case of encountering
+			 * XA_ZERO_ENTRY will be handled in exit_mmap().
+			 */
+			if (unlikely(mas_is_err(&vmi.mas))) {
+				retval = xa_err(vmi.mas.node);
+				mas_reset(&vmi.mas);
+				if (mas_find(&vmi.mas, ULONG_MAX))
+					mas_store(&vmi.mas, XA_ZERO_ENTRY);
+				goto loop_out;
+			}
+
 			continue;
 		}
 		charge = 0;
@@ -750,8 +771,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			hugetlb_dup_vma_private(tmp);
 
 		/* Link the vma into the MT */
-		if (vma_iter_bulk_store(&vmi, tmp))
-			goto fail_nomem_vmi_store;
+		mas_store(&vmi.mas, tmp);
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -778,8 +798,6 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	uprobe_end_dup_mmap();
 	return retval;
 
-fail_nomem_vmi_store:
-	unlink_anon_vmas(tmp);
 fail_nomem_anon_vma_fork:
 	mpol_put(vma_policy(tmp));
 fail_nomem_policy:
diff --git a/mm/mmap.c b/mm/mmap.c
index b56a7f0c9f85..dfc6881be81c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3196,7 +3196,11 @@  void exit_mmap(struct mm_struct *mm)
 	arch_exit_mmap(mm);
 
 	vma = mas_find(&mas, ULONG_MAX);
-	if (!vma) {
+	/*
+	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
+	 * xa_is_zero(vma) may be true.
+	 */
+	if (!vma || xa_is_zero(vma)) {
 		/* Can happen if dup_mmap() received an OOM */
 		mmap_read_unlock(mm);
 		return;
@@ -3234,7 +3238,13 @@  void exit_mmap(struct mm_struct *mm)
 		remove_vma(vma, true);
 		count++;
 		cond_resched();
-	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+		vma = mas_find(&mas, ULONG_MAX);
+		/*
+		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
+		 * donot need to be removed. Can happen if dup_mmap() fails to
+		 * remove a VMA marked VM_DONTCOPY.
+		 */
+	} while (vma != NULL && !xa_is_zero(vma));
 
 	BUG_ON(count != mm->map_count);