diff mbox series

[RESEND,v2,1/2] mm: drop oom code from exit_mmap

Message ID 20220531223100.510392-1-surenb@google.com (mailing list archive)
State Accepted
Commit bf3980c85212fc71512d27a46f5aab66f46ca284
Headers show
Series [RESEND,v2,1/2] mm: drop oom code from exit_mmap | expand

Commit Message

Suren Baghdasaryan May 31, 2022, 10:30 p.m. UTC
The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details). The invocation has
moved around since then because of the interaction with the munlock
logic but the underlying reason has remained the same (see [2]).

Munlock code is no longer a problem since [3] and there shouldn't be
any blocking operation before the memory is unmapped by exit_mmap so
the oom reaper invocation can be dropped. The unmapping part can be done
with the non-exclusive mmap_sem and the exclusive one is only required
when page tables are freed.

Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.

[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
Notes:
- Rebased over git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
mm-unstable branch per Andrew's request but applies cleany to Linus' ToT
- Conflicts with maple-tree patchset. Resolving these was discussed in
https://lore.kernel.org/all/20220519223438.qx35hbpfnnfnpouw@revolver/

 include/linux/oom.h |  2 --
 mm/mmap.c           | 31 ++++++++++++-------------------
 mm/oom_kill.c       |  2 +-
 3 files changed, 13 insertions(+), 22 deletions(-)

Comments

Andrew Morton June 1, 2022, 9:36 p.m. UTC | #1
On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see [1] for more details). The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see [2]).
> 
> Munlock code is no longer a problem since [3] and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
> 
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> 
> [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> 

I've just reinstated the mapletree patchset so there are some
conflicting changes.

> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm);
> -
>  long oom_badness(struct task_struct *p,
>  		unsigned long totalpages);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2b9305ed0dda..b7918e6bb0db 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_lock for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_lock is
> -		 * dropped.
> -		 *
> -		 * Nothing can be holding mm->mmap_lock here and the above call
> -		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> -		 * __oom_reap_task_mm() will not block.
> -		 */
> -		(void)__oom_reap_task_mm(mm);
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -	}
> -
> -	mmap_write_lock(mm);
> +	mmap_read_lock(mm);

Unclear why this patch fiddles with the mm_struct locking in this
fashion - changelogging that would have been helpful.

But iirc mapletree wants to retain a write_lock here, so I ended up with

void exit_mmap(struct mm_struct *mm)
{
	struct mmu_gather tlb;
	struct vm_area_struct *vma;
	unsigned long nr_accounted = 0;
	MA_STATE(mas, &mm->mm_mt, 0, 0);
	int count = 0;

	/* mm's last user has gone, and its about to be pulled down */
	mmu_notifier_release(mm);

	mmap_write_lock(mm);
	arch_exit_mmap(mm);

	vma = mas_find(&mas, ULONG_MAX);
	if (!vma) {
		/* Can happen if dup_mmap() received an OOM */
		mmap_write_unlock(mm);
		return;
	}

	lru_add_drain();
	flush_cache_mm(mm);
	tlb_gather_mmu_fullmm(&tlb, mm);
	/* update_hiwater_rss(mm) here? but nobody should be looking */
	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);

	/*
	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
	 * because the memory has been already freed. Do not bother checking
	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
	 */
	set_bit(MMF_OOM_SKIP, &mm->flags);
	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
		      USER_PGTABLES_CEILING);
	tlb_finish_mmu(&tlb);

	/*
	 * Walk the list again, actually closing and freeing it, with preemption
	 * enabled, without holding any MM locks besides the unreachable
	 * mmap_write_lock.
	 */
	do {
		if (vma->vm_flags & VM_ACCOUNT)
			nr_accounted += vma_pages(vma);
		remove_vma(vma);
		count++;
		cond_resched();
	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);

	BUG_ON(count != mm->map_count);

	trace_exit_mmap(mm);
	__mt_destroy(&mm->mm_mt);
	mm->mmap = NULL;
	mmap_write_unlock(mm);
	vm_unacct_memory(nr_accounted);
}
Suren Baghdasaryan June 1, 2022, 9:47 p.m. UTC | #2
On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > The primary reason to invoke the oom reaper from the exit_mmap path used
> > to be a prevention of an excessive oom killing if the oom victim exit
> > races with the oom reaper (see [1] for more details). The invocation has
> > moved around since then because of the interaction with the munlock
> > logic but the underlying reason has remained the same (see [2]).
> >
> > Munlock code is no longer a problem since [3] and there shouldn't be
> > any blocking operation before the memory is unmapped by exit_mmap so
> > the oom reaper invocation can be dropped. The unmapping part can be done
> > with the non-exclusive mmap_sem and the exclusive one is only required
> > when page tables are freed.
> >
> > Remove the oom_reaper from exit_mmap which will make the code easier to
> > read. This is really unlikely to make any observable difference although
> > some microbenchmarks could benefit from one less branch that needs to be
> > evaluated even though it almost never is true.
> >
> > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> >
>
> I've just reinstated the mapletree patchset so there are some
> conflicting changes.
>
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >       return 0;
> >  }
> >
> > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > -
> >  long oom_badness(struct task_struct *p,
> >               unsigned long totalpages);
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2b9305ed0dda..b7918e6bb0db 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
> >       /* mm's last user has gone, and its about to be pulled down */
> >       mmu_notifier_release(mm);
> >
> > -     if (unlikely(mm_is_oom_victim(mm))) {
> > -             /*
> > -              * Manually reap the mm to free as much memory as possible.
> > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -              * reaper will not run on this mm again after mmap_lock is
> > -              * dropped.
> > -              *
> > -              * Nothing can be holding mm->mmap_lock here and the above call
> > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > -              * __oom_reap_task_mm() will not block.
> > -              */
> > -             (void)__oom_reap_task_mm(mm);
> > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > -     }
> > -
> > -     mmap_write_lock(mm);
> > +     mmap_read_lock(mm);
>
> Unclear why this patch fiddles with the mm_struct locking in this
> fashion - changelogging that would have been helpful.

Yeah, I should have clarified this in the description. Everything up
to unmap_vmas() can be done under mmap_read_lock and that way
oom-reaper and process_mrelease can do the unmapping in parallel with
exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
free_pgtables. I think maple trees do not change that except there is
no mm->mmap anymore, so the line at the end of exit_mmap where we
reset mm->mmap to NULL can be removed (I show that line below).

>
> But iirc mapletree wants to retain a write_lock here, so I ended up with
>
> void exit_mmap(struct mm_struct *mm)
> {
>         struct mmu_gather tlb;
>         struct vm_area_struct *vma;
>         unsigned long nr_accounted = 0;
>         MA_STATE(mas, &mm->mm_mt, 0, 0);
>         int count = 0;
>
>         /* mm's last user has gone, and its about to be pulled down */
>         mmu_notifier_release(mm);
>
>         mmap_write_lock(mm);
>         arch_exit_mmap(mm);
>
>         vma = mas_find(&mas, ULONG_MAX);
>         if (!vma) {
>                 /* Can happen if dup_mmap() received an OOM */
>                 mmap_write_unlock(mm);
>                 return;
>         }
>
>         lru_add_drain();
>         flush_cache_mm(mm);
>         tlb_gather_mmu_fullmm(&tlb, mm);
>         /* update_hiwater_rss(mm) here? but nobody should be looking */
>         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
>         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
>
>         /*
>          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
>          * because the memory has been already freed. Do not bother checking
>          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
>          */
>         set_bit(MMF_OOM_SKIP, &mm->flags);
>         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
>                       USER_PGTABLES_CEILING);
>         tlb_finish_mmu(&tlb);
>
>         /*
>          * Walk the list again, actually closing and freeing it, with preemption
>          * enabled, without holding any MM locks besides the unreachable
>          * mmap_write_lock.
>          */
>         do {
>                 if (vma->vm_flags & VM_ACCOUNT)
>                         nr_accounted += vma_pages(vma);
>                 remove_vma(vma);
>                 count++;
>                 cond_resched();
>         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>
>         BUG_ON(count != mm->map_count);
>
>         trace_exit_mmap(mm);
>         __mt_destroy(&mm->mm_mt);
>         mm->mmap = NULL;

^^^ this line above needs to be removed when the patch is applied over
the maple tree patchset.


>         mmap_write_unlock(mm);
>         vm_unacct_memory(nr_accounted);
> }
>
Suren Baghdasaryan June 1, 2022, 9:50 p.m. UTC | #3
On Wed, Jun 1, 2022 at 2:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see [1] for more details). The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see [2]).
> > >
> > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > >
> > > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> > >
> >
> > I've just reinstated the mapletree patchset so there are some
> > conflicting changes.
> >
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > >       return 0;
> > >  }
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > -
> > >  long oom_badness(struct task_struct *p,
> > >               unsigned long totalpages);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2b9305ed0dda..b7918e6bb0db 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > > -     if (unlikely(mm_is_oom_victim(mm))) {
> > > -             /*
> > > -              * Manually reap the mm to free as much memory as possible.
> > > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > -              * reaper will not run on this mm again after mmap_lock is
> > > -              * dropped.
> > > -              *
> > > -              * Nothing can be holding mm->mmap_lock here and the above call
> > > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > -              * __oom_reap_task_mm() will not block.
> > > -              */
> > > -             (void)__oom_reap_task_mm(mm);
> > > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > > -     }
> > > -
> > > -     mmap_write_lock(mm);
> > > +     mmap_read_lock(mm);
> >
> > Unclear why this patch fiddles with the mm_struct locking in this
> > fashion - changelogging that would have been helpful.
>
> Yeah, I should have clarified this in the description. Everything up
> to unmap_vmas() can be done under mmap_read_lock and that way
> oom-reaper and process_mrelease can do the unmapping in parallel with
> exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> free_pgtables. I think maple trees do not change that except there is
> no mm->mmap anymore, so the line at the end of exit_mmap where we
> reset mm->mmap to NULL can be removed (I show that line below).

In the current changelog I have this explanation:

"The unmapping part can be done with the non-exclusive mmap_sem and
the exclusive one is only required when page tables are freed."

should I resend a v3 with a more detailed explanation for these
mmap_lock manipulations?

>
> >
> > But iirc mapletree wants to retain a write_lock here, so I ended up with
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >         struct mmu_gather tlb;
> >         struct vm_area_struct *vma;
> >         unsigned long nr_accounted = 0;
> >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> >         int count = 0;
> >
> >         /* mm's last user has gone, and its about to be pulled down */
> >         mmu_notifier_release(mm);
> >
> >         mmap_write_lock(mm);
> >         arch_exit_mmap(mm);
> >
> >         vma = mas_find(&mas, ULONG_MAX);
> >         if (!vma) {
> >                 /* Can happen if dup_mmap() received an OOM */
> >                 mmap_write_unlock(mm);
> >                 return;
> >         }
> >
> >         lru_add_drain();
> >         flush_cache_mm(mm);
> >         tlb_gather_mmu_fullmm(&tlb, mm);
> >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> >
> >         /*
> >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >          * because the memory has been already freed. Do not bother checking
> >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >          */
> >         set_bit(MMF_OOM_SKIP, &mm->flags);
> >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> >                       USER_PGTABLES_CEILING);
> >         tlb_finish_mmu(&tlb);
> >
> >         /*
> >          * Walk the list again, actually closing and freeing it, with preemption
> >          * enabled, without holding any MM locks besides the unreachable
> >          * mmap_write_lock.
> >          */
> >         do {
> >                 if (vma->vm_flags & VM_ACCOUNT)
> >                         nr_accounted += vma_pages(vma);
> >                 remove_vma(vma);
> >                 count++;
> >                 cond_resched();
> >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> >
> >         BUG_ON(count != mm->map_count);
> >
> >         trace_exit_mmap(mm);
> >         __mt_destroy(&mm->mm_mt);
> >         mm->mmap = NULL;
>
> ^^^ this line above needs to be removed when the patch is applied over
> the maple tree patchset.
>
>
> >         mmap_write_unlock(mm);
> >         vm_unacct_memory(nr_accounted);
> > }
> >
Michal Hocko June 2, 2022, 6:53 a.m. UTC | #4
On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
[...]
> > But iirc mapletree wants to retain a write_lock here, so I ended up with
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >         struct mmu_gather tlb;
> >         struct vm_area_struct *vma;
> >         unsigned long nr_accounted = 0;
> >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> >         int count = 0;
> >
> >         /* mm's last user has gone, and its about to be pulled down */
> >         mmu_notifier_release(mm);
> >
> >         mmap_write_lock(mm);
> >         arch_exit_mmap(mm);
> >
> >         vma = mas_find(&mas, ULONG_MAX);
> >         if (!vma) {
> >                 /* Can happen if dup_mmap() received an OOM */
> >                 mmap_write_unlock(mm);
> >                 return;
> >         }
> >
> >         lru_add_drain();
> >         flush_cache_mm(mm);
> >         tlb_gather_mmu_fullmm(&tlb, mm);
> >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> >
> >         /*
> >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >          * because the memory has been already freed. Do not bother checking
> >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >          */
> >         set_bit(MMF_OOM_SKIP, &mm->flags);
> >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> >                       USER_PGTABLES_CEILING);
> >         tlb_finish_mmu(&tlb);
> >
> >         /*
> >          * Walk the list again, actually closing and freeing it, with preemption
> >          * enabled, without holding any MM locks besides the unreachable
> >          * mmap_write_lock.
> >          */
> >         do {
> >                 if (vma->vm_flags & VM_ACCOUNT)
> >                         nr_accounted += vma_pages(vma);
> >                 remove_vma(vma);
> >                 count++;
> >                 cond_resched();
> >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> >
> >         BUG_ON(count != mm->map_count);
> >
> >         trace_exit_mmap(mm);
> >         __mt_destroy(&mm->mm_mt);
> >         mm->mmap = NULL;
> 
> ^^^ this line above needs to be removed when the patch is applied over
> the maple tree patchset.

I am not fully up to date on the maple tree changes. Could you explain
why resetting mm->mmap is not needed anymore please?
Liam R. Howlett June 2, 2022, 1:31 p.m. UTC | #5
* Michal Hocko <mhocko@suse.com> [220602 02:53]:
> On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> > On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> [...]
> > > But iirc mapletree wants to retain a write_lock here, so I ended up with
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >         struct mmu_gather tlb;
> > >         struct vm_area_struct *vma;
> > >         unsigned long nr_accounted = 0;
> > >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> > >         int count = 0;
> > >
> > >         /* mm's last user has gone, and its about to be pulled down */
> > >         mmu_notifier_release(mm);
> > >
> > >         mmap_write_lock(mm);
> > >         arch_exit_mmap(mm);
> > >
> > >         vma = mas_find(&mas, ULONG_MAX);
> > >         if (!vma) {
> > >                 /* Can happen if dup_mmap() received an OOM */
> > >                 mmap_write_unlock(mm);
> > >                 return;
> > >         }
> > >
> > >         lru_add_drain();
> > >         flush_cache_mm(mm);
> > >         tlb_gather_mmu_fullmm(&tlb, mm);
> > >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > >
> > >         /*
> > >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >          * because the memory has been already freed. Do not bother checking
> > >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >          */
> > >         set_bit(MMF_OOM_SKIP, &mm->flags);
> > >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> > >                       USER_PGTABLES_CEILING);
> > >         tlb_finish_mmu(&tlb);
> > >
> > >         /*
> > >          * Walk the list again, actually closing and freeing it, with preemption
> > >          * enabled, without holding any MM locks besides the unreachable
> > >          * mmap_write_lock.
> > >          */
> > >         do {
> > >                 if (vma->vm_flags & VM_ACCOUNT)
> > >                         nr_accounted += vma_pages(vma);
> > >                 remove_vma(vma);
> > >                 count++;
> > >                 cond_resched();
> > >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > >
> > >         BUG_ON(count != mm->map_count);
> > >
> > >         trace_exit_mmap(mm);
> > >         __mt_destroy(&mm->mm_mt);
> > >         mm->mmap = NULL;
> > 
> > ^^^ this line above needs to be removed when the patch is applied over
> > the maple tree patchset.
> 
> I am not fully up to date on the maple tree changes. Could you explain
> why resetting mm->mmap is not needed anymore please?

The maple tree patch set removes the linked list, including mm->mmap.
The call to __mt_destroy() means none of the old VMAs can be found in
the race condition that mm->mmap = NULL was solving.


Thanks,
Liam
Matthew Wilcox June 2, 2022, 1:39 p.m. UTC | #6
On Wed, Jun 01, 2022 at 02:47:41PM -0700, Suren Baghdasaryan wrote:
> > Unclear why this patch fiddles with the mm_struct locking in this
> > fashion - changelogging that would have been helpful.
> 
> Yeah, I should have clarified this in the description. Everything up
> to unmap_vmas() can be done under mmap_read_lock and that way
> oom-reaper and process_mrelease can do the unmapping in parallel with
> exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> free_pgtables. I think maple trees do not change that except there is
> no mm->mmap anymore, so the line at the end of exit_mmap where we
> reset mm->mmap to NULL can be removed (I show that line below).

I don't understand why we _want_ unmapping to proceed in parallel?  Is it
so urgent to unmap these page tables that we need two processes doing
it at the same time?  And doesn't that just change the contention from
visible (contention on a lock) to invisible (contention on cachelines)?
Michal Hocko June 2, 2022, 2:08 p.m. UTC | #7
On Thu 02-06-22 13:31:27, Liam Howlett wrote:
> * Michal Hocko <mhocko@suse.com> [220602 02:53]:
> > On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> > > On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > [...]
> > > > But iirc mapletree wants to retain a write_lock here, so I ended up with
> > > >
> > > > void exit_mmap(struct mm_struct *mm)
> > > > {
> > > >         struct mmu_gather tlb;
> > > >         struct vm_area_struct *vma;
> > > >         unsigned long nr_accounted = 0;
> > > >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> > > >         int count = 0;
> > > >
> > > >         /* mm's last user has gone, and its about to be pulled down */
> > > >         mmu_notifier_release(mm);
> > > >
> > > >         mmap_write_lock(mm);
> > > >         arch_exit_mmap(mm);
> > > >
> > > >         vma = mas_find(&mas, ULONG_MAX);
> > > >         if (!vma) {
> > > >                 /* Can happen if dup_mmap() received an OOM */
> > > >                 mmap_write_unlock(mm);
> > > >                 return;
> > > >         }
> > > >
> > > >         lru_add_drain();
> > > >         flush_cache_mm(mm);
> > > >         tlb_gather_mmu_fullmm(&tlb, mm);
> > > >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > > >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > > >
> > > >         /*
> > > >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > >          * because the memory has been already freed. Do not bother checking
> > > >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > >          */
> > > >         set_bit(MMF_OOM_SKIP, &mm->flags);
> > > >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> > > >                       USER_PGTABLES_CEILING);
> > > >         tlb_finish_mmu(&tlb);
> > > >
> > > >         /*
> > > >          * Walk the list again, actually closing and freeing it, with preemption
> > > >          * enabled, without holding any MM locks besides the unreachable
> > > >          * mmap_write_lock.
> > > >          */
> > > >         do {
> > > >                 if (vma->vm_flags & VM_ACCOUNT)
> > > >                         nr_accounted += vma_pages(vma);
> > > >                 remove_vma(vma);
> > > >                 count++;
> > > >                 cond_resched();
> > > >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > > >
> > > >         BUG_ON(count != mm->map_count);
> > > >
> > > >         trace_exit_mmap(mm);
> > > >         __mt_destroy(&mm->mm_mt);
> > > >         mm->mmap = NULL;
> > > 
> > > ^^^ this line above needs to be removed when the patch is applied over
> > > the maple tree patchset.
> > 
> > I am not fully up to date on the maple tree changes. Could you explain
> > why resetting mm->mmap is not needed anymore please?
> 
> The maple tree patch set removes the linked list, including mm->mmap.
> The call to __mt_destroy() means none of the old VMAs can be found in
> the race condition that mm->mmap = NULL was solving.

Thanks for the clarification, Liam.
Suren Baghdasaryan June 2, 2022, 3:02 p.m. UTC | #8
On Thu, Jun 2, 2022 at 6:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jun 01, 2022 at 02:47:41PM -0700, Suren Baghdasaryan wrote:
> > > Unclear why this patch fiddles with the mm_struct locking in this
> > > fashion - changelogging that would have been helpful.
> >
> > Yeah, I should have clarified this in the description. Everything up
> > to unmap_vmas() can be done under mmap_read_lock and that way
> > oom-reaper and process_mrelease can do the unmapping in parallel with
> > exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> > mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> > free_pgtables. I think maple trees do not change that except there is
> > no mm->mmap anymore, so the line at the end of exit_mmap where we
> > reset mm->mmap to NULL can be removed (I show that line below).
>
> I don't understand why we _want_ unmapping to proceed in parallel?  Is it
> so urgent to unmap these page tables that we need two processes doing
> it at the same time?  And doesn't that just change the contention from
> visible (contention on a lock) to invisible (contention on cachelines)?

It's important for process_madvise() syscall not to be blocked by a
potentially lower priority task doing exit_mmap. I've seen such
priority inversion happening when the dying process is running on a
little core taking its time while a high-priority task is waiting in
the syscall while there is no reason for them to block each other.

>
diff mbox series

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 02d1e7bbd8cd..6cdde62b078b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,8 +106,6 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
 long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 2b9305ed0dda..b7918e6bb0db 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3110,30 +3110,13 @@  void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_lock for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_lock is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_lock here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 */
-		(void)__oom_reap_task_mm(mm);
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-	}
-
-	mmap_write_lock(mm);
+	mmap_read_lock(mm);
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
 	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
-		mmap_write_unlock(mm);
+		mmap_read_unlock(mm);
 		return;
 	}
 
@@ -3143,6 +3126,16 @@  void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
+	mmap_read_unlock(mm);
+
+	/*
+	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
+	 * because the memory has been already freed. Do not bother checking
+	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
+	 */
+	set_bit(MMF_OOM_SKIP, &mm->flags);
+
+	mmap_write_lock(mm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8a70bca67c94..98dca2b42357 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -538,7 +538,7 @@  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;