diff mbox series

[4/3] mm: drop MMF_OOM_SKIP from exit_mmap

Message ID YbHIaq9a0CtqRulE@dhcp22.suse.cz (mailing list archive)
State New
Headers show
Series [v4,1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap | expand

Commit Message

Michal Hocko Dec. 9, 2021, 9:12 a.m. UTC
Do we want this on top?
----
From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 9 Dec 2021 10:07:51 +0100
Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap

MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
oom repear in the past. Since the exclusive mmap_sem is held in
exit_mmap to cover all destructive operations the flag synchronization
is not needed anymore and we can safely drop it. Just make sure that
mm->mmap is set to NULL so that nobody will access the freed vma list.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmap.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

Comments

Suren Baghdasaryan Dec. 9, 2021, 4:24 p.m. UTC | #1
On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Do we want this on top?

As we discussed in this thread
https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
__oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
unmap pages in parallel with exit_mmap without blocking each other.
Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
and has a negative impact on performance. So the conclusion of that
thread I thought was to keep that part. My understanding is that we
also wanted to remove MMF_OOM_SKIP as a follow-up patch but
__oom_reap_task_mm would stay.


> ----
> From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 9 Dec 2021 10:07:51 +0100
> Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap
>
> MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
> oom repear in the past. Since the exclusive mmap_sem is held in
> exit_mmap to cover all destructive operations the flag synchronization
> is not needed anymore and we can safely drop it. Just make sure that
> mm->mmap is set to NULL so that nobody will access the freed vma list.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/mmap.c | 23 +----------------------
>  1 file changed, 1 insertion(+), 22 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f4e09d390a07..0d6af9d89aa8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3129,28 +3129,6 @@ void exit_mmap(struct mm_struct *mm)
>         /* mm's last user has gone, and its about to be pulled down */
>         mmu_notifier_release(mm);
>
> -       if (unlikely(mm_is_oom_victim(mm))) {
> -               /*
> -                * Manually reap the mm to free as much memory as possible.
> -                * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -                * this mm from further consideration.  Taking mm->mmap_lock for
> -                * write after setting MMF_OOM_SKIP will guarantee that the oom
> -                * reaper will not run on this mm again after mmap_lock is
> -                * dropped.
> -                *
> -                * Nothing can be holding mm->mmap_lock here and the above call
> -                * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> -                * __oom_reap_task_mm() will not block.
> -                *
> -                * This needs to be done before calling unlock_range(),
> -                * which clears VM_LOCKED, otherwise the oom reaper cannot
> -                * reliably test it.
> -                */
> -               (void)__oom_reap_task_mm(mm);
> -
> -               set_bit(MMF_OOM_SKIP, &mm->flags);
> -       }
> -
>         mmap_write_lock(mm);
>         if (mm->locked_vm)
>                 unlock_range(mm->mmap, ULONG_MAX);
> @@ -3180,6 +3158,7 @@ void exit_mmap(struct mm_struct *mm)
>                 vma = remove_vma(vma);
>                 cond_resched();
>         }
> +       mm->mmap = NULL;
>         mmap_write_unlock(mm);
>         vm_unacct_memory(nr_accounted);
>  }
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Dec. 9, 2021, 4:47 p.m. UTC | #2
On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Do we want this on top?
> 
> As we discussed in this thread
> https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> unmap pages in parallel with exit_mmap without blocking each other.
> Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> and has a negative impact on performance. So the conclusion of that
> thread I thought was to keep that part. My understanding is that we
> also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> __oom_reap_task_mm would stay.

OK, then we were talking past each other, I am afraid. I really wanted
to get rid of this oom specific stuff from exit_mmap. It was there out
of necessity. With a proper locking we can finally get rid of the crud.
As I've said previously oom reaping has never been a hot path.

If we really want to optimize this path then I would much rather see a
generic solution which would allow to move the write lock down after
unmap_vmas. That would require oom reaper to be able to handle mlocked
memory.
Suren Baghdasaryan Dec. 9, 2021, 5:06 p.m. UTC | #3
On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > Do we want this on top?
> >
> > As we discussed in this thread
> > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > unmap pages in parallel with exit_mmap without blocking each other.
> > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > and has a negative impact on performance. So the conclusion of that
> > thread I thought was to keep that part. My understanding is that we
> > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > __oom_reap_task_mm would stay.
>
> OK, then we were talking past each other, I am afraid. I really wanted
> to get rid of this oom specific stuff from exit_mmap. It was there out
> of necessity. With a proper locking we can finally get rid of the crud.
> As I've said previously oom reaping has never been a hot path.
>
> If we really want to optimize this path then I would much rather see a
> generic solution which would allow to move the write lock down after
> unmap_vmas. That would require oom reaper to be able to handle mlocked
> memory.

Ok, let's work on that and when that's done we can get rid of the oom
stuff in exit_mmap. I'll look into this over the weekend and will
likely be back with questions.
Thanks!

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Dec. 16, 2021, 2:26 a.m. UTC | #4
On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > Do we want this on top?
> > >
> > > As we discussed in this thread
> > > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > unmap pages in parallel with exit_mmap without blocking each other.
> > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > and has a negative impact on performance. So the conclusion of that
> > > thread I thought was to keep that part. My understanding is that we
> > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > __oom_reap_task_mm would stay.
> >
> > OK, then we were talking past each other, I am afraid. I really wanted
> > to get rid of this oom specific stuff from exit_mmap. It was there out
> > of necessity. With a proper locking we can finally get rid of the crud.
> > As I've said previously oom reaping has never been a hot path.
> >
> > If we really want to optimize this path then I would much rather see a
> > generic solution which would allow to move the write lock down after
> > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > memory.
>
> Ok, let's work on that and when that's done we can get rid of the oom
> stuff in exit_mmap. I'll look into this over the weekend and will
> likely be back with questions.

As promised, I have a question:
Any particular reason why munlock_vma_pages_range clears VM_LOCKED
before unlocking pages and not after (see:
https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
to me if VM_LOCKED was reset at the end (with proper ordering) then
__oom_reap_task_mm would correctly skip VM_LOCKED vmas.
https://lore.kernel.org/lkml/20180514064824.534798031@linuxfoundation.org/
has this explanation:

"Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
vm_flags before actually doing the munlock to determine if any other
vmas are locking the same memory, the check for VM_LOCKED in the oom
reaper is racy."

but "to determine if any other vmas are locking the same memory"
explanation eludes me... Any insights?
Thanks,
Suren.

> Thanks!
>
> > --
> > Michal Hocko
> > SUSE Labs
Suren Baghdasaryan Dec. 16, 2021, 5:23 p.m. UTC | #5
On Thu, Dec 16, 2021 at 3:49 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 15, 2021 at 06:26:11PM -0800, Suren Baghdasaryan wrote:
> > On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > Do we want this on top?
> > > > >
> > > > > As we discussed in this thread
> > > > > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > > > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > > > unmap pages in parallel with exit_mmap without blocking each other.
> > > > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > > > and has a negative impact on performance. So the conclusion of that
> > > > > thread I thought was to keep that part. My understanding is that we
> > > > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > > > __oom_reap_task_mm would stay.
> > > >
> > > > OK, then we were talking past each other, I am afraid. I really wanted
> > > > to get rid of this oom specific stuff from exit_mmap. It was there out
> > > > of necessity. With a proper locking we can finally get rid of the crud.
> > > > As I've said previously oom reaping has never been a hot path.
> > > >
> > > > If we really want to optimize this path then I would much rather see a
> > > > generic solution which would allow to move the write lock down after
> > > > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > > > memory.
> > >
> > > Ok, let's work on that and when that's done we can get rid of the oom
> > > stuff in exit_mmap. I'll look into this over the weekend and will
> > > likely be back with questions.
> >
> > As promised, I have a question:
> > Any particular reason why munlock_vma_pages_range clears VM_LOCKED
> > before unlocking pages and not after (see:
> > https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
> > to me if VM_LOCKED was reset at the end (with proper ordering) then
> > __oom_reap_task_mm would correctly skip VM_LOCKED vmas.
> > https://lore.kernel.org/lkml/20180514064824.534798031@linuxfoundation.org/
> > has this explanation:
> >
> > "Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
> > vm_flags before actually doing the munlock to determine if any other
> > vmas are locking the same memory, the check for VM_LOCKED in the oom
> > reaper is racy."
> >
> > but "to determine if any other vmas are locking the same memory"
> > explanation eludes me... Any insights?
>
> A page's mlock state is determined by whether any of the vmas that map
> it are mlocked. The munlock code does:
>
> vma->vm_flags &= VM_LOCKED_CLEAR_MASK
> TestClearPageMlocked()
> isolate_lru_page()
> __munlock_isolated_page()
>   page_mlock()
>     rmap_walk() # for_each_vma()
>       page_mlock_one()
>         (vma->vm_flags & VM_LOCKED) && TestSetPageMlocked()
>
> If we didn't clear the VM_LOCKED flag first, racing threads could
> re-lock pages under us because they see that flag and think our vma
> wants those pages mlocked when we're in the process of munlocking.

Thanks for the explanation Johannes!
So far I didn't find an easy way to let __oom_reap_task_mm() run
concurrently with unlock_range(). Will keep exploring.

>
Suren Baghdasaryan Dec. 30, 2021, 5:59 a.m. UTC | #6
On Thu, Dec 16, 2021 at 9:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 16, 2021 at 3:49 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Dec 15, 2021 at 06:26:11PM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > > > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > Do we want this on top?
> > > > > >
> > > > > > As we discussed in this thread
> > > > > > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > > > > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > > > > unmap pages in parallel with exit_mmap without blocking each other.
> > > > > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > > > > and has a negative impact on performance. So the conclusion of that
> > > > > > thread I thought was to keep that part. My understanding is that we
> > > > > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > > > > __oom_reap_task_mm would stay.
> > > > >
> > > > > OK, then we were talking past each other, I am afraid. I really wanted
> > > > > to get rid of this oom specific stuff from exit_mmap. It was there out
> > > > > of necessity. With a proper locking we can finally get rid of the crud.
> > > > > As I've said previously oom reaping has never been a hot path.
> > > > >
> > > > > If we really want to optimize this path then I would much rather see a
> > > > > generic solution which would allow to move the write lock down after
> > > > > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > > > > memory.
> > > >
> > > > Ok, let's work on that and when that's done we can get rid of the oom
> > > > stuff in exit_mmap. I'll look into this over the weekend and will
> > > > likely be back with questions.
> > >
> > > As promised, I have a question:
> > > Any particular reason why munlock_vma_pages_range clears VM_LOCKED
> > > before unlocking pages and not after (see:
> > > https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
> > > to me if VM_LOCKED was reset at the end (with proper ordering) then
> > > __oom_reap_task_mm would correctly skip VM_LOCKED vmas.
> > > https://lore.kernel.org/lkml/20180514064824.534798031@linuxfoundation.org/
> > > has this explanation:
> > >
> > > "Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
> > > vm_flags before actually doing the munlock to determine if any other
> > > vmas are locking the same memory, the check for VM_LOCKED in the oom
> > > reaper is racy."
> > >
> > > but "to determine if any other vmas are locking the same memory"
> > > explanation eludes me... Any insights?
> >
> > A page's mlock state is determined by whether any of the vmas that map
> > it are mlocked. The munlock code does:
> >
> > vma->vm_flags &= VM_LOCKED_CLEAR_MASK
> > TestClearPageMlocked()
> > isolate_lru_page()
> > __munlock_isolated_page()
> >   page_mlock()
> >     rmap_walk() # for_each_vma()
> >       page_mlock_one()
> >         (vma->vm_flags & VM_LOCKED) && TestSetPageMlocked()
> >
> > If we didn't clear the VM_LOCKED flag first, racing threads could
> > re-lock pages under us because they see that flag and think our vma
> > wants those pages mlocked when we're in the process of munlocking.
>
> Thanks for the explanation Johannes!
> So far I didn't find an easy way to let __oom_reap_task_mm() run
> concurrently with unlock_range(). Will keep exploring.

After some more digging I think there are two acceptable options:

1. Call unlock_range() under mmap_write_lock and then downgrade it to
read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
vmas in parallel like this:

    if (mm->locked_vm) {
        mmap_write_lock(mm);
        unlock_range(mm->mmap, ULONG_MAX);
        mmap_write_downgrade(mm);
    } else
        mmap_read_lock(mm);
...
    unmap_vmas(&tlb, vma, 0, -1);
    mmap_read_unlock(mm);
    mmap_write_lock(mm);
    free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
...
    mm->mmap = NULL;
    mmap_write_unlock(mm);

This way exit_mmap() might block __oom_reap_task_mm() but for a much
shorter time during unlock_range() call.

2. Introduce another vm_flag mask similar to VM_LOCKED which is set
before munlock_vma_pages_range() clears VM_LOCKED so that
__oom_reap_task_mm() can identify vmas being unlocked and skip them.

Option 1 seems cleaner to me because it keeps the locking pattern
around unlock_range() in exit_mmap() consistent with all other places
it is used (in mremap() and munmap()) with mmap_write_lock taken.
WDYT?

> >
Michal Hocko Dec. 30, 2021, 8:24 a.m. UTC | #7
On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
[...]
> After some more digging I think there are two acceptable options:
> 
> 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> vmas in parallel like this:
> 
>     if (mm->locked_vm) {
>         mmap_write_lock(mm);
>         unlock_range(mm->mmap, ULONG_MAX);
>         mmap_write_downgrade(mm);
>     } else
>         mmap_read_lock(mm);
> ...
>     unmap_vmas(&tlb, vma, 0, -1);
>     mmap_read_unlock(mm);
>     mmap_write_lock(mm);
>     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> ...
>     mm->mmap = NULL;
>     mmap_write_unlock(mm);
> 
> This way exit_mmap() might block __oom_reap_task_mm() but for a much
> shorter time during unlock_range() call.

IIRC unlock_range depends on page lock at some stage and that can mean
this will block for a long time or for ever when the holder of the lock
depends on a memory allocation. This was the primary problem why the oom
reaper skips over mlocked vmas.

> 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> before munlock_vma_pages_range() clears VM_LOCKED so that
> __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> 
> Option 1 seems cleaner to me because it keeps the locking pattern
> around unlock_range() in exit_mmap() consistent with all other places
> it is used (in mremap() and munmap()) with mmap_write_lock taken.
> WDYT?

It would be really great to make unlock_range oom reaper aware IMHO.

You do not quote your change in the full length so it is not really
clear whether you are planning to drop __oom_reap_task_mm from exit_mmap
as well. If yes then 1) could push oom reaper to timeout while the
unlock_range could be dropped on something so that wouldn't be an
improvement. 2) sounds like a workaround to me as it doesn't really
address the underlying problem.

I have to say that I am not really a great fan of __oom_reap_task_mm in
exit_mmap but I would rather see it in place than making the surrounding
code more complex/tricky.
Suren Baghdasaryan Dec. 30, 2021, 5:29 p.m. UTC | #8
On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
> [...]
> > After some more digging I think there are two acceptable options:
> >
> > 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> > vmas in parallel like this:
> >
> >     if (mm->locked_vm) {
> >         mmap_write_lock(mm);
> >         unlock_range(mm->mmap, ULONG_MAX);
> >         mmap_write_downgrade(mm);
> >     } else
> >         mmap_read_lock(mm);
> > ...
> >     unmap_vmas(&tlb, vma, 0, -1);
> >     mmap_read_unlock(mm);
> >     mmap_write_lock(mm);
> >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > ...
> >     mm->mmap = NULL;
> >     mmap_write_unlock(mm);
> >
> > This way exit_mmap() might block __oom_reap_task_mm() but for a much
> > shorter time during unlock_range() call.
>
> IIRC unlock_range depends on page lock at some stage and that can mean
> this will block for a long time or for ever when the holder of the lock
> depends on a memory allocation. This was the primary problem why the oom
> reaper skips over mlocked vmas.

Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked
vmas only to avoid destroying pgds from under follow_page().

>
> > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> > before munlock_vma_pages_range() clears VM_LOCKED so that
> > __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> >
> > Option 1 seems cleaner to me because it keeps the locking pattern
> > around unlock_range() in exit_mmap() consistent with all other places
> > it is used (in mremap() and munmap()) with mmap_write_lock taken.
> > WDYT?
>
> It would be really great to make unlock_range oom reaper aware IMHO.

What exactly do you envision? Say unlock_range() knows that it's
racing with __oom_reap_task_mm() and that calling follow_page() is
unsafe without locking, what should it do?

>
> You do not quote your change in the full length so it is not really
> clear whether you are planning to drop __oom_reap_task_mm from exit_mmap
> as well.

Yes, that was the plan.

> If yes then 1) could push oom reaper to timeout while the
> unlock_range could be dropped on something so that wouldn't be an
> improvement. 2) sounds like a workaround to me as it doesn't really
> address the underlying problem.

With (1) potentially blocking due to allocation I can see why this is a problem.
Agree about (2).

>
> I have to say that I am not really a great fan of __oom_reap_task_mm in
> exit_mmap but I would rather see it in place than making the surrounding
> code more complex/tricky.

Agree. So far I could not find a cleaner solution. I thought (1) would
be a good one but the point you made renders it invalid. If you
clarify your comment about making unlock_range oom reaper aware maybe
that will open a new line of investigation?
Thanks,
Suren.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 3, 2022, 12:11 p.m. UTC | #9
On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
> > [...]
> > > After some more digging I think there are two acceptable options:
> > >
> > > 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> > > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> > > vmas in parallel like this:
> > >
> > >     if (mm->locked_vm) {
> > >         mmap_write_lock(mm);
> > >         unlock_range(mm->mmap, ULONG_MAX);
> > >         mmap_write_downgrade(mm);
> > >     } else
> > >         mmap_read_lock(mm);
> > > ...
> > >     unmap_vmas(&tlb, vma, 0, -1);
> > >     mmap_read_unlock(mm);
> > >     mmap_write_lock(mm);
> > >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > ...
> > >     mm->mmap = NULL;
> > >     mmap_write_unlock(mm);
> > >
> > > This way exit_mmap() might block __oom_reap_task_mm() but for a much
> > > shorter time during unlock_range() call.
> >
> > IIRC unlock_range depends on page lock at some stage and that can mean
> > this will block for a long time or for ever when the holder of the lock
> > depends on a memory allocation. This was the primary problem why the oom
> > reaper skips over mlocked vmas.
> 
> Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked
> vmas only to avoid destroying pgds from under follow_page().
> 
> >
> > > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> > > before munlock_vma_pages_range() clears VM_LOCKED so that
> > > __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> > >
> > > Option 1 seems cleaner to me because it keeps the locking pattern
> > > around unlock_range() in exit_mmap() consistent with all other places
> > > it is used (in mremap() and munmap()) with mmap_write_lock taken.
> > > WDYT?
> >
> > It would be really great to make unlock_range oom reaper aware IMHO.
> 
> What exactly do you envision? Say unlock_range() knows that it's
> racing with __oom_reap_task_mm() and that calling follow_page() is
> unsafe without locking, what should it do?

My original plan was to make the page lock conditional and use
trylocking from the oom reaper (aka lockless context). It is OK to
simply bail out and leave some mlocked memory behind if there is a
contention on a specific page. The overall objective is to free as much
memory as possible, not all of it.

IIRC Hugh was not a fan of this approach and he has mentioned that the
lock might not be even really needed and that the area would benefit
from a clean up rather than oom reaper specific hacks. I do tend to
agree with that. I just never managed to find any spare time for that
though and heavily mlocked oom victims tend to be really rare.
Hugh Dickins Jan. 3, 2022, 9:16 p.m. UTC | #10
On Mon, 3 Jan 2022, Michal Hocko wrote:
> On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> > On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > It would be really great to make unlock_range oom reaper aware IMHO.
> > 
> > What exactly do you envision? Say unlock_range() knows that it's
> > racing with __oom_reap_task_mm() and that calling follow_page() is
> > unsafe without locking, what should it do?
> 
> My original plan was to make the page lock conditional and use
> trylocking from the oom reaper (aka lockless context). It is OK to
> simply bail out and leave some mlocked memory behind if there is a
> contention on a specific page. The overall objective is to free as much
> memory as possible, not all of it.
> 
> IIRC Hugh was not a fan of this approach and he has mentioned that the
> lock might not be even really needed and that the area would benefit
> from a clean up rather than oom reaper specific hacks. I do tend to
> agree with that. I just never managed to find any spare time for that
> though and heavily mlocked oom victims tend to be really rare.

I forget when that was, and what I had in mind at that time.
But yes, by now I am very sure that munlocking needs a cleanup.

And I do have that cleanup (against a much older tree), but never
the time to rebase or post or shepherd it through N revisions.

It was 22 files changed, 464 insertions, 706 deletions:
which is too much to help with this immediate oom reaper question.

I'd better not drive this discussion further off-course; but it pains
me to see munlock_vma_pages obstructing, knowing there's a better way.

I wonder: what if I were to steal time and promise to post a
rebased series against 5.17-rc1 or rc2: not support it thereafter,
but might there be someone to pick it up and shepherd it through?
But there's no answer to that, without you seeing what it's like.

Hugh
Suren Baghdasaryan Jan. 4, 2022, 10:24 p.m. UTC | #11
On Mon, Jan 3, 2022 at 1:16 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 3 Jan 2022, Michal Hocko wrote:
> > On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> > > On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > It would be really great to make unlock_range oom reaper aware IMHO.
> > >
> > > What exactly do you envision? Say unlock_range() knows that it's
> > > racing with __oom_reap_task_mm() and that calling follow_page() is
> > > unsafe without locking, what should it do?
> >
> > My original plan was to make the page lock conditional and use
> > trylocking from the oom reaper (aka lockless context). It is OK to
> > simply bail out and leave some mlocked memory behind if there is a
> > contention on a specific page. The overall objective is to free as much
> > memory as possible, not all of it.
> >
> > IIRC Hugh was not a fan of this approach and he has mentioned that the
> > lock might not be even really needed and that the area would benefit
> > from a clean up rather than oom reaper specific hacks. I do tend to
> > agree with that. I just never managed to find any spare time for that
> > though and heavily mlocked oom victims tend to be really rare.
>
> I forget when that was, and what I had in mind at that time.
> But yes, by now I am very sure that munlocking needs a cleanup.
>
> And I do have that cleanup (against a much older tree), but never
> the time to rebase or post or shepherd it through N revisions.

How old was that tree?

>
> It was 22 files changed, 464 insertions, 706 deletions:
> which is too much to help with this immediate oom reaper question.
>
> I'd better not drive this discussion further off-course; but it pains
> me to see munlock_vma_pages obstructing, knowing there's a better way.
>
> I wonder: what if I were to steal time and promise to post a
> rebased series against 5.17-rc1 or rc2: not support it thereafter,
> but might there be someone to pick it up and shepherd it through?
> But there's no answer to that, without you seeing what it's like.

I would be interested in taking a look and see if it can be upstreamed
and supported without bugging you too much.
Thanks,
Suren.

>
> Hugh
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index f4e09d390a07..0d6af9d89aa8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3129,28 +3129,6 @@  void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_lock for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_lock is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_lock here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling unlock_range(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
-		 */
-		(void)__oom_reap_task_mm(mm);
-
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-	}
-
 	mmap_write_lock(mm);
 	if (mm->locked_vm)
 		unlock_range(mm->mmap, ULONG_MAX);
@@ -3180,6 +3158,7 @@  void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 		cond_resched();
 	}
+	mm->mmap = NULL;
 	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
 }