[v3,7/8] mm: drop VMA lock before waiting for migration

Message ID	20230627042321.1763765-8-surenb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> Date: Mon, 26 Jun 2023 21:23:20 -0700 In-Reply-To: <20230627042321.1763765-1-surenb@google.com> Mime-Version: 1.0 References: <20230627042321.1763765-1-surenb@google.com> Message-ID: <20230627042321.1763765-8-surenb@google.com> Subject: [PATCH v3 7/8] mm: drop VMA lock before waiting for migration From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, hdanton@sina.com, apopple@nvidia.com, peterx@redhat.com, ying.huang@intel.com, david@redhat.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, pasha.tatashin@soleen.com, surenb@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Per-VMA lock support for swap and userfaults \| expand [v3,0/8] Per-VMA lock support for swap and userfaults [v3,1/8] swap: remove remnants of polling from read_swap_cache_async [v3,2/8] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED [v3,3/8] mm: drop per-VMA lock in handle_mm_fault if retrying or when finished [v3,4/8] mm: replace folio_lock_or_retry with folio_lock_fault [v3,5/8] mm: make folio_lock_fault indicate the state of mmap_lock upon return [v3,6/8] mm: handle swap page faults under per-VMA lock [v3,7/8] mm: drop VMA lock before waiting for migration [v3,8/8] mm: handle userfaults under VMA lock

Suren Baghdasaryan June 27, 2023, 4:23 a.m. UTC

migration_entry_wait does not need VMA lock, therefore it can be
dropped before waiting.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/memory.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Alistair Popple June 27, 2023, 8:02 a.m. UTC | #1

Suren Baghdasaryan <surenb@google.com> writes:

> migration_entry_wait does not need VMA lock, therefore it can be
> dropped before waiting.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/memory.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5caaa4c66ea2..bdf46fdc58d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> -			migration_entry_wait(vma->vm_mm, vmf->pmd,
> -					     vmf->address);
> +			/* Save mm in case VMA lock is dropped */
> +			struct mm_struct *mm = vma->vm_mm;
> +
> +			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +				/*
> +				 * No need to hold VMA lock for migration.
> +				 * WARNING: vma can't be used after this!
> +				 */
> +				vma_end_read(vma);
> +				ret |= VM_FAULT_COMPLETED;

Doesn't this need to also set FAULT_FLAG_LOCK_DROPPED to ensure we don't
call vma_end_read() again in __handle_mm_fault()?

> +			}
> +			migration_entry_wait(mm, vmf->pmd, vmf->address);
>  		} else if (is_device_exclusive_entry(entry)) {
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			ret = remove_device_exclusive_entry(vmf);

Suren Baghdasaryan June 27, 2023, 3:35 p.m. UTC | #2

On Tue, Jun 27, 2023 at 1:06 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> > migration_entry_wait does not need VMA lock, therefore it can be
> > dropped before waiting.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/memory.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5caaa4c66ea2..bdf46fdc58d6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       if (unlikely(non_swap_entry(entry))) {
> >               if (is_migration_entry(entry)) {
> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
> > -                                          vmf->address);
> > +                     /* Save mm in case VMA lock is dropped */
> > +                     struct mm_struct *mm = vma->vm_mm;
> > +
> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                             /*
> > +                              * No need to hold VMA lock for migration.
> > +                              * WARNING: vma can't be used after this!
> > +                              */
> > +                             vma_end_read(vma);
> > +                             ret |= VM_FAULT_COMPLETED;
>
> Doesn't this need to also set FAULT_FLAG_LOCK_DROPPED to ensure we don't
> call vma_end_read() again in __handle_mm_fault()?

Uh, right. Got lost during the last refactoring. Thanks for flagging!

>
> > +                     }
> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
> >               } else if (is_device_exclusive_entry(entry)) {
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       ret = remove_device_exclusive_entry(vmf);
>

Peter Xu June 27, 2023, 3:49 p.m. UTC | #3

On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
> migration_entry_wait does not need VMA lock, therefore it can be
> dropped before waiting.

Hmm, I'm not sure..

Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
on the page table and IIUC only be guaranteed if the vma is still there.
If without both mmap / vma lock I don't see what makes sure the pgtable is
always there.  E.g. IIUC a race can happen where unmap() runs right after
vma_end_read() below but before pmdp_get_lockless() (inside
migration_entry_wait()), then pmdp_get_lockless() can read some random
things if the pgtable is freed.

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/memory.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5caaa4c66ea2..bdf46fdc58d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> -			migration_entry_wait(vma->vm_mm, vmf->pmd,
> -					     vmf->address);
> +			/* Save mm in case VMA lock is dropped */
> +			struct mm_struct *mm = vma->vm_mm;
> +
> +			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +				/*
> +				 * No need to hold VMA lock for migration.
> +				 * WARNING: vma can't be used after this!
> +				 */
> +				vma_end_read(vma);
> +				ret |= VM_FAULT_COMPLETED;
> +			}
> +			migration_entry_wait(mm, vmf->pmd, vmf->address);
>  		} else if (is_device_exclusive_entry(entry)) {
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			ret = remove_device_exclusive_entry(vmf);
> -- 
> 2.41.0.178.g377b9f9a00-goog
>

Suren Baghdasaryan June 27, 2023, 4:23 p.m. UTC | #4

On Tue, Jun 27, 2023 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
> > migration_entry_wait does not need VMA lock, therefore it can be
> > dropped before waiting.
>
> Hmm, I'm not sure..
>
> Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
> on the page table and IIUC only be guaranteed if the vma is still there.
> If without both mmap / vma lock I don't see what makes sure the pgtable is
> always there.  E.g. IIUC a race can happen where unmap() runs right after
> vma_end_read() below but before pmdp_get_lockless() (inside
> migration_entry_wait()), then pmdp_get_lockless() can read some random
> things if the pgtable is freed.

That sounds correct. I thought ptl would keep pmd stable but there is
time between vma_end_read() and spin_lock(ptl) when it can be freed
from under us. I think it would work if we do vma_end_read() after
spin_lock(ptl) but that requires code refactoring. I'll probably drop
this optimization from the patchset for now to keep things simple and
will get back to it later.

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/memory.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5caaa4c66ea2..bdf46fdc58d6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       if (unlikely(non_swap_entry(entry))) {
> >               if (is_migration_entry(entry)) {
> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
> > -                                          vmf->address);
> > +                     /* Save mm in case VMA lock is dropped */
> > +                     struct mm_struct *mm = vma->vm_mm;
> > +
> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                             /*
> > +                              * No need to hold VMA lock for migration.
> > +                              * WARNING: vma can't be used after this!
> > +                              */
> > +                             vma_end_read(vma);
> > +                             ret |= VM_FAULT_COMPLETED;
> > +                     }
> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
> >               } else if (is_device_exclusive_entry(entry)) {
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       ret = remove_device_exclusive_entry(vmf);
> > --
> > 2.41.0.178.g377b9f9a00-goog
> >
>
> --
> Peter Xu
>

Alistair Popple June 28, 2023, 3:22 a.m. UTC | #5

Suren Baghdasaryan <surenb@google.com> writes:

> On Tue, Jun 27, 2023 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
>> > migration_entry_wait does not need VMA lock, therefore it can be
>> > dropped before waiting.
>>
>> Hmm, I'm not sure..
>>
>> Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
>> on the page table and IIUC only be guaranteed if the vma is still there.
>> If without both mmap / vma lock I don't see what makes sure the pgtable is
>> always there.  E.g. IIUC a race can happen where unmap() runs right after
>> vma_end_read() below but before pmdp_get_lockless() (inside
>> migration_entry_wait()), then pmdp_get_lockless() can read some random
>> things if the pgtable is freed.
>
> That sounds correct. I thought ptl would keep pmd stable but there is
> time between vma_end_read() and spin_lock(ptl) when it can be freed
> from under us. I think it would work if we do vma_end_read() after
> spin_lock(ptl) but that requires code refactoring. I'll probably drop
> this optimization from the patchset for now to keep things simple and
> will get back to it later.

Oh thanks Peter that's a good point. It could be made to work, but agree
it's probably not worth the code refactoring at this point so I'm ok if
the optimisation is dropped for now.

>>
>> >
>> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> > ---
>> >  mm/memory.c | 14 ++++++++++++--
>> >  1 file changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 5caaa4c66ea2..bdf46fdc58d6 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       entry = pte_to_swp_entry(vmf->orig_pte);
>> >       if (unlikely(non_swap_entry(entry))) {
>> >               if (is_migration_entry(entry)) {
>> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
>> > -                                          vmf->address);
>> > +                     /* Save mm in case VMA lock is dropped */
>> > +                     struct mm_struct *mm = vma->vm_mm;
>> > +
>> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> > +                             /*
>> > +                              * No need to hold VMA lock for migration.
>> > +                              * WARNING: vma can't be used after this!
>> > +                              */
>> > +                             vma_end_read(vma);
>> > +                             ret |= VM_FAULT_COMPLETED;
>> > +                     }
>> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
>> >               } else if (is_device_exclusive_entry(entry)) {
>> >                       vmf->page = pfn_swap_entry_to_page(entry);
>> >                       ret = remove_device_exclusive_entry(vmf);
>> > --
>> > 2.41.0.178.g377b9f9a00-goog
>> >
>>
>> --
>> Peter Xu
>>

[v3,7/8] mm: drop VMA lock before waiting for migration

Commit Message

Comments

Patch