diff mbox series

[1/6] mm: Allow per-VMA locks on file-backed VMAs

Message ID 20230404135850.3673404-2-willy@infradead.org (mailing list archive)
State New
Headers show
Series Avoid the mmap lock for fault-around | expand

Commit Message

Matthew Wilcox (Oracle) April 4, 2023, 1:58 p.m. UTC
The fault path will immediately fail in handle_mm_fault(), so this
is the minimal step which allows the per-VMA lock to be taken on
file-backed VMAs.  There may be a small performance reduction as a
little unnecessary work will be done on each page fault.  See later
patches for the improvement.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Comments

Suren Baghdasaryan April 7, 2023, 5:54 p.m. UTC | #1
On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> The fault path will immediately fail in handle_mm_fault(), so this
> is the minimal step which allows the per-VMA lock to be taken on
> file-backed VMAs.  There may be a small performance reduction as a
> little unnecessary work will be done on each page fault.  See later
> patches for the improvement.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/memory.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fdaec7772fff..f726f85f0081 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                                             flags & FAULT_FLAG_REMOTE))
>                 return VM_FAULT_SIGSEGV;
>
> +       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma))
> +               return VM_FAULT_RETRY;
> +

There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm,
PGFAULT) earlier in this function. Returning here and retrying I think
will double-count this page fault. Returning before this accounting
should fix this issue.

>         /*
>          * Enable the memcg OOM handling for faults triggered in user
>          * space.  Kernel faults are handled more gracefully.
> @@ -5275,12 +5278,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma)
>                 goto inval;
>
> -       /* Only anonymous vmas are supported for now */
> -       if (!vma_is_anonymous(vma))
> -               goto inval;
> -
>         /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
> -       if (!vma->anon_vma)
> +       if (vma_is_anonymous(vma) && !vma->anon_vma)
>                 goto inval;
>
>         if (!vma_start_read(vma))
> --
> 2.39.2
>
Matthew Wilcox (Oracle) April 7, 2023, 8:12 p.m. UTC | #2
On Fri, Apr 07, 2023 at 10:54:00AM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> >
> > The fault path will immediately fail in handle_mm_fault(), so this
> > is the minimal step which allows the per-VMA lock to be taken on
> > file-backed VMAs.  There may be a small performance reduction as a
> > little unnecessary work will be done on each page fault.  See later
> > patches for the improvement.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  mm/memory.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index fdaec7772fff..f726f85f0081 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >                                             flags & FAULT_FLAG_REMOTE))
> >                 return VM_FAULT_SIGSEGV;
> >
> > +       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma))
> > +               return VM_FAULT_RETRY;
> > +
> 
> There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm,
> PGFAULT) earlier in this function. Returning here and retrying I think
> will double-count this page fault. Returning before this accounting
> should fix this issue.

You're right, but this will be an issue with later patches in the series
anyway as we move the check further and further down the call-chain.
For that matter, it's an issue in do_swap_page() right now, isn't it?
I suppose we don't care too much because it's the rare case where we go
into do_swap_page() and so the stats are "correct enough".
Suren Baghdasaryan April 7, 2023, 8:26 p.m. UTC | #3
On Fri, Apr 7, 2023 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 07, 2023 at 10:54:00AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle)
> > <willy@infradead.org> wrote:
> > >
> > > The fault path will immediately fail in handle_mm_fault(), so this
> > > is the minimal step which allows the per-VMA lock to be taken on
> > > file-backed VMAs.  There may be a small performance reduction as a
> > > little unnecessary work will be done on each page fault.  See later
> > > patches for the improvement.
> > >
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > ---
> > >  mm/memory.c | 9 ++++-----
> > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index fdaec7772fff..f726f85f0081 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >                                             flags & FAULT_FLAG_REMOTE))
> > >                 return VM_FAULT_SIGSEGV;
> > >
> > > +       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma))
> > > +               return VM_FAULT_RETRY;
> > > +
> >
> > There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm,
> > PGFAULT) earlier in this function. Returning here and retrying I think
> > will double-count this page fault. Returning before this accounting
> > should fix this issue.
>
> You're right, but this will be an issue with later patches in the series
> anyway as we move the check further and further down the call-chain.
> For that matter, it's an issue in do_swap_page() right now, isn't it?
> I suppose we don't care too much because it's the rare case where we go
> into do_swap_page() and so the stats are "correct enough".

True. do_swap_page() has the same issue. Can we move these
count_vm_event() calls to the end of handle_mm_fault():

vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
   unsigned int flags, struct pt_regs *regs)
{
       vm_fault_t ret;

       __set_current_state(TASK_RUNNING);

-       count_vm_event(PGFAULT);
-       count_memcg_event_mm(vma->vm_mm, PGFAULT);

       ret = sanitize_fault_flags(vma, &flags);
       if (ret)
-              return ret;
-              goto out;
       ...
       mm_account_fault(regs, address, flags, ret);
+out:
+       if (ret != VM_FAULT_RETRY) {
+              count_vm_event(PGFAULT);
+              count_memcg_event_mm(vma->vm_mm, PGFAULT);
+       }

       return ret;
}

?
Matthew Wilcox (Oracle) April 7, 2023, 9:36 p.m. UTC | #4
On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote:
> True. do_swap_page() has the same issue. Can we move these
> count_vm_event() calls to the end of handle_mm_fault():

I was going to suggest something similar, so count that as an
Acked-by.  This will change the accounting for the current retry
situation, where we drop the mmap_lock in filemap_fault(), initiate I/O
and return RETRY.  I think that's probably a good change though; I don't
see why applications should have their fault counters incremented twice
for that situation.

>        mm_account_fault(regs, address, flags, ret);
> +out:
> +       if (ret != VM_FAULT_RETRY) {
> +              count_vm_event(PGFAULT);
> +              count_memcg_event_mm(vma->vm_mm, PGFAULT);
> +       }

Nit: this is a bitmask, so we should probably have:

	if (!(ret & VM_FAULT_RETRY)) {

in case somebody's ORed it with VM_FAULT_MAJOR or something.

Hm, I wonder if we're handling that correctly; if we need to start I/O,
do we preserve VM_FAULT_MAJOR returned from the first attempt?  Not
in a position where I can look at the code right now.
Suren Baghdasaryan April 7, 2023, 10:40 p.m. UTC | #5
On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote:
> > True. do_swap_page() has the same issue. Can we move these
> > count_vm_event() calls to the end of handle_mm_fault():
>
> I was going to suggest something similar, so count that as an
> Acked-by.  This will change the accounting for the current retry
> situation, where we drop the mmap_lock in filemap_fault(), initiate I/O
> and return RETRY.  I think that's probably a good change though; I don't
> see why applications should have their fault counters incremented twice
> for that situation.
>
> >        mm_account_fault(regs, address, flags, ret);
> > +out:
> > +       if (ret != VM_FAULT_RETRY) {
> > +              count_vm_event(PGFAULT);
> > +              count_memcg_event_mm(vma->vm_mm, PGFAULT);
> > +       }
>
> Nit: this is a bitmask, so we should probably have:
>
>         if (!(ret & VM_FAULT_RETRY)) {
>
> in case somebody's ORed it with VM_FAULT_MAJOR or something.
>
> Hm, I wonder if we're handling that correctly; if we need to start I/O,
> do we preserve VM_FAULT_MAJOR returned from the first attempt?  Not
> in a position where I can look at the code right now.

Interesting question. IIUC mm_account_fault() is the place where
VM_FAULT_MAJOR is used to perform minor/major fault accounting.
However it has an early exit:

        /*
         * We don't do accounting for some specific faults:
         *
         * - Unsuccessful faults (e.g. when the address wasn't valid).  That
         *   includes arch_vma_access_permitted() failing before reaching here.
         *   So this is not a "this many hardware page faults" counter.  We
         *   should use the hw profiling for that.
         *
         * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
         *   once they're completed.
         */
        if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
                return;

So, VM_FAULT_RETRY is ignored in the accounting path.
For the current do_swap_page (the only user returning VM_FAULT_RETRY
this late) it's fine because we did not really do anything. However
with your series and with my current attempts to drop the vma lock, do
swapin_readahead() and return VM_FAULT_RETRY the situation changes. We
need to register such (VM_FAULT_MAJOR | VM_FAULT_RETRY) case as a
major fault, otherwise after retry it will be accounted as only a
minor fault. Accounting the retry as a minor fault after the original
major fault is technically also double-counting but probably not as
big of a problem as missing the major fault accounting.
Suren Baghdasaryan April 7, 2023, 10:49 p.m. UTC | #6
On Fri, Apr 7, 2023 at 3:40 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote:
> > > True. do_swap_page() has the same issue. Can we move these
> > > count_vm_event() calls to the end of handle_mm_fault():
> >
> > I was going to suggest something similar, so count that as an
> > Acked-by.  This will change the accounting for the current retry
> > situation, where we drop the mmap_lock in filemap_fault(), initiate I/O
> > and return RETRY.  I think that's probably a good change though; I don't
> > see why applications should have their fault counters incremented twice
> > for that situation.
> >
> > >        mm_account_fault(regs, address, flags, ret);
> > > +out:
> > > +       if (ret != VM_FAULT_RETRY) {
> > > +              count_vm_event(PGFAULT);
> > > +              count_memcg_event_mm(vma->vm_mm, PGFAULT);
> > > +       }
> >
> > Nit: this is a bitmask, so we should probably have:
> >
> >         if (!(ret & VM_FAULT_RETRY)) {
> >
> > in case somebody's ORed it with VM_FAULT_MAJOR or something.
> >
> > Hm, I wonder if we're handling that correctly; if we need to start I/O,
> > do we preserve VM_FAULT_MAJOR returned from the first attempt?  Not
> > in a position where I can look at the code right now.
>
> Interesting question. IIUC mm_account_fault() is the place where
> VM_FAULT_MAJOR is used to perform minor/major fault accounting.
> However it has an early exit:
>
>         /*
>          * We don't do accounting for some specific faults:
>          *
>          * - Unsuccessful faults (e.g. when the address wasn't valid).  That
>          *   includes arch_vma_access_permitted() failing before reaching here.
>          *   So this is not a "this many hardware page faults" counter.  We
>          *   should use the hw profiling for that.
>          *
>          * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
>          *   once they're completed.
>          */
>         if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
>                 return;
>
> So, VM_FAULT_RETRY is ignored in the accounting path.
> For the current do_swap_page (the only user returning VM_FAULT_RETRY
> this late) it's fine because we did not really do anything. However
> with your series and with my current attempts to drop the vma lock, do
> swapin_readahead() and return VM_FAULT_RETRY the situation changes. We
> need to register such (VM_FAULT_MAJOR | VM_FAULT_RETRY) case as a
> major fault, otherwise after retry it will be accounted as only a
> minor fault. Accounting the retry as a minor fault after the original
> major fault is technically also double-counting but probably not as
> big of a problem as missing the major fault accounting.

Actually, looks like in most cases where VM_FAULT_MAJOR is returned we
also do count_vm_event(PGMAJFAULT) and
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT). mm_account_fault()
modifies only current->{maj|min}_flt and perf_sw_event().
Suren Baghdasaryan April 14, 2023, 6:02 p.m. UTC | #7
On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote:
> > True. do_swap_page() has the same issue. Can we move these
> > count_vm_event() calls to the end of handle_mm_fault():
>
> I was going to suggest something similar, so count that as an
> Acked-by.  This will change the accounting for the current retry
> situation, where we drop the mmap_lock in filemap_fault(), initiate I/O
> and return RETRY.  I think that's probably a good change though; I don't
> see why applications should have their fault counters incremented twice
> for that situation.
>
> >        mm_account_fault(regs, address, flags, ret);
> > +out:
> > +       if (ret != VM_FAULT_RETRY) {
> > +              count_vm_event(PGFAULT);
> > +              count_memcg_event_mm(vma->vm_mm, PGFAULT);
> > +       }
>
> Nit: this is a bitmask, so we should probably have:
>
>         if (!(ret & VM_FAULT_RETRY)) {
>
> in case somebody's ORed it with VM_FAULT_MAJOR or something.

The fix is posted at
https://lore.kernel.org/all/20230414175444.1837474-1-surenb@google.com/

>
> Hm, I wonder if we're handling that correctly; if we need to start I/O,
> do we preserve VM_FAULT_MAJOR returned from the first attempt?  Not
> in a position where I can look at the code right now.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index fdaec7772fff..f726f85f0081 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5223,6 +5223,9 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 					    flags & FAULT_FLAG_REMOTE))
 		return VM_FAULT_SIGSEGV;
 
+	if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma))
+		return VM_FAULT_RETRY;
+
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
 	 * space.  Kernel faults are handled more gracefully.
@@ -5275,12 +5278,8 @@  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	/* Only anonymous vmas are supported for now */
-	if (!vma_is_anonymous(vma))
-		goto inval;
-
 	/* find_mergeable_anon_vma uses adjacent vmas which are not locked */
-	if (!vma->anon_vma)
+	if (vma_is_anonymous(vma) && !vma->anon_vma)
 		goto inval;
 
 	if (!vma_start_read(vma))