diff mbox series

[v2,1/1] mm: do not increment pgfault stats when page fault handler retries

Message ID 20230415000818.1955007-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [v2,1/1] mm: do not increment pgfault stats when page fault handler retries | expand

Commit Message

Suren Baghdasaryan April 15, 2023, 12:08 a.m. UTC
If the page fault handler requests a retry, we will count the fault
multiple times. This is a relatively harmless problem as the retry paths
are not often requested, and the only user-visible problem is that the
fault counter will be slightly higher than it should be.  Nevertheless,
userspace only took one fault, and should not see the fact that the
kernel had to retry the fault multiple times.
Move page fault accounting into mm_account_fault() and skip incomplete
faults which will be accounted upon completion.

Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

Comments

Matthew Wilcox April 17, 2023, 5:26 p.m. UTC | #1
On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
>  	/*
> -	 * We don't do accounting for some specific faults:
> -	 *
> -	 * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> -	 *   includes arch_vma_access_permitted() failing before reaching here.
> -	 *   So this is not a "this many hardware page faults" counter.  We
> -	 *   should use the hw profiling for that.
> -	 *
> -	 * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> -	 *   once they're completed.
> +	 * Do not account for incomplete faults (VM_FAULT_RETRY). They will be

I don't think you need the "(VM_FAULT_RETRY)" here.

> @@ -5180,21 +5186,22 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
>  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  			   unsigned int flags, struct pt_regs *regs)
>  {
> +	/* Copy vma->vm_mm in case mmap_lock is dropped and vma becomes unstable. */

How about:

	/* If the fault handler drops the mmap_lock, vma may be freed */

> +	struct mm_struct *mm = vma->vm_mm;
Peter Xu April 17, 2023, 7:40 p.m. UTC | #2
On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> If the page fault handler requests a retry, we will count the fault
> multiple times. This is a relatively harmless problem as the retry paths
> are not often requested, and the only user-visible problem is that the
> fault counter will be slightly higher than it should be.  Nevertheless,
> userspace only took one fault, and should not see the fact that the
> kernel had to retry the fault multiple times.
> Move page fault accounting into mm_account_fault() and skip incomplete
> faults which will be accounted upon completion.
> 
> Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>  1 file changed, 26 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 01a23ad48a04..c3b709ceeed7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5080,24 +5080,30 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>   * updates.  However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should
>   * still be in per-arch page fault handlers at the entry of page fault.
>   */
> -static inline void mm_account_fault(struct pt_regs *regs,
> +static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs,
>  				    unsigned long address, unsigned int flags,
>  				    vm_fault_t ret)
>  {
>  	bool major;
>  
>  	/*
> -	 * We don't do accounting for some specific faults:
> -	 *
> -	 * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> -	 *   includes arch_vma_access_permitted() failing before reaching here.
> -	 *   So this is not a "this many hardware page faults" counter.  We
> -	 *   should use the hw profiling for that.
> -	 *
> -	 * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> -	 *   once they're completed.
> +	 * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> +	 * counted upon completion.
>  	 */
> -	if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> +	if (ret & VM_FAULT_RETRY)
> +		return;
> +
> +	/* Register both successful and failed faults in PGFAULT counters. */
> +	count_vm_event(PGFAULT);
> +	count_memcg_event_mm(mm, PGFAULT);

Is there reason on why vm events accountings need to be explicitly
different from perf events right below on handling ERROR?

I get the point if this is to make sure ERROR accountings untouched for
these two vm events after this patch. IOW probably the only concern right
now is having RETRY counted much more than before (perhaps worse with vma
locking applied).

But since we're on this, I'm wondering whether we should also align the two
events (vm, perf) so they represent in an aligned manner if we'll change it
anyway.  Any future reader will be confused on why they account
differently, IMHO, so if we need to differenciate we'd better add a comment
on why.

I'm wildly guessing the error faults are indeed very rare and probably not
matter much at all.  I just think the code can be slightly cleaner if
vm/perf accountings match and easier if we treat everything the same. E.g.,
we can also drop the below "goto out"s too.  What do you think?

Thanks,

> +
> +	/*
> +	 * Do not account for unsuccessful faults (e.g. when the address wasn't
> +	 * valid).  That includes arch_vma_access_permitted() failing before
> +	 * reaching here. So this is not a "this many hardware page faults"
> +	 * counter.  We should use the hw profiling for that.
> +	 */
> +	if (ret & VM_FAULT_ERROR)
>  		return;
>  
>  	/*
> @@ -5180,21 +5186,22 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
>  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  			   unsigned int flags, struct pt_regs *regs)
>  {
> +	/* Copy vma->vm_mm in case mmap_lock is dropped and vma becomes unstable. */
> +	struct mm_struct *mm = vma->vm_mm;
>  	vm_fault_t ret;
>  
>  	__set_current_state(TASK_RUNNING);
>  
> -	count_vm_event(PGFAULT);
> -	count_memcg_event_mm(vma->vm_mm, PGFAULT);
> -
>  	ret = sanitize_fault_flags(vma, &flags);
>  	if (ret)
> -		return ret;
> +		goto out;
>  
>  	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>  					    flags & FAULT_FLAG_INSTRUCTION,
> -					    flags & FAULT_FLAG_REMOTE))
> -		return VM_FAULT_SIGSEGV;
> +					    flags & FAULT_FLAG_REMOTE)) {
> +		ret = VM_FAULT_SIGSEGV;
> +		goto out;
> +	}
>  
>  	/*
>  	 * Enable the memcg OOM handling for faults triggered in user
> @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>  			mem_cgroup_oom_synchronize(false);
>  	}
> -
> -	mm_account_fault(regs, address, flags, ret);
> +out:
> +	mm_account_fault(mm, regs, address, flags, ret);
>  
>  	return ret;
>  }
> -- 
> 2.40.0.634.g4ca3ef3211-goog
> 
>
Suren Baghdasaryan April 17, 2023, 8:29 p.m. UTC | #3
On Mon, Apr 17, 2023 at 12:40 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > If the page fault handler requests a retry, we will count the fault
> > multiple times. This is a relatively harmless problem as the retry paths
> > are not often requested, and the only user-visible problem is that the
> > fault counter will be slightly higher than it should be.  Nevertheless,
> > userspace only took one fault, and should not see the fact that the
> > kernel had to retry the fault multiple times.
> > Move page fault accounting into mm_account_fault() and skip incomplete
> > faults which will be accounted upon completion.
> >
> > Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
> >  1 file changed, 26 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 01a23ad48a04..c3b709ceeed7 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5080,24 +5080,30 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >   * updates.  However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should
> >   * still be in per-arch page fault handlers at the entry of page fault.
> >   */
> > -static inline void mm_account_fault(struct pt_regs *regs,
> > +static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs,
> >                                   unsigned long address, unsigned int flags,
> >                                   vm_fault_t ret)
> >  {
> >       bool major;
> >
> >       /*
> > -      * We don't do accounting for some specific faults:
> > -      *
> > -      * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > -      *   includes arch_vma_access_permitted() failing before reaching here.
> > -      *   So this is not a "this many hardware page faults" counter.  We
> > -      *   should use the hw profiling for that.
> > -      *
> > -      * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > -      *   once they're completed.
> > +      * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > +      * counted upon completion.
> >        */
> > -     if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > +     if (ret & VM_FAULT_RETRY)
> > +             return;
> > +
> > +     /* Register both successful and failed faults in PGFAULT counters. */
> > +     count_vm_event(PGFAULT);
> > +     count_memcg_event_mm(mm, PGFAULT);
>
> Is there reason on why vm events accountings need to be explicitly
> different from perf events right below on handling ERROR?
>
> I get the point if this is to make sure ERROR accountings untouched for
> these two vm events after this patch. IOW probably the only concern right
> now is having RETRY counted much more than before (perhaps worse with vma
> locking applied).
>
> But since we're on this, I'm wondering whether we should also align the two
> events (vm, perf) so they represent in an aligned manner if we'll change it
> anyway.  Any future reader will be confused on why they account
> differently, IMHO, so if we need to differenciate we'd better add a comment
> on why.
>
> I'm wildly guessing the error faults are indeed very rare and probably not
> matter much at all.  I just think the code can be slightly cleaner if
> vm/perf accountings match and easier if we treat everything the same. E.g.,
> we can also drop the below "goto out"s too.  What do you think?

I think the rationale might be that vm accounting should account for
*all* events, including failing page faults while for perf, the corner
cases which rarely happen would not have tangible effect.
I don't have a strong position on this issue and kept it as is to
avoid changing the current accounting approach. If we are fine with
such consolidation which would miss failing faults in vm accounting, I
can make the change.
Thanks,
Suren.

>
> Thanks,
>
> > +
> > +     /*
> > +      * Do not account for unsuccessful faults (e.g. when the address wasn't
> > +      * valid).  That includes arch_vma_access_permitted() failing before
> > +      * reaching here. So this is not a "this many hardware page faults"
> > +      * counter.  We should use the hw profiling for that.
> > +      */
> > +     if (ret & VM_FAULT_ERROR)
> >               return;
> >
> >       /*
> > @@ -5180,21 +5186,22 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
> >  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >                          unsigned int flags, struct pt_regs *regs)
> >  {
> > +     /* Copy vma->vm_mm in case mmap_lock is dropped and vma becomes unstable. */
> > +     struct mm_struct *mm = vma->vm_mm;
> >       vm_fault_t ret;
> >
> >       __set_current_state(TASK_RUNNING);
> >
> > -     count_vm_event(PGFAULT);
> > -     count_memcg_event_mm(vma->vm_mm, PGFAULT);
> > -
> >       ret = sanitize_fault_flags(vma, &flags);
> >       if (ret)
> > -             return ret;
> > +             goto out;
> >
> >       if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> >                                           flags & FAULT_FLAG_INSTRUCTION,
> > -                                         flags & FAULT_FLAG_REMOTE))
> > -             return VM_FAULT_SIGSEGV;
> > +                                         flags & FAULT_FLAG_REMOTE)) {
> > +             ret = VM_FAULT_SIGSEGV;
> > +             goto out;
> > +     }
> >
> >       /*
> >        * Enable the memcg OOM handling for faults triggered in user
> > @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >               if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> >                       mem_cgroup_oom_synchronize(false);
> >       }
> > -
> > -     mm_account_fault(regs, address, flags, ret);
> > +out:
> > +     mm_account_fault(mm, regs, address, flags, ret);
> >
> >       return ret;
> >  }
> > --
> > 2.40.0.634.g4ca3ef3211-goog
> >
> >
>
> --
> Peter Xu
>
Peter Xu April 17, 2023, 9:14 p.m. UTC | #4
On Mon, Apr 17, 2023 at 01:29:45PM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 17, 2023 at 12:40 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > > If the page fault handler requests a retry, we will count the fault
> > > multiple times. This is a relatively harmless problem as the retry paths
> > > are not often requested, and the only user-visible problem is that the
> > > fault counter will be slightly higher than it should be.  Nevertheless,
> > > userspace only took one fault, and should not see the fact that the
> > > kernel had to retry the fault multiple times.
> > > Move page fault accounting into mm_account_fault() and skip incomplete
> > > faults which will be accounted upon completion.
> > >
> > > Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
> > >  1 file changed, 26 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 01a23ad48a04..c3b709ceeed7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5080,24 +5080,30 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > >   * updates.  However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should
> > >   * still be in per-arch page fault handlers at the entry of page fault.
> > >   */
> > > -static inline void mm_account_fault(struct pt_regs *regs,
> > > +static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs,
> > >                                   unsigned long address, unsigned int flags,
> > >                                   vm_fault_t ret)
> > >  {
> > >       bool major;
> > >
> > >       /*
> > > -      * We don't do accounting for some specific faults:
> > > -      *
> > > -      * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > -      *   includes arch_vma_access_permitted() failing before reaching here.
> > > -      *   So this is not a "this many hardware page faults" counter.  We
> > > -      *   should use the hw profiling for that.
> > > -      *
> > > -      * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > -      *   once they're completed.
> > > +      * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > +      * counted upon completion.
> > >        */
> > > -     if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > +     if (ret & VM_FAULT_RETRY)
> > > +             return;
> > > +
> > > +     /* Register both successful and failed faults in PGFAULT counters. */

[1]

> > > +     count_vm_event(PGFAULT);
> > > +     count_memcg_event_mm(mm, PGFAULT);
> >
> > Is there reason on why vm events accountings need to be explicitly
> > different from perf events right below on handling ERROR?
> >
> > I get the point if this is to make sure ERROR accountings untouched for
> > these two vm events after this patch. IOW probably the only concern right
> > now is having RETRY counted much more than before (perhaps worse with vma
> > locking applied).
> >
> > But since we're on this, I'm wondering whether we should also align the two
> > events (vm, perf) so they represent in an aligned manner if we'll change it
> > anyway.  Any future reader will be confused on why they account
> > differently, IMHO, so if we need to differenciate we'd better add a comment
> > on why.
> >
> > I'm wildly guessing the error faults are indeed very rare and probably not
> > matter much at all.  I just think the code can be slightly cleaner if
> > vm/perf accountings match and easier if we treat everything the same. E.g.,
> > we can also drop the below "goto out"s too.  What do you think?
> 
> I think the rationale might be that vm accounting should account for
> *all* events, including failing page faults while for perf, the corner
> cases which rarely happen would not have tangible effect.

Note that it's not only for perf, but also task_struct.maj_flt|min_flt.

If we check the reasoning of "why ERROR shouldn't be accounted for perf
events", there're actually something valid in the comment:

	 * - Unsuccessful faults (e.g. when the address wasn't valid).  That
	 *   includes arch_vma_access_permitted() failing before reaching here.
	 *   So this is not a "this many hardware page faults" counter.  We
	 *   should use the hw profiling for that.

IMHO it suggests that if someone wants to trap either ERROR or RETRY one
can use the hardware counters instead.  The same reasoning just sounds
applicable to vm events too, because vm events are not special in this case
to me.

> I don't have a strong position on this issue and kept it as is to
> avoid changing the current accounting approach. If we are fine with
> such consolidation which would miss failing faults in vm accounting, I
> can make the change.

I don't have a strong opinion either.  We used to change this path before
for perf events and task events and no one complains with the change.  I'd
just bet the same to vm events:

https://lore.kernel.org/all/20200707225021.200906-1-peterx@redhat.com/

Please feel free to keep it as-is if you still feel unsafe on changing
ERROR handling.  If so, would you mind slightly modify [1] above explaining
why we need ERROR to be accounted for vm accountings with the reasoning?
Current comment only says "what it does" rather than why.

Thanks,
Suren Baghdasaryan April 17, 2023, 9:21 p.m. UTC | #5
On Mon, Apr 17, 2023 at 2:14 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 01:29:45PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Apr 17, 2023 at 12:40 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > > > If the page fault handler requests a retry, we will count the fault
> > > > multiple times. This is a relatively harmless problem as the retry paths
> > > > are not often requested, and the only user-visible problem is that the
> > > > fault counter will be slightly higher than it should be.  Nevertheless,
> > > > userspace only took one fault, and should not see the fact that the
> > > > kernel had to retry the fault multiple times.
> > > > Move page fault accounting into mm_account_fault() and skip incomplete
> > > > faults which will be accounted upon completion.
> > > >
> > > > Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
> > > >  1 file changed, 26 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 01a23ad48a04..c3b709ceeed7 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -5080,24 +5080,30 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > > >   * updates.  However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should
> > > >   * still be in per-arch page fault handlers at the entry of page fault.
> > > >   */
> > > > -static inline void mm_account_fault(struct pt_regs *regs,
> > > > +static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs,
> > > >                                   unsigned long address, unsigned int flags,
> > > >                                   vm_fault_t ret)
> > > >  {
> > > >       bool major;
> > > >
> > > >       /*
> > > > -      * We don't do accounting for some specific faults:
> > > > -      *
> > > > -      * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > -      *   includes arch_vma_access_permitted() failing before reaching here.
> > > > -      *   So this is not a "this many hardware page faults" counter.  We
> > > > -      *   should use the hw profiling for that.
> > > > -      *
> > > > -      * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > -      *   once they're completed.
> > > > +      * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > +      * counted upon completion.
> > > >        */
> > > > -     if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > +     if (ret & VM_FAULT_RETRY)
> > > > +             return;
> > > > +
> > > > +     /* Register both successful and failed faults in PGFAULT counters. */
>
> [1]
>
> > > > +     count_vm_event(PGFAULT);
> > > > +     count_memcg_event_mm(mm, PGFAULT);
> > >
> > > Is there reason on why vm events accountings need to be explicitly
> > > different from perf events right below on handling ERROR?
> > >
> > > I get the point if this is to make sure ERROR accountings untouched for
> > > these two vm events after this patch. IOW probably the only concern right
> > > now is having RETRY counted much more than before (perhaps worse with vma
> > > locking applied).
> > >
> > > But since we're on this, I'm wondering whether we should also align the two
> > > events (vm, perf) so they represent in an aligned manner if we'll change it
> > > anyway.  Any future reader will be confused on why they account
> > > differently, IMHO, so if we need to differenciate we'd better add a comment
> > > on why.
> > >
> > > I'm wildly guessing the error faults are indeed very rare and probably not
> > > matter much at all.  I just think the code can be slightly cleaner if
> > > vm/perf accountings match and easier if we treat everything the same. E.g.,
> > > we can also drop the below "goto out"s too.  What do you think?
> >
> > I think the rationale might be that vm accounting should account for
> > *all* events, including failing page faults while for perf, the corner
> > cases which rarely happen would not have tangible effect.
>
> Note that it's not only for perf, but also task_struct.maj_flt|min_flt.
>
> If we check the reasoning of "why ERROR shouldn't be accounted for perf
> events", there're actually something valid in the comment:
>
>          * - Unsuccessful faults (e.g. when the address wasn't valid).  That
>          *   includes arch_vma_access_permitted() failing before reaching here.
>          *   So this is not a "this many hardware page faults" counter.  We
>          *   should use the hw profiling for that.
>
> IMHO it suggests that if someone wants to trap either ERROR or RETRY one
> can use the hardware counters instead.  The same reasoning just sounds
> applicable to vm events too, because vm events are not special in this case
> to me.
>
> > I don't have a strong position on this issue and kept it as is to
> > avoid changing the current accounting approach. If we are fine with
> > such consolidation which would miss failing faults in vm accounting, I
> > can make the change.
>
> I don't have a strong opinion either.  We used to change this path before
> for perf events and task events and no one complains with the change.  I'd
> just bet the same to vm events:
>
> https://lore.kernel.org/all/20200707225021.200906-1-peterx@redhat.com/

Ok, if these rare failures don't change anything then let's
consolidate the code. It should simplify things a bit and will account
faults in a consistent way. I'll post v3 shortly incorporating your
and Matthew's feedbacks. Thanks for the input!

>
> Please feel free to keep it as-is if you still feel unsafe on changing
> ERROR handling.  If so, would you mind slightly modify [1] above explaining
> why we need ERROR to be accounted for vm accountings with the reasoning?
> Current comment only says "what it does" rather than why.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 17, 2023, 9:26 p.m. UTC | #6
On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>  			mem_cgroup_oom_synchronize(false);
>  	}
> -
> -	mm_account_fault(regs, address, flags, ret);
> +out:
> +	mm_account_fault(mm, regs, address, flags, ret);

Ah, one more question.. can this cached mm race with a destroying mm (just
like the vma race we wanted to avoid)?  Still a question only applies to
COMPLETE case when mmap read lock can be released.  Thanks,

>  
>  	return ret;
>  }
Suren Baghdasaryan April 17, 2023, 10:47 p.m. UTC | #7
On Mon, Apr 17, 2023 at 2:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >               if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> >                       mem_cgroup_oom_synchronize(false);
> >       }
> > -
> > -     mm_account_fault(regs, address, flags, ret);
> > +out:
> > +     mm_account_fault(mm, regs, address, flags, ret);
>
> Ah, one more question.. can this cached mm race with a destroying mm (just
> like the vma race we wanted to avoid)?  Still a question only applies to
> COMPLETE case when mmap read lock can be released.  Thanks,

I believe that is impossible because whoever is calling the page fault
handler has stabilized the mm by getting a refcount.

>
> >
> >       return ret;
> >  }
>
> --
> Peter Xu
>
Matthew Wilcox April 17, 2023, 10:52 p.m. UTC | #8
On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> >  	/*
> > -	 * We don't do accounting for some specific faults:
> > -	 *
> > -	 * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > -	 *   includes arch_vma_access_permitted() failing before reaching here.
> > -	 *   So this is not a "this many hardware page faults" counter.  We
> > -	 *   should use the hw profiling for that.
> > -	 *
> > -	 * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > -	 *   once they're completed.
> > +	 * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > +	 * counted upon completion.
> >  	 */
> > -	if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > +	if (ret & VM_FAULT_RETRY)
> > +		return;
> > +
> > +	/* Register both successful and failed faults in PGFAULT counters. */
> > +	count_vm_event(PGFAULT);
> > +	count_memcg_event_mm(mm, PGFAULT);
> 
> Is there reason on why vm events accountings need to be explicitly
> different from perf events right below on handling ERROR?

I think so.  ERROR is quite different from RETRY.  If we are, for
example, handling a SIGSEGV (perhaps a GC language?) that should be
accounted.  If we can't handle a page fault right now, and need to
retry within the kernel, that should not be accounted.
Suren Baghdasaryan April 17, 2023, 11:17 p.m. UTC | #9
On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > >     /*
> > > -    * We don't do accounting for some specific faults:
> > > -    *
> > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > -    *   should use the hw profiling for that.
> > > -    *
> > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > -    *   once they're completed.
> > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > +    * counted upon completion.
> > >      */
> > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > +   if (ret & VM_FAULT_RETRY)
> > > +           return;
> > > +
> > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > +   count_vm_event(PGFAULT);
> > > +   count_memcg_event_mm(mm, PGFAULT);
> >
> > Is there reason on why vm events accountings need to be explicitly
> > different from perf events right below on handling ERROR?
>
> I think so.  ERROR is quite different from RETRY.  If we are, for
> example, handling a SIGSEGV (perhaps a GC language?) that should be
> accounted.  If we can't handle a page fault right now, and need to
> retry within the kernel, that should not be accounted.

IIUC, the question was about the differences in vm vs perf accounting
for errors, not the difference between ERROR and RETRY cases. Matthew,
are you answering the right question or did I misunderstand your
answer?

>
Matthew Wilcox April 18, 2023, 2:25 p.m. UTC | #10
On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > >     /*
> > > > -    * We don't do accounting for some specific faults:
> > > > -    *
> > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > -    *   should use the hw profiling for that.
> > > > -    *
> > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > -    *   once they're completed.
> > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > +    * counted upon completion.
> > > >      */
> > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > +   if (ret & VM_FAULT_RETRY)
> > > > +           return;
> > > > +
> > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > +   count_vm_event(PGFAULT);
> > > > +   count_memcg_event_mm(mm, PGFAULT);
> > >
> > > Is there reason on why vm events accountings need to be explicitly
> > > different from perf events right below on handling ERROR?
> >
> > I think so.  ERROR is quite different from RETRY.  If we are, for
> > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > accounted.  If we can't handle a page fault right now, and need to
> > retry within the kernel, that should not be accounted.
> 
> IIUC, the question was about the differences in vm vs perf accounting
> for errors, not the difference between ERROR and RETRY cases. Matthew,
> are you answering the right question or did I misunderstand your
> answer?

Maybe I'm misunderstanding what you're proposing.  I thought the
proposal was to make neither ERROR nor RETRY increment the counters,
but if the proposal is to make ERROR increment the perf counters
instead, then that's cool with me.
Suren Baghdasaryan April 18, 2023, 2:54 p.m. UTC | #11
On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > >     /*
> > > > > -    * We don't do accounting for some specific faults:
> > > > > -    *
> > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > -    *   should use the hw profiling for that.
> > > > > -    *
> > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > -    *   once they're completed.
> > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > +    * counted upon completion.
> > > > >      */
> > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > +           return;
> > > > > +
> > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > +   count_vm_event(PGFAULT);
> > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > >
> > > > Is there reason on why vm events accountings need to be explicitly
> > > > different from perf events right below on handling ERROR?
> > >
> > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > accounted.  If we can't handle a page fault right now, and need to
> > > retry within the kernel, that should not be accounted.
> >
> > IIUC, the question was about the differences in vm vs perf accounting
> > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > are you answering the right question or did I misunderstand your
> > answer?
>
> Maybe I'm misunderstanding what you're proposing.  I thought the
> proposal was to make neither ERROR nor RETRY increment the counters,
> but if the proposal is to make ERROR increment the perf counters
> instead, then that's cool with me.

Oh, I think now I understand your answer. You were not highlighting
the difference between the who but objecting to the proposal of not
counting both ERROR and RETRY. Am I on the same page now?
Matthew Wilcox April 18, 2023, 3:08 p.m. UTC | #12
On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > >     /*
> > > > > > -    * We don't do accounting for some specific faults:
> > > > > > -    *
> > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > -    *   should use the hw profiling for that.
> > > > > > -    *
> > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > -    *   once they're completed.
> > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > +    * counted upon completion.
> > > > > >      */
> > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > +           return;
> > > > > > +
> > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > +   count_vm_event(PGFAULT);
> > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > >
> > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > different from perf events right below on handling ERROR?
> > > >
> > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > accounted.  If we can't handle a page fault right now, and need to
> > > > retry within the kernel, that should not be accounted.
> > >
> > > IIUC, the question was about the differences in vm vs perf accounting
> > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > are you answering the right question or did I misunderstand your
> > > answer?
> >
> > Maybe I'm misunderstanding what you're proposing.  I thought the
> > proposal was to make neither ERROR nor RETRY increment the counters,
> > but if the proposal is to make ERROR increment the perf counters
> > instead, then that's cool with me.
> 
> Oh, I think now I understand your answer. You were not highlighting
> the difference between the who but objecting to the proposal of not
> counting both ERROR and RETRY. Am I on the same page now?

I think so.  Let's see your patch and then we can be sure we're talking
about the same thing ;-)
Peter Xu April 18, 2023, 3:19 p.m. UTC | #13
On Mon, Apr 17, 2023 at 03:47:57PM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 17, 2023 at 2:26 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > > @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >               if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> > >                       mem_cgroup_oom_synchronize(false);
> > >       }
> > > -
> > > -     mm_account_fault(regs, address, flags, ret);
> > > +out:
> > > +     mm_account_fault(mm, regs, address, flags, ret);
> >
> > Ah, one more question.. can this cached mm race with a destroying mm (just
> > like the vma race we wanted to avoid)?  Still a question only applies to
> > COMPLETE case when mmap read lock can be released.  Thanks,
> 
> I believe that is impossible because whoever is calling the page fault
> handler has stabilized the mm by getting a refcount.

Do you have a hint on where that refcount is taken?

Btw, it's definitely not a question sololy for this patch but a more common
question to the page fault path.  It's just that when I wanted to look for
any refcount boost (which I also expect to have somewhere) I didn't really
see that in current path (e.g. do_user_addr_fault() for x86_64).

I also had a quick look on do_exit() but I also didn't see where do we
e.g. wait for all the threads to stop before recycles a mm.

I had a feeling that I must have missed something, but just want to make
sure it's the case.
Matthew Wilcox April 18, 2023, 3:32 p.m. UTC | #14
On Tue, Apr 18, 2023 at 11:19:40AM -0400, Peter Xu wrote:
> On Mon, Apr 17, 2023 at 03:47:57PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Apr 17, 2023 at 2:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote:
> > > > @@ -5223,8 +5230,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > > >               if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> > > >                       mem_cgroup_oom_synchronize(false);
> > > >       }
> > > > -
> > > > -     mm_account_fault(regs, address, flags, ret);
> > > > +out:
> > > > +     mm_account_fault(mm, regs, address, flags, ret);
> > >
> > > Ah, one more question.. can this cached mm race with a destroying mm (just
> > > like the vma race we wanted to avoid)?  Still a question only applies to
> > > COMPLETE case when mmap read lock can be released.  Thanks,
> > 
> > I believe that is impossible because whoever is calling the page fault
> > handler has stabilized the mm by getting a refcount.
> 
> Do you have a hint on where that refcount is taken?

... when we called clone()?  A thread by definition has a reference to
its own mm.

> Btw, it's definitely not a question sololy for this patch but a more common
> question to the page fault path.  It's just that when I wanted to look for
> any refcount boost (which I also expect to have somewhere) I didn't really
> see that in current path (e.g. do_user_addr_fault() for x86_64).
> 
> I also had a quick look on do_exit() but I also didn't see where do we
> e.g. wait for all the threads to stop before recycles a mm.
> 
> I had a feeling that I must have missed something, but just want to make
> sure it's the case.
> 
> -- 
> Peter Xu
>
Peter Xu April 18, 2023, 3:48 p.m. UTC | #15
On Tue, Apr 18, 2023 at 04:32:27PM +0100, Matthew Wilcox wrote:
> ... when we called clone()?  A thread by definition has a reference to
> its own mm.

Ah yes.. thanks!
Peter Xu April 18, 2023, 4:06 p.m. UTC | #16
On Tue, Apr 18, 2023 at 04:08:58PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > > >     /*
> > > > > > > -    * We don't do accounting for some specific faults:
> > > > > > > -    *
> > > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > > -    *   should use the hw profiling for that.
> > > > > > > -    *
> > > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > > -    *   once they're completed.
> > > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > > +    * counted upon completion.
> > > > > > >      */
> > > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > > +           return;
> > > > > > > +
> > > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > > +   count_vm_event(PGFAULT);
> > > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > > >
> > > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > > different from perf events right below on handling ERROR?
> > > > >
> > > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > > accounted.  If we can't handle a page fault right now, and need to
> > > > > retry within the kernel, that should not be accounted.
> > > >
> > > > IIUC, the question was about the differences in vm vs perf accounting
> > > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > > are you answering the right question or did I misunderstand your
> > > > answer?
> > >
> > > Maybe I'm misunderstanding what you're proposing.  I thought the
> > > proposal was to make neither ERROR nor RETRY increment the counters,
> > > but if the proposal is to make ERROR increment the perf counters
> > > instead, then that's cool with me.
> > 
> > Oh, I think now I understand your answer. You were not highlighting
> > the difference between the who but objecting to the proposal of not
> > counting both ERROR and RETRY. Am I on the same page now?
> 
> I think so.  Let's see your patch and then we can be sure we're talking
> about the same thing ;-)

IMHO if there's no explicit reason to differenciate the events, we should
always account them the same way for vm,perf,... either with ERROR
accounted or not.

I am not sure whether accounting ERROR faults would matter for a mprotect()
use case, because they shouldn't rely on the counter to work but the SIGBUS
itself to be generated on page accesses with the sighandler doing work.

One trivial benefit of keep accounting ERROR is we only need to modify vm
account ABI so both RETRY & ERROR will be adjusted to match perf,task
counters.  OTOH we can also change all to take ERROR into account, but then
we're modifying ABI for all counters.
Suren Baghdasaryan April 18, 2023, 4:45 p.m. UTC | #17
On Tue, Apr 18, 2023 at 8:48 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > ... when we called clone()?  A thread by definition has a reference to
> > its own mm.
>
> Ah yes.. thanks!

re: I also had a quick look on do_exit() but I also didn't see where
do we e.g. wait for all the threads to stop before recycles a mm.

We recycle mm after all refcounts are dropped in the exit path:
  do_exit
    exit_mm
      mmput(if !mm->mm_users)
        mmdrop(if !mm->mm_count)
          free_mm

>
> --
> Peter Xu
>
Suren Baghdasaryan April 18, 2023, 5:17 p.m. UTC | #18
On Tue, Apr 18, 2023 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 04:08:58PM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > > > >     /*
> > > > > > > > -    * We don't do accounting for some specific faults:
> > > > > > > > -    *
> > > > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > > > -    *   should use the hw profiling for that.
> > > > > > > > -    *
> > > > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > > > -    *   once they're completed.
> > > > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > > > +    * counted upon completion.
> > > > > > > >      */
> > > > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > > > +           return;
> > > > > > > > +
> > > > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > > > +   count_vm_event(PGFAULT);
> > > > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > > > >
> > > > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > > > different from perf events right below on handling ERROR?
> > > > > >
> > > > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > > > accounted.  If we can't handle a page fault right now, and need to
> > > > > > retry within the kernel, that should not be accounted.
> > > > >
> > > > > IIUC, the question was about the differences in vm vs perf accounting
> > > > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > > > are you answering the right question or did I misunderstand your
> > > > > answer?
> > > >
> > > > Maybe I'm misunderstanding what you're proposing.  I thought the
> > > > proposal was to make neither ERROR nor RETRY increment the counters,
> > > > but if the proposal is to make ERROR increment the perf counters
> > > > instead, then that's cool with me.
> > >
> > > Oh, I think now I understand your answer. You were not highlighting
> > > the difference between the who but objecting to the proposal of not
> > > counting both ERROR and RETRY. Am I on the same page now?
> >
> > I think so.  Let's see your patch and then we can be sure we're talking
> > about the same thing ;-)
>
> IMHO if there's no explicit reason to differenciate the events, we should
> always account them the same way for vm,perf,... either with ERROR
> accounted or not.
>
> I am not sure whether accounting ERROR faults would matter for a mprotect()
> use case, because they shouldn't rely on the counter to work but the SIGBUS
> itself to be generated on page accesses with the sighandler doing work.

For that example with GC, these are valid page faults IIUC and current
PGFAULT counters do register them. Do we want to change that and
potentially break assumptions about these counters?

>
> One trivial benefit of keep accounting ERROR is we only need to modify vm
> account ABI so both RETRY & ERROR will be adjusted to match perf,task
> counters.  OTOH we can also change all to take ERROR into account, but then
> we're modifying ABI for all counters.

So, not accounting them in both vm and perf would be problematic for
that GC example and similar cases.
Are we left with only two viable options?:
1. skip RETRY for vm and skip ERROR for both vm and perf (this patch)
2. skip RETRY for both vm and perf, account ERROR for both

#2 would go against the comment in mm_account_fault() saying that we
don't account for unsuccessful faults. I guess there must have been
some reason we were not accounting for them (such as access to a
faulty address is neither major nor minor fault)?

>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Peter Xu April 18, 2023, 6:46 p.m. UTC | #19
On Tue, Apr 18, 2023 at 09:45:52AM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 18, 2023 at 8:48 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > > ... when we called clone()?  A thread by definition has a reference to
> > > its own mm.
> >
> > Ah yes.. thanks!
> 
> re: I also had a quick look on do_exit() but I also didn't see where
> do we e.g. wait for all the threads to stop before recycles a mm.
> 
> We recycle mm after all refcounts are dropped in the exit path:
>   do_exit
>     exit_mm
>       mmput(if !mm->mm_users)
>         mmdrop(if !mm->mm_count)
>           free_mm

I assume Matthew means when the task_struct is created with part of
kernel_clone().

copy_mm() has:

	if (clone_flags & CLONE_VM) {
		mmget(oldmm);
		mm = oldmm;
	} else {
		mm = dup_mm(tsk, current->mm);
		if (!mm)
			return -ENOMEM;
	}

If CLONE_VM, we'll mmget() on the existing mm. If !CLONE_VM, we'll just
create a new one with reference held.  For the latter, I think that hides
in mm_init() where it'll just set it to 1:

	atomic_set(&mm->mm_users, 1);

With mm_users>0, do_exit() will leave the mm_struct* alone since mmput()
will still be called but not the final step on mmdrop().

Thanks,
Suren Baghdasaryan April 18, 2023, 9:48 p.m. UTC | #20
On Tue, Apr 18, 2023 at 10:17 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Apr 18, 2023 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 04:08:58PM +0100, Matthew Wilcox wrote:
> > > On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > > > > >     /*
> > > > > > > > > -    * We don't do accounting for some specific faults:
> > > > > > > > > -    *
> > > > > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > > > > -    *   should use the hw profiling for that.
> > > > > > > > > -    *
> > > > > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > > > > -    *   once they're completed.
> > > > > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > > > > +    * counted upon completion.
> > > > > > > > >      */
> > > > > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > > > > +           return;
> > > > > > > > > +
> > > > > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > > > > +   count_vm_event(PGFAULT);
> > > > > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > > > > >
> > > > > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > > > > different from perf events right below on handling ERROR?
> > > > > > >
> > > > > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > > > > accounted.  If we can't handle a page fault right now, and need to
> > > > > > > retry within the kernel, that should not be accounted.
> > > > > >
> > > > > > IIUC, the question was about the differences in vm vs perf accounting
> > > > > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > > > > are you answering the right question or did I misunderstand your
> > > > > > answer?
> > > > >
> > > > > Maybe I'm misunderstanding what you're proposing.  I thought the
> > > > > proposal was to make neither ERROR nor RETRY increment the counters,
> > > > > but if the proposal is to make ERROR increment the perf counters
> > > > > instead, then that's cool with me.
> > > >
> > > > Oh, I think now I understand your answer. You were not highlighting
> > > > the difference between the who but objecting to the proposal of not
> > > > counting both ERROR and RETRY. Am I on the same page now?
> > >
> > > I think so.  Let's see your patch and then we can be sure we're talking
> > > about the same thing ;-)
> >
> > IMHO if there's no explicit reason to differenciate the events, we should
> > always account them the same way for vm,perf,... either with ERROR
> > accounted or not.
> >
> > I am not sure whether accounting ERROR faults would matter for a mprotect()
> > use case, because they shouldn't rely on the counter to work but the SIGBUS
> > itself to be generated on page accesses with the sighandler doing work.
>
> For that example with GC, these are valid page faults IIUC and current
> PGFAULT counters do register them. Do we want to change that and
> potentially break assumptions about these counters?
>
> >
> > One trivial benefit of keep accounting ERROR is we only need to modify vm
> > account ABI so both RETRY & ERROR will be adjusted to match perf,task
> > counters.  OTOH we can also change all to take ERROR into account, but then
> > we're modifying ABI for all counters.
>
> So, not accounting them in both vm and perf would be problematic for
> that GC example and similar cases.
> Are we left with only two viable options?:
> 1. skip RETRY for vm and skip ERROR for both vm and perf (this patch)
> 2. skip RETRY for both vm and perf, account ERROR for both
>
> #2 would go against the comment in mm_account_fault() saying that we
> don't account for unsuccessful faults. I guess there must have been
> some reason we were not accounting for them (such as access to a
> faulty address is neither major nor minor fault)?

I did some digging in the history and looks like the check for ERROR
was added after this discussion:
https://lore.kernel.org/all/20200624203412.GB64004@xz-x1/ and IIUC the
reason was that previous code also skipped VM_FAULT_ERROR. Peter, is
that correct?

It seems this discussion is becoming longer than it should be. How
about we keep the behavior of all counters as they are to avoid
breaking any possible usecases and just fix the double-counting issue
for RETRY cases?

>
> >
> > --
> > Peter Xu
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
Peter Xu April 18, 2023, 10:45 p.m. UTC | #21
On Tue, Apr 18, 2023 at 02:48:58PM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 18, 2023 at 10:17 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Apr 18, 2023 at 04:08:58PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> > > > > On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > > > > > >     /*
> > > > > > > > > > -    * We don't do accounting for some specific faults:
> > > > > > > > > > -    *
> > > > > > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > > > > > -    *   should use the hw profiling for that.
> > > > > > > > > > -    *
> > > > > > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > > > > > -    *   once they're completed.
> > > > > > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > > > > > +    * counted upon completion.
> > > > > > > > > >      */
> > > > > > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > > > > > +           return;
> > > > > > > > > > +
> > > > > > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > > > > > +   count_vm_event(PGFAULT);
> > > > > > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > > > > > >
> > > > > > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > > > > > different from perf events right below on handling ERROR?
> > > > > > > >
> > > > > > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > > > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > > > > > accounted.  If we can't handle a page fault right now, and need to
> > > > > > > > retry within the kernel, that should not be accounted.
> > > > > > >
> > > > > > > IIUC, the question was about the differences in vm vs perf accounting
> > > > > > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > > > > > are you answering the right question or did I misunderstand your
> > > > > > > answer?
> > > > > >
> > > > > > Maybe I'm misunderstanding what you're proposing.  I thought the
> > > > > > proposal was to make neither ERROR nor RETRY increment the counters,
> > > > > > but if the proposal is to make ERROR increment the perf counters
> > > > > > instead, then that's cool with me.
> > > > >
> > > > > Oh, I think now I understand your answer. You were not highlighting
> > > > > the difference between the who but objecting to the proposal of not
> > > > > counting both ERROR and RETRY. Am I on the same page now?
> > > >
> > > > I think so.  Let's see your patch and then we can be sure we're talking
> > > > about the same thing ;-)
> > >
> > > IMHO if there's no explicit reason to differenciate the events, we should
> > > always account them the same way for vm,perf,... either with ERROR
> > > accounted or not.
> > >
> > > I am not sure whether accounting ERROR faults would matter for a mprotect()
> > > use case, because they shouldn't rely on the counter to work but the SIGBUS
> > > itself to be generated on page accesses with the sighandler doing work.
> >
> > For that example with GC, these are valid page faults IIUC and current
> > PGFAULT counters do register them. Do we want to change that and
> > potentially break assumptions about these counters?
> >
> > >
> > > One trivial benefit of keep accounting ERROR is we only need to modify vm
> > > account ABI so both RETRY & ERROR will be adjusted to match perf,task
> > > counters.  OTOH we can also change all to take ERROR into account, but then
> > > we're modifying ABI for all counters.
> >
> > So, not accounting them in both vm and perf would be problematic for
> > that GC example and similar cases.
> > Are we left with only two viable options?:
> > 1. skip RETRY for vm and skip ERROR for both vm and perf (this patch)
> > 2. skip RETRY for both vm and perf, account ERROR for both
> >
> > #2 would go against the comment in mm_account_fault() saying that we
> > don't account for unsuccessful faults. I guess there must have been
> > some reason we were not accounting for them (such as access to a
> > faulty address is neither major nor minor fault)?
> 
> I did some digging in the history and looks like the check for ERROR
> was added after this discussion:
> https://lore.kernel.org/all/20200624203412.GB64004@xz-x1/ and IIUC the
> reason was that previous code also skipped VM_FAULT_ERROR. Peter, is
> that correct?

I think so.

It's a few years ago, what I remember is that we did change some of the
counters at least on some archs, and I don't remember anything is broken
for real after that.

> 
> It seems this discussion is becoming longer than it should be. How
> about we keep the behavior of all counters as they are to avoid
> breaking any possible usecases and just fix the double-counting issue
> for RETRY cases?

Sure, though again I hope we can add some comment explaining, because the
outcome of the code can look a bit weird on handling different counters
differently.

IMHO it can be as simple as "we did accounting differently on different
types of counters for historical reasons, here just to make them compatible
with old kernels", maybe it will still help a bit when we read again on
this chunk of code?

Thanks,
Suren Baghdasaryan April 18, 2023, 10:58 p.m. UTC | #22
On Tue, Apr 18, 2023 at 3:45 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 02:48:58PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Apr 18, 2023 at 10:17 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Apr 18, 2023 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 18, 2023 at 04:08:58PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Apr 18, 2023 at 07:54:01AM -0700, Suren Baghdasaryan wrote:
> > > > > > On Tue, Apr 18, 2023 at 7:25 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Apr 17, 2023 at 04:17:45PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Apr 17, 2023 at 03:40:33PM -0400, Peter Xu wrote:
> > > > > > > > > > >     /*
> > > > > > > > > > > -    * We don't do accounting for some specific faults:
> > > > > > > > > > > -    *
> > > > > > > > > > > -    * - Unsuccessful faults (e.g. when the address wasn't valid).  That
> > > > > > > > > > > -    *   includes arch_vma_access_permitted() failing before reaching here.
> > > > > > > > > > > -    *   So this is not a "this many hardware page faults" counter.  We
> > > > > > > > > > > -    *   should use the hw profiling for that.
> > > > > > > > > > > -    *
> > > > > > > > > > > -    * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
> > > > > > > > > > > -    *   once they're completed.
> > > > > > > > > > > +    * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
> > > > > > > > > > > +    * counted upon completion.
> > > > > > > > > > >      */
> > > > > > > > > > > -   if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
> > > > > > > > > > > +   if (ret & VM_FAULT_RETRY)
> > > > > > > > > > > +           return;
> > > > > > > > > > > +
> > > > > > > > > > > +   /* Register both successful and failed faults in PGFAULT counters. */
> > > > > > > > > > > +   count_vm_event(PGFAULT);
> > > > > > > > > > > +   count_memcg_event_mm(mm, PGFAULT);
> > > > > > > > > >
> > > > > > > > > > Is there reason on why vm events accountings need to be explicitly
> > > > > > > > > > different from perf events right below on handling ERROR?
> > > > > > > > >
> > > > > > > > > I think so.  ERROR is quite different from RETRY.  If we are, for
> > > > > > > > > example, handling a SIGSEGV (perhaps a GC language?) that should be
> > > > > > > > > accounted.  If we can't handle a page fault right now, and need to
> > > > > > > > > retry within the kernel, that should not be accounted.
> > > > > > > >
> > > > > > > > IIUC, the question was about the differences in vm vs perf accounting
> > > > > > > > for errors, not the difference between ERROR and RETRY cases. Matthew,
> > > > > > > > are you answering the right question or did I misunderstand your
> > > > > > > > answer?
> > > > > > >
> > > > > > > Maybe I'm misunderstanding what you're proposing.  I thought the
> > > > > > > proposal was to make neither ERROR nor RETRY increment the counters,
> > > > > > > but if the proposal is to make ERROR increment the perf counters
> > > > > > > instead, then that's cool with me.
> > > > > >
> > > > > > Oh, I think now I understand your answer. You were not highlighting
> > > > > > the difference between the who but objecting to the proposal of not
> > > > > > counting both ERROR and RETRY. Am I on the same page now?
> > > > >
> > > > > I think so.  Let's see your patch and then we can be sure we're talking
> > > > > about the same thing ;-)
> > > >
> > > > IMHO if there's no explicit reason to differenciate the events, we should
> > > > always account them the same way for vm,perf,... either with ERROR
> > > > accounted or not.
> > > >
> > > > I am not sure whether accounting ERROR faults would matter for a mprotect()
> > > > use case, because they shouldn't rely on the counter to work but the SIGBUS
> > > > itself to be generated on page accesses with the sighandler doing work.
> > >
> > > For that example with GC, these are valid page faults IIUC and current
> > > PGFAULT counters do register them. Do we want to change that and
> > > potentially break assumptions about these counters?
> > >
> > > >
> > > > One trivial benefit of keep accounting ERROR is we only need to modify vm
> > > > account ABI so both RETRY & ERROR will be adjusted to match perf,task
> > > > counters.  OTOH we can also change all to take ERROR into account, but then
> > > > we're modifying ABI for all counters.
> > >
> > > So, not accounting them in both vm and perf would be problematic for
> > > that GC example and similar cases.
> > > Are we left with only two viable options?:
> > > 1. skip RETRY for vm and skip ERROR for both vm and perf (this patch)
> > > 2. skip RETRY for both vm and perf, account ERROR for both
> > >
> > > #2 would go against the comment in mm_account_fault() saying that we
> > > don't account for unsuccessful faults. I guess there must have been
> > > some reason we were not accounting for them (such as access to a
> > > faulty address is neither major nor minor fault)?
> >
> > I did some digging in the history and looks like the check for ERROR
> > was added after this discussion:
> > https://lore.kernel.org/all/20200624203412.GB64004@xz-x1/ and IIUC the
> > reason was that previous code also skipped VM_FAULT_ERROR. Peter, is
> > that correct?
>
> I think so.
>
> It's a few years ago, what I remember is that we did change some of the
> counters at least on some archs, and I don't remember anything is broken
> for real after that.
>
> >
> > It seems this discussion is becoming longer than it should be. How
> > about we keep the behavior of all counters as they are to avoid
> > breaking any possible usecases and just fix the double-counting issue
> > for RETRY cases?
>
> Sure, though again I hope we can add some comment explaining, because the
> outcome of the code can look a bit weird on handling different counters
> differently.
>
> IMHO it can be as simple as "we did accounting differently on different
> types of counters for historical reasons, here just to make them compatible
> with old kernels", maybe it will still help a bit when we read again on
> this chunk of code?

Sure. How about replacing my prior "Register both successful and
failed faults in PGFAULT counters." comment with "To preserve the
behavior of older kernels, PGFAULT counters record both successful and
failed faults, as opposed to perf counters which ignore failed cases"
?

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 18, 2023, 11:38 p.m. UTC | #23
On Tue, Apr 18, 2023 at 03:58:26PM -0700, Suren Baghdasaryan wrote:
> Sure. How about replacing my prior "Register both successful and
> failed faults in PGFAULT counters." comment with "To preserve the
> behavior of older kernels, PGFAULT counters record both successful and
> failed faults, as opposed to perf counters which ignore failed cases"

Looks good here, thanks!
Suren Baghdasaryan April 19, 2023, 6 p.m. UTC | #24
On Tue, Apr 18, 2023 at 4:38 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 03:58:26PM -0700, Suren Baghdasaryan wrote:
> > Sure. How about replacing my prior "Register both successful and
> > failed faults in PGFAULT counters." comment with "To preserve the
> > behavior of older kernels, PGFAULT counters record both successful and
> > failed faults, as opposed to perf counters which ignore failed cases"
>
> Looks good here, thanks!

v3 is posted at:
https://lore.kernel.org/all/20230419175836.3857458-1-surenb@google.com/
Thanks!

>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 01a23ad48a04..c3b709ceeed7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5080,24 +5080,30 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
  * updates.  However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should
  * still be in per-arch page fault handlers at the entry of page fault.
  */
-static inline void mm_account_fault(struct pt_regs *regs,
+static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs,
 				    unsigned long address, unsigned int flags,
 				    vm_fault_t ret)
 {
 	bool major;
 
 	/*
-	 * We don't do accounting for some specific faults:
-	 *
-	 * - Unsuccessful faults (e.g. when the address wasn't valid).  That
-	 *   includes arch_vma_access_permitted() failing before reaching here.
-	 *   So this is not a "this many hardware page faults" counter.  We
-	 *   should use the hw profiling for that.
-	 *
-	 * - Incomplete faults (VM_FAULT_RETRY).  They will only be counted
-	 *   once they're completed.
+	 * Do not account for incomplete faults (VM_FAULT_RETRY). They will be
+	 * counted upon completion.
 	 */
-	if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY))
+	if (ret & VM_FAULT_RETRY)
+		return;
+
+	/* Register both successful and failed faults in PGFAULT counters. */
+	count_vm_event(PGFAULT);
+	count_memcg_event_mm(mm, PGFAULT);
+
+	/*
+	 * Do not account for unsuccessful faults (e.g. when the address wasn't
+	 * valid).  That includes arch_vma_access_permitted() failing before
+	 * reaching here. So this is not a "this many hardware page faults"
+	 * counter.  We should use the hw profiling for that.
+	 */
+	if (ret & VM_FAULT_ERROR)
 		return;
 
 	/*
@@ -5180,21 +5186,22 @@  static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
 vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
+	/* Copy vma->vm_mm in case mmap_lock is dropped and vma becomes unstable. */
+	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 
 	__set_current_state(TASK_RUNNING);
 
-	count_vm_event(PGFAULT);
-	count_memcg_event_mm(vma->vm_mm, PGFAULT);
-
 	ret = sanitize_fault_flags(vma, &flags);
 	if (ret)
-		return ret;
+		goto out;
 
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
-					    flags & FAULT_FLAG_REMOTE))
-		return VM_FAULT_SIGSEGV;
+					    flags & FAULT_FLAG_REMOTE)) {
+		ret = VM_FAULT_SIGSEGV;
+		goto out;
+	}
 
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
@@ -5223,8 +5230,8 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
 			mem_cgroup_oom_synchronize(false);
 	}
-
-	mm_account_fault(regs, address, flags, ret);
+out:
+	mm_account_fault(mm, regs, address, flags, ret);
 
 	return ret;
 }