diff mbox series

[001/163] mm/memory.c: avoid access flag update TLB flush for retried page fault

Message ID 20200807061706.unk5_0KtC%akpm@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series [001/163] mm/memory.c: avoid access flag update TLB flush for retried page fault | expand

Commit Message

Andrew Morton Aug. 7, 2020, 6:17 a.m. UTC
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm/memory.c: avoid access flag update TLB flush for retried page fault

Recently we found regression when running will_it_scale/page_fault3 test
on ARM64.  Over 70% down for the multi processes cases and over 20% down
for the multi threads cases.  It turns out the regression is caused by
commit 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before
calling balance_dirty_pages() in write fault").

The test mmaps a memory size file then write to the mapping, this would
make all memory dirty and trigger dirty pages throttle, that upstream
commit would release mmap_sem then retry the page fault.  The retried page
fault would see correct PTEs installed by the first try then update dirty
bit and clear read-only bit and flush TLBs for ARM.  The regression is
caused by the excessive TLB flush.  It is fine on x86 since x86 doesn't
clear read-only bit so there is no need to flush TLB for this case.

The page fault would be retried due to:
1. Waiting for page readahead
2. Waiting for page swapped in
3. Waiting for dirty pages throttling

The first two cases don't have PTEs set up at all, so the retried page
fault would install the PTEs, so they don't reach there.  But the #3 case
usually has PTEs installed, the retried page fault would reach the dirty
bit and read-only bit update.  But it seems not necessary to modify those
bits again for #3 since they should be already set by the first page fault
try.

Of course the parallel page fault may set up PTEs, but we just need care
about write fault.  If the parallel page fault setup a writable and dirty
PTE then the retried fault doesn't need do anything extra.  If the
parallel page fault setup a clean read-only PTE, the retried fault should
just call do_wp_page() then return as the below code snippet shows:

if (vmf->flags & FAULT_FLAG_WRITE) {
        if (!pte_write(entry))
            return do_wp_page(vmf);
}

With this fix the test result get back to normal.

[yang.shi@linux.alibaba.com: incorporate comment from Will Deacon, update commit log per discussion]
  Link: http://lkml.kernel.org/r/1594848990-55657-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1594148072-91273-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Xu Yu <xuyu@linux.alibaba.com>
Debugged-by: Xu Yu <xuyu@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Aug. 7, 2020, 6:17 p.m. UTC | #1
On Thu, Aug 6, 2020 at 11:17 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Yang Shi <yang.shi@linux.alibaba.com>
> Subject: mm/memory.c: avoid access flag update TLB flush for retried page fault

This is not the safe version that just avoids the extra TLB flush.

This is - once again - the thing that skips the whole mkdirty and page
table update too.

I'm not taking it this time _either_.

Andrew, please flush this garbage from your system.

                 Linus
Yang Shi Aug. 7, 2020, 8:53 p.m. UTC | #2
On Fri, Aug 7, 2020 at 11:17 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Aug 6, 2020 at 11:17 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Yang Shi <yang.shi@linux.alibaba.com>
> > Subject: mm/memory.c: avoid access flag update TLB flush for retried page fault
>
> This is not the safe version that just avoids the extra TLB flush.
>
> This is - once again - the thing that skips the whole mkdirty and page
> table update too.
>
> I'm not taking it this time _either_.

I'm supposed Catalin would submit his proposal (flush local TLB for
spurious TLB fault on ARM) for this specific regression per the
discussion, right?

And, the more general spurious TLB fault problem sounds not that
urgent since it should be very rare.

>
> Andrew, please flush this garbage from your system.
>
>                  Linus
>
Linus Torvalds Aug. 8, 2020, 4:33 a.m. UTC | #3
On Fri, Aug 7, 2020 at 1:53 PM Yang Shi <shy828301@gmail.com> wrote:
>
> I'm supposed Catalin would submit his proposal (flush local TLB for
> spurious TLB fault on ARM) for this specific regression per the
> discussion, right?

I think arm64 should do that regardless, yes.

But I would also be ok with a version that does the FAULT_FLAG_TRIED
testing, but does it only for that spurious TLB flushing.

This "let's not update the page tables at all" is wrong, when the only
problem was the TLB flushing.

So changing the current (but quesitonable)

                if (vmf->flags & FAULT_FLAG_WRITE)
                        flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);

to be

                if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_TRIED))
                        flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);

would be fine.

But this patch that changes any semantics outside just the flushin gis
a complete no-no.

                Linus
Yang Shi Aug. 10, 2020, 5:48 p.m. UTC | #4
On Fri, Aug 7, 2020 at 9:34 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Aug 7, 2020 at 1:53 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > I'm supposed Catalin would submit his proposal (flush local TLB for
> > spurious TLB fault on ARM) for this specific regression per the
> > discussion, right?
>
> I think arm64 should do that regardless, yes.
>
> But I would also be ok with a version that does the FAULT_FLAG_TRIED
> testing, but does it only for that spurious TLB flushing.
>
> This "let's not update the page tables at all" is wrong, when the only
> problem was the TLB flushing.
>
> So changing the current (but quesitonable)
>
>                 if (vmf->flags & FAULT_FLAG_WRITE)
>                         flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
>
> to be
>
>                 if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_TRIED))
>                         flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);

It looks the retried fault still flush TLB with this change.

Shouldn't we do something like this to skip spurious TLB flush:

@@ -4251,6 +4251,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
                                vmf->flags & FAULT_FLAG_WRITE)) {
                update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
        } else {
+               if (vmf->flags & FAULT_FLAG_TRIED)
+                       goto unlock;
+
                /*
                 * This is needed only for protection faults but the arch code
                 * is not yet telling us if this is a protection fault or not.

>
> would be fine.
>
> But this patch that changes any semantics outside just the flushin gis
> a complete no-no.
>
>                 Linus
Linus Torvalds Aug. 10, 2020, 6:57 p.m. UTC | #5
On Mon, Aug 10, 2020 at 10:48 AM Yang Shi <shy828301@gmail.com> wrote:
>
> It looks the retried fault still flush TLB with this change.
>
> Shouldn't we do something like this to skip spurious TLB flush:

I have no idea what code-base you're basing your patches against, and
what you're comparing my patch.

Your patch does *exactly* the same thing mine did. Except it does a
"goto unlock" to jump over the flush_tlb_fix_spurious_fault(), while
my pseudo-patch just changed the

                if (vmf->flags & FAULT_FLAG_WRITE)

to be a

                if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_TRIED))

but it has the same effect: it skips the flush_tlb_fix_spurious_fault().

So if you think your patch does something else, then your source code
doesn't match mine. The *only* thing you jumped over was that same
thing that I disabled.

Somebody is confused.

                    Linus
diff mbox series

Patch

--- a/mm/memory.c~mm-avoid-access-flag-update-tlb-flush-for-retried-page-fault
+++ a/mm/memory.c
@@ -4241,8 +4241,14 @@  static vm_fault_t handle_pte_fault(struc
 	if (vmf->flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
-		entry = pte_mkdirty(entry);
 	}
+
+	if (vmf->flags & FAULT_FLAG_TRIED)
+		goto unlock;
+
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		entry = pte_mkdirty(entry);
+
 	entry = pte_mkyoung(entry);
 	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
 				vmf->flags & FAULT_FLAG_WRITE)) {