[v2,0/5] mm/mprotect: avoid unnecessary TLB flushes

Message ID	20211021122112.592634-1-namit@vmware.com (mailing list archive)
Headers	show Return-Path: <SRS0=ID2a=PJ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8BB5D61371 From: Nadav Amit <nadav.amit@gmail.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Nadav Amit <namit@vmware.com>, Andi Kleen <ak@linux.intel.com>, Andrea Arcangeli <aarcange@redhat.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Nick Piggin <npiggin@gmail.com>, x86@kernel.org Subject: [PATCH v2 0/5] mm/mprotect: avoid unnecessary TLB flushes Date: Thu, 21 Oct 2021 05:21:07 -0700 Message-Id: <20211021122112.592634-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/mprotect: avoid unnecessary TLB flushes \| expand [v2,0/5] mm/mprotect: avoid unnecessary TLB flushes [v2,1/5] x86: Detection of Knights Landing A/D leak [v2,2/5] mm: avoid unnecessary flush on change_huge_pmd() [v2,3/5] x86/mm: check exec permissions on fault [v2,4/5] mm/mprotect: use mmu_gather [v2,5/5] mm/mprotect: do not flush on permission promotion

Nadav Amit Oct. 21, 2021, 12:21 p.m. UTC

From: Nadav Amit <namit@vmware.com>

This patch-set is intended to remove unnecessary TLB flushes. It is
based on feedback from v1 and several bugs I found in v1 myself.

Basically, there are 3 optimizations in this patch-set:
1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
   prevent the A/D bits from changing.
2. Use TLB batching infrastructure to batch flushes across VMAs and
   do better/fewer flushes.
3. Avoid TLB flushes on permission demotion.

Andrea asked for the aforementioned (2) to come after (3), but this
is not simple (specifically since change_prot_numa() needs the number
of pages affected).

There are many changes from v1 to v2 so consider the change log as
partial.

v1->v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org

Nadav Amit (5):
  x86: Detection of Knights Landing A/D leak
  mm: avoid unnecessary flush on change_huge_pmd()
  x86/mm: check exec permissions on fault
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion

 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/pgtable.h       |  8 +++
 arch/x86/include/asm/pgtable_types.h |  2 +
 arch/x86/include/asm/tlbflush.h      | 80 +++++++++++++++++++++++
 arch/x86/kernel/cpu/intel.c          |  5 ++
 arch/x86/mm/fault.c                  | 11 +++-
 fs/exec.c                            |  6 +-
 include/asm-generic/tlb.h            | 14 +++++
 include/linux/huge_mm.h              |  5 +-
 include/linux/mm.h                   |  5 +-
 include/linux/pgtable.h              |  5 ++
 mm/huge_memory.c                     | 22 ++++---
 mm/mprotect.c                        | 94 +++++++++++++++-------------
 mm/pgtable-generic.c                 |  8 +++
 mm/userfaultfd.c                     |  6 +-
 15 files changed, 215 insertions(+), 57 deletions(-)

Andrew Morton Oct. 22, 2021, 3:04 a.m. UTC | #1

On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:

> This patch-set is intended to remove unnecessary TLB flushes. It is
> based on feedback from v1 and several bugs I found in v1 myself.
> 
> Basically, there are 3 optimizations in this patch-set:
> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
>    prevent the A/D bits from changing.
> 2. Use TLB batching infrastructure to batch flushes across VMAs and
>    do better/fewer flushes.
> 3. Avoid TLB flushes on permission demotion.
> 
> Andrea asked for the aforementioned (2) to come after (3), but this
> is not simple (specifically since change_prot_numa() needs the number
> of pages affected).

[1/5] appears to be a significant fix which should probably be
backported into -stable kernels.  If you agree with this then I suggest
it be prepared as a standalone patch, separate from the other four
patches.  With a cc:stable.

And the remaining patches are a performance optimization.  Has any
attempt been made to quantify the benefits?

Nadav Amit Oct. 22, 2021, 9:58 p.m. UTC | #2

> On Oct 21, 2021, at 8:04 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> This patch-set is intended to remove unnecessary TLB flushes. It is
>> based on feedback from v1 and several bugs I found in v1 myself.
>> 
>> Basically, there are 3 optimizations in this patch-set:
>> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
>>   prevent the A/D bits from changing.
>> 2. Use TLB batching infrastructure to batch flushes across VMAs and
>>   do better/fewer flushes.
>> 3. Avoid TLB flushes on permission demotion.
>> 
>> Andrea asked for the aforementioned (2) to come after (3), but this
>> is not simple (specifically since change_prot_numa() needs the number
>> of pages affected).
> 
> [1/5] appears to be a significant fix which should probably be
> backported into -stable kernels.  If you agree with this then I suggest
> it be prepared as a standalone patch, separate from the other four
> patches.  With a cc:stable.

There is no functionality bug in the kernel. The Knights Landing bug
was circumvented eventually by changing the swap entry structure so
the access/dirty bits would not overlap with the swap entry data.

> 
> And the remaining patches are a performance optimization.  Has any
> attempt been made to quantify the benefits?

I included some data before [1]. In general the cost that is saved
is the cost of a TLB flush/shootdown.

I will modify my benchmark to test huge-pages (which were not
included in the previous patch-set) and send results later. I would
also try nodejs to see if there is a significant enough benefit.
Nodejs crashed before (hence the 3rd patch added here), as it
exec-protects/unprotects pages - I will see if the benefit shows in
the benchmarks.

[ The motivation behind the patches is to later introduce userfaultfd
writeprotectv interface, and for my use-case that is under 
development this proved to improve performance considerably. ]

[1] https://lore.kernel.org/linux-mm/DA49DBBB-FFEE-4ACC-BB6C-364D07533C5E@vmware.com/

Peter Zijlstra Oct. 25, 2021, 10:50 a.m. UTC | #3

On Thu, Oct 21, 2021 at 08:04:50PM -0700, Andrew Morton wrote:
> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > This patch-set is intended to remove unnecessary TLB flushes. It is
> > based on feedback from v1 and several bugs I found in v1 myself.
> > 
> > Basically, there are 3 optimizations in this patch-set:
> > 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
> >    prevent the A/D bits from changing.
> > 2. Use TLB batching infrastructure to batch flushes across VMAs and
> >    do better/fewer flushes.
> > 3. Avoid TLB flushes on permission demotion.
> > 
> > Andrea asked for the aforementioned (2) to come after (3), but this
> > is not simple (specifically since change_prot_numa() needs the number
> > of pages affected).
> 
> [1/5] appears to be a significant fix which should probably be
> backported into -stable kernels.  If you agree with this then I suggest
> it be prepared as a standalone patch, separate from the other four
> patches.  With a cc:stable.

I am confused, 1/5 doesn't actually do *anything*. I also cannot find
any further usage of the introduced X86_BUG_PTE_LEAK.

I'm thinking patch #2 means to have something like:

	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

In the newly minted: pmdp_invalidate_ad(), but alas, nothing there.

Nadav Amit Oct. 25, 2021, 4:42 p.m. UTC | #4

> On Oct 25, 2021, at 3:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Oct 21, 2021 at 08:04:50PM -0700, Andrew Morton wrote:
>> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>>> This patch-set is intended to remove unnecessary TLB flushes. It is
>>> based on feedback from v1 and several bugs I found in v1 myself.
>>> 
>>> Basically, there are 3 optimizations in this patch-set:
>>> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
>>>   prevent the A/D bits from changing.
>>> 2. Use TLB batching infrastructure to batch flushes across VMAs and
>>>   do better/fewer flushes.
>>> 3. Avoid TLB flushes on permission demotion.
>>> 
>>> Andrea asked for the aforementioned (2) to come after (3), but this
>>> is not simple (specifically since change_prot_numa() needs the number
>>> of pages affected).
>> 
>> [1/5] appears to be a significant fix which should probably be
>> backported into -stable kernels.  If you agree with this then I suggest
>> it be prepared as a standalone patch, separate from the other four
>> patches.  With a cc:stable.
> 
> I am confused, 1/5 doesn't actually do *anything*. I also cannot find
> any further usage of the introduced X86_BUG_PTE_LEAK.
> 
> I'm thinking patch #2 means to have something like:
> 
> 	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> 		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
> In the newly minted: pmdp_invalidate_ad(), but alas, nothing there.

This change was only intended for pmdp_invalidate_ad() but somehow
got lost. I will add it there.

I eventually did not add the optimization to avoid TLB flushes on
(!dirty|write)->!write so I did not use it for the first case that
you mentioned. I am too afraid, although I think this is correct.
Perhaps I will add it as a separate patch.

Dave Hansen Oct. 26, 2021, 4:09 p.m. UTC | #5

On 10/22/21 2:58 PM, Nadav Amit wrote:
>> [1/5] appears to be a significant fix which should probably be
>> backported into -stable kernels.  If you agree with this then I suggest
>> it be prepared as a standalone patch, separate from the other four
>> patches.  With a cc:stable.
> 
> There is no functionality bug in the kernel. The Knights Landing bug
> was circumvented eventually by changing the swap entry structure so
> the access/dirty bits would not overlap with the swap entry data.

Yeah, it was a significant issue, but we fixed it in here:

> commit 00839ee3b299303c6a5e26a0a2485427a3afcbbf
> Author: Dave Hansen <dave.hansen@linux.intel.com>
> Date:   Thu Jul 7 17:19:11 2016 -0700
> 
>     x86/mm: Move swap offset/type up in PTE to work around erratum

[v2,0/5] mm/mprotect: avoid unnecessary TLB flushes

Message

Comments