[2/2] mm/mprotect: do not flush on permission promotion

Message ID	20210925205423.168858-3-namit@vmware.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=5LcS=OQ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D56D16108E From: Nadav Amit <nadav.amit@gmail.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Peter Xu <peterx@redhat.com>, Nadav Amit <namit@vmware.com>, Andrea Arcangeli <aarcange@redhat.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Nick Piggin <npiggin@gmail.com>, x86@kernel.org Subject: [PATCH 2/2] mm/mprotect: do not flush on permission promotion Date: Sat, 25 Sep 2021 13:54:23 -0700 Message-Id: <20210925205423.168858-3-namit@vmware.com> In-Reply-To: <20210925205423.168858-1-namit@vmware.com> References: <20210925205423.168858-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/mprotect: avoid unnecessary TLB flushes \| expand [0/2] mm/mprotect: avoid unnecessary TLB flushes [1/2] mm/mprotect: use mmu_gather [2/2] mm/mprotect: do not flush on permission promotion

Nadav Amit Sept. 25, 2021, 8:54 p.m. UTC

From: Nadav Amit <namit@vmware.com>

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.

Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().

For x86, PTE protection promotion or changes of software bits does
require a flush, also add logic that considers the dirty-bit. Changes to
the access-bit do not trigger a TLB flush, although architecturally they
should, as Linux considers the access-bit as a hint.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 40 +++++++++++++++++++++++++++++++++
 include/asm-generic/tlb.h       |  4 ++++
 mm/mprotect.c                   |  3 ++-
 3 files changed, 46 insertions(+), 1 deletion(-)

David Hildenbrand Oct. 7, 2021, 12:13 p.m. UTC | #1

On 25.09.21 22:54, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Currently, using mprotect() to unprotect a memory region or uffd to
> unprotect a memory region causes a TLB flush. At least on x86, as
> protection is promoted, no TLB flush is needed.
> 
> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> flush is needed based on the old PTE and the new one. Implement an x86
> pte_may_need_flush().
> 
> For x86, PTE protection promotion or changes of software bits does
> require a flush, also add logic that considers the dirty-bit. Changes to
> the access-bit do not trigger a TLB flush, although architecturally they
> should, as Linux considers the access-bit as a hint.

Is the added LOC worth the benefit? IOW, do we have some benchmark that 
really benefits from that?

Nadav Amit Oct. 7, 2021, 4:16 p.m. UTC | #2

> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 25.09.21 22:54, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> For x86, PTE protection promotion or changes of software bits does
>> require a flush, also add logic that considers the dirty-bit. Changes to
>> the access-bit do not trigger a TLB flush, although architecturally they
>> should, as Linux considers the access-bit as a hint.
> 
> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?

So you ask whether the added ~10 LOC (net) worth the benefit?

Let’s start with the cost of this patch.

If you ask about complexity, I think that it is a rather simple
patch and documented as needed. Please be more concrete if you
think otherwise.

If you ask about the runtime overhead, my experience is that
such code, which mostly does bit operations, has negligible cost.
The execution time of mprotect code, and other similar pieces of
code, is mostly dominated by walking the page-tables & getting
the pages (which might require cold or random memory accesses),
acquiring the locks, and of course the TLB flushes that this
patch tries to eliminate.

As for the benefit: TLB flush on x86 of a single PTE has an
overhead of ~200 cycles. If a TLB shootdown is needed, for instance
on multithreaded applications, this overhead can grow to few
microseconds or even more, depending on the number of sockets,
whether the workload runs in a VM (and worse if CPUs are
overcommitted) and so on.

This overhead is completely unnecessary on many occasions. If
you run mprotect() to add permissions, or as I noted in my case,
to do something similar using userfaultfd. Note that the
potentially unnecessary TLB flush/shootdown takes place while
you hold the mmap-lock for write in the case of mprotect(),
thereby potentially preventing other threads from making
progress during that time.

On my in-development workload it was a considerable overhead
(I didn’t collect numbers though). Basically, I track dirty
pages using uffd, and every page-fault that can be easily
resolved by unprotecting cause a TLB flush/shootdown.

If you want, I will write a microbenchmarks and give you numbers.
If you look for further optimizations (although you did not indicate
so), such as doing the TLB batching from do_mprotect_key(),
(i.e. batching across VMAs), we can discuss it and apply it on
top of these patches.

David Hildenbrand Oct. 7, 2021, 5:07 p.m. UTC | #3

On 07.10.21 18:16, Nadav Amit wrote:
> 
>> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 25.09.21 22:54, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>> Currently, using mprotect() to unprotect a memory region or uffd to
>>> unprotect a memory region causes a TLB flush. At least on x86, as
>>> protection is promoted, no TLB flush is needed.
>>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>>> flush is needed based on the old PTE and the new one. Implement an x86
>>> pte_may_need_flush().
>>> For x86, PTE protection promotion or changes of software bits does
>>> require a flush, also add logic that considers the dirty-bit. Changes to
>>> the access-bit do not trigger a TLB flush, although architecturally they
>>> should, as Linux considers the access-bit as a hint.
>>
>> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?
> 
> So you ask whether the added ~10 LOC (net) worth the benefit?

I read  "3 files changed, 46 insertions(+), 1 deletion(-)" to optimize 
something without proof, so I naturally have to ask. So this is just a 
"usually we optimize and show numbers to proof" comment.

> 
> Let’s start with the cost of this patch.
> 
> If you ask about complexity, I think that it is a rather simple
> patch and documented as needed. Please be more concrete if you
> think otherwise.

It is most certainly added complexity, although documented cleanly.

> 
> If you ask about the runtime overhead, my experience is that
> such code, which mostly does bit operations, has negligible cost.
> The execution time of mprotect code, and other similar pieces of
> code, is mostly dominated by walking the page-tables & getting
> the pages (which might require cold or random memory accesses),
> acquiring the locks, and of course the TLB flushes that this
> patch tries to eliminate.

I'm absolutely not concerned about runtime overhead :)

> 
> As for the benefit: TLB flush on x86 of a single PTE has an
> overhead of ~200 cycles. If a TLB shootdown is needed, for instance
> on multithreaded applications, this overhead can grow to few
> microseconds or even more, depending on the number of sockets,
> whether the workload runs in a VM (and worse if CPUs are
> overcommitted) and so on.
> 
> This overhead is completely unnecessary on many occasions. If
> you run mprotect() to add permissions, or as I noted in my case,
> to do something similar using userfaultfd. Note that the
> potentially unnecessary TLB flush/shootdown takes place while
> you hold the mmap-lock for write in the case of mprotect(),
> thereby potentially preventing other threads from making
> progress during that time.
> 
> On my in-development workload it was a considerable overhead
> (I didn’t collect numbers though). Basically, I track dirty
> pages using uffd, and every page-fault that can be easily
> resolved by unprotecting cause a TLB flush/shootdown.

Any numbers would be helpful.

> 
> If you want, I will write a microbenchmarks and give you numbers.
> If you look for further optimizations (although you did not indicate
> so), such as doing the TLB batching from do_mprotect_key(),
> (i.e. batching across VMAs), we can discuss it and apply it on
> top of these patches.

I think this patch itself is sufficient if we can show a benefit; I do 
wonder if existing benchmarks could already show a benefit, I feel like 
they should if this makes a difference. Excessive mprotect() usage 
(protect<>unprotect) isn't something unusual.

Nadav Amit Oct. 8, 2021, 6:06 a.m. UTC | #4

> On Oct 7, 2021, at 10:07 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 07.10.21 18:16, Nadav Amit wrote:
>>> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 25.09.21 22:54, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> Currently, using mprotect() to unprotect a memory region or uffd to
>>>> unprotect a memory region causes a TLB flush. At least on x86, as
>>>> protection is promoted, no TLB flush is needed.
>>>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>>>> flush is needed based on the old PTE and the new one. Implement an x86
>>>> pte_may_need_flush().
>>>> For x86, PTE protection promotion or changes of software bits does
>>>> require a flush, also add logic that considers the dirty-bit. Changes to
>>>> the access-bit do not trigger a TLB flush, although architecturally they
>>>> should, as Linux considers the access-bit as a hint.
>>> 
>>> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?
>> So you ask whether the added ~10 LOC (net) worth the benefit?
> 
> I read  "3 files changed, 46 insertions(+), 1 deletion(-)" to optimize something without proof, so I naturally have to ask. So this is just a "usually we optimize and show numbers to proof" comment.

These numbers are misleading, as they count comments and other non-code.

[snip]

> 
> Any numbers would be helpful.
> 
>> If you want, I will write a microbenchmarks and give you numbers.
>> If you look for further optimizations (although you did not indicate
>> so), such as doing the TLB batching from do_mprotect_key(),
>> (i.e. batching across VMAs), we can discuss it and apply it on
>> top of these patches.
> 
> I think this patch itself is sufficient if we can show a benefit; I do wonder if existing benchmarks could already show a benefit, I feel like they should if this makes a difference. Excessive mprotect() usage (protect<>unprotect) isn't something unusual.

I do not know about a concrete benchmark (other than my workload, which I cannot share right now) that does excessive mprotect() in a way that would actually be measurable on the overall performance. I would argue that many many optimizations in the kernel are such that would not have so measurable benefit by themselves on common macrobenchmarks.

Anyhow, per your request I created a small micro-benchmark that runs mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) in a loop and measured the time it took to do the latter (where a writeprotect is not needed). I ran the benchmark on a VM (guest) on top of KVM.

The cost (cycles) per mprotect(PROT_READ|PROT_WRITE) operation:

		1 thread		2 threads
		--------		---------
w/patch:	2496			2505			
w/o patch:	5342			10458

[ The results for 1 thread might seem strange as one can expect the overhead in this case to be no more than ~250 cycles, which is the time a TLB invalidation of a single PTE takes. Yet, this overhead are probably related to “page fracturing”, which happens when the VM memory is backed by 4KB pages. In such scenarios, a single PTE invalidation in the VM can cause on Intel a full TLB flush. The full flush is needed to ensure that if the invalidated address was mapped through huge page in the VM, any relevant 4KB mapping that is cached in the TLB (after fracturing due to the 4KB GPA->HPA mapping) would be removed.]

Let me know if you want me to share the micro-benchmark with you. I am not going to mention the results in the commit log, because I think the overhead of unnecessary TLB invalidation is well established.

David Hildenbrand Oct. 8, 2021, 7:35 a.m. UTC | #5

>>
>> Any numbers would be helpful.
>>
>>> If you want, I will write a microbenchmarks and give you numbers.
>>> If you look for further optimizations (although you did not indicate
>>> so), such as doing the TLB batching from do_mprotect_key(),
>>> (i.e. batching across VMAs), we can discuss it and apply it on
>>> top of these patches.
>>
>> I think this patch itself is sufficient if we can show a benefit; I do wonder if existing benchmarks could already show a benefit, I feel like they should if this makes a difference. Excessive mprotect() usage (protect<>unprotect) isn't something unusual.
> 
> I do not know about a concrete benchmark (other than my workload, which I cannot share right now) that does excessive mprotect() in a way that would actually be measurable on the overall performance. I would argue that many many optimizations in the kernel are such that would not have so measurable benefit by themselves on common macrobenchmarks.
> 
> Anyhow, per your request I created a small micro-benchmark that runs mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) in a loop and measured the time it took to do the latter (where a writeprotect is not needed). I ran the benchmark on a VM (guest) on top of KVM.
> 
> The cost (cycles) per mprotect(PROT_READ|PROT_WRITE) operation:
> 
> 		1 thread		2 threads
> 		--------		---------
> w/patch:	2496			2505			
> w/o patch:	5342			10458
> 

For my taste, the above numbers are sufficient, thanks!

> [ The results for 1 thread might seem strange as one can expect the overhead in this case to be no more than ~250 cycles, which is the time a TLB invalidation of a single PTE takes. Yet, this overhead are probably related to “page fracturing”, which happens when the VM memory is backed by 4KB pages. In such scenarios, a single PTE invalidation in the VM can cause on Intel a full TLB flush. The full flush is needed to ensure that if the invalidated address was mapped through huge page in the VM, any relevant 4KB mapping that is cached in the TLB (after fracturing due to the 4KB GPA->HPA mapping) would be removed.]

Very nice analysis :)

> 
> Let me know if you want me to share the micro-benchmark with you. I am not going to mention the results in the commit log, because I think the overhead of unnecessary TLB invalidation is well established.

Just let me clarify why I am asking at all, it could be that:

a) The optimization is effective and applicable to many workloads
b) The optimization is effective and applicable to some workloads 
("micro benchmark")
c) The optimization is ineffective
d) The optimization is wrong

IMHO: We can rule out d) by review and tests. We can rule out c) by 
simple benchmarks easily.

Maybe extend the patch description by something like:

"The benefit of this optimization can already be visible when doing 
mprotect(PROT_READ) -> mprotect(PROT_READ|PROT_WRITE) on a single 
thread, because we end up requiring basically no TLB flushes. The 
optimization gets even more significant with more threads. See [1] for 
simple micro benchmark results."

Something like that would be good enough for my taste.

[2/2] mm/mprotect: do not flush on permission promotion

Commit Message

Comments

Patch