[v7,00/11] Speedup mremap on ppc64

Message ID	20210607055131.156184-1-aneesh.kumar@linux.ibm.com (mailing list archive)
Headers	show Return-Path: <SRS0=eTUP=LB=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1D0160FEF From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org, kaleshsingh@google.com, npiggin@gmail.com, joel@joelfernandes.org, Christophe Leroy <christophe.leroy@csgroup.eu>, Linus Torvalds <torvalds@linux-foundation.org>, "Kirill A . Shutemov" <kirill@shutemov.name>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Subject: [PATCH v7 00/11] Speedup mremap on ppc64 Date: Mon, 7 Jun 2021 11:21:20 +0530 Message-Id: <20210607055131.156184-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Speedup mremap on ppc64 \| expand [v7,00/11] Speedup mremap on ppc64 [v7,01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout [v7,02/11] mm/mremap: Fix race between MOVE_PUD mremap and pageout [v7,03/11] selftest/mremap_test: Update the test to handle pagesize other than 4K [v7,04/11] selftest/mremap_test: Avoid crash with static build [v7,05/11] mm/mremap: Convert huge PUD move to separate helper [v7,06/11] mm/mremap: Don't enable optimized PUD move if page table levels is 2 [v7,07/11] mm/mremap: Use pmd/pud_poplulate to update page table entries [v7,08/11] powerpc/mm/book3s64: Fix possible build error [v7,09/11] mm/mremap: Allow arch runtime override [v7,10/11] powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache [v7,11/11] powerpc/mm: Enable HAVE_MOVE_PMD support

Aneesh Kumar K.V June 7, 2021, 5:51 a.m. UTC

This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.

Changes from v6:
* Update ppc64 flush_tlb_range to invalidate page walk cache.
* Add patches to fix race between mremap and page out
* Add patch to fix build error with page table levels 2

Changes from v5:
* Drop patch mm/mremap: Move TLB flush outside page table lock
* Add fixes for race between optimized mremap and page out

Changes from v4:
* Change function name and arguments based on review feedback.

Changes from v3:
* Fix build error reported by kernel test robot
* Address review feedback.

Changes from v2:
* switch from using mmu_gather to flush_pte_tlb_pwc_range() 

Changes from v1:
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes


Aneesh Kumar K.V (11):
  mm/mremap: Fix race between MOVE_PMD mremap and pageout
  mm/mremap: Fix race between MOVE_PUD mremap and pageout
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Convert huge PUD move to separate helper
  mm/mremap: Don't enable optimized PUD move if page table levels is 2
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  powerpc/mm/book3s64: Fix possible build error
  mm/mremap: Allow arch runtime override
  powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache
  powerpc/mm: Enable HAVE_MOVE_PMD support

 .../include/asm/book3s/64/tlbflush-radix.h    |   2 +
 arch/powerpc/include/asm/tlb.h                |   6 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |   8 +-
 arch/powerpc/mm/book3s64/radix_tlb.c          |  70 +++++++----
 arch/powerpc/platforms/Kconfig.cputype        |   2 +
 include/linux/rmap.h                          |  13 +-
 mm/mremap.c                                   | 104 +++++++++++++--
 mm/page_vma_mapped.c                          |  43 ++++---
 tools/testing/selftests/vm/mremap_test.c      | 118 ++++++++++--------
 9 files changed, 251 insertions(+), 115 deletions(-)

Nicholas Piggin June 7, 2021, 10:10 a.m. UTC | #1

On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote:

>
> This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
> the platform to support updating higher-level page tables without
> updating page table entries. This also needs to invalidate the Page Walk
> Cache on architecture supporting the same.
>
> Changes from v6:
> * Update ppc64 flush_tlb_range to invalidate page walk cache.


I'd really rather not do this, I'm not sure if micro bench mark captures
everything.

Page tables coming from L2/L3 probably aren't the primary purpose or
biggest benefit of intermediate level caches.

The situation on POWER with nest mmu (coherent accelerators) is magnified.
They have huge page walk cashes to make up for the fact they don't have
data caches for walking page tables which makes the invalidation more
painful in terms of subsequent misses, but also latency to invalidate (can
be order of microseconds whereas a page invalidate is a couple of orders of
magnitude faster).

Yes it is a deficiency of the ppc invalidation architecture, we are aware
and would like to improve it but for now those is what we have.

Thanks,
Nick


> * Add patches to fix race between mremap and page out
> * Add patch to fix build error with page table levels 2
>
> Changes from v5:
> * Drop patch mm/mremap: Move TLB flush outside page table lock
> * Add fixes for race between optimized mremap and page out
>
> Changes from v4:
> * Change function name and arguments based on review feedback.
>
> Changes from v3:
> * Fix build error reported by kernel test robot
> * Address review feedback.
>
> Changes from v2:
> * switch from using mmu_gather to flush_pte_tlb_pwc_range()
>
> Changes from v1:
> * Rebase to recent upstream
> * Fix build issues with tlb_gather_mmu changes
>
>
> Aneesh Kumar K.V (11):
>   mm/mremap: Fix race between MOVE_PMD mremap and pageout
>   mm/mremap: Fix race between MOVE_PUD mremap and pageout
>   selftest/mremap_test: Update the test to handle pagesize other than 4K
>   selftest/mremap_test: Avoid crash with static build
>   mm/mremap: Convert huge PUD move to separate helper
>   mm/mremap: Don't enable optimized PUD move if page table levels is 2
>   mm/mremap: Use pmd/pud_poplulate to update page table entries
>   powerpc/mm/book3s64: Fix possible build error
>   mm/mremap: Allow arch runtime override
>   powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache
>   powerpc/mm: Enable HAVE_MOVE_PMD support
>
>  .../include/asm/book3s/64/tlbflush-radix.h    |   2 +
>  arch/powerpc/include/asm/tlb.h                |   6 +
>  arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |   8 +-
>  arch/powerpc/mm/book3s64/radix_tlb.c          |  70 +++++++----
>  arch/powerpc/platforms/Kconfig.cputype        |   2 +
>  include/linux/rmap.h                          |  13 +-
>  mm/mremap.c                                   | 104 +++++++++++++--
>  mm/page_vma_mapped.c                          |  43 ++++---
>  tools/testing/selftests/vm/mremap_test.c      | 118 ++++++++++--------
>  9 files changed, 251 insertions(+), 115 deletions(-)
>
> --
> 2.31.1
>
>

Aneesh Kumar K.V June 8, 2021, 4:39 a.m. UTC | #2

On 6/7/21 3:40 PM, Nick Piggin wrote:
> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> 
> wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This 
> requires the platform to support updating higher-level page tables 
> without updating page table ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> 
> 
> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com 
> <mailto:aneesh.kumar@linux.ibm.com>> wrote:
> 
> 
>     This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
>     the platform to support updating higher-level page tables without
>     updating page table entries. This also needs to invalidate the Page Walk
>     Cache on architecture supporting the same.
> 
>     Changes from v6:
>     * Update ppc64 flush_tlb_range to invalidate page walk cache.
> 
> 
> I'd really rather not do this, I'm not sure if micro bench mark captures 
> everything.
> 
> Page tables coming from L2/L3 probably aren't the primary purpose or 
> biggest benefit of intermediate level caches.
> 
> The situation on POWER with nest mmu (coherent accelerators) is 
> magnified. They have huge page walk cashes to make up for the fact they 
> don't have data caches for walking page tables which makes the 
> invalidation more painful in terms of subsequent misses, but also 
> latency to invalidate (can be order of microseconds whereas a page 
> invalidate is a couple of orders of magnitude faster).
> 

If we are using NestMMU, we already upgrade that flush to invalidate 
page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade 
the invalidate to a pid flush via

flush_pid = nr_pages > tlb_single_page_flush_ceiling;
	
and if it is PID flush if we are using NestMMU we already upgrade a 
RIC_FLUSH_TLB to RIC_FLUSH_ALL ?

> Yes it is a deficiency of the ppc invalidation architecture, we are 
> aware and would like to improve it but for now those is what we have.
> 

-aneesh

Nicholas Piggin June 8, 2021, 5:03 a.m. UTC | #3

Excerpts from Aneesh Kumar K.V's message of June 8, 2021 2:39 pm:
> On 6/7/21 3:40 PM, Nick Piggin wrote:
>> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> 
>> wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This 
>> requires the platform to support updating higher-level page tables 
>> without updating page table ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>> 
>> 
>> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com 
>> <mailto:aneesh.kumar@linux.ibm.com>> wrote:
>> 
>> 
>>     This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
>>     the platform to support updating higher-level page tables without
>>     updating page table entries. This also needs to invalidate the Page Walk
>>     Cache on architecture supporting the same.
>> 
>>     Changes from v6:
>>     * Update ppc64 flush_tlb_range to invalidate page walk cache.
>> 
>> 
>> I'd really rather not do this, I'm not sure if micro bench mark captures 
>> everything.
>> 
>> Page tables coming from L2/L3 probably aren't the primary purpose or 
>> biggest benefit of intermediate level caches.
>> 
>> The situation on POWER with nest mmu (coherent accelerators) is 
>> magnified. They have huge page walk cashes to make up for the fact they 
>> don't have data caches for walking page tables which makes the 
>> invalidation more painful in terms of subsequent misses, but also 
>> latency to invalidate (can be order of microseconds whereas a page 
>> invalidate is a couple of orders of magnitude faster).
>> 
> 
> If we are using NestMMU, we already upgrade that flush to invalidate 
> page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade 
> the invalidate to a pid flush via
> 
> flush_pid = nr_pages > tlb_single_page_flush_ceiling;

Not that we've tuned that parameter for a long time, certainly not with 
nMMU probably. Quite possibly it should be higher for nMMU because of 
the big TLBs they have. (and what about == PMD_SIZE)?

> 	
> and if it is PID flush if we are using NestMMU we already upgrade a 
> RIC_FLUSH_TLB to RIC_FLUSH_ALL ?

Does P10 still have that bug?

At any rate, the core MMU I think still has the same issues just less
pronounced. PWC invalidates take longer, and PWC should have most
benefit when CPU data caches are highly used and don't filled with
page table entries.

Thanks,
Nick

Linus Torvalds June 8, 2021, 5:10 p.m. UTC | #4

On Mon, Jun 7, 2021 at 3:10 AM Nick Piggin <npiggin@gmail.com> wrote:
>
> I'd really rather not do this, I'm not sure if micro benchmark captures everything.

I don't much care what powerpc code does _itnernally_ for this
architecture-specific mis-design issue, but I really don't want to see
more complex generic interfaces unless you have better hard numbers
for them.

So far the numbers are: "no observable difference".

It would have to be not just observable, but actually meaningful for
me to go "ok, we'll add this crazy flag that nobody else cares about".

And honestly, from everything I've seen on page table walker caches:
they are great, but once you start remapping big ranges and
invallidating megabytes of TLB's, the walker caches just aren't going
to be your issue.

But: numbers talk.  I'd take the sane generic interfaces as a first
cut. If somebody then has really compelling numbers, we can _then_
look at that "optimize for odd page table walker cache situation"
case.

And in the meantime, maybe you can talk to the hardware people and
tell them that you want the "flush range" capability to work right,
and that if the walker cache is <i>so</i> important they shouldn't
have made it a all-or-nothing flush.

            Linus

Nicholas Piggin June 16, 2021, 1:44 a.m. UTC | #5

Excerpts from Linus Torvalds's message of June 9, 2021 3:10 am:
> On Mon, Jun 7, 2021 at 3:10 AM Nick Piggin <npiggin@gmail.com> wrote:
>>
>> I'd really rather not do this, I'm not sure if micro benchmark captures everything.
> 
> I don't much care what powerpc code does _itnernally_ for this
> architecture-specific mis-design issue, but I really don't want to see
> more complex generic interfaces unless you have better hard numbers
> for them.
> 
> So far the numbers are: "no observable difference".
> 
> It would have to be not just observable, but actually meaningful for
> me to go "ok, we'll add this crazy flag that nobody else cares about".

Fair enough, will have to try get more numbers then I suppose.

> 
> And honestly, from everything I've seen on page table walker caches:
> they are great, but once you start remapping big ranges and
> invallidating megabytes of TLB's, the walker caches just aren't going
> to be your issue.

Remapping big ranges is going to have to invalidate intermediate caches
(aka PWC), so is unmapping. So we're stuck with the big hammer PWC 
invalidate there anyway.

It's mprotect and friends that would care here, possibly some THP thing...
but I guess those are probably down the list a little way.

I'm a bit less concerned about the PWCs that might be caching the regions
of the big mprotect() we just did, and more concerned about the effect 
of flushing all unrelated caches. Including on all other CPUs a threaded
program is running on. HANA, Java, are threaded and do mremaps, 
unfortunately.

> 
> But: numbers talk.  I'd take the sane generic interfaces as a first
> cut. If somebody then has really compelling numbers, we can _then_
> look at that "optimize for odd page table walker cache situation"
> case.

Yep okay. It's not the end of the world (or if it is we'd be able to get
numbers presumably).

> And in the meantime, maybe you can talk to the hardware people and
> tell them that you want the "flush range" capability to work right,
> and that if the walker cache is <i>so</i> important they shouldn't
> have made it a all-or-nothing flush.

I have, more than once :(

Fixing that would fix munmap etc cases as well, so yeah.

Thanks,
Nick

[v7,00/11] Speedup mremap on ppc64

Message

Comments