Message ID | 20210607055131.156184-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | Speedup mremap on ppc64 | expand |
On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires > the platform to support updating higher-level page tables without > updating page table entries. This also needs to invalidate the Page Walk > Cache on architecture supporting the same. > > Changes from v6: > * Update ppc64 flush_tlb_range to invalidate page walk cache. I'd really rather not do this, I'm not sure if micro bench mark captures everything. Page tables coming from L2/L3 probably aren't the primary purpose or biggest benefit of intermediate level caches. The situation on POWER with nest mmu (coherent accelerators) is magnified. They have huge page walk cashes to make up for the fact they don't have data caches for walking page tables which makes the invalidation more painful in terms of subsequent misses, but also latency to invalidate (can be order of microseconds whereas a page invalidate is a couple of orders of magnitude faster). Yes it is a deficiency of the ppc invalidation architecture, we are aware and would like to improve it but for now those is what we have. Thanks, Nick > * Add patches to fix race between mremap and page out > * Add patch to fix build error with page table levels 2 > > Changes from v5: > * Drop patch mm/mremap: Move TLB flush outside page table lock > * Add fixes for race between optimized mremap and page out > > Changes from v4: > * Change function name and arguments based on review feedback. > > Changes from v3: > * Fix build error reported by kernel test robot > * Address review feedback. > > Changes from v2: > * switch from using mmu_gather to flush_pte_tlb_pwc_range() > > Changes from v1: > * Rebase to recent upstream > * Fix build issues with tlb_gather_mmu changes > > > Aneesh Kumar K.V (11): > mm/mremap: Fix race between MOVE_PMD mremap and pageout > mm/mremap: Fix race between MOVE_PUD mremap and pageout > selftest/mremap_test: Update the test to handle pagesize other than 4K > selftest/mremap_test: Avoid crash with static build > mm/mremap: Convert huge PUD move to separate helper > mm/mremap: Don't enable optimized PUD move if page table levels is 2 > mm/mremap: Use pmd/pud_poplulate to update page table entries > powerpc/mm/book3s64: Fix possible build error > mm/mremap: Allow arch runtime override > powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache > powerpc/mm: Enable HAVE_MOVE_PMD support > > .../include/asm/book3s/64/tlbflush-radix.h | 2 + > arch/powerpc/include/asm/tlb.h | 6 + > arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 8 +- > arch/powerpc/mm/book3s64/radix_tlb.c | 70 +++++++---- > arch/powerpc/platforms/Kconfig.cputype | 2 + > include/linux/rmap.h | 13 +- > mm/mremap.c | 104 +++++++++++++-- > mm/page_vma_mapped.c | 43 ++++--- > tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++-------- > 9 files changed, 251 insertions(+), 115 deletions(-) > > -- > 2.31.1 > >
On 6/7/21 3:40 PM, Nick Piggin wrote: > On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This > requires the platform to support updating higher-level page tables > without updating page table ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > ZjQcmQRYFpfptBannerEnd > > > On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com > <mailto:aneesh.kumar@linux.ibm.com>> wrote: > > > This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires > the platform to support updating higher-level page tables without > updating page table entries. This also needs to invalidate the Page Walk > Cache on architecture supporting the same. > > Changes from v6: > * Update ppc64 flush_tlb_range to invalidate page walk cache. > > > I'd really rather not do this, I'm not sure if micro bench mark captures > everything. > > Page tables coming from L2/L3 probably aren't the primary purpose or > biggest benefit of intermediate level caches. > > The situation on POWER with nest mmu (coherent accelerators) is > magnified. They have huge page walk cashes to make up for the fact they > don't have data caches for walking page tables which makes the > invalidation more painful in terms of subsequent misses, but also > latency to invalidate (can be order of microseconds whereas a page > invalidate is a couple of orders of magnitude faster). > If we are using NestMMU, we already upgrade that flush to invalidate page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade the invalidate to a pid flush via flush_pid = nr_pages > tlb_single_page_flush_ceiling; and if it is PID flush if we are using NestMMU we already upgrade a RIC_FLUSH_TLB to RIC_FLUSH_ALL ? > Yes it is a deficiency of the ppc invalidation architecture, we are > aware and would like to improve it but for now those is what we have. > -aneesh
Excerpts from Aneesh Kumar K.V's message of June 8, 2021 2:39 pm: > On 6/7/21 3:40 PM, Nick Piggin wrote: >> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This >> requires the platform to support updating higher-level page tables >> without updating page table ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> ZjQcmQRYFpfptBannerEnd >> >> >> On Monday, 7 June 2021, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com >> <mailto:aneesh.kumar@linux.ibm.com>> wrote: >> >> >> This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires >> the platform to support updating higher-level page tables without >> updating page table entries. This also needs to invalidate the Page Walk >> Cache on architecture supporting the same. >> >> Changes from v6: >> * Update ppc64 flush_tlb_range to invalidate page walk cache. >> >> >> I'd really rather not do this, I'm not sure if micro bench mark captures >> everything. >> >> Page tables coming from L2/L3 probably aren't the primary purpose or >> biggest benefit of intermediate level caches. >> >> The situation on POWER with nest mmu (coherent accelerators) is >> magnified. They have huge page walk cashes to make up for the fact they >> don't have data caches for walking page tables which makes the >> invalidation more painful in terms of subsequent misses, but also >> latency to invalidate (can be order of microseconds whereas a page >> invalidate is a couple of orders of magnitude faster). >> > > If we are using NestMMU, we already upgrade that flush to invalidate > page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade > the invalidate to a pid flush via > > flush_pid = nr_pages > tlb_single_page_flush_ceiling; Not that we've tuned that parameter for a long time, certainly not with nMMU probably. Quite possibly it should be higher for nMMU because of the big TLBs they have. (and what about == PMD_SIZE)? > > and if it is PID flush if we are using NestMMU we already upgrade a > RIC_FLUSH_TLB to RIC_FLUSH_ALL ? Does P10 still have that bug? At any rate, the core MMU I think still has the same issues just less pronounced. PWC invalidates take longer, and PWC should have most benefit when CPU data caches are highly used and don't filled with page table entries. Thanks, Nick
On Mon, Jun 7, 2021 at 3:10 AM Nick Piggin <npiggin@gmail.com> wrote: > > I'd really rather not do this, I'm not sure if micro benchmark captures everything. I don't much care what powerpc code does _itnernally_ for this architecture-specific mis-design issue, but I really don't want to see more complex generic interfaces unless you have better hard numbers for them. So far the numbers are: "no observable difference". It would have to be not just observable, but actually meaningful for me to go "ok, we'll add this crazy flag that nobody else cares about". And honestly, from everything I've seen on page table walker caches: they are great, but once you start remapping big ranges and invallidating megabytes of TLB's, the walker caches just aren't going to be your issue. But: numbers talk. I'd take the sane generic interfaces as a first cut. If somebody then has really compelling numbers, we can _then_ look at that "optimize for odd page table walker cache situation" case. And in the meantime, maybe you can talk to the hardware people and tell them that you want the "flush range" capability to work right, and that if the walker cache is <i>so</i> important they shouldn't have made it a all-or-nothing flush. Linus
Excerpts from Linus Torvalds's message of June 9, 2021 3:10 am: > On Mon, Jun 7, 2021 at 3:10 AM Nick Piggin <npiggin@gmail.com> wrote: >> >> I'd really rather not do this, I'm not sure if micro benchmark captures everything. > > I don't much care what powerpc code does _itnernally_ for this > architecture-specific mis-design issue, but I really don't want to see > more complex generic interfaces unless you have better hard numbers > for them. > > So far the numbers are: "no observable difference". > > It would have to be not just observable, but actually meaningful for > me to go "ok, we'll add this crazy flag that nobody else cares about". Fair enough, will have to try get more numbers then I suppose. > > And honestly, from everything I've seen on page table walker caches: > they are great, but once you start remapping big ranges and > invallidating megabytes of TLB's, the walker caches just aren't going > to be your issue. Remapping big ranges is going to have to invalidate intermediate caches (aka PWC), so is unmapping. So we're stuck with the big hammer PWC invalidate there anyway. It's mprotect and friends that would care here, possibly some THP thing... but I guess those are probably down the list a little way. I'm a bit less concerned about the PWCs that might be caching the regions of the big mprotect() we just did, and more concerned about the effect of flushing all unrelated caches. Including on all other CPUs a threaded program is running on. HANA, Java, are threaded and do mremaps, unfortunately. > > But: numbers talk. I'd take the sane generic interfaces as a first > cut. If somebody then has really compelling numbers, we can _then_ > look at that "optimize for odd page table walker cache situation" > case. Yep okay. It's not the end of the world (or if it is we'd be able to get numbers presumably). > And in the meantime, maybe you can talk to the hardware people and > tell them that you want the "flush range" capability to work right, > and that if the walker cache is <i>so</i> important they shouldn't > have made it a all-or-nothing flush. I have, more than once :( Fixing that would fix munmap etc cases as well, so yeah. Thanks, Nick