[082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Message ID	20211105203857.gaAdLZ-Vh%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=bSwl=PY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D9BD860174 Date: Fri, 05 Nov 2021 13:38:57 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, anton@ozlabs.org, benh@kernel.crashing.org, linux-mm@kvack.org, luto@kernel.org, mm-commits@vger.kernel.org, npiggin@gmail.com, paulus@ozlabs.org, rdunlap@infradead.org, torvalds@linux-foundation.org Subject: [patch 082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Message-ID: <20211105203857.gaAdLZ-Vh%akpm@linux-foundation.org> In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/262] scripts/spelling.txt: add more spellings to spelling.txt \| expand [001/262] scripts/spelling.txt: add more spellings to spelling.txt [002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" [003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format [004/262] ocfs2: fix handle refcount leak in two exception handling paths [005/262] ocfs2: cleanup journal init and shutdown [006/262] ocfs2/dlm: remove redundant assignment of variable ret [007/262] ocfs2: fix data corruption on truncate [008/262] ocfs2: do not zero pages beyond i_size [009/262] fs/posix_acl.c: avoid -Wempty-body warning [010/262] d_path: fix Kernel doc validator complaining [011/262] mm: move kvmalloc-related functions to slab.h [012/262] mm/slab.c: remove useless lines in enable_cpucache() [013/262] slub: add back check for free nonslab objects [014/262] mm, slub: change percpu partial accounting from objects to pages [015/262] mm/slub: increase default cpu partial list sizes [016/262] mm, slub: use prefetchw instead of prefetch [017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT [018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> [019/262] lib/stackdepot: include gfp.h [020/262] lib/stackdepot: remove unused function argument [021/262] lib/stackdepot: introduce __stack_depot_save() [022/262] kasan: common: provide can_alloc in kasan_save_stack() [023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() [024/262] workqueue, kasan: avoid alloc_pages() when recording stack [025/262] kasan: fix tag for large allocations when using CONFIG_SLAB [026/262] kasan: test: add memcpy test that avoids out-of-bounds write [027/262] mm/smaps: fix shmem pte hole swap calculation [028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap [029/262] mm/smaps: simplify shmem handling of pte holes [030/262] mm: debug_vm_pgtable: don't use __P000 directly [031/262] kasan: test: bypass __alloc_size checks [032/262] rapidio: avoid bogus __alloc_size warning [033/262] Compiler Attributes: add __alloc_size() for better bounds checking [034/262] slab: clean up function prototypes [035/262] slab: add __alloc_size attributes for better bounds checking [036/262] mm/kvmalloc: add __alloc_size attributes for better bounds checking [037/262] mm/vmalloc: add __alloc_size attributes for better bounds checking [038/262] mm/page_alloc: add __alloc_size attributes for better bounds checking [039/262] percpu: add __alloc_size attributes for better bounds checking [040/262] mm/page_ext.c: fix a comment [041/262] mm: stop filemap_read() from grabbing a superfluous page [042/262] mm: export bdi_unregister [043/262] mtd: call bdi_unregister explicitly [044/262] fs: explicitly unregister per-superblock BDIs [045/262] mm: don't automatically unregister bdis [046/262] mm: simplify bdi refcounting [047/262] mm: don't read i_size of inode unless we need it [048/262] mm/filemap.c: remove bogus VM_BUG_ON [049/262] mm: move more expensive part of XA setup out of mapping check [050/262] mm/gup: further simplify __gup_device_huge() [051/262] mm/swapfile: remove needless request_queue NULL pointer check [052/262] mm/swapfile: fix an integer overflow in swap_show() [053/262] mm: optimise put_pages_list() [054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() [055/262] memcg: flush stats only if updated [056/262] memcg: unify memcg stat flushing [057/262] mm/memcg: remove obsolete memcg_free_kmem() [058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic [059/262] memcg, kmem: further deprecate kmem.limit_in_bytes [060/262] mm: list_lru: remove holding lru lock [061/262] mm: list_lru: fix the return value of list_lru_count_one() [062/262] mm: memcontrol: remove kmemcg_id reparenting [063/262] mm: memcontrol: remove the kmem states [064/262] mm: list_lru: only add memcg-aware lrus to the global lru list [065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks [066/262] mm, oom: do not trigger out_of_memory from the #PF [067/262] memcg: prohibit unconditional exceeding the limit of dying tasks [068/262] mm/mmap.c: fix a data race of mm->total_vm [069/262] mm: use __pfn_to_section() instead of open coding it [070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion [071/262] mm/memory.c: use correct VMA flags when freeing page-tables [072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte [073/262] mm: clear vmf->pte after pte_unmap_same() returns [074/262] mm: drop first_index/last_index in zap_details [075/262] mm: add zap_skip_check_mapping() helper [076/262] mm: introduce pmd_install() helper [077/262] mm: remove redundant smp_wmb() [078/262] Documentation: update pagemap with shmem exceptions [079/262] lazy tlb: introduce lazy mm refcount helper functions [080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable [081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option [082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN [083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE [084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() [085/262] mm/mremap: don't account pages in vma_to_resize() [086/262] include/linux/io-mapping.h: remove fallback for writecombine [087/262] mm: mmap_lock: remove redundant newline in TP_printk [088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN [089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() [090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() [091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings [092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo [093/262] mm/vmalloc: do not adjust the search size for alignment overhead [094/262] mm/vmalloc: check various alignments when debugging [095/262] vmalloc: back off when the current task is OOM-killed [096/262] vmalloc: choose a better start address in vm_area_register_early() [097/262] arm64: support page mapping percpu first chunk allocator [098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC [099/262] mm/vmalloc: be more explicit about supported gfp flags [100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation [101/262] lib/test_vmalloc.c: use swap() to make code cleaner [102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash [103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() [104/262] mm/page_alloc.c: simplify the code by using macro K() [105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() [106/262] mm/page_alloc.c: use helper function zone_spans_pfn() [107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] [108/262] mm/page_alloc: print node fallback order [109/262] mm/page_alloc: use accumulated load when building node fallback list [110/262] mm: move node_reclaim_distance to fix NUMA without SMP [111/262] mm: move fold_vm_numa_events() to fix NUMA without SMP [112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() [113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early [114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo [115/262] mm: create a new system state and fix core_kernel_text() [116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says [117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() [118/262] s390: use generic version of arch_is_kernel_initmem_freed() [119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() [120/262] mm/page_alloc: use clamp() to simplify code [121/262] mm: fix data race in PagePoisoned() [122/262] mm/memory_failure: constify static mm_walk_ops [123/262] mm: filemap: coding style cleanup for filemap_map_pmd() [124/262] mm: hwpoison: refactor refcount check handling [125/262] mm: shmem: don't truncate page if memory failure happens [126/262] mm: hwpoison: handle non-anonymous THP correctly [127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h [128/262] hugetlb: add demote hugetlb page sysfs interfaces [129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA [130/262] hugetlb: be sure to free demoted CMA pages to CMA [131/262] hugetlb: add demote bool to gigantic page routines [132/262] hugetlb: add hugetlb demote page support [133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged [134/262] mm, hugepages: add mremap() support for hugepage backed vma [135/262] mm, hugepages: add hugetlb vma mremap() test [136/262] hugetlb: support node specified when using cma for gigantic hugepages [137/262] mm: remove duplicate include in hugepage-mremap.c [138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro [139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments [140/262] hugetlb: remove redundant validation in has_same_uncharge_info() [141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() [142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page [143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers [144/262] userfaultfd/selftests: fix feature support detection [145/262] userfaultfd/selftests: fix calculation of expected ioctls [146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() [147/262] mm/page_isolation: guard against possible putback unisolated page [148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning [149/262] mm/vmscan: throttle reclaim until some writeback completes if congested [150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated [151/262] mm/vmscan: throttle reclaim when no progress is being made [152/262] mm/writeback: throttle based on page writeback instead of congestion [153/262] mm/page_alloc: remove the throttling logic from the page allocator [154/262] mm/vmscan: centralise timeout values for reclaim_throttle [155/262] mm/vmscan: increase the timeout if page reclaim is not making progress [156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS [157/262] mm/vmpressure: fix data-race with memcg->socket_pressure [158/262] tools/vm/page_owner_sort.c: count and sort by mem [159/262] tools/vm/page-types.c: make walk_file() aware of address range option [160/262] tools/vm/page-types.c: move show_file() to summary output [161/262] tools/vm/page-types.c: print file offset in hexadecimal [162/262] arch_numa: simplify numa_distance allocation [163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer [164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() [165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late [166/262] memblock: rename memblock_free to memblock_phys_free [167/262] memblock: use memblock_free for freeing virtual pointers [168/262] mm: mark the OOM reaper thread as freezable [169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation [170/262] mm/migrate: de-duplicate migrate_reason strings [171/262] mm: migrate: make demotion knob depend on migration [172/262] selftests/vm/transhuge-stress: fix ram size thinko [173/262] mm, thp: lock filemap when truncating page cache [174/262] mm, thp: fix incorrect unmap behavior for private pages [175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size [176/262] mm: nommu: kill arch_get_unmapped_area() [177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies [178/262] selftests: vm: add KSM huge pages merging time test [179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free [180/262] mm: vmstat.c: make extfrag_index show more pretty [181/262] selftests/vm: make MADV_POPULATE_(READ\|WRITE) use in-tree headers [182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() [183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" [184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path [185/262] memory-hotplug.rst: document the "auto-movable" online policy [186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG [187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE [188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit [189/262] mm/memory_hotplug: remove HIGHMEM leftovers [190/262] mm/memory_hotplug: remove stale function declarations [191/262] x86: remove memory hotplug support on X86_32 [192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() [193/262] memblock: improve MEMBLOCK_HOTPLUG documentation [194/262] memblock: allow to specify flags with memblock_add_node() [195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED [196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED [197/262] mm/rmap.c: avoid double faults migrating device private pages [198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migrati… [199/262] mm/highmem: remove deprecated kmap_atomic [200/262] zram_drv: allow reclaim on bio_alloc [201/262] zram: off by one in read_block_state() [202/262] zram: introduce an aged idle interface [203/262] mm: remove HARDENED_USERCOPY_FALLBACK [204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h [205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c [206/262] kfence: count unexpectedly skipped allocations [207/262] kfence: move saving stack trace of allocations into __kfence_alloc() [208/262] kfence: limit currently covered allocations when pool nearly full [209/262] kfence: add note to documentation about skipping covered allocations [210/262] kfence: test: use kunit_skip() to skip tests [211/262] kfence: shorten critical sections of alloc/free [212/262] kfence: always use static branches to guard kfence_alloc() [213/262] kfence: default to dynamic branch instead of static keys mode [214/262] mm/damon: grammar s/works/work/ [215/262] Documentation/vm: move user guides to admin-guide/mm/ [216/262] MAINTAINERS: update SeongJae's email address [217/262] docs/vm/damon: remove broken reference [218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' [219/262] mm/damon/core: print kdamond start log in debug mode only [220/262] mm/damon: remove unnecessary do_exit() from kdamond [221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond [222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL [223/262] mm/damon/core: account age of target regions [224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) [225/262] mm/damon/vaddr: support DAMON-based Operation Schemes [226/262] mm/damon/dbgfs: support DAMON-based Operation Schemes [227/262] mm/damon/schemes: implement statistics feature [228/262] selftests/damon: add 'schemes' debugfs tests [229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes [230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions [231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' [232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature [233/262] mm/damon/vaddr: separate commonly usable functions [234/262] mm/damon: implement primitives for physical address space monitoring [235/262] mm/damon/dbgfs: support physical memory monitoring [236/262] Docs/DAMON: document physical memory monitoring support [237/262] mm/damon/vaddr: constify static mm_walk_ops [238/262] mm/damon/dbgfs: remove unnecessary variables [239/262] mm/damon/paddr: support the pageout scheme [240/262] mm/damon/schemes: implement size quota for schemes application speed control [241/262] mm/damon/schemes: skip already charged targets and regions [242/262] mm/damon/schemes: implement time quota [243/262] mm/damon/dbgfs: support quotas of schemes [244/262] mm/damon/selftests: support schemes quotas [245/262] mm/damon/schemes: prioritize regions within the quotas [246/262] mm/damon/vaddr,paddr: support pageout prioritization [247/262] mm/damon/dbgfs: support prioritization weights [248/262] tools/selftests/damon: update for regions prioritization of schemes [249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism [250/262] mm/damon/dbgfs: support watermarks [251/262] selftests/damon: support watermarks [252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) [253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM [254/262] mm/damon: remove unnecessary variable initialization [255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on [256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands [257/262] Docs/admin-guide/mm/damon/start: fix a wrong link [258/262] Docs/admin-guide/mm/damon/start: simplify the content [259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions [260/262] mm/damon: simplify stop mechanism [261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message [262/262] mm/damon: remove return value from before_terminate callback

Message ID

20211105203857.gaAdLZ-Vh%akpm@linux-foundation.org (mailing list archive)

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D9BD860174
Date: Fri, 05 Nov 2021 13:38:57 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, anton@ozlabs.org,
 benh@kernel.crashing.org, linux-mm@kvack.org, luto@kernel.org,
 mm-commits@vger.kernel.org, npiggin@gmail.com, paulus@ozlabs.org,
 rdunlap@infradead.org, torvalds@linux-foundation.org
Subject: [patch 082/262] powerpc/64s: enable
 MMU_LAZY_TLB_SHOOTDOWN
Message-ID: <20211105203857.gaAdLZ-Vh%akpm@linux-foundation.org>
In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/262] scripts/spelling.txt: add more spellings to spelling.txt | expand

Commit Message

Andrew Morton Nov. 5, 2021, 8:38 p.m. UTC

From: Nicholas Piggin <npiggin@gmail.com>
Subject: powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

No real datya for real world workloads unfortunately.  I think it's
always been a "known" cacheline, it just showed up badly on
will-it-scale tests recently when Anton was doing a sweep of low
hanging scalability issues on big systems.

We have some very big systems running certain in-memory databases that
get into very high contention conditions on mutexes that push context
switch rates right up and with idle times pretty high, which would get
a lot of parallel context switching between user and idle thread, we
might be getting a bit of this contention there.

It's not something at the top of profiles though.  And on
multi-threaded workloads like this, the normal refcounting of the user
mm still has fundmaental contention.  It's tricky to get the change
tested on these workloads (machine time is very limited and I can't
drive the software).

I suspect it could also show in things that do high net or disk IO
rates (enough to need a lot of cores), and do some user processing
steps along the way.  You'd potentially get a lot of idle switching.

This infrastructure could be beneficial to other architectures.  The
cacheline is going to bounce in the same situations on other archs, so
I would say yes.  Rik at one stage had some patches to try avoid it for
x86 some years ago, I don't know what happened to those.

The way powerpc has to maintain mm_cpumask for its TLB flushing makes
it relatively easy to do this shootdown, and we decided the additional
IPIs were less of a concern than the bouncing.  Others have different
concerns, but I tried to make it generic and add comments explaining
what other archs can do, or possibly different ways it might be
achieved.

Link: https://lkml.kernel.org/r/20210605014216.446867-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/Kconfig |    1 +
 1 file changed, 1 insertion(+)

--- a/arch/powerpc/Kconfig~powerpc-64s-enable-mmu_lazy_tlb_shootdown
+++ a/arch/powerpc/Kconfig
@@ -249,6 +249,7 @@  config PPC
 	select IRQ_FORCED_THREADING
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_SG_DMA_LENGTH

[082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Commit Message

Patch