[036/212] mm, slab: make flush_slab() possible to call with irqs enabled

Message ID	20210902215152.ibWfL_bvd%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=eNaS=NY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 217E5610A0 Date: Thu, 02 Sep 2021 14:51:52 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, bigeasy@linutronix.de, brouer@redhat.com, cl@linux.com, efault@gmx.de, iamjoonsoo.kim@lge.com, jannh@google.com, linux-mm@kvack.org, mgorman@techsingularity.net, mm-commits@vger.kernel.org, penberg@kernel.org, rientjes@google.com, tglx@linutronix.de, torvalds@linux-foundation.org, vbabka@suse.cz Subject: [patch 036/212] mm, slab: make flush_slab() possible to call with irqs enabled Message-ID: <20210902215152.ibWfL_bvd%akpm@linux-foundation.org> In-Reply-To: <20210902144820.78957dff93d7bea620d55a89@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/212] ia64: fix typo in a comment \| expand [001/212] ia64: fix typo in a comment [002/212] ia64: fix #endif comment for reserve_elfcorehdr() [003/212] ia64: make reserve_elfcorehdr() static [004/212] ia64: make num_rsvd_regions static [005/212] ocfs2: remove an unnecessary condition [006/212] ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info() [007/212] ocfs2: ocfs2_downconvert_lock failure results in deadlock [008/212] arch/csky/kernel/probes/kprobes.c: fix bugon.cocci warnings [009/212] mm, slub: don't call flush_all() from slab_debug_trace_open() [010/212] mm, slub: allocate private object map for debugfs listings [011/212] mm, slub: allocate private object map for validate_slab_cache() [012/212] mm, slub: don't disable irq for debug_check_no_locks_freed() [013/212] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() [014/212] mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab() [015/212] mm, slub: extract get_partial() from new_slab_objects() [016/212] mm, slub: dissolve new_slab_objects() into ___slab_alloc() [017/212] mm, slub: return slab page from get_partial() and set c->page afterwards [018/212] mm, slub: restructure new page checks in ___slab_alloc() [019/212] mm, slub: simplify kmem_cache_cpu and tid setup [020/212] mm, slub: move disabling/enabling irqs to ___slab_alloc() [021/212] mm, slub: do initial checks in ___slab_alloc() with irqs enabled [022/212] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() [023/212] mm, slub: restore irqs around calling new_slab() [024/212] mm, slub: validate slab from partial list or page allocator before making it cpu slab [025/212] mm, slub: check new pages with restored irqs [026/212] mm, slub: stop disabling irqs around get_partial() [027/212] mm, slub: move reset of c->page and freelist out of deactivate_slab() [028/212] mm, slub: make locking in deactivate_slab() irq-safe [029/212] mm, slub: call deactivate_slab() without disabling irqs [030/212] mm, slub: move irq control into unfreeze_partials() [031/212] mm, slub: discard slabs in unfreeze_partials() without irqs disabled [032/212] mm, slub: detach whole partial list at once in unfreeze_partials() [033/212] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing [034/212] mm, slub: only disable irq with spin_lock in __unfreeze_partials() [035/212] mm, slub: don't disable irqs in slub_cpu_dead() [036/212] mm, slab: make flush_slab() possible to call with irqs enabled [037/212] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context [038/212] mm: slub: make object_map_lock a raw_spinlock_t [039/212] mm, slub: optionally save/restore irqs in slab_[un]lock()/ [040/212] mm, slub: make slab_lock() disable irqs with PREEMPT_RT [041/212] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg [042/212] mm, slub: use migrate_disable() on PREEMPT_RT [043/212] mm, slub: convert kmem_cpu_slab protection to local_lock [044/212] mm/debug_vm_pgtable: introduce struct pgtable_debug_args [045/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in basic tests [046/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in leaf and savewrite tests [047/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in protnone and devmap tests [048/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in soft_dirty and swap tests [049/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in migration and thp tests [050/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in PTE modifying tests [051/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in PMD modifying tests [052/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in PUD modifying tests [053/212] mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests [054/212] mm/debug_vm_pgtable: remove unused code [055/212] mm/debug_vm_pgtable: fix corrupted page flag [056/212] mm: report a more useful address for reclaim acquisition [057/212] mm: add kernel_misc_reclaimable in show_free_areas [058/212] writeback: track number of inodes under writeback [059/212] writeback: reliably update bandwidth estimation [060/212] writeback: fix bandwidth estimate for spiky workload [061/212] writeback: rename domain_update_bandwidth() [062/212] writeback: use READ_ONCE for unlocked reads of writeback stats [063/212] mm: remove irqsave/restore locking from contexts with irqs enabled [064/212] fs: drop_caches: fix skipping over shadow cache inodes [065/212] fs: inode: count invalidated shadow pages in pginodesteal [066/212] writeback: memcg: simplify cgroup_writeback_by_id [067/212] include/linux/buffer_head.h: fix boolreturn.cocci warnings [068/212] mm: gup: remove set but unused local variable major [069/212] mm: gup: remove unneed local variable orig_refs [070/212] mm: gup: remove useless BUG_ON in __get_user_pages() [071/212] mm: gup: fix potential pgmap refcnt leak in __gup_device_huge() [072/212] mm: gup: use helper PAGE_ALIGNED in populate_vma_page_range() [073/212] mm/gup: documentation corrections for gup/pup [074/212] mm/gup: small refactoring: simplify try_grab_page() [075/212] mm/gup: remove try_get_page(), call try_get_compound_head() directly [076/212] fs, mm: fix race in unlinking swapfile [077/212] mm: delete unused get_kernel_page() [078/212] shmem: use raw_spinlock_t for ->stat_lock [079/212] shmem: remove unneeded variable ret [080/212] shmem: remove unneeded header file [081/212] shmem: remove unneeded function forward declaration [082/212] shmem: include header file to declare swap_info [083/212] huge tmpfs: fix fallocate(vanilla) advance over huge pages [084/212] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE [085/212] huge tmpfs: remove shrinklist addition from shmem_setattr() [086/212] huge tmpfs: revert shmem's use of transhuge_vma_enabled() [087/212] huge tmpfs: move shmem_huge_enabled() upwards [088/212] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race [089/212] huge tmpfs: shmem_is_huge(vma, inode, index) [090/212] huge tmpfs: decide stat.st_blksize by shmem_is_huge() [091/212] shmem: shmem_writepage() split unlikely i915 THP [092/212] mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions [093/212] mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config [094/212] mm, memcg: inline swap-related functions to improve disabled memcg config [095/212] memcg: enable accounting for pids in nested pid namespaces [096/212] memcg: switch lruvec stats to rstat [097/212] memcg: infrastructure to flush memcg stats [098/212] memcg: charge fs_context and legacy_fs_context [099/212] memcg: enable accounting for mnt_cache entries [100/212] memcg: enable accounting for pollfd and select bits arrays [101/212] memcg: enable accounting for file lock caches [102/212] memcg: enable accounting for fasync_cache [103/212] memcg: enable accounting for new namesapces and struct nsproxy [104/212] memcg: enable accounting of ipc resources [105/212] memcg: enable accounting for signals [106/212] memcg: enable accounting for posix_timers_cache slab [107/212] memcg: enable accounting for ldt_struct objects [108/212] memcg: cleanup racy sum avoidance code [109/212] memcg: replace in_interrupt() by !in_task() in active_memcg() [110/212] mm: memcontrol: set the correct memcg swappiness restriction [111/212] mm, memcg: remove unused functions [112/212] mm, memcg: save some atomic ops when flush is already true [113/212] memcg: fix up drain_local_stock comment [114/212] memcg: make memcg->event_list_lock irqsafe [115/212] selftests/vm: use kselftest skip code for skipped tests [116/212] selftests: Fix spelling mistake "cann't" -> "cannot" [117/212] lazy tlb: introduce lazy mm refcount helper functions [118/212] lazy tlb: allow lazy tlb mm refcounting to be configurable [119/212] lazy tlb: shoot lazies, a non-refcounting lazy tlb option [120/212] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN [121/212] mmc: JZ4740: remove the flush_kernel_dcache_page call in jz4740_mmc_read_data [122/212] mmc: mmc_spi: replace flush_kernel_dcache_page with flush_dcache_page [123/212] scatterlist: replace flush_kernel_dcache_page with flush_dcache_page [125/212] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code [126/212] mm: change fault_in_pages_* to have an unsigned size parameter [127/212] mm/pagemap: add mmap_assert_locked() annotations to find_vma() [128/212] remap_file_pages: Use vma_lookup() instead of find_vma() [129/212] mm/mremap: fix memory account on do_munmap() failure [130/212] mm/bootmem_info.c: mark __init on register_page_bootmem_info_section [131/212] mm: sparse: pass section_nr to section_mark_present [132/212] mm: sparse: pass section_nr to find_memory_block [133/212] mm: sparse: remove __section_nr() function [134/212] mm/sparse: set SECTION_NID_SHIFT to 6 [135/212] include/linux/mmzone.h: avoid a warning in sparse memory support [136/212] mm/sparse: clarify pgdat_to_phys [137/212] mm/vmalloc: use batched page requests in bulk-allocator [138/212] mm/vmalloc: remove gfpflags_allow_blocking() check [139/212] lib/test_vmalloc.c: add a new 'nr_pages' parameter [140/212] mm/vmalloc: fix wrong behavior in vread [141/212] mm/kasan: move kasan.fault to mm/kasan/report.c [142/212] kasan: test: rework kmalloc_oob_right [143/212] kasan: test: avoid writing invalid memory [144/212] kasan: test: avoid corrupting memory via memset [145/212] kasan: test: disable kmalloc_memmove_invalid_size for HW_TAGS [146/212] kasan: test: only do kmalloc_uaf_memset for generic mode [147/212] kasan: test: clean up ksize_uaf [148/212] kasan: test: avoid corrupting memory in copy_user_test [149/212] kasan: test: avoid corrupting memory in kasan_rcu_uaf [150/212] mm/page_alloc: always initialize memory map for the holes [151/212] microblaze: simplify pte_alloc_one_kernel() [152/212] mm: introduce memmap_alloc() to unify memory map allocation [153/212] memblock: stop poisoning raw allocations [154/212] mm/page_alloc.c: fix 'zone_id' may be used uninitialized in this function warning [155/212] mm/page_alloc: make alloc_node_mem_map() __init rather than __ref [156/212] mm/page_alloc.c: use in_task() [157/212] mm/page_isolation: tracing: trace all test_pages_isolated failures [158/212] mm/hwpoison: remove unneeded variable unmap_success [159/212] mm/hwpoison: fix potential pte_unmap_unlock pte error [160/212] mm/hwpoison: change argument struct page hpagep to hpage [161/212] mm/hwpoison: fix some obsolete comments [162/212] mm: hwpoison: don't drop slab caches for offlining non-LRU page [163/212] doc: hwpoison: correct the support for hugepage [164/212] mm: hwpoison: dump page for unhandlable page [165/212] mm: fix panic caused by __page_handle_poison() [166/212] hugetlb: simplify prep_compound_gigantic_page ref count racing code [167/212] hugetlb: drop ref count earlier after page allocation [168/212] hugetlb: before freeing hugetlb page set dtor to appropriate value [169/212] hugetlb: fix hugetlb cgroup refcounting during vma split [170/212] userfaultfd: change mmap_changing to atomic [171/212] userfaultfd: prevent concurrent API initialization [172/212] selftests/vm/userfaultfd: wake after copy failure [173/212] mm/numa: automatically generate node migration order [174/212] mm/migrate: update node demotion order on hotplug events [175/212] mm/migrate: enable returning precise migrate_pages() success count [176/212] mm/migrate: demote pages during reclaim [177/212] mm/vmscan: add page demotion counter [178/212] mm/vmscan: add helper for querying ability to age anonymous pages [179/212] mm/vmscan: Consider anonymous pages without swap [180/212] mm/vmscan: never demote for memcg reclaim [181/212] mm/migrate: add sysfs interface to enable reclaim migration [182/212] mm/vmpressure: replace vmpressure_to_css() with vmpressure_to_memcg() [183/212] mm/vmscan: remove the PageDirty check after MADV_FREE pages are page_ref_freezed [184/212] mm/vmscan: remove misleading setting to sc->priority [185/212] mm/vmscan: remove unneeded return value of kswapd_run() [186/212] mm/vmscan: add 'else' to remove check_pending label [187/212] mm, vmscan: guarantee drop_slab_node() termination [188/212] mm: compaction: optimize proactive compaction deferrals [189/212] mm: compaction: support triggering of proactive compaction by user [190/212] mm/mempolicy: use readable NUMA_NO_NODE macro instead of magic number [191/212] mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes [192/212] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy [193/212] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY [194/212] mm/mempolicy: advertise new MPOL_PREFERRED_MANY [195/212] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies [196/212] mm/mempolicy.c: use in_task() in mempolicy_slab_node() [197/212] memblock: make memblock_find_in_range method private [198/212] mm: introduce process_mrelease system call [199/212] mm: wire up syscall process_mrelease [200/212] mm/migrate: correct kernel-doc notation [201/212] selftests: vm: add KSM merge test [202/212] selftests: vm: add KSM unmerge test [203/212] selftests: vm: add KSM zero page merging test [204/212] selftests: vm: add KSM merging across nodes test [205/212] mm: KSM: fix data type [206/212] selftests: vm: add KSM merging time test [207/212] selftests: vm: add COW time test for KSM pages [208/212] mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() [209/212] mm/vmstat: correct some wrong comments [210/212] mm/vmstat: simplify the array size calculation [211/212] mm/vmstat: remove unneeded return value [212/212] mm/madvise: add MADV_WILLNEED to process_madvise()

> > so that when you read that function on its own, it's clear that the > > lock is always held over that critical section - and the issue is that > > perhaps the lock was already taken by the caller. > > Actually that "already taken" becomes "caller does not need it/can't > even take the local lock as it's not local" (it's a cpu hot remove > handler on behalf of another, dead cpu). > > So would it work with something like the following cleanup on top later > after proper testing? (now just compile tested). Scroll downward... > ---8<--- > diff --git a/mm/slub.c b/mm/slub.c > index df1ac8aff86f..0d9e63e918f1 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2566,38 +2566,33 @@ static inline void unfreeze_partials_cpu(struct kmem_cache *s, > > #endif /* CONFIG_SLUB_CPU_PARTIAL */ > > -static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c, > - bool lock) > +static inline struct page * > +__detach_cpu_slab(struct kmem_cache *s, struct kmem_cache_cpu *c, > + void **freelist) > { > - unsigned long flags; > - void *freelist; > struct page *page; > > - if (lock) > - local_lock_irqsave(&s->cpu_slab->lock, flags); > - > - freelist = c->freelist; > page = c->page; > + *freelist = c->freelist; > > c->page = NULL; > c->freelist = NULL; > c->tid = next_tid(c->tid); > > - if (lock) > - local_unlock_irqrestore(&s->cpu_slab->lock, flags); > - > - if (page) > - deactivate_slab(s, page, freelist); > - > - stat(s, CPUSLAB_FLUSH); > + return page; > } > > static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) > { > struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); > + struct page *page; > + void *freelist; > > - if (c->page) > - flush_slab(s, c, false); > + if (c->page) { > + page = __detach_cpu_slab(s, c, &freelist); > + deactivate_slab(s, page, freelist); > + stat(s, CPUSLAB_FLUSH); > + } > > unfreeze_partials_cpu(s, c); > } > @@ -2618,14 +2613,24 @@ static void flush_cpu_slab(struct work_struct *w) > struct kmem_cache *s; > struct kmem_cache_cpu *c; > struct slub_flush_work *sfw; > + struct page *page; > + void *freelist; > + unsigned long flags; > > sfw = container_of(w, struct slub_flush_work, work); > > s = sfw->s; > c = this_cpu_ptr(s->cpu_slab); > > - if (c->page) > - flush_slab(s, c, true); > + if (c->page) { > + local_lock_irqsave(&s->cpu_slab->lock, flags); > + page = __detach_cpu_slab(s, c, &freelist); > + local_unlock_irqrestore(&s->cpu_slab->lock, flags); > + > + if (page) > + deactivate_slab(s, page, freelist); > + stat(s, CPUSLAB_FLUSH); > + } > > unfreeze_partials(s); > } To my eyeballs, below duplication of a couple lines of initialization needed by the lockless function is less icky than the double return. --- mm/slub.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) --- a/mm/slub.c +++ b/mm/slub.c @@ -2566,15 +2566,13 @@ static inline void unfreeze_partials_cpu #endif /* CONFIG_SLUB_CPU_PARTIAL */ -static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c, - bool lock) +static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { unsigned long flags; void *freelist; struct page *page; - if (lock) - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); freelist = c->freelist; page = c->page; @@ -2583,8 +2581,7 @@ static inline void flush_slab(struct kme c->freelist = NULL; c->tid = next_tid(c->tid); - if (lock) - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); if (page) deactivate_slab(s, page, freelist); @@ -2595,11 +2592,19 @@ static inline void flush_slab(struct kme static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); + struct page *page = c->page; + void *freelist = c->freelist; - if (c->page) - flush_slab(s, c, false); + c->page = NULL; + c->freelist = NULL; + c->tid = next_tid(c->tid); + + if (page) + deactivate_slab(s, page, freelist); unfreeze_partials_cpu(s, c); + + stat(s, CPUSLAB_FLUSH); } struct slub_flush_work { @@ -2625,7 +2630,7 @@ static void flush_cpu_slab(struct work_s c = this_cpu_ptr(s->cpu_slab); if (c->page) - flush_slab(s, c, true); + flush_slab(s, c); unfreeze_partials(s); }

[036/212] mm, slab: make flush_slab() possible to call with irqs enabled

Commit Message

Comments

Patch