From patchwork Thu Jan 21 17:21:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vlastimil Babka X-Patchwork-Id: 12037231 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A064C433E0 for ; Thu, 21 Jan 2021 17:22:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 747B623A59 for ; Thu, 21 Jan 2021 17:22:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 747B623A59 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D49CB6B0008; Thu, 21 Jan 2021 12:22:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CCCC06B000D; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B56696B0007; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132]) by kanga.kvack.org (Postfix) with ESMTP id 95CA56B0007 for ; Thu, 21 Jan 2021 12:22:07 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3F9D9364C for ; Thu, 21 Jan 2021 17:22:07 +0000 (UTC) X-FDA: 77730450294.17.spoon04_490bf2027564 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id E1C04180D0186 for ; Thu, 21 Jan 2021 17:22:06 +0000 (UTC) X-HE-Tag: spoon04_490bf2027564 X-Filterd-Recvd-Size: 5670 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Thu, 21 Jan 2021 17:22:06 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 054CCAAAE; Thu, 21 Jan 2021 17:22:05 +0000 (UTC) From: Vlastimil Babka To: vbabka@suse.cz Cc: akpm@linux-foundation.org, bigeasy@linutronix.de, cl@linux.com, guro@fb.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, minchan@kernel.org, penberg@kernel.org, rientjes@google.com, shakeelb@google.com, surenb@google.com, tglx@linutronix.de Subject: [RFC 1/2] mm, vmscan: add priority field to struct shrink_control Date: Thu, 21 Jan 2021 18:21:53 +0100 Message-Id: <20210121172154.27580-1-vbabka@suse.cz> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Slab reclaim works with reclaim priority, which influences how much to reclaim, but is not directly passed to individual shrinkers. The next patch introduces a slab shrinker that uses the priority, so add it to shrink_control and initialize appropriately. We can then also remove the parameter from shrink_slab() and trace_mm_shrink_slab_start(). Signed-off-by: Vlastimil Babka --- include/linux/shrinker.h | 3 +++ include/trace/events/vmscan.h | 8 +++----- mm/vmscan.c | 14 ++++++++------ 3 files changed, 14 insertions(+), 11 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 0f80123650e2..1066f052be4f 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -29,6 +29,9 @@ struct shrink_control { */ unsigned long nr_scanned; + /* current reclaim priority */ + int priority; + /* current memcg being shrunk (for memcg aware shrinkers) */ struct mem_cgroup *memcg; }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 2070df64958e..d42e480977c6 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -185,11 +185,9 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long nr_objects_to_shrink, unsigned long cache_items, - unsigned long long delta, unsigned long total_scan, - int priority), + unsigned long long delta, unsigned long total_scan), - TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, - priority), + TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan), TP_STRUCT__entry( __field(struct shrinker *, shr) @@ -212,7 +210,7 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->cache_items = cache_items; __entry->delta = delta; __entry->total_scan = total_scan; - __entry->priority = priority; + __entry->priority = sc->priority; ), TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", diff --git a/mm/vmscan.c b/mm/vmscan.c index 469016222cdb..bc5157625cec 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -410,7 +410,7 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, - struct shrinker *shrinker, int priority) + struct shrinker *shrinker) { unsigned long freed = 0; unsigned long long delta; @@ -439,7 +439,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = nr; if (shrinker->seeks) { - delta = freeable >> priority; + delta = freeable >> shrinkctl->priority; delta *= 4; do_div(delta, shrinker->seeks); } else { @@ -484,7 +484,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = freeable * 2; trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - freeable, delta, total_scan, priority); + freeable, delta, total_scan); /* * Normally, we should not scan less than batch_size objects in one @@ -562,6 +562,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .priority = priority, .memcg = memcg, }; struct shrinker *shrinker; @@ -578,7 +579,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, !(shrinker->flags & SHRINKER_NONSLAB)) continue; - ret = do_shrink_slab(&sc, shrinker, priority); + ret = do_shrink_slab(&sc, shrinker); if (ret == SHRINK_EMPTY) { clear_bit(i, map->map); /* @@ -597,7 +598,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, * set_bit() do_shrink_slab() */ smp_mb__after_atomic(); - ret = do_shrink_slab(&sc, shrinker, priority); + ret = do_shrink_slab(&sc, shrinker); if (ret == SHRINK_EMPTY) ret = 0; else @@ -666,10 +667,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .priority = priority, .memcg = memcg, }; - ret = do_shrink_slab(&sc, shrinker, priority); + ret = do_shrink_slab(&sc, shrinker); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; From patchwork Thu Jan 21 17:21:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vlastimil Babka X-Patchwork-Id: 12037233 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E970CC433E9 for ; Thu, 21 Jan 2021 17:22:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6A8B023A57 for ; Thu, 21 Jan 2021 17:22:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6A8B023A57 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0E5266B0007; Thu, 21 Jan 2021 12:22:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 04EC16B000C; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D423D6B0007; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id A94936B0008 for ; Thu, 21 Jan 2021 12:22:07 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 619A18249980 for ; Thu, 21 Jan 2021 17:22:07 +0000 (UTC) X-FDA: 77730450294.06.kiss88_4c094e927564 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 29F601005FE9B for ; Thu, 21 Jan 2021 17:22:07 +0000 (UTC) X-HE-Tag: kiss88_4c094e927564 X-Filterd-Recvd-Size: 9012 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Thu, 21 Jan 2021 17:22:06 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 32597B8F8; Thu, 21 Jan 2021 17:22:05 +0000 (UTC) From: Vlastimil Babka To: vbabka@suse.cz Cc: akpm@linux-foundation.org, bigeasy@linutronix.de, cl@linux.com, guro@fb.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, minchan@kernel.org, penberg@kernel.org, rientjes@google.com, shakeelb@google.com, surenb@google.com, tglx@linutronix.de Subject: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs Date: Thu, 21 Jan 2021 18:21:54 +0100 Message-Id: <20210121172154.27580-2-vbabka@suse.cz> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20210121172154.27580-1-vbabka@suse.cz> References: <20210121172154.27580-1-vbabka@suse.cz> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For performance reasons, SLUB doesn't keep all slabs on shared lists and doesn't always free slabs immediately after all objects are freed. Namely: - for each cache and cpu, there might be a "CPU slab" page, partially or fully free - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu partial slabs" for each cache and cpu, also partially or fully free - for each cache and numa node, there are caches on per-node partial list, up to 10 of those may be empty As Jann reports [1], the number of percpu partial slabs should be limited by number of free objects (up to 30), but due to imprecise accounting, this can deterioriate so that there are up to 30 free slabs. He notes: > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I > see something like 1.5MiB of pages with zero inuse objects stuck in > percpu lists. My observations match Jann's, and we've seen e.g. cases with 10 free slabs per cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in v5.9), this issue is amplified as there are separate sets of kmem caches with cpu caches, per-cpu partial and per-node partial lists for each memcg and cache that deals with kmemcg-accounted objects. The cached free slabs can therefore become a memory waste, making memory pressure higher, causing more reclaim of actually used LRU pages, and even cause OOM (global, or memcg on older kernels). SLUB provides __kmem_cache_shrink() that can flush all the abovementioned slabs, but is currently called only in rare situations, or from a sysfs handler. The standard way to cooperate with reclaim is to provide a shrinker, and so this patch adds such shrinker to call __kmem_cache_shrink() systematically. The shrinker design is however atypical. The usual design assumes that a shrinker can easily count how many objects can be reclaimed, and then reclaim given number of objects. For SLUB, determining the number of the various cached slabs would be a lot of work, and controlling how many to shrink precisely would be impractical. Instead, the shrinker is based on reclaim priority, and on lowest priority shrinks a single kmem cache, while on highest it shrinks all of them. To do that effectively, there's a new list caches_to_shrink where caches are taken from its head and then moved to tail. Existing slab_caches list is unaffected so that e.g. /proc/slabinfo order is not disrupted. This approach should not cause excessive shrinking and IPI storms: - If there are multiple reclaimers in parallel, only one can proceed, thanks to mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked are at the tail of the list. - in flush_all(), we actually check if there's anything to flush by a CPU (has_cpu_slab()) before sending an IPI - CPU slab deactivation became more efficient with "mm, slub: splice cpu and page freelists in deactivate_slab() The result is that SLUB's per-cpu and per-node caches are trimmed of free pages, and partially used pages have higher chance of being either reused of freed. The trimming effort is controlled by reclaim activity and thus memory pressure. Before an OOM, a reclaim attempt at highest priority ensures shrinking all caches. Also being a proper slab shrinker, the shrinking is now also called as part of the drop_caches sysctl operation. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ Reported-by: Jann Horn Signed-off-by: Vlastimil Babka --- include/linux/slub_def.h | 1 + mm/slub.c | 76 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 76 insertions(+), 1 deletion(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..6c4eeb30764d 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -107,6 +107,7 @@ struct kmem_cache { unsigned int red_left_pad; /* Left redzone padding size */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ + struct list_head shrink_list; /* List ordered for shrinking */ #ifdef CONFIG_SYSFS struct kobject kobj; /* For sysfs */ #endif diff --git a/mm/slub.c b/mm/slub.c index c3141aa962be..bba05bd9287a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -123,6 +123,8 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled); #endif #endif +static LIST_HEAD(caches_to_shrink); + static inline bool kmem_cache_debug(struct kmem_cache *s) { return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS); @@ -3933,6 +3935,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s) int node; struct kmem_cache_node *n; + list_del(&s->shrink_list); + flush_all(s); /* Attempt to free all objects */ for_each_kmem_cache_node(s, node, n) { @@ -3985,6 +3989,69 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page) } #endif +static unsigned long count_shrinkable_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + /* + * Determining how much there is to shrink would be so complex, it's + * better to just pretend there always is and scale the actual effort + * based on sc->priority. + */ + return shrink->batch; +} + +static unsigned long shrink_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct kmem_cache *s; + int nr_to_shrink; + int ret = sc->nr_to_scan / 2; + + nr_to_shrink = DEF_PRIORITY - sc->priority; + if (nr_to_shrink < 0) + nr_to_shrink = 0; + + nr_to_shrink = 1 << nr_to_shrink; + if (sc->priority == 0) { + nr_to_shrink = INT_MAX; + ret = 0; + } + + if (!mutex_trylock(&slab_mutex)) + return SHRINK_STOP; + + list_for_each_entry(s, &caches_to_shrink, shrink_list) { + __kmem_cache_shrink(s); + if (--nr_to_shrink == 0) { + list_bulk_move_tail(&caches_to_shrink, + caches_to_shrink.next, + &s->shrink_list); + break; + } + } + + mutex_unlock(&slab_mutex); + + /* + * As long as we are not at the highest priority, pretend we freed + * something as we might have not have processed all caches. This + * should signal that it's worth retrying. Once we are at the highest + * priority and shrink the whole list, pretend we didn't free anything, + * because there's no point in trying again. + * + * Note the value is currently ultimately ignored in "normal" reclaim, + * but drop_slab_node() which handles drop_caches sysctl works like this. + */ + return ret; +} + +static struct shrinker slub_cache_shrinker = { + .count_objects = count_shrinkable_caches, + .scan_objects = shrink_caches, + .batch = 128, + .seeks = 0, +}; + /******************************************************************** * Kmalloc subsystem *******************************************************************/ @@ -4424,6 +4491,8 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) #endif } list_add(&s->list, &slab_caches); + list_del(&static_cache->shrink_list); + list_add(&s->shrink_list, &caches_to_shrink); return s; } @@ -4480,6 +4549,8 @@ void __init kmem_cache_init(void) void __init kmem_cache_init_late(void) { + if (!register_shrinker(&slub_cache_shrinker)) + pr_err("SLUB: failed to register shrinker\n"); } struct kmem_cache * @@ -4518,11 +4589,14 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) /* Mutex is not taken during early boot */ if (slab_state <= UP) - return 0; + goto out; err = sysfs_slab_add(s); if (err) __kmem_cache_release(s); +out: + if (!err) + list_add(&s->shrink_list, &caches_to_shrink); return err; }