From patchwork Wed Oct 9 14:45:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11181329 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D210315AB for ; Wed, 9 Oct 2019 14:45:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 83382218AC for ; Wed, 9 Oct 2019 14:45:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b="B8+o2znH" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 83382218AC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 98F958E0005; Wed, 9 Oct 2019 10:45:21 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9182E8E0003; Wed, 9 Oct 2019 10:45:21 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7DFE88E0005; Wed, 9 Oct 2019 10:45:21 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56]) by kanga.kvack.org (Postfix) with ESMTP id 53E098E0003 for ; Wed, 9 Oct 2019 10:45:21 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id EC841180AD801 for ; Wed, 9 Oct 2019 14:45:20 +0000 (UTC) X-FDA: 76024519200.19.vest07_2e860d2176a1c X-Spam-Summary: 2,0,0,9cb23270193bdc0e,d41d8cd98f00b204,kirill@shutemov.name,:mhocko@kernel.org:vbabka@suse.cz:yang.shi@linux.alibaba.com:hannes@cmpxchg.org:hughd@google.com:rientjes@google.com:akpm@linux-foundation.org::linux-kernel@vger.kernel.org:kirill.shutemov@linux.intel.com,RULES_HIT:1:2:41:69:355:379:541:560:960:966:973:988:989:1260:1311:1314:1345:1437:1515:1605:1730:1747:1777:1792:2196:2199:2393:2553:2559:2562:2693:2740:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4051:4250:4321:4385:4605:5007:6117:6119:6120:6261:6653:7901:7903:8957:9010:9592:10004:11026:11232:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13161:13229:13894:14096:14877:21080:21324:21444:21451:21627:21740:30034:30054:30070:30090,0,RBL:209.85.208.196:@shutemov.name:.lbl8.mailshell.net-62.14.0.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:492,L UA_SUMMA X-HE-Tag: vest07_2e860d2176a1c X-Filterd-Recvd-Size: 11506 Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 14:45:19 +0000 (UTC) Received: by mail-lj1-f196.google.com with SMTP id y3so2780250ljj.6 for ; Wed, 09 Oct 2019 07:45:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WHLqnBcAMWl9oxUxGSTJExK0Gib3hZ/+qFSag2SnMEc=; b=B8+o2znHRHD/V8dEhy4aDoTZdnArOLYa6TUorSUQmIk595Xl7NOFqChI8+KUigVn7R zlFYkGRa6y+deWLCikjjE/igMxZvZNiDVolisBb3GENV9jT9d9LN5v8vAYapFFXs1YKh G57ER7/3a8JEZa5VMordbF57zRzgyGSrIwMkscdM6N+iuwV8gRwzXrizlsEtSCB1QmOF G/QHCh3KsQ7gAGKE3kTgKEROxy3pHW7gervRWU1djCfyD0mtj5I0gNVkiGDcgltErZKC SlzmgRyy9/u2xJO86fivb3NFp7sxv74EzVNylni3AXBuK1WP+1rrmZeIjUS267SuYBfS st9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WHLqnBcAMWl9oxUxGSTJExK0Gib3hZ/+qFSag2SnMEc=; b=pdJbfnSGDuJwKw/eiIbPDmjNIbRwbN+zp+bKfFb+NJ6Kl8lSf8CW+dPovrB/6aREcH MsZPm8Br4EY7Ti7HMISejRx+7Epy+9lnpJ7Y5MpP/USrOyFPUbT5oIFosL6mUeIeeghH Rmtl5yXRqQefbIORwpRdUy7BaryfPR5qsq0BLjBFeyBiqy62pqkgsvhvg5r+8QwC2+OG WY6h2Il5JtNeX9NIoZvAY0ze0PT0fOcF6rpfDAxQMYrLtzQBePE2kGBuFcO1BM2eKM83 G+TWATmQYDQyvMSO7hxOHkBTyrQFkx9E6QZaBPlJQQKCvENFWwuNSkLjMVui8dAQMVXg BkPg== X-Gm-Message-State: APjAAAWH8uwYo3UWUZykbxrZkJq545CUt5Waq2esYS2Cu7hw39rn1iLw gdwQ2dAP26mCf+qHKnqRLKfW3A== X-Google-Smtp-Source: APXvYqxTbi648QQFAqz7HczUUQLCcbKuogovtYoFBs5AJSxYzt2EYoP7qLWmnUsU7NeYOv4zy2KZsQ== X-Received: by 2002:a2e:5dd5:: with SMTP id v82mr2693751lje.54.1570632317793; Wed, 09 Oct 2019 07:45:17 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id c4sm514343lfm.4.2019.10.09.07.45.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Oct 2019 07:45:17 -0700 (PDT) From: "Kirill A. Shutemov" X-Google-Original-From: "Kirill A. Shutemov" Received: by box.localdomain (Postfix, from userid 1000) id 3DD02102BFA; Wed, 9 Oct 2019 17:45:17 +0300 (+03) To: Michal Hocko , Vlastimil Babka , Yang Shi Cc: hannes@cmpxchg.org, hughd@google.com, rientjes@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [RFC, PATCH] mm, thp: Try to bound number of pages on deferred split queue Date: Wed, 9 Oct 2019 17:45:09 +0300 Message-Id: <20191009144509.23649-1-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: THPs on deferred split queue got split by shrinker if memory pressure comes. In absence of memory pressure, there is no bound on how long the deferred split queue can be. In extreme cases, deferred queue can grow to tens of gigabytes. It is suboptimal: even without memory pressure we can find better way to use the memory (page cache for instance). Make deferred_split_huge_page() to trigger a work that would split pages, if we have more than NR_PAGES_ON_QUEUE_TO_SPLIT on the queue. The split can fail (i.e. due to memory pinning by GUP), making the queue grow despite the effort. Rate-limit the work triggering to at most every NR_CALLS_TO_SPLIT calls of deferred_split_huge_page(). NR_PAGES_ON_QUEUE_TO_SPLIT and NR_CALLS_TO_SPLIT chosen arbitrarily and will likely require tweaking. The patch has risk to introduce performance regressions. For system with plenty of free memory, triggering the split would cost CPU time (~100ms per GB of THPs to split). I have doubts about the approach, so: Not-Signed-off-by: Kirill A. Shutemov --- include/linux/mmzone.h | 5 ++ mm/huge_memory.c | 129 ++++++++++++++++++++++++++++------------- mm/memcontrol.c | 3 + mm/page_alloc.c | 2 + 4 files changed, 100 insertions(+), 39 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bda20282746b..f748542745ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -684,7 +684,12 @@ struct deferred_split { spinlock_t split_queue_lock; struct list_head split_queue; unsigned long split_queue_len; + unsigned int deferred_split_calls; + struct work_struct deferred_split_work; }; + +void flush_deferred_split_queue(struct work_struct *work); +void flush_deferred_split_queue_memcg(struct work_struct *work); #endif /* diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c5cb6dcd6c69..bb7bef856e38 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2842,43 +2842,6 @@ void free_transhuge_page(struct page *page) free_compound_page(page); } -void deferred_split_huge_page(struct page *page) -{ - struct deferred_split *ds_queue = get_deferred_split_queue(page); -#ifdef CONFIG_MEMCG - struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; -#endif - unsigned long flags; - - VM_BUG_ON_PAGE(!PageTransHuge(page), page); - - /* - * The try_to_unmap() in page reclaim path might reach here too, - * this may cause a race condition to corrupt deferred split queue. - * And, if page reclaim is already handling the same page, it is - * unnecessary to handle it again in shrinker. - * - * Check PageSwapCache to determine if the page is being - * handled by page reclaim since THP swap would add the page into - * swap cache before calling try_to_unmap(). - */ - if (PageSwapCache(page)) - return; - - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - if (list_empty(page_deferred_list(page))) { - count_vm_event(THP_DEFERRED_SPLIT_PAGE); - list_add_tail(page_deferred_list(page), &ds_queue->split_queue); - ds_queue->split_queue_len++; -#ifdef CONFIG_MEMCG - if (memcg) - memcg_set_shrinker_bit(memcg, page_to_nid(page), - deferred_split_shrinker.id); -#endif - } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); -} - static unsigned long deferred_split_count(struct shrinker *shrink, struct shrink_control *sc) { @@ -2895,8 +2858,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { - struct pglist_data *pgdata = NODE_DATA(sc->nid); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + struct deferred_split *ds_queue = NULL; unsigned long flags; LIST_HEAD(list), *pos, *next; struct page *page; @@ -2906,6 +2868,10 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, if (sc->memcg) ds_queue = &sc->memcg->deferred_split_queue; #endif + if (!ds_queue) { + struct pglist_data *pgdata = NODE_DATA(sc->nid); + ds_queue = &pgdata->deferred_split_queue; + } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ @@ -2957,6 +2923,91 @@ static struct shrinker deferred_split_shrinker = { SHRINKER_NONSLAB, }; +static void __flush_deferred_split_queue(struct pglist_data *pgdata, + struct mem_cgroup *memcg) +{ + struct shrink_control sc; + + sc.nid = pgdata ? pgdata->node_id : 0; + sc.memcg = memcg; + sc.nr_to_scan = 0; /* Unlimited */ + + deferred_split_scan(NULL, &sc); +} + +void flush_deferred_split_queue(struct work_struct *work) +{ + struct deferred_split *ds_queue; + struct pglist_data *pgdata; + + ds_queue = container_of(work, struct deferred_split, + deferred_split_work); + pgdata = container_of(ds_queue, struct pglist_data, + deferred_split_queue); + __flush_deferred_split_queue(pgdata, NULL); +} + +#ifdef CONFIG_MEMCG +void flush_deferred_split_queue_memcg(struct work_struct *work) +{ + struct deferred_split *ds_queue; + struct mem_cgroup *memcg; + + ds_queue = container_of(work, struct deferred_split, + deferred_split_work); + memcg = container_of(ds_queue, struct mem_cgroup, + deferred_split_queue); + __flush_deferred_split_queue(NULL, memcg); +} +#endif + +#define NR_CALLS_TO_SPLIT 32 +#define NR_PAGES_ON_QUEUE_TO_SPLIT 16 + +void deferred_split_huge_page(struct page *page) +{ + struct deferred_split *ds_queue = get_deferred_split_queue(page); +#ifdef CONFIG_MEMCG + struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; +#endif + unsigned long flags; + + VM_BUG_ON_PAGE(!PageTransHuge(page), page); + + /* + * The try_to_unmap() in page reclaim path might reach here too, + * this may cause a race condition to corrupt deferred split queue. + * And, if page reclaim is already handling the same page, it is + * unnecessary to handle it again in shrinker. + * + * Check PageSwapCache to determine if the page is being + * handled by page reclaim since THP swap would add the page into + * swap cache before calling try_to_unmap(). + */ + if (PageSwapCache(page)) + return; + + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + if (list_empty(page_deferred_list(page))) { + count_vm_event(THP_DEFERRED_SPLIT_PAGE); + list_add_tail(page_deferred_list(page), &ds_queue->split_queue); + ds_queue->split_queue_len++; + ds_queue->deferred_split_calls++; +#ifdef CONFIG_MEMCG + if (memcg) + memcg_set_shrinker_bit(memcg, page_to_nid(page), + deferred_split_shrinker.id); +#endif + } + + if (ds_queue->split_queue_len > NR_PAGES_ON_QUEUE_TO_SPLIT && + ds_queue->deferred_split_calls > NR_CALLS_TO_SPLIT) { + ds_queue->deferred_split_calls = 0; + schedule_work(&ds_queue->deferred_split_work); + } + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); +} + #ifdef CONFIG_DEBUG_FS static int split_huge_pages_set(void *data, u64 val) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c313c49074ca..67305ec75fdc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5085,6 +5085,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); memcg->deferred_split_queue.split_queue_len = 0; + memcg->deferred_split_queue.deferred_split_calls = 0; + INIT_WORK(&memcg->deferred_split_queue.deferred_split_work, + flush_deferred_split_queue_memcg); #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15c2050c629b..2f52e538a26f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6674,6 +6674,8 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat) spin_lock_init(&ds_queue->split_queue_lock); INIT_LIST_HEAD(&ds_queue->split_queue); ds_queue->split_queue_len = 0; + ds_queue->deferred_split_calls = 0; + INIT_WORK(&ds_queue->deferred_split_work, flush_deferred_split_queue); } #else static void pgdat_init_split_queue(struct pglist_data *pgdat) {}