From patchwork Sat Nov 16 09:16:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chen Ridong X-Patchwork-Id: 13877536 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAADDD68BD2 for ; Sat, 16 Nov 2024 09:26:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 240516B00C7; Sat, 16 Nov 2024 04:26:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1845B6B00CD; Sat, 16 Nov 2024 04:26:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDB846B00CA; Sat, 16 Nov 2024 04:26:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B9DE36B00C9 for ; Sat, 16 Nov 2024 04:26:29 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6C7FAC1728 for ; Sat, 16 Nov 2024 09:26:29 +0000 (UTC) X-FDA: 82791425346.13.D1D0D2F Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by imf14.hostedemail.com (Postfix) with ESMTP id AA203100007 for ; Sat, 16 Nov 2024 09:25:32 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731749098; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3z4oxJvnV/PMdmjLaDwHKEi9hJOg+TbqJ1lyKOe6xek=; b=gycB8a6cdahRGoGkdArq5G2ZaW34C6GzMLpikU5+N02nxSVPqfSmYU4tJWsiTaZsoLIiRB 6Kf0rNPgfKkHICWgncubLrSyk/DjfX2Z8qV7GNQvcZJbTp34eKrJcYT2SwCksKk2DIF8fe jMunTZdBdz22fD+rDfMNU0H/WXSFkG0= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731749098; a=rsa-sha256; cv=none; b=WmOEhppc6O7M3kciPOqGT83EgAu27tIYVKon7VjsRWlK9vzVuoO723ODJ3NctZX5Y2nMpw cinNqaS9Vkj33Kih16T7X7ruNeHYseiDIVeimafBNqrLkz+pirFq/2goyPFSz44NDhXASv Osqx6e31ljefTUEXX/oeuADMpeBA4SI= Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4Xr7mR3m0hz4f3nTh for ; Sat, 16 Nov 2024 17:25:59 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id C0BF21A0568 for ; Sat, 16 Nov 2024 17:26:18 +0800 (CST) Received: from hulk-vt.huawei.com (unknown [10.67.174.121]) by APP4 (Coremail) with SMTP id gCh0CgDnDoMoZThnJYjBBw--.28482S3; Sat, 16 Nov 2024 17:26:18 +0800 (CST) From: Chen Ridong To: akpm@linux-foundation.org, mhocko@suse.com, hannes@cmpxchg.org, yosryahmed@google.com, yuzhao@google.com, david@redhat.com, willy@infradead.org, ryan.roberts@arm.com, baohua@kernel.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com, xieym_ict@hotmail.com Subject: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking Date: Sat, 16 Nov 2024 09:16:58 +0000 Message-Id: <20241116091658.1983491-2-chenridong@huaweicloud.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241116091658.1983491-1-chenridong@huaweicloud.com> References: <20241116091658.1983491-1-chenridong@huaweicloud.com> MIME-Version: 1.0 X-CM-TRANSID: gCh0CgDnDoMoZThnJYjBBw--.28482S3 X-Coremail-Antispam: 1UD129KBjvJXoWxuFyfKrWxGrWftr13uw43trb_yoWfJF47pF Z8W3sFyrWkJrnIqr13JF4q9ryFkrW8Xr4UJFW3ur12yF13W340gFyDC340qFW5JrykAF1x ZF9rCry5Wa1YyFJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUmmb4IE77IF4wAFF20E14v26rWj6s0DM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28IrcIa0xkI8VA2jI8067AKxVWUGw A2048vs2IY020Ec7CjxVAFwI0_Gr0_Xr1l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0rcxS w2x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8IcVCY1x0267AKxV WxJVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_ GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx 0E2Ix0cI8IcVAFwI0_Jr0_Jr4lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWU JVW8JwACjcxG0xvY0x0EwIxGrwACI402YVCY1x02628vn2kIc2xKxwCY1x0262kKe7AKxV WUtVW8ZwCF04k20xvY0x0EwIxGrwCF54CYxVCY1x0262kKe7AKxVWUtVW8ZwCFx2IqxVCF s4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r 1rMI8E67AF67kF1VAFwI0_Jw0_GFylIxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWU JVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r 1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1U YxBIdaVFxhVjvjDU0xZFpf9x07jnpnQUUUUU= X-CM-SenderInfo: hfkh02xlgr0w46kxt4xhlfz01xgou0bp/ X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: AA203100007 X-Stat-Signature: 4zd6t7it6dyp3a768yqpqwoififo9g9s X-HE-Tag: 1731749132-861237 X-HE-Meta: U2FsdGVkX1+BtzW+imh/YD4+oHQeswlJVUJkrgpO4tVAJKVVNAYbLfPcb51L67SHKm4s2FuCg3tFjJuUVbPOtWbqTS/DZNxZRFZI1r8RQHMfMiQSLvt8NHdjQ315VZ9dUyuFZ/VW5RCf3EohQgSTMyYw29jGFsJZc++e/wgAtX0nd+fJ1gSMwo+/EQnnIug9wUl/zf5k60e/EbjQk/fZdsH1J7ckBPS48wko9DjPWavIKWbtjRgVto0JIFicom9UjB6FAZCmYeqCzj6MEo8R0RlcCC0DLF8fNMcsz7uPKMy8LJHaXjtXtz0WnW0JujXh4TNRBSYMbedoV5IdPzii0uDv9d0mcnwtofSW2bdv6azEfx/IHYKNdUUnvBDOuDFaF/BzW3dtsdX4kU5Ye/oMAif9LmgTZGZ/AVnMOyXbJb0xlbl7+iktCNCLmKfCCuWtUdS0U68n2ZrKPDDovCqqtBhxiBruVumYJ8xZHwcV8F/UlE7zfVxxcRUjRG/ZMzaADEcIahXXrdCGDGhnRrnVdJZkX0Xung8mZvDrxqBNKcJV1KPB+b9/yPDkVyVuwfDj2FgFy5JQwp51bwqRqIPKKd0PAhJephRV58xV9Sc7T5fRXW3M7qNciEia8vK9p2LScLtLooSI52zINpGM1HubbolhonoW52mbYaKlL5AQqysxXgsJSF0qa9OOgd4Wid4+ldbpsYmY7pQBUpQNwD7zYtePTshKkmadffSXYrOMC3dT6KGZKNDQcqzrCENTQFhTHHeXhFhWYybtfx6yoUr6yax7xDX+D2crI/CBq8tSgToos7NMI1JcR3QW5uD0cbCj+Nhc49KE6Pjwee2CYEZOA59jXiTg/rHebCiFTPaHVGSCW0hxw3K0JJ/uSeJYxYDPedinzz8F2e8KFfNY91o6dwOH/vQzukLO+c6TxZXpwtEGxjfAlNcPEk2Jf+TIKGsaG8YITj4QPLUdRGjadDq 5td84YHs ZQQT4yW6pj9LFtflrDU8vGcPNcJKe5lqWPcA0Cf0/8TjhvOfwEuAAdgSNYDJRl3aEHrohJllujvxsZm3kbuXjcftBXWWPbQSLJxFDXH70synoZxOQtA3aGfMbGwxtvZ2r0c2J X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Chen Ridong An issue was found with the following testing step: 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y 2. Mount memcg v1, and create memcg named test_memcg and set usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. It was found that: cat memory.usage_in_bytes 2144940032 cat memory.memsw.usage_in_bytes 2255056896 free -h total used free Mem: 31Gi 2.1Gi 27Gi Swap: 1.0Gi 618Mi 405Mi As shown above, the test_memcg used about 100M swap, but 600M+ swap memory was used, which means that 500M may be wasted because other memcgs can not use these swap memory. It can be explained as follows: 1. When entering shrink_inactive_list, it isolates folios from lru from tail to head. If it just takes folioN from lru(make it simple). inactive lru: folio1<->folio2<->folio3...<->folioN-1 isolated list: folioN 2. In shrink_page_list function, if folioN is THP(2M), it may be splited and added to swap cache folio by folio. After adding to swap cache, it will submit io to writeback folio to swap, which is asynchronous. When shrink_page_list is finished, the isolated folios list will be moved back to the head of inactive lru. The inactive lru may just look like this, with 512 filioes have been move to the head of inactive lru. folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 It committed io from folioN1 to folioN512, the later folios committed was added to head of the 'ret_folios' in the shrink_page_list function. As a result, the order was shown as folioN512->folioN511->...->folioN1. 3. When folio writeback io is completed, the folio may be rotated to tail of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN512 are completed in order(commit io in this order), and they are rotated to the tail of the LRU in order (filioN1<->...folioN511<->folioN512). Therefore, those folios that are tail of the lru will be reclaimed as soon as possible. folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 4. However, shrink_page_list and folio writeback are asynchronous. If THP is splited, shrink_page_list loops at least 512 times, which means that shrink_page_list is not completed but some folios writeback have been completed, and this may lead to failure to rotate these folios to the tail of lru. The lru may look likes as below: folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> folioN51<->folioN52<->...folioN511<->folioN512 Although those folios (N1-N50) have been finished writing back, they are still at the head of the lru. This is because their writeback_end occurred while it were still looping in shrink_folio_list(), causing folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving these folios, which are not in the LRU but still in the 'folio_list', to the tail of the LRU. When isolating folios from lru, it scans from tail to head, so it is difficult to scan those folios again. What mentioned above may lead to a large number of folios have been added to swap cache but can not be reclaimed in time, which may reduce reclaim efficiency and prevent other memcgs from using this swap memory even if they trigger OOM. To fix this issue, the folios whose writeback has been completed should be move to the tail of the LRU instead of always placing them at the head of the LRU when the shrink_page_list is finished. It can be realized by following steps. 1. In the shrink_page_list function, the folios whose are committed to are added to the head of 'folio_list', which will be return to the caller. 2. When shrink_page_list finishes, it is known that how many folios have been pageout, and they are all at the head of 'folio_list', which is ready be moved back to LRU. So, in the 'move_folios_to_lru function' function, if the first 'nr_io' folios (which have been pageout) have been written back completely, move them to the tail of LRU. Otherwise, move them to the head of the LRU. Signed-off-by: Chen Ridong --- mm/vmscan.c | 37 +++++++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 76378bc257e3..04f7eab9d818 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1046,6 +1046,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct folio_batch free_folios; LIST_HEAD(ret_folios); LIST_HEAD(demote_folios); + LIST_HEAD(pageout_folios); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0; bool do_demote_pass; @@ -1061,7 +1062,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct address_space *mapping; struct folio *folio; enum folio_references references = FOLIOREF_RECLAIM; - bool dirty, writeback; + bool dirty, writeback, is_pageout = false; unsigned int nr_pages; cond_resched(); @@ -1384,6 +1385,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, nr_pages = 1; } stat->nr_pageout += nr_pages; + is_pageout = true; if (folio_test_writeback(folio)) goto keep; @@ -1508,7 +1510,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, keep_locked: folio_unlock(folio); keep: - list_add(&folio->lru, &ret_folios); + if (is_pageout) + list_add(&folio->lru, &pageout_folios); + else + list_add(&folio->lru, &ret_folios); VM_BUG_ON_FOLIO(folio_test_lru(folio) || folio_test_unevictable(folio), folio); } @@ -1551,6 +1556,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, free_unref_folios(&free_folios); list_splice(&ret_folios, folio_list); + list_splice(&pageout_folios, folio_list); count_vm_events(PGACTIVATE, pgactivate); if (plug) @@ -1826,11 +1832,14 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file, /* * move_folios_to_lru() moves folios from private @list to appropriate LRU list. + * @lruvec: The LRU vector the list is moved to. + * @list: The folio list are moved to lruvec + * @nr_io: The first nr folios of the list that have been committed io. * * Returns the number of pages moved to the given lruvec. */ static unsigned int move_folios_to_lru(struct lruvec *lruvec, - struct list_head *list) + struct list_head *list, unsigned int nr_io) { int nr_pages, nr_moved = 0; struct folio_batch free_folios; @@ -1880,9 +1889,21 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, * inhibits memcg migration). */ VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); - lruvec_add_folio(lruvec, folio); + /* + * If the folio have been committed io and writed back completely, + * it should be added to the tailed to the lru, so it can + * be relaimed as soon as possible. + */ + if (nr_io > 0 && + !folio_test_reclaim(folio) && + !folio_test_writeback(folio)) + lruvec_add_folio_tail(lruvec, folio); + else + lruvec_add_folio(lruvec, folio); + nr_pages = folio_nr_pages(folio); nr_moved += nr_pages; + nr_io = nr_io > nr_pages ? (nr_io - nr_pages) : 0; if (folio_test_active(folio)) workingset_age_nonresident(lruvec, nr_pages); } @@ -1960,7 +1981,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); spin_lock_irq(&lruvec->lru_lock); - move_folios_to_lru(lruvec, &folio_list); + move_folios_to_lru(lruvec, &folio_list, stat.nr_pageout); __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), stat.nr_demoted); @@ -2111,8 +2132,8 @@ static void shrink_active_list(unsigned long nr_to_scan, */ spin_lock_irq(&lruvec->lru_lock); - nr_activate = move_folios_to_lru(lruvec, &l_active); - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); + nr_activate = move_folios_to_lru(lruvec, &l_active, 0); + nr_deactivate = move_folios_to_lru(lruvec, &l_inactive, 0); __count_vm_events(PGDEACTIVATE, nr_deactivate); __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); @@ -4627,7 +4648,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap spin_lock_irq(&lruvec->lru_lock); - move_folios_to_lru(lruvec, &list); + move_folios_to_lru(lruvec, &list, 0); walk = current->reclaim_state->mm_walk; if (walk && walk->batched) {