From patchwork Mon Jul 29 02:13:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gao Xiang X-Patchwork-Id: 13744170 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7ADB9C3DA49 for ; Mon, 29 Jul 2024 02:13:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D0DC6B0093; Sun, 28 Jul 2024 22:13:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 081816B0098; Sun, 28 Jul 2024 22:13:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB1846B0099; Sun, 28 Jul 2024 22:13:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CE7166B0093 for ; Sun, 28 Jul 2024 22:13:18 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4C38C1C0948 for ; Mon, 29 Jul 2024 02:13:18 +0000 (UTC) X-FDA: 82391168076.11.558F8F0 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 8ACB1C0007 for ; Mon, 29 Jul 2024 02:13:14 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=m8z1uxb8; spf=pass (imf10.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722219192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=MRCm7n66FO+9l/wEfF3ozr0nEPDyX9jY+ChFJjsEldM=; b=VZ7lETfHJkOiw6GyNpikh5htTr0OY9Blyur5pEecIVJDxMdJnM+o37R0AXl6dbUabP5YXX 340g+srlbfpujMAKJEu7Tkb2WTHHHAEIm+dRgofexhAdLT0FN1j3YadEQhKI9AuAssRssC YvyY7c+3rGqP0CKxb6pWIgA7qi9iEPM= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=m8z1uxb8; spf=pass (imf10.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722219192; a=rsa-sha256; cv=none; b=J8A1Bwhce26pbr8yLK2glmcqV/tIWPrazJKB9DYXlrC2fMKqW6ifKcJjzFmJVOPSKekQJn 2/CC+Wxhd4TPHN15okcpWJrLfYrUveuicLAb5tbgdp6QIt9uktHtcFGN1WZLBQzhi4hse+ 03PLifKUpYXxh3zg3G+3/8+3LrM2I5g= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1722219192; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=MRCm7n66FO+9l/wEfF3ozr0nEPDyX9jY+ChFJjsEldM=; b=m8z1uxb8mXp7dmaLMRpcNu3aw0hhXDIH/1l7KjR1Ah+bzOoOFoq+hA9pDrXdpUJbI50sUzrD22NCpZHxpbhUg8WXSGuPKXsraLsnO+xhlq6vCPq98Od4/HTZ3OIz6s082BYpum6i3hc1HimFTJlr8800arhJ4+eixxPrsmEBPfA= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R651e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033045046011;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0WBSZeMA_1722219186; Received: from x31i01179.sqa.na131.tbsite.net(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0WBSZeMA_1722219186) by smtp.aliyun-inc.com; Mon, 29 Jul 2024 10:13:11 +0800 From: Gao Xiang To: Andrew Morton , Huang Ying , linux-mm@kvack.org, Matthew Wilcox , linux-kernel@vger.kernel.org Cc: Gao Xiang Subject: [PATCH v2] mm/migrate: fix deadlock in migrate_pages_batch() on large folios Date: Mon, 29 Jul 2024 10:13:06 +0800 Message-ID: <20240729021306.398286-1-hsiangkao@linux.alibaba.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: uddyo53zciw7f7sf8fyteuo8sjidkdpc X-Rspamd-Queue-Id: 8ACB1C0007 X-Rspamd-Server: rspam11 X-HE-Tag: 1722219194-120767 X-HE-Meta: U2FsdGVkX19W3cBt0sA/MREMNp/+YLMuZlq3VsGGRJ8uq8SN2sBN42ItLEqwP4PslHVqDUuZDsqFzpMclBg9249SRTqGyVrsx2OlYH6zta9o5MtNZ47lmM0ubJdo2hCQJfUHw9sJpjl8nuyiGAMyWGWA77fqz7UKxspbbfMXqOEAQBKMup1js01szWbg7L6TPDYI1SWLlr3FHTJJEtldEjvhDq+akAREw/y/h47kL1+g4LQ/c647snZPG+OWExmSrfY9s7sAbQf7stMuShhGQRFPqQHgiuXGGtV2mRY8SKCi6PYOU3Pqhe20Y5FK7Af/bHR5BumlK7Mqv9X/N6oHht3QqJnJ9F48P6hF55cXKbnHA8pv+buvSAwYznoJgCmm1RphAqzfviUvPdF2QPXTavQ/JVU3vVi6dvHR7OZ//OjGU+xz3Gmtz1YaBpB3aobQnV28a6IlBNCF3HsIX5iAcicWwfAp+fzB3/ZeL+uUksOKfFqhtQeN4lkpPFV9JG3oE9p5pyMfPs+dfABtautiU5rSB/AKmRwCn7ya2By+8p/LgvpThy17IvWmr3EzKLKHPstfITDQqPcLkb6dZHHhBBxOzrDz6FHD7uKdbgPEOYIK9qBWnb6uyniV8O6yDsQUzNN7LgKgV1kcRaswjcVDkvEt+bc7V0AONTxk3TUJZ8rnPECQZMjYJmUwKdCF4X3Nz8nuIVdzvvrvb3sIBYejMBERIMMMMI7RTlMyV9np3OiILx1R1ruEyv5URKpSlP6c/c5BzIJFGU02fRZpRZFaHsfaf/LvCAP7tszCPKCodHyBuLXqUtIrfuxo4TGoGfj/FVAdct7eQS04DjCmsGAxLEIOELWPrErpm6GvpiduezRXLp+OuUjJViFIwb/AL0OYc1edOlDIQuQ/9U1AmgRW7OTHRg9dOYmLmv/oDWLJ1IPKWKRhJw/uTrEnWu8dHY+BTN73/QfWhWtxFjyfJ66 /183XBha 7/OPyatcPZQ1BtEl+y4FJAxaKvYPimbibpXRSIsFOGS83OlAhST44RE7/GxKBo72SCsi0zMF8azmI1pvvBQ1Qx+65jCmascYadiUX1RhPmNjkiEVu0TmuPIzbfw8DaFM0SUPhlWthHQAV4iuBYKaGqTKNu/Tqq2DhbnCTkNNkjOXApEuoezBOaGjYiosCKUc39M19AffaQ2RvWgtSHXT/8fuGhav7myFITFl1P5ISfJ4UPFmaZj6Mzwlb8vbCZZUA9NpG7C+AtwnciLcLv1QkhSwHxWbfZnQ7OR6PqkRvz6tgy62ScLNo7PslS1dtmhvqrc8plvWDgapN9Er2gC6cHCxC3a9J/QoD2tDvfxf6E4psmufzc9jEWFd8IsbKTAKZJdbl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing. It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be: file-backed folios (A) bdev/meta folios (B) The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock [Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 [Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Cc: "Huang, Ying" Cc: Matthew Wilcox Signed-off-by: Gao Xiang Reviewed-by: "Huang, Ying" Acked-by: David Hildenbrand --- v1: https://lore.kernel.org/r/20240728154913.4023977-1-hsiangkao@linux.alibaba.com changes since v1: - pass in migrate_mode suggested by Huang, Ying: https://lore.kernel.org/r/87plqx0yh2.fsf@yhuang6-desk2.ccr.corp.intel.com mm/migrate.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 20cb9f5f7446..15c4330e40cd 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1479,11 +1479,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, return rc; } -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, + enum migrate_mode mode) { int rc; - folio_lock(folio); + if (mode == MIGRATE_ASYNC) { + if (!folio_trylock(folio)) + return -EAGAIN; + } else { + folio_lock(folio); + } rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); if (!rc) @@ -1677,7 +1683,7 @@ static int migrate_pages_batch(struct list_head *from, */ if (nr_pages > 2 && !list_empty(&folio->_deferred_list)) { - if (try_split_folio(folio, split_folios) == 0) { + if (!try_split_folio(folio, split_folios, mode)) { nr_failed++; stats->nr_thp_failed += is_thp; stats->nr_thp_split += is_thp; @@ -1699,7 +1705,7 @@ static int migrate_pages_batch(struct list_head *from, if (!thp_migration_supported() && is_thp) { nr_failed++; stats->nr_thp_failed++; - if (!try_split_folio(folio, split_folios)) { + if (!try_split_folio(folio, split_folios, mode)) { stats->nr_thp_split++; stats->nr_split++; continue; @@ -1731,7 +1737,7 @@ static int migrate_pages_batch(struct list_head *from, stats->nr_thp_failed += is_thp; /* Large folio NUMA faulting doesn't split to retry. */ if (is_large && !nosplit) { - int ret = try_split_folio(folio, split_folios); + int ret = try_split_folio(folio, split_folios, mode); if (!ret) { stats->nr_thp_split += is_thp;