From patchwork Thu Oct 24 13:22:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Kuai X-Patchwork-Id: 13848992 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D7811FC7CC; Thu, 24 Oct 2024 13:25:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729776330; cv=none; b=HqgYmqrFBzDDmGZBsyRojptbBAGsWbdmSzdr5FF6PfPXLJd1hiDJp8haCBWYM5xjSwb8m3eJ9Vi7n68GJ3PTKPnTtvAiO/ql9a87OBtwvpd+PO5UTvJWBETSLMfpCgSs5RneRdmlLZm+c2BKA6LRAXXyYHG1IVIQD2xVI6ZXsqA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729776330; c=relaxed/simple; bh=TcVoi/3kXLFyCOWu5OTevK77hXAL+N5wzFnLDQ+BdZk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lwzhbMZMzx1qwrCEWx8ehRW+Nr77/C8oQGEwmVkusry090wMUS2n6c61zItrXqVjOLcwt88yZKTGbUln5bmjCJxAPUGcekYTq6Htw/w6On6EUPCxEGzcxDRecoz9TO7QpPyvyudTw/pHTnpxu/THQHLggL/VliynWfJcN8NzKmI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4XZ68y1gjpz4f3jdk; Thu, 24 Oct 2024 21:25:06 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id AA6E41A0568; Thu, 24 Oct 2024 21:25:23 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP4 (Coremail) with SMTP id gCh0CgD3LMmxShpnmfz6Ew--.42902S16; Thu, 24 Oct 2024 21:25:23 +0800 (CST) From: Yu Kuai To: stable@vger.kernel.org, gregkh@linuxfoundation.org, harry.wentland@amd.com, sunpeng.li@amd.com, Rodrigo.Siqueira@amd.com, alexander.deucher@amd.com, christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com, daniel@ffwll.ch, viro@zeniv.linux.org.uk, brauner@kernel.org, Liam.Howlett@oracle.com, akpm@linux-foundation.org, hughd@google.com, willy@infradead.org, sashal@kernel.org, srinivasan.shanmugam@amd.com, chiahsuan.chung@amd.com, mingo@kernel.org, mgorman@techsingularity.net, yukuai3@huawei.com, chengming.zhou@linux.dev, zhangpeng.00@bytedance.com, chuck.lever@oracle.com Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org, linux-mm@kvack.org, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH 6.6 28/28] maple_tree: correct tree corruption on spanning store Date: Thu, 24 Oct 2024 21:22:25 +0800 Message-Id: <20241024132225.2271667-13-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20241024132225.2271667-1-yukuai1@huaweicloud.com> References: <20241024132009.2267260-1-yukuai1@huaweicloud.com> <20241024132225.2271667-1-yukuai1@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-CM-TRANSID: gCh0CgD3LMmxShpnmfz6Ew--.42902S16 X-Coremail-Antispam: 1UD129KBjvJXoWxKrWrXFWfXFWrKw1xuw4kZwb_yoWDGFW7pF W3Kry3Kr4Dta48CF4vka10vr90vrs3JrW7tas8Kw1FyrZ0gFyIqrnav3WFvFyDu3ykGr12 vF4jvw1UCa98AFJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUmY14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_JF0E3s1l82xGYI kIc2x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2 z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_GcCE3s 1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_GcCE3s1l e2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI 8IcVAFwI0_Jrv_JF1lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJVW8JwAC jcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2Y2ka0x kIwI1lc7CjxVAaw2AFwI0_Wrv_ZF1l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_ Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1V AY17CE14v26rWY6r4UJwCIc40Y0x0EwIxGrwCI42IY6xIIjxv20xvE14v26ryj6F1UMIIF 0xvE2Ix0cI8IcVCY1x0267AKxVW0oVCq3wCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIx AIcVC2z280aVAFwI0_Gr0_Cr1lIxAIcVC2z280aVCY1x0267AKxVW0oVCq3bIYCTnIWIev Ja73UjIFyTuYvjTRGg4SUUUUU X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ From: Lorenzo Stoakes commit bea07fd63192b61209d48cbb81ef474cc3ee4c62 upstream. Patch series "maple_tree: correct tree corruption on spanning store", v3. There has been a nasty yet subtle maple tree corruption bug that appears to have been in existence since the inception of the algorithm. This bug seems far more likely to happen since commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()"), which is the point at which reports started to be submitted concerning this bug. We were made definitely aware of the bug thanks to the kind efforts of Bert Karwatzki who helped enormously in my being able to track this down and identify the cause of it. The bug arises when an attempt is made to perform a spanning store across two leaf nodes, where the right leaf node is the rightmost child of the shared parent, AND the store completely consumes the right-mode node. This results in mas_wr_spanning_store() mitakenly duplicating the new and existing entries at the maximum pivot within the range, and thus maple tree corruption. The fix patch corrects this by detecting this scenario and disallowing the mistaken duplicate copy. The fix patch commit message goes into great detail as to how this occurs. This series also includes a test which reliably reproduces the issue, and asserts that the fix works correctly. Bert has kindly tested the fix and confirmed it resolved his issues. Also Mikhail Gavrilov kindly reported what appears to be precisely the same bug, which this fix should also resolve. This patch (of 2): There has been a subtle bug present in the maple tree implementation from its inception. This arises from how stores are performed - when a store occurs, it will overwrite overlapping ranges and adjust the tree as necessary to accommodate this. A range may always ultimately span two leaf nodes. In this instance we walk the two leaf nodes, determine which elements are not overwritten to the left and to the right of the start and end of the ranges respectively and then rebalance the tree to contain these entries and the newly inserted one. This kind of store is dubbed a 'spanning store' and is implemented by mas_wr_spanning_store(). In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(), mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update the object (mas) to traverse to the location where the write should be performed, determining its store type. When a spanning store is required, this function returns false stopping at the parent node which contains the target range, and mas_wr_store_type() marks the mas->store_type as wr_spanning_store to denote this fact. When we go to perform the store in mas_wr_spanning_store(), we first determine the elements AFTER the END of the range we wish to store (that is, to the right of the entry to be inserted) - we do this by walking to the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we have just determined contains the range over which we intend to write. We then turn our attention to the entries to the left of the entry we are inserting, whose state is represented by l_mas, and copy these into a 'big node', which is a special node which contains enough slots to contain two leaf node's worth of data. We then copy the entry we wish to store immediately after this - the copy and the insertion of the new entry is performed by mas_store_b_node(). After this we copy the elements to the right of the end of the range which we are inserting, if we have not exceeded the length of the node (i.e. r_mas.offset <= r_mas.end). Herein lies the bug - under very specific circumstances, this logic can break and corrupt the maple tree. Consider the following tree: Height 0 Root Node / \ pivot = 0xffff / \ pivot = ULONG_MAX / \ 1 A [-----] ... / \ pivot = 0x4fff / \ pivot = 0xffff / \ 2 (LEAVES) B [-----] [-----] C ^--- Last pivot 0xffff. Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note that all ranges expressed in maple tree code are inclusive): 1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then determines that this is a spanning store across nodes B and C. The mas state is set such that the current node from which we traverse further is node A. 2. In mas_wr_spanning_store() we try to find elements to the right of pivot 0xffff by searching for an index of 0x10000: - mas_wr_walk_index() invokes mas_wr_walk_descend() and mas_wr_node_walk() in turn. - mas_wr_node_walk() loops over entries in node A until EITHER it finds an entry whose pivot equals or exceeds 0x10000 OR it reaches the final entry. - Since no entry has a pivot equal to or exceeding 0x10000, pivot 0xffff is selected, leading to node C. - mas_wr_walk_traverse() resets the mas state to traverse node C. We loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk() in turn once again. - Again, we reach the last entry in node C, which has a pivot of 0xffff. 3. We then copy the elements to the left of 0x4000 in node B to the big node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry too. 4. We determine whether we have any entries to copy from the right of the end of the range via - and with r_mas set up at the entry at pivot 0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at pivot 0xffff. 5. BUG! The maple tree is corrupted with a duplicate entry. This requires a very specific set of circumstances - we must be spanning the last element in a leaf node, which is the last element in the parent node. spanning store across two leaf nodes with a range that ends at that shared pivot. A potential solution to this problem would simply be to reset the walk each time we traverse r_mas, however given the rarity of this situation it seems that would be rather inefficient. Instead, this patch detects if the right hand node is populated, i.e. has anything we need to copy. We do so by only copying elements from the right of the entry being inserted when the maximum value present exceeds the last, rather than basing this on offset position. The patch also updates some comments and eliminates the unused bool return value in mas_wr_walk_index(). The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") seems to have made the probability of this event much more likely, which is the point at which reports started to be submitted concerning this bug. The motivation for this change arose from Bert Karwatzki's report of encountering mm instability after the release of kernel v6.12-rc1 which, after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration options, was identified as maple tree corruption. After Bert very generously provided his time and ability to reproduce this event consistently, I was able to finally identify that the issue discussed in this commit message was occurring for him. Link: https://lkml.kernel.org/r/cover.1728314402.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Lorenzo Stoakes Reported-by: Bert Karwatzki Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/ Tested-by: Bert Karwatzki Reported-by: Mikhail Gavrilov Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/ Acked-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Tested-by: Mikhail Gavrilov Reviewed-by: Wei Yang Cc: Matthew Wilcox Cc: Sidhartha Kumar Cc: Signed-off-by: Andrew Morton Signed-off-by: Yu Kuai --- lib/maple_tree.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 5328e08723d7..c57b6fc4db2e 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2239,6 +2239,8 @@ static inline void mas_node_or_none(struct ma_state *mas, /* * mas_wr_node_walk() - Find the correct offset for the index in the @mas. + * If @mas->index cannot be found within the containing + * node, we traverse to the last entry in the node. * @wr_mas: The maple write state * * Uses mas_slot_locked() and does not need to worry about dead nodes. @@ -3655,7 +3657,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) return true; } -static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) +static void mas_wr_walk_index(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; @@ -3664,11 +3666,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) wr_mas->content = mas_slot_locked(mas, wr_mas->slots, mas->offset); if (ma_is_leaf(wr_mas->type)) - return true; + return; mas_wr_walk_traverse(wr_mas); - } - return true; } /* * mas_extend_spanning_null() - Extend a store of a %NULL to include surrounding %NULLs. @@ -3899,8 +3899,8 @@ static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) memset(&b_node, 0, sizeof(struct maple_big_node)); /* Copy l_mas and store the value in b_node. */ mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); - /* Copy r_mas into b_node. */ - if (r_mas.offset <= r_mas.end) + /* Copy r_mas into b_node if there is anything to copy. */ + if (r_mas.max > r_mas.last) mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, &b_node, b_node.b_end + 1); else