From patchwork Wed Jan 25 01:57:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 13114948 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D979C54EAA for ; Wed, 25 Jan 2023 01:58:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 48CA26B0072; Tue, 24 Jan 2023 20:58:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 415C16B0073; Tue, 24 Jan 2023 20:58:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DDBD6B0075; Tue, 24 Jan 2023 20:58:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 203226B0072 for ; Tue, 24 Jan 2023 20:58:16 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0432E805AE for ; Wed, 25 Jan 2023 01:58:15 +0000 (UTC) X-FDA: 80391661392.22.1D29C4F Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf28.hostedemail.com (Postfix) with ESMTP id 36AD1C0003 for ; Wed, 25 Jan 2023 01:58:14 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bXlVQRum; spf=pass (imf28.hostedemail.com: domain of 3tIzQYwcKCDUqfbVVWVXffXcV.TfdcZelo-ddbmRTb.fiX@flex--zokeefe.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3tIzQYwcKCDUqfbVVWVXffXcV.TfdcZelo-ddbmRTb.fiX@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674611894; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=djSryAJEJQf7zG3EWbPhVZ2o63jK6CgekVl0YdYuq5fA1s4rLbxVeCkslBF1gP8XKU49oq I1e/+jv3HgofFBfyiNn8r0Es+KwIozdlw3y6FhKTi74j79rA/SzwgLix4Bwlqb8sKfP+zI /JXSxNuusi7xaoUQo+YXzf+IES0BXI0= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bXlVQRum; spf=pass (imf28.hostedemail.com: domain of 3tIzQYwcKCDUqfbVVWVXffXcV.TfdcZelo-ddbmRTb.fiX@flex--zokeefe.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3tIzQYwcKCDUqfbVVWVXffXcV.TfdcZelo-ddbmRTb.fiX@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674611894; a=rsa-sha256; cv=none; b=7CscFzIfGOu5vxzfFk8L8I9eCB+HhHMyOv2dSmYRhErfJmZoQiMXRoPw4djIGQuqlcfYwY HpFIKAF3h+r0uGbzughYcW7MJZjcPqUABOcKXtYc3w9VJ4QH/bdjBQKc7yiKuHCh+sNlye x88AGvv32Hf8/T+gNmuRep67rELyam4= Received: by mail-pf1-f201.google.com with SMTP id j1-20020aa78001000000b0057d28e11cb6so7478015pfi.11 for ; Tue, 24 Jan 2023 17:58:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=bXlVQRumzm40VAjDslHGOuknPuAdE9sb0L4akt62AYYiNfF1nl/PE9a3K4+mhxfxbU qlOjlx9HYpNZ7nlKZHVgHB0Q1n9WvvwyURl+Khn4oVXSDFDI5stAZNLD/GV7mYuhnAmX 4fP31lq2dBuIKqP7/dpZqBG1jQWdaE4UbMixSINcYMwJN6K9z5usFxW1G1RYeQzjzLmE IuTvdJPPP3texNImaofSkc7EevfsFsTYsDDWMHluYgcK4K8lzmtm33VRfk8XHHm/C1AO JzwN7PIDjhXbMVtdks4HwD6HiFKY6CPRKEfjrxQ0G2uWsS+SXuQvOj6jyB/oyIkb1ojY XkMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=BolV3EscAGXNYgnqvxqyF0dBEl7/Dw1Q+RDO9yxjW9BVCxy1gdXmQM8XGQsR4ZZAqf FlrOUM6k/eUHVzVRUbw3X9Nkn6swBEp7TaEbtrgDuNC6KrU8+Q7jylxnZUWfry3rN877 0Oauddzq6J+TDBKIzJoDoxq07YAz0a8rSgf2/4tc/pSGTw9xa3Bom7D+wICBc+Gii+Ph /+93tZrriwJZyZpTr8QvEf6jU1tTPi2qnBxDd+m+dJOpCDm2l1P1ErvXdJMUj1n2xHYD Qzrh3Fnz5gWQNuJcjPU46T+Lf4bF5sAjyaTeGdM/TZx8VF3WlbHKN10wIwbCn2qplxr4 o82g== X-Gm-Message-State: AFqh2ko0g5i0UNJ8b3k5YFx/iF8A0hPJGKyYh884HBXjJb1nWV9nWsSC FUCKAJ/zKOUxp/T3bpzz/n4eX1DoKBG9W0w32nJZJZC3/SMqtrtyXUuryVy71k2Cf5vapdJmJC8 6LiPrFYricJIPk3ogCfr2dUnDODuqVvyFmpmlDBqBpIMWCQ6j2WuUHfp6dns= X-Google-Smtp-Source: AMrXdXuhZDEkZKkDLUre49lmdMlmiICegzf6uQdbSopmZQDIYG8wsakARMsaTdDgeLxTwvPxruwcgZi2TTo/ X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a62:6d04:0:b0:578:9709:615f with SMTP id i4-20020a626d04000000b005789709615fmr3520547pfc.45.1674611892894; Tue, 24 Jan 2023 17:58:12 -0800 (PST) Date: Tue, 24 Jan 2023 17:57:38 -0800 In-Reply-To: <20230125015738.912924-1-zokeefe@google.com> Mime-Version: 1.0 References: <20230125015738.912924-1-zokeefe@google.com> X-Mailer: git-send-email 2.39.1.405.gd4c25cc71f-goog Message-ID: <20230125015738.912924-2-zokeefe@google.com> Subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups From: "Zach O'Keefe" To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Andrew Morton , Hugh Dickins , Yang Shi , "Zach O'Keefe" , stable@vger.kernel.org X-Rspamd-Queue-Id: 36AD1C0003 X-Stat-Signature: jqhrugg4g5erfgs3sc4mry9ra5wsm8e7 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1674611893-354220 X-HE-Meta: U2FsdGVkX1/SUYPafTxUVY3h5vBmcOkMQcgin4PuHx2vX4/XOMLZHfT5GFdJYlpGMlVAwiGLhA2eOHExlRvvEiTJwpTh93dhrtHoBcTduLggvP1dN4y34yyZ41D7dYWFz3I1WAPljvoSLdQbS25qXuBP6YTuTO+qIOHYEpprLD7o8dDcwZFZvEBbD+5wejRkCHyjuAvrKtspL/r9L26v4xu/pxHKwLzl3oHx+aw9VwmVM40WnwgYKQJ8DmZu702gHnoChbmmWR8J+OQkhRVVza9lwbmAfK7EVEqqCAz6nyb+Vz2NqV5pM7r7NSymKAaz2T9LLalJQrWx7FqUxr7kvJZWQJJC/CoX4gbgi79aKUjwTiflk0QfA02B4EHRwMX2aCJXnpFcUVgI++5j0l0/Q+kB2/pfIEBk1UwAMlahmTJVWQMjxov998q2CN1fhOvoQCDrIEyHnSRpK+CcNQegppc3dWNNE4OGTEfxNGqBkO1FRaQbDaDCMkaXbmnplNQxex2F58LUCb3u64ikiGLXdFS6mkDns45+RhdXbXO6dMGp1wJjcBeWSa1KMk5NSHCnOmYYTq/WjhhU+aV87eXN82MC78qd9HIdp4q/jMTvH+k+3iK9KqJeAnjuQLWFymTq2sBxYzi+YrtUsJnK54n+uZ6rGvNRe/JjzGo/UnpH5rQg6XGA7HLtSTEDS8ie/zqXfaVp/UaT5rAYFuMDVNrgy2WYVLkiD40lb0gWojuU3mCyMkwSwRn1VWtpAo5AUWvNFUuDs5wQ6+PdUtuF2AoeDjJIvJcmeSo8H9Pp4Tr/WPeeu9ZJ8yUA0YpZqyGQLK51sIbS/EJq9dccfh6otOLPujWBPocQjmSM8ytJNq1ouT16ZzVUyr3NuR0MhcU/1pA3oXLr16Pj/6HYsQXYBaUfgviaoZ5uAae+pNdCDQWbN4TFf6jmmfhV3fhmKreCHd+Dl2VPdoJlHAqvqaWQ/bJ TSNaeO/b GtP93x+0WfDRtaStxLb1M0rI4OotgusqHoaXmOXruLgpyW5e8ZtAXV8Z6gpkRIHryXhoDkXhlzbYPth2FqHw/Exe1vikyddpq/HhzExVcntLRcxbYXGV14QzxBIaSVRy5mUxLQJgSXTNmtUH2I5PHn1GwiZvhEOutC9gNGHN/D2kAFyihW4y3LAp2FucpisAj34VOQx2BdWIYjxOTTuLNRxrxVRc9iZS8ztqseTKSGHJkDCLw3XhdZsVi91YIlDLQK0cgIVgLw6NHCVQtQ+0sgWETm9faJMfetMgrnsIDU5Bb+bmRT7pfoLDnj38lWugWGoZBGO3eXjCZXZQB3BchLLC+SiheI7PKFq4RsZWSq/17T2IrkRv9T6uoXKkE7r5xBELczrZ7iDij7ltctoPI8l3PCEkq8GtdNEThWq8UInuo3or2912yFxitKRilwh45YDKgfsZRQoEOJPuIPrYHihbeL9ZSo1O0VNAgGPO+5nJ4GulsshCzjZEQNGVXzYNdwp5UuWR669QYm3s8tDcJGty2rUkFQvtk+dC3YQk53cUTElp3PwcfM/OYTjSKKKut1iyMiEKvsT1tt97ccErhCdPYkdKxvcHCs/X+ihv6MIOHVMHak1tdz0LV1gJSwZFvxJGsN0aBQ4qoQUlR+qHIjj+p9pPnaW7YuT/VReR6oABbBkoY6x+pu3ptvyT+rQFcncONux4M7YRz6+spQmUw/ZJlT825X9utCVu75AbCjqmwCf+3ky+bUtRB4GLSlCAoAkO/n+Oy5qBs24s5GPXE6Y5Jaw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.001009, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none(): - if (!pmd_present(pmde)) - return SCAN_PMD_NULL; + if (pmd_none(pmde)) + return SCAN_PMD_NONE; This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE might identify a pte-mapped hugepage, only to have khugepaged race-in, free the pte table, and clear the pmd. Such codepaths include: A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER already in the pagecache. B) In retract_page_tables(), if we fail to grab mmap_lock for the target mm/address. In these cases, collapse_pte_mapped_thp() really does expect a none (not just !present) pmd, and we want to suitably identify that case separate from the case where no pmd is found, or it's a bad-pmd (of course, many things could happen once we drop mmap_lock, and the pmd could plausibly undergo multiple transitions due to intervening fault, split, etc). Regardless, the code is prepared install a huge-pmd only when the existing pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd. However, the commit introduces a logical hole; namely, that we've allowed !none- && !huge- && !bad-pmds to be classified as genuine pte-table-mapping-pmds. One such example that could leak through are swap entries. The pmd values aren't checked again before use in pte_offset_map_lock(), which is expecting nothing less than a genuine pte-table-mapping-pmd. We want to put back the !pmd_present() check (below the pmd_none() check), but need to be careful to deal with subtleties in pmd transitions and treatments by various arch. The issue is that __split_huge_pmd_locked() temporarily clears the present bit (or otherwise marks the entry as invalid), but pmd_present() and pmd_trans_huge() still need to return true while the pmd is in this transitory state. For example, x86's pmd_present() also checks the _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also checks a PMD_PRESENT_INVALID bit. Covering all 4 cases for x86 (all checks done on the same pmd value): 1) pmd_present() && pmd_trans_huge() All we actually know here is that the PSE bit is set. Either: a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE is set. => huge-pmd b) We are currently racing with __split_huge_page(). The danger here is that we proceed as-if we have a huge-pmd, but really we are looking at a pte-mapping-pmd. So, what is the risk of this danger? The only relevant path is: madvise_collapse() -> collapse_pte_mapped_thp() Where we might just incorrectly report back "success", when really the memory isn't pmd-backed. This is fine, since split could happen immediately after (actually) successful madvise_collapse(). So, it should be safe to just assume huge-pmd here. 2) pmd_present() && !pmd_trans_huge() Either: a) PSE not set and either PRESENT or PROTNONE is. => pte-table-mapping pmd (or PROT_NONE) b) devmap. This routine can be called immediately after unlocking/locking mmap_lock -- or called with no locks held (see khugepaged_scan_mm_slot()), so previous VMA checks have since been invalidated. 3) !pmd_present() && pmd_trans_huge() Not possible. 4) !pmd_present() && !pmd_trans_huge() Neither PRESENT nor PROTNONE set => not present I've checked all archs that implement pmd_trans_huge() (arm64, riscv, powerpc, longarch, x86, mips, s390) and this logic roughly translates (though devmap treatment is unique to x86 and powerpc, and (3) doesn't necessarily hold in general -- but that doesn't matter since !pmd_present() always takes failure path). Also, add a comment above find_pmd_or_thp_or_none() to help future travelers reason about the validity of the code; namely, the possible mutations that might happen out from under us, depending on how mmap_lock is held (if at all). Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Reported-by: Hugh Dickins Signed-off-by: Zach O'Keefe Cc: stable@vger.kernel.org --- Request that this be pulled into stable since it's theoretically possible (though I have no reproducer) that while mmap_lock is dropped, racing thp migration installs a pmd migration entry which then has a path to be consumed, unchecked, by pte_offset_map(). --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fa38cae240b9..7ea668bbea70 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -941,6 +941,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return SCAN_SUCCEED; } +/* + * See pmd_trans_unstable() for how the result may change out from + * underneath us, even if we hold mmap_lock in read. + */ static int find_pmd_or_thp_or_none(struct mm_struct *mm, unsigned long address, pmd_t **pmd) @@ -959,8 +963,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, #endif if (pmd_none(pmde)) return SCAN_PMD_NONE; + if (!pmd_present(pmde)) + return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) return SCAN_PMD_MAPPED; + if (pmd_devmap(pmd)) + return SCAN_PMD_NULL; if (pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED;