From patchwork Wed Jan 25 22:53:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 13116389 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30662C54E94 for ; Wed, 25 Jan 2023 22:54:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 97FAE6B0071; Wed, 25 Jan 2023 17:54:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 92FB96B0073; Wed, 25 Jan 2023 17:54:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D0426B0074; Wed, 25 Jan 2023 17:54:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6E32E6B0071 for ; Wed, 25 Jan 2023 17:54:06 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 39B1E1C66D5 for ; Wed, 25 Jan 2023 22:54:06 +0000 (UTC) X-FDA: 80394826092.27.F004AD2 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf15.hostedemail.com (Postfix) with ESMTP id 7D5D9A0011 for ; Wed, 25 Jan 2023 22:54:04 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=duPMv+B+; spf=pass (imf15.hostedemail.com: domain of 3CrPRYwcKCNsWLHBBCBDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--zokeefe.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3CrPRYwcKCNsWLHBBCBDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674687244; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=b0J4iQz6mVo6c6roQjaIqXL7knKeg1+IHzCeOidMp+4=; b=yoZfDAbr2QA8l+HlvrvmZaU+LvAGxWdA+dEA8G3mFuaWm+ARxRBMWMp3fvhgWN7iAoqQ9V BUM/Vm6BTrvyCUxePe7tglOmKPEIe0v+T2Lzm+nAm+3+Fli8jcICQQh9xaLwtCmTclWiw+ YWclmzJ9N8X/1IaVvM4WL8ciz8IgChg= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=duPMv+B+; spf=pass (imf15.hostedemail.com: domain of 3CrPRYwcKCNsWLHBBCBDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--zokeefe.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3CrPRYwcKCNsWLHBBCBDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674687244; a=rsa-sha256; cv=none; b=IjFFOxLZW+Rrvox+ChWFgoWE22apFo6urelyq0XtBMwbA7ySKshH1beGEoiZBfUbss9ixK To51ItHhG698AesWD+Uj19F/NeAaj/zZ3hdG7FAy1yGs49alN/xYeZk89zRmscPGPk2/Za DTWA0fytvfFUmJZb+AQ9euQA6U6EF+0= Received: by mail-pj1-f73.google.com with SMTP id pa16-20020a17090b265000b0020a71040b4cso8552371pjb.6 for ; Wed, 25 Jan 2023 14:54:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=b0J4iQz6mVo6c6roQjaIqXL7knKeg1+IHzCeOidMp+4=; b=duPMv+B+ac5xZD+nzhSwoU8TM/YzapsEKD4xJl6tYyrtdaa18QKaB/UdSW8iQZmeNo zTjO3IVC1EHyoun3s16ZA71xR8kQIcEba+QEE2XersNrjffJsvwoMcIDHEw67y+Tsm5m Vk1tiDhHVCAcZ8lyJstomBI8jcsbLX6R+3h6AqAZ1Dnj8tKByQIv4aPuQ1ZV/8i1U9zl 8FpMoC3u0KtuY4T947qtttiXkC8RmYSqqnW9zkon8qhWjmaiass5BqSXzmeC5H5mp3tA lR7piNTsqZ6Cw0m8uvGJwLt+ux1XlzspffDJCiQRvwM2Ed4acY1gnvcjRabUCWoEiCUH KGWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=b0J4iQz6mVo6c6roQjaIqXL7knKeg1+IHzCeOidMp+4=; b=zhDhWZzYS5SM0A/V0ZPY5EwjXxTZUlH50ibr0hXH/eEhMui3IBGavoCtyQdBNOjbqe 9xfQQmjhsYmDzsBnwukzE7L9JSmOG8X9Af+lvXEuTOpW7pnIjbyzoBW62CybqdNIgW9t HPqcXW2W2ZbG1jAHfi4ssDS6QPtHpSUt1xDFO02u6zW8QYG5Lk6gTaD+9SQUhQKT1Mof kX9j33/uW8cFRzD1kbySqo3QUcckhv5dNNnRGG19QUjzdUbVA4EsPaPOqO+JojVS3BXw O2KKEUZNmr/itPx2SMKCquj4meu0mJHQrLlweG9YztxNLhNQ0z2uqRBMEyogRTW/rQqZ EEmg== X-Gm-Message-State: AO0yUKW/5KLIUj289vZKQ7wpM0gyqOLb2U9de1ZU0Ujp3JV51aEcWkcU QjOewRv+/CiCAuzSVtWlfonMaJsw87INpqD+U1geSgepFNBKB4CgntgCic5upvi1u00YWDPEq/L e8bTDcGd+bZQ99ckNx8UwvBW8f0C9NfhfRO6JrE0HZZAgK83WHOAYWCgUawo= X-Google-Smtp-Source: AK7set/oVNIVG6ZqjXwzRec3b48YjsoL5D/+knPXngkUkYKKIMnvLlla1OllIYt991LRciT8dE/2shnEx8Zt X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a17:90a:a205:b0:229:f43c:4049 with SMTP id u5-20020a17090aa20500b00229f43c4049mr118780pjp.0.1674687242591; Wed, 25 Jan 2023 14:54:02 -0800 (PST) Date: Wed, 25 Jan 2023 14:53:58 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.1.456.gfc5497dd1b-goog Message-ID: <20230125225358.2576151-1-zokeefe@google.com> Subject: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups From: "Zach O'Keefe" To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Andrew Morton , Hugh Dickins , Yang Shi , "Zach O'Keefe" , stable@vger.kernel.org X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7D5D9A0011 X-Stat-Signature: 7hh3etidiisx48g563mmhrfs6tjk6a8j X-Rspam-User: X-HE-Tag: 1674687244-854073 X-HE-Meta: U2FsdGVkX18AkzL2gQf5lT0qM7HaUbIu9CwJqMEQp7iLQtZq7Jh7f3mLRleHm3OQOdxoFJYGXNXUCQnkaX5OXQM7HD7hx/sofM9ou45CXqSodMiYwK3d4hacv/fWHPjoAu0ZGG15hPeps8Xq+NLhjngHXKk1ipb4IlZAKkkuzJ7RPsOI3+wJwDBMfN678VVzOjvSQ6Uco/iKhgI6W7FBVhSSsEatac6QqZyxsi0EhWX5GD25hkgBbh+PmDUrE5hfDCUj7UFGj+Q+2mlr5DkWmv5o0iZzVDAg69NUDb949j7QWv3aBa1e8xaotMSd+H79B7jEH4ojn83CcED+HMPriOk/PXLc6quidzrFJGbuVr+KnYU6BdLGtKAnprIkAMAD5jqhPZfj1FicZ0VCmG7dTDrMESCDoHOTWw1Ej/3/tHjHufVXmhQku42aRRI6UQDyM9DKPM0K/4uaPkjZSIU3GmkHktD/0D3dH7j6C3+z0eTXXAp4Q/8wTi0Z/RWNbCHyQ5kox8szYyD6gCxSlAoDcX8MZHywYjU6/WcztWYIsBGNJU/ut/d8A/+HGVg9vAnXgq+yVYKXy86KtoQeFnOJ/w1mD8Ua0YP6gj7Q70JWBMGsB/I0SaDAy2SuuUtjQrfL4/B33whlstFDR16XVtxYeD9hkiIoeKA4oZ0Qt2e0kTasKGFqM5C0Qg5WLTZeBafoLlbES2Xb3JsYd4LfjkkCC09Opq0A5ry9ReRm3/yK+WnmtHjGCK5b6A0WCAGW9ax4eyjRrbjaYwskA9QgJjOF74VTgXf7rxCwd60pJEThqtPAtWlfb8o/9BSGmoJViMx0wE15sRC++QmShRjptVgHaEzF1CvsMg88cSuK87AvPmXyBBJEIc8rGJ3e6FdVaWSewhUg/Qn5yBIROe7bQzgfuySge4Q7W0d+loPN+xkh2X2Zk/joZtu0MZYgakE/2L0ySqwHu0bhVPeYd/xOst5 8R+qZusq p0Zt2+UGCIqauUO/8c5Zun86aBQS15Sq+nhalPBVgxkwxvNfRP72hPsaYko74o6aX/tmGq6aXBMRgKl4eb+KH/qPR62aXO1wr7go6hJhx0kALs0S870WBQOFzAUul7uJWkEt5shr/iWZHGZObq2k3L08vi1DH20IbBHe3ymbe3VAMoTmaGLunTrShMqvwP6+S4TclVCV3KogL29oMTPVbAFlcT8S21rGMI3rRTslxtS5sbbPHMrL/S2KemyL6D1WjTuREXMRcpfzCz+nE3FPs3iXc8SE+5cQhEUsuY6BMBa7uLMTUQHj7BUdbRWzB9z+I9iYdfFnRqRt7zQIWmUIDRjDdgAhl4CQbGkf4hfWBN/2Y+v5QB2GxxOgxnA5gm5tE5PqyZ4lfFiWjCIQHjRZOxf5z8poP9sWL3IfC6g+GRIIiAhh5X5dxhgifqhKP05BURz4z4IDb8WPfDkB+Rc5rpUyfweltFMhmnDiw7CzxUl/hZYUt/Q9DfhVA+1hBPY8ND90aUjcDY/Cx5GB3Hsj9eQlacAaKJky+DzfEHXDlwhImR14LxjV7tafQocuRtgoNVc/fUAYVczyknt1yzn4txyBvJLQ1+Q5SPNz7gChyUvNvm1T0JlaRQCB2q37WcFySi/7033JZciKm+4YJlJ1hQjqNzDvy/P3mUNwYojQ4LmlEa+7u2OPV3kkSZNDe/kujQizPWAoMT3h1UQJvWW0XG2RYkMohIhO+4+6tF1En9pw9FNU8T7DjipKwrl0Gw1Juq5VoOucsUDv/EblGxolxG0Cv7g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000109, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none(): - if (!pmd_present(pmde)) - return SCAN_PMD_NULL; + if (pmd_none(pmde)) + return SCAN_PMD_NONE; This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE might identify a pte-mapped hugepage, only to have khugepaged race-in, free the pte table, and clear the pmd. Such codepaths include: A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER already in the pagecache. B) In retract_page_tables(), if we fail to grab mmap_lock for the target mm/address. In these cases, collapse_pte_mapped_thp() really does expect a none (not just !present) pmd, and we want to suitably identify that case separate from the case where no pmd is found, or it's a bad-pmd (of course, many things could happen once we drop mmap_lock, and the pmd could plausibly undergo multiple transitions due to intervening fault, split, etc). Regardless, the code is prepared install a huge-pmd only when the existing pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd. However, the commit introduces a logical hole; namely, that we've allowed !none- && !huge- && !bad-pmds to be classified as genuine pte-table-mapping-pmds. One such example that could leak through are swap entries. The pmd values aren't checked again before use in pte_offset_map_lock(), which is expecting nothing less than a genuine pte-table-mapping-pmd. We want to put back the !pmd_present() check (below the pmd_none() check), but need to be careful to deal with subtleties in pmd transitions and treatments by various arch. The issue is that __split_huge_pmd_locked() temporarily clears the present bit (or otherwise marks the entry as invalid), but pmd_present() and pmd_trans_huge() still need to return true while the pmd is in this transitory state. For example, x86's pmd_present() also checks the _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also checks a PMD_PRESENT_INVALID bit. Covering all 4 cases for x86 (all checks done on the same pmd value): 1) pmd_present() && pmd_trans_huge() All we actually know here is that the PSE bit is set. Either: a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE is set. => huge-pmd b) We are currently racing with __split_huge_page(). The danger here is that we proceed as-if we have a huge-pmd, but really we are looking at a pte-mapping-pmd. So, what is the risk of this danger? The only relevant path is: madvise_collapse() -> collapse_pte_mapped_thp() Where we might just incorrectly report back "success", when really the memory isn't pmd-backed. This is fine, since split could happen immediately after (actually) successful madvise_collapse(). So, it should be safe to just assume huge-pmd here. 2) pmd_present() && !pmd_trans_huge() Either: a) PSE not set and either PRESENT or PROTNONE is. => pte-table-mapping pmd (or PROT_NONE) b) devmap. This routine can be called immediately after unlocking/locking mmap_lock -- or called with no locks held (see khugepaged_scan_mm_slot()), so previous VMA checks have since been invalidated. 3) !pmd_present() && pmd_trans_huge() Not possible. 4) !pmd_present() && !pmd_trans_huge() Neither PRESENT nor PROTNONE set => not present I've checked all archs that implement pmd_trans_huge() (arm64, riscv, powerpc, longarch, x86, mips, s390) and this logic roughly translates (though devmap treatment is unique to x86 and powerpc, and (3) doesn't necessarily hold in general -- but that doesn't matter since !pmd_present() always takes failure path). Also, add a comment above find_pmd_or_thp_or_none() to help future travelers reason about the validity of the code; namely, the possible mutations that might happen out from under us, depending on how mmap_lock is held (if at all). Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Reported-by: Hugh Dickins Signed-off-by: Zach O'Keefe Cc: stable@vger.kernel.org Reviewed-by: Yang Shi --- Request that this be pulled into stable since it's theoretically possible (though I have no reproducer) that while mmap_lock is dropped, racing thp migration installs a pmd migration entry which then has a path to be consumed, unchecked, by pte_offset_map(). v1 -> v2: Fix typo --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9548644bdb56..1face2ae5877 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return SCAN_SUCCEED; } +/* + * See pmd_trans_unstable() for how the result may change out from + * underneath us, even if we hold mmap_lock in read. + */ static int find_pmd_or_thp_or_none(struct mm_struct *mm, unsigned long address, pmd_t **pmd) @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, #endif if (pmd_none(pmde)) return SCAN_PMD_NONE; + if (!pmd_present(pmde)) + return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) return SCAN_PMD_MAPPED; + if (pmd_devmap(pmde)) + return SCAN_PMD_NULL; if (pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED;