From patchwork Tue Nov 29 19:35:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13059087 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0708EC4332F for ; Tue, 29 Nov 2022 19:35:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A6F6B6B0075; Tue, 29 Nov 2022 14:35:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F4726B007D; Tue, 29 Nov 2022 14:35:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 846D36B007E; Tue, 29 Nov 2022 14:35:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 591686B007D for ; Tue, 29 Nov 2022 14:35:38 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3170E1A0885 for ; Tue, 29 Nov 2022 19:35:38 +0000 (UTC) X-FDA: 80187484356.06.49E5D0A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf07.hostedemail.com (Postfix) with ESMTP id C4E5A40004 for ; Tue, 29 Nov 2022 19:35:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669750537; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=b6gVKCGZCvbnoU5aOoZzsIibfF+bR1hZQIon/dYwWZEkyI+xD/keZLKldUD6ds9Nb+7zT6 s39XNLh8ldeMS5hzTnsCGCtrmvEt+Bpm9X1cziLyx6nl29hk80lcOQpM5LB0XXz4QNwTeG mgcqWaT1+GJIr78wiEGaF+W847PoFXA= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-377-DDuI8lWxODufox5ZW9UOrw-1; Tue, 29 Nov 2022 14:35:36 -0500 X-MC-Unique: DDuI8lWxODufox5ZW9UOrw-1 Received: by mail-qt1-f197.google.com with SMTP id ff5-20020a05622a4d8500b003a526107477so22829218qtb.9 for ; Tue, 29 Nov 2022 11:35:36 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=X0pownKir0YVs3F1yzfU+QpyV0C3EcDTe93uDRwWfSr5/PHDh0LCYk/MIqj8RsfVcz mo0jVV+Opa0VooSjAORGzQSnWaOSRnbCM5sYtqmWodRs+PCmSTNnbNCED3Gq4JhsYL3/ MzHKHMR1S8y/s5HIGdI3/OtsX2CK1MmmfDSZvj9GKRJa1ct0JKJLP5mD9iCjcsGgM9do U1uMHnUhv8o0AdBIXumKEAu/LPlNpFqICCwUIgngMzcZaAMzX4yDMVlYmawauyNAr67Y BFxOZVAbQQ7aCfynxcT6ab2SEMJrKdYtf0qAlcPXGXBjD3KLbUgpGfVs81Z2+aOXEi0t +0jw== X-Gm-Message-State: ANoB5pkgMvVBcMgrKoe6zHNsivBMo0s78awi8UdIvrqwdabqedfCl85j PaRm0N+jZoQcohWWaIKuZgD6Ahbc9vdhR9pE7O2eo4UI4+HcXo5QrJaNFi7YE6QVJsBYGG79hla qybRgOT6ztigz/8lyo9I7K3g277aodFSFMw44X8aV+mzr3nFzWWTAfOgc18lM X-Received: by 2002:a05:620a:573:b0:6fc:1ddf:deec with SMTP id p19-20020a05620a057300b006fc1ddfdeecmr31840287qkp.595.1669750535442; Tue, 29 Nov 2022 11:35:35 -0800 (PST) X-Google-Smtp-Source: AA0mqf5eWRw1Wa7HXYgpo5FSiggQn9vyy3jAcCpLR470cYzs1D4H/YgQTVf/66JS7eR+jzWwjJRHDg== X-Received: by 2002:a05:620a:573:b0:6fc:1ddf:deec with SMTP id p19-20020a05620a057300b006fc1ddfdeecmr31840242qkp.595.1669750535029; Tue, 29 Nov 2022 11:35:35 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id n1-20020a05620a294100b006fa16fe93bbsm11313013qkp.15.2022.11.29.11.35.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Nov 2022 11:35:34 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: James Houghton , Jann Horn , peterx@redhat.com, Andrew Morton , Andrea Arcangeli , Rik van Riel , Nadav Amit , Miaohe Lin , Muchun Song , Mike Kravetz , David Hildenbrand Subject: [PATCH 03/10] mm/hugetlb: Document huge_pte_offset usage Date: Tue, 29 Nov 2022 14:35:19 -0500 Message-Id: <20221129193526.3588187-4-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221129193526.3588187-1-peterx@redhat.com> References: <20221129193526.3588187-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-type: text/plain ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=b6gVKCGZ; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669750537; a=rsa-sha256; cv=none; b=VwT6Mo7v2nUI25QZbqesR7rxnFxvI3kiiWF9Ebdh7ftWplOMyOG1wMBqZMpyn0+Ep6s76r rf1QzxSLebD3RC7T8bTT9OGePU5xTLdvLiPz3mq2jASTWSIIEXEeCgQ29eRpKBq6XU1zU6 K/9xCxJAanbhAXIKSFhWHuZx4RhWqsk= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669750537; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=hs02FSBlN47A5VkrlGmkxoWWb9tGKptejDjDheYd+csmiVKGgfzEzqFBS4s+YCB0kf1vGc IjaxQF3S1c30ktf+DlIj/vFWhz5V5sXMhTW2ne3o9S8p/Nra2YShlrOp44k0qixL4LGwR9 Ih7aTb2v7dbv86JuWz77CAoejhXKQ2A= X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: C4E5A40004 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=b6gVKCGZ; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: jjyo9iwbyi9uskmdadspg9g3319gu5ch X-HE-Tag: 1669750537-131288 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a hugetlb address. Normally, it's always safe to walk a generic pgtable as long as we're with the mmap lock held for either read or write, because that guarantees the pgtable pages will always be valid during the process. But it's not true for hugetlbfs, especially shared: hugetlbfs can have its pgtable freed by pmd unsharing, it means that even with mmap lock held for current mm, the PMD pgtable page can still go away from under us if pmd unsharing is possible during the walk. So we have two ways to make it safe even for a shared mapping: (1) If we're with the hugetlb vma lock held for either read/write, it's okay because pmd unshare cannot happen at all. (2) If we're with the i_mmap_rwsem lock held for either read/write, it's okay because even if pmd unshare can happen, the pgtable page cannot be freed from under us. Document it. Signed-off-by: Peter Xu --- include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 551834cd5299..81efd9b9baa2 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages; pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz); +/* + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE. + * Returns the pte_t* if found, or NULL if the address is not mapped. + * + * Since this function will walk all the pgtable pages (including not only + * high-level pgtable page, but also PUD entry that can be unshared + * concurrently for VM_SHARED), the caller of this function should be + * responsible of its thread safety. One can follow this rule: + * + * (1) For private mappings: pmd unsharing is not possible, so it'll + * always be safe if we're with the mmap sem for either read or write. + * This is normally always the case, IOW we don't need to do anything + * special. + * + * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged + * pgtable page can go away from under us! It can be done by a pmd + * unshare with a follow up munmap() on the other process), then we + * need either: + * + * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare + * won't happen upon the range (it also makes sure the pte_t we + * read is the right and stable one), or, + * + * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make + * sure even if unshare happened the racy unmap() will wait until + * i_mmap_rwsem is released. + * + * Option (2.1) is the safest, which guarantees pte stability from pmd + * sharing pov, until the vma lock released. Option (2.2) doesn't protect + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to + * access. + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h);