[v3,3/9] mm/hugetlb: Document huge_pte_offset usage

Message ID	20221209170100.973970-4-peterx@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Miaohe Lin <linmiaohe@huawei.com>, David Hildenbrand <david@redhat.com>, Nadav Amit <nadav.amit@gmail.com>, peterx@redhat.com, Andrea Arcangeli <aarcange@redhat.com>, Jann Horn <jannh@google.com>, John Hubbard <jhubbard@nvidia.com>, Mike Kravetz <mike.kravetz@oracle.com>, James Houghton <jthoughton@google.com>, Rik van Riel <riel@surriel.com>, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v3 3/9] mm/hugetlb: Document huge_pte_offset usage Date: Fri, 9 Dec 2022 12:00:54 -0500 Message-Id: <20221209170100.973970-4-peterx@redhat.com> In-Reply-To: <20221209170100.973970-1-peterx@redhat.com> References: <20221209170100.973970-1-peterx@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v3,1/9] mm/hugetlb: Let vma_offset_start() to return start \| expand [v3,1/9] mm/hugetlb: Let vma_offset_start() to return start [v3,2/9] mm/hugetlb: Don't wait for migration entry during follow page [v3,3/9] mm/hugetlb: Document huge_pte_offset usage [v3,4/9] mm/hugetlb: Move swap entry handling into vma lock when faulted [v3,5/9] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare [v3,6/9] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare [v3,7/9] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare [v3,8/9] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare [v3,9/9] mm/hugetlb: Introduce hugetlb_walk()

Message ID

20221209170100.973970-4-peterx@redhat.com (mailing list archive)

State

New

Headers

From: Peter Xu <peterx@redhat.com>
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Hildenbrand <david@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	peterx@redhat.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	Jann Horn <jannh@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Rik van Riel <riel@surriel.com>,
	Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH v3 3/9] mm/hugetlb: Document huge_pte_offset usage
Date: Fri,  9 Dec 2022 12:00:54 -0500
Message-Id: <20221209170100.973970-4-peterx@redhat.com>
In-Reply-To: <20221209170100.973970-1-peterx@redhat.com>
References: <20221209170100.973970-1-peterx@redhat.com>
MIME-Version: 1.0
Content-type: text/plain
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v3,1/9] mm/hugetlb: Let vma_offset_start() to return start | expand

Commit Message

Peter Xu Dec. 9, 2022, 5 p.m. UTC

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have two ways to make it safe even for a shared mapping:

  (1) If we're with the hugetlb vma lock held for either read/write, it's
      okay because pmd unshare cannot happen at all.

  (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
      okay because even if pmd unshare can happen, the pgtable page cannot
      be freed from under us.

Document it.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 551834cd5299..d755e2a7c0db 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -192,6 +192,38 @@  extern struct list_head huge_boot_pages;
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * Since this function will walk all the pgtable pages (including not only
+ * high-level pgtable page, but also PUD entry that can be unshared
+ * concurrently for VM_SHARED), the caller of this function should be
+ * responsible of its thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd unsharing is not possible, so holding the
+ *      mmap_lock for either read or write is sufficient. Most callers
+ *      already hold the mmap_lock, so normally, no special action is
+ *      required.
+ *
+ *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
+ *      pgtable page can go away from under us!  It can be done by a pmd
+ *      unshare with a follow up munmap() on the other process), then we
+ *      need either:
+ *
+ *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ *           won't happen upon the range (it also makes sure the pte_t we
+ *           read is the right and stable one), or,
+ *
+ *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
+ *           sure even if unshare happened the racy unmap() will wait until
+ *           i_mmap_rwsem is released.
+ *
+ * Option (2.1) is the safest, which guarantees pte stability from pmd
+ * sharing pov, until the vma lock released.  Option (2.2) doesn't protect
+ * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
+ * access.
+ */
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);

[v3,3/9] mm/hugetlb: Document huge_pte_offset usage

Commit Message

Patch