diff mbox series

[RFC,v2,06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock

Message ID 20221118011025.2178986-7-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare | expand

Commit Message

Peter Xu Nov. 18, 2022, 1:10 a.m. UTC
huge_pmd_share() is normally called with vma lock held already, it lets us
feel like we don't really need the walker lock.  But that's not true,
because we only took the vma lock for "current" vma, but not all the vma
pgtables we're going to walk upon.

Taking each vma lock may lead to deadlock and hard to order.  The safe
approach is just to use the walker lock which guarantees the pgtable page
being alive, then we should use get_page_unless_zero() rather than
get_page() only, to make sure the pgtable page is not being freed at the
meantime.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Comments

Peter Xu Nov. 18, 2022, 1:17 a.m. UTC | #1
On Thu, Nov 17, 2022 at 08:10:19PM -0500, Peter Xu wrote:
> huge_pmd_share() is normally called with vma lock held already, it lets us
> feel like we don't really need the walker lock.  But that's not true,
> because we only took the vma lock for "current" vma, but not all the vma
> pgtables we're going to walk upon.
> 
> Taking each vma lock may lead to deadlock and hard to order.  The safe
> approach is just to use the walker lock which guarantees the pgtable page
> being alive, then we should use get_page_unless_zero() rather than
> get_page() only, to make sure the pgtable page is not being freed at the
> meantime.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 61a1fa678172..5ef883184885 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7008,6 +7008,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  
>  	i_mmap_lock_read(mapping);
> +
> +	/*
> +	 * NOTE: even if we've got the vma read lock, here we still need to
> +	 * take the walker lock, because we're not walking the current vma,
> +	 * but some other mm's!
> +	 */
> +	hugetlb_walker_lock();
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -7016,12 +7023,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (saddr) {
>  			spte = huge_pte_offset(svma->vm_mm, saddr,
>  					       vma_mmu_pagesize(svma));
> -			if (spte) {
> -				get_page(virt_to_page(spte));
> +			/*
> +			 * When page ref==0, it means it's probably being
> +			 * freed; continue with the next one.
> +			 */
> +			if (spte && get_page_unless_zero(virt_to_page(spte)))
>  				break;
> -			}
>  		}
>  	}
> +	hugetlb_walker_unlock();
>  
>  	if (!spte)
>  		goto out;
> -- 
> 2.37.3
> 

Will squash this in..

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c77ae7a3d94..f9abede9de47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7045,6 +7045,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
                         */
                        if (spte && get_page_unless_zero(virt_to_page(spte)))
                                break;
+                       else
+                               spte = NULL;
                }
        }
        hugetlb_walker_unlock();
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61a1fa678172..5ef883184885 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7008,6 +7008,13 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 
 	i_mmap_lock_read(mapping);
+
+	/*
+	 * NOTE: even if we've got the vma read lock, here we still need to
+	 * take the walker lock, because we're not walking the current vma,
+	 * but some other mm's!
+	 */
+	hugetlb_walker_lock();
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -7016,12 +7023,15 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (saddr) {
 			spte = huge_pte_offset(svma->vm_mm, saddr,
 					       vma_mmu_pagesize(svma));
-			if (spte) {
-				get_page(virt_to_page(spte));
+			/*
+			 * When page ref==0, it means it's probably being
+			 * freed; continue with the next one.
+			 */
+			if (spte && get_page_unless_zero(virt_to_page(spte)))
 				break;
-			}
 		}
 	}
+	hugetlb_walker_unlock();
 
 	if (!spte)
 		goto out;