diff mbox series

[RFC,v4,7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

Message ID 20220706202347.95150-8-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series hugetlb: Change huge pmd sharing synchronization again | expand

Commit Message

Mike Kravetz July 6, 2022, 8:23 p.m. UTC
Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 124 +++++++++++++++++++++++++++++++++----------
 1 file changed, 95 insertions(+), 29 deletions(-)

Comments

Miaohe Lin July 29, 2022, 2:02 a.m. UTC | #1
On 2022/7/7 4:23, Mike Kravetz wrote:
> Create the new routine hugetlb_unmap_file_folio that will unmap a single
> file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
> modified to do locking within the routine itself and check whether the
> page is mapped within a specific vma before unmapping.
> 
> This refactoring will be put to use and expanded upon in a subsequent
> patch adding vma specific locking.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c | 124 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 95 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 31bd4325fce5..0eac0ea2a245 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -396,6 +396,94 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>  	return -EINVAL;
>  }
>  
> +/*
> + * Called with i_mmap_rwsem held for inode based vma maps.  This makes
> + * sure vma (and vm_mm) will not go away.  We also hold the hugetlb fault
> + * mutex for the page in the mapping.  So, we can not race with page being
> + * faulted into the vma.
> + */
> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> +				unsigned long addr, struct page *page)
> +{
> +	pte_t *ptep, pte;
> +
> +	ptep = huge_pte_offset(vma->vm_mm, addr,
> +			huge_page_size(hstate_vma(vma)));
> +
> +	if (!ptep)
> +		return false;
> +
> +	pte = huge_ptep_get(ptep);
> +	if (huge_pte_none(pte) || !pte_present(pte))
> +		return false;
> +
> +	if (pte_page(pte) == page)
> +		return true;
> +
> +	return false;	/* WTH??? */

I'm sorry but what does WTH means? IIUC, this could happen if pte_page is a COW-ed private page?
vma_interval_tree_foreach doesn't exclude the private mapping even after cow?

Except from above (trivial one), this patch looks good to me.

Thanks.
Mike Kravetz July 29, 2022, 6:11 p.m. UTC | #2
On 07/29/22 10:02, Miaohe Lin wrote:
> On 2022/7/7 4:23, Mike Kravetz wrote:
> > Create the new routine hugetlb_unmap_file_folio that will unmap a single
> > file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
> > modified to do locking within the routine itself and check whether the
> > page is mapped within a specific vma before unmapping.
> > 
> > This refactoring will be put to use and expanded upon in a subsequent
> > patch adding vma specific locking.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  fs/hugetlbfs/inode.c | 124 +++++++++++++++++++++++++++++++++----------
> >  1 file changed, 95 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 31bd4325fce5..0eac0ea2a245 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -396,6 +396,94 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
> >  	return -EINVAL;
> >  }
> >  
> > +/*
> > + * Called with i_mmap_rwsem held for inode based vma maps.  This makes
> > + * sure vma (and vm_mm) will not go away.  We also hold the hugetlb fault
> > + * mutex for the page in the mapping.  So, we can not race with page being
> > + * faulted into the vma.
> > + */
> > +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> > +				unsigned long addr, struct page *page)
> > +{
> > +	pte_t *ptep, pte;
> > +
> > +	ptep = huge_pte_offset(vma->vm_mm, addr,
> > +			huge_page_size(hstate_vma(vma)));
> > +
> > +	if (!ptep)
> > +		return false;
> > +
> > +	pte = huge_ptep_get(ptep);
> > +	if (huge_pte_none(pte) || !pte_present(pte))
> > +		return false;
> > +
> > +	if (pte_page(pte) == page)
> > +		return true;
> > +
> > +	return false;	/* WTH??? */
> 
> I'm sorry but what does WTH means? IIUC, this could happen if pte_page is a COW-ed private page?
> vma_interval_tree_foreach doesn't exclude the private mapping even after cow?

My apologies, I left that comment in during development and should have removed
it.  WTH is an acronym for 'What the Heck?".  I added it because I did not
think we should ever get to this return statement.

I am not sure if your COW of a private page would get us to this return
statement.  In any case, if we get there we need to return false.

Thank you for your analysis and comments!
Miaohe Lin July 30, 2022, 2:15 a.m. UTC | #3
On 2022/7/30 2:11, Mike Kravetz wrote:
> On 07/29/22 10:02, Miaohe Lin wrote:
>> On 2022/7/7 4:23, Mike Kravetz wrote:
>>> Create the new routine hugetlb_unmap_file_folio that will unmap a single
>>> file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
>>> modified to do locking within the routine itself and check whether the
>>> page is mapped within a specific vma before unmapping.
>>>
>>> This refactoring will be put to use and expanded upon in a subsequent
>>> patch adding vma specific locking.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  fs/hugetlbfs/inode.c | 124 +++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 95 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index 31bd4325fce5..0eac0ea2a245 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -396,6 +396,94 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>>>  	return -EINVAL;
>>>  }
>>>  
>>> +/*
>>> + * Called with i_mmap_rwsem held for inode based vma maps.  This makes
>>> + * sure vma (and vm_mm) will not go away.  We also hold the hugetlb fault
>>> + * mutex for the page in the mapping.  So, we can not race with page being
>>> + * faulted into the vma.
>>> + */
>>> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
>>> +				unsigned long addr, struct page *page)
>>> +{
>>> +	pte_t *ptep, pte;
>>> +
>>> +	ptep = huge_pte_offset(vma->vm_mm, addr,
>>> +			huge_page_size(hstate_vma(vma)));
>>> +
>>> +	if (!ptep)
>>> +		return false;
>>> +
>>> +	pte = huge_ptep_get(ptep);
>>> +	if (huge_pte_none(pte) || !pte_present(pte))
>>> +		return false;
>>> +
>>> +	if (pte_page(pte) == page)
>>> +		return true;
>>> +
>>> +	return false;	/* WTH??? */
>>
>> I'm sorry but what does WTH means? IIUC, this could happen if pte_page is a COW-ed private page?
>> vma_interval_tree_foreach doesn't exclude the private mapping even after cow?
> 
> My apologies, I left that comment in during development and should have removed
> it.  WTH is an acronym for 'What the Heck?".  I added it because I did not
> think we should ever get to this return statement.
> 

That's all right. Thanks for your hard work.

> I am not sure if your COW of a private page would get us to this return
> statement.  In any case, if we get there we need to return false.
> 
> Thank you for your analysis and comments!
>
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 31bd4325fce5..0eac0ea2a245 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -396,6 +396,94 @@  static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
+/*
+ * Called with i_mmap_rwsem held for inode based vma maps.  This makes
+ * sure vma (and vm_mm) will not go away.  We also hold the hugetlb fault
+ * mutex for the page in the mapping.  So, we can not race with page being
+ * faulted into the vma.
+ */
+static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
+				unsigned long addr, struct page *page)
+{
+	pte_t *ptep, pte;
+
+	ptep = huge_pte_offset(vma->vm_mm, addr,
+			huge_page_size(hstate_vma(vma)));
+
+	if (!ptep)
+		return false;
+
+	pte = huge_ptep_get(ptep);
+	if (huge_pte_none(pte) || !pte_present(pte))
+		return false;
+
+	if (pte_page(pte) == page)
+		return true;
+
+	return false;	/* WTH??? */
+}
+
+/*
+ * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
+ * No, because the interval tree returns us only those vmas
+ * which overlap the truncated area starting at pgoff,
+ * and no vma on a 32-bit arch can span beyond the 4GB.
+ */
+static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
+{
+	if (vma->vm_pgoff < start)
+		return (start - vma->vm_pgoff) << PAGE_SHIFT;
+	else
+		return 0;
+}
+
+static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
+{
+	unsigned long t_end;
+
+	if (!end)
+		return vma->vm_end;
+
+	t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+	if (t_end > vma->vm_end)
+		t_end = vma->vm_end;
+	return t_end;
+}
+
+/*
+ * Called with hugetlb fault mutex held.  Therefore, no more mappings to
+ * this folio can be created while executing the routine.
+ */
+static void hugetlb_unmap_file_folio(struct hstate *h,
+					struct address_space *mapping,
+					struct folio *folio, pgoff_t index)
+{
+	struct rb_root_cached *root = &mapping->i_mmap;
+	struct page *page = &folio->page;
+	struct vm_area_struct *vma;
+	unsigned long v_start;
+	unsigned long v_end;
+	pgoff_t start, end;
+
+	start = index * pages_per_huge_page(h);
+	end = ((index + 1) * pages_per_huge_page(h));
+
+	i_mmap_lock_write(mapping);
+
+	vma_interval_tree_foreach(vma, root, start, end - 1) {
+		v_start = vma_offset_start(vma, start);
+		v_end = vma_offset_end(vma, end);
+
+		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+			continue;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
+				NULL, ZAP_FLAG_DROP_MARKER);
+	}
+
+	i_mmap_unlock_write(mapping);
+}
+
 static void
 hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		      zap_flags_t zap_flags)
@@ -408,30 +496,13 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 	 * an inclusive "last".
 	 */
 	vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) {
-		unsigned long v_offset;
+		unsigned long v_start;
 		unsigned long v_end;
 
-		/*
-		 * Can the expression below overflow on 32-bit arches?
-		 * No, because the interval tree returns us only those vmas
-		 * which overlap the truncated area starting at pgoff,
-		 * and no vma on a 32-bit arch can span beyond the 4GB.
-		 */
-		if (vma->vm_pgoff < start)
-			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
-		else
-			v_offset = 0;
-
-		if (!end)
-			v_end = vma->vm_end;
-		else {
-			v_end = ((end - vma->vm_pgoff) << PAGE_SHIFT)
-							+ vma->vm_start;
-			if (v_end > vma->vm_end)
-				v_end = vma->vm_end;
-		}
+		v_start = vma_offset_start(vma, start);
+		v_end = vma_offset_end(vma, end);
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
+		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
 				     NULL, zap_flags);
 	}
 }
@@ -504,14 +575,9 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * the fault mutex.  The mutex will prevent faults
 			 * until we finish removing the folio.
 			 */
-			if (unlikely(folio_mapped(folio))) {
-				i_mmap_lock_write(mapping);
-				hugetlb_vmdelete_list(&mapping->i_mmap,
-					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h),
-					ZAP_FLAG_DROP_MARKER);
-				i_mmap_unlock_write(mapping);
-			}
+			if (unlikely(folio_mapped(folio)))
+				hugetlb_unmap_file_folio(h, mapping, folio,
+							index);
 
 			folio_lock(folio);
 			/*