diff mbox series

[RFC,v4,4/8] hugetlbfs: catch and handle truncate racing with page faults

Message ID 20220706202347.95150-5-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series hugetlb: Change huge pmd sharing synchronization again | expand

Commit Message

Mike Kravetz July 6, 2022, 8:23 p.m. UTC
Most hugetlb fault handling code checks for faults beyond i_size.
While there are early checks in the code paths, the most difficult
to handle are those discovered after taking the page table lock.
At this point, we have possibly allocated a page and consumed
associated reservations and possibly added the page to the page cache.

When discovering a fault beyond i_size, be sure to:
- Remove the page from page cache, else it will sit there until the
  file is removed.
- Do not restore any reservation for the page consumed.  Otherwise
  there will be an outstanding reservation for an offset beyond the
  end of file.

The 'truncation' code in remove_inode_hugepages must deal with fault
code potentially removing a page/folio from the cache after the page was
returned by filemap_get_folios and before locking the page.  This can be
discovered by a change in folio_mapping() after taking folio lock.  In
addition, this code must deal with fault code potentially consuming
and returning reservations.  To synchronize this, remove_inode_hugepages
will now take the fault mutex for ALL indices in the hole or truncated
range.  In this way, it KNOWS fault code has finished with the page/index
OR fault code will see the updated file size.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 88 ++++++++++++++++++++++++++++++--------------
 mm/hugetlb.c         | 39 +++++++++++++++-----
 2 files changed, 90 insertions(+), 37 deletions(-)

Comments

Miaohe Lin July 27, 2022, 9:20 a.m. UTC | #1
On 2022/7/7 4:23, Mike Kravetz wrote:
> Most hugetlb fault handling code checks for faults beyond i_size.
> While there are early checks in the code paths, the most difficult
> to handle are those discovered after taking the page table lock.
> At this point, we have possibly allocated a page and consumed
> associated reservations and possibly added the page to the page cache.
> 
> When discovering a fault beyond i_size, be sure to:
> - Remove the page from page cache, else it will sit there until the
>   file is removed.
> - Do not restore any reservation for the page consumed.  Otherwise
>   there will be an outstanding reservation for an offset beyond the
>   end of file.
> 
> The 'truncation' code in remove_inode_hugepages must deal with fault
> code potentially removing a page/folio from the cache after the page was
> returned by filemap_get_folios and before locking the page.  This can be
> discovered by a change in folio_mapping() after taking folio lock.  In
> addition, this code must deal with fault code potentially consuming
> and returning reservations.  To synchronize this, remove_inode_hugepages
> will now take the fault mutex for ALL indices in the hole or truncated
> range.  In this way, it KNOWS fault code has finished with the page/index
> OR fault code will see the updated file size.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---

<snip>

> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  
>  	ptl = huge_pte_lock(h, mm, ptep);
>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
> -	if (idx >= size)
> +	if (idx >= size) {
> +		beyond_i_size = true;

Thanks for your patch. There is one question:

Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
do we really need to check it again after taking the page table lock?

BTW: I will learn more about this series when I have enough time. Thanks for your work. :)
Mike Kravetz July 27, 2022, 7 p.m. UTC | #2
On 07/27/22 17:20, Miaohe Lin wrote:
> On 2022/7/7 4:23, Mike Kravetz wrote:
> > Most hugetlb fault handling code checks for faults beyond i_size.
> > While there are early checks in the code paths, the most difficult
> > to handle are those discovered after taking the page table lock.
> > At this point, we have possibly allocated a page and consumed
> > associated reservations and possibly added the page to the page cache.
> > 
> > When discovering a fault beyond i_size, be sure to:
> > - Remove the page from page cache, else it will sit there until the
> >   file is removed.
> > - Do not restore any reservation for the page consumed.  Otherwise
> >   there will be an outstanding reservation for an offset beyond the
> >   end of file.
> > 
> > The 'truncation' code in remove_inode_hugepages must deal with fault
> > code potentially removing a page/folio from the cache after the page was
> > returned by filemap_get_folios and before locking the page.  This can be
> > discovered by a change in folio_mapping() after taking folio lock.  In
> > addition, this code must deal with fault code potentially consuming
> > and returning reservations.  To synchronize this, remove_inode_hugepages
> > will now take the fault mutex for ALL indices in the hole or truncated
> > range.  In this way, it KNOWS fault code has finished with the page/index
> > OR fault code will see the updated file size.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> 
> <snip>
> 
> > @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  
> >  	ptl = huge_pte_lock(h, mm, ptep);
> >  	size = i_size_read(mapping->host) >> huge_page_shift(h);
> > -	if (idx >= size)
> > +	if (idx >= size) {
> > +		beyond_i_size = true;
> 
> Thanks for your patch. There is one question:
> 
> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
> do we really need to check it again after taking the page table lock?
> 

Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
is actually an array/table of mutexes hashed by mapping address and file index.
So, during truncation we take take the mutex for each page as they are
unmapped and removed.  So, the fault mutex only synchronizes operations
on one specific page.  The idea with this patch is to coordinate the fault
code and truncate code when operating on the same page.

In addition, changing the file size happens early in the truncate process
before taking any locks/mutexes.
Miaohe Lin July 28, 2022, 2:02 a.m. UTC | #3
On 2022/7/28 3:00, Mike Kravetz wrote:
> On 07/27/22 17:20, Miaohe Lin wrote:
>> On 2022/7/7 4:23, Mike Kravetz wrote:
>>> Most hugetlb fault handling code checks for faults beyond i_size.
>>> While there are early checks in the code paths, the most difficult
>>> to handle are those discovered after taking the page table lock.
>>> At this point, we have possibly allocated a page and consumed
>>> associated reservations and possibly added the page to the page cache.
>>>
>>> When discovering a fault beyond i_size, be sure to:
>>> - Remove the page from page cache, else it will sit there until the
>>>   file is removed.
>>> - Do not restore any reservation for the page consumed.  Otherwise
>>>   there will be an outstanding reservation for an offset beyond the
>>>   end of file.
>>>
>>> The 'truncation' code in remove_inode_hugepages must deal with fault
>>> code potentially removing a page/folio from the cache after the page was
>>> returned by filemap_get_folios and before locking the page.  This can be
>>> discovered by a change in folio_mapping() after taking folio lock.  In
>>> addition, this code must deal with fault code potentially consuming
>>> and returning reservations.  To synchronize this, remove_inode_hugepages
>>> will now take the fault mutex for ALL indices in the hole or truncated
>>> range.  In this way, it KNOWS fault code has finished with the page/index
>>> OR fault code will see the updated file size.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>
>> <snip>
>>
>>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>  
>>>  	ptl = huge_pte_lock(h, mm, ptep);
>>>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
>>> -	if (idx >= size)
>>> +	if (idx >= size) {
>>> +		beyond_i_size = true;
>>
>> Thanks for your patch. There is one question:
>>
>> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
>> do we really need to check it again after taking the page table lock?
>>
> 
> Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
> is actually an array/table of mutexes hashed by mapping address and file index.
> So, during truncation we take take the mutex for each page as they are
> unmapped and removed.  So, the fault mutex only synchronizes operations
> on one specific page.  The idea with this patch is to coordinate the fault
> code and truncate code when operating on the same page.
> 
> In addition, changing the file size happens early in the truncate process
> before taking any locks/mutexes.

I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens
after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would
remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back
when checking i_size again under the page table lock?

In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its
work. So we don't need to complicate the already complicated error path. Or am I miss something?

Thanks.

>
Mike Kravetz July 28, 2022, 4:45 p.m. UTC | #4
On 07/28/22 10:02, Miaohe Lin wrote:
> On 2022/7/28 3:00, Mike Kravetz wrote:
> > On 07/27/22 17:20, Miaohe Lin wrote:
> >> On 2022/7/7 4:23, Mike Kravetz wrote:
> >>> Most hugetlb fault handling code checks for faults beyond i_size.
> >>> While there are early checks in the code paths, the most difficult
> >>> to handle are those discovered after taking the page table lock.
> >>> At this point, we have possibly allocated a page and consumed
> >>> associated reservations and possibly added the page to the page cache.
> >>>
> >>> When discovering a fault beyond i_size, be sure to:
> >>> - Remove the page from page cache, else it will sit there until the
> >>>   file is removed.
> >>> - Do not restore any reservation for the page consumed.  Otherwise
> >>>   there will be an outstanding reservation for an offset beyond the
> >>>   end of file.
> >>>
> >>> The 'truncation' code in remove_inode_hugepages must deal with fault
> >>> code potentially removing a page/folio from the cache after the page was
> >>> returned by filemap_get_folios and before locking the page.  This can be
> >>> discovered by a change in folio_mapping() after taking folio lock.  In
> >>> addition, this code must deal with fault code potentially consuming
> >>> and returning reservations.  To synchronize this, remove_inode_hugepages
> >>> will now take the fault mutex for ALL indices in the hole or truncated
> >>> range.  In this way, it KNOWS fault code has finished with the page/index
> >>> OR fault code will see the updated file size.
> >>>
> >>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>> ---
> >>
> >> <snip>
> >>
> >>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>  
> >>>  	ptl = huge_pte_lock(h, mm, ptep);
> >>>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
> >>> -	if (idx >= size)
> >>> +	if (idx >= size) {
> >>> +		beyond_i_size = true;
> >>
> >> Thanks for your patch. There is one question:
> >>
> >> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
> >> do we really need to check it again after taking the page table lock?
> >>
> > 
> > Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
> > is actually an array/table of mutexes hashed by mapping address and file index.
> > So, during truncation we take take the mutex for each page as they are
> > unmapped and removed.  So, the fault mutex only synchronizes operations
> > on one specific page.  The idea with this patch is to coordinate the fault
> > code and truncate code when operating on the same page.
> > 
> > In addition, changing the file size happens early in the truncate process
> > before taking any locks/mutexes.
> 
> I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens
> after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would
> remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back
> when checking i_size again under the page table lock?
> 
> In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its
> work. So we don't need to complicate the already complicated error path. Or am I miss something?
> 

Thank you! I believe your observations and suggestions are correct.

We can just let the fault code proceed after the early "idx >= size",
and let the truncation code remove the page.  This also eliminates the
need for patch 3 (hugetlbfs: move routine remove_huge_page to hugetlb.c).

I will make these changes in the next version.
David Hildenbrand Aug. 5, 2022, 4:28 p.m. UTC | #5
On 28.07.22 18:45, Mike Kravetz wrote:
> On 07/28/22 10:02, Miaohe Lin wrote:
>> On 2022/7/28 3:00, Mike Kravetz wrote:
>>> On 07/27/22 17:20, Miaohe Lin wrote:
>>>> On 2022/7/7 4:23, Mike Kravetz wrote:
>>>>> Most hugetlb fault handling code checks for faults beyond i_size.
>>>>> While there are early checks in the code paths, the most difficult
>>>>> to handle are those discovered after taking the page table lock.
>>>>> At this point, we have possibly allocated a page and consumed
>>>>> associated reservations and possibly added the page to the page cache.
>>>>>
>>>>> When discovering a fault beyond i_size, be sure to:
>>>>> - Remove the page from page cache, else it will sit there until the
>>>>>   file is removed.
>>>>> - Do not restore any reservation for the page consumed.  Otherwise
>>>>>   there will be an outstanding reservation for an offset beyond the
>>>>>   end of file.
>>>>>
>>>>> The 'truncation' code in remove_inode_hugepages must deal with fault
>>>>> code potentially removing a page/folio from the cache after the page was
>>>>> returned by filemap_get_folios and before locking the page.  This can be
>>>>> discovered by a change in folio_mapping() after taking folio lock.  In
>>>>> addition, this code must deal with fault code potentially consuming
>>>>> and returning reservations.  To synchronize this, remove_inode_hugepages
>>>>> will now take the fault mutex for ALL indices in the hole or truncated
>>>>> range.  In this way, it KNOWS fault code has finished with the page/index
>>>>> OR fault code will see the updated file size.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> ---
>>>>
>>>> <snip>
>>>>
>>>>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>  
>>>>>  	ptl = huge_pte_lock(h, mm, ptep);
>>>>>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
>>>>> -	if (idx >= size)
>>>>> +	if (idx >= size) {
>>>>> +		beyond_i_size = true;
>>>>
>>>> Thanks for your patch. There is one question:
>>>>
>>>> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
>>>> do we really need to check it again after taking the page table lock?
>>>>
>>>
>>> Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
>>> is actually an array/table of mutexes hashed by mapping address and file index.
>>> So, during truncation we take take the mutex for each page as they are
>>> unmapped and removed.  So, the fault mutex only synchronizes operations
>>> on one specific page.  The idea with this patch is to coordinate the fault
>>> code and truncate code when operating on the same page.
>>>
>>> In addition, changing the file size happens early in the truncate process
>>> before taking any locks/mutexes.
>>
>> I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens
>> after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would
>> remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back
>> when checking i_size again under the page table lock?
>>
>> In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its
>> work. So we don't need to complicate the already complicated error path. Or am I miss something?
>>
> 
> Thank you! I believe your observations and suggestions are correct.
> 
> We can just let the fault code proceed after the early "idx >= size",
> and let the truncation code remove the page.  This also eliminates the
> need for patch 3 (hugetlbfs: move routine remove_huge_page to hugetlb.c).

At least remaining the functions would be very welcome nonetheless :)

> 
> I will make these changes in the next version.

Just so I understand correctly, we want to let fault handling code back
out early if we find any incompatible change, and simply retry the
fault? I'm thinking about some kind of a high-level seqcount.
Mike Kravetz Aug. 5, 2022, 10:41 p.m. UTC | #6
On 08/05/22 18:28, David Hildenbrand wrote:
> On 28.07.22 18:45, Mike Kravetz wrote:
> > On 07/28/22 10:02, Miaohe Lin wrote:
> >> On 2022/7/28 3:00, Mike Kravetz wrote:
> >>> On 07/27/22 17:20, Miaohe Lin wrote:
> >>>> On 2022/7/7 4:23, Mike Kravetz wrote:
> >>>>> Most hugetlb fault handling code checks for faults beyond i_size.
> >>>>> While there are early checks in the code paths, the most difficult
> >>>>> to handle are those discovered after taking the page table lock.
> >>>>> At this point, we have possibly allocated a page and consumed
> >>>>> associated reservations and possibly added the page to the page cache.
> >>>>>
> >>>>> When discovering a fault beyond i_size, be sure to:
> >>>>> - Remove the page from page cache, else it will sit there until the
> >>>>>   file is removed.
> >>>>> - Do not restore any reservation for the page consumed.  Otherwise
> >>>>>   there will be an outstanding reservation for an offset beyond the
> >>>>>   end of file.
> >>>>>
> >>>>> The 'truncation' code in remove_inode_hugepages must deal with fault
> >>>>> code potentially removing a page/folio from the cache after the page was
> >>>>> returned by filemap_get_folios and before locking the page.  This can be
> >>>>> discovered by a change in folio_mapping() after taking folio lock.  In
> >>>>> addition, this code must deal with fault code potentially consuming
> >>>>> and returning reservations.  To synchronize this, remove_inode_hugepages
> >>>>> will now take the fault mutex for ALL indices in the hole or truncated
> >>>>> range.  In this way, it KNOWS fault code has finished with the page/index
> >>>>> OR fault code will see the updated file size.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> ---
> >>>>
> >>>> <snip>
> >>>>
> >>>>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>>>  
> >>>>>  	ptl = huge_pte_lock(h, mm, ptep);
> >>>>>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
> >>>>> -	if (idx >= size)
> >>>>> +	if (idx >= size) {
> >>>>> +		beyond_i_size = true;
> >>>>
> >>>> Thanks for your patch. There is one question:
> >>>>
> >>>> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
> >>>> do we really need to check it again after taking the page table lock?
> >>>>
> >>>
> >>> Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
> >>> is actually an array/table of mutexes hashed by mapping address and file index.
> >>> So, during truncation we take take the mutex for each page as they are
> >>> unmapped and removed.  So, the fault mutex only synchronizes operations
> >>> on one specific page.  The idea with this patch is to coordinate the fault
> >>> code and truncate code when operating on the same page.
> >>>
> >>> In addition, changing the file size happens early in the truncate process
> >>> before taking any locks/mutexes.
> >>
> >> I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens
> >> after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would
> >> remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back
> >> when checking i_size again under the page table lock?
> >>
> >> In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its
> >> work. So we don't need to complicate the already complicated error path. Or am I miss something?
> >>
> > 
> > Thank you! I believe your observations and suggestions are correct.
> > 
> > We can just let the fault code proceed after the early "idx >= size",
> > and let the truncation code remove the page.  This also eliminates the
> > need for patch 3 (hugetlbfs: move routine remove_huge_page to hugetlb.c).
> 
> At least remaining the functions would be very welcome nonetheless :)

Agree.

> 
> > 
> > I will make these changes in the next version.
> 
> Just so I understand correctly, we want to let fault handling code back
> out early if we find any incompatible change, and simply retry the
> fault? I'm thinking about some kind of a high-level seqcount.
> 

Not exactly.

In the routine hugetlb_no_page, there are two (no three) places where we
check for races with truncation to see if the fault is beyond the end of
the file.  The first two are before adding a newly allocated page to the
page cache.  The third check is after taking the page table lock to
update the pte.

The idea is to eliminate this third check that requires backing out the
page from the cache.  So, it is 'possible' that the fault code could add
a page beyond i_size.  With the updates to the truncation code (actually
remove_inode_hugepages), we know that this page beyond i_size will be
removed by the racing truncation code.

Hope that makes sense.
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a878c672cf6d..31bd4325fce5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -443,11 +443,10 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  * truncation is indicated by end of range being LLONG_MAX
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() prevents page faults in the
- *	truncated range.  It checks i_size before allocation, and again after
- *	with the page table lock for the page held.  The same lock must be
- *	acquired to unmap a page.
+ *	maps and global counts.  Page faults can race with truncation.
+ *	During faults, hugetlb_no_page() checks i_size before page allocation,
+ *	and again after	obtaining page table lock.  It will 'back out'
+ *	allocations in the truncated range.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -456,27 +455,46 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  *	This is indicated if we find a mapped page.
  * Note: If the passed end of range value is beyond the end of file, but
  * not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data.  To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated.  In this way, we KNOW fault code will either have
+ * completed backout operations under the mutex, or fault code will see the
+ * updated file size and not allocate a page for offsets beyond truncated size.
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation.  For hole punch lm_end == lend.
  */
 static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
-				   loff_t lend)
+				   loff_t lend, loff_t lm_end)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
 	const pgoff_t end = lend >> huge_page_shift(h);
+	pgoff_t m_end = lm_end >> huge_page_shift(h);
+	pgoff_t m_start, m_index;
 	struct folio_batch fbatch;
 	pgoff_t next, index;
 	int i, freed = 0;
+	u32 hash;
 	bool truncate_op = (lend == LLONG_MAX);
 
 	folio_batch_init(&fbatch);
-	next = start;
+	next = m_start = start;
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
 		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
 			struct folio *folio = fbatch.folios[i];
-			u32 hash = 0;
 
 			index = folio->index;
+			/* Take fault mutex for missing folios before index */
+			for (m_index = m_start; m_index < index; m_index++) {
+				hash = hugetlb_fault_mutex_hash(mapping,
+								m_index);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			}
+			m_start = index + 1;
 			hash = hugetlb_fault_mutex_hash(mapping, index);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -485,13 +503,8 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * unmapped in caller.  Unmap (again) now after taking
 			 * the fault mutex.  The mutex will prevent faults
 			 * until we finish removing the folio.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
 			 */
 			if (unlikely(folio_mapped(folio))) {
-				BUG_ON(truncate_op);
-
 				i_mmap_lock_write(mapping);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
@@ -502,20 +515,30 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			folio_lock(folio);
 			/*
-			 * We must free the huge page and remove from page
-			 * cache BEFORE removing the region/reserve map
-			 * (hugetlb_unreserve_pages).  In rare out of memory
-			 * conditions, removal of the region/reserve map could
-			 * fail. Correspondingly, the subpool and global
-			 * reserve usage count can need to be adjusted.
+			 * After locking page, make sure mapping is the same.
+			 * We could have raced with page fault populate and
+			 * backout code.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(&folio->page));
-			hugetlb_delete_from_page_cache(&folio->page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
+			if (folio_mapping(folio) == mapping) {
+				/*
+				 * We must free the folio and remove from
+				 * page cache BEFORE removing the region/
+				 * reserve map (hugetlb_unreserve_pages).  In
+				 * rare out of memory conditions, removal of
+				 * the region/reserve map could fail.
+				 * Correspondingly, the subpool and global
+				 * reserve usage count can need to be adjusted.
+				 */
+				VM_BUG_ON(HPageRestoreReserve(&folio->page));
+				hugetlb_delete_from_page_cache(&folio->page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(
+						hugetlb_unreserve_pages(inode,
 							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
+						hugetlb_fix_reserve_counts(
+							inode);
+				}
 			}
 
 			folio_unlock(folio);
@@ -525,6 +548,13 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 		cond_resched();
 	}
 
+	/* Take fault mutex for missing folios at end of range */
+	for (m_index = m_start; m_index < m_end; m_index++) {
+		hash = hugetlb_fault_mutex_hash(mapping, m_index);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	}
+
 	if (truncate_op)
 		(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
 }
@@ -532,8 +562,9 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
+	loff_t prev_size = i_size_read(inode);
 
-	remove_inode_hugepages(inode, 0, LLONG_MAX);
+	remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);
 
 	/*
 	 * Get the resv_map from the address space embedded in the inode.
@@ -553,6 +584,7 @@  static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
+	loff_t prev_size = i_size_read(inode);
 
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
@@ -563,7 +595,7 @@  static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
-	remove_inode_hugepages(inode, offset, LLONG_MAX);
+	remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
 }
 
 static void hugetlbfs_zero_partial_page(struct hstate *h,
@@ -635,7 +667,7 @@  static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	/* Remove full pages from the file. */
 	if (hole_end > hole_start)
-		remove_inode_hugepages(inode, hole_start, hole_end);
+		remove_inode_hugepages(inode, hole_start, hole_end, hole_end);
 
 	inode_unlock(inode);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9f320c676e4..25f644a3a981 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5491,6 +5491,8 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool beyond_i_size = false;
+	bool reserve_alloc = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5548,6 +5550,8 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (HPageRestoreReserve(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5606,8 +5610,10 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
+	if (idx >= size) {
+		beyond_i_size = true;
 		goto backout;
+	}
 
 	ret = 0;
 	/* If pte changed from under us, retry */
@@ -5652,10 +5658,25 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
+	if (new_page) {
+		if (new_pagecache_page)
+			hugetlb_delete_from_page_cache(page);
+
+		/*
+		 * If reserve was consumed, make sure flag is set so that it
+		 * will be restored in free_huge_page().
+		 */
+		if (reserve_alloc)
+			SetHPageRestoreReserve(page);
+
+		/*
+		 * Do not restore reserve map entries beyond i_size.
+		 * Otherwise, there will be leaks when the file is removed.
+		 */
+		if (!beyond_i_size)
+			restore_reserve_on_error(h, vma, haddr, page);
+	}
 	unlock_page(page);
-	/* restore reserve for newly allocated pages not in page cache */
-	if (new_page && !new_pagecache_page)
-		restore_reserve_on_error(h, vma, haddr, page);
 	put_page(page);
 	goto out;
 }
@@ -5975,15 +5996,15 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * Recheck the i_size after holding PT lock to make sure not
 	 * to leave any page mapped (as page_mapped()) beyond the end
 	 * of the i_size (remove_inode_hugepages() is strict about
-	 * enforcing that). If we bail out here, we'll also leave a
-	 * page in the radix tree in the vm_shared case beyond the end
-	 * of the i_size, but remove_inode_hugepages() will take care
-	 * of it as soon as we drop the hugetlb_fault_mutex_table.
+	 * enforcing that). If we bail out here, remove the page
+	 * added to the radix tree.
 	 */
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	ret = -EFAULT;
-	if (idx >= size)
+	if (idx >= size) {
+		hugetlb_delete_from_page_cache(page);
 		goto out_release_unlock;
+	}
 
 	ret = -EEXIST;
 	/*