diff mbox series

[v2] mm: shmem: allow split THP when truncating THP partially

Message ID 1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: shmem: allow split THP when truncating THP partially | expand

Commit Message

Yang Shi Dec. 4, 2019, 12:42 a.m. UTC
Currently when truncating shmem file, if the range is partial of THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed unless the range cover the whole THP.
Even though all the subpages are truncated (randomly or sequentially),
the THP may still be kept in page cache.  This might be fine for some
usecases which prefer preserving THP.

But, when doing balloon inflation in QEMU, QEMU actually does hole punch
or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
So, when using shmem THP as memory backend QEMU inflation actually doesn't
work as expected since it doesn't free memory.  But, the inflation
usecase really needs get the memory freed.  Anonymous THP will not get
freed right away too but it will be freed eventually when all subpages are
unmapped, but shmem THP would still stay in page cache.

Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read to the hole punched area would return
zero.

Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
v2: * Adopted the comment from Kirill.
    * Dropped fallocate mode flag, THP split is the default behavior.
    * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to
      Hugh's find_get_entries() hack (basically neutral), but without that hack
      we have to rely on pagevec_release() to release extra pins and play with
      goto. This version does in this way. The patch is bigger than Hugh's due
      to extra comments to make the flow clear.

 mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 83 insertions(+), 37 deletions(-)

Comments

Hugh Dickins Dec. 5, 2019, 12:15 a.m. UTC | #1
On Wed, 4 Dec 2019, Yang Shi wrote:

> Currently when truncating shmem file, if the range is partial of THP
> (start or end is in the middle of THP), the pages actually will just get
> cleared rather than being freed unless the range cover the whole THP.
> Even though all the subpages are truncated (randomly or sequentially),
> the THP may still be kept in page cache.  This might be fine for some
> usecases which prefer preserving THP.
> 
> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> So, when using shmem THP as memory backend QEMU inflation actually doesn't
> work as expected since it doesn't free memory.  But, the inflation
> usecase really needs get the memory freed.  Anonymous THP will not get
> freed right away too but it will be freed eventually when all subpages are
> unmapped, but shmem THP would still stay in page cache.
> 
> Split THP right away when doing partial hole punch, and if split fails
> just clear the page so that read to the hole punched area would return
> zero.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
> v2: * Adopted the comment from Kirill.
>     * Dropped fallocate mode flag, THP split is the default behavior.
>     * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to
>       Hugh's find_get_entries() hack (basically neutral), but without that hack

Thanks for giving it a try.  I'm not neutral about my find_get_entries()
hack: it surely had to go (without it, I'd have just pushed my own patch).
I've not noticed anything wrong with your patch, and it's in the right
direction, but I'm still not thrilled with it.  I also remember that I
got the looping wrong in my first internal attempt (fixed in what I sent),
and need to be very sure of the try-again-versus-move-on-to-next conditions
before agreeing to anything.  No rush, I'll come back to this in days or
month ahead: I'll try to find a less gotoey blend of yours and mine.

Hugh

>       we have to rely on pagevec_release() to release extra pins and play with
>       goto. This version does in this way. The patch is bigger than Hugh's due
>       to extra comments to make the flow clear.
> 
>  mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 83 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 220be9f..1ae0c7f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  	long nr_swaps_freed = 0;
>  	pgoff_t index;
>  	int i;
> +	bool split = false;
> +	struct page *page = NULL;
>  
>  	if (lend == -1)
>  		end = -1;	/* unsigned, so actually very big */
>  
>  	pagevec_init(&pvec);
>  	index = start;
> +retry:
>  	while (index < end) {
>  		pvec.nr = find_get_entries(mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  		if (!pvec.nr)
>  			break;
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
> -			struct page *page = pvec.pages[i];
> +			split = false;
> +			page = pvec.pages[i];
>  
>  			index = indices[i];
>  			if (index >= end)
> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			if (!trylock_page(page))
>  				continue;
>  
> -			if (PageTransTail(page)) {
> -				/* Middle of THP: zero out the page */
> -				clear_highpage(page);
> -				unlock_page(page);
> -				continue;
> -			} else if (PageTransHuge(page)) {
> -				if (index == round_down(end, HPAGE_PMD_NR)) {
> +			if (PageTransCompound(page) && !unfalloc) {
> +				if (PageHead(page) &&
> +				    index != round_down(end, HPAGE_PMD_NR)) {
>  					/*
> -					 * Range ends in the middle of THP:
> -					 * zero out the page
> +					 * Fall through when punching whole
> +					 * THP.
>  					 */
> -					clear_highpage(page);
> -					unlock_page(page);
> -					continue;
> +					index += HPAGE_PMD_NR - 1;
> +					i += HPAGE_PMD_NR - 1;
> +				} else {
> +					/*
> +					 * Split THP for any partial hole
> +					 * punch.
> +					 */
> +					get_page(page);
> +					split = true;
> +					goto split;
>  				}
> -				index += HPAGE_PMD_NR - 1;
> -				i += HPAGE_PMD_NR - 1;
>  			}
>  
>  			if (!unfalloc || !PageUptodate(page)) {
> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> +split:
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		cond_resched();
> +
> +		if (split) {
> +			/*
> +			 * The pagevec_release() released all extra pins
> +			 * from pagevec lookup.  And we hold an extra pin
> +			 * and still have the page locked under us.
> +			 */
> +			if (!split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				/* Re-lookup page cache from current index */
> +				goto retry;
> +			}
> +
> +			/* Fail to split THP, move to next index */
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +
>  		index++;
>  	}
>  
> @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  		return;
>  
>  	index = start;
> +again:
>  	while (index < end) {
>  		cond_resched();
>  
> @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			continue;
>  		}
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
> -			struct page *page = pvec.pages[i];
> +			split = false;
> +			page = pvec.pages[i];
>  
>  			index = indices[i];
>  			if (index >= end)
> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  
>  			lock_page(page);
>  
> -			if (PageTransTail(page)) {
> -				/* Middle of THP: zero out the page */
> -				clear_highpage(page);
> -				unlock_page(page);
> -				/*
> -				 * Partial thp truncate due 'start' in middle
> -				 * of THP: don't need to look on these pages
> -				 * again on !pvec.nr restart.
> -				 */
> -				if (index != round_down(end, HPAGE_PMD_NR))
> -					start++;
> -				continue;
> -			} else if (PageTransHuge(page)) {
> -				if (index == round_down(end, HPAGE_PMD_NR)) {
> +			if (PageTransCompound(page) && !unfalloc) {
> +				if (PageHead(page) &&
> +				    index != round_down(end, HPAGE_PMD_NR)) {
>  					/*
> -					 * Range ends in the middle of THP:
> -					 * zero out the page
> +					 * Fall through when punching whole
> +					 * THP.
>  					 */
> -					clear_highpage(page);
> -					unlock_page(page);
> -					continue;
> +					index += HPAGE_PMD_NR - 1;
> +					i += HPAGE_PMD_NR - 1;
> +				} else {
> +					/*
> +					 * Split THP for any partial hole
> +					 * punch.
> +					 */
> +					get_page(page);
> +					split = true;
> +					goto rescan_split;
>  				}
> -				index += HPAGE_PMD_NR - 1;
> -				i += HPAGE_PMD_NR - 1;
>  			}
>  
>  			if (!unfalloc || !PageUptodate(page)) {
> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> +rescan_split:
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> +
> +		if (split) {
> +			/*
> +			 * The pagevec_release() released all extra pins
> +			 * from pagevec lookup.  And we hold an extra pin
> +			 * and still have the page locked under us.
> +			 */
> +			if (!split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				/* Re-lookup page cache from current index */
> +				goto again;
> +			}
> +
> +			/*
> +			 * Split fail, clear the page then move to next
> +			 * index.
> +			 */
> +			clear_highpage(page);
> +
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +
>  		index++;
>  	}
>  
> -- 
> 1.8.3.1
> 
>
Yang Shi Dec. 5, 2019, 12:50 a.m. UTC | #2
On 12/4/19 4:15 PM, Hugh Dickins wrote:
> On Wed, 4 Dec 2019, Yang Shi wrote:
>
>> Currently when truncating shmem file, if the range is partial of THP
>> (start or end is in the middle of THP), the pages actually will just get
>> cleared rather than being freed unless the range cover the whole THP.
>> Even though all the subpages are truncated (randomly or sequentially),
>> the THP may still be kept in page cache.  This might be fine for some
>> usecases which prefer preserving THP.
>>
>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>> work as expected since it doesn't free memory.  But, the inflation
>> usecase really needs get the memory freed.  Anonymous THP will not get
>> freed right away too but it will be freed eventually when all subpages are
>> unmapped, but shmem THP would still stay in page cache.
>>
>> Split THP right away when doing partial hole punch, and if split fails
>> just clear the page so that read to the hole punched area would return
>> zero.
>>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: * Adopted the comment from Kirill.
>>      * Dropped fallocate mode flag, THP split is the default behavior.
>>      * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to
>>        Hugh's find_get_entries() hack (basically neutral), but without that hack
> Thanks for giving it a try.  I'm not neutral about my find_get_entries()
> hack: it surely had to go (without it, I'd have just pushed my own patch).

We are on the same page :-)

> I've not noticed anything wrong with your patch, and it's in the right
> direction, but I'm still not thrilled with it.  I also remember that I
> got the looping wrong in my first internal attempt (fixed in what I sent),
> and need to be very sure of the try-again-versus-move-on-to-next conditions
> before agreeing to anything.  No rush, I'll come back to this in days or
> month ahead: I'll try to find a less gotoey blend of yours and mine.

Yes, those goto look a little bit convoluted so I added a lot comments 
to improve the readability. Thanks for your time.

>
> Hugh
>
>>        we have to rely on pagevec_release() to release extra pins and play with
>>        goto. This version does in this way. The patch is bigger than Hugh's due
>>        to extra comments to make the flow clear.
>>
>>   mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------------------
>>   1 file changed, 83 insertions(+), 37 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 220be9f..1ae0c7f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   	long nr_swaps_freed = 0;
>>   	pgoff_t index;
>>   	int i;
>> +	bool split = false;
>> +	struct page *page = NULL;
>>   
>>   	if (lend == -1)
>>   		end = -1;	/* unsigned, so actually very big */
>>   
>>   	pagevec_init(&pvec);
>>   	index = start;
>> +retry:
>>   	while (index < end) {
>>   		pvec.nr = find_get_entries(mapping, index,
>>   			min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr)
>>   			break;
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			if (!trylock_page(page))
>>   				continue;
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>>   		cond_resched();
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto retry;
>> +			}
>> +
>> +			/* Fail to split THP, move to next index */
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		return;
>>   
>>   	index = start;
>> +again:
>>   	while (index < end) {
>>   		cond_resched();
>>   
>> @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			continue;
>>   		}
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   
>>   			lock_page(page);
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				/*
>> -				 * Partial thp truncate due 'start' in middle
>> -				 * of THP: don't need to look on these pages
>> -				 * again on !pvec.nr restart.
>> -				 */
>> -				if (index != round_down(end, HPAGE_PMD_NR))
>> -					start++;
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto rescan_split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +rescan_split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto again;
>> +			}
>> +
>> +			/*
>> +			 * Split fail, clear the page then move to next
>> +			 * index.
>> +			 */
>> +			clear_highpage(page);
>> +
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> -- 
>> 1.8.3.1
>>
>>
Yang Shi Jan. 14, 2020, 7:28 p.m. UTC | #3
On 12/4/19 4:15 PM, Hugh Dickins wrote:
> On Wed, 4 Dec 2019, Yang Shi wrote:
>
>> Currently when truncating shmem file, if the range is partial of THP
>> (start or end is in the middle of THP), the pages actually will just get
>> cleared rather than being freed unless the range cover the whole THP.
>> Even though all the subpages are truncated (randomly or sequentially),
>> the THP may still be kept in page cache.  This might be fine for some
>> usecases which prefer preserving THP.
>>
>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>> work as expected since it doesn't free memory.  But, the inflation
>> usecase really needs get the memory freed.  Anonymous THP will not get
>> freed right away too but it will be freed eventually when all subpages are
>> unmapped, but shmem THP would still stay in page cache.
>>
>> Split THP right away when doing partial hole punch, and if split fails
>> just clear the page so that read to the hole punched area would return
>> zero.
>>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: * Adopted the comment from Kirill.
>>      * Dropped fallocate mode flag, THP split is the default behavior.
>>      * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to
>>        Hugh's find_get_entries() hack (basically neutral), but without that hack
> Thanks for giving it a try.  I'm not neutral about my find_get_entries()
> hack: it surely had to go (without it, I'd have just pushed my own patch).
> I've not noticed anything wrong with your patch, and it's in the right
> direction, but I'm still not thrilled with it.  I also remember that I
> got the looping wrong in my first internal attempt (fixed in what I sent),
> and need to be very sure of the try-again-versus-move-on-to-next conditions
> before agreeing to anything.  No rush, I'll come back to this in days or
> month ahead: I'll try to find a less gotoey blend of yours and mine.

Hi Hugh,

Any update on this one?

Thanks,
Yang

>
> Hugh
>
>>        we have to rely on pagevec_release() to release extra pins and play with
>>        goto. This version does in this way. The patch is bigger than Hugh's due
>>        to extra comments to make the flow clear.
>>
>>   mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------------------
>>   1 file changed, 83 insertions(+), 37 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 220be9f..1ae0c7f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   	long nr_swaps_freed = 0;
>>   	pgoff_t index;
>>   	int i;
>> +	bool split = false;
>> +	struct page *page = NULL;
>>   
>>   	if (lend == -1)
>>   		end = -1;	/* unsigned, so actually very big */
>>   
>>   	pagevec_init(&pvec);
>>   	index = start;
>> +retry:
>>   	while (index < end) {
>>   		pvec.nr = find_get_entries(mapping, index,
>>   			min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr)
>>   			break;
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			if (!trylock_page(page))
>>   				continue;
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>>   		cond_resched();
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto retry;
>> +			}
>> +
>> +			/* Fail to split THP, move to next index */
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		return;
>>   
>>   	index = start;
>> +again:
>>   	while (index < end) {
>>   		cond_resched();
>>   
>> @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			continue;
>>   		}
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   
>>   			lock_page(page);
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				/*
>> -				 * Partial thp truncate due 'start' in middle
>> -				 * of THP: don't need to look on these pages
>> -				 * again on !pvec.nr restart.
>> -				 */
>> -				if (index != round_down(end, HPAGE_PMD_NR))
>> -					start++;
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto rescan_split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +rescan_split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto again;
>> +			}
>> +
>> +			/*
>> +			 * Split fail, clear the page then move to next
>> +			 * index.
>> +			 */
>> +			clear_highpage(page);
>> +
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> -- 
>> 1.8.3.1
>>
>>
Yang Shi Feb. 4, 2020, 11:27 p.m. UTC | #4
On 1/14/20 11:28 AM, Yang Shi wrote:
>
>
> On 12/4/19 4:15 PM, Hugh Dickins wrote:
>> On Wed, 4 Dec 2019, Yang Shi wrote:
>>
>>> Currently when truncating shmem file, if the range is partial of THP
>>> (start or end is in the middle of THP), the pages actually will just 
>>> get
>>> cleared rather than being freed unless the range cover the whole THP.
>>> Even though all the subpages are truncated (randomly or sequentially),
>>> the THP may still be kept in page cache.  This might be fine for some
>>> usecases which prefer preserving THP.
>>>
>>> But, when doing balloon inflation in QEMU, QEMU actually does hole 
>>> punch
>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not 
>>> used.
>>> So, when using shmem THP as memory backend QEMU inflation actually 
>>> doesn't
>>> work as expected since it doesn't free memory.  But, the inflation
>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>> freed right away too but it will be freed eventually when all 
>>> subpages are
>>> unmapped, but shmem THP would still stay in page cache.
>>>
>>> Split THP right away when doing partial hole punch, and if split fails
>>> just clear the page so that read to the hole punched area would return
>>> zero.
>>>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> ---
>>> v2: * Adopted the comment from Kirill.
>>>      * Dropped fallocate mode flag, THP split is the default behavior.
>>>      * Blended Huge's implementation with my v1 patch. TBH I'm not 
>>> very keen to
>>>        Hugh's find_get_entries() hack (basically neutral), but 
>>> without that hack
>> Thanks for giving it a try.  I'm not neutral about my find_get_entries()
>> hack: it surely had to go (without it, I'd have just pushed my own 
>> patch).
>> I've not noticed anything wrong with your patch, and it's in the right
>> direction, but I'm still not thrilled with it.  I also remember that I
>> got the looping wrong in my first internal attempt (fixed in what I 
>> sent),
>> and need to be very sure of the try-again-versus-move-on-to-next 
>> conditions
>> before agreeing to anything.  No rush, I'll come back to this in days or
>> month ahead: I'll try to find a less gotoey blend of yours and mine.
>
> Hi Hugh,
>
> Any update on this one?
>
> Thanks,
> Yang

Hi Hugh,

Ping. Any comment on this? I really hope it can make v5.7.

Thanks,
Yang

>
>>
>> Hugh
>>
>>>        we have to rely on pagevec_release() to release extra pins 
>>> and play with
>>>        goto. This version does in this way. The patch is bigger than 
>>> Hugh's due
>>>        to extra comments to make the flow clear.
>>>
>>>   mm/shmem.c | 120 
>>> ++++++++++++++++++++++++++++++++++++++++++-------------------
>>>   1 file changed, 83 insertions(+), 37 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 220be9f..1ae0c7f 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>       long nr_swaps_freed = 0;
>>>       pgoff_t index;
>>>       int i;
>>> +    bool split = false;
>>> +    struct page *page = NULL;
>>>         if (lend == -1)
>>>           end = -1;    /* unsigned, so actually very big */
>>>         pagevec_init(&pvec);
>>>       index = start;
>>> +retry:
>>>       while (index < end) {
>>>           pvec.nr = find_get_entries(mapping, index,
>>>               min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>           if (!pvec.nr)
>>>               break;
>>>           for (i = 0; i < pagevec_count(&pvec); i++) {
>>> -            struct page *page = pvec.pages[i];
>>> +            split = false;
>>> +            page = pvec.pages[i];
>>>                 index = indices[i];
>>>               if (index >= end)
>>> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>               if (!trylock_page(page))
>>>                   continue;
>>>   -            if (PageTransTail(page)) {
>>> -                /* Middle of THP: zero out the page */
>>> -                clear_highpage(page);
>>> -                unlock_page(page);
>>> -                continue;
>>> -            } else if (PageTransHuge(page)) {
>>> -                if (index == round_down(end, HPAGE_PMD_NR)) {
>>> +            if (PageTransCompound(page) && !unfalloc) {
>>> +                if (PageHead(page) &&
>>> +                    index != round_down(end, HPAGE_PMD_NR)) {
>>>                       /*
>>> -                     * Range ends in the middle of THP:
>>> -                     * zero out the page
>>> +                     * Fall through when punching whole
>>> +                     * THP.
>>>                        */
>>> -                    clear_highpage(page);
>>> -                    unlock_page(page);
>>> -                    continue;
>>> +                    index += HPAGE_PMD_NR - 1;
>>> +                    i += HPAGE_PMD_NR - 1;
>>> +                } else {
>>> +                    /*
>>> +                     * Split THP for any partial hole
>>> +                     * punch.
>>> +                     */
>>> +                    get_page(page);
>>> +                    split = true;
>>> +                    goto split;
>>>                   }
>>> -                index += HPAGE_PMD_NR - 1;
>>> -                i += HPAGE_PMD_NR - 1;
>>>               }
>>>                 if (!unfalloc || !PageUptodate(page)) {
>>> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>               }
>>>               unlock_page(page);
>>>           }
>>> +split:
>>>           pagevec_remove_exceptionals(&pvec);
>>>           pagevec_release(&pvec);
>>>           cond_resched();
>>> +
>>> +        if (split) {
>>> +            /*
>>> +             * The pagevec_release() released all extra pins
>>> +             * from pagevec lookup.  And we hold an extra pin
>>> +             * and still have the page locked under us.
>>> +             */
>>> +            if (!split_huge_page(page)) {
>>> +                unlock_page(page);
>>> +                put_page(page);
>>> +                /* Re-lookup page cache from current index */
>>> +                goto retry;
>>> +            }
>>> +
>>> +            /* Fail to split THP, move to next index */
>>> +            unlock_page(page);
>>> +            put_page(page);
>>> +        }
>>> +
>>>           index++;
>>>       }
>>>   @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>           return;
>>>         index = start;
>>> +again:
>>>       while (index < end) {
>>>           cond_resched();
>>>   @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>               continue;
>>>           }
>>>           for (i = 0; i < pagevec_count(&pvec); i++) {
>>> -            struct page *page = pvec.pages[i];
>>> +            split = false;
>>> +            page = pvec.pages[i];
>>>                 index = indices[i];
>>>               if (index >= end)
>>> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>                 lock_page(page);
>>>   -            if (PageTransTail(page)) {
>>> -                /* Middle of THP: zero out the page */
>>> -                clear_highpage(page);
>>> -                unlock_page(page);
>>> -                /*
>>> -                 * Partial thp truncate due 'start' in middle
>>> -                 * of THP: don't need to look on these pages
>>> -                 * again on !pvec.nr restart.
>>> -                 */
>>> -                if (index != round_down(end, HPAGE_PMD_NR))
>>> -                    start++;
>>> -                continue;
>>> -            } else if (PageTransHuge(page)) {
>>> -                if (index == round_down(end, HPAGE_PMD_NR)) {
>>> +            if (PageTransCompound(page) && !unfalloc) {
>>> +                if (PageHead(page) &&
>>> +                    index != round_down(end, HPAGE_PMD_NR)) {
>>>                       /*
>>> -                     * Range ends in the middle of THP:
>>> -                     * zero out the page
>>> +                     * Fall through when punching whole
>>> +                     * THP.
>>>                        */
>>> -                    clear_highpage(page);
>>> -                    unlock_page(page);
>>> -                    continue;
>>> +                    index += HPAGE_PMD_NR - 1;
>>> +                    i += HPAGE_PMD_NR - 1;
>>> +                } else {
>>> +                    /*
>>> +                     * Split THP for any partial hole
>>> +                     * punch.
>>> +                     */
>>> +                    get_page(page);
>>> +                    split = true;
>>> +                    goto rescan_split;
>>>                   }
>>> -                index += HPAGE_PMD_NR - 1;
>>> -                i += HPAGE_PMD_NR - 1;
>>>               }
>>>                 if (!unfalloc || !PageUptodate(page)) {
>>> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode 
>>> *inode, loff_t lstart, loff_t lend,
>>>               }
>>>               unlock_page(page);
>>>           }
>>> +rescan_split:
>>>           pagevec_remove_exceptionals(&pvec);
>>>           pagevec_release(&pvec);
>>> +
>>> +        if (split) {
>>> +            /*
>>> +             * The pagevec_release() released all extra pins
>>> +             * from pagevec lookup.  And we hold an extra pin
>>> +             * and still have the page locked under us.
>>> +             */
>>> +            if (!split_huge_page(page)) {
>>> +                unlock_page(page);
>>> +                put_page(page);
>>> +                /* Re-lookup page cache from current index */
>>> +                goto again;
>>> +            }
>>> +
>>> +            /*
>>> +             * Split fail, clear the page then move to next
>>> +             * index.
>>> +             */
>>> +            clear_highpage(page);
>>> +
>>> +            unlock_page(page);
>>> +            put_page(page);
>>> +        }
>>> +
>>>           index++;
>>>       }
>>>   --
>>> 1.8.3.1
>>>
>>>
>
Yang Shi Feb. 14, 2020, 12:38 a.m. UTC | #5
Hi Kirill,


Would you please help review this patch? I don't know why Hugh didn't 
response though I pinged him twice.


Thanks,

Yang



On 2/4/20 3:27 PM, Yang Shi wrote:
>
>
> On 1/14/20 11:28 AM, Yang Shi wrote:
>>
>>
>> On 12/4/19 4:15 PM, Hugh Dickins wrote:
>>> On Wed, 4 Dec 2019, Yang Shi wrote:
>>>
>>>> Currently when truncating shmem file, if the range is partial of THP
>>>> (start or end is in the middle of THP), the pages actually will 
>>>> just get
>>>> cleared rather than being freed unless the range cover the whole THP.
>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>> the THP may still be kept in page cache.  This might be fine for some
>>>> usecases which prefer preserving THP.
>>>>
>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole 
>>>> punch
>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not 
>>>> used.
>>>> So, when using shmem THP as memory backend QEMU inflation actually 
>>>> doesn't
>>>> work as expected since it doesn't free memory.  But, the inflation
>>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>>> freed right away too but it will be freed eventually when all 
>>>> subpages are
>>>> unmapped, but shmem THP would still stay in page cache.
>>>>
>>>> Split THP right away when doing partial hole punch, and if split fails
>>>> just clear the page so that read to the hole punched area would return
>>>> zero.
>>>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>> ---
>>>> v2: * Adopted the comment from Kirill.
>>>>      * Dropped fallocate mode flag, THP split is the default behavior.
>>>>      * Blended Huge's implementation with my v1 patch. TBH I'm not 
>>>> very keen to
>>>>        Hugh's find_get_entries() hack (basically neutral), but 
>>>> without that hack
>>> Thanks for giving it a try.  I'm not neutral about my 
>>> find_get_entries()
>>> hack: it surely had to go (without it, I'd have just pushed my own 
>>> patch).
>>> I've not noticed anything wrong with your patch, and it's in the right
>>> direction, but I'm still not thrilled with it.  I also remember that I
>>> got the looping wrong in my first internal attempt (fixed in what I 
>>> sent),
>>> and need to be very sure of the try-again-versus-move-on-to-next 
>>> conditions
>>> before agreeing to anything.  No rush, I'll come back to this in 
>>> days or
>>> month ahead: I'll try to find a less gotoey blend of yours and mine.
>>
>> Hi Hugh,
>>
>> Any update on this one?
>>
>> Thanks,
>> Yang
>
> Hi Hugh,
>
> Ping. Any comment on this? I really hope it can make v5.7.
>
> Thanks,
> Yang
>
>>
>>>
>>> Hugh
>>>
>>>>        we have to rely on pagevec_release() to release extra pins 
>>>> and play with
>>>>        goto. This version does in this way. The patch is bigger 
>>>> than Hugh's due
>>>>        to extra comments to make the flow clear.
>>>>
>>>>   mm/shmem.c | 120 
>>>> ++++++++++++++++++++++++++++++++++++++++++-------------------
>>>>   1 file changed, 83 insertions(+), 37 deletions(-)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 220be9f..1ae0c7f 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>       long nr_swaps_freed = 0;
>>>>       pgoff_t index;
>>>>       int i;
>>>> +    bool split = false;
>>>> +    struct page *page = NULL;
>>>>         if (lend == -1)
>>>>           end = -1;    /* unsigned, so actually very big */
>>>>         pagevec_init(&pvec);
>>>>       index = start;
>>>> +retry:
>>>>       while (index < end) {
>>>>           pvec.nr = find_get_entries(mapping, index,
>>>>               min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>>> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>           if (!pvec.nr)
>>>>               break;
>>>>           for (i = 0; i < pagevec_count(&pvec); i++) {
>>>> -            struct page *page = pvec.pages[i];
>>>> +            split = false;
>>>> +            page = pvec.pages[i];
>>>>                 index = indices[i];
>>>>               if (index >= end)
>>>> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>               if (!trylock_page(page))
>>>>                   continue;
>>>>   -            if (PageTransTail(page)) {
>>>> -                /* Middle of THP: zero out the page */
>>>> -                clear_highpage(page);
>>>> -                unlock_page(page);
>>>> -                continue;
>>>> -            } else if (PageTransHuge(page)) {
>>>> -                if (index == round_down(end, HPAGE_PMD_NR)) {
>>>> +            if (PageTransCompound(page) && !unfalloc) {
>>>> +                if (PageHead(page) &&
>>>> +                    index != round_down(end, HPAGE_PMD_NR)) {
>>>>                       /*
>>>> -                     * Range ends in the middle of THP:
>>>> -                     * zero out the page
>>>> +                     * Fall through when punching whole
>>>> +                     * THP.
>>>>                        */
>>>> -                    clear_highpage(page);
>>>> -                    unlock_page(page);
>>>> -                    continue;
>>>> +                    index += HPAGE_PMD_NR - 1;
>>>> +                    i += HPAGE_PMD_NR - 1;
>>>> +                } else {
>>>> +                    /*
>>>> +                     * Split THP for any partial hole
>>>> +                     * punch.
>>>> +                     */
>>>> +                    get_page(page);
>>>> +                    split = true;
>>>> +                    goto split;
>>>>                   }
>>>> -                index += HPAGE_PMD_NR - 1;
>>>> -                i += HPAGE_PMD_NR - 1;
>>>>               }
>>>>                 if (!unfalloc || !PageUptodate(page)) {
>>>> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>               }
>>>>               unlock_page(page);
>>>>           }
>>>> +split:
>>>>           pagevec_remove_exceptionals(&pvec);
>>>>           pagevec_release(&pvec);
>>>>           cond_resched();
>>>> +
>>>> +        if (split) {
>>>> +            /*
>>>> +             * The pagevec_release() released all extra pins
>>>> +             * from pagevec lookup.  And we hold an extra pin
>>>> +             * and still have the page locked under us.
>>>> +             */
>>>> +            if (!split_huge_page(page)) {
>>>> +                unlock_page(page);
>>>> +                put_page(page);
>>>> +                /* Re-lookup page cache from current index */
>>>> +                goto retry;
>>>> +            }
>>>> +
>>>> +            /* Fail to split THP, move to next index */
>>>> +            unlock_page(page);
>>>> +            put_page(page);
>>>> +        }
>>>> +
>>>>           index++;
>>>>       }
>>>>   @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>           return;
>>>>         index = start;
>>>> +again:
>>>>       while (index < end) {
>>>>           cond_resched();
>>>>   @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>               continue;
>>>>           }
>>>>           for (i = 0; i < pagevec_count(&pvec); i++) {
>>>> -            struct page *page = pvec.pages[i];
>>>> +            split = false;
>>>> +            page = pvec.pages[i];
>>>>                 index = indices[i];
>>>>               if (index >= end)
>>>> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>                 lock_page(page);
>>>>   -            if (PageTransTail(page)) {
>>>> -                /* Middle of THP: zero out the page */
>>>> -                clear_highpage(page);
>>>> -                unlock_page(page);
>>>> -                /*
>>>> -                 * Partial thp truncate due 'start' in middle
>>>> -                 * of THP: don't need to look on these pages
>>>> -                 * again on !pvec.nr restart.
>>>> -                 */
>>>> -                if (index != round_down(end, HPAGE_PMD_NR))
>>>> -                    start++;
>>>> -                continue;
>>>> -            } else if (PageTransHuge(page)) {
>>>> -                if (index == round_down(end, HPAGE_PMD_NR)) {
>>>> +            if (PageTransCompound(page) && !unfalloc) {
>>>> +                if (PageHead(page) &&
>>>> +                    index != round_down(end, HPAGE_PMD_NR)) {
>>>>                       /*
>>>> -                     * Range ends in the middle of THP:
>>>> -                     * zero out the page
>>>> +                     * Fall through when punching whole
>>>> +                     * THP.
>>>>                        */
>>>> -                    clear_highpage(page);
>>>> -                    unlock_page(page);
>>>> -                    continue;
>>>> +                    index += HPAGE_PMD_NR - 1;
>>>> +                    i += HPAGE_PMD_NR - 1;
>>>> +                } else {
>>>> +                    /*
>>>> +                     * Split THP for any partial hole
>>>> +                     * punch.
>>>> +                     */
>>>> +                    get_page(page);
>>>> +                    split = true;
>>>> +                    goto rescan_split;
>>>>                   }
>>>> -                index += HPAGE_PMD_NR - 1;
>>>> -                i += HPAGE_PMD_NR - 1;
>>>>               }
>>>>                 if (!unfalloc || !PageUptodate(page)) {
>>>> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode 
>>>> *inode, loff_t lstart, loff_t lend,
>>>>               }
>>>>               unlock_page(page);
>>>>           }
>>>> +rescan_split:
>>>>           pagevec_remove_exceptionals(&pvec);
>>>>           pagevec_release(&pvec);
>>>> +
>>>> +        if (split) {
>>>> +            /*
>>>> +             * The pagevec_release() released all extra pins
>>>> +             * from pagevec lookup.  And we hold an extra pin
>>>> +             * and still have the page locked under us.
>>>> +             */
>>>> +            if (!split_huge_page(page)) {
>>>> +                unlock_page(page);
>>>> +                put_page(page);
>>>> +                /* Re-lookup page cache from current index */
>>>> +                goto again;
>>>> +            }
>>>> +
>>>> +            /*
>>>> +             * Split fail, clear the page then move to next
>>>> +             * index.
>>>> +             */
>>>> +            clear_highpage(page);
>>>> +
>>>> +            unlock_page(page);
>>>> +            put_page(page);
>>>> +        }
>>>> +
>>>>           index++;
>>>>       }
>>>>   --
>>>> 1.8.3.1
>>>>
>>>>
>>
>
Kirill A. Shutemov Feb. 14, 2020, 3:40 p.m. UTC | #6
On Thu, Feb 13, 2020 at 04:38:01PM -0800, Yang Shi wrote:
> Hi Kirill,
> 
> 
> Would you please help review this patch? I don't know why Hugh didn't
> response though I pinged him twice.

I have not noticed anything wrong with the patch.

But the function gets ugly beyond the reason. Any chance you could
restructure it to get somewhat maintainable? (It's not easy, I know).
Yang Shi Feb. 14, 2020, 5:17 p.m. UTC | #7
On 2/14/20 7:40 AM, Kirill A. Shutemov wrote:
> On Thu, Feb 13, 2020 at 04:38:01PM -0800, Yang Shi wrote:
>> Hi Kirill,
>>
>>
>> Would you please help review this patch? I don't know why Hugh didn't
>> response though I pinged him twice.
> I have not noticed anything wrong with the patch.
>
> But the function gets ugly beyond the reason. Any chance you could
> restructure it to get somewhat maintainable? (It's not easy, I know).

Yes, those goto looks not neat. I'm going to give it a try. Any 
suggestion is absolutely welcome.

>
Alexander Duyck Feb. 20, 2020, 6:16 p.m. UTC | #8
On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> Currently when truncating shmem file, if the range is partial of THP
> (start or end is in the middle of THP), the pages actually will just get
> cleared rather than being freed unless the range cover the whole THP.
> Even though all the subpages are truncated (randomly or sequentially),
> the THP may still be kept in page cache.  This might be fine for some
> usecases which prefer preserving THP.
>
> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> So, when using shmem THP as memory backend QEMU inflation actually doesn't
> work as expected since it doesn't free memory.  But, the inflation
> usecase really needs get the memory freed.  Anonymous THP will not get
> freed right away too but it will be freed eventually when all subpages are
> unmapped, but shmem THP would still stay in page cache.
>
> Split THP right away when doing partial hole punch, and if split fails
> just clear the page so that read to the hole punched area would return
> zero.
>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

One question I would have is if this is really the desired behavior we
are looking for?

By proactively splitting the THP you are likely going to see a
performance regression with the virtio-balloon driver enabled in QEMU.
I would suspect the response to that would be to update the QEMU code
to  identify the page size of the shared memory ramblock. At that
point I suspect it would start behaving the same as how it currently
handles anonymous memory, and the work done here would essentially
have been wasted other than triggering the desire to resolve this in
QEMU to avoid a performance regression.

The code for inflating a the balloon in virtio-balloon in QEMU can be
found here:
https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66

If there is a way for us to just populate the value obtained via
qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
which is the size I am assuming it is at since you indicated that it
is just freeing the base page size, then we could address the same
issue and likely get the desired outcome of freeing the entire THP
page when it is no longer used.

- Alex
Michael S. Tsirkin Feb. 21, 2020, 9:07 a.m. UTC | #9
On Thu, Feb 20, 2020 at 10:16:54AM -0800, Alexander Duyck wrote:
> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >
> > Currently when truncating shmem file, if the range is partial of THP
> > (start or end is in the middle of THP), the pages actually will just get
> > cleared rather than being freed unless the range cover the whole THP.
> > Even though all the subpages are truncated (randomly or sequentially),
> > the THP may still be kept in page cache.  This might be fine for some
> > usecases which prefer preserving THP.
> >
> > But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> > So, when using shmem THP as memory backend QEMU inflation actually doesn't
> > work as expected since it doesn't free memory.  But, the inflation
> > usecase really needs get the memory freed.  Anonymous THP will not get
> > freed right away too but it will be freed eventually when all subpages are
> > unmapped, but shmem THP would still stay in page cache.
> >
> > Split THP right away when doing partial hole punch, and if split fails
> > just clear the page so that read to the hole punched area would return
> > zero.
> >
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> 
> One question I would have is if this is really the desired behavior we
> are looking for?
> 
> By proactively splitting the THP you are likely going to see a
> performance regression with the virtio-balloon driver enabled in QEMU.
> I would suspect the response to that would be to update the QEMU code
> to  identify the page size of the shared memory ramblock. At that
> point I suspect it would start behaving the same as how it currently
> handles anonymous memory, and the work done here would essentially
> have been wasted other than triggering the desire to resolve this in
> QEMU to avoid a performance regression.
> 
> The code for inflating a the balloon in virtio-balloon in QEMU can be
> found here:
> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
> 
> If there is a way for us to just populate the value obtained via
> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
> which is the size I am assuming it is at since you indicated that it
> is just freeing the base page size, then we could address the same
> issue and likely get the desired outcome of freeing the entire THP
> page when it is no longer used.
> 
> - Alex

Well that would be racy right? It could be THP when you call
the function, by the time you try to free it, it's already
split up ...


Two more points:

1. we can probably teach QEMU to always use the pbp
machinery - will be helpful to reduce number of madvise calls too.

2. Something we should do is teach balloon to
inflate using address/length pairs instead of PFNs.
This way we can pass a full THP in one go.
David Hildenbrand Feb. 21, 2020, 9:36 a.m. UTC | #10
On 21.02.20 10:07, Michael S. Tsirkin wrote:
> On Thu, Feb 20, 2020 at 10:16:54AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>
>>> Currently when truncating shmem file, if the range is partial of THP
>>> (start or end is in the middle of THP), the pages actually will just get
>>> cleared rather than being freed unless the range cover the whole THP.
>>> Even though all the subpages are truncated (randomly or sequentially),
>>> the THP may still be kept in page cache.  This might be fine for some
>>> usecases which prefer preserving THP.
>>>
>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>>> work as expected since it doesn't free memory.  But, the inflation
>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>> freed right away too but it will be freed eventually when all subpages are
>>> unmapped, but shmem THP would still stay in page cache.
>>>
>>> Split THP right away when doing partial hole punch, and if split fails
>>> just clear the page so that read to the hole punched area would return
>>> zero.
>>>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>
>> One question I would have is if this is really the desired behavior we
>> are looking for?
>>
>> By proactively splitting the THP you are likely going to see a
>> performance regression with the virtio-balloon driver enabled in QEMU.
>> I would suspect the response to that would be to update the QEMU code
>> to  identify the page size of the shared memory ramblock. At that
>> point I suspect it would start behaving the same as how it currently
>> handles anonymous memory, and the work done here would essentially
>> have been wasted other than triggering the desire to resolve this in
>> QEMU to avoid a performance regression.
>>
>> The code for inflating a the balloon in virtio-balloon in QEMU can be
>> found here:
>> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>>
>> If there is a way for us to just populate the value obtained via
>> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
>> which is the size I am assuming it is at since you indicated that it
>> is just freeing the base page size, then we could address the same
>> issue and likely get the desired outcome of freeing the entire THP
>> page when it is no longer used.
>>
>> - Alex
> 
> Well that would be racy right? It could be THP when you call
> the function, by the time you try to free it, it's already
> split up ...
> 
> 
> Two more points:
> 
> 1. we can probably teach QEMU to always use the pbp
> machinery - will be helpful to reduce number of madvise calls too.

The pbp machinery only works in the special case where the target page
size > 4k and the guest is nice enough to send the 4k chunks of a target
page sequentially. If the guest sends random pages, it is not of any use.

> 
> 2. Something we should do is teach balloon to
> inflate using address/length pairs instead of PFNs.
> This way we can pass a full THP in one go.

The balloon works on 4k pages only. It is expected to break up THP and
harm performance. Or if that's not possible *do nothing*. Similar to
when balloon inflation is inhibited (e.g., vfio).

There was some work on huge page ballooning in a paper I read. But once
the guest is out of huge pages to report, it would want to fallback to
smaller granularity (down to 4k, to create real memory pressure), where
you would end up in the very same situation you are right now. So it's -
IMHO - only of limited used.

With what you suggest, you'll harm performance to reuse more memory.
IMHO, ballooning can be expected to harm performance. (after all, if you
inflate a 4k page in your guest, the guest won't be able to use a huge
page around that page anymore as well - until it compacts balloon
memory, resulting in new deflate/inflate steps). But I guess, it depends
on the use case ...
Yang Shi Feb. 21, 2020, 6:24 p.m. UTC | #11
On 2/20/20 10:16 AM, Alexander Duyck wrote:
> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> Currently when truncating shmem file, if the range is partial of THP
>> (start or end is in the middle of THP), the pages actually will just get
>> cleared rather than being freed unless the range cover the whole THP.
>> Even though all the subpages are truncated (randomly or sequentially),
>> the THP may still be kept in page cache.  This might be fine for some
>> usecases which prefer preserving THP.
>>
>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>> work as expected since it doesn't free memory.  But, the inflation
>> usecase really needs get the memory freed.  Anonymous THP will not get
>> freed right away too but it will be freed eventually when all subpages are
>> unmapped, but shmem THP would still stay in page cache.
>>
>> Split THP right away when doing partial hole punch, and if split fails
>> just clear the page so that read to the hole punched area would return
>> zero.
>>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> One question I would have is if this is really the desired behavior we
> are looking for?
>
> By proactively splitting the THP you are likely going to see a
> performance regression with the virtio-balloon driver enabled in QEMU.
> I would suspect the response to that would be to update the QEMU code
> to  identify the page size of the shared memory ramblock. At that
> point I suspect it would start behaving the same as how it currently
> handles anonymous memory, and the work done here would essentially
> have been wasted other than triggering the desire to resolve this in
> QEMU to avoid a performance regression.
>
> The code for inflating a the balloon in virtio-balloon in QEMU can be
> found here:
> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>
> If there is a way for us to just populate the value obtained via
> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
> which is the size I am assuming it is at since you indicated that it
> is just freeing the base page size, then we could address the same
> issue and likely get the desired outcome of freeing the entire THP
> page when it is no longer used.

If qemu could punch hole (this is how qemu free file-backed memory) in 
THP unit, either w/ or w/o the patch the THP won't get split since the 
whole THP will get truncated. But, if qemu has to free memory in sub-THP 
size due to whatever reason (for example, 1MB for every 2MB section), 
then we have to split THP otherwise no memory will be freed actually 
with the current code. It is not about performance, it is about really 
giving memory back to host.

>
> - Alex
Alexander Duyck Feb. 22, 2020, 12:24 a.m. UTC | #12
On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 2/20/20 10:16 AM, Alexander Duyck wrote:
> > On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> Currently when truncating shmem file, if the range is partial of THP
> >> (start or end is in the middle of THP), the pages actually will just get
> >> cleared rather than being freed unless the range cover the whole THP.
> >> Even though all the subpages are truncated (randomly or sequentially),
> >> the THP may still be kept in page cache.  This might be fine for some
> >> usecases which prefer preserving THP.
> >>
> >> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> >> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> >> So, when using shmem THP as memory backend QEMU inflation actually doesn't
> >> work as expected since it doesn't free memory.  But, the inflation
> >> usecase really needs get the memory freed.  Anonymous THP will not get
> >> freed right away too but it will be freed eventually when all subpages are
> >> unmapped, but shmem THP would still stay in page cache.
> >>
> >> Split THP right away when doing partial hole punch, and if split fails
> >> just clear the page so that read to the hole punched area would return
> >> zero.
> >>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > One question I would have is if this is really the desired behavior we
> > are looking for?
> >
> > By proactively splitting the THP you are likely going to see a
> > performance regression with the virtio-balloon driver enabled in QEMU.
> > I would suspect the response to that would be to update the QEMU code
> > to  identify the page size of the shared memory ramblock. At that
> > point I suspect it would start behaving the same as how it currently
> > handles anonymous memory, and the work done here would essentially
> > have been wasted other than triggering the desire to resolve this in
> > QEMU to avoid a performance regression.
> >
> > The code for inflating a the balloon in virtio-balloon in QEMU can be
> > found here:
> > https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
> >
> > If there is a way for us to just populate the value obtained via
> > qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
> > which is the size I am assuming it is at since you indicated that it
> > is just freeing the base page size, then we could address the same
> > issue and likely get the desired outcome of freeing the entire THP
> > page when it is no longer used.
>
> If qemu could punch hole (this is how qemu free file-backed memory) in
> THP unit, either w/ or w/o the patch the THP won't get split since the
> whole THP will get truncated. But, if qemu has to free memory in sub-THP
> size due to whatever reason (for example, 1MB for every 2MB section),
> then we have to split THP otherwise no memory will be freed actually
> with the current code. It is not about performance, it is about really
> giving memory back to host.

I get that, but at the same time I am not sure if everyone will be
happy with the trade-off. That is my concern.

You may want to change the patch description above if that is the
case. Based on the description above it makes it sound as if the issue
is that QEMU is using hole punch or MADV_DONTNEED with the wrong
granularity. Based on your comment here it sounds like you want to
have the ability to break up the larger THP page as soon as you want
to push out a single 4K page from it.

I am not sure the description for the behavior of anonymous THP with
respect to QEMU makes sense either. Based on the description you made
it sound like it was somehow using the same process used for huge
pages. That isn't the case right? My understanding is that in the case
of an anonymous THP it is getting broken into 4K subpages and then
those are freed individually. That should leave you with the same
performance regression that I had brought up earlier.

So if anything it sounds like what you are really wanting is feature
parity with anonymous THP which will split the anonymous THP page when
a single 4K page within the THP is hit with MADV_DONTNEED.

- Alex
Alexander Duyck Feb. 22, 2020, 12:39 a.m. UTC | #13
On Fri, Feb 21, 2020 at 1:36 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.02.20 10:07, Michael S. Tsirkin wrote:
> > On Thu, Feb 20, 2020 at 10:16:54AM -0800, Alexander Duyck wrote:
> >> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >>>
> >>> Currently when truncating shmem file, if the range is partial of THP
> >>> (start or end is in the middle of THP), the pages actually will just get
> >>> cleared rather than being freed unless the range cover the whole THP.
> >>> Even though all the subpages are truncated (randomly or sequentially),
> >>> the THP may still be kept in page cache.  This might be fine for some
> >>> usecases which prefer preserving THP.
> >>>
> >>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> >>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> >>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
> >>> work as expected since it doesn't free memory.  But, the inflation
> >>> usecase really needs get the memory freed.  Anonymous THP will not get
> >>> freed right away too but it will be freed eventually when all subpages are
> >>> unmapped, but shmem THP would still stay in page cache.
> >>>
> >>> Split THP right away when doing partial hole punch, and if split fails
> >>> just clear the page so that read to the hole punched area would return
> >>> zero.
> >>>
> >>> Cc: Hugh Dickins <hughd@google.com>
> >>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> >>
> >> One question I would have is if this is really the desired behavior we
> >> are looking for?
> >>
> >> By proactively splitting the THP you are likely going to see a
> >> performance regression with the virtio-balloon driver enabled in QEMU.
> >> I would suspect the response to that would be to update the QEMU code
> >> to  identify the page size of the shared memory ramblock. At that
> >> point I suspect it would start behaving the same as how it currently
> >> handles anonymous memory, and the work done here would essentially
> >> have been wasted other than triggering the desire to resolve this in
> >> QEMU to avoid a performance regression.
> >>
> >> The code for inflating a the balloon in virtio-balloon in QEMU can be
> >> found here:
> >> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
> >>
> >> If there is a way for us to just populate the value obtained via
> >> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
> >> which is the size I am assuming it is at since you indicated that it
> >> is just freeing the base page size, then we could address the same
> >> issue and likely get the desired outcome of freeing the entire THP
> >> page when it is no longer used.
> >>
> >> - Alex
> >
> > Well that would be racy right? It could be THP when you call
> > the function, by the time you try to free it, it's already
> > split up ...
> >
> >
> > Two more points:
> >
> > 1. we can probably teach QEMU to always use the pbp
> > machinery - will be helpful to reduce number of madvise calls too.
>
> The pbp machinery only works in the special case where the target page
> size > 4k and the guest is nice enough to send the 4k chunks of a target
> page sequentially. If the guest sends random pages, it is not of any use.

Honestly I hadn't looked that close at the code. I had looked it over
briefly when I was working on the page reporting logic and had decided
against even bothering with it when I decided to use the scatterlist
approach since I can simply ignore the pages that fall below the
lowest order supported for the reporting.

> >
> > 2. Something we should do is teach balloon to
> > inflate using address/length pairs instead of PFNs.
> > This way we can pass a full THP in one go.
>
> The balloon works on 4k pages only. It is expected to break up THP and
> harm performance. Or if that's not possible *do nothing*. Similar to
> when balloon inflation is inhibited (e.g., vfio).

Yes, but I think the point is that this is counter productive. If we
can allocate something up to MAX_ORDER - 1 and hand that to the
balloon driver instead then it would make the driver much more
efficient. We could basically just work from the highest available
order to the lowest and if that pushes us to the point of breaking up
THP pages then at that point it would make sense. Us allocating the
lower order pages first just makes it more difficult to go through and
compact pages back up to higher order. The goal should really always
be highest order to lowest order for inflation, and lowest to highest
for deflation. That way we put pressure on the guest to compact its
memory making it possible for us to squeeze it down even smaller and
provide more THP pages for the rest of the system.

> There was some work on huge page ballooning in a paper I read. But once
> the guest is out of huge pages to report, it would want to fallback to
> smaller granularity (down to 4k, to create real memory pressure), where
> you would end up in the very same situation you are right now. So it's -
> IMHO - only of limited used.

I wouldn't think it would be that limited of a use case. By having the
balloon inflate with higher order pages you should be able to put more
pressure on the guest to compact the memory and reduce fragmentation
instead of increasing it. If you have the balloon flushing out the
lower order pages it is sitting on when there is pressure it seems
like it would be more likely to reduce fragmentation further.

> With what you suggest, you'll harm performance to reuse more memory.
> IMHO, ballooning can be expected to harm performance. (after all, if you
> inflate a 4k page in your guest, the guest won't be able to use a huge
> page around that page anymore as well - until it compacts balloon
> memory, resulting in new deflate/inflate steps). But I guess, it depends
> on the use case ...

I think it depends on how you are using the balloon. If you have the
hypervisor only doing the MADV_DONTNEED on 2M pages, while letting it
fill the balloon in the guest with everything down to 4K it might lead
to enough memory churn to actually reduce the fragmentation as the
lower order pages are inflated/deflated as we maintain memory
pressure. It would probably be an interesting experiment if nothing
else, and probably wouldn't take much more than a few tweaks to make
use of inflation/deflation queues similar to what I did with the page
reporting/hinting interface and a bit of logic to try allocating from
highest order to lowest.
David Hildenbrand Feb. 24, 2020, 10:22 a.m. UTC | #14
>>> 1. we can probably teach QEMU to always use the pbp
>>> machinery - will be helpful to reduce number of madvise calls too.
>>
>> The pbp machinery only works in the special case where the target page
>> size > 4k and the guest is nice enough to send the 4k chunks of a target
>> page sequentially. If the guest sends random pages, it is not of any use.
> 
> Honestly I hadn't looked that close at the code. I had looked it over
> briefly when I was working on the page reporting logic and had decided
> against even bothering with it when I decided to use the scatterlist
> approach since I can simply ignore the pages that fall below the
> lowest order supported for the reporting.

Yes, it's rather a hack for a special use case.

> 
>>>
>>> 2. Something we should do is teach balloon to
>>> inflate using address/length pairs instead of PFNs.
>>> This way we can pass a full THP in one go.
>>
>> The balloon works on 4k pages only. It is expected to break up THP and
>> harm performance. Or if that's not possible *do nothing*. Similar to
>> when balloon inflation is inhibited (e.g., vfio).
> 
> Yes, but I think the point is that this is counter productive. If we
> can allocate something up to MAX_ORDER - 1 and hand that to the
> balloon driver instead then it would make the driver much more
> efficient. We could basically just work from the highest available
> order to the lowest and if that pushes us to the point of breaking up
> THP pages then at that point it would make sense. Us allocating the
> lower order pages first just makes it more difficult to go through and
> compact pages back up to higher order. The goal should really always
> be highest order to lowest order for inflation, and lowest to highest
> for deflation. That way we put pressure on the guest to compact its
> memory making it possible for us to squeeze it down even smaller and
> provide more THP pages for the rest of the system.

While the initial inflate path would be fine, I am more concerned about
deflation/balloon compaction handling (see below, split to order-0
pages). Because you really want to keep page compaction working.

Imagine you would allocate higher-order pages in your balloon that are
not movable, then the kernel would have less higher/order pages to work
with which might actually harm performance in your guest.

I think of it like that: Best performance is huge page in guest and
host. Medium performance is huge page in guest xor host. Worst
performance is no huge page.

If you take away huge pages in your guest for your balloon, you limit
the cases for "best performance", esp. less THP in your guest. You'll be
able to get medium performance if you inflate lower-order pages in your
guest but don't discard THP in your host - while having more huge pages
for THP available. You'll get worst performance if you inflate
lower-order pages in your guest and discard THP in your host.

> 
>> There was some work on huge page ballooning in a paper I read. But once
>> the guest is out of huge pages to report, it would want to fallback to
>> smaller granularity (down to 4k, to create real memory pressure), where
>> you would end up in the very same situation you are right now. So it's -
>> IMHO - only of limited used.
> 
> I wouldn't think it would be that limited of a use case. By having the
> balloon inflate with higher order pages you should be able to put more
> pressure on the guest to compact the memory and reduce fragmentation
> instead of increasing it. If you have the balloon flushing out the
> lower order pages it is sitting on when there is pressure it seems
> like it would be more likely to reduce fragmentation further.

As we have balloon compaction in place and balloon pages are movable, I
guess fragmentation is not really an issue.

> 
>> With what you suggest, you'll harm performance to reuse more memory.
>> IMHO, ballooning can be expected to harm performance. (after all, if you
>> inflate a 4k page in your guest, the guest won't be able to use a huge
>> page around that page anymore as well - until it compacts balloon
>> memory, resulting in new deflate/inflate steps). But I guess, it depends
>> on the use case ...
> 
> I think it depends on how you are using the balloon. If you have the
> hypervisor only doing the MADV_DONTNEED on 2M pages, while letting it
> fill the balloon in the guest with everything down to 4K it might lead
> to enough memory churn to actually reduce the fragmentation as the
> lower order pages are inflated/deflated as we maintain memory
> pressure. It would probably be an interesting experiment if nothing
> else, and probably wouldn't take much more than a few tweaks to make
> use of inflation/deflation queues similar to what I did with the page
> reporting/hinting interface and a bit of logic to try allocating from
> highest order to lowest.
> 

Especially page compaction/migration in the guest might be tricky. AFAIK
it only works on oder-0 pages. E.g., whenever you allocated a
higher-order page in the guest and reported it to your hypervisor, you
want to split it into separate order-0 pages before adding them to the
balloon list. Otherwise, you won't be able to tag them as movable and
handle them via the existing balloon compaction framework - and that
would be a major step backwards, because you would be heavily
fragmenting your guest (and even turning MAX_ORDER - 1 into unmovable
pages means that memory offlining/alloc_contig_range() users won't be
able to move such pages around anymore).

But then, the balloon compaction will result in single 4k pages getting
moved and deflated+inflated. Once you have order-0 pages in your list,
deflating higher-order pages becomes trickier.

E.g., have a look at the vmware balloon (drivers/misc/vmw_balloon.c). It
will allocate either 4k or 2MB pages, but won't be able to handle them
for balloon compaction. They don't even bother about other granularity.


Long story short: Inflating higher-order pages could be good for
inflation performance in some setups, but I think you'll have to fall
back to lower-order allocations +  balloon compaction on 4k.
Alexander Duyck Feb. 25, 2020, 12:13 a.m. UTC | #15
On Mon, Feb 24, 2020 at 2:22 AM David Hildenbrand <david@redhat.com> wrote:
>
>
> >>> 1. we can probably teach QEMU to always use the pbp
> >>> machinery - will be helpful to reduce number of madvise calls too.
> >>
> >> The pbp machinery only works in the special case where the target page
> >> size > 4k and the guest is nice enough to send the 4k chunks of a target
> >> page sequentially. If the guest sends random pages, it is not of any use.
> >
> > Honestly I hadn't looked that close at the code. I had looked it over
> > briefly when I was working on the page reporting logic and had decided
> > against even bothering with it when I decided to use the scatterlist
> > approach since I can simply ignore the pages that fall below the
> > lowest order supported for the reporting.
>
> Yes, it's rather a hack for a special use case.
>
> >
> >>>
> >>> 2. Something we should do is teach balloon to
> >>> inflate using address/length pairs instead of PFNs.
> >>> This way we can pass a full THP in one go.
> >>
> >> The balloon works on 4k pages only. It is expected to break up THP and
> >> harm performance. Or if that's not possible *do nothing*. Similar to
> >> when balloon inflation is inhibited (e.g., vfio).
> >
> > Yes, but I think the point is that this is counter productive. If we
> > can allocate something up to MAX_ORDER - 1 and hand that to the
> > balloon driver instead then it would make the driver much more
> > efficient. We could basically just work from the highest available
> > order to the lowest and if that pushes us to the point of breaking up
> > THP pages then at that point it would make sense. Us allocating the
> > lower order pages first just makes it more difficult to go through and
> > compact pages back up to higher order. The goal should really always
> > be highest order to lowest order for inflation, and lowest to highest
> > for deflation. That way we put pressure on the guest to compact its
> > memory making it possible for us to squeeze it down even smaller and
> > provide more THP pages for the rest of the system.
>
> While the initial inflate path would be fine, I am more concerned about
> deflation/balloon compaction handling (see below, split to order-0
> pages). Because you really want to keep page compaction working.
>
> Imagine you would allocate higher-order pages in your balloon that are
> not movable, then the kernel would have less higher/order pages to work
> with which might actually harm performance in your guest.
>
> I think of it like that: Best performance is huge page in guest and
> host. Medium performance is huge page in guest xor host. Worst
> performance is no huge page.
>
> If you take away huge pages in your guest for your balloon, you limit
> the cases for "best performance", esp. less THP in your guest. You'll be
> able to get medium performance if you inflate lower-order pages in your
> guest but don't discard THP in your host - while having more huge pages
> for THP available. You'll get worst performance if you inflate
> lower-order pages in your guest and discard THP in your host.

My concern is more the fact that while the balloon driver may support
migration there is a chance that the other threads holding onto the
pages might not. That is why I am thinking that if possible we should
avoid breaking up a THP page unless we have to. By doing that it puts
pressure on the guest to try and keep more of the memory it has more
compact so as to avoid fragmentation.

I guess the question is if pressuring the guest to compact the memory
to create more THP pages would add value versus letting the pressure
from the inflation cause more potential fragmentation.

> >
> >> There was some work on huge page ballooning in a paper I read. But once
> >> the guest is out of huge pages to report, it would want to fallback to
> >> smaller granularity (down to 4k, to create real memory pressure), where
> >> you would end up in the very same situation you are right now. So it's -
> >> IMHO - only of limited used.
> >
> > I wouldn't think it would be that limited of a use case. By having the
> > balloon inflate with higher order pages you should be able to put more
> > pressure on the guest to compact the memory and reduce fragmentation
> > instead of increasing it. If you have the balloon flushing out the
> > lower order pages it is sitting on when there is pressure it seems
> > like it would be more likely to reduce fragmentation further.
>
> As we have balloon compaction in place and balloon pages are movable, I
> guess fragmentation is not really an issue.

I'm not sure that is truly the case. My concern is that by allocating
the 4K pages we are breaking up the higher order pages and we aren't
necessarily guaranteed to obtain all pieces of the higher order page
when we break it up. As a result we could end up causing the THP pages
to be broken up and scattered between the balloon and other consumers
of memory. If one of those other consumers is responsible for the
fragmentation that is causing us to be working on trying to compact
the memory it is possible it would just make things worse.

> >
> >> With what you suggest, you'll harm performance to reuse more memory.
> >> IMHO, ballooning can be expected to harm performance. (after all, if you
> >> inflate a 4k page in your guest, the guest won't be able to use a huge
> >> page around that page anymore as well - until it compacts balloon
> >> memory, resulting in new deflate/inflate steps). But I guess, it depends
> >> on the use case ...
> >
> > I think it depends on how you are using the balloon. If you have the
> > hypervisor only doing the MADV_DONTNEED on 2M pages, while letting it
> > fill the balloon in the guest with everything down to 4K it might lead
> > to enough memory churn to actually reduce the fragmentation as the
> > lower order pages are inflated/deflated as we maintain memory
> > pressure. It would probably be an interesting experiment if nothing
> > else, and probably wouldn't take much more than a few tweaks to make
> > use of inflation/deflation queues similar to what I did with the page
> > reporting/hinting interface and a bit of logic to try allocating from
> > highest order to lowest.
> >
>
> Especially page compaction/migration in the guest might be tricky. AFAIK
> it only works on oder-0 pages. E.g., whenever you allocated a
> higher-order page in the guest and reported it to your hypervisor, you
> want to split it into separate order-0 pages before adding them to the
> balloon list. Otherwise, you won't be able to tag them as movable and
> handle them via the existing balloon compaction framework - and that
> would be a major step backwards, because you would be heavily
> fragmenting your guest (and even turning MAX_ORDER - 1 into unmovable
> pages means that memory offlining/alloc_contig_range() users won't be
> able to move such pages around anymore).

Yes, from what I can tell compaction will not touch anything that is
pageblock size or larger. I am not sure if that is an issue or not.

For migration is is a bit of a different story. It looks like there is
logic in place for migrating huge and transparent huge pages, but not
higher order pages. I'll have to take a look through the code some
more to see just how difficult it would be to support migrating a 2M
page. I can probably make it work if I just configure it as a
transparent huge page with the appropriate flags and bits in the page
being set.

> But then, the balloon compaction will result in single 4k pages getting
> moved and deflated+inflated. Once you have order-0 pages in your list,
> deflating higher-order pages becomes trickier.

I probably wouldn't want to maintain them as individual lists. In my
mind it would make more sense to have two separate lists with separate
handlers for each. Then in the event of something such as a deflate we
could choose what we free based on the number of pages we need to
free. That would allow us to deflate the balloon quicker in the case
of a low-memory condition which should improve our responsiveness. In
addition with the driver sitting on a reserve of higher-order pages it
could help to alleviate fragmentation in such a case as well since it
could release larger contiguous blocks of memory.

> E.g., have a look at the vmware balloon (drivers/misc/vmw_balloon.c). It
> will allocate either 4k or 2MB pages, but won't be able to handle them
> for balloon compaction. They don't even bother about other granularity.
>
>
> Long story short: Inflating higher-order pages could be good for
> inflation performance in some setups, but I think you'll have to fall
> back to lower-order allocations +  balloon compaction on 4k.

I'm not entirely sure that is the case. It seems like with a few
tweaks to things we could look at doing something like splitting the
balloon so that we have a 4K and a 2M balloon. At that point it would
just be a matter of registering a pair of address space handlers so
that the 2M balloons are handled correctly if there is a request to
migrate their memory. As far as compaction that is another story since
it looks like 2M pages will not be compacted.
Hugh Dickins Feb. 25, 2020, 3:46 a.m. UTC | #16
On Tue, 14 Jan 2020, Yang Shi wrote:
> On 12/4/19 4:15 PM, Hugh Dickins wrote:
> > On Wed, 4 Dec 2019, Yang Shi wrote:
> > 
> > > Currently when truncating shmem file, if the range is partial of THP
> > > (start or end is in the middle of THP), the pages actually will just get
> > > cleared rather than being freed unless the range cover the whole THP.
> > > Even though all the subpages are truncated (randomly or sequentially),
> > > the THP may still be kept in page cache.  This might be fine for some
> > > usecases which prefer preserving THP.
> > > 
> > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> > > So, when using shmem THP as memory backend QEMU inflation actually
> > > doesn't
> > > work as expected since it doesn't free memory.  But, the inflation
> > > usecase really needs get the memory freed.  Anonymous THP will not get
> > > freed right away too but it will be freed eventually when all subpages
> > > are
> > > unmapped, but shmem THP would still stay in page cache.
> > > 
> > > Split THP right away when doing partial hole punch, and if split fails
> > > just clear the page so that read to the hole punched area would return
> > > zero.
> > > 
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > > ---
> > > v2: * Adopted the comment from Kirill.
> > >      * Dropped fallocate mode flag, THP split is the default behavior.
> > >      * Blended Huge's implementation with my v1 patch. TBH I'm not very
> > > keen to
> > >        Hugh's find_get_entries() hack (basically neutral), but without
> > > that hack
> > Thanks for giving it a try.  I'm not neutral about my find_get_entries()
> > hack: it surely had to go (without it, I'd have just pushed my own patch).
> > I've not noticed anything wrong with your patch, and it's in the right
> > direction, but I'm still not thrilled with it.  I also remember that I
> > got the looping wrong in my first internal attempt (fixed in what I sent),
> > and need to be very sure of the try-again-versus-move-on-to-next conditions
> > before agreeing to anything.  No rush, I'll come back to this in days or
> > month ahead: I'll try to find a less gotoey blend of yours and mine.
> 
> Hi Hugh,
> 
> Any update on this one?

I apologize for my dreadful unresponsiveness.

I've spent the last day trying to love yours, and considering how mine
might be improved; but repeatedly arrived at the conclusion that mine is
about as nice as we can manage, just needing more comment to dignify it.

I did willingly call my find_get_entries() stopping at PageTransCompound
a hack; but now think we should just document that behavior and accept it.
The contortions of your patch come from the need to release those 14 extra
unwanted references: much better not to get them in the first place.

Neither of us handle a failed split optimally, we treat every tail as an
opportunity to retry: which is good to recover from transient failures,
but probably excessive.  And we both have to restart the pagevec after
each attempt, but at least I don't get 14 unwanted extras each time.

What of other find_get_entries() users and its pagevec_lookup_entries()
wrapper: does an argument to select this "stop at PageTransCompound"
behavior need to be added?

No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
new behavior - evicting the head from page cache removes all the tails
along with it, so getting the tails a waste of time there too, just as
it was in shmem_undo_range().

Whereas shmem_unlock_mapping() and shmem_seek_hole_data(), as they
stand, are not removing pages from cache, but actually wanting to plod
through the tails.  So those two would be slowed a little, while the
pagevec collects 1 instead of 15 pages.  However: if we care about those
two at all, it's clear that we should speed them up, by noticing the
PageTransCompound case and accelerating over it, instead of plodding
through the tails.  Since we haven't bothered with that optimization
yet, I'm not very worried to slow them down a little until it's done.

I must take a look at where we stand with tmpfs 64-bit ino tomorrow,
then recomment my shmem_punch_compound() patch and post it properly,
probably day after.  (Reviewing it, I see I have a "page->index +
HPAGE_PMD_NR <= end" test which needs correcting - I tend to live
in the past, before v4.13 doubled the 32-bit MAX_LFS_FILESIZE.)

I notice that this thread has veered off into QEMU ballooning
territory: which may indeed be important, but there's nothing at all
that I can contribute on that.  I certainly do not want to slow down
anything important, but remain convinced that the correct filesystem
implementation for punching a hole is to punch a hole.

Hugh
David Hildenbrand Feb. 25, 2020, 8:09 a.m. UTC | #17
[...]

> I guess the question is if pressuring the guest to compact the memory
> to create more THP pages would add value versus letting the pressure
> from the inflation cause more potential fragmentation.

Would be interesting to see some actual numbers. Right now, it's just
speculations. I know that there are ideas to do proactive compaction,
maybe that has a similar effect.

[...]

> 
>>>
>>>> There was some work on huge page ballooning in a paper I read. But once
>>>> the guest is out of huge pages to report, it would want to fallback to
>>>> smaller granularity (down to 4k, to create real memory pressure), where
>>>> you would end up in the very same situation you are right now. So it's -
>>>> IMHO - only of limited used.
>>>
>>> I wouldn't think it would be that limited of a use case. By having the
>>> balloon inflate with higher order pages you should be able to put more
>>> pressure on the guest to compact the memory and reduce fragmentation
>>> instead of increasing it. If you have the balloon flushing out the
>>> lower order pages it is sitting on when there is pressure it seems
>>> like it would be more likely to reduce fragmentation further.
>>
>> As we have balloon compaction in place and balloon pages are movable, I
>> guess fragmentation is not really an issue.
> 
> I'm not sure that is truly the case. My concern is that by allocating
> the 4K pages we are breaking up the higher order pages and we aren't
> necessarily guaranteed to obtain all pieces of the higher order page
> when we break it up. As a result we could end up causing the THP pages
> to be broken up and scattered between the balloon and other consumers

We are allocating movable memory. We should be working on/creating
movable pageblocks. Yes, other concurrent allcoations can race with the
allocation. But after all, they are likely movable as well (because they
are allocating from a movable pageblock) and we do have compaction in
place. There are corner cases but in don't think they are very relevant.

[...]

>> Especially page compaction/migration in the guest might be tricky. AFAIK
>> it only works on oder-0 pages. E.g., whenever you allocated a
>> higher-order page in the guest and reported it to your hypervisor, you
>> want to split it into separate order-0 pages before adding them to the
>> balloon list. Otherwise, you won't be able to tag them as movable and
>> handle them via the existing balloon compaction framework - and that
>> would be a major step backwards, because you would be heavily
>> fragmenting your guest (and even turning MAX_ORDER - 1 into unmovable
>> pages means that memory offlining/alloc_contig_range() users won't be
>> able to move such pages around anymore).
> 
> Yes, from what I can tell compaction will not touch anything that is
> pageblock size or larger. I am not sure if that is an issue or not.
> 
> For migration is is a bit of a different story. It looks like there is
> logic in place for migrating huge and transparent huge pages, but not
> higher order pages. I'll have to take a look through the code some
> more to see just how difficult it would be to support migrating a 2M
> page. I can probably make it work if I just configure it as a
> transparent huge page with the appropriate flags and bits in the page
> being set.

Note: With virtio-balloon you actually don't necessarily want to migrate
the higher-order page. E.g., migrating a higher-order page might fail
because there is no migration target available. Instead, you would want
"migrate" to multiple smaller pieces. This is esp., interesting for
alloc_contig_range() users. Something that the current 4k pages can
handle just nicely.

> 
>> But then, the balloon compaction will result in single 4k pages getting
>> moved and deflated+inflated. Once you have order-0 pages in your list,
>> deflating higher-order pages becomes trickier.
> 
> I probably wouldn't want to maintain them as individual lists. In my
> mind it would make more sense to have two separate lists with separate
> handlers for each. Then in the event of something such as a deflate we
> could choose what we free based on the number of pages we need to
> free. That would allow us to deflate the balloon quicker in the case
> of a low-memory condition which should improve our responsiveness. In
> addition with the driver sitting on a reserve of higher-order pages it
> could help to alleviate fragmentation in such a case as well since it
> could release larger contiguous blocks of memory.
> 
>> E.g., have a look at the vmware balloon (drivers/misc/vmw_balloon.c). It
>> will allocate either 4k or 2MB pages, but won't be able to handle them
>> for balloon compaction. They don't even bother about other granularity.
>>
>>
>> Long story short: Inflating higher-order pages could be good for
>> inflation performance in some setups, but I think you'll have to fall
>> back to lower-order allocations +  balloon compaction on 4k.
> 
> I'm not entirely sure that is the case. It seems like with a few
> tweaks to things we could look at doing something like splitting the
> balloon so that we have a 4K and a 2M balloon. At that point it would
> just be a matter of registering a pair of address space handlers so
> that the 2M balloons are handled correctly if there is a request to
> migrate their memory. As far as compaction that is another story since
> it looks like 2M pages will not be compacted.

I am not convinced what you describe is a real issue that needs such a
solution. Maybe we can come up with numbers that prove this. (e.g.,
#THP, fragmentation, benchmark performance in your guest, etc.).

I'll try digging out that huge page ballooning for KVM paper, maybe that
has any value.
Alexander Duyck Feb. 25, 2020, 4:42 p.m. UTC | #18
On Tue, Feb 25, 2020 at 12:09 AM David Hildenbrand <david@redhat.com> wrote:
>
> [...]
>
> > I guess the question is if pressuring the guest to compact the memory
> > to create more THP pages would add value versus letting the pressure
> > from the inflation cause more potential fragmentation.
>
> Would be interesting to see some actual numbers. Right now, it's just
> speculations. I know that there are ideas to do proactive compaction,
> maybe that has a similar effect.
>
> [...]
>
> >
> >>>
> >>>> There was some work on huge page ballooning in a paper I read. But once
> >>>> the guest is out of huge pages to report, it would want to fallback to
> >>>> smaller granularity (down to 4k, to create real memory pressure), where
> >>>> you would end up in the very same situation you are right now. So it's -
> >>>> IMHO - only of limited used.
> >>>
> >>> I wouldn't think it would be that limited of a use case. By having the
> >>> balloon inflate with higher order pages you should be able to put more
> >>> pressure on the guest to compact the memory and reduce fragmentation
> >>> instead of increasing it. If you have the balloon flushing out the
> >>> lower order pages it is sitting on when there is pressure it seems
> >>> like it would be more likely to reduce fragmentation further.
> >>
> >> As we have balloon compaction in place and balloon pages are movable, I
> >> guess fragmentation is not really an issue.
> >
> > I'm not sure that is truly the case. My concern is that by allocating
> > the 4K pages we are breaking up the higher order pages and we aren't
> > necessarily guaranteed to obtain all pieces of the higher order page
> > when we break it up. As a result we could end up causing the THP pages
> > to be broken up and scattered between the balloon and other consumers
>
> We are allocating movable memory. We should be working on/creating
> movable pageblocks. Yes, other concurrent allcoations can race with the
> allocation. But after all, they are likely movable as well (because they
> are allocating from a movable pageblock) and we do have compaction in
> place. There are corner cases but in don't think they are very relevant.
>
> [...]

The main advantage as I see it though is that you can much more likely
inflate an entire THP page if you are allocating 2M pages versus 4K
pages simply because if another thread ends up stealing one of those
pages while you are trying to inflate the balloon it will be
problematic. In addition by switching over to the scatterlist approach
it would be possible to process 512 pages as a single entry which is
already more than double the number of PFNs currently supported by
virtio balloon.

> >> Especially page compaction/migration in the guest might be tricky. AFAIK
> >> it only works on oder-0 pages. E.g., whenever you allocated a
> >> higher-order page in the guest and reported it to your hypervisor, you
> >> want to split it into separate order-0 pages before adding them to the
> >> balloon list. Otherwise, you won't be able to tag them as movable and
> >> handle them via the existing balloon compaction framework - and that
> >> would be a major step backwards, because you would be heavily
> >> fragmenting your guest (and even turning MAX_ORDER - 1 into unmovable
> >> pages means that memory offlining/alloc_contig_range() users won't be
> >> able to move such pages around anymore).
> >
> > Yes, from what I can tell compaction will not touch anything that is
> > pageblock size or larger. I am not sure if that is an issue or not.
> >
> > For migration is is a bit of a different story. It looks like there is
> > logic in place for migrating huge and transparent huge pages, but not
> > higher order pages. I'll have to take a look through the code some
> > more to see just how difficult it would be to support migrating a 2M
> > page. I can probably make it work if I just configure it as a
> > transparent huge page with the appropriate flags and bits in the page
> > being set.
>
> Note: With virtio-balloon you actually don't necessarily want to migrate
> the higher-order page. E.g., migrating a higher-order page might fail
> because there is no migration target available. Instead, you would want
> "migrate" to multiple smaller pieces. This is esp., interesting for
> alloc_contig_range() users. Something that the current 4k pages can
> handle just nicely.

That is why I am thinking it would be worthwhile to explore the
current THP approaches being used. If I could get virtio balloon to
act as though it is allocating THP pages then what I would get is the
ability to handle migration and the fact that THP pages already get
broken up into lower order pages if they cannot be allocated as a
higher order page. In reality the balloon driver doesn't really care
about if the page is mapped as 1 2M page or 512 4K ones, however it
would be preferred if we could get the linear 2M mapping.

> >
> >> But then, the balloon compaction will result in single 4k pages getting
> >> moved and deflated+inflated. Once you have order-0 pages in your list,
> >> deflating higher-order pages becomes trickier.
> >
> > I probably wouldn't want to maintain them as individual lists. In my
> > mind it would make more sense to have two separate lists with separate
> > handlers for each. Then in the event of something such as a deflate we
> > could choose what we free based on the number of pages we need to
> > free. That would allow us to deflate the balloon quicker in the case
> > of a low-memory condition which should improve our responsiveness. In
> > addition with the driver sitting on a reserve of higher-order pages it
> > could help to alleviate fragmentation in such a case as well since it
> > could release larger contiguous blocks of memory.
> >
> >> E.g., have a look at the vmware balloon (drivers/misc/vmw_balloon.c). It
> >> will allocate either 4k or 2MB pages, but won't be able to handle them
> >> for balloon compaction. They don't even bother about other granularity.
> >>
> >>
> >> Long story short: Inflating higher-order pages could be good for
> >> inflation performance in some setups, but I think you'll have to fall
> >> back to lower-order allocations +  balloon compaction on 4k.
> >
> > I'm not entirely sure that is the case. It seems like with a few
> > tweaks to things we could look at doing something like splitting the
> > balloon so that we have a 4K and a 2M balloon. At that point it would
> > just be a matter of registering a pair of address space handlers so
> > that the 2M balloons are handled correctly if there is a request to
> > migrate their memory. As far as compaction that is another story since
> > it looks like 2M pages will not be compacted.
>
> I am not convinced what you describe is a real issue that needs such a
> solution. Maybe we can come up with numbers that prove this. (e.g.,
> #THP, fragmentation, benchmark performance in your guest, etc.).

As I see it there are several issues to be addressed here.

1. As the size of memory increases I am not certain that operating a
balloon at 4K page size is going to make sense for much longer.
2. Inflating with 4K pages is likely to force the guest memory to be
split up at the host level, and will be expensive as it requires an
operation per 4K page on the host.
3. Inflating with 4K pages makes it impossible to currently identify
if the entire THP has been freed since we can only inflate half of a
THP page at a time.

> I'll try digging out that huge page ballooning for KVM paper, maybe that
> has any value.

Thanks. Also if you have any specific benchmarks in mind that would be
useful as well for establishing the criteria for what a
proof-of-concept would need to accomplish.
David Hildenbrand Feb. 25, 2020, 6:02 p.m. UTC | #19
>>
>> Any update on this one?
> 
> I apologize for my dreadful unresponsiveness.
> 
> I've spent the last day trying to love yours, and considering how mine
> might be improved; but repeatedly arrived at the conclusion that mine is
> about as nice as we can manage, just needing more comment to dignify it.
> 
> I did willingly call my find_get_entries() stopping at PageTransCompound
> a hack; but now think we should just document that behavior and accept it.
> The contortions of your patch come from the need to release those 14 extra
> unwanted references: much better not to get them in the first place.
> 
> Neither of us handle a failed split optimally, we treat every tail as an
> opportunity to retry: which is good to recover from transient failures,
> but probably excessive.  And we both have to restart the pagevec after
> each attempt, but at least I don't get 14 unwanted extras each time.
> 
> What of other find_get_entries() users and its pagevec_lookup_entries()
> wrapper: does an argument to select this "stop at PageTransCompound"
> behavior need to be added?
> 
> No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
> new behavior - evicting the head from page cache removes all the tails
> along with it, so getting the tails a waste of time there too, just as
> it was in shmem_undo_range().
> 
> Whereas shmem_unlock_mapping() and shmem_seek_hole_data(), as they
> stand, are not removing pages from cache, but actually wanting to plod
> through the tails.  So those two would be slowed a little, while the
> pagevec collects 1 instead of 15 pages.  However: if we care about those
> two at all, it's clear that we should speed them up, by noticing the
> PageTransCompound case and accelerating over it, instead of plodding
> through the tails.  Since we haven't bothered with that optimization
> yet, I'm not very worried to slow them down a little until it's done.
> 
> I must take a look at where we stand with tmpfs 64-bit ino tomorrow,
> then recomment my shmem_punch_compound() patch and post it properly,
> probably day after.  (Reviewing it, I see I have a "page->index +
> HPAGE_PMD_NR <= end" test which needs correcting - I tend to live
> in the past, before v4.13 doubled the 32-bit MAX_LFS_FILESIZE.)
> 
> I notice that this thread has veered off into QEMU ballooning
> territory: which may indeed be important, but there's nothing at all
> that I can contribute on that.  I certainly do not want to slow down
> anything important, but remain convinced that the correct filesystem
> implementation for punching a hole is to punch a hole.

I am not completely sure I follow all the shmem details (sorry!). But
trying to "punch a partial hole punch" into a hugetlbfs page will result
in the very same behavior as with shmem as of now, no?

FALLOC_FL_PUNCH_HOLE: "Within the specified range, partial filesystem
blocks are zeroed, and whole filesystem blocks are removed from the
file." ... After a successful call, subsequent reads from this range
will return zeros."

So, as long as we are talking about partial blocks the documented
behavior seems to be to only zero the memory.

Does this patch fix "FALLOC_FL_PUNCH_HOLE does not free blocks if called
in block granularity on shmem" (which would be a valid fix), or does it
try to implement something that is not documented? (removing partial
blocks when called in sub-block granularity)

I assume the latter, in which case I would interpret "punching a hole is
to punch a hole" as "punching sub-blocks will not free blocks".

(if somebody could enlighten me which important piece I am missing or
messing up, that would be great :) )
Hugh Dickins Feb. 25, 2020, 8:31 p.m. UTC | #20
On Tue, 25 Feb 2020, David Hildenbrand wrote:
> > 
> > I notice that this thread has veered off into QEMU ballooning
> > territory: which may indeed be important, but there's nothing at all
> > that I can contribute on that.  I certainly do not want to slow down
> > anything important, but remain convinced that the correct filesystem
> > implementation for punching a hole is to punch a hole.
> 
> I am not completely sure I follow all the shmem details (sorry!). But
> trying to "punch a partial hole punch" into a hugetlbfs page will result
> in the very same behavior as with shmem as of now, no?

I believe so.

> 
> FALLOC_FL_PUNCH_HOLE: "Within the specified range, partial filesystem
> blocks are zeroed, and whole filesystem blocks are removed from the
> file." ... After a successful call, subsequent reads from this range
> will return zeros."
> 
> So, as long as we are talking about partial blocks the documented
> behavior seems to be to only zero the memory.
> 
> Does this patch fix "FALLOC_FL_PUNCH_HOLE does not free blocks if called
> in block granularity on shmem" (which would be a valid fix),

Yes. The block size of tmpfs is (talking x86_64 for simplicity) 4KiB;
but when mounted huge, it transparently takes advantage of 2MiB extents
when it can.  Rather like a disk-based filesystem always presenting a
4KiB block interface, but stored on disk in multisector extents.

Whereas hugetlbfs is a different filesystem, which is and always has
been limited to supporting only certain larger block sizes.

> or does it
> try to implement something that is not documented? (removing partial
> blocks when called in sub-block granularity)

No.

> 
> I assume the latter, in which case I would interpret "punching a hole is
> to punch a hole" as "punching sub-blocks will not free blocks".
> 
> (if somebody could enlighten me which important piece I am missing or
> messing up, that would be great :) )
> 
> -- 
> Thanks,
> 
> David / dhildenb
Yang Shi Feb. 26, 2020, 5:31 p.m. UTC | #21
On 2/21/20 4:24 PM, Alexander Duyck wrote:
> On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>
>>
>> On 2/20/20 10:16 AM, Alexander Duyck wrote:
>>> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>> Currently when truncating shmem file, if the range is partial of THP
>>>> (start or end is in the middle of THP), the pages actually will just get
>>>> cleared rather than being freed unless the range cover the whole THP.
>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>> the THP may still be kept in page cache.  This might be fine for some
>>>> usecases which prefer preserving THP.
>>>>
>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>>>> work as expected since it doesn't free memory.  But, the inflation
>>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>>> freed right away too but it will be freed eventually when all subpages are
>>>> unmapped, but shmem THP would still stay in page cache.
>>>>
>>>> Split THP right away when doing partial hole punch, and if split fails
>>>> just clear the page so that read to the hole punched area would return
>>>> zero.
>>>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> One question I would have is if this is really the desired behavior we
>>> are looking for?
>>>
>>> By proactively splitting the THP you are likely going to see a
>>> performance regression with the virtio-balloon driver enabled in QEMU.
>>> I would suspect the response to that would be to update the QEMU code
>>> to  identify the page size of the shared memory ramblock. At that
>>> point I suspect it would start behaving the same as how it currently
>>> handles anonymous memory, and the work done here would essentially
>>> have been wasted other than triggering the desire to resolve this in
>>> QEMU to avoid a performance regression.
>>>
>>> The code for inflating a the balloon in virtio-balloon in QEMU can be
>>> found here:
>>> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>>>
>>> If there is a way for us to just populate the value obtained via
>>> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
>>> which is the size I am assuming it is at since you indicated that it
>>> is just freeing the base page size, then we could address the same
>>> issue and likely get the desired outcome of freeing the entire THP
>>> page when it is no longer used.
>> If qemu could punch hole (this is how qemu free file-backed memory) in
>> THP unit, either w/ or w/o the patch the THP won't get split since the
>> whole THP will get truncated. But, if qemu has to free memory in sub-THP
>> size due to whatever reason (for example, 1MB for every 2MB section),
>> then we have to split THP otherwise no memory will be freed actually
>> with the current code. It is not about performance, it is about really
>> giving memory back to host.
> I get that, but at the same time I am not sure if everyone will be
> happy with the trade-off. That is my concern.
>
> You may want to change the patch description above if that is the
> case. Based on the description above it makes it sound as if the issue
> is that QEMU is using hole punch or MADV_DONTNEED with the wrong
> granularity. Based on your comment here it sounds like you want to
> have the ability to break up the larger THP page as soon as you want
> to push out a single 4K page from it.

Yes, you are right. The commit log may be confusing. What I wanted to 
convey is QEMU has no idea if THP is used or not so it treats memory 
with base size unless hugetlbfs is used since QEMU is aware huge page is 
used in this case.
This may sounds irrelevant to the problem, I would just remove that.

>
> I am not sure the description for the behavior of anonymous THP with
> respect to QEMU makes sense either. Based on the description you made
> it sound like it was somehow using the same process used for huge
> pages. That isn't the case right? My understanding is that in the case
> of an anonymous THP it is getting broken into 4K subpages and then
> those are freed individually. That should leave you with the same
> performance regression that I had brought up earlier.

No, anonymous THP won't get split immediately and those memory also 
won't get freed immediately if QEMU does MADV_DONTNEED on sub THP range 
(for example, 1MB range in THP). The THP will get freed when:
1. Host has memory pressure. The THP will get split and unmapped pages 
will be freed.
2. Other sub pages in the same THP are MADV_DONTNEED'ed (eventually the 
whole THP get unmapped).

The difference between shmem and anonymous page is shmem will not get 
freed unless hole punch the whole THP, anonymous page will get freed 
sooner or later.

>
> So if anything it sounds like what you are really wanting is feature
> parity with anonymous THP which will split the anonymous THP page when
> a single 4K page within the THP is hit with MADV_DONTNEED.
>
> - Alex
Yang Shi Feb. 26, 2020, 5:43 p.m. UTC | #22
On 2/24/20 7:46 PM, Hugh Dickins wrote:
> On Tue, 14 Jan 2020, Yang Shi wrote:
>> On 12/4/19 4:15 PM, Hugh Dickins wrote:
>>> On Wed, 4 Dec 2019, Yang Shi wrote:
>>>
>>>> Currently when truncating shmem file, if the range is partial of THP
>>>> (start or end is in the middle of THP), the pages actually will just get
>>>> cleared rather than being freed unless the range cover the whole THP.
>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>> the THP may still be kept in page cache.  This might be fine for some
>>>> usecases which prefer preserving THP.
>>>>
>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>>> So, when using shmem THP as memory backend QEMU inflation actually
>>>> doesn't
>>>> work as expected since it doesn't free memory.  But, the inflation
>>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>>> freed right away too but it will be freed eventually when all subpages
>>>> are
>>>> unmapped, but shmem THP would still stay in page cache.
>>>>
>>>> Split THP right away when doing partial hole punch, and if split fails
>>>> just clear the page so that read to the hole punched area would return
>>>> zero.
>>>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>> ---
>>>> v2: * Adopted the comment from Kirill.
>>>>       * Dropped fallocate mode flag, THP split is the default behavior.
>>>>       * Blended Huge's implementation with my v1 patch. TBH I'm not very
>>>> keen to
>>>>         Hugh's find_get_entries() hack (basically neutral), but without
>>>> that hack
>>> Thanks for giving it a try.  I'm not neutral about my find_get_entries()
>>> hack: it surely had to go (without it, I'd have just pushed my own patch).
>>> I've not noticed anything wrong with your patch, and it's in the right
>>> direction, but I'm still not thrilled with it.  I also remember that I
>>> got the looping wrong in my first internal attempt (fixed in what I sent),
>>> and need to be very sure of the try-again-versus-move-on-to-next conditions
>>> before agreeing to anything.  No rush, I'll come back to this in days or
>>> month ahead: I'll try to find a less gotoey blend of yours and mine.
>> Hi Hugh,
>>
>> Any update on this one?
> I apologize for my dreadful unresponsiveness.
>
> I've spent the last day trying to love yours, and considering how mine
> might be improved; but repeatedly arrived at the conclusion that mine is
> about as nice as we can manage, just needing more comment to dignify it.

Those gotoes do seems convoluted, I do agree.

>
> I did willingly call my find_get_entries() stopping at PageTransCompound
> a hack; but now think we should just document that behavior and accept it.
> The contortions of your patch come from the need to release those 14 extra
> unwanted references: much better not to get them in the first place.
>
> Neither of us handle a failed split optimally, we treat every tail as an
> opportunity to retry: which is good to recover from transient failures,
> but probably excessive.  And we both have to restart the pagevec after
> each attempt, but at least I don't get 14 unwanted extras each time.
>
> What of other find_get_entries() users and its pagevec_lookup_entries()
> wrapper: does an argument to select this "stop at PageTransCompound"
> behavior need to be added?
>
> No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
> new behavior - evicting the head from page cache removes all the tails
> along with it, so getting the tails a waste of time there too, just as
> it was in shmem_undo_range().

TBH I'm not a fun of this hack. This would bring in other confusion or 
complexity. Pagevec is supposed to count in the number of base page, now 
it would treat THP as one page, and there might be mixed base page and 
THP in one pagevec. But, I tend to agree avoiding getting those 14 extra 
pins at the first place might be a better approach. All the complexity 
are used to release those extra pins.

>
> Whereas shmem_unlock_mapping() and shmem_seek_hole_data(), as they
> stand, are not removing pages from cache, but actually wanting to plod
> through the tails.  So those two would be slowed a little, while the
> pagevec collects 1 instead of 15 pages.  However: if we care about those
> two at all, it's clear that we should speed them up, by noticing the
> PageTransCompound case and accelerating over it, instead of plodding
> through the tails.  Since we haven't bothered with that optimization
> yet, I'm not very worried to slow them down a little until it's done.
>
> I must take a look at where we stand with tmpfs 64-bit ino tomorrow,
> then recomment my shmem_punch_compound() patch and post it properly,
> probably day after.  (Reviewing it, I see I have a "page->index +
> HPAGE_PMD_NR <= end" test which needs correcting - I tend to live
> in the past, before v4.13 doubled the 32-bit MAX_LFS_FILESIZE.)
>
> I notice that this thread has veered off into QEMU ballooning
> territory: which may indeed be important, but there's nothing at all
> that I can contribute on that.  I certainly do not want to slow down
> anything important, but remain convinced that the correct filesystem
> implementation for punching a hole is to punch a hole.
>
> Hugh
David Hildenbrand Feb. 26, 2020, 5:45 p.m. UTC | #23
On 26.02.20 18:31, Yang Shi wrote:
> 
> 
> On 2/21/20 4:24 PM, Alexander Duyck wrote:
>> On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>
>>>
>>> On 2/20/20 10:16 AM, Alexander Duyck wrote:
>>>> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>>> Currently when truncating shmem file, if the range is partial of THP
>>>>> (start or end is in the middle of THP), the pages actually will just get
>>>>> cleared rather than being freed unless the range cover the whole THP.
>>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>>> the THP may still be kept in page cache.  This might be fine for some
>>>>> usecases which prefer preserving THP.
>>>>>
>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>>>> freed right away too but it will be freed eventually when all subpages are
>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>
>>>>> Split THP right away when doing partial hole punch, and if split fails
>>>>> just clear the page so that read to the hole punched area would return
>>>>> zero.
>>>>>
>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>> One question I would have is if this is really the desired behavior we
>>>> are looking for?
>>>>
>>>> By proactively splitting the THP you are likely going to see a
>>>> performance regression with the virtio-balloon driver enabled in QEMU.
>>>> I would suspect the response to that would be to update the QEMU code
>>>> to  identify the page size of the shared memory ramblock. At that
>>>> point I suspect it would start behaving the same as how it currently
>>>> handles anonymous memory, and the work done here would essentially
>>>> have been wasted other than triggering the desire to resolve this in
>>>> QEMU to avoid a performance regression.
>>>>
>>>> The code for inflating a the balloon in virtio-balloon in QEMU can be
>>>> found here:
>>>> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>>>>
>>>> If there is a way for us to just populate the value obtained via
>>>> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
>>>> which is the size I am assuming it is at since you indicated that it
>>>> is just freeing the base page size, then we could address the same
>>>> issue and likely get the desired outcome of freeing the entire THP
>>>> page when it is no longer used.
>>> If qemu could punch hole (this is how qemu free file-backed memory) in
>>> THP unit, either w/ or w/o the patch the THP won't get split since the
>>> whole THP will get truncated. But, if qemu has to free memory in sub-THP
>>> size due to whatever reason (for example, 1MB for every 2MB section),
>>> then we have to split THP otherwise no memory will be freed actually
>>> with the current code. It is not about performance, it is about really
>>> giving memory back to host.
>> I get that, but at the same time I am not sure if everyone will be
>> happy with the trade-off. That is my concern.
>>
>> You may want to change the patch description above if that is the
>> case. Based on the description above it makes it sound as if the issue
>> is that QEMU is using hole punch or MADV_DONTNEED with the wrong
>> granularity. Based on your comment here it sounds like you want to
>> have the ability to break up the larger THP page as soon as you want
>> to push out a single 4K page from it.
> 
> Yes, you are right. The commit log may be confusing. What I wanted to 
> convey is QEMU has no idea if THP is used or not so it treats memory 
> with base size unless hugetlbfs is used since QEMU is aware huge page is 
> used in this case.
> This may sounds irrelevant to the problem, I would just remove that.
> 
>>
>> I am not sure the description for the behavior of anonymous THP with
>> respect to QEMU makes sense either. Based on the description you made
>> it sound like it was somehow using the same process used for huge
>> pages. That isn't the case right? My understanding is that in the case
>> of an anonymous THP it is getting broken into 4K subpages and then
>> those are freed individually. That should leave you with the same
>> performance regression that I had brought up earlier.
> 
> No, anonymous THP won't get split immediately and those memory also 
> won't get freed immediately if QEMU does MADV_DONTNEED on sub THP range 
> (for example, 1MB range in THP). The THP will get freed when:
> 1. Host has memory pressure. The THP will get split and unmapped pages 
> will be freed.
> 2. Other sub pages in the same THP are MADV_DONTNEED'ed (eventually the 
> whole THP get unmapped).
> 
> The difference between shmem and anonymous page is shmem will not get 
> freed unless hole punch the whole THP, anonymous page will get freed 
> sooner or later.
> 

As far as I understood Hugh, the "page size" we'll see in QEMU via
fstatfs() is 4k, not 2MB. IMHO, that's the block size of the "device",
and breaking up THP is the right think to to obey the documentation of
"FALLOC_FL_PUNCH_HOLE".

IMHO THP is called "transparent" because it shouldn't have any such
visible side effects.

As always, anybody correct me if I am wrong here.
Yang Shi Feb. 26, 2020, 6 p.m. UTC | #24
On 2/26/20 9:45 AM, David Hildenbrand wrote:
> On 26.02.20 18:31, Yang Shi wrote:
>>
>> On 2/21/20 4:24 PM, Alexander Duyck wrote:
>>> On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>>
>>>> On 2/20/20 10:16 AM, Alexander Duyck wrote:
>>>>> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>>>> Currently when truncating shmem file, if the range is partial of THP
>>>>>> (start or end is in the middle of THP), the pages actually will just get
>>>>>> cleared rather than being freed unless the range cover the whole THP.
>>>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>>>> the THP may still be kept in page cache.  This might be fine for some
>>>>>> usecases which prefer preserving THP.
>>>>>>
>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>>>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>>> usecase really needs get the memory freed.  Anonymous THP will not get
>>>>>> freed right away too but it will be freed eventually when all subpages are
>>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>>
>>>>>> Split THP right away when doing partial hole punch, and if split fails
>>>>>> just clear the page so that read to the hole punched area would return
>>>>>> zero.
>>>>>>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>>> One question I would have is if this is really the desired behavior we
>>>>> are looking for?
>>>>>
>>>>> By proactively splitting the THP you are likely going to see a
>>>>> performance regression with the virtio-balloon driver enabled in QEMU.
>>>>> I would suspect the response to that would be to update the QEMU code
>>>>> to  identify the page size of the shared memory ramblock. At that
>>>>> point I suspect it would start behaving the same as how it currently
>>>>> handles anonymous memory, and the work done here would essentially
>>>>> have been wasted other than triggering the desire to resolve this in
>>>>> QEMU to avoid a performance regression.
>>>>>
>>>>> The code for inflating a the balloon in virtio-balloon in QEMU can be
>>>>> found here:
>>>>> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>>>>>
>>>>> If there is a way for us to just populate the value obtained via
>>>>> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
>>>>> which is the size I am assuming it is at since you indicated that it
>>>>> is just freeing the base page size, then we could address the same
>>>>> issue and likely get the desired outcome of freeing the entire THP
>>>>> page when it is no longer used.
>>>> If qemu could punch hole (this is how qemu free file-backed memory) in
>>>> THP unit, either w/ or w/o the patch the THP won't get split since the
>>>> whole THP will get truncated. But, if qemu has to free memory in sub-THP
>>>> size due to whatever reason (for example, 1MB for every 2MB section),
>>>> then we have to split THP otherwise no memory will be freed actually
>>>> with the current code. It is not about performance, it is about really
>>>> giving memory back to host.
>>> I get that, but at the same time I am not sure if everyone will be
>>> happy with the trade-off. That is my concern.
>>>
>>> You may want to change the patch description above if that is the
>>> case. Based on the description above it makes it sound as if the issue
>>> is that QEMU is using hole punch or MADV_DONTNEED with the wrong
>>> granularity. Based on your comment here it sounds like you want to
>>> have the ability to break up the larger THP page as soon as you want
>>> to push out a single 4K page from it.
>> Yes, you are right. The commit log may be confusing. What I wanted to
>> convey is QEMU has no idea if THP is used or not so it treats memory
>> with base size unless hugetlbfs is used since QEMU is aware huge page is
>> used in this case.
>> This may sounds irrelevant to the problem, I would just remove that.
>>
>>> I am not sure the description for the behavior of anonymous THP with
>>> respect to QEMU makes sense either. Based on the description you made
>>> it sound like it was somehow using the same process used for huge
>>> pages. That isn't the case right? My understanding is that in the case
>>> of an anonymous THP it is getting broken into 4K subpages and then
>>> those are freed individually. That should leave you with the same
>>> performance regression that I had brought up earlier.
>> No, anonymous THP won't get split immediately and those memory also
>> won't get freed immediately if QEMU does MADV_DONTNEED on sub THP range
>> (for example, 1MB range in THP). The THP will get freed when:
>> 1. Host has memory pressure. The THP will get split and unmapped pages
>> will be freed.
>> 2. Other sub pages in the same THP are MADV_DONTNEED'ed (eventually the
>> whole THP get unmapped).
>>
>> The difference between shmem and anonymous page is shmem will not get
>> freed unless hole punch the whole THP, anonymous page will get freed
>> sooner or later.
>>
> As far as I understood Hugh, the "page size" we'll see in QEMU via
> fstatfs() is 4k, not 2MB. IMHO, that's the block size of the "device",
> and breaking up THP is the right think to to obey the documentation of
> "FALLOC_FL_PUNCH_HOLE".

This is what the patch attempts to accomplish.

>
> IMHO THP is called "transparent" because it shouldn't have any such
> visible side effects.

AFAICT, the lazy split is due to locking issue in partial unmap paths. 
Please refer to "Partial unmap and deferred_split_huge_page()" section 
in Documentation/vm/transhuge.rst.

>
> As always, anybody correct me if I am wrong here.
>
Hugh Dickins Feb. 27, 2020, 12:56 a.m. UTC | #25
On Wed, 26 Feb 2020, Yang Shi wrote:
> On 2/21/20 4:24 PM, Alexander Duyck wrote:
> > On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com>
> > wrote:
> > > On 2/20/20 10:16 AM, Alexander Duyck wrote:
> > > > On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com>
> > > > wrote:
> > > > > Currently when truncating shmem file, if the range is partial of THP
> > > > > (start or end is in the middle of THP), the pages actually will just
> > > > > get
> > > > > cleared rather than being freed unless the range cover the whole THP.
> > > > > Even though all the subpages are truncated (randomly or
> > > > > sequentially),
> > > > > the THP may still be kept in page cache.  This might be fine for some
> > > > > usecases which prefer preserving THP.
> > > > > 
> > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole
> > > > > punch
> > > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
> > > > > used.
> > > > > So, when using shmem THP as memory backend QEMU inflation actually
> > > > > doesn't
> > > > > work as expected since it doesn't free memory.  But, the inflation
> > > > > usecase really needs get the memory freed.  Anonymous THP will not
> > > > > get
> > > > > freed right away too but it will be freed eventually when all
> > > > > subpages are
> > > > > unmapped, but shmem THP would still stay in page cache.
> > > > > 
> > > > > Split THP right away when doing partial hole punch, and if split
> > > > > fails
> > > > > just clear the page so that read to the hole punched area would
> > > > > return
> > > > > zero.
> > > > > 
> > > > > Cc: Hugh Dickins <hughd@google.com>
> > > > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > > > One question I would have is if this is really the desired behavior we
> > > > are looking for?
> > > > 
> > > > By proactively splitting the THP you are likely going to see a
> > > > performance regression with the virtio-balloon driver enabled in QEMU.
> > > > I would suspect the response to that would be to update the QEMU code
> > > > to  identify the page size of the shared memory ramblock. At that
> > > > point I suspect it would start behaving the same as how it currently
> > > > handles anonymous memory, and the work done here would essentially
> > > > have been wasted other than triggering the desire to resolve this in
> > > > QEMU to avoid a performance regression.
> > > > 
> > > > The code for inflating a the balloon in virtio-balloon in QEMU can be
> > > > found here:
> > > > https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
> > > > 
> > > > If there is a way for us to just populate the value obtained via
> > > > qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
> > > > which is the size I am assuming it is at since you indicated that it
> > > > is just freeing the base page size, then we could address the same
> > > > issue and likely get the desired outcome of freeing the entire THP
> > > > page when it is no longer used.
> > > If qemu could punch hole (this is how qemu free file-backed memory) in
> > > THP unit, either w/ or w/o the patch the THP won't get split since the
> > > whole THP will get truncated. But, if qemu has to free memory in sub-THP
> > > size due to whatever reason (for example, 1MB for every 2MB section),
> > > then we have to split THP otherwise no memory will be freed actually
> > > with the current code. It is not about performance, it is about really
> > > giving memory back to host.
> > I get that, but at the same time I am not sure if everyone will be
> > happy with the trade-off. That is my concern.
> > 
> > You may want to change the patch description above if that is the
> > case. Based on the description above it makes it sound as if the issue
> > is that QEMU is using hole punch or MADV_DONTNEED with the wrong
> > granularity. Based on your comment here it sounds like you want to
> > have the ability to break up the larger THP page as soon as you want
> > to push out a single 4K page from it.
> 
> Yes, you are right. The commit log may be confusing. What I wanted to convey
> is QEMU has no idea if THP is used or not so it treats memory with base size
> unless hugetlbfs is used since QEMU is aware huge page is used in this case.
> This may sounds irrelevant to the problem, I would just remove that.

Oh, I'm sad to read that, since I was yanking most of your commit
message (as "Yang Shi writes") into my version, to give stronger
and independent justification for the change.

If I try to write about QEMU and ballooning myself, nonsense is sure to
emerge; but I don't know what part "I would just remove that" refers to.

May I beg you for an updated paragraph or two, explaining why you
want to see the change?

Thanks,
Hugh
Yang Shi Feb. 27, 2020, 1:14 a.m. UTC | #26
On 2/26/20 4:56 PM, Hugh Dickins wrote:
> On Wed, 26 Feb 2020, Yang Shi wrote:
>> On 2/21/20 4:24 PM, Alexander Duyck wrote:
>>> On Fri, Feb 21, 2020 at 10:24 AM Yang Shi <yang.shi@linux.alibaba.com>
>>> wrote:
>>>> On 2/20/20 10:16 AM, Alexander Duyck wrote:
>>>>> On Tue, Dec 3, 2019 at 4:43 PM Yang Shi <yang.shi@linux.alibaba.com>
>>>>> wrote:
>>>>>> Currently when truncating shmem file, if the range is partial of THP
>>>>>> (start or end is in the middle of THP), the pages actually will just
>>>>>> get
>>>>>> cleared rather than being freed unless the range cover the whole THP.
>>>>>> Even though all the subpages are truncated (randomly or
>>>>>> sequentially),
>>>>>> the THP may still be kept in page cache.  This might be fine for some
>>>>>> usecases which prefer preserving THP.
>>>>>>
>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole
>>>>>> punch
>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
>>>>>> used.
>>>>>> So, when using shmem THP as memory backend QEMU inflation actually
>>>>>> doesn't
>>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>>> usecase really needs get the memory freed.  Anonymous THP will not
>>>>>> get
>>>>>> freed right away too but it will be freed eventually when all
>>>>>> subpages are
>>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>>
>>>>>> Split THP right away when doing partial hole punch, and if split
>>>>>> fails
>>>>>> just clear the page so that read to the hole punched area would
>>>>>> return
>>>>>> zero.
>>>>>>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>>> One question I would have is if this is really the desired behavior we
>>>>> are looking for?
>>>>>
>>>>> By proactively splitting the THP you are likely going to see a
>>>>> performance regression with the virtio-balloon driver enabled in QEMU.
>>>>> I would suspect the response to that would be to update the QEMU code
>>>>> to  identify the page size of the shared memory ramblock. At that
>>>>> point I suspect it would start behaving the same as how it currently
>>>>> handles anonymous memory, and the work done here would essentially
>>>>> have been wasted other than triggering the desire to resolve this in
>>>>> QEMU to avoid a performance regression.
>>>>>
>>>>> The code for inflating a the balloon in virtio-balloon in QEMU can be
>>>>> found here:
>>>>> https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c#L66
>>>>>
>>>>> If there is a way for us to just populate the value obtained via
>>>>> qemu_ram_pagesize with the THP page size instead of leaving it at 4K,
>>>>> which is the size I am assuming it is at since you indicated that it
>>>>> is just freeing the base page size, then we could address the same
>>>>> issue and likely get the desired outcome of freeing the entire THP
>>>>> page when it is no longer used.
>>>> If qemu could punch hole (this is how qemu free file-backed memory) in
>>>> THP unit, either w/ or w/o the patch the THP won't get split since the
>>>> whole THP will get truncated. But, if qemu has to free memory in sub-THP
>>>> size due to whatever reason (for example, 1MB for every 2MB section),
>>>> then we have to split THP otherwise no memory will be freed actually
>>>> with the current code. It is not about performance, it is about really
>>>> giving memory back to host.
>>> I get that, but at the same time I am not sure if everyone will be
>>> happy with the trade-off. That is my concern.
>>>
>>> You may want to change the patch description above if that is the
>>> case. Based on the description above it makes it sound as if the issue
>>> is that QEMU is using hole punch or MADV_DONTNEED with the wrong
>>> granularity. Based on your comment here it sounds like you want to
>>> have the ability to break up the larger THP page as soon as you want
>>> to push out a single 4K page from it.
>> Yes, you are right. The commit log may be confusing. What I wanted to convey
>> is QEMU has no idea if THP is used or not so it treats memory with base size
>> unless hugetlbfs is used since QEMU is aware huge page is used in this case.
>> This may sounds irrelevant to the problem, I would just remove that.
> Oh, I'm sad to read that, since I was yanking most of your commit
> message (as "Yang Shi writes") into my version, to give stronger
> and independent justification for the change.
>
> If I try to write about QEMU and ballooning myself, nonsense is sure to
> emerge; but I don't know what part "I would just remove that" refers to.
>
> May I beg you for an updated paragraph or two, explaining why you
> want to see the change?

I think Alexander means this line "But, when doing balloon inflation in 
QEMU, QEMU actually does hole punch
or MADV_DONTNEED in base page size granulairty if hugetlbfs is not 
used." He thought it may confuse people thought this is QEMU issue. 
Actually, according to the later discussion, it sounds like a limitation 
from balloon driver, which just can deal with 4K size page. So, we could 
rephrase it to "Balloon inflation is handled in base page size."

>
> Thanks,
> Hugh
Matthew Wilcox Feb. 27, 2020, 1:16 a.m. UTC | #27
On Wed, Feb 26, 2020 at 09:43:53AM -0800, Yang Shi wrote:
> > No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
> > new behavior - evicting the head from page cache removes all the tails
> > along with it, so getting the tails a waste of time there too, just as
> > it was in shmem_undo_range().
> 
> TBH I'm not a fun of this hack. This would bring in other confusion or
> complexity. Pagevec is supposed to count in the number of base page, now it
> would treat THP as one page, and there might be mixed base page and THP in
> one pagevec. But, I tend to agree avoiding getting those 14 extra pins at
> the first place might be a better approach. All the complexity are used to
> release those extra pins.

My long-term goal is to eradicate tail pages entirely, so a pagevec will
end up containing pages of different sizes.  If you want to help move
in this direction, I'd be awfully grateful.  But I wouldn't say that's
in any way a prerequisite for fixing this current problem.
Hugh Dickins Feb. 27, 2020, 1:37 a.m. UTC | #28
On Wed, 26 Feb 2020, Yang Shi wrote:
> On 2/24/20 7:46 PM, Hugh Dickins wrote:
> > 
> > I did willingly call my find_get_entries() stopping at PageTransCompound
> > a hack; but now think we should just document that behavior and accept it.
> > The contortions of your patch come from the need to release those 14 extra
> > unwanted references: much better not to get them in the first place.
> > 
> > Neither of us handle a failed split optimally, we treat every tail as an
> > opportunity to retry: which is good to recover from transient failures,
> > but probably excessive.  And we both have to restart the pagevec after
> > each attempt, but at least I don't get 14 unwanted extras each time.
> > 
> > What of other find_get_entries() users and its pagevec_lookup_entries()
> > wrapper: does an argument to select this "stop at PageTransCompound"
> > behavior need to be added?
> > 
> > No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
> > new behavior - evicting the head from page cache removes all the tails
> > along with it, so getting the tails a waste of time there too, just as
> > it was in shmem_undo_range().
> 
> TBH I'm not a fun of this hack. This would bring in other confusion or
> complexity. Pagevec is supposed to count in the number of base page, now it
> would treat THP as one page, and there might be mixed base page and THP in
> one pagevec.

I agree that it would be horrid if find_get_entries() and
pagevec_lookup_entries() were switched to returning just one page
for a THP, demanding all callers to cope with its huge size along
with the small sizes of other pages in the vector.  I don't know how
to get such an interface to work at all: it's essential to be able
to deliver tail pages from a requested offset in the compound page.

No, that's not what the find_get_entries() modification does: it
takes advantage of the fact that no caller expects it to guarantee
a full pagevec, so terminates the pagevec early when it encounters
any head or tail subpage of the compound page.  Then the next call
to it (if caller does not have code to skip the extent - which
removal of head from page cache does) returns just the next tail,
etc, until all have been delivered.  All as small pages.

(Aside from the comments, I have made one adjustment to what I
showed before: though it appears now that hugetlbfs happens not
to use pagevec_lookup_entries(), not directly anyway, I'm more
comfortable checking PageTransHuge && !PageHuge, so that it would
not go one-at-a-time on hugetlbfs pages.  But found I was wrong
earlier when I said the "page->index + HPAGE_PMD_NR <= end" test
needed correcting for 32-bit: it's working on PageHead there, so
there's no chance of that "+ HPAGE_PMD_NR" wrapping around: left
unchanged, what it's doing is clearer that way than with macros.)

Hugh

> But, I tend to agree avoiding getting those 14 extra pins at the
> first place might be a better approach. All the complexity are used to
> release those extra pins.
Hugh Dickins Feb. 27, 2020, 1:47 a.m. UTC | #29
On Wed, 26 Feb 2020, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 09:43:53AM -0800, Yang Shi wrote:
> > > No.  The pagevec_lookup_entries() calls from mm/truncate.c prefer the
> > > new behavior - evicting the head from page cache removes all the tails
> > > along with it, so getting the tails a waste of time there too, just as
> > > it was in shmem_undo_range().
> > 
> > TBH I'm not a fun of this hack. This would bring in other confusion or
> > complexity. Pagevec is supposed to count in the number of base page, now it
> > would treat THP as one page, and there might be mixed base page and THP in
> > one pagevec. But, I tend to agree avoiding getting those 14 extra pins at
> > the first place might be a better approach. All the complexity are used to
> > release those extra pins.
> 
> My long-term goal is to eradicate tail pages entirely, so a pagevec will
> end up containing pages of different sizes.  If you want to help move
> in this direction, I'd be awfully grateful.  But I wouldn't say that's
> in any way a prerequisite for fixing this current problem.

You're right to be moving in that direction, but yes, that is a larger
task, and I think both Yang and I have to decline your awful gratitude :)

Hugh
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index 220be9f..1ae0c7f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -806,12 +806,15 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	long nr_swaps_freed = 0;
 	pgoff_t index;
 	int i;
+	bool split = false;
+	struct page *page = NULL;
 
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
 	pagevec_init(&pvec);
 	index = start;
+retry:
 	while (index < end) {
 		pvec.nr = find_get_entries(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
@@ -819,7 +822,8 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		if (!pvec.nr)
 			break;
 		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+			split = false;
+			page = pvec.pages[i];
 
 			index = indices[i];
 			if (index >= end)
@@ -838,23 +842,24 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if (!trylock_page(page))
 				continue;
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
+			if (PageTransCompound(page) && !unfalloc) {
+				if (PageHead(page) &&
+				    index != round_down(end, HPAGE_PMD_NR)) {
 					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
+					 * Fall through when punching whole
+					 * THP.
 					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
+					index += HPAGE_PMD_NR - 1;
+					i += HPAGE_PMD_NR - 1;
+				} else {
+					/*
+					 * Split THP for any partial hole
+					 * punch.
+					 */
+					get_page(page);
+					split = true;
+					goto split;
 				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
 			}
 
 			if (!unfalloc || !PageUptodate(page)) {
@@ -866,9 +871,29 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
+split:
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		cond_resched();
+
+		if (split) {
+			/*
+			 * The pagevec_release() released all extra pins
+			 * from pagevec lookup.  And we hold an extra pin
+			 * and still have the page locked under us.
+			 */
+			if (!split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				/* Re-lookup page cache from current index */
+				goto retry;
+			}
+
+			/* Fail to split THP, move to next index */
+			unlock_page(page);
+			put_page(page);
+		}
+
 		index++;
 	}
 
@@ -901,6 +926,7 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		return;
 
 	index = start;
+again:
 	while (index < end) {
 		cond_resched();
 
@@ -916,7 +942,8 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+			split = false;
+			page = pvec.pages[i];
 
 			index = indices[i];
 			if (index >= end)
@@ -936,30 +963,24 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 
 			lock_page(page);
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				/*
-				 * Partial thp truncate due 'start' in middle
-				 * of THP: don't need to look on these pages
-				 * again on !pvec.nr restart.
-				 */
-				if (index != round_down(end, HPAGE_PMD_NR))
-					start++;
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
+			if (PageTransCompound(page) && !unfalloc) {
+				if (PageHead(page) &&
+				    index != round_down(end, HPAGE_PMD_NR)) {
 					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
+					 * Fall through when punching whole
+					 * THP.
 					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
+					index += HPAGE_PMD_NR - 1;
+					i += HPAGE_PMD_NR - 1;
+				} else {
+					/*
+					 * Split THP for any partial hole
+					 * punch.
+					 */
+					get_page(page);
+					split = true;
+					goto rescan_split;
 				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
 			}
 
 			if (!unfalloc || !PageUptodate(page)) {
@@ -976,8 +997,33 @@  static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
+rescan_split:
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
+
+		if (split) {
+			/*
+			 * The pagevec_release() released all extra pins
+			 * from pagevec lookup.  And we hold an extra pin
+			 * and still have the page locked under us.
+			 */
+			if (!split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				/* Re-lookup page cache from current index */
+				goto again;
+			}
+
+			/*
+			 * Split fail, clear the page then move to next
+			 * index.
+			 */
+			clear_highpage(page);
+
+			unlock_page(page);
+			put_page(page);
+		}
+
 		index++;
 	}