diff mbox series

[RFC,10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped

Message ID 20241216165105.56185-11-dev.jain@arm.com (mailing list archive)
State New
Headers show
Series khugepaged: Asynchronous mTHP collapse | expand

Commit Message

Dev Jain Dec. 16, 2024, 4:51 p.m. UTC
We may hit a situation wherein we have a larger folio mapped. It is incorrect
to go ahead with the collapse since some pages will be unmapped, leading to
the entire folio getting unmapped. Therefore, skip the corresponding range.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
In the future, if at all it is required that at some point we want all the folios
in the system to be of a specific order, we may split these larger folios.

 mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Ryan Roberts Dec. 18, 2024, 7:36 a.m. UTC | #1
On 16/12/2024 16:51, Dev Jain wrote:
> We may hit a situation wherein we have a larger folio mapped. It is incorrect
> to go ahead with the collapse since some pages will be unmapped, leading to
> the entire folio getting unmapped. Therefore, skip the corresponding range.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> In the future, if at all it is required that at some point we want all the folios
> in the system to be of a specific order, we may split these larger folios.
> 
>  mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8040b130e677..47e7c476b893 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -33,6 +33,7 @@ enum scan_result {
>  	SCAN_PMD_NULL,
>  	SCAN_PMD_NONE,
>  	SCAN_PMD_MAPPED,
> +	SCAN_PTE_MAPPED,
>  	SCAN_EXCEED_NONE_PTE,
>  	SCAN_EXCEED_SWAP_PTE,
>  	SCAN_EXCEED_SHARED_PTE,
> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		folio = page_folio(page);
>  		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>  
> +		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			goto out;
> +		}
> +
>  		/* See hpage_collapse_scan_ptes(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  	unsigned long orders;
>  	pte_t *pte, *_pte;
>  	spinlock_t *ptl;
> +	int found_order;
>  	pmd_t *pmd;
>  	int order;
>  
> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out_unmap;
>  		}
>  
> +		found_order = folio_order(folio);
> +
> +		/*
> +		 * No point of scanning. Two options: if this folio was hit
> +		 * somewhere in the middle of the scan, then drop down the
> +		 * order. Or, completely skip till the end of this folio. The
> +		 * latter gives us a higher order to start with, with atmost
> +		 * 1 << order PTEs not collapsed; the former may force us
> +		 * to end up going below order 2 and exiting.
> +		 */
> +		if (order != HPAGE_PMD_ORDER && found_order >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			_address += (PAGE_SIZE << found_order);
> +			_pte += (1UL << found_order);
> +			pte_unmap_unlock(pte, ptl);
> +			goto decide_order;
> +		}

It would be good if you can spell out the desired policy when khugepaged hits
partially unmapped large folios and unaligned large folios. I think the simple
approach is to always collapse them to fully mapped, aligned folios even if the
resulting order is smaller than the original. But I'm not sure that's definitely
going to always be the best thing.

Regardless, I'm struggling to understand the logic in this patch. Taking the
order of a folio based on having hit one of it's pages says anything about
whether the whole of that folio is mapped or not or it's alignment. And it's not
clear to me how we would get to a situation where we are scanning for a lower
order and find a (fully mapped, aligned) folio of higher order in the first place.

Let's assume the desired policy is that khugepaged should always collapse to
naturally aligned large folios. If there happens to be an existing aligned
order-4 folio that is fully mapped, we will identify that for collapse as part
of the scan for order-4. At that point, we should just notice that it is already
an aligned order-4 folio and bypass collapse. Of course we may have already
chosen to collapse it into a higher order, but we should definitely not get to a
lower order before we notice it.

Hmm... I guess if the sysfs thp settings have been changed then things could get
spicy... if order-8 was previously enabled and we have an order-8 folio, then it
get's disabled and khugepaged is scanning for order-4 (which is still enabled)
then hits the order-8; what's the expected policy? rework into 2 order-4 folios
or leave it as as single order-8?


> +
>  		/*
>  		 * We treat a single page as shared if any part of the THP
>  		 * is shared. "False negatives" from
> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>  			goto out;
>  	}
> +	/* A larger folio was mapped; it will be skipped in next iteration */
> +	if (result == SCAN_PTE_MAPPED)
> +		goto decide_order;
> +
>  	if (result != SCAN_SUCCEED) {
>  
>  		/* Go to the next order. */
> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out;
>  		goto maybe_mmap_lock;
>  	} else {
> +
> +decide_order:
>  		address = _address;
>  		pte = _pte;
>
Dev Jain Dec. 18, 2024, 9:34 a.m. UTC | #2
On 18/12/24 1:06 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>> to go ahead with the collapse since some pages will be unmapped, leading to
>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> In the future, if at all it is required that at some point we want all the folios
>> in the system to be of a specific order, we may split these larger folios.
>>
>>   mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 8040b130e677..47e7c476b893 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -33,6 +33,7 @@ enum scan_result {
>>   	SCAN_PMD_NULL,
>>   	SCAN_PMD_NONE,
>>   	SCAN_PMD_MAPPED,
>> +	SCAN_PTE_MAPPED,
>>   	SCAN_EXCEED_NONE_PTE,
>>   	SCAN_EXCEED_SWAP_PTE,
>>   	SCAN_EXCEED_SHARED_PTE,
>> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   		folio = page_folio(page);
>>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>   
>> +		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
>> +			result = SCAN_PTE_MAPPED;
>> +			goto out;
>> +		}
>> +
>>   		/* See hpage_collapse_scan_ptes(). */
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	unsigned long orders;
>>   	pte_t *pte, *_pte;
>>   	spinlock_t *ptl;
>> +	int found_order;
>>   	pmd_t *pmd;
>>   	int order;
>>   
>> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			goto out_unmap;
>>   		}
>>   
>> +		found_order = folio_order(folio);
>> +
>> +		/*
>> +		 * No point of scanning. Two options: if this folio was hit
>> +		 * somewhere in the middle of the scan, then drop down the
>> +		 * order. Or, completely skip till the end of this folio. The
>> +		 * latter gives us a higher order to start with, with atmost
>> +		 * 1 << order PTEs not collapsed; the former may force us
>> +		 * to end up going below order 2 and exiting.
>> +		 */
>> +		if (order != HPAGE_PMD_ORDER && found_order >= order) {
>> +			result = SCAN_PTE_MAPPED;
>> +			_address += (PAGE_SIZE << found_order);
>> +			_pte += (1UL << found_order);
>> +			pte_unmap_unlock(pte, ptl);
>> +			goto decide_order;
>> +		}
> It would be good if you can spell out the desired policy when khugepaged hits
> partially unmapped large folios and unaligned large folios. I think the simple
> approach is to always collapse them to fully mapped, aligned folios even if the
> resulting order is smaller than the original. But I'm not sure that's definitely
> going to always be the best thing.
>
> Regardless, I'm struggling to understand the logic in this patch. Taking the
> order of a folio based on having hit one of it's pages says anything about
> whether the whole of that folio is mapped or not or it's alignment. And it's not
> clear to me how we would get to a situation where we are scanning for a lower
> order and find a (fully mapped, aligned) folio of higher order in the first place.
>
> Let's assume the desired policy is that khugepaged should always collapse to
> naturally aligned large folios. If there happens to be an existing aligned
> order-4 folio that is fully mapped, we will identify that for collapse as part
> of the scan for order-4. At that point, we should just notice that it is already
> an aligned order-4 folio and bypass collapse. Of course we may have already
> chosen to collapse it into a higher order, but we should definitely not get to a
> lower order before we notice it.
>
> Hmm... I guess if the sysfs thp settings have been changed then things could get
> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
> or leave it as as single order-8?

Exactly, sorry, I should have made it clear in the patch description that I am
handling the following scenario: there is a long running system on which we are
using order-8 folios, and now we decide to downgrade to order-4. Will it be a
good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
be a rare situation in the first place, so I have currently decided to ignore the
folios set up by the previous sysfs setting and only focus on collapsing fresh memory.

Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
sense to shatter large folios....maybe we can have a sysfs tunable for this?

>
>> +
>>   		/*
>>   		 * We treat a single page as shared if any part of the THP
>>   		 * is shared. "False negatives" from
>> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>>   			goto out;
>>   	}
>> +	/* A larger folio was mapped; it will be skipped in next iteration */
>> +	if (result == SCAN_PTE_MAPPED)
>> +		goto decide_order;
>> +
>>   	if (result != SCAN_SUCCEED) {
>>   
>>   		/* Go to the next order. */
>> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			goto out;
>>   		goto maybe_mmap_lock;
>>   	} else {
>> +
>> +decide_order:
>>   		address = _address;
>>   		pte = _pte;
>>
John Hubbard Dec. 19, 2024, 3:40 a.m. UTC | #3
On 12/18/24 1:34 AM, Dev Jain wrote:
> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>> On 16/12/2024 16:51, Dev Jain wrote:
>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
...
>> It would be good if you can spell out the desired policy when khugepaged hits
>> partially unmapped large folios and unaligned large folios. I think the simple
>> approach is to always collapse them to fully mapped, aligned folios even if the
>> resulting order is smaller than the original. But I'm not sure that's definitely
>> going to always be the best thing.
>>
>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>> order of a folio based on having hit one of it's pages says anything about
>> whether the whole of that folio is mapped or not or it's alignment. And it's not
>> clear to me how we would get to a situation where we are scanning for a lower
>> order and find a (fully mapped, aligned) folio of higher order in the first place.
>>
>> Let's assume the desired policy is that khugepaged should always collapse to
>> naturally aligned large folios. If there happens to be an existing aligned
>> order-4 folio that is fully mapped, we will identify that for collapse as part
>> of the scan for order-4. At that point, we should just notice that it is already
>> an aligned order-4 folio and bypass collapse. Of course we may have already
>> chosen to collapse it into a higher order, but we should definitely not get to a
>> lower order before we notice it.
>>
>> Hmm... I guess if the sysfs thp settings have been changed then things could get
>> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
>> or leave it as as single order-8?
> 
> Exactly, sorry, I should have made it clear in the patch description that I am
> handling the following scenario: there is a long running system on which we are
> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
> good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
> be a rare situation in the first place, so I have currently decided to ignore the
> folios set up by the previous sysfs setting and only focus on collapsing fresh memory.
> 
> Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
> the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
> sense to shatter large folios....maybe we can have a sysfs tunable for this?

Maybe we should not support it (at runtime) at all. We are trying to build
systems that don't require incredibly detailed sysadmin involvement, and
this level of tweaking qualifies, thoroughly, as "incredibly detailed
sysadmin micromanagement", imho.

Apologies for not having gone through the series in detail yet, but this
point jumped out at me.

thanks,
Zi Yan Dec. 19, 2024, 3:51 a.m. UTC | #4
On 18 Dec 2024, at 22:40, John Hubbard wrote:

> On 12/18/24 1:34 AM, Dev Jain wrote:
>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
> ...
>>> It would be good if you can spell out the desired policy when khugepaged hits
>>> partially unmapped large folios and unaligned large folios. I think the simple
>>> approach is to always collapse them to fully mapped, aligned folios even if the
>>> resulting order is smaller than the original. But I'm not sure that's definitely
>>> going to always be the best thing.
>>>
>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>> order of a folio based on having hit one of it's pages says anything about
>>> whether the whole of that folio is mapped or not or it's alignment. And it's not
>>> clear to me how we would get to a situation where we are scanning for a lower
>>> order and find a (fully mapped, aligned) folio of higher order in the first place.
>>>
>>> Let's assume the desired policy is that khugepaged should always collapse to
>>> naturally aligned large folios. If there happens to be an existing aligned
>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>> of the scan for order-4. At that point, we should just notice that it is already
>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>> chosen to collapse it into a higher order, but we should definitely not get to a
>>> lower order before we notice it.
>>>
>>> Hmm... I guess if the sysfs thp settings have been changed then things could get
>>> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
>>> or leave it as as single order-8?
>>
>> Exactly, sorry, I should have made it clear in the patch description that I am
>> handling the following scenario: there is a long running system on which we are
>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
>> be a rare situation in the first place, so I have currently decided to ignore the
>> folios set up by the previous sysfs setting and only focus on collapsing fresh memory.
>>
>> Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
>> the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>
> Maybe we should not support it (at runtime) at all. We are trying to build
> systems that don't require incredibly detailed sysadmin involvement, and
> this level of tweaking qualifies, thoroughly, as "incredibly detailed
> sysadmin micromanagement", imho.

I agree.

Regarding sysctl thp settings, we probably want to change its meaning from
kernel only allows folios with enabled orders to appear in the system to
kernel actively creates folios with enabled orders. Otherwise, like Ryan
said above, if a sysadmin changes sysctl thp settings from order-8 only
to order-4 only, kernel needs to scan all memory to split all order-8 folios?
That sounds like madness.


--
Best Regards,
Yan, Zi
Dev Jain Dec. 19, 2024, 7:59 a.m. UTC | #5
On 19/12/24 9:10 am, John Hubbard wrote:
> On 12/18/24 1:34 AM, Dev Jain wrote:
>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>> We may hit a situation wherein we have a larger folio mapped. It is 
>>>> incorrect
>>>> to go ahead with the collapse since some pages will be unmapped, 
>>>> leading to
>>>> the entire folio getting unmapped. Therefore, skip the 
>>>> corresponding range.
> ...
>>> It would be good if you can spell out the desired policy when 
>>> khugepaged hits
>>> partially unmapped large folios and unaligned large folios. I think 
>>> the simple
>>> approach is to always collapse them to fully mapped, aligned folios 
>>> even if the
>>> resulting order is smaller than the original. But I'm not sure 
>>> that's definitely
>>> going to always be the best thing.
>>>
>>> Regardless, I'm struggling to understand the logic in this patch. 
>>> Taking the
>>> order of a folio based on having hit one of it's pages says anything 
>>> about
>>> whether the whole of that folio is mapped or not or it's alignment. 
>>> And it's not
>>> clear to me how we would get to a situation where we are scanning 
>>> for a lower
>>> order and find a (fully mapped, aligned) folio of higher order in 
>>> the first place.
>>>
>>> Let's assume the desired policy is that khugepaged should always 
>>> collapse to
>>> naturally aligned large folios. If there happens to be an existing 
>>> aligned
>>> order-4 folio that is fully mapped, we will identify that for 
>>> collapse as part
>>> of the scan for order-4. At that point, we should just notice that 
>>> it is already
>>> an aligned order-4 folio and bypass collapse. Of course we may have 
>>> already
>>> chosen to collapse it into a higher order, but we should definitely 
>>> not get to a
>>> lower order before we notice it.
>>>
>>> Hmm... I guess if the sysfs thp settings have been changed then 
>>> things could get
>>> spicy... if order-8 was previously enabled and we have an order-8 
>>> folio, then it
>>> get's disabled and khugepaged is scanning for order-4 (which is 
>>> still enabled)
>>> then hits the order-8; what's the expected policy? rework into 2 
>>> order-4 folios
>>> or leave it as as single order-8?
>>
>> Exactly, sorry, I should have made it clear in the patch description 
>> that I am
>> handling the following scenario: there is a long running system on 
>> which we are
>> using order-8 folios, and now we decide to downgrade to order-4. Will 
>> it be a
>> good idea to take the pain of splitting order-8 to 16 order-4 folios? 
>> This should
>> be a rare situation in the first place, so I have currently decided 
>> to ignore the
>> folios set up by the previous sysfs setting and only focus on 
>> collapsing fresh memory.
>>
>> Thinking again, a sys-admin deciding to downgrade order of folios, 
>> should do that in
>> the hopes of reducing internal fragmentation or increasing swap speed 
>> etc, so it makes
>> sense to shatter large folios....maybe we can have a sysfs tunable 
>> for this?
>
> Maybe we should not support it (at runtime) at all. We are trying to 
> build
> systems that don't require incredibly detailed sysadmin involvement, and
> this level of tweaking qualifies, thoroughly, as "incredibly detailed
> sysadmin micromanagement", imho.

Ryan pointed out one thing: what about unaligned, or partially mapped large
folios? For the previous sysfs settings, it may happen that we have an 
unaligned
order-8 folio, let us say it got unaligned due to mremap(). Then it is a 
good
idea to start from the order-4 aligned page and start collapsing memory so
that we can take advantage of the contig bit. Otherwise if it is a 
fully-mapped
aligned order-8 folio, then we anyways are abusing the contig bit advantage
so collapsing is pointless.
>
> Apologies for not having gone through the series in detail yet, but this
> point jumped out at me.
>
> thanks,
Dev Jain Dec. 19, 2024, 8:07 a.m. UTC | #6
On 19/12/24 1:29 pm, Dev Jain wrote:
>
> On 19/12/24 9:10 am, John Hubbard wrote:
>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>> We may hit a situation wherein we have a larger folio mapped. It 
>>>>> is incorrect
>>>>> to go ahead with the collapse since some pages will be unmapped, 
>>>>> leading to
>>>>> the entire folio getting unmapped. Therefore, skip the 
>>>>> corresponding range.
>> ...
>>>> It would be good if you can spell out the desired policy when 
>>>> khugepaged hits
>>>> partially unmapped large folios and unaligned large folios. I think 
>>>> the simple
>>>> approach is to always collapse them to fully mapped, aligned folios 
>>>> even if the
>>>> resulting order is smaller than the original. But I'm not sure 
>>>> that's definitely
>>>> going to always be the best thing.
>>>>
>>>> Regardless, I'm struggling to understand the logic in this patch. 
>>>> Taking the
>>>> order of a folio based on having hit one of it's pages says 
>>>> anything about
>>>> whether the whole of that folio is mapped or not or it's alignment. 
>>>> And it's not
>>>> clear to me how we would get to a situation where we are scanning 
>>>> for a lower
>>>> order and find a (fully mapped, aligned) folio of higher order in 
>>>> the first place.
>>>>
>>>> Let's assume the desired policy is that khugepaged should always 
>>>> collapse to
>>>> naturally aligned large folios. If there happens to be an existing 
>>>> aligned
>>>> order-4 folio that is fully mapped, we will identify that for 
>>>> collapse as part
>>>> of the scan for order-4. At that point, we should just notice that 
>>>> it is already
>>>> an aligned order-4 folio and bypass collapse. Of course we may have 
>>>> already
>>>> chosen to collapse it into a higher order, but we should definitely 
>>>> not get to a
>>>> lower order before we notice it.
>>>>
>>>> Hmm... I guess if the sysfs thp settings have been changed then 
>>>> things could get
>>>> spicy... if order-8 was previously enabled and we have an order-8 
>>>> folio, then it
>>>> get's disabled and khugepaged is scanning for order-4 (which is 
>>>> still enabled)
>>>> then hits the order-8; what's the expected policy? rework into 2 
>>>> order-4 folios
>>>> or leave it as as single order-8?
>>>
>>> Exactly, sorry, I should have made it clear in the patch description 
>>> that I am
>>> handling the following scenario: there is a long running system on 
>>> which we are
>>> using order-8 folios, and now we decide to downgrade to order-4. 
>>> Will it be a
>>> good idea to take the pain of splitting order-8 to 16 order-4 
>>> folios? This should
>>> be a rare situation in the first place, so I have currently decided 
>>> to ignore the
>>> folios set up by the previous sysfs setting and only focus on 
>>> collapsing fresh memory.
>>>
>>> Thinking again, a sys-admin deciding to downgrade order of folios, 
>>> should do that in
>>> the hopes of reducing internal fragmentation or increasing swap 
>>> speed etc, so it makes
>>> sense to shatter large folios....maybe we can have a sysfs tunable 
>>> for this?
>>
>> Maybe we should not support it (at runtime) at all. We are trying to 
>> build
>> systems that don't require incredibly detailed sysadmin involvement, and
>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>> sysadmin micromanagement", imho.
>
> Ryan pointed out one thing: what about unaligned, or partially mapped 
> large
> folios? For the previous sysfs settings, it may happen that we have an 
> unaligned
> order-8 folio, let us say it got unaligned due to mremap(). Then it is 
> a good
> idea to start from the order-4 aligned page and start collapsing 
> memory so
> that we can take advantage of the contig bit. Otherwise if it is a 
> fully-mapped
> aligned order-8 folio, then we anyways are abusing the contig bit 
> advantage
> so collapsing is pointless.


In fact, in the current code, we are collapsing an unaligned PMD-size 
folio to an
aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
ahead with the scan...so the logic should be, skip the scan if the VAs 
and PAs are
aligned.

>>
>> Apologies for not having gone through the series in detail yet, but this
>> point jumped out at me.
>>
>> thanks,
>
Ryan Roberts Dec. 20, 2024, 11:57 a.m. UTC | #7
On 19/12/2024 08:07, Dev Jain wrote:
> 
> On 19/12/24 1:29 pm, Dev Jain wrote:
>>
>> On 19/12/24 9:10 am, John Hubbard wrote:
>>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>> ...
>>>>> It would be good if you can spell out the desired policy when khugepaged hits
>>>>> partially unmapped large folios and unaligned large folios. I think the simple
>>>>> approach is to always collapse them to fully mapped, aligned folios even if
>>>>> the
>>>>> resulting order is smaller than the original. But I'm not sure that's
>>>>> definitely
>>>>> going to always be the best thing.
>>>>>
>>>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>>>> order of a folio based on having hit one of it's pages says anything about
>>>>> whether the whole of that folio is mapped or not or it's alignment. And
>>>>> it's not
>>>>> clear to me how we would get to a situation where we are scanning for a lower
>>>>> order and find a (fully mapped, aligned) folio of higher order in the first
>>>>> place.
>>>>>
>>>>> Let's assume the desired policy is that khugepaged should always collapse to
>>>>> naturally aligned large folios. If there happens to be an existing aligned
>>>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>>>> of the scan for order-4. At that point, we should just notice that it is
>>>>> already
>>>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>>>> chosen to collapse it into a higher order, but we should definitely not get
>>>>> to a
>>>>> lower order before we notice it.
>>>>>
>>>>> Hmm... I guess if the sysfs thp settings have been changed then things
>>>>> could get
>>>>> spicy... if order-8 was previously enabled and we have an order-8 folio,
>>>>> then it
>>>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>>>> then hits the order-8; what's the expected policy? rework into 2 order-4
>>>>> folios
>>>>> or leave it as as single order-8?
>>>>
>>>> Exactly, sorry, I should have made it clear in the patch description that I am
>>>> handling the following scenario: there is a long running system on which we are
>>>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>>>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This
>>>> should
>>>> be a rare situation in the first place, so I have currently decided to
>>>> ignore the
>>>> folios set up by the previous sysfs setting and only focus on collapsing
>>>> fresh memory.
>>>>
>>>> Thinking again, a sys-admin deciding to downgrade order of folios, should do
>>>> that in
>>>> the hopes of reducing internal fragmentation or increasing swap speed etc,
>>>> so it makes
>>>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>>>
>>> Maybe we should not support it (at runtime) at all. We are trying to build
>>> systems that don't require incredibly detailed sysadmin involvement, and
>>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>>> sysadmin micromanagement", imho.

Agreed that we definitely don't want any new controls for this.

>>
>> Ryan pointed out one thing: what about unaligned, or partially mapped large
>> folios? For the previous sysfs settings, it may happen that we have an unaligned
>> order-8 folio, let us say it got unaligned due to mremap(). Then it is a good
>> idea to start from the order-4 aligned page and start collapsing memory so
>> that we can take advantage of the contig bit. Otherwise if it is a fully-mapped
>> aligned order-8 folio, then we anyways are abusing the contig bit advantage
>> so collapsing is pointless.
> 
> 
> In fact, in the current code, we are collapsing an unaligned PMD-size folio to an
> aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
> ahead with the scan...so the logic should be, skip the scan if the VAs and PAs are
> aligned.

Here is my stab at what the policy should be. It sounds quite complex, but
actually I think the implementation is quite simple.

memory may initially be mapped partially or fully from large or small folios.
And the mapping may be naturally aligned for the folio order or not.

khugepaged's goal should always to end up with the largest, aligned orders that
are possible. It will only collapse to an enabled order, but it will never
collapse to a smaller or equal order than the source memory's contiguous mapping
_and_ aligment.

So if for example, the first (or last) 64K of a 128K folio is mapped on a 64K VA
boundary, if scanning for order-4 (64K) we would leave that partial mapping
alone. But if either less than 64K was mapped from the 128K folio, or it was
mapped in a way that it was not 64K aligned in VA, then it would be a candidate
for collapse to a new 64K folio.

I think this can all be implemented by remembering if all of the PFNs for the
VA-aligned order that is being scanned are present and congituous and if they
all belong to the same folio. If that is true, and if the first PFN is aligned
on a PA boundary for that order, then skip the collapse and move to the next
block (addr + (PAGE_SIZE << order)).

I think this will give us the properties we want; if we encounter a larger folio
that is fully mapped and aligned, we will leave it alone. If we encounter a mix
of small folios and partially mapped large folios we will collapse. If we
encounter a partial mapping of a larger folio but that partial mapping fills the
block we are scanning in an appropriately aligned manner, we will leave it alone.

Thanks,
Ryan

> 
>>>
>>> Apologies for not having gone through the series in detail yet, but this
>>> point jumped out at me.
>>>
>>> thanks,
>>
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8040b130e677..47e7c476b893 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -33,6 +33,7 @@  enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_PMD_NONE,
 	SCAN_PMD_MAPPED,
+	SCAN_PTE_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -609,6 +610,11 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED;
+			goto out;
+		}
+
 		/* See hpage_collapse_scan_ptes(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
@@ -1369,6 +1375,7 @@  static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 	unsigned long orders;
 	pte_t *pte, *_pte;
 	spinlock_t *ptl;
+	int found_order;
 	pmd_t *pmd;
 	int order;
 
@@ -1467,6 +1474,24 @@  static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+		found_order = folio_order(folio);
+
+		/*
+		 * No point of scanning. Two options: if this folio was hit
+		 * somewhere in the middle of the scan, then drop down the
+		 * order. Or, completely skip till the end of this folio. The
+		 * latter gives us a higher order to start with, with atmost
+		 * 1 << order PTEs not collapsed; the former may force us
+		 * to end up going below order 2 and exiting.
+		 */
+		if (order != HPAGE_PMD_ORDER && found_order >= order) {
+			result = SCAN_PTE_MAPPED;
+			_address += (PAGE_SIZE << found_order);
+			_pte += (1UL << found_order);
+			pte_unmap_unlock(pte, ptl);
+			goto decide_order;
+		}
+
 		/*
 		 * We treat a single page as shared if any part of the THP
 		 * is shared. "False negatives" from
@@ -1550,6 +1575,10 @@  static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
 			goto out;
 	}
+	/* A larger folio was mapped; it will be skipped in next iteration */
+	if (result == SCAN_PTE_MAPPED)
+		goto decide_order;
+
 	if (result != SCAN_SUCCEED) {
 
 		/* Go to the next order. */
@@ -1558,6 +1587,8 @@  static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 			goto out;
 		goto maybe_mmap_lock;
 	} else {
+
+decide_order:
 		address = _address;
 		pte = _pte;