diff mbox series

[v2,16/20] fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount (CONFIG_NO_PAGE_MAPCOUNT)

Message ID 20250224165603.1434404-17-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT | expand

Commit Message

David Hildenbrand Feb. 24, 2025, 4:55 p.m. UTC
Let's implement an alternative when per-page mapcounts in large folios
are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

For large folios, we'll return the per-page average mapcount within the
folio, except when the average is 0 but the folio is mapped: then we
return 1.

For hugetlb folios and for large folios that are fully mapped
into all address spaces, there is no change.

As an alternative, we could simply return 0 for non-hugetlb large folios,
or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.

But the information exposed by this interface can still be valuable, and
frequently we deal with fully-mapped large folios where the average
corresponds to the actual page mapcount. So we'll leave it like this for
now and document the new behavior.

Note: this interface is likely not very relevant for performance. If
ever required, we could try doing a rather expensive rmap walk to collect
precisely how often this folio page is mapped.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
 fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
 fs/proc/page.c                           | 19 ++++++++++++---
 3 files changed, 53 insertions(+), 4 deletions(-)

Comments

Zi Yan Feb. 24, 2025, 8:40 p.m. UTC | #1
On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
> Let's implement an alternative when per-page mapcounts in large folios
> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>
> For large folios, we'll return the per-page average mapcount within the
> folio, except when the average is 0 but the folio is mapped: then we
> return 1.
>
> For hugetlb folios and for large folios that are fully mapped
> into all address spaces, there is no change.
>
> As an alternative, we could simply return 0 for non-hugetlb large folios,
> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>
> But the information exposed by this interface can still be valuable, and
> frequently we deal with fully-mapped large folios where the average
> corresponds to the actual page mapcount. So we'll leave it like this for
> now and document the new behavior.
>
> Note: this interface is likely not very relevant for performance. If
> ever required, we could try doing a rather expensive rmap walk to collect
> precisely how often this folio page is mapped.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>  fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>  fs/proc/page.c                           | 19 ++++++++++++---
>  3 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> index caba0f52dd36c..49590306c61a0 100644
> --- a/Documentation/admin-guide/mm/pagemap.rst
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -42,7 +42,12 @@ There are four components to pagemap:
>     skip over unmapped regions.
>  
>   * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
> -   times each page is mapped, indexed by PFN.
> +   times each page is mapped, indexed by PFN. Some kernel configurations do
> +   not track the precise number of times a page part of a larger allocation
> +   (e.g., THP) is mapped. In these configurations, the average number of
> +   mappings per page in this larger allocation is returned instead. However,
> +   if any page of the large allocation is mapped, the returned value will
> +   be at least 1.
>  
>  The page-types tool in the tools/mm directory can be used to query the
>  number of times a page is mapped.
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 1695509370b88..16aa1fd260771 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>  	return mapcount;
>  }
>  
> +/**
> + * folio_average_page_mapcount() - Average number of mappings per page in this
> + *				   folio
> + * @folio: The folio.
> + *
> + * The average number of present user page table entries that reference each
> + * page in this folio as tracked via the RMAP: either referenced directly
> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
> + *
> + * Returns: The average number of mappings per page in this folio. 0 for
> + * folios that are not mapped to user space or are not tracked via the RMAP
> + * (e.g., shared zeropage).
> + */
> +static inline int folio_average_page_mapcount(struct folio *folio)
> +{
> +	int mapcount, entire_mapcount;
> +	unsigned int adjust;
> +
> +	if (!folio_test_large(folio))
> +		return atomic_read(&folio->_mapcount) + 1;
> +
> +	mapcount = folio_large_mapcount(folio);
> +	entire_mapcount = folio_entire_mapcount(folio);
> +	if (mapcount <= entire_mapcount)
> +		return entire_mapcount;
> +	mapcount -= entire_mapcount;
> +
> +	adjust = folio_large_nr_pages(folio) / 2;

Is there any reason for choosing this adjust number? A comment might be
helpful in case people want to change it later, either with some reasoning
or just saying it is chosen empirically.

> +	return ((mapcount + adjust) >> folio_large_order(folio)) +
> +		entire_mapcount;
> +}
>  /*
>   * array.c
>   */
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index a55f5acefa974..4d3290cc69667 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>  		 * memmaps that were actually initialized.
>  		 */
>  		page = pfn_to_online_page(pfn);
> -		if (page)
> -			mapcount = folio_precise_page_mapcount(page_folio(page),
> -							       page);
> +		if (page) {
> +			struct folio *folio = page_folio(page);
> +
> +			if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
> +				mapcount = folio_precise_page_mapcount(folio, page);
> +			} else {
> +				/*
> +				 * Indicate the per-page average, but at least "1" for
> +				 * mapped folios.
> +				 */
> +				mapcount = folio_average_page_mapcount(folio);
> +				if (!mapcount && folio_test_large(folio) &&
> +				    folio_mapped(folio))
> +					mapcount = 1;

This should be part of folio_average_page_mapcount() right?
Otherwise, the comment on folio_average_page_mapcount() is not correct,
since it can return 0 when a folio is mapped to user space.

> +			}
> +		}
>  
>  		if (put_user(mapcount, out)) {
>  			ret = -EFAULT;
David Hildenbrand Feb. 24, 2025, 9:02 p.m. UTC | #2
On 24.02.25 21:40, Zi Yan wrote:
> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
>> Let's implement an alternative when per-page mapcounts in large folios
>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>>
>> For large folios, we'll return the per-page average mapcount within the
>> folio, except when the average is 0 but the folio is mapped: then we
>> return 1.
>>
>> For hugetlb folios and for large folios that are fully mapped
>> into all address spaces, there is no change.
>>
>> As an alternative, we could simply return 0 for non-hugetlb large folios,
>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>>
>> But the information exposed by this interface can still be valuable, and
>> frequently we deal with fully-mapped large folios where the average
>> corresponds to the actual page mapcount. So we'll leave it like this for
>> now and document the new behavior.
>>
>> Note: this interface is likely not very relevant for performance. If
>> ever required, we could try doing a rather expensive rmap walk to collect
>> precisely how often this folio page is mapped.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>>   fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>>   fs/proc/page.c                           | 19 ++++++++++++---
>>   3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>> index caba0f52dd36c..49590306c61a0 100644
>> --- a/Documentation/admin-guide/mm/pagemap.rst
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -42,7 +42,12 @@ There are four components to pagemap:
>>      skip over unmapped regions.
>>   
>>    * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
>> -   times each page is mapped, indexed by PFN.
>> +   times each page is mapped, indexed by PFN. Some kernel configurations do
>> +   not track the precise number of times a page part of a larger allocation
>> +   (e.g., THP) is mapped. In these configurations, the average number of
>> +   mappings per page in this larger allocation is returned instead. However,
>> +   if any page of the large allocation is mapped, the returned value will
>> +   be at least 1.
>>   
>>   The page-types tool in the tools/mm directory can be used to query the
>>   number of times a page is mapped.
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index 1695509370b88..16aa1fd260771 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>>   	return mapcount;
>>   }
>>   
>> +/**
>> + * folio_average_page_mapcount() - Average number of mappings per page in this
>> + *				   folio
>> + * @folio: The folio.
>> + *
>> + * The average number of present user page table entries that reference each
>> + * page in this folio as tracked via the RMAP: either referenced directly
>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
>> + *
>> + * Returns: The average number of mappings per page in this folio. 0 for
>> + * folios that are not mapped to user space or are not tracked via the RMAP
>> + * (e.g., shared zeropage).
>> + */
>> +static inline int folio_average_page_mapcount(struct folio *folio)
>> +{
>> +	int mapcount, entire_mapcount;
>> +	unsigned int adjust;
>> +
>> +	if (!folio_test_large(folio))
>> +		return atomic_read(&folio->_mapcount) + 1;
>> +
>> +	mapcount = folio_large_mapcount(folio);
>> +	entire_mapcount = folio_entire_mapcount(folio);
>> +	if (mapcount <= entire_mapcount)
>> +		return entire_mapcount;
>> +	mapcount -= entire_mapcount;
>> +
>> +	adjust = folio_large_nr_pages(folio) / 2;

Thanks for the review!

> 
> Is there any reason for choosing this adjust number? A comment might be
> helpful in case people want to change it later, either with some reasoning
> or just saying it is chosen empirically.

We're dividing by folio_large_nr_pages(folio) (shifting by 
folio_large_order(folio)), so this is not a magic number at all.

So this should be "ordinary" rounding.

Assume nr_pages = 512.

With 255 we want to round down, with 256 we want to round up.

255 / 512 = 0 :)
256 / 512 = 0 :(

Compared to:

(255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :)
(256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :)

> 
>> +	return ((mapcount + adjust) >> folio_large_order(folio)) +
>> +		entire_mapcount;
>> +}
>>   /*
>>    * array.c
>>    */
>> diff --git a/fs/proc/page.c b/fs/proc/page.c
>> index a55f5acefa974..4d3290cc69667 100644
>> --- a/fs/proc/page.c
>> +++ b/fs/proc/page.c
>> @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>>   		 * memmaps that were actually initialized.
>>   		 */
>>   		page = pfn_to_online_page(pfn);
>> -		if (page)
>> -			mapcount = folio_precise_page_mapcount(page_folio(page),
>> -							       page);
>> +		if (page) {
>> +			struct folio *folio = page_folio(page);
>> +
>> +			if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
>> +				mapcount = folio_precise_page_mapcount(folio, page);
>> +			} else {
>> +				/*
>> +				 * Indicate the per-page average, but at least "1" for
>> +				 * mapped folios.
>> +				 */
>> +				mapcount = folio_average_page_mapcount(folio);
>> +				if (!mapcount && folio_test_large(folio) &&
>> +				    folio_mapped(folio))
>> +					mapcount = 1;
> 
> This should be part of folio_average_page_mapcount() right?

No, that's not desired.

> Otherwise, the comment on folio_average_page_mapcount() is not correct,
> since it can return 0 when a folio is mapped to user space.

It's misleading. I'll clarify the comment, probably simply saying:

Returns: The average number of mappings per page in this folio.

Thanks!
David Hildenbrand Feb. 24, 2025, 9:10 p.m. UTC | #3
On 24.02.25 22:02, David Hildenbrand wrote:
> On 24.02.25 21:40, Zi Yan wrote:
>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
>>> Let's implement an alternative when per-page mapcounts in large folios
>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>>>
>>> For large folios, we'll return the per-page average mapcount within the
>>> folio, except when the average is 0 but the folio is mapped: then we
>>> return 1.
>>>
>>> For hugetlb folios and for large folios that are fully mapped
>>> into all address spaces, there is no change.
>>>
>>> As an alternative, we could simply return 0 for non-hugetlb large folios,
>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>>>
>>> But the information exposed by this interface can still be valuable, and
>>> frequently we deal with fully-mapped large folios where the average
>>> corresponds to the actual page mapcount. So we'll leave it like this for
>>> now and document the new behavior.
>>>
>>> Note: this interface is likely not very relevant for performance. If
>>> ever required, we could try doing a rather expensive rmap walk to collect
>>> precisely how often this folio page is mapped.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>>>    fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>>>    fs/proc/page.c                           | 19 ++++++++++++---
>>>    3 files changed, 53 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>> index caba0f52dd36c..49590306c61a0 100644
>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>> @@ -42,7 +42,12 @@ There are four components to pagemap:
>>>       skip over unmapped regions.
>>>    
>>>     * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
>>> -   times each page is mapped, indexed by PFN.
>>> +   times each page is mapped, indexed by PFN. Some kernel configurations do
>>> +   not track the precise number of times a page part of a larger allocation
>>> +   (e.g., THP) is mapped. In these configurations, the average number of
>>> +   mappings per page in this larger allocation is returned instead. However,
>>> +   if any page of the large allocation is mapped, the returned value will
>>> +   be at least 1.
>>>    
>>>    The page-types tool in the tools/mm directory can be used to query the
>>>    number of times a page is mapped.
>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>>> index 1695509370b88..16aa1fd260771 100644
>>> --- a/fs/proc/internal.h
>>> +++ b/fs/proc/internal.h
>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>>>    	return mapcount;
>>>    }
>>>    
>>> +/**
>>> + * folio_average_page_mapcount() - Average number of mappings per page in this
>>> + *				   folio
>>> + * @folio: The folio.
>>> + *
>>> + * The average number of present user page table entries that reference each
>>> + * page in this folio as tracked via the RMAP: either referenced directly
>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
>>> + *
>>> + * Returns: The average number of mappings per page in this folio. 0 for
>>> + * folios that are not mapped to user space or are not tracked via the RMAP
>>> + * (e.g., shared zeropage).
>>> + */
>>> +static inline int folio_average_page_mapcount(struct folio *folio)
>>> +{
>>> +	int mapcount, entire_mapcount;
>>> +	unsigned int adjust;
>>> +
>>> +	if (!folio_test_large(folio))
>>> +		return atomic_read(&folio->_mapcount) + 1;
>>> +
>>> +	mapcount = folio_large_mapcount(folio);
>>> +	entire_mapcount = folio_entire_mapcount(folio);
>>> +	if (mapcount <= entire_mapcount)
>>> +		return entire_mapcount;
>>> +	mapcount -= entire_mapcount;
>>> +
>>> +	adjust = folio_large_nr_pages(folio) / 2;
> 
> Thanks for the review!
> 
>>
>> Is there any reason for choosing this adjust number? A comment might be
>> helpful in case people want to change it later, either with some reasoning
>> or just saying it is chosen empirically.
> 
> We're dividing by folio_large_nr_pages(folio) (shifting by
> folio_large_order(folio)), so this is not a magic number at all.
> 
> So this should be "ordinary" rounding.
> 
> Assume nr_pages = 512.
> 
> With 255 we want to round down, with 256 we want to round up.
> 
> 255 / 512 = 0 :)
> 256 / 512 = 0 :(
> 
> Compared to:
> 
> (255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :)
> (256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :)

I think adding to the function doc:

"The average is calculated by rounding to the nearest integer."

might make it clearer.
Zi Yan Feb. 24, 2025, 9:10 p.m. UTC | #4
On 24 Feb 2025, at 16:02, David Hildenbrand wrote:

> On 24.02.25 21:40, Zi Yan wrote:
>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
>>> Let's implement an alternative when per-page mapcounts in large folios
>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>>>
>>> For large folios, we'll return the per-page average mapcount within the
>>> folio, except when the average is 0 but the folio is mapped: then we
>>> return 1.
>>>
>>> For hugetlb folios and for large folios that are fully mapped
>>> into all address spaces, there is no change.
>>>
>>> As an alternative, we could simply return 0 for non-hugetlb large folios,
>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>>>
>>> But the information exposed by this interface can still be valuable, and
>>> frequently we deal with fully-mapped large folios where the average
>>> corresponds to the actual page mapcount. So we'll leave it like this for
>>> now and document the new behavior.
>>>
>>> Note: this interface is likely not very relevant for performance. If
>>> ever required, we could try doing a rather expensive rmap walk to collect
>>> precisely how often this folio page is mapped.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>>>   fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>>>   fs/proc/page.c                           | 19 ++++++++++++---
>>>   3 files changed, 53 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>> index caba0f52dd36c..49590306c61a0 100644
>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>> @@ -42,7 +42,12 @@ There are four components to pagemap:
>>>      skip over unmapped regions.
>>>     * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
>>> -   times each page is mapped, indexed by PFN.
>>> +   times each page is mapped, indexed by PFN. Some kernel configurations do
>>> +   not track the precise number of times a page part of a larger allocation
>>> +   (e.g., THP) is mapped. In these configurations, the average number of
>>> +   mappings per page in this larger allocation is returned instead. However,
>>> +   if any page of the large allocation is mapped, the returned value will
>>> +   be at least 1.
>>>    The page-types tool in the tools/mm directory can be used to query the
>>>   number of times a page is mapped.
>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>>> index 1695509370b88..16aa1fd260771 100644
>>> --- a/fs/proc/internal.h
>>> +++ b/fs/proc/internal.h
>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>>>   	return mapcount;
>>>   }
>>>  +/**
>>> + * folio_average_page_mapcount() - Average number of mappings per page in this
>>> + *				   folio
>>> + * @folio: The folio.
>>> + *
>>> + * The average number of present user page table entries that reference each
>>> + * page in this folio as tracked via the RMAP: either referenced directly
>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
>>> + *
>>> + * Returns: The average number of mappings per page in this folio. 0 for
>>> + * folios that are not mapped to user space or are not tracked via the RMAP
>>> + * (e.g., shared zeropage).
>>> + */
>>> +static inline int folio_average_page_mapcount(struct folio *folio)
>>> +{
>>> +	int mapcount, entire_mapcount;
>>> +	unsigned int adjust;
>>> +
>>> +	if (!folio_test_large(folio))
>>> +		return atomic_read(&folio->_mapcount) + 1;
>>> +
>>> +	mapcount = folio_large_mapcount(folio);
>>> +	entire_mapcount = folio_entire_mapcount(folio);
>>> +	if (mapcount <= entire_mapcount)
>>> +		return entire_mapcount;
>>> +	mapcount -= entire_mapcount;
>>> +
>>> +	adjust = folio_large_nr_pages(folio) / 2;
>
> Thanks for the review!
>
>>
>> Is there any reason for choosing this adjust number? A comment might be
>> helpful in case people want to change it later, either with some reasoning
>> or just saying it is chosen empirically.
>
> We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all.
>
> So this should be "ordinary" rounding.

I thought the rounding would be (mapcount + 511) / 512. But
that means if one subpage is mapped, the average will be 1.
Your rounding means if at least half of the subpages is mapped,
the average will be 1. Others might think 1/3 is mapped,
the average will be 1. That is why I think adjust looks like
a magic number.

>
> Assume nr_pages = 512.
>
> With 255 we want to round down, with 256 we want to round up.
>
> 255 / 512 = 0 :)
> 256 / 512 = 0 :(
>
> Compared to:
>
> (255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :)
> (256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :)
>
>>
>>> +	return ((mapcount + adjust) >> folio_large_order(folio)) +
>>> +		entire_mapcount;
>>> +}
>>>   /*
>>>    * array.c
>>>    */
>>> diff --git a/fs/proc/page.c b/fs/proc/page.c
>>> index a55f5acefa974..4d3290cc69667 100644
>>> --- a/fs/proc/page.c
>>> +++ b/fs/proc/page.c
>>> @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
>>>   		 * memmaps that were actually initialized.
>>>   		 */
>>>   		page = pfn_to_online_page(pfn);
>>> -		if (page)
>>> -			mapcount = folio_precise_page_mapcount(page_folio(page),
>>> -							       page);
>>> +		if (page) {
>>> +			struct folio *folio = page_folio(page);
>>> +
>>> +			if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
>>> +				mapcount = folio_precise_page_mapcount(folio, page);
>>> +			} else {
>>> +				/*
>>> +				 * Indicate the per-page average, but at least "1" for
>>> +				 * mapped folios.
>>> +				 */
>>> +				mapcount = folio_average_page_mapcount(folio);
>>> +				if (!mapcount && folio_test_large(folio) &&
>>> +				    folio_mapped(folio))
>>> +					mapcount = 1;
>>
>> This should be part of folio_average_page_mapcount() right?
>
> No, that's not desired.
>
>> Otherwise, the comment on folio_average_page_mapcount() is not correct,
>> since it can return 0 when a folio is mapped to user space.
>
> It's misleading. I'll clarify the comment, probably simply saying:
>
> Returns: The average number of mappings per page in this folio.

Got it.

Best Regards,
Yan, Zi
David Hildenbrand Feb. 24, 2025, 9:15 p.m. UTC | #5
On 24.02.25 22:10, Zi Yan wrote:
> On 24 Feb 2025, at 16:02, David Hildenbrand wrote:
> 
>> On 24.02.25 21:40, Zi Yan wrote:
>>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
>>>> Let's implement an alternative when per-page mapcounts in large folios
>>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>>>>
>>>> For large folios, we'll return the per-page average mapcount within the
>>>> folio, except when the average is 0 but the folio is mapped: then we
>>>> return 1.
>>>>
>>>> For hugetlb folios and for large folios that are fully mapped
>>>> into all address spaces, there is no change.
>>>>
>>>> As an alternative, we could simply return 0 for non-hugetlb large folios,
>>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>>>>
>>>> But the information exposed by this interface can still be valuable, and
>>>> frequently we deal with fully-mapped large folios where the average
>>>> corresponds to the actual page mapcount. So we'll leave it like this for
>>>> now and document the new behavior.
>>>>
>>>> Note: this interface is likely not very relevant for performance. If
>>>> ever required, we could try doing a rather expensive rmap walk to collect
>>>> precisely how often this folio page is mapped.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>>>>    fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>>>>    fs/proc/page.c                           | 19 ++++++++++++---
>>>>    3 files changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>>> index caba0f52dd36c..49590306c61a0 100644
>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -42,7 +42,12 @@ There are four components to pagemap:
>>>>       skip over unmapped regions.
>>>>      * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
>>>> -   times each page is mapped, indexed by PFN.
>>>> +   times each page is mapped, indexed by PFN. Some kernel configurations do
>>>> +   not track the precise number of times a page part of a larger allocation
>>>> +   (e.g., THP) is mapped. In these configurations, the average number of
>>>> +   mappings per page in this larger allocation is returned instead. However,
>>>> +   if any page of the large allocation is mapped, the returned value will
>>>> +   be at least 1.
>>>>     The page-types tool in the tools/mm directory can be used to query the
>>>>    number of times a page is mapped.
>>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>>>> index 1695509370b88..16aa1fd260771 100644
>>>> --- a/fs/proc/internal.h
>>>> +++ b/fs/proc/internal.h
>>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>>>>    	return mapcount;
>>>>    }
>>>>   +/**
>>>> + * folio_average_page_mapcount() - Average number of mappings per page in this
>>>> + *				   folio
>>>> + * @folio: The folio.
>>>> + *
>>>> + * The average number of present user page table entries that reference each
>>>> + * page in this folio as tracked via the RMAP: either referenced directly
>>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
>>>> + *
>>>> + * Returns: The average number of mappings per page in this folio. 0 for
>>>> + * folios that are not mapped to user space or are not tracked via the RMAP
>>>> + * (e.g., shared zeropage).
>>>> + */
>>>> +static inline int folio_average_page_mapcount(struct folio *folio)
>>>> +{
>>>> +	int mapcount, entire_mapcount;
>>>> +	unsigned int adjust;
>>>> +
>>>> +	if (!folio_test_large(folio))
>>>> +		return atomic_read(&folio->_mapcount) + 1;
>>>> +
>>>> +	mapcount = folio_large_mapcount(folio);
>>>> +	entire_mapcount = folio_entire_mapcount(folio);
>>>> +	if (mapcount <= entire_mapcount)
>>>> +		return entire_mapcount;
>>>> +	mapcount -= entire_mapcount;
>>>> +
>>>> +	adjust = folio_large_nr_pages(folio) / 2;
>>
>> Thanks for the review!
>>
>>>
>>> Is there any reason for choosing this adjust number? A comment might be
>>> helpful in case people want to change it later, either with some reasoning
>>> or just saying it is chosen empirically.
>>
>> We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all.
>>
>> So this should be "ordinary" rounding.
> 
> I thought the rounding would be (mapcount + 511) / 512.

Yes, that's "rounding up".

> But
> that means if one subpage is mapped, the average will be 1.
> Your rounding means if at least half of the subpages is mapped,
> the average will be 1. Others might think 1/3 is mapped,
> the average will be 1. That is why I think adjust looks like
> a magic number.

I think all callers could tolerate (or benefit) from 
folio_average_page_mapcount() returning at least 1 in case any page is 
mapped.

There was a reason why I decided to round to the nearest integer instead.

Let me think about this once more, I went back and forth a couple of 
times on this.
Zi Yan Feb. 24, 2025, 9:23 p.m. UTC | #6
On 24 Feb 2025, at 16:15, David Hildenbrand wrote:

> On 24.02.25 22:10, Zi Yan wrote:
>> On 24 Feb 2025, at 16:02, David Hildenbrand wrote:
>>
>>> On 24.02.25 21:40, Zi Yan wrote:
>>>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
>>>>> Let's implement an alternative when per-page mapcounts in large folios
>>>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
>>>>>
>>>>> For large folios, we'll return the per-page average mapcount within the
>>>>> folio, except when the average is 0 but the folio is mapped: then we
>>>>> return 1.
>>>>>
>>>>> For hugetlb folios and for large folios that are fully mapped
>>>>> into all address spaces, there is no change.
>>>>>
>>>>> As an alternative, we could simply return 0 for non-hugetlb large folios,
>>>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.
>>>>>
>>>>> But the information exposed by this interface can still be valuable, and
>>>>> frequently we deal with fully-mapped large folios where the average
>>>>> corresponds to the actual page mapcount. So we'll leave it like this for
>>>>> now and document the new behavior.
>>>>>
>>>>> Note: this interface is likely not very relevant for performance. If
>>>>> ever required, we could try doing a rather expensive rmap walk to collect
>>>>> precisely how often this folio page is mapped.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>    Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
>>>>>    fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
>>>>>    fs/proc/page.c                           | 19 ++++++++++++---
>>>>>    3 files changed, 53 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>>>> index caba0f52dd36c..49590306c61a0 100644
>>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>>> @@ -42,7 +42,12 @@ There are four components to pagemap:
>>>>>       skip over unmapped regions.
>>>>>      * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
>>>>> -   times each page is mapped, indexed by PFN.
>>>>> +   times each page is mapped, indexed by PFN. Some kernel configurations do
>>>>> +   not track the precise number of times a page part of a larger allocation
>>>>> +   (e.g., THP) is mapped. In these configurations, the average number of
>>>>> +   mappings per page in this larger allocation is returned instead. However,
>>>>> +   if any page of the large allocation is mapped, the returned value will
>>>>> +   be at least 1.
>>>>>     The page-types tool in the tools/mm directory can be used to query the
>>>>>    number of times a page is mapped.
>>>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>>>>> index 1695509370b88..16aa1fd260771 100644
>>>>> --- a/fs/proc/internal.h
>>>>> +++ b/fs/proc/internal.h
>>>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
>>>>>    	return mapcount;
>>>>>    }
>>>>>   +/**
>>>>> + * folio_average_page_mapcount() - Average number of mappings per page in this
>>>>> + *				   folio
>>>>> + * @folio: The folio.
>>>>> + *
>>>>> + * The average number of present user page table entries that reference each
>>>>> + * page in this folio as tracked via the RMAP: either referenced directly
>>>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD).
>>>>> + *
>>>>> + * Returns: The average number of mappings per page in this folio. 0 for
>>>>> + * folios that are not mapped to user space or are not tracked via the RMAP
>>>>> + * (e.g., shared zeropage).
>>>>> + */
>>>>> +static inline int folio_average_page_mapcount(struct folio *folio)
>>>>> +{
>>>>> +	int mapcount, entire_mapcount;
>>>>> +	unsigned int adjust;
>>>>> +
>>>>> +	if (!folio_test_large(folio))
>>>>> +		return atomic_read(&folio->_mapcount) + 1;
>>>>> +
>>>>> +	mapcount = folio_large_mapcount(folio);
>>>>> +	entire_mapcount = folio_entire_mapcount(folio);
>>>>> +	if (mapcount <= entire_mapcount)
>>>>> +		return entire_mapcount;
>>>>> +	mapcount -= entire_mapcount;
>>>>> +
>>>>> +	adjust = folio_large_nr_pages(folio) / 2;
>>>
>>> Thanks for the review!
>>>
>>>>
>>>> Is there any reason for choosing this adjust number? A comment might be
>>>> helpful in case people want to change it later, either with some reasoning
>>>> or just saying it is chosen empirically.
>>>
>>> We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all.
>>>
>>> So this should be "ordinary" rounding.
>>
>> I thought the rounding would be (mapcount + 511) / 512.
>
> Yes, that's "rounding up".
>
>> But
>> that means if one subpage is mapped, the average will be 1.
>> Your rounding means if at least half of the subpages is mapped,
>> the average will be 1. Others might think 1/3 is mapped,
>> the average will be 1. That is why I think adjust looks like
>> a magic number.
>
> I think all callers could tolerate (or benefit) from folio_average_page_mapcount() returning at least 1 in case any page is mapped.
>
> There was a reason why I decided to round to the nearest integer instead.
>
> Let me think about this once more, I went back and forth a couple of times on this.

Sure. Your current choice might be good enough for now. My intend of
adding a comment here is just to let people know the adjust can be
changed in the future. :)


Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36c..49590306c61a0 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -42,7 +42,12 @@  There are four components to pagemap:
    skip over unmapped regions.
 
  * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
-   times each page is mapped, indexed by PFN.
+   times each page is mapped, indexed by PFN. Some kernel configurations do
+   not track the precise number of times a page part of a larger allocation
+   (e.g., THP) is mapped. In these configurations, the average number of
+   mappings per page in this larger allocation is returned instead. However,
+   if any page of the large allocation is mapped, the returned value will
+   be at least 1.
 
 The page-types tool in the tools/mm directory can be used to query the
 number of times a page is mapped.
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1695509370b88..16aa1fd260771 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -174,6 +174,37 @@  static inline int folio_precise_page_mapcount(struct folio *folio,
 	return mapcount;
 }
 
+/**
+ * folio_average_page_mapcount() - Average number of mappings per page in this
+ *				   folio
+ * @folio: The folio.
+ *
+ * The average number of present user page table entries that reference each
+ * page in this folio as tracked via the RMAP: either referenced directly
+ * (PTE) or as part of a larger area that covers this page (e.g., PMD).
+ *
+ * Returns: The average number of mappings per page in this folio. 0 for
+ * folios that are not mapped to user space or are not tracked via the RMAP
+ * (e.g., shared zeropage).
+ */
+static inline int folio_average_page_mapcount(struct folio *folio)
+{
+	int mapcount, entire_mapcount;
+	unsigned int adjust;
+
+	if (!folio_test_large(folio))
+		return atomic_read(&folio->_mapcount) + 1;
+
+	mapcount = folio_large_mapcount(folio);
+	entire_mapcount = folio_entire_mapcount(folio);
+	if (mapcount <= entire_mapcount)
+		return entire_mapcount;
+	mapcount -= entire_mapcount;
+
+	adjust = folio_large_nr_pages(folio) / 2;
+	return ((mapcount + adjust) >> folio_large_order(folio)) +
+		entire_mapcount;
+}
 /*
  * array.c
  */
diff --git a/fs/proc/page.c b/fs/proc/page.c
index a55f5acefa974..4d3290cc69667 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -67,9 +67,22 @@  static ssize_t kpagecount_read(struct file *file, char __user *buf,
 		 * memmaps that were actually initialized.
 		 */
 		page = pfn_to_online_page(pfn);
-		if (page)
-			mapcount = folio_precise_page_mapcount(page_folio(page),
-							       page);
+		if (page) {
+			struct folio *folio = page_folio(page);
+
+			if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
+				mapcount = folio_precise_page_mapcount(folio, page);
+			} else {
+				/*
+				 * Indicate the per-page average, but at least "1" for
+				 * mapped folios.
+				 */
+				mapcount = folio_average_page_mapcount(folio);
+				if (!mapcount && folio_test_large(folio) &&
+				    folio_mapped(folio))
+					mapcount = 1;
+			}
+		}
 
 		if (put_user(mapcount, out)) {
 			ret = -EFAULT;