Message ID | 20250224165603.1434404-17-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT | expand |
On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: > Let's implement an alternative when per-page mapcounts in large folios > are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. > > For large folios, we'll return the per-page average mapcount within the > folio, except when the average is 0 but the folio is mapped: then we > return 1. > > For hugetlb folios and for large folios that are fully mapped > into all address spaces, there is no change. > > As an alternative, we could simply return 0 for non-hugetlb large folios, > or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. > > But the information exposed by this interface can still be valuable, and > frequently we deal with fully-mapped large folios where the average > corresponds to the actual page mapcount. So we'll leave it like this for > now and document the new behavior. > > Note: this interface is likely not very relevant for performance. If > ever required, we could try doing a rather expensive rmap walk to collect > precisely how often this folio page is mapped. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > Documentation/admin-guide/mm/pagemap.rst | 7 +++++- > fs/proc/internal.h | 31 ++++++++++++++++++++++++ > fs/proc/page.c | 19 ++++++++++++--- > 3 files changed, 53 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst > index caba0f52dd36c..49590306c61a0 100644 > --- a/Documentation/admin-guide/mm/pagemap.rst > +++ b/Documentation/admin-guide/mm/pagemap.rst > @@ -42,7 +42,12 @@ There are four components to pagemap: > skip over unmapped regions. > > * ``/proc/kpagecount``. This file contains a 64-bit count of the number of > - times each page is mapped, indexed by PFN. > + times each page is mapped, indexed by PFN. Some kernel configurations do > + not track the precise number of times a page part of a larger allocation > + (e.g., THP) is mapped. In these configurations, the average number of > + mappings per page in this larger allocation is returned instead. However, > + if any page of the large allocation is mapped, the returned value will > + be at least 1. > > The page-types tool in the tools/mm directory can be used to query the > number of times a page is mapped. > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index 1695509370b88..16aa1fd260771 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, > return mapcount; > } > > +/** > + * folio_average_page_mapcount() - Average number of mappings per page in this > + * folio > + * @folio: The folio. > + * > + * The average number of present user page table entries that reference each > + * page in this folio as tracked via the RMAP: either referenced directly > + * (PTE) or as part of a larger area that covers this page (e.g., PMD). > + * > + * Returns: The average number of mappings per page in this folio. 0 for > + * folios that are not mapped to user space or are not tracked via the RMAP > + * (e.g., shared zeropage). > + */ > +static inline int folio_average_page_mapcount(struct folio *folio) > +{ > + int mapcount, entire_mapcount; > + unsigned int adjust; > + > + if (!folio_test_large(folio)) > + return atomic_read(&folio->_mapcount) + 1; > + > + mapcount = folio_large_mapcount(folio); > + entire_mapcount = folio_entire_mapcount(folio); > + if (mapcount <= entire_mapcount) > + return entire_mapcount; > + mapcount -= entire_mapcount; > + > + adjust = folio_large_nr_pages(folio) / 2; Is there any reason for choosing this adjust number? A comment might be helpful in case people want to change it later, either with some reasoning or just saying it is chosen empirically. > + return ((mapcount + adjust) >> folio_large_order(folio)) + > + entire_mapcount; > +} > /* > * array.c > */ > diff --git a/fs/proc/page.c b/fs/proc/page.c > index a55f5acefa974..4d3290cc69667 100644 > --- a/fs/proc/page.c > +++ b/fs/proc/page.c > @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, > * memmaps that were actually initialized. > */ > page = pfn_to_online_page(pfn); > - if (page) > - mapcount = folio_precise_page_mapcount(page_folio(page), > - page); > + if (page) { > + struct folio *folio = page_folio(page); > + > + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) { > + mapcount = folio_precise_page_mapcount(folio, page); > + } else { > + /* > + * Indicate the per-page average, but at least "1" for > + * mapped folios. > + */ > + mapcount = folio_average_page_mapcount(folio); > + if (!mapcount && folio_test_large(folio) && > + folio_mapped(folio)) > + mapcount = 1; This should be part of folio_average_page_mapcount() right? Otherwise, the comment on folio_average_page_mapcount() is not correct, since it can return 0 when a folio is mapped to user space. > + } > + } > > if (put_user(mapcount, out)) { > ret = -EFAULT;
On 24.02.25 21:40, Zi Yan wrote: > On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: >> Let's implement an alternative when per-page mapcounts in large folios >> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. >> >> For large folios, we'll return the per-page average mapcount within the >> folio, except when the average is 0 but the folio is mapped: then we >> return 1. >> >> For hugetlb folios and for large folios that are fully mapped >> into all address spaces, there is no change. >> >> As an alternative, we could simply return 0 for non-hugetlb large folios, >> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. >> >> But the information exposed by this interface can still be valuable, and >> frequently we deal with fully-mapped large folios where the average >> corresponds to the actual page mapcount. So we'll leave it like this for >> now and document the new behavior. >> >> Note: this interface is likely not very relevant for performance. If >> ever required, we could try doing a rather expensive rmap walk to collect >> precisely how often this folio page is mapped. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> Documentation/admin-guide/mm/pagemap.rst | 7 +++++- >> fs/proc/internal.h | 31 ++++++++++++++++++++++++ >> fs/proc/page.c | 19 ++++++++++++--- >> 3 files changed, 53 insertions(+), 4 deletions(-) >> >> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst >> index caba0f52dd36c..49590306c61a0 100644 >> --- a/Documentation/admin-guide/mm/pagemap.rst >> +++ b/Documentation/admin-guide/mm/pagemap.rst >> @@ -42,7 +42,12 @@ There are four components to pagemap: >> skip over unmapped regions. >> >> * ``/proc/kpagecount``. This file contains a 64-bit count of the number of >> - times each page is mapped, indexed by PFN. >> + times each page is mapped, indexed by PFN. Some kernel configurations do >> + not track the precise number of times a page part of a larger allocation >> + (e.g., THP) is mapped. In these configurations, the average number of >> + mappings per page in this larger allocation is returned instead. However, >> + if any page of the large allocation is mapped, the returned value will >> + be at least 1. >> >> The page-types tool in the tools/mm directory can be used to query the >> number of times a page is mapped. >> diff --git a/fs/proc/internal.h b/fs/proc/internal.h >> index 1695509370b88..16aa1fd260771 100644 >> --- a/fs/proc/internal.h >> +++ b/fs/proc/internal.h >> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, >> return mapcount; >> } >> >> +/** >> + * folio_average_page_mapcount() - Average number of mappings per page in this >> + * folio >> + * @folio: The folio. >> + * >> + * The average number of present user page table entries that reference each >> + * page in this folio as tracked via the RMAP: either referenced directly >> + * (PTE) or as part of a larger area that covers this page (e.g., PMD). >> + * >> + * Returns: The average number of mappings per page in this folio. 0 for >> + * folios that are not mapped to user space or are not tracked via the RMAP >> + * (e.g., shared zeropage). >> + */ >> +static inline int folio_average_page_mapcount(struct folio *folio) >> +{ >> + int mapcount, entire_mapcount; >> + unsigned int adjust; >> + >> + if (!folio_test_large(folio)) >> + return atomic_read(&folio->_mapcount) + 1; >> + >> + mapcount = folio_large_mapcount(folio); >> + entire_mapcount = folio_entire_mapcount(folio); >> + if (mapcount <= entire_mapcount) >> + return entire_mapcount; >> + mapcount -= entire_mapcount; >> + >> + adjust = folio_large_nr_pages(folio) / 2; Thanks for the review! > > Is there any reason for choosing this adjust number? A comment might be > helpful in case people want to change it later, either with some reasoning > or just saying it is chosen empirically. We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all. So this should be "ordinary" rounding. Assume nr_pages = 512. With 255 we want to round down, with 256 we want to round up. 255 / 512 = 0 :) 256 / 512 = 0 :( Compared to: (255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :) (256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :) > >> + return ((mapcount + adjust) >> folio_large_order(folio)) + >> + entire_mapcount; >> +} >> /* >> * array.c >> */ >> diff --git a/fs/proc/page.c b/fs/proc/page.c >> index a55f5acefa974..4d3290cc69667 100644 >> --- a/fs/proc/page.c >> +++ b/fs/proc/page.c >> @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, >> * memmaps that were actually initialized. >> */ >> page = pfn_to_online_page(pfn); >> - if (page) >> - mapcount = folio_precise_page_mapcount(page_folio(page), >> - page); >> + if (page) { >> + struct folio *folio = page_folio(page); >> + >> + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) { >> + mapcount = folio_precise_page_mapcount(folio, page); >> + } else { >> + /* >> + * Indicate the per-page average, but at least "1" for >> + * mapped folios. >> + */ >> + mapcount = folio_average_page_mapcount(folio); >> + if (!mapcount && folio_test_large(folio) && >> + folio_mapped(folio)) >> + mapcount = 1; > > This should be part of folio_average_page_mapcount() right? No, that's not desired. > Otherwise, the comment on folio_average_page_mapcount() is not correct, > since it can return 0 when a folio is mapped to user space. It's misleading. I'll clarify the comment, probably simply saying: Returns: The average number of mappings per page in this folio. Thanks!
On 24.02.25 22:02, David Hildenbrand wrote: > On 24.02.25 21:40, Zi Yan wrote: >> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: >>> Let's implement an alternative when per-page mapcounts in large folios >>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. >>> >>> For large folios, we'll return the per-page average mapcount within the >>> folio, except when the average is 0 but the folio is mapped: then we >>> return 1. >>> >>> For hugetlb folios and for large folios that are fully mapped >>> into all address spaces, there is no change. >>> >>> As an alternative, we could simply return 0 for non-hugetlb large folios, >>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. >>> >>> But the information exposed by this interface can still be valuable, and >>> frequently we deal with fully-mapped large folios where the average >>> corresponds to the actual page mapcount. So we'll leave it like this for >>> now and document the new behavior. >>> >>> Note: this interface is likely not very relevant for performance. If >>> ever required, we could try doing a rather expensive rmap walk to collect >>> precisely how often this folio page is mapped. >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> Documentation/admin-guide/mm/pagemap.rst | 7 +++++- >>> fs/proc/internal.h | 31 ++++++++++++++++++++++++ >>> fs/proc/page.c | 19 ++++++++++++--- >>> 3 files changed, 53 insertions(+), 4 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst >>> index caba0f52dd36c..49590306c61a0 100644 >>> --- a/Documentation/admin-guide/mm/pagemap.rst >>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>> @@ -42,7 +42,12 @@ There are four components to pagemap: >>> skip over unmapped regions. >>> >>> * ``/proc/kpagecount``. This file contains a 64-bit count of the number of >>> - times each page is mapped, indexed by PFN. >>> + times each page is mapped, indexed by PFN. Some kernel configurations do >>> + not track the precise number of times a page part of a larger allocation >>> + (e.g., THP) is mapped. In these configurations, the average number of >>> + mappings per page in this larger allocation is returned instead. However, >>> + if any page of the large allocation is mapped, the returned value will >>> + be at least 1. >>> >>> The page-types tool in the tools/mm directory can be used to query the >>> number of times a page is mapped. >>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h >>> index 1695509370b88..16aa1fd260771 100644 >>> --- a/fs/proc/internal.h >>> +++ b/fs/proc/internal.h >>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, >>> return mapcount; >>> } >>> >>> +/** >>> + * folio_average_page_mapcount() - Average number of mappings per page in this >>> + * folio >>> + * @folio: The folio. >>> + * >>> + * The average number of present user page table entries that reference each >>> + * page in this folio as tracked via the RMAP: either referenced directly >>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD). >>> + * >>> + * Returns: The average number of mappings per page in this folio. 0 for >>> + * folios that are not mapped to user space or are not tracked via the RMAP >>> + * (e.g., shared zeropage). >>> + */ >>> +static inline int folio_average_page_mapcount(struct folio *folio) >>> +{ >>> + int mapcount, entire_mapcount; >>> + unsigned int adjust; >>> + >>> + if (!folio_test_large(folio)) >>> + return atomic_read(&folio->_mapcount) + 1; >>> + >>> + mapcount = folio_large_mapcount(folio); >>> + entire_mapcount = folio_entire_mapcount(folio); >>> + if (mapcount <= entire_mapcount) >>> + return entire_mapcount; >>> + mapcount -= entire_mapcount; >>> + >>> + adjust = folio_large_nr_pages(folio) / 2; > > Thanks for the review! > >> >> Is there any reason for choosing this adjust number? A comment might be >> helpful in case people want to change it later, either with some reasoning >> or just saying it is chosen empirically. > > We're dividing by folio_large_nr_pages(folio) (shifting by > folio_large_order(folio)), so this is not a magic number at all. > > So this should be "ordinary" rounding. > > Assume nr_pages = 512. > > With 255 we want to round down, with 256 we want to round up. > > 255 / 512 = 0 :) > 256 / 512 = 0 :( > > Compared to: > > (255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :) > (256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :) I think adding to the function doc: "The average is calculated by rounding to the nearest integer." might make it clearer.
On 24 Feb 2025, at 16:02, David Hildenbrand wrote: > On 24.02.25 21:40, Zi Yan wrote: >> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: >>> Let's implement an alternative when per-page mapcounts in large folios >>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. >>> >>> For large folios, we'll return the per-page average mapcount within the >>> folio, except when the average is 0 but the folio is mapped: then we >>> return 1. >>> >>> For hugetlb folios and for large folios that are fully mapped >>> into all address spaces, there is no change. >>> >>> As an alternative, we could simply return 0 for non-hugetlb large folios, >>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. >>> >>> But the information exposed by this interface can still be valuable, and >>> frequently we deal with fully-mapped large folios where the average >>> corresponds to the actual page mapcount. So we'll leave it like this for >>> now and document the new behavior. >>> >>> Note: this interface is likely not very relevant for performance. If >>> ever required, we could try doing a rather expensive rmap walk to collect >>> precisely how often this folio page is mapped. >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> Documentation/admin-guide/mm/pagemap.rst | 7 +++++- >>> fs/proc/internal.h | 31 ++++++++++++++++++++++++ >>> fs/proc/page.c | 19 ++++++++++++--- >>> 3 files changed, 53 insertions(+), 4 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst >>> index caba0f52dd36c..49590306c61a0 100644 >>> --- a/Documentation/admin-guide/mm/pagemap.rst >>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>> @@ -42,7 +42,12 @@ There are four components to pagemap: >>> skip over unmapped regions. >>> * ``/proc/kpagecount``. This file contains a 64-bit count of the number of >>> - times each page is mapped, indexed by PFN. >>> + times each page is mapped, indexed by PFN. Some kernel configurations do >>> + not track the precise number of times a page part of a larger allocation >>> + (e.g., THP) is mapped. In these configurations, the average number of >>> + mappings per page in this larger allocation is returned instead. However, >>> + if any page of the large allocation is mapped, the returned value will >>> + be at least 1. >>> The page-types tool in the tools/mm directory can be used to query the >>> number of times a page is mapped. >>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h >>> index 1695509370b88..16aa1fd260771 100644 >>> --- a/fs/proc/internal.h >>> +++ b/fs/proc/internal.h >>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, >>> return mapcount; >>> } >>> +/** >>> + * folio_average_page_mapcount() - Average number of mappings per page in this >>> + * folio >>> + * @folio: The folio. >>> + * >>> + * The average number of present user page table entries that reference each >>> + * page in this folio as tracked via the RMAP: either referenced directly >>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD). >>> + * >>> + * Returns: The average number of mappings per page in this folio. 0 for >>> + * folios that are not mapped to user space or are not tracked via the RMAP >>> + * (e.g., shared zeropage). >>> + */ >>> +static inline int folio_average_page_mapcount(struct folio *folio) >>> +{ >>> + int mapcount, entire_mapcount; >>> + unsigned int adjust; >>> + >>> + if (!folio_test_large(folio)) >>> + return atomic_read(&folio->_mapcount) + 1; >>> + >>> + mapcount = folio_large_mapcount(folio); >>> + entire_mapcount = folio_entire_mapcount(folio); >>> + if (mapcount <= entire_mapcount) >>> + return entire_mapcount; >>> + mapcount -= entire_mapcount; >>> + >>> + adjust = folio_large_nr_pages(folio) / 2; > > Thanks for the review! > >> >> Is there any reason for choosing this adjust number? A comment might be >> helpful in case people want to change it later, either with some reasoning >> or just saying it is chosen empirically. > > We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all. > > So this should be "ordinary" rounding. I thought the rounding would be (mapcount + 511) / 512. But that means if one subpage is mapped, the average will be 1. Your rounding means if at least half of the subpages is mapped, the average will be 1. Others might think 1/3 is mapped, the average will be 1. That is why I think adjust looks like a magic number. > > Assume nr_pages = 512. > > With 255 we want to round down, with 256 we want to round up. > > 255 / 512 = 0 :) > 256 / 512 = 0 :( > > Compared to: > > (255 + (512 / 2)) / 512 = (255 + 256) / 512 = 0 :) > (256 + (512 / 2)) / 512 = (256 + 256) / 512 = 1 :) > >> >>> + return ((mapcount + adjust) >> folio_large_order(folio)) + >>> + entire_mapcount; >>> +} >>> /* >>> * array.c >>> */ >>> diff --git a/fs/proc/page.c b/fs/proc/page.c >>> index a55f5acefa974..4d3290cc69667 100644 >>> --- a/fs/proc/page.c >>> +++ b/fs/proc/page.c >>> @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, >>> * memmaps that were actually initialized. >>> */ >>> page = pfn_to_online_page(pfn); >>> - if (page) >>> - mapcount = folio_precise_page_mapcount(page_folio(page), >>> - page); >>> + if (page) { >>> + struct folio *folio = page_folio(page); >>> + >>> + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) { >>> + mapcount = folio_precise_page_mapcount(folio, page); >>> + } else { >>> + /* >>> + * Indicate the per-page average, but at least "1" for >>> + * mapped folios. >>> + */ >>> + mapcount = folio_average_page_mapcount(folio); >>> + if (!mapcount && folio_test_large(folio) && >>> + folio_mapped(folio)) >>> + mapcount = 1; >> >> This should be part of folio_average_page_mapcount() right? > > No, that's not desired. > >> Otherwise, the comment on folio_average_page_mapcount() is not correct, >> since it can return 0 when a folio is mapped to user space. > > It's misleading. I'll clarify the comment, probably simply saying: > > Returns: The average number of mappings per page in this folio. Got it. Best Regards, Yan, Zi
On 24.02.25 22:10, Zi Yan wrote: > On 24 Feb 2025, at 16:02, David Hildenbrand wrote: > >> On 24.02.25 21:40, Zi Yan wrote: >>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: >>>> Let's implement an alternative when per-page mapcounts in large folios >>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. >>>> >>>> For large folios, we'll return the per-page average mapcount within the >>>> folio, except when the average is 0 but the folio is mapped: then we >>>> return 1. >>>> >>>> For hugetlb folios and for large folios that are fully mapped >>>> into all address spaces, there is no change. >>>> >>>> As an alternative, we could simply return 0 for non-hugetlb large folios, >>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. >>>> >>>> But the information exposed by this interface can still be valuable, and >>>> frequently we deal with fully-mapped large folios where the average >>>> corresponds to the actual page mapcount. So we'll leave it like this for >>>> now and document the new behavior. >>>> >>>> Note: this interface is likely not very relevant for performance. If >>>> ever required, we could try doing a rather expensive rmap walk to collect >>>> precisely how often this folio page is mapped. >>>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> Documentation/admin-guide/mm/pagemap.rst | 7 +++++- >>>> fs/proc/internal.h | 31 ++++++++++++++++++++++++ >>>> fs/proc/page.c | 19 ++++++++++++--- >>>> 3 files changed, 53 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst >>>> index caba0f52dd36c..49590306c61a0 100644 >>>> --- a/Documentation/admin-guide/mm/pagemap.rst >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>>> @@ -42,7 +42,12 @@ There are four components to pagemap: >>>> skip over unmapped regions. >>>> * ``/proc/kpagecount``. This file contains a 64-bit count of the number of >>>> - times each page is mapped, indexed by PFN. >>>> + times each page is mapped, indexed by PFN. Some kernel configurations do >>>> + not track the precise number of times a page part of a larger allocation >>>> + (e.g., THP) is mapped. In these configurations, the average number of >>>> + mappings per page in this larger allocation is returned instead. However, >>>> + if any page of the large allocation is mapped, the returned value will >>>> + be at least 1. >>>> The page-types tool in the tools/mm directory can be used to query the >>>> number of times a page is mapped. >>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h >>>> index 1695509370b88..16aa1fd260771 100644 >>>> --- a/fs/proc/internal.h >>>> +++ b/fs/proc/internal.h >>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, >>>> return mapcount; >>>> } >>>> +/** >>>> + * folio_average_page_mapcount() - Average number of mappings per page in this >>>> + * folio >>>> + * @folio: The folio. >>>> + * >>>> + * The average number of present user page table entries that reference each >>>> + * page in this folio as tracked via the RMAP: either referenced directly >>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD). >>>> + * >>>> + * Returns: The average number of mappings per page in this folio. 0 for >>>> + * folios that are not mapped to user space or are not tracked via the RMAP >>>> + * (e.g., shared zeropage). >>>> + */ >>>> +static inline int folio_average_page_mapcount(struct folio *folio) >>>> +{ >>>> + int mapcount, entire_mapcount; >>>> + unsigned int adjust; >>>> + >>>> + if (!folio_test_large(folio)) >>>> + return atomic_read(&folio->_mapcount) + 1; >>>> + >>>> + mapcount = folio_large_mapcount(folio); >>>> + entire_mapcount = folio_entire_mapcount(folio); >>>> + if (mapcount <= entire_mapcount) >>>> + return entire_mapcount; >>>> + mapcount -= entire_mapcount; >>>> + >>>> + adjust = folio_large_nr_pages(folio) / 2; >> >> Thanks for the review! >> >>> >>> Is there any reason for choosing this adjust number? A comment might be >>> helpful in case people want to change it later, either with some reasoning >>> or just saying it is chosen empirically. >> >> We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all. >> >> So this should be "ordinary" rounding. > > I thought the rounding would be (mapcount + 511) / 512. Yes, that's "rounding up". > But > that means if one subpage is mapped, the average will be 1. > Your rounding means if at least half of the subpages is mapped, > the average will be 1. Others might think 1/3 is mapped, > the average will be 1. That is why I think adjust looks like > a magic number. I think all callers could tolerate (or benefit) from folio_average_page_mapcount() returning at least 1 in case any page is mapped. There was a reason why I decided to round to the nearest integer instead. Let me think about this once more, I went back and forth a couple of times on this.
On 24 Feb 2025, at 16:15, David Hildenbrand wrote: > On 24.02.25 22:10, Zi Yan wrote: >> On 24 Feb 2025, at 16:02, David Hildenbrand wrote: >> >>> On 24.02.25 21:40, Zi Yan wrote: >>>> On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote: >>>>> Let's implement an alternative when per-page mapcounts in large folios >>>>> are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. >>>>> >>>>> For large folios, we'll return the per-page average mapcount within the >>>>> folio, except when the average is 0 but the folio is mapped: then we >>>>> return 1. >>>>> >>>>> For hugetlb folios and for large folios that are fully mapped >>>>> into all address spaces, there is no change. >>>>> >>>>> As an alternative, we could simply return 0 for non-hugetlb large folios, >>>>> or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. >>>>> >>>>> But the information exposed by this interface can still be valuable, and >>>>> frequently we deal with fully-mapped large folios where the average >>>>> corresponds to the actual page mapcount. So we'll leave it like this for >>>>> now and document the new behavior. >>>>> >>>>> Note: this interface is likely not very relevant for performance. If >>>>> ever required, we could try doing a rather expensive rmap walk to collect >>>>> precisely how often this folio page is mapped. >>>>> >>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>>> --- >>>>> Documentation/admin-guide/mm/pagemap.rst | 7 +++++- >>>>> fs/proc/internal.h | 31 ++++++++++++++++++++++++ >>>>> fs/proc/page.c | 19 ++++++++++++--- >>>>> 3 files changed, 53 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst >>>>> index caba0f52dd36c..49590306c61a0 100644 >>>>> --- a/Documentation/admin-guide/mm/pagemap.rst >>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>>>> @@ -42,7 +42,12 @@ There are four components to pagemap: >>>>> skip over unmapped regions. >>>>> * ``/proc/kpagecount``. This file contains a 64-bit count of the number of >>>>> - times each page is mapped, indexed by PFN. >>>>> + times each page is mapped, indexed by PFN. Some kernel configurations do >>>>> + not track the precise number of times a page part of a larger allocation >>>>> + (e.g., THP) is mapped. In these configurations, the average number of >>>>> + mappings per page in this larger allocation is returned instead. However, >>>>> + if any page of the large allocation is mapped, the returned value will >>>>> + be at least 1. >>>>> The page-types tool in the tools/mm directory can be used to query the >>>>> number of times a page is mapped. >>>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h >>>>> index 1695509370b88..16aa1fd260771 100644 >>>>> --- a/fs/proc/internal.h >>>>> +++ b/fs/proc/internal.h >>>>> @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, >>>>> return mapcount; >>>>> } >>>>> +/** >>>>> + * folio_average_page_mapcount() - Average number of mappings per page in this >>>>> + * folio >>>>> + * @folio: The folio. >>>>> + * >>>>> + * The average number of present user page table entries that reference each >>>>> + * page in this folio as tracked via the RMAP: either referenced directly >>>>> + * (PTE) or as part of a larger area that covers this page (e.g., PMD). >>>>> + * >>>>> + * Returns: The average number of mappings per page in this folio. 0 for >>>>> + * folios that are not mapped to user space or are not tracked via the RMAP >>>>> + * (e.g., shared zeropage). >>>>> + */ >>>>> +static inline int folio_average_page_mapcount(struct folio *folio) >>>>> +{ >>>>> + int mapcount, entire_mapcount; >>>>> + unsigned int adjust; >>>>> + >>>>> + if (!folio_test_large(folio)) >>>>> + return atomic_read(&folio->_mapcount) + 1; >>>>> + >>>>> + mapcount = folio_large_mapcount(folio); >>>>> + entire_mapcount = folio_entire_mapcount(folio); >>>>> + if (mapcount <= entire_mapcount) >>>>> + return entire_mapcount; >>>>> + mapcount -= entire_mapcount; >>>>> + >>>>> + adjust = folio_large_nr_pages(folio) / 2; >>> >>> Thanks for the review! >>> >>>> >>>> Is there any reason for choosing this adjust number? A comment might be >>>> helpful in case people want to change it later, either with some reasoning >>>> or just saying it is chosen empirically. >>> >>> We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all. >>> >>> So this should be "ordinary" rounding. >> >> I thought the rounding would be (mapcount + 511) / 512. > > Yes, that's "rounding up". > >> But >> that means if one subpage is mapped, the average will be 1. >> Your rounding means if at least half of the subpages is mapped, >> the average will be 1. Others might think 1/3 is mapped, >> the average will be 1. That is why I think adjust looks like >> a magic number. > > I think all callers could tolerate (or benefit) from folio_average_page_mapcount() returning at least 1 in case any page is mapped. > > There was a reason why I decided to round to the nearest integer instead. > > Let me think about this once more, I went back and forth a couple of times on this. Sure. Your current choice might be good enough for now. My intend of adding a comment here is just to let people know the adjust can be changed in the future. :) Best Regards, Yan, Zi
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36c..49590306c61a0 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -42,7 +42,12 @@ There are four components to pagemap: skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 1695509370b88..16aa1fd260771 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio, return mapcount; } +/** + * folio_average_page_mapcount() - Average number of mappings per page in this + * folio + * @folio: The folio. + * + * The average number of present user page table entries that reference each + * page in this folio as tracked via the RMAP: either referenced directly + * (PTE) or as part of a larger area that covers this page (e.g., PMD). + * + * Returns: The average number of mappings per page in this folio. 0 for + * folios that are not mapped to user space or are not tracked via the RMAP + * (e.g., shared zeropage). + */ +static inline int folio_average_page_mapcount(struct folio *folio) +{ + int mapcount, entire_mapcount; + unsigned int adjust; + + if (!folio_test_large(folio)) + return atomic_read(&folio->_mapcount) + 1; + + mapcount = folio_large_mapcount(folio); + entire_mapcount = folio_entire_mapcount(folio); + if (mapcount <= entire_mapcount) + return entire_mapcount; + mapcount -= entire_mapcount; + + adjust = folio_large_nr_pages(folio) / 2; + return ((mapcount + adjust) >> folio_large_order(folio)) + + entire_mapcount; +} /* * array.c */ diff --git a/fs/proc/page.c b/fs/proc/page.c index a55f5acefa974..4d3290cc69667 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -67,9 +67,22 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, * memmaps that were actually initialized. */ page = pfn_to_online_page(pfn); - if (page) - mapcount = folio_precise_page_mapcount(page_folio(page), - page); + if (page) { + struct folio *folio = page_folio(page); + + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) { + mapcount = folio_precise_page_mapcount(folio, page); + } else { + /* + * Indicate the per-page average, but at least "1" for + * mapped folios. + */ + mapcount = folio_average_page_mapcount(folio); + if (!mapcount && folio_test_large(folio) && + folio_mapped(folio)) + mapcount = 1; + } + } if (put_user(mapcount, out)) { ret = -EFAULT;
Let's implement an alternative when per-page mapcounts in large folios are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT. For large folios, we'll return the per-page average mapcount within the folio, except when the average is 0 but the folio is mapped: then we return 1. For hugetlb folios and for large folios that are fully mapped into all address spaces, there is no change. As an alternative, we could simply return 0 for non-hugetlb large folios, or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT. But the information exposed by this interface can still be valuable, and frequently we deal with fully-mapped large folios where the average corresponds to the actual page mapcount. So we'll leave it like this for now and document the new behavior. Note: this interface is likely not very relevant for performance. If ever required, we could try doing a rather expensive rmap walk to collect precisely how often this folio page is mapped. Signed-off-by: David Hildenbrand <david@redhat.com> --- Documentation/admin-guide/mm/pagemap.rst | 7 +++++- fs/proc/internal.h | 31 ++++++++++++++++++++++++ fs/proc/page.c | 19 ++++++++++++--- 3 files changed, 53 insertions(+), 4 deletions(-)