diff mbox series

mm/rmap: do not add fully unmapped large folio to deferred split list

Message ID 20240411153232.169560-1-zi.yan@sent.com (mailing list archive)
State New
Headers show
Series mm/rmap: do not add fully unmapped large folio to deferred split list | expand

Commit Message

Zi Yan April 11, 2024, 3:32 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio mapcount before
adding a folio to deferred split list.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/rmap.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)


base-commit: ed7c95c95397baff9b40ba9b0919933eee2d7960

Comments

David Hildenbrand April 11, 2024, 3:46 p.m. UTC | #1
On 11.04.24 17:32, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. It is possible that
> the folio is unmapped fully, but it is unnecessary to add the folio
> to deferred split list at all. Fix it by checking folio mapcount before
> adding a folio to deferred split list.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/rmap.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..d599a772e282 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   		enum rmap_level level)
>   {
>   	atomic_t *mapped = &folio->_nr_pages_mapped;
> -	int last, nr = 0, nr_pmdmapped = 0;
> +	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>   	enum node_stat_item idx;
>   
>   	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   			break;
>   		}
>   
> -		atomic_sub(nr_pages, &folio->_large_mapcount);
> +		mapcount = atomic_sub_return(nr_pages,
> +					     &folio->_large_mapcount) + 1;

That becomes a new memory barrier on some archs. Rather just re-read it 
below. Re-reading should be fine here.

>   		do {
>   			last = atomic_add_negative(-1, &page->_mapcount);
>   			if (last) {
> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   		 * is still mapped.
>   		 */
>   		if (folio_test_large(folio) && folio_test_anon(folio))
> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> +			if ((level == RMAP_LEVEL_PTE &&
> +			     mapcount != 0) ||
> +			    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>   				deferred_split_folio(folio);
>   	}

But I do wonder if we really care? Usually the folio will simply get 
freed afterwards, where we simply remove it from the list.

If it's pinned, we won't be able to free or reclaim, but it's rather a 
corner case ...

Is it really worth the added code? Not convinced.
Yang Shi April 11, 2024, 7:01 p.m. UTC | #2
On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.04.24 17:32, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> >
> > In __folio_remove_rmap(), a large folio is added to deferred split list
> > if any page in a folio loses its final mapping. It is possible that
> > the folio is unmapped fully, but it is unnecessary to add the folio
> > to deferred split list at all. Fix it by checking folio mapcount before
> > adding a folio to deferred split list.
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > ---
> >   mm/rmap.c | 9 ++++++---
> >   1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..d599a772e282 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >               enum rmap_level level)
> >   {
> >       atomic_t *mapped = &folio->_nr_pages_mapped;
> > -     int last, nr = 0, nr_pmdmapped = 0;
> > +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >       enum node_stat_item idx;
> >
> >       __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> > @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                       break;
> >               }
> >
> > -             atomic_sub(nr_pages, &folio->_large_mapcount);
> > +             mapcount = atomic_sub_return(nr_pages,
> > +                                          &folio->_large_mapcount) + 1;
>
> That becomes a new memory barrier on some archs. Rather just re-read it
> below. Re-reading should be fine here.
>
> >               do {
> >                       last = atomic_add_negative(-1, &page->_mapcount);
> >                       if (last) {
> > @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                * is still mapped.
> >                */
> >               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > +                     if ((level == RMAP_LEVEL_PTE &&
> > +                          mapcount != 0) ||
> > +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >                               deferred_split_folio(folio);
> >       }
>
> But I do wonder if we really care? Usually the folio will simply get
> freed afterwards, where we simply remove it from the list.
>
> If it's pinned, we won't be able to free or reclaim, but it's rather a
> corner case ...
>
> Is it really worth the added code? Not convinced.

It is actually not only an optimization, but also fixed the broken
thp_deferred_split_page counter in /proc/vmstat.

The counter actually counted the partially unmapped huge pages (so
they are on deferred split queue), but it counts the fully unmapped
mTHP as well now. For example, when a 64K THP is fully unmapped, the
thp_deferred_split_page is not supposed to get inc'ed, but it does
now.

The counter is also useful for performance analysis, for example,
whether a workload did a lot of partial unmap or not. So fixing the
counter seems worthy. Zi Yan should have mentioned this in the commit
log.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand April 11, 2024, 9:15 p.m. UTC | #3
On 11.04.24 21:01, Yang Shi wrote:
> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.04.24 17:32, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. It is possible that
>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>> to deferred split list at all. Fix it by checking folio mapcount before
>>> adding a folio to deferred split list.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/rmap.c | 9 ++++++---
>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..d599a772e282 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                enum rmap_level level)
>>>    {
>>>        atomic_t *mapped = &folio->_nr_pages_mapped;
>>> -     int last, nr = 0, nr_pmdmapped = 0;
>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>        enum node_stat_item idx;
>>>
>>>        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                        break;
>>>                }
>>>
>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
>>> +             mapcount = atomic_sub_return(nr_pages,
>>> +                                          &folio->_large_mapcount) + 1;
>>
>> That becomes a new memory barrier on some archs. Rather just re-read it
>> below. Re-reading should be fine here.
>>
>>>                do {
>>>                        last = atomic_add_negative(-1, &page->_mapcount);
>>>                        if (last) {
>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                 * is still mapped.
>>>                 */
>>>                if (folio_test_large(folio) && folio_test_anon(folio))
>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> +                     if ((level == RMAP_LEVEL_PTE &&
>>> +                          mapcount != 0) ||
>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>                                deferred_split_folio(folio);
>>>        }
>>
>> But I do wonder if we really care? Usually the folio will simply get
>> freed afterwards, where we simply remove it from the list.
>>
>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>> corner case ...
>>
>> Is it really worth the added code? Not convinced.
> 
> It is actually not only an optimization, but also fixed the broken
> thp_deferred_split_page counter in /proc/vmstat.
> 
> The counter actually counted the partially unmapped huge pages (so
> they are on deferred split queue), but it counts the fully unmapped
> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> thp_deferred_split_page is not supposed to get inc'ed, but it does
> now.
> 
> The counter is also useful for performance analysis, for example,
> whether a workload did a lot of partial unmap or not. So fixing the
> counter seems worthy. Zi Yan should have mentioned this in the commit
> log.

Yes, all that is information that is missing from the patch description. 
If it's a fix, there should be a "Fixes:".

Likely we want to have a folio_large_mapcount() check in the code below. 
(I yet have to digest the condition where this happens -- can we have an 
example where we'd use to do the wrong thing and now would do the right 
thing as well?)
Yang Shi April 11, 2024, 9:59 p.m. UTC | #4
On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.04.24 21:01, Yang Shi wrote:
> > On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 11.04.24 17:32, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>>
> >>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>> if any page in a folio loses its final mapping. It is possible that
> >>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>> to deferred split list at all. Fix it by checking folio mapcount before
> >>> adding a folio to deferred split list.
> >>>
> >>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>> ---
> >>>    mm/rmap.c | 9 ++++++---
> >>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..d599a772e282 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                enum rmap_level level)
> >>>    {
> >>>        atomic_t *mapped = &folio->_nr_pages_mapped;
> >>> -     int last, nr = 0, nr_pmdmapped = 0;
> >>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>        enum node_stat_item idx;
> >>>
> >>>        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                        break;
> >>>                }
> >>>
> >>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
> >>> +             mapcount = atomic_sub_return(nr_pages,
> >>> +                                          &folio->_large_mapcount) + 1;
> >>
> >> That becomes a new memory barrier on some archs. Rather just re-read it
> >> below. Re-reading should be fine here.
> >>
> >>>                do {
> >>>                        last = atomic_add_negative(-1, &page->_mapcount);
> >>>                        if (last) {
> >>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                 * is still mapped.
> >>>                 */
> >>>                if (folio_test_large(folio) && folio_test_anon(folio))
> >>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> +                     if ((level == RMAP_LEVEL_PTE &&
> >>> +                          mapcount != 0) ||
> >>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>>                                deferred_split_folio(folio);
> >>>        }
> >>
> >> But I do wonder if we really care? Usually the folio will simply get
> >> freed afterwards, where we simply remove it from the list.
> >>
> >> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >> corner case ...
> >>
> >> Is it really worth the added code? Not convinced.
> >
> > It is actually not only an optimization, but also fixed the broken
> > thp_deferred_split_page counter in /proc/vmstat.
> >
> > The counter actually counted the partially unmapped huge pages (so
> > they are on deferred split queue), but it counts the fully unmapped
> > mTHP as well now. For example, when a 64K THP is fully unmapped, the
> > thp_deferred_split_page is not supposed to get inc'ed, but it does
> > now.
> >
> > The counter is also useful for performance analysis, for example,
> > whether a workload did a lot of partial unmap or not. So fixing the
> > counter seems worthy. Zi Yan should have mentioned this in the commit
> > log.
>
> Yes, all that is information that is missing from the patch description.
> If it's a fix, there should be a "Fixes:".
>
> Likely we want to have a folio_large_mapcount() check in the code below.
> (I yet have to digest the condition where this happens -- can we have an
> example where we'd use to do the wrong thing and now would do the right
> thing as well?)

For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
some full 64K areas, you will see thp_deferred_split_page increased,
but it shouldn't.

It looks __folio_remove_rmap() incorrectly detected whether the mTHP
is fully unmapped or partially unmapped by comparing the number of
still-mapped subpages to ENTIRELY_MAPPED, which should just work for
PMD-mappable THP.

However I just realized this problem was kind of workaround'ed by commit:

commit 98046944a1597f3a02b792dbe9665e9943b77f28
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Mar 29 14:59:33 2024 +0800

    mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
split statistics

    Now the mTHP can also be split or added into the deferred list, so add
    folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
    confusion with PMD mapped THP related statistics.

    Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

This commit made thp_deferred_split_page didn't count mTHP anymore, it
also made thp_split_page didn't count mTHP anymore.

However Zi Yan's patch does make the code more robust and we don't
need to worry about the miscounting issue anymore if we will add
deferred_split_page and split_page counters for mTHP in the future.

> --
> Cheers,
>
> David / dhildenb
>
Zi Yan April 12, 2024, 2:21 p.m. UTC | #5
On 11 Apr 2024, at 17:59, Yang Shi wrote:

> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.04.24 21:01, Yang Shi wrote:
>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> ---
>>>>>    mm/rmap.c | 9 ++++++---
>>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>                enum rmap_level level)
>>>>>    {
>>>>>        atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>        enum node_stat_item idx;
>>>>>
>>>>>        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>                        break;
>>>>>                }
>>>>>
>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> +             mapcount = atomic_sub_return(nr_pages,
>>>>> +                                          &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>> below. Re-reading should be fine here.
>>>>
>>>>>                do {
>>>>>                        last = atomic_add_negative(-1, &page->_mapcount);
>>>>>                        if (last) {
>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>                 * is still mapped.
>>>>>                 */
>>>>>                if (folio_test_large(folio) && folio_test_anon(folio))
>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
>>>>> +                          mapcount != 0) ||
>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>                                deferred_split_folio(folio);
>>>>>        }
>>>>
>>>> But I do wonder if we really care? Usually the folio will simply get
>>>> freed afterwards, where we simply remove it from the list.
>>>>
>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>> corner case ...
>>>>
>>>> Is it really worth the added code? Not convinced.
>>>
>>> It is actually not only an optimization, but also fixed the broken
>>> thp_deferred_split_page counter in /proc/vmstat.
>>>
>>> The counter actually counted the partially unmapped huge pages (so
>>> they are on deferred split queue), but it counts the fully unmapped
>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>> now.
>>>
>>> The counter is also useful for performance analysis, for example,
>>> whether a workload did a lot of partial unmap or not. So fixing the
>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>> log.
>>
>> Yes, all that is information that is missing from the patch description.
>> If it's a fix, there should be a "Fixes:".
>>
>> Likely we want to have a folio_large_mapcount() check in the code below.
>> (I yet have to digest the condition where this happens -- can we have an
>> example where we'd use to do the wrong thing and now would do the right
>> thing as well?)
>
> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> some full 64K areas, you will see thp_deferred_split_page increased,
> but it shouldn't.
>
> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> is fully unmapped or partially unmapped by comparing the number of
> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> PMD-mappable THP.
>
> However I just realized this problem was kind of workaround'ed by commit:
>
> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
> Date:   Fri Mar 29 14:59:33 2024 +0800
>
>     mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> split statistics
>
>     Now the mTHP can also be split or added into the deferred list, so add
>     folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>     confusion with PMD mapped THP related statistics.
>
>     Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>     Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>     Acked-by: David Hildenbrand <david@redhat.com>
>     Cc: Muchun Song <muchun.song@linux.dev>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> also made thp_split_page didn't count mTHP anymore.
>
> However Zi Yan's patch does make the code more robust and we don't
> need to worry about the miscounting issue anymore if we will add
> deferred_split_page and split_page counters for mTHP in the future.

Actually, the patch above does not fix everything. A fully unmapped
PTE-mapped order-9 THP is also added to deferred split list and
counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
(non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
the order-9 folio is folio_test_pmd_mappable().

I will add this information in the next version.

--
Best Regards,
Yan, Zi
Zi Yan April 12, 2024, 2:31 p.m. UTC | #6
On 12 Apr 2024, at 10:21, Zi Yan wrote:

> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>
>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 11.04.24 21:01, Yang Shi wrote:
>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>    mm/rmap.c | 9 ++++++---
>>>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                enum rmap_level level)
>>>>>>    {
>>>>>>        atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
>>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>        enum node_stat_item idx;
>>>>>>
>>>>>>        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                        break;
>>>>>>                }
>>>>>>
>>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> +             mapcount = atomic_sub_return(nr_pages,
>>>>>> +                                          &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>> below. Re-reading should be fine here.
>>>>>
>>>>>>                do {
>>>>>>                        last = atomic_add_negative(-1, &page->_mapcount);
>>>>>>                        if (last) {
>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                 * is still mapped.
>>>>>>                 */
>>>>>>                if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
>>>>>> +                          mapcount != 0) ||
>>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>>                                deferred_split_folio(folio);
>>>>>>        }
>>>>>
>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>> freed afterwards, where we simply remove it from the list.
>>>>>
>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>> corner case ...
>>>>>
>>>>> Is it really worth the added code? Not convinced.
>>>>
>>>> It is actually not only an optimization, but also fixed the broken
>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>
>>>> The counter actually counted the partially unmapped huge pages (so
>>>> they are on deferred split queue), but it counts the fully unmapped
>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>> now.
>>>>
>>>> The counter is also useful for performance analysis, for example,
>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>> log.
>>>
>>> Yes, all that is information that is missing from the patch description.
>>> If it's a fix, there should be a "Fixes:".
>>>
>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>> (I yet have to digest the condition where this happens -- can we have an
>>> example where we'd use to do the wrong thing and now would do the right
>>> thing as well?)
>>
>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>> some full 64K areas, you will see thp_deferred_split_page increased,
>> but it shouldn't.
>>
>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>> is fully unmapped or partially unmapped by comparing the number of
>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>> PMD-mappable THP.
>>
>> However I just realized this problem was kind of workaround'ed by commit:
>>
>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Date:   Fri Mar 29 14:59:33 2024 +0800
>>
>>     mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>> split statistics
>>
>>     Now the mTHP can also be split or added into the deferred list, so add
>>     folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>>     confusion with PMD mapped THP related statistics.
>>
>>     Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>>     Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>     Acked-by: David Hildenbrand <david@redhat.com>
>>     Cc: Muchun Song <muchun.song@linux.dev>
>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>
>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>> also made thp_split_page didn't count mTHP anymore.
>>
>> However Zi Yan's patch does make the code more robust and we don't
>> need to worry about the miscounting issue anymore if we will add
>> deferred_split_page and split_page counters for mTHP in the future.
>
> Actually, the patch above does not fix everything. A fully unmapped
> PTE-mapped order-9 THP is also added to deferred split list and
> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> the order-9 folio is folio_test_pmd_mappable().
>
> I will add this information in the next version.

It might
Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
but before this commit fully unmapping a PTE-mapped order-9 THP still increased
THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
due to implementation. I will add this to the commit log as well without Fixes
tag.


--
Best Regards,
Yan, Zi
Zi Yan April 12, 2024, 2:35 p.m. UTC | #7
On 11 Apr 2024, at 11:46, David Hildenbrand wrote:

> On 11.04.24 17:32, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. It is possible that
>> the folio is unmapped fully, but it is unnecessary to add the folio
>> to deferred split list at all. Fix it by checking folio mapcount before
>> adding a folio to deferred split list.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/rmap.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..d599a772e282 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   		enum rmap_level level)
>>   {
>>   	atomic_t *mapped = &folio->_nr_pages_mapped;
>> -	int last, nr = 0, nr_pmdmapped = 0;
>> +	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>   	enum node_stat_item idx;
>>    	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   			break;
>>   		}
>>  -		atomic_sub(nr_pages, &folio->_large_mapcount);
>> +		mapcount = atomic_sub_return(nr_pages,
>> +					     &folio->_large_mapcount) + 1;
>
> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.

Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
below, but to save an atomic op, I chose to read mapcount here.

--
Best Regards,
Yan, Zi
Yang Shi April 12, 2024, 6:29 p.m. UTC | #8
On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 12 Apr 2024, at 10:21, Zi Yan wrote:
>
> > On 11 Apr 2024, at 17:59, Yang Shi wrote:
> >
> >> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 11.04.24 21:01, Yang Shi wrote:
> >>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>> ---
> >>>>>>    mm/rmap.c | 9 ++++++---
> >>>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                enum rmap_level level)
> >>>>>>    {
> >>>>>>        atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>>        enum node_stat_item idx;
> >>>>>>
> >>>>>>        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                        break;
> >>>>>>                }
> >>>>>>
> >>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> +             mapcount = atomic_sub_return(nr_pages,
> >>>>>> +                                          &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it
> >>>>> below. Re-reading should be fine here.
> >>>>>
> >>>>>>                do {
> >>>>>>                        last = atomic_add_negative(-1, &page->_mapcount);
> >>>>>>                        if (last) {
> >>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                 * is still mapped.
> >>>>>>                 */
> >>>>>>                if (folio_test_large(folio) && folio_test_anon(folio))
> >>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
> >>>>>> +                          mapcount != 0) ||
> >>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>>>>>                                deferred_split_folio(folio);
> >>>>>>        }
> >>>>>
> >>>>> But I do wonder if we really care? Usually the folio will simply get
> >>>>> freed afterwards, where we simply remove it from the list.
> >>>>>
> >>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >>>>> corner case ...
> >>>>>
> >>>>> Is it really worth the added code? Not convinced.
> >>>>
> >>>> It is actually not only an optimization, but also fixed the broken
> >>>> thp_deferred_split_page counter in /proc/vmstat.
> >>>>
> >>>> The counter actually counted the partially unmapped huge pages (so
> >>>> they are on deferred split queue), but it counts the fully unmapped
> >>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> >>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
> >>>> now.
> >>>>
> >>>> The counter is also useful for performance analysis, for example,
> >>>> whether a workload did a lot of partial unmap or not. So fixing the
> >>>> counter seems worthy. Zi Yan should have mentioned this in the commit
> >>>> log.
> >>>
> >>> Yes, all that is information that is missing from the patch description.
> >>> If it's a fix, there should be a "Fixes:".
> >>>
> >>> Likely we want to have a folio_large_mapcount() check in the code below.
> >>> (I yet have to digest the condition where this happens -- can we have an
> >>> example where we'd use to do the wrong thing and now would do the right
> >>> thing as well?)
> >>
> >> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> >> some full 64K areas, you will see thp_deferred_split_page increased,
> >> but it shouldn't.
> >>
> >> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> >> is fully unmapped or partially unmapped by comparing the number of
> >> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> >> PMD-mappable THP.
> >>
> >> However I just realized this problem was kind of workaround'ed by commit:
> >>
> >> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> >> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Date:   Fri Mar 29 14:59:33 2024 +0800
> >>
> >>     mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> >> split statistics
> >>
> >>     Now the mTHP can also be split or added into the deferred list, so add
> >>     folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
> >>     confusion with PMD mapped THP related statistics.
> >>
> >>     Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
> >>     Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>     Acked-by: David Hildenbrand <david@redhat.com>
> >>     Cc: Muchun Song <muchun.song@linux.dev>
> >>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >>
> >> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> >> also made thp_split_page didn't count mTHP anymore.
> >>
> >> However Zi Yan's patch does make the code more robust and we don't
> >> need to worry about the miscounting issue anymore if we will add
> >> deferred_split_page and split_page counters for mTHP in the future.
> >
> > Actually, the patch above does not fix everything. A fully unmapped
> > PTE-mapped order-9 THP is also added to deferred split list and
> > counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> > (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> > the order-9 folio is folio_test_pmd_mappable().
> >
> > I will add this information in the next version.
>
> It might
> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> due to implementation. I will add this to the commit log as well without Fixes
> tag.

Thanks for digging deeper. The problem may be not that obvious before
mTHP because PMD-mappable THP is converted to PTE-mapped due to
partial unmap in most cases. But mTHP is always PTE-mapped in the
first place. The other reason is batched rmap remove was not supported
before David's optimization.

Now we do have reasonable motivation to make it precise and it is also
easier to do so than before.

>
>
> --
> Best Regards,
> Yan, Zi
David Hildenbrand April 12, 2024, 7:06 p.m. UTC | #9
On 12.04.24 16:31, Zi Yan wrote:
> On 12 Apr 2024, at 10:21, Zi Yan wrote:
> 
>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>>
>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 11.04.24 21:01, Yang Shi wrote:
>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>>
>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>>> adding a folio to deferred split list.
>>>>>>>
>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>> ---
>>>>>>>     mm/rmap.c | 9 ++++++---
>>>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>                 enum rmap_level level)
>>>>>>>     {
>>>>>>>         atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
>>>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>>         enum node_stat_item idx;
>>>>>>>
>>>>>>>         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>                         break;
>>>>>>>                 }
>>>>>>>
>>>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>>> +             mapcount = atomic_sub_return(nr_pages,
>>>>>>> +                                          &folio->_large_mapcount) + 1;
>>>>>>
>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>>> below. Re-reading should be fine here.
>>>>>>
>>>>>>>                 do {
>>>>>>>                         last = atomic_add_negative(-1, &page->_mapcount);
>>>>>>>                         if (last) {
>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>                  * is still mapped.
>>>>>>>                  */
>>>>>>>                 if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
>>>>>>> +                          mapcount != 0) ||
>>>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>>>                                 deferred_split_folio(folio);
>>>>>>>         }
>>>>>>
>>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>>> freed afterwards, where we simply remove it from the list.
>>>>>>
>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>>> corner case ...
>>>>>>
>>>>>> Is it really worth the added code? Not convinced.
>>>>>
>>>>> It is actually not only an optimization, but also fixed the broken
>>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>>
>>>>> The counter actually counted the partially unmapped huge pages (so
>>>>> they are on deferred split queue), but it counts the fully unmapped
>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>>> now.
>>>>>
>>>>> The counter is also useful for performance analysis, for example,
>>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>>> log.
>>>>
>>>> Yes, all that is information that is missing from the patch description.
>>>> If it's a fix, there should be a "Fixes:".
>>>>
>>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>>> (I yet have to digest the condition where this happens -- can we have an
>>>> example where we'd use to do the wrong thing and now would do the right
>>>> thing as well?)
>>>
>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>>> some full 64K areas, you will see thp_deferred_split_page increased,
>>> but it shouldn't.
>>>
>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>>> is fully unmapped or partially unmapped by comparing the number of
>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>>> PMD-mappable THP.
>>>
>>> However I just realized this problem was kind of workaround'ed by commit:
>>>
>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>>> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Date:   Fri Mar 29 14:59:33 2024 +0800
>>>
>>>      mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>>> split statistics
>>>
>>>      Now the mTHP can also be split or added into the deferred list, so add
>>>      folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>>>      confusion with PMD mapped THP related statistics.
>>>
>>>      Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>>>      Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>      Acked-by: David Hildenbrand <david@redhat.com>
>>>      Cc: Muchun Song <muchun.song@linux.dev>
>>>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>
>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>>> also made thp_split_page didn't count mTHP anymore.
>>>
>>> However Zi Yan's patch does make the code more robust and we don't
>>> need to worry about the miscounting issue anymore if we will add
>>> deferred_split_page and split_page counters for mTHP in the future.
>>
>> Actually, the patch above does not fix everything. A fully unmapped
>> PTE-mapped order-9 THP is also added to deferred split list and
>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>> the order-9 folio is folio_test_pmd_mappable().
>>
>> I will add this information in the next version.
> 
> It might
> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> due to implementation. I will add this to the commit log as well without Fixes
> tag.

Right, so it's always been a problem for PTE-mapped PMD-sized THP and 
only with batching we can now do "better". But not fix it completely.

I'll reply separately to your other mail.
David Hildenbrand April 12, 2024, 7:32 p.m. UTC | #10
On 12.04.24 16:35, Zi Yan wrote:
> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> 
>> On 11.04.24 17:32, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. It is possible that
>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>> to deferred split list at all. Fix it by checking folio mapcount before
>>> adding a folio to deferred split list.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/rmap.c | 9 ++++++---
>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..d599a772e282 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>    		enum rmap_level level)
>>>    {
>>>    	atomic_t *mapped = &folio->_nr_pages_mapped;
>>> -	int last, nr = 0, nr_pmdmapped = 0;
>>> +	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>    	enum node_stat_item idx;
>>>     	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>    			break;
>>>    		}
>>>   -		atomic_sub(nr_pages, &folio->_large_mapcount);
>>> +		mapcount = atomic_sub_return(nr_pages,
>>> +					     &folio->_large_mapcount) + 1;
>>
>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> 
> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> below, but to save an atomic op, I chose to read mapcount here.

Some points:

(1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
RMW that return a value -- and how they interact with memory barriers.
Further, how relaxed variants are only optimized on some architectures.

atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
access that should not be refetched. Usually cheaper than most other stuff
that involves atomics.

(2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
to figure out if the folio is now completely unmapped.

(3) There is one fundamental issue: if we are not batch-unmapping the whole
thing, we will still add the folios to the deferred split queue. Migration
would still do that, or if there are multiple VMAs covering a folio.

(4) We should really avoid making common operations slower only to make
some unreliable stats less unreliable.


We should likely do something like the following, which might even be a bit
faster in some cases because we avoid a function call in case we unmap
individual PTEs by checking _deferred_list ahead of time

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..356598b3dc3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                  * page of the folio is unmapped and at least one page
                  * is still mapped.
                  */
-               if (folio_test_large(folio) && folio_test_anon(folio))
-                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-                               deferred_split_folio(folio);
+               if (folio_test_large(folio) && folio_test_anon(folio) &&
+                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
+                   atomic_read(mapped) &&
+                   data_race(list_empty(&folio->_deferred_list)))
+                       deferred_split_folio(folio);
         }
  

I also thought about handling the scenario where we unmap the whole
think in smaller chunks. We could detect "!atomic_read(mapped)" and
detect that it is on the deferred split list, and simply remove it
from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.

But it would be racy with concurrent remapping of the folio (might happen with
anon folios in corner cases I guess).

What we can do is the following, though:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dc30139590e6..f05cba1807f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
         ds_queue = get_deferred_split_queue(folio);
         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
         if (!list_empty(&folio->_deferred_list)) {
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
                 ds_queue->split_queue_len--;
                 list_del_init(&folio->_deferred_list);
         }

Adding the right event of course.


Then it's easy to filter out these "temporarily added to the list, but never split
before the folio was freed" cases.
David Hildenbrand April 12, 2024, 7:36 p.m. UTC | #11
On 12.04.24 20:29, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 12 Apr 2024, at 10:21, Zi Yan wrote:
>>
>>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>>>
>>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 11.04.24 21:01, Yang Shi wrote:
>>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>>>
>>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>>>> adding a folio to deferred split list.
>>>>>>>>
>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>> ---
>>>>>>>>     mm/rmap.c | 9 ++++++---
>>>>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>>                 enum rmap_level level)
>>>>>>>>     {
>>>>>>>>         atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
>>>>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>>>         enum node_stat_item idx;
>>>>>>>>
>>>>>>>>         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>>                         break;
>>>>>>>>                 }
>>>>>>>>
>>>>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>>>> +             mapcount = atomic_sub_return(nr_pages,
>>>>>>>> +                                          &folio->_large_mapcount) + 1;
>>>>>>>
>>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>>>> below. Re-reading should be fine here.
>>>>>>>
>>>>>>>>                 do {
>>>>>>>>                         last = atomic_add_negative(-1, &page->_mapcount);
>>>>>>>>                         if (last) {
>>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>>                  * is still mapped.
>>>>>>>>                  */
>>>>>>>>                 if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
>>>>>>>> +                          mapcount != 0) ||
>>>>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>>>>                                 deferred_split_folio(folio);
>>>>>>>>         }
>>>>>>>
>>>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>>>> freed afterwards, where we simply remove it from the list.
>>>>>>>
>>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>>>> corner case ...
>>>>>>>
>>>>>>> Is it really worth the added code? Not convinced.
>>>>>>
>>>>>> It is actually not only an optimization, but also fixed the broken
>>>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>>>
>>>>>> The counter actually counted the partially unmapped huge pages (so
>>>>>> they are on deferred split queue), but it counts the fully unmapped
>>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>>>> now.
>>>>>>
>>>>>> The counter is also useful for performance analysis, for example,
>>>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>>>> log.
>>>>>
>>>>> Yes, all that is information that is missing from the patch description.
>>>>> If it's a fix, there should be a "Fixes:".
>>>>>
>>>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>>>> (I yet have to digest the condition where this happens -- can we have an
>>>>> example where we'd use to do the wrong thing and now would do the right
>>>>> thing as well?)
>>>>
>>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>>>> some full 64K areas, you will see thp_deferred_split_page increased,
>>>> but it shouldn't.
>>>>
>>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>>>> is fully unmapped or partially unmapped by comparing the number of
>>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>>>> PMD-mappable THP.
>>>>
>>>> However I just realized this problem was kind of workaround'ed by commit:
>>>>
>>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>>>> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Date:   Fri Mar 29 14:59:33 2024 +0800
>>>>
>>>>      mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>>>> split statistics
>>>>
>>>>      Now the mTHP can also be split or added into the deferred list, so add
>>>>      folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>>>>      confusion with PMD mapped THP related statistics.
>>>>
>>>>      Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>>>>      Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>      Acked-by: David Hildenbrand <david@redhat.com>
>>>>      Cc: Muchun Song <muchun.song@linux.dev>
>>>>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>>
>>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>>>> also made thp_split_page didn't count mTHP anymore.
>>>>
>>>> However Zi Yan's patch does make the code more robust and we don't
>>>> need to worry about the miscounting issue anymore if we will add
>>>> deferred_split_page and split_page counters for mTHP in the future.
>>>
>>> Actually, the patch above does not fix everything. A fully unmapped
>>> PTE-mapped order-9 THP is also added to deferred split list and
>>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
>>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>>> the order-9 folio is folio_test_pmd_mappable().
>>>
>>> I will add this information in the next version.
>>
>> It might
>> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
>> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
>> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
>> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
>> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
>> due to implementation. I will add this to the commit log as well without Fixes
>> tag.
> 
> Thanks for digging deeper. The problem may be not that obvious before
> mTHP because PMD-mappable THP is converted to PTE-mapped due to
> partial unmap in most cases. But mTHP is always PTE-mapped in the
> first place. The other reason is batched rmap remove was not supported
> before David's optimization.

Yes.

> 
> Now we do have reasonable motivation to make it precise and it is also
> easier to do so than before.

If by "precise" you mean "less unreliable in some cases", yes. See my 
other mail.
Yang Shi April 12, 2024, 8:21 p.m. UTC | #12
On Fri, Apr 12, 2024 at 12:36 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.04.24 20:29, Yang Shi wrote:
> > On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 12 Apr 2024, at 10:21, Zi Yan wrote:
> >>
> >>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
> >>>
> >>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 11.04.24 21:01, Yang Shi wrote:
> >>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>>>
> >>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>>>> adding a folio to deferred split list.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>>>> ---
> >>>>>>>>     mm/rmap.c | 9 ++++++---
> >>>>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>>>> --- a/mm/rmap.c
> >>>>>>>> +++ b/mm/rmap.c
> >>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>>                 enum rmap_level level)
> >>>>>>>>     {
> >>>>>>>>         atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>>>> -     int last, nr = 0, nr_pmdmapped = 0;
> >>>>>>>> +     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>>>>         enum node_stat_item idx;
> >>>>>>>>
> >>>>>>>>         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>>                         break;
> >>>>>>>>                 }
> >>>>>>>>
> >>>>>>>> -             atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>>>> +             mapcount = atomic_sub_return(nr_pages,
> >>>>>>>> +                                          &folio->_large_mapcount) + 1;
> >>>>>>>
> >>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
> >>>>>>> below. Re-reading should be fine here.
> >>>>>>>
> >>>>>>>>                 do {
> >>>>>>>>                         last = atomic_add_negative(-1, &page->_mapcount);
> >>>>>>>>                         if (last) {
> >>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>>                  * is still mapped.
> >>>>>>>>                  */
> >>>>>>>>                 if (folio_test_large(folio) && folio_test_anon(folio))
> >>>>>>>> -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>>>>>> +                     if ((level == RMAP_LEVEL_PTE &&
> >>>>>>>> +                          mapcount != 0) ||
> >>>>>>>> +                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>>>>>>>                                 deferred_split_folio(folio);
> >>>>>>>>         }
> >>>>>>>
> >>>>>>> But I do wonder if we really care? Usually the folio will simply get
> >>>>>>> freed afterwards, where we simply remove it from the list.
> >>>>>>>
> >>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >>>>>>> corner case ...
> >>>>>>>
> >>>>>>> Is it really worth the added code? Not convinced.
> >>>>>>
> >>>>>> It is actually not only an optimization, but also fixed the broken
> >>>>>> thp_deferred_split_page counter in /proc/vmstat.
> >>>>>>
> >>>>>> The counter actually counted the partially unmapped huge pages (so
> >>>>>> they are on deferred split queue), but it counts the fully unmapped
> >>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> >>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
> >>>>>> now.
> >>>>>>
> >>>>>> The counter is also useful for performance analysis, for example,
> >>>>>> whether a workload did a lot of partial unmap or not. So fixing the
> >>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
> >>>>>> log.
> >>>>>
> >>>>> Yes, all that is information that is missing from the patch description.
> >>>>> If it's a fix, there should be a "Fixes:".
> >>>>>
> >>>>> Likely we want to have a folio_large_mapcount() check in the code below.
> >>>>> (I yet have to digest the condition where this happens -- can we have an
> >>>>> example where we'd use to do the wrong thing and now would do the right
> >>>>> thing as well?)
> >>>>
> >>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> >>>> some full 64K areas, you will see thp_deferred_split_page increased,
> >>>> but it shouldn't.
> >>>>
> >>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> >>>> is fully unmapped or partially unmapped by comparing the number of
> >>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> >>>> PMD-mappable THP.
> >>>>
> >>>> However I just realized this problem was kind of workaround'ed by commit:
> >>>>
> >>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> >>>> Author: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>> Date:   Fri Mar 29 14:59:33 2024 +0800
> >>>>
> >>>>      mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> >>>> split statistics
> >>>>
> >>>>      Now the mTHP can also be split or added into the deferred list, so add
> >>>>      folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
> >>>>      confusion with PMD mapped THP related statistics.
> >>>>
> >>>>      Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
> >>>>      Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>>      Acked-by: David Hildenbrand <david@redhat.com>
> >>>>      Cc: Muchun Song <muchun.song@linux.dev>
> >>>>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >>>>
> >>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> >>>> also made thp_split_page didn't count mTHP anymore.
> >>>>
> >>>> However Zi Yan's patch does make the code more robust and we don't
> >>>> need to worry about the miscounting issue anymore if we will add
> >>>> deferred_split_page and split_page counters for mTHP in the future.
> >>>
> >>> Actually, the patch above does not fix everything. A fully unmapped
> >>> PTE-mapped order-9 THP is also added to deferred split list and
> >>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> >>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> >>> the order-9 folio is folio_test_pmd_mappable().
> >>>
> >>> I will add this information in the next version.
> >>
> >> It might
> >> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> >> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> >> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> >> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> >> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> >> due to implementation. I will add this to the commit log as well without Fixes
> >> tag.
> >
> > Thanks for digging deeper. The problem may be not that obvious before
> > mTHP because PMD-mappable THP is converted to PTE-mapped due to
> > partial unmap in most cases. But mTHP is always PTE-mapped in the
> > first place. The other reason is batched rmap remove was not supported
> > before David's optimization.
>
> Yes.
>
> >
> > Now we do have reasonable motivation to make it precise and it is also
> > easier to do so than before.
>
> If by "precise" you mean "less unreliable in some cases", yes. See my
> other mail.

Yes, definitely. Make the unreliability somehow acceptable.

>
> --
> Cheers,
>
> David / dhildenb
>
Yang Shi April 12, 2024, 8:35 p.m. UTC | #13
On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.04.24 16:35, Zi Yan wrote:
> > On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >
> >> On 11.04.24 17:32, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>>
> >>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>> if any page in a folio loses its final mapping. It is possible that
> >>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>> to deferred split list at all. Fix it by checking folio mapcount before
> >>> adding a folio to deferred split list.
> >>>
> >>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>> ---
> >>>    mm/rmap.c | 9 ++++++---
> >>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..d599a772e282 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>             enum rmap_level level)
> >>>    {
> >>>     atomic_t *mapped = &folio->_nr_pages_mapped;
> >>> -   int last, nr = 0, nr_pmdmapped = 0;
> >>> +   int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>     enum node_stat_item idx;
> >>>             __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                     break;
> >>>             }
> >>>   -         atomic_sub(nr_pages, &folio->_large_mapcount);
> >>> +           mapcount = atomic_sub_return(nr_pages,
> >>> +                                        &folio->_large_mapcount) + 1;
> >>
> >> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >
> > Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> > below, but to save an atomic op, I chose to read mapcount here.
>
> Some points:
>
> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> RMW that return a value -- and how they interact with memory barriers.
> Further, how relaxed variants are only optimized on some architectures.
>
> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> access that should not be refetched. Usually cheaper than most other stuff
> that involves atomics.
>
> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> to figure out if the folio is now completely unmapped.
>
> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> thing, we will still add the folios to the deferred split queue. Migration
> would still do that, or if there are multiple VMAs covering a folio.

Maybe we can let rmap remove code not touch deferred split queue in
migration path at all because we know all the PTEs will be converted
to migration entries. And all the migration entries will be converted
back to PTEs regardless of whether try_to_migrate succeeded or not.

>
> (4) We should really avoid making common operations slower only to make
> some unreliable stats less unreliable.
>
>
> We should likely do something like the following, which might even be a bit
> faster in some cases because we avoid a function call in case we unmap
> individual PTEs by checking _deferred_list ahead of time
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..356598b3dc3c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                   * page of the folio is unmapped and at least one page
>                   * is still mapped.
>                   */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> +                   atomic_read(mapped) &&
> +                   data_race(list_empty(&folio->_deferred_list)))
> +                       deferred_split_folio(folio);
>          }
>
>
> I also thought about handling the scenario where we unmap the whole
> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> detect that it is on the deferred split list, and simply remove it
> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>
> But it would be racy with concurrent remapping of the folio (might happen with
> anon folios in corner cases I guess).
>
> What we can do is the following, though:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dc30139590e6..f05cba1807f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>          ds_queue = get_deferred_split_queue(folio);
>          spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>          if (!list_empty(&folio->_deferred_list)) {
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>                  ds_queue->split_queue_len--;
>                  list_del_init(&folio->_deferred_list);
>          }
>
> Adding the right event of course.
>
>
> Then it's easy to filter out these "temporarily added to the list, but never split
> before the folio was freed" cases.
>
>
> --
> Cheers,
>
> David / dhildenb
>
Zi Yan April 12, 2024, 9:06 p.m. UTC | #14
On 12 Apr 2024, at 15:32, David Hildenbrand wrote:

> On 12.04.24 16:35, Zi Yan wrote:
>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>
>>> On 11.04.24 17:32, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>> if any page in a folio loses its final mapping. It is possible that
>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>> adding a folio to deferred split list.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>    mm/rmap.c | 9 ++++++---
>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 2608c40dffad..d599a772e282 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>    		enum rmap_level level)
>>>>    {
>>>>    	atomic_t *mapped = &folio->_nr_pages_mapped;
>>>> -	int last, nr = 0, nr_pmdmapped = 0;
>>>> +	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>    	enum node_stat_item idx;
>>>>     	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>    			break;
>>>>    		}
>>>>   -		atomic_sub(nr_pages, &folio->_large_mapcount);
>>>> +		mapcount = atomic_sub_return(nr_pages,
>>>> +					     &folio->_large_mapcount) + 1;
>>>
>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>
>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>> below, but to save an atomic op, I chose to read mapcount here.
>
> Some points:
>
> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> RMW that return a value -- and how they interact with memory barriers.
> Further, how relaxed variants are only optimized on some architectures.
>
> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> access that should not be refetched. Usually cheaper than most other stuff
> that involves atomics.

I should have checked the actual implementation instead of being fooled
by the name. Will read about it. Thanks.

>
> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> to figure out if the folio is now completely unmapped.
>
> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> thing, we will still add the folios to the deferred split queue. Migration
> would still do that, or if there are multiple VMAs covering a folio.
>
> (4) We should really avoid making common operations slower only to make
> some unreliable stats less unreliable.
>
>
> We should likely do something like the following, which might even be a bit
> faster in some cases because we avoid a function call in case we unmap
> individual PTEs by checking _deferred_list ahead of time
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..356598b3dc3c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> +                   atomic_read(mapped) &&
> +                   data_race(list_empty(&folio->_deferred_list)))

data_race() might not be needed, as Ryan pointed out[1]

> +                       deferred_split_folio(folio);
>         }
>
> I also thought about handling the scenario where we unmap the whole
> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> detect that it is on the deferred split list, and simply remove it
> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>
> But it would be racy with concurrent remapping of the folio (might happen with
> anon folios in corner cases I guess).
>
> What we can do is the following, though:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dc30139590e6..f05cba1807f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (!list_empty(&folio->_deferred_list)) {
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>                 ds_queue->split_queue_len--;
>                 list_del_init(&folio->_deferred_list);
>         }
>
> Adding the right event of course.
>
>
> Then it's easy to filter out these "temporarily added to the list, but never split
> before the folio was freed" cases.

So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.

I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
splits, why not just move the counter to deferred_split_scan(), where the actual
split happens. Or the counter has a different meaning?



[1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941@arm.com/

--
Best Regards,
Yan, Zi
Yang Shi April 12, 2024, 10:29 p.m. UTC | #15
On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>
> > On 12.04.24 16:35, Zi Yan wrote:
> >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>
> >>> On 11.04.24 17:32, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>> if any page in a folio loses its final mapping. It is possible that
> >>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>> adding a folio to deferred split list.
> >>>>
> >>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>> ---
> >>>>    mm/rmap.c | 9 ++++++---
> >>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 2608c40dffad..d599a772e282 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>                    enum rmap_level level)
> >>>>    {
> >>>>            atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>> -  int last, nr = 0, nr_pmdmapped = 0;
> >>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>            enum node_stat_item idx;
> >>>>            __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>                            break;
> >>>>                    }
> >>>>   -                atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>> +          mapcount = atomic_sub_return(nr_pages,
> >>>> +                                       &folio->_large_mapcount) + 1;
> >>>
> >>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>
> >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >> below, but to save an atomic op, I chose to read mapcount here.
> >
> > Some points:
> >
> > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> > RMW that return a value -- and how they interact with memory barriers.
> > Further, how relaxed variants are only optimized on some architectures.
> >
> > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> > access that should not be refetched. Usually cheaper than most other stuff
> > that involves atomics.
>
> I should have checked the actual implementation instead of being fooled
> by the name. Will read about it. Thanks.
>
> >
> > (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> > to figure out if the folio is now completely unmapped.
> >
> > (3) There is one fundamental issue: if we are not batch-unmapping the whole
> > thing, we will still add the folios to the deferred split queue. Migration
> > would still do that, or if there are multiple VMAs covering a folio.
> >
> > (4) We should really avoid making common operations slower only to make
> > some unreliable stats less unreliable.
> >
> >
> > We should likely do something like the following, which might even be a bit
> > faster in some cases because we avoid a function call in case we unmap
> > individual PTEs by checking _deferred_list ahead of time
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..356598b3dc3c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                  * page of the folio is unmapped and at least one page
> >                  * is still mapped.
> >                  */
> > -               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > -                               deferred_split_folio(folio);
> > +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> > +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> > +                   atomic_read(mapped) &&
> > +                   data_race(list_empty(&folio->_deferred_list)))
>
> data_race() might not be needed, as Ryan pointed out[1]
>
> > +                       deferred_split_folio(folio);
> >         }
> >
> > I also thought about handling the scenario where we unmap the whole
> > think in smaller chunks. We could detect "!atomic_read(mapped)" and
> > detect that it is on the deferred split list, and simply remove it
> > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >
> > But it would be racy with concurrent remapping of the folio (might happen with
> > anon folios in corner cases I guess).
> >
> > What we can do is the following, though:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index dc30139590e6..f05cba1807f2 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> >         ds_queue = get_deferred_split_queue(folio);
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >         if (!list_empty(&folio->_deferred_list)) {
> > +               if (folio_test_pmd_mappable(folio))
> > +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >                 ds_queue->split_queue_len--;
> >                 list_del_init(&folio->_deferred_list);
> >         }
> >
> > Adding the right event of course.
> >
> >
> > Then it's easy to filter out these "temporarily added to the list, but never split
> > before the folio was freed" cases.
>
> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.

It is definitely possible that the THP on the deferred split queue are
freed instead of split. For example, 1M is unmapped for a 2M THP, then
later the remaining 1M is unmapped, or the process exits before memory
pressure happens. So how come we can tell it is "temporarily added to
list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
actually just counts how many pages are still on deferred split queue.
It may be useful. However the counter is typically used to estimate
how many THP are partially unmapped during a period of time. So we
just need to know the initial value and the value when we read it
again.

>
> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
> splits, why not just move the counter to deferred_split_scan(), where the actual
> split happens. Or the counter has a different meaning?

The deferred_split_scan() / deferred_split_count() just can return the
number of pages on a specific queue (a specific node with a specific
memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
something? Or you mean traverse all memcgs and all nodes? It sounds
too overkilling.

>
>
>
> [1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941@arm.com/
>
> --
> Best Regards,
> Yan, Zi
Zi Yan April 12, 2024, 10:59 p.m. UTC | #16
On 12 Apr 2024, at 18:29, Yang Shi wrote:

> On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>>
>>> On 12.04.24 16:35, Zi Yan wrote:
>>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>    mm/rmap.c | 9 ++++++---
>>>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                    enum rmap_level level)
>>>>>>    {
>>>>>>            atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> -  int last, nr = 0, nr_pmdmapped = 0;
>>>>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>            enum node_stat_item idx;
>>>>>>            __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                            break;
>>>>>>                    }
>>>>>>   -                atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> +          mapcount = atomic_sub_return(nr_pages,
>>>>>> +                                       &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>>
>>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>>> below, but to save an atomic op, I chose to read mapcount here.
>>>
>>> Some points:
>>>
>>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>>> RMW that return a value -- and how they interact with memory barriers.
>>> Further, how relaxed variants are only optimized on some architectures.
>>>
>>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>>> access that should not be refetched. Usually cheaper than most other stuff
>>> that involves atomics.
>>
>> I should have checked the actual implementation instead of being fooled
>> by the name. Will read about it. Thanks.
>>
>>>
>>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>>> to figure out if the folio is now completely unmapped.
>>>
>>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>>> thing, we will still add the folios to the deferred split queue. Migration
>>> would still do that, or if there are multiple VMAs covering a folio.
>>>
>>> (4) We should really avoid making common operations slower only to make
>>> some unreliable stats less unreliable.
>>>
>>>
>>> We should likely do something like the following, which might even be a bit
>>> faster in some cases because we avoid a function call in case we unmap
>>> individual PTEs by checking _deferred_list ahead of time
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..356598b3dc3c 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                  * page of the folio is unmapped and at least one page
>>>                  * is still mapped.
>>>                  */
>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> -                               deferred_split_folio(folio);
>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>>> +                   atomic_read(mapped) &&
>>> +                   data_race(list_empty(&folio->_deferred_list)))
>>
>> data_race() might not be needed, as Ryan pointed out[1]
>>
>>> +                       deferred_split_folio(folio);
>>>         }
>>>
>>> I also thought about handling the scenario where we unmap the whole
>>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
>>> detect that it is on the deferred split list, and simply remove it
>>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>>>
>>> But it would be racy with concurrent remapping of the folio (might happen with
>>> anon folios in corner cases I guess).
>>>
>>> What we can do is the following, though:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index dc30139590e6..f05cba1807f2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>>>         ds_queue = get_deferred_split_queue(folio);
>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>         if (!list_empty(&folio->_deferred_list)) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>>>                 ds_queue->split_queue_len--;
>>>                 list_del_init(&folio->_deferred_list);
>>>         }
>>>
>>> Adding the right event of course.
>>>
>>>
>>> Then it's easy to filter out these "temporarily added to the list, but never split
>>> before the folio was freed" cases.
>>
>> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
>> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
>
> It is definitely possible that the THP on the deferred split queue are
> freed instead of split. For example, 1M is unmapped for a 2M THP, then
> later the remaining 1M is unmapped, or the process exits before memory
> pressure happens. So how come we can tell it is "temporarily added to
> list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> actually just counts how many pages are still on deferred split queue.
> It may be useful. However the counter is typically used to estimate
> how many THP are partially unmapped during a period of time. So we
> just need to know the initial value and the value when we read it
> again.
>
>>
>> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
>> splits, why not just move the counter to deferred_split_scan(), where the actual
>> split happens. Or the counter has a different meaning?
>
> The deferred_split_scan() / deferred_split_count() just can return the
> number of pages on a specific queue (a specific node with a specific
> memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
> something? Or you mean traverse all memcgs and all nodes? It sounds
> too overkilling.

I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added
to the split list, increase it when a folio is split in deferred_split_scan(),
regardless which list the folio is on.

--
Best Regards,
Yan, Zi
Yang Shi April 13, 2024, 12:50 a.m. UTC | #17
On Fri, Apr 12, 2024 at 3:59 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 12 Apr 2024, at 18:29, Yang Shi wrote:
>
> > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
> >>
> >>> On 12.04.24 16:35, Zi Yan wrote:
> >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>> ---
> >>>>>>    mm/rmap.c | 9 ++++++---
> >>>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                    enum rmap_level level)
> >>>>>>    {
> >>>>>>            atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> -  int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>>            enum node_stat_item idx;
> >>>>>>            __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                            break;
> >>>>>>                    }
> >>>>>>   -                atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> +          mapcount = atomic_sub_return(nr_pages,
> >>>>>> +                                       &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>>>
> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >>>> below, but to save an atomic op, I chose to read mapcount here.
> >>>
> >>> Some points:
> >>>
> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> >>> RMW that return a value -- and how they interact with memory barriers.
> >>> Further, how relaxed variants are only optimized on some architectures.
> >>>
> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> >>> access that should not be refetched. Usually cheaper than most other stuff
> >>> that involves atomics.
> >>
> >> I should have checked the actual implementation instead of being fooled
> >> by the name. Will read about it. Thanks.
> >>
> >>>
> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> >>> to figure out if the folio is now completely unmapped.
> >>>
> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> >>> thing, we will still add the folios to the deferred split queue. Migration
> >>> would still do that, or if there are multiple VMAs covering a folio.
> >>>
> >>> (4) We should really avoid making common operations slower only to make
> >>> some unreliable stats less unreliable.
> >>>
> >>>
> >>> We should likely do something like the following, which might even be a bit
> >>> faster in some cases because we avoid a function call in case we unmap
> >>> individual PTEs by checking _deferred_list ahead of time
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..356598b3dc3c 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                  * page of the folio is unmapped and at least one page
> >>>                  * is still mapped.
> >>>                  */
> >>> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> -                               deferred_split_folio(folio);
> >>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> >>> +                   atomic_read(mapped) &&
> >>> +                   data_race(list_empty(&folio->_deferred_list)))
> >>
> >> data_race() might not be needed, as Ryan pointed out[1]
> >>
> >>> +                       deferred_split_folio(folio);
> >>>         }
> >>>
> >>> I also thought about handling the scenario where we unmap the whole
> >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> >>> detect that it is on the deferred split list, and simply remove it
> >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >>>
> >>> But it would be racy with concurrent remapping of the folio (might happen with
> >>> anon folios in corner cases I guess).
> >>>
> >>> What we can do is the following, though:
> >>>
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index dc30139590e6..f05cba1807f2 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> >>>         ds_queue = get_deferred_split_queue(folio);
> >>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>>         if (!list_empty(&folio->_deferred_list)) {
> >>> +               if (folio_test_pmd_mappable(folio))
> >>> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >>>                 ds_queue->split_queue_len--;
> >>>                 list_del_init(&folio->_deferred_list);
> >>>         }
> >>>
> >>> Adding the right event of course.
> >>>
> >>>
> >>> Then it's easy to filter out these "temporarily added to the list, but never split
> >>> before the folio was freed" cases.
> >>
> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> >
> > It is definitely possible that the THP on the deferred split queue are
> > freed instead of split. For example, 1M is unmapped for a 2M THP, then
> > later the remaining 1M is unmapped, or the process exits before memory
> > pressure happens. So how come we can tell it is "temporarily added to
> > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> > actually just counts how many pages are still on deferred split queue.
> > It may be useful. However the counter is typically used to estimate
> > how many THP are partially unmapped during a period of time. So we
> > just need to know the initial value and the value when we read it
> > again.
> >
> >>
> >> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
> >> splits, why not just move the counter to deferred_split_scan(), where the actual
> >> split happens. Or the counter has a different meaning?
> >
> > The deferred_split_scan() / deferred_split_count() just can return the
> > number of pages on a specific queue (a specific node with a specific
> > memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
> > something? Or you mean traverse all memcgs and all nodes? It sounds
> > too overkilling.
>
> I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added
> to the split list, increase it when a folio is split in deferred_split_scan(),
> regardless which list the folio is on.

It will have overlap with thp_split_page. And what if memory pressure
doesn't happen? The counter will be 0 even though thousands THP have
been partially unmapped.

>
> --
> Best Regards,
> Yan, Zi
David Hildenbrand April 15, 2024, 3:13 p.m. UTC | #18
On 12.04.24 23:06, Zi Yan wrote:
> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
> 
>> On 12.04.24 16:35, Zi Yan wrote:
>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> ---
>>>>>     mm/rmap.c | 9 ++++++---
>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>     		enum rmap_level level)
>>>>>     {
>>>>>     	atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> -	int last, nr = 0, nr_pmdmapped = 0;
>>>>> +	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>     	enum node_stat_item idx;
>>>>>      	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>     			break;
>>>>>     		}
>>>>>    -		atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> +		mapcount = atomic_sub_return(nr_pages,
>>>>> +					     &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>
>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>> below, but to save an atomic op, I chose to read mapcount here.
>>
>> Some points:
>>
>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>> RMW that return a value -- and how they interact with memory barriers.
>> Further, how relaxed variants are only optimized on some architectures.
>>
>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>> access that should not be refetched. Usually cheaper than most other stuff
>> that involves atomics.
> 
> I should have checked the actual implementation instead of being fooled
> by the name. Will read about it. Thanks.
> 
>>
>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>> to figure out if the folio is now completely unmapped.
>>
>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>> thing, we will still add the folios to the deferred split queue. Migration
>> would still do that, or if there are multiple VMAs covering a folio.
>>
>> (4) We should really avoid making common operations slower only to make
>> some unreliable stats less unreliable.
>>
>>
>> We should likely do something like the following, which might even be a bit
>> faster in some cases because we avoid a function call in case we unmap
>> individual PTEs by checking _deferred_list ahead of time
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..356598b3dc3c 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>                   * page of the folio is unmapped and at least one page
>>                   * is still mapped.
>>                   */
>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> -                               deferred_split_folio(folio);
>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>> +                   atomic_read(mapped) &&
>> +                   data_race(list_empty(&folio->_deferred_list)))
> 
> data_race() might not be needed, as Ryan pointed out[1]

Right, I keep getting confused by that. Likely we should add data_race() 
only if we get actual reports.
David Hildenbrand April 15, 2024, 3:40 p.m. UTC | #19
On 13.04.24 00:29, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>>
>>> On 12.04.24 16:35, Zi Yan wrote:
>>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>     mm/rmap.c | 9 ++++++---
>>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                     enum rmap_level level)
>>>>>>     {
>>>>>>             atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> -  int last, nr = 0, nr_pmdmapped = 0;
>>>>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>             enum node_stat_item idx;
>>>>>>             __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                             break;
>>>>>>                     }
>>>>>>    -                atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> +          mapcount = atomic_sub_return(nr_pages,
>>>>>> +                                       &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>>
>>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>>> below, but to save an atomic op, I chose to read mapcount here.
>>>
>>> Some points:
>>>
>>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>>> RMW that return a value -- and how they interact with memory barriers.
>>> Further, how relaxed variants are only optimized on some architectures.
>>>
>>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>>> access that should not be refetched. Usually cheaper than most other stuff
>>> that involves atomics.
>>
>> I should have checked the actual implementation instead of being fooled
>> by the name. Will read about it. Thanks.
>>
>>>
>>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>>> to figure out if the folio is now completely unmapped.
>>>
>>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>>> thing, we will still add the folios to the deferred split queue. Migration
>>> would still do that, or if there are multiple VMAs covering a folio.
>>>
>>> (4) We should really avoid making common operations slower only to make
>>> some unreliable stats less unreliable.
>>>
>>>
>>> We should likely do something like the following, which might even be a bit
>>> faster in some cases because we avoid a function call in case we unmap
>>> individual PTEs by checking _deferred_list ahead of time
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..356598b3dc3c 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                   * page of the folio is unmapped and at least one page
>>>                   * is still mapped.
>>>                   */
>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> -                               deferred_split_folio(folio);
>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>>> +                   atomic_read(mapped) &&
>>> +                   data_race(list_empty(&folio->_deferred_list)))
>>
>> data_race() might not be needed, as Ryan pointed out[1]
>>
>>> +                       deferred_split_folio(folio);
>>>          }
>>>
>>> I also thought about handling the scenario where we unmap the whole
>>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
>>> detect that it is on the deferred split list, and simply remove it
>>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>>>
>>> But it would be racy with concurrent remapping of the folio (might happen with
>>> anon folios in corner cases I guess).
>>>
>>> What we can do is the following, though:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index dc30139590e6..f05cba1807f2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>>>          ds_queue = get_deferred_split_queue(folio);
>>>          spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>          if (!list_empty(&folio->_deferred_list)) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>>>                  ds_queue->split_queue_len--;
>>>                  list_del_init(&folio->_deferred_list);
>>>          }
>>>
>>> Adding the right event of course.
>>>
>>>
>>> Then it's easy to filter out these "temporarily added to the list, but never split
>>> before the folio was freed" cases.
>>
>> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
>> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> 
> It is definitely possible that the THP on the deferred split queue are
> freed instead of split. For example, 1M is unmapped for a 2M THP, then
> later the remaining 1M is unmapped, or the process exits before memory
> pressure happens. So how come we can tell it is "temporarily added to
> list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> actually just counts how many pages are still on deferred split queue.

Not quite I think. I don't think we have a counter that counts how many 
large folios on the deferred list were split. I think we only have 
THP_SPLIT_PAGE.

We could have
* THP_DEFERRED_SPLIT_PAGE
* THP_UNDO_DEFERRED_SPLIT_PAGE
* THP_PERFORM_DEFERRED_SPLIT_PAGE

Maybe that would catch more cases (not sure if all, though). Then, you 
could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - 
THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.

That could give one a clearer picture how deferred split interacts with 
actual splitting (possibly under memory pressure), the whole reason why 
deferred splitting was added after all.

> It may be useful. However the counter is typically used to estimate
> how many THP are partially unmapped during a period of time.

I'd say it's a bit of an abuse of that counter; well, or interpreting 
something into the counter that that counter never reliably represented.

I can easily write a program that keeps sending your counter to infinity 
simply by triggering that behavior in a loop, so it's all a bit shaky.

Something like Ryans script makes more sense, where you get a clearer 
picture of what's mapped where and how. Because that information can be 
much more valuable than just knowing if it's mapped fully or partially 
(again, relevant for handling with memory waste).
David Hildenbrand April 15, 2024, 3:43 p.m. UTC | #20
On 12.04.24 22:35, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.04.24 16:35, Zi Yan wrote:
>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> ---
>>>>>     mm/rmap.c | 9 ++++++---
>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>              enum rmap_level level)
>>>>>     {
>>>>>      atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> -   int last, nr = 0, nr_pmdmapped = 0;
>>>>> +   int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>      enum node_stat_item idx;
>>>>>              __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>                      break;
>>>>>              }
>>>>>    -         atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> +           mapcount = atomic_sub_return(nr_pages,
>>>>> +                                        &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>
>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>> below, but to save an atomic op, I chose to read mapcount here.
>>
>> Some points:
>>
>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>> RMW that return a value -- and how they interact with memory barriers.
>> Further, how relaxed variants are only optimized on some architectures.
>>
>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>> access that should not be refetched. Usually cheaper than most other stuff
>> that involves atomics.
>>
>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>> to figure out if the folio is now completely unmapped.
>>
>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>> thing, we will still add the folios to the deferred split queue. Migration
>> would still do that, or if there are multiple VMAs covering a folio.
> 
> Maybe we can let rmap remove code not touch deferred split queue in
> migration path at all because we know all the PTEs will be converted
> to migration entries. And all the migration entries will be converted
> back to PTEs regardless of whether try_to_migrate succeeded or not.

It's would be just another bandaid I think :/ Maybe a worthwile one, not 
sure.
Yang Shi April 15, 2024, 5:54 p.m. UTC | #21
On Mon, Apr 15, 2024 at 8:40 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.04.24 00:29, Yang Shi wrote:
> > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
> >>
> >>> On 12.04.24 16:35, Zi Yan wrote:
> >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>> ---
> >>>>>>     mm/rmap.c | 9 ++++++---
> >>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                     enum rmap_level level)
> >>>>>>     {
> >>>>>>             atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> -  int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>>             enum node_stat_item idx;
> >>>>>>             __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                             break;
> >>>>>>                     }
> >>>>>>    -                atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> +          mapcount = atomic_sub_return(nr_pages,
> >>>>>> +                                       &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>>>
> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >>>> below, but to save an atomic op, I chose to read mapcount here.
> >>>
> >>> Some points:
> >>>
> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> >>> RMW that return a value -- and how they interact with memory barriers.
> >>> Further, how relaxed variants are only optimized on some architectures.
> >>>
> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> >>> access that should not be refetched. Usually cheaper than most other stuff
> >>> that involves atomics.
> >>
> >> I should have checked the actual implementation instead of being fooled
> >> by the name. Will read about it. Thanks.
> >>
> >>>
> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> >>> to figure out if the folio is now completely unmapped.
> >>>
> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> >>> thing, we will still add the folios to the deferred split queue. Migration
> >>> would still do that, or if there are multiple VMAs covering a folio.
> >>>
> >>> (4) We should really avoid making common operations slower only to make
> >>> some unreliable stats less unreliable.
> >>>
> >>>
> >>> We should likely do something like the following, which might even be a bit
> >>> faster in some cases because we avoid a function call in case we unmap
> >>> individual PTEs by checking _deferred_list ahead of time
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..356598b3dc3c 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>                   * page of the folio is unmapped and at least one page
> >>>                   * is still mapped.
> >>>                   */
> >>> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> -                               deferred_split_folio(folio);
> >>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> >>> +                   atomic_read(mapped) &&
> >>> +                   data_race(list_empty(&folio->_deferred_list)))
> >>
> >> data_race() might not be needed, as Ryan pointed out[1]
> >>
> >>> +                       deferred_split_folio(folio);
> >>>          }
> >>>
> >>> I also thought about handling the scenario where we unmap the whole
> >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> >>> detect that it is on the deferred split list, and simply remove it
> >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >>>
> >>> But it would be racy with concurrent remapping of the folio (might happen with
> >>> anon folios in corner cases I guess).
> >>>
> >>> What we can do is the following, though:
> >>>
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index dc30139590e6..f05cba1807f2 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> >>>          ds_queue = get_deferred_split_queue(folio);
> >>>          spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>>          if (!list_empty(&folio->_deferred_list)) {
> >>> +               if (folio_test_pmd_mappable(folio))
> >>> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >>>                  ds_queue->split_queue_len--;
> >>>                  list_del_init(&folio->_deferred_list);
> >>>          }
> >>>
> >>> Adding the right event of course.
> >>>
> >>>
> >>> Then it's easy to filter out these "temporarily added to the list, but never split
> >>> before the folio was freed" cases.
> >>
> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> >
> > It is definitely possible that the THP on the deferred split queue are
> > freed instead of split. For example, 1M is unmapped for a 2M THP, then
> > later the remaining 1M is unmapped, or the process exits before memory
> > pressure happens. So how come we can tell it is "temporarily added to
> > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> > actually just counts how many pages are still on deferred split queue.
>
> Not quite I think. I don't think we have a counter that counts how many
> large folios on the deferred list were split. I think we only have
> THP_SPLIT_PAGE.

Yes, we just count how many THP were split regardless of why they got
split. They may be split from a deferred split queue due to memory
pressure, migration, etc.

>
> We could have
> * THP_DEFERRED_SPLIT_PAGE
> * THP_UNDO_DEFERRED_SPLIT_PAGE
> * THP_PERFORM_DEFERRED_SPLIT_PAGE
>
> Maybe that would catch more cases (not sure if all, though). Then, you
> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
>
> That could give one a clearer picture how deferred split interacts with
> actual splitting (possibly under memory pressure), the whole reason why
> deferred splitting was added after all.

I'm not quite sure whether there is a solid usecase or not. If we
have, we could consider this. But a simpler counter may be more
preferred.

>
> > It may be useful. However the counter is typically used to estimate
> > how many THP are partially unmapped during a period of time.
>
> I'd say it's a bit of an abuse of that counter; well, or interpreting
> something into the counter that that counter never reliably represented.

It was way more reliable than now.

>
> I can easily write a program that keeps sending your counter to infinity
> simply by triggering that behavior in a loop, so it's all a bit shaky.

I don't doubt that. But let's get back to reality. The counter used to
stay reasonable and reliable with most real life workloads before
mTHP. There may be over-counting, for example, when unmapping a
PTE-mapped THP which was not on a deferred split queue before. But
such a case is not common for real life workloads because the huge PMD
has to be split by partial unmap for most cases. And the partial unmap
will add the THP to deferred split queue.

But now a common workload, for example, just process exit, may
probably send the counter to infinity.

>
> Something like Ryans script makes more sense, where you get a clearer
> picture of what's mapped where and how. Because that information can be
> much more valuable than just knowing if it's mapped fully or partially
> (again, relevant for handling with memory waste).

Ryan's script is very helpful. But the counter has been existing and
used for years, and it is a quick indicator and much easier to monitor
in a large-scale fleet.

If we think the reliability of the counter is not worth fixing, why
don't we just remove it. No counter is better than a broken counter.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand April 15, 2024, 7:19 p.m. UTC | #22
>>
>> We could have
>> * THP_DEFERRED_SPLIT_PAGE
>> * THP_UNDO_DEFERRED_SPLIT_PAGE
>> * THP_PERFORM_DEFERRED_SPLIT_PAGE
>>
>> Maybe that would catch more cases (not sure if all, though). Then, you
>> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
>> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
>>
>> That could give one a clearer picture how deferred split interacts with
>> actual splitting (possibly under memory pressure), the whole reason why
>> deferred splitting was added after all.
> 
> I'm not quite sure whether there is a solid usecase or not. If we
> have, we could consider this. But a simpler counter may be more
> preferred.

Yes.

> 
>>
>>> It may be useful. However the counter is typically used to estimate
>>> how many THP are partially unmapped during a period of time.
>>
>> I'd say it's a bit of an abuse of that counter; well, or interpreting
>> something into the counter that that counter never reliably represented.
> 
> It was way more reliable than now.

Correct me if I am wrong: now that we only adjust the counter for 
PMD-sized THP, it is as (un)reliable as it always was.

Or was there another unintended change by some of my cleanups or 
previous patches?

> 
>>
>> I can easily write a program that keeps sending your counter to infinity
>> simply by triggering that behavior in a loop, so it's all a bit shaky.
> 
> I don't doubt that. But let's get back to reality. The counter used to
> stay reasonable and reliable with most real life workloads before
> mTHP. There may be over-counting, for example, when unmapping a
> PTE-mapped THP which was not on a deferred split queue before. But
> such a case is not common for real life workloads because the huge PMD
> has to be split by partial unmap for most cases. And the partial unmap
> will add the THP to deferred split queue.
> 
> But now a common workload, for example, just process exit, may
> probably send the counter to infinity.

Agreed, that's stupid.

> 
>>
>> Something like Ryans script makes more sense, where you get a clearer
>> picture of what's mapped where and how. Because that information can be
>> much more valuable than just knowing if it's mapped fully or partially
>> (again, relevant for handling with memory waste).
> 
> Ryan's script is very helpful. But the counter has been existing and
> used for years, and it is a quick indicator and much easier to monitor
> in a large-scale fleet.
> 
> If we think the reliability of the counter is not worth fixing, why
> don't we just remove it. No counter is better than a broken counter.

Again, is only counting the PMD-sized THPs "fixing" the old use cases? 
Then it should just stick around. And we can even optimize it for some 
more cases as proposed in this patch. But there is no easy way to "get 
it completely right" I'm afraid.
Yang Shi April 15, 2024, 9:16 p.m. UTC | #23
On Mon, Apr 15, 2024 at 12:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> We could have
> >> * THP_DEFERRED_SPLIT_PAGE
> >> * THP_UNDO_DEFERRED_SPLIT_PAGE
> >> * THP_PERFORM_DEFERRED_SPLIT_PAGE
> >>
> >> Maybe that would catch more cases (not sure if all, though). Then, you
> >> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
> >> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
> >>
> >> That could give one a clearer picture how deferred split interacts with
> >> actual splitting (possibly under memory pressure), the whole reason why
> >> deferred splitting was added after all.
> >
> > I'm not quite sure whether there is a solid usecase or not. If we
> > have, we could consider this. But a simpler counter may be more
> > preferred.
>
> Yes.
>
> >
> >>
> >>> It may be useful. However the counter is typically used to estimate
> >>> how many THP are partially unmapped during a period of time.
> >>
> >> I'd say it's a bit of an abuse of that counter; well, or interpreting
> >> something into the counter that that counter never reliably represented.
> >
> > It was way more reliable than now.
>
> Correct me if I am wrong: now that we only adjust the counter for
> PMD-sized THP, it is as (un)reliable as it always was.

Yes. The problem introduced by mTHP was somehow workaround'ed by that commit.

>
> Or was there another unintended change by some of my cleanups or
> previous patches?

No, at least I didn't see for now.

>
> >
> >>
> >> I can easily write a program that keeps sending your counter to infinity
> >> simply by triggering that behavior in a loop, so it's all a bit shaky.
> >
> > I don't doubt that. But let's get back to reality. The counter used to
> > stay reasonable and reliable with most real life workloads before
> > mTHP. There may be over-counting, for example, when unmapping a
> > PTE-mapped THP which was not on a deferred split queue before. But
> > such a case is not common for real life workloads because the huge PMD
> > has to be split by partial unmap for most cases. And the partial unmap
> > will add the THP to deferred split queue.
> >
> > But now a common workload, for example, just process exit, may
> > probably send the counter to infinity.
>
> Agreed, that's stupid.
>
> >
> >>
> >> Something like Ryans script makes more sense, where you get a clearer
> >> picture of what's mapped where and how. Because that information can be
> >> much more valuable than just knowing if it's mapped fully or partially
> >> (again, relevant for handling with memory waste).
> >
> > Ryan's script is very helpful. But the counter has been existing and
> > used for years, and it is a quick indicator and much easier to monitor
> > in a large-scale fleet.
> >
> > If we think the reliability of the counter is not worth fixing, why
> > don't we just remove it. No counter is better than a broken counter.
>
> Again, is only counting the PMD-sized THPs "fixing" the old use cases?

Yes

> Then it should just stick around. And we can even optimize it for some
> more cases as proposed in this patch. But there is no easy way to "get
> it completely right" I'm afraid.

I don't mean we should revert that "fixing", my point is we should not
rely on it and we should make rmap remove code behave more reliable
regardless of whether we just count PMD-sized THP or not.

>
> --
> Cheers,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..d599a772e282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1494,7 +1494,7 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 		enum rmap_level level)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int last, nr = 0, nr_pmdmapped = 0;
+	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
 	enum node_stat_item idx;
 
 	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1506,7 +1506,8 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 			break;
 		}
 
-		atomic_sub(nr_pages, &folio->_large_mapcount);
+		mapcount = atomic_sub_return(nr_pages,
+					     &folio->_large_mapcount) + 1;
 		do {
 			last = atomic_add_negative(-1, &page->_mapcount);
 			if (last) {
@@ -1554,7 +1555,9 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 * is still mapped.
 		 */
 		if (folio_test_large(folio) && folio_test_anon(folio))
-			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
+			if ((level == RMAP_LEVEL_PTE &&
+			     mapcount != 0) ||
+			    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
 				deferred_split_folio(folio);
 	}