diff mbox series

[v3,2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES

Message ID 20240715192142.3241557-3-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/mprotect: Fix dax puds | expand

Commit Message

Peter Xu July 15, 2024, 7:21 p.m. UTC
In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
altered by protection changes") introduced "numa_huge_pte_updates" vmstat
entry, trying to capture how many huge ptes (in reality, PMD thps at that
time) are marked by NUMA balancing.

This patch proposes to remove it for some reasons.

Firstly, the name is misleading. We can have more than one way to have a
"huge pte" at least nowadays, and that's also the major goal of this patch,
where it paves way for PUD handling in change protection code paths.

PUDs are coming not only for dax (which has already came and yet broken..),
but also for pfnmaps and hugetlb pages.  The name will simply stop making
sense when PUD will start to be involved in mprotect() world.

It'll also make it not reasonable either if we boost the counter for both
pmd/puds.  In short, current accounting won't be right when PUD comes, so
the scheme was only suitable at that point in time where PUD wasn't even
possible.

Secondly, the accounting was simply not right from the start as long as it
was also affected by other call sites besides NUMA.  mprotect() is one,
while userfaultfd-wp also leverages change protection path to modify
pgtables.  If it wants to do right it needs to check the caller but it
never did; at least mprotect() should be there even in 2013.

It gives me the impression that nobody is seriously using this field, and
it's also impossible to be serious.

We may want to do it right if any NUMA developers would like it to exist,
but we should do that with all above resolved, on both considering PUDs,
but also on correct accountings.  That should be able to be done on top
when there's a real need of such.

Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/vm_event_item.h | 1 -
 mm/mprotect.c                 | 8 +-------
 mm/vmstat.c                   | 1 -
 3 files changed, 1 insertion(+), 9 deletions(-)

Comments

David Hildenbrand July 31, 2024, 12:18 p.m. UTC | #1
On 15.07.24 21:21, Peter Xu wrote:
> In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
> altered by protection changes") introduced "numa_huge_pte_updates" vmstat
> entry, trying to capture how many huge ptes (in reality, PMD thps at that
> time) are marked by NUMA balancing.
> 
> This patch proposes to remove it for some reasons.
> 
> Firstly, the name is misleading. We can have more than one way to have a
> "huge pte" at least nowadays, and that's also the major goal of this patch,
> where it paves way for PUD handling in change protection code paths.
> 
> PUDs are coming not only for dax (which has already came and yet broken..),
> but also for pfnmaps and hugetlb pages.  The name will simply stop making
> sense when PUD will start to be involved in mprotect() world.
> 
> It'll also make it not reasonable either if we boost the counter for both
> pmd/puds.  In short, current accounting won't be right when PUD comes, so
> the scheme was only suitable at that point in time where PUD wasn't even
> possible.
> 
> Secondly, the accounting was simply not right from the start as long as it
> was also affected by other call sites besides NUMA.  mprotect() is one,
> while userfaultfd-wp also leverages change protection path to modify
> pgtables.  If it wants to do right it needs to check the caller but it
> never did; at least mprotect() should be there even in 2013.
> 
> It gives me the impression that nobody is seriously using this field, and
> it's also impossible to be serious.

It's weird and the implementation is ugly. The intention really was to 
only consider MM_CP_PROT_NUMA, but that apparently is not the case.

hugetlb/mprotect/... should have never been accounted.

[...]

>   
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 73d791d1caad..53656227f70d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   	"numa_pte_updates",
> -	"numa_huge_pte_updates",
>   	"numa_hint_faults",
>   	"numa_hint_faults_local",
>   	"numa_pages_migrated",

It's a user-visible update. I assume most tools should be prepared for 
this stat missing (just like handling !CONFIG_NUMA_BALANCING).

Apparently it's documented [1][2] for some distros:

"The amount of transparent huge pages that were marked for NUMA hinting 
faults. In combination with numa_pte_updates the total address space 
that was marked can be calculated."

And now I realize that change_prot_numa() would account these PMD 
updates as well in numa_pte_updates and I am confused about the SUSE 
documentation: "In combination with numa_pte_updates" doesn't really 
apply, right?

At this point I don't know what's right or wrong.

If we'd want to fix it instead, the right thing to do would be doing the 
accounting only with MM_CP_PROT_NUMA. But then, numa_pte_updates is also 
wrongly updated I believe :(


[1] 
https://documentation.suse.com/de-de/sles/12-SP5/html/SLES-all/cha-tuning-numactl.html
[2] 
https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2749259_1.html
Peter Xu Aug. 4, 2024, 3:06 p.m. UTC | #2
On Wed, Jul 31, 2024 at 02:18:26PM +0200, David Hildenbrand wrote:
> On 15.07.24 21:21, Peter Xu wrote:
> > In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
> > altered by protection changes") introduced "numa_huge_pte_updates" vmstat
> > entry, trying to capture how many huge ptes (in reality, PMD thps at that
> > time) are marked by NUMA balancing.
> > 
> > This patch proposes to remove it for some reasons.
> > 
> > Firstly, the name is misleading. We can have more than one way to have a
> > "huge pte" at least nowadays, and that's also the major goal of this patch,
> > where it paves way for PUD handling in change protection code paths.
> > 
> > PUDs are coming not only for dax (which has already came and yet broken..),
> > but also for pfnmaps and hugetlb pages.  The name will simply stop making
> > sense when PUD will start to be involved in mprotect() world.
> > 
> > It'll also make it not reasonable either if we boost the counter for both
> > pmd/puds.  In short, current accounting won't be right when PUD comes, so
> > the scheme was only suitable at that point in time where PUD wasn't even
> > possible.
> > 
> > Secondly, the accounting was simply not right from the start as long as it
> > was also affected by other call sites besides NUMA.  mprotect() is one,
> > while userfaultfd-wp also leverages change protection path to modify
> > pgtables.  If it wants to do right it needs to check the caller but it
> > never did; at least mprotect() should be there even in 2013.
> > 
> > It gives me the impression that nobody is seriously using this field, and
> > it's also impossible to be serious.
> 
> It's weird and the implementation is ugly. The intention really was to only
> consider MM_CP_PROT_NUMA, but that apparently is not the case.
> 
> hugetlb/mprotect/... should have never been accounted.
> 
> [...]
> 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 73d791d1caad..53656227f70d 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
> >   #ifdef CONFIG_NUMA_BALANCING
> >   	"numa_pte_updates",
> > -	"numa_huge_pte_updates",
> >   	"numa_hint_faults",
> >   	"numa_hint_faults_local",
> >   	"numa_pages_migrated",
> 
> It's a user-visible update. I assume most tools should be prepared for this
> stat missing (just like handling !CONFIG_NUMA_BALANCING).
> 
> Apparently it's documented [1][2] for some distros:

Yes, and AFAIU, [2] is a document to explain an issue relevant to numa
balancing, and I'd highly doubt [2] referenced [1] here; even the order of
the parameters are the same to be listed.

> 
> "The amount of transparent huge pages that were marked for NUMA hinting
> faults. In combination with numa_pte_updates the total address space that
> was marked can be calculated."
> 
> And now I realize that change_prot_numa() would account these PMD updates as
> well in numa_pte_updates and I am confused about the SUSE documentation: "In
> combination with numa_pte_updates" doesn't really apply, right?
> 
> At this point I don't know what's right or wrong.

Me neither, even without PUD involvement.

Talking about numa_pte_updates, hugetlb_change_protection() returns "number
of huge ptes", so one 2M hugetlb page is accounted once; while comparing to
the generic THP (change_protection_range()) it's HPAGE_PUD_NR.  It'll make
more sense to me if it sticks with PAGE_SIZE.  So all these counters look a
bit confusing.

> 
> If we'd want to fix it instead, the right thing to do would be doing the
> accounting only with MM_CP_PROT_NUMA. But then, numa_pte_updates is also
> wrongly updated I believe :(

Right.

I don't have a reason to change numa_pte_updates semantics yet so far, but
here there's the problem where numa_huge_pte_updates can be ambiguous when
there is even PUD involved.

In general, I don't know how I should treat this counter in PUD path even
if NUMA isn't involved in dax yet; it can be soon involved if we move on
with using this same path for hugetlb, or when 1G thp can be possible (with
Yu Zhao's TAO?).

One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
in PUD dax processing for now.  It'll work for this series, but it'll still
be a problem later.  I figured maybe we should simply drop it from now.

Thanks,

> 
> 
> [1] https://documentation.suse.com/de-de/sles/12-SP5/html/SLES-all/cha-tuning-numactl.html
> [2] https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2749259_1.html
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Aug. 6, 2024, 1:02 p.m. UTC | #3
> Right.
> 
> I don't have a reason to change numa_pte_updates semantics yet so far, but
> here there's the problem where numa_huge_pte_updates can be ambiguous when
> there is even PUD involved.
> 
> In general, I don't know how I should treat this counter in PUD path even
> if NUMA isn't involved in dax yet; it can be soon involved if we move on
> with using this same path for hugetlb, or when 1G thp can be possible (with
> Yu Zhao's TAO?).

We shouldn't bother about it in the PUD path at all I think. Especially 
as long as NUMA hinting doesn't apply to any of what we would handle on 
the PUD path :)

> 
> One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
> in PUD dax processing for now.  It'll work for this series, but it'll still
> be a problem later.  I figured maybe we should simply drop it from now.

It probably shouldn't block your other fixes and we should likely 
discuss that separately.

I agree that we should look into dropping that PMD counter completely.
Peter Xu Aug. 6, 2024, 4:26 p.m. UTC | #4
On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
> > Right.
> > 
> > I don't have a reason to change numa_pte_updates semantics yet so far, but
> > here there's the problem where numa_huge_pte_updates can be ambiguous when
> > there is even PUD involved.
> > 
> > In general, I don't know how I should treat this counter in PUD path even
> > if NUMA isn't involved in dax yet; it can be soon involved if we move on
> > with using this same path for hugetlb, or when 1G thp can be possible (with
> > Yu Zhao's TAO?).
> 
> We shouldn't bother about it in the PUD path at all I think. Especially as
> long as NUMA hinting doesn't apply to any of what we would handle on the PUD
> path :)

Hmm, I just noticed that hugetlb was never involved.. but then how about a
potential 1G THP?  Do you mean 1G THP will not be accounted in numa
balancing too even in the future?

The motivation I had this patch in this series is I want to be clear on how
I should treat this counter in pud path if it won't go.  And when people
compare the two paths we'll need to be clear why there's such difference if
I ignore it in pud path.

Per my current read on this counter, it might be an overkill to do that at
all, and it might be simpler we drop it now.

> 
> > 
> > One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
> > in PUD dax processing for now.  It'll work for this series, but it'll still
> > be a problem later.  I figured maybe we should simply drop it from now.
> 
> It probably shouldn't block your other fixes and we should likely discuss
> that separately.
> 
> I agree that we should look into dropping that PMD counter completely.

No strong opinion here.  If we prefer keeping that as separate topic, I'll
drop this patch.  You're right, it's not yet relevant to the fix.

Thanks,
David Hildenbrand Aug. 6, 2024, 4:32 p.m. UTC | #5
On 06.08.24 18:26, Peter Xu wrote:
> On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
>>> Right.
>>>
>>> I don't have a reason to change numa_pte_updates semantics yet so far, but
>>> here there's the problem where numa_huge_pte_updates can be ambiguous when
>>> there is even PUD involved.
>>>
>>> In general, I don't know how I should treat this counter in PUD path even
>>> if NUMA isn't involved in dax yet; it can be soon involved if we move on
>>> with using this same path for hugetlb, or when 1G thp can be possible (with
>>> Yu Zhao's TAO?).
>>
>> We shouldn't bother about it in the PUD path at all I think. Especially as
>> long as NUMA hinting doesn't apply to any of what we would handle on the PUD
>> path :)
> 
> Hmm, I just noticed that hugetlb was never involved.. but then how about a
> potential 1G THP?  Do you mean 1G THP will not be accounted in numa
> balancing too even in the future?

My best guess is that you would want a separate counter for that. The 
old one was just badly named ...

72403b4a0fbd even spells out "NUMA huge PMD updates".


"NUMA huge PMD updates were the number of THP  updates which in 
combination can be used to calculate how many ptes were updated from 
userspace."

... which doesn't make sense if you don't know how "huge" the huge 
actually was. :)

> 
> The motivation I had this patch in this series is I want to be clear on how
> I should treat this counter in pud path if it won't go.  And when people
> compare the two paths we'll need to be clear why there's such difference if
> I ignore it in pud path.
> 
> Per my current read on this counter, it might be an overkill to do that at
> all, and it might be simpler we drop it now.

Fine with me. But I would send that out separately, not buried in this 
series. The we might actually get Mel to review (was he CCed?).

Up to you!
Peter Xu Aug. 6, 2024, 4:51 p.m. UTC | #6
On Tue, Aug 06, 2024 at 06:32:10PM +0200, David Hildenbrand wrote:
> On 06.08.24 18:26, Peter Xu wrote:
> > On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
> > > > Right.
> > > > 
> > > > I don't have a reason to change numa_pte_updates semantics yet so far, but
> > > > here there's the problem where numa_huge_pte_updates can be ambiguous when
> > > > there is even PUD involved.
> > > > 
> > > > In general, I don't know how I should treat this counter in PUD path even
> > > > if NUMA isn't involved in dax yet; it can be soon involved if we move on
> > > > with using this same path for hugetlb, or when 1G thp can be possible (with
> > > > Yu Zhao's TAO?).
> > > 
> > > We shouldn't bother about it in the PUD path at all I think. Especially as
> > > long as NUMA hinting doesn't apply to any of what we would handle on the PUD
> > > path :)
> > 
> > Hmm, I just noticed that hugetlb was never involved.. but then how about a
> > potential 1G THP?  Do you mean 1G THP will not be accounted in numa
> > balancing too even in the future?
> 
> My best guess is that you would want a separate counter for that. The old
> one was just badly named ...
> 
> 72403b4a0fbd even spells out "NUMA huge PMD updates".
> 
> 
> "NUMA huge PMD updates were the number of THP  updates which in combination
> can be used to calculate how many ptes were updated from userspace."
> 
> ... which doesn't make sense if you don't know how "huge" the huge actually
> was. :)
> 
> > 
> > The motivation I had this patch in this series is I want to be clear on how
> > I should treat this counter in pud path if it won't go.  And when people
> > compare the two paths we'll need to be clear why there's such difference if
> > I ignore it in pud path.
> > 
> > Per my current read on this counter, it might be an overkill to do that at
> > all, and it might be simpler we drop it now.
> 
> Fine with me. But I would send that out separately, not buried in this
> series. The we might actually get Mel to review (was he CCed?).

Yes he is.

Fair point, let's do this separately. It's just that when split I don't
feel strongly to push that patch alone..  no reason for me to push dropping
a counter that maybe some people can still use even if I don't.  More
important to me is how I should move on with PUD, then at least this is
fully discussed and ignoring is the option I'm ok.

I'll respin with this patch dropped as of now, then I'll add a comment in
the PUD patch mention this counter is ignored.

Thanks,
diff mbox series

Patch

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..2a3797fb6742 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -59,7 +59,6 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		OOM_KILL,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
-		NUMA_HUGE_PTE_UPDATES,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..21172272695e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,7 +363,6 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 	pmd_t *pmd;
 	unsigned long next;
 	long pages = 0;
-	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
 
 	range.start = 0;
@@ -411,11 +410,8 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 				ret = change_huge_pmd(tlb, vma, pmd,
 						addr, newprot, cp_flags);
 				if (ret) {
-					if (ret == HPAGE_PMD_NR) {
+					if (ret == HPAGE_PMD_NR)
 						pages += HPAGE_PMD_NR;
-						nr_huge_updates++;
-					}
-
 					/* huge pmd was handled */
 					goto next;
 				}
@@ -435,8 +431,6 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);
 
-	if (nr_huge_updates)
-		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 73d791d1caad..53656227f70d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1313,7 +1313,6 @@  const char * const vmstat_text[] = {
 
 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",
-	"numa_huge_pte_updates",
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",