diff mbox series

[v2] mm, thp: always specify ineligible vmas as nh in smaps

Message ID alpine.DEB.2.21.1809241227370.241621@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm, thp: always specify ineligible vmas as nh in smaps | expand

Commit Message

David Rientjes Sept. 24, 2018, 7:30 p.m. UTC
Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
introduced a regression in that userspace cannot always determine the set
of vmas where thp is ineligible.

Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
to determine if a vma is eligible to be backed by hugepages.

Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
be disabled and emit "nh" as a flag for the corresponding vmas as part of
/proc/pid/smaps.  After the commit, thp is disabled by means of an mm
flag and "nh" is not emitted.

This causes smaps parsing libraries to assume a vma is eligible for thp
and ends up puzzling the user on why its memory is not backed by thp.

This also clears the "hg" flag to make the behavior of MADV_HUGEPAGE and
PR_SET_THP_DISABLE definitive.

Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2:
  - clear VM_HUGEPAGE per Vlastimil
  - update Documentation/filesystems/proc.txt to be explicit

 Documentation/filesystems/proc.txt | 12 ++++++++++--
 fs/proc/task_mmu.c                 | 14 +++++++++++++-
 2 files changed, 23 insertions(+), 3 deletions(-)

Comments

Michal Hocko Sept. 24, 2018, 7:56 p.m. UTC | #1
On Mon 24-09-18 12:30:07, David Rientjes wrote:
> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> introduced a regression in that userspace cannot always determine the set
> of vmas where thp is ineligible.
> 
> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> to determine if a vma is eligible to be backed by hugepages.

I was under impression that nh resp hg flags only tell about the madvise
status. How do you exactly use these flags in an application?

Your eligible rules as defined here:

> + [*] A process mapping is eligible to be backed by transparent hugepages (thp)
> +     depending on system-wide settings and the mapping itself.  See
> +     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
> +     mapping has a flag of "nh", it is not eligible to be backed by hugepages
> +     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
> +     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
> +     MADV_HUGEPAGE.

doesn't seem to match the reality. I do not see all the file backed
mappings to be nh marked. So is this really about eligibility rather
than the madvise status? Maybe it is just the above documentation that
needs to be updated.

That being said, I do not object to the patch, I am just trying to
understand what is the intended usage for the flag that does try to say
more than the madvise status.
Michal Hocko Sept. 24, 2018, 8:02 p.m. UTC | #2
On Mon 24-09-18 21:56:03, Michal Hocko wrote:
> On Mon 24-09-18 12:30:07, David Rientjes wrote:
> > Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > introduced a regression in that userspace cannot always determine the set
> > of vmas where thp is ineligible.
> > 
> > Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > to determine if a vma is eligible to be backed by hugepages.
> 
> I was under impression that nh resp hg flags only tell about the madvise
> status. How do you exactly use these flags in an application?
> 
> Your eligible rules as defined here:
> 
> > + [*] A process mapping is eligible to be backed by transparent hugepages (thp)
> > +     depending on system-wide settings and the mapping itself.  See
> > +     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
> > +     mapping has a flag of "nh", it is not eligible to be backed by hugepages
> > +     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
> > +     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
> > +     MADV_HUGEPAGE.
> 
> doesn't seem to match the reality. I do not see all the file backed
> mappings to be nh marked. So is this really about eligibility rather
> than the madvise status? Maybe it is just the above documentation that
> needs to be updated.
> 
> That being said, I do not object to the patch, I am just trying to
> understand what is the intended usage for the flag that does try to say
> more than the madvise status.

And moreover, how is the PR_SET_THP_DISABLE any different from the
global THP disabled case. Do we want to set all vmas to nh as well?
Vlastimil Babka Sept. 24, 2018, 8:43 p.m. UTC | #3
On 9/24/18 10:02 PM, Michal Hocko wrote:
> On Mon 24-09-18 21:56:03, Michal Hocko wrote:
>> On Mon 24-09-18 12:30:07, David Rientjes wrote:
>>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
>>> introduced a regression in that userspace cannot always determine the set
>>> of vmas where thp is ineligible.
>>>
>>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
>>> to determine if a vma is eligible to be backed by hugepages.
>>
>> I was under impression that nh resp hg flags only tell about the madvise
>> status. How do you exactly use these flags in an application?
>>
>> Your eligible rules as defined here:
>>
>>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp)
>>> +     depending on system-wide settings and the mapping itself.  See
>>> +     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
>>> +     mapping has a flag of "nh", it is not eligible to be backed by hugepages
>>> +     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
>>> +     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
>>> +     MADV_HUGEPAGE.
>>
>> doesn't seem to match the reality. I do not see all the file backed
>> mappings to be nh marked. So is this really about eligibility rather
>> than the madvise status? Maybe it is just the above documentation that
>> needs to be updated.

Yeah the change from madvise to eligibility in the doc seems to go too far.

>> That being said, I do not object to the patch, I am just trying to
>> understand what is the intended usage for the flag that does try to say
>> more than the madvise status.
> 
> And moreover, how is the PR_SET_THP_DISABLE any different from the
> global THP disabled case. Do we want to set all vmas to nh as well?

Probably not. It's easy to check the global status, but is it possible
to query for the prctl flags of a process? We are looking at process or
even vma-specific flags here. If the prctl was historically implemented
via VM_NOHUGEPAGE and thus reported as such in smaps, it makes sense to
do so even with the MMF_ flag IMHO?
Michal Hocko Sept. 25, 2018, 5:50 a.m. UTC | #4
On Mon 24-09-18 22:43:49, Vlastimil Babka wrote:
> On 9/24/18 10:02 PM, Michal Hocko wrote:
> > On Mon 24-09-18 21:56:03, Michal Hocko wrote:
[...]
> >> That being said, I do not object to the patch, I am just trying to
> >> understand what is the intended usage for the flag that does try to say
> >> more than the madvise status.
> > 
> > And moreover, how is the PR_SET_THP_DISABLE any different from the
> > global THP disabled case. Do we want to set all vmas to nh as well?
> 
> Probably not. It's easy to check the global status, but is it possible
> to query for the prctl flags of a process?

Dunno but I suspect there is no way to check for this.

> We are looking at process or
> even vma-specific flags here. If the prctl was historically implemented
> via VM_NOHUGEPAGE and thus reported as such in smaps, it makes sense to
> do so even with the MMF_ flag IMHO?

Yes if this breaks some userspace which relied on the previous behavior.
But if nothing really broke then I guess it would be better to have the
semantic as clear as possible. Go and check the global status to make
the whole picture doesn't look very sound to me. On the other hand this
VMA has a madvise flag on it sounds quite clear and you know what to
expect at least. Sure the hint might be ignored in the end but well,
these are hints they do not guarantee anything after all.
David Rientjes Sept. 25, 2018, 7:52 p.m. UTC | #5
On Mon, 24 Sep 2018, Vlastimil Babka wrote:

> On 9/24/18 10:02 PM, Michal Hocko wrote:
> > On Mon 24-09-18 21:56:03, Michal Hocko wrote:
> >> On Mon 24-09-18 12:30:07, David Rientjes wrote:
> >>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> >>> introduced a regression in that userspace cannot always determine the set
> >>> of vmas where thp is ineligible.
> >>>
> >>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> >>> to determine if a vma is eligible to be backed by hugepages.
> >>
> >> I was under impression that nh resp hg flags only tell about the madvise
> >> status. How do you exactly use these flags in an application?
> >>

This is used to identify heap mappings that should be able to fault thp 
but do not, and they normally point to a low-on-memory or fragmentation 
issue.  After commit 1860033237d4, our users of PR_SET_THP_DISABLE no 
longer show "nh" for their heap mappings so they get reported as having a 
low thp ratio when in reality it is disabled.  It is also used in 
automated testing to ensure that vmas get disabled for thp appropriately 
and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
this, and those tests now break.

> >> Your eligible rules as defined here:
> >>
> >>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp)
> >>> +     depending on system-wide settings and the mapping itself.  See
> >>> +     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
> >>> +     mapping has a flag of "nh", it is not eligible to be backed by hugepages
> >>> +     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
> >>> +     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
> >>> +     MADV_HUGEPAGE.
> >>
> >> doesn't seem to match the reality. I do not see all the file backed
> >> mappings to be nh marked. So is this really about eligibility rather
> >> than the madvise status? Maybe it is just the above documentation that
> >> needs to be updated.
> 
> Yeah the change from madvise to eligibility in the doc seems to go too far.
> 

I'll reword this to explicitly state that "hg" and "nh" mappings either 
allow or disallow thp backing.
Michal Hocko Sept. 25, 2018, 8:29 p.m. UTC | #6
On Tue 25-09-18 12:52:09, David Rientjes wrote:
> On Mon, 24 Sep 2018, Vlastimil Babka wrote:
> 
> > On 9/24/18 10:02 PM, Michal Hocko wrote:
> > > On Mon 24-09-18 21:56:03, Michal Hocko wrote:
> > >> On Mon 24-09-18 12:30:07, David Rientjes wrote:
> > >>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > >>> introduced a regression in that userspace cannot always determine the set
> > >>> of vmas where thp is ineligible.
> > >>>
> > >>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > >>> to determine if a vma is eligible to be backed by hugepages.
> > >>
> > >> I was under impression that nh resp hg flags only tell about the madvise
> > >> status. How do you exactly use these flags in an application?
> > >>
> 
> This is used to identify heap mappings that should be able to fault thp 
> but do not, and they normally point to a low-on-memory or fragmentation 
> issue.  After commit 1860033237d4, our users of PR_SET_THP_DISABLE no 
> longer show "nh" for their heap mappings so they get reported as having a 
> low thp ratio when in reality it is disabled.  

I am still not sure I understand the issue completely. How are PR_SET_THP_DISABLE
users any different from the global THP disabled case? Is this only
about the scope? E.g the one who checks for the state cannot check the
PR_SET_THP_DISABLE state? Besides that what are consequences of the
low ratio? Is this an example of somebody using the prctl and still
complaining or an external observer trying to do something useful which
ends up doing contrary?

> It is also used in 
> automated testing to ensure that vmas get disabled for thp appropriately 
> and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
> this, and those tests now break.

This sounds like a bit of an abuse to me. It shows how an internal
implementation detail leaks out to the userspace which is something we
should try to avoid.

> > >> Your eligible rules as defined here:
> > >>
> > >>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp)
> > >>> +     depending on system-wide settings and the mapping itself.  See
> > >>> +     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
> > >>> +     mapping has a flag of "nh", it is not eligible to be backed by hugepages
> > >>> +     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
> > >>> +     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
> > >>> +     MADV_HUGEPAGE.
> > >>
> > >> doesn't seem to match the reality. I do not see all the file backed
> > >> mappings to be nh marked. So is this really about eligibility rather
> > >> than the madvise status? Maybe it is just the above documentation that
> > >> needs to be updated.
> > 
> > Yeah the change from madvise to eligibility in the doc seems to go too far.
> > 
> 
> I'll reword this to explicitly state that "hg" and "nh" mappings either 
> allow or disallow thp backing.

How are you going to distinguish a regular THP-able mapping then? I am
still not sure how this is supposed to work. Could you be more specific.
Let's say I have a THP-able mapping (shmem resp. anon for the current
implementation). What is the the matrix for hg/nh wrt. madvice/nomadvise
PR_SET_THP_DISABLE and global THP enabled/disable.
David Rientjes Sept. 25, 2018, 9:45 p.m. UTC | #7
On Tue, 25 Sep 2018, Michal Hocko wrote:

> > This is used to identify heap mappings that should be able to fault thp 
> > but do not, and they normally point to a low-on-memory or fragmentation 
> > issue.  After commit 1860033237d4, our users of PR_SET_THP_DISABLE no 
> > longer show "nh" for their heap mappings so they get reported as having a 
> > low thp ratio when in reality it is disabled.  
> 
> I am still not sure I understand the issue completely. How are PR_SET_THP_DISABLE
> users any different from the global THP disabled case? Is this only
> about the scope? E.g the one who checks for the state cannot check the
> PR_SET_THP_DISABLE state? Besides that what are consequences of the
> low ratio? Is this an example of somebody using the prctl and still
> complaining or an external observer trying to do something useful which
> ends up doing contrary?
> 

Yes, that is how I found out about this.  The system-wide policy can be 
determined from /sys/kernel/mm/transparent_hugepage/enabled.  If it is 
"always" and heap mappings are not being backed by hugepages and lack the 
"nh" flag, it was considered as a likely fragmentation issue before commit 
1860033237d4.  After commit 1860033237d4, the heap mapping for 
PR_SET_THP_DISABLE users was not showing it actually is prevented from 
faulting thp because of policy, not because of fragmentation.

> > It is also used in 
> > automated testing to ensure that vmas get disabled for thp appropriately 
> > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
> > this, and those tests now break.
> 
> This sounds like a bit of an abuse to me. It shows how an internal
> implementation detail leaks out to the userspace which is something we
> should try to avoid.
> 

Well, it's already how this has worked for years before commit 
1860033237d4 broke it.  Changing the implementation in the kernel is fine 
as long as you don't break userspace who relies on what is exported to it 
and is the only way to determine if MADV_NOHUGEPAGE is preventing it from 
being backed by hugepages.

> > I'll reword this to explicitly state that "hg" and "nh" mappings either 
> > allow or disallow thp backing.
> 
> How are you going to distinguish a regular THP-able mapping then? I am
> still not sure how this is supposed to work. Could you be more specific.

You look for "[heap]" in smaps to determine where the heap mapping is.
Andrew Morton Sept. 25, 2018, 10:04 p.m. UTC | #8
On Tue, 25 Sep 2018 14:45:19 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> > > It is also used in 
> > > automated testing to ensure that vmas get disabled for thp appropriately 
> > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
> > > this, and those tests now break.
> > 
> > This sounds like a bit of an abuse to me. It shows how an internal
> > implementation detail leaks out to the userspace which is something we
> > should try to avoid.
> > 
> 
> Well, it's already how this has worked for years before commit 
> 1860033237d4 broke it.  Changing the implementation in the kernel is fine 
> as long as you don't break userspace who relies on what is exported to it 
> and is the only way to determine if MADV_NOHUGEPAGE is preventing it from 
> being backed by hugepages.

1860033237d4 was over a year ago so perhaps we don't need to be
too worried about restoring the old interface.  In which case
we have an opportunity to make improvements such as that suggested
by Michal?
David Rientjes Sept. 26, 2018, 12:55 a.m. UTC | #9
On Tue, 25 Sep 2018, Andrew Morton wrote:

> > > > It is also used in 
> > > > automated testing to ensure that vmas get disabled for thp appropriately 
> > > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
> > > > this, and those tests now break.
> > > 
> > > This sounds like a bit of an abuse to me. It shows how an internal
> > > implementation detail leaks out to the userspace which is something we
> > > should try to avoid.
> > > 
> > 
> > Well, it's already how this has worked for years before commit 
> > 1860033237d4 broke it.  Changing the implementation in the kernel is fine 
> > as long as you don't break userspace who relies on what is exported to it 
> > and is the only way to determine if MADV_NOHUGEPAGE is preventing it from 
> > being backed by hugepages.
> 
> 1860033237d4 was over a year ago so perhaps we don't need to be
> too worried about restoring the old interface.  In which case
> we have an opportunity to make improvements such as that suggested
> by Michal?
> 

The only way to determine if a vma was thp disabled prior to this commit 
was parsing VmFlags from /proc/pid/smaps.  That was possible either 
through MADV_NOHUGEPAGE or PR_SET_THP_DISABLE.  It is perfectly legitimate 
for a test case to check if either are being set correctly through 
userspace libraries or through the kernel itself in the manner in which 
the kernel exports this information.  It is also perfectly legitimate for 
userspace to cull through information in the only way it is exported by 
the kernel to identify reasons for why applications are not having their 
heap backed by transparent hugepages: the mapping is disabled, the 
application is hitting the limit for its mem cgroup, we are low on memory, 
or there are fragmentation issues.  Differentiating between those is 
something our userspace does, and was broken by 1860033237d4.
Michal Hocko Sept. 26, 2018, 6:06 a.m. UTC | #10
On Tue 25-09-18 15:04:06, Andrew Morton wrote:
> On Tue, 25 Sep 2018 14:45:19 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > > > It is also used in 
> > > > automated testing to ensure that vmas get disabled for thp appropriately 
> > > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced 
> > > > this, and those tests now break.
> > > 
> > > This sounds like a bit of an abuse to me. It shows how an internal
> > > implementation detail leaks out to the userspace which is something we
> > > should try to avoid.
> > > 
> > 
> > Well, it's already how this has worked for years before commit 
> > 1860033237d4 broke it.  Changing the implementation in the kernel is fine 
> > as long as you don't break userspace who relies on what is exported to it 
> > and is the only way to determine if MADV_NOHUGEPAGE is preventing it from 
> > being backed by hugepages.
> 
> 1860033237d4 was over a year ago so perhaps we don't need to be
> too worried about restoring the old interface.  In which case
> we have an opportunity to make improvements such as that suggested
> by Michal?

Yeah, can we add a way to export PR_SET_THP_DISABLE to userspace
somehow? E.g. /proc/<pid>/status. It is a process wide thing so
reporting it per VMA sounds strange at best.

This would also keep a sane (and currently documented) semantic for
the smaps flag to be really
    hg  - huge page advise flag
    nh  - no-huge page advise flag
diff mbox series

Patch

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -490,10 +490,18 @@  manner. The codes are the following:
     dd  - do not include area into core dump
     sd  - soft-dirty flag
     mm  - mixed map area
-    hg  - huge page advise flag
-    nh  - no-huge page advise flag
+    hg  - eligible for transparent hugepages [*]
+    nh  - not eligible for transparent hugepages [*]
     mg  - mergable advise flag
 
+ [*] A process mapping is eligible to be backed by transparent hugepages (thp)
+     depending on system-wide settings and the mapping itself.  See
+     Documentation/admin-guide/mm/transhuge.rst for default behavior.  If a
+     mapping has a flag of "nh", it is not eligible to be backed by hugepages
+     in any condition, either because of prctl(PR_SET_THP_DISABLE) or
+     madvise(MADV_NOHUGEPAGE).  PR_SET_THP_DISABLE takes precedence over any
+     MADV_HUGEPAGE.
+
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
 be vanished or the reverse -- new added.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -653,13 +653,25 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 	};
+	unsigned long flags = vma->vm_flags;
 	size_t i;
 
+	/*
+	 * Disabling thp is possible through both MADV_NOHUGEPAGE and
+	 * PR_SET_THP_DISABLE.  Both historically used VM_NOHUGEPAGE.  Since
+	 * the introduction of MMF_DISABLE_THP, however, userspace needs the
+	 * ability to detect vmas where thp is not eligible in the same manner.
+	 */
+	if (vma->vm_mm && test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) {
+		flags &= ~VM_HUGEPAGE;
+		flags |= VM_NOHUGEPAGE;
+	}
+
 	seq_puts(m, "VmFlags: ");
 	for (i = 0; i < BITS_PER_LONG; i++) {
 		if (!mnemonics[i][0])
 			continue;
-		if (vma->vm_flags & (1UL << i)) {
+		if (flags & (1UL << i)) {
 			seq_putc(m, mnemonics[i][0]);
 			seq_putc(m, mnemonics[i][1]);
 			seq_putc(m, ' ');