mbox series

[v3,0/8] Make khugepaged collapse readonly FS THP more consistent

Message ID 20220404200250.321455-1-shy828301@gmail.com (mailing list archive)
Headers show
Series Make khugepaged collapse readonly FS THP more consistent | expand

Message

Yang Shi April 4, 2022, 8:02 p.m. UTC
Changelog
v3: * Register mm to khugepaged in common mmap path instead of touching
      filesystem code (patch 8/8) suggested by Ted.
    * New patch 7/8 cleaned up and renamed khugepaged_enter_vma_merge()
      to khugepaged_enter_vma().
    * Collected acked-by from Song Liu for patch 1 ~ 6.
    * Rebased on top of 5.18-rc1.
    * Excluded linux-xfs and linux-ext4 list since the series doesn't
      touch fs code anymore, but keep linux-fsdevel posted. 
v2: * Collected reviewed-by tags from Miaohe Lin.
    * Fixed build error for patch 4/8.

The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas.  But it is kind of "random luck" for khugepaged to see the
readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
  - Anon huge pmd page fault
  - VMA merge
  - MADV_HUGEPAGE
  - Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when khugepaged
mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
is not set.

So make sure readonly FS vmas are registered to khugepaged to make the
behavior more consistent.

Registering suitable vmas in common mmap path, that could cover both
readonly FS vmas and shmem vmas, so removed the khugepaged calls in
shmem.c.

The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
The patch 8 is the real meat. 


Tested with khugepaged test in selftests and the testcase provided by
Vlastimil Babka in https://lore.kernel.org/lkml/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/
by commenting out MADV_HUGEPAGE call.


Yang Shi (8):
      sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE
      mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED
      mm: khugepaged: skip DAX vma
      mm: thp: only regular file could be THP eligible
      mm: khugepaged: make khugepaged_enter() void function
      mm: khugepaged: move some khugepaged_* functions to khugepaged.c
      mm: khugepaged: introduce khugepaged_enter_vma() helper
      mm: mmap: register suitable readonly file vmas for khugepaged

 include/linux/huge_mm.h        | 14 ++++++++++++
 include/linux/khugepaged.h     | 59 ++++++++++++---------------------------------------
 include/linux/sched/coredump.h |  3 ++-
 kernel/fork.c                  |  4 +---
 mm/huge_memory.c               | 15 ++++---------
 mm/khugepaged.c                | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
 mm/mmap.c                      | 14 ++++++++----
 mm/shmem.c                     | 12 -----------
 8 files changed, 88 insertions(+), 109 deletions(-)

Comments

Matthew Wilcox (Oracle) April 5, 2022, 12:16 a.m. UTC | #1
On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote:
> The readonly FS THP relies on khugepaged to collapse THP for suitable
> vmas.  But it is kind of "random luck" for khugepaged to see the
> readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:

I still don't see the point.  The effort should be put into
supporting large folios, not in making this hack work better.
Yang Shi April 5, 2022, 12:48 a.m. UTC | #2
On Mon, Apr 4, 2022 at 5:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote:
> > The readonly FS THP relies on khugepaged to collapse THP for suitable
> > vmas.  But it is kind of "random luck" for khugepaged to see the
> > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
>
> I still don't see the point.  The effort should be put into
> supporting large folios, not in making this hack work better.

The series makes sense even though the hack is replaced by large
folios IMHO. The problem is the file VMAs may be not registered by
khugepaged consistently for some THP modes, for example, always,
regardless of whether it's readonly or the hack is gone or not. IIUC
even though the hack is replaced by the large folios, we still have
khugepaged to collapse pmd-mappable huge pages for both anonymous vmas
and file vmas, right? Or are you thinking about killing khugepaged
soon with supporting large folios?

Anyway it may make things clearer if the cover letter is rephrased to:

When khugepaged collapses file THPs, its behavior is not consistent.
It is kind of "random luck" for khugepaged to see the file vmas (see
report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/)
since currently the vmas are registered to khugepaged when:
  - Anon huge pmd page fault
  - VMA merge
  - MADV_HUGEPAGE
  - Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see any file vma at all.  MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when
khugepaged mode is "always" it should scan suitable vmas as long as
VM_NOHUGEPAGE is not set.

So make sure file vmas are registered to khugepaged to make the
behavior more consistent.
Matthew Wilcox (Oracle) April 27, 2022, 8:58 p.m. UTC | #3
On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> When khugepaged collapses file THPs, its behavior is not consistent.
> It is kind of "random luck" for khugepaged to see the file vmas (see
> report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/)
> since currently the vmas are registered to khugepaged when:
>   - Anon huge pmd page fault
>   - VMA merge
>   - MADV_HUGEPAGE
>   - Shmem mmap
> 
> If the above conditions are not met, even though khugepaged is enabled
> it won't see any file vma at all.  MADV_HUGEPAGE could be specified
> explicitly to tell khugepaged to collapse this area, but when
> khugepaged mode is "always" it should scan suitable vmas as long as
> VM_NOHUGEPAGE is not set.

I don't see that as being true at all.  The point of this hack was that
applications which really knew what they were doing could enable it.
It makes no sense to me that setting "always" by the sysadmin for shmem
also force-enables ROTHP, even for applications which aren't aware of it.

Most telling, I think, is that Song Liu hasn't weighed in on this at
all.  It's clearly not important to the original author.
Yang Shi April 27, 2022, 10:38 p.m. UTC | #4
On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> > When khugepaged collapses file THPs, its behavior is not consistent.
> > It is kind of "random luck" for khugepaged to see the file vmas (see
> > report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/)
> > since currently the vmas are registered to khugepaged when:
> >   - Anon huge pmd page fault
> >   - VMA merge
> >   - MADV_HUGEPAGE
> >   - Shmem mmap
> >
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see any file vma at all.  MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when
> > khugepaged mode is "always" it should scan suitable vmas as long as
> > VM_NOHUGEPAGE is not set.
>
> I don't see that as being true at all.  The point of this hack was that
> applications which really knew what they were doing could enable it.
> It makes no sense to me that setting "always" by the sysadmin for shmem
> also force-enables ROTHP, even for applications which aren't aware of it.
>
> Most telling, I think, is that Song Liu hasn't weighed in on this at
> all.  It's clearly not important to the original author.

I tend to agree that MADV_MADVISE should be preferred when this
feature (or hack) was designed in the original author's mind in the
first place. And "madvise" is definitely the recommended way to use
THP, but I don't think it means we should not care "always" and assume
nobody uses it otherwise the issue would have not been reported.
Yang Shi April 27, 2022, 11:16 p.m. UTC | #5
On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> > When khugepaged collapses file THPs, its behavior is not consistent.
> > It is kind of "random luck" for khugepaged to see the file vmas (see
> > report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/)
> > since currently the vmas are registered to khugepaged when:
> >   - Anon huge pmd page fault
> >   - VMA merge
> >   - MADV_HUGEPAGE
> >   - Shmem mmap
> >
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see any file vma at all.  MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when
> > khugepaged mode is "always" it should scan suitable vmas as long as
> > VM_NOHUGEPAGE is not set.
>
> I don't see that as being true at all.  The point of this hack was that
> applications which really knew what they were doing could enable it.
> It makes no sense to me that setting "always" by the sysadmin for shmem
> also force-enables ROTHP, even for applications which aren't aware of it.
>
> Most telling, I think, is that Song Liu hasn't weighed in on this at
> all.  It's clearly not important to the original author.

Song Liu already acked the series, please see
https://lore.kernel.org/linux-mm/96F2D93B-2043-44C3-8062-C639372A0212@fb.com/
Vlastimil Babka May 9, 2022, 4:05 p.m. UTC | #6
On 4/4/22 22:02, Yang Shi wrote:
>  include/linux/huge_mm.h        | 14 ++++++++++++
>  include/linux/khugepaged.h     | 59 ++++++++++++---------------------------------------
>  include/linux/sched/coredump.h |  3 ++-
>  kernel/fork.c                  |  4 +---
>  mm/huge_memory.c               | 15 ++++---------
>  mm/khugepaged.c                | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
>  mm/mmap.c                      | 14 ++++++++----
>  mm/shmem.c                     | 12 -----------
>  8 files changed, 88 insertions(+), 109 deletions(-)
 
Resending my general feedback from mm-commits thread to include the
public ML's:

There's modestly less lines in the end, some duplicate code removed,
special casing in shmem.c removed, that's all good as it is. Also patch 8/8
become quite boring in v3, no need to change individual filesystems and also
no hook in fault path, just the common mmap path. So I would just handle
patch 6 differently as I just replied to it, and acked the rest.

That said it's still unfortunately rather a mess of functions that have
similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
So maybe still some space for further cleanups. But the series is fine as it
is so we don't have to wait for it now.

We could also consider that the tracking of which mm is to be scanned is
modelled after ksm which has its own madvise flag, but also no "always"
mode. What if for THP we only tracked actual THP madvised mm's, and in
"always" mode just scanned all vm's, would that allow ripping out some code
perhaps, while not adding too many unnecessary scans? If some processes are
being scanned without any effect, maybe track success separately, and scan
them less frequently etc. That could be ultimately more efficinet than
painfully tracking just *eligibility* for scanning in "always" mode?

Even more radical thing to consider (maybe that's a LSF/MM level topic, too
bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
in MGLRU, and I probably forgot something else. Maybe time to think about
unifying those scanners?
Yang Shi May 9, 2022, 8:34 p.m. UTC | #7
On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/4/22 22:02, Yang Shi wrote:
> >  include/linux/huge_mm.h        | 14 ++++++++++++
> >  include/linux/khugepaged.h     | 59 ++++++++++++---------------------------------------
> >  include/linux/sched/coredump.h |  3 ++-
> >  kernel/fork.c                  |  4 +---
> >  mm/huge_memory.c               | 15 ++++---------
> >  mm/khugepaged.c                | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
> >  mm/mmap.c                      | 14 ++++++++----
> >  mm/shmem.c                     | 12 -----------
> >  8 files changed, 88 insertions(+), 109 deletions(-)
>
> Resending my general feedback from mm-commits thread to include the
> public ML's:
>
> There's modestly less lines in the end, some duplicate code removed,
> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
> become quite boring in v3, no need to change individual filesystems and also
> no hook in fault path, just the common mmap path. So I would just handle
> patch 6 differently as I just replied to it, and acked the rest.
>
> That said it's still unfortunately rather a mess of functions that have
> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
> So maybe still some space for further cleanups. But the series is fine as it
> is so we don't have to wait for it now.

Yeah, I agree that we do have a lot thp checks. Will find some time to
look into it deeper later.

>
> We could also consider that the tracking of which mm is to be scanned is
> modelled after ksm which has its own madvise flag, but also no "always"
> mode. What if for THP we only tracked actual THP madvised mm's, and in
> "always" mode just scanned all vm's, would that allow ripping out some code
> perhaps, while not adding too many unnecessary scans? If some processes are

Do you mean add all mm(s) to the scan list unconditionally? I don't
think it will scale.

> being scanned without any effect, maybe track success separately, and scan
> them less frequently etc. That could be ultimately more efficinet than
> painfully tracking just *eligibility* for scanning in "always" mode?

Sounds like we need a couple of different lists, for example, inactive
and active? And promote or demote mm(s) between the two lists? TBH I
don't see too many benefits at the moment. Or I misunderstood you?

>
> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
> in MGLRU, and I probably forgot something else. Maybe time to think about
> unifying those scanners?

We do have pagewalk (walk_page_range()) which is used by a couple of
mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
whether it is feasible for khugepaged, ksm, etc, or not since I didn't
look that hard. But I agree it should be worth looking at.

>
>
Vlastimil Babka May 10, 2022, 7:35 a.m. UTC | #8
On 5/9/22 22:34, Yang Shi wrote:
> On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 4/4/22 22:02, Yang Shi wrote:
>> >  include/linux/huge_mm.h        | 14 ++++++++++++
>> >  include/linux/khugepaged.h     | 59 ++++++++++++---------------------------------------
>> >  include/linux/sched/coredump.h |  3 ++-
>> >  kernel/fork.c                  |  4 +---
>> >  mm/huge_memory.c               | 15 ++++---------
>> >  mm/khugepaged.c                | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
>> >  mm/mmap.c                      | 14 ++++++++----
>> >  mm/shmem.c                     | 12 -----------
>> >  8 files changed, 88 insertions(+), 109 deletions(-)
>>
>> Resending my general feedback from mm-commits thread to include the
>> public ML's:
>>
>> There's modestly less lines in the end, some duplicate code removed,
>> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
>> become quite boring in v3, no need to change individual filesystems and also
>> no hook in fault path, just the common mmap path. So I would just handle
>> patch 6 differently as I just replied to it, and acked the rest.
>>
>> That said it's still unfortunately rather a mess of functions that have
>> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
>> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
>> So maybe still some space for further cleanups. But the series is fine as it
>> is so we don't have to wait for it now.
> 
> Yeah, I agree that we do have a lot thp checks. Will find some time to
> look into it deeper later.

Thanks.

>>
>> We could also consider that the tracking of which mm is to be scanned is
>> modelled after ksm which has its own madvise flag, but also no "always"
>> mode. What if for THP we only tracked actual THP madvised mm's, and in
>> "always" mode just scanned all vm's, would that allow ripping out some code
>> perhaps, while not adding too many unnecessary scans? If some processes are
> 
> Do you mean add all mm(s) to the scan list unconditionally? I don't
> think it will scale.

It might be interesting to find out how many mm's (percentage of all mm's)
are typically in the list with "always" enabled. I wouldn't be surprised if
it was nearly all of them. Having at least one large enough anonymous area
sounds like something all processes would have these days?

>> being scanned without any effect, maybe track success separately, and scan
>> them less frequently etc. That could be ultimately more efficinet than
>> painfully tracking just *eligibility* for scanning in "always" mode?
> 
> Sounds like we need a couple of different lists, for example, inactive
> and active? And promote or demote mm(s) between the two lists? TBH I
> don't see too many benefits at the moment. Or I misunderstood you?

Yeah, something like that. It would of course require finding out whether
khugepaged is consuming too much cpu uselessly these days while not
processing fast enough mm's where it succeeds more.

>>
>> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
>> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
>> in MGLRU, and I probably forgot something else. Maybe time to think about
>> unifying those scanners?
> 
> We do have pagewalk (walk_page_range()) which is used by a couple of
> mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
> whether it is feasible for khugepaged, ksm, etc, or not since I didn't
> look that hard. But I agree it should be worth looking at.

pagewalk is a framework to simplify writing code that processes page tables
for a given one-off task, yeah. But this would be something a bit different,
e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa
balancing uses task_work instead of kthread so that would require
consideration on which mechanism the unified daemon would use.

>>
>>
Yang Shi May 10, 2022, 7:25 p.m. UTC | #9
On Tue, May 10, 2022 at 12:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/9/22 22:34, Yang Shi wrote:
> > On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 4/4/22 22:02, Yang Shi wrote:
> >> >  include/linux/huge_mm.h        | 14 ++++++++++++
> >> >  include/linux/khugepaged.h     | 59 ++++++++++++---------------------------------------
> >> >  include/linux/sched/coredump.h |  3 ++-
> >> >  kernel/fork.c                  |  4 +---
> >> >  mm/huge_memory.c               | 15 ++++---------
> >> >  mm/khugepaged.c                | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
> >> >  mm/mmap.c                      | 14 ++++++++----
> >> >  mm/shmem.c                     | 12 -----------
> >> >  8 files changed, 88 insertions(+), 109 deletions(-)
> >>
> >> Resending my general feedback from mm-commits thread to include the
> >> public ML's:
> >>
> >> There's modestly less lines in the end, some duplicate code removed,
> >> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
> >> become quite boring in v3, no need to change individual filesystems and also
> >> no hook in fault path, just the common mmap path. So I would just handle
> >> patch 6 differently as I just replied to it, and acked the rest.
> >>
> >> That said it's still unfortunately rather a mess of functions that have
> >> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
> >> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
> >> So maybe still some space for further cleanups. But the series is fine as it
> >> is so we don't have to wait for it now.
> >
> > Yeah, I agree that we do have a lot thp checks. Will find some time to
> > look into it deeper later.
>
> Thanks.
>
> >>
> >> We could also consider that the tracking of which mm is to be scanned is
> >> modelled after ksm which has its own madvise flag, but also no "always"
> >> mode. What if for THP we only tracked actual THP madvised mm's, and in
> >> "always" mode just scanned all vm's, would that allow ripping out some code
> >> perhaps, while not adding too many unnecessary scans? If some processes are
> >
> > Do you mean add all mm(s) to the scan list unconditionally? I don't
> > think it will scale.
>
> It might be interesting to find out how many mm's (percentage of all mm's)
> are typically in the list with "always" enabled. I wouldn't be surprised if
> it was nearly all of them. Having at least one large enough anonymous area
> sounds like something all processes would have these days?

Just did a simple test on my Fedora VM with "always", which is almost idle.

The number of user processes (excluding kernel thread) is:
[yangs@localhost ~]$ ps -aux --no-headers | awk '{if ($5 > 0) print $5}' | wc -l
113

The number of mm on khugepaged scan list counted by a simple drgn
script is 56. The below is the drgn script:
>>> i = 0
>>> mm_list = prog['khugepaged_scan'].mm_head.address_of_()
>>> for mm in list_for_each(mm_list):
...     i = i + 1
...
>>> print(i)

So 50% processes on the list. For busy machines, the percentage may be
higher. And the most big enough processes (with large anon mapping)
should be on the list.

>
> >> being scanned without any effect, maybe track success separately, and scan
> >> them less frequently etc. That could be ultimately more efficinet than
> >> painfully tracking just *eligibility* for scanning in "always" mode?
> >
> > Sounds like we need a couple of different lists, for example, inactive
> > and active? And promote or demote mm(s) between the two lists? TBH I
> > don't see too many benefits at the moment. Or I misunderstood you?
>
> Yeah, something like that. It would of course require finding out whether
> khugepaged is consuming too much cpu uselessly these days while not
> processing fast enough mm's where it succeeds more.

Yeah, currently we don't have such statistics at mm or vma
granularity. But we may be able to get some stats by some
post-processing with trace events.

>
> >>
> >> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
> >> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
> >> in MGLRU, and I probably forgot something else. Maybe time to think about
> >> unifying those scanners?
> >
> > We do have pagewalk (walk_page_range()) which is used by a couple of
> > mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
> > whether it is feasible for khugepaged, ksm, etc, or not since I didn't
> > look that hard. But I agree it should be worth looking at.
>
> pagewalk is a framework to simplify writing code that processes page tables
> for a given one-off task, yeah. But this would be something a bit different,
> e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa
> balancing uses task_work instead of kthread so that would require
> consideration on which mechanism the unified daemon would use.

Aha, OK, you mean consolidate khugepaged, ksmd, etc into one kthread.
TBH I don't know whether that will work out or not. Maybe the first
step is to use the same page table walk framework for all of them?

>
> >>
> >>
>