Message ID | 20220404200250.321455-1-shy828301@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Make khugepaged collapse readonly FS THP more consistent | expand |
On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote: > The readonly FS THP relies on khugepaged to collapse THP for suitable > vmas. But it is kind of "random luck" for khugepaged to see the > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: I still don't see the point. The effort should be put into supporting large folios, not in making this hack work better.
On Mon, Apr 4, 2022 at 5:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote: > > The readonly FS THP relies on khugepaged to collapse THP for suitable > > vmas. But it is kind of "random luck" for khugepaged to see the > > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > > I still don't see the point. The effort should be put into > supporting large folios, not in making this hack work better. The series makes sense even though the hack is replaced by large folios IMHO. The problem is the file VMAs may be not registered by khugepaged consistently for some THP modes, for example, always, regardless of whether it's readonly or the hack is gone or not. IIUC even though the hack is replaced by the large folios, we still have khugepaged to collapse pmd-mappable huge pages for both anonymous vmas and file vmas, right? Or are you thinking about killing khugepaged soon with supporting large folios? Anyway it may make things clearer if the cover letter is rephrased to: When khugepaged collapses file THPs, its behavior is not consistent. It is kind of "random luck" for khugepaged to see the file vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: - Anon huge pmd page fault - VMA merge - MADV_HUGEPAGE - Shmem mmap If the above conditions are not met, even though khugepaged is enabled it won't see any file vma at all. MADV_HUGEPAGE could be specified explicitly to tell khugepaged to collapse this area, but when khugepaged mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE is not set. So make sure file vmas are registered to khugepaged to make the behavior more consistent.
On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote: > When khugepaged collapses file THPs, its behavior is not consistent. > It is kind of "random luck" for khugepaged to see the file vmas (see > report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) > since currently the vmas are registered to khugepaged when: > - Anon huge pmd page fault > - VMA merge > - MADV_HUGEPAGE > - Shmem mmap > > If the above conditions are not met, even though khugepaged is enabled > it won't see any file vma at all. MADV_HUGEPAGE could be specified > explicitly to tell khugepaged to collapse this area, but when > khugepaged mode is "always" it should scan suitable vmas as long as > VM_NOHUGEPAGE is not set. I don't see that as being true at all. The point of this hack was that applications which really knew what they were doing could enable it. It makes no sense to me that setting "always" by the sysadmin for shmem also force-enables ROTHP, even for applications which aren't aware of it. Most telling, I think, is that Song Liu hasn't weighed in on this at all. It's clearly not important to the original author.
On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote: > > When khugepaged collapses file THPs, its behavior is not consistent. > > It is kind of "random luck" for khugepaged to see the file vmas (see > > report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) > > since currently the vmas are registered to khugepaged when: > > - Anon huge pmd page fault > > - VMA merge > > - MADV_HUGEPAGE > > - Shmem mmap > > > > If the above conditions are not met, even though khugepaged is enabled > > it won't see any file vma at all. MADV_HUGEPAGE could be specified > > explicitly to tell khugepaged to collapse this area, but when > > khugepaged mode is "always" it should scan suitable vmas as long as > > VM_NOHUGEPAGE is not set. > > I don't see that as being true at all. The point of this hack was that > applications which really knew what they were doing could enable it. > It makes no sense to me that setting "always" by the sysadmin for shmem > also force-enables ROTHP, even for applications which aren't aware of it. > > Most telling, I think, is that Song Liu hasn't weighed in on this at > all. It's clearly not important to the original author. I tend to agree that MADV_MADVISE should be preferred when this feature (or hack) was designed in the original author's mind in the first place. And "madvise" is definitely the recommended way to use THP, but I don't think it means we should not care "always" and assume nobody uses it otherwise the issue would have not been reported.
On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote: > > When khugepaged collapses file THPs, its behavior is not consistent. > > It is kind of "random luck" for khugepaged to see the file vmas (see > > report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) > > since currently the vmas are registered to khugepaged when: > > - Anon huge pmd page fault > > - VMA merge > > - MADV_HUGEPAGE > > - Shmem mmap > > > > If the above conditions are not met, even though khugepaged is enabled > > it won't see any file vma at all. MADV_HUGEPAGE could be specified > > explicitly to tell khugepaged to collapse this area, but when > > khugepaged mode is "always" it should scan suitable vmas as long as > > VM_NOHUGEPAGE is not set. > > I don't see that as being true at all. The point of this hack was that > applications which really knew what they were doing could enable it. > It makes no sense to me that setting "always" by the sysadmin for shmem > also force-enables ROTHP, even for applications which aren't aware of it. > > Most telling, I think, is that Song Liu hasn't weighed in on this at > all. It's clearly not important to the original author. Song Liu already acked the series, please see https://lore.kernel.org/linux-mm/96F2D93B-2043-44C3-8062-C639372A0212@fb.com/
On 4/4/22 22:02, Yang Shi wrote: > include/linux/huge_mm.h | 14 ++++++++++++ > include/linux/khugepaged.h | 59 ++++++++++++--------------------------------------- > include/linux/sched/coredump.h | 3 ++- > kernel/fork.c | 4 +--- > mm/huge_memory.c | 15 ++++--------- > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++----------------------------- > mm/mmap.c | 14 ++++++++---- > mm/shmem.c | 12 ----------- > 8 files changed, 88 insertions(+), 109 deletions(-) Resending my general feedback from mm-commits thread to include the public ML's: There's modestly less lines in the end, some duplicate code removed, special casing in shmem.c removed, that's all good as it is. Also patch 8/8 become quite boring in v3, no need to change individual filesystems and also no hook in fault path, just the common mmap path. So I would just handle patch 6 differently as I just replied to it, and acked the rest. That said it's still unfortunately rather a mess of functions that have similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma), transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)? So maybe still some space for further cleanups. But the series is fine as it is so we don't have to wait for it now. We could also consider that the tracking of which mm is to be scanned is modelled after ksm which has its own madvise flag, but also no "always" mode. What if for THP we only tracked actual THP madvised mm's, and in "always" mode just scanned all vm's, would that allow ripping out some code perhaps, while not adding too many unnecessary scans? If some processes are being scanned without any effect, maybe track success separately, and scan them less frequently etc. That could be ultimately more efficinet than painfully tracking just *eligibility* for scanning in "always" mode? Even more radical thing to consider (maybe that's a LSF/MM level topic, too bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon in MGLRU, and I probably forgot something else. Maybe time to think about unifying those scanners?
On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 4/4/22 22:02, Yang Shi wrote: > > include/linux/huge_mm.h | 14 ++++++++++++ > > include/linux/khugepaged.h | 59 ++++++++++++--------------------------------------- > > include/linux/sched/coredump.h | 3 ++- > > kernel/fork.c | 4 +--- > > mm/huge_memory.c | 15 ++++--------- > > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++----------------------------- > > mm/mmap.c | 14 ++++++++---- > > mm/shmem.c | 12 ----------- > > 8 files changed, 88 insertions(+), 109 deletions(-) > > Resending my general feedback from mm-commits thread to include the > public ML's: > > There's modestly less lines in the end, some duplicate code removed, > special casing in shmem.c removed, that's all good as it is. Also patch 8/8 > become quite boring in v3, no need to change individual filesystems and also > no hook in fault path, just the common mmap path. So I would just handle > patch 6 differently as I just replied to it, and acked the rest. > > That said it's still unfortunately rather a mess of functions that have > similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma), > transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)? > So maybe still some space for further cleanups. But the series is fine as it > is so we don't have to wait for it now. Yeah, I agree that we do have a lot thp checks. Will find some time to look into it deeper later. > > We could also consider that the tracking of which mm is to be scanned is > modelled after ksm which has its own madvise flag, but also no "always" > mode. What if for THP we only tracked actual THP madvised mm's, and in > "always" mode just scanned all vm's, would that allow ripping out some code > perhaps, while not adding too many unnecessary scans? If some processes are Do you mean add all mm(s) to the scan list unconditionally? I don't think it will scale. > being scanned without any effect, maybe track success separately, and scan > them less frequently etc. That could be ultimately more efficinet than > painfully tracking just *eligibility* for scanning in "always" mode? Sounds like we need a couple of different lists, for example, inactive and active? And promote or demote mm(s) between the two lists? TBH I don't see too many benefits at the moment. Or I misunderstood you? > > Even more radical thing to consider (maybe that's a LSF/MM level topic, too > bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon > in MGLRU, and I probably forgot something else. Maybe time to think about > unifying those scanners? We do have pagewalk (walk_page_range()) which is used by a couple of mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure whether it is feasible for khugepaged, ksm, etc, or not since I didn't look that hard. But I agree it should be worth looking at. > >
On 5/9/22 22:34, Yang Shi wrote: > On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 4/4/22 22:02, Yang Shi wrote: >> > include/linux/huge_mm.h | 14 ++++++++++++ >> > include/linux/khugepaged.h | 59 ++++++++++++--------------------------------------- >> > include/linux/sched/coredump.h | 3 ++- >> > kernel/fork.c | 4 +--- >> > mm/huge_memory.c | 15 ++++--------- >> > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++----------------------------- >> > mm/mmap.c | 14 ++++++++---- >> > mm/shmem.c | 12 ----------- >> > 8 files changed, 88 insertions(+), 109 deletions(-) >> >> Resending my general feedback from mm-commits thread to include the >> public ML's: >> >> There's modestly less lines in the end, some duplicate code removed, >> special casing in shmem.c removed, that's all good as it is. Also patch 8/8 >> become quite boring in v3, no need to change individual filesystems and also >> no hook in fault path, just the common mmap path. So I would just handle >> patch 6 differently as I just replied to it, and acked the rest. >> >> That said it's still unfortunately rather a mess of functions that have >> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma), >> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)? >> So maybe still some space for further cleanups. But the series is fine as it >> is so we don't have to wait for it now. > > Yeah, I agree that we do have a lot thp checks. Will find some time to > look into it deeper later. Thanks. >> >> We could also consider that the tracking of which mm is to be scanned is >> modelled after ksm which has its own madvise flag, but also no "always" >> mode. What if for THP we only tracked actual THP madvised mm's, and in >> "always" mode just scanned all vm's, would that allow ripping out some code >> perhaps, while not adding too many unnecessary scans? If some processes are > > Do you mean add all mm(s) to the scan list unconditionally? I don't > think it will scale. It might be interesting to find out how many mm's (percentage of all mm's) are typically in the list with "always" enabled. I wouldn't be surprised if it was nearly all of them. Having at least one large enough anonymous area sounds like something all processes would have these days? >> being scanned without any effect, maybe track success separately, and scan >> them less frequently etc. That could be ultimately more efficinet than >> painfully tracking just *eligibility* for scanning in "always" mode? > > Sounds like we need a couple of different lists, for example, inactive > and active? And promote or demote mm(s) between the two lists? TBH I > don't see too many benefits at the moment. Or I misunderstood you? Yeah, something like that. It would of course require finding out whether khugepaged is consuming too much cpu uselessly these days while not processing fast enough mm's where it succeeds more. >> >> Even more radical thing to consider (maybe that's a LSF/MM level topic, too >> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon >> in MGLRU, and I probably forgot something else. Maybe time to think about >> unifying those scanners? > > We do have pagewalk (walk_page_range()) which is used by a couple of > mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure > whether it is feasible for khugepaged, ksm, etc, or not since I didn't > look that hard. But I agree it should be worth looking at. pagewalk is a framework to simplify writing code that processes page tables for a given one-off task, yeah. But this would be something a bit different, e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa balancing uses task_work instead of kthread so that would require consideration on which mechanism the unified daemon would use. >> >>
On Tue, May 10, 2022 at 12:35 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 5/9/22 22:34, Yang Shi wrote: > > On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 4/4/22 22:02, Yang Shi wrote: > >> > include/linux/huge_mm.h | 14 ++++++++++++ > >> > include/linux/khugepaged.h | 59 ++++++++++++--------------------------------------- > >> > include/linux/sched/coredump.h | 3 ++- > >> > kernel/fork.c | 4 +--- > >> > mm/huge_memory.c | 15 ++++--------- > >> > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++----------------------------- > >> > mm/mmap.c | 14 ++++++++---- > >> > mm/shmem.c | 12 ----------- > >> > 8 files changed, 88 insertions(+), 109 deletions(-) > >> > >> Resending my general feedback from mm-commits thread to include the > >> public ML's: > >> > >> There's modestly less lines in the end, some duplicate code removed, > >> special casing in shmem.c removed, that's all good as it is. Also patch 8/8 > >> become quite boring in v3, no need to change individual filesystems and also > >> no hook in fault path, just the common mmap path. So I would just handle > >> patch 6 differently as I just replied to it, and acked the rest. > >> > >> That said it's still unfortunately rather a mess of functions that have > >> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma), > >> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)? > >> So maybe still some space for further cleanups. But the series is fine as it > >> is so we don't have to wait for it now. > > > > Yeah, I agree that we do have a lot thp checks. Will find some time to > > look into it deeper later. > > Thanks. > > >> > >> We could also consider that the tracking of which mm is to be scanned is > >> modelled after ksm which has its own madvise flag, but also no "always" > >> mode. What if for THP we only tracked actual THP madvised mm's, and in > >> "always" mode just scanned all vm's, would that allow ripping out some code > >> perhaps, while not adding too many unnecessary scans? If some processes are > > > > Do you mean add all mm(s) to the scan list unconditionally? I don't > > think it will scale. > > It might be interesting to find out how many mm's (percentage of all mm's) > are typically in the list with "always" enabled. I wouldn't be surprised if > it was nearly all of them. Having at least one large enough anonymous area > sounds like something all processes would have these days? Just did a simple test on my Fedora VM with "always", which is almost idle. The number of user processes (excluding kernel thread) is: [yangs@localhost ~]$ ps -aux --no-headers | awk '{if ($5 > 0) print $5}' | wc -l 113 The number of mm on khugepaged scan list counted by a simple drgn script is 56. The below is the drgn script: >>> i = 0 >>> mm_list = prog['khugepaged_scan'].mm_head.address_of_() >>> for mm in list_for_each(mm_list): ... i = i + 1 ... >>> print(i) So 50% processes on the list. For busy machines, the percentage may be higher. And the most big enough processes (with large anon mapping) should be on the list. > > >> being scanned without any effect, maybe track success separately, and scan > >> them less frequently etc. That could be ultimately more efficinet than > >> painfully tracking just *eligibility* for scanning in "always" mode? > > > > Sounds like we need a couple of different lists, for example, inactive > > and active? And promote or demote mm(s) between the two lists? TBH I > > don't see too many benefits at the moment. Or I misunderstood you? > > Yeah, something like that. It would of course require finding out whether > khugepaged is consuming too much cpu uselessly these days while not > processing fast enough mm's where it succeeds more. Yeah, currently we don't have such statistics at mm or vma granularity. But we may be able to get some stats by some post-processing with trace events. > > >> > >> Even more radical thing to consider (maybe that's a LSF/MM level topic, too > >> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon > >> in MGLRU, and I probably forgot something else. Maybe time to think about > >> unifying those scanners? > > > > We do have pagewalk (walk_page_range()) which is used by a couple of > > mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure > > whether it is feasible for khugepaged, ksm, etc, or not since I didn't > > look that hard. But I agree it should be worth looking at. > > pagewalk is a framework to simplify writing code that processes page tables > for a given one-off task, yeah. But this would be something a bit different, > e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa > balancing uses task_work instead of kthread so that would require > consideration on which mechanism the unified daemon would use. Aha, OK, you mean consolidate khugepaged, ksmd, etc into one kthread. TBH I don't know whether that will work out or not. Maybe the first step is to use the same page table walk framework for all of them? > > >> > >> >