Message ID | 20220317234827.447799-1-shy828301@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Make khugepaged collapse readonly FS THP more consistent | expand |
> On Mar 17, 2022, at 4:48 PM, Yang Shi <shy828301@gmail.com> wrote: > > > Changelog > v2: * Collected reviewed-by tags from Miaohe Lin. > * Fixed build error for patch 4/8. > > The readonly FS THP relies on khugepaged to collapse THP for suitable > vmas. But it is kind of "random luck" for khugepaged to see the > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > - Anon huge pmd page fault > - VMA merge > - MADV_HUGEPAGE > - Shmem mmap > > If the above conditions are not met, even though khugepaged is enabled > it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified > explicitly to tell khugepaged to collapse this area, but when khugepaged > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE > is not set. > > So make sure readonly FS vmas are registered to khugepaged to make the > behavior more consistent. > > Registering the vmas in mmap path seems more preferred from performance > point of view since page fault path is definitely hot path. > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > but I'd like to hear some comments before doing that. > > > Tested with khugepaged test in selftests and the testcase provided by > Vlastimil Babka in https://lore.kernel.org/lkml/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/ > by commenting out MADV_HUGEPAGE call. LGTM. For the series: Acked-by: Song Liu <song@kernel.org> > > > b/fs/ext4/file.c | 4 +++ > b/fs/xfs/xfs_file.c | 4 +++ > b/include/linux/huge_mm.h | 9 +++++++ > b/include/linux/khugepaged.h | 69 +++++++++++++++++++++---------------------------------------- > b/include/linux/sched/coredump.h | 3 +- > b/kernel/fork.c | 4 --- > b/mm/huge_memory.c | 15 +++---------- > b/mm/khugepaged.c | 71 ++++++++++++++++++++++++++++++++++++++++++++------------------- > b/mm/shmem.c | 14 +++--------- > 9 files changed, 102 insertions(+), 91 deletions(-) >
On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > Changelog > v2: * Collected reviewed-by tags from Miaohe Lin. > * Fixed build error for patch 4/8. > > The readonly FS THP relies on khugepaged to collapse THP for suitable > vmas. But it is kind of "random luck" for khugepaged to see the > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > - Anon huge pmd page fault > - VMA merge > - MADV_HUGEPAGE > - Shmem mmap > > If the above conditions are not met, even though khugepaged is enabled > it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified > explicitly to tell khugepaged to collapse this area, but when khugepaged > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE > is not set. > > So make sure readonly FS vmas are registered to khugepaged to make the > behavior more consistent. > > Registering the vmas in mmap path seems more preferred from performance > point of view since page fault path is definitely hot path. > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > but I'd like to hear some comments before doing that. After reading through the patchset, I have no idea what this is even doing or enabling. I can't comment on the last patch and it's effect on XFS because there's no high level explanation of the functionality or feature to provide me with the context in which I should be reviewing this patchset. I understand this has something to do with hugepages, but there's no explaination of exactly where huge pages are going to be used in the filesystem, what the problems with khugepaged and filesystems are that this apparently solves, what constraints it places on filesystems to enable huge pages to be used, etc. I'm guessing that the result is that we'll suddenly see huge pages in the page cache for some undefined set of files in some undefined set of workloads. But that doesn't help me understand any of the impacts it may have. e.g: - how does this relate to the folio conversion and use of large pages in the page cache? - why do we want two completely separate large page mechanisms in the page cache? - why is this limited to "read only VMAs" and how does the filesystem actually ensure that the VMAs are read only? - what happens if we have a file that huge pages mapped into the page cache via read only VMAs then has write() called on it via a different file descriptor and so we need to dirty the page cache that has huge pages in it? I've got a lot more questions, but to save me having to ask them, how about you explain what this new functionality actually does, why we need to support it, and why it is better than the fully writeable huge page support via folios that we already have in the works... Cheers, Dave.
On Fri, Mar 18, 2022 at 12:29:48PM +1100, Dave Chinner wrote: > On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > > > Changelog > > v2: * Collected reviewed-by tags from Miaohe Lin. > > * Fixed build error for patch 4/8. > > > > The readonly FS THP relies on khugepaged to collapse THP for suitable > > vmas. But it is kind of "random luck" for khugepaged to see the > > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > > - Anon huge pmd page fault > > - VMA merge > > - MADV_HUGEPAGE > > - Shmem mmap > > > > If the above conditions are not met, even though khugepaged is enabled > > it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified > > explicitly to tell khugepaged to collapse this area, but when khugepaged > > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE > > is not set. > > > > So make sure readonly FS vmas are registered to khugepaged to make the > > behavior more consistent. > > > > Registering the vmas in mmap path seems more preferred from performance > > point of view since page fault path is definitely hot path. > > > > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > > but I'd like to hear some comments before doing that. > > After reading through the patchset, I have no idea what this is even > doing or enabling. I can't comment on the last patch and it's effect > on XFS because there's no high level explanation of the > functionality or feature to provide me with the context in which I > should be reviewing this patchset. > > I understand this has something to do with hugepages, but there's no > explaination of exactly where huge pages are going to be used in the > filesystem, what the problems with khugepaged and filesystems are > that this apparently solves, what constraints it places on > filesystems to enable huge pages to be used, etc. > > I'm guessing that the result is that we'll suddenly see huge pages > in the page cache for some undefined set of files in some undefined > set of workloads. But that doesn't help me understand any of the > impacts it may have. e.g: > > - how does this relate to the folio conversion and use of large > pages in the page cache? > - why do we want two completely separate large page mechanisms in > the page cache? > - why is this limited to "read only VMAs" and how does the > filesystem actually ensure that the VMAs are read only? > - what happens if we have a file that huge pages mapped into the > page cache via read only VMAs then has write() called on it via a > different file descriptor and so we need to dirty the page cache > that has huge pages in it? > > I've got a lot more questions, but to save me having to ask them, > how about you explain what this new functionality actually does, why > we need to support it, and why it is better than the fully writeable > huge page support via folios that we already have in the works... Back in Puerto Rico when we set up the THP Cabal, we had two competing approaches for using larger pages in the page cache; mine (which turned into folios after I realised that THPs were the wrong model) and Song Liu's CONFIG_READ_ONLY_THP_FOR_FS. Song's patches were ready earlier (2019) and were helpful in unveiling some of the problems which needed to be fixed. The filesystem never sees the large pages because they're only used for read-only files, and the pages are already Uptodate at the point they're collapsed into a THP. So there's no changes needed to the filesystem. This collection of patches I'm agnostic about. As far as I can tell, they're a way to improve how often the ROTHP feature gets used. That doesn't really interest me since we're so close to having proper support for large pages/folios in filesystems. So I'm not particularly interested in improving a feature that we're about to delete. But I also don't like it that the filesystem now has to do something; the ROTHP feature is supposed to be completely transparent from the point of view of the filesystem.
On Thu, Mar 17, 2022 at 6:29 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > > > Changelog > > v2: * Collected reviewed-by tags from Miaohe Lin. > > * Fixed build error for patch 4/8. > > > > The readonly FS THP relies on khugepaged to collapse THP for suitable > > vmas. But it is kind of "random luck" for khugepaged to see the > > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > > - Anon huge pmd page fault > > - VMA merge > > - MADV_HUGEPAGE > > - Shmem mmap > > > > If the above conditions are not met, even though khugepaged is enabled > > it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified > > explicitly to tell khugepaged to collapse this area, but when khugepaged > > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE > > is not set. > > > > So make sure readonly FS vmas are registered to khugepaged to make the > > behavior more consistent. > > > > Registering the vmas in mmap path seems more preferred from performance > > point of view since page fault path is definitely hot path. > > > > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > > but I'd like to hear some comments before doing that. > > After reading through the patchset, I have no idea what this is even > doing or enabling. I can't comment on the last patch and it's effect > on XFS because there's no high level explanation of the > functionality or feature to provide me with the context in which I > should be reviewing this patchset. > > I understand this has something to do with hugepages, but there's no > explaination of exactly where huge pages are going to be used in the > filesystem, what the problems with khugepaged and filesystems are > that this apparently solves, what constraints it places on > filesystems to enable huge pages to be used, etc. > > I'm guessing that the result is that we'll suddenly see huge pages > in the page cache for some undefined set of files in some undefined > set of workloads. But that doesn't help me understand any of the > impacts it may have. e.g: Song introduced READ_ONLY_THP_FOR_FS back in 2019. It collapses huge pages for readonly executable file mappings to speed up the programs with huge text section. The huge page is allocated/collapsed by khugepaged instead of in page fault path. Vlastimil reported the huge pages are not collapsed consistently since the suitable MMs were not registered by khugepaged consistently as the commit log elaborated. So this patchset makes the behavior of khugepaged (for collapsing readonly file THP) more consistent. > > - how does this relate to the folio conversion and use of large > pages in the page cache? > - why do we want two completely separate large page mechanisms in > the page cache? It has nothing to do with the folio conversion. But once the file THP/huge page is fully supported, the READ_ONLY_THP_FOR_FS may be deprecated. However, making khugepaged collapse file THP more consistently is applicable to full file huge page support as well as long as we still use khugepaged to collapse THP. > - why is this limited to "read only VMAs" and how does the > filesystem actually ensure that the VMAs are read only? It uses the below check: (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && (vma->vm_flags & VM_EXEC) && !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); This condition was introduced by READ_ONLY_THP_FOR_FS in the first place, not this patchset. > - what happens if we have a file that huge pages mapped into the > page cache via read only VMAs then has write() called on it via a > different file descriptor and so we need to dirty the page cache > that has huge pages in it? Once someone else opens the fd in write mode, the THP will be truncated and khugepaged will backoff IIUC. > > I've got a lot more questions, but to save me having to ask them, > how about you explain what this new functionality actually does, why > we need to support it, and why it is better than the fully writeable > huge page support via folios that we already have in the works... > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com
On Thu, Mar 17, 2022 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Mar 18, 2022 at 12:29:48PM +1100, Dave Chinner wrote: > > On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > > > > > Changelog > > > v2: * Collected reviewed-by tags from Miaohe Lin. > > > * Fixed build error for patch 4/8. > > > > > > The readonly FS THP relies on khugepaged to collapse THP for suitable > > > vmas. But it is kind of "random luck" for khugepaged to see the > > > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when: > > > - Anon huge pmd page fault > > > - VMA merge > > > - MADV_HUGEPAGE > > > - Shmem mmap > > > > > > If the above conditions are not met, even though khugepaged is enabled > > > it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified > > > explicitly to tell khugepaged to collapse this area, but when khugepaged > > > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE > > > is not set. > > > > > > So make sure readonly FS vmas are registered to khugepaged to make the > > > behavior more consistent. > > > > > > Registering the vmas in mmap path seems more preferred from performance > > > point of view since page fault path is definitely hot path. > > > > > > > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > > > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > > > but I'd like to hear some comments before doing that. > > > > After reading through the patchset, I have no idea what this is even > > doing or enabling. I can't comment on the last patch and it's effect > > on XFS because there's no high level explanation of the > > functionality or feature to provide me with the context in which I > > should be reviewing this patchset. > > > > I understand this has something to do with hugepages, but there's no > > explaination of exactly where huge pages are going to be used in the > > filesystem, what the problems with khugepaged and filesystems are > > that this apparently solves, what constraints it places on > > filesystems to enable huge pages to be used, etc. > > > > I'm guessing that the result is that we'll suddenly see huge pages > > in the page cache for some undefined set of files in some undefined > > set of workloads. But that doesn't help me understand any of the > > impacts it may have. e.g: > > > > - how does this relate to the folio conversion and use of large > > pages in the page cache? > > - why do we want two completely separate large page mechanisms in > > the page cache? > > - why is this limited to "read only VMAs" and how does the > > filesystem actually ensure that the VMAs are read only? > > - what happens if we have a file that huge pages mapped into the > > page cache via read only VMAs then has write() called on it via a > > different file descriptor and so we need to dirty the page cache > > that has huge pages in it? > > > > I've got a lot more questions, but to save me having to ask them, > > how about you explain what this new functionality actually does, why > > we need to support it, and why it is better than the fully writeable > > huge page support via folios that we already have in the works... > > Back in Puerto Rico when we set up the THP Cabal, we had two competing > approaches for using larger pages in the page cache; mine (which turned > into folios after I realised that THPs were the wrong model) and Song > Liu's CONFIG_READ_ONLY_THP_FOR_FS. Song's patches were ready earlier > (2019) and were helpful in unveiling some of the problems which needed > to be fixed. The filesystem never sees the large pages because they're > only used for read-only files, and the pages are already Uptodate at > the point they're collapsed into a THP. So there's no changes needed > to the filesystem. > > This collection of patches I'm agnostic about. As far as I can > tell, they're a way to improve how often the ROTHP feature gets used. > That doesn't really interest me since we're so close to having proper > support for large pages/folios in filesystems. So I'm not particularly > interested in improving a feature that we're about to delete. But I also > don't like it that the filesystem now has to do something; the ROTHP > feature is supposed to be completely transparent from the point of view > of the filesystem. I agree once page cache huge page is fully supported, READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset makes khugepaged collapse file THP more consistently. It guarantees the THP could be collapsed as long as file THP is supported and configured properly and there is suitable file vmas, it is not guaranteed by the current code. So it should be useful even though READ_ONLY_THP_FOR_FS is gone IMHO. >
On Fri, Mar 18, 2022 at 11:04:29AM -0700, Yang Shi wrote: > I agree once page cache huge page is fully supported, > READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset > makes khugepaged collapse file THP more consistently. It guarantees > the THP could be collapsed as long as file THP is supported and > configured properly and there is suitable file vmas, it is not > guaranteed by the current code. So it should be useful even though > READ_ONLY_THP_FOR_FS is gone IMHO. I don't know if it's a good thing or not. Experiments with 64k PAGE_SIZE on arm64 shows some benchmarks improving and others regressing. Just because we _can_ collapse a 2MB range of pages into a single 2MB page doesn't mean we _should_. I suspect the right size folio for any given file will depend on the access pattern. For example, dirtying a few bytes in a folio will result in the entire folio being written back. Is that what you want? Maybe! It may prompt the filesystem to defragment that range, which would be good. On the other hand, if you're bandwidth limited, it may decrease your performance. And if your media has limited write endurance, it may result in your drive wearing out more quickly. Changing the heuristics should come with data. Preferably from a wide range of systems and use cases. I know that's hard to do, but how else can we proceed? And I think you ignored my point that READ_ONLY_THP_FOR_FS required no changes to filesystems. It was completely invisible to them, by design. Now this patchset requires each filesystem to do something. That's not a great step. P.S. khugepaged currently does nothing if a range contains a compound page. It assumes that the page is compound because it's now a THP. Large folios break that assumption, so khugepaged will now never collapse a range which includes large folios. Thanks to commit mm/filemap: Support VM_HUGEPAGE for file mappings we'll always try to bring in PMD-sized pages for MADV_HUGEPAGE, so it _probably_ doesn't matter. But it's something we should watch for as filesystems grow support for large folios.
On Fri, Mar 18, 2022 at 11:48 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Mar 18, 2022 at 11:04:29AM -0700, Yang Shi wrote: > > I agree once page cache huge page is fully supported, > > READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset > > makes khugepaged collapse file THP more consistently. It guarantees > > the THP could be collapsed as long as file THP is supported and > > configured properly and there is suitable file vmas, it is not > > guaranteed by the current code. So it should be useful even though > > READ_ONLY_THP_FOR_FS is gone IMHO. > > I don't know if it's a good thing or not. Experiments with 64k > PAGE_SIZE on arm64 shows some benchmarks improving and others regressing. > Just because we _can_ collapse a 2MB range of pages into a single 2MB > page doesn't mean we _should_. I suspect the right size folio for any > given file will depend on the access pattern. For example, dirtying a > few bytes in a folio will result in the entire folio being written back. > Is that what you want? Maybe! It may prompt the filesystem to defragment > that range, which would be good. On the other hand, if you're bandwidth > limited, it may decrease your performance. And if your media has limited > write endurance, it may result in your drive wearing out more quickly. > > Changing the heuristics should come with data. Preferably from a wide > range of systems and use cases. I know that's hard to do, but how else > can we proceed? TBH I don't think it belongs to "change the heuristics". Its users' decision if their workloads could benefit from huge pages or not. They could set THP to always/madivse/never per their workloads. The patchset is aimed to fix the misbehavior. The user visible issue is even though users enable READ_ONLY_THP_FOR_FS and configure THP to "always" (khugepaged always runs) and do expect their huge text section is backed by THP but THP may not be collapsed. > > And I think you ignored my point that READ_ONLY_THP_FOR_FS required > no changes to filesystems. It was completely invisible to them, by > design. Now this patchset requires each filesystem to do something. > That's not a great step. I don't mean to ignore your point. I do understand it is not perfect. I was thinking about making it FS agnostic in the first place. But I didn't think of a perfect way to do it at that time, so I followed what tmpfs does. However, by rethinking this we may be able to call khugepaged_enter_file() in filemap_fault(). I was concerned about the overhead in the page fault path. But it may be neglectable since khugepaged_enter_file() does bail out in the first place if the mm is already registered in khugepaged, just the first page fault needs to go through all the check, but the first page fault is typically a major fault so the overhead should be not noticeable comparing to the overhead of I/O. Calling khugepaged_enter() in page fault path is the approach used by anonymous THP too. > > P.S. khugepaged currently does nothing if a range contains a compound > page. It assumes that the page is compound because it's now a THP. > Large folios break that assumption, so khugepaged will now never > collapse a range which includes large folios. Thanks to commit > mm/filemap: Support VM_HUGEPAGE for file mappings > we'll always try to bring in PMD-sized pages for MADV_HUGEPAGE, so > it _probably_ doesn't matter. But it's something we should watch > for as filesystems grow support for large folios. Yeah, I agree, thanks for reminding this. In addition I think the users of READ_ONLY_THP_FOR_FS should also expect the PMD-sized THP to be collapsed for their usecase with full page cache THP support since their benefits come from reduced TLB miss.
On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > but I'd like to hear some comments before doing that. Adding a hard-coded call to khugepage_enter_file() in ext4 and xfs, and potentially, each file system, seems kludgy as all heck. Is there any reason not to simply call it in the mm code which calls f_op->mmap()? - Ted
On Wed, Mar 23, 2022 at 6:48 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote: > > > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches. > > The patch 8 converts ext4 and xfs. We may need convert more filesystems, > > but I'd like to hear some comments before doing that. > > Adding a hard-coded call to khugepage_enter_file() in ext4 and xfs, > and potentially, each file system, seems kludgy as all heck. Is there > any reason not to simply call it in the mm code which calls f_op->mmap()? Thanks, Ted. Very good point. I just didn't think of it. I think it is doable. We may be able to clean up the code further. > > - Ted