Message ID | 20240321220802.679544-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/gup: Unify hugetlb, part 2 | expand |
On Thu, Mar 21, 2024 at 06:07:55PM -0400, peterx@redhat.com wrote: > From: Peter Xu <peterx@redhat.com> > > Hugepd format for GUP is only used in PowerPC with hugetlbfs. There are > some kernel usage of hugepd (can refer to hugepd_populate_kernel() for > PPC_8XX), however those pages are not candidates for GUP. > > Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to > file-backed mappings") added a check to fail gup-fast if there's potential > risk of violating GUP over writeback file systems. That should never apply > to hugepd. Considering that hugepd is an old format (and even > software-only), there's no plan to extend hugepd into other file typed > memories that is prone to the same issue. > > Drop that check, not only because it'll never be true for hugepd per any > known plan, but also it paves way for reusing the function outside > fast-gup. > > To make sure we'll still remember this issue just in case hugepd will be > extended to support non-hugetlbfs memories, add a rich comment above > gup_huge_pd(), explaining the issue with proper references. > > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Lorenzo Stoakes <lstoakes@gmail.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/gup.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On Thu, Mar 21, 2024 at 06:07:50PM -0400, peterx@redhat.com wrote: > From: Peter Xu <peterx@redhat.com> > > v3: > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th) > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also > pXd_huge() users) with pXd_leaf() [Jason] > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason] > - Use IS_ENABLED() in follow_huge_pud() [Jason] > - Remove redundant none pud check in follow_pud_mask() [Jason] > > rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com > v1: https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com > v2: https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com > > The series removes the hugetlb slow gup path after a previous refactor work > [1], so that slow gup now uses the exact same path to process all kinds of > memory including hugetlb. > > For the long term, we may want to remove most, if not all, call sites of > huge_pte_offset(). It'll be ideal if that API can be completely dropped > from arch hugetlb API. This series is one small step towards merging > hugetlb specific codes into generic mm paths. From that POV, this series > removes one reference to huge_pte_offset() out of many others. This remark would be a little easier to understand if you refer to hugetlb_walk() not huge_pte_offset() - the latter is only used to implement hugetlb_walk() and isn't removed by this series, while a single hugetlb_walk() was removed. Regardless, I think the point is spot on, the direction should be to make the page table reading generic with minimal/no interaction with hugetlbfs in the generic code. After this series I would suggest doing the pagewalk.c stuff as it is very parallel to GUP slow (indeed it would be amazing to figure out a way to make GUP slow and pagewalk.c use the same code without a performance cost) Some of the other core mm callers don't look too bad either, getting to the point where hugetlb_walk() is internal to hugetlb.c would be a nice step that looks reasonable size. > One goal of such a route is that we can reconsider merging hugetlb features > like High Granularity Mapping (HGM). It was not accepted in the past > because it may add lots of hugetlb specific codes and make the mm code even > harder to maintain. With a merged codeset, features like HGM can hopefully > share some code with THP, legacy (PMD+) or modern (continuous PTEs). Yeah, if all the special hugetlb stuff is using generic arch code and generic page walkers (maybe with that special vma locking hook) it is much easier to swallow than adding yet another special class of code to all the page walkers. > To make it work, the generic slow gup code will need to at least understand > hugepd, which is already done like so in fast-gup. Due to the specialty of > hugepd to be software-only solution (no hardware recognizes the hugepd > format, so it's purely artificial structures), there's chance we can merge > some or all hugepd formats with cont_pte in the future. That question is > yet unsettled from Power side to have an acknowledgement. At a minimum, I think we have a concurrence that the reading of the hugepd entries should be fine through the standard contig pte APIs and so all the page walkers doing read side operations could stop having special hugepd code. It is just an implementation of contig pte with the new property that the size of a PTE can be larger than a PMD entry. If, or how much, the hugepd write side remains special is the open question, I think. > this series, I kept the hugepd handling because we may still need to do so > before getting a clearer picture of the future of hugepd. The other reason > is simply that we did it already for fast-gup and most codes are still > around to be reused. It'll make more sense to keep slow/fast gup behave > the same before a decision is made to remove hugepd. Yeah, I think that is right for this series. Lets get this done and then try to remove hugepd read side. Thanks, Jason
Le 21/03/2024 à 23:07, peterx@redhat.com a écrit : > From: Peter Xu <peterx@redhat.com> > > v3: > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th) > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also > pXd_huge() users) with pXd_leaf() [Jason] > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason] > - Use IS_ENABLED() in follow_huge_pud() [Jason] > - Remove redundant none pud check in follow_pud_mask() [Jason] > > rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com > v1: https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com > v2: https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com > > The series removes the hugetlb slow gup path after a previous refactor work > [1], so that slow gup now uses the exact same path to process all kinds of > memory including hugetlb. > > For the long term, we may want to remove most, if not all, call sites of > huge_pte_offset(). It'll be ideal if that API can be completely dropped > from arch hugetlb API. This series is one small step towards merging > hugetlb specific codes into generic mm paths. From that POV, this series > removes one reference to huge_pte_offset() out of many others. > > One goal of such a route is that we can reconsider merging hugetlb features > like High Granularity Mapping (HGM). It was not accepted in the past > because it may add lots of hugetlb specific codes and make the mm code even > harder to maintain. With a merged codeset, features like HGM can hopefully > share some code with THP, legacy (PMD+) or modern (continuous PTEs). > > To make it work, the generic slow gup code will need to at least understand > hugepd, which is already done like so in fast-gup. Due to the specialty of > hugepd to be software-only solution (no hardware recognizes the hugepd > format, so it's purely artificial structures), there's chance we can merge > some or all hugepd formats with cont_pte in the future. That question is > yet unsettled from Power side to have an acknowledgement. As of now for > this series, I kept the hugepd handling because we may still need to do so > before getting a clearer picture of the future of hugepd. The other reason > is simply that we did it already for fast-gup and most codes are still > around to be reused. It'll make more sense to keep slow/fast gup behave > the same before a decision is made to remove hugepd. > It is not true that hugepd is a software-only solution. Powerpc 8xx HW matches the hugepd topology for 8M pages. Christophe
On Fri, Mar 22, 2024 at 01:10:00PM -0300, Jason Gunthorpe wrote: > On Thu, Mar 21, 2024 at 06:07:50PM -0400, peterx@redhat.com wrote: > > From: Peter Xu <peterx@redhat.com> > > > > v3: > > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th) > > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also > > pXd_huge() users) with pXd_leaf() [Jason] > > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason] > > - Use IS_ENABLED() in follow_huge_pud() [Jason] > > - Remove redundant none pud check in follow_pud_mask() [Jason] > > > > rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com > > v1: https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com > > v2: https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com > > > > The series removes the hugetlb slow gup path after a previous refactor work > > [1], so that slow gup now uses the exact same path to process all kinds of > > memory including hugetlb. > > > > For the long term, we may want to remove most, if not all, call sites of > > huge_pte_offset(). It'll be ideal if that API can be completely dropped > > from arch hugetlb API. This series is one small step towards merging > > hugetlb specific codes into generic mm paths. From that POV, this series > > removes one reference to huge_pte_offset() out of many others. > > This remark would be a little easier to understand if you refer to > hugetlb_walk() not huge_pte_offset() - the latter is only used to > implement hugetlb_walk() and isn't removed by this series, while a > single hugetlb_walk() was removed. Right. Here huge_pte_offset() is the arch API that I hope we can remove, the hugetlb_walk() is simply the wrapper. > > Regardless, I think the point is spot on, the direction should be to > make the page table reading generic with minimal/no interaction with > hugetlbfs in the generic code. Yes, and I also like your terms on calling them "pgtable readers". It's a better way to describe the difference in that regard between huge_pte_offset() v.s. huge_pte_alloc(). Exactly that's my goal, that we should start with the "readers". The writters might change semantics when merge, and can be more challenging, I'll need to look into details of each one, like page fault handlers. Such work may need to be analyzed case by case, and this GUP part is definitely the low hanging fruit comparing to the rest. > > After this series I would suggest doing the pagewalk.c stuff as it is > very parallel to GUP slow (indeed it would be amazing to figure out a > way to make GUP slow and pagewalk.c use the same code without a > performance cost) Yes. I hope there shouldn't be much perf degrade, I can do some more measurements too when getting there. And btw, IIUC the major challenge of pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not be that hard if that's the solo purpose. The better way to go is to remove mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all callers; that's "the challenge".. pretty much labor works, not a major technical challenge it seems. Not sure if it's a good news or bad.. One thing I'll soon look into is to allow huge mappings for PFNMAP; there's one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry() seems to be good for all such purposes; well, I may need to bite the bullet here.. for either of the purposes to move on. > > Some of the other core mm callers don't look too bad either, getting > to the point where hugetlb_walk() is internal to hugetlb.c would be a > nice step that looks reasonable size. Agree. > > > One goal of such a route is that we can reconsider merging hugetlb features > > like High Granularity Mapping (HGM). It was not accepted in the past > > because it may add lots of hugetlb specific codes and make the mm code even > > harder to maintain. With a merged codeset, features like HGM can hopefully > > share some code with THP, legacy (PMD+) or modern (continuous PTEs). > > Yeah, if all the special hugetlb stuff is using generic arch code and > generic page walkers (maybe with that special vma locking hook) it is > much easier to swallow than adding yet another special class of code > to all the page walkers. > > > To make it work, the generic slow gup code will need to at least understand > > hugepd, which is already done like so in fast-gup. Due to the specialty of > > hugepd to be software-only solution (no hardware recognizes the hugepd > > format, so it's purely artificial structures), there's chance we can merge > > some or all hugepd formats with cont_pte in the future. That question is > > yet unsettled from Power side to have an acknowledgement. > > At a minimum, I think we have a concurrence that the reading of the > hugepd entries should be fine through the standard contig pte APIs and > so all the page walkers doing read side operations could stop having > special hugepd code. It is just an implementation of contig pte with > the new property that the size of a PTE can be larger than a PMD > entry. > > If, or how much, the hugepd write side remains special is the open > question, I think. It seems balls are rolling in that aspect, I haven't looked in depth there in Christophe's series but it's great to have started! > > > this series, I kept the hugepd handling because we may still need to do so > > before getting a clearer picture of the future of hugepd. The other reason > > is simply that we did it already for fast-gup and most codes are still > > around to be reused. It'll make more sense to keep slow/fast gup behave > > the same before a decision is made to remove hugepd. > > Yeah, I think that is right for this series. Lets get this done and > then try to remove hugepd read side. Thanks a bunch for the reviews.
On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote: > > This remark would be a little easier to understand if you refer to > > hugetlb_walk() not huge_pte_offset() - the latter is only used to > > implement hugetlb_walk() and isn't removed by this series, while a > > single hugetlb_walk() was removed. > > Right. Here huge_pte_offset() is the arch API that I hope we can remove, > the hugetlb_walk() is simply the wrapper. But arguably hugetlb_walk is the thing that should go.. In the generic code we should really try to get away from the weird hugetlb abuse of treating every level as a pte_t. That is not how the generic page table API works and that weirdness is part of what motivates the arch API to supply special functions for reading. Not every arch can just cast every level to a pte_t and still work. But that weirdness is also saving alot of code so something else needs to be though up.. > > Regardless, I think the point is spot on, the direction should be to > > make the page table reading generic with minimal/no interaction with > > hugetlbfs in the generic code. > > Yes, and I also like your terms on calling them "pgtable readers". It's a > better way to describe the difference in that regard between > huge_pte_offset() v.s. huge_pte_alloc(). Exactly that's my goal, that we > should start with the "readers". Yeah, it makes alot of sense to tackle the readers first - we are pretty close now to having enough done to have generic readers. I would imagine tackling everything outside mm/huge*.c to use the normal page table API for reading. Then consider what to do with the reading in mm/huge*.c > The writters might change semantics when merge, and can be more > challenging, I'll need to look into details of each one, like page fault > handlers. Such work may need to be analyzed case by case, and this GUP > part is definitely the low hanging fruit comparing to the rest. The write side is tricky, I think if the read side is sorted out then it will be easer to reason about the write side. Today the write side is paired with the special read side and it is extra hard to understand if there is something weird hidden in the arch. > measurements too when getting there. And btw, IIUC the major challenge of > pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not > be that hard if that's the solo purpose. The better way to go is to remove > mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all > callers; that's "the challenge".. pretty much labor works, not a major > technical challenge it seems. Not sure if it's a good news or bad.. Ugh, that is a big pain. It is relying on that hugetlbfs trick of passing in a pte that is not a pte to make the API generic.. The more I look at this the more I think we need to get to Matthew's idea of having some kind of generic page table API that is not tightly tied to level. Replacing the hugetlb trick of 'everything is a PTE' with 5 special cases in every place seems just horrible. struct mm_walk_ops { int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk); } And many cases really want something like: struct mm_walk_state state; if (!mm_walk_seek_leaf(state, mm, address)) goto no_present if (mm_walk_is_write(state)) .. And detailed walking: for_each_pt_leaf(state, mm, address) { if (mm_walk_is_write(state)) .. } Replacing it with a mm_walk_state that retains the level or otherwise to allow decoding any entry composes a lot better. Forced Loop unrolling can get back to the current code gen in alot of places. It also makes the power stuff a bit nicer as the mm_walk_state could automatically retain back pointers to the higher levels in the state struct too... The puzzle is how to do it and still get reasonable efficient codegen, many operations are going to end up switching on some state->level to know how to decode the entry. > One thing I'll soon look into is to allow huge mappings for PFNMAP; there's > one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry() > seems to be good for all such purposes; well, I may need to bite the bullet > here.. for either of the purposes to move on. That would be a nice feature! > > If, or how much, the hugepd write side remains special is the open > > question, I think. > > It seems balls are rolling in that aspect, I haven't looked in depth there > in Christophe's series but it's great to have started! Yes! Jason
On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote: > The more I look at this the more I think we need to get to Matthew's > idea of having some kind of generic page table API that is not tightly > tied to level. Replacing the hugetlb trick of 'everything is a PTE' > with 5 special cases in every place seems just horrible. > > struct mm_walk_ops { > int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk); > } > > And many cases really want something like: > struct mm_walk_state state; > > if (!mm_walk_seek_leaf(state, mm, address)) > goto no_present > if (mm_walk_is_write(state)) .. > > And detailed walking: > for_each_pt_leaf(state, mm, address) { > if (mm_walk_is_write(state)) .. > } > > Replacing it with a mm_walk_state that retains the level or otherwise > to allow decoding any entry composes a lot better. Forced Loop > unrolling can get back to the current code gen in alot of places. > > It also makes the power stuff a bit nicer as the mm_walk_state could > automatically retain back pointers to the higher levels in the state > struct too... > > The puzzle is how to do it and still get reasonable efficient codegen, > many operations are going to end up switching on some state->level to > know how to decode the entry. These discussions are definitely constructive, thanks Jason. Very helpful. I thought about this last week but got interrupted. It does make sense to me; it looks pretty generic and it is flexible enough as a top design. At least that's what I thought. However now when I rethink about it, and look more into the code when I got the chance, it turns out this will be a major rewrite of mostly every walkers.. it doesn't mean that this is a bad idea, but then I'll need to compare the other approach, because there can be a huge difference on when we can get that code ready, I think. :) Consider that what we (or.. I) want to teach the pXd layers are two things right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly shares the generic concept when working on the mm walkers no matter which way to go, just different treatment on different type of mem. (2) is on top of current code and new stuff, while (1) is a refactoring to drop hugetlb_entry() hook point as the goal. Taking a simplest mm walker (smaps) as example, I think most codes are ready thanks to THP's existance, and also like vm_normal_page[_pmd]() which should even already work for pfnmaps; pud layer is missing but that should be trivial. It means we may have chance to drop hugetlb_entry() without an huge overhaul yet. Now the important question I'm asking myself is: do we really need huge p4d or even bigger? It's 512GB on x86, and we said "No 512 GiB pages yet" (commit fe1e8c3e963) since 2017 - that is 7 years without chaning this fact. While on non-x86 p4d_leaf() never defined. Then it's also interesting to see how many codes are "ready" to handle p4d entries (by looking at p4d_leaf() calls; much easier to see with the removal of the rest huge apis..) even if none existed. So, can we over-engineer too much if we go the generic route now? Considering that we already have most of pmd/pud entries around in the mm walker ops. So far it sounds better we leave it for later, until further justifed to be useful. And that won't block it if it ever justified to be needed, I'd say it can also be seen as a step forward if I can make it to remove hugetlb_entry() first. Comments welcomed (before I start to work on anything..). Thanks,
On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote: > On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote: > > The more I look at this the more I think we need to get to Matthew's > > idea of having some kind of generic page table API that is not tightly > > tied to level. Replacing the hugetlb trick of 'everything is a PTE' > > with 5 special cases in every place seems just horrible. > > > > struct mm_walk_ops { > > int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk); > > } > > > > And many cases really want something like: > > struct mm_walk_state state; > > > > if (!mm_walk_seek_leaf(state, mm, address)) > > goto no_present > > if (mm_walk_is_write(state)) .. > > > > And detailed walking: > > for_each_pt_leaf(state, mm, address) { > > if (mm_walk_is_write(state)) .. > > } > > > > Replacing it with a mm_walk_state that retains the level or otherwise > > to allow decoding any entry composes a lot better. Forced Loop > > unrolling can get back to the current code gen in alot of places. > > > > It also makes the power stuff a bit nicer as the mm_walk_state could > > automatically retain back pointers to the higher levels in the state > > struct too... > > > > The puzzle is how to do it and still get reasonable efficient codegen, > > many operations are going to end up switching on some state->level to > > know how to decode the entry. > > These discussions are definitely constructive, thanks Jason. Very helpful. > > I thought about this last week but got interrupted. It does make sense to > me; it looks pretty generic and it is flexible enough as a top design. At > least that's what I thought. Yeah, exactly.. > However now when I rethink about it, and look more into the code when I got > the chance, it turns out this will be a major rewrite of mostly every > walkers.. Indeed, it is why it may not be reasonable. > Consider that what we (or.. I) want to teach the pXd layers are two things > right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly > shares the generic concept when working on the mm walkers no matter which > way to go, just different treatment on different type of mem. (2) is on > top of current code and new stuff, while (1) is a refactoring to drop > hugetlb_entry() hook point as the goal. Right, I view this as a two pronged attack One one front you teach the generic pXX_* macros to process huge pages and push that around to the performance walkers like GUP On another front you want to replace use of the hugepte with the new walkers. The challenge with the hugepte code is that it is all structured to assume one API that works at all levels and that may be a hard fit to replace with pXX_* functions. The places that are easy to switch from hugetlb to pXX_* may as well do so. Other places maybe need a hugetlb replacement that has a similar abstraction level of pointing to any page table level. I think if you do the easy places for pXX conversion you will have a good idea about what is needed for the hard places. > Now the important question I'm asking myself is: do we really need huge p4d > or even bigger? Do we need huge p4d support with folios? Probably not.. huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly interesting - but I wouldn't ask anyone to do the work :) But then again we come back to power and its big list of page sizes and variety :( Looks like some there have huge sizes at the pgd level at least. > So, can we over-engineer too much if we go the generic route now? Yes we can, and it will probably be slow as well. The pXX macros are the most efficient if code can be adapted to use them. > Considering that we already have most of pmd/pud entries around in the mm > walker ops. Yeah, so you add pgd and maybe p4d and then we can don't need any generic thing. If it is easy it would be nice. Jason
On Fri, Apr 05, 2024 at 03:16:33PM -0300, Jason Gunthorpe wrote: > On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote: > > On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote: > > > The more I look at this the more I think we need to get to Matthew's > > > idea of having some kind of generic page table API that is not tightly > > > tied to level. Replacing the hugetlb trick of 'everything is a PTE' > > > with 5 special cases in every place seems just horrible. > > > > > > struct mm_walk_ops { > > > int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk); > > > } > > > > > > And many cases really want something like: > > > struct mm_walk_state state; > > > > > > if (!mm_walk_seek_leaf(state, mm, address)) > > > goto no_present > > > if (mm_walk_is_write(state)) .. > > > > > > And detailed walking: > > > for_each_pt_leaf(state, mm, address) { > > > if (mm_walk_is_write(state)) .. > > > } > > > > > > Replacing it with a mm_walk_state that retains the level or otherwise > > > to allow decoding any entry composes a lot better. Forced Loop > > > unrolling can get back to the current code gen in alot of places. > > > > > > It also makes the power stuff a bit nicer as the mm_walk_state could > > > automatically retain back pointers to the higher levels in the state > > > struct too... > > > > > > The puzzle is how to do it and still get reasonable efficient codegen, > > > many operations are going to end up switching on some state->level to > > > know how to decode the entry. > > > > These discussions are definitely constructive, thanks Jason. Very helpful. > > > > I thought about this last week but got interrupted. It does make sense to > > me; it looks pretty generic and it is flexible enough as a top design. At > > least that's what I thought. > > Yeah, exactly.. > > > However now when I rethink about it, and look more into the code when I got > > the chance, it turns out this will be a major rewrite of mostly every > > walkers.. > > Indeed, it is why it may not be reasonable. > > > Consider that what we (or.. I) want to teach the pXd layers are two things > > right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly > > shares the generic concept when working on the mm walkers no matter which > > way to go, just different treatment on different type of mem. (2) is on > > top of current code and new stuff, while (1) is a refactoring to drop > > hugetlb_entry() hook point as the goal. > > Right, I view this as a two pronged attack > > One one front you teach the generic pXX_* macros to process huge pages > and push that around to the performance walkers like GUP > > On another front you want to replace use of the hugepte with the new > walkers. > > The challenge with the hugepte code is that it is all structured to > assume one API that works at all levels and that may be a hard fit to > replace with pXX_* functions. That's the core of problem, or otherwise I feel like I might be doing something else already. I had a feeling even if it's currently efficient for hugetlb, we'll drop that sooner or later. The issue is at least hugetlb pgtable format is exactly the same as the rest, as large folio grows it will reach the point that we complain more than before on having hugetlb does its smart things on its own. > > The places that are easy to switch from hugetlb to pXX_* may as well > do so. > > Other places maybe need a hugetlb replacement that has a similar > abstraction level of pointing to any page table level. IMHO this depends. Per my current knowledge, hugetlb is only special in three forms: - huge mapping (well, this isn't that special..) - cont pte/pmd/... - hugepd The most fancy one is actually hugepd.. but I plan to put that temporarily aside - I haven't look at Christophe's series yet, however I think we're going to solve orthogonal issues but his work will definitely help me on reaching mine, and I want to think through first on my end of things to know a plan. If hugepd has its chance to be generalized, the worst case is I'll leverage CONFIG_ARCH_HAS_HUGEPD and only keep hugetlb_entry() for them until hugepd became some cont-pxx variance. Then when I put HAS_HUGEPD aside, I don't think it's very complicated, perhaps? In short, hugetlb mappings shouldn't be special comparing to other huge pXd and large folio (cont-pXd) mappings for most of the walkers in my mind, if not all. I need to look at all the walkers and there can be some tricky ones, but I believe that applies in general. It's actually similar to what I did with slow gup here. Like this series, for cont-pXd we'll need multiple walks comparing to before (when with hugetlb_entry()), but for that part I'll provide some performance tests too, and we also have a fallback plan, which is to detect cont-pXd existance, which will also work for large folios. > > I think if you do the easy places for pXX conversion you will have a > good idea about what is needed for the hard places. Here IMHO we don't need to understand "what is the size of this hugetlb vma", or "which level of pgtable does this hugetlb vma pages locate", because we may not need that, e.g., when we only want to collect some smaps statistics. "whether it's hugetlb" may matter, though. E.g. in the mm walker we see a huge pmd, it can be a thp, it can be a hugetlb (when hugetlb_entry removed), we may need extra check later to put things into the right bucket, but for the walker itself it doesn't necessarily need hugetlb_entry(). > > > Now the important question I'm asking myself is: do we really need huge p4d > > or even bigger? > > Do we need huge p4d support with folios? Probably not.. > > huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly > interesting - but I wouldn't ask anyone to do the work :) Considering it does not yet exist, and we don't have plan to work it out (so far), maybe I can see this as a very implicit ack that I can put that aside at least of now. :) > > But then again we come back to power and its big list of page sizes > and variety :( Looks like some there have huge sizes at the pgd level > at least. Yeah this is something I want to be super clear, because I may miss something: we don't have real pgd pages, right? Powerpc doesn't even define p4d_leaf(), AFAICT. $ git grep p4d_leaf arch/powerpc/mm/book3s64/radix_pgtable.c: if (p4d_leaf(*p4d)) { arch/powerpc/mm/pgtable.c: if (p4d_leaf(p4d)) { arch/powerpc/mm/pgtable_64.c: if (p4d_leaf(p4d)) { arch/powerpc/mm/pgtable_64.c: VM_WARN_ON(!p4d_leaf(p4d)); arch/powerpc/xmon/xmon.c: if (p4d_leaf(*p4dp)) { They all constantly return false (which is the fallback)? > > > So, can we over-engineer too much if we go the generic route now? > > Yes we can, and it will probably be slow as well. The pXX macros are > the most efficient if code can be adapted to use them. > > > Considering that we already have most of pmd/pud entries around in the mm > > walker ops. > > Yeah, so you add pgd and maybe p4d and then we can don't need any > generic thing. If it is easy it would be nice. I want to double check on above on PowerPC first, otherwise I'd consider leaving p4d+ alone, if that looks ok. It could be that I missed something important, Christophe may want to chim in if he has any thoughts or just to clarify my mistakes. Thanks,
On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote: > In short, hugetlb mappings shouldn't be special comparing to other huge pXd > and large folio (cont-pXd) mappings for most of the walkers in my mind, if > not all. I need to look at all the walkers and there can be some tricky > ones, but I believe that applies in general. It's actually similar to what > I did with slow gup here. I think that is the big question, I also haven't done the research to know the answer. At this point focusing on moving what is reasonable to the pXX_* API makes sense to me. Then reviewing what remains and making some decision. > Like this series, for cont-pXd we'll need multiple walks comparing to > before (when with hugetlb_entry()), but for that part I'll provide some > performance tests too, and we also have a fallback plan, which is to detect > cont-pXd existance, which will also work for large folios. I think we can optimize this pretty easy. > > I think if you do the easy places for pXX conversion you will have a > > good idea about what is needed for the hard places. > > Here IMHO we don't need to understand "what is the size of this hugetlb > vma" Yeh, I never really understood why hugetlb was linked to the VMA.. The page table is self describing, obviously. > or "which level of pgtable does this hugetlb vma pages locate", Ditto > because we may not need that, e.g., when we only want to collect some smaps > statistics. "whether it's hugetlb" may matter, though. E.g. in the mm > walker we see a huge pmd, it can be a thp, it can be a hugetlb (when > hugetlb_entry removed), we may need extra check later to put things into > the right bucket, but for the walker itself it doesn't necessarily need > hugetlb_entry(). Right, places may still need to know it is part of a huge VMA because we have special stuff linked to that. > > But then again we come back to power and its big list of page sizes > > and variety :( Looks like some there have huge sizes at the pgd level > > at least. > > Yeah this is something I want to be super clear, because I may miss > something: we don't have real pgd pages, right? Powerpc doesn't even > define p4d_leaf(), AFAICT. AFAICT it is because it hides it all in hugepd. If the goal is to purge hugepd then some of the options might turn out to convert hugepd into huge p4d/pgd, as I understand it. It would be nice to have certainty on this at least. We have effectively three APIs to parse a single page table and currently none of the APIs can return 100% of the data for power. Jason
On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote: > On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote: > > In short, hugetlb mappings shouldn't be special comparing to other huge pXd > > and large folio (cont-pXd) mappings for most of the walkers in my mind, if > > not all. I need to look at all the walkers and there can be some tricky > > ones, but I believe that applies in general. It's actually similar to what > > I did with slow gup here. > > I think that is the big question, I also haven't done the research to > know the answer. > > At this point focusing on moving what is reasonable to the pXX_* API > makes sense to me. Then reviewing what remains and making some > decision. > > > Like this series, for cont-pXd we'll need multiple walks comparing to > > before (when with hugetlb_entry()), but for that part I'll provide some > > performance tests too, and we also have a fallback plan, which is to detect > > cont-pXd existance, which will also work for large folios. > > I think we can optimize this pretty easy. > > > > I think if you do the easy places for pXX conversion you will have a > > > good idea about what is needed for the hard places. > > > > Here IMHO we don't need to understand "what is the size of this hugetlb > > vma" > > Yeh, I never really understood why hugetlb was linked to the VMA.. The > page table is self describing, obviously. Attaching to vma still makes sense to me, where we should definitely avoid a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are allocated, managed, ... totally differently. And since hugetlb is designed as file-based (which also makes sense to me, at least for now), it's also natural that it's vma-attached. > > > or "which level of pgtable does this hugetlb vma pages locate", > > Ditto > > > because we may not need that, e.g., when we only want to collect some smaps > > statistics. "whether it's hugetlb" may matter, though. E.g. in the mm > > walker we see a huge pmd, it can be a thp, it can be a hugetlb (when > > hugetlb_entry removed), we may need extra check later to put things into > > the right bucket, but for the walker itself it doesn't necessarily need > > hugetlb_entry(). > > Right, places may still need to know it is part of a huge VMA because we > have special stuff linked to that. > > > > But then again we come back to power and its big list of page sizes > > > and variety :( Looks like some there have huge sizes at the pgd level > > > at least. > > > > Yeah this is something I want to be super clear, because I may miss > > something: we don't have real pgd pages, right? Powerpc doesn't even > > define p4d_leaf(), AFAICT. > > AFAICT it is because it hides it all in hugepd. IMHO one thing we can benefit from such hugepd rework is, if we can squash all the hugepds like what Christophe does, then we push it one more layer down, and we have a good chance all things should just work. So again my Power brain is close to zero, but now I'm referring to what Christophe shared in the other thread: https://github.com/linuxppc/wiki/wiki/Huge-pages Together with: https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu Where it has: --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -98,6 +98,7 @@ config PPC_BOOK3S_64 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION select ARCH_ENABLE_SPLIT_PMD_PTLOCK select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_HAS_HUGEPD if HUGETLB_PAGE select ARCH_SUPPORTS_HUGETLBFS select ARCH_SUPPORTS_NUMA_BALANCING select HAVE_MOVE_PMD @@ -290,6 +291,7 @@ config PPC_BOOK3S config PPC_E500 select FSL_EMB_PERFMON bool + select ARCH_HAS_HUGEPD if HUGETLB_PAGE select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64 select PPC_SMP_MUXED_IPI select PPC_DOORBELL So I think it means we have three PowerPC systems that supports hugepd right now (besides the 8xx which Christophe is trying to drop support there), besides 8xx we still have book3s_64 and E500. Let's check one by one: - book3s_64 - hash - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It means after squashing it'll be a bunch of cont-pmd, all good. - 4K: p4d also not used, largest pgsize pgd 128G, after squashed it'll be cont-pud. all good. - radix - 64K: largest 1G @pud, then cont-pmd after squashed. all good. - 4K: largest 1G @pud, then cont-pmd, all good. - e500 & 8xx - both of them use 2-level pgtables (pgd + pte), after squashed hugepd @pgd level they become cont-pte. all good. I think the trick here is there'll be no pgd leaves after hugepd squashing to lower levels, then since PowerPC seems to never have p4d, then all things fall into pud or lower. We seem to be all good there? > > If the goal is to purge hugepd then some of the options might turn out > to convert hugepd into huge p4d/pgd, as I understand it. It would be > nice to have certainty on this at least. Right. I hope the pmd/pud plan I proposed above can already work too with such ambicious goal too. But review very welcomed from either you or Christophe. PS: I think I'll also have a closer look at Christophe's series this week or next. > > We have effectively three APIs to parse a single page table and > currently none of the APIs can return 100% of the data for power. Thanks,
Le 10/04/2024 à 17:28, Peter Xu a écrit : > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote: >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote: >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if >>> not all. I need to look at all the walkers and there can be some tricky >>> ones, but I believe that applies in general. It's actually similar to what >>> I did with slow gup here. >> >> I think that is the big question, I also haven't done the research to >> know the answer. >> >> At this point focusing on moving what is reasonable to the pXX_* API >> makes sense to me. Then reviewing what remains and making some >> decision. >> >>> Like this series, for cont-pXd we'll need multiple walks comparing to >>> before (when with hugetlb_entry()), but for that part I'll provide some >>> performance tests too, and we also have a fallback plan, which is to detect >>> cont-pXd existance, which will also work for large folios. >> >> I think we can optimize this pretty easy. >> >>>> I think if you do the easy places for pXX conversion you will have a >>>> good idea about what is needed for the hard places. >>> >>> Here IMHO we don't need to understand "what is the size of this hugetlb >>> vma" >> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The >> page table is self describing, obviously. > > Attaching to vma still makes sense to me, where we should definitely avoid > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are > allocated, managed, ... totally differently. > > And since hugetlb is designed as file-based (which also makes sense to me, > at least for now), it's also natural that it's vma-attached. > >> >>> or "which level of pgtable does this hugetlb vma pages locate", >> >> Ditto >> >>> because we may not need that, e.g., when we only want to collect some smaps >>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when >>> hugetlb_entry removed), we may need extra check later to put things into >>> the right bucket, but for the walker itself it doesn't necessarily need >>> hugetlb_entry(). >> >> Right, places may still need to know it is part of a huge VMA because we >> have special stuff linked to that. >> >>>> But then again we come back to power and its big list of page sizes >>>> and variety :( Looks like some there have huge sizes at the pgd level >>>> at least. >>> >>> Yeah this is something I want to be super clear, because I may miss >>> something: we don't have real pgd pages, right? Powerpc doesn't even >>> define p4d_leaf(), AFAICT. >> >> AFAICT it is because it hides it all in hugepd. > > IMHO one thing we can benefit from such hugepd rework is, if we can squash > all the hugepds like what Christophe does, then we push it one more layer > down, and we have a good chance all things should just work. > > So again my Power brain is close to zero, but now I'm referring to what > Christophe shared in the other thread: > > https://github.com/linuxppc/wiki/wiki/Huge-pages > > Together with: > > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu > > Where it has: > > --- a/arch/powerpc/platforms/Kconfig.cputype > +++ b/arch/powerpc/platforms/Kconfig.cputype > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64 > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > select ARCH_ENABLE_SPLIT_PMD_PTLOCK > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > select ARCH_SUPPORTS_HUGETLBFS > select ARCH_SUPPORTS_NUMA_BALANCING > select HAVE_MOVE_PMD > @@ -290,6 +291,7 @@ config PPC_BOOK3S > config PPC_E500 > select FSL_EMB_PERFMON > bool > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64 > select PPC_SMP_MUXED_IPI > select PPC_DOORBELL > > So I think it means we have three PowerPC systems that supports hugepd > right now (besides the 8xx which Christophe is trying to drop support > there), besides 8xx we still have book3s_64 and E500. > > Let's check one by one: > > - book3s_64 > > - hash > > - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It > means after squashing it'll be a bunch of cont-pmd, all good. > > - 4K: p4d also not used, largest pgsize pgd 128G, after squashed > it'll be cont-pud. all good. > > - radix > > - 64K: largest 1G @pud, then cont-pmd after squashed. all good. > > - 4K: largest 1G @pud, then cont-pmd, all good. > > - e500 & 8xx > > - both of them use 2-level pgtables (pgd + pte), after squashed hugepd > @pgd level they become cont-pte. all good. e500 has two modes: 32 bits and 64 bits. For 32 bits: 8xx is the only one handling it through HW-assisted pagetable walk hence requiring a 2-level whatever the pagesize is. On e500 it is all software so pages 2M and larger should be cont-PGD (by the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD and PTE, the PGD entries are populated by a function called PMD_populate()). Current situation for 8xx is illustrated here: https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx I also tried to better illustrate e500/32 here: https://github.com/linuxppc/wiki/wiki/Huge-pages#e500 For 64 bits: We have PTE/PMD/PUD/PGD, no P4D See arch/powerpc/include/asm/nohash/64/pgtable-4k.h > > I think the trick here is there'll be no pgd leaves after hugepd squashing > to lower levels, then since PowerPC seems to never have p4d, then all > things fall into pud or lower. We seem to be all good there? > >> >> If the goal is to purge hugepd then some of the options might turn out >> to convert hugepd into huge p4d/pgd, as I understand it. It would be >> nice to have certainty on this at least. > > Right. I hope the pmd/pud plan I proposed above can already work too with > such ambicious goal too. But review very welcomed from either you or > Christophe. > > PS: I think I'll also have a closer look at Christophe's series this week > or next. > >> >> We have effectively three APIs to parse a single page table and >> currently none of the APIs can return 100% of the data for power. > > Thanks, >
On Wed, Apr 10, 2024 at 04:30:41PM +0000, Christophe Leroy wrote: > > > Le 10/04/2024 à 17:28, Peter Xu a écrit : > > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote: > >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote: > >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd > >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if > >>> not all. I need to look at all the walkers and there can be some tricky > >>> ones, but I believe that applies in general. It's actually similar to what > >>> I did with slow gup here. > >> > >> I think that is the big question, I also haven't done the research to > >> know the answer. > >> > >> At this point focusing on moving what is reasonable to the pXX_* API > >> makes sense to me. Then reviewing what remains and making some > >> decision. > >> > >>> Like this series, for cont-pXd we'll need multiple walks comparing to > >>> before (when with hugetlb_entry()), but for that part I'll provide some > >>> performance tests too, and we also have a fallback plan, which is to detect > >>> cont-pXd existance, which will also work for large folios. > >> > >> I think we can optimize this pretty easy. > >> > >>>> I think if you do the easy places for pXX conversion you will have a > >>>> good idea about what is needed for the hard places. > >>> > >>> Here IMHO we don't need to understand "what is the size of this hugetlb > >>> vma" > >> > >> Yeh, I never really understood why hugetlb was linked to the VMA.. The > >> page table is self describing, obviously. > > > > Attaching to vma still makes sense to me, where we should definitely avoid > > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are > > allocated, managed, ... totally differently. > > > > And since hugetlb is designed as file-based (which also makes sense to me, > > at least for now), it's also natural that it's vma-attached. > > > >> > >>> or "which level of pgtable does this hugetlb vma pages locate", > >> > >> Ditto > >> > >>> because we may not need that, e.g., when we only want to collect some smaps > >>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm > >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when > >>> hugetlb_entry removed), we may need extra check later to put things into > >>> the right bucket, but for the walker itself it doesn't necessarily need > >>> hugetlb_entry(). > >> > >> Right, places may still need to know it is part of a huge VMA because we > >> have special stuff linked to that. > >> > >>>> But then again we come back to power and its big list of page sizes > >>>> and variety :( Looks like some there have huge sizes at the pgd level > >>>> at least. > >>> > >>> Yeah this is something I want to be super clear, because I may miss > >>> something: we don't have real pgd pages, right? Powerpc doesn't even > >>> define p4d_leaf(), AFAICT. > >> > >> AFAICT it is because it hides it all in hugepd. > > > > IMHO one thing we can benefit from such hugepd rework is, if we can squash > > all the hugepds like what Christophe does, then we push it one more layer > > down, and we have a good chance all things should just work. > > > > So again my Power brain is close to zero, but now I'm referring to what > > Christophe shared in the other thread: > > > > https://github.com/linuxppc/wiki/wiki/Huge-pages > > > > Together with: > > > > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu > > > > Where it has: > > > > --- a/arch/powerpc/platforms/Kconfig.cputype > > +++ b/arch/powerpc/platforms/Kconfig.cputype > > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64 > > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > > select ARCH_ENABLE_SPLIT_PMD_PTLOCK > > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE > > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > > select ARCH_SUPPORTS_HUGETLBFS > > select ARCH_SUPPORTS_NUMA_BALANCING > > select HAVE_MOVE_PMD > > @@ -290,6 +291,7 @@ config PPC_BOOK3S > > config PPC_E500 > > select FSL_EMB_PERFMON > > bool > > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > > select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64 > > select PPC_SMP_MUXED_IPI > > select PPC_DOORBELL > > > > So I think it means we have three PowerPC systems that supports hugepd > > right now (besides the 8xx which Christophe is trying to drop support > > there), besides 8xx we still have book3s_64 and E500. > > > > Let's check one by one: > > > > - book3s_64 > > > > - hash > > > > - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It > > means after squashing it'll be a bunch of cont-pmd, all good. > > > > - 4K: p4d also not used, largest pgsize pgd 128G, after squashed > > it'll be cont-pud. all good. > > > > - radix > > > > - 64K: largest 1G @pud, then cont-pmd after squashed. all good. > > > > - 4K: largest 1G @pud, then cont-pmd, all good. > > > > - e500 & 8xx > > > > - both of them use 2-level pgtables (pgd + pte), after squashed hugepd > > @pgd level they become cont-pte. all good. > > e500 has two modes: 32 bits and 64 bits. > > For 32 bits: > > 8xx is the only one handling it through HW-assisted pagetable walk hence > requiring a 2-level whatever the pagesize is. Hmm I think maybe finally I get it.. I think the confusion came from when I saw there's always such level-2 table described in Figure 8-5 of the manual: https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf So I suppose you meant for 8M, the PowerPC 8xx system hardware will be aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set 011b), then it won't ever read anything starting from "Level-2 Descriptor 1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally hugepd format must look like such for 8xx? But then perhaps it's still compatible with cont-pte because the rest entries (pte index 1+) will simply be ignored by the hardware? > > On e500 it is all software so pages 2M and larger should be cont-PGD (by > the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD > and PTE, the PGD entries are populated by a function called PMD_populate()). Yeah.. I am also wondering whether pgd_populate() could also work there (perhaps with some trivial changes, or maybe not even needed..), as when p4d/pud/pmd levels are missing, linux should just do something like an enforced cast from pgd_t* -> pmd_t* in this case. I think currently they're already not pgd, as __find_linux_pte() already skipped pgd unconditionally: pgdp = pgdir + pgd_index(ea); p4dp = p4d_offset(pgdp, ea); > > Current situation for 8xx is illustrated here: > https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx > > I also tried to better illustrate e500/32 here: > https://github.com/linuxppc/wiki/wiki/Huge-pages#e500 > > For 64 bits: > We have PTE/PMD/PUD/PGD, no P4D > > See arch/powerpc/include/asm/nohash/64/pgtable-4k.h We don't have anything that is above pud in this category, right? That's what I read from your wiki (and thanks for providing that in the first place; helps a lot for me to understand how it works on PowerPC). I want to make sure if I can move on without caring on p4d/pgd leafs like what we do right now, even after if we can remove hugepd for good, in this case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()" can also cover the pgd ones when that day comes, iiuc. Thanks, > > > > > > I think the trick here is there'll be no pgd leaves after hugepd squashing > > to lower levels, then since PowerPC seems to never have p4d, then all > > things fall into pud or lower. We seem to be all good there? > > > >> > >> If the goal is to purge hugepd then some of the options might turn out > >> to convert hugepd into huge p4d/pgd, as I understand it. It would be > >> nice to have certainty on this at least. > > > > Right. I hope the pmd/pud plan I proposed above can already work too with > > such ambicious goal too. But review very welcomed from either you or > > Christophe. > > > > PS: I think I'll also have a closer look at Christophe's series this week > > or next. > > > >> > >> We have effectively three APIs to parse a single page table and > >> currently none of the APIs can return 100% of the data for power. > > > > Thanks, > >
Le 10/04/2024 à 21:58, Peter Xu a écrit : >> >> e500 has two modes: 32 bits and 64 bits. >> >> For 32 bits: >> >> 8xx is the only one handling it through HW-assisted pagetable walk hence >> requiring a 2-level whatever the pagesize is. > > Hmm I think maybe finally I get it.. > > I think the confusion came from when I saw there's always such level-2 > table described in Figure 8-5 of the manual: > > https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf Yes indeed that figure is confusing. Table 8-1 gives a pretty good idea of what is required. We only use MD_CTR[TWAM] = 1 > > So I suppose you meant for 8M, the PowerPC 8xx system hardware will be > aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set > 011b), then it won't ever read anything starting from "Level-2 Descriptor > 1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally > hugepd format must look like such for 8xx? > > But then perhaps it's still compatible with cont-pte because the rest > entries (pte index 1+) will simply be ignored by the hardware? Yes, still compatible with CONT-PTE allthough things become tricky because you need two page tables to get the full 8M so that's a kind of cont-PMD down to PTE level, as you can see in my RFC series. > >> >> On e500 it is all software so pages 2M and larger should be cont-PGD (by >> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD >> and PTE, the PGD entries are populated by a function called PMD_populate()). > > Yeah.. I am also wondering whether pgd_populate() could also work there > (perhaps with some trivial changes, or maybe not even needed..), as when > p4d/pud/pmd levels are missing, linux should just do something like an > enforced cast from pgd_t* -> pmd_t* in this case. > > I think currently they're already not pgd, as __find_linux_pte() already > skipped pgd unconditionally: > > pgdp = pgdir + pgd_index(ea); > p4dp = p4d_offset(pgdp, ea); > Yes that's what is confusing, some parts of code considers we have only a PGD and a PT while other parts consider we have only a PMD and a PT >> >> Current situation for 8xx is illustrated here: >> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx >> >> I also tried to better illustrate e500/32 here: >> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500 >> >> For 64 bits: >> We have PTE/PMD/PUD/PGD, no P4D >> >> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h > > We don't have anything that is above pud in this category, right? That's > what I read from your wiki (and thanks for providing that in the first > place; helps a lot for me to understand how it works on PowerPC). Yes thanks to Michael and Aneesh who initiated that Wiki page. > > I want to make sure if I can move on without caring on p4d/pgd leafs like > what we do right now, even after if we can remove hugepd for good, in this > case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()" > can also cover the pgd ones when that day comes, iiuc. I guess so but I'd like Aneesh and/or Michael to confirm as I'm not an expert on PPC64. Christophe
From: Peter Xu <peterx@redhat.com> v3: - Rebased to latest mm-unstalbe (a824831a082f, of March 21th) - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also pXd_huge() users) with pXd_leaf() [Jason] - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason] - Use IS_ENABLED() in follow_huge_pud() [Jason] - Remove redundant none pud check in follow_pud_mask() [Jason] rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com v1: https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com v2: https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com The series removes the hugetlb slow gup path after a previous refactor work [1], so that slow gup now uses the exact same path to process all kinds of memory including hugetlb. For the long term, we may want to remove most, if not all, call sites of huge_pte_offset(). It'll be ideal if that API can be completely dropped from arch hugetlb API. This series is one small step towards merging hugetlb specific codes into generic mm paths. From that POV, this series removes one reference to huge_pte_offset() out of many others. One goal of such a route is that we can reconsider merging hugetlb features like High Granularity Mapping (HGM). It was not accepted in the past because it may add lots of hugetlb specific codes and make the mm code even harder to maintain. With a merged codeset, features like HGM can hopefully share some code with THP, legacy (PMD+) or modern (continuous PTEs). To make it work, the generic slow gup code will need to at least understand hugepd, which is already done like so in fast-gup. Due to the specialty of hugepd to be software-only solution (no hardware recognizes the hugepd format, so it's purely artificial structures), there's chance we can merge some or all hugepd formats with cont_pte in the future. That question is yet unsettled from Power side to have an acknowledgement. As of now for this series, I kept the hugepd handling because we may still need to do so before getting a clearer picture of the future of hugepd. The other reason is simply that we did it already for fast-gup and most codes are still around to be reused. It'll make more sense to keep slow/fast gup behave the same before a decision is made to remove hugepd. There's one major difference for slow-gup on cont_pte / cont_pmd handling, currently supported on three architectures (aarch64, riscv, ppc). Before the series, slow gup will be able to recognize e.g. cont_pte entries with the help of huge_pte_offset() when hstate is around. Now it's gone but still working, by looking up pgtable entries one by one. It's not ideal, but hopefully this change should not affect yet on major workloads. There's some more information in the commit message of the last patch. If this would be a concern, we can consider teaching slow gup to recognize cont pte/pmd entries, and that should recover the lost performance. But I doubt its necessity for now, so I kept it as simple as it can be. Test Done ========= For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64, tested the same over 64KB cont_pte huge pages. One note is that this v3 didn't go through any ppc test anymore, as finding such system can always take time. It's based on the fact that it was tested in previous versions, and this version should have zero change regarding to hugepd sections. If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well enough (please consider [2] applied if hugepd is <1MB), as long as we're sure the hugepd pages are touched as expected. Patch layout ============= Patch 1-8: Preparation works, or cleanups in relevant code paths Patch 9-11: Teach slow gup with all kinds of huge entries (pXd, hugepd) Patch 12: Drop hugetlb_follow_page_mask() More information can be found in the commit messages of each patch. Any comment will be welcomed. Thanks. [1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com [2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com Peter Xu (12): mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static mm: Make HPAGE_PXD_* macros even if !THP mm: Introduce vma_pgtable_walk_{begin|end}() mm/gup: Drop folio_fast_pin_allowed() in hugepd processing mm/gup: Refactor record_subpages() to find 1st small page mm/gup: Handle hugetlb for no_page_table() mm/gup: Cache *pudp in follow_pud_mask() mm/gup: Handle huge pud for follow_pud_mask() mm/gup: Handle huge pmd for follow_pmd_mask() mm/gup: Handle hugepd for follow_page() mm/gup: Handle hugetlb in the generic follow_page_mask code include/linux/huge_mm.h | 25 +-- include/linux/hugetlb.h | 16 +- include/linux/mm.h | 3 + mm/Kconfig | 6 + mm/gup.c | 354 +++++++++++++++++++++++++++++++++------- mm/huge_memory.c | 133 +-------------- mm/hugetlb.c | 75 +-------- mm/internal.h | 7 +- mm/memory.c | 12 ++ 9 files changed, 337 insertions(+), 294 deletions(-)