Message ID | 20230929114421.3761121-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | variable-order, large folios for anonymous memory | expand |
On 29.09.23 13:44, Ryan Roberts wrote: > Hi All, Let me highlight some core decisions on the things discussed in the previous alignment meetings, and comment on them. > > This is v6 of a series to implement variable order, large folios for anonymous > memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO", > "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The > objective of this is to improve performance by allocating larger chunks of > memory during anonymous page faults: Change number 1: Let's call these things THP. Fine with me; I previously rooted for that but was told that end users could be confused. I think the important bit is that we don't mess up the stats, and when we talk about THP we default to "PMD-sized THP", unless we explicitly include the other ones. I dislike exposing "orders" to the users, I'm happy to be convinced why I am wrong and it is a good idea. So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as said, I think FreeBSD tends to call it "Medium-sized superpages". But what's small/medium/large is debatable. "Small" implies at least that it's smaller than what we used to know, which is a fact. Can we also now use the terminology consistently? (e.g., "variable-order, large folios for anonymous memory" -> "Small-sized anonymous THP", you can just point at the previous patch set name in the cover letter) > > 1) Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > 2) Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch). > > The major change in this revision is the addition of sysfs controls to allow > this "small-order THP" to be enabled/disabled/configured independently of > PMD-order THP. The approach I've taken differs a bit from previous discussions; > instead of creating a whole new interface ("large_folio"), I'm extending THP. I > personally think this makes things clearer and more extensible. See [6] for > detailed rationale. Change 2: sysfs interface. If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I agree. What we expose there and how, is TBD. Again, not a friend of "orders" and bitmaps at all. We can do better if we want to go down that path. Maybe we should take a look at hugetlb, and how they added support for multiple sizes. What *might* make sense could be (depending on which values we actually support!) /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ Each one would contain an "enabled" and "defrag" file. We want something minimal first? Start with the "enabled" option. enabled: always [global] madvise never Initially, we would set it for PMD-sized THP to "global" and for everything else to "never". That sounds reasonable at least to me, and we would be using what we learned from THP (as John suggested). That still gives reasonable flexibility without going too wild, and a better IMHO interface. I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long as we can have something that won't hurt us later and still be able to use this in distributions within a reasonable timeframe. Enabling/disabling individual sizes does not sound too restrictive to me. And we could always add an "auto" setting later and default to that with a new kconfig knob. If someone wants to configure it, why not. Let's just prepare a way to to handle this "better" automatically in the future (if ever ...). Change 3: Stats > /proc/meminfo: > Introduce new "AnonHugePteMap" field, which reports the amount of > memory (in KiB) mapped from large folios globally (similar to > AnonHugePages field). AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think we all agree on that. It should have been named "AnonPmdMapped" or "AnonHugePmdMapped", too bad, we can't change that. "AnonHugePteMap" better be "AnonHugePteMapped". But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense, because !PMD implied !performance, and one might have been able to troubleshoot that somehow. For PTE-mapped, it doesn't make much sense really, they are always PTE-mapped. That also raises the question how you would account a PTE-mapped THP. The hole thing? Only the parts that are mapped? Let's better not go down that path. That leaves the question why we would want to include them here at all in a special PTE-mapped way? Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page. HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 1050624 kB -> Only the last one gives the sum, the other stats don't even mention the other ones. [how do we get their stats, if at all?] So maybe, we only want a summary of how many anon huge pages of any size are allocated (independent of the PTE vs. PMD mapping), and some other source to eventually inspect how the different sizes behave. But note that for non-PMD-sized file THP we don't even have special counters! ... so maybe we should also defer any such stats and come up with something uniform for all types of non-PMD-sized THP. Sane discussion applies to all other stats. > > Because we now have runtime enable/disable control, I've removed the compile > time Kconfig switch. It still defaults to runtime-disabled. > > NOTE: These changes should not be merged until the prerequisites are complete. > These are in progress and tracked at [7]. We should probably list them here, and classify which one we see as strict a requirement, which ones might be an optimization. Now, these are just my thoughts, and I'm happy about other thoughts.
On 06/10/2023 21:06, David Hildenbrand wrote: > On 29.09.23 13:44, Ryan Roberts wrote: >> Hi All, > > Let me highlight some core decisions on the things discussed in the previous > alignment meetings, and comment on them. > >> >> This is v6 of a series to implement variable order, large folios for anonymous >> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO", >> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The >> objective of this is to improve performance by allocating larger chunks of >> memory during anonymous page faults: > > Change number 1: Let's call these things THP. > > Fine with me; I previously rooted for that but was told that end users could be > confused. I think the important bit is that we don't mess up the stats, and when > we talk about THP we default to "PMD-sized THP", unless we explicitly include > the other ones. > > > I dislike exposing "orders" to the users, I'm happy to be convinced why I am > wrong and it is a good idea. > > So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as > said, I think FreeBSD tends to call it "Medium-sized superpages". But what's > small/medium/large is debatable. "Small" implies at least that it's smaller than > what we used to know, which is a fact. > > Can we also now use the terminology consistently? (e.g., "variable-order, large > folios for anonymous memory" -> "Small-sized anonymous THP", you can just point > at the previous patch set name in the cover letter) Yes absolutely. FWIW, I was deliberately not changing the title of the patchset so people could easily see it was an evolution of something posted before. But if it's the norm to change the title as the patchset evolves, I'm very happy to do that. And there are other places too, in commit logs that I can tidy up. I will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP" (that last one slightly different from what David suggested above - it means "small-sized THP" can still be grepped) unless others object. > >> >> 1) Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> 2) Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >> >> The major change in this revision is the addition of sysfs controls to allow >> this "small-order THP" to be enabled/disabled/configured independently of >> PMD-order THP. The approach I've taken differs a bit from previous discussions; >> instead of creating a whole new interface ("large_folio"), I'm extending THP. I >> personally think this makes things clearer and more extensible. See [6] for >> detailed rationale. > > Change 2: sysfs interface. > > If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I > agree. > > What we expose there and how, is TBD. Again, not a friend of "orders" and > bitmaps at all. We can do better if we want to go down that path. > > Maybe we should take a look at hugetlb, and how they added support for multiple > sizes. What *might* make sense could be (depending on which values we actually > support!) > > > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ > > Each one would contain an "enabled" and "defrag" file. We want something minimal > first? Start with the "enabled" option. > > > enabled: always [global] madvise never > > Initially, we would set it for PMD-sized THP to "global" and for everything else > to "never". My only reservation about this approach is the potential for a future need for a "one setting applied across all sizes" class of control (e.g. "auto"). I think we agreed in the previous meetings that chasing a solution for "auto" was a good aspiration to have, so it would be good to have a place we we can insert that in future. The main reason why I chose to expose the "anon_orders" control is because it is possible to both enable/disable the various sizes as well as specificy (e.g.) "auto", without creating redundancy. But I agree that ideally we wouldn't expose orders to the user; I was attempting a compromise to simplify the "auto" case. A potential (though feels quite complex) solution to make auto work with your proposal: Add "auto" as an option to the existing global enabled file, and to all of your proposed new enabled files. But its only possible to *set* auto through the global file. And when it is set, all of the size-specific enabled files read-back "auto" too. Any any writes to the size-specific enabled files are ignored (or remembered but not enacted) until the global enabled file is changed away from auto. But I'm not sure if adding a new option to the global enabled file might break compat? > > > > That sounds reasonable at least to me, and we would be using what we learned > from THP (as John suggested). That still gives reasonable flexibility without > going too wild, and a better IMHO interface. > > I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long > as we can have something that won't hurt us later and still be able to use this > in distributions within a reasonable timeframe. Enabling/disabling individual > sizes does not sound too restrictive to me. And we could always add an "auto" > setting later and default to that with a new kconfig knob. > > If someone wants to configure it, why not. Let's just prepare a way to to handle > this "better" automatically in the future (if ever ...). > > > Change 3: Stats > >> /proc/meminfo: >> Introduce new "AnonHugePteMap" field, which reports the amount of >> memory (in KiB) mapped from large folios globally (similar to >> AnonHugePages field). > > AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think > we all agree on that. It should have been named "AnonPmdMapped" or > "AnonHugePmdMapped", too bad, we can't change that. Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and "AnonHugePteMapped", but I think that would likely break things. Its further complicated because vmstats prints it in PMD-size units, so can't represent PTE-mapped memory in that counter. > > "AnonHugePteMap" better be "AnonHugePteMapped". I agree, but I went with the shorter one because any longer and it would unalign the value e.g: AnonHugePages: 0 kB AnonHugePteMapped: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB So would need to decide which is preferable, or come up with a shorter name. > > But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why > should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense, > because !PMD implied !performance, and one might have been able to troubleshoot > that somehow. For PTE-mapped, it doesn't make much sense really, they are always > PTE-mapped. I disagree; I've been using it a lot to debug performance issues. It tells you how much of your anon memory is allocated with large folios. And making that percentage bigger improves performance; fewer page faults, and with a separate contpte series on arm64, better use of the TLB. Reasons might include; poorly aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc. I would actually argue for adding similar counters for file-backed memory too for the same reasons. (I actually posted an independent patch a while back that did this for file- and anon- memory, bucketted by size. But I think the idea of the bucketting was NAKed. > > That also raises the question how you would account a PTE-mapped THP. The hole > thing? Only the parts that are mapped? Let's better not go down that path. The approach I've taken in this series is the simple one - account every page that belongs to a large folio from when it is first mapped to last unmapped. Yes, in this case, you might not actually be mapping the full thing contigiously. But it gives a good indication. I also considered accounting the whole folio only when all of its pages become mapped (although not worrying about them all being contiguous). That's still simple to implement for all counters except smaps. So went with the simplest approach with the view that its "good enough". > > That leaves the question why we would want to include them here at all in a > special PTE-mapped way? > > > Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page. > > HugePages_Total: 1 > HugePages_Free: 1 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > Hugetlb: 1050624 kB > > -> Only the last one gives the sum, the other stats don't even mention the other > ones. [how do we get their stats, if at all?] There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages, surplus_hugepages. But this interface also constitutes the allocator, not just stats, I think. > > So maybe, we only want a summary of how many anon huge pages of any size are > allocated (independent of the PTE vs. PMD mapping), Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If the former, then I don't really see the difference. We have to continue to expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the total, or add a total counter and derive PTE-only. I suspect I've misunderstood your point. > and some other source to > eventually inspect how the different sizes behave. > > But note that for non-PMD-sized file THP we don't even have special counters! > ... so maybe we should also defer any such stats and come up with something > uniform for all types of non-PMD-sized THP. Indeed, I can see benefit in adding these for file THP - in fact I have a patch that does exactly that to help my development work. I had envisaged that we could add something like FileHugePteMapped, ShmemHugePteMapped that would follow the same semantics as AnonHugePteMapped. > > > Sane discussion applies to all other stats. > > >> >> Because we now have runtime enable/disable control, I've removed the compile >> time Kconfig switch. It still defaults to runtime-disabled. >> >> NOTE: These changes should not be merged until the prerequisites are complete. >> These are in progress and tracked at [7]. > > We should probably list them here, and classify which one we see as strict a > requirement, which ones might be an optimization. I'll need some help with clasifying them, so showing my working. Final list that I would propose as strict requirements at bottom. This is my list with status, as per response to Yu in other thread: - David is working on "shared vs exclusive mappings" - Zi Yan has posted an RFC for compaction - Yin Fengwei's mlock series is now in mm-stable - Yin Fengwei's madvise series is in 6.6 - I've reworked and posted a series for deferred_split_folio; although I've deprioritied it because Yu said it wasn't really a pre-requisite. - numa balancing depends on David's "shared vs exclusive mappings" work - I've started looking at "large folios in swap cache" in the background, because I'm seeing some slow down with large folios, but we also agreed that wasn't a prerequisite Although, since sending that, I've determined that when running kernel compilation across high number of cores on arm64, the cost of splitting the folios gets large due to needing to broadcast the extra TLBIs. So I think the last point on that list may be a prerequisite after all. (I've been able to fix this by adding support for allocating large folios in the swap file, and avoiding the split - planning to send RFC this week). There is also this set of things that you mentioned against "shared vs exclusive mappings", which I'm not sure if you are planning to cover as part of your work or if they are follow on things that will need to be done: (1) Detecting shared folios, to not mess with them while they are shared. MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ... replace cases where folio_estimated_sharers() == 1 would currently be the best we can do (and in some cases, page_mapcount() == 1). And I recently discovered that khugepaged doesn't collapse file-backed pages to a PMD-size THP if they belong to a large folio, so I'm guessing it may also suffer the same behaviour for anon memory. I'm not sure if that's what your "khugepaged ..." comment refers to? So taking all that and trying to put together a complete outstanding list for strict requirements: - Shared vs Exclusive Mappings (DavidH) - user-triggered page migration - NUMA hinting/balancing - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios - Compaction of Large Folios (Zi Yan) - Swap out small-size THP without Split (Ryan Roberts) > > > Now, these are just my thoughts, and I'm happy about other thoughts. As always, thanks for taking the time - I really appreciate it. Thanks, Ryan
[...] >> >> I dislike exposing "orders" to the users, I'm happy to be convinced why I am >> wrong and it is a good idea. >> >> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as >> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's >> small/medium/large is debatable. "Small" implies at least that it's smaller than >> what we used to know, which is a fact. >> >> Can we also now use the terminology consistently? (e.g., "variable-order, large >> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point >> at the previous patch set name in the cover letter) > > Yes absolutely. FWIW, I was deliberately not changing the title of the patchset > so people could easily see it was an evolution of something posted before. But > if it's the norm to change the title as the patchset evolves, I'm very happy to > do that. And there are other places too, in commit logs that I can tidy up. I > will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP" > (that last one slightly different from what David suggested above - it means > "small-sized THP" can still be grepped) unless others object. Absolutely fine with me. Hoping other people will object when I talk nonsense or my suggestions don't make any sense. Or even better, propose something better :) > >> >>> >>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base >>> pages, there are efficiency savings to be had; fewer page faults, batched PTE >>> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel >>> overhead. This should benefit all architectures. >>> 2) Since we are now mapping physically contiguous chunks of memory, we can take >>> advantage of HW TLB compression techniques. A reduction in TLB pressure >>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >>> >>> The major change in this revision is the addition of sysfs controls to allow >>> this "small-order THP" to be enabled/disabled/configured independently of >>> PMD-order THP. The approach I've taken differs a bit from previous discussions; >>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I >>> personally think this makes things clearer and more extensible. See [6] for >>> detailed rationale. >> >> Change 2: sysfs interface. >> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >> agree. >> >> What we expose there and how, is TBD. Again, not a friend of "orders" and >> bitmaps at all. We can do better if we want to go down that path. >> >> Maybe we should take a look at hugetlb, and how they added support for multiple >> sizes. What *might* make sense could be (depending on which values we actually >> support!) >> >> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >> >> Each one would contain an "enabled" and "defrag" file. We want something minimal >> first? Start with the "enabled" option. >> >> >> enabled: always [global] madvise never >> >> Initially, we would set it for PMD-sized THP to "global" and for everything else >> to "never". > > My only reservation about this approach is the potential for a future need for a > "one setting applied across all sizes" class of control (e.g. "auto"). I think > we agreed in the previous meetings that chasing a solution for "auto" was a good > aspiration to have, so it would be good to have a place we we can insert that in > future. The main reason why I chose to expose the "anon_orders" control is > because it is possible to both enable/disable the various sizes as well as > specificy (e.g.) "auto", without creating redundancy. But I agree that ideally > we wouldn't expose orders to the user; I was attempting a compromise to simplify > the "auto" case. > > A potential (though feels quite complex) solution to make auto work with your > proposal: Add "auto" as an option to the existing global enabled file, and to > all of your proposed new enabled files. But its only possible to *set* auto > through the global file. And when it is set, all of the size-specific enabled > files read-back "auto" too. Any any writes to the size-specific enabled files > are ignored (or remembered but not enacted) until the global enabled file is > changed away from auto. Yes, I think there are various ways forward regarding that. Or to enable "auto" mode only once all are "auto", and as soon as one is not "auto", just disable it. A simple echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled Would do to enable it. Or, have them all be "global" and have a global "auto" mode as you raised. echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled > > But I'm not sure if adding a new option to the global enabled file might break > compat? I think we used to extend the "defrag" option, see commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1 Author: David Rientjes <rientjes@google.com> Date: Wed Feb 22 15:45:49 2017 -0800 mm, thp: add new defer+madvise defrag option So I suspect we could extend that one in a similar way. But again, this is just the thing that came to mind when thinking about how to: a) avoid orders b) make it configurable and future-proof c) make it look a bit consistent with other interfaces (hugetlb and existing thp) d) still prepare for an auto mode that we want in the future I'm happy to hear other ideas. > >> >> >> >> That sounds reasonable at least to me, and we would be using what we learned >> from THP (as John suggested). That still gives reasonable flexibility without >> going too wild, and a better IMHO interface. >> >> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long >> as we can have something that won't hurt us later and still be able to use this >> in distributions within a reasonable timeframe. Enabling/disabling individual >> sizes does not sound too restrictive to me. And we could always add an "auto" >> setting later and default to that with a new kconfig knob. >> >> If someone wants to configure it, why not. Let's just prepare a way to to handle >> this "better" automatically in the future (if ever ...). >> >> >> Change 3: Stats >> >>> /proc/meminfo: >>> Introduce new "AnonHugePteMap" field, which reports the amount of >>> memory (in KiB) mapped from large folios globally (similar to >>> AnonHugePages field). >> >> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think >> we all agree on that. It should have been named "AnonPmdMapped" or >> "AnonHugePmdMapped", too bad, we can't change that. > > Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and > PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and > "AnonHugePteMapped", but I think that would likely break things. Its further > complicated because vmstats prints it in PMD-size units, so can't represent > PTE-mapped memory in that counter. :/ > >> >> "AnonHugePteMap" better be "AnonHugePteMapped". > > I agree, but I went with the shorter one because any longer and it would unalign > the value e.g: > > AnonHugePages: 0 kB > AnonHugePteMapped: 0 kB > ShmemPmdMapped: 0 kB > Shared_Hugetlb: 0 kB > Can't that be handled? We surely have long stuff in there: HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 1081344 kB ShmemPmdMapped: 0 kB HardwareCorrupted has same length as AnonHugePteMapped But I'm not convinced about "AnonHugePteMapped" yet :) > So would need to decide which is preferable, or come up with a shorter name. > >> >> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why >> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense, >> because !PMD implied !performance, and one might have been able to troubleshoot >> that somehow. For PTE-mapped, it doesn't make much sense really, they are always >> PTE-mapped. > > I disagree; I've been using it a lot to debug performance issues. It tells you > how much of your anon memory is allocated with large folios. And making that > percentage bigger improves performance; fewer page faults, and with a separate > contpte series on arm64, better use of the TLB. Reasons might include; poorly > aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc. Just because a small-sized THP is PTE-mapped doesn't tell you anything, really. What you want to know is if it is "completely" and "consecutively" mapped such that the HW can actually benefit from it -- if HW even supports it. So "PTE-mapped THP" is just part of the story. And that's where it gets tricky I think. I agree that it's good for debugging, but then maybe it should a) live somewhere else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we also want similar stats somewhere. One idea would be to expose such stats in a R/O fashion like "nr_allocated" or "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and friends. Of course, maybe tagging them with "anon" prefix. > > I would actually argue for adding similar counters for file-backed memory too > for the same reasons. (I actually posted an independent patch a while back that > did this for file- and anon- memory, bucketted by size. But I think the idea of > the bucketting was NAKed. For debugging, I *think* it might be valuable to see how many THP of each size are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we learned. PMD-mapped was easy, but also requires us to keep doing that tracking for all eternity ... Do you have a pointer to the patch set? Did it try to squeeze it into /proc/meminfo? > >> >> That also raises the question how you would account a PTE-mapped THP. The hole >> thing? Only the parts that are mapped? Let's better not go down that path. > > The approach I've taken in this series is the simple one - account every page > that belongs to a large folio from when it is first mapped to last unmapped. > Yes, in this case, you might not actually be mapping the full thing > contigiously. But it gives a good indication. > > I also considered accounting the whole folio only when all of its pages become > mapped (although not worrying about them all being contiguous). That's still > simple to implement for all counters except smaps. So went with the simplest > approach with the view that its "good enough". If you take a look at "ShmemHugePages" and "FileHugePages", there we actually track them when they get allocated+freed, which is much easier than tracking when/how they are (un)mapped. But it's only done for PMD-sized THP for now. > >> >> That leaves the question why we would want to include them here at all in a >> special PTE-mapped way? >> >> >> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page. >> >> HugePages_Total: 1 >> HugePages_Free: 1 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> Hugetlb: 1050624 kB >> >> -> Only the last one gives the sum, the other stats don't even mention the other >> ones. [how do we get their stats, if at all?] > > There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and > /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages, > surplus_hugepages. But this interface also constitutes the allocator, not just > stats, I think. Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I thought we would only have "nr_hugepages". > >> >> So maybe, we only want a summary of how many anon huge pages of any size are >> allocated (independent of the PTE vs. PMD mapping), > > Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If > the former, then I don't really see the difference. We have to continue to > expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the > total, or add a total counter and derive PTE-only. I suspect I've misunderstood > your point. I don't think we should go down the "PteMapped" path. Probably we want "bucketing" stats as you said, and maybe a global one that just combines everything (any THP). But naming will be difficult. > >> and some other source to >> eventually inspect how the different sizes behave. >> >> But note that for non-PMD-sized file THP we don't even have special counters! >> ... so maybe we should also defer any such stats and come up with something >> uniform for all types of non-PMD-sized THP. > > Indeed, I can see benefit in adding these for file THP - in fact I have a patch > that does exactly that to help my development work. I had envisaged that we > could add something like FileHugePteMapped, ShmemHugePteMapped that would follow > the same semantics as AnonHugePteMapped. Again, maybe we can find something that does not involve the "PteMapped" terminology and just gives us a big total of "allocated" THP. For detailed stats for debugging, maybe we can just use a different interface then. > >> >> >> Sane discussion applies to all other stats. >> >> >>> >>> Because we now have runtime enable/disable control, I've removed the compile >>> time Kconfig switch. It still defaults to runtime-disabled. >>> >>> NOTE: These changes should not be merged until the prerequisites are complete. >>> These are in progress and tracked at [7]. >> >> We should probably list them here, and classify which one we see as strict a >> requirement, which ones might be an optimization. > > > I'll need some help with clasifying them, so showing my working. Final list that > I would propose as strict requirements at bottom. > > This is my list with status, as per response to Yu in other thread: > > - David is working on "shared vs exclusive mappings" Probably "COW reuse support" is a separate item, although my approach would cover that. The question is, if the estimate we're using in most code for now would at least be sufficient to merge it. The estimate is easily wrong, but we do have that issue with PTE-mapped THP already. But that argument probably applies to most things here: the difference is that PTE-mapped THP are not the default, that's why nobody really cared. [I'm playing with an almost-lockless scheme right now and hope I have something running soonish -- as you know, I got distracted] > - Zi Yan has posted an RFC for compaction > - Yin Fengwei's mlock series is now in mm-stable > - Yin Fengwei's madvise series is in 6.6 > - I've reworked and posted a series for deferred_split_folio; although I've > deprioritied it because Yu said it wasn't really a pre-requisite. > - numa balancing depends on David's "shared vs exclusive mappings" work > - I've started looking at "large folios in swap cache" in the background, > because I'm seeing some slow down with large folios, but we also agreed that > wasn't a prerequisite > Probably it would be good to talk about the items and how we would classify them in a meeting. > Although, since sending that, I've determined that when running kernel > compilation across high number of cores on arm64, the cost of splitting the > folios gets large due to needing to broadcast the extra TLBIs. So I think the > last point on that list may be a prerequisite after all. (I've been able to fix > this by adding support for allocating large folios in the swap file, and > avoiding the split - planning to send RFC this week). > > There is also this set of things that you mentioned against "shared vs exclusive > mappings", which I'm not sure if you are planning to cover as part of your work > or if they are follow on things that will need to be done: > > (1) Detecting shared folios, to not mess with them while they are shared. > MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ... > replace cases where folio_estimated_sharers() == 1 would currently be the > best we can do (and in some cases, page_mapcount() == 1). > > And I recently discovered that khugepaged doesn't collapse file-backed pages to > a PMD-size THP if they belong to a large folio, so I'm guessing it may also > suffer the same behaviour for anon memory. I'm not sure if that's what your > "khugepaged ..." comment refers to? Yes. But I did not look into all the details yet. "kuhepaged" collapse support to small-sized THP is probably also a very imporant item, although it might be less relevant than for PMD -- and I consider it future work. See below. > > So taking all that and trying to put together a complete outstanding list for > strict requirements: > > - Shared vs Exclusive Mappings (DavidH) > - user-triggered page migration > - NUMA hinting/balancing > - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios > - Compaction of Large Folios (Zi Yan) > - Swap out small-size THP without Split (Ryan Roberts) ^ that's going to be tough, I can promise. And the only way to live without that would be khugepaged support. (because that's how it's all working for PMD-sized THP after all!) Once a PMD-sized THP was swapped out and evicted, it will always come back in order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do that for PTE-sized THP as well ... > > >> >> >> Now, these are just my thoughts, and I'm happy about other thoughts. > > As always, thanks for taking the time - I really appreciate it. Sure. Hoping others can comment. My gut feeling is that it's best to focus on getting the sysfs interface right+future proof and handling the stats independently. While being a good debug mechanism, I wouldn't consider these stats a requirement: we don't have them for file/shmem small-sized thp so far as well. So maybe really better to handle the stats in meminfo and friends separately.
On 09/10/2023 17:22, David Hildenbrand wrote: > [...] > >>> >>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am >>> wrong and it is a good idea. >>> >>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as >>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's >>> small/medium/large is debatable. "Small" implies at least that it's smaller than >>> what we used to know, which is a fact. >>> >>> Can we also now use the terminology consistently? (e.g., "variable-order, large >>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point >>> at the previous patch set name in the cover letter) >> >> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset >> so people could easily see it was an evolution of something posted before. But >> if it's the norm to change the title as the patchset evolves, I'm very happy to >> do that. And there are other places too, in commit logs that I can tidy up. I >> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP" >> (that last one slightly different from what David suggested above - it means >> "small-sized THP" can still be grepped) unless others object. > > Absolutely fine with me. Hoping other people will object when I talk nonsense or > my suggestions don't make any sense. > > Or even better, propose something better :) > >> >>> >>>> >>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base >>>> pages, there are efficiency savings to be had; fewer page faults, >>>> batched PTE >>>> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel >>>> overhead. This should benefit all architectures. >>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take >>>> advantage of HW TLB compression techniques. A reduction in TLB pressure >>>> speeds up kernel and user space. arm64 systems have 2 mechanisms to >>>> coalesce >>>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >>>> >>>> The major change in this revision is the addition of sysfs controls to allow >>>> this "small-order THP" to be enabled/disabled/configured independently of >>>> PMD-order THP. The approach I've taken differs a bit from previous discussions; >>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I >>>> personally think this makes things clearer and more extensible. See [6] for >>>> detailed rationale. >>> >>> Change 2: sysfs interface. >>> >>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >>> agree. >>> >>> What we expose there and how, is TBD. Again, not a friend of "orders" and >>> bitmaps at all. We can do better if we want to go down that path. >>> >>> Maybe we should take a look at hugetlb, and how they added support for multiple >>> sizes. What *might* make sense could be (depending on which values we actually >>> support!) >>> >>> >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >>> >>> Each one would contain an "enabled" and "defrag" file. We want something minimal >>> first? Start with the "enabled" option. >>> >>> >>> enabled: always [global] madvise never >>> >>> Initially, we would set it for PMD-sized THP to "global" and for everything else >>> to "never". >> >> My only reservation about this approach is the potential for a future need for a >> "one setting applied across all sizes" class of control (e.g. "auto"). I think >> we agreed in the previous meetings that chasing a solution for "auto" was a good >> aspiration to have, so it would be good to have a place we we can insert that in >> future. The main reason why I chose to expose the "anon_orders" control is >> because it is possible to both enable/disable the various sizes as well as >> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally >> we wouldn't expose orders to the user; I was attempting a compromise to simplify >> the "auto" case. >> >> A potential (though feels quite complex) solution to make auto work with your >> proposal: Add "auto" as an option to the existing global enabled file, and to >> all of your proposed new enabled files. But its only possible to *set* auto >> through the global file. And when it is set, all of the size-specific enabled >> files read-back "auto" too. Any any writes to the size-specific enabled files >> are ignored (or remembered but not enacted) until the global enabled file is >> changed away from auto. > > Yes, I think there are various ways forward regarding that. Or to enable "auto" > mode only once all are "auto", and as soon as one is not "auto", just disable > it. A simple > > echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled I'm not really a fan, because this implies that you have a period where "auto" is reported for a size, but its not really in "auto" mode yet. > > Would do to enable it. Or, have them all be "global" and have a global "auto" > mode as you raised. > > echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled > echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled > Again, this isn't atomic either. I tend to prefer my proposal because it switches atomically - there are no weird intermediate states. Anyway, I guess the important point is we have demonstrated that your proposed interface could be extended to support "auto" in future, should we need it. >> >> But I'm not sure if adding a new option to the global enabled file might break >> compat? > > I think we used to extend the "defrag" option, see > > commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1 > Author: David Rientjes <rientjes@google.com> > Date: Wed Feb 22 15:45:49 2017 -0800 > > mm, thp: add new defer+madvise defrag option > > > So I suspect we could extend that one in a similar way. > > But again, this is just the thing that came to mind when thinking about how to: > a) avoid orders > b) make it configurable and future-proof > c) make it look a bit consistent with other interfaces (hugetlb and > existing thp) > d) still prepare for an auto mode that we want in the future > > I'm happy to hear other ideas. > >> >>> >>> >>> >>> That sounds reasonable at least to me, and we would be using what we learned >>> from THP (as John suggested). That still gives reasonable flexibility without >>> going too wild, and a better IMHO interface. >>> >>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long >>> as we can have something that won't hurt us later and still be able to use this >>> in distributions within a reasonable timeframe. Enabling/disabling individual >>> sizes does not sound too restrictive to me. And we could always add an "auto" >>> setting later and default to that with a new kconfig knob. >>> >>> If someone wants to configure it, why not. Let's just prepare a way to to handle >>> this "better" automatically in the future (if ever ...). >>> >>> >>> Change 3: Stats >>> >>>> /proc/meminfo: >>>> Introduce new "AnonHugePteMap" field, which reports the amount of >>>> memory (in KiB) mapped from large folios globally (similar to >>>> AnonHugePages field). >>> >>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think >>> we all agree on that. It should have been named "AnonPmdMapped" or >>> "AnonHugePmdMapped", too bad, we can't change that. >> >> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and >> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and >> "AnonHugePteMapped", but I think that would likely break things. Its further >> complicated because vmstats prints it in PMD-size units, so can't represent >> PTE-mapped memory in that counter. > > :/ > >> >>> >>> "AnonHugePteMap" better be "AnonHugePteMapped". >> >> I agree, but I went with the shorter one because any longer and it would unalign >> the value e.g: >> >> AnonHugePages: 0 kB >> AnonHugePteMapped: 0 kB >> ShmemPmdMapped: 0 kB >> Shared_Hugetlb: 0 kB >> > > Can't that be handled? We surely have long stuff in there: > > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > ShmemHugePages: 1081344 kB > ShmemPmdMapped: 0 kB > > HardwareCorrupted has same length as AnonHugePteMapped HardwareCorrupted is special cased and has a field length of 5, so the largest value you can represent before it gets pushed out is ~97MB. I imagine that's plenty for HardwareCorrupted, but unlikely enough for AnonHugePteMapped. The standard field size is 8, which provides for ~95GB before becoming unaligned. > > But I'm not convinced about "AnonHugePteMapped" yet :) OK > >> So would need to decide which is preferable, or come up with a shorter name. >> >>> >>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why >>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense, >>> because !PMD implied !performance, and one might have been able to troubleshoot >>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always >>> PTE-mapped. >> >> I disagree; I've been using it a lot to debug performance issues. It tells you >> how much of your anon memory is allocated with large folios. And making that >> percentage bigger improves performance; fewer page faults, and with a separate >> contpte series on arm64, better use of the TLB. Reasons might include; poorly >> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc. > > Just because a small-sized THP is PTE-mapped doesn't tell you anything, really. > What you want to know is if it is "completely" and "consecutively" mapped such > that the HW can actually benefit from it -- if HW even supports it. So > "PTE-mapped THP" is just part of the story. And that's where it gets tricky I > think. > > I agree that it's good for debugging, but then maybe it should a) live somewhere > else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we > also want similar stats somewhere. > > One idea would be to expose such stats in a R/O fashion like "nr_allocated" or > "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and > friends. Of course, maybe tagging them with "anon" prefix. I see your point, but I don't completely agree with it all. That said, given your conclusion at the bottom, perhaps we should park the discussion about the accounting for a separate series in future? Then we can focus on the ABI? > >> >> I would actually argue for adding similar counters for file-backed memory too >> for the same reasons. (I actually posted an independent patch a while back that >> did this for file- and anon- memory, bucketted by size. But I think the idea of >> the bucketting was NAKed. > For debugging, I *think* it might be valuable to see how many THP of each size > are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we > learned. PMD-mapped was easy, but also requires us to keep doing that tracking > for all eternity ... > > Do you have a pointer to the patch set? Did it try to squeeze it into > /proc/meminfo? I was actually only working on smaps/smaps_rollup, which has been my main tool for debugging. patches at [1]. [1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/ > >> >>> >>> That also raises the question how you would account a PTE-mapped THP. The hole >>> thing? Only the parts that are mapped? Let's better not go down that path. >> >> The approach I've taken in this series is the simple one - account every page >> that belongs to a large folio from when it is first mapped to last unmapped. >> Yes, in this case, you might not actually be mapping the full thing >> contigiously. But it gives a good indication. >> >> I also considered accounting the whole folio only when all of its pages become >> mapped (although not worrying about them all being contiguous). That's still >> simple to implement for all counters except smaps. So went with the simplest >> approach with the view that its "good enough". > > If you take a look at "ShmemHugePages" and "FileHugePages", there we actually > track them when they get allocated+freed, which is much easier than tracking > when/how they are (un)mapped. But it's only done for PMD-sized THP for now. > >> >>> >>> That leaves the question why we would want to include them here at all in a >>> special PTE-mapped way? >>> >>> >>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page. >>> >>> HugePages_Total: 1 >>> HugePages_Free: 1 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 2048 kB >>> Hugetlb: 1050624 kB >>> >>> -> Only the last one gives the sum, the other stats don't even mention the other >>> ones. [how do we get their stats, if at all?] >> >> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and >> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages, >> surplus_hugepages. But this interface also constitutes the allocator, not just >> stats, I think. > > Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I > thought we would only have "nr_hugepages". > >> >>> >>> So maybe, we only want a summary of how many anon huge pages of any size are >>> allocated (independent of the PTE vs. PMD mapping), >> >> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If >> the former, then I don't really see the difference. We have to continue to >> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the >> total, or add a total counter and derive PTE-only. I suspect I've misunderstood >> your point. > > I don't think we should go down the "PteMapped" path. Probably we want > "bucketing" stats as you said, and maybe a global one that just combines > everything (any THP). But naming will be difficult. > >> >>> and some other source to >>> eventually inspect how the different sizes behave. >>> >>> But note that for non-PMD-sized file THP we don't even have special counters! >>> ... so maybe we should also defer any such stats and come up with something >>> uniform for all types of non-PMD-sized THP. >> >> Indeed, I can see benefit in adding these for file THP - in fact I have a patch >> that does exactly that to help my development work. I had envisaged that we >> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow >> the same semantics as AnonHugePteMapped. > > Again, maybe we can find something that does not involve the "PteMapped" > terminology and just gives us a big total of "allocated" THP. For detailed stats > for debugging, maybe we can just use a different interface then. > >> >>> >>> >>> Sane discussion applies to all other stats. >>> >>> >>>> >>>> Because we now have runtime enable/disable control, I've removed the compile >>>> time Kconfig switch. It still defaults to runtime-disabled. >>>> >>>> NOTE: These changes should not be merged until the prerequisites are complete. >>>> These are in progress and tracked at [7]. >>> >>> We should probably list them here, and classify which one we see as strict a >>> requirement, which ones might be an optimization. >> >> >> I'll need some help with clasifying them, so showing my working. Final list that >> I would propose as strict requirements at bottom. >> >> This is my list with status, as per response to Yu in other thread: >> >> - David is working on "shared vs exclusive mappings" > > Probably "COW reuse support" is a separate item, although my approach would > cover that. Yeah that's the in the original thread as (2), but I thought we were all agreed that is not a prerequisite so didn't bring it over here. > > The question is, if the estimate we're using in most code for now would at least > be sufficient to merge it. The estimate is easily wrong, but we do have that > issue with PTE-mapped THP already. Well as I understand it, at least the NUMA balancing code and khugepaged are ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped THP, but I assume it becomes untenable if large folios are the norm. Perhaps we can modify those paths to work with the current estimators in order to remove your work from the critical path - is that what you are getting at? > > But that argument probably applies to most things here: the difference is that > PTE-mapped THP are not the default, that's why nobody really cared. > > [I'm playing with an almost-lockless scheme right now and hope I have something > running soonish -- as you know, I got distracted] > >> - Zi Yan has posted an RFC for compaction >> - Yin Fengwei's mlock series is now in mm-stable >> - Yin Fengwei's madvise series is in 6.6 >> - I've reworked and posted a series for deferred_split_folio; although I've >> deprioritied it because Yu said it wasn't really a pre-requisite. >> - numa balancing depends on David's "shared vs exclusive mappings" work >> - I've started looking at "large folios in swap cache" in the background, >> because I'm seeing some slow down with large folios, but we also agreed that >> wasn't a prerequisite >> > > Probably it would be good to talk about the items and how we would classify them > in a meeting. Perhaps we can get a slot in Matthew's meeting tomorrow? > > >> Although, since sending that, I've determined that when running kernel >> compilation across high number of cores on arm64, the cost of splitting the >> folios gets large due to needing to broadcast the extra TLBIs. So I think the >> last point on that list may be a prerequisite after all. (I've been able to fix >> this by adding support for allocating large folios in the swap file, and >> avoiding the split - planning to send RFC this week). >> >> There is also this set of things that you mentioned against "shared vs exclusive >> mappings", which I'm not sure if you are planning to cover as part of your work >> or if they are follow on things that will need to be done: >> >> (1) Detecting shared folios, to not mess with them while they are shared. >> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ... >> replace cases where folio_estimated_sharers() == 1 would currently be the >> best we can do (and in some cases, page_mapcount() == 1). >> >> And I recently discovered that khugepaged doesn't collapse file-backed pages to >> a PMD-size THP if they belong to a large folio, so I'm guessing it may also >> suffer the same behaviour for anon memory. I'm not sure if that's what your >> "khugepaged ..." comment refers to? > > Yes. But I did not look into all the details yet. > > "kuhepaged" collapse support to small-sized THP is probably also a very imporant > item, although it might be less relevant than for PMD -- and I consider it > future work. See below. Yes I agree that it's definitely future work. Nothing regresses from today's performance if you don't have that. > >> >> So taking all that and trying to put together a complete outstanding list for >> strict requirements: >> >> - Shared vs Exclusive Mappings (DavidH) >> - user-triggered page migration >> - NUMA hinting/balancing >> - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios >> - Compaction of Large Folios (Zi Yan) >> - Swap out small-size THP without Split (Ryan Roberts) > > ^ that's going to be tough, I can promise. And the only way to live without that > would be khugepaged support. (because that's how it's all working for PMD-sized > THP after all!) Are you referring specifically to the "swap out" line there? If so, it wasn't my plan that we would *swap in* large folios - only swap them *out* as large folios to avoid the cost of splitting. Then when they come back in, the come in as single pages, just like PMD-sized THP, if I've understood things correctly. I have a patch working and showing the perf improvement as a result. I'm planning to post an RFC today, hopefully. I don't see the swap-in side as a problem for the initial patch set. OK, they come in as single pages, so you lost the potential TLB benefits. But that's no worse than today's performance so not a regression. And the ratio of SW savings on THP allocation to HW savings from the TLB is very different for small-sized THP; much more of the benefit comes from the SW and that's still there. > > Once a PMD-sized THP was swapped out and evicted, it will always come back in > order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do > that for PTE-sized THP as well ... Yes, sure, but that's a future improvement, not a requirement to prevent regression vs today, right? > >> >> >>> >>> >>> Now, these are just my thoughts, and I'm happy about other thoughts. >> >> As always, thanks for taking the time - I really appreciate it. > > Sure. Hoping others can comment. > > My gut feeling is that it's best to focus on getting the sysfs interface > right+future proof and handling the stats independently. While being a good > debug mechanism, I wouldn't consider these stats a requirement: we don't have > them for file/shmem small-sized thp so far as well. > > So maybe really better to handle the stats in meminfo and friends separately. > I'd be very happy with that approach if others are bought in. Thanks, Ryan
>> >> Yes, I think there are various ways forward regarding that. Or to enable "auto" >> mode only once all are "auto", and as soon as one is not "auto", just disable >> it. A simple >> >> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled > > I'm not really a fan, because this implies that you have a period where "auto" > is reported for a size, but its not really in "auto" mode yet. I think there are various alternatives that are feasible. For most systems later, you'd want to just "auto" via compile-time CONFIG option as default or via some cmdline option like "transparent_hugepage=auto". > >> >> Would do to enable it. Or, have them all be "global" and have a global "auto" >> mode as you raised. >> >> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled >> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled >> > > Again, this isn't atomic either. I tend to prefer my proposal because it > switches atomically - there are no weird intermediate states. Anyway, I guess > the important point is we have demonstrated that your proposed interface could > be extended to support "auto" in future, should we need it. I don't think the atomic switch is really relevant. But that's probably a separate discussion. [...] >> >> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really. >> What you want to know is if it is "completely" and "consecutively" mapped such >> that the HW can actually benefit from it -- if HW even supports it. So >> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I >> think. >> >> I agree that it's good for debugging, but then maybe it should a) live somewhere >> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we >> also want similar stats somewhere. >> >> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or >> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and >> friends. Of course, maybe tagging them with "anon" prefix. > > I see your point, but I don't completely agree with it all. That said, given > your conclusion at the bottom, perhaps we should park the discussion about the > accounting for a separate series in future? Then we can focus on the ABI? Yes! > >> >>> >>> I would actually argue for adding similar counters for file-backed memory too >>> for the same reasons. (I actually posted an independent patch a while back that >>> did this for file- and anon- memory, bucketted by size. But I think the idea of >>> the bucketting was NAKed. >> For debugging, I *think* it might be valuable to see how many THP of each size >> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we >> learned. PMD-mapped was easy, but also requires us to keep doing that tracking >> for all eternity ... >> >> Do you have a pointer to the patch set? Did it try to squeeze it into >> /proc/meminfo? > > I was actually only working on smaps/smaps_rollup, which has been my main tool > for debugging. patches at [1]. > > [1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/ > Thanks for the pointer! [...] >>> >>> I'll need some help with clasifying them, so showing my working. Final list that >>> I would propose as strict requirements at bottom. >>> >>> This is my list with status, as per response to Yu in other thread: >>> >>> - David is working on "shared vs exclusive mappings" >> >> Probably "COW reuse support" is a separate item, although my approach would >> cover that. > > Yeah that's the in the original thread as (2), but I thought we were all agreed > that is not a prerequisite so didn't bring it over here. Agreed. Having a full list of todo items might be reasonable. > >> >> The question is, if the estimate we're using in most code for now would at least >> be sufficient to merge it. The estimate is easily wrong, but we do have that >> issue with PTE-mapped THP already. > > Well as I understand it, at least the NUMA balancing code and khugepaged are > ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped > THP, but I assume it becomes untenable if large folios are the norm. Perhaps we > can modify those paths to work with the current estimators in order to remove > your work from the critical path - is that what you are getting at? IMHO most of the code that now uses the estimate-logic is really suboptimal, but it's all we have. It's probably interesting to see where the false negative/positives are tolerable for now ... I hate to be at the critical path ;) But I'm getting somewhere slowly but steadily (slowly, because I'm constantly distracted -- and apparently sick). [...] >> >> >>> Although, since sending that, I've determined that when running kernel >>> compilation across high number of cores on arm64, the cost of splitting the >>> folios gets large due to needing to broadcast the extra TLBIs. So I think the >>> last point on that list may be a prerequisite after all. (I've been able to fix >>> this by adding support for allocating large folios in the swap file, and >>> avoiding the split - planning to send RFC this week). >>> >>> There is also this set of things that you mentioned against "shared vs exclusive >>> mappings", which I'm not sure if you are planning to cover as part of your work >>> or if they are follow on things that will need to be done: >>> >>> (1) Detecting shared folios, to not mess with them while they are shared. >>> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ... >>> replace cases where folio_estimated_sharers() == 1 would currently be the >>> best we can do (and in some cases, page_mapcount() == 1). >>> >>> And I recently discovered that khugepaged doesn't collapse file-backed pages to >>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also >>> suffer the same behaviour for anon memory. I'm not sure if that's what your >>> "khugepaged ..." comment refers to? >> >> Yes. But I did not look into all the details yet. >> >> "kuhepaged" collapse support to small-sized THP is probably also a very imporant >> item, although it might be less relevant than for PMD -- and I consider it >> future work. See below. > > Yes I agree that it's definitely future work. Nothing regresses from today's > performance if you don't have that. > >> >>> >>> So taking all that and trying to put together a complete outstanding list for >>> strict requirements: >>> >>> - Shared vs Exclusive Mappings (DavidH) >>> - user-triggered page migration >>> - NUMA hinting/balancing >>> - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios >>> - Compaction of Large Folios (Zi Yan) >>> - Swap out small-size THP without Split (Ryan Roberts) >> >> ^ that's going to be tough, I can promise. And the only way to live without that >> would be khugepaged support. (because that's how it's all working for PMD-sized >> THP after all!) > > Are you referring specifically to the "swap out" line there? If so, it wasn't my > plan that we would *swap in* large folios - only swap them *out* as large folios Ah! > to avoid the cost of splitting. Then when they come back in, the come in as > single pages, just like PMD-sized THP, if I've understood things correctly. I > have a patch working and showing the perf improvement as a result. I'm planning > to post an RFC today, hopefully. > > I don't see the swap-in side as a problem for the initial patch set. OK, they > come in as single pages, so you lost the potential TLB benefits. But that's no > worse than today's performance so not a regression. And the ratio of SW savings > on THP allocation to HW savings from the TLB is very different for small-sized > THP; much more of the benefit comes from the SW and that's still there. > >> >> Once a PMD-sized THP was swapped out and evicted, it will always come back in >> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do >> that for PTE-sized THP as well ... > > Yes, sure, but that's a future improvement, not a requirement to prevent > regression vs today, right? Yes. You can't just currently assume: as soon as you swap, the whole benefit is gone because you end up will all order-0 pages. These are certainly not limiting "merge" factors IMHO, but it's certainly one of the things users of distributions will heavily complain about ;) PMD-sized THP are mostly self-healing in that sense. > >> >>> >>> >>>> >>>> >>>> Now, these are just my thoughts, and I'm happy about other thoughts. >>> >>> As always, thanks for taking the time - I really appreciate it. >> >> Sure. Hoping others can comment. >> >> My gut feeling is that it's best to focus on getting the sysfs interface >> right+future proof and handling the stats independently. While being a good >> debug mechanism, I wouldn't consider these stats a requirement: we don't have >> them for file/shmem small-sized thp so far as well. >> >> So maybe really better to handle the stats in meminfo and friends separately. >> > > I'd be very happy with that approach if others are bought in. Yeah. I'm expecting there still to be discussions, but then we shall also here other proposals. memcg controls are IMHO certainly future work.
On 06/10/2023 21:06, David Hildenbrand wrote: > On 29.09.23 13:44, Ryan Roberts wrote: >> Hi All, > [...] >> NOTE: These changes should not be merged until the prerequisites are complete. >> These are in progress and tracked at [7]. > > We should probably list them here, and classify which one we see as strict a > requirement, which ones might be an optimization. > Bringing back the discussion of prerequistes to this thread following the discussion at the mm-alignment meeting on Wednesday. Slides, updated following discussion to reflect all the agreed items that are prerequisites and enhancements, are at [1]. I've taken a closer look at the situation with khugepaged, and can confirm that it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when small-sized THP is enabled+always. So I've fixed that test up and will add the patch to the next version of my series. So I believe the khugepaged prerequisite can be marked as done. [1] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA Thanks, Ryan
On 20/10/2023 13:33, Ryan Roberts wrote: > On 06/10/2023 21:06, David Hildenbrand wrote: >> On 29.09.23 13:44, Ryan Roberts wrote: >>> Hi All, >> > > [...] > >>> NOTE: These changes should not be merged until the prerequisites are complete. >>> These are in progress and tracked at [7]. >> >> We should probably list them here, and classify which one we see as strict a >> requirement, which ones might be an optimization. >> > > Bringing back the discussion of prerequistes to this thread following the > discussion at the mm-alignment meeting on Wednesday. > > Slides, updated following discussion to reflect all the agreed items that are > prerequisites and enhancements, are at [1]. > > I've taken a closer look at the situation with khugepaged, and can confirm that > it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice > though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when > small-sized THP is enabled+always. So I've fixed that test up and will add the > patch to the next version of my series. > > So I believe the khugepaged prerequisite can be marked as done. > > [1] > https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA Hi All, It's been a week since the mm alignment meeting discussion we had around prerequisites and the ABI. I haven't heard any further feedback on the ABI proposal, so I'm going to be optimistic and assume that nobody has found any fatal flaws in it :). Certainly, I think it held up to the potential future policies that Yu Zhou cited on the call - the possibility of preferring a smaller size over a bigger one, if the smaller size can be allocated without splitting a contiguous block. I think the suggestion of adding a per-size priority file would solve it. And in general because we have a per-size directory, that gives us lots of flexibility for growth. Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm planning to do the following: - Drop the accounting patch (#3); we will continue to only account PMD-sized THP for now. We can add more counters in future if needed. page cache large folios haven't needed any new counters yet. - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape to that used by hugetlb - Drop the "recommend" keyword patch (#6); For now, users will need to understand implicitly which sizes are beneficial to their HW perf - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping patch #6 - Add patch for khugepaged selftest improvement (described in previous email above). - Ensure that PMD_ORDER is not assumed to be compile-time constant (current code is broken on powerpc) Please shout if you think this is the wrong approach. On the prerequisites front, we have 2 items still to land: - compaction; Zi Yan is working on a v2 - numa balancing; A developer has signed up and is working on it (I'll leave them to reveal themself as preferred). Thanks, Ryan
On 25.10.23 18:24, Ryan Roberts wrote: > On 20/10/2023 13:33, Ryan Roberts wrote: >> On 06/10/2023 21:06, David Hildenbrand wrote: >>> On 29.09.23 13:44, Ryan Roberts wrote: >>>> Hi All, >>> >> >> [...] >> >>>> NOTE: These changes should not be merged until the prerequisites are complete. >>>> These are in progress and tracked at [7]. >>> >>> We should probably list them here, and classify which one we see as strict a >>> requirement, which ones might be an optimization. >>> >> >> Bringing back the discussion of prerequistes to this thread following the >> discussion at the mm-alignment meeting on Wednesday. >> >> Slides, updated following discussion to reflect all the agreed items that are >> prerequisites and enhancements, are at [1]. >> >> I've taken a closer look at the situation with khugepaged, and can confirm that >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when >> small-sized THP is enabled+always. So I've fixed that test up and will add the >> patch to the next version of my series. >> >> So I believe the khugepaged prerequisite can be marked as done. >> >> [1] >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA > > Hi All, Hi, I wanted to remind people in the THP cabal meeting, but that either didn't happen or zoomed decided to not let me join :) > > It's been a week since the mm alignment meeting discussion we had around > prerequisites and the ABI. I haven't heard any further feedback on the ABI > proposal, so I'm going to be optimistic and assume that nobody has found any > fatal flaws in it :). After saying in the call probably 10 times that people should comment here if there are reasonable alternatives worth discussing, call me "optimistic" as well; but, it's only been a week and people might still be thinking about this/ There were two things discussed in the call: * Yu brought up "lists" so we can have priorities. As briefly discussed in the call, this (a) might not be needed right now in an initial version; (b) the kernel might be able to handle that (or many cases) automatically, TBD. Adding lists now would kind-of set the semantics of that interface in stone. As you describe below, the approach discussed here could easily be extended to cover priorities, if need be. * Hugh raised the point that "bitmap of orders" could be replaced by "added THP sizes", which really is "bitmap of orders" shifted to the left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" = "2162688" or in KiB "2112". Hm. Both approaches would require single-option files like "enable_always", "enable_madvise" ... which I don't particularly like, but who am I to judge. > > Certainly, I think it held up to the potential future policies that Yu Zhou > cited on the call - the possibility of preferring a smaller size over a bigger > one, if the smaller size can be allocated without splitting a contiguous block. > I think the suggestion of adding a per-size priority file would solve it. And in > general because we have a per-size directory, that gives us lots of flexibility > for growth. Jup, same opinion here. But again, I'm very happy to hear other alternatives and why they are better. > > Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm > planning to do the following: > > - Drop the accounting patch (#3); we will continue to only account PMD-sized > THP for now. We can add more counters in future if needed. page cache large > folios haven't needed any new counters yet. > > - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape > to that used by hugetlb > > - Drop the "recommend" keyword patch (#6); For now, users will need to > understand implicitly which sizes are beneficial to their HW perf > > - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping > patch #6 > > - Add patch for khugepaged selftest improvement (described in previous email > above). > > - Ensure that PMD_ORDER is not assumed to be compile-time constant (current > code is broken on powerpc) > > Please shout if you think this is the wrong approach. I'll shout that this sounds good to me; rather wait a bit more for more opinions. It probably makes sense to post something after the (upcoming) merge window, if there are no further discussions here.
On 10/25/23 09:24, Ryan Roberts wrote: > On the prerequisites front, we have 2 items still to land: > > - compaction; Zi Yan is working on a v2 > > - numa balancing; A developer has signed up and is working on it (I'll leave > them to reveal themself as preferred). > Oh yes, that's me, thanks for providing an easy place to reply-to, for that. thanks,
On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote: > > On 25.10.23 18:24, Ryan Roberts wrote: > > On 20/10/2023 13:33, Ryan Roberts wrote: > >> On 06/10/2023 21:06, David Hildenbrand wrote: > >>> On 29.09.23 13:44, Ryan Roberts wrote: > >>>> Hi All, > >>> > >> > >> [...] > >> > >>>> NOTE: These changes should not be merged until the prerequisites are complete. > >>>> These are in progress and tracked at [7]. > >>> > >>> We should probably list them here, and classify which one we see as strict a > >>> requirement, which ones might be an optimization. > >>> > >> > >> Bringing back the discussion of prerequistes to this thread following the > >> discussion at the mm-alignment meeting on Wednesday. > >> > >> Slides, updated following discussion to reflect all the agreed items that are > >> prerequisites and enhancements, are at [1]. > >> > >> I've taken a closer look at the situation with khugepaged, and can confirm that > >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice > >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when > >> small-sized THP is enabled+always. So I've fixed that test up and will add the > >> patch to the next version of my series. > >> > >> So I believe the khugepaged prerequisite can be marked as done. > >> > >> [1] > >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA > > > > Hi All, > > Hi, > > I wanted to remind people in the THP cabal meeting, but that either > didn't happen or zoomed decided to not let me join :) > > > > > It's been a week since the mm alignment meeting discussion we had around > > prerequisites and the ABI. I haven't heard any further feedback on the ABI > > proposal, so I'm going to be optimistic and assume that nobody has found any > > fatal flaws in it :). > > After saying in the call probably 10 times that people should comment > here if there are reasonable alternatives worth discussing, call me > "optimistic" as well; but, it's only been a week and people might still > be thinking about this/ > > There were two things discussed in the call: > > * Yu brought up "lists" so we can have priorities. As briefly discussed > in the call, this (a) might not be needed right now in an initial > version; (b) the kernel might be able to handle that (or many cases) > automatically, TBD. Adding lists now would kind-of set the semantics > of that interface in stone. As you describe below, the approach > discussed here could easily be extended to cover priorities, if need > be. I want to expand on this: the argument that "if you could allocate a higher order you should use it" is too simplistic. There are many reasons in addition to the one above that we want to "fall back" to higher orders, e.g., those higher orders are not on PCP or from the local node. When we consider the sequence of orders to try, user preference is just one of the parameters to the cost function. The bottom line is that I think we should all agree that there needs to be a cost function down the road, whatever it looks like. Otherwise I don't know how we can make "auto" happen. > * Hugh raised the point that "bitmap of orders" could be replaced by > "added THP sizes", which really is "bitmap of orders" shifted to the > left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" = > "2162688" or in KiB "2112". Hm. I'm not a big fan of the "bitmap of orders" approach, because it doesn't address my concern above. > Both approaches would require single-option files like "enable_always", > "enable_madvise" ... which I don't particularly like, but who am I to judge. > > > > > > Certainly, I think it held up to the potential future policies that Yu Zhou > > cited on the call - the possibility of preferring a smaller size over a bigger > > one, if the smaller size can be allocated without splitting a contiguous block. > > I think the suggestion of adding a per-size priority file would solve it. And in > > general because we have a per-size directory, that gives us lots of flexibility > > for growth. > > Jup, same opinion here. But again, I'm very happy to hear other > alternatives and why they are better. I'm not against David's proposal but I want to hear a lot more about "lots of flexibility for growth" before I'm fully convinced. Why do I need more convincing? When I brought up that we need to consider the priority of each order and the potential need to fall back to higher orders during the meeting, I got the impression people were surprised why we want to fall back to higher orders. TBH, I was surprised too that this possibility was never considered. I missed today's THP meeting too but I'll join next time and if anyone has more ideas on this, we can spend some time discussing it, especially on how LAF should cooperate with the page allocator to make better decisions.
On 25/10/2023 20:11, Yu Zhao wrote: > On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 25.10.23 18:24, Ryan Roberts wrote: >>> On 20/10/2023 13:33, Ryan Roberts wrote: >>>> On 06/10/2023 21:06, David Hildenbrand wrote: >>>>> On 29.09.23 13:44, Ryan Roberts wrote: >>>>>> Hi All, >>>>> >>>> >>>> [...] >>>> >>>>>> NOTE: These changes should not be merged until the prerequisites are complete. >>>>>> These are in progress and tracked at [7]. >>>>> >>>>> We should probably list them here, and classify which one we see as strict a >>>>> requirement, which ones might be an optimization. >>>>> >>>> >>>> Bringing back the discussion of prerequistes to this thread following the >>>> discussion at the mm-alignment meeting on Wednesday. >>>> >>>> Slides, updated following discussion to reflect all the agreed items that are >>>> prerequisites and enhancements, are at [1]. >>>> >>>> I've taken a closer look at the situation with khugepaged, and can confirm that >>>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice >>>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when >>>> small-sized THP is enabled+always. So I've fixed that test up and will add the >>>> patch to the next version of my series. >>>> >>>> So I believe the khugepaged prerequisite can be marked as done. >>>> >>>> [1] >>>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA >>> >>> Hi All, >> >> Hi, >> >> I wanted to remind people in the THP cabal meeting, but that either >> didn't happen or zoomed decided to not let me join :) I didn't make it yesterday either - was having to juggle child care. >> >>> >>> It's been a week since the mm alignment meeting discussion we had around >>> prerequisites and the ABI. I haven't heard any further feedback on the ABI >>> proposal, so I'm going to be optimistic and assume that nobody has found any >>> fatal flaws in it :). >> >> After saying in the call probably 10 times that people should comment >> here if there are reasonable alternatives worth discussing, call me >> "optimistic" as well; but, it's only been a week and people might still >> be thinking about this/ >> >> There were two things discussed in the call: >> >> * Yu brought up "lists" so we can have priorities. As briefly discussed >> in the call, this (a) might not be needed right now in an initial >> version; (b) the kernel might be able to handle that (or many cases) >> automatically, TBD. Adding lists now would kind-of set the semantics >> of that interface in stone. As you describe below, the approach >> discussed here could easily be extended to cover priorities, if need >> be. > > I want to expand on this: the argument that "if you could allocate a > higher order you should use it" is too simplistic. There are many > reasons in addition to the one above that we want to "fall back" to > higher orders, e.g., those higher orders are not on PCP or from the > local node. When we consider the sequence of orders to try, user > preference is just one of the parameters to the cost function. The > bottom line is that I think we should all agree that there needs to be > a cost function down the road, whatever it looks like. Otherwise I > don't know how we can make "auto" happen. I don't dispute that this sounds like it could be beneficial, but I see it as research to happen further down the road (as you say), and we don't know what that research might conclude. Also, I think the scope of this is bigger than anonymous memory - you would also likely want to look at the policy for page cache folio order too, since today that's based solely on readahead. So I see it as an optimization that is somewhat orthogonal to small-sized THP. The proposed interface does not imply any preference order - it only states which sizes the user wants the kernel to select from, so I think there is lots of freedom to change this down the track if the kernel wants to start using the buddy allocator's state as a signal to make its decisions. > >> * Hugh raised the point that "bitmap of orders" could be replaced by >> "added THP sizes", which really is "bitmap of orders" shifted to the >> left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" = >> "2162688" or in KiB "2112". Hm. > > I'm not a big fan of the "bitmap of orders" approach, because it > doesn't address my concern above. > >> Both approaches would require single-option files like "enable_always", >> "enable_madvise" ... which I don't particularly like, but who am I to judge. >> >> >>> >>> Certainly, I think it held up to the potential future policies that Yu Zhou >>> cited on the call - the possibility of preferring a smaller size over a bigger >>> one, if the smaller size can be allocated without splitting a contiguous block. >>> I think the suggestion of adding a per-size priority file would solve it. And in >>> general because we have a per-size directory, that gives us lots of flexibility >>> for growth. >> >> Jup, same opinion here. But again, I'm very happy to hear other >> alternatives and why they are better. > > I'm not against David's proposal but I want to hear a lot more about > "lots of flexibility for growth" before I'm fully convinced. My point was that in an abstract sense, there are properties a user may wish to apply individually to a size, which is catered for by having a per-size directory into which we can add more files if/when requirements for new per-size properties arise. There are also properties that may be applied globally, for which we have the top-level transparent_hugepage directory where properties can be extended or added. For your case around tighter integration with the buddy allocator, I could imagine a per-size file allowing the user to specify if the kernel should allow splitting a higher order to make a THP of that size (I'm not suggesting that's a good idea, I'm just pointing out that this sort of thing is possible with the interface). And we have discussed how the global enabled prpoerty could be extended to support "auto" [1]. But perhaps what we really need are lots more ideas for future directions for small-sized THP to allow us to evaluate this interface more widely. > Why do I > need more convincing? When I brought up that we need to consider the > priority of each order and the potential need to fall back to higher > orders during the meeting, I got the impression people were surprised > why we want to fall back to higher orders. TBH, I was surprised too > that this possibility was never considered. I missed today's THP > meeting too but I'll join next time and if anyone has more ideas on > this, we can spend some time discussing it, especially on how LAF > should cooperate with the page allocator to make better decisions. [1] https://lore.kernel.org/linux-mm/99f8294b-b4e5-424f-d761-24a70a82cc1a@redhat.com/ Thanks, Ryan
[...] >>> Hi, >>> >>> I wanted to remind people in the THP cabal meeting, but that either >>> didn't happen or zoomed decided to not let me join :) > > I didn't make it yesterday either - was having to juggle child care. I think it didn't happen, or started quite late (>20 min). > >>> >>>> >>>> It's been a week since the mm alignment meeting discussion we had around >>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI >>>> proposal, so I'm going to be optimistic and assume that nobody has found any >>>> fatal flaws in it :). >>> >>> After saying in the call probably 10 times that people should comment >>> here if there are reasonable alternatives worth discussing, call me >>> "optimistic" as well; but, it's only been a week and people might still >>> be thinking about this/ >>> >>> There were two things discussed in the call: >>> >>> * Yu brought up "lists" so we can have priorities. As briefly discussed >>> in the call, this (a) might not be needed right now in an initial >>> version; (b) the kernel might be able to handle that (or many cases) >>> automatically, TBD. Adding lists now would kind-of set the semantics >>> of that interface in stone. As you describe below, the approach >>> discussed here could easily be extended to cover priorities, if need >>> be. >> >> I want to expand on this: the argument that "if you could allocate a >> higher order you should use it" is too simplistic. There are many >> reasons in addition to the one above that we want to "fall back" to >> higher orders, e.g., those higher orders are not on PCP or from the >> local node. When we consider the sequence of orders to try, user >> preference is just one of the parameters to the cost function. The >> bottom line is that I think we should all agree that there needs to be >> a cost function down the road, whatever it looks like. Otherwise I >> don't know how we can make "auto" happen. I agree that there needs to be a cost function, and as pagecache showed that's independent of initial enablement. > > I don't dispute that this sounds like it could be beneficial, but I see it as > research to happen further down the road (as you say), and we don't know what > that research might conclude. Also, I think the scope of this is bigger than > anonymous memory - you would also likely want to look at the policy for page > cache folio order too, since today that's based solely on readahead. So I see it > as an optimization that is somewhat orthogonal to small-sized THP. Exactly my thoughts. The important thing is that we should plan ahead that we still have the option to let the admin configure if we cannot make this work automatically in the kernel. What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a single global toggle. > > The proposed interface does not imply any preference order - it only states > which sizes the user wants the kernel to select from, so I think there is lots > of freedom to change this down the track if the kernel wants to start using the > buddy allocator's state as a signal to make its decisions. Yes. [..] >>> Jup, same opinion here. But again, I'm very happy to hear other >>> alternatives and why they are better. >> >> I'm not against David's proposal but I want to hear a lot more about >> "lots of flexibility for growth" before I'm fully convinced. > > My point was that in an abstract sense, there are properties a user may wish to > apply individually to a size, which is catered for by having a per-size > directory into which we can add more files if/when requirements for new per-size > properties arise. There are also properties that may be applied globally, for > which we have the top-level transparent_hugepage directory where properties can > be extended or added. Exactly, well said. > > For your case around tighter integration with the buddy allocator, I could > imagine a per-size file allowing the user to specify if the kernel should allow > splitting a higher order to make a THP of that size (I'm not suggesting that's a > good idea, I'm just pointing out that this sort of thing is possible with the > interface). And we have discussed how the global enabled prpoerty could be > extended to support "auto" [1]. > > But perhaps what we really need are lots more ideas for future directions for > small-sized THP to allow us to evaluate this interface more widely. David R. motivated a future size-aware setting of the defrag option. As discussed we might want something similar to shmem_enable. What will happen with khugepaged, nobody knows yet :) I could imagine exposing per-size boolean read-only properties like "native-hw-size" (PMD, cont-pte). But these things require much more thought.
On 06/10/2023 21:06, David Hildenbrand wrote: [...] > > Change 2: sysfs interface. > > If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I > agree. > > What we expose there and how, is TBD. Again, not a friend of "orders" and > bitmaps at all. We can do better if we want to go down that path. > > Maybe we should take a look at hugetlb, and how they added support for multiple > sizes. What *might* make sense could be (depending on which values we actually > support!) > > > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ > > Each one would contain an "enabled" and "defrag" file. We want something minimal > first? Start with the "enabled" option. > > > enabled: always [global] madvise never > > Initially, we would set it for PMD-sized THP to "global" and for everything else > to "never". Hi David, I've just started coding this, and it occurs to me that I might need a small clarification here; the existing global "enabled" control is used to drive decisions for both anonymous memory and (non-shmem) file-backed memory. But the proposed new per-size "enabled" is implicitly only controlling anon memory (for now). 1) Is this potentially confusing for the user? Should we rename the per-size controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so we can reuse the same control for file-backed memory in future? 2) The global control will continue to drive the file-backed memory decision (for now), even when hugepages-2048kB/enabled != "global"; agreed? Thanks, Ryan
On 31/10/2023 11:50, Ryan Roberts wrote: > On 06/10/2023 21:06, David Hildenbrand wrote: > [...] >> >> Change 2: sysfs interface. >> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >> agree. >> >> What we expose there and how, is TBD. Again, not a friend of "orders" and >> bitmaps at all. We can do better if we want to go down that path. >> >> Maybe we should take a look at hugetlb, and how they added support for multiple >> sizes. What *might* make sense could be (depending on which values we actually >> support!) >> >> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >> >> Each one would contain an "enabled" and "defrag" file. We want something minimal >> first? Start with the "enabled" option. >> >> >> enabled: always [global] madvise never >> >> Initially, we would set it for PMD-sized THP to "global" and for everything else >> to "never". > > Hi David, > > I've just started coding this, and it occurs to me that I might need a small > clarification here; the existing global "enabled" control is used to drive > decisions for both anonymous memory and (non-shmem) file-backed memory. But the > proposed new per-size "enabled" is implicitly only controlling anon memory (for > now). > > 1) Is this potentially confusing for the user? Should we rename the per-size > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so > we can reuse the same control for file-backed memory in future? > > 2) The global control will continue to drive the file-backed memory decision > (for now), even when hugepages-2048kB/enabled != "global"; agreed? > > Thanks, > Ryan > Also, an implementation question: hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true regardless. Is that by design? It couldn't fathom any reasoning from the commit log: bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, bool smaps, bool in_pf, bool enforce_sysfs) { if (!vma->vm_mm) /* vdso */ return false; /* * Explicitly disabled through madvise or prctl, or some * architectures may disable THP for some mappings, for * example, s390 kvm. * */ if ((vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) return false; /* * If the hardware/firmware marked hugepage support disabled. */ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) return false; /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ if (vma_is_dax(vma)) return in_pf; <<<<<<<< ... }
On 31.10.23 12:50, Ryan Roberts wrote: > On 06/10/2023 21:06, David Hildenbrand wrote: > [...] >> >> Change 2: sysfs interface. >> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >> agree. >> >> What we expose there and how, is TBD. Again, not a friend of "orders" and >> bitmaps at all. We can do better if we want to go down that path. >> >> Maybe we should take a look at hugetlb, and how they added support for multiple >> sizes. What *might* make sense could be (depending on which values we actually >> support!) >> >> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >> >> Each one would contain an "enabled" and "defrag" file. We want something minimal >> first? Start with the "enabled" option. >> >> >> enabled: always [global] madvise never >> >> Initially, we would set it for PMD-sized THP to "global" and for everything else >> to "never". > > Hi David, Hi! > > I've just started coding this, and it occurs to me that I might need a small > clarification here; the existing global "enabled" control is used to drive > decisions for both anonymous memory and (non-shmem) file-backed memory. But the > proposed new per-size "enabled" is implicitly only controlling anon memory (for > now). Anon was (way) first, and pagecache later decided to reuse that one as an indication whether larger folios are desired. For the pagecache, it's just a way to enable/disable it globally. As there is no memory waste, nobody currently really cares about the exact sized the pagecache is allocating (maybe that will change at some point, maybe not, who knows). > > 1) Is this potentially confusing for the user? Should we rename the per-size > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so > we can reuse the same control for file-backed memory in future? The latter would be my take. Just like we did with the global toggle. > > 2) The global control will continue to drive the file-backed memory decision > (for now), even when hugepages-2048kB/enabled != "global"; agreed? That would be my take; it will allocate other sizes already, so just glue it to the global toggle and document for the other toggles that they only control anonymous THP for now.
On 31.10.23 12:55, Ryan Roberts wrote: > On 31/10/2023 11:50, Ryan Roberts wrote: >> On 06/10/2023 21:06, David Hildenbrand wrote: >> [...] >>> >>> Change 2: sysfs interface. >>> >>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >>> agree. >>> >>> What we expose there and how, is TBD. Again, not a friend of "orders" and >>> bitmaps at all. We can do better if we want to go down that path. >>> >>> Maybe we should take a look at hugetlb, and how they added support for multiple >>> sizes. What *might* make sense could be (depending on which values we actually >>> support!) >>> >>> >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >>> >>> Each one would contain an "enabled" and "defrag" file. We want something minimal >>> first? Start with the "enabled" option. >>> >>> >>> enabled: always [global] madvise never >>> >>> Initially, we would set it for PMD-sized THP to "global" and for everything else >>> to "never". >> >> Hi David, >> >> I've just started coding this, and it occurs to me that I might need a small >> clarification here; the existing global "enabled" control is used to drive >> decisions for both anonymous memory and (non-shmem) file-backed memory. But the >> proposed new per-size "enabled" is implicitly only controlling anon memory (for >> now). >> >> 1) Is this potentially confusing for the user? Should we rename the per-size >> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so >> we can reuse the same control for file-backed memory in future? >> >> 2) The global control will continue to drive the file-backed memory decision >> (for now), even when hugepages-2048kB/enabled != "global"; agreed? >> >> Thanks, >> Ryan >> > > Also, an implementation question: > > hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs > (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true > regardless. Is that by design? It couldn't fathom any reasoning from the commit log: The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA split I assume, COW). It doesn't result in any memory waste, so who really cares how it's mapped? Apparently we want individual processes to just disable PMD/PUD mappings of DAX using the prctl and madvise. Maybe there are good reasons. Looks like a design decision, probably some legacy leftovers.
On 31/10/2023 11:58, David Hildenbrand wrote: > On 31.10.23 12:50, Ryan Roberts wrote: >> On 06/10/2023 21:06, David Hildenbrand wrote: >> [...] >>> >>> Change 2: sysfs interface. >>> >>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >>> agree. >>> >>> What we expose there and how, is TBD. Again, not a friend of "orders" and >>> bitmaps at all. We can do better if we want to go down that path. >>> >>> Maybe we should take a look at hugetlb, and how they added support for multiple >>> sizes. What *might* make sense could be (depending on which values we actually >>> support!) >>> >>> >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >>> >>> Each one would contain an "enabled" and "defrag" file. We want something minimal >>> first? Start with the "enabled" option. >>> >>> >>> enabled: always [global] madvise never >>> >>> Initially, we would set it for PMD-sized THP to "global" and for everything else >>> to "never". >> >> Hi David, > > Hi! > >> >> I've just started coding this, and it occurs to me that I might need a small >> clarification here; the existing global "enabled" control is used to drive >> decisions for both anonymous memory and (non-shmem) file-backed memory. But the >> proposed new per-size "enabled" is implicitly only controlling anon memory (for >> now). > > Anon was (way) first, and pagecache later decided to reuse that one as an > indication whether larger folios are desired. > > For the pagecache, it's just a way to enable/disable it globally. As there is no > memory waste, nobody currently really cares about the exact sized the pagecache > is allocating (maybe that will change at some point, maybe not, who knows). Yup. Its not _just_ about allocation though; its also about collapse (MADV_COLLAPSE, khugepaged) which is supported for pagecache pages. I can imagine value in collapsing to various sizes that are beneficial for HW... anyway that's for another day. > >> >> 1) Is this potentially confusing for the user? Should we rename the per-size >> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so >> we can reuse the same control for file-backed memory in future? > > The latter would be my take. Just like we did with the global toggle. ACK > >> >> 2) The global control will continue to drive the file-backed memory decision >> (for now), even when hugepages-2048kB/enabled != "global"; agreed? > > That would be my take; it will allocate other sizes already, so just glue it to > the global toggle and document for the other toggles that they only control > anonymous THP for now. ACK >
On 31/10/2023 12:03, David Hildenbrand wrote: > On 31.10.23 12:55, Ryan Roberts wrote: >> On 31/10/2023 11:50, Ryan Roberts wrote: >>> On 06/10/2023 21:06, David Hildenbrand wrote: >>> [...] >>>> >>>> Change 2: sysfs interface. >>>> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >>>> agree. >>>> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and >>>> bitmaps at all. We can do better if we want to go down that path. >>>> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple >>>> sizes. What *might* make sense could be (depending on which values we actually >>>> support!) >>>> >>>> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >>>> >>>> Each one would contain an "enabled" and "defrag" file. We want something >>>> minimal >>>> first? Start with the "enabled" option. >>>> >>>> >>>> enabled: always [global] madvise never >>>> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything >>>> else >>>> to "never". >>> >>> Hi David, >>> >>> I've just started coding this, and it occurs to me that I might need a small >>> clarification here; the existing global "enabled" control is used to drive >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for >>> now). >>> >>> 1) Is this potentially confusing for the user? Should we rename the per-size >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so >>> we can reuse the same control for file-backed memory in future? >>> >>> 2) The global control will continue to drive the file-backed memory decision >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed? >>> >>> Thanks, >>> Ryan >>> >> >> Also, an implementation question: >> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true >> regardless. Is that by design? It couldn't fathom any reasoning from the >> commit log: > > The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply > using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA > split I assume, COW). > > It doesn't result in any memory waste, so who really cares how it's mapped? > Apparently we want individual processes to just disable PMD/PUD mappings of DAX > using the prctl and madvise. Maybe there are good reasons. > > Looks like a design decision, probably some legacy leftovers. OK, I'll ensure I keep this behaviour. Thanks! >
On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 31/10/2023 11:50, Ryan Roberts wrote: > > On 06/10/2023 21:06, David Hildenbrand wrote: > > [...] > >> > >> Change 2: sysfs interface. > >> > >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I > >> agree. > >> > >> What we expose there and how, is TBD. Again, not a friend of "orders" and > >> bitmaps at all. We can do better if we want to go down that path. > >> > >> Maybe we should take a look at hugetlb, and how they added support for multiple > >> sizes. What *might* make sense could be (depending on which values we actually > >> support!) > >> > >> > >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ > >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ > >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ > >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ > >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ > >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ > >> > >> Each one would contain an "enabled" and "defrag" file. We want something minimal > >> first? Start with the "enabled" option. > >> > >> > >> enabled: always [global] madvise never > >> > >> Initially, we would set it for PMD-sized THP to "global" and for everything else > >> to "never". > > > > Hi David, > > > > I've just started coding this, and it occurs to me that I might need a small > > clarification here; the existing global "enabled" control is used to drive > > decisions for both anonymous memory and (non-shmem) file-backed memory. But the > > proposed new per-size "enabled" is implicitly only controlling anon memory (for > > now). > > > > 1) Is this potentially confusing for the user? Should we rename the per-size > > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so > > we can reuse the same control for file-backed memory in future? > > > > 2) The global control will continue to drive the file-backed memory decision > > (for now), even when hugepages-2048kB/enabled != "global"; agreed? > > > > Thanks, > > Ryan > > > > Also, an implementation question: > > hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs > (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true > regardless. Is that by design? It couldn't fathom any reasoning from the commit log: The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs. > > bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, > bool smaps, bool in_pf, bool enforce_sysfs) > { > if (!vma->vm_mm) /* vdso */ > return false; > > /* > * Explicitly disabled through madvise or prctl, or some > * architectures may disable THP for some mappings, for > * example, s390 kvm. > * */ > if ((vm_flags & VM_NOHUGEPAGE) || > test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) > return false; > /* > * If the hardware/firmware marked hugepage support disabled. > */ > if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) > return false; > > /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ > if (vma_is_dax(vma)) > return in_pf; <<<<<<<< > > ... > } > >
On 31/10/2023 18:29, Yang Shi wrote: > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 31/10/2023 11:50, Ryan Roberts wrote: >>> On 06/10/2023 21:06, David Hildenbrand wrote: >>> [...] >>>> >>>> Change 2: sysfs interface. >>>> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I >>>> agree. >>>> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and >>>> bitmaps at all. We can do better if we want to go down that path. >>>> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple >>>> sizes. What *might* make sense could be (depending on which values we actually >>>> support!) >>>> >>>> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ >>>> >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal >>>> first? Start with the "enabled" option. >>>> >>>> >>>> enabled: always [global] madvise never >>>> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else >>>> to "never". >>> >>> Hi David, >>> >>> I've just started coding this, and it occurs to me that I might need a small >>> clarification here; the existing global "enabled" control is used to drive >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for >>> now). >>> >>> 1) Is this potentially confusing for the user? Should we rename the per-size >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so >>> we can reuse the same control for file-backed memory in future? >>> >>> 2) The global control will continue to drive the file-backed memory decision >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed? >>> >>> Thanks, >>> Ryan >>> >> >> Also, an implementation question: >> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log: > > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs. That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for anything that implements huge_fault() - see 7a81751fcdeb833acc858e59082688e3020bfe12).
On Wed, Nov 1, 2023 at 7:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 31/10/2023 18:29, Yang Shi wrote: > > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 31/10/2023 11:50, Ryan Roberts wrote: > >>> On 06/10/2023 21:06, David Hildenbrand wrote: > >>> [...] > >>>> > >>>> Change 2: sysfs interface. > >>>> > >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I > >>>> agree. > >>>> > >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and > >>>> bitmaps at all. We can do better if we want to go down that path. > >>>> > >>>> Maybe we should take a look at hugetlb, and how they added support for multiple > >>>> sizes. What *might* make sense could be (depending on which values we actually > >>>> support!) > >>>> > >>>> > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/ > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/ > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/ > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ > >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ > >>>> > >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal > >>>> first? Start with the "enabled" option. > >>>> > >>>> > >>>> enabled: always [global] madvise never > >>>> > >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else > >>>> to "never". > >>> > >>> Hi David, > >>> > >>> I've just started coding this, and it occurs to me that I might need a small > >>> clarification here; the existing global "enabled" control is used to drive > >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the > >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for > >>> now). > >>> > >>> 1) Is this potentially confusing for the user? Should we rename the per-size > >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so > >>> we can reuse the same control for file-backed memory in future? > >>> > >>> 2) The global control will continue to drive the file-backed memory decision > >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed? > >>> > >>> Thanks, > >>> Ryan > >>> > >> > >> Also, an implementation question: > >> > >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs > >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true > >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log: > > > > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs. > > That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file > VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for When implementing READ_ONLY_THP_FOR_FS the file THP just can be collapsed by khugepaged, but khugepaged is started iff enabled != "never". So READ_ONLY_THP_FOR_FS has to honor it. Unfortunately there are some confusing exceptions... But anyway DAX is not the same class. > anything that implements huge_fault() - see > 7a81751fcdeb833acc858e59082688e3020bfe12). IIUC this commit just gives the vmas which implement huge_fault() a chance to handle the fault. Currently just DAX vmas implement huge_fault() in vanilla kernel AFAICT. >
On 9/29/23 4:44 AM, Ryan Roberts wrote: > Hi All, > > This is v6 of a series to implement variable order, large folios for anonymous > memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO", > "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The > objective of this is to improve performance by allocating larger chunks of > memory during anonymous page faults: ... > > The major change in this revision is the addition of sysfs controls to allow > this "small-order THP" to be enabled/disabled/configured independently of > PMD-order THP. The approach I've taken differs a bit from previous discussions; > instead of creating a whole new interface ("large_folio"), I'm extending THP. I > personally think this makes things clearer and more extensible. See [6] for > detailed rationale. > Hi Ryan and all, I've done some initial performance testing of this patchset on an arm64 SBSA server. When these patches are combined with the arm64 arch contpte patches in Ryan's git tree (he has conveniently combined everything here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on some memory-intensive workloads. Many test runs, conducted independently by different engineers and on different machines, have convinced me and my colleagues that this is an accurate result. In order to achieve that result, we used the git tree in [1] with following settings: echo always >/sys/kernel/mm/transparent_hugepage/enabled echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders This was on a aarch64 machine configure to use a 64KB base page size. That configuration means that the PMD size is 512MB, which is of course too large for practical use as a pure PMD-THP. However, with with these small-size (less than PMD-sized) THPs, we get the improvements in TLB coverage, while still getting pages that are small enough to be effectively usable. These results are admittedly limited to aarch64 CPUs so far (because the contpte TLB coalescing behavior plays a big role), but it's nice to see real performance numbers from real computers. Up until now, there has been some healthy discussion and debate about various aspects of this patchset. This data point shows that at least for some types of memory-intensive workloads (and I apologize for being vague, at this point, about exactly *which* workloads), the performance gains are really worth it: ~10x ! [1] https://gitlab.arm.com/linux-arm/linux-rr.git (branch: features/granule_perf/anonfolio-v6-contpte-v2) thanks,
On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: > I've done some initial performance testing of this patchset on an arm64 > SBSA server. When these patches are combined with the arm64 arch contpte > patches in Ryan's git tree (he has conveniently combined everything > here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on > some memory-intensive workloads. Many test runs, conducted independently > by different engineers and on different machines, have convinced me and > my colleagues that this is an accurate result. > > In order to achieve that result, we used the git tree in [1] with > following settings: > > echo always >/sys/kernel/mm/transparent_hugepage/enabled > echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders > > This was on a aarch64 machine configure to use a 64KB base page size. > That configuration means that the PMD size is 512MB, which is of course > too large for practical use as a pure PMD-THP. However, with with these > small-size (less than PMD-sized) THPs, we get the improvements in TLB > coverage, while still getting pages that are small enough to be > effectively usable. That is quite remarkable! My hope is to abolish the 64kB page size configuration. ie instead of using the mixture of page sizes that you currently are -- 64k and 1M (right? Order-0, and order-4), that 4k, 64k and 2MB (order-0, order-4 and order-9) will provide better performance. Have you run any experiements with a 4kB page size?
On 13/11/2023 05:18, Matthew Wilcox wrote: > On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >> I've done some initial performance testing of this patchset on an arm64 >> SBSA server. When these patches are combined with the arm64 arch contpte >> patches in Ryan's git tree (he has conveniently combined everything >> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >> some memory-intensive workloads. Many test runs, conducted independently >> by different engineers and on different machines, have convinced me and >> my colleagues that this is an accurate result. >> >> In order to achieve that result, we used the git tree in [1] with >> following settings: >> >> echo always >/sys/kernel/mm/transparent_hugepage/enabled >> echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >> >> This was on a aarch64 machine configure to use a 64KB base page size. >> That configuration means that the PMD size is 512MB, which is of course >> too large for practical use as a pure PMD-THP. However, with with these >> small-size (less than PMD-sized) THPs, we get the improvements in TLB >> coverage, while still getting pages that are small enough to be >> effectively usable. > > That is quite remarkable! Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! > > My hope is to abolish the 64kB page size configuration. ie instead of > using the mixture of page sizes that you currently are -- 64k and > 1M (right? Order-0, and order-4) Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that intuitively you would expect the order to remain constant, but it doesn't). The "recommend" setting above will actually enable order-3 as well even though there is no HW benefit to this. So the full set of available memory sizes here is: 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 > , that 4k, 64k and 2MB (order-0, > order-4 and order-9) will provide better performance. > > Have you run any experiements with a 4kB page size? Agree that would be interesting with 64K small-sized THP enabled. And I'd love to get to a world were we universally deal in variable sized chunks of memory, aligned on 4K boundaries. In my experience though, there are still some performance benefits to 64K base page vs 4K+contpte; the page tables are more cache efficient for the former case - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the latter. In practice the HW will still only read 8 bytes in the latter but that's taking up a full cache line vs the former where a single cache line stores 8x 64K entries. Thanks, Ryan
On 2023/11/13 18:19, Ryan Roberts wrote: > On 13/11/2023 05:18, Matthew Wilcox wrote: >> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >>> I've done some initial performance testing of this patchset on an arm64 >>> SBSA server. When these patches are combined with the arm64 arch contpte >>> patches in Ryan's git tree (he has conveniently combined everything >>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >>> some memory-intensive workloads. Many test runs, conducted independently >>> by different engineers and on different machines, have convinced me and >>> my colleagues that this is an accurate result. >>> >>> In order to achieve that result, we used the git tree in [1] with >>> following settings: >>> >>> echo always >/sys/kernel/mm/transparent_hugepage/enabled >>> echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >>> >>> This was on a aarch64 machine configure to use a 64KB base page size. >>> That configuration means that the PMD size is 512MB, which is of course >>> too large for practical use as a pure PMD-THP. However, with with these >>> small-size (less than PMD-sized) THPs, we get the improvements in TLB >>> coverage, while still getting pages that are small enough to be >>> effectively usable. >> >> That is quite remarkable! > > Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! > >> >> My hope is to abolish the 64kB page size configuration. ie instead of >> using the mixture of page sizes that you currently are -- 64k and >> 1M (right? Order-0, and order-4) > > Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is > 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that > intuitively you would expect the order to remain constant, but it doesn't). > > The "recommend" setting above will actually enable order-3 as well even though > there is no HW benefit to this. So the full set of available memory sizes here is: > > 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 > >> , that 4k, 64k and 2MB (order-0, >> order-4 and order-9) will provide better performance. >> >> Have you run any experiements with a 4kB page size? > > Agree that would be interesting with 64K small-sized THP enabled. And I'd love > to get to a world were we universally deal in variable sized chunks of memory, > aligned on 4K boundaries. > > In my experience though, there are still some performance benefits to 64K base > page vs 4K+contpte; the page tables are more cache efficient for the former case > - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the > latter. In practice the HW will still only read 8 bytes in the latter but that's > taking up a full cache line vs the former where a single cache line stores 8x > 64K entries. We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on arm64 board(for better evaluation of anon large folio, using ext4, which don't support large folio for now), will test again and send the results once v7 out. 1) base page 4k + without anon large folio 2) base page 64k + without anon large folio 3) base page 4k + with anon large folio + cont-pte(order = 4,0) Most of the test results from v5 show the 3) have a good improvement vs 1), but still low than 2) , also for some latency-sensitive benchmark, 2) and 3) maybe have poor performance vs 1). Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for 3), we maybe enlarge it for better scalability when page allocation on arm64, not test on v5, will try to enlarge it on v7. > > Thanks, > Ryan > > >
On 13/11/2023 11:52, Kefeng Wang wrote: > > > On 2023/11/13 18:19, Ryan Roberts wrote: >> On 13/11/2023 05:18, Matthew Wilcox wrote: >>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >>>> I've done some initial performance testing of this patchset on an arm64 >>>> SBSA server. When these patches are combined with the arm64 arch contpte >>>> patches in Ryan's git tree (he has conveniently combined everything >>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >>>> some memory-intensive workloads. Many test runs, conducted independently >>>> by different engineers and on different machines, have convinced me and >>>> my colleagues that this is an accurate result. >>>> >>>> In order to achieve that result, we used the git tree in [1] with >>>> following settings: >>>> >>>> echo always >/sys/kernel/mm/transparent_hugepage/enabled >>>> echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >>>> >>>> This was on a aarch64 machine configure to use a 64KB base page size. >>>> That configuration means that the PMD size is 512MB, which is of course >>>> too large for practical use as a pure PMD-THP. However, with with these >>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB >>>> coverage, while still getting pages that are small enough to be >>>> effectively usable. >>> >>> That is quite remarkable! >> >> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! >> >>> >>> My hope is to abolish the 64kB page size configuration. ie instead of >>> using the mixture of page sizes that you currently are -- 64k and >>> 1M (right? Order-0, and order-4) >> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that >> intuitively you would expect the order to remain constant, but it doesn't). >> >> The "recommend" setting above will actually enable order-3 as well even though >> there is no HW benefit to this. So the full set of available memory sizes here >> is: >> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 >> >>> , that 4k, 64k and 2MB (order-0, >>> order-4 and order-9) will provide better performance. >>> >>> Have you run any experiements with a 4kB page size? >> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love >> to get to a world were we universally deal in variable sized chunks of memory, >> aligned on 4K boundaries. >> >> In my experience though, there are still some performance benefits to 64K base >> page vs 4K+contpte; the page tables are more cache efficient for the former case >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the >> latter. In practice the HW will still only read 8 bytes in the latter but that's >> taking up a full cache line vs the former where a single cache line stores 8x >> 64K entries. > > We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on > arm64 board(for better evaluation of anon large folio, using ext4, > which don't support large folio for now), will test again and send > the results once v7 out. Thanks for the testing and for posting the insights! > > 1) base page 4k + without anon large folio > 2) base page 64k + without anon large folio > 3) base page 4k + with anon large folio + cont-pte(order = 4,0) > > Most of the test results from v5 show the 3) have a good improvement > vs 1), but still low than 2) Do you have any understanding what the shortfall is for these particular workloads? Certainly the cache spatial locality benefit of the 64K page tables could be a factor. But certainly for the workloads I've been looking at, a bigger factor is often the fact that executable file-backed memory (elf segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is under pressure this can help a lot. I have a change (hack) to force all executable mappings to be read-ahead into 64K folios and this gives an improvement. But obviously that only works when the file system supports large folios (so not ext4 right now). It would certainly be interesting to see just how close to native 64K we can get when employing these extra ideas. >, also for some latency-sensitive > benchmark, 2) and 3) maybe have poor performance vs 1). > > Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for > 3), we maybe enlarge it for better scalability when page allocation > on arm64, not test on v5, will try to enlarge it on v7. Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable to be rebased on v6.7-rc1. I'd be interested to see your results. > >> >> Thanks, >> Ryan >> >> >>
On 11/13/23 2:19 AM, Ryan Roberts wrote: > On 13/11/2023 05:18, Matthew Wilcox wrote: >> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >>> I've done some initial performance testing of this patchset on an arm64 >>> SBSA server. When these patches are combined with the arm64 arch contpte >>> patches in Ryan's git tree (he has conveniently combined everything >>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >>> some memory-intensive workloads. Many test runs, conducted independently >>> by different engineers and on different machines, have convinced me and >>> my colleagues that this is an accurate result. >>> >>> In order to achieve that result, we used the git tree in [1] with >>> following settings: >>> >>> echo always >/sys/kernel/mm/transparent_hugepage/enabled >>> echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >>> >>> This was on a aarch64 machine configure to use a 64KB base page size. >>> That configuration means that the PMD size is 512MB, which is of course >>> too large for practical use as a pure PMD-THP. However, with with these >>> small-size (less than PMD-sized) THPs, we get the improvements in TLB >>> coverage, while still getting pages that are small enough to be >>> effectively usable. >> >> That is quite remarkable! > > Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! > >> >> My hope is to abolish the 64kB page size configuration. ie instead of We've found that a 64KB base page size provides better performance for HPC and AI workloads, than a 4KB base size, at least for these kinds of servers. In fact, the 4KB config is considered odd and I'd have to look around to get one. It's mostly a TLB coverage issue because, again, the problem typically has a very large memory footprint. So even though it would be nice from a software point of view, there's a real need for this. >> using the mixture of page sizes that you currently are -- 64k and >> 1M (right? Order-0, and order-4) > > Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is > 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that > intuitively you would expect the order to remain constant, but it doesn't). > > The "recommend" setting above will actually enable order-3 as well even though > there is no HW benefit to this. So the full set of available memory sizes here is: > > 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 Yes, and to provide some further details about the test runs, I went so far as to test individual anon_orders (for example, anon_orders=0x20), in order to isolate behavior and see what's really going on. On this hardware, anything with 2MB page sizes which corresponds to anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's an interesting on/off behavior. This particular server design and workload combination really prefers 2MB pages, even if they are held together with contpte instead of a real PMD entry. > >> , that 4k, 64k and 2MB (order-0, >> order-4 and order-9) will provide better performance. >> >> Have you run any experiements with a 4kB page size? > > Agree that would be interesting with 64K small-sized THP enabled. And I'd love > to get to a world were we universally deal in variable sized chunks of memory, > aligned on 4K boundaries. > > In my experience though, there are still some performance benefits to 64K base > page vs 4K+contpte; the page tables are more cache efficient for the former case > - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the > latter. In practice the HW will still only read 8 bytes in the latter but that's > taking up a full cache line vs the former where a single cache line stores 8x > 64K entries. > > Thanks, > Ryan > thanks,
On 2023/11/13 20:12, Ryan Roberts wrote: > On 13/11/2023 11:52, Kefeng Wang wrote: >> >> >> On 2023/11/13 18:19, Ryan Roberts wrote: >>> On 13/11/2023 05:18, Matthew Wilcox wrote: >>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >>>>> I've done some initial performance testing of this patchset on an arm64 >>>>> SBSA server. When these patches are combined with the arm64 arch contpte >>>>> patches in Ryan's git tree (he has conveniently combined everything >>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >>>>> some memory-intensive workloads. Many test runs, conducted independently >>>>> by different engineers and on different machines, have convinced me and >>>>> my colleagues that this is an accurate result. >>>>> >>>>> In order to achieve that result, we used the git tree in [1] with >>>>> following settings: >>>>> >>>>> echo always >/sys/kernel/mm/transparent_hugepage/enabled >>>>> echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >>>>> >>>>> This was on a aarch64 machine configure to use a 64KB base page size. >>>>> That configuration means that the PMD size is 512MB, which is of course >>>>> too large for practical use as a pure PMD-THP. However, with with these >>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB >>>>> coverage, while still getting pages that are small enough to be >>>>> effectively usable. >>>> >>>> That is quite remarkable! >>> >>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! >>> >>>> >>>> My hope is to abolish the 64kB page size configuration. ie instead of >>>> using the mixture of page sizes that you currently are -- 64k and >>>> 1M (right? Order-0, and order-4) >>> >>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is >>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that >>> intuitively you would expect the order to remain constant, but it doesn't). >>> >>> The "recommend" setting above will actually enable order-3 as well even though >>> there is no HW benefit to this. So the full set of available memory sizes here >>> is: >>> >>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 >>> >>>> , that 4k, 64k and 2MB (order-0, >>>> order-4 and order-9) will provide better performance. >>>> >>>> Have you run any experiements with a 4kB page size? >>> >>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love >>> to get to a world were we universally deal in variable sized chunks of memory, >>> aligned on 4K boundaries. >>> >>> In my experience though, there are still some performance benefits to 64K base >>> page vs 4K+contpte; the page tables are more cache efficient for the former case >>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the >>> latter. In practice the HW will still only read 8 bytes in the latter but that's >>> taking up a full cache line vs the former where a single cache line stores 8x >>> 64K entries. >> >> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on >> arm64 board(for better evaluation of anon large folio, using ext4, >> which don't support large folio for now), will test again and send >> the results once v7 out. > > Thanks for the testing and for posting the insights! > >> >> 1) base page 4k + without anon large folio >> 2) base page 64k + without anon large folio >> 3) base page 4k + with anon large folio + cont-pte(order = 4,0) >> >> Most of the test results from v5 show the 3) have a good improvement >> vs 1), but still low than 2) > > Do you have any understanding what the shortfall is for these particular > workloads? Certainly the cache spatial locality benefit of the 64K page tables > could be a factor. But certainly for the workloads I've been looking at, a > bigger factor is often the fact that executable file-backed memory (elf > segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is > under pressure this can help a lot. I have a change (hack) to force all > executable mappings to be read-ahead into 64K folios and this gives an > improvement. But obviously that only works when the file system supports large > folios (so not ext4 right now). It would certainly be interesting to see just > how close to native 64K we can get when employing these extra ideas. No detailed analysis, but with base page 64k, less page fault less TLB operation less zone-lock congestion(pcp) less buddy split/merge no reclaim/compact when allocate 64k page, and no fallback logical execfolio faster page table opreation? ... > >> , also for some latency-sensitive >> benchmark, 2) and 3) maybe have poor performance vs 1). >> >> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for >> 3), we maybe enlarge it for better scalability when page allocation >> on arm64, not test on v5, will try to enlarge it on v7. > > Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable > to be rebased on v6.7-rc1. I'd be interested to see your results. > Glad to see it.>> >>> >>> Thanks, >>> Ryan >>> >>> >>> > >
On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote: > On 13/11/2023 05:18, Matthew Wilcox wrote: > > My hope is to abolish the 64kB page size configuration. ie instead of > > using the mixture of page sizes that you currently are -- 64k and > > 1M (right? Order-0, and order-4) > > Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is > 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that > intuitively you would expect the order to remain constant, but it doesn't). > > The "recommend" setting above will actually enable order-3 as well even though > there is no HW benefit to this. So the full set of available memory sizes here is: > > 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 > > > , that 4k, 64k and 2MB (order-0, > > order-4 and order-9) will provide better performance. > > > > Have you run any experiements with a 4kB page size? > > Agree that would be interesting with 64K small-sized THP enabled. And I'd love > to get to a world were we universally deal in variable sized chunks of memory, > aligned on 4K boundaries. > > In my experience though, there are still some performance benefits to 64K base > page vs 4K+contpte; the page tables are more cache efficient for the former case > - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the > latter. In practice the HW will still only read 8 bytes in the latter but that's > taking up a full cache line vs the former where a single cache line stores 8x > 64K entries. This is going to depend on your workload though -- if you're using more 2MB than 64kB, you get to elide a layer of page table with 4k base, rather than taking up 4 cache lines with a 64k base.
On 13/11/2023 15:04, Matthew Wilcox wrote: > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote: >> On 13/11/2023 05:18, Matthew Wilcox wrote: >>> My hope is to abolish the 64kB page size configuration. ie instead of >>> using the mixture of page sizes that you currently are -- 64k and >>> 1M (right? Order-0, and order-4) >> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that >> intuitively you would expect the order to remain constant, but it doesn't). >> >> The "recommend" setting above will actually enable order-3 as well even though >> there is no HW benefit to this. So the full set of available memory sizes here is: >> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 >> >>> , that 4k, 64k and 2MB (order-0, >>> order-4 and order-9) will provide better performance. >>> >>> Have you run any experiements with a 4kB page size? >> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love >> to get to a world were we universally deal in variable sized chunks of memory, >> aligned on 4K boundaries. >> >> In my experience though, there are still some performance benefits to 64K base >> page vs 4K+contpte; the page tables are more cache efficient for the former case >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the >> latter. In practice the HW will still only read 8 bytes in the latter but that's >> taking up a full cache line vs the former where a single cache line stores 8x >> 64K entries. > > This is going to depend on your workload though -- if you're using more > 2MB than 64kB, you get to elide a layer of page table with 4k base, > rather than taking up 4 cache lines with a 64k base. True, but again depending on workload/config, you may have few levels of lookup for the 64K native case in the first place because you consume more VA bits at each level.
On Tue, Nov 14, 2023 at 10:57:07AM +0000, Ryan Roberts wrote: > On 13/11/2023 15:04, Matthew Wilcox wrote: > > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote: > >> On 13/11/2023 05:18, Matthew Wilcox wrote: > >>> My hope is to abolish the 64kB page size configuration. ie instead of > >>> using the mixture of page sizes that you currently are -- 64k and > >>> 1M (right? Order-0, and order-4) > >> > >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is > >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that > >> intuitively you would expect the order to remain constant, but it doesn't). > >> > >> The "recommend" setting above will actually enable order-3 as well even though > >> there is no HW benefit to this. So the full set of available memory sizes here is: > >> > >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 > >> > >>> , that 4k, 64k and 2MB (order-0, > >>> order-4 and order-9) will provide better performance. > >>> > >>> Have you run any experiements with a 4kB page size? > >> > >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love > >> to get to a world were we universally deal in variable sized chunks of memory, > >> aligned on 4K boundaries. > >> > >> In my experience though, there are still some performance benefits to 64K base > >> page vs 4K+contpte; the page tables are more cache efficient for the former case > >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the > >> latter. In practice the HW will still only read 8 bytes in the latter but that's > >> taking up a full cache line vs the former where a single cache line stores 8x > >> 64K entries. > > > > This is going to depend on your workload though -- if you're using more > > 2MB than 64kB, you get to elide a layer of page table with 4k base, > > rather than taking up 4 cache lines with a 64k base. > > True, but again depending on workload/config, you may have few levels of lookup > for the 64K native case in the first place because you consume more VA bits at > each level. Sorry, missed this email ... let's work it through. With 4k, and a 48-bit VA space, we get 12 bits at the lowest level, then 9 bits each layer, so 4 * 9 + 12 = 48. With a 2MB allocation, we eliminate the bottom layer and examine three cachelines to find it (PGD entry, PUD entry, PMD entry) With 64k, we get 16 bits at the lowest level, then 13 bits each layer, so 3 * 13 + 16 = 55. With a 2MB allocation, we can't eliminate the bottom layer, so we still have to examine three cachelines to find it (PGD, PMD, PTE). If you can fit into a 42-bit address space, you can reduce it by one cache miss, but my impression is that applications which use 21 bits of address space for a single allocation want more address space than your average application.