Message ID | 20200928175428.4110504-1-zi.yan@sent.com (mailing list archive) |
---|---|
Headers | show |
Series | 1GB PUD THP support on x86_64 | expand |
On Mon 28-09-20 13:53:58, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset adds support for 1GB PUD THP on x86_64. It is on top of > v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: > https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23 > > Other than PUD THP, we had some discussion on generating THPs and contiguous > physical memory via a synchronous system call [0]. I am planning to send out a > separate patchset on it later, since I feel that it can be done independently of > PUD THP support. While the technical challenges for the kernel implementation can be discussed before the user API is decided I believe we cannot simply add something now and then decide about a proper interface. I have raised few basic questions we should should find answers for before the any interface is added. Let me copy them here for easier reference - THP allocation time - #PF and/or madvise context - lazy/sync instantiation - huge page sizes controllable by the userspace? - aggressiveness - how hard to try - internal fragmentation - allow to create THPs on sparsely or unpopulated ranges - do we need some sort of access control or privilege check as some THPs would be a really scarce (like those that require pre-reservation).
On 30 Sep 2020, at 7:55, Michal Hocko wrote: > On Mon 28-09-20 13:53:58, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> Hi all, >> >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23 >> >> Other than PUD THP, we had some discussion on generating THPs and contiguous >> physical memory via a synchronous system call [0]. I am planning to send out a >> separate patchset on it later, since I feel that it can be done independently of >> PUD THP support. > > While the technical challenges for the kernel implementation can be > discussed before the user API is decided I believe we cannot simply add > something now and then decide about a proper interface. I have raised > few basic questions we should should find answers for before the any > interface is added. Let me copy them here for easier reference Sure. Thank you for doing this. For this new interface, I think it should generate THPs out of populated memory regions synchronously. It would be complement to khugepaged, which generate THPs asynchronously on the background. > - THP allocation time - #PF and/or madvise context I am not sure this is relevant, since the new interface is supposed to operate on populated memory regions. For THP allocation, madvise and the options from /sys/kernel/mm/transparent_hugepage/defrag should give enough choices to users. > - lazy/sync instantiation I would say the new interface only does sync instantiation. madvise has provided the lazy instantiation option by adding MADV_HUGEPAGE to populated memory regions and letting khugepaged generate THPs from them. > - huge page sizes controllable by the userspace? It might be good to allow advanced users to choose the page sizes, so they have better control of their applications. For normal users, we can provide best-effort service. Different options can be provided for these two cases. The new interface might want to inform user how many THPs are generated after the call for them to decide what to do with the memory region. > - aggressiveness - how hard to try The new interface would try as hard as it can, since I assume users really want THPs when they use this interface. > - internal fragmentation - allow to create THPs on sparsely or unpopulated > ranges The new interface would only operate on populated memory regions. MAP_POPULATE like option can be added if necessary. > - do we need some sort of access control or privilege check as some THPs > would be a really scarce (like those that require pre-reservation). It seems too much to me. I suppose if we provide page size options to users when generating THPs, users apps could coordinate themselves. BTW, do we have access control for hugetlb pages? If yes, we could borrow their method. — Best Regards, Yan Zi
On Thu 01-10-20 11:14:14, Zi Yan wrote: > On 30 Sep 2020, at 7:55, Michal Hocko wrote: > > > On Mon 28-09-20 13:53:58, Zi Yan wrote: > >> From: Zi Yan <ziy@nvidia.com> > >> > >> Hi all, > >> > >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of > >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: > >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23 > >> > >> Other than PUD THP, we had some discussion on generating THPs and contiguous > >> physical memory via a synchronous system call [0]. I am planning to send out a > >> separate patchset on it later, since I feel that it can be done independently of > >> PUD THP support. > > > > While the technical challenges for the kernel implementation can be > > discussed before the user API is decided I believe we cannot simply add > > something now and then decide about a proper interface. I have raised > > few basic questions we should should find answers for before the any > > interface is added. Let me copy them here for easier reference > Sure. Thank you for doing this. > > For this new interface, I think it should generate THPs out of populated > memory regions synchronously. It would be complement to khugepaged, which > generate THPs asynchronously on the background. > > > - THP allocation time - #PF and/or madvise context > I am not sure this is relevant, since the new interface is supposed to > operate on populated memory regions. For THP allocation, madvise and > the options from /sys/kernel/mm/transparent_hugepage/defrag should give > enough choices to users. OK, so no #PF, this makes things easier. > > - lazy/sync instantiation > > I would say the new interface only does sync instantiation. madvise has > provided the lazy instantiation option by adding MADV_HUGEPAGE to populated > memory regions and letting khugepaged generate THPs from them. OK > > - huge page sizes controllable by the userspace? > > It might be good to allow advanced users to choose the page sizes, so they > have better control of their applications. Could you elaborate more? Those advanced users can use hugetlb, right? They get a very good control over page size and pool preallocation etc. So they can get what they need - assuming there is enough memory. > For normal users, we can provide > best-effort service. Different options can be provided for these two cases. Do we really need two sync mechanisms to compact physical memory? This adds an API complexity because it has to cover all possible huge pages and that can be a large set of sizes. We already have that choice for hugetlb mmap interface but that is needed to cover all existing setups. I would argue this doesn't make the API particurarly easy to use. > The new interface might want to inform user how many THPs are generated > after the call for them to decide what to do with the memory region. Why would that be useful? /proc/<pid>/smaps should give a good picture already, right? > > - aggressiveness - how hard to try > > The new interface would try as hard as it can, since I assume users really > want THPs when they use this interface. > > > - internal fragmentation - allow to create THPs on sparsely or unpopulated > > ranges > > The new interface would only operate on populated memory regions. MAP_POPULATE > like option can be added if necessary. OK, so initialy you do not want to populate more memory. How do you envision a future extension to provide such a functionality. A different API, modification to existing? > > - do we need some sort of access control or privilege check as some THPs > > would be a really scarce (like those that require pre-reservation). > > It seems too much to me. I suppose if we provide page size options to users > when generating THPs, users apps could coordinate themselves. BTW, do we have > access control for hugetlb pages? If yes, we could borrow their method. We do not. Well, there is a hugetlb cgroup controller but I am not sure this is the right method. A lack of hugetlb access controll is a serious shortcoming which has turned this interface into "only first class citizens" feature with a very closed coordination with an admin.
>>> - huge page sizes controllable by the userspace? >> >> It might be good to allow advanced users to choose the page sizes, so they >> have better control of their applications. > > Could you elaborate more? Those advanced users can use hugetlb, right? > They get a very good control over page size and pool preallocation etc. > So they can get what they need - assuming there is enough memory. > I am still not convinced that 1G THP (TGP :) ) are really what we want to support. I can understand that there are some use cases that might benefit from it, especially: "I want a lot of memory, give me memory in any granularity you have, I absolutely don't care - but of course, more TGP might be good for performance." Say, you want a 5GB region, but only have a single 1GB hugepage lying around. hugetlbfs allocation will fail. But then, do we really want to optimize for such (very special?) use cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ? I think gigantic pages are a sparse resource. Only selected applications *really* depend on them and benefit from them. Let these special applications handle it explicitly. Can we have a summary of use cases that would really benefit from this change?
On Fri 02-10-20 09:50:02, David Hildenbrand wrote: > >>> - huge page sizes controllable by the userspace? > >> > >> It might be good to allow advanced users to choose the page sizes, so they > >> have better control of their applications. > > > > Could you elaborate more? Those advanced users can use hugetlb, right? > > They get a very good control over page size and pool preallocation etc. > > So they can get what they need - assuming there is enough memory. > > > > I am still not convinced that 1G THP (TGP :) ) are really what we want > to support. I can understand that there are some use cases that might > benefit from it, especially: Well, I would say that internal support for larger huge pages (e.g. 1GB) that can transparently split under memory pressure is a useful funtionality. I cannot really judge how complex that would be consideting that 2MB THP have turned out to be quite a pain but situation has settled over time. Maybe our current code base is prepared for that much better. Exposing that interface to the userspace is a different story of course. I do agree that we likely do not want to be very explicit about that. E.g. an interface for address space defragmentation without any more specifics sounds like a useful feature to me. It will be up to the kernel to decide which huge pages to use.
On 02.10.20 10:10, Michal Hocko wrote: > On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>> - huge page sizes controllable by the userspace? >>>> >>>> It might be good to allow advanced users to choose the page sizes, so they >>>> have better control of their applications. >>> >>> Could you elaborate more? Those advanced users can use hugetlb, right? >>> They get a very good control over page size and pool preallocation etc. >>> So they can get what they need - assuming there is enough memory. >>> >> >> I am still not convinced that 1G THP (TGP :) ) are really what we want >> to support. I can understand that there are some use cases that might >> benefit from it, especially: > > Well, I would say that internal support for larger huge pages (e.g. 1GB) > that can transparently split under memory pressure is a useful > funtionality. I cannot really judge how complex that would be Right, but that's then something different than serving (scarce, unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing wrong about *real* THP support, meaning, e.g., grouping consecutive pages and converting them back and forth on demand. (E.g., 1GB -> multiple 2MB -> multiple single pages), for example, when having to migrate such a gigantic page. But that's very different from our existing gigantic page code as far as I can tell. > consideting that 2MB THP have turned out to be quite a pain but > situation has settled over time. Maybe our current code base is prepared > for that much better. > > Exposing that interface to the userspace is a different story of course. > I do agree that we likely do not want to be very explicit about that. > E.g. an interface for address space defragmentation without any more > specifics sounds like a useful feature to me. It will be up to the > kernel to decide which huge pages to use. Yes, I think one important feature would be that we don't end up placing a gigantic page where only a handful of pages are actually populated without green light from the application - because that's what some user space applications care about (not consuming more memory than intended. IIUC, this is also what this patch set does). I'm fine with placing gigantic pages if it really just "defragments" the address space layout, without filling unpopulated holes. Then, this would be mostly invisible to user space, and we really wouldn't have to care about any configuration.
On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > On 02.10.20 10:10, Michal Hocko wrote: >> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>> - huge page sizes controllable by the userspace? >>>>> >>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>> have better control of their applications. >>>> >>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>> They get a very good control over page size and pool preallocation etc. >>>> So they can get what they need - assuming there is enough memory. >>>> >>> >>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>> to support. I can understand that there are some use cases that might >>> benefit from it, especially: >> >> Well, I would say that internal support for larger huge pages (e.g. 1GB) >> that can transparently split under memory pressure is a useful >> funtionality. I cannot really judge how complex that would be > > Right, but that's then something different than serving (scarce, > unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing > wrong about *real* THP support, meaning, e.g., grouping consecutive > pages and converting them back and forth on demand. (E.g., 1GB -> > multiple 2MB -> multiple single pages), for example, when having to > migrate such a gigantic page. But that's very different from our > existing gigantic page code as far as I can tell. Serving 1GB PUD THPs from CMA is a compromise, since we do not want to bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, which needs section size increase. In addition, unmoveable pages cannot be allocated in CMA, so allocating 1GB pages has much higher chance from it than from ZONE_NORMAL. >> consideting that 2MB THP have turned out to be quite a pain but >> situation has settled over time. Maybe our current code base is prepared >> for that much better. I am planning to refactor my code further to reduce the amount of the added code, since PUD THP is very similar to PMD THP. One thing I want to achieve is to enable split_huge_page to split any order of pages to a group of any lower order of pages. A lot of code in this patchset is replicating the same behavior of PMD THP at PUD level. It might be possible to deduplicate most of the code. >> >> Exposing that interface to the userspace is a different story of course. >> I do agree that we likely do not want to be very explicit about that. >> E.g. an interface for address space defragmentation without any more >> specifics sounds like a useful feature to me. It will be up to the >> kernel to decide which huge pages to use. > > Yes, I think one important feature would be that we don't end up placing > a gigantic page where only a handful of pages are actually populated > without green light from the application - because that's what some user > space applications care about (not consuming more memory than intended. > IIUC, this is also what this patch set does). I'm fine with placing > gigantic pages if it really just "defragments" the address space layout, > without filling unpopulated holes. > > Then, this would be mostly invisible to user space, and we really > wouldn't have to care about any configuration. I agree that the interface should be as simple as no configuration to most users. But I also wonder why we have hugetlbfs to allow users to specify different kinds of page sizes, which seems against the discussion above. Are we assuming advanced users should always use hugetlbfs instead of THPs? — Best Regards, Yan Zi
On 2 Oct 2020, at 3:50, David Hildenbrand wrote: >>>> - huge page sizes controllable by the userspace? >>> >>> It might be good to allow advanced users to choose the page sizes, so they >>> have better control of their applications. >> >> Could you elaborate more? Those advanced users can use hugetlb, right? >> They get a very good control over page size and pool preallocation etc. >> So they can get what they need - assuming there is enough memory. >> > > I am still not convinced that 1G THP (TGP :) ) are really what we want > to support. I can understand that there are some use cases that might > benefit from it, especially: > > "I want a lot of memory, give me memory in any granularity you have, I > absolutely don't care - but of course, more TGP might be good for > performance." Say, you want a 5GB region, but only have a single 1GB > hugepage lying around. hugetlbfs allocation will fail. > > > But then, do we really want to optimize for such (very special?) use > cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ? I am planning to further refactor my code to reduce the size and make it more general to support any size of THPs. As Matthew’s patchset[1] is removing kernel’s THP size assumption, it might be a good time to make THP support more general. > > I think gigantic pages are a sparse resource. Only selected applications > *really* depend on them and benefit from them. Let these special > applications handle it explicitly. > > Can we have a summary of use cases that would really benefit from this > change? For large machine learning applications, 1GB pages give good performance boost[2]. NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not that sparse in GPU-equipped infrastructure[3]. In addition, @Roman Gushchin should be able to provide a more concrete story from his side. [1] https://lore.kernel.org/linux-mm/20200908195539.25896-1-willy@infradead.org/ [2] http://learningsys.org/neurips19/assets/papers/18_CameraReadySubmission_MLSys_NeurIPS_2019.pdf [3] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf — Best Regards, Yan Zi
On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > > Yes, I think one important feature would be that we don't end up placing > > a gigantic page where only a handful of pages are actually populated > > without green light from the application - because that's what some user > > space applications care about (not consuming more memory than intended. > > IIUC, this is also what this patch set does). I'm fine with placing > > gigantic pages if it really just "defragments" the address space layout, > > without filling unpopulated holes. > > > > Then, this would be mostly invisible to user space, and we really > > wouldn't have to care about any configuration. > > I agree that the interface should be as simple as no configuration to > most users. But I also wonder why we have hugetlbfs to allow users to > specify different kinds of page sizes, which seems against the discussion > above. Are we assuming advanced users should always use hugetlbfs instead > of THPs? Evolution doesn't always produce the best outcomes ;-) A perennial mistake we've made is "Oh, this is a strange & new & weird feature that most applications will never care about, let's put it in hugetlbfs where nobody will notice and we don't have to think about it in the core VM" And then what was initially strange & new & weird gradually becomes something that most applications just want to have happen automatically, and telling them all to go use hugetlbfs becomes untenable, so we move the feature into the core VM. It is absurd that my phone is attempting to manage a million 4kB pages. I think even trying to manage a quarter-million 16kB pages is too much work, and really it would be happier managing 65,000 64kB pages. Extend that into the future a decade or two, and we'll be expecting that it manages memory in megabyte sized units and uses PMD and PUD mappings by default. PTE mappings will still be used, but very much on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into smaller pages to not waste too much memory when mapping it" basis. So, yeah, PUD sized mappings have problems today, but we should be writing software now so a Pixel 15 in a decade can boot a kernel built five years from now and have PUD mappings Just Work without requiring the future userspace programmer to "use hugetlbfs". One of the longer-term todo items is to support variable sized THPs for anonymous memory, just like I've done for the pagecache. With that in place, I think scaling up from PMD sized pages to PUD sized pages starts to look more natural. Itanium and PA-RISC (two architectures that will never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards. The RiscV spec you pointed me at the other day confines itself to adding support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB sizes would be possible additions in the future. But, back to today, what to do with this patchset? Even on my 16GB laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB page is ever the right decision to make. But my laptop runs a "mixed" workload, and if you could convince me that Firefox would run 10% faster by using a 1GB page as its in-memory cache, well, I'd be sold. I do like having the kernel figure out what's in the best interests of the system as a whole. Apps don't have enough information, and while they can provide hints, they're often wrong. So, let's say an app maps 8GB of anonymous memory. As the app accesses it, we should probably start by allocating 4kB pages to back that memory. As time goes on and that memory continues to be accessed and more memory is accessed, it makes sense to keep track of that, replacing the existing 4kB pages with, say, 16-64kB pages and allocating newly accessed memory with larger pages. Eventually that should grow to 2MB allocations and PMD mappings. And then continue on, all the way to 1GB pages. We also need to be able to figure out that it's not being effective any more. One of the issues with tracing accessed/dirty at the 1GB level is that writing an entire 1GB page is going to take 0.25 seconds on a x4 gen3 PCIe link. I know swapping sucks, but that's extreme. So to use 1GB pages effectively today, we need to fragment them before choosing to swap them out (*) Maybe that's the point where we can start to say "OK, this sized mapping might not be effective any more". On the other hand, that might not work for some situations. Imagine, eg, a matrix multiply (everybody's favourite worst-case scenario). C = A * B where each of A, B and C is too large to fit in DRAM. There are going to be points of the calculation where each element of A is going to be walked sequentially, and so it'd be nice to use larger PTEs to map it, but then we need to destroy that almost immediately to allow other things to use the memory. I think I'm leaning towards not merging this patchset yet. I'm in agreement with the goals (allowing systems to use PUD-sized pages automatically), but I think we need to improve the infrastructure to make it work well automatically. Does that make sense? (*) It would be nice if hardware provided a way to track D/A on a sub-PTE level when using PMD/PUD sized mappings. I don't know of any that does that today.
On Mon, Oct 05, 2020 at 04:55:53PM +0100, Matthew Wilcox wrote: > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > > On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > > > Yes, I think one important feature would be that we don't end up placing > > > a gigantic page where only a handful of pages are actually populated > > > without green light from the application - because that's what some user > > > space applications care about (not consuming more memory than intended. > > > IIUC, this is also what this patch set does). I'm fine with placing > > > gigantic pages if it really just "defragments" the address space layout, > > > without filling unpopulated holes. > > > > > > Then, this would be mostly invisible to user space, and we really > > > wouldn't have to care about any configuration. > > > > I agree that the interface should be as simple as no configuration to > > most users. But I also wonder why we have hugetlbfs to allow users to > > specify different kinds of page sizes, which seems against the discussion > > above. Are we assuming advanced users should always use hugetlbfs instead > > of THPs? > > Evolution doesn't always produce the best outcomes ;-) > > A perennial mistake we've made is "Oh, this is a strange & new & weird > feature that most applications will never care about, let's put it in > hugetlbfs where nobody will notice and we don't have to think about it > in the core VM" > > And then what was initially strange & new & weird gradually becomes > something that most applications just want to have happen automatically, > and telling them all to go use hugetlbfs becomes untenable, so we move > the feature into the core VM. > > It is absurd that my phone is attempting to manage a million 4kB pages. > I think even trying to manage a quarter-million 16kB pages is too much > work, and really it would be happier managing 65,000 64kB pages. > > Extend that into the future a decade or two, and we'll be expecting > that it manages memory in megabyte sized units and uses PMD and PUD > mappings by default. PTE mappings will still be used, but very much > on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into > smaller pages to not waste too much memory when mapping it" basis. So, > yeah, PUD sized mappings have problems today, but we should be writing > software now so a Pixel 15 in a decade can boot a kernel built five > years from now and have PUD mappings Just Work without requiring the > future userspace programmer to "use hugetlbfs". > > One of the longer-term todo items is to support variable sized THPs for > anonymous memory, just like I've done for the pagecache. With that in > place, I think scaling up from PMD sized pages to PUD sized pages starts > to look more natural. Itanium and PA-RISC (two architectures that will > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards. > The RiscV spec you pointed me at the other day confines itself to adding > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB > sizes would be possible additions in the future. +1 > But, back to today, what to do with this patchset? Even on my 16GB > laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB > page is ever the right decision to make. But my laptop runs a "mixed" > workload, and if you could convince me that Firefox would run 10% faster > by using a 1GB page as its in-memory cache, well, I'd be sold. > > I do like having the kernel figure out what's in the best interests of the > system as a whole. Apps don't have enough information, and while they > can provide hints, they're often wrong. It's definitely true for many cases, but not true for some other cases. For example, we're running hhvm ( https://hhvm.com/ ) on a large number of machines. Hhvm is known to have a significant performance benefit when using hugepages. Exact numbers depend on the exact workload and configuration, but there is a noticeable difference (in single digits of percents) between using 4k pages only, 4k pages and 2MB pages, and 4k, 2MB and some 1GB pages. As now, we have to use hugetlbfs mostly because of the lack of 1GB THP support. It has some significant downsides: e.g. hugetlb memory is not properly accounted on a memory cgroup level, it requires additional "management", etc. If we could allocate 1GB THPs with something like new madvise, having all memcg stats working and destroy them transparently on the application exit, it's already valuable. > So, let's say an app maps 8GB > of anonymous memory. As the app accesses it, we should probably start > by allocating 4kB pages to back that memory. As time goes on and that > memory continues to be accessed and more memory is accessed, it makes > sense to keep track of that, replacing the existing 4kB pages with, say, > 16-64kB pages and allocating newly accessed memory with larger pages. > Eventually that should grow to 2MB allocations and PMD mappings. > And then continue on, all the way to 1GB pages. > > We also need to be able to figure out that it's not being effective > any more. One of the issues with tracing accessed/dirty at the 1GB level > is that writing an entire 1GB page is going to take 0.25 seconds on a x4 > gen3 PCIe link. I know swapping sucks, but that's extreme. So to use > 1GB pages effectively today, we need to fragment them before choosing to > swap them out (*) Maybe that's the point where we can start to say "OK, > this sized mapping might not be effective any more". On the other hand, > that might not work for some situations. Imagine, eg, a matrix multiply > (everybody's favourite worst-case scenario). C = A * B where each of A, > B and C is too large to fit in DRAM. There are going to be points of the > calculation where each element of A is going to be walked sequentially, > and so it'd be nice to use larger PTEs to map it, but then we need to > destroy that almost immediately to allow other things to use the memory. > > > I think I'm leaning towards not merging this patchset yet. Please, correct me if I'm wrong, but in my understanding the effort required for a proper 1 GB THP support can be roughly split into two parts: 1) technical support of PUD-sized THPs, 2) heuristics to create and destroy them automatically . The second part will likely require a lot of experimenting and fine-tuning and obviously depends on the working part 1. So I don't see why we should postpone the part 1, if only it doesn't add too much overhead (which is not the case, right?). If the problem is the introduction of a semi-dead code, we can put it under a config option (I would prefer to not do it though). > I'm in > agreement with the goals (allowing systems to use PUD-sized pages > automatically), but I think we need to improve the infrastructure to > make it work well automatically. Does that make sense? Is there a plan for this? How can we make sure there we're making a forward progress here? Thank you!
On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > > > On 02.10.20 10:10, Michal Hocko wrote: > >> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: > >>>>>> - huge page sizes controllable by the userspace? > >>>>> > >>>>> It might be good to allow advanced users to choose the page sizes, so they > >>>>> have better control of their applications. > >>>> > >>>> Could you elaborate more? Those advanced users can use hugetlb, right? > >>>> They get a very good control over page size and pool preallocation etc. > >>>> So they can get what they need - assuming there is enough memory. > >>>> > >>> > >>> I am still not convinced that 1G THP (TGP :) ) are really what we want > >>> to support. I can understand that there are some use cases that might > >>> benefit from it, especially: > >> > >> Well, I would say that internal support for larger huge pages (e.g. 1GB) > >> that can transparently split under memory pressure is a useful > >> funtionality. I cannot really judge how complex that would be > > > > Right, but that's then something different than serving (scarce, > > unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing > > wrong about *real* THP support, meaning, e.g., grouping consecutive > > pages and converting them back and forth on demand. (E.g., 1GB -> > > multiple 2MB -> multiple single pages), for example, when having to > > migrate such a gigantic page. But that's very different from our > > existing gigantic page code as far as I can tell. > > Serving 1GB PUD THPs from CMA is a compromise, since we do not want to > bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, > which needs section size increase. In addition, unmoveable pages cannot > be allocated in CMA, so allocating 1GB pages has much higher chance from > it than from ZONE_NORMAL. s/higher chances/non-zero chances Currently we have nothing that prevents the fragmentation of the memory with unmovable pages on the 1GB scale. It means that in a common case it's highly unlikely to find a continuous GB without any unmovable page. As now CMA seems to be the only working option. However it seems there are other use cases for the allocation of continuous 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using 1GB pages can reduce the fragmentation of the direct mapping. So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. E.g. something like a second level of pageblocks. That would allow to group all unmovable memory in few 1GB blocks and have more 1GB regions available for gigantic THPs and other use cases. I'm looking now into how it can be done. If anybody has any ideas here, I'll appreciate a lot. Thanks!
On 05.10.20 19:16, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >> >>> On 02.10.20 10:10, Michal Hocko wrote: >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>> - huge page sizes controllable by the userspace? >>>>>>> >>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>> have better control of their applications. >>>>>> >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>> They get a very good control over page size and pool preallocation etc. >>>>>> So they can get what they need - assuming there is enough memory. >>>>>> >>>>> >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>> to support. I can understand that there are some use cases that might >>>>> benefit from it, especially: >>>> >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>> that can transparently split under memory pressure is a useful >>>> funtionality. I cannot really judge how complex that would be >>> >>> Right, but that's then something different than serving (scarce, >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>> pages and converting them back and forth on demand. (E.g., 1GB -> >>> multiple 2MB -> multiple single pages), for example, when having to >>> migrate such a gigantic page. But that's very different from our >>> existing gigantic page code as far as I can tell. >> >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >> which needs section size increase. In addition, unmoveable pages cannot >> be allocated in CMA, so allocating 1GB pages has much higher chance from >> it than from ZONE_NORMAL. > > s/higher chances/non-zero chances Well, the longer the system runs (and consumes a significant amount of available main memory), the less likely it is. > > Currently we have nothing that prevents the fragmentation of the memory > with unmovable pages on the 1GB scale. It means that in a common case > it's highly unlikely to find a continuous GB without any unmovable page. > As now CMA seems to be the only working option. > And I completely dislike the use of CMA in this context (for example, allocating via CMA and freeing via the buddy by patching CMA when splitting up PUDs ...). > However it seems there are other use cases for the allocation of continuous > 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using > 1GB pages can reduce the fragmentation of the direct mapping. Yes, see RFC v1 where I already cced Mike. > > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. > E.g. something like a second level of pageblocks. That would allow to group > all unmovable memory in few 1GB blocks and have more 1GB regions available for > gigantic THPs and other use cases. I'm looking now into how it can be done. Anything bigger than sections is somewhat problematic: you have to track that data somewhere. It cannot be the section (in contrast to pageblocks) > If anybody has any ideas here, I'll appreciate a lot. I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That somewhat mimics what CMA does (when sized reasonably), works well with memory hot(un)plug, and is immune to misconfiguration. Within such a zone, we can try to optimize the placement of larger blocks.
>> I think gigantic pages are a sparse resource. Only selected applications >> *really* depend on them and benefit from them. Let these special >> applications handle it explicitly. >> >> Can we have a summary of use cases that would really benefit from this >> change? > > For large machine learning applications, 1GB pages give good performance boost[2]. > NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not > that sparse in GPU-equipped infrastructure[3]. Well, they *are* sparse and there are absolutely no grantees until you reserve them via CMA, which is just plain ugly IMHO. In the same setup, you can most probably use hugetlbfs and achieve a similar result. Not saying it is very user-friendly.
>>> consideting that 2MB THP have turned out to be quite a pain but >>> situation has settled over time. Maybe our current code base is prepared >>> for that much better. > > I am planning to refactor my code further to reduce the amount of > the added code, since PUD THP is very similar to PMD THP. One thing > I want to achieve is to enable split_huge_page to split any order of > pages to a group of any lower order of pages. A lot of code in this > patchset is replicating the same behavior of PMD THP at PUD level. > It might be possible to deduplicate most of the code. > >>> >>> Exposing that interface to the userspace is a different story of course. >>> I do agree that we likely do not want to be very explicit about that. >>> E.g. an interface for address space defragmentation without any more >>> specifics sounds like a useful feature to me. It will be up to the >>> kernel to decide which huge pages to use. >> >> Yes, I think one important feature would be that we don't end up placing >> a gigantic page where only a handful of pages are actually populated >> without green light from the application - because that's what some user >> space applications care about (not consuming more memory than intended. >> IIUC, this is also what this patch set does). I'm fine with placing >> gigantic pages if it really just "defragments" the address space layout, >> without filling unpopulated holes. >> >> Then, this would be mostly invisible to user space, and we really >> wouldn't have to care about any configuration. > > > I agree that the interface should be as simple as no configuration to > most users. But I also wonder why we have hugetlbfs to allow users to > specify different kinds of page sizes, which seems against the discussion > above. Are we assuming advanced users should always use hugetlbfs instead > of THPs? Well, with hugetlbfs you get a real control over which pagesizes to use. No mixture, guarantees. In some environments you might want to control which application gets which pagesize. I know of database applications and hypervisors that sometimes really want 2MB huge pages instead of 1GB huge pages. And sometimes you really want/need 1GB huge pages (e.g., low-latency applications, real-time KVM, ...). Simple example: KVM with postcopy live migration While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages on demand (via userfaultdfd) is a painfully slow / impractical.
On 5 Oct 2020, at 13:39, David Hildenbrand wrote: >>>> consideting that 2MB THP have turned out to be quite a pain but >>>> situation has settled over time. Maybe our current code base is prepared >>>> for that much better. >> >> I am planning to refactor my code further to reduce the amount of >> the added code, since PUD THP is very similar to PMD THP. One thing >> I want to achieve is to enable split_huge_page to split any order of >> pages to a group of any lower order of pages. A lot of code in this >> patchset is replicating the same behavior of PMD THP at PUD level. >> It might be possible to deduplicate most of the code. >> >>>> >>>> Exposing that interface to the userspace is a different story of course. >>>> I do agree that we likely do not want to be very explicit about that. >>>> E.g. an interface for address space defragmentation without any more >>>> specifics sounds like a useful feature to me. It will be up to the >>>> kernel to decide which huge pages to use. >>> >>> Yes, I think one important feature would be that we don't end up placing >>> a gigantic page where only a handful of pages are actually populated >>> without green light from the application - because that's what some user >>> space applications care about (not consuming more memory than intended. >>> IIUC, this is also what this patch set does). I'm fine with placing >>> gigantic pages if it really just "defragments" the address space layout, >>> without filling unpopulated holes. >>> >>> Then, this would be mostly invisible to user space, and we really >>> wouldn't have to care about any configuration. >> >> >> I agree that the interface should be as simple as no configuration to >> most users. But I also wonder why we have hugetlbfs to allow users to >> specify different kinds of page sizes, which seems against the discussion >> above. Are we assuming advanced users should always use hugetlbfs instead >> of THPs? > > Well, with hugetlbfs you get a real control over which pagesizes to use. > No mixture, guarantees. > > In some environments you might want to control which application gets > which pagesize. I know of database applications and hypervisors that > sometimes really want 2MB huge pages instead of 1GB huge pages. And > sometimes you really want/need 1GB huge pages (e.g., low-latency > applications, real-time KVM, ...). > > Simple example: KVM with postcopy live migration > > While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages > on demand (via userfaultdfd) is a painfully slow / impractical. The real control of hugetlbfs comes from the interfaces provided by the kernel. If kernel provides similar interfaces to control page sizes of THPs, it should work the same as hugetlbfs. Mixing page sizes usually comes from system memory fragmentation and hugetlbfs does not have this mixture because of its special allocation pools not because of the code itself. If THPs are allocated from the same pools, they would act the same as hugetlbfs. What am I missing here? I just do not get why hugetlbfs is so special that it can have pagesize fine control when normal pages cannot get. The “it should be invisible to userpsace” argument suddenly does not hold for hugetlbfs. — Best Regards, Yan Zi
On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: > On 05.10.20 19:16, Roman Gushchin wrote: > > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > >> > >>> On 02.10.20 10:10, Michal Hocko wrote: > >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: > >>>>>>>> - huge page sizes controllable by the userspace? > >>>>>>> > >>>>>>> It might be good to allow advanced users to choose the page sizes, so they > >>>>>>> have better control of their applications. > >>>>>> > >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? > >>>>>> They get a very good control over page size and pool preallocation etc. > >>>>>> So they can get what they need - assuming there is enough memory. > >>>>>> > >>>>> > >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want > >>>>> to support. I can understand that there are some use cases that might > >>>>> benefit from it, especially: > >>>> > >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) > >>>> that can transparently split under memory pressure is a useful > >>>> funtionality. I cannot really judge how complex that would be > >>> > >>> Right, but that's then something different than serving (scarce, > >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing > >>> wrong about *real* THP support, meaning, e.g., grouping consecutive > >>> pages and converting them back and forth on demand. (E.g., 1GB -> > >>> multiple 2MB -> multiple single pages), for example, when having to > >>> migrate such a gigantic page. But that's very different from our > >>> existing gigantic page code as far as I can tell. > >> > >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to > >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, > >> which needs section size increase. In addition, unmoveable pages cannot > >> be allocated in CMA, so allocating 1GB pages has much higher chance from > >> it than from ZONE_NORMAL. > > > > s/higher chances/non-zero chances > > Well, the longer the system runs (and consumes a significant amount of > available main memory), the less likely it is. > > > > > Currently we have nothing that prevents the fragmentation of the memory > > with unmovable pages on the 1GB scale. It means that in a common case > > it's highly unlikely to find a continuous GB without any unmovable page. > > As now CMA seems to be the only working option. > > > > And I completely dislike the use of CMA in this context (for example, > allocating via CMA and freeing via the buddy by patching CMA when > splitting up PUDs ...). > > > However it seems there are other use cases for the allocation of continuous > > 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using > > 1GB pages can reduce the fragmentation of the direct mapping. > > Yes, see RFC v1 where I already cced Mike. > > > > > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. > > E.g. something like a second level of pageblocks. That would allow to group > > all unmovable memory in few 1GB blocks and have more 1GB regions available for > > gigantic THPs and other use cases. I'm looking now into how it can be done. > > Anything bigger than sections is somewhat problematic: you have to track > that data somewhere. It cannot be the section (in contrast to pageblocks) Well, it's not a large amount of data: the number of 1GB regions is not that high even on very large machines. > > > If anybody has any ideas here, I'll appreciate a lot. > > I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That > somewhat mimics what CMA does (when sized reasonably), works well with > memory hot(un)plug, and is immune to misconfiguration. Within such a > zone, we can try to optimize the placement of larger blocks. Thank you for pointing at it! The main problem with it is the same as with ZONE_MOVABLE: it does require a boot-time educated guess on a good size. I admit that the CMA does too. But I really hope that a long-term solution will not require a pre-configuration. I do not see why fundamentally we can't group unmovable allocations in (few) 1GB regions. Basically all we need to do is to choose a nearby 2MB block if we don't have enough free pages in the unmovable free list and going to steal a new 2MB block. I know, it doesn't work this way, but just as an illustration. In the reality, when stealing a block, under some conditions we might want to steal the whole 1GB region. In this case the following unmovable allocations will not lead to stealing of new blocks from (potentially) different 1GB regions. I have no working code yet, just thinking into this direction. Thanks!
On 05.10.20 20:25, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: >> On 05.10.20 19:16, Roman Gushchin wrote: >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >>>> >>>>> On 02.10.20 10:10, Michal Hocko wrote: >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>>>> - huge page sizes controllable by the userspace? >>>>>>>>> >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>>>> have better control of their applications. >>>>>>>> >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>>>> They get a very good control over page size and pool preallocation etc. >>>>>>>> So they can get what they need - assuming there is enough memory. >>>>>>>> >>>>>>> >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>>>> to support. I can understand that there are some use cases that might >>>>>>> benefit from it, especially: >>>>>> >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>>>> that can transparently split under memory pressure is a useful >>>>>> funtionality. I cannot really judge how complex that would be >>>>> >>>>> Right, but that's then something different than serving (scarce, >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>>>> pages and converting them back and forth on demand. (E.g., 1GB -> >>>>> multiple 2MB -> multiple single pages), for example, when having to >>>>> migrate such a gigantic page. But that's very different from our >>>>> existing gigantic page code as far as I can tell. >>>> >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >>>> which needs section size increase. In addition, unmoveable pages cannot >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from >>>> it than from ZONE_NORMAL. >>> >>> s/higher chances/non-zero chances >> >> Well, the longer the system runs (and consumes a significant amount of >> available main memory), the less likely it is. >> >>> >>> Currently we have nothing that prevents the fragmentation of the memory >>> with unmovable pages on the 1GB scale. It means that in a common case >>> it's highly unlikely to find a continuous GB without any unmovable page. >>> As now CMA seems to be the only working option. >>> >> >> And I completely dislike the use of CMA in this context (for example, >> allocating via CMA and freeing via the buddy by patching CMA when >> splitting up PUDs ...). >> >>> However it seems there are other use cases for the allocation of continuous >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using >>> 1GB pages can reduce the fragmentation of the direct mapping. >> >> Yes, see RFC v1 where I already cced Mike. >> >>> >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. >>> E.g. something like a second level of pageblocks. That would allow to group >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for >>> gigantic THPs and other use cases. I'm looking now into how it can be done. >> >> Anything bigger than sections is somewhat problematic: you have to track >> that data somewhere. It cannot be the section (in contrast to pageblocks) > > Well, it's not a large amount of data: the number of 1GB regions is not that > high even on very large machines. Yes, but then you can have very sparse systems. And some use cases would actually want to avoid fragmentation on smaller levels (e.g., 128MB) - optimizing memory efficiency by turning off banks and such ... > >> >>> If anybody has any ideas here, I'll appreciate a lot. >> >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That >> somewhat mimics what CMA does (when sized reasonably), works well with >> memory hot(un)plug, and is immune to misconfiguration. Within such a >> zone, we can try to optimize the placement of larger blocks. > > Thank you for pointing at it! > > The main problem with it is the same as with ZONE_MOVABLE: it does require > a boot-time educated guess on a good size. I admit that the CMA does too. "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from highmem times) ares usually perfectly fine. And if you mess up - in comparison to CMA - you won't shoot yourself in the foot, you get less gigantic pages - which is usually better than before. I consider that a clear win. Perfect? No. Can we be perfect? unlikely. In comparison to CMA / ZONE_MOVABLE, a bad guess won't cause instabilities.
> The real control of hugetlbfs comes from the interfaces provided by > the kernel. If kernel provides similar interfaces to control page sizes > of THPs, it should work the same as hugetlbfs. Mixing page sizes usually > comes from system memory fragmentation and hugetlbfs does not have this > mixture because of its special allocation pools not because of the code With hugeltbfs, you have a guarantee that all pages within your VMA have the same page size. This is an important property. With THP you have the guarantee that any page can be operated on, as if it would be base-page granularity. Example: KVM on s390x a) It cannot deal with THP. If you supply THP, the kernel will simply split up all THP and prohibit new ones from getting formed. All works well (well, no speedup because no THP). b) It can deal with 1MB huge pages (in some configurations). c) It cannot deal with 2G huge pages. So user space really has to control which pagesize to use in case of hugetlbfs. > itself. If THPs are allocated from the same pools, they would act > the same as hugetlbfs. What am I missing here? Did I mention that I dislike taking THP from the CMA pool? ;) > > I just do not get why hugetlbfs is so special that it can have pagesize > fine control when normal pages cannot get. The “it should be invisible > to userpsace” argument suddenly does not hold for hugetlbfs. It's not about "cannot get", it's about "do we need it". We do have a trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient. The name "Transparent" implies that they *should* be transparent to user space. This, unfortunately, is not completely true: 1. Performance aspects: Breaking up THP is bad for performance. This can be observed fairly easily by when using 4k-based memory ballooning in virtualized environments. If we stick to the current THP size (e.g., 2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required is completely acceptable. 2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat acceptable / controllable. Touch 4K, get 1G populated is not desirable. And I think we mostly agree that we should operate only on fully-populated ranges to replace by 1G THP. But then, there is no observerable difference between 1G THP and 2M THP from user space point of view except performance. So we are debating about "Should the kernel tell us that we can use 1G THP for a VMA". What if we were suddenly to support 2G THP (look at arm64 how they support all kinds of huge pages for hugetlbfs)? Do we really need *another* trigger? What Michal proposed (IIUC) is rather user space telling the kernel "this large memory range here is *really* important for performance, please try to optimize the memory layout, give me the best you've got". MADV_HUGEPAGE_1GB is just ugly.
On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote: > On 05.10.20 20:25, Roman Gushchin wrote: > > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: > >> On 05.10.20 19:16, Roman Gushchin wrote: > >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > >>>> > >>>>> On 02.10.20 10:10, Michal Hocko wrote: > >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: > >>>>>>>>>> - huge page sizes controllable by the userspace? > >>>>>>>>> > >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they > >>>>>>>>> have better control of their applications. > >>>>>>>> > >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? > >>>>>>>> They get a very good control over page size and pool preallocation etc. > >>>>>>>> So they can get what they need - assuming there is enough memory. > >>>>>>>> > >>>>>>> > >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want > >>>>>>> to support. I can understand that there are some use cases that might > >>>>>>> benefit from it, especially: > >>>>>> > >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) > >>>>>> that can transparently split under memory pressure is a useful > >>>>>> funtionality. I cannot really judge how complex that would be > >>>>> > >>>>> Right, but that's then something different than serving (scarce, > >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing > >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive > >>>>> pages and converting them back and forth on demand. (E.g., 1GB -> > >>>>> multiple 2MB -> multiple single pages), for example, when having to > >>>>> migrate such a gigantic page. But that's very different from our > >>>>> existing gigantic page code as far as I can tell. > >>>> > >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to > >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, > >>>> which needs section size increase. In addition, unmoveable pages cannot > >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from > >>>> it than from ZONE_NORMAL. > >>> > >>> s/higher chances/non-zero chances > >> > >> Well, the longer the system runs (and consumes a significant amount of > >> available main memory), the less likely it is. > >> > >>> > >>> Currently we have nothing that prevents the fragmentation of the memory > >>> with unmovable pages on the 1GB scale. It means that in a common case > >>> it's highly unlikely to find a continuous GB without any unmovable page. > >>> As now CMA seems to be the only working option. > >>> > >> > >> And I completely dislike the use of CMA in this context (for example, > >> allocating via CMA and freeing via the buddy by patching CMA when > >> splitting up PUDs ...). > >> > >>> However it seems there are other use cases for the allocation of continuous > >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using > >>> 1GB pages can reduce the fragmentation of the direct mapping. > >> > >> Yes, see RFC v1 where I already cced Mike. > >> > >>> > >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. > >>> E.g. something like a second level of pageblocks. That would allow to group > >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for > >>> gigantic THPs and other use cases. I'm looking now into how it can be done. > >> > >> Anything bigger than sections is somewhat problematic: you have to track > >> that data somewhere. It cannot be the section (in contrast to pageblocks) > > > > Well, it's not a large amount of data: the number of 1GB regions is not that > > high even on very large machines. > > Yes, but then you can have very sparse systems. And some use cases would > actually want to avoid fragmentation on smaller levels (e.g., 128MB) - > optimizing memory efficiency by turning off banks and such ... It's a definitely a good question. > > > >> > >>> If anybody has any ideas here, I'll appreciate a lot. > >> > >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That > >> somewhat mimics what CMA does (when sized reasonably), works well with > >> memory hot(un)plug, and is immune to misconfiguration. Within such a > >> zone, we can try to optimize the placement of larger blocks. > > > > Thank you for pointing at it! > > > > The main problem with it is the same as with ZONE_MOVABLE: it does require > > a boot-time educated guess on a good size. I admit that the CMA does too. > > "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from > highmem times) ares usually perfectly fine. And if you mess up - in > comparison to CMA - you won't shoot yourself in the foot, you get less > gigantic pages - which is usually better than before. I consider that a > clear win. Perfect? No. Can we be perfect? unlikely. I'm not necessarily opposing your idea, I just think it will be tricky to not introduce an additional overhead if the ratio is not perfectly chosen. And there is simple a cost of adding a zone. But fundamentally we're speaking about the same thing: grouping pages by their movability on a smaller scale. With a new zone we'll split pages into two parts with a fixed border, with new pageblock layer in 1GB blocks. I think the agreement is that we need such functionality. Thanks!
On 5 Oct 2020, at 11:55, Matthew Wilcox wrote: > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >>> Yes, I think one important feature would be that we don't end up placing >>> a gigantic page where only a handful of pages are actually populated >>> without green light from the application - because that's what some user >>> space applications care about (not consuming more memory than intended. >>> IIUC, this is also what this patch set does). I'm fine with placing >>> gigantic pages if it really just "defragments" the address space layout, >>> without filling unpopulated holes. >>> >>> Then, this would be mostly invisible to user space, and we really >>> wouldn't have to care about any configuration. >> >> I agree that the interface should be as simple as no configuration to >> most users. But I also wonder why we have hugetlbfs to allow users to >> specify different kinds of page sizes, which seems against the discussion >> above. Are we assuming advanced users should always use hugetlbfs instead >> of THPs? > > Evolution doesn't always produce the best outcomes ;-) > > A perennial mistake we've made is "Oh, this is a strange & new & weird > feature that most applications will never care about, let's put it in > hugetlbfs where nobody will notice and we don't have to think about it > in the core VM" > > And then what was initially strange & new & weird gradually becomes > something that most applications just want to have happen automatically, > and telling them all to go use hugetlbfs becomes untenable, so we move > the feature into the core VM. > > It is absurd that my phone is attempting to manage a million 4kB pages. > I think even trying to manage a quarter-million 16kB pages is too much > work, and really it would be happier managing 65,000 64kB pages. > > Extend that into the future a decade or two, and we'll be expecting > that it manages memory in megabyte sized units and uses PMD and PUD > mappings by default. PTE mappings will still be used, but very much > on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into > smaller pages to not waste too much memory when mapping it" basis. So, > yeah, PUD sized mappings have problems today, but we should be writing > software now so a Pixel 15 in a decade can boot a kernel built five > years from now and have PUD mappings Just Work without requiring the > future userspace programmer to "use hugetlbfs". I agree. > > One of the longer-term todo items is to support variable sized THPs for > anonymous memory, just like I've done for the pagecache. With that in > place, I think scaling up from PMD sized pages to PUD sized pages starts > to look more natural. Itanium and PA-RISC (two architectures that will > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards. > The RiscV spec you pointed me at the other day confines itself to adding > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB > sizes would be possible additions in the future. Just to understand the todo items clearly. With your pagecache patchset, kernel should be able to understand variable sized THPs no matter they are anonymous or not, right? For anonymous memory, we need kernel policies to decide what THP sizes to use at allocation, what to do when under memory pressure, and so on. In terms of implementation, THP split function needs to support from any order to any lower order. Anything I am missing here? > > But, back to today, what to do with this patchset? Even on my 16GB > laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB > page is ever the right decision to make. But my laptop runs a "mixed" > workload, and if you could convince me that Firefox would run 10% faster > by using a 1GB page as its in-memory cache, well, I'd be sold. > > I do like having the kernel figure out what's in the best interests of the > system as a whole. Apps don't have enough information, and while they > can provide hints, they're often wrong. So, let's say an app maps 8GB > of anonymous memory. As the app accesses it, we should probably start > by allocating 4kB pages to back that memory. As time goes on and that > memory continues to be accessed and more memory is accessed, it makes > sense to keep track of that, replacing the existing 4kB pages with, say, > 16-64kB pages and allocating newly accessed memory with larger pages. > Eventually that should grow to 2MB allocations and PMD mappings. > And then continue on, all the way to 1GB pages. > > We also need to be able to figure out that it's not being effective > any more. One of the issues with tracing accessed/dirty at the 1GB level > is that writing an entire 1GB page is going to take 0.25 seconds on a x4 > gen3 PCIe link. I know swapping sucks, but that's extreme. So to use > 1GB pages effectively today, we need to fragment them before choosing to > swap them out (*) Maybe that's the point where we can start to say "OK, > this sized mapping might not be effective any more". On the other hand, > that might not work for some situations. Imagine, eg, a matrix multiply > (everybody's favourite worst-case scenario). C = A * B where each of A, > B and C is too large to fit in DRAM. There are going to be points of the > calculation where each element of A is going to be walked sequentially, > and so it'd be nice to use larger PTEs to map it, but then we need to > destroy that almost immediately to allow other things to use the memory. > > > I think I'm leaning towards not merging this patchset yet. I'm in > agreement with the goals (allowing systems to use PUD-sized pages > automatically), but I think we need to improve the infrastructure to > make it work well automatically. Does that make sense? I agree that this patchset should not be merged in the current form. I think PUD THP support is a part of variable sized THP support, but current form of the patchset does not have the “variable sized THP” spirit yet and is more like a special PUD case support. I guess some changes to existing THP code to make PUD THP less a special case would make the whole patchset more acceptable? Can you elaborate more on the infrastructure part? Thanks. > > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE > level when using PMD/PUD sized mappings. I don't know of any that does > that today. I agree it would be a nice hardware feature, but it also has a high cost. Each TLB would support this with 1024 bits, which is about 16 TLB entry size, assuming each entry takes 8B space. Now it becomes why not having a bigger TLB. ;) — Best Regards, Yan Zi
On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote: > On 5 Oct 2020, at 11:55, Matthew Wilcox wrote: > > One of the longer-term todo items is to support variable sized THPs for > > anonymous memory, just like I've done for the pagecache. With that in > > place, I think scaling up from PMD sized pages to PUD sized pages starts > > to look more natural. Itanium and PA-RISC (two architectures that will > > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards. > > The RiscV spec you pointed me at the other day confines itself to adding > > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB > > sizes would be possible additions in the future. > > Just to understand the todo items clearly. With your pagecache patchset, > kernel should be able to understand variable sized THPs no matter they > are anonymous or not, right? ... yes ... modulo bugs and places I didn't fix because only anonymous pages can get there ;-) There are still quite a few references to HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related to things which are actually PMD sized. I did fix a couple of places where the anonymous path assumed that pages were PMD sized because I thought we'd probably want to do that sooner rather than later. > For anonymous memory, we need kernel policies > to decide what THP sizes to use at allocation, what to do when under > memory pressure, and so on. In terms of implementation, THP split function > needs to support from any order to any lower order. Anything I am missing here? I think that's the bulk of the work. The swap code also needs work so we don't have to split pages to swap them out. > > I think I'm leaning towards not merging this patchset yet. I'm in > > agreement with the goals (allowing systems to use PUD-sized pages > > automatically), but I think we need to improve the infrastructure to > > make it work well automatically. Does that make sense? > > I agree that this patchset should not be merged in the current form. > I think PUD THP support is a part of variable sized THP support, but > current form of the patchset does not have the “variable sized THP” > spirit yet and is more like a special PUD case support. I guess some > changes to existing THP code to make PUD THP less a special case would > make the whole patchset more acceptable? > > Can you elaborate more on the infrastructure part? Thanks. Oh, this paragraph was just summarising the above. We need to be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc. I haven't put much effort yet into supporting pages which are larger than PMD-size -- that is, if a page is mapped with a PMD entry, we assume it's PMD-sized. Once we can allocate a larger-than-PMD sized page, that's off. I assume a lot of that is dealt with in your patchset, although I haven't audited it to check for that. > > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE > > level when using PMD/PUD sized mappings. I don't know of any that does > > that today. > > I agree it would be a nice hardware feature, but it also has a high cost. > Each TLB would support this with 1024 bits, which is about 16 TLB entry size, > assuming each entry takes 8B space. Now it becomes why not having a bigger > TLB. ;) Oh, we don't have to track at the individual-page level for this to be useful. Let's take the RISC-V Sv39 page table entry format as an example: 63-54 attributes 53-28 PPN2 27-19 PPN1 18-10 PPN0 9-8 RSW 7-0 DAGUXWRV For a 2MB page, we currently insist that 18-10 are zero. If we repurpose eight of those nine bits as A/D bits, we can track at 512kB granularity. For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB granularity. It's not great, but it is quite cheap!
On 05.10.20 21:11, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote: >> On 05.10.20 20:25, Roman Gushchin wrote: >>> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: >>>> On 05.10.20 19:16, Roman Gushchin wrote: >>>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >>>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >>>>>> >>>>>>> On 02.10.20 10:10, Michal Hocko wrote: >>>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>>>>>> - huge page sizes controllable by the userspace? >>>>>>>>>>> >>>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>>>>>> have better control of their applications. >>>>>>>>>> >>>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>>>>>> They get a very good control over page size and pool preallocation etc. >>>>>>>>>> So they can get what they need - assuming there is enough memory. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>>>>>> to support. I can understand that there are some use cases that might >>>>>>>>> benefit from it, especially: >>>>>>>> >>>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>>>>>> that can transparently split under memory pressure is a useful >>>>>>>> funtionality. I cannot really judge how complex that would be >>>>>>> >>>>>>> Right, but that's then something different than serving (scarce, >>>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>>>>>> pages and converting them back and forth on demand. (E.g., 1GB -> >>>>>>> multiple 2MB -> multiple single pages), for example, when having to >>>>>>> migrate such a gigantic page. But that's very different from our >>>>>>> existing gigantic page code as far as I can tell. >>>>>> >>>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >>>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >>>>>> which needs section size increase. In addition, unmoveable pages cannot >>>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from >>>>>> it than from ZONE_NORMAL. >>>>> >>>>> s/higher chances/non-zero chances >>>> >>>> Well, the longer the system runs (and consumes a significant amount of >>>> available main memory), the less likely it is. >>>> >>>>> >>>>> Currently we have nothing that prevents the fragmentation of the memory >>>>> with unmovable pages on the 1GB scale. It means that in a common case >>>>> it's highly unlikely to find a continuous GB without any unmovable page. >>>>> As now CMA seems to be the only working option. >>>>> >>>> >>>> And I completely dislike the use of CMA in this context (for example, >>>> allocating via CMA and freeing via the buddy by patching CMA when >>>> splitting up PUDs ...). >>>> >>>>> However it seems there are other use cases for the allocation of continuous >>>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using >>>>> 1GB pages can reduce the fragmentation of the direct mapping. >>>> >>>> Yes, see RFC v1 where I already cced Mike. >>>> >>>>> >>>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. >>>>> E.g. something like a second level of pageblocks. That would allow to group >>>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for >>>>> gigantic THPs and other use cases. I'm looking now into how it can be done. >>>> >>>> Anything bigger than sections is somewhat problematic: you have to track >>>> that data somewhere. It cannot be the section (in contrast to pageblocks) >>> >>> Well, it's not a large amount of data: the number of 1GB regions is not that >>> high even on very large machines. >> >> Yes, but then you can have very sparse systems. And some use cases would >> actually want to avoid fragmentation on smaller levels (e.g., 128MB) - >> optimizing memory efficiency by turning off banks and such ... > > It's a definitely a good question. Oh, and I forgot that there might be users that want bigger granularity :) (primarily, memory hotunplug that wants to avoid ZONE_MOVABLE but still have higher chances to eventually unplug some memory) > >>> >>>> >>>>> If anybody has any ideas here, I'll appreciate a lot. >>>> >>>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That >>>> somewhat mimics what CMA does (when sized reasonably), works well with >>>> memory hot(un)plug, and is immune to misconfiguration. Within such a >>>> zone, we can try to optimize the placement of larger blocks. >>> >>> Thank you for pointing at it! >>> >>> The main problem with it is the same as with ZONE_MOVABLE: it does require >>> a boot-time educated guess on a good size. I admit that the CMA does too. >> >> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from >> highmem times) ares usually perfectly fine. And if you mess up - in >> comparison to CMA - you won't shoot yourself in the foot, you get less >> gigantic pages - which is usually better than before. I consider that a >> clear win. Perfect? No. Can we be perfect? unlikely. > > I'm not necessarily opposing your idea, I just think it will be tricky > to not introduce an additional overhead if the ratio is not perfectly > chosen. And there is simple a cost of adding a zone. Not sure this will be really visible - and if your kernel requires more than 20%..50% unmovable data than something is usually really fishy/special. The nice thing is that Linux will try to "auto-optimize" within each zone already. My gut feeling is that it's way easier to teach Linux (add zone, add mmop_type, build zonelists, split memory similar to movablecore) - however, that doesn't imply that it's better. We'll have to see. > > But fundamentally we're speaking about the same thing: grouping pages > by their movability on a smaller scale. With a new zone we'll split > pages into two parts with a fixed border, with new pageblock layer > in 1GB blocks. I also discussed moving the border on demand, which is way more tricky and would definitely be stuff for the future. There are some papers about similar fragmentation-avoidance techniques, mostly in the context of energy efficiency IIRC. Especially: - PALLOC: https://ieeexplore.ieee.org/document/6925999 - Adaptive-buddy: https://ieeexplore.ieee.org/document/7397629?reload=true&arnumber=7397629 IIRC, the problem about such approaches is that they are quite invasive and degrade some workloads due to overhead. > > I think the agreement is that we need such functionality. Yeah, on my long todo list. I'll be prototyping ZONE_RPEFER_MOVABLE soon, to see how it looks/feels/performs.
On Mon 05-10-20 14:05:17, Zi Yan wrote: > On 5 Oct 2020, at 13:39, David Hildenbrand wrote: > > >>>> consideting that 2MB THP have turned out to be quite a pain but > >>>> situation has settled over time. Maybe our current code base is prepared > >>>> for that much better. > >> > >> I am planning to refactor my code further to reduce the amount of > >> the added code, since PUD THP is very similar to PMD THP. One thing > >> I want to achieve is to enable split_huge_page to split any order of > >> pages to a group of any lower order of pages. A lot of code in this > >> patchset is replicating the same behavior of PMD THP at PUD level. > >> It might be possible to deduplicate most of the code. > >> > >>>> > >>>> Exposing that interface to the userspace is a different story of course. > >>>> I do agree that we likely do not want to be very explicit about that. > >>>> E.g. an interface for address space defragmentation without any more > >>>> specifics sounds like a useful feature to me. It will be up to the > >>>> kernel to decide which huge pages to use. > >>> > >>> Yes, I think one important feature would be that we don't end up placing > >>> a gigantic page where only a handful of pages are actually populated > >>> without green light from the application - because that's what some user > >>> space applications care about (not consuming more memory than intended. > >>> IIUC, this is also what this patch set does). I'm fine with placing > >>> gigantic pages if it really just "defragments" the address space layout, > >>> without filling unpopulated holes. > >>> > >>> Then, this would be mostly invisible to user space, and we really > >>> wouldn't have to care about any configuration. > >> > >> > >> I agree that the interface should be as simple as no configuration to > >> most users. But I also wonder why we have hugetlbfs to allow users to > >> specify different kinds of page sizes, which seems against the discussion > >> above. Are we assuming advanced users should always use hugetlbfs instead > >> of THPs? > > > > Well, with hugetlbfs you get a real control over which pagesizes to use. > > No mixture, guarantees. > > > > In some environments you might want to control which application gets > > which pagesize. I know of database applications and hypervisors that > > sometimes really want 2MB huge pages instead of 1GB huge pages. And > > sometimes you really want/need 1GB huge pages (e.g., low-latency > > applications, real-time KVM, ...). > > > > Simple example: KVM with postcopy live migration > > > > While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages > > on demand (via userfaultdfd) is a painfully slow / impractical. > > > The real control of hugetlbfs comes from the interfaces provided by > the kernel. If kernel provides similar interfaces to control page sizes > of THPs, it should work the same as hugetlbfs. Mixing page sizes usually > comes from system memory fragmentation and hugetlbfs does not have this > mixture because of its special allocation pools not because of the code > itself. Not really. Hugetlb is defined to provide a consistent and single page size access to the memory. To the degree that you fail early if you cannot guarantee that. This is not an implementation detail. This is the semantic of the feature. Control goes along with the interface. > If THPs are allocated from the same pools, they would act > the same as hugetlbfs. What am I missing here? THPs are a completely different beast. They are aiming to be transparent so that user doesn't really have to control them explicitly[1]. They should be dynamically created and demoted as the system manages resources behind users back. In short they optimize rather than guanratee. This is also the reason why a complete control sounds quite alien to me. Say you explicitly ask for THP_SIZEFOO but the kernel decides a completely different size later on. What is the actual contract you as a user are getting? In an ideal world the kernel would pick up the best large page automagically. I am a bit skeptical we will reach such an enlightment soon (if ever) so a certain level of hinting is likely needed to prevent 2MB THP fiasco again [1]. But the control should correspond to the functionality users are getting. > I just do not get why hugetlbfs is so special that it can have pagesize > fine control when normal pages cannot get. The “it should be invisible > to userpsace” argument suddenly does not hold for hugetlbfs. In short it provides a guarantee. Does the above clarifies it a bit? [1] this is not entirely true though because there is a non-trivial admin interface around THP. Mostly because they turned out to be too transparent and many people do care about internal fragmentation, allocation latency, locality (small page on a local node or a large on a slightly further one?) or simply follow a cargo cult - just have a look how many admin guides recommend disabling THPs. We got seriously burned by 2MB THP because of the way how they were enforced on users.
From: Zi Yan <ziy@nvidia.com> Hi all, This patchset adds support for 1GB PUD THP on x86_64. It is on top of v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23 Other than PUD THP, we had some discussion on generating THPs and contiguous physical memory via a synchronous system call [0]. I am planning to send out a separate patchset on it later, since I feel that it can be done independently of PUD THP support. Any comment or suggestion is welcome. Thanks. Motiation ==== The patchset is trying to provide a more transparent way of boosting virtual memory performance by leveraging gigantic TLB entries compared to hugetlbfs pages [1,2]. Roman also said he would provide performance numbers of using 1GB PUD THP once the patchset is a relatively good shape [1]. Patchset organization: ==== 1. Patch 1 and 2: Jason's PUD entry READ_ONCE patch to walk_page_range to give a consistent read of PUD entries during lockless page table walks. I also add PMD entry READ_ONCE patch, since PMD level walk_page_range has the same lockless behavior as PUD level. 2. Patch 3: THP page table deposit now use single linked list to enable hierarchical page table deposit, i.e., deposit a PMD page where 512 PTE pages are deposited to. Every page table page has a deposit_head and a deposit_node. For example, when storing 512 PTE pages to a PMD page, PMD page's deposit_head links to a PTE page's deposit_node, which links to another PTE page's deposit_node. 3. Patch 4,5,6: helper functions for allocating page table pages for PUD THPs and change thp_order and thp_nr. 4. Patch 7 to 23: PUD THP implementation. It is broken into small patches for easy review. 5. Patch 24, 25: new page size encoding for MADV_HUGEPAGE and MADV_NOHUGEPAGE in madvise. User can specify THP size. Only MADV_HUGEPAGE_1GB is used accepted. VM_HUGEPAGE_PUD is added to vm_flags to store the information at big 37. You are welcome to suggest any other approach. 6. Patch 26, 27: enable_pud_thp and hpage_pud_size are added to /sys/kernel/mm/transparent_hugepage/. enable_pud_thp is set to never by default. 7. Patch 28, 29: PUD THPs are allocated only from boot-time reserved CMA regions. The CMA regions can be used for other moveable page allocations. Design for PUD-, PMD-, and PTE-mapped PUD THP ==== One additional design compared to PMD THP is the support for PMD-mapped PUD THP, since original THP design supports PUD-mapped and PTE-mapped PUD THP automatically. PMD mapcounts are stored at (512*N + 3) subpages (N = 0 to 511) and 512*N subpages are called PMDPageInPUD. A PUDDoubleMap bit is stored at third subpage of a PUD THP, using the same page flag position as DoubleMap (stored at second subpage of a PMD THP), to indicate a PUD THP with both PUD and PMD mappings. A PUD THP looks like: ┌───┬───┬───┬───┬─────┬───┬───┬───┬───┬────────┬──────┐ │ H │ T │ T │ T │ ... │ T │ T │ T │ T │ ... │ T │ │ 0 │ 1 │ 2 │ 3 │ │512│513│514│515│ │262143│ └───┴───┴───┴───┴─────┴───┴───┴───┴───┴────────┴──────┘ PMDPageInPUD pages in a PUD THP (only show first two PMDPageInPUD pages below). Note that PMDPageInPUD pages are identified by their relative position to the head page of the PUD THP and are still tail pages except the first one, so H_0, T_512, T_1024, ... T_512x511 are all PMDPageInPUD pages: ┌────────────┬──────────────┬────────────┬──────────────┬───────────────────┐ │PMDPageInPUD│ ... │PMDPageInPUD│ ... │ the remaining │ │ page │ 511 subpages │ page │ 511 subpages │ 510x512 subpages │ └────────────┴──────────────┴────────────┴──────────────┴───────────────────┘ Mapcount positions: * For each subpage, its PTE mapcount is _mapcount, the same as PMD THP. * For PUD THP, its PUD-mapping uses compound_mapcount at T_1 the same as PMD THP. * For PMD-mapped PUD THP, its PMD-mapping uses compound_mapcount at T_3, T_515, ..., T_512x511+3. It is called sub_compound_mapcount. PUDDoubleMap and DoubleMap in PUD THP: * PUDDoubleMap is stored at the page flag of T_2 (third subpage), reusing the DoubleMap's position. * DoubleMap is stored at the page flags of T_1 (second subpage), T_513, ..., T_512x511+1. [0] https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/ [1] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/ [2] https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/ Changelog from RFC v1 ==== 1. Add Jason's PUD entry READ_ONCE patch and my PMD entry READ_ONCE patch to get consistent page table entry reading in lockless page table walks. 2. Use single linked list for page table page deposit instead of pagechain data structure from RFC v1. 3. Address Kirill's comments. 4. Remove PUD page allocation via alloc_contig_pages(), using cma_alloc only. 5. Add madvise flag MADV_HUGEPAGE_1GB to explicitly enable PUD THP on specific VMAs instead of reusing MADV_HUGEPAGE. A new vm_flags VM_HUGEPAGE_PUD is added to achieve this. 6. Break large patches in v1 into small ones for easy review. Jason Gunthorpe (1): mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan (29): mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked mm: thp: use single linked list for THP page table page deposit. mm: add new helper functions to allocate one PMD page with 512 PTE pages. mm: thp: add page table deposit/withdraw functions for PUD THP. mm: change thp_order and thp_nr as we will have not just PMD THPs. mm: thp: add anonymous PUD THP page fault support without enabling it. mm: thp: add PUD THP support for copy_huge_pud. mm: thp: add PUD THP support to zap_huge_pud. fs: proc: add PUD THP kpageflag. mm: thp: handling PUD THP reference bit. mm: rmap: add mappped/unmapped page order to anonymous page rmap functions. mm: rmap: add map_order to page_remove_anon_compound_rmap. mm: thp: add PUD THP split_huge_pud_page() function. mm: thp: add PUD THP to deferred split list when PUD mapping is gone. mm: debug: adapt dump_page to PUD THP. mm: thp: PUD THP COW splits PUD page and falls back to PMD page. mm: thp: PUD THP follow_p*d_page() support. mm: stats: make smap stats understand PUD THPs. mm: page_vma_walk: teach it about PMD-mapped PUD THP. mm: thp: PUD THP support in try_to_unmap(). mm: thp: split PUD THPs at page reclaim. mm: support PUD THP pagemap support. mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE. mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37. mm: thp: add a global knob to enable/disable PUD THPs. mm: thp: make PUD THP size public. hugetlb: cma: move cma reserve function to cma.c. mm: thp: use cma reservation for pud thp allocation. mm: thp: enable anonymous PUD THP at page fault path. .../admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/transhuge.rst | 1 + arch/arm64/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 +- arch/x86/include/asm/pgalloc.h | 69 ++ arch/x86/include/asm/pgtable.h | 26 + arch/x86/kernel/setup.c | 8 +- arch/x86/mm/pgtable.c | 38 + drivers/base/node.c | 3 + fs/proc/meminfo.c | 2 + fs/proc/page.c | 2 + fs/proc/task_mmu.c | 200 +++- include/linux/cma.h | 18 + include/linux/huge_mm.h | 84 +- include/linux/hugetlb.h | 12 - include/linux/memcontrol.h | 5 + include/linux/mm.h | 42 +- include/linux/mm_types.h | 11 +- include/linux/mmu_notifier.h | 13 + include/linux/mmzone.h | 1 + include/linux/page-flags.h | 48 + include/linux/pagewalk.h | 4 +- include/linux/pgtable.h | 34 + include/linux/rmap.h | 10 +- include/linux/swap.h | 2 + include/linux/vm_event_item.h | 7 + include/uapi/asm-generic/mman-common.h | 23 + include/uapi/linux/kernel-page-flags.h | 1 + kernel/events/uprobes.c | 4 +- kernel/fork.c | 10 +- mm/cma.c | 119 +++ mm/debug.c | 6 +- mm/gup.c | 60 +- mm/hmm.c | 16 +- mm/huge_memory.c | 899 +++++++++++++++++- mm/hugetlb.c | 117 +-- mm/khugepaged.c | 16 +- mm/ksm.c | 4 +- mm/madvise.c | 76 +- mm/mapping_dirty_helpers.c | 6 +- mm/memcontrol.c | 43 +- mm/memory.c | 28 +- mm/mempolicy.c | 29 +- mm/migrate.c | 12 +- mm/mincore.c | 10 +- mm/page_alloc.c | 53 +- mm/page_vma_mapped.c | 171 +++- mm/pagewalk.c | 47 +- mm/pgtable-generic.c | 49 +- mm/ptdump.c | 3 +- mm/rmap.c | 300 ++++-- mm/swap.c | 30 + mm/swap_slots.c | 2 + mm/swapfile.c | 11 +- mm/userfaultfd.c | 2 +- mm/util.c | 22 +- mm/vmscan.c | 33 +- mm/vmstat.c | 8 + 58 files changed, 2396 insertions(+), 460 deletions(-) -- 2.28.0