Message ID | 20210805190253.2795604-1-zi.yan@sent.com (mailing list archive) |
---|---|
Headers | show |
Series | Make MAX_ORDER adjustable as a kernel boot time parameter. | expand |
On 8/5/21 9:02 PM, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > Patch 3 restores the pfn_valid_within() check when buddy allocator can merge > pages across memory sections. The check was removed when ARM64 gets rid of holes > in zones, but holes can appear in zones again after this patchset. To me that's most unwelcome resurrection. I kinda missed it was going away and now I can't even rejoice? I assume the systems that will be bumping max_order have a lot of memory. Are they going to have many holes? What if we just sacrificed the memory that would have a hole and don't add it to buddy at all?
On 06.08.21 17:36, Vlastimil Babka wrote: > On 8/5/21 9:02 PM, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> > >> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge >> pages across memory sections. The check was removed when ARM64 gets rid of holes >> in zones, but holes can appear in zones again after this patchset. > > To me that's most unwelcome resurrection. I kinda missed it was going away and > now I can't even rejoice? I assume the systems that will be bumping max_order > have a lot of memory. Are they going to have many holes? What if we just > sacrificed the memory that would have a hole and don't add it to buddy at all? I think the old implementation was just horrible and the description we have here still suffers from that old crap: "but holes can appear in zones again". No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages that are partially a hole. And to be precise, "hole" here means "there is no memmap" and not "there is a hole but it has a valid memmap". But IIRC, we now have under SPARSEMEM always a complete memmap for a complete memory sections (when talking about system RAM, ZONE_DEVICE is different but we don't really care for now I think). So instead of introducing what we had before, I think we should look into something that doesn't confuse each person that stumbles over it out there. What does pfn_valid_within() even mean in the new context? pfn_valid() is most probably no longer what we really want, as we're dealing with multiple sections that might be online or offline; in the old world, this was different, as a MAX_ORDER -1 page was completely contained in a memory section that was either online or offline. I'd imagine something that expresses something different in the context of sparsemem: "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. Each memory section has a completely valid memmap if online. Memory sections might either be completely online or completely offline. pfn_to_online_page() might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But it will certainly be consistent within one memory section." Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we can actually do a binary search to identify boundaries, instead of having to check each and every page in the range. Is what I describe the actual reason why we introduce pfn_valid_within() ? (and might better introduce something new, with a better fitting name?)
On 8/5/21 9:02 PM, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset add support for kernel boot time adjustable MAX_ORDER, so that > user can change the largest size of pages obtained from buddy allocator. It also > removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that > buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is > set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51. > > Motivation > === > > This enables kernel to allocate 1GB pages and is necessary for my ongoing work > on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with > after some discussion with David Hildenbrand on what methods should be used for > allocating gigantic pages[2], since other approaches like using CMA allocator or > alloc_contig_pages() are regarded as suboptimal. I see references [1] [2] etc but no list of links at the end, did you forgot to include?
On 8/6/21 6:16 PM, David Hildenbrand wrote: > On 06.08.21 17:36, Vlastimil Babka wrote: >> On 8/5/21 9:02 PM, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >> >>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge >>> pages across memory sections. The check was removed when ARM64 gets rid of holes >>> in zones, but holes can appear in zones again after this patchset. >> >> To me that's most unwelcome resurrection. I kinda missed it was going away and >> now I can't even rejoice? I assume the systems that will be bumping max_order >> have a lot of memory. Are they going to have many holes? What if we just >> sacrificed the memory that would have a hole and don't add it to buddy at all? > > I think the old implementation was just horrible and the description we have > here still suffers from that old crap: "but holes can appear in zones again". > No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages > that are partially a hole. > > And to be precise, "hole" here means "there is no memmap" and not "there is a > hole but it has a valid memmap". Yes. > But IIRC, we now have under SPARSEMEM always a complete memmap for a complete > memory sections (when talking about system RAM, ZONE_DEVICE is different but we > don't really care for now I think). > > So instead of introducing what we had before, I think we should look into > something that doesn't confuse each person that stumbles over it out there. What > does pfn_valid_within() even mean in the new context? pfn_valid() is most > probably no longer what we really want, as we're dealing with multiple sections > that might be online or offline; in the old world, this was different, as a > MAX_ORDER -1 page was completely contained in a memory section that was either > online or offline. > > I'd imagine something that expresses something different in the context of > sparsemem: > > "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. > Each memory section has a completely valid memmap if online. Memory sections > might either be completely online or completely offline. pfn_to_online_page() > might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But > it will certainly be consistent within one memory section." > > Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we > can actually do a binary search to identify boundaries, instead of having to > check each and every page in the range. > > Is what I describe the actual reason why we introduce pfn_valid_within() ? (and > might better introduce something new, with a better fitting name?) What I don't like is mainly the re-addition of pfn_valid_within() (or whatever we'd call it) into __free_one_page() for performance reasons, and also to various pfn scanners (compaction) for performance and "I must not forget to check this, or do I?" confusion reasons. It would be really great if we could keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to achieve that: 1. we create memmap for MAX_ORDER blocks, pages in sections not online are marked as reserved or some other state that allows us to do checks such as "is there a buddy? no" without accessing a missing memmap 2. smaller blocks than MAX_ORDER are not released to buddy allocator I think 1 would be more work, but less wasteful in the end?
On 06.08.21 18:54, Vlastimil Babka wrote: > On 8/6/21 6:16 PM, David Hildenbrand wrote: >> On 06.08.21 17:36, Vlastimil Babka wrote: >>> On 8/5/21 9:02 PM, Zi Yan wrote: >>>> From: Zi Yan <ziy@nvidia.com> >>> >>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge >>>> pages across memory sections. The check was removed when ARM64 gets rid of holes >>>> in zones, but holes can appear in zones again after this patchset. >>> >>> To me that's most unwelcome resurrection. I kinda missed it was going away and >>> now I can't even rejoice? I assume the systems that will be bumping max_order >>> have a lot of memory. Are they going to have many holes? What if we just >>> sacrificed the memory that would have a hole and don't add it to buddy at all? >> >> I think the old implementation was just horrible and the description we have >> here still suffers from that old crap: "but holes can appear in zones again". >> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages >> that are partially a hole. >> >> And to be precise, "hole" here means "there is no memmap" and not "there is a >> hole but it has a valid memmap". > > Yes. > >> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete >> memory sections (when talking about system RAM, ZONE_DEVICE is different but we >> don't really care for now I think). >> >> So instead of introducing what we had before, I think we should look into >> something that doesn't confuse each person that stumbles over it out there. What >> does pfn_valid_within() even mean in the new context? pfn_valid() is most >> probably no longer what we really want, as we're dealing with multiple sections >> that might be online or offline; in the old world, this was different, as a >> MAX_ORDER -1 page was completely contained in a memory section that was either >> online or offline. >> >> I'd imagine something that expresses something different in the context of >> sparsemem: >> >> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. >> Each memory section has a completely valid memmap if online. Memory sections >> might either be completely online or completely offline. pfn_to_online_page() >> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But >> it will certainly be consistent within one memory section." >> >> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we >> can actually do a binary search to identify boundaries, instead of having to >> check each and every page in the range. >> >> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and >> might better introduce something new, with a better fitting name?) > > What I don't like is mainly the re-addition of pfn_valid_within() (or whatever > we'd call it) into __free_one_page() for performance reasons, and also to > various pfn scanners (compaction) for performance and "I must not forget to > check this, or do I?" confusion reasons. It would be really great if we could > keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to > achieve that: > > 1. we create memmap for MAX_ORDER blocks, pages in sections not online are > marked as reserved or some other state that allows us to do checks such as "is > there a buddy? no" without accessing a missing memmap > 2. smaller blocks than MAX_ORDER are not released to buddy allocator > > I think 1 would be more work, but less wasteful in the end? It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning. So you'd have to allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online. So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken. The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible. Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it. What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously.
On 6 Aug 2021, at 12:32, Vlastimil Babka wrote: > On 8/5/21 9:02 PM, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> Hi all, >> >> This patchset add support for kernel boot time adjustable MAX_ORDER, so that >> user can change the largest size of pages obtained from buddy allocator. It also >> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that >> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is >> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51. >> >> Motivation >> === >> >> This enables kernel to allocate 1GB pages and is necessary for my ongoing work >> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with >> after some discussion with David Hildenbrand on what methods should be used for >> allocating gigantic pages[2], since other approaches like using CMA allocator or >> alloc_contig_pages() are regarded as suboptimal. > > I see references [1] [2] etc but no list of links at the end, did you > forgot to include? Ah, I missed that. Sorry. Here are the links: [1] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/ [2] https://lore.kernel.org/linux-mm/e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com/ [3] https://lore.kernel.org/linux-mm/289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com/ BTW, [3] is the performance test I did to compare MAX_ORDER=11 and MAX_ORDER=20, which showed no performance impact of increasing MAX_ORDER. In addition, I would like to share more detail on my plan on supporting 1GB PUD THP. This patchset is the first step, enabling kernel to allocate 1GB pages, so that user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory fragmentation handling for pages up to MAX_ORDER, since currently pageblock size is still limited by memory section size. As a result, I will explore solutions like having additional larger pageblocks (up to MAX_ORDER) to counter memory fragmentation. I will discover what else needs to be solved as I gradually improve 1GB PUD THP support. — Best Regards, Yan, Zi
On 6 Aug 2021, at 13:08, David Hildenbrand wrote: > On 06.08.21 18:54, Vlastimil Babka wrote: >> On 8/6/21 6:16 PM, David Hildenbrand wrote: >>> On 06.08.21 17:36, Vlastimil Babka wrote: >>>> On 8/5/21 9:02 PM, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge >>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes >>>>> in zones, but holes can appear in zones again after this patchset. >>>> >>>> To me that's most unwelcome resurrection. I kinda missed it was going away and >>>> now I can't even rejoice? I assume the systems that will be bumping max_order >>>> have a lot of memory. Are they going to have many holes? What if we just >>>> sacrificed the memory that would have a hole and don't add it to buddy at all? >>> >>> I think the old implementation was just horrible and the description we have >>> here still suffers from that old crap: "but holes can appear in zones again". >>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages >>> that are partially a hole. >>> >>> And to be precise, "hole" here means "there is no memmap" and not "there is a >>> hole but it has a valid memmap". >> >> Yes. >> >>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete >>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we >>> don't really care for now I think). >>> >>> So instead of introducing what we had before, I think we should look into >>> something that doesn't confuse each person that stumbles over it out there. What >>> does pfn_valid_within() even mean in the new context? pfn_valid() is most >>> probably no longer what we really want, as we're dealing with multiple sections >>> that might be online or offline; in the old world, this was different, as a >>> MAX_ORDER -1 page was completely contained in a memory section that was either >>> online or offline. >>> >>> I'd imagine something that expresses something different in the context of >>> sparsemem: >>> >>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. >>> Each memory section has a completely valid memmap if online. Memory sections >>> might either be completely online or completely offline. pfn_to_online_page() >>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But >>> it will certainly be consistent within one memory section." >>> >>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we >>> can actually do a binary search to identify boundaries, instead of having to >>> check each and every page in the range. >>> >>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and >>> might better introduce something new, with a better fitting name?) >> >> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever >> we'd call it) into __free_one_page() for performance reasons, and also to >> various pfn scanners (compaction) for performance and "I must not forget to >> check this, or do I?" confusion reasons. It would be really great if we could >> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to >> achieve that: >> >> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are >> marked as reserved or some other state that allows us to do checks such as "is >> there a buddy? no" without accessing a missing memmap >> 2. smaller blocks than MAX_ORDER are not released to buddy allocator >> >> I think 1 would be more work, but less wasteful in the end? > > It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning. > > So you'd have to allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online. > > So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken. > > > The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible. > > > Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it. > > What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously. Thanks a lot for your valuable inputs! Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right? Also it requires a lot more intrusive changes, which might not be accepted easily. I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right? — Best Regards, Yan, Zi
On Fri, 6 Aug 2021, Zi Yan wrote: > > In addition, I would like to share more detail on my plan on supporting 1GB PUD THP. > This patchset is the first step, enabling kernel to allocate 1GB pages, so that > user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using > alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory > fragmentation handling for pages up to MAX_ORDER, since currently pageblock size > is still limited by memory section size. As a result, I will explore solutions > like having additional larger pageblocks (up to MAX_ORDER) to counter memory > fragmentation. I will discover what else needs to be solved as I gradually improve > 1GB PUD THP support. Sorry to be blunt, but let me state my opinion: 2MB THPs have given and continue to give us more than enough trouble. Complicating the kernel's mm further, just to allow 1GB THPs, seems a very bad tradeoff to me. I understand that it's an appealing personal project; but for the sake of of all the rest of us, please leave 1GB huge pages to hugetlbfs (until the day when we are all using 2MB base pages). Hugh
On 6 Aug 2021, at 16:27, Hugh Dickins wrote: > On Fri, 6 Aug 2021, Zi Yan wrote: >> >> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP. >> This patchset is the first step, enabling kernel to allocate 1GB pages, so that >> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using >> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory >> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size >> is still limited by memory section size. As a result, I will explore solutions >> like having additional larger pageblocks (up to MAX_ORDER) to counter memory >> fragmentation. I will discover what else needs to be solved as I gradually improve >> 1GB PUD THP support. > > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and > continue to give us more than enough trouble. Complicating the kernel's > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me. I > understand that it's an appealing personal project; but for the sake of > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until > the day when we are all using 2MB base pages). I do not agree with you. 2MB THP provides good performance, while letting us keep using 4KB base pages. The 2MB THP implementation is the price we pay to get the performance. This patchset removes the tie between MAX_ORDER and section size to allow >2MB page allocation, which is useful in many places. 1GB THP is one of the users. Gigantic pages also improve device performance, like GPUs (e.g., AMD GPUs can use any power of two up to 1GB pages[1], which I just learnt). Also could you point out which part of my patchset complicates kernel’s mm? I could try to simplify it if possible. In addition, I am not sure hugetlbfs is the way to go. THP is managed by core mm, whereas hugetlbfs has its own code for memory management. As hugetlbfs gets popular, more core mm functionalities have been replicated and added to hugetlbfs codebase. It is not a good tradeoff either. One of the reasons I work on 1GB THP is that Roman from Facebook explicitly mentioned they want to use THP in place of hugetlbfs[2]. I think it might be more constructive to point out the existing issues in THP so that we can improve the code together. BTW, I am also working on simplifying THP code like generalizing THP split[3] and planning to simplify page table manipulation code by reviving Kirill’s idea[4]. [1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/ [2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/ [3] https://lwn.net/Articles/837928/ [4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/ — Best Regards, Yan, Zi
On Fri, Aug 06, 2021 at 01:27:27PM -0700, Hugh Dickins wrote: > On Fri, 6 Aug 2021, Zi Yan wrote: > > > > In addition, I would like to share more detail on my plan on supporting 1GB PUD THP. > > This patchset is the first step, enabling kernel to allocate 1GB pages, so that > > user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using > > alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory > > fragmentation handling for pages up to MAX_ORDER, since currently pageblock size > > is still limited by memory section size. As a result, I will explore solutions > > like having additional larger pageblocks (up to MAX_ORDER) to counter memory > > fragmentation. I will discover what else needs to be solved as I gradually improve > > 1GB PUD THP support. > > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and > continue to give us more than enough trouble. Complicating the kernel's > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me. I > understand that it's an appealing personal project; but for the sake of > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until > the day when we are all using 2MB base pages). I respect your opinion, Hugh. You, more than most of us, have spent an inordinate amount of time debugging huge page related issues. I also share your misgivings about the potential performance improvements for 1GB pages. They're too big for all but the most unusual of special cases. This hasn't been helped by the scarce number of 1GB TLB entries in Intel CPUs until very recently (and even those are hard to come by today). I do not think they are of interest for the page cache (as I'm fond of observing, if you have 7GB/s storage (eg the Samsung 980 Pro), you can take seven page faults per second). I am, however, of the opinion that 2MB pages give us so much trouble because they're so very special. Few people exercise those code paths and it's easy to break them without noticing. This is partly why I want to do arbitrary-order pages. If everybody is running with compound pages all the time, we'll see the corner cases often, and people other than Hugh, Kirill and Mike will be able to work on them. Now, I'm not planning on working on arbitrary-order anonymous pages myself. I think I have enough to deal with in the page cache & filesystems. But I'm happy to help out when I can be useful. I think 256kB pages are probably optimal at the moment for file-backed memory, so I'm not planning on exploring the space above PMD_ORDER myself. But there have already been some important areas of collaboration between the 1GB effort and the folio effort.
On Sat, Aug 07, 2021 at 02:10:55AM +0100, Matthew Wilcox wrote: > This hasn't been helped by the scarce number of 1GB TLB entries in Intel > CPUs until very recently (and even those are hard to come by today). Minor correction to this. I just fixed x86info to work on my Core i7-1165G7 (Tiger Lake, launched about a year ago) and discovered it has: L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries L1 Load Only TLB: 1GB pages, fully associative, 8 entries L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries My prior laptop (i7-7500U, Kaby Lake, 2016) has only 4x 1GB TLB entries at the L1 level, and no support for L2 1GB pages. So this speaks to Intel finally taking performance of 1GB TLB entries seriously. Perhaps more seriously than they need to for a laptop with 16GB of memory! There are Xeon-W versions of this CPU, so I imagine that there are versions which can support 1TB or more memory. I still think that 1GB pages are too big for anything but specialist use cases, but those do exist.
On Fri, Aug 06, 2021 at 06:54:14PM +0200, Vlastimil Babka wrote: > On 8/6/21 6:16 PM, David Hildenbrand wrote: > > On 06.08.21 17:36, Vlastimil Babka wrote: > >> On 8/5/21 9:02 PM, Zi Yan wrote: > >>> From: Zi Yan <ziy@nvidia.com> > >> > >>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge > >>> pages across memory sections. The check was removed when ARM64 gets rid of holes > >>> in zones, but holes can appear in zones again after this patchset. > >> > >> To me that's most unwelcome resurrection. I kinda missed it was going away and > >> now I can't even rejoice? I assume the systems that will be bumping max_order > >> have a lot of memory. Are they going to have many holes? What if we just > >> sacrificed the memory that would have a hole and don't add it to buddy at all? > > > > I think the old implementation was just horrible and the description we have > > here still suffers from that old crap: "but holes can appear in zones again". > > No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages > > that are partially a hole. > > > > And to be precise, "hole" here means "there is no memmap" and not "there is a > > hole but it has a valid memmap". > > Yes. > > > But IIRC, we now have under SPARSEMEM always a complete memmap for a complete > > memory sections (when talking about system RAM, ZONE_DEVICE is different but we > > don't really care for now I think). > > > > So instead of introducing what we had before, I think we should look into > > something that doesn't confuse each person that stumbles over it out there. What > > does pfn_valid_within() even mean in the new context? pfn_valid() is most > > probably no longer what we really want, as we're dealing with multiple sections > > that might be online or offline; in the old world, this was different, as a > > MAX_ORDER -1 page was completely contained in a memory section that was either > > online or offline. > > > > I'd imagine something that expresses something different in the context of > > sparsemem: > > > > "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. > > Each memory section has a completely valid memmap if online. Memory sections > > might either be completely online or completely offline. pfn_to_online_page() > > might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But > > it will certainly be consistent within one memory section." > > > > Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we > > can actually do a binary search to identify boundaries, instead of having to > > check each and every page in the range. > > > > Is what I describe the actual reason why we introduce pfn_valid_within() ? (and > > might better introduce something new, with a better fitting name?) > > What I don't like is mainly the re-addition of pfn_valid_within() (or whatever > we'd call it) into __free_one_page() for performance reasons, and also to > various pfn scanners (compaction) for performance and "I must not forget to > check this, or do I?" confusion reasons. It would be really great if we could > keep a guarantee that memmap exists for MAX_ORDER blocks. Maybe I'm missing something, but what if we use different constants to define maximal allocation order buddy can handle and maximal size of reasonable physically contiguous chunk? I've skimmed through the uses of MAX_ORDER and it seems that many of those could use, say, pageblock_order of several megabytes rather than MAX_ORDER. If we only use MAX_ORDER to denote the maximal allocation order and keep PFN walkers to use smaller pageblock order that we'll be able to have valid memory map in all the cases. Then we can work out how to make migration/compaction etc to process 1G pages. > I see two ways to achieve that: > > 1. we create memmap for MAX_ORDER blocks, pages in sections not online are > marked as reserved or some other state that allows us to do checks such as "is > there a buddy? no" without accessing a missing memmap > 2. smaller blocks than MAX_ORDER are not released to buddy allocator > > I think 1 would be more work, but less wasteful in the end?
On Fri, 6 Aug 2021, Zi Yan wrote: > On 6 Aug 2021, at 16:27, Hugh Dickins wrote: > > On Fri, 6 Aug 2021, Zi Yan wrote: > >> > >> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP. > >> This patchset is the first step, enabling kernel to allocate 1GB pages, so that > >> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using > >> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory > >> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size > >> is still limited by memory section size. As a result, I will explore solutions > >> like having additional larger pageblocks (up to MAX_ORDER) to counter memory > >> fragmentation. I will discover what else needs to be solved as I gradually improve > >> 1GB PUD THP support. > > > > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and > > continue to give us more than enough trouble. Complicating the kernel's > > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me. I > > understand that it's an appealing personal project; but for the sake of > > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until > > the day when we are all using 2MB base pages). > > I do not agree with you. 2MB THP provides good performance, while letting us > keep using 4KB base pages. The 2MB THP implementation is the price we pay > to get the performance. This patchset removes the tie between MAX_ORDER > and section size to allow >2MB page allocation, which is useful in many > places. 1GB THP is one of the users. Gigantic pages also improve > device performance, like GPUs (e.g., AMD GPUs can use any power of two up to > 1GB pages[1], which I just learnt). Also could you point out which part > of my patchset complicates kernel’s mm? I could try to simplify it if > possible. > > In addition, I am not sure hugetlbfs is the way to go. THP is managed by > core mm, whereas hugetlbfs has its own code for memory management. > As hugetlbfs gets popular, more core mm functionalities have been > replicated and added to hugetlbfs codebase. It is not a good tradeoff > either. One of the reasons I work on 1GB THP is that Roman from Facebook > explicitly mentioned they want to use THP in place of hugetlbfs[2]. > > I think it might be more constructive to point out the existing issues > in THP so that we can improve the code together. BTW, I am also working > on simplifying THP code like generalizing THP split[3] and planning to > simplify page table manipulation code by reviving Kirill’s idea[4]. You may have good reasons for working on huge PUD entry support; and perhaps we have different understandings of "THP". Fragmentation: that's what horrifies me about 1GB THP. The dark side of THP is compaction. People have put in a lot of effort to get compaction working as well as it currently does, but getting 512 adjacent 4k pages is not easy. Getting 512*512 adjacent 4k pages is very much harder. Please put in the work on compaction before you attempt to support 1GB THP. Related fears: unexpected latencies; unacceptable variance between runs; frequent rebooting of machines to get back to an unfragmented state; page table code that most of us will never be in a position to test. Sorry, no, I'm not reading your patches: that's not personal, it's just that I've more than enough to do already, and must make choices. Hugh > > [1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/ > [2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/ > [3] https://lwn.net/Articles/837928/ > [4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/ > > — > Best Regards, > Yan, Zi
On Sat, 7 Aug 2021, Matthew Wilcox wrote: > > I am, however, of the opinion that 2MB pages give us so much trouble > because they're so very special. Few people exercise those code paths and > it's easy to break them without noticing. This is partly why I want to > do arbitrary-order pages. If everybody is running with compound pages > all the time, we'll see the corner cases often, and people other than > Hugh, Kirill and Mike will be able to work on them. I don't entirely agree. I'm all for your use of compound pages in page cache, but don't think its problems are representative of the problems in aiming for a PMD (or PUD) bar, with the weird page table transitions we expect of "THP" there. I haven't checked: is your use of compound pages in page cache still limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures? When the others could just as well use compound pages in page cache too. Hugh
On 06.08.21 20:24, Zi Yan wrote: > On 6 Aug 2021, at 13:08, David Hildenbrand wrote: > >> On 06.08.21 18:54, Vlastimil Babka wrote: >>> On 8/6/21 6:16 PM, David Hildenbrand wrote: >>>> On 06.08.21 17:36, Vlastimil Babka wrote: >>>>> On 8/5/21 9:02 PM, Zi Yan wrote: >>>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge >>>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes >>>>>> in zones, but holes can appear in zones again after this patchset. >>>>> >>>>> To me that's most unwelcome resurrection. I kinda missed it was going away and >>>>> now I can't even rejoice? I assume the systems that will be bumping max_order >>>>> have a lot of memory. Are they going to have many holes? What if we just >>>>> sacrificed the memory that would have a hole and don't add it to buddy at all? >>>> >>>> I think the old implementation was just horrible and the description we have >>>> here still suffers from that old crap: "but holes can appear in zones again". >>>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages >>>> that are partially a hole. >>>> >>>> And to be precise, "hole" here means "there is no memmap" and not "there is a >>>> hole but it has a valid memmap". >>> >>> Yes. >>> >>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete >>>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we >>>> don't really care for now I think). >>>> >>>> So instead of introducing what we had before, I think we should look into >>>> something that doesn't confuse each person that stumbles over it out there. What >>>> does pfn_valid_within() even mean in the new context? pfn_valid() is most >>>> probably no longer what we really want, as we're dealing with multiple sections >>>> that might be online or offline; in the old world, this was different, as a >>>> MAX_ORDER -1 page was completely contained in a memory section that was either >>>> online or offline. >>>> >>>> I'd imagine something that expresses something different in the context of >>>> sparsemem: >>>> >>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections. >>>> Each memory section has a completely valid memmap if online. Memory sections >>>> might either be completely online or completely offline. pfn_to_online_page() >>>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But >>>> it will certainly be consistent within one memory section." >>>> >>>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we >>>> can actually do a binary search to identify boundaries, instead of having to >>>> check each and every page in the range. >>>> >>>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and >>>> might better introduce something new, with a better fitting name?) >>> >>> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever >>> we'd call it) into __free_one_page() for performance reasons, and also to >>> various pfn scanners (compaction) for performance and "I must not forget to >>> check this, or do I?" confusion reasons. It would be really great if we could >>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to >>> achieve that: >>> >>> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are >>> marked as reserved or some other state that allows us to do checks such as "is >>> there a buddy? no" without accessing a missing memmap >>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator >>> >>> I think 1 would be more work, but less wasteful in the end? >> >> It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning. >> >> So you'd have to allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online. >> >> So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken. >> >> >> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible. >> >> >> Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it. >> >> What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously. > > Thanks a lot for your valuable inputs! > > Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right? Not really. I will only be suboptimal for corner cases. Except corner cases for devemem, we already always populate the memmap for complete memory sections. Now, we would populate the memmap for all memory sections spanning a MAX_ORDER - 1 page, if bigger than a section. Will it matter in practice? I doubt it. I consider 1 GiB allocations only relevant for really big machines. There, we don't really expect to have a lot of random memory holes. On a 1 TiB machine, with 1 GiB memory blocks and 1 GiB max_order - 1, you don't expect to have a completely fragmented memory layout such that allocating additional memmap for some memory sections really makes a difference. > Also it requires a lot more intrusive changes, which might not be accepted easily. I guess it should require quite minimal changes in contrast to what you propose. What we should have to do is a) Check that the configured MAX_ORDER - 1 is effectively not bigger than the memory block size b) Initialize all sections spanning a MAX_ORDER - 1 during boot, we won't even have to mess with memory blocks et al. All that's required is parsing/processing early parameters in the right order. That sound very little intrusive compared to what you propose. Actually, I think what you propose would be an optimization of that approach. > I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right? I think we can have quite some crazy "sparse" layouts where you can have random holes within a zone, not only at the beginning/end. Offline pages can be identified using pfn_to_online_page(). You must not touch their memmap, not even to check for PageBuddy(). PG_offline is a special case where pfn_to_online_page() returns true and the memmap is valid, however, the pages are logically offline and might get logically onlined later -- primarily used in virtualized environments, for example, with memory ballooning. You can treat PG_offline pages like their are online, they just are accounted differently (!managed) and shouldn't be touched; but otherwise, they are just like any other allocated page.
On 05.08.21 21:02, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset add support for kernel boot time adjustable MAX_ORDER, so that > user can change the largest size of pages obtained from buddy allocator. It also > removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that > buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is > set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51. > > Motivation > === > > This enables kernel to allocate 1GB pages and is necessary for my ongoing work > on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with > after some discussion with David Hildenbrand on what methods should be used for > allocating gigantic pages[2], since other approaches like using CMA allocator or > alloc_contig_pages() are regarded as suboptimal. > > This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since > increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk > size will be increased as well, causing memory management difficulty for VMs. > > In addition, make MAX_ORDER a kernel boot time parameter can enable user to > adjust buddy allocator without recompiling the kernel for their own needs, so > that one can still have a small MAX_ORDER if he/she does not need to allocate > gigantic pages like 1GB PUD THPs. > > Background > === > > At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS > restriction. This prevents buddy allocator merging pages across memory sections, > as PFNs might not be contiguous and code like page++ would fail. But this would > not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are > virtually contiguous. In addition, as long as buddy allocator checks the PFN > validity during buddy page merging (done in Patch 3), pages allocated from > buddy allocator can be manipulated by code like page++. > > > Description > === > > I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base > pages. The systems boot and ltp mm test suite finished without issue. Also > memory hotplug worked on x86_64 when I tested. It definitely needs more tests > and reviews for other architectures. > > In terms of the concerns on performance degradation if MAX_ORDER is increased, > I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on > x86_64 machines and saw no performance difference[3]. > > Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses > irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The > exclusion does not break the check in 32bit kernel, since the check will still > be performed during other kernel component compilation. > > Patch 2 gives FORCE_MAX_ZONEORDER a better name. > > Patch 3 restores the pfn_valid_within() check when buddy allocator can merge > pages across memory sections. The check was removed when ARM64 gets rid of holes > in zones, but holes can appear in zones again after this patchset. > > Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative > constants, since these code places use MAX_ORDER as boundary check for > physically contiguous pages, where SECTION_SIZE_BITS should be used. After this > patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break. > I separate changes to different patches for easy review and can merge them into > a single one if that works better. > > Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER > when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64. > > Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic > ones if possible and prepares for making MAX_ORDER a boot time parameter. > > Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic > MAX_ORDER for places where converting static array to dynamic one is causing > hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB. > > Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter. > > > Any suggestion and/or comment is welcome. Thanks. > > > TODO > === > > 1. Redo the performance comparison tests using this patchset to understand the > performance implication of changing MAX_ORDER. 2. Make alloc_contig_range() cleanly deal with pageblock_order instead of MAX_ORDER - 1 to not force the minimal CMA area size/alignment to be e.g., 1 GiB instead of 4 MiB and to keep virtio-mem working as expected. virtio-mem short term would mean disallowing initialization when an incompatible setup (MAX_ORDER_NR_PAGE > SECTION_NR_PAGES) is detected and bailing out loud that the admin has to fix that on the command line. I have optimizing alloc_contig_range() on my todo list, to get rid of the MAX_ORDER -1 dependency in virtio-mem; but I have no idea when I'll have time to work on that.
On Sun, Aug 08, 2021 at 09:29:29PM -0700, Hugh Dickins wrote: > On Sat, 7 Aug 2021, Matthew Wilcox wrote: > > > > I am, however, of the opinion that 2MB pages give us so much trouble > > because they're so very special. Few people exercise those code paths and > > it's easy to break them without noticing. This is partly why I want to > > do arbitrary-order pages. If everybody is running with compound pages > > all the time, we'll see the corner cases often, and people other than > > Hugh, Kirill and Mike will be able to work on them. > > I don't entirely agree. I'm all for your use of compound pages in page > cache, but don't think its problems are representative of the problems > in aiming for a PMD (or PUD) bar, with the weird page table transitions > we expect of "THP" there. > > I haven't checked: is your use of compound pages in page cache still > limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures? When > the others could just as well use compound pages in page cache too. It is no longer gated by whether TRANSPARENT_HUGEPAGE is enabled or not. That's a relatively recent change, and I can't say that I've tested it. I'll give it a try today on a PA-RISC system I've booted recently. One of the followup pieces of work that I hope somebody other than myself will undertake is using 64KB PTEs on a 4KB PAGE_SIZE ARM/POWER machine if the stars align.
From: Zi Yan <ziy@nvidia.com> Hi all, This patchset add support for kernel boot time adjustable MAX_ORDER, so that user can change the largest size of pages obtained from buddy allocator. It also removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51. Motivation === This enables kernel to allocate 1GB pages and is necessary for my ongoing work on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with after some discussion with David Hildenbrand on what methods should be used for allocating gigantic pages[2], since other approaches like using CMA allocator or alloc_contig_pages() are regarded as suboptimal. This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk size will be increased as well, causing memory management difficulty for VMs. In addition, make MAX_ORDER a kernel boot time parameter can enable user to adjust buddy allocator without recompiling the kernel for their own needs, so that one can still have a small MAX_ORDER if he/she does not need to allocate gigantic pages like 1GB PUD THPs. Background === At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS restriction. This prevents buddy allocator merging pages across memory sections, as PFNs might not be contiguous and code like page++ would fail. But this would not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are virtually contiguous. In addition, as long as buddy allocator checks the PFN validity during buddy page merging (done in Patch 3), pages allocated from buddy allocator can be manipulated by code like page++. Description === I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base pages. The systems boot and ltp mm test suite finished without issue. Also memory hotplug worked on x86_64 when I tested. It definitely needs more tests and reviews for other architectures. In terms of the concerns on performance degradation if MAX_ORDER is increased, I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on x86_64 machines and saw no performance difference[3]. Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The exclusion does not break the check in 32bit kernel, since the check will still be performed during other kernel component compilation. Patch 2 gives FORCE_MAX_ZONEORDER a better name. Patch 3 restores the pfn_valid_within() check when buddy allocator can merge pages across memory sections. The check was removed when ARM64 gets rid of holes in zones, but holes can appear in zones again after this patchset. Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative constants, since these code places use MAX_ORDER as boundary check for physically contiguous pages, where SECTION_SIZE_BITS should be used. After this patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break. I separate changes to different patches for easy review and can merge them into a single one if that works better. Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64. Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic ones if possible and prepares for making MAX_ORDER a boot time parameter. Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic MAX_ORDER for places where converting static array to dynamic one is causing hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB. Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter. Any suggestion and/or comment is welcome. Thanks. TODO === 1. Redo the performance comparison tests using this patchset to understand the performance implication of changing MAX_ORDER. Zi Yan (15): arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso. arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER mm: check pfn validity when buddy allocator can merge pages across mem sections. mm: prevent pageblock size being larger than section size. mm/memory_hotplug: online pages at section size. mm: use PAGES_PER_SECTION instead for mem_map_offset/next(). mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity fs: proc: use PAGES_PER_SECTION for page offline checking period. virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES virtio: virtio_balloon: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES. mm/page_reporting: report pages at section size instead of MAX_ORDER. mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER. mm: convert MAX_ORDER sized static arrays to dynamic ones. mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant. mm: make MAX_ORDER a kernel boot time parameter. .../admin-guide/kdump/vmcoreinfo.rst | 2 +- .../admin-guide/kernel-parameters.txt | 5 + arch/Kconfig | 4 + arch/arc/Kconfig | 2 +- arch/arm/Kconfig | 2 +- arch/arm/configs/imx_v6_v7_defconfig | 2 +- arch/arm/configs/milbeaut_m10v_defconfig | 2 +- arch/arm/configs/oxnas_v6_defconfig | 2 +- arch/arm/configs/sama7_defconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +- arch/arm64/kvm/hyp/nvhe/page_alloc.c | 3 +- arch/csky/Kconfig | 2 +- arch/ia64/Kconfig | 2 +- arch/ia64/include/asm/sparsemem.h | 6 +- arch/m68k/Kconfig.cpu | 2 +- arch/mips/Kconfig | 2 +- arch/nios2/Kconfig | 2 +- arch/powerpc/Kconfig | 2 +- arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +- arch/powerpc/configs/fsl-emb-nonhw.config | 2 +- arch/sh/configs/ecovec24_defconfig | 2 +- arch/sh/mm/Kconfig | 2 +- arch/sparc/Kconfig | 2 +- arch/x86/entry/vdso/Makefile | 1 + arch/xtensa/Kconfig | 2 +- drivers/gpu/drm/ttm/ttm_device.c | 7 +- drivers/gpu/drm/ttm/ttm_pool.c | 58 +++++++++- drivers/virtio/virtio_balloon.c | 2 +- drivers/virtio/virtio_mem.c | 12 +- fs/proc/kcore.c | 2 +- include/drm/ttm/ttm_pool.h | 4 +- include/linux/memory_hotplug.h | 1 + include/linux/mmzone.h | 56 ++++++++- include/linux/pageblock-flags.h | 7 +- include/linux/slab.h | 8 +- mm/Kconfig | 16 +++ mm/compaction.c | 20 ++-- mm/hugetlb.c | 2 +- mm/internal.h | 4 +- mm/memory_hotplug.c | 18 ++- mm/page_alloc.c | 108 ++++++++++++++++-- mm/page_isolation.c | 7 +- mm/page_owner.c | 14 ++- mm/page_reporting.c | 3 +- mm/slab.c | 2 +- mm/slub.c | 7 +- mm/vmscan.c | 1 - 48 files changed, 334 insertions(+), 86 deletions(-)