Message ID | 20250206185109.1210657-1-fvdl@google.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb/CMA improvements for large systems | expand |
On Thu, Feb 06, 2025 at 06:50:40PM +0000, Frank van der Linden wrote: > v3: > * Fix SPDX comment include file format. > * Add new hugetlb_cma.* files to MAINTAINERS > * Document new ranges/ subdir in CMA debugfs. > * Fix powerpc compilation for config without HAVE_BOOTMEM_INFO_NODE > * Fix various other nits found by kernel test robot. > * Use a PFN value of -1 to indicate a non-mirrored mapping > in sparse-vmemmap.c, not 0. > * Fix incorrect if() statement that got mangled in cma.c > > v2: > * Add missing CMA debugfs code. > * Minor cleanups in hugetlb_cma changes. > * Move hugetlb_cma code to its own file to further clean > things up. > > On large systems, we observed some issues with hugetlb and CMA: > > 1) When specifying a large number of hugetlb boot pages (hugepages= > on the commandline), the kernel may run out of memory before it > even gets to HVO. For example, if you have a 3072G system, and > want to use 3024 1G hugetlb pages for VMs, that should leave > you plenty of space for the hypervisor, provided you have the > hugetlb vmemmap optimization (HVO) enabled. However, since > the vmemmap pages are always allocated first, and then later > in boot freed, you will actually run yourself out of memory > before you can do HVO. This means not getting all the hugetlb > pages you want, and worse, failure to boot if there is an > allocation failure in the system from which it can't recover. > > 2) There is a system setup where you might want to use hugetlb_cma > with a large value (say, again, 3024 out of 3072G like above), > and then lower that if system usage allows it, to make room > for non-hugetlb processes. For this, a variation of the problem > above applies: the kernel runs out of unmovable space to allocate > from before you finish boot, since your CMA area takes up all > the space. > > 3) CMA wants to use one big contiguous area for allocations. Which > fails if you have the aforementioned 3T system with a gap in the > middle of physical memory (like the < 40bits BIOS DMA area seen on > some AMD systems). You then won't be able to set up a CMA area for > one of the NUMA nodes, leading to loss of half of your hugetlb > CMA area. > > 4) Under the scenario mentioned in 2), when trying to grow the > number of hugetlb pages after dropping it for a while, new > CMA allocations may fail occasionally. This is not unexpected, > some transient references on pages may prevent cma_alloc > from succeeding under memory pressure. However, the hugetlb > code then falls back to a normal contiguous alloc, which may > end up succeeding. This is not always desired behavior. If > you have a large CMA area, then the kernel has a restricted > amount of memory it can do unmovable allocations from (a well > known issue). A normal contiguous alloc may eat further in to > this space. Hi Frank, While I plan to keep reviewing the series, I think it would make sense to split this patchset into two smaller ones. The way I see it, we are trying to deal with two different problems and their solutions. 1) pre-hvo at boot time 2) multi-range support of CMA (only used for hugetlb) I did not go through the entire patchset yet, so I ignore whether the respective patches to tackle these two problems are really dependent on each other, but I think that would be very interesting to consider a patchset per solution if that is not the case. IMHO, it would ease review quite a lot.
On Mon, Feb 10, 2025 at 10:40 AM Oscar Salvador <osalvador@suse.de> wrote: > > On Thu, Feb 06, 2025 at 06:50:40PM +0000, Frank van der Linden wrote: > > v3: > > * Fix SPDX comment include file format. > > * Add new hugetlb_cma.* files to MAINTAINERS > > * Document new ranges/ subdir in CMA debugfs. > > * Fix powerpc compilation for config without HAVE_BOOTMEM_INFO_NODE > > * Fix various other nits found by kernel test robot. > > * Use a PFN value of -1 to indicate a non-mirrored mapping > > in sparse-vmemmap.c, not 0. > > * Fix incorrect if() statement that got mangled in cma.c > > > > v2: > > * Add missing CMA debugfs code. > > * Minor cleanups in hugetlb_cma changes. > > * Move hugetlb_cma code to its own file to further clean > > things up. > > > > On large systems, we observed some issues with hugetlb and CMA: > > > > 1) When specifying a large number of hugetlb boot pages (hugepages= > > on the commandline), the kernel may run out of memory before it > > even gets to HVO. For example, if you have a 3072G system, and > > want to use 3024 1G hugetlb pages for VMs, that should leave > > you plenty of space for the hypervisor, provided you have the > > hugetlb vmemmap optimization (HVO) enabled. However, since > > the vmemmap pages are always allocated first, and then later > > in boot freed, you will actually run yourself out of memory > > before you can do HVO. This means not getting all the hugetlb > > pages you want, and worse, failure to boot if there is an > > allocation failure in the system from which it can't recover. > > > > 2) There is a system setup where you might want to use hugetlb_cma > > with a large value (say, again, 3024 out of 3072G like above), > > and then lower that if system usage allows it, to make room > > for non-hugetlb processes. For this, a variation of the problem > > above applies: the kernel runs out of unmovable space to allocate > > from before you finish boot, since your CMA area takes up all > > the space. > > > > 3) CMA wants to use one big contiguous area for allocations. Which > > fails if you have the aforementioned 3T system with a gap in the > > middle of physical memory (like the < 40bits BIOS DMA area seen on > > some AMD systems). You then won't be able to set up a CMA area for > > one of the NUMA nodes, leading to loss of half of your hugetlb > > CMA area. > > > > 4) Under the scenario mentioned in 2), when trying to grow the > > number of hugetlb pages after dropping it for a while, new > > CMA allocations may fail occasionally. This is not unexpected, > > some transient references on pages may prevent cma_alloc > > from succeeding under memory pressure. However, the hugetlb > > code then falls back to a normal contiguous alloc, which may > > end up succeeding. This is not always desired behavior. If > > you have a large CMA area, then the kernel has a restricted > > amount of memory it can do unmovable allocations from (a well > > known issue). A normal contiguous alloc may eat further in to > > this space. > > Hi Frank, > > While I plan to keep reviewing the series, I think it would make sense > to split this patchset into two smaller ones. > The way I see it, we are trying to deal with two different problems and their > solutions. > > 1) pre-hvo at boot time > 2) multi-range support of CMA (only used for hugetlb) > > I did not go through the entire patchset yet, so I ignore whether the > respective patches to tackle these two problems are really dependent on > each other, but I think that would be very interesting to consider a > patchset per solution if that is not the case. > > IMHO, it would ease review quite a lot. Hi Oskar, Thanks a lot for reviewing this series. I certainly could split it up, but here are the dependencies (it's actually 3 parts): 1. Multi-range CMA (used by hugetlb) (patches 1-4) 2. Pre-HVO for hugetlb bootmem pages (patches 5-22) 3. Enable hugepages= (and pre-HVO) for CMA (patches 23-28) 1 and 2 are independent. 3 depends on 1 and 2. So, I could post 1) and 2) simultaneously, and 3) would have to wait until 1) and 2) are resolved. Andrew, do you have any thoughts on splitting it up? - Frank
On Mon, 10 Feb 2025 10:56:50 -0800 Frank van der Linden <fvdl@google.com> wrote: > > Hi Frank, > > > > While I plan to keep reviewing the series, I think it would make sense > > to split this patchset into two smaller ones. > > The way I see it, we are trying to deal with two different problems and their > > solutions. > > > > 1) pre-hvo at boot time > > 2) multi-range support of CMA (only used for hugetlb) > > > > I did not go through the entire patchset yet, so I ignore whether the > > respective patches to tackle these two problems are really dependent on > > each other, but I think that would be very interesting to consider a > > patchset per solution if that is not the case. > > > > IMHO, it would ease review quite a lot. > > Hi Oskar, > > Thanks a lot for reviewing this series. > > I certainly could split it up, but here are the dependencies (it's > actually 3 parts): > > 1. Multi-range CMA (used by hugetlb) (patches 1-4) > 2. Pre-HVO for hugetlb bootmem pages (patches 5-22) > 3. Enable hugepages= (and pre-HVO) for CMA (patches 23-28) > > 1 and 2 are independent. 3 depends on 1 and 2. > > So, I could post 1) and 2) simultaneously, and 3) would have to wait > until 1) and 2) are resolved. > > Andrew, do you have any thoughts on splitting it up? I don't see much trouble with the above dependencies - we can consider the three series to be an all-or-nothing thing. Such a splitup would be the same patches, packaged slightly differently. The main difference would be the presence of two more [0/n] cover letters, presumably also repackaging existing material. I don't see a lot of benefit personally.
On Mon, Feb 10, 2025 at 3:29 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 10 Feb 2025 10:56:50 -0800 Frank van der Linden <fvdl@google.com> wrote: > > > > Hi Frank, > > > > > > While I plan to keep reviewing the series, I think it would make sense > > > to split this patchset into two smaller ones. > > > The way I see it, we are trying to deal with two different problems and their > > > solutions. > > > > > > 1) pre-hvo at boot time > > > 2) multi-range support of CMA (only used for hugetlb) > > > > > > I did not go through the entire patchset yet, so I ignore whether the > > > respective patches to tackle these two problems are really dependent on > > > each other, but I think that would be very interesting to consider a > > > patchset per solution if that is not the case. > > > > > > IMHO, it would ease review quite a lot. > > > > Hi Oskar, > > > > Thanks a lot for reviewing this series. > > > > I certainly could split it up, but here are the dependencies (it's > > actually 3 parts): > > > > 1. Multi-range CMA (used by hugetlb) (patches 1-4) > > 2. Pre-HVO for hugetlb bootmem pages (patches 5-22) > > 3. Enable hugepages= (and pre-HVO) for CMA (patches 23-28) > > > > 1 and 2 are independent. 3 depends on 1 and 2. > > > > So, I could post 1) and 2) simultaneously, and 3) would have to wait > > until 1) and 2) are resolved. > > > > Andrew, do you have any thoughts on splitting it up? > > I don't see much trouble with the above dependencies - we can consider > the three series to be an all-or-nothing thing. > > Such a splitup would be the same patches, packaged slightly > differently. The main difference would be the presence of two more > [0/n] cover letters, presumably also repackaging existing material. I > don't see a lot of benefit personally. > Thanks Andrew. Here's what I can do: keep the series as a whole, but note at the top of the cover letter that parts can be reviewed / applied independently. I hope that works out for everyone. - Frank