Message ID | 20241202202058.3249628-1-fvdl@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hugetlb: optionally pre-zero hugetlb pages | expand |
On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: > Fresh hugetlb pages are zeroed out when they are faulted in, > just like with all other page types. This can take up a good > amount of time for larger page sizes (e.g. around 40 > milliseconds for a 1G page on a recent AMD-based system). > > This normally isn't a problem, since hugetlb pages are typically > mapped by the application for a long time, and the initial > delay when touching them isn't much of an issue. > > However, there are some use cases where a large number of hugetlb > pages are touched when an application (such as a VM backed by these > pages) starts. For 256 1G pages and 40ms per page, this would take > 10 seconds, a noticeable delay. The current huge page zeroing code is not that great to begin with. There was a patchset posted some time ago to remedy at least some of it: https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ but it apparently fell through the cracks. Any games with "background zeroing" are notoriously crappy and I would argue one should exhaust other avenues before going there -- at the end of the day the cost of zeroing will have to get paid. To that end I would suggest picking up the patchset and experimenting with more variants of the zeroing code (for example for 1G it may be it is faster to employ SIMD usage in the routine). If this is really such a problem I wonder if this could start as a series of 2MB pages instead faulted as needed, eventually promoted to 1G after passing some threshold?
Hi Frank, kernel test robot noticed the following build warnings: [auto build test WARNING on v6.12] [cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all linus/master v6.13-rc1 next-20241128] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Frank-van-der-Linden/mm-hugetlb-optionally-pre-zero-hugetlb-pages/20241203-042817 base: v6.12 patch link: https://lore.kernel.org/r/20241202202058.3249628-1-fvdl%40google.com patch subject: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages config: powerpc-ps3_defconfig (https://download.01.org/0day-ci/archive/20241203/202412030519.W14yll4e-lkp@intel.com/config) compiler: powerpc64-linux-gcc (GCC) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241203/202412030519.W14yll4e-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202412030519.W14yll4e-lkp@intel.com/ All warnings (new ones prefixed by >>): mm/hugetlb.c: In function 'prezero_enabled_store': >> mm/hugetlb.c:4290:13: warning: variable 'err' set but not used [-Wunused-but-set-variable] 4290 | int err; | ^~~ vim +/err +4290 mm/hugetlb.c 4284 4285 static ssize_t prezero_enabled_store(struct kobject *kobj, 4286 struct kobj_attribute *attr, 4287 const char *buf, size_t count) 4288 { 4289 struct hstate *h; > 4290 int err; 4291 long val; 4292 bool prezero_enabled; 4293 4294 err = kstrtol(buf, 10, &val); 4295 if (val != 0 && val != 1) 4296 return -EINVAL; 4297 4298 prezero_enabled = !!val; 4299 4300 h = kobj_to_hstate(kobj, NULL); 4301 if (prezero_enabled == h->prezero_enabled) 4302 return count; 4303 4304 h->prezero_enabled = prezero_enabled; 4305 4306 mem_hotplug_begin(); 4307 khzerod_run_hstate(h); 4308 mem_hotplug_done(); 4309 4310 return count; 4311 } 4312 HSTATE_ATTR(prezero_enabled); 4313
On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: > > Fresh hugetlb pages are zeroed out when they are faulted in, > > just like with all other page types. This can take up a good > > amount of time for larger page sizes (e.g. around 40 > > milliseconds for a 1G page on a recent AMD-based system). > > > > This normally isn't a problem, since hugetlb pages are typically > > mapped by the application for a long time, and the initial > > delay when touching them isn't much of an issue. > > > > However, there are some use cases where a large number of hugetlb > > pages are touched when an application (such as a VM backed by these > > pages) starts. For 256 1G pages and 40ms per page, this would take > > 10 seconds, a noticeable delay. > > The current huge page zeroing code is not that great to begin with. > > There was a patchset posted some time ago to remedy at least some of it: > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ > > but it apparently fell through the cracks. Hi Mateusz, thanks for your reply. I am aware of that patch set, yes. The discussion around it evolved in to one about kernel preemption and the evilness of cond_resched(). You can certainly improve the time it takes to zero out a 1G page by optimizing the code that does it. See also, for example, https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/ However, while, say, a 50% improvement in zero out time, at the max, is nice, this still leaves the faulting process spending considerable time doing it. Like you say, that's cost that needs to be paid - but it would be good to avoid paying it inline. This patch avoids doing that altogether, leading to a basically 100% improvement under reasonably good circumstances. > > Any games with "background zeroing" are notoriously crappy and I would > argue one should exhaust other avenues before going there -- at the end > of the day the cost of zeroing will have to get paid. I understand that the concept of background prezeroing has been, and will be, met with some resistance. But, do you have any specific concerns with the patch I posted? It's pretty well isolated from the rest of the code, and optional. > > To that end I would suggest picking up the patchset and experimenting > with more variants of the zeroing code (for example for 1G it may be it > is faster to employ SIMD usage in the routine). See above - happy to pick up older patch(es) as a separate effort, but they won't fully solve the issue for the scenario I'm describing. > > If this is really such a problem I wonder if this could start as a > series of 2MB pages instead faulted as needed, eventually promoted to > 1G after passing some threshold? This idea sounds similar to HGM (high granularity mapping), an idea which was originally posted for the purpose of live migration of VMs (but never made it in). It's not trivial, and seems like overkill. Again, my patch is non-invasive and optional, so I think it's better in that regard. - Frank
On Mon 02-12-24 14:50:49, Frank van der Linden wrote: > On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Any games with "background zeroing" are notoriously crappy and I would > > argue one should exhaust other avenues before going there -- at the end > > of the day the cost of zeroing will have to get paid. > > I understand that the concept of background prezeroing has been, and > will be, met with some resistance. But, do you have any specific > concerns with the patch I posted? It's pretty well isolated from the > rest of the code, and optional. The biggest concern I have is that the overhead is payed by everybody on the system - it is considered to be a system overhead regardless only part of the workload benefits from hugetlb pages. In other words the workload using those pages is not accounted for the use completely. If the startup latency is a real problem is there a way to workaround that in the userspace by preallocating hugetlb pages ahead of time before those VMs are launched and hand over already pre-allocated pages?
On 03.12.24 13:06, Michal Hocko wrote: > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >>> Any games with "background zeroing" are notoriously crappy and I would >>> argue one should exhaust other avenues before going there -- at the end >>> of the day the cost of zeroing will have to get paid. >> >> I understand that the concept of background prezeroing has been, and >> will be, met with some resistance. But, do you have any specific >> concerns with the patch I posted? It's pretty well isolated from the >> rest of the code, and optional. > > The biggest concern I have is that the overhead is payed by everybody on > the system - it is considered to be a system overhead regardless only > part of the workload benefits from hugetlb pages. In other words the > workload using those pages is not accounted for the use completely. > > If the startup latency is a real problem is there a way to workaround > that in the userspace by preallocating hugetlb pages ahead of time > before those VMs are launched and hand over already pre-allocated pages? In QEMU we support parallel preallocation, and even NUMA-aware preallocation, to reduce VM startup times with hugetlb. We discussed improvements as referred to by Mateusz to further speed up preallocation, but for now parallel+NUMA-aware preallocation seems to do the trick.
On 02/12/2024 21:58, Mateusz Guzik wrote: > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: >> Fresh hugetlb pages are zeroed out when they are faulted in, >> just like with all other page types. This can take up a good >> amount of time for larger page sizes (e.g. around 40 >> milliseconds for a 1G page on a recent AMD-based system). >> >> This normally isn't a problem, since hugetlb pages are typically >> mapped by the application for a long time, and the initial >> delay when touching them isn't much of an issue. >> >> However, there are some use cases where a large number of hugetlb >> pages are touched when an application (such as a VM backed by these >> pages) starts. For 256 1G pages and 40ms per page, this would take >> 10 seconds, a noticeable delay. > > The current huge page zeroing code is not that great to begin with. > > There was a patchset posted some time ago to remedy at least some of it: > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ > > but it apparently fell through the cracks. > It didn't fell through the cracks for sure Just had a detour into preempt=auto before resuming the main work. But that seems to be done in the last merge window with the lazy preempt stuff. I think Ankur was planning on following that series above soon-ish. Adding him here, such that he keeps me honest :)
On 03/12/2024 12:06, Michal Hocko wrote: > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >>> Any games with "background zeroing" are notoriously crappy and I would >>> argue one should exhaust other avenues before going there -- at the end >>> of the day the cost of zeroing will have to get paid. >> >> I understand that the concept of background prezeroing has been, and >> will be, met with some resistance. But, do you have any specific >> concerns with the patch I posted? It's pretty well isolated from the >> rest of the code, and optional. > > The biggest concern I have is that the overhead is payed by everybody on > the system - it is considered to be a system overhead regardless only > part of the workload benefits from hugetlb pages. In other words the > workload using those pages is not accounted for the use completely. > > If the startup latency is a real problem is there a way to workaround > that in the userspace by preallocating hugetlb pages ahead of time > before those VMs are launched and hand over already pre-allocated pages? It should be relatively simple to actually do this. Me and Mike had experimented ourselves a couple years back but we never had the chance to send it over. IIRC if we: - add the PageZeroed tracking bit when a page is zeroed - clear it in the write (fixup/non-fixup) fault-path [somewhat similar to this series I suspect] Then what's left is to change the lookup of free hugetlb pages (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed pages. Provided we don't track its 'cleared' state, there's no UAPI change in behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and free them back 'as zeroed' to implement a userspace scrubber. And in principle existing apps should see no difference. The amount of changes is consequently significantly smaller (or it looked as such in a quick PoC years back). Something extra on the top would perhaps be the ability so select a lookup heuristic such that we can pick the search method of non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) without too much of a dance. Joao
On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 03/12/2024 12:06, Michal Hocko wrote: > > If the startup latency is a real problem is there a way to workaround > > that in the userspace by preallocating hugetlb pages ahead of time > > before those VMs are launched and hand over already pre-allocated pages? > > It should be relatively simple to actually do this. Me and Mike had experimented > ourselves a couple years back but we never had the chance to send it over. IIRC > if we: > > - add the PageZeroed tracking bit when a page is zeroed > - clear it in the write (fixup/non-fixup) fault-path > > [somewhat similar to this series I suspect] > > Then what's left is to change the lookup of free hugetlb pages > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > free them back 'as zeroed' to implement a userspace scrubber. And in principle > existing apps should see no difference. The amount of changes is consequently > significantly smaller (or it looked as such in a quick PoC years back). > > Something extra on the top would perhaps be the ability so select a lookup > heuristic such that we can pick the search method of > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > without too much of a dance. > Ye after the qemu prefaulting got pointed out I started thinking about a userlevel daemon which would do the work proposed here. Except I got stuck at a good way to do it. The mmap + load from the area + munmap triple does work but also entails more overhead than necessary, but I only have some handwaving how to not do it. :) Suppose a daemon of the sort exists and there is a machine with 4 or more NUMA domains to deal with. Further suppose it spawns at least one thread per such domain and tasksets them accordingly. Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter indicating how many pages to zero out (or even just accept one page). This would avoid crap on munmap. This would still need majority of the patch, but all the zeroing policy would be taken out. Key point being that whatever specific behavior one sees fit, they can implement it in userspace, preventing future kernel patches to add more tweaks.
On Tue, Dec 3, 2024 at 4:57 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > > > On 03/12/2024 12:06, Michal Hocko wrote: > > > If the startup latency is a real problem is there a way to workaround > > > that in the userspace by preallocating hugetlb pages ahead of time > > > before those VMs are launched and hand over already pre-allocated pages? > > > > It should be relatively simple to actually do this. Me and Mike had experimented > > ourselves a couple years back but we never had the chance to send it over. IIRC > > if we: > > > > - add the PageZeroed tracking bit when a page is zeroed > > - clear it in the write (fixup/non-fixup) fault-path > > > > [somewhat similar to this series I suspect] > > > > Then what's left is to change the lookup of free hugetlb pages > > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > > free them back 'as zeroed' to implement a userspace scrubber. And in principle > > existing apps should see no difference. The amount of changes is consequently > > significantly smaller (or it looked as such in a quick PoC years back). > > > > Something extra on the top would perhaps be the ability so select a lookup > > heuristic such that we can pick the search method of > > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > > without too much of a dance. > > > > Ye after the qemu prefaulting got pointed out I started thinking about > a userlevel daemon which would do the work proposed here. > > Except I got stuck at a good way to do it. The mmap + load from the > area + munmap triple does work but also entails more overhead than > necessary, but I only have some handwaving how to not do it. :) > > Suppose a daemon of the sort exists and there is a machine with 4 or > more NUMA domains to deal with. Further suppose it spawns at least one > thread per such domain and tasksets them accordingly. > > Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter > indicating how many pages to zero out (or even just accept one page). > This would avoid crap on munmap. > > This would still need majority of the patch, but all the zeroing > policy would be taken out. Key point being that whatever specific > behavior one sees fit, they can implement it in userspace, preventing > future kernel patches to add more tweaks. How about this for a rough sketch (which I have 0 intention of implementing myself): /dev/hugepagectl or whatever is created with a bunch of ioctls, notably: - something to query hugepage stats - an event generated for epoll if count in any domain goes below a threshold - something to zero a page of given size from the free list Perhaps make it so that fds require an upfront ioctl to set a numa domain of interest before poll works -- for example if there is one thread per domain, each of them sleeps on its own relevant fd. Or maybe someone still wants the main thread to get the full view so they poll on all of them. then a google internal tool can react however it sees fit without waking up in a periodic fashion. (replace google with any other company which may want to mess with this). optional: - allocating and zeroing (but not mmaping!) a page then a party which shares the file descriptor could obtain it by passing the fd to mmap. munmap would just free it as it does now. this would allow qemu et al to avoid the mmap/munmap dance just to zero, but I don't know how useful it is for them
On 03/12/2024 15:57, Mateusz Guzik wrote: > On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> On 03/12/2024 12:06, Michal Hocko wrote: >>> If the startup latency is a real problem is there a way to workaround >>> that in the userspace by preallocating hugetlb pages ahead of time >>> before those VMs are launched and hand over already pre-allocated pages? >> >> It should be relatively simple to actually do this. Me and Mike had experimented >> ourselves a couple years back but we never had the chance to send it over. IIRC >> if we: >> >> - add the PageZeroed tracking bit when a page is zeroed >> - clear it in the write (fixup/non-fixup) fault-path >> >> [somewhat similar to this series I suspect] >> >> Then what's left is to change the lookup of free hugetlb pages >> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed >> pages. Provided we don't track its 'cleared' state, there's no UAPI change in >> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and >> free them back 'as zeroed' to implement a userspace scrubber. And in principle >> existing apps should see no difference. The amount of changes is consequently >> significantly smaller (or it looked as such in a quick PoC years back). >> >> Something extra on the top would perhaps be the ability so select a lookup >> heuristic such that we can pick the search method of >> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic >> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) >> without too much of a dance. >> > > Ye after the qemu prefaulting got pointed out I started thinking about > a userlevel daemon which would do the work proposed here. > > Except I got stuck at a good way to do it. The mmap + load from the > area + munmap triple does work but also entails more overhead than > necessary, but I only have some handwaving how to not do it. :) > What I was trying to suggest above is that it would be no different that how you use hugetlb. I am not enterily sure I follow the triple work part on unmap. > Suppose a daemon of the sort exists and there is a machine with 4 or > more NUMA domains to deal with. Further suppose it spawns at least one > thread per such domain and tasksets them accordingly. > > Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter > indicating how many pages to zero out (or even just accept one page). > This would avoid crap on munmap. > > This would still need majority of the patch, but all the zeroing> policy would be taken out. Key point being that whatever specific > behavior one sees fit, they can implement it in userspace, preventing > future kernel patches to add more tweaks. Kernel should still ensure it tracks if it's cleared or not -- so what I said above was just letting the allocation zero out the page or not (if it's not zeroed already) and just tweak the dirtyness of pages it picks before installing PTEs. A scrubber would pick only dirty pages (and maybe fail if there aren't any), and a VMM would pick clean pages (taking advantage of the scrubber work). An explicit zero sounds a somewhat limiting ... but hmm What throws all this away (in primary MM) is the prefaulting with write as we would clear the PageCleared bit all the time (I think that's what you mean 'crap on unmap'?). But there could be hope for systems with a secondary pagetables (with paging), where the secondary faulting is the one in control of the cleared status. That is because reads inside the VM ultimately trigger secondary-VM read-faults and get fixed up later with write on writes. Well, at least it would work given we don't prefault secondary page tables yet...
On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 03/12/2024 12:06, Michal Hocko wrote: > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: > >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >>> Any games with "background zeroing" are notoriously crappy and I would > >>> argue one should exhaust other avenues before going there -- at the end > >>> of the day the cost of zeroing will have to get paid. > >> > >> I understand that the concept of background prezeroing has been, and > >> will be, met with some resistance. But, do you have any specific > >> concerns with the patch I posted? It's pretty well isolated from the > >> rest of the code, and optional. > > > > The biggest concern I have is that the overhead is payed by everybody on > > the system - it is considered to be a system overhead regardless only > > part of the workload benefits from hugetlb pages. In other words the > > workload using those pages is not accounted for the use completely. > > > > If the startup latency is a real problem is there a way to workaround > > that in the userspace by preallocating hugetlb pages ahead of time > > before those VMs are launched and hand over already pre-allocated pages? > > It should be relatively simple to actually do this. Me and Mike had experimented > ourselves a couple years back but we never had the chance to send it over. IIRC > if we: > > - add the PageZeroed tracking bit when a page is zeroed > - clear it in the write (fixup/non-fixup) fault-path > > [somewhat similar to this series I suspect] > > Then what's left is to change the lookup of free hugetlb pages > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > free them back 'as zeroed' to implement a userspace scrubber. And in principle > existing apps should see no difference. The amount of changes is consequently > significantly smaller (or it looked as such in a quick PoC years back). This would work, and is easy to do, but: * You now have a userspace daemon that depends on kernel-internal behavior. * It has no way to track how much work is left to do or what needs to be done (unless it is part of an application that is the only user of hugetlbfs on the system). > > Something extra on the top would perhaps be the ability so select a lookup > heuristic such that we can pick the search method of > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > without too much of a dance. Again, that would probably work, but if you take a step back: you now have a kernel behavior that can be guided in certain directions, but no guarantees and no stats to see if things are working out. And an explicit allocation method option (basically: take from the head or the tail of the freelist). The picture is getting murkier. At least with the patch I sent you have a clearly defined, optional, behavior that can be switched on or off, and stats to see if it's working. I do understand the argument against having pre-zeroing not being accounted to the current thread. I would counter that benefiting from work by kernel threads is not unheard of in the kernel today already. Also, the other proposals so far also have another thread doing the zeroing - it just is explicitly started by userspace. So, the cost is still not paid by the user of the pages. You just end up with explicitly controlling who does pay the cost. Which I suppose is better, but it's still not trivial to get it completely right (you perhaps could do it at the container level with some trickery). What we have done so far is to bind the khzerod threads introduced in this patch to CPUs in such a way that it doesn't interfere with the rest of the system. Which you would also have to do with any userspace solution. Again, this is optional - if you are a system manager who prefers to have the resources used by zeroing hugetlb pages to be explicitly accounted to the actual user, you can not enable this behavior (it's off by default). I guess I can summarize my thoughts like this: while I understand the argument against doing this outside of the context of the actual user of the pages, this is 1) optional, and 2) so far the other solutions introduce interfaces that I don't think are that great, or would require maintaining a hugetlb 'shadow pool' in userspace through hugetlbfs files. - Frank
On Tue, Dec 3, 2024 at 10:43 AM Frank van der Linden <fvdl@google.com> wrote: > > On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > > > On 03/12/2024 12:06, Michal Hocko wrote: > > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: > > >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > >>> Any games with "background zeroing" are notoriously crappy and I would > > >>> argue one should exhaust other avenues before going there -- at the end > > >>> of the day the cost of zeroing will have to get paid. > > >> > > >> I understand that the concept of background prezeroing has been, and > > >> will be, met with some resistance. But, do you have any specific > > >> concerns with the patch I posted? It's pretty well isolated from the > > >> rest of the code, and optional. > > > > > > The biggest concern I have is that the overhead is payed by everybody on > > > the system - it is considered to be a system overhead regardless only > > > part of the workload benefits from hugetlb pages. In other words the > > > workload using those pages is not accounted for the use completely. > > > > > > If the startup latency is a real problem is there a way to workaround > > > that in the userspace by preallocating hugetlb pages ahead of time > > > before those VMs are launched and hand over already pre-allocated pages? > > > > It should be relatively simple to actually do this. Me and Mike had experimented > > ourselves a couple years back but we never had the chance to send it over. IIRC > > if we: > > > > - add the PageZeroed tracking bit when a page is zeroed > > - clear it in the write (fixup/non-fixup) fault-path > > > > [somewhat similar to this series I suspect] > > > > Then what's left is to change the lookup of free hugetlb pages > > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > > free them back 'as zeroed' to implement a userspace scrubber. And in principle > > existing apps should see no difference. The amount of changes is consequently > > significantly smaller (or it looked as such in a quick PoC years back). > > This would work, and is easy to do, but: > * You now have a userspace daemon that depends on kernel-internal behavior. > * It has no way to track how much work is left to do or what needs > to be done (unless it is part of an application that is the only user > of hugetlbfs on the system). > > > > > Something extra on the top would perhaps be the ability so select a lookup > > heuristic such that we can pick the search method of > > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > > without too much of a dance. > > Again, that would probably work, but if you take a step back: you now > have a kernel behavior that can be guided in certain directions, but > no guarantees and no stats to see if things are working out. And an > explicit allocation method option (basically: take from the head or > the tail of the freelist). The picture is getting murkier. At least > with the patch I sent you have a clearly defined, optional, behavior > that can be switched on or off, and stats to see if it's working. > > I do understand the argument against having pre-zeroing not being > accounted to the current thread. I would counter that benefiting from > work by kernel threads is not unheard of in the kernel today already. > Also, the other proposals so far also have another thread doing the > zeroing - it just is explicitly started by userspace. So, the cost is > still not paid by the user of the pages. You just end up with > explicitly controlling who does pay the cost. Which I suppose is > better, but it's still not trivial to get it completely right (you > perhaps could do it at the container level with some trickery). > > What we have done so far is to bind the khzerod threads introduced in > this patch to CPUs in such a way that it doesn't interfere with the > rest of the system. Which you would also have to do with any userspace > solution. > > Again, this is optional - if you are a system manager who prefers to > have the resources used by zeroing hugetlb pages to be explicitly > accounted to the actual user, you can not enable this behavior (it's > off by default). > > I guess I can summarize my thoughts like this: while I understand the > argument against doing this outside of the context of the actual user > of the pages, this is 1) optional, and 2) so far the other solutions > introduce interfaces that I don't think are that great, or would > require maintaining a hugetlb 'shadow pool' in userspace through > hugetlbfs files. One more thing: any userspace solution has one basic problem: the hugetlb pages will be unavailable while they are being zeroed out, as the userspace process or thread will have to map+touch them, taking them off the freelist. So now another process that needs the hugetlb pages, expecting them to be there based on initial configuration and what it's done so far, may end up getting unexpected -ENOMEM because one or more pages have been temporarily allocated by userspace prezero threads. My patch doesn't have that issue - the pages stay on the freelist, the total number of available pages does not change. In the rare case that a freshly allocated page is being prezeroed, you'll just have to wait until it's done (taking up no more time than doing it yourself). Now, you can implement something like this in userspace (if I get -ENOMEM, check with the prezero thread or process), but it's not great. - Frank
Mateusz Guzik <mjguzik@gmail.com> writes: > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: >> Fresh hugetlb pages are zeroed out when they are faulted in, >> just like with all other page types. This can take up a good >> amount of time for larger page sizes (e.g. around 40 >> milliseconds for a 1G page on a recent AMD-based system). >> >> This normally isn't a problem, since hugetlb pages are typically >> mapped by the application for a long time, and the initial >> delay when touching them isn't much of an issue. >> >> However, there are some use cases where a large number of hugetlb >> pages are touched when an application (such as a VM backed by these >> pages) starts. For 256 1G pages and 40ms per page, this would take >> 10 seconds, a noticeable delay. > > The current huge page zeroing code is not that great to begin with. Yeah definitely suboptimal. The current huge page zeroing code is both slow and it trashes the cache while zeroing. > There was a patchset posted some time ago to remedy at least some of it: > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ > > but it apparently fell through the cracks. As Joao mentioned that got side tracked due to the preempt-lazy stuff. Now that lazy is in, I plan to follow up on the zeroing work. > Any games with "background zeroing" are notoriously crappy and I would > argue one should exhaust other avenues before going there -- at the end > of the day the cost of zeroing will have to get paid. Yeah and the background zeroing has dual cost: the cost in CPU time plus the indirect cost to other processes due to the trashing of L3 etc. Ankur
Joao Martins <joao.m.martins@oracle.com> writes: > On 02/12/2024 21:58, Mateusz Guzik wrote: >> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: >>> Fresh hugetlb pages are zeroed out when they are faulted in, >>> just like with all other page types. This can take up a good >>> amount of time for larger page sizes (e.g. around 40 >>> milliseconds for a 1G page on a recent AMD-based system). >>> >>> This normally isn't a problem, since hugetlb pages are typically >>> mapped by the application for a long time, and the initial >>> delay when touching them isn't much of an issue. >>> >>> However, there are some use cases where a large number of hugetlb >>> pages are touched when an application (such as a VM backed by these >>> pages) starts. For 256 1G pages and 40ms per page, this would take >>> 10 seconds, a noticeable delay. >> >> The current huge page zeroing code is not that great to begin with. >> >> There was a patchset posted some time ago to remedy at least some of it: >> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ >> >> but it apparently fell through the cracks. >> > > It didn't fell through the cracks for sure > > Just had a detour into preempt=auto before resuming the main work. But that > seems to be done in the last merge window with the lazy preempt stuff. I think > Ankur was planning on following that series above soon-ish. > > Adding him here, such that he keeps me honest :) Thanks for Ccing Joao. -- ankur
On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: > >> Fresh hugetlb pages are zeroed out when they are faulted in, > >> just like with all other page types. This can take up a good > >> amount of time for larger page sizes (e.g. around 40 > >> milliseconds for a 1G page on a recent AMD-based system). > >> > >> This normally isn't a problem, since hugetlb pages are typically > >> mapped by the application for a long time, and the initial > >> delay when touching them isn't much of an issue. > >> > >> However, there are some use cases where a large number of hugetlb > >> pages are touched when an application (such as a VM backed by these > >> pages) starts. For 256 1G pages and 40ms per page, this would take > >> 10 seconds, a noticeable delay. > > > > The current huge page zeroing code is not that great to begin with. > > Yeah definitely suboptimal. The current huge page zeroing code is > both slow and it trashes the cache while zeroing. > > > There was a patchset posted some time ago to remedy at least some of it: > > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ > > > > but it apparently fell through the cracks. > > As Joao mentioned that got side tracked due to the preempt-lazy stuff. > Now that lazy is in, I plan to follow up on the zeroing work. > > > Any games with "background zeroing" are notoriously crappy and I would > > argue one should exhaust other avenues before going there -- at the end > > of the day the cost of zeroing will have to get paid. > > Yeah and the background zeroing has dual cost: the cost in CPU time plus > the indirect cost to other processes due to the trashing of L3 etc. I'm not sure what you mean here - any caching side effects of zeroing happen regardless of who does it, right? It doesn't matter if it's a kthread or the calling thread. If you're concerned about the caching side effects in general, using non-temporal instructions helps (e.g. movnti on x86). See the link I mentioned for a patch that was sent years ago ( https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/ ). Using movnti on x86 definitely helps performance (up to 50% in my experiments). Which is great, but it still leaves considerable delay for the use case I mentioned. - Frank
Frank van der Linden <fvdl@google.com> writes: > On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> >> Mateusz Guzik <mjguzik@gmail.com> writes: >> >> > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: >> >> Fresh hugetlb pages are zeroed out when they are faulted in, >> >> just like with all other page types. This can take up a good >> >> amount of time for larger page sizes (e.g. around 40 >> >> milliseconds for a 1G page on a recent AMD-based system). >> >> >> >> This normally isn't a problem, since hugetlb pages are typically >> >> mapped by the application for a long time, and the initial >> >> delay when touching them isn't much of an issue. >> >> >> >> However, there are some use cases where a large number of hugetlb >> >> pages are touched when an application (such as a VM backed by these >> >> pages) starts. For 256 1G pages and 40ms per page, this would take >> >> 10 seconds, a noticeable delay. >> > >> > The current huge page zeroing code is not that great to begin with. >> >> Yeah definitely suboptimal. The current huge page zeroing code is >> both slow and it trashes the cache while zeroing. >> >> > There was a patchset posted some time ago to remedy at least some of it: >> > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/ >> > >> > but it apparently fell through the cracks. >> >> As Joao mentioned that got side tracked due to the preempt-lazy stuff. >> Now that lazy is in, I plan to follow up on the zeroing work. >> >> > Any games with "background zeroing" are notoriously crappy and I would >> > argue one should exhaust other avenues before going there -- at the end >> > of the day the cost of zeroing will have to get paid. >> >> Yeah and the background zeroing has dual cost: the cost in CPU time plus >> the indirect cost to other processes due to the trashing of L3 etc. > > I'm not sure what you mean here - any caching side effects of zeroing > happen regardless of who does it, right? Sure. > It doesn't matter if it's a > kthread or the calling thread. As other people point out it's more a matter of accruing it to the right context. The noise will always spill over but userspace can use cpu cgroups etc to to limit how far these effects are seen. Additionally, this kthread will be doing bulk zeroing while a user thread would zero as needed. Though I guess for the VM prefaulting case it's likely similar. > If you're concerned about the caching side effects in general, using > non-temporal instructions helps (e.g. movnti on x86). See the link I > mentioned for a patch that was sent years ago ( > https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/ > ). Using movnti on x86 definitely helps performance (up to 50% in my > experiments). Which is great, but it still leaves considerable delay > for the use case I mentioned. In my testing at least on AMD you can get a lot more than 50% improvement. See for instance the CLZERO (or the REP STOS) numbers here: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/ -- ankur
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5cf327337e22..7a6749959372 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -893,8 +893,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, error = PTR_ERR(folio); goto out; } - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); - __folio_mark_uptodate(folio); + clear_hugetlb_folio(folio, addr); error = hugetlb_add_to_page_cache(folio, mapping, index); if (unlikely(error)) { restore_reserve_on_error(h, &pseudo_vma, addr, folio); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e4697539b665..31d56190a4fb 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -586,6 +586,17 @@ generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page * that is not tracked by raw_hwp_page list. + * HPG_pre_zeroed - page was pre-zeroed, clear_huge_page not needed. + * Synchronization: hugetlb_lock held when set by pre-zero kthread. + * Only valid to read outside hugetlb_lock once the page is off + * the freelist, and HPG_zero_busy is clear. Always cleared when a + * page is put (back) on the freelist. + * HPG_zero_busy - page is being zeroed by the pre-zero kthread. + * Synchronization: set and cleared by the pre-zero kthread with + * hugetlb_lock held. Access by others is read-only. Once the page + * is off the freelist, this can only change from set -> clear, + * which the new page owner must wait for. Always cleared + * when a page is put (back) on the freelist. */ enum hugetlb_page_flags { HPG_restore_reserve = 0, @@ -594,6 +605,8 @@ enum hugetlb_page_flags { HPG_freed, HPG_vmemmap_optimized, HPG_raw_hwp_unreliable, + HPG_pre_zeroed, + HPG_zero_busy, __NR_HPAGEFLAGS, }; @@ -653,6 +666,8 @@ HPAGEFLAG(Temporary, temporary) HPAGEFLAG(Freed, freed) HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable) +HPAGEFLAG(PreZeroed, pre_zeroed) +HPAGEFLAG(ZeroBusy, zero_busy) #ifdef CONFIG_HUGETLB_PAGE @@ -678,6 +693,19 @@ struct hstate { unsigned int nr_huge_pages_node[MAX_NUMNODES]; unsigned int free_huge_pages_node[MAX_NUMNODES]; unsigned int surplus_huge_pages_node[MAX_NUMNODES]; + + unsigned long free_huge_pages_zero; + unsigned int free_huge_pages_zero_node[MAX_NUMNODES]; + + /* Wait queue for the prezero thread */ + wait_queue_head_t hzerod_wait[MAX_NUMNODES]; + /* Queue to wait for a hugetlb folio that is being prezeroed */ + wait_queue_head_t dqzero_wait[MAX_NUMNODES]; + /* Prezero threads (one per node) */ + struct task_struct *hzerod[MAX_NUMNODES]; + + bool prezero_enabled; + char name[HSTATE_NAME_LEN]; }; @@ -699,6 +727,7 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping pgoff_t idx); void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, unsigned long address, struct folio *folio); +void clear_hugetlb_folio(struct folio *folio, unsigned long address); /* arch callback */ int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); @@ -1035,6 +1064,9 @@ void hugetlb_unregister_node(struct node *node); */ bool is_raw_hwpoison_page_in_hugepage(struct page *page); +void khzerod_run(int nid); +void khzerod_stop(int nid); + #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; @@ -1239,6 +1271,18 @@ static inline bool hugetlbfs_pagecache_present( { return false; } + +static inline void khzerod_run(int nid) +{ +} + +static inline void khzerod_stop(int nid) +{ +} + +static inline void clear_hugetlb_folio(struct folio *folio, unsigned long address) +{ +} #endif /* CONFIG_HUGETLB_PAGE */ static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 190fa05635f4..0c9a80851bf5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -44,6 +44,8 @@ #include <linux/io.h> #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h> +#include <linux/memcontrol.h> +#include <linux/memory_hotplug.h> #include <linux/node.h> #include <linux/page_owner.h> #include "internal.h" @@ -68,6 +70,20 @@ static bool __initdata parsed_valid_hugepagesz = true; static bool __initdata parsed_default_hugepagesz; static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata; +static void khzerod_wakeup_node(struct hstate *h, int nid); +static void hpage_wait_zerobusy(struct hstate *h, struct folio *folio); +static void khzerod_run_all(void); +static void khzerod_run_hstate(struct hstate *h); + +/* + * Mutex to protect prezero kthread stopping / starting. + * This is really per <hstate,nid> tuple, but since this + * only happens when prezeroing is enabled/disabled via + * sysfs (rarely), a mutex per <hstate,nid> would be + * overkill. + */ +static DEFINE_MUTEX(prezero_chg_lock); + /* * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages, * free_huge_pages, and surplus_huge_pages. @@ -1309,6 +1325,16 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg) return false; } +/* + * Clear flags for either a fresh page or one that is being + * added to the free list. + */ +static inline void prep_clear_prezero(struct folio *folio) +{ + folio_clear_hugetlb_pre_zeroed(folio); + folio_clear_hugetlb_zero_busy(folio); +} + static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio) { int nid = folio_nid(folio); @@ -1316,14 +1342,16 @@ static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio) lockdep_assert_held(&hugetlb_lock); VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); - list_move(&folio->lru, &h->hugepage_freelists[nid]); + list_move_tail(&folio->lru, &h->hugepage_freelists[nid]); h->free_huge_pages++; h->free_huge_pages_node[nid]++; + prep_clear_prezero(folio); folio_set_hugetlb_freed(folio); + khzerod_wakeup_node(h, nid); } -static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, - int nid) +static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, int nid, + gfp_t gfp_mask) { struct folio *folio; bool pin = !!(current->flags & PF_MEMALLOC_PIN); @@ -1333,6 +1361,16 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, if (pin && !folio_is_longterm_pinnable(folio)) continue; + /* + * This shouldn't happen, as hugetlb pages are never allocated + * with GFP_ATOMIC. But be paranoid and check for it, as + * a zero_busy page might cause a sleep later in + * hpage_wait_zerobusy(). + */ + if (WARN_ON_ONCE(folio_test_hugetlb_zero_busy(folio) && + !gfpflags_allow_blocking(gfp_mask))) + continue; + if (folio_test_hwpoison(folio)) continue; @@ -1341,6 +1379,12 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, folio_clear_hugetlb_freed(folio); h->free_huge_pages--; h->free_huge_pages_node[nid]--; + if (folio_test_hugetlb_pre_zeroed(folio) || + folio_test_hugetlb_zero_busy(folio)) { + h->free_huge_pages_zero_node[nid]--; + h->free_huge_pages_zero--; + } + return folio; } @@ -1377,7 +1421,7 @@ static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_ continue; node = zone_to_nid(zone); - folio = dequeue_hugetlb_folio_node_exact(h, node); + folio = dequeue_hugetlb_folio_node_exact(h, node, gfp_mask); if (folio) return folio; } @@ -1605,6 +1649,17 @@ static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, folio_clear_hugetlb_freed(folio); h->free_huge_pages--; h->free_huge_pages_node[nid]--; + folio_clear_hugetlb_freed(folio); + } + /* + * Adjust the zero page counters now. Note that + * if a page is currently being zeroed, that + * will be waited for in update_and_free_page() + */ + if (folio_test_hugetlb_pre_zeroed(folio) || + folio_test_hugetlb_zero_busy(folio)) { + h->free_huge_pages_zero--; + h->free_huge_pages_zero_node[nid]--; } if (adjust_surplus) { h->surplus_huge_pages--; @@ -1658,6 +1713,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, { bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio); + VM_BUG_ON_FOLIO(folio_test_hugetlb_zero_busy(folio), folio); + if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; @@ -1743,6 +1800,7 @@ static void free_hpage_workfn(struct work_struct *work) */ h = size_to_hstate(folio_size(folio)); + hpage_wait_zerobusy(h, folio); __update_and_free_hugetlb_folio(h, folio); cond_resched(); @@ -1759,7 +1817,8 @@ static inline void flush_free_hpage_work(struct hstate *h) static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio, bool atomic) { - if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) { + if ((!folio_test_hugetlb_zero_busy(folio) && + !folio_test_hugetlb_vmemmap_optimized(folio)) || !atomic) { __update_and_free_hugetlb_folio(h, folio); return; } @@ -1974,6 +2033,8 @@ static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni { __prep_new_hugetlb_folio(h, folio); spin_lock_irq(&hugetlb_lock); + folio_clear_hugetlb_freed(folio); + prep_clear_prezero(folio); __prep_account_new_huge_page(h, nid); spin_unlock_irq(&hugetlb_lock); } @@ -2419,6 +2480,13 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, preferred_nid, nmask); if (folio) { spin_unlock_irq(&hugetlb_lock); + /* + * The contents of this page will be completely + * overwritten immediately, as its a migration + * target, so no clearing is needed. Do wait in + * case khzerod was working on it, though. + */ + hpage_wait_zerobusy(h, folio); return folio; } } @@ -3066,6 +3134,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, spin_unlock_irq(&hugetlb_lock); + hpage_wait_zerobusy(h, folio); + hugetlb_set_folio_subpool(folio, spool); map_commit = vma_commit_reservation(h, vma, addr); @@ -3503,6 +3573,8 @@ static void __init hugetlb_init_hstates(void) h->demote_order = h2->order; } } + + khzerod_run_all(); } static void __init report_hugepages(void) @@ -4184,15 +4256,72 @@ static ssize_t demote_size_store(struct kobject *kobj, } HSTATE_ATTR(demote_size); +static ssize_t free_hugepages_zero_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct hstate *h; + unsigned long free_huge_pages_zero; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid == NUMA_NO_NODE) + free_huge_pages_zero = h->free_huge_pages_zero; + else + free_huge_pages_zero = h->free_huge_pages_zero_node[nid]; + + return sprintf(buf, "%lu\n", free_huge_pages_zero); +} +HSTATE_ATTR_RO(free_hugepages_zero); + +static ssize_t prezero_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + struct hstate *h = kobj_to_hstate(kobj, NULL); + + return sprintf(buf, "%d\n", h->prezero_enabled ? 1 : 0); +} + +static ssize_t prezero_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct hstate *h; + int err; + long val; + bool prezero_enabled; + + err = kstrtol(buf, 10, &val); + if (val != 0 && val != 1) + return -EINVAL; + + prezero_enabled = !!val; + + h = kobj_to_hstate(kobj, NULL); + if (prezero_enabled == h->prezero_enabled) + return count; + + h->prezero_enabled = prezero_enabled; + + mem_hotplug_begin(); + khzerod_run_hstate(h); + mem_hotplug_done(); + + return count; +} +HSTATE_ATTR(prezero_enabled); + static struct attribute *hstate_attrs[] = { &nr_hugepages_attr.attr, &nr_overcommit_hugepages_attr.attr, &free_hugepages_attr.attr, &resv_hugepages_attr.attr, &surplus_hugepages_attr.attr, + &free_hugepages_zero_attr.attr, #ifdef CONFIG_NUMA &nr_hugepages_mempolicy_attr.attr, #endif + &prezero_enabled_attr.attr, NULL, }; @@ -4265,6 +4394,7 @@ static struct node_hstate node_hstates[MAX_NUMNODES]; static struct attribute *per_node_hstate_attrs[] = { &nr_hugepages_attr.attr, &free_hugepages_attr.attr, + &free_hugepages_zero_attr.attr, &surplus_hugepages_attr.attr, NULL, }; @@ -4501,6 +4631,232 @@ bool __init __attribute((weak)) arch_hugetlb_valid_size(unsigned long size) return size == HPAGE_SIZE; } +struct khzerod_info { + struct hstate *h; + int nid; +}; + +/* + * Clear a hugetlb page. + * + * The caller has already made sure that the page is not + * being actively zeroed out in the background. + * + * If it wasn't zeroed out, do it ourselves. + */ +void clear_hugetlb_folio(struct folio *folio, unsigned long address) +{ + if (!folio_test_hugetlb_pre_zeroed(folio)) + folio_zero_user(folio, address); + + __folio_mark_uptodate(folio); +} + +/* + * Once a page has been taken off the freelist, the new page owner + * must wait for the pre-zero kthread to finish if it happens + * to be working on this page (which should be rare). + */ +static void hpage_wait_zerobusy(struct hstate *h, struct folio *folio) +{ + if (!folio_test_hugetlb_zero_busy(folio)) + return; + + spin_lock_irq(&hugetlb_lock); + + wait_event_cmd(h->dqzero_wait[folio_nid(folio)], + !folio_test_hugetlb_zero_busy(folio), + spin_unlock_irq(&hugetlb_lock), + spin_lock_irq(&hugetlb_lock)); + + spin_unlock_irq(&hugetlb_lock); +} + +/* + * This per-node-per-hstate kthread pre-zeroes pages that are on + * the freelist. They remain on the freelist while this is being + * done. When pre-zeroing is done, they are moved to the head + * of the list. + * + * Pages are left on the freelist because an allocation should not + * fail just because the page is being prezeroed. In the rare + * corner case that a page that is being worked on by this + * thread is taken as part of an allocation, the caller will + * wait for the prezero to finish (see hpage_wait_zerobusy). + */ +static int khzerod(void *p) +{ + struct khzerod_info *ki = p; + struct hstate *h = ki->h; + int nid = ki->nid; + unsigned int *freep, *zerop; + struct folio *folio; + const struct cpumask *cpumask = cpumask_of_node(nid); + struct list_head *freelist; + + if (!cpumask_empty(cpumask)) + set_cpus_allowed_ptr(current, cpumask); + + freep = &h->free_huge_pages_node[nid]; + zerop = &h->free_huge_pages_zero_node[nid]; + freelist = &h->hugepage_freelists[nid]; + + do { + wait_event_interruptible(h->hzerod_wait[nid], + *zerop < *freep || kthread_should_stop()); + + if (kthread_should_stop()) + break; + + spin_lock_irq(&hugetlb_lock); + if (*zerop == *freep || list_empty(freelist)) { + spin_unlock_irq(&hugetlb_lock); + continue; + } + + folio = list_last_entry(freelist, struct folio, lru); + if (folio_test_hugetlb_pre_zeroed(folio)) { + spin_unlock_irq(&hugetlb_lock); + continue; + } + + folio_set_hugetlb_zero_busy(folio); + /* + * Incrementing this here is a bit of a fib, since + * the page hasn't been cleared yet (it will be done + * immediately after dropping the lock below). But + * it keeps the count consistent with the overall + * free count in case the page gets taken off the + * freelist while we're working on it. + */ + (*zerop)++; + h->free_huge_pages_zero++; + spin_unlock_irq(&hugetlb_lock); + + /* + * HWPoison pages may show up on the freelist. + * Don't try to zero it out, but do set the flag + * and counts, so that we don't consider it again. + */ + if (!folio_test_hwpoison(folio)) + folio_zero_user(folio, 0); + + spin_lock_irq(&hugetlb_lock); + folio_set_hugetlb_pre_zeroed(folio); + folio_clear_hugetlb_zero_busy(folio); + + /* + * If the page is still on the free list, move + * it to the head. + */ + if (folio_test_hugetlb_freed(folio)) + list_move(&folio->lru, freelist); + + /* + * If someone was waiting for the zero to + * finish, wake them up. + */ + if (waitqueue_active(&h->dqzero_wait[nid])) + wake_up(&h->dqzero_wait[nid]); + spin_unlock_irq(&hugetlb_lock); + + } while (1); + + kfree(ki); + return 0; +} + +static void khzerod_run_hstate_nid(struct hstate *h, int nid) +{ + struct khzerod_info *ki; + struct task_struct *t; + + mutex_lock(&prezero_chg_lock); + + if (!h->prezero_enabled) { + if (h->hzerod[nid] != NULL) { + kthread_stop(h->hzerod[nid]); + h->hzerod[nid] = NULL; + } + goto out; + } + + if (h->hzerod[nid] != NULL) + goto out; + + ki = kmalloc(sizeof(*ki), GFP_KERNEL); + if (ki == NULL) + goto out; + ki->h = h; + ki->nid = nid; + t = kthread_run(khzerod, ki, "khzerod%d-%lukB", + nid, huge_page_size(h) / 1024); + if (IS_ERR(t)) { + kfree(ki); + pr_err("could not run khzerod on node %d for size %lukB\n", + nid, huge_page_size(h) / 1024); + } else + h->hzerod[nid] = t; + +out: + mutex_unlock(&prezero_chg_lock); +} + +static void khzerod_stop_hstate_nid(struct hstate *h, int nid) +{ + mutex_lock(&prezero_chg_lock); + + if (h->hzerod[nid] != NULL) { + kthread_stop(h->hzerod[nid]); + h->hzerod[nid] = NULL; + } + + mutex_unlock(&prezero_chg_lock); +} + +/* + * (re-)run all pre-zero kthreads for a node. kthreads + * that are currently running, but should no longer run + * (because the prezero enable flag was changed) are stopped. + */ +void khzerod_run(int nid) +{ + struct hstate *h; + + for_each_hstate(h) + khzerod_run_hstate_nid(h, nid); +} + +static void khzerod_run_all(void) +{ + int nid; + + for_each_node_state(nid, N_MEMORY) + khzerod_run(nid); +} + +static void khzerod_run_hstate(struct hstate *h) +{ + int nid; + + for_each_node_state(nid, N_MEMORY) + khzerod_run_hstate_nid(h, nid); +} + +void khzerod_stop(int nid) +{ + struct hstate *h; + + for_each_hstate(h) + khzerod_stop_hstate_nid(h, nid); +} + +static void khzerod_wakeup_node(struct hstate *h, int nid) +{ + if (h->hzerod[nid]) + wake_up(&h->hzerod_wait[nid]); +} + void __init hugetlb_add_hstate(unsigned int order) { struct hstate *h; @@ -4515,8 +4871,11 @@ void __init hugetlb_add_hstate(unsigned int order) __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); h->order = order; h->mask = ~(huge_page_size(h) - 1); - for (i = 0; i < MAX_NUMNODES; ++i) + for (i = 0; i < MAX_NUMNODES; ++i) { INIT_LIST_HEAD(&h->hugepage_freelists[i]); + init_waitqueue_head(&h->hzerod_wait[i]); + init_waitqueue_head(&h->dqzero_wait[i]); + } INIT_LIST_HEAD(&h->hugepage_activelist); h->next_nid_to_alloc = first_memory_node; h->next_nid_to_free = first_memory_node; @@ -6139,8 +6498,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, ret = 0; goto out; } - folio_zero_user(folio, vmf->real_address); - __folio_mark_uptodate(folio); + clear_hugetlb_folio(folio, vmf->address); new_folio = true; if (vma->vm_flags & VM_MAYSHARE) { diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 621ae1015106..8398e4d0130d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1217,6 +1217,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, kswapd_run(nid); kcompactd_run(nid); + khzerod_run(nid); writeback_set_ratelimit(); @@ -2092,6 +2093,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, if (arg.status_change_nid >= 0) { kcompactd_stop(node); kswapd_stop(node); + khzerod_stop(node); } writeback_set_ratelimit();
Fresh hugetlb pages are zeroed out when they are faulted in, just like with all other page types. This can take up a good amount of time for larger page sizes (e.g. around 40 milliseconds for a 1G page on a recent AMD-based system). This normally isn't a problem, since hugetlb pages are typically mapped by the application for a long time, and the initial delay when touching them isn't much of an issue. However, there are some use cases where a large number of hugetlb pages are touched when an application (such as a VM backed by these pages) starts. For 256 1G pages and 40ms per page, this would take 10 seconds, a noticeable delay. To make the above scenario faster, optionally pre-zero hugetlb pages. This is done by a kthread for each hstate/node combination that is eligible for pre-zeroing. If there are pages to be pre-zeroed, the kthread starts at the tail of the per-node free list, and zeroes out pages, walking the list backwards until done. To make this logic work, new allocations are always taken from the head, and freed pages are added to the tail. After a page has been pre-zeroed, it is moved to the head of the list. The kthread runs on init, and is also woken up when a page is put back on the free list. A thread that allocates a hugetlb page can examine the prezero flag to see whether it's been zeroed out already. If this is not set, it must do so itself, as before. For the rare corner where the kthread is busy zeroing out the only page available on the freelist, the caller must wait, which is done via a per-hstate wait queue. Prezeroing can be switched on or off per hugetlb page size (hstate) by a sysfs value, /sys/kernel/mm/hugepages/hugepage-XXX/prezero_enabled If this is set to 0 (default), no pre-zeroing is done. If it's set to 1, all hugetlb pages of that size are pre-zeroed. A new sysfs value, free_huge_pages_zero, tracks the number of free prezeroed pages per hstate. Some test results: a simple test with a loop that maps new hugetlb pages and touches them, run on an AMD Milan system, touching 512 1G pages. It records the time in microseconds that a new touch (fault) takes, and the number of hits, misses and waits. Without prezeroing With prezeroing total time 20253322us total time 760us average 39557us average 1us min 38199us min 0us max 41347us max 10us prezero hits 0 prezero hits 512 prezero misses 512 prezero misses 0 prezero waits 0 prezero waits 0 Repeating the loop 16 times, meaning the background zeroing is not yet complete when the loop restarts (so the worst case scenario for prezeroing with these parameters): Without prezeroing With prezeroing total time 321927564us total time 111980668us average 39297us average 13669us min 38174us min 0us max 41935us max 41876us prezero hits 0 prezero hits 5299 prezero misses 8192 prezero misses 2877 prezero waits 0 prezero waits 16 So, even in this scenario, the average fault time is still quite a bit faster, and the maximum fault time is never bigger than without prezeroing. Signed-off-by: Frank van der Linden <fvdl@google.com> --- fs/hugetlbfs/inode.c | 3 +- include/linux/hugetlb.h | 44 +++++ mm/hugetlb.c | 374 +++++++++++++++++++++++++++++++++++++++- mm/memory_hotplug.c | 2 + 4 files changed, 413 insertions(+), 10 deletions(-)