diff mbox series

mm/hugetlb: optionally pre-zero hugetlb pages

Message ID 20241202202058.3249628-1-fvdl@google.com (mailing list archive)
State New
Headers show
Series mm/hugetlb: optionally pre-zero hugetlb pages | expand

Commit Message

Frank van der Linden Dec. 2, 2024, 8:20 p.m. UTC
Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40
milliseconds for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay.

To make the above scenario faster, optionally pre-zero hugetlb
pages. This is done by a kthread for each hstate/node combination
that is eligible for pre-zeroing.

If there are pages to be pre-zeroed, the kthread starts at the tail
of the per-node free list, and zeroes out pages, walking the list
backwards until done. To make this logic work, new allocations
are always taken from the head, and freed pages are added to the tail.
After a page has been pre-zeroed, it is moved to the head of the list.

The kthread runs on init, and is also woken up when a page is put
back on the free list.

A thread that allocates a hugetlb page can examine the prezero
flag to see whether it's been zeroed out already. If this is
not set, it must do so itself, as before.

For the rare corner where the kthread is busy zeroing out the only
page available on the freelist, the caller must wait, which is done
via a per-hstate wait queue.

Prezeroing can be switched on or off per hugetlb page size (hstate)
by a sysfs value, /sys/kernel/mm/hugepages/hugepage-XXX/prezero_enabled
If this is set to 0 (default), no pre-zeroing is done. If
it's set to 1, all hugetlb pages of that size are pre-zeroed.

A new sysfs value, free_huge_pages_zero, tracks the number of
free prezeroed pages per hstate.

Some test results: a simple test with a loop that maps new
hugetlb pages and touches them, run on an AMD Milan system,
touching 512 1G pages. It records the time in microseconds
that a new touch (fault) takes, and the number of hits,
misses and waits.

Without prezeroing		With prezeroing

total time 20253322us		total time 760us
average    39557us		average    1us
min        38199us		min        0us
max        41347us		max        10us
prezero hits   0		prezero hits   512
prezero misses 512		prezero misses 0
prezero waits  0		prezero waits  0

Repeating the loop 16 times, meaning the background zeroing is not yet
complete when the loop restarts (so the worst case scenario for
prezeroing with these parameters):

Without prezeroing 		With prezeroing

total time 321927564us		total time 111980668us
average    39297us		average    13669us
min        38174us		min        0us
max        41935us		max        41876us
prezero hits   0		prezero hits   5299
prezero misses 8192		prezero misses 2877
prezero waits  0		prezero waits  16

So, even in this scenario, the average fault time is still quite
a bit faster, and the maximum fault time is never bigger than
without prezeroing.

Signed-off-by: Frank van der Linden <fvdl@google.com>
---
 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  44 +++++
 mm/hugetlb.c            | 374 +++++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c     |   2 +
 4 files changed, 413 insertions(+), 10 deletions(-)

Comments

Mateusz Guzik Dec. 2, 2024, 9:58 p.m. UTC | #1
On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.

The current huge page zeroing code is not that great to begin with.

There was a patchset posted some time ago to remedy at least some of it:
https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/

but it apparently fell through the cracks.

Any games with "background zeroing" are notoriously crappy and I would
argue one should exhaust other avenues before going there -- at the end
of the day the cost of zeroing will have to get paid.

To that end I would suggest picking up the patchset and experimenting
with more variants of the zeroing code (for example for 1G it may be it
is faster to employ SIMD usage in the routine).

If this is really such a problem I wonder if this could start as a
series of 2MB pages instead faulted as needed, eventually promoted to
1G after passing some threshold?
kernel test robot Dec. 2, 2024, 10:07 p.m. UTC | #2
Hi Frank,

kernel test robot noticed the following build warnings:

[auto build test WARNING on v6.12]
[cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all linus/master v6.13-rc1 next-20241128]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Frank-van-der-Linden/mm-hugetlb-optionally-pre-zero-hugetlb-pages/20241203-042817
base:   v6.12
patch link:    https://lore.kernel.org/r/20241202202058.3249628-1-fvdl%40google.com
patch subject: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages
config: powerpc-ps3_defconfig (https://download.01.org/0day-ci/archive/20241203/202412030519.W14yll4e-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241203/202412030519.W14yll4e-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412030519.W14yll4e-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/hugetlb.c: In function 'prezero_enabled_store':
>> mm/hugetlb.c:4290:13: warning: variable 'err' set but not used [-Wunused-but-set-variable]
    4290 |         int err;
         |             ^~~


vim +/err +4290 mm/hugetlb.c

  4284	
  4285	static ssize_t prezero_enabled_store(struct kobject *kobj,
  4286						  struct kobj_attribute *attr,
  4287						  const char *buf, size_t count)
  4288	{
  4289		struct hstate *h;
> 4290		int err;
  4291		long val;
  4292		bool prezero_enabled;
  4293	
  4294		err = kstrtol(buf, 10, &val);
  4295		if (val != 0 && val != 1)
  4296			return -EINVAL;
  4297	
  4298		prezero_enabled = !!val;
  4299	
  4300		h = kobj_to_hstate(kobj, NULL);
  4301		if (prezero_enabled == h->prezero_enabled)
  4302			return count;
  4303	
  4304		h->prezero_enabled = prezero_enabled;
  4305	
  4306		mem_hotplug_begin();
  4307		khzerod_run_hstate(h);
  4308		mem_hotplug_done();
  4309	
  4310		return count;
  4311	}
  4312	HSTATE_ATTR(prezero_enabled);
  4313
Frank van der Linden Dec. 2, 2024, 10:50 p.m. UTC | #3
On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
> > Fresh hugetlb pages are zeroed out when they are faulted in,
> > just like with all other page types. This can take up a good
> > amount of time for larger page sizes (e.g. around 40
> > milliseconds for a 1G page on a recent AMD-based system).
> >
> > This normally isn't a problem, since hugetlb pages are typically
> > mapped by the application for a long time, and the initial
> > delay when touching them isn't much of an issue.
> >
> > However, there are some use cases where a large number of hugetlb
> > pages are touched when an application (such as a VM backed by these
> > pages) starts. For 256 1G pages and 40ms per page, this would take
> > 10 seconds, a noticeable delay.
>
> The current huge page zeroing code is not that great to begin with.
>
> There was a patchset posted some time ago to remedy at least some of it:
> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
>
> but it apparently fell through the cracks.

Hi Mateusz, thanks for your reply.

I am aware of that patch set, yes. The discussion around it evolved in
to one about kernel preemption and the evilness of cond_resched().

You can certainly improve the time it takes to zero out a 1G page by
optimizing the code that does it. See also, for example,
https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/

However, while, say, a 50% improvement in zero out time, at the max,
is nice, this still leaves the faulting process spending considerable
time doing it. Like you say, that's cost that needs to be paid - but
it would be good to avoid paying it inline. This patch avoids doing
that altogether, leading to a basically 100% improvement under
reasonably good circumstances.

>
> Any games with "background zeroing" are notoriously crappy and I would
> argue one should exhaust other avenues before going there -- at the end
> of the day the cost of zeroing will have to get paid.

I understand that the concept of background prezeroing has been, and
will be, met with some resistance. But, do you have any specific
concerns with the patch I posted? It's pretty well isolated from the
rest of the code, and optional.

>
> To that end I would suggest picking up the patchset and experimenting
> with more variants of the zeroing code (for example for 1G it may be it
> is faster to employ SIMD usage in the routine).

See above - happy to pick up older patch(es) as a separate effort, but
they won't fully solve the issue for the scenario I'm describing.

>
> If this is really such a problem I wonder if this could start as a
> series of 2MB pages instead faulted as needed, eventually promoted to
> 1G after passing some threshold?

This idea sounds similar to HGM (high granularity mapping), an idea
which was originally posted for the purpose of live migration of VMs
(but never made it in). It's not trivial, and seems like overkill.
Again, my patch is non-invasive and optional, so I think it's better
in that regard.

- Frank
Michal Hocko Dec. 3, 2024, 12:06 p.m. UTC | #4
On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Any games with "background zeroing" are notoriously crappy and I would
> > argue one should exhaust other avenues before going there -- at the end
> > of the day the cost of zeroing will have to get paid.
> 
> I understand that the concept of background prezeroing has been, and
> will be, met with some resistance. But, do you have any specific
> concerns with the patch I posted? It's pretty well isolated from the
> rest of the code, and optional.

The biggest concern I have is that the overhead is payed by everybody on
the system - it is considered to be a system overhead regardless only
part of the workload benefits from hugetlb pages. In other words the
workload using those pages is not accounted for the use completely.

If the startup latency is a real problem is there a way to workaround
that in the userspace by preallocating hugetlb pages ahead of time
before those VMs are launched and hand over already pre-allocated pages?
David Hildenbrand Dec. 3, 2024, 12:17 p.m. UTC | #5
On 03.12.24 13:06, Michal Hocko wrote:
> On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
>> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> Any games with "background zeroing" are notoriously crappy and I would
>>> argue one should exhaust other avenues before going there -- at the end
>>> of the day the cost of zeroing will have to get paid.
>>
>> I understand that the concept of background prezeroing has been, and
>> will be, met with some resistance. But, do you have any specific
>> concerns with the patch I posted? It's pretty well isolated from the
>> rest of the code, and optional.
> 
> The biggest concern I have is that the overhead is payed by everybody on
> the system - it is considered to be a system overhead regardless only
> part of the workload benefits from hugetlb pages. In other words the
> workload using those pages is not accounted for the use completely.
> 
> If the startup latency is a real problem is there a way to workaround
> that in the userspace by preallocating hugetlb pages ahead of time
> before those VMs are launched and hand over already pre-allocated pages?

In QEMU we support parallel preallocation, and even NUMA-aware 
preallocation, to reduce VM startup times with hugetlb.

We discussed improvements as referred to by Mateusz to further speed up 
preallocation, but for now parallel+NUMA-aware preallocation seems to do 
the trick.
Joao Martins Dec. 3, 2024, 2:02 p.m. UTC | #6
On 02/12/2024 21:58, Mateusz Guzik wrote:
> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
>> Fresh hugetlb pages are zeroed out when they are faulted in,
>> just like with all other page types. This can take up a good
>> amount of time for larger page sizes (e.g. around 40
>> milliseconds for a 1G page on a recent AMD-based system).
>>
>> This normally isn't a problem, since hugetlb pages are typically
>> mapped by the application for a long time, and the initial
>> delay when touching them isn't much of an issue.
>>
>> However, there are some use cases where a large number of hugetlb
>> pages are touched when an application (such as a VM backed by these
>> pages) starts. For 256 1G pages and 40ms per page, this would take
>> 10 seconds, a noticeable delay.
> 
> The current huge page zeroing code is not that great to begin with.
> 
> There was a patchset posted some time ago to remedy at least some of it:
> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
> 
> but it apparently fell through the cracks.
> 

It didn't fell through the cracks for sure

Just had a detour into preempt=auto before resuming the main work. But that
seems to be done in the last merge window with the lazy preempt stuff. I think
Ankur was planning on following that series above soon-ish.

Adding him here, such that he keeps me honest :)
Joao Martins Dec. 3, 2024, 2:26 p.m. UTC | #7
On 03/12/2024 12:06, Michal Hocko wrote:
> On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
>> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> Any games with "background zeroing" are notoriously crappy and I would
>>> argue one should exhaust other avenues before going there -- at the end
>>> of the day the cost of zeroing will have to get paid.
>>
>> I understand that the concept of background prezeroing has been, and
>> will be, met with some resistance. But, do you have any specific
>> concerns with the patch I posted? It's pretty well isolated from the
>> rest of the code, and optional.
> 
> The biggest concern I have is that the overhead is payed by everybody on
> the system - it is considered to be a system overhead regardless only
> part of the workload benefits from hugetlb pages. In other words the
> workload using those pages is not accounted for the use completely.
> 
> If the startup latency is a real problem is there a way to workaround
> that in the userspace by preallocating hugetlb pages ahead of time
> before those VMs are launched and hand over already pre-allocated pages?

It should be relatively simple to actually do this. Me and Mike had experimented
ourselves a couple years back but we never had the chance to send it over. IIRC
if we:

- add the PageZeroed tracking bit when a page is zeroed
- clear it in the write (fixup/non-fixup) fault-path

[somewhat similar to this series I suspect]

Then what's left is to change the lookup of free hugetlb pages
(dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
pages. Provided we don't track its 'cleared' state, there's no UAPI change in
behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
free them back 'as zeroed' to implement a userspace scrubber. And in principle
existing apps should see no difference. The amount of changes is consequently
significantly smaller (or it looked as such in a quick PoC years back).

Something extra on the top would perhaps be the ability so select a lookup
heuristic such that we can pick the search method of
non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
without too much of a dance.

	Joao
Mateusz Guzik Dec. 3, 2024, 3:57 p.m. UTC | #8
On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 03/12/2024 12:06, Michal Hocko wrote:
> > If the startup latency is a real problem is there a way to workaround
> > that in the userspace by preallocating hugetlb pages ahead of time
> > before those VMs are launched and hand over already pre-allocated pages?
>
> It should be relatively simple to actually do this. Me and Mike had experimented
> ourselves a couple years back but we never had the chance to send it over. IIRC
> if we:
>
> - add the PageZeroed tracking bit when a page is zeroed
> - clear it in the write (fixup/non-fixup) fault-path
>
> [somewhat similar to this series I suspect]
>
> Then what's left is to change the lookup of free hugetlb pages
> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> free them back 'as zeroed' to implement a userspace scrubber. And in principle
> existing apps should see no difference. The amount of changes is consequently
> significantly smaller (or it looked as such in a quick PoC years back).
>
> Something extra on the top would perhaps be the ability so select a lookup
> heuristic such that we can pick the search method of
> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> without too much of a dance.
>

Ye after the qemu prefaulting got pointed out I started thinking about
a userlevel daemon which would do the work proposed here.

Except I got stuck at a good way to do it. The mmap + load from the
area + munmap triple does work but also entails more overhead than
necessary, but I only have some handwaving how to not do it. :)

Suppose a daemon of the sort exists and there is a machine with 4 or
more NUMA domains to deal with. Further suppose it spawns at least one
thread per such domain and tasksets them accordingly.

Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
indicating how many pages to zero out (or even just accept one page).
This would avoid crap on munmap.

This would still need majority of the patch, but all the zeroing
policy would be taken out. Key point being that whatever specific
behavior one sees fit, they can implement it in userspace, preventing
future kernel patches to add more tweaks.
Mateusz Guzik Dec. 3, 2024, 4:17 p.m. UTC | #9
On Tue, Dec 3, 2024 at 4:57 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >
> > On 03/12/2024 12:06, Michal Hocko wrote:
> > > If the startup latency is a real problem is there a way to workaround
> > > that in the userspace by preallocating hugetlb pages ahead of time
> > > before those VMs are launched and hand over already pre-allocated pages?
> >
> > It should be relatively simple to actually do this. Me and Mike had experimented
> > ourselves a couple years back but we never had the chance to send it over. IIRC
> > if we:
> >
> > - add the PageZeroed tracking bit when a page is zeroed
> > - clear it in the write (fixup/non-fixup) fault-path
> >
> > [somewhat similar to this series I suspect]
> >
> > Then what's left is to change the lookup of free hugetlb pages
> > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> > pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> > free them back 'as zeroed' to implement a userspace scrubber. And in principle
> > existing apps should see no difference. The amount of changes is consequently
> > significantly smaller (or it looked as such in a quick PoC years back).
> >
> > Something extra on the top would perhaps be the ability so select a lookup
> > heuristic such that we can pick the search method of
> > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> > without too much of a dance.
> >
>
> Ye after the qemu prefaulting got pointed out I started thinking about
> a userlevel daemon which would do the work proposed here.
>
> Except I got stuck at a good way to do it. The mmap + load from the
> area + munmap triple does work but also entails more overhead than
> necessary, but I only have some handwaving how to not do it. :)
>
> Suppose a daemon of the sort exists and there is a machine with 4 or
> more NUMA domains to deal with. Further suppose it spawns at least one
> thread per such domain and tasksets them accordingly.
>
> Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
> indicating how many pages to zero out (or even just accept one page).
> This would avoid crap on munmap.
>
> This would still need majority of the patch, but all the zeroing
> policy would be taken out. Key point being that whatever specific
> behavior one sees fit, they can implement it in userspace, preventing
> future kernel patches to add more tweaks.

How about this for a rough sketch (which I have 0 intention of
implementing myself):

/dev/hugepagectl or whatever is created with a bunch of ioctls, notably:
- something to query hugepage stats
- an event generated for epoll if count in any domain goes below a threshold
- something to zero a page of given size from the free list

Perhaps make it so that fds require an upfront ioctl to set a numa
domain of interest before poll works -- for example if there is one
thread per domain, each of them sleeps on its own relevant fd. Or
maybe someone still wants the main thread to get the full view so they
poll on all of them.

then a google internal tool can react however it sees fit without
waking up in a periodic fashion. (replace google with any other
company which may want to mess with this).

optional:
- allocating and zeroing (but not mmaping!) a page

then a party which shares the file descriptor could obtain it by
passing the fd to mmap. munmap would just free it as it does now. this
would allow qemu et al to avoid the mmap/munmap dance just to zero,
but I don't know how useful it is for them
Joao Martins Dec. 3, 2024, 4:21 p.m. UTC | #10
On 03/12/2024 15:57, Mateusz Guzik wrote:
> On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 03/12/2024 12:06, Michal Hocko wrote:
>>> If the startup latency is a real problem is there a way to workaround
>>> that in the userspace by preallocating hugetlb pages ahead of time
>>> before those VMs are launched and hand over already pre-allocated pages?
>>
>> It should be relatively simple to actually do this. Me and Mike had experimented
>> ourselves a couple years back but we never had the chance to send it over. IIRC
>> if we:
>>
>> - add the PageZeroed tracking bit when a page is zeroed
>> - clear it in the write (fixup/non-fixup) fault-path
>>
>> [somewhat similar to this series I suspect]
>>
>> Then what's left is to change the lookup of free hugetlb pages
>> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
>> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
>> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
>> free them back 'as zeroed' to implement a userspace scrubber. And in principle
>> existing apps should see no difference. The amount of changes is consequently
>> significantly smaller (or it looked as such in a quick PoC years back).
>>
>> Something extra on the top would perhaps be the ability so select a lookup
>> heuristic such that we can pick the search method of
>> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
>> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
>> without too much of a dance.
>>
> 
> Ye after the qemu prefaulting got pointed out I started thinking about
> a userlevel daemon which would do the work proposed here.
> 
> Except I got stuck at a good way to do it. The mmap + load from the
> area + munmap triple does work but also entails more overhead than
> necessary, but I only have some handwaving how to not do it. :)
> 
What I was trying to suggest above is that it would be no different that how you
use hugetlb. I am not enterily sure I follow the triple work part on unmap.

> Suppose a daemon of the sort exists and there is a machine with 4 or
> more NUMA domains to deal with. Further suppose it spawns at least one
> thread per such domain and tasksets them accordingly.
> 
> Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
> indicating how many pages to zero out (or even just accept one page).
> This would avoid crap on munmap.
> 
> This would still need majority of the patch, but all the zeroing> policy would
be taken out. Key point being that whatever specific
> behavior one sees fit, they can implement it in userspace, preventing
> future kernel patches to add more tweaks.

Kernel should still ensure it tracks if it's cleared or not -- so what I said
above was just letting the allocation zero out the page or not (if it's not
zeroed already) and just tweak the dirtyness of pages it picks before installing
PTEs. A scrubber would pick only dirty pages (and maybe fail if there aren't
any), and a VMM would pick clean pages (taking advantage of the scrubber work).
An explicit zero sounds a somewhat limiting ... but hmm

What throws all this away (in primary MM) is the prefaulting with write as we
would clear the PageCleared bit all the time (I think that's what you mean 'crap
on unmap'?).

But there could be hope for systems with a secondary pagetables (with paging),
where the secondary faulting is the one in control of the cleared status. That
is because reads inside the VM ultimately trigger secondary-VM read-faults and
get fixed up later with write on writes.

Well, at least it would work given we don't prefault secondary page tables yet...
Frank van der Linden Dec. 3, 2024, 6:43 p.m. UTC | #11
On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 03/12/2024 12:06, Michal Hocko wrote:
> > On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
> >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>> Any games with "background zeroing" are notoriously crappy and I would
> >>> argue one should exhaust other avenues before going there -- at the end
> >>> of the day the cost of zeroing will have to get paid.
> >>
> >> I understand that the concept of background prezeroing has been, and
> >> will be, met with some resistance. But, do you have any specific
> >> concerns with the patch I posted? It's pretty well isolated from the
> >> rest of the code, and optional.
> >
> > The biggest concern I have is that the overhead is payed by everybody on
> > the system - it is considered to be a system overhead regardless only
> > part of the workload benefits from hugetlb pages. In other words the
> > workload using those pages is not accounted for the use completely.
> >
> > If the startup latency is a real problem is there a way to workaround
> > that in the userspace by preallocating hugetlb pages ahead of time
> > before those VMs are launched and hand over already pre-allocated pages?
>
> It should be relatively simple to actually do this. Me and Mike had experimented
> ourselves a couple years back but we never had the chance to send it over. IIRC
> if we:
>
> - add the PageZeroed tracking bit when a page is zeroed
> - clear it in the write (fixup/non-fixup) fault-path
>
> [somewhat similar to this series I suspect]
>
> Then what's left is to change the lookup of free hugetlb pages
> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> free them back 'as zeroed' to implement a userspace scrubber. And in principle
> existing apps should see no difference. The amount of changes is consequently
> significantly smaller (or it looked as such in a quick PoC years back).

This would work, and is easy to do, but:
  * You now have a userspace daemon that depends on kernel-internal behavior.
  * It has no way to track how much work is left to do or what needs
to be done (unless it is part of an application that is the only user
of hugetlbfs on the system).

>
> Something extra on the top would perhaps be the ability so select a lookup
> heuristic such that we can pick the search method of
> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> without too much of a dance.

Again, that would probably work, but if you take a step back: you now
have a kernel behavior that can be guided in certain directions, but
no guarantees and no stats to see if things are working out. And an
explicit allocation method option (basically: take from the head or
the tail of the freelist). The picture is getting murkier. At least
with the patch I sent you have a clearly defined, optional, behavior
that can be switched on or off, and stats to see if it's working.

I do understand the argument against having pre-zeroing not being
accounted to the current thread. I would counter that benefiting from
work by kernel threads is not unheard of in the kernel today already.
Also, the other proposals so far also have another thread doing the
zeroing - it just is explicitly started by userspace. So, the cost is
still not paid by the user of the pages. You just end up with
explicitly controlling who does pay the cost. Which I suppose is
better, but it's still not trivial to get it completely right (you
perhaps could do it at the container level with some trickery).

What we have done so far is to bind the khzerod threads introduced in
this patch to CPUs in such a way that it doesn't interfere with the
rest of the system. Which you would also have to do with any userspace
solution.

Again, this is optional - if you are a system manager who prefers to
have the resources used by zeroing hugetlb pages to be explicitly
accounted to the actual user, you can not enable this behavior (it's
off by default).

I guess I can summarize my thoughts like this: while I understand the
argument against doing this outside of the context of the actual user
of the pages, this is 1) optional, and 2) so far the other solutions
introduce interfaces that I don't think are that great, or would
require maintaining a hugetlb 'shadow pool' in userspace through
hugetlbfs files.

- Frank
Frank van der Linden Dec. 3, 2024, 8:15 p.m. UTC | #12
On Tue, Dec 3, 2024 at 10:43 AM Frank van der Linden <fvdl@google.com> wrote:
>
> On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >
> > On 03/12/2024 12:06, Michal Hocko wrote:
> > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
> > >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >>> Any games with "background zeroing" are notoriously crappy and I would
> > >>> argue one should exhaust other avenues before going there -- at the end
> > >>> of the day the cost of zeroing will have to get paid.
> > >>
> > >> I understand that the concept of background prezeroing has been, and
> > >> will be, met with some resistance. But, do you have any specific
> > >> concerns with the patch I posted? It's pretty well isolated from the
> > >> rest of the code, and optional.
> > >
> > > The biggest concern I have is that the overhead is payed by everybody on
> > > the system - it is considered to be a system overhead regardless only
> > > part of the workload benefits from hugetlb pages. In other words the
> > > workload using those pages is not accounted for the use completely.
> > >
> > > If the startup latency is a real problem is there a way to workaround
> > > that in the userspace by preallocating hugetlb pages ahead of time
> > > before those VMs are launched and hand over already pre-allocated pages?
> >
> > It should be relatively simple to actually do this. Me and Mike had experimented
> > ourselves a couple years back but we never had the chance to send it over. IIRC
> > if we:
> >
> > - add the PageZeroed tracking bit when a page is zeroed
> > - clear it in the write (fixup/non-fixup) fault-path
> >
> > [somewhat similar to this series I suspect]
> >
> > Then what's left is to change the lookup of free hugetlb pages
> > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> > pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> > free them back 'as zeroed' to implement a userspace scrubber. And in principle
> > existing apps should see no difference. The amount of changes is consequently
> > significantly smaller (or it looked as such in a quick PoC years back).
>
> This would work, and is easy to do, but:
>   * You now have a userspace daemon that depends on kernel-internal behavior.
>   * It has no way to track how much work is left to do or what needs
> to be done (unless it is part of an application that is the only user
> of hugetlbfs on the system).
>
> >
> > Something extra on the top would perhaps be the ability so select a lookup
> > heuristic such that we can pick the search method of
> > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> > without too much of a dance.
>
> Again, that would probably work, but if you take a step back: you now
> have a kernel behavior that can be guided in certain directions, but
> no guarantees and no stats to see if things are working out. And an
> explicit allocation method option (basically: take from the head or
> the tail of the freelist). The picture is getting murkier. At least
> with the patch I sent you have a clearly defined, optional, behavior
> that can be switched on or off, and stats to see if it's working.
>
> I do understand the argument against having pre-zeroing not being
> accounted to the current thread. I would counter that benefiting from
> work by kernel threads is not unheard of in the kernel today already.
> Also, the other proposals so far also have another thread doing the
> zeroing - it just is explicitly started by userspace. So, the cost is
> still not paid by the user of the pages. You just end up with
> explicitly controlling who does pay the cost. Which I suppose is
> better, but it's still not trivial to get it completely right (you
> perhaps could do it at the container level with some trickery).
>
> What we have done so far is to bind the khzerod threads introduced in
> this patch to CPUs in such a way that it doesn't interfere with the
> rest of the system. Which you would also have to do with any userspace
> solution.
>
> Again, this is optional - if you are a system manager who prefers to
> have the resources used by zeroing hugetlb pages to be explicitly
> accounted to the actual user, you can not enable this behavior (it's
> off by default).
>
> I guess I can summarize my thoughts like this: while I understand the
> argument against doing this outside of the context of the actual user
> of the pages, this is 1) optional, and 2) so far the other solutions
> introduce interfaces that I don't think are that great, or would
> require maintaining a hugetlb 'shadow pool' in userspace through
> hugetlbfs files.

One more thing: any userspace solution has one basic problem: the
hugetlb pages will be unavailable while they are being zeroed out, as
the userspace process or thread will have to map+touch them, taking
them off the freelist. So now another process that needs the hugetlb
pages, expecting them to be there based on initial configuration and
what it's done so far, may end up getting unexpected -ENOMEM because
one or more pages have been temporarily allocated by userspace prezero
threads.

My patch doesn't have that issue - the pages stay on the freelist, the
total number of available pages does not change. In the rare case that
a freshly allocated page is being prezeroed, you'll just have to wait
until it's done (taking up no more time than doing it yourself).

Now, you can implement something like this in userspace (if I get
-ENOMEM, check with the prezero thread or process), but it's not
great.

- Frank
Ankur Arora Dec. 4, 2024, 12:05 a.m. UTC | #13
Mateusz Guzik <mjguzik@gmail.com> writes:

> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
>> Fresh hugetlb pages are zeroed out when they are faulted in,
>> just like with all other page types. This can take up a good
>> amount of time for larger page sizes (e.g. around 40
>> milliseconds for a 1G page on a recent AMD-based system).
>>
>> This normally isn't a problem, since hugetlb pages are typically
>> mapped by the application for a long time, and the initial
>> delay when touching them isn't much of an issue.
>>
>> However, there are some use cases where a large number of hugetlb
>> pages are touched when an application (such as a VM backed by these
>> pages) starts. For 256 1G pages and 40ms per page, this would take
>> 10 seconds, a noticeable delay.
>
> The current huge page zeroing code is not that great to begin with.

Yeah definitely suboptimal. The current huge page zeroing code is
both slow and it trashes the cache while zeroing.

> There was a patchset posted some time ago to remedy at least some of it:
> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
>
> but it apparently fell through the cracks.

As Joao mentioned that got side tracked due to the preempt-lazy stuff.
Now that lazy is in, I plan to follow up on the zeroing work.

> Any games with "background zeroing" are notoriously crappy and I would
> argue one should exhaust other avenues before going there -- at the end
> of the day the cost of zeroing will have to get paid.

Yeah and the background zeroing has dual cost: the cost in CPU time plus
the indirect cost to other processes due to the trashing of L3 etc.

Ankur
Ankur Arora Dec. 4, 2024, 12:06 a.m. UTC | #14
Joao Martins <joao.m.martins@oracle.com> writes:

> On 02/12/2024 21:58, Mateusz Guzik wrote:
>> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
>>> Fresh hugetlb pages are zeroed out when they are faulted in,
>>> just like with all other page types. This can take up a good
>>> amount of time for larger page sizes (e.g. around 40
>>> milliseconds for a 1G page on a recent AMD-based system).
>>>
>>> This normally isn't a problem, since hugetlb pages are typically
>>> mapped by the application for a long time, and the initial
>>> delay when touching them isn't much of an issue.
>>>
>>> However, there are some use cases where a large number of hugetlb
>>> pages are touched when an application (such as a VM backed by these
>>> pages) starts. For 256 1G pages and 40ms per page, this would take
>>> 10 seconds, a noticeable delay.
>>
>> The current huge page zeroing code is not that great to begin with.
>>
>> There was a patchset posted some time ago to remedy at least some of it:
>> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
>>
>> but it apparently fell through the cracks.
>>
>
> It didn't fell through the cracks for sure
>
> Just had a detour into preempt=auto before resuming the main work. But that
> seems to be done in the last merge window with the lazy preempt stuff. I think
> Ankur was planning on following that series above soon-ish.
>
> Adding him here, such that he keeps me honest :)

Thanks for Ccing Joao.

--
ankur
Frank van der Linden Dec. 4, 2024, 5:01 p.m. UTC | #15
On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
>
> > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
> >> Fresh hugetlb pages are zeroed out when they are faulted in,
> >> just like with all other page types. This can take up a good
> >> amount of time for larger page sizes (e.g. around 40
> >> milliseconds for a 1G page on a recent AMD-based system).
> >>
> >> This normally isn't a problem, since hugetlb pages are typically
> >> mapped by the application for a long time, and the initial
> >> delay when touching them isn't much of an issue.
> >>
> >> However, there are some use cases where a large number of hugetlb
> >> pages are touched when an application (such as a VM backed by these
> >> pages) starts. For 256 1G pages and 40ms per page, this would take
> >> 10 seconds, a noticeable delay.
> >
> > The current huge page zeroing code is not that great to begin with.
>
> Yeah definitely suboptimal. The current huge page zeroing code is
> both slow and it trashes the cache while zeroing.
>
> > There was a patchset posted some time ago to remedy at least some of it:
> > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
> >
> > but it apparently fell through the cracks.
>
> As Joao mentioned that got side tracked due to the preempt-lazy stuff.
> Now that lazy is in, I plan to follow up on the zeroing work.
>
> > Any games with "background zeroing" are notoriously crappy and I would
> > argue one should exhaust other avenues before going there -- at the end
> > of the day the cost of zeroing will have to get paid.
>
> Yeah and the background zeroing has dual cost: the cost in CPU time plus
> the indirect cost to other processes due to the trashing of L3 etc.

I'm not sure what you mean here - any caching side effects of zeroing
happen regardless of who does it, right? It doesn't matter if it's a
kthread or the calling thread.

If you're concerned about the caching side effects in general, using
non-temporal instructions helps (e.g. movnti on x86). See the link I
mentioned for a patch that was sent years ago (
https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/
). Using movnti on x86 definitely helps performance (up to 50% in my
experiments). Which is great, but it still leaves considerable delay
for the use case I mentioned.

- Frank
Ankur Arora Dec. 4, 2024, 7:57 p.m. UTC | #16
Frank van der Linden <fvdl@google.com> writes:

> On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Mateusz Guzik <mjguzik@gmail.com> writes:
>>
>> > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
>> >> Fresh hugetlb pages are zeroed out when they are faulted in,
>> >> just like with all other page types. This can take up a good
>> >> amount of time for larger page sizes (e.g. around 40
>> >> milliseconds for a 1G page on a recent AMD-based system).
>> >>
>> >> This normally isn't a problem, since hugetlb pages are typically
>> >> mapped by the application for a long time, and the initial
>> >> delay when touching them isn't much of an issue.
>> >>
>> >> However, there are some use cases where a large number of hugetlb
>> >> pages are touched when an application (such as a VM backed by these
>> >> pages) starts. For 256 1G pages and 40ms per page, this would take
>> >> 10 seconds, a noticeable delay.
>> >
>> > The current huge page zeroing code is not that great to begin with.
>>
>> Yeah definitely suboptimal. The current huge page zeroing code is
>> both slow and it trashes the cache while zeroing.
>>
>> > There was a patchset posted some time ago to remedy at least some of it:
>> > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
>> >
>> > but it apparently fell through the cracks.
>>
>> As Joao mentioned that got side tracked due to the preempt-lazy stuff.
>> Now that lazy is in, I plan to follow up on the zeroing work.
>>
>> > Any games with "background zeroing" are notoriously crappy and I would
>> > argue one should exhaust other avenues before going there -- at the end
>> > of the day the cost of zeroing will have to get paid.
>>
>> Yeah and the background zeroing has dual cost: the cost in CPU time plus
>> the indirect cost to other processes due to the trashing of L3 etc.
>
> I'm not sure what you mean here - any caching side effects of zeroing
> happen regardless of who does it, right?

Sure.

> It doesn't matter if it's a
> kthread or the calling thread.

As other people point out it's more a matter of accruing it to the
right context. The noise will always spill over but userspace can use
cpu cgroups etc to to limit how far these effects are seen.

Additionally, this kthread will be doing bulk zeroing while a user
thread would zero as needed. Though I guess for the VM prefaulting
case it's likely similar.

> If you're concerned about the caching side effects in general, using
> non-temporal instructions helps (e.g. movnti on x86). See the link I
> mentioned for a patch that was sent years ago (
> https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/
> ). Using movnti on x86 definitely helps performance (up to 50% in my
> experiments). Which is great, but it still leaves considerable delay
> for the use case I mentioned.

In my testing at least on AMD you can get a lot more than 50%
improvement.

See for instance the CLZERO (or the REP STOS) numbers here: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/

--
ankur
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5cf327337e22..7a6749959372 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -893,8 +893,7 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
-		__folio_mark_uptodate(folio);
+		clear_hugetlb_folio(folio, addr);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
 			restore_reserve_on_error(h, &pseudo_vma, addr, folio);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4697539b665..31d56190a4fb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -586,6 +586,17 @@  generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
  * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page
  *     that is not tracked by raw_hwp_page list.
+ * HPG_pre_zeroed - page was pre-zeroed, clear_huge_page not needed.
+ *	Synchronization: hugetlb_lock held when set by pre-zero kthread.
+ *	Only valid to read outside hugetlb_lock once the page is off
+ *	the freelist, and HPG_zero_busy is clear. Always cleared when a
+ *	page is put (back) on the freelist.
+ * HPG_zero_busy - page is being zeroed by the pre-zero kthread.
+ *	Synchronization: set and cleared by the pre-zero kthread with
+ *	hugetlb_lock held. Access by others is read-only. Once the page
+ *	is off the freelist, this can only change from set -> clear,
+ *	which the new page owner must wait for. Always cleared
+ *	when a page is put (back) on the freelist.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
@@ -594,6 +605,8 @@  enum hugetlb_page_flags {
 	HPG_freed,
 	HPG_vmemmap_optimized,
 	HPG_raw_hwp_unreliable,
+	HPG_pre_zeroed,
+	HPG_zero_busy,
 	__NR_HPAGEFLAGS,
 };
 
@@ -653,6 +666,8 @@  HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
 HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
+HPAGEFLAG(PreZeroed, pre_zeroed)
+HPAGEFLAG(ZeroBusy, zero_busy)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
@@ -678,6 +693,19 @@  struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+
+	unsigned long free_huge_pages_zero;
+	unsigned int free_huge_pages_zero_node[MAX_NUMNODES];
+
+	/* Wait queue for the prezero thread */
+	wait_queue_head_t hzerod_wait[MAX_NUMNODES];
+	/* Queue to wait for a hugetlb folio that is being prezeroed */
+	wait_queue_head_t dqzero_wait[MAX_NUMNODES];
+	/* Prezero threads (one per node) */
+	struct task_struct *hzerod[MAX_NUMNODES];
+
+	bool prezero_enabled;
+
 	char name[HSTATE_NAME_LEN];
 };
 
@@ -699,6 +727,7 @@  int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping
 			pgoff_t idx);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct folio *folio);
+void clear_hugetlb_folio(struct folio *folio, unsigned long address);
 
 /* arch callback */
 int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
@@ -1035,6 +1064,9 @@  void hugetlb_unregister_node(struct node *node);
  */
 bool is_raw_hwpoison_page_in_hugepage(struct page *page);
 
+void khzerod_run(int nid);
+void khzerod_stop(int nid);
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1239,6 +1271,18 @@  static inline bool hugetlbfs_pagecache_present(
 {
 	return false;
 }
+
+static inline void khzerod_run(int nid)
+{
+}
+
+static inline void khzerod_stop(int nid)
+{
+}
+
+static inline void clear_hugetlb_folio(struct folio *folio, unsigned long address)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..0c9a80851bf5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -44,6 +44,8 @@ 
 #include <linux/io.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/memory_hotplug.h>
 #include <linux/node.h>
 #include <linux/page_owner.h>
 #include "internal.h"
@@ -68,6 +70,20 @@  static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
 static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
 
+static void khzerod_wakeup_node(struct hstate *h, int nid);
+static void hpage_wait_zerobusy(struct hstate *h, struct folio *folio);
+static void khzerod_run_all(void);
+static void khzerod_run_hstate(struct hstate *h);
+
+/*
+ * Mutex to protect prezero kthread stopping / starting.
+ * This is really per <hstate,nid> tuple, but since this
+ * only happens when prezeroing is enabled/disabled via
+ * sysfs (rarely), a mutex per <hstate,nid> would be
+ * overkill.
+ */
+static DEFINE_MUTEX(prezero_chg_lock);
+
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
  * free_huge_pages, and surplus_huge_pages.
@@ -1309,6 +1325,16 @@  static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+/*
+ * Clear flags for either a fresh page or one that is being
+ * added to the free list.
+ */
+static inline void prep_clear_prezero(struct folio *folio)
+{
+	folio_clear_hugetlb_pre_zeroed(folio);
+	folio_clear_hugetlb_zero_busy(folio);
+}
+
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	int nid = folio_nid(folio);
@@ -1316,14 +1342,16 @@  static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 
-	list_move(&folio->lru, &h->hugepage_freelists[nid]);
+	list_move_tail(&folio->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
+	prep_clear_prezero(folio);
 	folio_set_hugetlb_freed(folio);
+	khzerod_wakeup_node(h, nid);
 }
 
-static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
-								int nid)
+static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, int nid,
+		gfp_t gfp_mask)
 {
 	struct folio *folio;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
@@ -1333,6 +1361,16 @@  static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
 		if (pin && !folio_is_longterm_pinnable(folio))
 			continue;
 
+		/*
+		 * This shouldn't happen, as hugetlb pages are never allocated
+		 * with GFP_ATOMIC. But be paranoid and check for it, as
+		 * a zero_busy page might cause a sleep later in
+		 * hpage_wait_zerobusy().
+		 */
+		if (WARN_ON_ONCE(folio_test_hugetlb_zero_busy(folio) &&
+					!gfpflags_allow_blocking(gfp_mask)))
+			continue;
+
 		if (folio_test_hwpoison(folio))
 			continue;
 
@@ -1341,6 +1379,12 @@  static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
 		folio_clear_hugetlb_freed(folio);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
+		if (folio_test_hugetlb_pre_zeroed(folio) ||
+		    folio_test_hugetlb_zero_busy(folio)) {
+			h->free_huge_pages_zero_node[nid]--;
+			h->free_huge_pages_zero--;
+		}
+
 		return folio;
 	}
 
@@ -1377,7 +1421,7 @@  static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_
 			continue;
 		node = zone_to_nid(zone);
 
-		folio = dequeue_hugetlb_folio_node_exact(h, node);
+		folio = dequeue_hugetlb_folio_node_exact(h, node, gfp_mask);
 		if (folio)
 			return folio;
 	}
@@ -1605,6 +1649,17 @@  static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 		folio_clear_hugetlb_freed(folio);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
+		folio_clear_hugetlb_freed(folio);
+	}
+	/*
+	 * Adjust the zero page counters now. Note that
+	 * if a page is currently being zeroed, that
+	 * will be waited for in update_and_free_page()
+	 */
+	if (folio_test_hugetlb_pre_zeroed(folio) ||
+	    folio_test_hugetlb_zero_busy(folio)) {
+		h->free_huge_pages_zero--;
+		h->free_huge_pages_zero_node[nid]--;
 	}
 	if (adjust_surplus) {
 		h->surplus_huge_pages--;
@@ -1658,6 +1713,8 @@  static void __update_and_free_hugetlb_folio(struct hstate *h,
 {
 	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
 
+	VM_BUG_ON_FOLIO(folio_test_hugetlb_zero_busy(folio), folio);
+
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
@@ -1743,6 +1800,7 @@  static void free_hpage_workfn(struct work_struct *work)
 		 */
 		h = size_to_hstate(folio_size(folio));
 
+		hpage_wait_zerobusy(h, folio);
 		__update_and_free_hugetlb_folio(h, folio);
 
 		cond_resched();
@@ -1759,7 +1817,8 @@  static inline void flush_free_hpage_work(struct hstate *h)
 static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 				 bool atomic)
 {
-	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
+	if ((!folio_test_hugetlb_zero_busy(folio) &&
+	     !folio_test_hugetlb_vmemmap_optimized(folio)) || !atomic) {
 		__update_and_free_hugetlb_folio(h, folio);
 		return;
 	}
@@ -1974,6 +2033,8 @@  static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni
 {
 	__prep_new_hugetlb_folio(h, folio);
 	spin_lock_irq(&hugetlb_lock);
+	folio_clear_hugetlb_freed(folio);
+	prep_clear_prezero(folio);
 	__prep_account_new_huge_page(h, nid);
 	spin_unlock_irq(&hugetlb_lock);
 }
@@ -2419,6 +2480,13 @@  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 						preferred_nid, nmask);
 		if (folio) {
 			spin_unlock_irq(&hugetlb_lock);
+			/*
+			 * The contents of this page will be completely
+			 * overwritten immediately, as its a migration
+			 * target, so no clearing is needed. Do wait in
+			 * case khzerod was working on it, though.
+			 */
+			hpage_wait_zerobusy(h, folio);
 			return folio;
 		}
 	}
@@ -3066,6 +3134,8 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_unlock_irq(&hugetlb_lock);
 
+	hpage_wait_zerobusy(h, folio);
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	map_commit = vma_commit_reservation(h, vma, addr);
@@ -3503,6 +3573,8 @@  static void __init hugetlb_init_hstates(void)
 				h->demote_order = h2->order;
 		}
 	}
+
+	khzerod_run_all();
 }
 
 static void __init report_hugepages(void)
@@ -4184,15 +4256,72 @@  static ssize_t demote_size_store(struct kobject *kobj,
 }
 HSTATE_ATTR(demote_size);
 
+static ssize_t free_hugepages_zero_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h;
+	unsigned long free_huge_pages_zero;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NUMA_NO_NODE)
+		free_huge_pages_zero = h->free_huge_pages_zero;
+	else
+		free_huge_pages_zero = h->free_huge_pages_zero_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages_zero);
+}
+HSTATE_ATTR_RO(free_hugepages_zero);
+
+static ssize_t prezero_enabled_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+	return sprintf(buf, "%d\n", h->prezero_enabled ? 1 : 0);
+}
+
+static ssize_t prezero_enabled_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct hstate *h;
+	int err;
+	long val;
+	bool prezero_enabled;
+
+	err = kstrtol(buf, 10, &val);
+	if (val != 0 && val != 1)
+		return -EINVAL;
+
+	prezero_enabled = !!val;
+
+	h = kobj_to_hstate(kobj, NULL);
+	if (prezero_enabled == h->prezero_enabled)
+		return count;
+
+	h->prezero_enabled = prezero_enabled;
+
+	mem_hotplug_begin();
+	khzerod_run_hstate(h);
+	mem_hotplug_done();
+
+	return count;
+}
+HSTATE_ATTR(prezero_enabled);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
 	&resv_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+	&free_hugepages_zero_attr.attr,
 #ifdef CONFIG_NUMA
 	&nr_hugepages_mempolicy_attr.attr,
 #endif
+	&prezero_enabled_attr.attr,
 	NULL,
 };
 
@@ -4265,6 +4394,7 @@  static struct node_hstate node_hstates[MAX_NUMNODES];
 static struct attribute *per_node_hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
+	&free_hugepages_zero_attr.attr,
 	&surplus_hugepages_attr.attr,
 	NULL,
 };
@@ -4501,6 +4631,232 @@  bool __init __attribute((weak)) arch_hugetlb_valid_size(unsigned long size)
 	return size == HPAGE_SIZE;
 }
 
+struct khzerod_info {
+	struct hstate *h;
+	int nid;
+};
+
+/*
+ * Clear a hugetlb page.
+ *
+ * The caller has already made sure that the page is not
+ * being actively zeroed out in the background.
+ *
+ * If it wasn't zeroed out, do it ourselves.
+ */
+void clear_hugetlb_folio(struct folio *folio, unsigned long address)
+{
+	if (!folio_test_hugetlb_pre_zeroed(folio))
+		folio_zero_user(folio, address);
+
+	__folio_mark_uptodate(folio);
+}
+
+/*
+ * Once a page has been taken off the freelist, the new page owner
+ * must wait for the pre-zero kthread to finish if it happens
+ * to be working on this page (which should be rare).
+ */
+static void hpage_wait_zerobusy(struct hstate *h, struct folio *folio)
+{
+	if (!folio_test_hugetlb_zero_busy(folio))
+		return;
+
+	spin_lock_irq(&hugetlb_lock);
+
+	wait_event_cmd(h->dqzero_wait[folio_nid(folio)],
+		       !folio_test_hugetlb_zero_busy(folio),
+		       spin_unlock_irq(&hugetlb_lock),
+		       spin_lock_irq(&hugetlb_lock));
+
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+/*
+ * This per-node-per-hstate kthread pre-zeroes pages that are on
+ * the freelist. They remain on the freelist while this is being
+ * done. When pre-zeroing is done, they are moved to the head
+ * of the list.
+ *
+ * Pages are left on the freelist because an allocation should not
+ * fail just because the page is being prezeroed. In the rare
+ * corner case that a page that is being worked on by this
+ * thread is taken as part of an allocation, the caller will
+ * wait for the prezero to finish (see hpage_wait_zerobusy).
+ */
+static int khzerod(void *p)
+{
+	struct khzerod_info *ki = p;
+	struct hstate *h = ki->h;
+	int nid = ki->nid;
+	unsigned int *freep, *zerop;
+	struct folio *folio;
+	const struct cpumask *cpumask = cpumask_of_node(nid);
+	struct list_head *freelist;
+
+	if (!cpumask_empty(cpumask))
+		set_cpus_allowed_ptr(current, cpumask);
+
+	freep = &h->free_huge_pages_node[nid];
+	zerop = &h->free_huge_pages_zero_node[nid];
+	freelist = &h->hugepage_freelists[nid];
+
+	do {
+		wait_event_interruptible(h->hzerod_wait[nid],
+				*zerop < *freep || kthread_should_stop());
+
+		if (kthread_should_stop())
+			break;
+
+		spin_lock_irq(&hugetlb_lock);
+		if (*zerop == *freep || list_empty(freelist)) {
+			spin_unlock_irq(&hugetlb_lock);
+			continue;
+		}
+
+		folio = list_last_entry(freelist, struct folio, lru);
+		if (folio_test_hugetlb_pre_zeroed(folio)) {
+			spin_unlock_irq(&hugetlb_lock);
+			continue;
+		}
+
+		folio_set_hugetlb_zero_busy(folio);
+		/*
+		 * Incrementing this here is a bit of a fib, since
+		 * the page hasn't been cleared yet (it will be done
+		 * immediately after dropping the lock below). But
+		 * it keeps the count consistent with the overall
+		 * free count in case the page gets taken off the
+		 * freelist while we're working on it.
+		 */
+		(*zerop)++;
+		h->free_huge_pages_zero++;
+		spin_unlock_irq(&hugetlb_lock);
+
+		/*
+		 * HWPoison pages may show up on the freelist.
+		 * Don't try to zero it out, but do set the flag
+		 * and counts, so that we don't consider it again.
+		 */
+		if (!folio_test_hwpoison(folio))
+			folio_zero_user(folio, 0);
+
+		spin_lock_irq(&hugetlb_lock);
+		folio_set_hugetlb_pre_zeroed(folio);
+		folio_clear_hugetlb_zero_busy(folio);
+
+		/*
+		 * If the page is still on the free list, move
+		 * it to the head.
+		 */
+		if (folio_test_hugetlb_freed(folio))
+			list_move(&folio->lru, freelist);
+
+		/*
+		 * If someone was waiting for the zero to
+		 * finish, wake them up.
+		 */
+		if (waitqueue_active(&h->dqzero_wait[nid]))
+			wake_up(&h->dqzero_wait[nid]);
+		spin_unlock_irq(&hugetlb_lock);
+
+	} while (1);
+
+	kfree(ki);
+	return 0;
+}
+
+static void khzerod_run_hstate_nid(struct hstate *h, int nid)
+{
+	struct khzerod_info *ki;
+	struct task_struct *t;
+
+	mutex_lock(&prezero_chg_lock);
+
+	if (!h->prezero_enabled) {
+		if (h->hzerod[nid] != NULL) {
+			kthread_stop(h->hzerod[nid]);
+			h->hzerod[nid] = NULL;
+		}
+		goto out;
+	}
+
+	if (h->hzerod[nid] != NULL)
+		goto out;
+
+	ki = kmalloc(sizeof(*ki), GFP_KERNEL);
+	if (ki == NULL)
+		goto out;
+	ki->h = h;
+	ki->nid = nid;
+	t = kthread_run(khzerod, ki, "khzerod%d-%lukB",
+			nid, huge_page_size(h) / 1024);
+	if (IS_ERR(t)) {
+		kfree(ki);
+		pr_err("could not run khzerod on node %d for size %lukB\n",
+		       nid, huge_page_size(h) / 1024);
+	} else
+		h->hzerod[nid] = t;
+
+out:
+	mutex_unlock(&prezero_chg_lock);
+}
+
+static void khzerod_stop_hstate_nid(struct hstate *h, int nid)
+{
+	mutex_lock(&prezero_chg_lock);
+
+	if (h->hzerod[nid] != NULL) {
+		kthread_stop(h->hzerod[nid]);
+		h->hzerod[nid] = NULL;
+	}
+
+	mutex_unlock(&prezero_chg_lock);
+}
+
+/*
+ * (re-)run all pre-zero kthreads for a node. kthreads
+ * that are currently running, but should no longer run
+ * (because the prezero enable flag was changed) are stopped.
+ */
+void khzerod_run(int nid)
+{
+	struct hstate *h;
+
+	for_each_hstate(h)
+		khzerod_run_hstate_nid(h, nid);
+}
+
+static void khzerod_run_all(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		khzerod_run(nid);
+}
+
+static void khzerod_run_hstate(struct hstate *h)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		khzerod_run_hstate_nid(h, nid);
+}
+
+void khzerod_stop(int nid)
+{
+	struct hstate *h;
+
+	for_each_hstate(h)
+		khzerod_stop_hstate_nid(h, nid);
+}
+
+static void khzerod_wakeup_node(struct hstate *h, int nid)
+{
+	if (h->hzerod[nid])
+		wake_up(&h->hzerod_wait[nid]);
+}
+
 void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
@@ -4515,8 +4871,11 @@  void __init hugetlb_add_hstate(unsigned int order)
 	__mutex_init(&h->resize_lock, "resize mutex", &h->resize_key);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
-	for (i = 0; i < MAX_NUMNODES; ++i)
+	for (i = 0; i < MAX_NUMNODES; ++i) {
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+		init_waitqueue_head(&h->hzerod_wait[i]);
+		init_waitqueue_head(&h->dqzero_wait[i]);
+	}
 	INIT_LIST_HEAD(&h->hugepage_activelist);
 	h->next_nid_to_alloc = first_memory_node;
 	h->next_nid_to_free = first_memory_node;
@@ -6139,8 +6498,7 @@  static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				ret = 0;
 			goto out;
 		}
-		folio_zero_user(folio, vmf->real_address);
-		__folio_mark_uptodate(folio);
+		clear_hugetlb_folio(folio, vmf->address);
 		new_folio = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 621ae1015106..8398e4d0130d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1217,6 +1217,7 @@  int online_pages(unsigned long pfn, unsigned long nr_pages,
 
 	kswapd_run(nid);
 	kcompactd_run(nid);
+	khzerod_run(nid);
 
 	writeback_set_ratelimit();
 
@@ -2092,6 +2093,7 @@  int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	if (arg.status_change_nid >= 0) {
 		kcompactd_stop(node);
 		kswapd_stop(node);
+		khzerod_stop(node);
 	}
 
 	writeback_set_ratelimit();