diff mbox series

[RFC,v4,11/13] mm: parallelize deferred struct page initialization within each node

Message ID 20181105165558.11698-12-daniel.m.jordan@oracle.com (mailing list archive)
State New, archived
Headers show
Series ktask: multithread CPU-intensive kernel work | expand

Commit Message

Daniel Jordan Nov. 5, 2018, 4:55 p.m. UTC
Deferred struct page initialization currently runs one thread per node,
but this is a bottleneck during boot on big machines, so use ktask
within each pgdatinit thread to parallelize the struct page
initialization, allowing the system to take better advantage of its
memory bandwidth.

Because the system is not fully up yet and most CPUs are idle, use more
than the default maximum number of ktask threads.  The kernel doesn't
know the memory bandwidth of a given system to get the most efficient
number of threads, so there's some guesswork involved.  In testing, a
reasonable value turned out to be about a quarter of the CPUs on the
node.

__free_pages_core used to increase the zone's managed page count by the
number of pages being freed.  To accommodate multiple threads, however,
account the number of freed pages with an atomic shared across the ktask
threads and bump the managed page count with it after ktask is finished.

Test:    Boot the machine with deferred struct page init three times

Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
         2 sockets

kernel                   speedup   max time per   stdev
                                   node (ms)

baseline (4.15-rc2)                        5860     8.6
ktask                      9.56x            613    12.4

---

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
         8 sockets

kernel                   speedup   max time per   stdev
                                   node (ms)
baseline (4.15-rc2)                        1261     1.9
ktask                      3.88x            325     5.0

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <Pavel.Tatashin@microsoft.com>
---
 mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 78 insertions(+), 13 deletions(-)

Comments

Elliott, Robert (Servers) Nov. 10, 2018, 3:48 a.m. UTC | #1
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwisler@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> +	/*
> +	 * We'd like to know the memory bandwidth of the chip to
>         calculate the
> +	 * most efficient number of threads to start, but we can't.
> +	 * In testing, a good value for a variety of systems was a
>         quarter of the CPUs on the node.
> +	 */
> +	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory
Daniel Jordan Nov. 12, 2018, 4:54 p.m. UTC | #2
On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent Memory) wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > owner@vger.kernel.org> On Behalf Of Daniel Jordan
> > Sent: Monday, November 05, 2018 10:56 AM
> > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> > 
> > ...  The kernel doesn't
> > know the memory bandwidth of a given system to get the most efficient
> > number of threads, so there's some guesswork involved.  
> 
> The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> that kind of information, and could facilitate automatic tuning.
> 
> There was discussion last year about kernel support for it:
> https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwisler@linux.intel.com/

Thanks for bringing this up.  I'm traveling but will take a closer look when I
get back.

> > In testing, a reasonable value turned out to be about a quarter of the
> > CPUs on the node.
> ...
> > +	/*
> > +	 * We'd like to know the memory bandwidth of the chip to
> >         calculate the
> > +	 * most efficient number of threads to start, but we can't.
> > +	 * In testing, a good value for a variety of systems was a
> >         quarter of the CPUs on the node.
> > +	 */
> > +	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> 
> 
> You might want to base that calculation on and limit the threads to
> physical cores, not hyperthreaded cores.

Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
don't have data that shows that in this case.
Elliott, Robert (Servers) Nov. 12, 2018, 10:15 p.m. UTC | #3
> -----Original Message-----
> From: Daniel Jordan <daniel.m.jordan@oracle.com>
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>; linux-mm@kvack.org;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; aarcange@redhat.com;
> aaron.lu@intel.com; akpm@linux-foundation.org; alex.williamson@redhat.com;
> bsd@redhat.com; darrick.wong@oracle.com; dave.hansen@linux.intel.com;
> jgg@mellanox.com; jwadams@google.com; jiangshanlai@gmail.com;
> mhocko@kernel.org; mike.kravetz@oracle.com; Pavel.Tatashin@microsoft.com;
> prasad.singamsetty@oracle.com; rdunlap@infradead.org;
> steven.sistare@oracle.com; tim.c.chen@intel.com; tj@kernel.org;
> vbabka@suse.cz
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> > > -----Original Message-----
> > > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > > owner@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > +	/*
> > > +	 * We'd like to know the memory bandwidth of the chip to
> > >         calculate the
> > > +	 * most efficient number of threads to start, but we can't.
> > > +	 * In testing, a good value for a variety of systems was a
> > >         quarter of the CPUs on the node.
> > > +	 */
> > > +	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
> 
> Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory
Daniel Jordan Nov. 19, 2018, 4:01 p.m. UTC | #4
On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote:
> 
> 
> > -----Original Message-----
> > From: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Sent: Monday, November 12, 2018 11:54 AM
> > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com>; linux-mm@kvack.org;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; aarcange@redhat.com;
> > aaron.lu@intel.com; akpm@linux-foundation.org; alex.williamson@redhat.com;
> > bsd@redhat.com; darrick.wong@oracle.com; dave.hansen@linux.intel.com;
> > jgg@mellanox.com; jwadams@google.com; jiangshanlai@gmail.com;
> > mhocko@kernel.org; mike.kravetz@oracle.com; Pavel.Tatashin@microsoft.com;
> > prasad.singamsetty@oracle.com; rdunlap@infradead.org;
> > steven.sistare@oracle.com; tim.c.chen@intel.com; tj@kernel.org;
> > vbabka@suse.cz
> > Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> > 
> > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> > Memory) wrote:
> > > > -----Original Message-----
> > > > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > > > owner@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > initialization within each node
> > > >
> ...
> > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > CPUs on the node.
> > > ...
> > > > +	/*
> > > > +	 * We'd like to know the memory bandwidth of the chip to
> > > >         calculate the
> > > > +	 * most efficient number of threads to start, but we can't.
> > > > +	 * In testing, a good value for a variety of systems was a
> > > >         quarter of the CPUs on the node.
> > > > +	 */
> > > > +	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > >
> > >
> > > You might want to base that calculation on and limit the threads to
> > > physical cores, not hyperthreaded cores.
> > 
> > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > don't have data that shows that in this case.
> 
> I think that's only if there are some register-based calculations to do while
> waiting. If both threads are just doing memory accesses, they'll both stall, and
> there doesn't seem to be any benefit in having two contexts generate the IOs
> rather than one (at least on the systems I've used). I think it takes longer
> to switch contexts than to just turnaround the next IO.

(Sorry for the delay, Plumbers is over now...)

I guess we're both just waving our hands without data.  I've only got x86, so
using a quarter of the CPUs rules out HT on my end.  Do you have a system that
you can test this on, where using a quarter of the CPUs will involve HT?

Thanks,
Daniel
Daniel Jordan Nov. 19, 2018, 4:29 p.m. UTC | #5
On Mon, Nov 12, 2018 at 08:54:12AM -0800, Daniel Jordan wrote:
> On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent Memory) wrote:
> > > -----Original Message-----
> > > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > > owner@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > > 
> > > ...  The kernel doesn't
> > > know the memory bandwidth of a given system to get the most efficient
> > > number of threads, so there's some guesswork involved.  
> > 
> > The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> > that kind of information, and could facilitate automatic tuning.
> > 
> > There was discussion last year about kernel support for it:
> > https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwisler@linux.intel.com/
> 
> Thanks for bringing this up.  I'm traveling but will take a closer look when I
> get back.

So this series would give the total bandwidth for a memory target, but there's
not a way to map that to a CPU count.  In other words, it seems we couldn't
determine how many CPUs it takes to reach the max bandwidth.  If I haven't
missed something, I'm going to remove that comment.
Elliott, Robert (Servers) Nov. 27, 2018, 12:12 a.m. UTC | #6
> -----Original Message-----
> From: Daniel Jordan [mailto:daniel.m.jordan@oracle.com]
> Sent: Monday, November 19, 2018 10:02 AM
> On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote:
> >
> > > -----Original Message-----
> > > From: Daniel Jordan <daniel.m.jordan@oracle.com>
> > > Sent: Monday, November 12, 2018 11:54 AM
> > >
> > > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> > > Memory) wrote:
> > > > > -----Original Message-----
> > > > > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > > > > owner@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > > initialization within each node
> > > > >
> > ...
> > > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > > CPUs on the node.
> > > > ...
> > > > > +	/*
> > > > > +	 * We'd like to know the memory bandwidth of the chip to
> > > > >         calculate the
> > > > > +	 * most efficient number of threads to start, but we can't.
> > > > > +	 * In testing, a good value for a variety of systems was a
> > > > >         quarter of the CPUs on the node.
> > > > > +	 */
> > > > > +	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > > >
> > > >
> > > > You might want to base that calculation on and limit the threads to
> > > > physical cores, not hyperthreaded cores.
> > >
> > > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > > don't have data that shows that in this case.
> >
> > I think that's only if there are some register-based calculations to do while
> > waiting. If both threads are just doing memory accesses, they'll both stall, and
> > there doesn't seem to be any benefit in having two contexts generate the IOs
> > rather than one (at least on the systems I've used). I think it takes longer
> > to switch contexts than to just turnaround the next IO.
> 
> (Sorry for the delay, Plumbers is over now...)
> 
> I guess we're both just waving our hands without data.  I've only got x86, so
> using a quarter of the CPUs rules out HT on my end.  Do you have a system that
> you can test this on, where using a quarter of the CPUs will involve HT?

I ran a short test with:
* HPE ProLiant DL360 Gen9 system
* Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
  18 hyperthreaded cores (36-53)
* DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
* fio workload generator
* cores on one CPU socket talking to a pmem device on the same CPU
* large (1 MiB) random writes (to minimize the threads getting CPU cache
  hits from each other)

Results:
* 31.7 GB/s    four threads, four physical cores (0,1,2,3)
* 22.2 GB/s    four threads, two physical cores (0,1,36,37)
* 21.4 GB/s    two threads, two physical cores (0,1)
* 12.1 GB/s    two threads, one physical core (0,36)
* 11.2 GB/s    one thread, one physical core (0)

So, I think it's important that the initialization threads run on
separate physical cores.

For the number of cores to use, one approach is:
    memory bandwidth (number of interleaved channels * speed)
divided by 
    CPU core max sustained write bandwidth

For example, this 2133 MT/s system is roughly:
    68 GB/s    (4 * 17 GB/s nominal)
divided by
    11.2 GB/s  (one core's performance)
which is 
    6 cores

ACPI HMAT will report that 68 GB/s number.  I'm not sure of
a good way to discover the 11.2 GB/s number.


fio job file:
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
bs=1M
runtime=20
time_based=1
group_reporting
thread
gtod_reduce=1
zero_buffers
cpus_allowed_policy=split
# pick the desired number of threads
numjobs=4
numjobs=2
numjobs=1

# CPU0: cores 0-17, hyperthreaded cores 36-53
[pmem0]
filename=/dev/pmem0
# pick the desired cpus_allowed list
cpus_allowed=0,1,2,3
cpus_allowed=0,1,36,37
cpus_allowed=0,36
cpus_allowed=0,1
cpus_allowed=0
rw=randwrite

Although most CPU time is in movnti instructions (non-temporal stores),
there is overhead in clearing the page cache and in the pmem block
driver; those won't be present in your initialization function. 
perf top shows:
  82.00%  [kernel]                [k] memcpy_flushcache
   5.23%  [kernel]                [k] gup_pgd_range
   3.41%  [kernel]                [k] __blkdev_direct_IO_simple
   2.38%  [kernel]                [k] pmem_make_request
   1.46%  [kernel]                [k] write_pmem
   1.29%  [kernel]                [k] pmem_do_bvec


---
Robert Elliott, HPE Persistent Memory
Daniel Jordan Nov. 27, 2018, 8:23 p.m. UTC | #7
On Tue, Nov 27, 2018 at 12:12:28AM +0000, Elliott, Robert (Persistent Memory) wrote:
> I ran a short test with:
> * HPE ProLiant DL360 Gen9 system
> * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
>   18 hyperthreaded cores (36-53)
> * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
> * fio workload generator
> * cores on one CPU socket talking to a pmem device on the same CPU
> * large (1 MiB) random writes (to minimize the threads getting CPU cache
>   hits from each other)
> 
> Results:
> * 31.7 GB/s    four threads, four physical cores (0,1,2,3)
> * 22.2 GB/s    four threads, two physical cores (0,1,36,37)
> * 21.4 GB/s    two threads, two physical cores (0,1)
> * 12.1 GB/s    two threads, one physical core (0,36)
> * 11.2 GB/s    one thread, one physical core (0)
> 
> So, I think it's important that the initialization threads run on
> separate physical cores.

Thanks for running this.  And fair enough, in this test using both siblings
gives only a 4-8% speedup over one, so it makes sense to use only cores in the
calculation.

As for how to actually do this, some arches have smp_num_siblings, but there
should be a generic interface to provide that.

It's also possible to calculate this from the existing
topology_sibling_cpumask, but the first option is better IMHO.  Open to
suggestions.

> For the number of cores to use, one approach is:
>     memory bandwidth (number of interleaved channels * speed)
> divided by 
>     CPU core max sustained write bandwidth
> 
> For example, this 2133 MT/s system is roughly:
>     68 GB/s    (4 * 17 GB/s nominal)
> divided by
>     11.2 GB/s  (one core's performance)
> which is 
>     6 cores
> 
> ACPI HMAT will report that 68 GB/s number.  I'm not sure of
> a good way to discover the 11.2 GB/s number.

Yes, this would be nice to do if we could know the per-core number, with the
caveat that a single number like this would be most useful for the CPU-memory
pair it was calculated for, so the kernel could at least calculate it for jobs
operating on local memory.

Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in
the kernel.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..fe7b681567ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@ 
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/ktask.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1275,7 +1276,6 @@  void __free_pages_core(struct page *page, unsigned int order)
 		set_page_count(p, 0);
 	}
 
-	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -1340,6 +1340,7 @@  void __init memblock_free_pages(struct page *page, unsigned long pfn,
 	if (early_page_uninitialised(pfn))
 		return;
 	__free_pages_core(page, order);
+	page_zone(page)->managed_pages += 1UL << order;
 }
 
 /*
@@ -1477,23 +1478,31 @@  deferred_pfn_valid(int nid, unsigned long pfn,
 	return true;
 }
 
+struct deferred_args {
+	int nid;
+	int zid;
+	atomic64_t nr_pages;
+};
+
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
-				       unsigned long end_pfn)
+static int __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+				      unsigned long end_pfn)
 {
 	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
-	unsigned long nr_free = 0;
+	unsigned long nr_free = 0, nr_pages = 0;
 
 	for (; pfn < end_pfn; pfn++) {
 		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
 			deferred_free_range(pfn - nr_free, nr_free);
+			nr_pages += nr_free;
 			nr_free = 0;
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
+			nr_pages += nr_free;
 			nr_free = 1;
 			touch_nmi_watchdog();
 		} else {
@@ -1502,16 +1511,27 @@  static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
 	}
 	/* Free the last block of pages to allocator */
 	deferred_free_range(pfn - nr_free, nr_free);
+	nr_pages += nr_free;
+
+	return nr_pages;
+}
+
+static int __init deferred_free_chunk(unsigned long pfn, unsigned long end_pfn,
+				      struct deferred_args *args)
+{
+	unsigned long nr_pages = deferred_free_pages(args->nid, args->zid, pfn,
+						     end_pfn);
+	atomic64_add(nr_pages, &args->nr_pages);
+	return KTASK_RETURN_SUCCESS;
 }
 
 /*
  * Initialize struct pages.  We minimize pfn page lookups and scheduler checks
  * by performing it only once every pageblock_nr_pages.
- * Return number of pages initialized.
+ * Return number of pages initialized in deferred_args.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
-						 unsigned long pfn,
-						 unsigned long end_pfn)
+static int __init deferred_init_pages(int nid, int zid, unsigned long pfn,
+				      unsigned long end_pfn)
 {
 	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
@@ -1531,7 +1551,17 @@  static unsigned long  __init deferred_init_pages(int nid, int zid,
 		__init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
-	return (nr_pages);
+
+	return nr_pages;
+}
+
+static int __init deferred_init_chunk(unsigned long pfn, unsigned long end_pfn,
+				      struct deferred_args *args)
+{
+	unsigned long nr_pages = deferred_init_pages(args->nid, args->zid, pfn,
+						     end_pfn);
+	atomic64_add(nr_pages, &args->nr_pages);
+	return KTASK_RETURN_SUCCESS;
 }
 
 /* Initialise remaining memory on a node */
@@ -1540,13 +1570,15 @@  static int __init deferred_init_memmap(void *data)
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
 	unsigned long start = jiffies;
-	unsigned long nr_pages = 0;
+	unsigned long nr_init = 0, nr_free = 0;
 	unsigned long spfn, epfn, first_init_pfn, flags;
 	phys_addr_t spa, epa;
 	int zid;
 	struct zone *zone;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	u64 i;
+	unsigned long nr_node_cpus;
+	struct ktask_node kn;
 
 	/* Bind memory initialisation thread to a local node if possible */
 	if (!cpumask_empty(cpumask))
@@ -1560,6 +1592,14 @@  static int __init deferred_init_memmap(void *data)
 		return 0;
 	}
 
+	/*
+	 * We'd like to know the memory bandwidth of the chip to calculate the
+	 * most efficient number of threads to start, but we can't.  In
+	 * testing, a good value for a variety of systems was a quarter of the
+	 * CPUs on the node.
+	 */
+	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
+
 	/* Sanity check boundaries */
 	BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
@@ -1580,21 +1620,46 @@  static int __init deferred_init_memmap(void *data)
 	 * page in __free_one_page()).
 	 */
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL(ctl, deferred_init_chunk, &args,
+				 KTASK_PTE_MINCHUNK);
+		ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+
+		kn.kn_start	= (void *)spfn;
+		kn.kn_task_size	= (spfn < epfn) ? epfn - spfn : 0;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
+
+		nr_init += atomic64_read(&args.nr_pages);
 	}
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL(ctl, deferred_free_chunk, &args,
+				 KTASK_PTE_MINCHUNK);
+		ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+
+		kn.kn_start	= (void *)spfn;
+		kn.kn_task_size	= (spfn < epfn) ? epfn - spfn : 0;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
+
+		nr_free += atomic64_read(&args.nr_pages);
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+	VM_BUG_ON(nr_init != nr_free);
+
+	zone->managed_pages += nr_free;
 
-	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
+	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_free,
 					jiffies_to_msecs(jiffies - start));
 
 	pgdat_init_report_one_done();