mbox series

[0/7] padata: parallelize deferred page init

Message ID 20200430201125.532129-1-daniel.m.jordan@oracle.com (mailing list archive)
Headers show
Series padata: parallelize deferred page init | expand

Message

Daniel Jordan April 30, 2020, 8:11 p.m. UTC
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem, and this series is the
first step.

It extends padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 7.

The first user is deferred struct page init, a large bottleneck in
kernel boot--actually the largest for us and likely others too.  This
path doesn't require concurrency limits, resource control, or priority
adjustments like future users will (vfio, hugetlb fallocate, munmap)
because it happens during boot when the system is otherwise idle and
waiting on page init to finish.

This has been tested on a variety of x86 systems and speeds up kernel
boot by 6% to 49% by making deferred init 63% to 91% faster.  Patch 6
has detailed numbers.  Test results from other systems appreciated.

This series is based on v5.6 plus these three from mmotm:

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

All of the above can be found in this branch:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1

The future users and related features are available as work-in-progress
here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.3
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.3

Thanks to everyone who commented on the last version of this[0],
including Alex Williamson, Jason Gunthorpe, Jonathan Corbet, Michal
Hocko, Pavel Machek, Peter Zijlstra, Randy Dunlap, Robert Elliott, Tejun
Heo, and Zi Yan.

RFC v4 -> padata v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[0] https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/

Daniel Jordan (7):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: move zone iterator outside of deferred_init_maxorder()
  mm: parallelize deferred_init_memmap()
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 include/linux/padata.h            |  43 ++++-
 init/main.c                       |   2 +
 kernel/padata.c                   | 277 ++++++++++++++++++++++++------
 mm/Kconfig                        |   6 +-
 mm/page_alloc.c                   | 118 ++++++-------
 6 files changed, 355 insertions(+), 132 deletions(-)


base-commit: 7111951b8d4973bda27ff663f2cf18b663d15b48
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620

Comments

Andrew Morton April 30, 2020, 9:31 p.m. UTC | #1
On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> Sometimes the kernel doesn't take full advantage of system memory
> bandwidth, leading to a single CPU spending excessive time in
> initialization paths where the data scales with memory size.
> 
> Multithreading naturally addresses this problem, and this series is the
> first step.
> 
> It extends padata, a framework that handles many parallel singlethreaded
> jobs, to handle multithreaded jobs as well by adding support for
> splitting up the work evenly, specifying a minimum amount of work that's
> appropriate for one helper thread to do, load balancing between helpers,
> and coordinating them.  More documentation in patches 4 and 7.
> 
> The first user is deferred struct page init, a large bottleneck in
> kernel boot--actually the largest for us and likely others too.  This
> path doesn't require concurrency limits, resource control, or priority
> adjustments like future users will (vfio, hugetlb fallocate, munmap)
> because it happens during boot when the system is otherwise idle and
> waiting on page init to finish.
> 
> This has been tested on a variety of x86 systems and speeds up kernel
> boot by 6% to 49% by making deferred init 63% to 91% faster.

How long is this up-to-91% in seconds?  If it's 91% of a millisecond
then not impressed.  If it's 91% of two weeks then better :)

Relatedly, how important is boot time on these large machines anyway? 
They presumably have lengthy uptimes so boot time is relatively
unimportant?

IOW, can you please explain more fully why this patchset is valuable to
our users?
Pasha Tatashin April 30, 2020, 9:40 p.m. UTC | #2
On Thu, Apr 30, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
>
> > Sometimes the kernel doesn't take full advantage of system memory
> > bandwidth, leading to a single CPU spending excessive time in
> > initialization paths where the data scales with memory size.
> >
> > Multithreading naturally addresses this problem, and this series is the
> > first step.
> >
> > It extends padata, a framework that handles many parallel singlethreaded
> > jobs, to handle multithreaded jobs as well by adding support for
> > splitting up the work evenly, specifying a minimum amount of work that's
> > appropriate for one helper thread to do, load balancing between helpers,
> > and coordinating them.  More documentation in patches 4 and 7.
> >
> > The first user is deferred struct page init, a large bottleneck in
> > kernel boot--actually the largest for us and likely others too.  This
> > path doesn't require concurrency limits, resource control, or priority
> > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > because it happens during boot when the system is otherwise idle and
> > waiting on page init to finish.
> >
> > This has been tested on a variety of x86 systems and speeds up kernel
> > boot by 6% to 49% by making deferred init 63% to 91% faster.
>
> How long is this up-to-91% in seconds?  If it's 91% of a millisecond
> then not impressed.  If it's 91% of two weeks then better :)
>
> Relatedly, how important is boot time on these large machines anyway?
> They presumably have lengthy uptimes so boot time is relatively
> unimportant?

Large machines indeed have a lengthy uptime, but they also can host a
large number of VMs meaning that downtime of the host increases the
downtime of VMs in cloud environments. Some VMs might be very sensible
to downtime: game servers, traders, etc.

>
> IOW, can you please explain more fully why this patchset is valuable to
> our users?
Josh Triplett May 1, 2020, 12:50 a.m. UTC | #3
On Thu, Apr 30, 2020 at 02:31:31PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> > Sometimes the kernel doesn't take full advantage of system memory
> > bandwidth, leading to a single CPU spending excessive time in
> > initialization paths where the data scales with memory size.
> > 
> > Multithreading naturally addresses this problem, and this series is the
> > first step.
> > 
> > It extends padata, a framework that handles many parallel singlethreaded
> > jobs, to handle multithreaded jobs as well by adding support for
> > splitting up the work evenly, specifying a minimum amount of work that's
> > appropriate for one helper thread to do, load balancing between helpers,
> > and coordinating them.  More documentation in patches 4 and 7.
> > 
> > The first user is deferred struct page init, a large bottleneck in
> > kernel boot--actually the largest for us and likely others too.  This
> > path doesn't require concurrency limits, resource control, or priority
> > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > because it happens during boot when the system is otherwise idle and
> > waiting on page init to finish.
> > 
> > This has been tested on a variety of x86 systems and speeds up kernel
> > boot by 6% to 49% by making deferred init 63% to 91% faster.
> 
> How long is this up-to-91% in seconds?  If it's 91% of a millisecond
> then not impressed.  If it's 91% of two weeks then better :)

Some test results on a system with 96 CPUs and 192GB of memory:

Without this patch series:
[    0.487132] node 0 initialised, 23398907 pages in 292ms
[    0.499132] node 1 initialised, 24189223 pages in 304ms
...
[    0.629376] Run /sbin/init as init process

With this patch series:
[    0.227868] node 0 initialised, 23398907 pages in 28ms
[    0.230019] node 1 initialised, 24189223 pages in 28ms
...
[    0.361069] Run /sbin/init as init process

That makes a huge difference; memory initialization is the largest
remaining component of boot time.

> Relatedly, how important is boot time on these large machines anyway? 
> They presumably have lengthy uptimes so boot time is relatively
> unimportant?

Cloud systems and other virtual machines may have a huge amount of
memory but not necessarily run for a long time; on such systems, boot
time becomes critically important. Reducing boot time on cloud systems
and VMs makes the difference between "leave running to reduce latency"
and "just boot up when needed".

- Josh Triplett
Josh Triplett May 1, 2020, 1:09 a.m. UTC | #4
On Thu, Apr 30, 2020 at 04:11:18PM -0400, Daniel Jordan wrote:
> Sometimes the kernel doesn't take full advantage of system memory
> bandwidth, leading to a single CPU spending excessive time in
> initialization paths where the data scales with memory size.
> 
> Multithreading naturally addresses this problem, and this series is the
> first step.
> 
> It extends padata, a framework that handles many parallel singlethreaded
> jobs, to handle multithreaded jobs as well by adding support for
> splitting up the work evenly, specifying a minimum amount of work that's
> appropriate for one helper thread to do, load balancing between helpers,
> and coordinating them.  More documentation in patches 4 and 7.
> 
> The first user is deferred struct page init, a large bottleneck in
> kernel boot--actually the largest for us and likely others too.  This
> path doesn't require concurrency limits, resource control, or priority
> adjustments like future users will (vfio, hugetlb fallocate, munmap)
> because it happens during boot when the system is otherwise idle and
> waiting on page init to finish.
> 
> This has been tested on a variety of x86 systems and speeds up kernel
> boot by 6% to 49% by making deferred init 63% to 91% faster.  Patch 6
> has detailed numbers.  Test results from other systems appreciated.
> 
> This series is based on v5.6 plus these three from mmotm:
> 
>   mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
>   mm-initialize-deferred-pages-with-interrupts-enabled.patch
>   mm-call-cond_resched-from-deferred_init_memmap.patch
> 
> All of the above can be found in this branch:
> 
>   git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1
>   https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1

For the series (and the three prerequisite patches):

Tested-by: Josh Triplett <josh@joshtriplett.org>

Thank you for writing this, and thank you for working towards
upstreaming it!
Daniel Jordan May 1, 2020, 2:40 a.m. UTC | #5
On Thu, Apr 30, 2020 at 05:40:59PM -0400, Pavel Tatashin wrote:
> On Thu, Apr 30, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> >
> > > Sometimes the kernel doesn't take full advantage of system memory
> > > bandwidth, leading to a single CPU spending excessive time in
> > > initialization paths where the data scales with memory size.
> > >
> > > Multithreading naturally addresses this problem, and this series is the
> > > first step.
> > >
> > > It extends padata, a framework that handles many parallel singlethreaded
> > > jobs, to handle multithreaded jobs as well by adding support for
> > > splitting up the work evenly, specifying a minimum amount of work that's
> > > appropriate for one helper thread to do, load balancing between helpers,
> > > and coordinating them.  More documentation in patches 4 and 7.
> > >
> > > The first user is deferred struct page init, a large bottleneck in
> > > kernel boot--actually the largest for us and likely others too.  This
> > > path doesn't require concurrency limits, resource control, or priority
> > > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > > because it happens during boot when the system is otherwise idle and
> > > waiting on page init to finish.
> > >
> > > This has been tested on a variety of x86 systems and speeds up kernel
> > > boot by 6% to 49% by making deferred init 63% to 91% faster.
> >
> > How long is this up-to-91% in seconds?  If it's 91% of a millisecond
> > then not impressed.  If it's 91% of two weeks then better :)

The largest system I could test had 384G per node and saved 1.5 out of 4
seconds.

> > Relatedly, how important is boot time on these large machines anyway?
> > They presumably have lengthy uptimes so boot time is relatively
> > unimportant?
> 
> Large machines indeed have a lengthy uptime, but they also can host a
> large number of VMs meaning that downtime of the host increases the
> downtime of VMs in cloud environments. Some VMs might be very sensible
> to downtime: game servers, traders, etc.
>
> > IOW, can you please explain more fully why this patchset is valuable to
> > our users?

I'll let the users speak for themselves, but I have a similar use case to Pavel
of limiting the downtime of VMs running on these large systems, and spinning up
instances as fast as possible is also desirable for our cloud users.
Daniel Jordan May 1, 2020, 2:48 a.m. UTC | #6
On Thu, Apr 30, 2020 at 06:09:35PM -0700, Josh Triplett wrote:
> On Thu, Apr 30, 2020 at 04:11:18PM -0400, Daniel Jordan wrote:
> > Sometimes the kernel doesn't take full advantage of system memory
> > bandwidth, leading to a single CPU spending excessive time in
> > initialization paths where the data scales with memory size.
> > 
> > Multithreading naturally addresses this problem, and this series is the
> > first step.
> > 
> > It extends padata, a framework that handles many parallel singlethreaded
> > jobs, to handle multithreaded jobs as well by adding support for
> > splitting up the work evenly, specifying a minimum amount of work that's
> > appropriate for one helper thread to do, load balancing between helpers,
> > and coordinating them.  More documentation in patches 4 and 7.
> > 
> > The first user is deferred struct page init, a large bottleneck in
> > kernel boot--actually the largest for us and likely others too.  This
> > path doesn't require concurrency limits, resource control, or priority
> > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > because it happens during boot when the system is otherwise idle and
> > waiting on page init to finish.
> > 
> > This has been tested on a variety of x86 systems and speeds up kernel
> > boot by 6% to 49% by making deferred init 63% to 91% faster.  Patch 6
> > has detailed numbers.  Test results from other systems appreciated.
> > 
> > This series is based on v5.6 plus these three from mmotm:
> > 
> >   mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
> >   mm-initialize-deferred-pages-with-interrupts-enabled.patch
> >   mm-call-cond_resched-from-deferred_init_memmap.patch
> > 
> > All of the above can be found in this branch:
> > 
> >   git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1
> >   https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1
> 
> For the series (and the three prerequisite patches):
> 
> Tested-by: Josh Triplett <josh@joshtriplett.org>

Appreciate the runs, Josh, thanks.