mbox series

[0/4] kernfs: proposed locking and concurrency improvement

Message ID 159038508228.276051.14042452586133971255.stgit@mickey.themaw.net (mailing list archive)
Headers show
Series kernfs: proposed locking and concurrency improvement | expand

Message

Ian Kent May 25, 2020, 5:46 a.m. UTC
For very large systems with hundreds of CPUs and TBs of RAM booting can
take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging. But two main issues appeared to be, a large number of path
lookups for non-existent files, and high lock contention in the VFS during
path walks particularly in the dentry allocation code path.

The underlying cause of this was believed to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration.

This patch series tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths.

This was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes the kernel parameters of udev.children-max=2048 and
systemd.default_timeout_start_sec=300 for are still needed to get the
fastest boot times and result in boot time of under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing. And thinking about what else could be
done, and discussing it with Rick Lindsay, I suspect improvements will
get more difficult to implement for somewhat less improvement so I think
what we have here is a good start for now.

I think what's needed now is patch review, and if we can get through
that, send them via linux-next for broader exposure and hopefully have
them merged into mainline.
---

Ian Kent (4):
      kernfs: switch kernfs to use an rwsem
      kernfs: move revalidate to be near lookup
      kernfs: improve kernfs path resolution
      kernfs: use revision to identify directory node changes


 fs/kernfs/dir.c             |  283 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   16 +-
 fs/kernfs/kernfs-internal.h |   29 ++++
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    5 +
 7 files changed, 232 insertions(+), 121 deletions(-)

--
Ian

Comments

Greg KH May 25, 2020, 6:16 a.m. UTC | #1
On Mon, May 25, 2020 at 01:46:59PM +0800, Ian Kent wrote:
> For very large systems with hundreds of CPUs and TBs of RAM booting can
> take a very long time.
> 
> Initial reports showed that booting a configuration of several hundred
> CPUs and 64TB of RAM would take more than 30 minutes and require kernel
> parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
> to prevent dropping into emergency mode.
> 
> Gathering information about what's happening during the boot is a bit
> challenging. But two main issues appeared to be, a large number of path
> lookups for non-existent files, and high lock contention in the VFS during
> path walks particularly in the dentry allocation code path.
> 
> The underlying cause of this was believed to be the sheer number of sysfs
> memory objects, 100,000+ for a 64TB memory configuration.

Independant of your kernfs changes, why do we really need to represent
all of this memory with that many different "memory objects"?  What is
that providing to userspace?

I remember Ben Herrenschmidt did a lot of work on some of the kernfs and
other functions to make large-memory systems boot faster to remove some
of the complexity in our functions, but that too did not look into why
we needed to create so many objects in the first place.

Perhaps you might want to look there instead?

thanks,

greg k-h
Ian Kent May 25, 2020, 7:23 a.m. UTC | #2
On Mon, 2020-05-25 at 08:16 +0200, Greg Kroah-Hartman wrote:
> On Mon, May 25, 2020 at 01:46:59PM +0800, Ian Kent wrote:
> > For very large systems with hundreds of CPUs and TBs of RAM booting
> > can
> > take a very long time.
> > 
> > Initial reports showed that booting a configuration of several
> > hundred
> > CPUs and 64TB of RAM would take more than 30 minutes and require
> > kernel
> > parameters of udev.children-max=1024
> > systemd.default_timeout_start_sec=3600
> > to prevent dropping into emergency mode.
> > 
> > Gathering information about what's happening during the boot is a
> > bit
> > challenging. But two main issues appeared to be, a large number of
> > path
> > lookups for non-existent files, and high lock contention in the VFS
> > during
> > path walks particularly in the dentry allocation code path.
> > 
> > The underlying cause of this was believed to be the sheer number of
> > sysfs
> > memory objects, 100,000+ for a 64TB memory configuration.
> 
> Independant of your kernfs changes, why do we really need to
> represent
> all of this memory with that many different "memory objects"?  What
> is
> that providing to userspace?
> 
> I remember Ben Herrenschmidt did a lot of work on some of the kernfs
> and
> other functions to make large-memory systems boot faster to remove
> some
> of the complexity in our functions, but that too did not look into
> why
> we needed to create so many objects in the first place.
> 
> Perhaps you might want to look there instead?

I presumed it was a hardware design requirement or IBM VM design
requirement.

Perhaps Rick can find out more on that question.

Ian
Greg KH May 25, 2020, 7:31 a.m. UTC | #3
On Mon, May 25, 2020 at 03:23:35PM +0800, Ian Kent wrote:
> On Mon, 2020-05-25 at 08:16 +0200, Greg Kroah-Hartman wrote:
> > On Mon, May 25, 2020 at 01:46:59PM +0800, Ian Kent wrote:
> > > For very large systems with hundreds of CPUs and TBs of RAM booting
> > > can
> > > take a very long time.
> > > 
> > > Initial reports showed that booting a configuration of several
> > > hundred
> > > CPUs and 64TB of RAM would take more than 30 minutes and require
> > > kernel
> > > parameters of udev.children-max=1024
> > > systemd.default_timeout_start_sec=3600
> > > to prevent dropping into emergency mode.
> > > 
> > > Gathering information about what's happening during the boot is a
> > > bit
> > > challenging. But two main issues appeared to be, a large number of
> > > path
> > > lookups for non-existent files, and high lock contention in the VFS
> > > during
> > > path walks particularly in the dentry allocation code path.
> > > 
> > > The underlying cause of this was believed to be the sheer number of
> > > sysfs
> > > memory objects, 100,000+ for a 64TB memory configuration.
> > 
> > Independant of your kernfs changes, why do we really need to
> > represent
> > all of this memory with that many different "memory objects"?  What
> > is
> > that providing to userspace?
> > 
> > I remember Ben Herrenschmidt did a lot of work on some of the kernfs
> > and
> > other functions to make large-memory systems boot faster to remove
> > some
> > of the complexity in our functions, but that too did not look into
> > why
> > we needed to create so many objects in the first place.
> > 
> > Perhaps you might want to look there instead?
> 
> I presumed it was a hardware design requirement or IBM VM design
> requirement.
> 
> Perhaps Rick can find out more on that question.

Also, why do you need to create the devices _when_ you create them?  Can
you wait until after init is up and running to start populating the
device tree with them?  That way boot can be moving on and disks can be
spinning up earlier?

Also, what about just hot-adding all of that memory after init happens?

Those two options only delay the long delay, but it could allow other
things to be moving and speed up the overall boot process.

thanks,

greg k-h
Rick Lindsley May 27, 2020, 12:44 p.m. UTC | #4
On 5/24/20 11:16 PM, Greg Kroah-Hartman wrote:

> Independant of your kernfs changes, why do we really need to represent
> all of this memory with that many different "memory objects"?  What is
> that providing to userspace?
> 
> I remember Ben Herrenschmidt did a lot of work on some of the kernfs and
> other functions to make large-memory systems boot faster to remove some
> of the complexity in our functions, but that too did not look into why
> we needed to create so many objects in the first place.

That was my first choice too.  Unfortunately, I was not consulted on this design decision, however, and now it's out there.  It is, as you guessed, a hardware "feature".  The hw believes there is value in identifying memory in 256MB chunks.  There are, unfortunately, 2^18 or over 250,000 of those on a 64TB system, compared with dozens or maybe even hundreds of other devices.

We considered a revamping of the boot process - delaying some devices, reordering operations and such - but deemed that more dangerous to other architectures.  Although this change is driven by a particular architecture, the changes we've identified are architecture independent.  The risk of breaking something else is much lower than if we start reordering boot steps.

> Also, why do you need to create the devices _when_ you create them?  Can
> you wait until after init is up and running to start populating the
> device tree with them?  That way boot can be moving on and disks can be
> spinning up earlier?

I'm not a systemd expert, unfortunately, so I don't know if it needs to happen *right* then or not.  I do know that upon successful boot, a ps reveals many systemd children still reporting in.  It's not that we're waiting on everybody; the contention is causing a delay in the discovery of key devices like disks, and *that* leads to timeouts firing in systemd rules.  Any workaround bent on dodging the problem tends to get exponentially worse when the numbers change.  We noticed this problem at 32TB, designed some timeout changes and udev options to improve it, only to have both fail at 64TB.  Worse, at 64TB, larger timeouts and udev options failed to work consistently anymore.

There are two times we do coldplugs - once in the initramfs, and then again after we switch over to the actual root.  I did try omitting memory devices after the switchover.  Much faster!  So, why is the second one necessary?  Are there some architectures that need that?  I've not found anyone who can answer that, so going that route presents us with a different big risk.

Rick