[v2,0/6] kernfs: proposed locking and concurrency improvement
mbox series

Message ID 159237905950.89469.6559073274338175600.stgit@mickey.themaw.net
Headers show
Series
  • kernfs: proposed locking and concurrency improvement
Related show

Message

Ian Kent June 17, 2020, 7:37 a.m. UTC
For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
      kernfs: switch kernfs to use an rwsem
      kernfs: move revalidate to be near lookup
      kernfs: improve kernfs path resolution
      kernfs: use revision to identify directory node changes
      kernfs: refactor attr locking
      kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c             |  284 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   58 +++++----
 fs/kernfs/kernfs-internal.h |   29 ++++
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian

Comments

Tejun Heo June 19, 2020, 3:38 p.m. UTC | #1
Hello, Ian.

On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> The series here tries to reduce the locking needed during path walks
> based on the assumption that there are many path walks with a fairly
> large portion of those for non-existent paths, as described above.
> 
> That was done by adding kernfs negative dentry caching (non-existent
> paths) to avoid continual alloc/free cycle of dentries and a read/write
> semaphore introduced to increase kernfs concurrency during path walks.
> 
> With these changes we still need kernel parameters of udev.children-max=2048
> and systemd.default_timeout_start_sec=300 for the fastest boot times of
> under 5 minutes.

I don't have strong objections to the series but the rationales don't seem
particularly strong. It's solving a suspected problem but only half way. It
isn't clear whether this can be the long term solution for the problem
machine and whether it will benefit anyone else in a meaningful way either.

I think Greg already asked this but how are the 100,000+ memory objects
used? Is that justified in the first place?

Thanks.
Rick Lindsley June 19, 2020, 8:41 p.m. UTC | #2
On 6/19/20 8:38 AM, Tejun Heo wrote:

> I don't have strong objections to the series but the rationales don't seem
> particularly strong. It's solving a suspected problem but only half way. It
> isn't clear whether this can be the long term solution for the problem
> machine and whether it will benefit anyone else in a meaningful way either.

I don't understand your statement about solving the problem halfway.  Could you elaborate?

> I think Greg already asked this but how are the 100,000+ memory objects
> used? Is that justified in the first place?

They are used for hotplugging and partitioning memory.  The size of the segments (and thus the number of them) is dictated by the underlying hardware.

Rick
Tejun Heo June 19, 2020, 10:23 p.m. UTC | #3
On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> On 6/19/20 8:38 AM, Tejun Heo wrote:
> 
> > I don't have strong objections to the series but the rationales don't seem
> > particularly strong. It's solving a suspected problem but only half way. It
> > isn't clear whether this can be the long term solution for the problem
> > machine and whether it will benefit anyone else in a meaningful way either.
> 
> I don't understand your statement about solving the problem halfway. Could
> you elaborate?

Spending 5 minutes during boot creating sysfs objects doesn't seem like a
particularly good solution and I don't know whether anyone else would
experience similar issues. Again, not necessarily against improving the
scalability of kernfs code but the use case seems a bit out there.

> > I think Greg already asked this but how are the 100,000+ memory objects
> > used? Is that justified in the first place?
> 
> They are used for hotplugging and partitioning memory. The size of the
> segments (and thus the number of them) is dictated by the underlying
> hardware.

This sounds so bad. There gotta be a better interface for that, right?

Thanks.
Rick Lindsley June 20, 2020, 2:44 a.m. UTC | #4
On 6/19/20 3:23 PM, Tejun Heo wrote:

> Spending 5 minutes during boot creating sysfs objects doesn't seem like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving the
> scalability of kernfs code but the use case seems a bit out there.

Creating sysfs objects is not a new "solution".  We already do it, and it can take over an hour on a machine with 64TB of memory.  We're not *adding* this ... we're *improving* something that has been around for a long time.

>> They are used for hotplugging and partitioning memory. The size of the
>> segments (and thus the number of them) is dictated by the underlying
>> hardware.
> 
> This sounds so bad. There gotta be a better interface for that, right?

Again; this is not new.  Easily changing the state of devices by writing new values into kernfs is one of the reasons kernfs exists.

     echo 0 > /sys/devices//system/memory/memory10374/online

and boom - you've taken memory chunk 10374 offline.

These changes are not just a whim.  I used lockstat to measure contention during boot.  The addition of 250,000 "devices" in parallel created tremendous contention on the kernfs_mutex and, it appears, on one of the directories within it where memory nodes are created.  Using a mutex means that the details of that mutex must bounce around all the cpus ... did I mention 1500+ cpus?  A whole lot of thrash ...

These patches turn that mutex into a rw semaphore, and any thread checking for the mere existence of something can get by with a read lock. lockstat showed that about 90% of the mutex contentions were in a read path and only 10% in a write path.  Switching to a rw semaphore meant that 90% of the time there was no waiting and the thread proceeded with its caches intact.  Total time spent waiting on either read or write decreased by 75%, and should scale much better with subsequent hardware upgrades.

With contention on this particular data structure reduced, we saw thrash on related control structures decrease markedly too.  The contention for the lockref taken in dput dropped 66% and, likely due to reduced thrash, the time used waiting for that structure dropped 99%.

Rick
Ian Kent June 21, 2020, 3:21 a.m. UTC | #5
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> > The series here tries to reduce the locking needed during path
> > walks
> > based on the assumption that there are many path walks with a
> > fairly
> > large portion of those for non-existent paths, as described above.
> > 
> > That was done by adding kernfs negative dentry caching (non-
> > existent
> > paths) to avoid continual alloc/free cycle of dentries and a
> > read/write
> > semaphore introduced to increase kernfs concurrency during path
> > walks.
> > 
> > With these changes we still need kernel parameters of
> > udev.children-max=2048
> > and systemd.default_timeout_start_sec=300 for the fastest boot
> > times of
> > under 5 minutes.
> 
> I don't have strong objections to the series but the rationales don't
> seem
> particularly strong. It's solving a suspected problem but only half
> way. It
> isn't clear whether this can be the long term solution for the
> problem
> machine and whether it will benefit anyone else in a meaningful way
> either.
> 
> I think Greg already asked this but how are the 100,000+ memory
> objects
> used? Is that justified in the first place?

The problem is real enough, however, whether improvements can be made
in other areas flowing on from the arch specific device creation
notifications is not clear cut.

There's no question that there is very high contention between the
VFS and sysfs and that's all the series is trying to improve, nothing
more.

What both you and Greg have raised are good questions but are
unfortunately very difficult to answer.

I tried to add some discussion about it, to the extent that I could,
in the cover letter.

Basically the division of memory into 256M chunks is something that's
needed to provide flexibility for arbitrary partition creation (a set
of hardware allocated that's used for, essentially, a bare metal OS
install). Whether that's many small partitions for load balanced server
farms (or whatever) or much larger partitions for for demanding
applications, such as Oracle systems, is not something that can be
known in advance.

So the division into small memory chunks can't change.

The question of sysfs node creation, what uses them and when they
are used is much harder.

I'm not able to find that out and, I doubt even IBM would know, if
their customers use applications that need to consult the sysfs
file system for this information or when it's needed if it is need
at all. So I'm stuck on this question.

One thing is for sure though, it would be (at the very least) risky
for a vendor to assume they either aren't needed or aren't needed early
during system start up.

OTOH I've looked at what gets invoked on udev notifications (which
is the source of the heavy path walk activity, I admit I need to
dig deeper) and that doesn't appear to be doing anything obviously
wrong so that far seems ok.

For my part, as long as the series proves to be sound, why not,
it does substantially reduce contention between the VFS and sysfs
is the face of heavy sysfs path walk activity so I think that
stands alone as sufficient to consider the change worthwhile.

Ian
Ian Kent June 21, 2020, 4:55 a.m. UTC | #6
On Fri, 2020-06-19 at 18:23 -0400, Tejun Heo wrote:
> On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> > On 6/19/20 8:38 AM, Tejun Heo wrote:
> > 
> > > I don't have strong objections to the series but the rationales
> > > don't seem
> > > particularly strong. It's solving a suspected problem but only
> > > half way. It
> > > isn't clear whether this can be the long term solution for the
> > > problem
> > > machine and whether it will benefit anyone else in a meaningful
> > > way either.
> > 
> > I don't understand your statement about solving the problem
> > halfway. Could
> > you elaborate?
> 
> Spending 5 minutes during boot creating sysfs objects doesn't seem
> like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving
> the
> scalability of kernfs code but the use case seems a bit out there.
> 
> > > I think Greg already asked this but how are the 100,000+ memory
> > > objects
> > > used? Is that justified in the first place?
> > 
> > They are used for hotplugging and partitioning memory. The size of
> > the
> > segments (and thus the number of them) is dictated by the
> > underlying
> > hardware.
> 
> This sounds so bad. There gotta be a better interface for that,
> right?

I'm still struggling a bit to grasp what your getting at but ...

Maybe your talking about the underlying notifications system where
a notification is sent for every event.

There's nothing new about that problem and it's becoming increasingly
clear that existing kernel notification sub-systems don't scale well.

Mount handling is a current example which is one of the areas David
Howells is trying to improve and that's taken years now to get as
far as it has.

It seems to me that any improvements in the area here would have a
different solution, perhaps something along the lines of multiple
notification merging, increased context carried in notifications,
or the like. Something like the notification merging to reduce
notification volume might eventually be useful for David's
notifications sub-system too (and, I think the design of that
sub-system could probably accommodate that sort of change away
from the problematic anonymous notification sub-systems we have
now).

But it's taken a long time to get that far with that project and
the case here would have a far more significant impact on a fairly
large number of sub-systems, both kernel and user space, so all I
can hope for with this discussion is to raise awareness of the need
so that it's recognised and thought about approaches to improving
it can happen.

So, while the questions you ask are valid and your concerns real,
it's unrealistic to think there's a simple solution that can be
implemented in short order. Problem awareness is all that can be done
now so that fundamental and probably wide spread improvements might
be able to be implemented over time.

But if I misunderstand your thinking on this please elaborate further.

Ian
Tejun Heo June 22, 2020, 5:48 p.m. UTC | #7
Hello, Ian.

On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > They are used for hotplugging and partitioning memory. The size of
> > > the
> > > segments (and thus the number of them) is dictated by the
> > > underlying
> > > hardware.
> > 
> > This sounds so bad. There gotta be a better interface for that,
> > right?
> 
> I'm still struggling a bit to grasp what your getting at but ...

I was more trying to say that the sysfs device interface with per-object
directory isn't the right interface for this sort of usage at all. Are these
even real hardware pieces which can be plugged in and out? While being a
discrete piece of hardware isn't a requirement to be a device model device,
the whole thing is designed with such use cases on mind. It definitely isn't
the right design for representing six digit number of logical entities.

It should be obvious that representing each consecutive memory range with a
separate directory entry is far from an optimal way of representing
something like this. It's outright silly.

> Maybe your talking about the underlying notifications system where
> a notification is sent for every event.
> 
> There's nothing new about that problem and it's becoming increasingly
> clear that existing kernel notification sub-systems don't scale well.
> 
> Mount handling is a current example which is one of the areas David
> Howells is trying to improve and that's taken years now to get as
> far as it has.

There sure are scalability issues everywhere that needs to be improved but
there also are cases where the issue is the approach itself rather than
scalability and I'm wondering whether this is the latter.

Thanks.
Tejun Heo June 22, 2020, 5:53 p.m. UTC | #8
On Fri, Jun 19, 2020 at 07:44:29PM -0700, Rick Lindsley wrote:
>     echo 0 > /sys/devices//system/memory/memory10374/online
> 
> and boom - you've taken memory chunk 10374 offline.
> 
> These changes are not just a whim. I used lockstat to measure contention
> during boot. The addition of 250,000 "devices" in parallel created
> tremendous contention on the kernfs_mutex and, it appears, on one of the
> directories within it where memory nodes are created. Using a mutex means
> that the details of that mutex must bounce around all the cpus ... did I
> mention 1500+ cpus? A whole lot of thrash ...

I don't know. The above highlights the absurdity of the approach itself to
me. You seem to be aware of it too in writing: 250,000 "devices".

Thanks.
Greg Kroah-Hartman June 22, 2020, 6:03 p.m. UTC | #9
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > They are used for hotplugging and partitioning memory. The size of
> > > > the
> > > > segments (and thus the number of them) is dictated by the
> > > > underlying
> > > > hardware.
> > > 
> > > This sounds so bad. There gotta be a better interface for that,
> > > right?
> > 
> > I'm still struggling a bit to grasp what your getting at but ...
> 
> I was more trying to say that the sysfs device interface with per-object
> directory isn't the right interface for this sort of usage at all. Are these
> even real hardware pieces which can be plugged in and out? While being a
> discrete piece of hardware isn't a requirement to be a device model device,
> the whole thing is designed with such use cases on mind. It definitely isn't
> the right design for representing six digit number of logical entities.
> 
> It should be obvious that representing each consecutive memory range with a
> separate directory entry is far from an optimal way of representing
> something like this. It's outright silly.

I agree.  And again, Ian, you are just "kicking the problem down the
road" if we accept these patches.  Please fix this up properly so that
this interface is correctly fixed to not do looney things like this.

thanks,

greg k-h
Rick Lindsley June 22, 2020, 9:22 p.m. UTC | #10
On 6/22/20 10:53 AM, Tejun Heo wrote:

> I don't know. The above highlights the absurdity of the approach itself to
> me. You seem to be aware of it too in writing: 250,000 "devices".

Just because it is absurd doesn't mean it wasn't built that way :)

I agree, and I'm trying to influence the next hardware design.  However, what's already out there is memory units that must be accessed in 256MB blocks.  If you want to remove/add a GB, that's really 4 blocks of memory you're manipulating, to the hardware.  Those blocks have to be registered and recognized by the kernel for that to work.

Rick
Rick Lindsley June 22, 2020, 9:27 p.m. UTC | #11
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:

> It should be obvious that representing each consecutive memory range with a
> separate directory entry is far from an optimal way of representing
> something like this. It's outright silly.

On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:

> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so that
> this interface is correctly fixed to not do looney things like this.

Given that we cannot change the underlying machine representation of this hardware, what do you (all, not just you Greg) consider to be "properly"?

Rick
Ian Kent June 23, 2020, 5:09 a.m. UTC | #12
On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > Hello, Ian.
> > 
> > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > They are used for hotplugging and partitioning memory. The
> > > > > size of
> > > > > the
> > > > > segments (and thus the number of them) is dictated by the
> > > > > underlying
> > > > > hardware.
> > > > 
> > > > This sounds so bad. There gotta be a better interface for that,
> > > > right?
> > > 
> > > I'm still struggling a bit to grasp what your getting at but ...
> > 
> > I was more trying to say that the sysfs device interface with per-
> > object
> > directory isn't the right interface for this sort of usage at all.
> > Are these
> > even real hardware pieces which can be plugged in and out? While
> > being a
> > discrete piece of hardware isn't a requirement to be a device model
> > device,
> > the whole thing is designed with such use cases on mind. It
> > definitely isn't
> > the right design for representing six digit number of logical
> > entities.
> > 
> > It should be obvious that representing each consecutive memory
> > range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so
> that
> this interface is correctly fixed to not do looney things like this.

Fine, mitigating this problem isn't the end of the story, and you
don't want to do accept a change to mitigate it because that could
mean no further discussion on it and no further work toward solving
it.

But it seems to me a "proper" solution to this will cross a number
of areas so this isn't just "my" problem and, as you point out, it's
likely to become increasingly problematic over time.

So what are your ideas and recommendations on how to handle hotplug
memory at this granularity for this much RAM (and larger amounts)?

Ian
Greg Kroah-Hartman June 23, 2020, 5:21 a.m. UTC | #13
On Mon, Jun 22, 2020 at 02:27:38PM -0700, Rick Lindsley wrote:
> 
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> 
> > It should be obvious that representing each consecutive memory range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:
> 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so that
> > this interface is correctly fixed to not do looney things like this.
> 
> Given that we cannot change the underlying machine representation of
> this hardware, what do you (all, not just you Greg) consider to be
> "properly"?

Change the userspace representation of the hardware then.  Why does
userspace care about so many individual blocks, what happens if you
provide them a larger granularity?  I can't imagine userspace really
wants to see 20k devices and manage them individually, where is the code
that does that?

What happens if you delay adding the devices until after booting?
Userspace should be event driven and only handle things after it sees
the devices being present, so try delaying and seeing what happens to
prevent this from keeping boot from progressing.

thanks,

greg k-h
Greg Kroah-Hartman June 23, 2020, 6:02 a.m. UTC | #14
On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > Hello, Ian.
> > > 
> > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > They are used for hotplugging and partitioning memory. The
> > > > > > size of
> > > > > > the
> > > > > > segments (and thus the number of them) is dictated by the
> > > > > > underlying
> > > > > > hardware.
> > > > > 
> > > > > This sounds so bad. There gotta be a better interface for that,
> > > > > right?
> > > > 
> > > > I'm still struggling a bit to grasp what your getting at but ...
> > > 
> > > I was more trying to say that the sysfs device interface with per-
> > > object
> > > directory isn't the right interface for this sort of usage at all.
> > > Are these
> > > even real hardware pieces which can be plugged in and out? While
> > > being a
> > > discrete piece of hardware isn't a requirement to be a device model
> > > device,
> > > the whole thing is designed with such use cases on mind. It
> > > definitely isn't
> > > the right design for representing six digit number of logical
> > > entities.
> > > 
> > > It should be obvious that representing each consecutive memory
> > > range with a
> > > separate directory entry is far from an optimal way of representing
> > > something like this. It's outright silly.
> > 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so
> > that
> > this interface is correctly fixed to not do looney things like this.
> 
> Fine, mitigating this problem isn't the end of the story, and you
> don't want to do accept a change to mitigate it because that could
> mean no further discussion on it and no further work toward solving
> it.
> 
> But it seems to me a "proper" solution to this will cross a number
> of areas so this isn't just "my" problem and, as you point out, it's
> likely to become increasingly problematic over time.
> 
> So what are your ideas and recommendations on how to handle hotplug
> memory at this granularity for this much RAM (and larger amounts)?

First off, this is not my platform, and not my problem, so it's funny
you ask me :)

Anyway, as I have said before, my first guesses would be:
	- increase the granularity size of the "memory chunks", reducing
	  the number of devices you create.
	- delay creating the devices until way after booting, or do it
	  on a totally different path/thread/workqueue/whatever to
	  prevent delay at booting

And then there's always:
	- don't create them at all, only only do so if userspace asks
	  you to.

You all have the userspace tools/users for this interface and know it
best to know what will work for them.  If you don't, then hey, let's
just delete the whole thing and see who screams :)

thanks,

greg k-h
Ian Kent June 23, 2020, 8:01 a.m. UTC | #15
On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > Hello, Ian.
> > > > 
> > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > The
> > > > > > > size of
> > > > > > > the
> > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > underlying
> > > > > > > hardware.
> > > > > > 
> > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > that,
> > > > > > right?
> > > > > 
> > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > ...
> > > > 
> > > > I was more trying to say that the sysfs device interface with
> > > > per-
> > > > object
> > > > directory isn't the right interface for this sort of usage at
> > > > all.
> > > > Are these
> > > > even real hardware pieces which can be plugged in and out?
> > > > While
> > > > being a
> > > > discrete piece of hardware isn't a requirement to be a device
> > > > model
> > > > device,
> > > > the whole thing is designed with such use cases on mind. It
> > > > definitely isn't
> > > > the right design for representing six digit number of logical
> > > > entities.
> > > > 
> > > > It should be obvious that representing each consecutive memory
> > > > range with a
> > > > separate directory entry is far from an optimal way of
> > > > representing
> > > > something like this. It's outright silly.
> > > 
> > > I agree.  And again, Ian, you are just "kicking the problem down
> > > the
> > > road" if we accept these patches.  Please fix this up properly so
> > > that
> > > this interface is correctly fixed to not do looney things like
> > > this.
> > 
> > Fine, mitigating this problem isn't the end of the story, and you
> > don't want to do accept a change to mitigate it because that could
> > mean no further discussion on it and no further work toward solving
> > it.
> > 
> > But it seems to me a "proper" solution to this will cross a number
> > of areas so this isn't just "my" problem and, as you point out,
> > it's
> > likely to become increasingly problematic over time.
> > 
> > So what are your ideas and recommendations on how to handle hotplug
> > memory at this granularity for this much RAM (and larger amounts)?
> 
> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Sorry, but I don't think it's funny at all.

It's not "my platform" either, I'm just the poor old sole that
took this on because, on the face of it, it's a file system
problem as claimed by others that looked at it and promptly
washed their hands of it.

I don't see how asking for your advice is out of order at all.

> 
> Anyway, as I have said before, my first guesses would be:
> 	- increase the granularity size of the "memory chunks",
> reducing
> 	  the number of devices you create.

Yes, I didn't get that from your initial comments but you've said
it a couple of times recently and I do get it now.

I'll try and find someone appropriate to consult about that and
see where it goes.

> 	- delay creating the devices until way after booting, or do it
> 	  on a totally different path/thread/workqueue/whatever to
> 	  prevent delay at booting

When you first said this it sounded like a ugly workaround to me.
But perhaps it isn't (I'm not really convinced it is TBH), so it's
probably worth trying to follow up on too.

> 
> And then there's always:
> 	- don't create them at all, only only do so if userspace asks
> 	  you to.

At first glance the impression I get from this is that it's an even
uglier work around than delaying it but it might actually the most
sensible way to handle this, as it's been called, silliness.

We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
the dentry op ->d_automount() to be called on access so, from a path
walk perspective, the dentries could just appear when needed.

The question I'd need to answer is do the kernfs nodes exist so
->d_automount() can discover if the node lookup is valid, and I think
the answer might be yes (but we would need to suppress udev
notifications for S_AUTOMOUNT nodes).

The catch will be that this is "not" mounting per-se, so anything
I do would probably be seen as an ugly hack that subverts the VFS
automount support.

If I could find a way to reconcile that I could probably do this.

Al, what say you on this?

> 
> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

Please, no joking, I'm finding it hard enough to cope with this
disappointment as it is, ;)

Ian
Ian Kent June 23, 2020, 8:29 a.m. UTC | #16
On Tue, 2020-06-23 at 16:01 +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by
> > > > > > > > the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at
> > > > > > but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive
> > > > > memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem
> > > > down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly
> > > > so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that
> > > could
> > > mean no further discussion on it and no further work toward
> > > solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a
> > > number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle
> > > hotplug
> > > memory at this granularity for this much RAM (and larger
> > > amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.
> 
> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).

Or maybe taking the notion of on-demand dentry creation further
is "nothing" more than not triggering udev notifications when
nodes are created letting the VFS create dentries on access alone
is all that would be needed.

I'm really not sure about how this would work yet ...

Ian
Rick Lindsley June 23, 2020, 9:33 a.m. UTC | #17
On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:

> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Weeeelll, not your platform perhaps but MAINTAINERS does list you first and Tejun second as maintainers for kernfs.  So in that sense, any patches would need to go thru you.  So, your opinions do matter.

  
> Anyway, as I have said before, my first guesses would be:
> 	- increase the granularity size of the "memory chunks", reducing
> 	  the number of devices you create.

This would mean finding every utility that relies on this behavior.  That may be possible, although not easy, for distro or platform software, but it's hard to guess what user-related utilities may have been created by other consumers of those distros or that platform.  In any case, removing an interface without warning is a hanging offense in many Linux circles.

> 	- delay creating the devices until way after booting, or do it
> 	  on a totally different path/thread/workqueue/whatever to
> 	  prevent delay at booting

This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time.  It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting.

> And then there's always:
> 	- don't create them at all, only only do so if userspace asks
> 	  you to.

If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.)  You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes.  Seems equally unfriendly.

A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident.  We do a second coldplug after we switch roots and this is the one that runs into timer issues.  I have asked "those that should know" why there is a second coldplug.  I can guess but would prefer to know to avoid that screaming option.  If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution.  (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.)

However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution.

> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore.  In a vacuum, sure, but we have before and after numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see who screams?

I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem.  This problem is not specific to memory devices.  As we get larger systems,  we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel.  Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too.  Why not address this now?

'Doctor, it hurts when I do this'
'Then don't do that'

Funny as a joke.  Less funny as a review comment.

Rick
Greg Kroah-Hartman June 23, 2020, 11:45 a.m. UTC | #18
On Tue, Jun 23, 2020 at 02:33:48AM -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Weeeelll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.

Sure, but "help, I'm abusing your code interface, so fix your code
interface and not my caller code" really isn't the best mantra :)

> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks", reducing
> > 	  the number of devices you create.
> 
> This would mean finding every utility that relies on this behavior.
> That may be possible, although not easy, for distro or platform
> software, but it's hard to guess what user-related utilities may have
> been created by other consumers of those distros or that platform.  In
> any case, removing an interface without warning is a hanging offense
> in many Linux circles.

I agree, so find out who uses it!  You can search all debian tools
easily.  You can ask any closed-source setup tools that are on your
platforms if they use it.  You can "break it and see if anyone notices",
which is what we do all the time.

The "hanging offence" is "I'm breaking this even if you are using it!".

> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time.  It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting.

Is that really the case?  I strongly suggest you all do some research
here.

Oh, and wrap your email lines please...

> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.)  You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes.  Seems equally unfriendly.

I agree, but it shouldn't be shutting down the whole box, other stuff
should run just fine, right?  Is this platform really that "weak" that
it can't handle this happening in a single thread/cpu?

> A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident.  We do a second coldplug after we switch roots and this is the one that runs into timer issues.  I have asked "those that should know" why there is a second coldplug.  I can guess but would prefer to know to avoid that screaming option.  If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution.  (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.)
> 
> However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution.
> 
> > You all have the userspace tools/users for this interface and know it
> > best to know what will work for them.  If you don't, then hey, let's
> > just delete the whole thing and see who screams :)
> 
> I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore.  In a vacuum, sure, but we have before and after numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see who screams?

I am offended as a number of years ago this same user of kernfs/sysfs
did a lot of work to reduce the number of contentions in kernfs for this
same reason.  After that work was done, "all was good".  Now this comes
along again, blaming kernfs/sysfs, not the caller.

Memory is only going to get bigger over time, you might want to fix it
this way and then run away.  But we have to maintain this for the next
20+ years, and you are not solving the root-problem here.  It will come
back again, right?

> I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem.  This problem is not specific to memory devices.  As we get larger systems,  we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel.  Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too.  Why not address this now?

1-2k devices are easy to handle, we handle 30k scsi devices today with
no problem at all, and have for 15+ years.  We are "lucky" there that
the hardware is slower than kernfs/sysfs so that we are not the
bottleneck at all.

> 'Doctor, it hurts when I do this'
> 'Then don't do that'
> 
> Funny as a joke.  Less funny as a review comment.

Treat the system as a whole please, don't go for a short-term fix that
we all know is not solving the real problem here.

thanks,

greg k-h
Greg Kroah-Hartman June 23, 2020, 11:49 a.m. UTC | #19
On Tue, Jun 23, 2020 at 04:01:52PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that could
> > > mean no further discussion on it and no further work toward solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle hotplug
> > > memory at this granularity for this much RAM (and larger amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > 
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.

It's not a workaround, it lets the rest of the system come up and do
useful things while you are still discovering parts of the system that
are not up and running.  We do this all the time for lots of
drivers/devices/subsystems, why is memory any different here?

> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).
> 
> The catch will be that this is "not" mounting per-se, so anything
> I do would probably be seen as an ugly hack that subverts the VFS
> automount support.
> 
> If I could find a way to reconcile that I could probably do this.

I am not meaning to do this at the fs layer, but at the device layer.

Why not wait until someone goes "hey, I wonder what my memory layout is,
let's go ask the kernel to probe all of that."  Or some other such
"delayed initialization" method.  Don't mess with the fs for this,
that's probably the wrong layer for all of this.

thanks,

greg k-h
Ian Kent June 23, 2020, 11:51 a.m. UTC | #20
On Tue, 2020-06-23 at 02:33 -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Weeeelll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.
> 
>   
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> This would mean finding every utility that relies on this
> behavior.  That may be possible, although not easy, for distro or
> platform software, but it's hard to guess what user-related utilities
> may have been created by other consumers of those distros or that
> platform.  In any case, removing an interface without warning is a
> hanging offense in many Linux circles.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> This has been considered, but it again requires a full list of
> utilities relying on this interface and determining which of them may
> want to run before the devices are "loaded" at boot time.  It may be
> few, or even zero, but it would be a much more disruptive change in
> the boot process than what we are suggesting.
> 
> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> If they are done in parallel on demand, you'll see the same problem
> (load average of 1000+, contention in the same spot.)  You obviously
> won't hold up the boot, of course, but your utility and anything else
> running on the machine will take an unexpected pause ... for
> somewhere between 30 and 90 minutes.  Seems equally unfriendly.
> 
> A variant of this, which does have a positive effect, is to observe
> that coldplug during initramfs does seem to load up the memory device
> tree without incident.  We do a second coldplug after we switch roots
> and this is the one that runs into timer issues.  I have asked "those
> that should know" why there is a second coldplug.  I can guess but
> would prefer to know to avoid that screaming option.  If that second
> coldplug is unnecessary for the kernfs memory interfaces to work
> correctly, then that is an alternate, and perhaps even better
> solution.  (It wouldn't change the fact that kernfs was not built for
> speed and this problem remains below the surface to trip up another.)

We might still need the patches here for that on-demand mechanism
to be feasible.

For example, for an ls of the node directory it should be doable to
enumerate the nodes in readdir without creating dentries but there's
the inevitable stat() of each path that follows that would probably
lead to similar contention.

And changing the division of the entries into sub-directories would
inevitably break anything that does actually need to access them.

Ian
Rick Lindsley June 23, 2020, 10:55 p.m. UTC | #21
On 6/23/20 4:45 AM, Greg Kroah-Hartman wrote:

> Sure, but "help, I'm abusing your code interface, so fix your code
> interface and not my caller code" really isn't the best mantra :)

Well, those are your words, not mine.  What we're saying is, "we've 
identified an interface that doesn't scale in this situation, but we 
have a way to make it scale to all known configurations."

 > I am offended as a number of years ago this same user of kernfs/sysfs
 > did a lot of work to reduce the number of contentions in kernfs for
 > this same reason.  After that work was done, "all was good".  Now
 > this comes along again, blaming kernfs/sysfs, not the caller.

Ok. I don't know about the history, but I can tell you "blame" is not 
the word I'd use.  As hardware changes, Linux also changes, and over "a 
number of years" it's not surprising to me if basic assumptions changed 
again and led us back to a place we've been before.  That's not an 
indictment.  It just IS.

 > Memory is only going to get bigger over time, you might want to fix it
 > this way and then run away.  But we have to maintain this for the next
 > 20+ years, and you are not solving the root-problem here.  It will
 > come back again, right?

If hardware vendors insist on dealing with small blocks of memory in 
large aggregates, then yes it could.  You'll have to trust that I am 
also in discussion with hardware architects about how that is a very bad 
architecture and it's time to change decades and think bigger.  Separate 
audience, equally contentious discussion.  But the bottom line is, it's 
out there already and can't be walked back.

Your response here seems to center on "kernfs was never designed for 
that."  If so, we're in agreement.   We're suggesting a way it can be 
extended to be more robust, with no (apparent) side effects.  I'd like 
to discuss the merits of the patch itself.

Rick
Tejun Heo June 23, 2020, 11:13 p.m. UTC | #22
Hello, Rick.

On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > I don't know. The above highlights the absurdity of the approach itself to
> > me. You seem to be aware of it too in writing: 250,000 "devices".
> 
> Just because it is absurd doesn't mean it wasn't built that way :)
> 
> I agree, and I'm trying to influence the next hardware design. However,

I'm not saying that the hardware should not segment things into however many
pieces that it wants / needs to. That part is fine.

> what's already out there is memory units that must be accessed in 256MB
> blocks. If you want to remove/add a GB, that's really 4 blocks of memory
> you're manipulating, to the hardware. Those blocks have to be registered
> and recognized by the kernel for that to work.

The problem is fitting that into an interface which wholly doesn't fit that
particular requirement. It's not that difficult to imagine different ways to
represent however many memory slots, right? It'd take work to make sure that
integrates well with whatever tooling or use cases but once done this
particular problem will be resolved permanently and the whole thing will
look a lot less silly. Wouldn't that be better?

Thanks.
Rick Lindsley June 24, 2020, 9:04 a.m. UTC | #23
Thanks, Tejun, appreciate the feedback.

On 6/23/20 4:13 PM, Tejun Heo wrote:

> The problem is fitting that into an interface which wholly doesn't fit that
> particular requirement. It's not that difficult to imagine different ways to
> represent however many memory slots, right? 

Perhaps we have different views of how this is showing up.  systemd is 
the primary instigator of the boot issues.

Systemd runs

     /usr/lib/systemd/system/systemd-udev-trigger.service

which does a udev trigger, specifically

     /usr/bin/udevadm trigger --type=devices --action=add

as part of its post-initramfs coldplug.  It then waits for that to 
finish, under the watch of a timer.

So, the kernel itself is reporting these devices to systemd. It gets 
that information from talking to the hardware.  That means, then, that 
the obfuscation must either start in the kernel itself (it lies to 
systemd), or start in systemd when it handles the devices it got from 
the kernel.  If the kernel lies, then the actual granularity is not 
available to any user utilities.

Unless you're suggesting a new interface be created that would allow 
utilities to determine the "real" memory addresses available for 
manipulation.  But the changes you describe cannot be limited to the 
unknown number of auxiliary utilities.

Having one subsystem lie to another seems like the start of a bad idea, 
anyway.  When the hardware management console, separate from Linux, 
reports a memory error, or adds or deletes memory in a guest system, 
it's not going to be manipulating spoofed addresses that are only a 
Linux construct.

In contrast, the provided patch fixes the observed problem with no 
ripple effect to other subsystems or utilities.

Greg had suggested
     Treat the system as a whole please, don't go for a short-term
     fix that we all know is not solving the real problem here.

Your solution affects multiple subsystems; this one affects one.  Which 
is the whole system approach in terms of risk?  You mentioned you 
support 30k scsi disks but only because they are slow so the 
inefficiencies of kernfs don't show.  That doesn't bother you?

Rick
Greg Kroah-Hartman June 24, 2020, 9:27 a.m. UTC | #24
On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.

Your patch, as-is, is fine, but to somehow think that this is going to
solve your real problem here is the issue we keep raising.

The real problem is you have way too many devices that somehow need to
all get probed at boot time before you can do anything else.

> Greg had suggested
>     Treat the system as a whole please, don't go for a short-term
>     fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

Systems with 30k of devices do not have any problems that I know of
because they do not do foolish things like stall booting before they are
all discovered :)

What's the odds that if we take this patch, you all have to come back in
a few years with some other change to the api due to even larger memory
sizes happening?  What happens if you boot on a system with this change
and with 10x the memory you currently have?  Try simulating that by
creating 10x the number of devices and see what happens.  Does the
bottleneck still remain in kernfs or is it somewhere else?

thanks,

greg k-h
Tejun Heo June 24, 2020, 1:19 p.m. UTC | #25
Hello, Rick.

On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.
> 
> Greg had suggested
>     Treat the system as a whole please, don't go for a short-term
>     fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

I suggest putting honest thoughts into finding a long term solution instead
of these rhetorical retorts. If you really can't see how ill-suited the
current use of interface and proposed solution is, I'm not sure how better
to communicate them to you.

Thanks.
Ian Kent June 25, 2020, 8:15 a.m. UTC | #26
On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> Hello, Rick.
> 
> On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > I don't know. The above highlights the absurdity of the approach
> > > itself to
> > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > 
> > Just because it is absurd doesn't mean it wasn't built that way :)
> > 
> > I agree, and I'm trying to influence the next hardware design.
> > However,
> 
> I'm not saying that the hardware should not segment things into
> however many
> pieces that it wants / needs to. That part is fine.
> 
> > what's already out there is memory units that must be accessed in
> > 256MB
> > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > memory
> > you're manipulating, to the hardware. Those blocks have to be
> > registered
> > and recognized by the kernel for that to work.
> 
> The problem is fitting that into an interface which wholly doesn't
> fit that
> particular requirement. It's not that difficult to imagine different
> ways to
> represent however many memory slots, right? It'd take work to make
> sure that
> integrates well with whatever tooling or use cases but once done this
> particular problem will be resolved permanently and the whole thing
> will
> look a lot less silly. Wouldn't that be better?

Well, no, I am finding it difficult to imagine different ways to
represent this but perhaps that's because I'm blinker eyed on what
a solution might look like because of my file system focus.

Can "anyone" throw out some ideas with a little more detail than we
have had so far so we can maybe start to formulate an actual plan of
what needs to be done.

Ian
Greg Kroah-Hartman June 25, 2020, 9:43 a.m. UTC | #27
On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > Hello, Rick.
> > 
> > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > I don't know. The above highlights the absurdity of the approach
> > > > itself to
> > > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > > 
> > > Just because it is absurd doesn't mean it wasn't built that way :)
> > > 
> > > I agree, and I'm trying to influence the next hardware design.
> > > However,
> > 
> > I'm not saying that the hardware should not segment things into
> > however many
> > pieces that it wants / needs to. That part is fine.
> > 
> > > what's already out there is memory units that must be accessed in
> > > 256MB
> > > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > > memory
> > > you're manipulating, to the hardware. Those blocks have to be
> > > registered
> > > and recognized by the kernel for that to work.
> > 
> > The problem is fitting that into an interface which wholly doesn't
> > fit that
> > particular requirement. It's not that difficult to imagine different
> > ways to
> > represent however many memory slots, right? It'd take work to make
> > sure that
> > integrates well with whatever tooling or use cases but once done this
> > particular problem will be resolved permanently and the whole thing
> > will
> > look a lot less silly. Wouldn't that be better?
> 
> Well, no, I am finding it difficult to imagine different ways to
> represent this but perhaps that's because I'm blinker eyed on what
> a solution might look like because of my file system focus.
> 
> Can "anyone" throw out some ideas with a little more detail than we
> have had so far so we can maybe start to formulate an actual plan of
> what needs to be done.

I think both Tejun and I have provided a number of alternatives for you
all to look into, and yet you all keep saying that those are impossible
for some unknown reason.

It's not up to me to tell you what to do to fix your broken interfaces
as only you all know who is using this and how to handle those changes.

It is up to me to say "don't do that!" and to refuse patches that don't
solve the root problem here.  I'll review these later on (I have 1500+
patches to review at the moment) as these are a nice
micro-optimization...

And as this conversation seems to just going in circles, I think this is
going to be my last response to it...

greg k-h
Ian Kent June 26, 2020, 12:19 a.m. UTC | #28
On Thu, 2020-06-25 at 11:43 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > > Hello, Rick.
> > > 
> > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > > I don't know. The above highlights the absurdity of the
> > > > > approach
> > > > > itself to
> > > > > me. You seem to be aware of it too in writing: 250,000
> > > > > "devices".
> > > > 
> > > > Just because it is absurd doesn't mean it wasn't built that way
> > > > :)
> > > > 
> > > > I agree, and I'm trying to influence the next hardware design.
> > > > However,
> > > 
> > > I'm not saying that the hardware should not segment things into
> > > however many
> > > pieces that it wants / needs to. That part is fine.
> > > 
> > > > what's already out there is memory units that must be accessed
> > > > in
> > > > 256MB
> > > > blocks. If you want to remove/add a GB, that's really 4 blocks
> > > > of
> > > > memory
> > > > you're manipulating, to the hardware. Those blocks have to be
> > > > registered
> > > > and recognized by the kernel for that to work.
> > > 
> > > The problem is fitting that into an interface which wholly
> > > doesn't
> > > fit that
> > > particular requirement. It's not that difficult to imagine
> > > different
> > > ways to
> > > represent however many memory slots, right? It'd take work to
> > > make
> > > sure that
> > > integrates well with whatever tooling or use cases but once done
> > > this
> > > particular problem will be resolved permanently and the whole
> > > thing
> > > will
> > > look a lot less silly. Wouldn't that be better?
> > 
> > Well, no, I am finding it difficult to imagine different ways to
> > represent this but perhaps that's because I'm blinker eyed on what
> > a solution might look like because of my file system focus.
> > 
> > Can "anyone" throw out some ideas with a little more detail than we
> > have had so far so we can maybe start to formulate an actual plan
> > of
> > what needs to be done.
> 
> I think both Tejun and I have provided a number of alternatives for
> you
> all to look into, and yet you all keep saying that those are
> impossible
> for some unknown reason.

Yes, those comments are a starting point to be sure.
And continuing on that path isn't helping anyone.

That's why I'm asking for your input on what a solution you
would see as adequate might look like to you (and Tejun).

> 
> It's not up to me to tell you what to do to fix your broken
> interfaces
> as only you all know who is using this and how to handle those
> changes.

But it would be useful to go into a little more detail, based on
your own experience, about what you think a suitable solution might
be.

That surely needs to be taken into account and used to guide the
direction of our investigation of what we do.

> 
> It is up to me to say "don't do that!" and to refuse patches that
> don't
> solve the root problem here.  I'll review these later on (I have
> 1500+
> patches to review at the moment) as these are a nice
> micro-optimization...

Sure, and I get the "I don't want another post and run set of
patches that I have to maintain forever that don't fully solve
the problem" view and any ideas and perhaps a little more detail
on where we might go with this would be very much appreciated.

> 
> And as this conversation seems to just going in circles, I think this
> is
> going to be my last response to it...

Which is why I'm asking this, I really would like to see this
discussion change course and become useful.

Ian