mbox series

[v2,0/6] kernfs: proposed locking and concurrency improvement

Message ID 159237905950.89469.6559073274338175600.stgit@mickey.themaw.net (mailing list archive)
Headers show
Series kernfs: proposed locking and concurrency improvement | expand

Message

Ian Kent June 17, 2020, 7:37 a.m. UTC
For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
      kernfs: switch kernfs to use an rwsem
      kernfs: move revalidate to be near lookup
      kernfs: improve kernfs path resolution
      kernfs: use revision to identify directory node changes
      kernfs: refactor attr locking
      kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c             |  284 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   58 +++++----
 fs/kernfs/kernfs-internal.h |   29 ++++
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian

Comments

Tejun Heo June 19, 2020, 3:38 p.m. UTC | #1
Hello, Ian.

On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> The series here tries to reduce the locking needed during path walks
> based on the assumption that there are many path walks with a fairly
> large portion of those for non-existent paths, as described above.
> 
> That was done by adding kernfs negative dentry caching (non-existent
> paths) to avoid continual alloc/free cycle of dentries and a read/write
> semaphore introduced to increase kernfs concurrency during path walks.
> 
> With these changes we still need kernel parameters of udev.children-max=2048
> and systemd.default_timeout_start_sec=300 for the fastest boot times of
> under 5 minutes.

I don't have strong objections to the series but the rationales don't seem
particularly strong. It's solving a suspected problem but only half way. It
isn't clear whether this can be the long term solution for the problem
machine and whether it will benefit anyone else in a meaningful way either.

I think Greg already asked this but how are the 100,000+ memory objects
used? Is that justified in the first place?

Thanks.
Rick Lindsley June 19, 2020, 8:41 p.m. UTC | #2
On 6/19/20 8:38 AM, Tejun Heo wrote:

> I don't have strong objections to the series but the rationales don't seem
> particularly strong. It's solving a suspected problem but only half way. It
> isn't clear whether this can be the long term solution for the problem
> machine and whether it will benefit anyone else in a meaningful way either.

I don't understand your statement about solving the problem halfway.  Could you elaborate?

> I think Greg already asked this but how are the 100,000+ memory objects
> used? Is that justified in the first place?

They are used for hotplugging and partitioning memory.  The size of the segments (and thus the number of them) is dictated by the underlying hardware.

Rick
Tejun Heo June 19, 2020, 10:23 p.m. UTC | #3
On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> On 6/19/20 8:38 AM, Tejun Heo wrote:
> 
> > I don't have strong objections to the series but the rationales don't seem
> > particularly strong. It's solving a suspected problem but only half way. It
> > isn't clear whether this can be the long term solution for the problem
> > machine and whether it will benefit anyone else in a meaningful way either.
> 
> I don't understand your statement about solving the problem halfway. Could
> you elaborate?

Spending 5 minutes during boot creating sysfs objects doesn't seem like a
particularly good solution and I don't know whether anyone else would
experience similar issues. Again, not necessarily against improving the
scalability of kernfs code but the use case seems a bit out there.

> > I think Greg already asked this but how are the 100,000+ memory objects
> > used? Is that justified in the first place?
> 
> They are used for hotplugging and partitioning memory. The size of the
> segments (and thus the number of them) is dictated by the underlying
> hardware.

This sounds so bad. There gotta be a better interface for that, right?

Thanks.
Rick Lindsley June 20, 2020, 2:44 a.m. UTC | #4
On 6/19/20 3:23 PM, Tejun Heo wrote:

> Spending 5 minutes during boot creating sysfs objects doesn't seem like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving the
> scalability of kernfs code but the use case seems a bit out there.

Creating sysfs objects is not a new "solution".  We already do it, and it can take over an hour on a machine with 64TB of memory.  We're not *adding* this ... we're *improving* something that has been around for a long time.

>> They are used for hotplugging and partitioning memory. The size of the
>> segments (and thus the number of them) is dictated by the underlying
>> hardware.
> 
> This sounds so bad. There gotta be a better interface for that, right?

Again; this is not new.  Easily changing the state of devices by writing new values into kernfs is one of the reasons kernfs exists.

     echo 0 > /sys/devices//system/memory/memory10374/online

and boom - you've taken memory chunk 10374 offline.

These changes are not just a whim.  I used lockstat to measure contention during boot.  The addition of 250,000 "devices" in parallel created tremendous contention on the kernfs_mutex and, it appears, on one of the directories within it where memory nodes are created.  Using a mutex means that the details of that mutex must bounce around all the cpus ... did I mention 1500+ cpus?  A whole lot of thrash ...

These patches turn that mutex into a rw semaphore, and any thread checking for the mere existence of something can get by with a read lock. lockstat showed that about 90% of the mutex contentions were in a read path and only 10% in a write path.  Switching to a rw semaphore meant that 90% of the time there was no waiting and the thread proceeded with its caches intact.  Total time spent waiting on either read or write decreased by 75%, and should scale much better with subsequent hardware upgrades.

With contention on this particular data structure reduced, we saw thrash on related control structures decrease markedly too.  The contention for the lockref taken in dput dropped 66% and, likely due to reduced thrash, the time used waiting for that structure dropped 99%.

Rick
Ian Kent June 21, 2020, 3:21 a.m. UTC | #5
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> > The series here tries to reduce the locking needed during path
> > walks
> > based on the assumption that there are many path walks with a
> > fairly
> > large portion of those for non-existent paths, as described above.
> > 
> > That was done by adding kernfs negative dentry caching (non-
> > existent
> > paths) to avoid continual alloc/free cycle of dentries and a
> > read/write
> > semaphore introduced to increase kernfs concurrency during path
> > walks.
> > 
> > With these changes we still need kernel parameters of
> > udev.children-max=2048
> > and systemd.default_timeout_start_sec=300 for the fastest boot
> > times of
> > under 5 minutes.
> 
> I don't have strong objections to the series but the rationales don't
> seem
> particularly strong. It's solving a suspected problem but only half
> way. It
> isn't clear whether this can be the long term solution for the
> problem
> machine and whether it will benefit anyone else in a meaningful way
> either.
> 
> I think Greg already asked this but how are the 100,000+ memory
> objects
> used? Is that justified in the first place?

The problem is real enough, however, whether improvements can be made
in other areas flowing on from the arch specific device creation
notifications is not clear cut.

There's no question that there is very high contention between the
VFS and sysfs and that's all the series is trying to improve, nothing
more.

What both you and Greg have raised are good questions but are
unfortunately very difficult to answer.

I tried to add some discussion about it, to the extent that I could,
in the cover letter.

Basically the division of memory into 256M chunks is something that's
needed to provide flexibility for arbitrary partition creation (a set
of hardware allocated that's used for, essentially, a bare metal OS
install). Whether that's many small partitions for load balanced server
farms (or whatever) or much larger partitions for for demanding
applications, such as Oracle systems, is not something that can be
known in advance.

So the division into small memory chunks can't change.

The question of sysfs node creation, what uses them and when they
are used is much harder.

I'm not able to find that out and, I doubt even IBM would know, if
their customers use applications that need to consult the sysfs
file system for this information or when it's needed if it is need
at all. So I'm stuck on this question.

One thing is for sure though, it would be (at the very least) risky
for a vendor to assume they either aren't needed or aren't needed early
during system start up.

OTOH I've looked at what gets invoked on udev notifications (which
is the source of the heavy path walk activity, I admit I need to
dig deeper) and that doesn't appear to be doing anything obviously
wrong so that far seems ok.

For my part, as long as the series proves to be sound, why not,
it does substantially reduce contention between the VFS and sysfs
is the face of heavy sysfs path walk activity so I think that
stands alone as sufficient to consider the change worthwhile.

Ian
Ian Kent June 21, 2020, 4:55 a.m. UTC | #6
On Fri, 2020-06-19 at 18:23 -0400, Tejun Heo wrote:
> On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> > On 6/19/20 8:38 AM, Tejun Heo wrote:
> > 
> > > I don't have strong objections to the series but the rationales
> > > don't seem
> > > particularly strong. It's solving a suspected problem but only
> > > half way. It
> > > isn't clear whether this can be the long term solution for the
> > > problem
> > > machine and whether it will benefit anyone else in a meaningful
> > > way either.
> > 
> > I don't understand your statement about solving the problem
> > halfway. Could
> > you elaborate?
> 
> Spending 5 minutes during boot creating sysfs objects doesn't seem
> like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving
> the
> scalability of kernfs code but the use case seems a bit out there.
> 
> > > I think Greg already asked this but how are the 100,000+ memory
> > > objects
> > > used? Is that justified in the first place?
> > 
> > They are used for hotplugging and partitioning memory. The size of
> > the
> > segments (and thus the number of them) is dictated by the
> > underlying
> > hardware.
> 
> This sounds so bad. There gotta be a better interface for that,
> right?

I'm still struggling a bit to grasp what your getting at but ...

Maybe your talking about the underlying notifications system where
a notification is sent for every event.

There's nothing new about that problem and it's becoming increasingly
clear that existing kernel notification sub-systems don't scale well.

Mount handling is a current example which is one of the areas David
Howells is trying to improve and that's taken years now to get as
far as it has.

It seems to me that any improvements in the area here would have a
different solution, perhaps something along the lines of multiple
notification merging, increased context carried in notifications,
or the like. Something like the notification merging to reduce
notification volume might eventually be useful for David's
notifications sub-system too (and, I think the design of that
sub-system could probably accommodate that sort of change away
from the problematic anonymous notification sub-systems we have
now).

But it's taken a long time to get that far with that project and
the case here would have a far more significant impact on a fairly
large number of sub-systems, both kernel and user space, so all I
can hope for with this discussion is to raise awareness of the need
so that it's recognised and thought about approaches to improving
it can happen.

So, while the questions you ask are valid and your concerns real,
it's unrealistic to think there's a simple solution that can be
implemented in short order. Problem awareness is all that can be done
now so that fundamental and probably wide spread improvements might
be able to be implemented over time.

But if I misunderstand your thinking on this please elaborate further.

Ian
Tejun Heo June 22, 2020, 5:48 p.m. UTC | #7
Hello, Ian.

On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > They are used for hotplugging and partitioning memory. The size of
> > > the
> > > segments (and thus the number of them) is dictated by the
> > > underlying
> > > hardware.
> > 
> > This sounds so bad. There gotta be a better interface for that,
> > right?
> 
> I'm still struggling a bit to grasp what your getting at but ...

I was more trying to say that the sysfs device interface with per-object
directory isn't the right interface for this sort of usage at all. Are these
even real hardware pieces which can be plugged in and out? While being a
discrete piece of hardware isn't a requirement to be a device model device,
the whole thing is designed with such use cases on mind. It definitely isn't
the right design for representing six digit number of logical entities.

It should be obvious that representing each consecutive memory range with a
separate directory entry is far from an optimal way of representing
something like this. It's outright silly.

> Maybe your talking about the underlying notifications system where
> a notification is sent for every event.
> 
> There's nothing new about that problem and it's becoming increasingly
> clear that existing kernel notification sub-systems don't scale well.
> 
> Mount handling is a current example which is one of the areas David
> Howells is trying to improve and that's taken years now to get as
> far as it has.

There sure are scalability issues everywhere that needs to be improved but
there also are cases where the issue is the approach itself rather than
scalability and I'm wondering whether this is the latter.

Thanks.
Tejun Heo June 22, 2020, 5:53 p.m. UTC | #8
On Fri, Jun 19, 2020 at 07:44:29PM -0700, Rick Lindsley wrote:
>     echo 0 > /sys/devices//system/memory/memory10374/online
> 
> and boom - you've taken memory chunk 10374 offline.
> 
> These changes are not just a whim. I used lockstat to measure contention
> during boot. The addition of 250,000 "devices" in parallel created
> tremendous contention on the kernfs_mutex and, it appears, on one of the
> directories within it where memory nodes are created. Using a mutex means
> that the details of that mutex must bounce around all the cpus ... did I
> mention 1500+ cpus? A whole lot of thrash ...

I don't know. The above highlights the absurdity of the approach itself to
me. You seem to be aware of it too in writing: 250,000 "devices".

Thanks.
Greg Kroah-Hartman June 22, 2020, 6:03 p.m. UTC | #9
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > They are used for hotplugging and partitioning memory. The size of
> > > > the
> > > > segments (and thus the number of them) is dictated by the
> > > > underlying
> > > > hardware.
> > > 
> > > This sounds so bad. There gotta be a better interface for that,
> > > right?
> > 
> > I'm still struggling a bit to grasp what your getting at but ...
> 
> I was more trying to say that the sysfs device interface with per-object
> directory isn't the right interface for this sort of usage at all. Are these
> even real hardware pieces which can be plugged in and out? While being a
> discrete piece of hardware isn't a requirement to be a device model device,
> the whole thing is designed with such use cases on mind. It definitely isn't
> the right design for representing six digit number of logical entities.
> 
> It should be obvious that representing each consecutive memory range with a
> separate directory entry is far from an optimal way of representing
> something like this. It's outright silly.

I agree.  And again, Ian, you are just "kicking the problem down the
road" if we accept these patches.  Please fix this up properly so that
this interface is correctly fixed to not do looney things like this.

thanks,

greg k-h
Rick Lindsley June 22, 2020, 9:22 p.m. UTC | #10
On 6/22/20 10:53 AM, Tejun Heo wrote:

> I don't know. The above highlights the absurdity of the approach itself to
> me. You seem to be aware of it too in writing: 250,000 "devices".

Just because it is absurd doesn't mean it wasn't built that way :)

I agree, and I'm trying to influence the next hardware design.  However, what's already out there is memory units that must be accessed in 256MB blocks.  If you want to remove/add a GB, that's really 4 blocks of memory you're manipulating, to the hardware.  Those blocks have to be registered and recognized by the kernel for that to work.

Rick
Rick Lindsley June 22, 2020, 9:27 p.m. UTC | #11
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:

> It should be obvious that representing each consecutive memory range with a
> separate directory entry is far from an optimal way of representing
> something like this. It's outright silly.

On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:

> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so that
> this interface is correctly fixed to not do looney things like this.

Given that we cannot change the underlying machine representation of this hardware, what do you (all, not just you Greg) consider to be "properly"?

Rick
Ian Kent June 23, 2020, 5:09 a.m. UTC | #12
On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > Hello, Ian.
> > 
> > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > They are used for hotplugging and partitioning memory. The
> > > > > size of
> > > > > the
> > > > > segments (and thus the number of them) is dictated by the
> > > > > underlying
> > > > > hardware.
> > > > 
> > > > This sounds so bad. There gotta be a better interface for that,
> > > > right?
> > > 
> > > I'm still struggling a bit to grasp what your getting at but ...
> > 
> > I was more trying to say that the sysfs device interface with per-
> > object
> > directory isn't the right interface for this sort of usage at all.
> > Are these
> > even real hardware pieces which can be plugged in and out? While
> > being a
> > discrete piece of hardware isn't a requirement to be a device model
> > device,
> > the whole thing is designed with such use cases on mind. It
> > definitely isn't
> > the right design for representing six digit number of logical
> > entities.
> > 
> > It should be obvious that representing each consecutive memory
> > range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so
> that
> this interface is correctly fixed to not do looney things like this.

Fine, mitigating this problem isn't the end of the story, and you
don't want to do accept a change to mitigate it because that could
mean no further discussion on it and no further work toward solving
it.

But it seems to me a "proper" solution to this will cross a number
of areas so this isn't just "my" problem and, as you point out, it's
likely to become increasingly problematic over time.

So what are your ideas and recommendations on how to handle hotplug
memory at this granularity for this much RAM (and larger amounts)?

Ian
Greg Kroah-Hartman June 23, 2020, 5:21 a.m. UTC | #13
On Mon, Jun 22, 2020 at 02:27:38PM -0700, Rick Lindsley wrote:
> 
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> 
> > It should be obvious that representing each consecutive memory range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:
> 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so that
> > this interface is correctly fixed to not do looney things like this.
> 
> Given that we cannot change the underlying machine representation of
> this hardware, what do you (all, not just you Greg) consider to be
> "properly"?

Change the userspace representation of the hardware then.  Why does
userspace care about so many individual blocks, what happens if you
provide them a larger granularity?  I can't imagine userspace really
wants to see 20k devices and manage them individually, where is the code
that does that?

What happens if you delay adding the devices until after booting?
Userspace should be event driven and only handle things after it sees
the devices being present, so try delaying and seeing what happens to
prevent this from keeping boot from progressing.

thanks,

greg k-h
Greg Kroah-Hartman June 23, 2020, 6:02 a.m. UTC | #14
On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > Hello, Ian.
> > > 
> > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > They are used for hotplugging and partitioning memory. The
> > > > > > size of
> > > > > > the
> > > > > > segments (and thus the number of them) is dictated by the
> > > > > > underlying
> > > > > > hardware.
> > > > > 
> > > > > This sounds so bad. There gotta be a better interface for that,
> > > > > right?
> > > > 
> > > > I'm still struggling a bit to grasp what your getting at but ...
> > > 
> > > I was more trying to say that the sysfs device interface with per-
> > > object
> > > directory isn't the right interface for this sort of usage at all.
> > > Are these
> > > even real hardware pieces which can be plugged in and out? While
> > > being a
> > > discrete piece of hardware isn't a requirement to be a device model
> > > device,
> > > the whole thing is designed with such use cases on mind. It
> > > definitely isn't
> > > the right design for representing six digit number of logical
> > > entities.
> > > 
> > > It should be obvious that representing each consecutive memory
> > > range with a
> > > separate directory entry is far from an optimal way of representing
> > > something like this. It's outright silly.
> > 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so
> > that
> > this interface is correctly fixed to not do looney things like this.
> 
> Fine, mitigating this problem isn't the end of the story, and you
> don't want to do accept a change to mitigate it because that could
> mean no further discussion on it and no further work toward solving
> it.
> 
> But it seems to me a "proper" solution to this will cross a number
> of areas so this isn't just "my" problem and, as you point out, it's
> likely to become increasingly problematic over time.
> 
> So what are your ideas and recommendations on how to handle hotplug
> memory at this granularity for this much RAM (and larger amounts)?

First off, this is not my platform, and not my problem, so it's funny
you ask me :)

Anyway, as I have said before, my first guesses would be:
	- increase the granularity size of the "memory chunks", reducing
	  the number of devices you create.
	- delay creating the devices until way after booting, or do it
	  on a totally different path/thread/workqueue/whatever to
	  prevent delay at booting

And then there's always:
	- don't create them at all, only only do so if userspace asks
	  you to.

You all have the userspace tools/users for this interface and know it
best to know what will work for them.  If you don't, then hey, let's
just delete the whole thing and see who screams :)

thanks,

greg k-h
Ian Kent June 23, 2020, 8:01 a.m. UTC | #15
On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > Hello, Ian.
> > > > 
> > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > The
> > > > > > > size of
> > > > > > > the
> > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > underlying
> > > > > > > hardware.
> > > > > > 
> > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > that,
> > > > > > right?
> > > > > 
> > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > ...
> > > > 
> > > > I was more trying to say that the sysfs device interface with
> > > > per-
> > > > object
> > > > directory isn't the right interface for this sort of usage at
> > > > all.
> > > > Are these
> > > > even real hardware pieces which can be plugged in and out?
> > > > While
> > > > being a
> > > > discrete piece of hardware isn't a requirement to be a device
> > > > model
> > > > device,
> > > > the whole thing is designed with such use cases on mind. It
> > > > definitely isn't
> > > > the right design for representing six digit number of logical
> > > > entities.
> > > > 
> > > > It should be obvious that representing each consecutive memory
> > > > range with a
> > > > separate directory entry is far from an optimal way of
> > > > representing
> > > > something like this. It's outright silly.
> > > 
> > > I agree.  And again, Ian, you are just "kicking the problem down
> > > the
> > > road" if we accept these patches.  Please fix this up properly so
> > > that
> > > this interface is correctly fixed to not do looney things like
> > > this.
> > 
> > Fine, mitigating this problem isn't the end of the story, and you
> > don't want to do accept a change to mitigate it because that could
> > mean no further discussion on it and no further work toward solving
> > it.
> > 
> > But it seems to me a "proper" solution to this will cross a number
> > of areas so this isn't just "my" problem and, as you point out,
> > it's
> > likely to become increasingly problematic over time.
> > 
> > So what are your ideas and recommendations on how to handle hotplug
> > memory at this granularity for this much RAM (and larger amounts)?
> 
> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Sorry, but I don't think it's funny at all.

It's not "my platform" either, I'm just the poor old sole that
took this on because, on the face of it, it's a file system
problem as claimed by others that looked at it and promptly
washed their hands of it.

I don't see how asking for your advice is out of order at all.

> 
> Anyway, as I have said before, my first guesses would be:
> 	- increase the granularity size of the "memory chunks",
> reducing
> 	  the number of devices you create.

Yes, I didn't get that from your initial comments but you've said
it a couple of times recently and I do get it now.

I'll try and find someone appropriate to consult about that and
see where it goes.

> 	- delay creating the devices until way after booting, or do it
> 	  on a totally different path/thread/workqueue/whatever to
> 	  prevent delay at booting

When you first said this it sounded like a ugly workaround to me.
But perhaps it isn't (I'm not really convinced it is TBH), so it's
probably worth trying to follow up on too.

> 
> And then there's always:
> 	- don't create them at all, only only do so if userspace asks
> 	  you to.

At first glance the impression I get from this is that it's an even
uglier work around than delaying it but it might actually the most
sensible way to handle this, as it's been called, silliness.

We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
the dentry op ->d_automount() to be called on access so, from a path
walk perspective, the dentries could just appear when needed.

The question I'd need to answer is do the kernfs nodes exist so
->d_automount() can discover if the node lookup is valid, and I think
the answer might be yes (but we would need to suppress udev
notifications for S_AUTOMOUNT nodes).

The catch will be that this is "not" mounting per-se, so anything
I do would probably be seen as an ugly hack that subverts the VFS
automount support.

If I could find a way to reconcile that I could probably do this.

Al, what say you on this?

> 
> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

Please, no joking, I'm finding it hard enough to cope with this
disappointment as it is, ;)

Ian
Ian Kent June 23, 2020, 8:29 a.m. UTC | #16
On Tue, 2020-06-23 at 16:01 +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by
> > > > > > > > the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at
> > > > > > but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive
> > > > > memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem
> > > > down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly
> > > > so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that
> > > could
> > > mean no further discussion on it and no further work toward
> > > solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a
> > > number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle
> > > hotplug
> > > memory at this granularity for this much RAM (and larger
> > > amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.
> 
> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).

Or maybe taking the notion of on-demand dentry creation further
is "nothing" more than not triggering udev notifications when
nodes are created letting the VFS create dentries on access alone
is all that would be needed.

I'm really not sure about how this would work yet ...

Ian
Rick Lindsley June 23, 2020, 9:33 a.m. UTC | #17
On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:

> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Weeeelll, not your platform perhaps but MAINTAINERS does list you first and Tejun second as maintainers for kernfs.  So in that sense, any patches would need to go thru you.  So, your opinions do matter.

  
> Anyway, as I have said before, my first guesses would be:
> 	- increase the granularity size of the "memory chunks", reducing
> 	  the number of devices you create.

This would mean finding every utility that relies on this behavior.  That may be possible, although not easy, for distro or platform software, but it's hard to guess what user-related utilities may have been created by other consumers of those distros or that platform.  In any case, removing an interface without warning is a hanging offense in many Linux circles.

> 	- delay creating the devices until way after booting, or do it
> 	  on a totally different path/thread/workqueue/whatever to
> 	  prevent delay at booting

This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time.  It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting.

> And then there's always:
> 	- don't create them at all, only only do so if userspace asks
> 	  you to.

If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.)  You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes.  Seems equally unfriendly.

A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident.  We do a second coldplug after we switch roots and this is the one that runs into timer issues.  I have asked "those that should know" why there is a second coldplug.  I can guess but would prefer to know to avoid that screaming option.  If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution.  (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.)

However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution.

> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore.  In a vacuum, sure, but we have before and after numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see who screams?

I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem.  This problem is not specific to memory devices.  As we get larger systems,  we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel.  Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too.  Why not address this now?

'Doctor, it hurts when I do this'
'Then don't do that'

Funny as a joke.  Less funny as a review comment.

Rick
Greg Kroah-Hartman June 23, 2020, 11:45 a.m. UTC | #18
On Tue, Jun 23, 2020 at 02:33:48AM -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Weeeelll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.

Sure, but "help, I'm abusing your code interface, so fix your code
interface and not my caller code" really isn't the best mantra :)

> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks", reducing
> > 	  the number of devices you create.
> 
> This would mean finding every utility that relies on this behavior.
> That may be possible, although not easy, for distro or platform
> software, but it's hard to guess what user-related utilities may have
> been created by other consumers of those distros or that platform.  In
> any case, removing an interface without warning is a hanging offense
> in many Linux circles.

I agree, so find out who uses it!  You can search all debian tools
easily.  You can ask any closed-source setup tools that are on your
platforms if they use it.  You can "break it and see if anyone notices",
which is what we do all the time.

The "hanging offence" is "I'm breaking this even if you are using it!".

> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time.  It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting.

Is that really the case?  I strongly suggest you all do some research
here.

Oh, and wrap your email lines please...

> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.)  You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes.  Seems equally unfriendly.

I agree, but it shouldn't be shutting down the whole box, other stuff
should run just fine, right?  Is this platform really that "weak" that
it can't handle this happening in a single thread/cpu?

> A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident.  We do a second coldplug after we switch roots and this is the one that runs into timer issues.  I have asked "those that should know" why there is a second coldplug.  I can guess but would prefer to know to avoid that screaming option.  If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution.  (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.)
> 
> However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution.
> 
> > You all have the userspace tools/users for this interface and know it
> > best to know what will work for them.  If you don't, then hey, let's
> > just delete the whole thing and see who screams :)
> 
> I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore.  In a vacuum, sure, but we have before and after numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see who screams?

I am offended as a number of years ago this same user of kernfs/sysfs
did a lot of work to reduce the number of contentions in kernfs for this
same reason.  After that work was done, "all was good".  Now this comes
along again, blaming kernfs/sysfs, not the caller.

Memory is only going to get bigger over time, you might want to fix it
this way and then run away.  But we have to maintain this for the next
20+ years, and you are not solving the root-problem here.  It will come
back again, right?

> I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem.  This problem is not specific to memory devices.  As we get larger systems,  we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel.  Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too.  Why not address this now?

1-2k devices are easy to handle, we handle 30k scsi devices today with
no problem at all, and have for 15+ years.  We are "lucky" there that
the hardware is slower than kernfs/sysfs so that we are not the
bottleneck at all.

> 'Doctor, it hurts when I do this'
> 'Then don't do that'
> 
> Funny as a joke.  Less funny as a review comment.

Treat the system as a whole please, don't go for a short-term fix that
we all know is not solving the real problem here.

thanks,

greg k-h
Greg Kroah-Hartman June 23, 2020, 11:49 a.m. UTC | #19
On Tue, Jun 23, 2020 at 04:01:52PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that could
> > > mean no further discussion on it and no further work toward solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle hotplug
> > > memory at this granularity for this much RAM (and larger amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > 
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.

It's not a workaround, it lets the rest of the system come up and do
useful things while you are still discovering parts of the system that
are not up and running.  We do this all the time for lots of
drivers/devices/subsystems, why is memory any different here?

> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).
> 
> The catch will be that this is "not" mounting per-se, so anything
> I do would probably be seen as an ugly hack that subverts the VFS
> automount support.
> 
> If I could find a way to reconcile that I could probably do this.

I am not meaning to do this at the fs layer, but at the device layer.

Why not wait until someone goes "hey, I wonder what my memory layout is,
let's go ask the kernel to probe all of that."  Or some other such
"delayed initialization" method.  Don't mess with the fs for this,
that's probably the wrong layer for all of this.

thanks,

greg k-h
Ian Kent June 23, 2020, 11:51 a.m. UTC | #20
On Tue, 2020-06-23 at 02:33 -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Weeeelll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.
> 
>   
> > Anyway, as I have said before, my first guesses would be:
> > 	- increase the granularity size of the "memory chunks",
> > reducing
> > 	  the number of devices you create.
> 
> This would mean finding every utility that relies on this
> behavior.  That may be possible, although not easy, for distro or
> platform software, but it's hard to guess what user-related utilities
> may have been created by other consumers of those distros or that
> platform.  In any case, removing an interface without warning is a
> hanging offense in many Linux circles.
> 
> > 	- delay creating the devices until way after booting, or do it
> > 	  on a totally different path/thread/workqueue/whatever to
> > 	  prevent delay at booting
> 
> This has been considered, but it again requires a full list of
> utilities relying on this interface and determining which of them may
> want to run before the devices are "loaded" at boot time.  It may be
> few, or even zero, but it would be a much more disruptive change in
> the boot process than what we are suggesting.
> 
> > And then there's always:
> > 	- don't create them at all, only only do so if userspace asks
> > 	  you to.
> 
> If they are done in parallel on demand, you'll see the same problem
> (load average of 1000+, contention in the same spot.)  You obviously
> won't hold up the boot, of course, but your utility and anything else
> running on the machine will take an unexpected pause ... for
> somewhere between 30 and 90 minutes.  Seems equally unfriendly.
> 
> A variant of this, which does have a positive effect, is to observe
> that coldplug during initramfs does seem to load up the memory device
> tree without incident.  We do a second coldplug after we switch roots
> and this is the one that runs into timer issues.  I have asked "those
> that should know" why there is a second coldplug.  I can guess but
> would prefer to know to avoid that screaming option.  If that second
> coldplug is unnecessary for the kernfs memory interfaces to work
> correctly, then that is an alternate, and perhaps even better
> solution.  (It wouldn't change the fact that kernfs was not built for
> speed and this problem remains below the surface to trip up another.)

We might still need the patches here for that on-demand mechanism
to be feasible.

For example, for an ls of the node directory it should be doable to
enumerate the nodes in readdir without creating dentries but there's
the inevitable stat() of each path that follows that would probably
lead to similar contention.

And changing the division of the entries into sub-directories would
inevitably break anything that does actually need to access them.

Ian
Rick Lindsley June 23, 2020, 10:55 p.m. UTC | #21
On 6/23/20 4:45 AM, Greg Kroah-Hartman wrote:

> Sure, but "help, I'm abusing your code interface, so fix your code
> interface and not my caller code" really isn't the best mantra :)

Well, those are your words, not mine.  What we're saying is, "we've 
identified an interface that doesn't scale in this situation, but we 
have a way to make it scale to all known configurations."

 > I am offended as a number of years ago this same user of kernfs/sysfs
 > did a lot of work to reduce the number of contentions in kernfs for
 > this same reason.  After that work was done, "all was good".  Now
 > this comes along again, blaming kernfs/sysfs, not the caller.

Ok. I don't know about the history, but I can tell you "blame" is not 
the word I'd use.  As hardware changes, Linux also changes, and over "a 
number of years" it's not surprising to me if basic assumptions changed 
again and led us back to a place we've been before.  That's not an 
indictment.  It just IS.

 > Memory is only going to get bigger over time, you might want to fix it
 > this way and then run away.  But we have to maintain this for the next
 > 20+ years, and you are not solving the root-problem here.  It will
 > come back again, right?

If hardware vendors insist on dealing with small blocks of memory in 
large aggregates, then yes it could.  You'll have to trust that I am 
also in discussion with hardware architects about how that is a very bad 
architecture and it's time to change decades and think bigger.  Separate 
audience, equally contentious discussion.  But the bottom line is, it's 
out there already and can't be walked back.

Your response here seems to center on "kernfs was never designed for 
that."  If so, we're in agreement.   We're suggesting a way it can be 
extended to be more robust, with no (apparent) side effects.  I'd like 
to discuss the merits of the patch itself.

Rick
Tejun Heo June 23, 2020, 11:13 p.m. UTC | #22
Hello, Rick.

On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > I don't know. The above highlights the absurdity of the approach itself to
> > me. You seem to be aware of it too in writing: 250,000 "devices".
> 
> Just because it is absurd doesn't mean it wasn't built that way :)
> 
> I agree, and I'm trying to influence the next hardware design. However,

I'm not saying that the hardware should not segment things into however many
pieces that it wants / needs to. That part is fine.

> what's already out there is memory units that must be accessed in 256MB
> blocks. If you want to remove/add a GB, that's really 4 blocks of memory
> you're manipulating, to the hardware. Those blocks have to be registered
> and recognized by the kernel for that to work.

The problem is fitting that into an interface which wholly doesn't fit that
particular requirement. It's not that difficult to imagine different ways to
represent however many memory slots, right? It'd take work to make sure that
integrates well with whatever tooling or use cases but once done this
particular problem will be resolved permanently and the whole thing will
look a lot less silly. Wouldn't that be better?

Thanks.
Rick Lindsley June 24, 2020, 9:04 a.m. UTC | #23
Thanks, Tejun, appreciate the feedback.

On 6/23/20 4:13 PM, Tejun Heo wrote:

> The problem is fitting that into an interface which wholly doesn't fit that
> particular requirement. It's not that difficult to imagine different ways to
> represent however many memory slots, right? 

Perhaps we have different views of how this is showing up.  systemd is 
the primary instigator of the boot issues.

Systemd runs

     /usr/lib/systemd/system/systemd-udev-trigger.service

which does a udev trigger, specifically

     /usr/bin/udevadm trigger --type=devices --action=add

as part of its post-initramfs coldplug.  It then waits for that to 
finish, under the watch of a timer.

So, the kernel itself is reporting these devices to systemd. It gets 
that information from talking to the hardware.  That means, then, that 
the obfuscation must either start in the kernel itself (it lies to 
systemd), or start in systemd when it handles the devices it got from 
the kernel.  If the kernel lies, then the actual granularity is not 
available to any user utilities.

Unless you're suggesting a new interface be created that would allow 
utilities to determine the "real" memory addresses available for 
manipulation.  But the changes you describe cannot be limited to the 
unknown number of auxiliary utilities.

Having one subsystem lie to another seems like the start of a bad idea, 
anyway.  When the hardware management console, separate from Linux, 
reports a memory error, or adds or deletes memory in a guest system, 
it's not going to be manipulating spoofed addresses that are only a 
Linux construct.

In contrast, the provided patch fixes the observed problem with no 
ripple effect to other subsystems or utilities.

Greg had suggested
     Treat the system as a whole please, don't go for a short-term
     fix that we all know is not solving the real problem here.

Your solution affects multiple subsystems; this one affects one.  Which 
is the whole system approach in terms of risk?  You mentioned you 
support 30k scsi disks but only because they are slow so the 
inefficiencies of kernfs don't show.  That doesn't bother you?

Rick
Greg Kroah-Hartman June 24, 2020, 9:27 a.m. UTC | #24
On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.

Your patch, as-is, is fine, but to somehow think that this is going to
solve your real problem here is the issue we keep raising.

The real problem is you have way too many devices that somehow need to
all get probed at boot time before you can do anything else.

> Greg had suggested
>     Treat the system as a whole please, don't go for a short-term
>     fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

Systems with 30k of devices do not have any problems that I know of
because they do not do foolish things like stall booting before they are
all discovered :)

What's the odds that if we take this patch, you all have to come back in
a few years with some other change to the api due to even larger memory
sizes happening?  What happens if you boot on a system with this change
and with 10x the memory you currently have?  Try simulating that by
creating 10x the number of devices and see what happens.  Does the
bottleneck still remain in kernfs or is it somewhere else?

thanks,

greg k-h
Tejun Heo June 24, 2020, 1:19 p.m. UTC | #25
Hello, Rick.

On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.
> 
> Greg had suggested
>     Treat the system as a whole please, don't go for a short-term
>     fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

I suggest putting honest thoughts into finding a long term solution instead
of these rhetorical retorts. If you really can't see how ill-suited the
current use of interface and proposed solution is, I'm not sure how better
to communicate them to you.

Thanks.
Ian Kent June 25, 2020, 8:15 a.m. UTC | #26
On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> Hello, Rick.
> 
> On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > I don't know. The above highlights the absurdity of the approach
> > > itself to
> > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > 
> > Just because it is absurd doesn't mean it wasn't built that way :)
> > 
> > I agree, and I'm trying to influence the next hardware design.
> > However,
> 
> I'm not saying that the hardware should not segment things into
> however many
> pieces that it wants / needs to. That part is fine.
> 
> > what's already out there is memory units that must be accessed in
> > 256MB
> > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > memory
> > you're manipulating, to the hardware. Those blocks have to be
> > registered
> > and recognized by the kernel for that to work.
> 
> The problem is fitting that into an interface which wholly doesn't
> fit that
> particular requirement. It's not that difficult to imagine different
> ways to
> represent however many memory slots, right? It'd take work to make
> sure that
> integrates well with whatever tooling or use cases but once done this
> particular problem will be resolved permanently and the whole thing
> will
> look a lot less silly. Wouldn't that be better?

Well, no, I am finding it difficult to imagine different ways to
represent this but perhaps that's because I'm blinker eyed on what
a solution might look like because of my file system focus.

Can "anyone" throw out some ideas with a little more detail than we
have had so far so we can maybe start to formulate an actual plan of
what needs to be done.

Ian
Greg Kroah-Hartman June 25, 2020, 9:43 a.m. UTC | #27
On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > Hello, Rick.
> > 
> > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > I don't know. The above highlights the absurdity of the approach
> > > > itself to
> > > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > > 
> > > Just because it is absurd doesn't mean it wasn't built that way :)
> > > 
> > > I agree, and I'm trying to influence the next hardware design.
> > > However,
> > 
> > I'm not saying that the hardware should not segment things into
> > however many
> > pieces that it wants / needs to. That part is fine.
> > 
> > > what's already out there is memory units that must be accessed in
> > > 256MB
> > > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > > memory
> > > you're manipulating, to the hardware. Those blocks have to be
> > > registered
> > > and recognized by the kernel for that to work.
> > 
> > The problem is fitting that into an interface which wholly doesn't
> > fit that
> > particular requirement. It's not that difficult to imagine different
> > ways to
> > represent however many memory slots, right? It'd take work to make
> > sure that
> > integrates well with whatever tooling or use cases but once done this
> > particular problem will be resolved permanently and the whole thing
> > will
> > look a lot less silly. Wouldn't that be better?
> 
> Well, no, I am finding it difficult to imagine different ways to
> represent this but perhaps that's because I'm blinker eyed on what
> a solution might look like because of my file system focus.
> 
> Can "anyone" throw out some ideas with a little more detail than we
> have had so far so we can maybe start to formulate an actual plan of
> what needs to be done.

I think both Tejun and I have provided a number of alternatives for you
all to look into, and yet you all keep saying that those are impossible
for some unknown reason.

It's not up to me to tell you what to do to fix your broken interfaces
as only you all know who is using this and how to handle those changes.

It is up to me to say "don't do that!" and to refuse patches that don't
solve the root problem here.  I'll review these later on (I have 1500+
patches to review at the moment) as these are a nice
micro-optimization...

And as this conversation seems to just going in circles, I think this is
going to be my last response to it...

greg k-h
Ian Kent June 26, 2020, 12:19 a.m. UTC | #28
On Thu, 2020-06-25 at 11:43 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > > Hello, Rick.
> > > 
> > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > > I don't know. The above highlights the absurdity of the
> > > > > approach
> > > > > itself to
> > > > > me. You seem to be aware of it too in writing: 250,000
> > > > > "devices".
> > > > 
> > > > Just because it is absurd doesn't mean it wasn't built that way
> > > > :)
> > > > 
> > > > I agree, and I'm trying to influence the next hardware design.
> > > > However,
> > > 
> > > I'm not saying that the hardware should not segment things into
> > > however many
> > > pieces that it wants / needs to. That part is fine.
> > > 
> > > > what's already out there is memory units that must be accessed
> > > > in
> > > > 256MB
> > > > blocks. If you want to remove/add a GB, that's really 4 blocks
> > > > of
> > > > memory
> > > > you're manipulating, to the hardware. Those blocks have to be
> > > > registered
> > > > and recognized by the kernel for that to work.
> > > 
> > > The problem is fitting that into an interface which wholly
> > > doesn't
> > > fit that
> > > particular requirement. It's not that difficult to imagine
> > > different
> > > ways to
> > > represent however many memory slots, right? It'd take work to
> > > make
> > > sure that
> > > integrates well with whatever tooling or use cases but once done
> > > this
> > > particular problem will be resolved permanently and the whole
> > > thing
> > > will
> > > look a lot less silly. Wouldn't that be better?
> > 
> > Well, no, I am finding it difficult to imagine different ways to
> > represent this but perhaps that's because I'm blinker eyed on what
> > a solution might look like because of my file system focus.
> > 
> > Can "anyone" throw out some ideas with a little more detail than we
> > have had so far so we can maybe start to formulate an actual plan
> > of
> > what needs to be done.
> 
> I think both Tejun and I have provided a number of alternatives for
> you
> all to look into, and yet you all keep saying that those are
> impossible
> for some unknown reason.

Yes, those comments are a starting point to be sure.
And continuing on that path isn't helping anyone.

That's why I'm asking for your input on what a solution you
would see as adequate might look like to you (and Tejun).

> 
> It's not up to me to tell you what to do to fix your broken
> interfaces
> as only you all know who is using this and how to handle those
> changes.

But it would be useful to go into a little more detail, based on
your own experience, about what you think a suitable solution might
be.

That surely needs to be taken into account and used to guide the
direction of our investigation of what we do.

> 
> It is up to me to say "don't do that!" and to refuse patches that
> don't
> solve the root problem here.  I'll review these later on (I have
> 1500+
> patches to review at the moment) as these are a nice
> micro-optimization...

Sure, and I get the "I don't want another post and run set of
patches that I have to maintain forever that don't fully solve
the problem" view and any ideas and perhaps a little more detail
on where we might go with this would be very much appreciated.

> 
> And as this conversation seems to just going in circles, I think this
> is
> going to be my last response to it...

Which is why I'm asking this, I really would like to see this
discussion change course and become useful.

Ian
Fox Chen Dec. 10, 2020, 4:44 p.m. UTC | #29
Hi,

I found this series of patches solves exact the problem I am trying to solve.
https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlchen@gmail.com/

The problem is reported by Brice Goglin on thread:
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
https://lore.kernel.org/lkml/X60dvJoT4fURcnsF@kroah.com/

I independently comfirmed that on a 96-core AWS c5.metal server.
Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 times.
With a single thread it takes ~2.5 us for each open+read+close.
With one thread per core, 96 threads running simultaneously takes 540 us 
for each of the same operation (without much variation) -- 200x slower than the 
single thread one. 

My Benchmark code is here:
https://github.com/foxhlchen/sysfs_benchmark

The problem can only be observed in large machines (>=16 cores).
The more cores you have the slower it can be.

Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in 
kernfs_iop_permission and kernfs_dop_revalidate.

After applying this, performance gets huge boost -- with the fastest one at ~30 us 
to the worst at ~180 us (most of on spin_locks, the delay just stacking up, very
similar to the performance on ext4). 

I hope this problem can justifies this series of patches. A big mutex in kernfs
is really not nice. Due to this BIG LOCK, concurrency in kernfs is almost NONE,
even though you do operations on different files, they are contentious.

As we get more and more cores on normal machines and because sysfs provides such
important information, this problem should be fix. So please reconsider accepting
the patches.

For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun mentioned here 
(https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), maybe a global 
rwsem for kn->iattr will be better??



thanks,
fox
Ian Kent Dec. 11, 2020, 2:01 a.m. UTC | #30
On Thu, 2020-12-10 at 16:44 +0000, Fox Chen wrote:
> Hi,
> 
> I found this series of patches solves exact the problem I am trying
> to solve.
> https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlchen@gmail.com/

Right.

> 
> The problem is reported by Brice Goglin on thread:
> Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
> https://lore.kernel.org/lkml/X60dvJoT4fURcnsF@kroah.com/
> 
> I independently comfirmed that on a 96-core AWS c5.metal server.
> Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id
> 1000 times.
> With a single thread it takes ~2.5 us for each open+read+close.
> With one thread per core, 96 threads running simultaneously takes 540
> us 
> for each of the same operation (without much variation) -- 200x
> slower than the 
> single thread one. 

Right, interesting that the it's actually a problem on such
small system configurations.

I didn't think it would be evident on hardware that doesn't
have a much larger configuration.

> 
> My Benchmark code is here:
> https://github.com/foxhlchen/sysfs_benchmark
> 
> The problem can only be observed in large machines (>=16 cores).
> The more cores you have the slower it can be.
> 
> Perf shows that CPUs spend most of the time (>80%) waiting on mutex
> locks in 
> kernfs_iop_permission and kernfs_dop_revalidate.
> 
> After applying this, performance gets huge boost -- with the fastest
> one at ~30 us 
> to the worst at ~180 us (most of on spin_locks, the delay just
> stacking up, very
> similar to the performance on ext4). 

That's the problem isn't it.

Unfortunately we don't get large improvements for nothing so I
was constantly thinking, what have I done here that isn't ok ...
and I don't have an answer for that.

The series needs review from others for that but we didn't get
that far.

> 
> I hope this problem can justifies this series of patches. A big mutex
> in kernfs
> is really not nice. Due to this BIG LOCK, concurrency in kernfs is
> almost NONE,
> even though you do operations on different files, they are
> contentious.

Well, as much as I don't like to admit it, Greg (and Tejun) do have
a point about the number of sysfs files used when there is a very
large amount of RAM. But IIUC the suggestion of altering the sysfs
representation for this devices memory would introduce all sorts
of problems, it then being different form all device memory
representations (systemd udev coldplug for example).

But I think your saying there are also visible improvements elsewhere
too, which is to be expected of course.

Let's not forget that, as the maintainer, Greg has every right to
be reluctant to take changes because he is likely to end up owning
and maintaining the changes. That can lead to considerable overhead
and frustration if the change isn't quite right and it's very hard
to be sure there aren't hidden problems with it.

Fact is that, due to Greg's rejection, there was much more focus
by the reporter to fix it at the source but I have no control over
that, I only know that it helped to get things moving.

Given the above, I was considering posting the series again and
asking for the series to be re-considered but since I annoyed
Greg so much the first time around I've been reluctant to do so.

But now is a good time I guess, Greg, please, would you re-consider
possibly accepting these patches?

I would really like some actual review of what I have done from
people like yourself and Al. Ha, after that they might well not
be ok anyway!

> 
> As we get more and more cores on normal machines and because sysfs
> provides such
> important information, this problem should be fix. So please
> reconsider accepting
> the patches.
> 
> For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> mentioned here 
> (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/),
> maybe a global 
> rwsem for kn->iattr will be better??

I wasn't sure about that, IIRC a spin lock could be used around the
initial check and checked again at the end which would probably have
been much faster but much less conservative and a bit more ugly so
I just went the conservative path since there was so much change
already.

Ian
Ian Kent Dec. 11, 2020, 2:17 a.m. UTC | #31
On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> 
> > For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> > mentioned here 
> > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/),
> > maybe a global 
> > rwsem for kn->iattr will be better??
> 
> I wasn't sure about that, IIRC a spin lock could be used around the
> initial check and checked again at the end which would probably have
> been much faster but much less conservative and a bit more ugly so
> I just went the conservative path since there was so much change
> already.

Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
it.

Based on what Tejun said it sounds like that needs work.

Ian
Ian Kent Dec. 13, 2020, 3:46 a.m. UTC | #32
On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > Tejun
> > > mentioned here 
> > > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/),
> > > maybe a global 
> > > rwsem for kn->iattr will be better??
> > 
> > I wasn't sure about that, IIRC a spin lock could be used around the
> > initial check and checked again at the end which would probably
> > have
> > been much faster but much less conservative and a bit more ugly so
> > I just went the conservative path since there was so much change
> > already.
> 
> Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
> it.
> 
> Based on what Tejun said it sounds like that needs work.

Those attribute handling patches were meant to allow taking the rw
sem read lock instead of the write lock for kernfs_refresh_inode()
updates, with the added locking to protect the inode attributes
update since it's called from the VFS both with and without the
inode lock.

Looking around it looks like kernfs_iattrs() is called from multiple
places without a node database lock at all.

I'm thinking that, to keep my proposed change straight forward
and on topic, I should just leave kernfs_refresh_inode() taking
the node db write lock for now and consider the attributes handling
as a separate change. Once that's done we could reconsider what's
needed to use the node db read lock in kernfs_refresh_inode().

It will reduce the effectiveness of the series but it would make
this change much more complicated, and is somewhat off-topic, and
could hamper the chances of reviewers spotting problem with it.

Ian
Fox Chen Dec. 14, 2020, 6:14 a.m. UTC | #33
On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote:
>
> On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > Tejun
> > > > mentioned here
> > > > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/),
> > > > maybe a global
> > > > rwsem for kn->iattr will be better??
> > >
> > > I wasn't sure about that, IIRC a spin lock could be used around the
> > > initial check and checked again at the end which would probably
> > > have
> > > been much faster but much less conservative and a bit more ugly so
> > > I just went the conservative path since there was so much change
> > > already.
> >
> > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
> > it.
> >
> > Based on what Tejun said it sounds like that needs work.
>
> Those attribute handling patches were meant to allow taking the rw
> sem read lock instead of the write lock for kernfs_refresh_inode()
> updates, with the added locking to protect the inode attributes
> update since it's called from the VFS both with and without the
> inode lock.

Oh, understood. I was asking also because lock on kn->attr_mutex drags
concurrent performance.

> Looking around it looks like kernfs_iattrs() is called from multiple
> places without a node database lock at all.
>
> I'm thinking that, to keep my proposed change straight forward
> and on topic, I should just leave kernfs_refresh_inode() taking
> the node db write lock for now and consider the attributes handling
> as a separate change. Once that's done we could reconsider what's
> needed to use the node db read lock in kernfs_refresh_inode().

You meant taking write lock of kernfs_rwsem for kernfs_refresh_inode()??
It may be a lot slower in my benchmark, let me test it.

> It will reduce the effectiveness of the series but it would make
> this change much more complicated, and is somewhat off-topic, and
> could hamper the chances of reviewers spotting problem with it.
>
> Ian
>


thanks,
fox
Ian Kent Dec. 14, 2020, 1:30 p.m. UTC | #34
On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote:
> > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > > Tejun
> > > > > mentioned here
> > > > > (
> > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > ),
> > > > > maybe a global
> > > > > rwsem for kn->iattr will be better??
> > > > 
> > > > I wasn't sure about that, IIRC a spin lock could be used around
> > > > the
> > > > initial check and checked again at the end which would probably
> > > > have
> > > > been much faster but much less conservative and a bit more ugly
> > > > so
> > > > I just went the conservative path since there was so much
> > > > change
> > > > already.
> > > 
> > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > remember
> > > it.
> > > 
> > > Based on what Tejun said it sounds like that needs work.
> > 
> > Those attribute handling patches were meant to allow taking the rw
> > sem read lock instead of the write lock for kernfs_refresh_inode()
> > updates, with the added locking to protect the inode attributes
> > update since it's called from the VFS both with and without the
> > inode lock.
> 
> Oh, understood. I was asking also because lock on kn->attr_mutex
> drags
> concurrent performance.
> 
> > Looking around it looks like kernfs_iattrs() is called from
> > multiple
> > places without a node database lock at all.
> > 
> > I'm thinking that, to keep my proposed change straight forward
> > and on topic, I should just leave kernfs_refresh_inode() taking
> > the node db write lock for now and consider the attributes handling
> > as a separate change. Once that's done we could reconsider what's
> > needed to use the node db read lock in kernfs_refresh_inode().
> 
> You meant taking write lock of kernfs_rwsem for
> kernfs_refresh_inode()??
> It may be a lot slower in my benchmark, let me test it.

Yes, but make sure the write lock of kernfs_rwsem is being taken
not the read lock.

That's a mistake I had initially?

Still, that attributes handling is, I think, sufficient to warrant
a separate change since it looks like it might need work, the kernfs
node db probably should be kept stable for those attribute updates
but equally the existence of an instantiated dentry might mitigate
the it.

Some people might just know whether it's ok or not but I would like
to check the callers to work out what's going on.

In any case it's academic if GCH isn't willing to consider the series
for review and possible merge.

Ian
Fox Chen Dec. 15, 2020, 8:33 a.m. UTC | #35
On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote:
> > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > > > Tejun
> > > > > > mentioned here
> > > > > > (
> > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > ),
> > > > > > maybe a global
> > > > > > rwsem for kn->iattr will be better??
> > > > >
> > > > > I wasn't sure about that, IIRC a spin lock could be used around
> > > > > the
> > > > > initial check and checked again at the end which would probably
> > > > > have
> > > > > been much faster but much less conservative and a bit more ugly
> > > > > so
> > > > > I just went the conservative path since there was so much
> > > > > change
> > > > > already.
> > > >
> > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > remember
> > > > it.
> > > >
> > > > Based on what Tejun said it sounds like that needs work.
> > >
> > > Those attribute handling patches were meant to allow taking the rw
> > > sem read lock instead of the write lock for kernfs_refresh_inode()
> > > updates, with the added locking to protect the inode attributes
> > > update since it's called from the VFS both with and without the
> > > inode lock.
> >
> > Oh, understood. I was asking also because lock on kn->attr_mutex
> > drags
> > concurrent performance.
> >
> > > Looking around it looks like kernfs_iattrs() is called from
> > > multiple
> > > places without a node database lock at all.
> > >
> > > I'm thinking that, to keep my proposed change straight forward
> > > and on topic, I should just leave kernfs_refresh_inode() taking
> > > the node db write lock for now and consider the attributes handling
> > > as a separate change. Once that's done we could reconsider what's
> > > needed to use the node db read lock in kernfs_refresh_inode().
> >
> > You meant taking write lock of kernfs_rwsem for
> > kernfs_refresh_inode()??
> > It may be a lot slower in my benchmark, let me test it.
>
> Yes, but make sure the write lock of kernfs_rwsem is being taken
> not the read lock.
>
> That's a mistake I had initially?
>
> Still, that attributes handling is, I think, sufficient to warrant
> a separate change since it looks like it might need work, the kernfs
> node db probably should be kept stable for those attribute updates
> but equally the existence of an instantiated dentry might mitigate
> the it.
>
> Some people might just know whether it's ok or not but I would like
> to check the callers to work out what's going on.
>
> In any case it's academic if GCH isn't willing to consider the series
> for review and possible merge.
>
Hi Ian

I removed kn->attr_mutex and changed read lock to write lock for
kernfs_refresh_inode

down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
up_write(&kernfs_rwsem);


Unfortunate, changes in this way make things worse,  my benchmark runs
100% slower than upstream sysfs.  :(
open+read+close a sysfs file concurrently took 1000us. (Currently,
sysfs with a big mutex kernfs_mutex only takes ~500us
for one open+read+close operation concurrently)

|--45.93%--kernfs_iop_permission
                                  |                                |
                  |          |          |          |
                                  |                                |
                  |          |          |
|--22.55%--down_write
                                  |                                |
                  |          |          |          |          |
                                  |                                |
                  |          |          |          |
--20.69%--rwsem_down_write_slowpath
                                  |                                |
                  |          |          |          |
  |
                                  |                                |
                  |          |          |          |
  |--8.89%--schedule

perf showed most of the time had been spent on kernfs_iop_permission


thanks,
fox
Ian Kent Dec. 15, 2020, 12:59 p.m. UTC | #36
On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, 
> > > > > > > as
> > > > > > > Tejun
> > > > > > > mentioned here
> > > > > > > (
> > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > ),
> > > > > > > maybe a global
> > > > > > > rwsem for kn->iattr will be better??
> > > > > > 
> > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > around
> > > > > > the
> > > > > > initial check and checked again at the end which would
> > > > > > probably
> > > > > > have
> > > > > > been much faster but much less conservative and a bit more
> > > > > > ugly
> > > > > > so
> > > > > > I just went the conservative path since there was so much
> > > > > > change
> > > > > > already.
> > > > > 
> > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > remember
> > > > > it.
> > > > > 
> > > > > Based on what Tejun said it sounds like that needs work.
> > > > 
> > > > Those attribute handling patches were meant to allow taking the
> > > > rw
> > > > sem read lock instead of the write lock for
> > > > kernfs_refresh_inode()
> > > > updates, with the added locking to protect the inode attributes
> > > > update since it's called from the VFS both with and without the
> > > > inode lock.
> > > 
> > > Oh, understood. I was asking also because lock on kn->attr_mutex
> > > drags
> > > concurrent performance.
> > > 
> > > > Looking around it looks like kernfs_iattrs() is called from
> > > > multiple
> > > > places without a node database lock at all.
> > > > 
> > > > I'm thinking that, to keep my proposed change straight forward
> > > > and on topic, I should just leave kernfs_refresh_inode() taking
> > > > the node db write lock for now and consider the attributes
> > > > handling
> > > > as a separate change. Once that's done we could reconsider
> > > > what's
> > > > needed to use the node db read lock in kernfs_refresh_inode().
> > > 
> > > You meant taking write lock of kernfs_rwsem for
> > > kernfs_refresh_inode()??
> > > It may be a lot slower in my benchmark, let me test it.
> > 
> > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > not the read lock.
> > 
> > That's a mistake I had initially?
> > 
> > Still, that attributes handling is, I think, sufficient to warrant
> > a separate change since it looks like it might need work, the
> > kernfs
> > node db probably should be kept stable for those attribute updates
> > but equally the existence of an instantiated dentry might mitigate
> > the it.
> > 
> > Some people might just know whether it's ok or not but I would like
> > to check the callers to work out what's going on.
> > 
> > In any case it's academic if GCH isn't willing to consider the
> > series
> > for review and possible merge.
> > 
> Hi Ian
> 
> I removed kn->attr_mutex and changed read lock to write lock for
> kernfs_refresh_inode
> 
> down_write(&kernfs_rwsem);
> kernfs_refresh_inode(kn, inode);
> up_write(&kernfs_rwsem);
> 
> 
> Unfortunate, changes in this way make things worse,  my benchmark
> runs
> 100% slower than upstream sysfs.  :(
> open+read+close a sysfs file concurrently took 1000us. (Currently,
> sysfs with a big mutex kernfs_mutex only takes ~500us
> for one open+read+close operation concurrently)

Right, so it does need attention nowish.

I'll have a look at it in a while, I really need to get a new autofs
release out, and there are quite a few changes, and testing is seeing
a number of errors, some old, some newly introduced. It's proving
difficult.

> 
> > --45.93%--kernfs_iop_permission
>                                   |                                |
>                   |          |          |          |
>                                   |                                |
>                   |          |          |
> > --22.55%--down_write
>                                   |                                |
>                   |          |          |          |          |
>                                   |                                |
>                   |          |          |          |
> --20.69%--rwsem_down_write_slowpath
>                                   |                                |
>                   |          |          |          |
>   |
>                                   |                                |
>                   |          |          |          |
>   |--8.89%--schedule
> 
> perf showed most of the time had been spent on kernfs_iop_permission
> 
> 
> thanks,
> fox
Ian Kent Dec. 17, 2020, 4:46 a.m. UTC | #37
On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote:
> > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > >attr_mutex, 
> > > > > > > > as
> > > > > > > > Tejun
> > > > > > > > mentioned here
> > > > > > > > (
> > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > > ),
> > > > > > > > maybe a global
> > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > 
> > > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > > around
> > > > > > > the
> > > > > > > initial check and checked again at the end which would
> > > > > > > probably
> > > > > > > have
> > > > > > > been much faster but much less conservative and a bit
> > > > > > > more
> > > > > > > ugly
> > > > > > > so
> > > > > > > I just went the conservative path since there was so much
> > > > > > > change
> > > > > > > already.
> > > > > > 
> > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > > remember
> > > > > > it.
> > > > > > 
> > > > > > Based on what Tejun said it sounds like that needs work.
> > > > > 
> > > > > Those attribute handling patches were meant to allow taking
> > > > > the
> > > > > rw
> > > > > sem read lock instead of the write lock for
> > > > > kernfs_refresh_inode()
> > > > > updates, with the added locking to protect the inode
> > > > > attributes
> > > > > update since it's called from the VFS both with and without
> > > > > the
> > > > > inode lock.
> > > > 
> > > > Oh, understood. I was asking also because lock on kn-
> > > > >attr_mutex
> > > > drags
> > > > concurrent performance.
> > > > 
> > > > > Looking around it looks like kernfs_iattrs() is called from
> > > > > multiple
> > > > > places without a node database lock at all.
> > > > > 
> > > > > I'm thinking that, to keep my proposed change straight
> > > > > forward
> > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > taking
> > > > > the node db write lock for now and consider the attributes
> > > > > handling
> > > > > as a separate change. Once that's done we could reconsider
> > > > > what's
> > > > > needed to use the node db read lock in
> > > > > kernfs_refresh_inode().
> > > > 
> > > > You meant taking write lock of kernfs_rwsem for
> > > > kernfs_refresh_inode()??
> > > > It may be a lot slower in my benchmark, let me test it.
> > > 
> > > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > > not the read lock.
> > > 
> > > That's a mistake I had initially?
> > > 
> > > Still, that attributes handling is, I think, sufficient to
> > > warrant
> > > a separate change since it looks like it might need work, the
> > > kernfs
> > > node db probably should be kept stable for those attribute
> > > updates
> > > but equally the existence of an instantiated dentry might
> > > mitigate
> > > the it.
> > > 
> > > Some people might just know whether it's ok or not but I would
> > > like
> > > to check the callers to work out what's going on.
> > > 
> > > In any case it's academic if GCH isn't willing to consider the
> > > series
> > > for review and possible merge.
> > > 
> > Hi Ian
> > 
> > I removed kn->attr_mutex and changed read lock to write lock for
> > kernfs_refresh_inode
> > 
> > down_write(&kernfs_rwsem);
> > kernfs_refresh_inode(kn, inode);
> > up_write(&kernfs_rwsem);
> > 
> > 
> > Unfortunate, changes in this way make things worse,  my benchmark
> > runs
> > 100% slower than upstream sysfs.  :(
> > open+read+close a sysfs file concurrently took 1000us. (Currently,
> > sysfs with a big mutex kernfs_mutex only takes ~500us
> > for one open+read+close operation concurrently)
> 
> Right, so it does need attention nowish.
> 
> I'll have a look at it in a while, I really need to get a new autofs
> release out, and there are quite a few changes, and testing is seeing
> a number of errors, some old, some newly introduced. It's proving
> difficult.

I've taken a breather for the autofs testing and had a look at this.

I think my original analysis of this was wrong.

Could you try this patch please.
I'm not sure how much difference it will make but, in principle,
it's much the same as the previous approach except it doesn't
increase the kernfs node struct size or mess with the other
attribute handling code.

Note, this is not even compile tested.

kernfs: use kernfs read lock in .getattr() and .permission()

From: Ian Kent <raven@themaw.net>

From Documenation/filesystems.rst and (slightly outdated) comments
in fs/attr.c the inode i_rwsem is used for attribute handling.

This lock satisfies the requirememnts needed to reduce lock contention,
namely a per-object lock needs to be used rather than a file system
global lock with the kernfs node db held stable for read operations.

In particular it should reduce lock contention seen when calling the
kernfs .permission() method.

The inode methods .getattr() and .permission() do not hold the inode
i_rwsem lock when called as they are usually read operations. Also
the .permission() method checks for rcu-walk mode and returns -ECHILD
to the VFS if it is set. So the i_rwsem lock can be used in
kernfs_iop_getattr() and kernfs_iop_permission() to protect the inode
update done by kernfs_refresh_inode(). Using this lock allows the
kernfs node db write lock in these functions to be changed to a read
lock.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index ddaf18198935..568037e9efe9 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	inode_lock(inode);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
+	inode_unlock(inode);
 
 	generic_fillattr(inode, stat);
 	return 0;
@@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode, int mask)
 
 	kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	inode_lock(inode);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
+	inode_unlock(inode);
 
 	return generic_permission(inode, mask);
 }
Fox Chen Dec. 17, 2020, 8:54 a.m. UTC | #38
On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote:
>
> On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote:
> > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > >attr_mutex,
> > > > > > > > > as
> > > > > > > > > Tejun
> > > > > > > > > mentioned here
> > > > > > > > > (
> > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > > > ),
> > > > > > > > > maybe a global
> > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > >
> > > > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > > > around
> > > > > > > > the
> > > > > > > > initial check and checked again at the end which would
> > > > > > > > probably
> > > > > > > > have
> > > > > > > > been much faster but much less conservative and a bit
> > > > > > > > more
> > > > > > > > ugly
> > > > > > > > so
> > > > > > > > I just went the conservative path since there was so much
> > > > > > > > change
> > > > > > > > already.
> > > > > > >
> > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > > > remember
> > > > > > > it.
> > > > > > >
> > > > > > > Based on what Tejun said it sounds like that needs work.
> > > > > >
> > > > > > Those attribute handling patches were meant to allow taking
> > > > > > the
> > > > > > rw
> > > > > > sem read lock instead of the write lock for
> > > > > > kernfs_refresh_inode()
> > > > > > updates, with the added locking to protect the inode
> > > > > > attributes
> > > > > > update since it's called from the VFS both with and without
> > > > > > the
> > > > > > inode lock.
> > > > >
> > > > > Oh, understood. I was asking also because lock on kn-
> > > > > >attr_mutex
> > > > > drags
> > > > > concurrent performance.
> > > > >
> > > > > > Looking around it looks like kernfs_iattrs() is called from
> > > > > > multiple
> > > > > > places without a node database lock at all.
> > > > > >
> > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > forward
> > > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > > taking
> > > > > > the node db write lock for now and consider the attributes
> > > > > > handling
> > > > > > as a separate change. Once that's done we could reconsider
> > > > > > what's
> > > > > > needed to use the node db read lock in
> > > > > > kernfs_refresh_inode().
> > > > >
> > > > > You meant taking write lock of kernfs_rwsem for
> > > > > kernfs_refresh_inode()??
> > > > > It may be a lot slower in my benchmark, let me test it.
> > > >
> > > > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > > > not the read lock.
> > > >
> > > > That's a mistake I had initially?
> > > >
> > > > Still, that attributes handling is, I think, sufficient to
> > > > warrant
> > > > a separate change since it looks like it might need work, the
> > > > kernfs
> > > > node db probably should be kept stable for those attribute
> > > > updates
> > > > but equally the existence of an instantiated dentry might
> > > > mitigate
> > > > the it.
> > > >
> > > > Some people might just know whether it's ok or not but I would
> > > > like
> > > > to check the callers to work out what's going on.
> > > >
> > > > In any case it's academic if GCH isn't willing to consider the
> > > > series
> > > > for review and possible merge.
> > > >
> > > Hi Ian
> > >
> > > I removed kn->attr_mutex and changed read lock to write lock for
> > > kernfs_refresh_inode
> > >
> > > down_write(&kernfs_rwsem);
> > > kernfs_refresh_inode(kn, inode);
> > > up_write(&kernfs_rwsem);
> > >
> > >
> > > Unfortunate, changes in this way make things worse,  my benchmark
> > > runs
> > > 100% slower than upstream sysfs.  :(
> > > open+read+close a sysfs file concurrently took 1000us. (Currently,
> > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > for one open+read+close operation concurrently)
> >
> > Right, so it does need attention nowish.
> >
> > I'll have a look at it in a while, I really need to get a new autofs
> > release out, and there are quite a few changes, and testing is seeing
> > a number of errors, some old, some newly introduced. It's proving
> > difficult.
>
> I've taken a breather for the autofs testing and had a look at this.

Thanks. :)

> I think my original analysis of this was wrong.
>
> Could you try this patch please.
> I'm not sure how much difference it will make but, in principle,
> it's much the same as the previous approach except it doesn't
> increase the kernfs node struct size or mess with the other
> attribute handling code.
>
> Note, this is not even compile tested.

I failed to apply this patch. So based on the original six patches, I
manually removed kn->attr_mutex, and added
inode_lock/inode_unlock to those two functions, they were like:

int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
                       u32 request_mask, unsigned int query_flags)
{
        struct inode *inode = d_inode(path->dentry);
        struct kernfs_node *kn = inode->i_private;

        inode_lock(inode);
        down_read(&kernfs_rwsem);
        kernfs_refresh_inode(kn, inode);
        up_read(&kernfs_rwsem);
        inode_unlock(inode);

        generic_fillattr(inode, stat);
        return 0;
}

int kernfs_iop_permission(struct inode *inode, int mask)
{
        struct kernfs_node *kn;

        if (mask & MAY_NOT_BLOCK)
                return -ECHILD;

        kn = inode->i_private;

        inode_lock(inode);
        down_read(&kernfs_rwsem);
        kernfs_refresh_inode(kn, inode);
        up_read(&kernfs_rwsem);
        inode_unlock(inode);

        return generic_permission(inode, mask);
}

But I couldn't boot the kernel and there was no error on the screen.
I guess it was deadlocked on /sys creation?? :D

> kernfs: use kernfs read lock in .getattr() and .permission()
>
> From: Ian Kent <raven@themaw.net>
>
> From Documenation/filesystems.rst and (slightly outdated) comments
> in fs/attr.c the inode i_rwsem is used for attribute handling.
>
> This lock satisfies the requirememnts needed to reduce lock contention,
> namely a per-object lock needs to be used rather than a file system
> global lock with the kernfs node db held stable for read operations.
>
> In particular it should reduce lock contention seen when calling the
> kernfs .permission() method.
>
> The inode methods .getattr() and .permission() do not hold the inode
> i_rwsem lock when called as they are usually read operations. Also
> the .permission() method checks for rcu-walk mode and returns -ECHILD
> to the VFS if it is set. So the i_rwsem lock can be used in
> kernfs_iop_getattr() and kernfs_iop_permission() to protect the inode
> update done by kernfs_refresh_inode(). Using this lock allows the
> kernfs node db write lock in these functions to be changed to a read
> lock.
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/kernfs/inode.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index ddaf18198935..568037e9efe9 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
>         struct inode *inode = d_inode(path->dentry);
>         struct kernfs_node *kn = inode->i_private;
>
> -       down_write(&kernfs_rwsem);
> +       inode_lock(inode);
> +       down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
> -       up_write(&kernfs_rwsem);
> +       up_read(&kernfs_rwsem);
> +       inode_unlock(inode);
>
>         generic_fillattr(inode, stat);
>         return 0;
> @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode, int mask)
>
>         kn = inode->i_private;
>
> -       down_write(&kernfs_rwsem);
> +       inode_lock(inode);
> +       down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
> -       up_write(&kernfs_rwsem);
> +       up_read(&kernfs_rwsem);
> +       inode_unlock(inode);
>
>         return generic_permission(inode, mask);
>  }
>


thanks,
fox
Ian Kent Dec. 17, 2020, 10:09 a.m. UTC | #39
On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote:
> > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net
> > > > > > >
> > > > > > wrote:
> > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > attr_mutex,
> > > > > > > > > > as
> > > > > > > > > > Tejun
> > > > > > > > > > mentioned here
> > > > > > > > > > (
> > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > > > > ),
> > > > > > > > > > maybe a global
> > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > 
> > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > used
> > > > > > > > > around
> > > > > > > > > the
> > > > > > > > > initial check and checked again at the end which
> > > > > > > > > would
> > > > > > > > > probably
> > > > > > > > > have
> > > > > > > > > been much faster but much less conservative and a bit
> > > > > > > > > more
> > > > > > > > > ugly
> > > > > > > > > so
> > > > > > > > > I just went the conservative path since there was so
> > > > > > > > > much
> > > > > > > > > change
> > > > > > > > > already.
> > > > > > > > 
> > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > didn't
> > > > > > > > remember
> > > > > > > > it.
> > > > > > > > 
> > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > work.
> > > > > > > 
> > > > > > > Those attribute handling patches were meant to allow
> > > > > > > taking
> > > > > > > the
> > > > > > > rw
> > > > > > > sem read lock instead of the write lock for
> > > > > > > kernfs_refresh_inode()
> > > > > > > updates, with the added locking to protect the inode
> > > > > > > attributes
> > > > > > > update since it's called from the VFS both with and
> > > > > > > without
> > > > > > > the
> > > > > > > inode lock.
> > > > > > 
> > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > attr_mutex
> > > > > > drags
> > > > > > concurrent performance.
> > > > > > 
> > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > from
> > > > > > > multiple
> > > > > > > places without a node database lock at all.
> > > > > > > 
> > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > forward
> > > > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > > > taking
> > > > > > > the node db write lock for now and consider the
> > > > > > > attributes
> > > > > > > handling
> > > > > > > as a separate change. Once that's done we could
> > > > > > > reconsider
> > > > > > > what's
> > > > > > > needed to use the node db read lock in
> > > > > > > kernfs_refresh_inode().
> > > > > > 
> > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > kernfs_refresh_inode()??
> > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > 
> > > > > Yes, but make sure the write lock of kernfs_rwsem is being
> > > > > taken
> > > > > not the read lock.
> > > > > 
> > > > > That's a mistake I had initially?
> > > > > 
> > > > > Still, that attributes handling is, I think, sufficient to
> > > > > warrant
> > > > > a separate change since it looks like it might need work, the
> > > > > kernfs
> > > > > node db probably should be kept stable for those attribute
> > > > > updates
> > > > > but equally the existence of an instantiated dentry might
> > > > > mitigate
> > > > > the it.
> > > > > 
> > > > > Some people might just know whether it's ok or not but I
> > > > > would
> > > > > like
> > > > > to check the callers to work out what's going on.
> > > > > 
> > > > > In any case it's academic if GCH isn't willing to consider
> > > > > the
> > > > > series
> > > > > for review and possible merge.
> > > > > 
> > > > Hi Ian
> > > > 
> > > > I removed kn->attr_mutex and changed read lock to write lock
> > > > for
> > > > kernfs_refresh_inode
> > > > 
> > > > down_write(&kernfs_rwsem);
> > > > kernfs_refresh_inode(kn, inode);
> > > > up_write(&kernfs_rwsem);
> > > > 
> > > > 
> > > > Unfortunate, changes in this way make things worse,  my
> > > > benchmark
> > > > runs
> > > > 100% slower than upstream sysfs.  :(
> > > > open+read+close a sysfs file concurrently took 1000us.
> > > > (Currently,
> > > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > > for one open+read+close operation concurrently)
> > > 
> > > Right, so it does need attention nowish.
> > > 
> > > I'll have a look at it in a while, I really need to get a new
> > > autofs
> > > release out, and there are quite a few changes, and testing is
> > > seeing
> > > a number of errors, some old, some newly introduced. It's proving
> > > difficult.
> > 
> > I've taken a breather for the autofs testing and had a look at
> > this.
> 
> Thanks. :)
> 
> > I think my original analysis of this was wrong.
> > 
> > Could you try this patch please.
> > I'm not sure how much difference it will make but, in principle,
> > it's much the same as the previous approach except it doesn't
> > increase the kernfs node struct size or mess with the other
> > attribute handling code.
> > 
> > Note, this is not even compile tested.
> 
> I failed to apply this patch. So based on the original six patches, I
> manually removed kn->attr_mutex, and added
> inode_lock/inode_unlock to those two functions, they were like:
> 
> int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
>                        u32 request_mask, unsigned int query_flags)
> {
>         struct inode *inode = d_inode(path->dentry);
>         struct kernfs_node *kn = inode->i_private;
> 
>         inode_lock(inode);
>         down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         up_read(&kernfs_rwsem);
>         inode_unlock(inode);
> 
>         generic_fillattr(inode, stat);
>         return 0;
> }
> 
> int kernfs_iop_permission(struct inode *inode, int mask)
> {
>         struct kernfs_node *kn;
> 
>         if (mask & MAY_NOT_BLOCK)
>                 return -ECHILD;
> 
>         kn = inode->i_private;
> 
>         inode_lock(inode);
>         down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         up_read(&kernfs_rwsem);
>         inode_unlock(inode);
> 
>         return generic_permission(inode, mask);
> }
> 
> But I couldn't boot the kernel and there was no error on the screen.
> I guess it was deadlocked on /sys creation?? :D

Right, I guess the locking documentation is out of date. I'm guessing
the inode lock is taken somewhere over the .permission() call. If that
usage is consistent it's easy fixed, if the usage is inconsistent it's
hard to deal with and amounts to a bug.

I'll have another look at it.

Also, it sounds like I'm working from a more recent series.

I had 8 patches, dropped the last three and added the one I posted.
If I can work out what's going on I'll post the series for you to
check.

Ian

> 
> > kernfs: use kernfs read lock in .getattr() and .permission()
> > 
> > From: Ian Kent <raven@themaw.net>
> > 
> > From Documenation/filesystems.rst and (slightly outdated) comments
> > in fs/attr.c the inode i_rwsem is used for attribute handling.
> > 
> > This lock satisfies the requirememnts needed to reduce lock
> > contention,
> > namely a per-object lock needs to be used rather than a file system
> > global lock with the kernfs node db held stable for read
> > operations.
> > 
> > In particular it should reduce lock contention seen when calling
> > the
> > kernfs .permission() method.
> > 
> > The inode methods .getattr() and .permission() do not hold the
> > inode
> > i_rwsem lock when called as they are usually read operations. Also
> > the .permission() method checks for rcu-walk mode and returns
> > -ECHILD
> > to the VFS if it is set. So the i_rwsem lock can be used in
> > kernfs_iop_getattr() and kernfs_iop_permission() to protect the
> > inode
> > update done by kernfs_refresh_inode(). Using this lock allows the
> > kernfs node db write lock in these functions to be changed to a
> > read
> > lock.
> > 
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/kernfs/inode.c |   12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index ddaf18198935..568037e9efe9 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path
> > *path, struct kstat *stat,
> >         struct inode *inode = d_inode(path->dentry);
> >         struct kernfs_node *kn = inode->i_private;
> > 
> > -       down_write(&kernfs_rwsem);
> > +       inode_lock(inode);
> > +       down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> > -       up_write(&kernfs_rwsem);
> > +       up_read(&kernfs_rwsem);
> > +       inode_unlock(inode);
> > 
> >         generic_fillattr(inode, stat);
> >         return 0;
> > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode,
> > int mask)
> > 
> >         kn = inode->i_private;
> > 
> > -       down_write(&kernfs_rwsem);
> > +       inode_lock(inode);
> > +       down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> > -       up_write(&kernfs_rwsem);
> > +       up_read(&kernfs_rwsem);
> > +       inode_unlock(inode);
> > 
> >         return generic_permission(inode, mask);
> >  }
> > 
> 
> thanks,
> fox
Ian Kent Dec. 17, 2020, 11:09 a.m. UTC | #40
On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote:
> > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > raven@themaw.net
> > > > > > > wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > as
> > > > > > > > > > > Tejun
> > > > > > > > > > > mentioned here
> > > > > > > > > > > (
> > > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > > > > > ),
> > > > > > > > > > > maybe a global
> > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > 
> > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > > used
> > > > > > > > > > around
> > > > > > > > > > the
> > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > would
> > > > > > > > > > probably
> > > > > > > > > > have
> > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > bit
> > > > > > > > > > more
> > > > > > > > > > ugly
> > > > > > > > > > so
> > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > so
> > > > > > > > > > much
> > > > > > > > > > change
> > > > > > > > > > already.
> > > > > > > > > 
> > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > didn't
> > > > > > > > > remember
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > work.
> > > > > > > > 
> > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > taking
> > > > > > > > the
> > > > > > > > rw
> > > > > > > > sem read lock instead of the write lock for
> > > > > > > > kernfs_refresh_inode()
> > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > attributes
> > > > > > > > update since it's called from the VFS both with and
> > > > > > > > without
> > > > > > > > the
> > > > > > > > inode lock.
> > > > > > > 
> > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > attr_mutex
> > > > > > > drags
> > > > > > > concurrent performance.
> > > > > > > 
> > > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > > from
> > > > > > > > multiple
> > > > > > > > places without a node database lock at all.
> > > > > > > > 
> > > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > > forward
> > > > > > > > and on topic, I should just leave
> > > > > > > > kernfs_refresh_inode()
> > > > > > > > taking
> > > > > > > > the node db write lock for now and consider the
> > > > > > > > attributes
> > > > > > > > handling
> > > > > > > > as a separate change. Once that's done we could
> > > > > > > > reconsider
> > > > > > > > what's
> > > > > > > > needed to use the node db read lock in
> > > > > > > > kernfs_refresh_inode().
> > > > > > > 
> > > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > > kernfs_refresh_inode()??
> > > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > > 
> > > > > > Yes, but make sure the write lock of kernfs_rwsem is being
> > > > > > taken
> > > > > > not the read lock.
> > > > > > 
> > > > > > That's a mistake I had initially?
> > > > > > 
> > > > > > Still, that attributes handling is, I think, sufficient to
> > > > > > warrant
> > > > > > a separate change since it looks like it might need work,
> > > > > > the
> > > > > > kernfs
> > > > > > node db probably should be kept stable for those attribute
> > > > > > updates
> > > > > > but equally the existence of an instantiated dentry might
> > > > > > mitigate
> > > > > > the it.
> > > > > > 
> > > > > > Some people might just know whether it's ok or not but I
> > > > > > would
> > > > > > like
> > > > > > to check the callers to work out what's going on.
> > > > > > 
> > > > > > In any case it's academic if GCH isn't willing to consider
> > > > > > the
> > > > > > series
> > > > > > for review and possible merge.
> > > > > > 
> > > > > Hi Ian
> > > > > 
> > > > > I removed kn->attr_mutex and changed read lock to write lock
> > > > > for
> > > > > kernfs_refresh_inode
> > > > > 
> > > > > down_write(&kernfs_rwsem);
> > > > > kernfs_refresh_inode(kn, inode);
> > > > > up_write(&kernfs_rwsem);
> > > > > 
> > > > > 
> > > > > Unfortunate, changes in this way make things worse,  my
> > > > > benchmark
> > > > > runs
> > > > > 100% slower than upstream sysfs.  :(
> > > > > open+read+close a sysfs file concurrently took 1000us.
> > > > > (Currently,
> > > > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > > > for one open+read+close operation concurrently)
> > > > 
> > > > Right, so it does need attention nowish.
> > > > 
> > > > I'll have a look at it in a while, I really need to get a new
> > > > autofs
> > > > release out, and there are quite a few changes, and testing is
> > > > seeing
> > > > a number of errors, some old, some newly introduced. It's
> > > > proving
> > > > difficult.
> > > 
> > > I've taken a breather for the autofs testing and had a look at
> > > this.
> > 
> > Thanks. :)
> > 
> > > I think my original analysis of this was wrong.
> > > 
> > > Could you try this patch please.
> > > I'm not sure how much difference it will make but, in principle,
> > > it's much the same as the previous approach except it doesn't
> > > increase the kernfs node struct size or mess with the other
> > > attribute handling code.
> > > 
> > > Note, this is not even compile tested.
> > 
> > I failed to apply this patch. So based on the original six patches,
> > I
> > manually removed kn->attr_mutex, and added
> > inode_lock/inode_unlock to those two functions, they were like:
> > 
> > int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
> >                        u32 request_mask, unsigned int query_flags)
> > {
> >         struct inode *inode = d_inode(path->dentry);
> >         struct kernfs_node *kn = inode->i_private;
> > 
> >         inode_lock(inode);
> >         down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> >         up_read(&kernfs_rwsem);
> >         inode_unlock(inode);
> > 
> >         generic_fillattr(inode, stat);
> >         return 0;
> > }
> > 
> > int kernfs_iop_permission(struct inode *inode, int mask)
> > {
> >         struct kernfs_node *kn;
> > 
> >         if (mask & MAY_NOT_BLOCK)
> >                 return -ECHILD;
> > 
> >         kn = inode->i_private;
> > 
> >         inode_lock(inode);
> >         down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> >         up_read(&kernfs_rwsem);
> >         inode_unlock(inode);
> > 
> >         return generic_permission(inode, mask);
> > }
> > 
> > But I couldn't boot the kernel and there was no error on the
> > screen.
> > I guess it was deadlocked on /sys creation?? :D
> 
> Right, I guess the locking documentation is out of date. I'm guessing
> the inode lock is taken somewhere over the .permission() call. If
> that
> usage is consistent it's easy fixed, if the usage is inconsistent
> it's
> hard to deal with and amounts to a bug.

Yes, it is called, both shared on open, and exclusive on open
create, and without the inode lock at all at the start of path
resolution.

That can't really be called a VFS bug since .permission() is
meant to check permissions not update the inode.

This is probably what lead to the attr patches I had.

If a suitable place to put a local per-object lock can't be
found for this, other than in the kernfs_node, then it's a
real problem from a contention POV.

What could be done is to make the kernfs node attr_mutex
a pointer and dynamically allocate it but even that is too
costly a size addition to the kernfs node structure as
Tejun has said.

Those patches I referred to clearly aren't finished because
the eighth one is empty, which followed a patch I have titled
"kernfs: make attr_mutex a local kernfs node lock".

I obviously gave up on it when the series was rejected.
But I'll give it some more thought.

Ian

> 
> I'll have another look at it.
> 
> Also, it sounds like I'm working from a more recent series.
> 
> I had 8 patches, dropped the last three and added the one I posted.
> If I can work out what's going on I'll post the series for you to
> check.
> 
> Ian
> 
> > > kernfs: use kernfs read lock in .getattr() and .permission()
> > > 
> > > From: Ian Kent <raven@themaw.net>
> > > 
> > > From Documenation/filesystems.rst and (slightly outdated)
> > > comments
> > > in fs/attr.c the inode i_rwsem is used for attribute handling.
> > > 
> > > This lock satisfies the requirememnts needed to reduce lock
> > > contention,
> > > namely a per-object lock needs to be used rather than a file
> > > system
> > > global lock with the kernfs node db held stable for read
> > > operations.
> > > 
> > > In particular it should reduce lock contention seen when calling
> > > the
> > > kernfs .permission() method.
> > > 
> > > The inode methods .getattr() and .permission() do not hold the
> > > inode
> > > i_rwsem lock when called as they are usually read operations.
> > > Also
> > > the .permission() method checks for rcu-walk mode and returns
> > > -ECHILD
> > > to the VFS if it is set. So the i_rwsem lock can be used in
> > > kernfs_iop_getattr() and kernfs_iop_permission() to protect the
> > > inode
> > > update done by kernfs_refresh_inode(). Using this lock allows the
> > > kernfs node db write lock in these functions to be changed to a
> > > read
> > > lock.
> > > 
> > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > ---
> > >  fs/kernfs/inode.c |   12 ++++++++----
> > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > index ddaf18198935..568037e9efe9 100644
> > > --- a/fs/kernfs/inode.c
> > > +++ b/fs/kernfs/inode.c
> > > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path
> > > *path, struct kstat *stat,
> > >         struct inode *inode = d_inode(path->dentry);
> > >         struct kernfs_node *kn = inode->i_private;
> > > 
> > > -       down_write(&kernfs_rwsem);
> > > +       inode_lock(inode);
> > > +       down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > > -       up_write(&kernfs_rwsem);
> > > +       up_read(&kernfs_rwsem);
> > > +       inode_unlock(inode);
> > > 
> > >         generic_fillattr(inode, stat);
> > >         return 0;
> > > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode
> > > *inode,
> > > int mask)
> > > 
> > >         kn = inode->i_private;
> > > 
> > > -       down_write(&kernfs_rwsem);
> > > +       inode_lock(inode);
> > > +       down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > > -       up_write(&kernfs_rwsem);
> > > +       up_read(&kernfs_rwsem);
> > > +       inode_unlock(inode);
> > > 
> > >         return generic_permission(inode, mask);
> > >  }
> > > 
> > 
> > thanks,
> > fox
Ian Kent Dec. 17, 2020, 11:48 a.m. UTC | #41
On Thu, 2020-12-17 at 19:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> > On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net>
> > > > > > wrote:
> > > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > > raven@themaw.net
> > > > > > > > wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > > as
> > > > > > > > > > > > Tejun
> > > > > > > > > > > > mentioned here
> > > > > > > > > > > > (
> > > > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/
> > > > > > > > > > > > ),
> > > > > > > > > > > > maybe a global
> > > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > > 
> > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could
> > > > > > > > > > > be
> > > > > > > > > > > used
> > > > > > > > > > > around
> > > > > > > > > > > the
> > > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > > would
> > > > > > > > > > > probably
> > > > > > > > > > > have
> > > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > > bit
> > > > > > > > > > > more
> > > > > > > > > > > ugly
> > > > > > > > > > > so
> > > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > > so
> > > > > > > > > > > much
> > > > > > > > > > > change
> > > > > > > > > > > already.
> > > > > > > > > > 
> > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > > didn't
> > > > > > > > > > remember
> > > > > > > > > > it.
> > > > > > > > > > 
> > > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > > work.
> > > > > > > > > 
> > > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > > taking
> > > > > > > > > the
> > > > > > > > > rw
> > > > > > > > > sem read lock instead of the write lock for
> > > > > > > > > kernfs_refresh_inode()
> > > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > > attributes
> > > > > > > > > update since it's called from the VFS both with and
> > > > > > > > > without
> > > > > > > > > the
> > > > > > > > > inode lock.
> > > > > > > > 
> > > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > > attr_mutex
> > > > > > > > drags
> > > > > > > > concurrent performance.
> > > > > > > > 
> > > > > > > > > Looking around it looks like kernfs_iattrs() is
> > > > > > > > > called
> > > > > > > > > from
> > > > > > > > > multiple
> > > > > > > > > places without a node database lock at all.
> > > > > > > > > 
> > > > > > > > > I'm thinking that, to keep my proposed change
> > > > > > > > > straight
> > > > > > > > > forward
> > > > > > > > > and on topic, I should just leave
> > > > > > > > > kernfs_refresh_inode()
> > > > > > > > > taking
> > > > > > > > > the node db write lock for now and consider the
> > > > > > > > > attributes
> > > > > > > > > handling
> > > > > > > > > as a separate change. Once that's done we could
> > > > > > > > > reconsider
> > > > > > > > > what's
> > > > > > > > > needed to use the node db read lock in
> > > > > > > > > kernfs_refresh_inode().
> > > > > > > > 
> > > > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > > > kernfs_refresh_inode()??
> > > > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > > > 
> > > > > > > Yes, but make sure the write lock of kernfs_rwsem is
> > > > > > > being
> > > > > > > taken
> > > > > > > not the read lock.
> > > > > > > 
> > > > > > > That's a mistake I had initially?
> > > > > > > 
> > > > > > > Still, that attributes handling is, I think, sufficient
> > > > > > > to
> > > > > > > warrant
> > > > > > > a separate change since it looks like it might need work,
> > > > > > > the
> > > > > > > kernfs
> > > > > > > node db probably should be kept stable for those
> > > > > > > attribute
> > > > > > > updates
> > > > > > > but equally the existence of an instantiated dentry might
> > > > > > > mitigate
> > > > > > > the it.
> > > > > > > 
> > > > > > > Some people might just know whether it's ok or not but I
> > > > > > > would
> > > > > > > like
> > > > > > > to check the callers to work out what's going on.
> > > > > > > 
> > > > > > > In any case it's academic if GCH isn't willing to
> > > > > > > consider
> > > > > > > the
> > > > > > > series
> > > > > > > for review and possible merge.
> > > > > > > 
> > > > > > Hi Ian
> > > > > > 
> > > > > > I removed kn->attr_mutex and changed read lock to write
> > > > > > lock
> > > > > > for
> > > > > > kernfs_refresh_inode
> > > > > > 
> > > > > > down_write(&kernfs_rwsem);
> > > > > > kernfs_refresh_inode(kn, inode);
> > > > > > up_write(&kernfs_rwsem);
> > > > > > 
> > > > > > 
> > > > > > Unfortunate, changes in this way make things worse,  my
> > > > > > benchmark
> > > > > > runs
> > > > > > 100% slower than upstream sysfs.  :(
> > > > > > open+read+close a sysfs file concurrently took 1000us.
> > > > > > (Currently,
> > > > > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > > > > for one open+read+close operation concurrently)
> > > > > 
> > > > > Right, so it does need attention nowish.
> > > > > 
> > > > > I'll have a look at it in a while, I really need to get a new
> > > > > autofs
> > > > > release out, and there are quite a few changes, and testing
> > > > > is
> > > > > seeing
> > > > > a number of errors, some old, some newly introduced. It's
> > > > > proving
> > > > > difficult.
> > > > 
> > > > I've taken a breather for the autofs testing and had a look at
> > > > this.
> > > 
> > > Thanks. :)
> > > 
> > > > I think my original analysis of this was wrong.
> > > > 
> > > > Could you try this patch please.
> > > > I'm not sure how much difference it will make but, in
> > > > principle,
> > > > it's much the same as the previous approach except it doesn't
> > > > increase the kernfs node struct size or mess with the other
> > > > attribute handling code.
> > > > 
> > > > Note, this is not even compile tested.
> > > 
> > > I failed to apply this patch. So based on the original six
> > > patches,
> > > I
> > > manually removed kn->attr_mutex, and added
> > > inode_lock/inode_unlock to those two functions, they were like:
> > > 
> > > int kernfs_iop_getattr(const struct path *path, struct kstat
> > > *stat,
> > >                        u32 request_mask, unsigned int
> > > query_flags)
> > > {
> > >         struct inode *inode = d_inode(path->dentry);
> > >         struct kernfs_node *kn = inode->i_private;
> > > 
> > >         inode_lock(inode);
> > >         down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > >         up_read(&kernfs_rwsem);
> > >         inode_unlock(inode);
> > > 
> > >         generic_fillattr(inode, stat);
> > >         return 0;
> > > }
> > > 
> > > int kernfs_iop_permission(struct inode *inode, int mask)
> > > {
> > >         struct kernfs_node *kn;
> > > 
> > >         if (mask & MAY_NOT_BLOCK)
> > >                 return -ECHILD;
> > > 
> > >         kn = inode->i_private;
> > > 
> > >         inode_lock(inode);
> > >         down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > >         up_read(&kernfs_rwsem);
> > >         inode_unlock(inode);
> > > 
> > >         return generic_permission(inode, mask);
> > > }
> > > 
> > > But I couldn't boot the kernel and there was no error on the
> > > screen.
> > > I guess it was deadlocked on /sys creation?? :D
> > 
> > Right, I guess the locking documentation is out of date. I'm
> > guessing
> > the inode lock is taken somewhere over the .permission() call. If
> > that
> > usage is consistent it's easy fixed, if the usage is inconsistent
> > it's
> > hard to deal with and amounts to a bug.
> 
> Yes, it is called, both shared on open, and exclusive on open
> create, and without the inode lock at all at the start of path
> resolution.
> 
> That can't really be called a VFS bug since .permission() is
> meant to check permissions not update the inode.
> 
> This is probably what lead to the attr patches I had.
> 
> If a suitable place to put a local per-object lock can't be
> found for this, other than in the kernfs_node, then it's a
> real problem from a contention POV.
> 
> What could be done is to make the kernfs node attr_mutex
> a pointer and dynamically allocate it but even that is too
> costly a size addition to the kernfs node structure as
> Tejun has said.

I guess the question to ask is, is there really a need to
call kernfs_refresh_inode() from functions that are usually
reading/checking functions.

Would it be sufficient to refresh the inode in the write/set
operations in (if there's any) places where things like
setattr_copy() is not already called?

Perhaps GKH or Tejun could comment on this?

Ian

> 
> Those patches I referred to clearly aren't finished because
> the eighth one is empty, which followed a patch I have titled
> "kernfs: make attr_mutex a local kernfs node lock".
> 
> I obviously gave up on it when the series was rejected.
> But I'll give it some more thought.
> 
> Ian
> 
> > I'll have another look at it.
> > 
> > Also, it sounds like I'm working from a more recent series.
> > 
> > I had 8 patches, dropped the last three and added the one I posted.
> > If I can work out what's going on I'll post the series for you to
> > check.
> > 
> > Ian
> > 
> > > > kernfs: use kernfs read lock in .getattr() and .permission()
> > > > 
> > > > From: Ian Kent <raven@themaw.net>
> > > > 
> > > > From Documenation/filesystems.rst and (slightly outdated)
> > > > comments
> > > > in fs/attr.c the inode i_rwsem is used for attribute handling.
> > > > 
> > > > This lock satisfies the requirememnts needed to reduce lock
> > > > contention,
> > > > namely a per-object lock needs to be used rather than a file
> > > > system
> > > > global lock with the kernfs node db held stable for read
> > > > operations.
> > > > 
> > > > In particular it should reduce lock contention seen when
> > > > calling
> > > > the
> > > > kernfs .permission() method.
> > > > 
> > > > The inode methods .getattr() and .permission() do not hold the
> > > > inode
> > > > i_rwsem lock when called as they are usually read operations.
> > > > Also
> > > > the .permission() method checks for rcu-walk mode and returns
> > > > -ECHILD
> > > > to the VFS if it is set. So the i_rwsem lock can be used in
> > > > kernfs_iop_getattr() and kernfs_iop_permission() to protect the
> > > > inode
> > > > update done by kernfs_refresh_inode(). Using this lock allows
> > > > the
> > > > kernfs node db write lock in these functions to be changed to a
> > > > read
> > > > lock.
> > > > 
> > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > ---
> > > >  fs/kernfs/inode.c |   12 ++++++++----
> > > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > index ddaf18198935..568037e9efe9 100644
> > > > --- a/fs/kernfs/inode.c
> > > > +++ b/fs/kernfs/inode.c
> > > > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path
> > > > *path, struct kstat *stat,
> > > >         struct inode *inode = d_inode(path->dentry);
> > > >         struct kernfs_node *kn = inode->i_private;
> > > > 
> > > > -       down_write(&kernfs_rwsem);
> > > > +       inode_lock(inode);
> > > > +       down_read(&kernfs_rwsem);
> > > >         kernfs_refresh_inode(kn, inode);
> > > > -       up_write(&kernfs_rwsem);
> > > > +       up_read(&kernfs_rwsem);
> > > > +       inode_unlock(inode);
> > > > 
> > > >         generic_fillattr(inode, stat);
> > > >         return 0;
> > > > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode
> > > > *inode,
> > > > int mask)
> > > > 
> > > >         kn = inode->i_private;
> > > > 
> > > > -       down_write(&kernfs_rwsem);
> > > > +       inode_lock(inode);
> > > > +       down_read(&kernfs_rwsem);
> > > >         kernfs_refresh_inode(kn, inode);
> > > > -       up_write(&kernfs_rwsem);
> > > > +       up_read(&kernfs_rwsem);
> > > > +       inode_unlock(inode);
> > > > 
> > > >         return generic_permission(inode, mask);
> > > >  }
> > > > 
> > > 
> > > thanks,
> > > fox
Tejun Heo Dec. 17, 2020, 3:14 p.m. UTC | #42
Hello,

On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > What could be done is to make the kernfs node attr_mutex
> > a pointer and dynamically allocate it but even that is too
> > costly a size addition to the kernfs node structure as
> > Tejun has said.
> 
> I guess the question to ask is, is there really a need to
> call kernfs_refresh_inode() from functions that are usually
> reading/checking functions.
> 
> Would it be sufficient to refresh the inode in the write/set
> operations in (if there's any) places where things like
> setattr_copy() is not already called?
> 
> Perhaps GKH or Tejun could comment on this?

My memory is a bit hazy but invalidations on reads is how sysfs namespace is
implemented, so I don't think there's an easy around that. The only thing I
can think of is embedding the lock into attrs and doing xchg dance when
attaching it.

Thanks.
Ian Kent Dec. 18, 2020, 7:36 a.m. UTC | #43
On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > What could be done is to make the kernfs node attr_mutex
> > > a pointer and dynamically allocate it but even that is too
> > > costly a size addition to the kernfs node structure as
> > > Tejun has said.
> > 
> > I guess the question to ask is, is there really a need to
> > call kernfs_refresh_inode() from functions that are usually
> > reading/checking functions.
> > 
> > Would it be sufficient to refresh the inode in the write/set
> > operations in (if there's any) places where things like
> > setattr_copy() is not already called?
> > 
> > Perhaps GKH or Tejun could comment on this?
> 
> My memory is a bit hazy but invalidations on reads is how sysfs
> namespace is
> implemented, so I don't think there's an easy around that. The only
> thing I
> can think of is embedding the lock into attrs and doing xchg dance
> when
> attaching it.

Sounds like your saying it would be ok to add a lock to the
attrs structure, am I correct?

Assuming it is then, to keep things simple, use two locks.

One global lock for the allocation and an attrs lock for all the
attrs field updates including the kernfs_refresh_inode() update.

The critical section for the global lock could be reduced and it
changed to a spin lock.

In __kernfs_iattrs() we would have something like:

take the allocation lock
do the allocated checks
  assign if existing attrs
  release the allocation lock
  return existing if found
othewise
  release the allocation lock

allocate and initialize attrs

take the allocation lock
check if someone beat us to it
  free and grab exiting attrs
otherwise
  assign the new attrs
release the allocation lock
return attrs

Add a spinlock to the attrs struct and use it everywhere for
field updates.

Am I on the right track or can you see problems with this?

Ian
Fox Chen Dec. 18, 2020, 8:01 a.m. UTC | #44
On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote:
>
> On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > What could be done is to make the kernfs node attr_mutex
> > > > a pointer and dynamically allocate it but even that is too
> > > > costly a size addition to the kernfs node structure as
> > > > Tejun has said.
> > >
> > > I guess the question to ask is, is there really a need to
> > > call kernfs_refresh_inode() from functions that are usually
> > > reading/checking functions.
> > >
> > > Would it be sufficient to refresh the inode in the write/set
> > > operations in (if there's any) places where things like
> > > setattr_copy() is not already called?
> > >
> > > Perhaps GKH or Tejun could comment on this?
> >
> > My memory is a bit hazy but invalidations on reads is how sysfs
> > namespace is
> > implemented, so I don't think there's an easy around that. The only
> > thing I
> > can think of is embedding the lock into attrs and doing xchg dance
> > when
> > attaching it.
>
> Sounds like your saying it would be ok to add a lock to the
> attrs structure, am I correct?
>
> Assuming it is then, to keep things simple, use two locks.
>
> One global lock for the allocation and an attrs lock for all the
> attrs field updates including the kernfs_refresh_inode() update.
>
> The critical section for the global lock could be reduced and it
> changed to a spin lock.
>
> In __kernfs_iattrs() we would have something like:
>
> take the allocation lock
> do the allocated checks
>   assign if existing attrs
>   release the allocation lock
>   return existing if found
> othewise
>   release the allocation lock
>
> allocate and initialize attrs
>
> take the allocation lock
> check if someone beat us to it
>   free and grab exiting attrs
> otherwise
>   assign the new attrs
> release the allocation lock
> return attrs
>
> Add a spinlock to the attrs struct and use it everywhere for
> field updates.
>
> Am I on the right track or can you see problems with this?
>
> Ian
>

umm, we update the inode in kernfs_refresh_inode, right??  So I guess
the problem is how can we protect the inode when kernfs_refresh_inode
is called, not the attrs??


thanks,
fox
Ian Kent Dec. 18, 2020, 11:21 a.m. UTC | #45
On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote:
> > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > What could be done is to make the kernfs node attr_mutex
> > > > > a pointer and dynamically allocate it but even that is too
> > > > > costly a size addition to the kernfs node structure as
> > > > > Tejun has said.
> > > > 
> > > > I guess the question to ask is, is there really a need to
> > > > call kernfs_refresh_inode() from functions that are usually
> > > > reading/checking functions.
> > > > 
> > > > Would it be sufficient to refresh the inode in the write/set
> > > > operations in (if there's any) places where things like
> > > > setattr_copy() is not already called?
> > > > 
> > > > Perhaps GKH or Tejun could comment on this?
> > > 
> > > My memory is a bit hazy but invalidations on reads is how sysfs
> > > namespace is
> > > implemented, so I don't think there's an easy around that. The
> > > only
> > > thing I
> > > can think of is embedding the lock into attrs and doing xchg
> > > dance
> > > when
> > > attaching it.
> > 
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> > 
> > Assuming it is then, to keep things simple, use two locks.
> > 
> > One global lock for the allocation and an attrs lock for all the
> > attrs field updates including the kernfs_refresh_inode() update.
> > 
> > The critical section for the global lock could be reduced and it
> > changed to a spin lock.
> > 
> > In __kernfs_iattrs() we would have something like:
> > 
> > take the allocation lock
> > do the allocated checks
> >   assign if existing attrs
> >   release the allocation lock
> >   return existing if found
> > othewise
> >   release the allocation lock
> > 
> > allocate and initialize attrs
> > 
> > take the allocation lock
> > check if someone beat us to it
> >   free and grab exiting attrs
> > otherwise
> >   assign the new attrs
> > release the allocation lock
> > return attrs
> > 
> > Add a spinlock to the attrs struct and use it everywhere for
> > field updates.
> > 
> > Am I on the right track or can you see problems with this?
> > 
> > Ian
> > 
> 
> umm, we update the inode in kernfs_refresh_inode, right??  So I guess
> the problem is how can we protect the inode when kernfs_refresh_inode
> is called, not the attrs??

But the attrs (which is what's copied from) were protected by the
mutex lock (IIUC) so dealing with the inode attributes implies
dealing with the kernfs node attrs too.

For example in kernfs_iop_setattr() the call to setattr_copy() copies
the node attrs to the inode under the same mutex lock. So, if a read
lock is used the copy in kernfs_refresh_inode() is no longer protected,
it needs to be protected in a different way.

Ian
Fox Chen Dec. 18, 2020, 1:20 p.m. UTC | #46
On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote:
>
> On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote:
> > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > a pointer and dynamically allocate it but even that is too
> > > > > > costly a size addition to the kernfs node structure as
> > > > > > Tejun has said.
> > > > >
> > > > > I guess the question to ask is, is there really a need to
> > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > reading/checking functions.
> > > > >
> > > > > Would it be sufficient to refresh the inode in the write/set
> > > > > operations in (if there's any) places where things like
> > > > > setattr_copy() is not already called?
> > > > >
> > > > > Perhaps GKH or Tejun could comment on this?
> > > >
> > > > My memory is a bit hazy but invalidations on reads is how sysfs
> > > > namespace is
> > > > implemented, so I don't think there's an easy around that. The
> > > > only
> > > > thing I
> > > > can think of is embedding the lock into attrs and doing xchg
> > > > dance
> > > > when
> > > > attaching it.
> > >
> > > Sounds like your saying it would be ok to add a lock to the
> > > attrs structure, am I correct?
> > >
> > > Assuming it is then, to keep things simple, use two locks.
> > >
> > > One global lock for the allocation and an attrs lock for all the
> > > attrs field updates including the kernfs_refresh_inode() update.
> > >
> > > The critical section for the global lock could be reduced and it
> > > changed to a spin lock.
> > >
> > > In __kernfs_iattrs() we would have something like:
> > >
> > > take the allocation lock
> > > do the allocated checks
> > >   assign if existing attrs
> > >   release the allocation lock
> > >   return existing if found
> > > othewise
> > >   release the allocation lock
> > >
> > > allocate and initialize attrs
> > >
> > > take the allocation lock
> > > check if someone beat us to it
> > >   free and grab exiting attrs
> > > otherwise
> > >   assign the new attrs
> > > release the allocation lock
> > > return attrs
> > >
> > > Add a spinlock to the attrs struct and use it everywhere for
> > > field updates.
> > >
> > > Am I on the right track or can you see problems with this?
> > >
> > > Ian
> > >
> >
> > umm, we update the inode in kernfs_refresh_inode, right??  So I guess
> > the problem is how can we protect the inode when kernfs_refresh_inode
> > is called, not the attrs??
>
> But the attrs (which is what's copied from) were protected by the
> mutex lock (IIUC) so dealing with the inode attributes implies
> dealing with the kernfs node attrs too.
>
> For example in kernfs_iop_setattr() the call to setattr_copy() copies
> the node attrs to the inode under the same mutex lock. So, if a read
> lock is used the copy in kernfs_refresh_inode() is no longer protected,
> it needs to be protected in a different way.
>

Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for .setattr but
 no lock for .getattr (misdocumented?? sometimes they have as you've found out)?
What does it protect against?? Because .permission does a similar thing
here -- updating inode attributes, the goal is to provide the same
protection level
for .permission as for .setattr, am I right???


thanks,
fox
Tejun Heo Dec. 18, 2020, 2:59 p.m. UTC | #47
Hello,

On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> Sounds like your saying it would be ok to add a lock to the
> attrs structure, am I correct?

Yeah, adding a lock to attrs is a lot less of a problem and it looks like
it's gonna have to be either that or hashed locks, which might actually make
sense if we're worried about the size of attrs (I don't think we need to).

Thanks.
Ian Kent Dec. 19, 2020, 12:53 a.m. UTC | #48
On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote:
> > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > too
> > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > Tejun has said.
> > > > > > 
> > > > > > I guess the question to ask is, is there really a need to
> > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > reading/checking functions.
> > > > > > 
> > > > > > Would it be sufficient to refresh the inode in the
> > > > > > write/set
> > > > > > operations in (if there's any) places where things like
> > > > > > setattr_copy() is not already called?
> > > > > > 
> > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > 
> > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > sysfs
> > > > > namespace is
> > > > > implemented, so I don't think there's an easy around that.
> > > > > The
> > > > > only
> > > > > thing I
> > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > dance
> > > > > when
> > > > > attaching it.
> > > > 
> > > > Sounds like your saying it would be ok to add a lock to the
> > > > attrs structure, am I correct?
> > > > 
> > > > Assuming it is then, to keep things simple, use two locks.
> > > > 
> > > > One global lock for the allocation and an attrs lock for all
> > > > the
> > > > attrs field updates including the kernfs_refresh_inode()
> > > > update.
> > > > 
> > > > The critical section for the global lock could be reduced and
> > > > it
> > > > changed to a spin lock.
> > > > 
> > > > In __kernfs_iattrs() we would have something like:
> > > > 
> > > > take the allocation lock
> > > > do the allocated checks
> > > >   assign if existing attrs
> > > >   release the allocation lock
> > > >   return existing if found
> > > > othewise
> > > >   release the allocation lock
> > > > 
> > > > allocate and initialize attrs
> > > > 
> > > > take the allocation lock
> > > > check if someone beat us to it
> > > >   free and grab exiting attrs
> > > > otherwise
> > > >   assign the new attrs
> > > > release the allocation lock
> > > > return attrs
> > > > 
> > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > field updates.
> > > > 
> > > > Am I on the right track or can you see problems with this?
> > > > 
> > > > Ian
> > > > 
> > > 
> > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > guess
> > > the problem is how can we protect the inode when
> > > kernfs_refresh_inode
> > > is called, not the attrs??
> > 
> > But the attrs (which is what's copied from) were protected by the
> > mutex lock (IIUC) so dealing with the inode attributes implies
> > dealing with the kernfs node attrs too.
> > 
> > For example in kernfs_iop_setattr() the call to setattr_copy()
> > copies
> > the node attrs to the inode under the same mutex lock. So, if a
> > read
> > lock is used the copy in kernfs_refresh_inode() is no longer
> > protected,
> > it needs to be protected in a different way.
> > 
> 
> Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> .setattr but
>  no lock for .getattr (misdocumented?? sometimes they have as you've
> found out)?
> What does it protect against?? Because .permission does a similar
> thing
> here -- updating inode attributes, the goal is to provide the same
> protection level
> for .permission as for .setattr, am I right???

As far as the documentation goes that's probably my misunderstanding
of it.

It does happen that the VFS makes assumptions about how call backs
are meant to be used.

Read like call backs, like .getattr() and .permission() are meant to
be used, well, like read like functions so the VFS should be ok to
take locks or not based on the operation context at hand.

So it's not about the locking for these call backs per se, it's about
the context in which they are called.

For example, in link_path_walk(), at the beginning of the component
lookup loop (essentially for the containing directory at that point),
may_lookup() is called which leads to a call to .permission() without
any inode lock held at that point.

But file opens (possibly following a path walk to resolve a path)
are different.

For example, do_filp_open() calls path_openat() which leads to a
call to open_last_lookups(), which leads to a call to .permission()
along the way. And in this case there are two contexts, an open()
create or one without create, the former needing the exclusive inode
lock and the later able to use the shared lock.

So it's about the locking needed for the encompassing operation that
is being done not about those functions specifically.

TBH the VFS is very complex and Al has a much, much better
understanding of it than I do so he would need to be the one to answer
whether it's the file systems responsibility to use these calls in the
way the VFS expects.

My belief is that if a file system needs to use a call back in a way
that's in conflict with what the VFS expects it's the file systems'
responsibility to deal with the side effects.

Ian
Ian Kent Dec. 19, 2020, 7:08 a.m. UTC | #49
On Fri, 2020-12-18 at 09:59 -0500, Tejun Heo wrote:
> Hello,
> 
> On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> 
> Yeah, adding a lock to attrs is a lot less of a problem and it looks
> like
> it's gonna have to be either that or hashed locks, which might
> actually make
> sense if we're worried about the size of attrs (I don't think we need
> to).

Maybe that isn't needed.

And looking further I see there's a race that kernfs can't do anything
about between kernfs_refresh_inode() and fs/inode.c:update_times().

kernfs could avoid fighting with the VFS to keep the attributes set to
those of the kernfs node by using the inode operation .update_times()
and, if it makes sense, the kernfs node attributes that it wants to be
updated on file system activity could also be updated here.

I can't find any reason why this shouldn't be done but kernfs is
fairly widely used in other kernel subsystems so what does everyone
think of this patch, updated to set kernfs node attributes that
should be updated of course, see comment in the patch?

kernfs: fix attributes update race

From: Ian Kent <raven@themaw.net>

kernfs uses kernfs_refresh_inode() (called from kernfs_iop_getattr()
and kernfs_iop_permission()) to keep the inode attributes set to the
attibutes of the kernfs node.

But there is no way for kernfs to prevent racing with the function
fs/inode.c:update_times().

The better choice is to use the inode operation .update_times() and
just let the VFS use the generic functions for .getattr() and
.permission().

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c           |   37 ++++++++++++++-----------------------
 fs/kernfs/kernfs-internal.h |    4 +---
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fc2469a20fed..51780329590c 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -24,9 +24,8 @@ static const struct address_space_operations kernfs_aops = {
 };
 
 static const struct inode_operations kernfs_iops = {
-	.permission	= kernfs_iop_permission,
+	.update_time	= kernfs_update_time,
 	.setattr	= kernfs_iop_setattr,
-	.getattr	= kernfs_iop_getattr,
 	.listxattr	= kernfs_iop_listxattr,
 };
 
@@ -183,18 +182,26 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 		set_nlink(inode, kn->dir.subdirs + 2);
 }
 
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-		       u32 request_mask, unsigned int query_flags)
+static int kernfs_iop_update_time(struct inode *inode, struct timespec64 *time, int flags)
 {
-	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
+	struct kernfs_iattrs *attrs;
 
 	mutex_lock(&kernfs_mutex);
+	attrs = kernfs_iattrs(kn);
+	if (!attrs) {
+		mutex_unlock(&kernfs_mutex);
+		return -ENOMEM;
+	}
+
+	/* Which kernfs node attributes should be updated from
+	 * time?
+	 */
+
 	kernfs_refresh_inode(kn, inode);
 	mutex_unlock(&kernfs_mutex);
 
-	generic_fillattr(inode, stat);
-	return 0;
+	return 0
 }
 
 static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
@@ -272,22 +279,6 @@ void kernfs_evict_inode(struct inode *inode)
 	kernfs_put(kn);
 }
 
-int kernfs_iop_permission(struct inode *inode, int mask)
-{
-	struct kernfs_node *kn;
-
-	if (mask & MAY_NOT_BLOCK)
-		return -ECHILD;
-
-	kn = inode->i_private;
-
-	mutex_lock(&kernfs_mutex);
-	kernfs_refresh_inode(kn, inode);
-	mutex_unlock(&kernfs_mutex);
-
-	return generic_permission(inode, mask);
-}
-
 int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 		     void *value, size_t size)
 {
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7ee97ef59184..98d08b928f93 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -89,10 +89,8 @@ extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
  */
 extern const struct xattr_handler *kernfs_xattr_handlers[];
 void kernfs_evict_inode(struct inode *inode);
-int kernfs_iop_permission(struct inode *inode, int mask);
+int kernfs_update_time(struct inode *inode, struct timespec64 *time, int flags);
 int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr);
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-		       u32 request_mask, unsigned int query_flags);
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
 int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
Fox Chen Dec. 19, 2020, 7:47 a.m. UTC | #50
On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote:
>
> On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote:
> > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > > too
> > > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > > Tejun has said.
> > > > > > >
> > > > > > > I guess the question to ask is, is there really a need to
> > > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > > reading/checking functions.
> > > > > > >
> > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > write/set
> > > > > > > operations in (if there's any) places where things like
> > > > > > > setattr_copy() is not already called?
> > > > > > >
> > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > >
> > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > sysfs
> > > > > > namespace is
> > > > > > implemented, so I don't think there's an easy around that.
> > > > > > The
> > > > > > only
> > > > > > thing I
> > > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > > dance
> > > > > > when
> > > > > > attaching it.
> > > > >
> > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > attrs structure, am I correct?
> > > > >
> > > > > Assuming it is then, to keep things simple, use two locks.
> > > > >
> > > > > One global lock for the allocation and an attrs lock for all
> > > > > the
> > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > update.
> > > > >
> > > > > The critical section for the global lock could be reduced and
> > > > > it
> > > > > changed to a spin lock.
> > > > >
> > > > > In __kernfs_iattrs() we would have something like:
> > > > >
> > > > > take the allocation lock
> > > > > do the allocated checks
> > > > >   assign if existing attrs
> > > > >   release the allocation lock
> > > > >   return existing if found
> > > > > othewise
> > > > >   release the allocation lock
> > > > >
> > > > > allocate and initialize attrs
> > > > >
> > > > > take the allocation lock
> > > > > check if someone beat us to it
> > > > >   free and grab exiting attrs
> > > > > otherwise
> > > > >   assign the new attrs
> > > > > release the allocation lock
> > > > > return attrs
> > > > >
> > > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > > field updates.
> > > > >
> > > > > Am I on the right track or can you see problems with this?
> > > > >
> > > > > Ian
> > > > >
> > > >
> > > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > > guess
> > > > the problem is how can we protect the inode when
> > > > kernfs_refresh_inode
> > > > is called, not the attrs??
> > >
> > > But the attrs (which is what's copied from) were protected by the
> > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > dealing with the kernfs node attrs too.
> > >
> > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > copies
> > > the node attrs to the inode under the same mutex lock. So, if a
> > > read
> > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > protected,
> > > it needs to be protected in a different way.
> > >
> >
> > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> > .setattr but
> >  no lock for .getattr (misdocumented?? sometimes they have as you've
> > found out)?
> > What does it protect against?? Because .permission does a similar
> > thing
> > here -- updating inode attributes, the goal is to provide the same
> > protection level
> > for .permission as for .setattr, am I right???
>
> As far as the documentation goes that's probably my misunderstanding
> of it.
>
> It does happen that the VFS makes assumptions about how call backs
> are meant to be used.
>
> Read like call backs, like .getattr() and .permission() are meant to
> be used, well, like read like functions so the VFS should be ok to
> take locks or not based on the operation context at hand.
>
> So it's not about the locking for these call backs per se, it's about
> the context in which they are called.
>
> For example, in link_path_walk(), at the beginning of the component
> lookup loop (essentially for the containing directory at that point),
> may_lookup() is called which leads to a call to .permission() without
> any inode lock held at that point.
>
> But file opens (possibly following a path walk to resolve a path)
> are different.
>
> For example, do_filp_open() calls path_openat() which leads to a
> call to open_last_lookups(), which leads to a call to .permission()
> along the way. And in this case there are two contexts, an open()
> create or one without create, the former needing the exclusive inode
> lock and the later able to use the shared lock.
>
> So it's about the locking needed for the encompassing operation that
> is being done not about those functions specifically.
>
> TBH the VFS is very complex and Al has a much, much better
> understanding of it than I do so he would need to be the one to answer
> whether it's the file systems responsibility to use these calls in the
> way the VFS expects.
>
> My belief is that if a file system needs to use a call back in a way
> that's in conflict with what the VFS expects it's the file systems'
> responsibility to deal with the side effects.
>

Thanks for clarifying. Ian.

Yeah, it's complex and confusing and it's very hard to spot lock
context by reading VFS code.

I put code like this:
        if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
                if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
                        pr_warn("kernfs iop_permission inode WRITE
lock is held");
                } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) {
                        pr_warn("kernfs iop_permission inode READ lock
is held");
                }
        } else {
                pr_warn("kernfs iop_permission inode lock is NOT held");
        }

in both .permission & .getattr. Then I do some open/read/write/close
to /sys, interestingly, all log outputs suggest they are in WRITE lock
context.

and I put dump_stack() to the lock-is-held if branch, it prints a lot
of following context:

[  481.795445] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.0-dirty #25
[  481.795446] Hardware name: Parallels Software International Inc.
Parallels Virtual Platform/Parallels Virtual Platform, BIOS 15.1.5
(47309) 09/26/2020
[  481.795446] Call Trace:
[  481.795448] dump_stack (lib/dump_stack.c:120)
[  481.795450] kernfs_iop_permission (fs/kernfs/inode.c:295)
[  481.795452] inode_permission.part.0 (fs/namei.c:398 fs/namei.c:463)
[  481.795454] may_open (fs/namei.c:2875)
[  481.795456] path_openat (fs/namei.c:3250 fs/namei.c:3369)
[  481.795458] ? ___sys_sendmsg (net/socket.c:2411)
[  481.795460] ? preempt_count_add (./include/linux/ftrace.h:821
kernel/sched/core.c:4166 kernel/sched/core.c:4163
kernel/sched/core.c:4191)
[  481.795461] ? sock_poll (net/socket.c:1265)
[  481.795463] do_filp_open (fs/namei.c:3396)
[  481.795466] ? __check_object_size (mm/usercopy.c:240
mm/usercopy.c:286 mm/usercopy.c:256)
[  481.795469] ? _raw_spin_unlock
(./arch/x86/include/asm/preempt.h:102
./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  481.795470] do_sys_openat2 (fs/open.c:1168)
[  481.795472] __x64_sys_openat (fs/open.c:1195)
[  481.795473] do_syscall_64 (arch/x86/entry/common.c:46)
[  481.795475] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127)
[  481.795476] RIP: 0033:0x7f6b31d69c94

Surprisingly, I didn't see the lock holding code along the path.


thanks,
fox
Tejun Heo Dec. 19, 2020, 4:23 p.m. UTC | #51
Hello,

On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> And looking further I see there's a race that kernfs can't do anything
> about between kernfs_refresh_inode() and fs/inode.c:update_times().

Do kernfs files end up calling into that path tho? Doesn't look like it to
me but if so yeah we'd need to override the update_time for kernfs.

> +static int kernfs_iop_update_time(struct inode *inode, struct timespec64 *time, int flags)
>  {
> -	struct inode *inode = d_inode(path->dentry);
>  	struct kernfs_node *kn = inode->i_private;
> +	struct kernfs_iattrs *attrs;
>  
>  	mutex_lock(&kernfs_mutex);
> +	attrs = kernfs_iattrs(kn);
> +	if (!attrs) {
> +		mutex_unlock(&kernfs_mutex);
> +		return -ENOMEM;
> +	}
> +
> +	/* Which kernfs node attributes should be updated from
> +	 * time?
> +	 */
> +
>  	kernfs_refresh_inode(kn, inode);
>  	mutex_unlock(&kernfs_mutex);

I don't see how this would reflect the changes from kernfs_setattr() into
the attached inode. This would actually make the attr updates obviously racy
- the userland visible attrs would be stale until the inode gets reclaimed
and then when it gets reinstantiated it'd show the latest information.

That said, if you wanna take the direction where attr updates are reflected
to the associated inode when the change occurs, which makes sense, the right
thing to do would be making kernfs_setattr() update the associated inode if
existent.

Thanks.
Ian Kent Dec. 19, 2020, 11:52 p.m. UTC | #52
On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> Hello,
> 
> On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > And looking further I see there's a race that kernfs can't do
> > anything
> > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> 
> Do kernfs files end up calling into that path tho? Doesn't look like
> it to
> me but if so yeah we'd need to override the update_time for kernfs.

Sorry, the below was very hastily done and not what I would actually
propose.

The main point of it was the question

+	/* Which kernfs node attributes should be updated from
+	 * time?
+	 */

but looking at it again this morning I think the node iattr fields
that might need to be updated would be atime, ctime and mtime only,
maybe not ctime ... not sure.

What do you think?

Also, if kn->attr == NULL it should fall back to what the VFS
currently does.

The update_times() function is one of the few places where the
VFS updates the inode times.

The idea is that the reason kernfs needs to overwrite the inode
attributes is to reset what the VFS might have done but if kernfs
has this inode operation they won't need to be overwritten since
they won't have changed.

There may be other places where the attributes (or an attribute)
are set by the VFS, I haven't finished checking that yet so my
suggestion might not be entirely valid.

What I need to do is work out what kernfs node attributes, if any,
should be updated by .update_times(). If I go by what
kernfs_refresh_inode() does now then that would be none but shouldn't
atime at least be updated in the node iattr.

> > +static int kernfs_iop_update_time(struct inode *inode, struct
> > timespec64 *time, int flags)
> >  {
> > -	struct inode *inode = d_inode(path->dentry);
> >  	struct kernfs_node *kn = inode->i_private;
> > +	struct kernfs_iattrs *attrs;
> >  
> >  	mutex_lock(&kernfs_mutex);
> > +	attrs = kernfs_iattrs(kn);
> > +	if (!attrs) {
> > +		mutex_unlock(&kernfs_mutex);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* Which kernfs node attributes should be updated from
> > +	 * time?
> > +	 */
> > +
> >  	kernfs_refresh_inode(kn, inode);
> >  	mutex_unlock(&kernfs_mutex);
> 
> I don't see how this would reflect the changes from kernfs_setattr()
> into
> the attached inode. This would actually make the attr updates
> obviously racy
> - the userland visible attrs would be stale until the inode gets
> reclaimed
> and then when it gets reinstantiated it'd show the latest
> information.

Right, I will have to think about that, but as I say above this
isn't really what I would propose.

If .update_times() sticks strictly to what kernfs_refresh_inode()
does now then it would set the inode attributes from the node iattr
only.

> 
> That said, if you wanna take the direction where attr updates are
> reflected
> to the associated inode when the change occurs, which makes sense,
> the right
> thing to do would be making kernfs_setattr() update the associated
> inode if
> existent.

Mmm, that's a good point but it looks like the inode isn't available
there.

Ian
Ian Kent Dec. 20, 2020, 1:37 a.m. UTC | #53
On Sun, 2020-12-20 at 07:52 +0800, Ian Kent wrote:
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and
> > > fs/inode.c:update_times().
> > 
> > Do kernfs files end up calling into that path tho? Doesn't look
> > like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.

You are correct, update_time() will only be called during symlink
following and only to update atime.

So this isn't sufficient to update the inode attributes to reflect
changes make by things like kernfs_setattr() or when the directory
link count changes ...

Sigh!
Fox Chen Dec. 21, 2020, 9:28 a.m. UTC | #54
On Sun, Dec 20, 2020 at 7:52 AM Ian Kent <raven@themaw.net> wrote:
>
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> >
> > Do kernfs files end up calling into that path tho? Doesn't look like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.
>
> Sorry, the below was very hastily done and not what I would actually
> propose.
>
> The main point of it was the question
>
> +       /* Which kernfs node attributes should be updated from
> +        * time?
> +        */
>
> but looking at it again this morning I think the node iattr fields
> that might need to be updated would be atime, ctime and mtime only,
> maybe not ctime ... not sure.
>
> What do you think?
>
> Also, if kn->attr == NULL it should fall back to what the VFS
> currently does.
>
> The update_times() function is one of the few places where the
> VFS updates the inode times.
>
> The idea is that the reason kernfs needs to overwrite the inode
> attributes is to reset what the VFS might have done but if kernfs
> has this inode operation they won't need to be overwritten since
> they won't have changed.
>
> There may be other places where the attributes (or an attribute)
> are set by the VFS, I haven't finished checking that yet so my
> suggestion might not be entirely valid.
>
> What I need to do is work out what kernfs node attributes, if any,
> should be updated by .update_times(). If I go by what
> kernfs_refresh_inode() does now then that would be none but shouldn't
> atime at least be updated in the node iattr.
>
> > > +static int kernfs_iop_update_time(struct inode *inode, struct
> > > timespec64 *time, int flags)
> > >  {
> > > -   struct inode *inode = d_inode(path->dentry);
> > >     struct kernfs_node *kn = inode->i_private;
> > > +   struct kernfs_iattrs *attrs;
> > >
> > >     mutex_lock(&kernfs_mutex);
> > > +   attrs = kernfs_iattrs(kn);
> > > +   if (!attrs) {
> > > +           mutex_unlock(&kernfs_mutex);
> > > +           return -ENOMEM;
> > > +   }
> > > +
> > > +   /* Which kernfs node attributes should be updated from
> > > +    * time?
> > > +    */
> > > +
> > >     kernfs_refresh_inode(kn, inode);
> > >     mutex_unlock(&kernfs_mutex);
> >
> > I don't see how this would reflect the changes from kernfs_setattr()
> > into
> > the attached inode. This would actually make the attr updates
> > obviously racy
> > - the userland visible attrs would be stale until the inode gets
> > reclaimed
> > and then when it gets reinstantiated it'd show the latest
> > information.
>
> Right, I will have to think about that, but as I say above this
> isn't really what I would propose.
>
> If .update_times() sticks strictly to what kernfs_refresh_inode()
> does now then it would set the inode attributes from the node iattr
> only.
>
> >
> > That said, if you wanna take the direction where attr updates are
> > reflected
> > to the associated inode when the change occurs, which makes sense,
> > the right
> > thing to do would be making kernfs_setattr() update the associated
> > inode if
> > existent.
>
> Mmm, that's a good point but it looks like the inode isn't available
> there.
>
Is it possible to embed super block somewhere, then we can call
kernfs_get_inode to get inode in kernfs_setattr???


thanks,
fox
Ian Kent Dec. 22, 2020, 2:17 a.m. UTC | #55
On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > > 
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > > 
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > > 
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > > 
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > > 
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > > 
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > > 
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > > 
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > > 
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > > 
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > >   assign if existing attrs
> > > > > >   release the allocation lock
> > > > > >   return existing if found
> > > > > > othewise
> > > > > >   release the allocation lock
> > > > > > 
> > > > > > allocate and initialize attrs
> > > > > > 
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > >   free and grab exiting attrs
> > > > > > otherwise
> > > > > >   assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > > 
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > > 
> > > > > > Am I on the right track or can you see problems with this?
> > > > > > 
> > > > > > Ian
> > > > > > 
> > > > > 
> > > > > umm, we update the inode in kernfs_refresh_inode, right??  So
> > > > > I
> > > > > guess
> > > > > the problem is how can we protect the inode when
> > > > > kernfs_refresh_inode
> > > > > is called, not the attrs??
> > > > 
> > > > But the attrs (which is what's copied from) were protected by
> > > > the
> > > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > > dealing with the kernfs node attrs too.
> > > > 
> > > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > > copies
> > > > the node attrs to the inode under the same mutex lock. So, if a
> > > > read
> > > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > > protected,
> > > > it needs to be protected in a different way.
> > > > 
> > > 
> > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem
> > > for
> > > .setattr but
> > >  no lock for .getattr (misdocumented?? sometimes they have as
> > > you've
> > > found out)?
> > > What does it protect against?? Because .permission does a similar
> > > thing
> > > here -- updating inode attributes, the goal is to provide the
> > > same
> > > protection level
> > > for .permission as for .setattr, am I right???
> > 
> > As far as the documentation goes that's probably my
> > misunderstanding
> > of it.
> > 
> > It does happen that the VFS makes assumptions about how call backs
> > are meant to be used.
> > 
> > Read like call backs, like .getattr() and .permission() are meant
> > to
> > be used, well, like read like functions so the VFS should be ok to
> > take locks or not based on the operation context at hand.
> > 
> > So it's not about the locking for these call backs per se, it's
> > about
> > the context in which they are called.
> > 
> > For example, in link_path_walk(), at the beginning of the component
> > lookup loop (essentially for the containing directory at that
> > point),
> > may_lookup() is called which leads to a call to .permission()
> > without
> > any inode lock held at that point.
> > 
> > But file opens (possibly following a path walk to resolve a path)
> > are different.
> > 
> > For example, do_filp_open() calls path_openat() which leads to a
> > call to open_last_lookups(), which leads to a call to .permission()
> > along the way. And in this case there are two contexts, an open()
> > create or one without create, the former needing the exclusive
> > inode
> > lock and the later able to use the shared lock.
> > 
> > So it's about the locking needed for the encompassing operation
> > that
> > is being done not about those functions specifically.
> > 
> > TBH the VFS is very complex and Al has a much, much better
> > understanding of it than I do so he would need to be the one to
> > answer
> > whether it's the file systems responsibility to use these calls in
> > the
> > way the VFS expects.
> > 
> > My belief is that if a file system needs to use a call back in a
> > way
> > that's in conflict with what the VFS expects it's the file systems'
> > responsibility to deal with the side effects.
> > 
> 
> Thanks for clarifying. Ian.
> 
> Yeah, it's complex and confusing and it's very hard to spot lock
> context by reading VFS code.
> 
> I put code like this:
>         if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
>                 if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
>                         pr_warn("kernfs iop_permission inode WRITE
> lock is held");
>                 } else if (lockdep_is_held_type(&inode->i_rwsem, 1))
> {
>                         pr_warn("kernfs iop_permission inode READ
> lock
> is held");
>                 }
>         } else {
>                 pr_warn("kernfs iop_permission inode lock is NOT
> held");
>         }
> 
> in both .permission & .getattr. Then I do some open/read/write/close
> to /sys, interestingly, all log outputs suggest they are in WRITE
> lock
> context.

The thing is in open_last_lookups() called from path_openat() we
have:
	if (open_flag & O_CREAT)
        	        inode_lock(dir->d_inode);
	        else
        	        inode_lock_shared(dir->d_inode);

and from there it's - lookup_open() and possibly may_o_create() which
calls inode_permission() and onto .permission().

So, strictly speaking, it should be taken exclusive because you would
expect may_o_create() to be called only on O_CREATE.

But all the possible cases would need to be checked and if it is taken
shared anywhere we are in trouble.

Another example is in link_path_walk() we have a call to may_lookup()
which leads to a call to .permission() without the lock being held.

So there are a bunch of cases to check and knowing you are the owner
of the lock if it is held is at least risky when concurrency is high
and there's a possibility the lock can be taken shared.

Ian