[v4,0/5] kernfs: proposed locking and concurrency improvement

Message ID	162077975380.14498.11347675368470436331.stgit@web.messagingengine.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [PATCH v4 0/5] kernfs: proposed locking and concurrency improvement From: Ian Kent <raven@themaw.net> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk>, Eric Sandeen <sandeen@sandeen.net>, Fox Chen <foxhlchen@gmail.com>, Brice Goglin <brice.goglin@gmail.com>, Rick Lindsley <ricklind@linux.vnet.ibm.com>, David Howells <dhowells@redhat.com>, Miklos Szeredi <miklos@szeredi.hu>, Marcelo Tosatti <mtosatti@redhat.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Kernel Mailing List <linux-kernel@vger.kernel.org> Date: Wed, 12 May 2021 08:38:35 +0800 Message-ID: <162077975380.14498.11347675368470436331.stgit@web.messagingengine.com> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Precedence: bulk
Series	kernfs: proposed locking and concurrency improvement \| expand [v4,0/5] kernfs: proposed locking and concurrency improvement [v4,1/5] kernfs: move revalidate to be near lookup [v4,2/5] kernfs: use VFS negative dentry caching [v4,3/5] kernfs: switch kernfs to use an rwsem [v4,4/5] kernfs: use i_lock to protect concurrent inode updates [v4,5/5] kernfs: add kernfs_need_inode_refresh()

Ian Kent May 12, 2021, 12:38 a.m. UTC

There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

The patch to use a revision to identify if a directory has changed has
also been dropped. If the directory has changed the dentry revision
needs to be updated to avoid subsequent rb tree searches and after
changing to use a read/write semaphore the update also requires a lock.
But the d_lock is the only lock available at this point which might
itself be contended.

Changes since v3:
- remove unneeded indirection when referencing the super block.
- check if inode attribute update is actually needed.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (5):
      kernfs: move revalidate to be near lookup
      kernfs: use VFS negative dentry caching
      kernfs: switch kernfs to use an rwsem
      kernfs: use i_lock to protect concurrent inode updates
      kernfs: add kernfs_need_inode_refresh()


 fs/kernfs/dir.c             | 170 ++++++++++++++++++++----------------
 fs/kernfs/file.c            |   4 +-
 fs/kernfs/inode.c           |  45 ++++++++--
 fs/kernfs/kernfs-internal.h |   5 +-
 fs/kernfs/mount.c           |  12 +--
 fs/kernfs/symlink.c         |   4 +-
 include/linux/kernfs.h      |   2 +-
 7 files changed, 147 insertions(+), 95 deletions(-)

--
Ian

Greg Kroah-Hartman May 12, 2021, 6:21 a.m. UTC | #1

On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> There have been a few instances of contention on the kernfs_mutex during
> path walks, a case on very large IBM systems seen by myself, a report by
> Brice Goglin and followed up by Fox Chen, and I've since seen a couple
> of other reports by CoreOS users.
> 
> The common thread is a large number of kernfs path walks leading to
> slowness of path walks due to kernfs_mutex contention.
> 
> The problem being that changes to the VFS over some time have increased
> it's concurrency capabilities to an extent that kernfs's use of a mutex
> is no longer appropriate. There's also an issue of walks for non-existent
> paths causing contention if there are quite a few of them which is a less
> common problem.
> 
> This patch series is relatively straight forward.
> 
> All it does is add the ability to take advantage of VFS negative dentry
> caching to avoid needless dentry alloc/free cycles for lookups of paths
> that don't exit and change the kernfs_mutex to a read/write semaphore.
> 
> The patch that tried to stay in VFS rcu-walk mode during path walks has
> been dropped for two reasons. First, it doesn't actually give very much
> improvement and, second, if there's a place where mistakes could go
> unnoticed it would be in that path. This makes the patch series simpler
> to review and reduces the likelihood of problems going unnoticed and
> popping up later.
> 
> The patch to use a revision to identify if a directory has changed has
> also been dropped. If the directory has changed the dentry revision
> needs to be updated to avoid subsequent rb tree searches and after
> changing to use a read/write semaphore the update also requires a lock.
> But the d_lock is the only lock available at this point which might
> itself be contended.
> 
> Changes since v3:
> - remove unneeded indirection when referencing the super block.
> - check if inode attribute update is actually needed.
> 
> Changes since v2:
> - actually fix the inode attribute update locking.
> - drop the patch that tried to stay in rcu-walk mode.
> - drop the use a revision to identify if a directory has changed patch.
> 
> Changes since v1:
> - fix locking in .permission() and .getattr() by re-factoring the attribute
>   handling code.
> ---
> 
> Ian Kent (5):
>       kernfs: move revalidate to be near lookup
>       kernfs: use VFS negative dentry caching
>       kernfs: switch kernfs to use an rwsem
>       kernfs: use i_lock to protect concurrent inode updates
>       kernfs: add kernfs_need_inode_refresh()
> 
> 
>  fs/kernfs/dir.c             | 170 ++++++++++++++++++++----------------
>  fs/kernfs/file.c            |   4 +-
>  fs/kernfs/inode.c           |  45 ++++++++--
>  fs/kernfs/kernfs-internal.h |   5 +-
>  fs/kernfs/mount.c           |  12 +--
>  fs/kernfs/symlink.c         |   4 +-
>  include/linux/kernfs.h      |   2 +-
>  7 files changed, 147 insertions(+), 95 deletions(-)
> 
> --
> Ian
> 

Any benchmark numbers that you ran that are better/worse with this patch
series?  That woul dbe good to know, otherwise you aren't changing
functionality here, so why would we take these changes?  :)

thanks,

greg k-h

Fox Chen May 12, 2021, 7:16 a.m. UTC | #2

On Wed, May 12, 2021 at 2:21 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> > There have been a few instances of contention on the kernfs_mutex during
> > path walks, a case on very large IBM systems seen by myself, a report by
> > Brice Goglin and followed up by Fox Chen, and I've since seen a couple
> > of other reports by CoreOS users.
> >
> > The common thread is a large number of kernfs path walks leading to
> > slowness of path walks due to kernfs_mutex contention.
> >
> > The problem being that changes to the VFS over some time have increased
> > it's concurrency capabilities to an extent that kernfs's use of a mutex
> > is no longer appropriate. There's also an issue of walks for non-existent
> > paths causing contention if there are quite a few of them which is a less
> > common problem.
> >
> > This patch series is relatively straight forward.
> >
> > All it does is add the ability to take advantage of VFS negative dentry
> > caching to avoid needless dentry alloc/free cycles for lookups of paths
> > that don't exit and change the kernfs_mutex to a read/write semaphore.
> >
> > The patch that tried to stay in VFS rcu-walk mode during path walks has
> > been dropped for two reasons. First, it doesn't actually give very much
> > improvement and, second, if there's a place where mistakes could go
> > unnoticed it would be in that path. This makes the patch series simpler
> > to review and reduces the likelihood of problems going unnoticed and
> > popping up later.
> >
> > The patch to use a revision to identify if a directory has changed has
> > also been dropped. If the directory has changed the dentry revision
> > needs to be updated to avoid subsequent rb tree searches and after
> > changing to use a read/write semaphore the update also requires a lock.
> > But the d_lock is the only lock available at this point which might
> > itself be contended.
> >
> > Changes since v3:
> > - remove unneeded indirection when referencing the super block.
> > - check if inode attribute update is actually needed.
> >
> > Changes since v2:
> > - actually fix the inode attribute update locking.
> > - drop the patch that tried to stay in rcu-walk mode.
> > - drop the use a revision to identify if a directory has changed patch.
> >
> > Changes since v1:
> > - fix locking in .permission() and .getattr() by re-factoring the attribute
> >   handling code.
> > ---
> >
> > Ian Kent (5):
> >       kernfs: move revalidate to be near lookup
> >       kernfs: use VFS negative dentry caching
> >       kernfs: switch kernfs to use an rwsem
> >       kernfs: use i_lock to protect concurrent inode updates
> >       kernfs: add kernfs_need_inode_refresh()
> >
> >
> >  fs/kernfs/dir.c             | 170 ++++++++++++++++++++----------------
> >  fs/kernfs/file.c            |   4 +-
> >  fs/kernfs/inode.c           |  45 ++++++++--
> >  fs/kernfs/kernfs-internal.h |   5 +-
> >  fs/kernfs/mount.c           |  12 +--
> >  fs/kernfs/symlink.c         |   4 +-
> >  include/linux/kernfs.h      |   2 +-
> >  7 files changed, 147 insertions(+), 95 deletions(-)
> >
> > --
> > Ian
> >
>
> Any benchmark numbers that you ran that are better/worse with this patch
> series?  That woul dbe good to know, otherwise you aren't changing
> functionality here, so why would we take these changes?  :)

Let me run it on my benchmark and bring back the result to you.

> thanks,
>
> greg k-h


thanks,
fox

Fox Chen May 12, 2021, 8:47 a.m. UTC | #3

Hi,

I ran it on my benchmark (https://github.com/foxhlchen/sysfs_benchmark).

machine: aws c5 (Intel Xeon with 96 logical cores)
kernel: v5.12
benchmark: create 96 threads and bind them to each core then run
open+read+close on a sysfs file simultaneously for 1000 times.
result:
Without the patchset, an open+read+close operation takes 550-570 us,
perf shows significant time(>40%) spending on mutex_lock.
After applying it, it takes 410-440 us for that operation and perf
shows only ~4% time on mutex_lock.

It's weird, I don't see a huge performance boost compared to v2, even
though there is no mutex problem from the perf report.
I've put console outputs and perf reports on the attachment for your reference.


thanks,
fox

Fox Chen May 12, 2021, 8:54 a.m. UTC | #4

On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote:
>
> Hi,
>
> I ran it on my benchmark (https://github.com/foxhlchen/sysfs_benchmark).
>
> machine: aws c5 (Intel Xeon with 96 logical cores)
> kernel: v5.12
> benchmark: create 96 threads and bind them to each core then run
> open+read+close on a sysfs file simultaneously for 1000 times.
> result:
> Without the patchset, an open+read+close operation takes 550-570 us,
> perf shows significant time(>40%) spending on mutex_lock.
> After applying it, it takes 410-440 us for that operation and perf
> shows only ~4% time on mutex_lock.
>
> It's weird, I don't see a huge performance boost compared to v2, even

I meant I don't see a huge performance boost here and it's way worse than v2.
IIRC, for v2 fastest one only takes 40us


> though there is no mutex problem from the perf report.
> I've put console outputs and perf reports on the attachment for your reference.
>
>
> thanks,
> fox

fox

Ian Kent May 13, 2021, 1:50 p.m. UTC | #5

On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote:
> On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> > There have been a few instances of contention on the kernfs_mutex
> > during
> > path walks, a case on very large IBM systems seen by myself, a
> > report by
> > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > couple
> > of other reports by CoreOS users.
> > 
> > The common thread is a large number of kernfs path walks leading to
> > slowness of path walks due to kernfs_mutex contention.
> > 
> > The problem being that changes to the VFS over some time have
> > increased
> > it's concurrency capabilities to an extent that kernfs's use of a
> > mutex
> > is no longer appropriate. There's also an issue of walks for non-
> > existent
> > paths causing contention if there are quite a few of them which is
> > a less
> > common problem.
> > 
> > This patch series is relatively straight forward.
> > 
> > All it does is add the ability to take advantage of VFS negative
> > dentry
> > caching to avoid needless dentry alloc/free cycles for lookups of
> > paths
> > that don't exit and change the kernfs_mutex to a read/write
> > semaphore.
> > 
> > The patch that tried to stay in VFS rcu-walk mode during path walks
> > has
> > been dropped for two reasons. First, it doesn't actually give very
> > much
> > improvement and, second, if there's a place where mistakes could go
> > unnoticed it would be in that path. This makes the patch series
> > simpler
> > to review and reduces the likelihood of problems going unnoticed
> > and
> > popping up later.
> > 
> > The patch to use a revision to identify if a directory has changed
> > has
> > also been dropped. If the directory has changed the dentry revision
> > needs to be updated to avoid subsequent rb tree searches and after
> > changing to use a read/write semaphore the update also requires a
> > lock.
> > But the d_lock is the only lock available at this point which might
> > itself be contended.
> > 
> > Changes since v3:
> > - remove unneeded indirection when referencing the super block.
> > - check if inode attribute update is actually needed.
> > 
> > Changes since v2:
> > - actually fix the inode attribute update locking.
> > - drop the patch that tried to stay in rcu-walk mode.
> > - drop the use a revision to identify if a directory has changed
> > patch.
> > 
> > Changes since v1:
> > - fix locking in .permission() and .getattr() by re-factoring the
> > attribute
> >   handling code.
> > ---
> > 
> > Ian Kent (5):
> >       kernfs: move revalidate to be near lookup
> >       kernfs: use VFS negative dentry caching
> >       kernfs: switch kernfs to use an rwsem
> >       kernfs: use i_lock to protect concurrent inode updates
> >       kernfs: add kernfs_need_inode_refresh()
> > 
> > 
> >  fs/kernfs/dir.c             | 170 ++++++++++++++++++++------------
> > ----
> >  fs/kernfs/file.c            |   4 +-
> >  fs/kernfs/inode.c           |  45 ++++++++--
> >  fs/kernfs/kernfs-internal.h |   5 +-
> >  fs/kernfs/mount.c           |  12 +--
> >  fs/kernfs/symlink.c         |   4 +-
> >  include/linux/kernfs.h      |   2 +-
> >  7 files changed, 147 insertions(+), 95 deletions(-)
> > 
> > --
> > Ian
> > 
> 
> Any benchmark numbers that you ran that are better/worse with this
> patch
> series?  That woul dbe good to know, otherwise you aren't changing
> functionality here, so why would we take these changes?  :)

Hi Greg,

I'm sorry, I don't have a benchmark.

My continued work on this has been driven by the report from
Brice Goglin and Fox Chen, and also because I've seen a couple
of other reports of kernfs_mutex contention that is resolved
by the series.

Unfortunately the two reports I've seen fairly recently are on
kernels that are about as far away from the upstream kernel
as you can get so probably aren't useful in making my case.

The report I've worked on most recently is on CoreOS/Kunbernetes
(based on RHEL-8.3) where the machine load goes to around 870
after loading what they call an OpenShift performance profile.

I looked at some sysreq dumps and they have a seemingly endless
number of processes waiting on the kernfs_mutex.

I tried to look at the Kubernetes source but it's written in
go so there would need to be a lot of time spent to work out
what's going on, I'm trying to find someone to help with that.

All I can say from looking at the Kubernetes source is it has
quite a few sysfs paths in it so I assume it uses sysfs fairly
heavily.

The other problem I saw was also on CoreOS/Kunernetes.
A vmcore analysis showed kernfs_mutex contention but with a
different set of processes and not as significant as the former
problem.

So, even though this isn't against the current upstream, there
isn't much difference in the kernfs/sysfs source between those
two kernels and given the results of Brice and Fox I fear I'll
be seeing more of this as time goes by.

I'm fairly confident that the user space applications aren't
optimal (although you may have stronger words for it than that)
I was hoping you would agree that it's sensible for the kernel
to protect itself to the extent that it can provided the change
is straight forward enough.

I have tried to make the patches as simple as possible without
loosing much of the improvement to minimize any potential ongoing
maintenance burden.

So, I'm sorry I can't offer you more incentive to consider the
series, but I remain hopeful you will.

If there is anything you would like me to follow up on please
ask and, if I can, I will do what's requested.

Ian

Ian Kent May 13, 2021, 2:10 p.m. UTC | #6

On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote:
> > 
> > Hi,
> > 
> > I ran it on my benchmark (
> > https://github.com/foxhlchen/sysfs_benchmark).
> > 
> > machine: aws c5 (Intel Xeon with 96 logical cores)
> > kernel: v5.12
> > benchmark: create 96 threads and bind them to each core then run
> > open+read+close on a sysfs file simultaneously for 1000 times.
> > result:
> > Without the patchset, an open+read+close operation takes 550-570
> > us,
> > perf shows significant time(>40%) spending on mutex_lock.
> > After applying it, it takes 410-440 us for that operation and perf
> > shows only ~4% time on mutex_lock.
> > 
> > It's weird, I don't see a huge performance boost compared to v2,
> > even
> 
> I meant I don't see a huge performance boost here and it's way worse
> than v2.
> IIRC, for v2 fastest one only takes 40us

Thanks Fox,

I'll have a look at those reports but this is puzzling.

Perhaps the added overhead of the check if an update is
needed is taking more than expected and more than just
taking the lock and being done with it. Then there's
the v2 series ... I'll see if I can dig out your reports
on those too.

> 
> 
> > though there is no mutex problem from the perf report.
> > I've put console outputs and perf reports on the attachment for
> > your reference.

Yep, thanks.
Ian

Greg Kroah-Hartman May 13, 2021, 3:19 p.m. UTC | #7

On Thu, May 13, 2021 at 09:50:19PM +0800, Ian Kent wrote:
> On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote:
> > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> > > There have been a few instances of contention on the kernfs_mutex
> > > during
> > > path walks, a case on very large IBM systems seen by myself, a
> > > report by
> > > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > > couple
> > > of other reports by CoreOS users.
> > > 
> > > The common thread is a large number of kernfs path walks leading to
> > > slowness of path walks due to kernfs_mutex contention.
> > > 
> > > The problem being that changes to the VFS over some time have
> > > increased
> > > it's concurrency capabilities to an extent that kernfs's use of a
> > > mutex
> > > is no longer appropriate. There's also an issue of walks for non-
> > > existent
> > > paths causing contention if there are quite a few of them which is
> > > a less
> > > common problem.
> > > 
> > > This patch series is relatively straight forward.
> > > 
> > > All it does is add the ability to take advantage of VFS negative
> > > dentry
> > > caching to avoid needless dentry alloc/free cycles for lookups of
> > > paths
> > > that don't exit and change the kernfs_mutex to a read/write
> > > semaphore.
> > > 
> > > The patch that tried to stay in VFS rcu-walk mode during path walks
> > > has
> > > been dropped for two reasons. First, it doesn't actually give very
> > > much
> > > improvement and, second, if there's a place where mistakes could go
> > > unnoticed it would be in that path. This makes the patch series
> > > simpler
> > > to review and reduces the likelihood of problems going unnoticed
> > > and
> > > popping up later.
> > > 
> > > The patch to use a revision to identify if a directory has changed
> > > has
> > > also been dropped. If the directory has changed the dentry revision
> > > needs to be updated to avoid subsequent rb tree searches and after
> > > changing to use a read/write semaphore the update also requires a
> > > lock.
> > > But the d_lock is the only lock available at this point which might
> > > itself be contended.
> > > 
> > > Changes since v3:
> > > - remove unneeded indirection when referencing the super block.
> > > - check if inode attribute update is actually needed.
> > > 
> > > Changes since v2:
> > > - actually fix the inode attribute update locking.
> > > - drop the patch that tried to stay in rcu-walk mode.
> > > - drop the use a revision to identify if a directory has changed
> > > patch.
> > > 
> > > Changes since v1:
> > > - fix locking in .permission() and .getattr() by re-factoring the
> > > attribute
> > >   handling code.
> > > ---
> > > 
> > > Ian Kent (5):
> > >       kernfs: move revalidate to be near lookup
> > >       kernfs: use VFS negative dentry caching
> > >       kernfs: switch kernfs to use an rwsem
> > >       kernfs: use i_lock to protect concurrent inode updates
> > >       kernfs: add kernfs_need_inode_refresh()
> > > 
> > > 
> > >  fs/kernfs/dir.c             | 170 ++++++++++++++++++++------------
> > > ----
> > >  fs/kernfs/file.c            |   4 +-
> > >  fs/kernfs/inode.c           |  45 ++++++++--
> > >  fs/kernfs/kernfs-internal.h |   5 +-
> > >  fs/kernfs/mount.c           |  12 +--
> > >  fs/kernfs/symlink.c         |   4 +-
> > >  include/linux/kernfs.h      |   2 +-
> > >  7 files changed, 147 insertions(+), 95 deletions(-)
> > > 
> > > --
> > > Ian
> > > 
> > 
> > Any benchmark numbers that you ran that are better/worse with this
> > patch
> > series?  That woul dbe good to know, otherwise you aren't changing
> > functionality here, so why would we take these changes?  :)
> 
> Hi Greg,
> 
> I'm sorry, I don't have a benchmark.
> 
> My continued work on this has been driven by the report from
> Brice Goglin and Fox Chen, and also because I've seen a couple
> of other reports of kernfs_mutex contention that is resolved
> by the series.
> 
> Unfortunately the two reports I've seen fairly recently are on
> kernels that are about as far away from the upstream kernel
> as you can get so probably aren't useful in making my case.
> 
> The report I've worked on most recently is on CoreOS/Kunbernetes
> (based on RHEL-8.3) where the machine load goes to around 870
> after loading what they call an OpenShift performance profile.
> 
> I looked at some sysreq dumps and they have a seemingly endless
> number of processes waiting on the kernfs_mutex.
> 
> I tried to look at the Kubernetes source but it's written in
> go so there would need to be a lot of time spent to work out
> what's going on, I'm trying to find someone to help with that.
> 
> All I can say from looking at the Kubernetes source is it has
> quite a few sysfs paths in it so I assume it uses sysfs fairly
> heavily.
> 
> The other problem I saw was also on CoreOS/Kunernetes.
> A vmcore analysis showed kernfs_mutex contention but with a
> different set of processes and not as significant as the former
> problem.
> 
> So, even though this isn't against the current upstream, there
> isn't much difference in the kernfs/sysfs source between those
> two kernels and given the results of Brice and Fox I fear I'll
> be seeing more of this as time goes by.
> 
> I'm fairly confident that the user space applications aren't
> optimal (although you may have stronger words for it than that)
> I was hoping you would agree that it's sensible for the kernel
> to protect itself to the extent that it can provided the change
> is straight forward enough.
> 
> I have tried to make the patches as simple as possible without
> loosing much of the improvement to minimize any potential ongoing
> maintenance burden.
> 
> So, I'm sorry I can't offer you more incentive to consider the
> series, but I remain hopeful you will.

At the very least, if you could test the series on those "older" systems
and say "booting went from X seconds to Y seconds!".

Otherwise, while changes are nice, without a real-world test that this
actually made any difference at all, why would we take these changes?

thanks,

greg k-h

Fox Chen May 13, 2021, 3:37 p.m. UTC | #8

Hi Ian

On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote:
>
> On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I ran it on my benchmark (
> > > https://github.com/foxhlchen/sysfs_benchmark).
> > >
> > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > kernel: v5.12
> > > benchmark: create 96 threads and bind them to each core then run
> > > open+read+close on a sysfs file simultaneously for 1000 times.
> > > result:
> > > Without the patchset, an open+read+close operation takes 550-570
> > > us,
> > > perf shows significant time(>40%) spending on mutex_lock.
> > > After applying it, it takes 410-440 us for that operation and perf
> > > shows only ~4% time on mutex_lock.
> > >
> > > It's weird, I don't see a huge performance boost compared to v2,
> > > even
> >
> > I meant I don't see a huge performance boost here and it's way worse
> > than v2.
> > IIRC, for v2 fastest one only takes 40us
>
> Thanks Fox,
>
> I'll have a look at those reports but this is puzzling.
>
> Perhaps the added overhead of the check if an update is
> needed is taking more than expected and more than just
> taking the lock and being done with it. Then there's
> the v2 series ... I'll see if I can dig out your reports
> on those too.

Apologies, I was mistaken, it's compared to V3, not V2.  The previous
benchmark report is here.
https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/

> >
> >
> > > though there is no mutex problem from the perf report.
> > > I've put console outputs and perf reports on the attachment for
> > > your reference.
>
> Yep, thanks.
> Ian
>

thanks,
fox

Ian Kent May 14, 2021, 1:02 a.m. UTC | #9

On Thu, 2021-05-13 at 17:19 +0200, Greg Kroah-Hartman wrote:
> On Thu, May 13, 2021 at 09:50:19PM +0800, Ian Kent wrote:
> > On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote:
> > > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> > > > There have been a few instances of contention on the
> > > > kernfs_mutex
> > > > during
> > > > path walks, a case on very large IBM systems seen by myself, a
> > > > report by
> > > > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > > > couple
> > > > of other reports by CoreOS users.
> > > > 
> > > > The common thread is a large number of kernfs path walks
> > > > leading to
> > > > slowness of path walks due to kernfs_mutex contention.
> > > > 
> > > > The problem being that changes to the VFS over some time have
> > > > increased
> > > > it's concurrency capabilities to an extent that kernfs's use of
> > > > a
> > > > mutex
> > > > is no longer appropriate. There's also an issue of walks for
> > > > non-
> > > > existent
> > > > paths causing contention if there are quite a few of them which
> > > > is
> > > > a less
> > > > common problem.
> > > > 
> > > > This patch series is relatively straight forward.
> > > > 
> > > > All it does is add the ability to take advantage of VFS
> > > > negative
> > > > dentry
> > > > caching to avoid needless dentry alloc/free cycles for lookups
> > > > of
> > > > paths
> > > > that don't exit and change the kernfs_mutex to a read/write
> > > > semaphore.
> > > > 
> > > > The patch that tried to stay in VFS rcu-walk mode during path
> > > > walks
> > > > has
> > > > been dropped for two reasons. First, it doesn't actually give
> > > > very
> > > > much
> > > > improvement and, second, if there's a place where mistakes
> > > > could go
> > > > unnoticed it would be in that path. This makes the patch series
> > > > simpler
> > > > to review and reduces the likelihood of problems going
> > > > unnoticed
> > > > and
> > > > popping up later.
> > > > 
> > > > The patch to use a revision to identify if a directory has
> > > > changed
> > > > has
> > > > also been dropped. If the directory has changed the dentry
> > > > revision
> > > > needs to be updated to avoid subsequent rb tree searches and
> > > > after
> > > > changing to use a read/write semaphore the update also requires
> > > > a
> > > > lock.
> > > > But the d_lock is the only lock available at this point which
> > > > might
> > > > itself be contended.
> > > > 
> > > > Changes since v3:
> > > > - remove unneeded indirection when referencing the super block.
> > > > - check if inode attribute update is actually needed.
> > > > 
> > > > Changes since v2:
> > > > - actually fix the inode attribute update locking.
> > > > - drop the patch that tried to stay in rcu-walk mode.
> > > > - drop the use a revision to identify if a directory has
> > > > changed
> > > > patch.
> > > > 
> > > > Changes since v1:
> > > > - fix locking in .permission() and .getattr() by re-factoring
> > > > the
> > > > attribute
> > > >   handling code.
> > > > ---
> > > > 
> > > > Ian Kent (5):
> > > >       kernfs: move revalidate to be near lookup
> > > >       kernfs: use VFS negative dentry caching
> > > >       kernfs: switch kernfs to use an rwsem
> > > >       kernfs: use i_lock to protect concurrent inode updates
> > > >       kernfs: add kernfs_need_inode_refresh()
> > > > 
> > > > 
> > > >  fs/kernfs/dir.c             | 170 ++++++++++++++++++++--------
> > > > ----
> > > > ----
> > > >  fs/kernfs/file.c            |   4 +-
> > > >  fs/kernfs/inode.c           |  45 ++++++++--
> > > >  fs/kernfs/kernfs-internal.h |   5 +-
> > > >  fs/kernfs/mount.c           |  12 +--
> > > >  fs/kernfs/symlink.c         |   4 +-
> > > >  include/linux/kernfs.h      |   2 +-
> > > >  7 files changed, 147 insertions(+), 95 deletions(-)
> > > > 
> > > > --
> > > > Ian
> > > > 
> > > 
> > > Any benchmark numbers that you ran that are better/worse with
> > > this
> > > patch
> > > series?  That woul dbe good to know, otherwise you aren't
> > > changing
> > > functionality here, so why would we take these changes?  :)
> > 
> > Hi Greg,
> > 
> > I'm sorry, I don't have a benchmark.
> > 
> > My continued work on this has been driven by the report from
> > Brice Goglin and Fox Chen, and also because I've seen a couple
> > of other reports of kernfs_mutex contention that is resolved
> > by the series.
> > 
> > Unfortunately the two reports I've seen fairly recently are on
> > kernels that are about as far away from the upstream kernel
> > as you can get so probably aren't useful in making my case.
> > 
> > The report I've worked on most recently is on CoreOS/Kunbernetes
> > (based on RHEL-8.3) where the machine load goes to around 870
> > after loading what they call an OpenShift performance profile.
> > 
> > I looked at some sysreq dumps and they have a seemingly endless
> > number of processes waiting on the kernfs_mutex.
> > 
> > I tried to look at the Kubernetes source but it's written in
> > go so there would need to be a lot of time spent to work out
> > what's going on, I'm trying to find someone to help with that.
> > 
> > All I can say from looking at the Kubernetes source is it has
> > quite a few sysfs paths in it so I assume it uses sysfs fairly
> > heavily.
> > 
> > The other problem I saw was also on CoreOS/Kunernetes.
> > A vmcore analysis showed kernfs_mutex contention but with a
> > different set of processes and not as significant as the former
> > problem.
> > 
> > So, even though this isn't against the current upstream, there
> > isn't much difference in the kernfs/sysfs source between those
> > two kernels and given the results of Brice and Fox I fear I'll
> > be seeing more of this as time goes by.
> > 
> > I'm fairly confident that the user space applications aren't
> > optimal (although you may have stronger words for it than that)
> > I was hoping you would agree that it's sensible for the kernel
> > to protect itself to the extent that it can provided the change
> > is straight forward enough.
> > 
> > I have tried to make the patches as simple as possible without
> > loosing much of the improvement to minimize any potential ongoing
> > maintenance burden.
> > 
> > So, I'm sorry I can't offer you more incentive to consider the
> > series, but I remain hopeful you will.
> 
> At the very least, if you could test the series on those "older"
> systems
> and say "booting went from X seconds to Y seconds!".

The last test I did was done on the system showing high load and
it went from around 870 to around 3. It completely resolved the
reported problem.

I need to have the current patches re-tested and that can take a
while and I need to look at Fox's results results, I'm thinking
the additional patch in v4 is probably not needed.

Ian

Ian Kent May 14, 2021, 1:34 a.m. UTC | #10

On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> Hi Ian
> 
> On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote:
> > 
> > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com>
> > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > I ran it on my benchmark (
> > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > 
> > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > kernel: v5.12
> > > > benchmark: create 96 threads and bind them to each core then
> > > > run
> > > > open+read+close on a sysfs file simultaneously for 1000 times.
> > > > result:
> > > > Without the patchset, an open+read+close operation takes 550-
> > > > 570
> > > > us,
> > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > After applying it, it takes 410-440 us for that operation and
> > > > perf
> > > > shows only ~4% time on mutex_lock.
> > > > 
> > > > It's weird, I don't see a huge performance boost compared to
> > > > v2,
> > > > even
> > > 
> > > I meant I don't see a huge performance boost here and it's way
> > > worse
> > > than v2.
> > > IIRC, for v2 fastest one only takes 40us
> > 
> > Thanks Fox,
> > 
> > I'll have a look at those reports but this is puzzling.
> > 
> > Perhaps the added overhead of the check if an update is
> > needed is taking more than expected and more than just
> > taking the lock and being done with it. Then there's
> > the v2 series ... I'll see if I can dig out your reports
> > on those too.
> 
> Apologies, I was mistaken, it's compared to V3, not V2.  The previous
> benchmark report is here.
> https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/

Are all these tests using a single file name in the open/read/close
loop?

That being the case the per-object inode lock will behave like a
mutex and once contention occurs any speed benefits of a spinlock
over a mutex (or rwsem) will disappear.

In this case changing from a write lock to a read lock in those
functions and adding the inode mutex will do nothing but add the
overhead of taking the read lock. And similarly adding the update
check function also just adds overhead and, as we see, once
contention starts it has a cumulative effect that's often not
linear.

The whole idea of a read lock/per-object spin lock was to reduce
the possibility of contention for paths other than the same path
while not impacting same path accesses too much for an overall
gain. Based on this I'm thinking the update check function is
probably not worth keeping, it just adds unnecessary churn and
has a negative impact for same file contention access patterns.

I think that using multiple paths, at least one per test process
(so if you are running 16 processes use at least 16 different
files, the same in each process), and selecting one at random
for each loop of the open would better simulate real world
access patterns.

Ian

Fox Chen May 14, 2021, 2:34 a.m. UTC | #11

On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote:
>
> On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > Hi Ian
> >
> > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote:
> > >
> > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I ran it on my benchmark (
> > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > >
> > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > kernel: v5.12
> > > > > benchmark: create 96 threads and bind them to each core then
> > > > > run
> > > > > open+read+close on a sysfs file simultaneously for 1000 times.
> > > > > result:
> > > > > Without the patchset, an open+read+close operation takes 550-
> > > > > 570
> > > > > us,
> > > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > > After applying it, it takes 410-440 us for that operation and
> > > > > perf
> > > > > shows only ~4% time on mutex_lock.
> > > > >
> > > > > It's weird, I don't see a huge performance boost compared to
> > > > > v2,
> > > > > even
> > > >
> > > > I meant I don't see a huge performance boost here and it's way
> > > > worse
> > > > than v2.
> > > > IIRC, for v2 fastest one only takes 40us
> > >
> > > Thanks Fox,
> > >
> > > I'll have a look at those reports but this is puzzling.
> > >
> > > Perhaps the added overhead of the check if an update is
> > > needed is taking more than expected and more than just
> > > taking the lock and being done with it. Then there's
> > > the v2 series ... I'll see if I can dig out your reports
> > > on those too.
> >
> > Apologies, I was mistaken, it's compared to V3, not V2.  The previous
> > benchmark report is here.
> > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
>
> Are all these tests using a single file name in the open/read/close
> loop?

Yes,  because It's easy to implement yet enough to trigger the mutex_lock.

And you are right It's not a real-life pattern, but on the bright
side, it proves there is no original mutex_lock problem anymore. :)

> That being the case the per-object inode lock will behave like a
> mutex and once contention occurs any speed benefits of a spinlock
> over a mutex (or rwsem) will disappear.
>
> In this case changing from a write lock to a read lock in those
> functions and adding the inode mutex will do nothing but add the
> overhead of taking the read lock. And similarly adding the update
> check function also just adds overhead and, as we see, once
> contention starts it has a cumulative effect that's often not
> linear.
>
> The whole idea of a read lock/per-object spin lock was to reduce
> the possibility of contention for paths other than the same path
> while not impacting same path accesses too much for an overall
> gain. Based on this I'm thinking the update check function is
> probably not worth keeping, it just adds unnecessary churn and
> has a negative impact for same file contention access patterns.
>
> I think that using multiple paths, at least one per test process
> (so if you are running 16 processes use at least 16 different
> files, the same in each process), and selecting one at random
> for each loop of the open would better simulate real world
> access patterns.
>
>
> Ian
>


thanks,
fox

Ian Kent May 17, 2021, 1:32 a.m. UTC | #12

On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote:
> On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote:
> > 
> > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > > Hi Ian
> > > 
> > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > 
> > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen
> > > > > <foxhlchen@gmail.com>
> > > > > wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I ran it on my benchmark (
> > > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > > > 
> > > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > > kernel: v5.12
> > > > > > benchmark: create 96 threads and bind them to each core
> > > > > > then
> > > > > > run
> > > > > > open+read+close on a sysfs file simultaneously for 1000
> > > > > > times.
> > > > > > result:
> > > > > > Without the patchset, an open+read+close operation takes
> > > > > > 550-
> > > > > > 570
> > > > > > us,
> > > > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > > > After applying it, it takes 410-440 us for that operation
> > > > > > and
> > > > > > perf
> > > > > > shows only ~4% time on mutex_lock.
> > > > > > 
> > > > > > It's weird, I don't see a huge performance boost compared
> > > > > > to
> > > > > > v2,
> > > > > > even
> > > > > 
> > > > > I meant I don't see a huge performance boost here and it's
> > > > > way
> > > > > worse
> > > > > than v2.
> > > > > IIRC, for v2 fastest one only takes 40us
> > > > 
> > > > Thanks Fox,
> > > > 
> > > > I'll have a look at those reports but this is puzzling.
> > > > 
> > > > Perhaps the added overhead of the check if an update is
> > > > needed is taking more than expected and more than just
> > > > taking the lock and being done with it. Then there's
> > > > the v2 series ... I'll see if I can dig out your reports
> > > > on those too.
> > > 
> > > Apologies, I was mistaken, it's compared to V3, not V2.  The
> > > previous
> > > benchmark report is here.
> > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
> > 
> > Are all these tests using a single file name in the open/read/close
> > loop?
> 
> Yes,  because It's easy to implement yet enough to trigger the
> mutex_lock.
> 
> And you are right It's not a real-life pattern, but on the bright
> side, it proves there is no original mutex_lock problem anymore. :)

I've been looking at your reports and they are quite interesting.

> 
> > That being the case the per-object inode lock will behave like a
> > mutex and once contention occurs any speed benefits of a spinlock
> > over a mutex (or rwsem) will disappear.
> > 
> > In this case changing from a write lock to a read lock in those
> > functions and adding the inode mutex will do nothing but add the
> > overhead of taking the read lock. And similarly adding the update
> > check function also just adds overhead and, as we see, once
> > contention starts it has a cumulative effect that's often not
> > linear.
> > 
> > The whole idea of a read lock/per-object spin lock was to reduce
> > the possibility of contention for paths other than the same path
> > while not impacting same path accesses too much for an overall
> > gain. Based on this I'm thinking the update check function is
> > probably not worth keeping, it just adds unnecessary churn and
> > has a negative impact for same file contention access patterns.

The reports indicate (to me anyway) that the slowdown isn't
due to kernfs. It looks more like kernfs is now putting pressure
on the VFS, mostly on the file table lock but it looks like
there's a mild amount of contention on a few other locks as well
now.

That's a whole different problem and those file table handling
functions don't appear to have any obvious problems so they are
doing what they have to do and that can't be avoided.

That's definitely out of scope for these changes.

And, as you'd expect, once any appreciable amount of contention
happens our measurements go out the window, certainly with
respect to kernfs.

It also doesn't change my option that checking if an inode
attribute update is needed in kernfs isn't useful since, IIUC
that file table lock contention would result even if you were
using different paths.

So I'll drop that patch from the series.

Ian
> > 
> > I think that using multiple paths, at least one per test process
> > (so if you are running 16 processes use at least 16 different
> > files, the same in each process), and selecting one at random
> > for each loop of the open would better simulate real world
> > access patterns.
> > 
> > 
> > Ian
> > 
> 
> 
> thanks,
> fox

Fox Chen May 18, 2021, 8:26 a.m. UTC | #13

On Mon, May 17, 2021 at 9:32 AM Ian Kent <raven@themaw.net> wrote:
>
> On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote:
> > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote:
> > >
> > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > > > Hi Ian
> > > >
> > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > >
> > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen
> > > > > > <foxhlchen@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I ran it on my benchmark (
> > > > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > > > >
> > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > > > kernel: v5.12
> > > > > > > benchmark: create 96 threads and bind them to each core
> > > > > > > then
> > > > > > > run
> > > > > > > open+read+close on a sysfs file simultaneously for 1000
> > > > > > > times.
> > > > > > > result:
> > > > > > > Without the patchset, an open+read+close operation takes
> > > > > > > 550-
> > > > > > > 570
> > > > > > > us,
> > > > > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > > > > After applying it, it takes 410-440 us for that operation
> > > > > > > and
> > > > > > > perf
> > > > > > > shows only ~4% time on mutex_lock.
> > > > > > >
> > > > > > > It's weird, I don't see a huge performance boost compared
> > > > > > > to
> > > > > > > v2,
> > > > > > > even
> > > > > >
> > > > > > I meant I don't see a huge performance boost here and it's
> > > > > > way
> > > > > > worse
> > > > > > than v2.
> > > > > > IIRC, for v2 fastest one only takes 40us
> > > > >
> > > > > Thanks Fox,
> > > > >
> > > > > I'll have a look at those reports but this is puzzling.
> > > > >
> > > > > Perhaps the added overhead of the check if an update is
> > > > > needed is taking more than expected and more than just
> > > > > taking the lock and being done with it. Then there's
> > > > > the v2 series ... I'll see if I can dig out your reports
> > > > > on those too.
> > > >
> > > > Apologies, I was mistaken, it's compared to V3, not V2.  The
> > > > previous
> > > > benchmark report is here.
> > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
> > >
> > > Are all these tests using a single file name in the open/read/close
> > > loop?
> >
> > Yes,  because It's easy to implement yet enough to trigger the
> > mutex_lock.
> >
> > And you are right It's not a real-life pattern, but on the bright
> > side, it proves there is no original mutex_lock problem anymore. :)
>
> I've been looking at your reports and they are quite interesting.
>
> >
> > > That being the case the per-object inode lock will behave like a
> > > mutex and once contention occurs any speed benefits of a spinlock
> > > over a mutex (or rwsem) will disappear.
> > >
> > > In this case changing from a write lock to a read lock in those
> > > functions and adding the inode mutex will do nothing but add the
> > > overhead of taking the read lock. And similarly adding the update
> > > check function also just adds overhead and, as we see, once
> > > contention starts it has a cumulative effect that's often not
> > > linear.
> > >
> > > The whole idea of a read lock/per-object spin lock was to reduce
> > > the possibility of contention for paths other than the same path
> > > while not impacting same path accesses too much for an overall
> > > gain. Based on this I'm thinking the update check function is
> > > probably not worth keeping, it just adds unnecessary churn and
> > > has a negative impact for same file contention access patterns.
>
> The reports indicate (to me anyway) that the slowdown isn't
> due to kernfs. It looks more like kernfs is now putting pressure
> on the VFS, mostly on the file table lock but it looks like
> there's a mild amount of contention on a few other locks as well
> now.

That's correct, I ran my benchmark on ext4 the result was similarly
slow. But It shouldn't be that as I remember I tested it before it was
very fast and you can see the result of V3 was much faster. So I ran
this benchmark again on AWS c5a which also has 96 cores but with AMD
CPUs. The result was amazing the fastest one had a 10x boost (~40us)
very similar to the V3 one (see attachment) I guess my previous
benchmark of V3 was run on c5a.

I can't figure why it is so slow on Intel's CPUs, I also tried
C5.metal which is running on the physical machine, the result is still
slow (~200us). But anyway, at least it shows this patchset solves the
mutex_lock problem and can bring even 10x boosts on some occasions.

> That's a whole different problem and those file table handling
> functions don't appear to have any obvious problems so they are
> doing what they have to do and that can't be avoided.
>
> That's definitely out of scope for these changes.
>
> And, as you'd expect, once any appreciable amount of contention
> happens our measurements go out the window, certainly with
> respect to kernfs.
>
> It also doesn't change my option that checking if an inode
> attribute update is needed in kernfs isn't useful since, IIUC
> that file table lock contention would result even if you were
> using different paths.
>
> So I'll drop that patch from the series.
>
> Ian
> > >
> > > I think that using multiple paths, at least one per test process
> > > (so if you are running 16 processes use at least 16 different
> > > files, the same in each process), and selecting one at random
> > > for each loop of the open would better simulate real world
> > > access patterns.
> > >
> > >
> > > Ian
> > >
> >
> >
> > thanks,
> > fox
>
>


thanks,
fox

Ian Kent May 27, 2021, 1:23 a.m. UTC | #14

On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote:
> On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote:
> > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote:
> > > 
> > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > > > Hi Ian
> > > > 
> > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > 
> > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen
> > > > > > <foxhlchen@gmail.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I ran it on my benchmark (
> > > > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > > > > 
> > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > > > kernel: v5.12
> > > > > > > benchmark: create 96 threads and bind them to each core
> > > > > > > then
> > > > > > > run
> > > > > > > open+read+close on a sysfs file simultaneously for 1000
> > > > > > > times.
> > > > > > > result:
> > > > > > > Without the patchset, an open+read+close operation takes
> > > > > > > 550-
> > > > > > > 570
> > > > > > > us,
> > > > > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > > > > After applying it, it takes 410-440 us for that operation
> > > > > > > and
> > > > > > > perf
> > > > > > > shows only ~4% time on mutex_lock.
> > > > > > > 
> > > > > > > It's weird, I don't see a huge performance boost compared
> > > > > > > to
> > > > > > > v2,
> > > > > > > even
> > > > > > 
> > > > > > I meant I don't see a huge performance boost here and it's
> > > > > > way
> > > > > > worse
> > > > > > than v2.
> > > > > > IIRC, for v2 fastest one only takes 40us
> > > > > 
> > > > > Thanks Fox,
> > > > > 
> > > > > I'll have a look at those reports but this is puzzling.
> > > > > 
> > > > > Perhaps the added overhead of the check if an update is
> > > > > needed is taking more than expected and more than just
> > > > > taking the lock and being done with it. Then there's
> > > > > the v2 series ... I'll see if I can dig out your reports
> > > > > on those too.
> > > > 
> > > > Apologies, I was mistaken, it's compared to V3, not V2.  The
> > > > previous
> > > > benchmark report is here.
> > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
> > > 
> > > Are all these tests using a single file name in the
> > > open/read/close
> > > loop?
> > 
> > Yes,  because It's easy to implement yet enough to trigger the
> > mutex_lock.
> > 
> > And you are right It's not a real-life pattern, but on the bright
> > side, it proves there is no original mutex_lock problem anymore. :)
> 
> I've been looking at your reports and they are quite interesting.
> 
> > 
> > > That being the case the per-object inode lock will behave like a
> > > mutex and once contention occurs any speed benefits of a spinlock
> > > over a mutex (or rwsem) will disappear.
> > > 
> > > In this case changing from a write lock to a read lock in those
> > > functions and adding the inode mutex will do nothing but add the
> > > overhead of taking the read lock. And similarly adding the update
> > > check function also just adds overhead and, as we see, once
> > > contention starts it has a cumulative effect that's often not
> > > linear.
> > > 
> > > The whole idea of a read lock/per-object spin lock was to reduce
> > > the possibility of contention for paths other than the same path
> > > while not impacting same path accesses too much for an overall
> > > gain. Based on this I'm thinking the update check function is
> > > probably not worth keeping, it just adds unnecessary churn and
> > > has a negative impact for same file contention access patterns.
> 
> The reports indicate (to me anyway) that the slowdown isn't
> due to kernfs. It looks more like kernfs is now putting pressure
> on the VFS, mostly on the file table lock but it looks like
> there's a mild amount of contention on a few other locks as well
> now.
> 
> That's a whole different problem and those file table handling
> functions don't appear to have any obvious problems so they are
> doing what they have to do and that can't be avoided.
> 
> That's definitely out of scope for these changes.
> 
> And, as you'd expect, once any appreciable amount of contention
> happens our measurements go out the window, certainly with
> respect to kernfs.
> 
> It also doesn't change my option that checking if an inode
> attribute update is needed in kernfs isn't useful since, IIUC
> that file table lock contention would result even if you were
> using different paths.
> 
> So I'll drop that patch from the series.

It will probably not make any difference, but for completeness
and after some thought, I felt I should post what I think is
happening with the benchmark.

The benchmark application runs a pthreads thread for each CPU
and goes into a tight open/read/close loop to demonstrate the
contention that can occur on the kernfs mutex as the number of
CPUs grows.

But that tight open/read/close loop causes contention on the VFS
file table because the pthreads threads share the process file
table.

So the poor performance is actually evidence the last patch is
doing what it's meant to do rather than evidence of a regression
with the series.

The benchmark is putting pressure on the process file table on
some hardware configurations but those critical sections are small
and there's really nothing obvious that can be done to improve the
file table locking.

It is however important to remember that the benckmark application
access pattern is not a normal access pattern by a long way.

So I don't see the need for a new revision of the series with the
last patch dropped.

If there's a change of heart and the series was to be merged I'll
leave whether to include this last patch to the discretion of the
maintainer as the bulk of the improvement comes from the earlier
patches anyway.

> 
> Ian
> > > 
> > > I think that using multiple paths, at least one per test process
> > > (so if you are running 16 processes use at least 16 different
> > > files, the same in each process), and selecting one at random
> > > for each loop of the open would better simulate real world
> > > access patterns.
> > > 
> > > 
> > > Ian
> > > 
> > 
> > 
> > thanks,
> > fox
>

Greg Kroah-Hartman May 27, 2021, 6:50 a.m. UTC | #15

On Thu, May 27, 2021 at 09:23:06AM +0800, Ian Kent wrote:
> On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote:
> > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote:
> > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote:
> > > > 
> > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > > > > Hi Ian
> > > > > 
> > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen
> > > > > > > <foxhlchen@gmail.com>
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > I ran it on my benchmark (
> > > > > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > > > > > 
> > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > > > > kernel: v5.12
> > > > > > > > benchmark: create 96 threads and bind them to each core
> > > > > > > > then
> > > > > > > > run
> > > > > > > > open+read+close on a sysfs file simultaneously for 1000
> > > > > > > > times.
> > > > > > > > result:
> > > > > > > > Without the patchset, an open+read+close operation takes
> > > > > > > > 550-
> > > > > > > > 570
> > > > > > > > us,
> > > > > > > > perf shows significant time(>40%) spending on mutex_lock.
> > > > > > > > After applying it, it takes 410-440 us for that operation
> > > > > > > > and
> > > > > > > > perf
> > > > > > > > shows only ~4% time on mutex_lock.
> > > > > > > > 
> > > > > > > > It's weird, I don't see a huge performance boost compared
> > > > > > > > to
> > > > > > > > v2,
> > > > > > > > even
> > > > > > > 
> > > > > > > I meant I don't see a huge performance boost here and it's
> > > > > > > way
> > > > > > > worse
> > > > > > > than v2.
> > > > > > > IIRC, for v2 fastest one only takes 40us
> > > > > > 
> > > > > > Thanks Fox,
> > > > > > 
> > > > > > I'll have a look at those reports but this is puzzling.
> > > > > > 
> > > > > > Perhaps the added overhead of the check if an update is
> > > > > > needed is taking more than expected and more than just
> > > > > > taking the lock and being done with it. Then there's
> > > > > > the v2 series ... I'll see if I can dig out your reports
> > > > > > on those too.
> > > > > 
> > > > > Apologies, I was mistaken, it's compared to V3, not V2.  The
> > > > > previous
> > > > > benchmark report is here.
> > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
> > > > 
> > > > Are all these tests using a single file name in the
> > > > open/read/close
> > > > loop?
> > > 
> > > Yes,  because It's easy to implement yet enough to trigger the
> > > mutex_lock.
> > > 
> > > And you are right It's not a real-life pattern, but on the bright
> > > side, it proves there is no original mutex_lock problem anymore. :)
> > 
> > I've been looking at your reports and they are quite interesting.
> > 
> > > 
> > > > That being the case the per-object inode lock will behave like a
> > > > mutex and once contention occurs any speed benefits of a spinlock
> > > > over a mutex (or rwsem) will disappear.
> > > > 
> > > > In this case changing from a write lock to a read lock in those
> > > > functions and adding the inode mutex will do nothing but add the
> > > > overhead of taking the read lock. And similarly adding the update
> > > > check function also just adds overhead and, as we see, once
> > > > contention starts it has a cumulative effect that's often not
> > > > linear.
> > > > 
> > > > The whole idea of a read lock/per-object spin lock was to reduce
> > > > the possibility of contention for paths other than the same path
> > > > while not impacting same path accesses too much for an overall
> > > > gain. Based on this I'm thinking the update check function is
> > > > probably not worth keeping, it just adds unnecessary churn and
> > > > has a negative impact for same file contention access patterns.
> > 
> > The reports indicate (to me anyway) that the slowdown isn't
> > due to kernfs. It looks more like kernfs is now putting pressure
> > on the VFS, mostly on the file table lock but it looks like
> > there's a mild amount of contention on a few other locks as well
> > now.
> > 
> > That's a whole different problem and those file table handling
> > functions don't appear to have any obvious problems so they are
> > doing what they have to do and that can't be avoided.
> > 
> > That's definitely out of scope for these changes.
> > 
> > And, as you'd expect, once any appreciable amount of contention
> > happens our measurements go out the window, certainly with
> > respect to kernfs.
> > 
> > It also doesn't change my option that checking if an inode
> > attribute update is needed in kernfs isn't useful since, IIUC
> > that file table lock contention would result even if you were
> > using different paths.
> > 
> > So I'll drop that patch from the series.
> 
> It will probably not make any difference, but for completeness
> and after some thought, I felt I should post what I think is
> happening with the benchmark.
> 
> The benchmark application runs a pthreads thread for each CPU
> and goes into a tight open/read/close loop to demonstrate the
> contention that can occur on the kernfs mutex as the number of
> CPUs grows.
> 
> But that tight open/read/close loop causes contention on the VFS
> file table because the pthreads threads share the process file
> table.
> 
> So the poor performance is actually evidence the last patch is
> doing what it's meant to do rather than evidence of a regression
> with the series.
> 
> The benchmark is putting pressure on the process file table on
> some hardware configurations but those critical sections are small
> and there's really nothing obvious that can be done to improve the
> file table locking.
> 
> It is however important to remember that the benckmark application
> access pattern is not a normal access pattern by a long way.
> 
> So I don't see the need for a new revision of the series with the
> last patch dropped.
> 
> If there's a change of heart and the series was to be merged I'll
> leave whether to include this last patch to the discretion of the
> maintainer as the bulk of the improvement comes from the earlier
> patches anyway.

Can you please resubmit the series, it is long-gone from my review
queue.

thanks,

greg k-h

Ian Kent May 28, 2021, 5:45 a.m. UTC | #16

On Thu, 2021-05-27 at 08:50 +0200, Greg Kroah-Hartman wrote:
> On Thu, May 27, 2021 at 09:23:06AM +0800, Ian Kent wrote:
> > On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote:
> > > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote:
> > > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > 
> > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote:
> > > > > > Hi Ian
> > > > > > 
> > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent
> > > > > > <raven@themaw.net>
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote:
> > > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen
> > > > > > > > <foxhlchen@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > I ran it on my benchmark (
> > > > > > > > > https://github.com/foxhlchen/sysfs_benchmark).
> > > > > > > > > 
> > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores)
> > > > > > > > > kernel: v5.12
> > > > > > > > > benchmark: create 96 threads and bind them to each
> > > > > > > > > core
> > > > > > > > > then
> > > > > > > > > run
> > > > > > > > > open+read+close on a sysfs file simultaneously for
> > > > > > > > > 1000
> > > > > > > > > times.
> > > > > > > > > result:
> > > > > > > > > Without the patchset, an open+read+close operation
> > > > > > > > > takes
> > > > > > > > > 550-
> > > > > > > > > 570
> > > > > > > > > us,
> > > > > > > > > perf shows significant time(>40%) spending on
> > > > > > > > > mutex_lock.
> > > > > > > > > After applying it, it takes 410-440 us for that
> > > > > > > > > operation
> > > > > > > > > and
> > > > > > > > > perf
> > > > > > > > > shows only ~4% time on mutex_lock.
> > > > > > > > > 
> > > > > > > > > It's weird, I don't see a huge performance boost
> > > > > > > > > compared
> > > > > > > > > to
> > > > > > > > > v2,
> > > > > > > > > even
> > > > > > > > 
> > > > > > > > I meant I don't see a huge performance boost here and
> > > > > > > > it's
> > > > > > > > way
> > > > > > > > worse
> > > > > > > > than v2.
> > > > > > > > IIRC, for v2 fastest one only takes 40us
> > > > > > > 
> > > > > > > Thanks Fox,
> > > > > > > 
> > > > > > > I'll have a look at those reports but this is puzzling.
> > > > > > > 
> > > > > > > Perhaps the added overhead of the check if an update is
> > > > > > > needed is taking more than expected and more than just
> > > > > > > taking the lock and being done with it. Then there's
> > > > > > > the v2 series ... I'll see if I can dig out your reports
> > > > > > > on those too.
> > > > > > 
> > > > > > Apologies, I was mistaken, it's compared to V3, not V2. 
> > > > > > The
> > > > > > previous
> > > > > > benchmark report is here.
> > > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/
> > > > > 
> > > > > Are all these tests using a single file name in the
> > > > > open/read/close
> > > > > loop?
> > > > 
> > > > Yes,  because It's easy to implement yet enough to trigger the
> > > > mutex_lock.
> > > > 
> > > > And you are right It's not a real-life pattern, but on the
> > > > bright
> > > > side, it proves there is no original mutex_lock problem
> > > > anymore. :)
> > > 
> > > I've been looking at your reports and they are quite interesting.
> > > 
> > > > 
> > > > > That being the case the per-object inode lock will behave
> > > > > like a
> > > > > mutex and once contention occurs any speed benefits of a
> > > > > spinlock
> > > > > over a mutex (or rwsem) will disappear.
> > > > > 
> > > > > In this case changing from a write lock to a read lock in
> > > > > those
> > > > > functions and adding the inode mutex will do nothing but add
> > > > > the
> > > > > overhead of taking the read lock. And similarly adding the
> > > > > update
> > > > > check function also just adds overhead and, as we see, once
> > > > > contention starts it has a cumulative effect that's often not
> > > > > linear.
> > > > > 
> > > > > The whole idea of a read lock/per-object spin lock was to
> > > > > reduce
> > > > > the possibility of contention for paths other than the same
> > > > > path
> > > > > while not impacting same path accesses too much for an
> > > > > overall
> > > > > gain. Based on this I'm thinking the update check function is
> > > > > probably not worth keeping, it just adds unnecessary churn
> > > > > and
> > > > > has a negative impact for same file contention access
> > > > > patterns.
> > > 
> > > The reports indicate (to me anyway) that the slowdown isn't
> > > due to kernfs. It looks more like kernfs is now putting pressure
> > > on the VFS, mostly on the file table lock but it looks like
> > > there's a mild amount of contention on a few other locks as well
> > > now.
> > > 
> > > That's a whole different problem and those file table handling
> > > functions don't appear to have any obvious problems so they are
> > > doing what they have to do and that can't be avoided.
> > > 
> > > That's definitely out of scope for these changes.
> > > 
> > > And, as you'd expect, once any appreciable amount of contention
> > > happens our measurements go out the window, certainly with
> > > respect to kernfs.
> > > 
> > > It also doesn't change my option that checking if an inode
> > > attribute update is needed in kernfs isn't useful since, IIUC
> > > that file table lock contention would result even if you were
> > > using different paths.
> > > 
> > > So I'll drop that patch from the series.
> > 
> > It will probably not make any difference, but for completeness
> > and after some thought, I felt I should post what I think is
> > happening with the benchmark.
> > 
> > The benchmark application runs a pthreads thread for each CPU
> > and goes into a tight open/read/close loop to demonstrate the
> > contention that can occur on the kernfs mutex as the number of
> > CPUs grows.
> > 
> > But that tight open/read/close loop causes contention on the VFS
> > file table because the pthreads threads share the process file
> > table.
> > 
> > So the poor performance is actually evidence the last patch is
> > doing what it's meant to do rather than evidence of a regression
> > with the series.
> > 
> > The benchmark is putting pressure on the process file table on
> > some hardware configurations but those critical sections are small
> > and there's really nothing obvious that can be done to improve the
> > file table locking.
> > 
> > It is however important to remember that the benckmark application
> > access pattern is not a normal access pattern by a long way.
> > 
> > So I don't see the need for a new revision of the series with the
> > last patch dropped.
> > 
> > If there's a change of heart and the series was to be merged I'll
> > leave whether to include this last patch to the discretion of the
> > maintainer as the bulk of the improvement comes from the earlier
> > patches anyway.
> 
> Can you please resubmit the series, it is long-gone from my review
> queue.

Sure, will do as soon as I can.

Thanks
Ian

[v4,0/5] kernfs: proposed locking and concurrency improvement

Message

Comments