mbox series

[0/6] kernfs: proposed locking and concurrency improvement

Message ID 160862320263.291330.9467216031366035418.stgit@mickey.themaw.net (mailing list archive)
Headers show
Series kernfs: proposed locking and concurrency improvement | expand

Message

Ian Kent Dec. 22, 2020, 7:47 a.m. UTC
Here is what I currently have for the patch series we were discussing
recently.

I'm interested to see how this goes with the problem you are seeing.

The last patch in the series (the attributes update patch) has seen
no more than compile testing, I hope I haven't messed up on that.

Please keep in mind that Greg's continued silence on the question
of whether the series might be re-considered indicates to me that
he has not changed his position on this.

---

Ian Kent (6):
      kernfs: move revalidate to be near lookup
      kernfs: use VFS negative dentry caching
      kernfs: use revision to identify directory node changes
      kernfs: switch kernfs to use an rwsem
      kernfs: stay in rcu-walk mode if possible
      kernfs: add a spinlock to kernfs iattrs for inode updates


 fs/kernfs/dir.c             |  285 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   19 ++-
 fs/kernfs/kernfs-internal.h |   30 ++++-
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    5 +
 7 files changed, 238 insertions(+), 121 deletions(-)

--
Ian

Comments

Fox Chen Dec. 23, 2020, 7:30 a.m. UTC | #1
Hi Ian,

On Tue, Dec 22, 2020 at 3:47 PM Ian Kent <raven@themaw.net> wrote:
>
> Here is what I currently have for the patch series we were discussing
> recently.
>
> I'm interested to see how this goes with the problem you are seeing.
>
> The last patch in the series (the attributes update patch) has seen
> no more than compile testing, I hope I haven't messed up on that.

OK, I just ignored the last patch (kernfs: add a spinlock to kernfs
iattrs for inode updates)
because I can not boot the kernel after applying it.

Applying the first five patches, my benchmark shows no improvement this time,
one open+read+close cycle spends 500us. :(

perf suggests that down_write in kernfs_iop_permission drags the most
performance
(full report see attachment) which makes sense to me, since down_write
has no difference
here compared to the mutex, they all allow only one thread running
simultaneously.

sigh... It's really hard to fix.

   |
   |--51.81%--inode_permission
   |          |
   |           --51.36%--kernfs_iop_permission
   |                     |
   |                     |--50.41%--down_write
   |                     |          |
   |                     |           --50.32%--rwsem_down_write_slowpath
   |                     |                     |
   |                     |                      --49.67%--rwsem_optimistic_spin
   |                     |                                |
   |                     |                                 --49.31%--osq_lock
   |                     |
   |                      --0.74%--up_write
   |                                rwsem_wake.isra.0
   |                                |
   |                                 --0.69%--wake_up_q
   |                                           |
   |                                            --0.62%--try_to_wake_up
   |                                                      |
   |
--0.54%--_raw_spin_lock_irqsave
   |                                                                 |
   |
--0.52%--native_queued
   |
    --29.90%--walk_component
              |
               --29.72%--lookup_fast
                         |
                          --29.51%--kernfs_dop_revalidate
                                    |
                                    |--28.88%--down_read
                                    |          |
                                    |
--28.75%--rwsem_down_read_slowpath
                                    |                     |
                                    |
--28.50%--rwsem_optimistic_spin
                                    |                                |
                                    |
--28.38%--osq_lock
                                    |
                                     --0.59%--up_read
                                               rwsem_wake.isra.0
                                               |
                                                --0.54%--wake_up_q
                                                          |

--0.53%--try_to_wake_up

> Please keep in mind that Greg's continued silence on the question
> of whether the series might be re-considered indicates to me that
> he has not changed his position on this.

Since the problem doesn't break the system, it doesn't need immediate attention.
So I bet it must be labeled with a low priority in Greg's to-do list.
And it's merging
window & holiday season. Let's just give him more time. :)


> ---
>
> Ian Kent (6):
>       kernfs: move revalidate to be near lookup
>       kernfs: use VFS negative dentry caching
>       kernfs: use revision to identify directory node changes
>       kernfs: switch kernfs to use an rwsem
>       kernfs: stay in rcu-walk mode if possible
>       kernfs: add a spinlock to kernfs iattrs for inode updates
>
>
>  fs/kernfs/dir.c             |  285 ++++++++++++++++++++++++++++---------------
>  fs/kernfs/file.c            |    4 -
>  fs/kernfs/inode.c           |   19 ++-
>  fs/kernfs/kernfs-internal.h |   30 ++++-
>  fs/kernfs/mount.c           |   12 +-
>  fs/kernfs/symlink.c         |    4 -
>  include/linux/kernfs.h      |    5 +
>  7 files changed, 238 insertions(+), 121 deletions(-)
>
> --
> Ian
>


thanks,
fox
Ian Kent Jan. 4, 2021, 12:42 a.m. UTC | #2
On Wed, 2020-12-23 at 15:30 +0800, Fox Chen wrote:
> Hi Ian,
> 
> On Tue, Dec 22, 2020 at 3:47 PM Ian Kent <raven@themaw.net> wrote:
> > Here is what I currently have for the patch series we were
> > discussing
> > recently.
> > 
> > I'm interested to see how this goes with the problem you are
> > seeing.
> > 
> > The last patch in the series (the attributes update patch) has seen
> > no more than compile testing, I hope I haven't messed up on that.
> 
> OK, I just ignored the last patch (kernfs: add a spinlock to kernfs
> iattrs for inode updates)
> because I can not boot the kernel after applying it.

Right, clearly I haven't got that last patch right yet.

It looks like the problem there is that the iattrs structure might
not be allocated at the time so clearly I can't just use the spin
lock inside.

It would be simple enough to resolve except for the need to set
the inode link count too.

Ian

> 
> Applying the first five patches, my benchmark shows no improvement
> this time,
> one open+read+close cycle spends 500us. :(
> 
> perf suggests that down_write in kernfs_iop_permission drags the most
> performance
> (full report see attachment) which makes sense to me, since
> down_write
> has no difference
> here compared to the mutex, they all allow only one thread running
> simultaneously.
> 
> sigh... It's really hard to fix.
> 
>    |
>    |--51.81%--inode_permission
>    |          |
>    |           --51.36%--kernfs_iop_permission
>    |                     |
>    |                     |--50.41%--down_write
>    |                     |          |
>    |                     |           --50.32%
> --rwsem_down_write_slowpath
>    |                     |                     |
>    |                     |                      --49.67%
> --rwsem_optimistic_spin
>    |                     |                                |
>    |                     |                                 --49.31%
> --osq_lock
>    |                     |
>    |                      --0.74%--up_write
>    |                                rwsem_wake.isra.0
>    |                                |
>    |                                 --0.69%--wake_up_q
>    |                                           |
>    |                                            --0.62%
> --try_to_wake_up
>    |                                                      |
>    |
> --0.54%--_raw_spin_lock_irqsave
>    |                                                                 
> |
>    |
> --0.52%--native_queued
>    |
>     --29.90%--walk_component
>               |
>                --29.72%--lookup_fast
>                          |
>                           --29.51%--kernfs_dop_revalidate
>                                     |
>                                     |--28.88%--down_read
>                                     |          |
>                                     |
> --28.75%--rwsem_down_read_slowpath
>                                     |                     |
>                                     |
> --28.50%--rwsem_optimistic_spin
>                                     |                                
> |
>                                     |
> --28.38%--osq_lock
>                                     |
>                                      --0.59%--up_read
>                                                rwsem_wake.isra.0
>                                                |
>                                                 --0.54%--wake_up_q
>                                                           |
> 
> --0.53%--try_to_wake_up
> 
> > Please keep in mind that Greg's continued silence on the question
> > of whether the series might be re-considered indicates to me that
> > he has not changed his position on this.
> 
> Since the problem doesn't break the system, it doesn't need immediate
> attention.
> So I bet it must be labeled with a low priority in Greg's to-do list.
> And it's merging
> window & holiday season. Let's just give him more time. :)
> 
> 
> > ---
> > 
> > Ian Kent (6):
> >       kernfs: move revalidate to be near lookup
> >       kernfs: use VFS negative dentry caching
> >       kernfs: use revision to identify directory node changes
> >       kernfs: switch kernfs to use an rwsem
> >       kernfs: stay in rcu-walk mode if possible
> >       kernfs: add a spinlock to kernfs iattrs for inode updates
> > 
> > 
> >  fs/kernfs/dir.c             |  285 ++++++++++++++++++++++++++++---
> > ------------
> >  fs/kernfs/file.c            |    4 -
> >  fs/kernfs/inode.c           |   19 ++-
> >  fs/kernfs/kernfs-internal.h |   30 ++++-
> >  fs/kernfs/mount.c           |   12 +-
> >  fs/kernfs/symlink.c         |    4 -
> >  include/linux/kernfs.h      |    5 +
> >  7 files changed, 238 insertions(+), 121 deletions(-)
> > 
> > --
> > Ian
> > 
> 
> thanks,
> fox
Fox Chen Jan. 6, 2021, 2:38 a.m. UTC | #3
Hi Ian,

I am rethinking this problem. Can we simply use a global lock?

 In your original patch 5, you have a global mutex attr_mutex to
protect attr, if we change it to a rwsem, is it enough to protect both
inode and attr while having the concurrent read ability?

like this patch I submitted. ( clearly, I missed __kernfs_iattrs part,
but just about that idea )
https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/



thanks,
fox
Ian Kent Jan. 11, 2021, 3:19 a.m. UTC | #4
On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> Hi Ian,
> 
> I am rethinking this problem. Can we simply use a global lock?
> 
>  In your original patch 5, you have a global mutex attr_mutex to
> protect attr, if we change it to a rwsem, is it enough to protect
> both
> inode and attr while having the concurrent read ability?
> 
> like this patch I submitted. ( clearly, I missed __kernfs_iattrs
> part,
> but just about that idea )
> https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/

I don't think so.

kernfs_refresh_inode() writes to the inode so taking a read lock
will allow multiple processes to concurrently update it which is
what we need to avoid.

It's possibly even more interesting.

For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might alter
the inode link count (I don't know if that would be the sort of thing
they would do but kernfs can't possibly know either). Both of these
functions rely on the VFS locking for exclusion but the inode link
count is updated in kernfs_refresh_inode() too.

That's the case now, without any patches.

I'm not entirely sure what's going on in kernfs_refresh_inode().

It could be as simple as being called with a NULL inode because
the dentry concerned is negative at that point. I haven't had
time to look closely at it TBH but I have been thinking about it.

Ian

> 
> 
> 
> thanks,
> fox
Ian Kent Jan. 11, 2021, 4:20 a.m. UTC | #5
On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > Hi Ian,
> > 
> > I am rethinking this problem. Can we simply use a global lock?
> > 
> >  In your original patch 5, you have a global mutex attr_mutex to
> > protect attr, if we change it to a rwsem, is it enough to protect
> > both
> > inode and attr while having the concurrent read ability?
> > 
> > like this patch I submitted. ( clearly, I missed __kernfs_iattrs
> > part,
> > but just about that idea )
> > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> 
> I don't think so.
> 
> kernfs_refresh_inode() writes to the inode so taking a read lock
> will allow multiple processes to concurrently update it which is
> what we need to avoid.
> 
> It's possibly even more interesting.
> 
> For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might alter
> the inode link count (I don't know if that would be the sort of thing
> they would do but kernfs can't possibly know either). Both of these
> functions rely on the VFS locking for exclusion but the inode link
> count is updated in kernfs_refresh_inode() too.
> 
> That's the case now, without any patches.

So it's not so easy to get the inode from just the kernfs object
so these probably aren't a problem ...

> 
> I'm not entirely sure what's going on in kernfs_refresh_inode().
> 
> It could be as simple as being called with a NULL inode because
> the dentry concerned is negative at that point. I haven't had
> time to look closely at it TBH but I have been thinking about it.

Certainly this can be called without a struct iattr having been
allocated ... and given it probably needs to remain a pointer
rather than embedded in the node the inode link count update
can't easily be protected from concurrent updates.

If it was ok to do the allocation at inode creation the problem
becomes much simpler to resolve but I thought there were concerns
about ram consumption (although I don't think that was exactly what
was said?).

Ian
Fox Chen Jan. 11, 2021, 7:04 a.m. UTC | #6
On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > Hi Ian,
> > >
> > > I am rethinking this problem. Can we simply use a global lock?
> > >
> > >  In your original patch 5, you have a global mutex attr_mutex to
> > > protect attr, if we change it to a rwsem, is it enough to protect
> > > both
> > > inode and attr while having the concurrent read ability?
> > >
> > > like this patch I submitted. ( clearly, I missed __kernfs_iattrs
> > > part,
> > > but just about that idea )
> > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> >
> > I don't think so.
> >
> > kernfs_refresh_inode() writes to the inode so taking a read lock
> > will allow multiple processes to concurrently update it which is
> > what we need to avoid.

Oh, got it. I missed the inode part. my bad. :(

> > It's possibly even more interesting.
> >
> > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might alter
> > the inode link count (I don't know if that would be the sort of thing
> > they would do but kernfs can't possibly know either). Both of these
> > functions rely on the VFS locking for exclusion but the inode link
> > count is updated in kernfs_refresh_inode() too.
> >
> > That's the case now, without any patches.
>
> So it's not so easy to get the inode from just the kernfs object
> so these probably aren't a problem ...

IIUC only when dop->revalidate, iop->lookup being called, the result
of rmdir/mkdir will be sync with vfs.

kernfs_node is detached from vfs inode/dentry to save ram.

> >
> > I'm not entirely sure what's going on in kernfs_refresh_inode().
> >
> > It could be as simple as being called with a NULL inode because
> > the dentry concerned is negative at that point. I haven't had
> > time to look closely at it TBH but I have been thinking about it.

um, It shouldn't be called with a NULL inode, right?

inode->i_mode = kn->mode;

otherwise will crash.

> Certainly this can be called without a struct iattr having been
> allocated ... and given it probably needs to remain a pointer
> rather than embedded in the node the inode link count update
> can't easily be protected from concurrent updates.
>
> If it was ok to do the allocation at inode creation the problem
> becomes much simpler to resolve but I thought there were concerns
> about ram consumption (although I don't think that was exactly what
> was said?).
>

you meant iattr to be allocated at inode creation time??
yes, I think so. it's due to ram consumption.



thanks,
fox
Ian Kent Jan. 11, 2021, 8:42 a.m. UTC | #7
On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > Hi Ian,
> > > > 
> > > > I am rethinking this problem. Can we simply use a global lock?
> > > > 
> > > >  In your original patch 5, you have a global mutex attr_mutex
> > > > to
> > > > protect attr, if we change it to a rwsem, is it enough to
> > > > protect
> > > > both
> > > > inode and attr while having the concurrent read ability?
> > > > 
> > > > like this patch I submitted. ( clearly, I missed
> > > > __kernfs_iattrs
> > > > part,
> > > > but just about that idea )
> > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > 
> > > I don't think so.
> > > 
> > > kernfs_refresh_inode() writes to the inode so taking a read lock
> > > will allow multiple processes to concurrently update it which is
> > > what we need to avoid.
> 
> Oh, got it. I missed the inode part. my bad. :(
> 
> > > It's possibly even more interesting.
> > > 
> > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might
> > > alter
> > > the inode link count (I don't know if that would be the sort of
> > > thing
> > > they would do but kernfs can't possibly know either). Both of
> > > these
> > > functions rely on the VFS locking for exclusion but the inode
> > > link
> > > count is updated in kernfs_refresh_inode() too.
> > > 
> > > That's the case now, without any patches.
> > 
> > So it's not so easy to get the inode from just the kernfs object
> > so these probably aren't a problem ...
> 
> IIUC only when dop->revalidate, iop->lookup being called, the result
> of rmdir/mkdir will be sync with vfs.

Don't quite get what you mean here?

Do you mean something like, VFS objects are created on user access
to the file system. Given that user access generally means path
resolution possibly followed by some operation.

I guess those VFS objects will go away some time after the access
but even thought the code looks like that should happen pretty
quickly after I've observed that these objects stay around longer
than expected. There wouldn't be any use in maintaining a least
recently used list of dentry candidates eligible to discard.

> 
> kernfs_node is detached from vfs inode/dentry to save ram.
> 
> > > I'm not entirely sure what's going on in kernfs_refresh_inode().
> > > 
> > > It could be as simple as being called with a NULL inode because
> > > the dentry concerned is negative at that point. I haven't had
> > > time to look closely at it TBH but I have been thinking about it.
> 
> um, It shouldn't be called with a NULL inode, right?
> 
> inode->i_mode = kn->mode;
> 
> otherwise will crash.

Yes, you're right about that.

> 
> > Certainly this can be called without a struct iattr having been
> > allocated ... and given it probably needs to remain a pointer
> > rather than embedded in the node the inode link count update
> > can't easily be protected from concurrent updates.
> > 
> > If it was ok to do the allocation at inode creation the problem
> > becomes much simpler to resolve but I thought there were concerns
> > about ram consumption (although I don't think that was exactly what
> > was said?).
> > 
> 
> you meant iattr to be allocated at inode creation time??
> yes, I think so. it's due to ram consumption.

I did, yes.

The actual problem is dealing with multiple concurrent updates to
the inode link count, the rest can work.

Ian
Fox Chen Jan. 11, 2021, 9:02 a.m. UTC | #8
On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net> wrote:
> > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > Hi Ian,
> > > > >
> > > > > I am rethinking this problem. Can we simply use a global lock?
> > > > >
> > > > >  In your original patch 5, you have a global mutex attr_mutex
> > > > > to
> > > > > protect attr, if we change it to a rwsem, is it enough to
> > > > > protect
> > > > > both
> > > > > inode and attr while having the concurrent read ability?
> > > > >
> > > > > like this patch I submitted. ( clearly, I missed
> > > > > __kernfs_iattrs
> > > > > part,
> > > > > but just about that idea )
> > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > >
> > > > I don't think so.
> > > >
> > > > kernfs_refresh_inode() writes to the inode so taking a read lock
> > > > will allow multiple processes to concurrently update it which is
> > > > what we need to avoid.
> >
> > Oh, got it. I missed the inode part. my bad. :(
> >
> > > > It's possibly even more interesting.
> > > >
> > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might
> > > > alter
> > > > the inode link count (I don't know if that would be the sort of
> > > > thing
> > > > they would do but kernfs can't possibly know either). Both of
> > > > these
> > > > functions rely on the VFS locking for exclusion but the inode
> > > > link
> > > > count is updated in kernfs_refresh_inode() too.
> > > >
> > > > That's the case now, without any patches.
> > >
> > > So it's not so easy to get the inode from just the kernfs object
> > > so these probably aren't a problem ...
> >
> > IIUC only when dop->revalidate, iop->lookup being called, the result
> > of rmdir/mkdir will be sync with vfs.
>
> Don't quite get what you mean here?
>
> Do you mean something like, VFS objects are created on user access
> to the file system. Given that user access generally means path
> resolution possibly followed by some operation.
>
> I guess those VFS objects will go away some time after the access
> but even thought the code looks like that should happen pretty
> quickly after I've observed that these objects stay around longer
> than expected. There wouldn't be any use in maintaining a least
> recently used list of dentry candidates eligible to discard.

Yes, that is what I meant. I think the duration may depend on the
current ram pressure. though not quite sure, I'm still digging this
part of code.

> >
> > kernfs_node is detached from vfs inode/dentry to save ram.
> >
> > > > I'm not entirely sure what's going on in kernfs_refresh_inode().
> > > >
> > > > It could be as simple as being called with a NULL inode because
> > > > the dentry concerned is negative at that point. I haven't had
> > > > time to look closely at it TBH but I have been thinking about it.
> >
> > um, It shouldn't be called with a NULL inode, right?
> >
> > inode->i_mode = kn->mode;
> >
> > otherwise will crash.
>
> Yes, you're right about that.
>
> >
> > > Certainly this can be called without a struct iattr having been
> > > allocated ... and given it probably needs to remain a pointer
> > > rather than embedded in the node the inode link count update
> > > can't easily be protected from concurrent updates.
> > >
> > > If it was ok to do the allocation at inode creation the problem
> > > becomes much simpler to resolve but I thought there were concerns
> > > about ram consumption (although I don't think that was exactly what
> > > was said?).
> > >
> >
> > you meant iattr to be allocated at inode creation time??
> > yes, I think so. it's due to ram consumption.
>
> I did, yes.
>
> The actual problem is dealing with multiple concurrent updates to
> the inode link count, the rest can work.
>
> Ian
>

fox
Ian Kent Jan. 12, 2021, 12:27 a.m. UTC | #9
On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > Hi Ian,
> > > > > > 
> > > > > > I am rethinking this problem. Can we simply use a global
> > > > > > lock?
> > > > > > 
> > > > > >  In your original patch 5, you have a global mutex
> > > > > > attr_mutex
> > > > > > to
> > > > > > protect attr, if we change it to a rwsem, is it enough to
> > > > > > protect
> > > > > > both
> > > > > > inode and attr while having the concurrent read ability?
> > > > > > 
> > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > __kernfs_iattrs
> > > > > > part,
> > > > > > but just about that idea )
> > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > 
> > > > > I don't think so.
> > > > > 
> > > > > kernfs_refresh_inode() writes to the inode so taking a read
> > > > > lock
> > > > > will allow multiple processes to concurrently update it which
> > > > > is
> > > > > what we need to avoid.
> > > 
> > > Oh, got it. I missed the inode part. my bad. :(
> > > 
> > > > > It's possibly even more interesting.
> > > > > 
> > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might
> > > > > alter
> > > > > the inode link count (I don't know if that would be the sort
> > > > > of
> > > > > thing
> > > > > they would do but kernfs can't possibly know either). Both of
> > > > > these
> > > > > functions rely on the VFS locking for exclusion but the inode
> > > > > link
> > > > > count is updated in kernfs_refresh_inode() too.
> > > > > 
> > > > > That's the case now, without any patches.
> > > > 
> > > > So it's not so easy to get the inode from just the kernfs
> > > > object
> > > > so these probably aren't a problem ...
> > > 
> > > IIUC only when dop->revalidate, iop->lookup being called, the
> > > result
> > > of rmdir/mkdir will be sync with vfs.
> > 
> > Don't quite get what you mean here?
> > 
> > Do you mean something like, VFS objects are created on user access
> > to the file system. Given that user access generally means path
> > resolution possibly followed by some operation.
> > 
> > I guess those VFS objects will go away some time after the access
> > but even thought the code looks like that should happen pretty
> > quickly after I've observed that these objects stay around longer
> > than expected. There wouldn't be any use in maintaining a least
> > recently used list of dentry candidates eligible to discard.
> 
> Yes, that is what I meant. I think the duration may depend on the
> current ram pressure. though not quite sure, I'm still digging this
> part of code.

The dentries on the LRU list are the ones that will get pruned
for things like memory pressure and drop caches invocations.

But it's a bit hard to grok that because it happens separately
so it's not something you pick up (if your like me) from linearly
scanning the code.

Ian

> 
> > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > 
> > > > > I'm not entirely sure what's going on in
> > > > > kernfs_refresh_inode().
> > > > > 
> > > > > It could be as simple as being called with a NULL inode
> > > > > because
> > > > > the dentry concerned is negative at that point. I haven't had
> > > > > time to look closely at it TBH but I have been thinking about
> > > > > it.
> > > 
> > > um, It shouldn't be called with a NULL inode, right?
> > > 
> > > inode->i_mode = kn->mode;
> > > 
> > > otherwise will crash.
> > 
> > Yes, you're right about that.
> > 
> > > > Certainly this can be called without a struct iattr having been
> > > > allocated ... and given it probably needs to remain a pointer
> > > > rather than embedded in the node the inode link count update
> > > > can't easily be protected from concurrent updates.
> > > > 
> > > > If it was ok to do the allocation at inode creation the problem
> > > > becomes much simpler to resolve but I thought there were
> > > > concerns
> > > > about ram consumption (although I don't think that was exactly
> > > > what
> > > > was said?).
> > > > 
> > > 
> > > you meant iattr to be allocated at inode creation time??
> > > yes, I think so. it's due to ram consumption.
> > 
> > I did, yes.
> > 
> > The actual problem is dealing with multiple concurrent updates to
> > the inode link count, the rest can work.
> > 
> > Ian
> > 
> 
> fox
Ian Kent Jan. 13, 2021, 5:17 a.m. UTC | #10
On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > Hi Ian,
> > > > > > 
> > > > > > I am rethinking this problem. Can we simply use a global
> > > > > > lock?
> > > > > > 
> > > > > >  In your original patch 5, you have a global mutex
> > > > > > attr_mutex
> > > > > > to
> > > > > > protect attr, if we change it to a rwsem, is it enough to
> > > > > > protect
> > > > > > both
> > > > > > inode and attr while having the concurrent read ability?
> > > > > > 
> > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > __kernfs_iattrs
> > > > > > part,
> > > > > > but just about that idea )
> > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > 
> > > > > I don't think so.
> > > > > 
> > > > > kernfs_refresh_inode() writes to the inode so taking a read
> > > > > lock
> > > > > will allow multiple processes to concurrently update it which
> > > > > is
> > > > > what we need to avoid.
> > > 
> > > Oh, got it. I missed the inode part. my bad. :(
> > > 
> > > > > It's possibly even more interesting.
> > > > > 
> > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might
> > > > > alter
> > > > > the inode link count (I don't know if that would be the sort
> > > > > of
> > > > > thing
> > > > > they would do but kernfs can't possibly know either). Both of
> > > > > these
> > > > > functions rely on the VFS locking for exclusion but the inode
> > > > > link
> > > > > count is updated in kernfs_refresh_inode() too.
> > > > > 
> > > > > That's the case now, without any patches.
> > > > 
> > > > So it's not so easy to get the inode from just the kernfs
> > > > object
> > > > so these probably aren't a problem ...
> > > 
> > > IIUC only when dop->revalidate, iop->lookup being called, the
> > > result
> > > of rmdir/mkdir will be sync with vfs.
> > 
> > Don't quite get what you mean here?
> > 
> > Do you mean something like, VFS objects are created on user access
> > to the file system. Given that user access generally means path
> > resolution possibly followed by some operation.
> > 
> > I guess those VFS objects will go away some time after the access
> > but even thought the code looks like that should happen pretty
> > quickly after I've observed that these objects stay around longer
> > than expected. There wouldn't be any use in maintaining a least
> > recently used list of dentry candidates eligible to discard.
> 
> Yes, that is what I meant. I think the duration may depend on the
> current ram pressure. though not quite sure, I'm still digging this
> part of code.
> 
> > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > 
> > > > > I'm not entirely sure what's going on in
> > > > > kernfs_refresh_inode().
> > > > > 
> > > > > It could be as simple as being called with a NULL inode
> > > > > because
> > > > > the dentry concerned is negative at that point. I haven't had
> > > > > time to look closely at it TBH but I have been thinking about
> > > > > it.
> > > 
> > > um, It shouldn't be called with a NULL inode, right?
> > > 
> > > inode->i_mode = kn->mode;
> > > 
> > > otherwise will crash.
> > 
> > Yes, you're right about that.
> > 
> > > > Certainly this can be called without a struct iattr having been
> > > > allocated ... and given it probably needs to remain a pointer
> > > > rather than embedded in the node the inode link count update
> > > > can't easily be protected from concurrent updates.
> > > > 
> > > > If it was ok to do the allocation at inode creation the problem
> > > > becomes much simpler to resolve but I thought there were
> > > > concerns
> > > > about ram consumption (although I don't think that was exactly
> > > > what
> > > > was said?).
> > > > 
> > > 
> > > you meant iattr to be allocated at inode creation time??
> > > yes, I think so. it's due to ram consumption.
> > 
> > I did, yes.
> > 
> > The actual problem is dealing with multiple concurrent updates to
> > the inode link count, the rest can work.

Umm ... maybe I've been trying to do this in the wrong place all
along.

You know the inode i_lock looks like the sensible thing to use to
protect these updates.

Something like this for that last patch should work:

kernfs: use i_lock to protect concurrent inode updates

From: Ian Kent <raven@themaw.net>

The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node opertations and path
walks when the number of concurrant walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an addtional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index ddaf18198935..e26fa5115821 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 {
 	struct kernfs_iattrs *attrs = kn->iattr;
 
+	spin_lock(inode->i_lock);
 	inode->i_mode = kn->mode;
 	if (attrs)
 		/*
@@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 
 	if (kernfs_type(kn) == KERNFS_DIR)
 		set_nlink(inode, kn->dir.subdirs + 2);
+	spin_unlock(inode->i_lock);
 }
 
 int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
@@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	generic_fillattr(inode, stat);
 	return 0;
@@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode *inode, int mask)
 
 	kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	return generic_permission(inode, mask);
 }
Fox Chen Jan. 13, 2021, 7 a.m. UTC | #11
On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net> wrote:
> > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > Hi Ian,
> > > > > > >
> > > > > > > I am rethinking this problem. Can we simply use a global
> > > > > > > lock?
> > > > > > >
> > > > > > >  In your original patch 5, you have a global mutex
> > > > > > > attr_mutex
> > > > > > > to
> > > > > > > protect attr, if we change it to a rwsem, is it enough to
> > > > > > > protect
> > > > > > > both
> > > > > > > inode and attr while having the concurrent read ability?
> > > > > > >
> > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > __kernfs_iattrs
> > > > > > > part,
> > > > > > > but just about that idea )
> > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > >
> > > > > > I don't think so.
> > > > > >
> > > > > > kernfs_refresh_inode() writes to the inode so taking a read
> > > > > > lock
> > > > > > will allow multiple processes to concurrently update it which
> > > > > > is
> > > > > > what we need to avoid.
> > > >
> > > > Oh, got it. I missed the inode part. my bad. :(
> > > >
> > > > > > It's possibly even more interesting.
> > > > > >
> > > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir() might
> > > > > > alter
> > > > > > the inode link count (I don't know if that would be the sort
> > > > > > of
> > > > > > thing
> > > > > > they would do but kernfs can't possibly know either). Both of
> > > > > > these
> > > > > > functions rely on the VFS locking for exclusion but the inode
> > > > > > link
> > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > >
> > > > > > That's the case now, without any patches.
> > > > >
> > > > > So it's not so easy to get the inode from just the kernfs
> > > > > object
> > > > > so these probably aren't a problem ...
> > > >
> > > > IIUC only when dop->revalidate, iop->lookup being called, the
> > > > result
> > > > of rmdir/mkdir will be sync with vfs.
> > >
> > > Don't quite get what you mean here?
> > >
> > > Do you mean something like, VFS objects are created on user access
> > > to the file system. Given that user access generally means path
> > > resolution possibly followed by some operation.
> > >
> > > I guess those VFS objects will go away some time after the access
> > > but even thought the code looks like that should happen pretty
> > > quickly after I've observed that these objects stay around longer
> > > than expected. There wouldn't be any use in maintaining a least
> > > recently used list of dentry candidates eligible to discard.
> >
> > Yes, that is what I meant. I think the duration may depend on the
> > current ram pressure. though not quite sure, I'm still digging this
> > part of code.
> >
> > > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > >
> > > > > > I'm not entirely sure what's going on in
> > > > > > kernfs_refresh_inode().
> > > > > >
> > > > > > It could be as simple as being called with a NULL inode
> > > > > > because
> > > > > > the dentry concerned is negative at that point. I haven't had
> > > > > > time to look closely at it TBH but I have been thinking about
> > > > > > it.
> > > >
> > > > um, It shouldn't be called with a NULL inode, right?
> > > >
> > > > inode->i_mode = kn->mode;
> > > >
> > > > otherwise will crash.
> > >
> > > Yes, you're right about that.
> > >
> > > > > Certainly this can be called without a struct iattr having been
> > > > > allocated ... and given it probably needs to remain a pointer
> > > > > rather than embedded in the node the inode link count update
> > > > > can't easily be protected from concurrent updates.
> > > > >
> > > > > If it was ok to do the allocation at inode creation the problem
> > > > > becomes much simpler to resolve but I thought there were
> > > > > concerns
> > > > > about ram consumption (although I don't think that was exactly
> > > > > what
> > > > > was said?).
> > > > >
> > > >
> > > > you meant iattr to be allocated at inode creation time??
> > > > yes, I think so. it's due to ram consumption.
> > >
> > > I did, yes.
> > >
> > > The actual problem is dealing with multiple concurrent updates to
> > > the inode link count, the rest can work.
>
> Umm ... maybe I've been trying to do this in the wrong place all
> along.
>
> You know the inode i_lock looks like the sensible thing to use to
> protect these updates.
>
> Something like this for that last patch should work:
>
> kernfs: use i_lock to protect concurrent inode updates
>
> From: Ian Kent <raven@themaw.net>
>
> The inode operations .permission() and .getattr() use the kernfs node
> write lock but all that's needed is to keep the rb tree stable while
> updating the inode attributes as well as protecting the update itself
> against concurrent changes.
>
> And .permission() is called frequently during path walks and can cause
> quite a bit of contention between kernfs node opertations and path
> walks when the number of concurrant walks is high.
>
> To change kernfs_iop_getattr() and kernfs_iop_permission() to take
> the rw sem read lock instead of the write lock an addtional lock is
> needed to protect against multiple processes concurrently updating
> the inode attributes and link count in kernfs_refresh_inode().
>
> The inode i_lock seems like the sensible thing to use to protect these
> inode attribute updates so use it in kernfs_refresh_inode().
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/kernfs/inode.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index ddaf18198935..e26fa5115821 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
>  {
>         struct kernfs_iattrs *attrs = kn->iattr;
>
> +       spin_lock(inode->i_lock);
>         inode->i_mode = kn->mode;
>         if (attrs)
>                 /*
> @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
>
>         if (kernfs_type(kn) == KERNFS_DIR)
>                 set_nlink(inode, kn->dir.subdirs + 2);
> +       spin_unlock(inode->i_lock);
>  }
>
>  int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
> @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
>         struct inode *inode = d_inode(path->dentry);
>         struct kernfs_node *kn = inode->i_private;
>
> -       down_write(&kernfs_rwsem);
> +       down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
> -       up_write(&kernfs_rwsem);
> +       up_read(&kernfs_rwsem);
>
>         generic_fillattr(inode, stat);
>         return 0;
> @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode *inode, int mask)
>
>         kn = inode->i_private;
>
> -       down_write(&kernfs_rwsem);
> +       down_read(&kernfs_rwsem);
>         kernfs_refresh_inode(kn, inode);
> -       up_write(&kernfs_rwsem);
> +       up_read(&kernfs_rwsem);
>
>         return generic_permission(inode, mask);
>  }
>

It looks good on my local machine, let me test my benchmark on a big machine. :)

Also, I wonder why i_lock?? what if I use a local spin_lock, will
there be any difference???

static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
{
        struct kernfs_iattrs *attrs = kn->iattr;
        static DEFINE_SPINLOCK(inode_lock);

        spin_lock(&inode_lock);
        inode->i_mode = kn->mode;
        if (attrs)
                /*
                 * kernfs_node has non-default attributes get them from
                 * persistent copy in kernfs_node.
                 */
                set_inode_attr(inode, attrs);

        if (kernfs_type(kn) == KERNFS_DIR)
                set_nlink(inode, kn->dir.subdirs + 2);
        spin_unlock(&inode_lock);
}



thanks,
fox
Ian Kent Jan. 13, 2021, 7:47 a.m. UTC | #12
On Wed, 2021-01-13 at 15:00 +0800, Fox Chen wrote:
> On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > > Hi Ian,
> > > > > > > > 
> > > > > > > > I am rethinking this problem. Can we simply use a
> > > > > > > > global
> > > > > > > > lock?
> > > > > > > > 
> > > > > > > >  In your original patch 5, you have a global mutex
> > > > > > > > attr_mutex
> > > > > > > > to
> > > > > > > > protect attr, if we change it to a rwsem, is it enough
> > > > > > > > to
> > > > > > > > protect
> > > > > > > > both
> > > > > > > > inode and attr while having the concurrent read
> > > > > > > > ability?
> > > > > > > > 
> > > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > > __kernfs_iattrs
> > > > > > > > part,
> > > > > > > > but just about that idea )
> > > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > > > 
> > > > > > > I don't think so.
> > > > > > > 
> > > > > > > kernfs_refresh_inode() writes to the inode so taking a
> > > > > > > read
> > > > > > > lock
> > > > > > > will allow multiple processes to concurrently update it
> > > > > > > which
> > > > > > > is
> > > > > > > what we need to avoid.
> > > > > 
> > > > > Oh, got it. I missed the inode part. my bad. :(
> > > > > 
> > > > > > > It's possibly even more interesting.
> > > > > > > 
> > > > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir()
> > > > > > > might
> > > > > > > alter
> > > > > > > the inode link count (I don't know if that would be the
> > > > > > > sort
> > > > > > > of
> > > > > > > thing
> > > > > > > they would do but kernfs can't possibly know either).
> > > > > > > Both of
> > > > > > > these
> > > > > > > functions rely on the VFS locking for exclusion but the
> > > > > > > inode
> > > > > > > link
> > > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > > > 
> > > > > > > That's the case now, without any patches.
> > > > > > 
> > > > > > So it's not so easy to get the inode from just the kernfs
> > > > > > object
> > > > > > so these probably aren't a problem ...
> > > > > 
> > > > > IIUC only when dop->revalidate, iop->lookup being called, the
> > > > > result
> > > > > of rmdir/mkdir will be sync with vfs.
> > > > 
> > > > Don't quite get what you mean here?
> > > > 
> > > > Do you mean something like, VFS objects are created on user
> > > > access
> > > > to the file system. Given that user access generally means path
> > > > resolution possibly followed by some operation.
> > > > 
> > > > I guess those VFS objects will go away some time after the
> > > > access
> > > > but even thought the code looks like that should happen pretty
> > > > quickly after I've observed that these objects stay around
> > > > longer
> > > > than expected. There wouldn't be any use in maintaining a least
> > > > recently used list of dentry candidates eligible to discard.
> > > 
> > > Yes, that is what I meant. I think the duration may depend on the
> > > current ram pressure. though not quite sure, I'm still digging
> > > this
> > > part of code.
> > > 
> > > > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > > > 
> > > > > > > I'm not entirely sure what's going on in
> > > > > > > kernfs_refresh_inode().
> > > > > > > 
> > > > > > > It could be as simple as being called with a NULL inode
> > > > > > > because
> > > > > > > the dentry concerned is negative at that point. I haven't
> > > > > > > had
> > > > > > > time to look closely at it TBH but I have been thinking
> > > > > > > about
> > > > > > > it.
> > > > > 
> > > > > um, It shouldn't be called with a NULL inode, right?
> > > > > 
> > > > > inode->i_mode = kn->mode;
> > > > > 
> > > > > otherwise will crash.
> > > > 
> > > > Yes, you're right about that.
> > > > 
> > > > > > Certainly this can be called without a struct iattr having
> > > > > > been
> > > > > > allocated ... and given it probably needs to remain a
> > > > > > pointer
> > > > > > rather than embedded in the node the inode link count
> > > > > > update
> > > > > > can't easily be protected from concurrent updates.
> > > > > > 
> > > > > > If it was ok to do the allocation at inode creation the
> > > > > > problem
> > > > > > becomes much simpler to resolve but I thought there were
> > > > > > concerns
> > > > > > about ram consumption (although I don't think that was
> > > > > > exactly
> > > > > > what
> > > > > > was said?).
> > > > > > 
> > > > > 
> > > > > you meant iattr to be allocated at inode creation time??
> > > > > yes, I think so. it's due to ram consumption.
> > > > 
> > > > I did, yes.
> > > > 
> > > > The actual problem is dealing with multiple concurrent updates
> > > > to
> > > > the inode link count, the rest can work.
> > 
> > Umm ... maybe I've been trying to do this in the wrong place all
> > along.
> > 
> > You know the inode i_lock looks like the sensible thing to use to
> > protect these updates.
> > 
> > Something like this for that last patch should work:
> > 
> > kernfs: use i_lock to protect concurrent inode updates
> > 
> > From: Ian Kent <raven@themaw.net>
> > 
> > The inode operations .permission() and .getattr() use the kernfs
> > node
> > write lock but all that's needed is to keep the rb tree stable
> > while
> > updating the inode attributes as well as protecting the update
> > itself
> > against concurrent changes.
> > 
> > And .permission() is called frequently during path walks and can
> > cause
> > quite a bit of contention between kernfs node opertations and path
> > walks when the number of concurrant walks is high.
> > 
> > To change kernfs_iop_getattr() and kernfs_iop_permission() to take
> > the rw sem read lock instead of the write lock an addtional lock is
> > needed to protect against multiple processes concurrently updating
> > the inode attributes and link count in kernfs_refresh_inode().
> > 
> > The inode i_lock seems like the sensible thing to use to protect
> > these
> > inode attribute updates so use it in kernfs_refresh_inode().
> > 
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/kernfs/inode.c |   10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index ddaf18198935..e26fa5115821 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> > 
> > +       spin_lock(inode->i_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> > @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> > 
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> > +       spin_unlock(inode->i_lock);
> >  }
> > 
> >  int kernfs_iop_getattr(const struct path *path, struct kstat
> > *stat,
> > @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path *path,
> > struct kstat *stat,
> >         struct inode *inode = d_inode(path->dentry);
> >         struct kernfs_node *kn = inode->i_private;
> > 
> > -       down_write(&kernfs_rwsem);
> > +       down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> > -       up_write(&kernfs_rwsem);
> > +       up_read(&kernfs_rwsem);
> > 
> >         generic_fillattr(inode, stat);
> >         return 0;
> > @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode *inode,
> > int mask)
> > 
> >         kn = inode->i_private;
> > 
> > -       down_write(&kernfs_rwsem);
> > +       down_read(&kernfs_rwsem);
> >         kernfs_refresh_inode(kn, inode);
> > -       up_write(&kernfs_rwsem);
> > +       up_read(&kernfs_rwsem);
> > 
> >         return generic_permission(inode, mask);
> >  }
> > 
> 
> It looks good on my local machine, let me test my benchmark on a big
> machine. :)
> 
> Also, I wonder why i_lock?? what if I use a local spin_lock, will
> there be any difference???

I think that amounts to using a global lock (ie. static) not a
per-object lock which is needed to reduce contention.

> 
> static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode
> *inode)
> {
>         struct kernfs_iattrs *attrs = kn->iattr;
>         static DEFINE_SPINLOCK(inode_lock);
> 
>         spin_lock(&inode_lock);
>         inode->i_mode = kn->mode;
>         if (attrs)
>                 /*
>                  * kernfs_node has non-default attributes get them
> from
>                  * persistent copy in kernfs_node.
>                  */
>                 set_inode_attr(inode, attrs);
> 
>         if (kernfs_type(kn) == KERNFS_DIR)
>                 set_nlink(inode, kn->dir.subdirs + 2);
>         spin_unlock(&inode_lock);
> }
> 
> 
> 
> thanks,
> fox
Ian Kent Jan. 13, 2021, 7:50 a.m. UTC | #13
On Wed, 2021-01-13 at 15:47 +0800, Ian Kent wrote:
> On Wed, 2021-01-13 at 15:00 +0800, Fox Chen wrote:
> > On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net> wrote:
> > > On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > > > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net
> > > > > > >
> > > > > > wrote:
> > > > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > > > Hi Ian,
> > > > > > > > > 
> > > > > > > > > I am rethinking this problem. Can we simply use a
> > > > > > > > > global
> > > > > > > > > lock?
> > > > > > > > > 
> > > > > > > > >  In your original patch 5, you have a global mutex
> > > > > > > > > attr_mutex
> > > > > > > > > to
> > > > > > > > > protect attr, if we change it to a rwsem, is it
> > > > > > > > > enough
> > > > > > > > > to
> > > > > > > > > protect
> > > > > > > > > both
> > > > > > > > > inode and attr while having the concurrent read
> > > > > > > > > ability?
> > > > > > > > > 
> > > > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > > > __kernfs_iattrs
> > > > > > > > > part,
> > > > > > > > > but just about that idea )
> > > > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > > > > 
> > > > > > > > I don't think so.
> > > > > > > > 
> > > > > > > > kernfs_refresh_inode() writes to the inode so taking a
> > > > > > > > read
> > > > > > > > lock
> > > > > > > > will allow multiple processes to concurrently update it
> > > > > > > > which
> > > > > > > > is
> > > > > > > > what we need to avoid.
> > > > > > 
> > > > > > Oh, got it. I missed the inode part. my bad. :(
> > > > > > 
> > > > > > > > It's possibly even more interesting.
> > > > > > > > 
> > > > > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir()
> > > > > > > > might
> > > > > > > > alter
> > > > > > > > the inode link count (I don't know if that would be the
> > > > > > > > sort
> > > > > > > > of
> > > > > > > > thing
> > > > > > > > they would do but kernfs can't possibly know either).
> > > > > > > > Both of
> > > > > > > > these
> > > > > > > > functions rely on the VFS locking for exclusion but the
> > > > > > > > inode
> > > > > > > > link
> > > > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > > > > 
> > > > > > > > That's the case now, without any patches.
> > > > > > > 
> > > > > > > So it's not so easy to get the inode from just the kernfs
> > > > > > > object
> > > > > > > so these probably aren't a problem ...
> > > > > > 
> > > > > > IIUC only when dop->revalidate, iop->lookup being called,
> > > > > > the
> > > > > > result
> > > > > > of rmdir/mkdir will be sync with vfs.
> > > > > 
> > > > > Don't quite get what you mean here?
> > > > > 
> > > > > Do you mean something like, VFS objects are created on user
> > > > > access
> > > > > to the file system. Given that user access generally means
> > > > > path
> > > > > resolution possibly followed by some operation.
> > > > > 
> > > > > I guess those VFS objects will go away some time after the
> > > > > access
> > > > > but even thought the code looks like that should happen
> > > > > pretty
> > > > > quickly after I've observed that these objects stay around
> > > > > longer
> > > > > than expected. There wouldn't be any use in maintaining a
> > > > > least
> > > > > recently used list of dentry candidates eligible to discard.
> > > > 
> > > > Yes, that is what I meant. I think the duration may depend on
> > > > the
> > > > current ram pressure. though not quite sure, I'm still digging
> > > > this
> > > > part of code.
> > > > 
> > > > > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > > > > 
> > > > > > > > I'm not entirely sure what's going on in
> > > > > > > > kernfs_refresh_inode().
> > > > > > > > 
> > > > > > > > It could be as simple as being called with a NULL inode
> > > > > > > > because
> > > > > > > > the dentry concerned is negative at that point. I
> > > > > > > > haven't
> > > > > > > > had
> > > > > > > > time to look closely at it TBH but I have been thinking
> > > > > > > > about
> > > > > > > > it.
> > > > > > 
> > > > > > um, It shouldn't be called with a NULL inode, right?
> > > > > > 
> > > > > > inode->i_mode = kn->mode;
> > > > > > 
> > > > > > otherwise will crash.
> > > > > 
> > > > > Yes, you're right about that.
> > > > > 
> > > > > > > Certainly this can be called without a struct iattr
> > > > > > > having
> > > > > > > been
> > > > > > > allocated ... and given it probably needs to remain a
> > > > > > > pointer
> > > > > > > rather than embedded in the node the inode link count
> > > > > > > update
> > > > > > > can't easily be protected from concurrent updates.
> > > > > > > 
> > > > > > > If it was ok to do the allocation at inode creation the
> > > > > > > problem
> > > > > > > becomes much simpler to resolve but I thought there were
> > > > > > > concerns
> > > > > > > about ram consumption (although I don't think that was
> > > > > > > exactly
> > > > > > > what
> > > > > > > was said?).
> > > > > > > 
> > > > > > 
> > > > > > you meant iattr to be allocated at inode creation time??
> > > > > > yes, I think so. it's due to ram consumption.
> > > > > 
> > > > > I did, yes.
> > > > > 
> > > > > The actual problem is dealing with multiple concurrent
> > > > > updates
> > > > > to
> > > > > the inode link count, the rest can work.
> > > 
> > > Umm ... maybe I've been trying to do this in the wrong place all
> > > along.
> > > 
> > > You know the inode i_lock looks like the sensible thing to use to
> > > protect these updates.
> > > 
> > > Something like this for that last patch should work:
> > > 
> > > kernfs: use i_lock to protect concurrent inode updates
> > > 
> > > From: Ian Kent <raven@themaw.net>
> > > 
> > > The inode operations .permission() and .getattr() use the kernfs
> > > node
> > > write lock but all that's needed is to keep the rb tree stable
> > > while
> > > updating the inode attributes as well as protecting the update
> > > itself
> > > against concurrent changes.
> > > 
> > > And .permission() is called frequently during path walks and can
> > > cause
> > > quite a bit of contention between kernfs node opertations and
> > > path
> > > walks when the number of concurrant walks is high.
> > > 
> > > To change kernfs_iop_getattr() and kernfs_iop_permission() to
> > > take
> > > the rw sem read lock instead of the write lock an addtional lock
> > > is
> > > needed to protect against multiple processes concurrently
> > > updating
> > > the inode attributes and link count in kernfs_refresh_inode().
> > > 
> > > The inode i_lock seems like the sensible thing to use to protect
> > > these
> > > inode attribute updates so use it in kernfs_refresh_inode().
> > > 
> > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > ---
> > >  fs/kernfs/inode.c |   10 ++++++----
> > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > index ddaf18198935..e26fa5115821 100644
> > > --- a/fs/kernfs/inode.c
> > > +++ b/fs/kernfs/inode.c
> > > @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > >  {
> > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > 
> > > +       spin_lock(inode->i_lock);
> > >         inode->i_mode = kn->mode;
> > >         if (attrs)
> > >                 /*
> > > @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > > 
> > >         if (kernfs_type(kn) == KERNFS_DIR)
> > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > +       spin_unlock(inode->i_lock);
> > >  }
> > > 
> > >  int kernfs_iop_getattr(const struct path *path, struct kstat
> > > *stat,
> > > @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path
> > > *path,
> > > struct kstat *stat,
> > >         struct inode *inode = d_inode(path->dentry);
> > >         struct kernfs_node *kn = inode->i_private;
> > > 
> > > -       down_write(&kernfs_rwsem);
> > > +       down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > > -       up_write(&kernfs_rwsem);
> > > +       up_read(&kernfs_rwsem);
> > > 
> > >         generic_fillattr(inode, stat);
> > >         return 0;
> > > @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode
> > > *inode,
> > > int mask)
> > > 
> > >         kn = inode->i_private;
> > > 
> > > -       down_write(&kernfs_rwsem);
> > > +       down_read(&kernfs_rwsem);
> > >         kernfs_refresh_inode(kn, inode);
> > > -       up_write(&kernfs_rwsem);
> > > +       up_read(&kernfs_rwsem);
> > > 
> > >         return generic_permission(inode, mask);
> > >  }
> > > 
> > 
> > It looks good on my local machine, let me test my benchmark on a
> > big
> > machine. :)
> > 
> > Also, I wonder why i_lock?? what if I use a local spin_lock, will
> > there be any difference???
> 
> I think that amounts to using a global lock (ie. static) not a
> per-object lock which is needed to reduce contention.

And this lock is used for similar purposes in quite a few other
places so it seems like sensible, consistent usage.

> 
> > static void kernfs_refresh_inode(struct kernfs_node *kn, struct
> > inode
> > *inode)
> > {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> >         static DEFINE_SPINLOCK(inode_lock);
> > 
> >         spin_lock(&inode_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> >                  * kernfs_node has non-default attributes get them
> > from
> >                  * persistent copy in kernfs_node.
> >                  */
> >                 set_inode_attr(inode, attrs);
> > 
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> >         spin_unlock(&inode_lock);
> > }
> > 
> > 
> > 
> > thanks,
> > fox
Fox Chen Jan. 13, 2021, 8 a.m. UTC | #14
On Wed, Jan 13, 2021 at 3:51 PM Ian Kent <raven@themaw.net> wrote:
>
> On Wed, 2021-01-13 at 15:47 +0800, Ian Kent wrote:
> > On Wed, 2021-01-13 at 15:00 +0800, Fox Chen wrote:
> > > On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net> wrote:
> > > > On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > > > > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net
> > > > > > > >
> > > > > > > wrote:
> > > > > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > > > > Hi Ian,
> > > > > > > > > >
> > > > > > > > > > I am rethinking this problem. Can we simply use a
> > > > > > > > > > global
> > > > > > > > > > lock?
> > > > > > > > > >
> > > > > > > > > >  In your original patch 5, you have a global mutex
> > > > > > > > > > attr_mutex
> > > > > > > > > > to
> > > > > > > > > > protect attr, if we change it to a rwsem, is it
> > > > > > > > > > enough
> > > > > > > > > > to
> > > > > > > > > > protect
> > > > > > > > > > both
> > > > > > > > > > inode and attr while having the concurrent read
> > > > > > > > > > ability?
> > > > > > > > > >
> > > > > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > > > > __kernfs_iattrs
> > > > > > > > > > part,
> > > > > > > > > > but just about that idea )
> > > > > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > > > > >
> > > > > > > > > I don't think so.
> > > > > > > > >
> > > > > > > > > kernfs_refresh_inode() writes to the inode so taking a
> > > > > > > > > read
> > > > > > > > > lock
> > > > > > > > > will allow multiple processes to concurrently update it
> > > > > > > > > which
> > > > > > > > > is
> > > > > > > > > what we need to avoid.
> > > > > > >
> > > > > > > Oh, got it. I missed the inode part. my bad. :(
> > > > > > >
> > > > > > > > > It's possibly even more interesting.
> > > > > > > > >
> > > > > > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir()
> > > > > > > > > might
> > > > > > > > > alter
> > > > > > > > > the inode link count (I don't know if that would be the
> > > > > > > > > sort
> > > > > > > > > of
> > > > > > > > > thing
> > > > > > > > > they would do but kernfs can't possibly know either).
> > > > > > > > > Both of
> > > > > > > > > these
> > > > > > > > > functions rely on the VFS locking for exclusion but the
> > > > > > > > > inode
> > > > > > > > > link
> > > > > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > > > > >
> > > > > > > > > That's the case now, without any patches.
> > > > > > > >
> > > > > > > > So it's not so easy to get the inode from just the kernfs
> > > > > > > > object
> > > > > > > > so these probably aren't a problem ...
> > > > > > >
> > > > > > > IIUC only when dop->revalidate, iop->lookup being called,
> > > > > > > the
> > > > > > > result
> > > > > > > of rmdir/mkdir will be sync with vfs.
> > > > > >
> > > > > > Don't quite get what you mean here?
> > > > > >
> > > > > > Do you mean something like, VFS objects are created on user
> > > > > > access
> > > > > > to the file system. Given that user access generally means
> > > > > > path
> > > > > > resolution possibly followed by some operation.
> > > > > >
> > > > > > I guess those VFS objects will go away some time after the
> > > > > > access
> > > > > > but even thought the code looks like that should happen
> > > > > > pretty
> > > > > > quickly after I've observed that these objects stay around
> > > > > > longer
> > > > > > than expected. There wouldn't be any use in maintaining a
> > > > > > least
> > > > > > recently used list of dentry candidates eligible to discard.
> > > > >
> > > > > Yes, that is what I meant. I think the duration may depend on
> > > > > the
> > > > > current ram pressure. though not quite sure, I'm still digging
> > > > > this
> > > > > part of code.
> > > > >
> > > > > > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > > > > >
> > > > > > > > > I'm not entirely sure what's going on in
> > > > > > > > > kernfs_refresh_inode().
> > > > > > > > >
> > > > > > > > > It could be as simple as being called with a NULL inode
> > > > > > > > > because
> > > > > > > > > the dentry concerned is negative at that point. I
> > > > > > > > > haven't
> > > > > > > > > had
> > > > > > > > > time to look closely at it TBH but I have been thinking
> > > > > > > > > about
> > > > > > > > > it.
> > > > > > >
> > > > > > > um, It shouldn't be called with a NULL inode, right?
> > > > > > >
> > > > > > > inode->i_mode = kn->mode;
> > > > > > >
> > > > > > > otherwise will crash.
> > > > > >
> > > > > > Yes, you're right about that.
> > > > > >
> > > > > > > > Certainly this can be called without a struct iattr
> > > > > > > > having
> > > > > > > > been
> > > > > > > > allocated ... and given it probably needs to remain a
> > > > > > > > pointer
> > > > > > > > rather than embedded in the node the inode link count
> > > > > > > > update
> > > > > > > > can't easily be protected from concurrent updates.
> > > > > > > >
> > > > > > > > If it was ok to do the allocation at inode creation the
> > > > > > > > problem
> > > > > > > > becomes much simpler to resolve but I thought there were
> > > > > > > > concerns
> > > > > > > > about ram consumption (although I don't think that was
> > > > > > > > exactly
> > > > > > > > what
> > > > > > > > was said?).
> > > > > > > >
> > > > > > >
> > > > > > > you meant iattr to be allocated at inode creation time??
> > > > > > > yes, I think so. it's due to ram consumption.
> > > > > >
> > > > > > I did, yes.
> > > > > >
> > > > > > The actual problem is dealing with multiple concurrent
> > > > > > updates
> > > > > > to
> > > > > > the inode link count, the rest can work.
> > > >
> > > > Umm ... maybe I've been trying to do this in the wrong place all
> > > > along.
> > > >
> > > > You know the inode i_lock looks like the sensible thing to use to
> > > > protect these updates.
> > > >
> > > > Something like this for that last patch should work:
> > > >
> > > > kernfs: use i_lock to protect concurrent inode updates
> > > >
> > > > From: Ian Kent <raven@themaw.net>
> > > >
> > > > The inode operations .permission() and .getattr() use the kernfs
> > > > node
> > > > write lock but all that's needed is to keep the rb tree stable
> > > > while
> > > > updating the inode attributes as well as protecting the update
> > > > itself
> > > > against concurrent changes.
> > > >
> > > > And .permission() is called frequently during path walks and can
> > > > cause
> > > > quite a bit of contention between kernfs node opertations and
> > > > path
> > > > walks when the number of concurrant walks is high.
> > > >
> > > > To change kernfs_iop_getattr() and kernfs_iop_permission() to
> > > > take
> > > > the rw sem read lock instead of the write lock an addtional lock
> > > > is
> > > > needed to protect against multiple processes concurrently
> > > > updating
> > > > the inode attributes and link count in kernfs_refresh_inode().
> > > >
> > > > The inode i_lock seems like the sensible thing to use to protect
> > > > these
> > > > inode attribute updates so use it in kernfs_refresh_inode().
> > > >
> > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > ---
> > > >  fs/kernfs/inode.c |   10 ++++++----
> > > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > index ddaf18198935..e26fa5115821 100644
> > > > --- a/fs/kernfs/inode.c
> > > > +++ b/fs/kernfs/inode.c
> > > > @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >  {
> > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > >
> > > > +       spin_lock(inode->i_lock);
> > > >         inode->i_mode = kn->mode;
> > > >         if (attrs)
> > > >                 /*
> > > > @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > > +       spin_unlock(inode->i_lock);
> > > >  }
> > > >
> > > >  int kernfs_iop_getattr(const struct path *path, struct kstat
> > > > *stat,
> > > > @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path
> > > > *path,
> > > > struct kstat *stat,
> > > >         struct inode *inode = d_inode(path->dentry);
> > > >         struct kernfs_node *kn = inode->i_private;
> > > >
> > > > -       down_write(&kernfs_rwsem);
> > > > +       down_read(&kernfs_rwsem);
> > > >         kernfs_refresh_inode(kn, inode);
> > > > -       up_write(&kernfs_rwsem);
> > > > +       up_read(&kernfs_rwsem);
> > > >
> > > >         generic_fillattr(inode, stat);
> > > >         return 0;
> > > > @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode
> > > > *inode,
> > > > int mask)
> > > >
> > > >         kn = inode->i_private;
> > > >
> > > > -       down_write(&kernfs_rwsem);
> > > > +       down_read(&kernfs_rwsem);
> > > >         kernfs_refresh_inode(kn, inode);
> > > > -       up_write(&kernfs_rwsem);
> > > > +       up_read(&kernfs_rwsem);
> > > >
> > > >         return generic_permission(inode, mask);
> > > >  }
> > > >
> > >
> > > It looks good on my local machine, let me test my benchmark on a
> > > big
> > > machine. :)
> > >
> > > Also, I wonder why i_lock?? what if I use a local spin_lock, will
> > > there be any difference???
> >
> > I think that amounts to using a global lock (ie. static) not a
> > per-object lock which is needed to reduce contention.
>
> And this lock is used for similar purposes in quite a few other
> places so it seems like sensible, consistent usage.

Oh, yes, that's awesome. And it would never be used to protect iop
operations on VFS side as i_rwsem does. Sounds great!

> >
> > > static void kernfs_refresh_inode(struct kernfs_node *kn, struct
> > > inode
> > > *inode)
> > > {
> > >         struct kernfs_iattrs *attrs = kn->iattr;
> > >         static DEFINE_SPINLOCK(inode_lock);
> > >
> > >         spin_lock(&inode_lock);
> > >         inode->i_mode = kn->mode;
> > >         if (attrs)
> > >                 /*
> > >                  * kernfs_node has non-default attributes get them
> > > from
> > >                  * persistent copy in kernfs_node.
> > >                  */
> > >                 set_inode_attr(inode, attrs);
> > >
> > >         if (kernfs_type(kn) == KERNFS_DIR)
> > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > >         spin_unlock(&inode_lock);
> > > }
> > >
> > >
> > >
> > > thanks,
> > > fox
>
Fox Chen Jan. 14, 2021, 3:20 a.m. UTC | #15
On Wed, Jan 13, 2021 at 4:00 PM Fox Chen <foxhlchen@gmail.com> wrote:
>
> On Wed, Jan 13, 2021 at 3:51 PM Ian Kent <raven@themaw.net> wrote:
> >
> > On Wed, 2021-01-13 at 15:47 +0800, Ian Kent wrote:
> > > On Wed, 2021-01-13 at 15:00 +0800, Fox Chen wrote:
> > > > On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net> wrote:
> > > > > On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > > > > > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net>
> > > > > > wrote:
> > > > > > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > > > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <raven@themaw.net
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > > > > > Hi Ian,
> > > > > > > > > > >
> > > > > > > > > > > I am rethinking this problem. Can we simply use a
> > > > > > > > > > > global
> > > > > > > > > > > lock?
> > > > > > > > > > >
> > > > > > > > > > >  In your original patch 5, you have a global mutex
> > > > > > > > > > > attr_mutex
> > > > > > > > > > > to
> > > > > > > > > > > protect attr, if we change it to a rwsem, is it
> > > > > > > > > > > enough
> > > > > > > > > > > to
> > > > > > > > > > > protect
> > > > > > > > > > > both
> > > > > > > > > > > inode and attr while having the concurrent read
> > > > > > > > > > > ability?
> > > > > > > > > > >
> > > > > > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > > > > > __kernfs_iattrs
> > > > > > > > > > > part,
> > > > > > > > > > > but just about that idea )
> > > > > > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > > > > > >
> > > > > > > > > > I don't think so.
> > > > > > > > > >
> > > > > > > > > > kernfs_refresh_inode() writes to the inode so taking a
> > > > > > > > > > read
> > > > > > > > > > lock
> > > > > > > > > > will allow multiple processes to concurrently update it
> > > > > > > > > > which
> > > > > > > > > > is
> > > > > > > > > > what we need to avoid.
> > > > > > > >
> > > > > > > > Oh, got it. I missed the inode part. my bad. :(
> > > > > > > >
> > > > > > > > > > It's possibly even more interesting.
> > > > > > > > > >
> > > > > > > > > > For example, kernfs_iop_rmdir() and kernfs_iop_mkdir()
> > > > > > > > > > might
> > > > > > > > > > alter
> > > > > > > > > > the inode link count (I don't know if that would be the
> > > > > > > > > > sort
> > > > > > > > > > of
> > > > > > > > > > thing
> > > > > > > > > > they would do but kernfs can't possibly know either).
> > > > > > > > > > Both of
> > > > > > > > > > these
> > > > > > > > > > functions rely on the VFS locking for exclusion but the
> > > > > > > > > > inode
> > > > > > > > > > link
> > > > > > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > > > > > >
> > > > > > > > > > That's the case now, without any patches.
> > > > > > > > >
> > > > > > > > > So it's not so easy to get the inode from just the kernfs
> > > > > > > > > object
> > > > > > > > > so these probably aren't a problem ...
> > > > > > > >
> > > > > > > > IIUC only when dop->revalidate, iop->lookup being called,
> > > > > > > > the
> > > > > > > > result
> > > > > > > > of rmdir/mkdir will be sync with vfs.
> > > > > > >
> > > > > > > Don't quite get what you mean here?
> > > > > > >
> > > > > > > Do you mean something like, VFS objects are created on user
> > > > > > > access
> > > > > > > to the file system. Given that user access generally means
> > > > > > > path
> > > > > > > resolution possibly followed by some operation.
> > > > > > >
> > > > > > > I guess those VFS objects will go away some time after the
> > > > > > > access
> > > > > > > but even thought the code looks like that should happen
> > > > > > > pretty
> > > > > > > quickly after I've observed that these objects stay around
> > > > > > > longer
> > > > > > > than expected. There wouldn't be any use in maintaining a
> > > > > > > least
> > > > > > > recently used list of dentry candidates eligible to discard.
> > > > > >
> > > > > > Yes, that is what I meant. I think the duration may depend on
> > > > > > the
> > > > > > current ram pressure. though not quite sure, I'm still digging
> > > > > > this
> > > > > > part of code.
> > > > > >
> > > > > > > > kernfs_node is detached from vfs inode/dentry to save ram.
> > > > > > > >
> > > > > > > > > > I'm not entirely sure what's going on in
> > > > > > > > > > kernfs_refresh_inode().
> > > > > > > > > >
> > > > > > > > > > It could be as simple as being called with a NULL inode
> > > > > > > > > > because
> > > > > > > > > > the dentry concerned is negative at that point. I
> > > > > > > > > > haven't
> > > > > > > > > > had
> > > > > > > > > > time to look closely at it TBH but I have been thinking
> > > > > > > > > > about
> > > > > > > > > > it.
> > > > > > > >
> > > > > > > > um, It shouldn't be called with a NULL inode, right?
> > > > > > > >
> > > > > > > > inode->i_mode = kn->mode;
> > > > > > > >
> > > > > > > > otherwise will crash.
> > > > > > >
> > > > > > > Yes, you're right about that.
> > > > > > >
> > > > > > > > > Certainly this can be called without a struct iattr
> > > > > > > > > having
> > > > > > > > > been
> > > > > > > > > allocated ... and given it probably needs to remain a
> > > > > > > > > pointer
> > > > > > > > > rather than embedded in the node the inode link count
> > > > > > > > > update
> > > > > > > > > can't easily be protected from concurrent updates.
> > > > > > > > >
> > > > > > > > > If it was ok to do the allocation at inode creation the
> > > > > > > > > problem
> > > > > > > > > becomes much simpler to resolve but I thought there were
> > > > > > > > > concerns
> > > > > > > > > about ram consumption (although I don't think that was
> > > > > > > > > exactly
> > > > > > > > > what
> > > > > > > > > was said?).
> > > > > > > > >
> > > > > > > >
> > > > > > > > you meant iattr to be allocated at inode creation time??
> > > > > > > > yes, I think so. it's due to ram consumption.
> > > > > > >
> > > > > > > I did, yes.
> > > > > > >
> > > > > > > The actual problem is dealing with multiple concurrent
> > > > > > > updates
> > > > > > > to
> > > > > > > the inode link count, the rest can work.
> > > > >
> > > > > Umm ... maybe I've been trying to do this in the wrong place all
> > > > > along.
> > > > >
> > > > > You know the inode i_lock looks like the sensible thing to use to
> > > > > protect these updates.
> > > > >
> > > > > Something like this for that last patch should work:
> > > > >
> > > > > kernfs: use i_lock to protect concurrent inode updates
> > > > >
> > > > > From: Ian Kent <raven@themaw.net>
> > > > >
> > > > > The inode operations .permission() and .getattr() use the kernfs
> > > > > node
> > > > > write lock but all that's needed is to keep the rb tree stable
> > > > > while
> > > > > updating the inode attributes as well as protecting the update
> > > > > itself
> > > > > against concurrent changes.
> > > > >
> > > > > And .permission() is called frequently during path walks and can
> > > > > cause
> > > > > quite a bit of contention between kernfs node opertations and
> > > > > path
> > > > > walks when the number of concurrant walks is high.
> > > > >
> > > > > To change kernfs_iop_getattr() and kernfs_iop_permission() to
> > > > > take
> > > > > the rw sem read lock instead of the write lock an addtional lock
> > > > > is
> > > > > needed to protect against multiple processes concurrently
> > > > > updating
> > > > > the inode attributes and link count in kernfs_refresh_inode().
> > > > >
> > > > > The inode i_lock seems like the sensible thing to use to protect
> > > > > these
> > > > > inode attribute updates so use it in kernfs_refresh_inode().
> > > > >
> > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > ---
> > > > >  fs/kernfs/inode.c |   10 ++++++----
> > > > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > > index ddaf18198935..e26fa5115821 100644
> > > > > --- a/fs/kernfs/inode.c
> > > > > +++ b/fs/kernfs/inode.c
> > > > > @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct
> > > > > kernfs_node *kn, struct inode *inode)
> > > > >  {
> > > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > > >
> > > > > +       spin_lock(inode->i_lock);
> > > > >         inode->i_mode = kn->mode;
> > > > >         if (attrs)
> > > > >                 /*
> > > > > @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct
> > > > > kernfs_node *kn, struct inode *inode)
> > > > >
> > > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > > > +       spin_unlock(inode->i_lock);
> > > > >  }
> > > > >
> > > > >  int kernfs_iop_getattr(const struct path *path, struct kstat
> > > > > *stat,
> > > > > @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path
> > > > > *path,
> > > > > struct kstat *stat,
> > > > >         struct inode *inode = d_inode(path->dentry);
> > > > >         struct kernfs_node *kn = inode->i_private;
> > > > >
> > > > > -       down_write(&kernfs_rwsem);
> > > > > +       down_read(&kernfs_rwsem);
> > > > >         kernfs_refresh_inode(kn, inode);
> > > > > -       up_write(&kernfs_rwsem);
> > > > > +       up_read(&kernfs_rwsem);
> > > > >
> > > > >         generic_fillattr(inode, stat);
> > > > >         return 0;
> > > > > @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode
> > > > > *inode,
> > > > > int mask)
> > > > >
> > > > >         kn = inode->i_private;
> > > > >
> > > > > -       down_write(&kernfs_rwsem);
> > > > > +       down_read(&kernfs_rwsem);
> > > > >         kernfs_refresh_inode(kn, inode);
> > > > > -       up_write(&kernfs_rwsem);
> > > > > +       up_read(&kernfs_rwsem);
> > > > >
> > > > >         return generic_permission(inode, mask);
> > > > >  }
> > > > >
> > > >
> > > > It looks good on my local machine, let me test my benchmark on a
> > > > big
> > > > machine. :)

I've tested it on a 96-core machine, it looked good. :)
before the patch, it took 500us for an open+read+close cycle without
much variation.
after the patch, the fastest one was around 40us, the slowest one was
around 120us (because of spinlock, the latency will stack up which I
don't think we can avoid).

the perf report is roughly like this:

              --99.26%--entry_SYSCALL_64_after_hwframe
                       |
                        --99.13%--do_syscall_64
                                  |
                                  |--60.23%--__x64_sys_openat
                                  |          do_sys_open
                                  |          do_sys_openat2
                                  |          |
                                  |          |--20.99%--fd_install
                                  |          |          |
                                  |          |           --20.53%--__fd_install
                                  |          |                     |
                                  |          |
--20.25%--_raw_spin_lock
                                  |          |
       native_queued_spin_lock_slowpath
                                  |          |
                                  |          |--20.26%--get_unused_fd_flags
                                  |          |          |
                                  |          |          |--19.59%--__alloc_fd
                                  |          |          |          |
                                  |          |          |
--18.87%--_raw_spin_lock
                                  |          |          |
       native_queued_spin_lock_slowpath
                                  |          |          |
                                  |          |           --0.68%--_raw_spin_lock
                                  |          |
                                  |           --18.47%--do_filp_open
                                  |                     |
                                  |                      --18.40%--path_openat
                                  |                                |
                                  |
|--7.69%--link_path_walk.part.0
                                  |                                |          |
                                  |                                |
       |--4.38%--inode_permission
                                  |                                |
       |          |
                                  |                                |
       |          |--3.03%--kernfs_iop_permission
                                  |                                |
       |          |          |
                                  |                                |
       |          |          |--1.33%--kernfs_refresh_inode
                                  |                                |
       |          |          |          _raw_spin_lock
                                  |                                |
       |          |          |
native_queued_spin_lock_slowpath




thanks,
fox
Ian Kent Jan. 14, 2021, 5:37 a.m. UTC | #16
On Wed, 2021-01-13 at 16:00 +0800, Fox Chen wrote:
> On Wed, Jan 13, 2021 at 3:51 PM Ian Kent <raven@themaw.net> wrote:
> > On Wed, 2021-01-13 at 15:47 +0800, Ian Kent wrote:
> > > On Wed, 2021-01-13 at 15:00 +0800, Fox Chen wrote:
> > > > On Wed, Jan 13, 2021 at 1:17 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Mon, 2021-01-11 at 17:02 +0800, Fox Chen wrote:
> > > > > > On Mon, Jan 11, 2021 at 4:42 PM Ian Kent <raven@themaw.net>
> > > > > > wrote:
> > > > > > > On Mon, 2021-01-11 at 15:04 +0800, Fox Chen wrote:
> > > > > > > > On Mon, Jan 11, 2021 at 12:20 PM Ian Kent <
> > > > > > > > raven@themaw.net
> > > > > > > > wrote:
> > > > > > > > > On Mon, 2021-01-11 at 11:19 +0800, Ian Kent wrote:
> > > > > > > > > > On Wed, 2021-01-06 at 10:38 +0800, Fox Chen wrote:
> > > > > > > > > > > Hi Ian,
> > > > > > > > > > > 
> > > > > > > > > > > I am rethinking this problem. Can we simply use a
> > > > > > > > > > > global
> > > > > > > > > > > lock?
> > > > > > > > > > > 
> > > > > > > > > > >  In your original patch 5, you have a global
> > > > > > > > > > > mutex
> > > > > > > > > > > attr_mutex
> > > > > > > > > > > to
> > > > > > > > > > > protect attr, if we change it to a rwsem, is it
> > > > > > > > > > > enough
> > > > > > > > > > > to
> > > > > > > > > > > protect
> > > > > > > > > > > both
> > > > > > > > > > > inode and attr while having the concurrent read
> > > > > > > > > > > ability?
> > > > > > > > > > > 
> > > > > > > > > > > like this patch I submitted. ( clearly, I missed
> > > > > > > > > > > __kernfs_iattrs
> > > > > > > > > > > part,
> > > > > > > > > > > but just about that idea )
> > > > > > > > > > > https://lore.kernel.org/lkml/20201207084333.179132-1-foxhlchen@gmail.com/
> > > > > > > > > > 
> > > > > > > > > > I don't think so.
> > > > > > > > > > 
> > > > > > > > > > kernfs_refresh_inode() writes to the inode so
> > > > > > > > > > taking a
> > > > > > > > > > read
> > > > > > > > > > lock
> > > > > > > > > > will allow multiple processes to concurrently
> > > > > > > > > > update it
> > > > > > > > > > which
> > > > > > > > > > is
> > > > > > > > > > what we need to avoid.
> > > > > > > > 
> > > > > > > > Oh, got it. I missed the inode part. my bad. :(
> > > > > > > > 
> > > > > > > > > > It's possibly even more interesting.
> > > > > > > > > > 
> > > > > > > > > > For example, kernfs_iop_rmdir() and
> > > > > > > > > > kernfs_iop_mkdir()
> > > > > > > > > > might
> > > > > > > > > > alter
> > > > > > > > > > the inode link count (I don't know if that would be
> > > > > > > > > > the
> > > > > > > > > > sort
> > > > > > > > > > of
> > > > > > > > > > thing
> > > > > > > > > > they would do but kernfs can't possibly know
> > > > > > > > > > either).
> > > > > > > > > > Both of
> > > > > > > > > > these
> > > > > > > > > > functions rely on the VFS locking for exclusion but
> > > > > > > > > > the
> > > > > > > > > > inode
> > > > > > > > > > link
> > > > > > > > > > count is updated in kernfs_refresh_inode() too.
> > > > > > > > > > 
> > > > > > > > > > That's the case now, without any patches.
> > > > > > > > > 
> > > > > > > > > So it's not so easy to get the inode from just the
> > > > > > > > > kernfs
> > > > > > > > > object
> > > > > > > > > so these probably aren't a problem ...
> > > > > > > > 
> > > > > > > > IIUC only when dop->revalidate, iop->lookup being
> > > > > > > > called,
> > > > > > > > the
> > > > > > > > result
> > > > > > > > of rmdir/mkdir will be sync with vfs.
> > > > > > > 
> > > > > > > Don't quite get what you mean here?
> > > > > > > 
> > > > > > > Do you mean something like, VFS objects are created on
> > > > > > > user
> > > > > > > access
> > > > > > > to the file system. Given that user access generally
> > > > > > > means
> > > > > > > path
> > > > > > > resolution possibly followed by some operation.
> > > > > > > 
> > > > > > > I guess those VFS objects will go away some time after
> > > > > > > the
> > > > > > > access
> > > > > > > but even thought the code looks like that should happen
> > > > > > > pretty
> > > > > > > quickly after I've observed that these objects stay
> > > > > > > around
> > > > > > > longer
> > > > > > > than expected. There wouldn't be any use in maintaining a
> > > > > > > least
> > > > > > > recently used list of dentry candidates eligible to
> > > > > > > discard.
> > > > > > 
> > > > > > Yes, that is what I meant. I think the duration may depend
> > > > > > on
> > > > > > the
> > > > > > current ram pressure. though not quite sure, I'm still
> > > > > > digging
> > > > > > this
> > > > > > part of code.
> > > > > > 
> > > > > > > > kernfs_node is detached from vfs inode/dentry to save
> > > > > > > > ram.
> > > > > > > > 
> > > > > > > > > > I'm not entirely sure what's going on in
> > > > > > > > > > kernfs_refresh_inode().
> > > > > > > > > > 
> > > > > > > > > > It could be as simple as being called with a NULL
> > > > > > > > > > inode
> > > > > > > > > > because
> > > > > > > > > > the dentry concerned is negative at that point. I
> > > > > > > > > > haven't
> > > > > > > > > > had
> > > > > > > > > > time to look closely at it TBH but I have been
> > > > > > > > > > thinking
> > > > > > > > > > about
> > > > > > > > > > it.
> > > > > > > > 
> > > > > > > > um, It shouldn't be called with a NULL inode, right?
> > > > > > > > 
> > > > > > > > inode->i_mode = kn->mode;
> > > > > > > > 
> > > > > > > > otherwise will crash.
> > > > > > > 
> > > > > > > Yes, you're right about that.
> > > > > > > 
> > > > > > > > > Certainly this can be called without a struct iattr
> > > > > > > > > having
> > > > > > > > > been
> > > > > > > > > allocated ... and given it probably needs to remain a
> > > > > > > > > pointer
> > > > > > > > > rather than embedded in the node the inode link count
> > > > > > > > > update
> > > > > > > > > can't easily be protected from concurrent updates.
> > > > > > > > > 
> > > > > > > > > If it was ok to do the allocation at inode creation
> > > > > > > > > the
> > > > > > > > > problem
> > > > > > > > > becomes much simpler to resolve but I thought there
> > > > > > > > > were
> > > > > > > > > concerns
> > > > > > > > > about ram consumption (although I don't think that
> > > > > > > > > was
> > > > > > > > > exactly
> > > > > > > > > what
> > > > > > > > > was said?).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > you meant iattr to be allocated at inode creation
> > > > > > > > time??
> > > > > > > > yes, I think so. it's due to ram consumption.
> > > > > > > 
> > > > > > > I did, yes.
> > > > > > > 
> > > > > > > The actual problem is dealing with multiple concurrent
> > > > > > > updates
> > > > > > > to
> > > > > > > the inode link count, the rest can work.
> > > > > 
> > > > > Umm ... maybe I've been trying to do this in the wrong place
> > > > > all
> > > > > along.
> > > > > 
> > > > > You know the inode i_lock looks like the sensible thing to
> > > > > use to
> > > > > protect these updates.
> > > > > 
> > > > > Something like this for that last patch should work:
> > > > > 
> > > > > kernfs: use i_lock to protect concurrent inode updates
> > > > > 
> > > > > From: Ian Kent <raven@themaw.net>
> > > > > 
> > > > > The inode operations .permission() and .getattr() use the
> > > > > kernfs
> > > > > node
> > > > > write lock but all that's needed is to keep the rb tree
> > > > > stable
> > > > > while
> > > > > updating the inode attributes as well as protecting the
> > > > > update
> > > > > itself
> > > > > against concurrent changes.
> > > > > 
> > > > > And .permission() is called frequently during path walks and
> > > > > can
> > > > > cause
> > > > > quite a bit of contention between kernfs node opertations and
> > > > > path
> > > > > walks when the number of concurrant walks is high.
> > > > > 
> > > > > To change kernfs_iop_getattr() and kernfs_iop_permission() to
> > > > > take
> > > > > the rw sem read lock instead of the write lock an addtional
> > > > > lock
> > > > > is
> > > > > needed to protect against multiple processes concurrently
> > > > > updating
> > > > > the inode attributes and link count in
> > > > > kernfs_refresh_inode().
> > > > > 
> > > > > The inode i_lock seems like the sensible thing to use to
> > > > > protect
> > > > > these
> > > > > inode attribute updates so use it in kernfs_refresh_inode().
> > > > > 
> > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > ---
> > > > >  fs/kernfs/inode.c |   10 ++++++----
> > > > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > > index ddaf18198935..e26fa5115821 100644
> > > > > --- a/fs/kernfs/inode.c
> > > > > +++ b/fs/kernfs/inode.c
> > > > > @@ -171,6 +171,7 @@ static void kernfs_refresh_inode(struct
> > > > > kernfs_node *kn, struct inode *inode)
> > > > >  {
> > > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > > > 
> > > > > +       spin_lock(inode->i_lock);
> > > > >         inode->i_mode = kn->mode;
> > > > >         if (attrs)
> > > > >                 /*
> > > > > @@ -181,6 +182,7 @@ static void kernfs_refresh_inode(struct
> > > > > kernfs_node *kn, struct inode *inode)
> > > > > 
> > > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > > > +       spin_unlock(inode->i_lock);
> > > > >  }
> > > > > 
> > > > >  int kernfs_iop_getattr(const struct path *path, struct kstat
> > > > > *stat,
> > > > > @@ -189,9 +191,9 @@ int kernfs_iop_getattr(const struct path
> > > > > *path,
> > > > > struct kstat *stat,
> > > > >         struct inode *inode = d_inode(path->dentry);
> > > > >         struct kernfs_node *kn = inode->i_private;
> > > > > 
> > > > > -       down_write(&kernfs_rwsem);
> > > > > +       down_read(&kernfs_rwsem);
> > > > >         kernfs_refresh_inode(kn, inode);
> > > > > -       up_write(&kernfs_rwsem);
> > > > > +       up_read(&kernfs_rwsem);
> > > > > 
> > > > >         generic_fillattr(inode, stat);
> > > > >         return 0;
> > > > > @@ -281,9 +283,9 @@ int kernfs_iop_permission(struct inode
> > > > > *inode,
> > > > > int mask)
> > > > > 
> > > > >         kn = inode->i_private;
> > > > > 
> > > > > -       down_write(&kernfs_rwsem);
> > > > > +       down_read(&kernfs_rwsem);
> > > > >         kernfs_refresh_inode(kn, inode);
> > > > > -       up_write(&kernfs_rwsem);
> > > > > +       up_read(&kernfs_rwsem);
> > > > > 
> > > > >         return generic_permission(inode, mask);
> > > > >  }
> > > > > 
> > > > 
> > > > It looks good on my local machine, let me test my benchmark on
> > > > a
> > > > big
> > > > machine. :)
> > > > 
> > > > Also, I wonder why i_lock?? what if I use a local spin_lock,
> > > > will
> > > > there be any difference???
> > > 
> > > I think that amounts to using a global lock (ie. static) not a
> > > per-object lock which is needed to reduce contention.
> > 
> > And this lock is used for similar purposes in quite a few other
> > places so it seems like sensible, consistent usage.
> 
> Oh, yes, that's awesome. And it would never be used to protect iop
> operations on VFS side as i_rwsem does. Sounds great!

I'm not quite done yet though.

I will need to spend some time checking for any places where the VFS
updates those inode attributes or the inode link count to verify there
are no more possible races than there was before changing to use the
read lock.

Ian
> 
> > > > static void kernfs_refresh_inode(struct kernfs_node *kn, struct
> > > > inode
> > > > *inode)
> > > > {
> > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > >         static DEFINE_SPINLOCK(inode_lock);
> > > > 
> > > >         spin_lock(&inode_lock);
> > > >         inode->i_mode = kn->mode;
> > > >         if (attrs)
> > > >                 /*
> > > >                  * kernfs_node has non-default attributes get
> > > > them
> > > > from
> > > >                  * persistent copy in kernfs_node.
> > > >                  */
> > > >                 set_inode_attr(inode, attrs);
> > > > 
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > >         spin_unlock(&inode_lock);
> > > > }
> > > > 
> > > > 
> > > > 
> > > > thanks,
> > > > fox