diff mbox series

[v6,5/7] kernfs: use i_lock to protect concurrent inode updates

Message ID 162322868275.361452.17585267026652222121.stgit@web.messagingengine.com (mailing list archive)
State New
Headers show
Series kernfs: proposed locking and concurrency improvement | expand

Commit Message

Ian Kent June 9, 2021, 8:51 a.m. UTC
The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

The last hunk in the patch, applied to kernfs_fill_super(), is possibly
not needed but taking the lock was present originally and I prefer to
continue to take it so the rb tree is held stable during the call to
kernfs_refresh_inode() made by kernfs_get_inode().

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c |   10 ++++++----
 fs/kernfs/mount.c |    4 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

Comments

Miklos Szeredi June 11, 2021, 1:11 p.m. UTC | #1
On Wed, 9 Jun 2021 at 10:52, Ian Kent <raven@themaw.net> wrote:
>
> The inode operations .permission() and .getattr() use the kernfs node
> write lock but all that's needed is to keep the rb tree stable while
> updating the inode attributes as well as protecting the update itself
> against concurrent changes.
>
> And .permission() is called frequently during path walks and can cause
> quite a bit of contention between kernfs node operations and path
> walks when the number of concurrent walks is high.
>
> To change kernfs_iop_getattr() and kernfs_iop_permission() to take
> the rw sem read lock instead of the write lock an additional lock is
> needed to protect against multiple processes concurrently updating
> the inode attributes and link count in kernfs_refresh_inode().
>
> The inode i_lock seems like the sensible thing to use to protect these
> inode attribute updates so use it in kernfs_refresh_inode().
>
> The last hunk in the patch, applied to kernfs_fill_super(), is possibly
> not needed but taking the lock was present originally and I prefer to
> continue to take it so the rb tree is held stable during the call to
> kernfs_refresh_inode() made by kernfs_get_inode().
>
> Signed-off-by: Ian Kent <raven@themaw.net>

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Al Viro June 12, 2021, 1:45 a.m. UTC | #2
On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> The inode operations .permission() and .getattr() use the kernfs node
> write lock but all that's needed is to keep the rb tree stable while
> updating the inode attributes as well as protecting the update itself
> against concurrent changes.

Huh?  Where does it access the rbtree at all?  Confused...

> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 3b01e9e61f14e..6728ecd81eb37 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
>  {
>  	struct kernfs_iattrs *attrs = kn->iattr;
>  
> +	spin_lock(&inode->i_lock);
>  	inode->i_mode = kn->mode;
>  	if (attrs)
>  		/*
> @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
>  
>  	if (kernfs_type(kn) == KERNFS_DIR)
>  		set_nlink(inode, kn->dir.subdirs + 2);
> +	spin_unlock(&inode->i_lock);
>  }

Even more so - just what are you serializing here?  That code synchronizes inode
metadata with those in kernfs_node.  Suppose you've got two threads doing
->permission(); the first one gets through kernfs_refresh_inode() and goes into
generic_permission().  No locks are held, so kernfs_refresh_inode() from another
thread can run in parallel with generic_permission().

If that's not a problem, why two kernfs_refresh_inode() done in parallel would
be a problem?

Thread 1:
	permission
		done refresh, all locks released now
Thread 2:
	change metadata in kernfs_node
Thread 2:
	permission
		goes into refresh, copying metadata into inode
Thread 1:
		generic_permission()
No locks in common between the last two operations, so
we generic_permission() might see partially updated metadata.
Either we don't give a fuck (in which case I don't understand
what purpose does that ->i_lock serve) *or* we need the exclusion
to cover a wider area.
Ian Kent June 13, 2021, 1:31 a.m. UTC | #3
On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > The inode operations .permission() and .getattr() use the kernfs
> > node
> > write lock but all that's needed is to keep the rb tree stable
> > while
> > updating the inode attributes as well as protecting the update
> > itself
> > against concurrent changes.
> 
> Huh?  Where does it access the rbtree at all?  Confused...

That description's wrong, I'll fix that.
 
> 
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index 3b01e9e61f14e..6728ecd81eb37 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> >  
> > +       spin_lock(&inode->i_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> > +       spin_unlock(&inode->i_lock);
> >  }
> 
> Even more so - just what are you serializing here?  That code
> synchronizes inode
> metadata with those in kernfs_node.  Suppose you've got two threads
> doing
> ->permission(); the first one gets through kernfs_refresh_inode() and
> goes into
> generic_permission().  No locks are held, so kernfs_refresh_inode()
> from another
> thread can run in parallel with generic_permission().
> 
> If that's not a problem, why two kernfs_refresh_inode() done in
> parallel would
> be a problem?
> 
> Thread 1:
>         permission
>                 done refresh, all locks released now
> Thread 2:
>         change metadata in kernfs_node
> Thread 2:
>         permission
>                 goes into refresh, copying metadata into inode
> Thread 1:
>                 generic_permission()
> No locks in common between the last two operations, so
> we generic_permission() might see partially updated metadata.
> Either we don't give a fuck (in which case I don't understand
> what purpose does that ->i_lock serve) *or* we need the exclusion
> to cover a wider area.
Ian Kent June 14, 2021, 1:32 a.m. UTC | #4
On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > The inode operations .permission() and .getattr() use the kernfs
> > node
> > write lock but all that's needed is to keep the rb tree stable
> > while
> > updating the inode attributes as well as protecting the update
> > itself
> > against concurrent changes.
> 
> Huh?  Where does it access the rbtree at all?  Confused...
> 
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index 3b01e9e61f14e..6728ecd81eb37 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> >  
> > +       spin_lock(&inode->i_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> > +       spin_unlock(&inode->i_lock);
> >  }
> 
> Even more so - just what are you serializing here?  That code
> synchronizes inode
> metadata with those in kernfs_node.  Suppose you've got two threads
> doing
> ->permission(); the first one gets through kernfs_refresh_inode() and
> goes into
> generic_permission().  No locks are held, so kernfs_refresh_inode()
> from another
> thread can run in parallel with generic_permission().
> 
> If that's not a problem, why two kernfs_refresh_inode() done in
> parallel would
> be a problem?
> 
> Thread 1:
>         permission
>                 done refresh, all locks released now
> Thread 2:
>         change metadata in kernfs_node
> Thread 2:
>         permission
>                 goes into refresh, copying metadata into inode
> Thread 1:
>                 generic_permission()
> No locks in common between the last two operations, so
> we generic_permission() might see partially updated metadata.
> Either we don't give a fuck (in which case I don't understand
> what purpose does that ->i_lock serve) *or* we need the exclusion
> to cover a wider area.

This didn't occur to me, obviously.

It seems to me this can happen with the original code too although
using a mutex might reduce the likelihood of it happening.

Still ->permission() is meant to be a read-only function so the VFS
shouldn't need to care about it.

Do you have any suggestions on how to handle this.
Perhaps the only way is to ensure the inode is updated only in
functions that are expected to do this.

Ian
Ian Kent June 14, 2021, 6:52 a.m. UTC | #5
On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote:
> On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > > The inode operations .permission() and .getattr() use the kernfs
> > > node
> > > write lock but all that's needed is to keep the rb tree stable
> > > while
> > > updating the inode attributes as well as protecting the update
> > > itself
> > > against concurrent changes.
> > 
> > Huh?  Where does it access the rbtree at all?  Confused...
> > 
> > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > index 3b01e9e61f14e..6728ecd81eb37 100644
> > > --- a/fs/kernfs/inode.c
> > > +++ b/fs/kernfs/inode.c
> > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > >  {
> > >         struct kernfs_iattrs *attrs = kn->iattr;
> > >  
> > > +       spin_lock(&inode->i_lock);
> > >         inode->i_mode = kn->mode;
> > >         if (attrs)
> > >                 /*
> > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > >  
> > >         if (kernfs_type(kn) == KERNFS_DIR)
> > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > +       spin_unlock(&inode->i_lock);
> > >  }
> > 
> > Even more so - just what are you serializing here?  That code
> > synchronizes inode
> > metadata with those in kernfs_node.  Suppose you've got two threads
> > doing
> > ->permission(); the first one gets through kernfs_refresh_inode()
> > and
> > goes into
> > generic_permission().  No locks are held, so kernfs_refresh_inode()
> > from another
> > thread can run in parallel with generic_permission().
> > 
> > If that's not a problem, why two kernfs_refresh_inode() done in
> > parallel would
> > be a problem?
> > 
> > Thread 1:
> >         permission
> >                 done refresh, all locks released now
> > Thread 2:
> >         change metadata in kernfs_node
> > Thread 2:
> >         permission
> >                 goes into refresh, copying metadata into inode
> > Thread 1:
> >                 generic_permission()
> > No locks in common between the last two operations, so
> > we generic_permission() might see partially updated metadata.
> > Either we don't give a fuck (in which case I don't understand
> > what purpose does that ->i_lock serve) *or* we need the exclusion
> > to cover a wider area.
> 
> This didn't occur to me, obviously.
> 
> It seems to me this can happen with the original code too although
> using a mutex might reduce the likelihood of it happening.
> 
> Still ->permission() is meant to be a read-only function so the VFS
> shouldn't need to care about it.
> 
> Do you have any suggestions on how to handle this.
> Perhaps the only way is to ensure the inode is updated only in
> functions that are expected to do this.

IIRC Greg and Tejun weren't averse to adding a field to the 
struct kernfs_iattrs, but there were concerns about increasing
memory usage.

Because of this I think the best way to handle this would be to
broaden the scope of the i_lock to cover the generic calls in
kernfs_iop_getattr() and kernfs_iop_permission(). The only other
call to kernfs_refresh_inode() is at inode initialization and
then only for I_NEW inodes so that should be ok. Also both
generic_permission() and generic_fillattr() are reading from the
inode so not likely to need to take the i_lock any time soon (is
this a reasonable assumption Al?).

Do you think this is a sensible way to go Al?

Ian
Ian Kent June 14, 2021, 7:16 a.m. UTC | #6
On Mon, 2021-06-14 at 14:52 +0800, Ian Kent wrote:
> On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote:
> > On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> > > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > > > The inode operations .permission() and .getattr() use the
> > > > kernfs
> > > > node
> > > > write lock but all that's needed is to keep the rb tree stable
> > > > while
> > > > updating the inode attributes as well as protecting the update
> > > > itself
> > > > against concurrent changes.
> > > 
> > > Huh?  Where does it access the rbtree at all?  Confused...
> > > 
> > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > index 3b01e9e61f14e..6728ecd81eb37 100644
> > > > --- a/fs/kernfs/inode.c
> > > > +++ b/fs/kernfs/inode.c
> > > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >  {
> > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > >  
> > > > +       spin_lock(&inode->i_lock);
> > > >         inode->i_mode = kn->mode;
> > > >         if (attrs)
> > > >                 /*
> > > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >  
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > > +       spin_unlock(&inode->i_lock);
> > > >  }
> > > 
> > > Even more so - just what are you serializing here?  That code
> > > synchronizes inode
> > > metadata with those in kernfs_node.  Suppose you've got two
> > > threads
> > > doing
> > > ->permission(); the first one gets through kernfs_refresh_inode()
> > > and
> > > goes into
> > > generic_permission().  No locks are held, so
> > > kernfs_refresh_inode()
> > > from another
> > > thread can run in parallel with generic_permission().
> > > 
> > > If that's not a problem, why two kernfs_refresh_inode() done in
> > > parallel would
> > > be a problem?
> > > 
> > > Thread 1:
> > >         permission
> > >                 done refresh, all locks released now
> > > Thread 2:
> > >         change metadata in kernfs_node
> > > Thread 2:
> > >         permission
> > >                 goes into refresh, copying metadata into inode
> > > Thread 1:
> > >                 generic_permission()
> > > No locks in common between the last two operations, so
> > > we generic_permission() might see partially updated metadata.
> > > Either we don't give a fuck (in which case I don't understand
> > > what purpose does that ->i_lock serve) *or* we need the exclusion
> > > to cover a wider area.
> > 
> > This didn't occur to me, obviously.
> > 
> > It seems to me this can happen with the original code too although
> > using a mutex might reduce the likelihood of it happening.
> > 
> > Still ->permission() is meant to be a read-only function so the VFS
> > shouldn't need to care about it.
> > 
> > Do you have any suggestions on how to handle this.
> > Perhaps the only way is to ensure the inode is updated only in
> > functions that are expected to do this.
> 
> IIRC Greg and Tejun weren't averse to adding a field to the 
> struct kernfs_iattrs, but there were concerns about increasing
> memory usage.
> 
> Because of this I think the best way to handle this would be to
> broaden the scope of the i_lock to cover the generic calls in
> kernfs_iop_getattr() and kernfs_iop_permission(). The only other
> call to kernfs_refresh_inode() is at inode initialization and
> then only for I_NEW inodes so that should be ok. Also both
> generic_permission() and generic_fillattr() are reading from the
> inode so not likely to need to take the i_lock any time soon (is
> this a reasonable assumption Al?).
> 
> Do you think this is a sensible way to go Al?

Unless of course we don't care about taking a lock here at all,
Greg, Tejun?


Ian
diff mbox series

Patch

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6728ecd81eb37 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -172,6 +172,7 @@  static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 {
 	struct kernfs_iattrs *attrs = kn->iattr;
 
+	spin_lock(&inode->i_lock);
 	inode->i_mode = kn->mode;
 	if (attrs)
 		/*
@@ -182,6 +183,7 @@  static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 
 	if (kernfs_type(kn) == KERNFS_DIR)
 		set_nlink(inode, kn->dir.subdirs + 2);
+	spin_unlock(&inode->i_lock);
 }
 
 int kernfs_iop_getattr(struct user_namespace *mnt_userns,
@@ -191,9 +193,9 @@  int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	return 0;
@@ -284,9 +286,9 @@  int kernfs_iop_permission(struct user_namespace *mnt_userns,
 
 	kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	return generic_permission(&init_user_ns, inode, mask);
 }
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@  static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;