diff mbox series

[RFC,3/4] lockref: rework CMPXCHG_LOOP to handle contention better

Message ID 20240802-openfast-v1-3-a1cff2a33063@kernel.org (mailing list archive)
State New
Headers show
Series fs: try an opportunistic lookup for O_CREAT opens too | expand

Commit Message

Jeff Layton Aug. 2, 2024, 9:45 p.m. UTC
In a later patch, we want to change the open(..., O_CREAT) codepath to
avoid taking the inode->i_rwsem for write when the dentry already exists.
When we tested that initially, the performance devolved significantly
due to contention for the parent's d_lockref spinlock.

There are two problems with lockrefs today: First, once any concurrent
task takes the spinlock, they all end up taking the spinlock, which is
much more costly than a single cmpxchg operation. The second problem is
that once any task fails to cmpxchg 100 times, it falls back to the
spinlock. The upshot there is that even moderate contention can cause a
fallback to serialized spinlocking, which worsens performance.

This patch changes CMPXCHG_LOOP in 2 ways:

First, change the loop to spin instead of falling back to a locked
codepath when the spinlock is held. Once the lock is released, allow the
task to continue trying its cmpxchg loop as before instead of taking the
lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
Just continue infinitely.

This greatly reduces contention on the lockref when there are large
numbers of concurrent increments and decrements occurring.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
 1 file changed, 32 insertions(+), 53 deletions(-)

Comments

Mateusz Guzik Aug. 3, 2024, 4:44 a.m. UTC | #1
On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> In a later patch, we want to change the open(..., O_CREAT) codepath to
> avoid taking the inode->i_rwsem for write when the dentry already exists.
> When we tested that initially, the performance devolved significantly
> due to contention for the parent's d_lockref spinlock.
> 
> There are two problems with lockrefs today: First, once any concurrent
> task takes the spinlock, they all end up taking the spinlock, which is
> much more costly than a single cmpxchg operation. The second problem is
> that once any task fails to cmpxchg 100 times, it falls back to the
> spinlock. The upshot there is that even moderate contention can cause a
> fallback to serialized spinlocking, which worsens performance.
> 
> This patch changes CMPXCHG_LOOP in 2 ways:
> 
> First, change the loop to spin instead of falling back to a locked
> codepath when the spinlock is held. Once the lock is released, allow the
> task to continue trying its cmpxchg loop as before instead of taking the
> lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> Just continue infinitely.
> 
> This greatly reduces contention on the lockref when there are large
> numbers of concurrent increments and decrements occurring.
> 

This was already tried by me and it unfortunately can reduce performance.

Key problem is that in some corner cases the lock can be continuously
held and be queued on, making the fast path always fail and making all
the spins actively waste time (and notably pull on the cacheline).

See this for more details:
https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/

However, I *suspect* in the case you are optimizing here (open + O_CREAT
of an existing file) lockref on the parent can be avoided altogether
with some hackery and that's what should be done here.

When it comes to lockref in vfs in general, most uses can be elided with
some hackery (see the above thread) which is in early WIP (the LSMs are
a massive headache).

For open calls which *do* need to take a real ref the hackery does not
help of course.

This is where I think decoupling ref from the lock is the best way
forward. For that to work the dentry must hang around after the last
unref (already done thanks to RCU and dput even explicitly handles that
already!) and there needs to be a way to block new refs atomically --
can be done with cmpxchg from a 0-ref state to a flag blocking new refs
coming in. I have that as a WIP as well.


> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
>  1 file changed, 32 insertions(+), 53 deletions(-)
> 
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 2afe4c5d8919..b76941043fe9 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -8,22 +8,25 @@
>   * Note that the "cmpxchg()" reloads the "old" value for the
>   * failure case.
>   */
> -#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
> -	int retry = 100;							\
> -	struct lockref old;							\
> -	BUILD_BUG_ON(sizeof(old) != 8);						\
> -	old.lock_count = READ_ONCE(lockref->lock_count);			\
> -	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
> -		struct lockref new = old;					\
> -		CODE								\
> -		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> -						 &old.lock_count,		\
> -						 new.lock_count))) {		\
> -			SUCCESS;						\
> -		}								\
> -		if (!--retry)							\
> -			break;							\
> -	}									\
> +#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
> +	struct lockref old;								\
> +	BUILD_BUG_ON(sizeof(old) != 8);							\
> +	old.lock_count = READ_ONCE(lockref->lock_count);				\
> +	for (;;) {									\
> +		struct lockref new = old;						\
> +											\
> +		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
> +			CODE								\
> +			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> +							 &old.lock_count,		\
> +							 new.lock_count))) {		\
> +				SUCCESS;						\
> +			}								\
> +		} else {								\
> +			cpu_relax();							\
> +			old.lock_count = READ_ONCE(lockref->lock_count);		\
> +		}									\
> +	}										\
>  } while (0)
>  
>  #else
> @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
>  	,
>  		return;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	lockref->count++;
> -	spin_unlock(&lockref->lock);
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
>  }
>  EXPORT_SYMBOL(lockref_get);
>  
> @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
>   */
>  int lockref_get_not_zero(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count++;
>  		if (old.count <= 0)
> @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count > 0) {
> -		lockref->count++;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_get_not_zero);
>  
> @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
>   */
>  int lockref_put_not_zero(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count--;
>  		if (old.count <= 1)
> @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count > 1) {
> -		lockref->count--;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_put_not_zero);
>  
> @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
>  	,
>  		return new.count;
>  	);
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
>  	return -1;
>  }
>  EXPORT_SYMBOL(lockref_put_return);
> @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
>   */
>  int lockref_get_not_dead(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count++;
>  		if (old.count < 0)
> @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count >= 0) {
> -		lockref->count++;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_get_not_dead);
> 
> -- 
> 2.45.2
>
Mateusz Guzik Aug. 3, 2024, 9:09 a.m. UTC | #2
On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > When we tested that initially, the performance devolved significantly
> > due to contention for the parent's d_lockref spinlock.
> >
> > There are two problems with lockrefs today: First, once any concurrent
> > task takes the spinlock, they all end up taking the spinlock, which is
> > much more costly than a single cmpxchg operation. The second problem is
> > that once any task fails to cmpxchg 100 times, it falls back to the
> > spinlock. The upshot there is that even moderate contention can cause a
> > fallback to serialized spinlocking, which worsens performance.
> >
> > This patch changes CMPXCHG_LOOP in 2 ways:
> >
> > First, change the loop to spin instead of falling back to a locked
> > codepath when the spinlock is held. Once the lock is released, allow the
> > task to continue trying its cmpxchg loop as before instead of taking the
> > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > Just continue infinitely.
> >
> > This greatly reduces contention on the lockref when there are large
> > numbers of concurrent increments and decrements occurring.
> >
>
> This was already tried by me and it unfortunately can reduce performance.
>

Oh wait I misread the patch based on what I tried there. Spinning
indefinitely waiting for the lock to be free is a no-go as it loses
the forward progress guarantee (and it is possible to get the lock
being continuously held). Only spinning up to an arbitrary point wins
some in some tests and loses in others.

Either way, as described below, chances are decent that:
1. there is an easy way to not lockref_get/put on the parent if the
file is already there, dodging the problem
.. and even if that's not true
2. lockref can be ditched in favor of atomics. apart from some minor
refactoring this all looks perfectly doable and I have a wip. I will
try to find the time next week to sort it out

> Key problem is that in some corner cases the lock can be continuously
> held and be queued on, making the fast path always fail and making all
> the spins actively waste time (and notably pull on the cacheline).
>
> See this for more details:
> https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
>
> However, I *suspect* in the case you are optimizing here (open + O_CREAT
> of an existing file) lockref on the parent can be avoided altogether
> with some hackery and that's what should be done here.
>
> When it comes to lockref in vfs in general, most uses can be elided with
> some hackery (see the above thread) which is in early WIP (the LSMs are
> a massive headache).
>
> For open calls which *do* need to take a real ref the hackery does not
> help of course.
>
> This is where I think decoupling ref from the lock is the best way
> forward. For that to work the dentry must hang around after the last
> unref (already done thanks to RCU and dput even explicitly handles that
> already!) and there needs to be a way to block new refs atomically --
> can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> coming in. I have that as a WIP as well.
>
>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> >  1 file changed, 32 insertions(+), 53 deletions(-)
> >
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..b76941043fe9 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -8,22 +8,25 @@
> >   * Note that the "cmpxchg()" reloads the "old" value for the
> >   * failure case.
> >   */
> > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > -     int retry = 100;                                                        \
> > -     struct lockref old;                                                     \
> > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > -             struct lockref new = old;                                       \
> > -             CODE                                                            \
> > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > -                                              &old.lock_count,               \
> > -                                              new.lock_count))) {            \
> > -                     SUCCESS;                                                \
> > -             }                                                               \
> > -             if (!--retry)                                                   \
> > -                     break;                                                  \
> > -     }                                                                       \
> > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > +     struct lockref old;                                                             \
> > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > +     for (;;) {                                                                      \
> > +             struct lockref new = old;                                               \
> > +                                                                                     \
> > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > +                     CODE                                                            \
> > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > +                                                      &old.lock_count,               \
> > +                                                      new.lock_count))) {            \
> > +                             SUCCESS;                                                \
> > +                     }                                                               \
> > +             } else {                                                                \
> > +                     cpu_relax();                                                    \
> > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > +             }                                                                       \
> > +     }                                                                               \
> >  } while (0)
> >
> >  #else
> > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> >       ,
> >               return;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     lockref->count++;
> > -     spin_unlock(&lockref->lock);
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> >  }
> >  EXPORT_SYMBOL(lockref_get);
> >
> > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> >   */
> >  int lockref_get_not_zero(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count++;
> >               if (old.count <= 0)
> > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count > 0) {
> > -             lockref->count++;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_zero);
> >
> > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> >   */
> >  int lockref_put_not_zero(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count--;
> >               if (old.count <= 1)
> > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count > 1) {
> > -             lockref->count--;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_not_zero);
> >
> > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> >       ,
> >               return new.count;
> >       );
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> >       return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_return);
> > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> >   */
> >  int lockref_get_not_dead(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count++;
> >               if (old.count < 0)
> > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count >= 0) {
> > -             lockref->count++;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_dead);
> >
> > --
> > 2.45.2
> >
Jeff Layton Aug. 3, 2024, 10:55 a.m. UTC | #3
On Sat, 2024-08-03 at 06:44 +0200, Mateusz Guzik wrote:
> On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > When we tested that initially, the performance devolved significantly
> > due to contention for the parent's d_lockref spinlock.
> > 
> > There are two problems with lockrefs today: First, once any concurrent
> > task takes the spinlock, they all end up taking the spinlock, which is
> > much more costly than a single cmpxchg operation. The second problem is
> > that once any task fails to cmpxchg 100 times, it falls back to the
> > spinlock. The upshot there is that even moderate contention can cause a
> > fallback to serialized spinlocking, which worsens performance.
> > 
> > This patch changes CMPXCHG_LOOP in 2 ways:
> > 
> > First, change the loop to spin instead of falling back to a locked
> > codepath when the spinlock is held. Once the lock is released, allow the
> > task to continue trying its cmpxchg loop as before instead of taking the
> > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > Just continue infinitely.
> > 
> > This greatly reduces contention on the lockref when there are large
> > numbers of concurrent increments and decrements occurring.
> > 
> 
> This was already tried by me and it unfortunately can reduce performance.
> 
> Key problem is that in some corner cases the lock can be continuously
> held and be queued on, making the fast path always fail and making all
> the spins actively waste time (and notably pull on the cacheline).
> 

The cacheline contention does seem like a real problem with this
approach.

> See this for more details:
> https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> 
> However, I *suspect* in the case you are optimizing here (open + O_CREAT
> of an existing file) lockref on the parent can be avoided altogether
> with some hackery and that's what should be done here.
> 

Unfortunately I don't think we can in this codepath:

-------------------8<----------------------
	if (!(open_flag & O_CREAT)) {                                        
		...
	} else {
        	/* create side of things */
                if (nd->flags & LOOKUP_RCU) {
                        if (!try_to_unlazy(nd))
                                return ERR_PTR(-ECHILD);
                }
                audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
                /* trailing slashes? */
                if (unlikely(nd->last.name[nd->last.len]))
                        return ERR_PTR(-EISDIR);
        }
-------------------8<----------------------

The problem here is the audit_inode call, which can do a GFP_KERNEL
allocation. We can't stay in RCU mode for that, and we need a reference
to "dir" (at least with the current way audit works).

> When it comes to lockref in vfs in general, most uses can be elided with
> some hackery (see the above thread) which is in early WIP (the LSMs are
> a massive headache).
> 
> For open calls which *do* need to take a real ref the hackery does not
> help of course.
> 
> This is where I think decoupling ref from the lock is the best way
> forward. For that to work the dentry must hang around after the last
> unref (already done thanks to RCU and dput even explicitly handles that
> already!) and there needs to be a way to block new refs atomically --
> can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> coming in. I have that as a WIP as well.
> 

These both sound very interesting. FWIW, Josef also started looking at
decoupling the refcount and lock, but I don't think he's gotten very
far yet.

I'm happy to help test some of this too if you get to that point. The
4th patch in this RFC series really amps up the contention for the
lockref once the i_rwsem isn't being touched.

> 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> >  1 file changed, 32 insertions(+), 53 deletions(-)
> > 
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..b76941043fe9 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -8,22 +8,25 @@
> >   * Note that the "cmpxchg()" reloads the "old" value for the
> >   * failure case.
> >   */
> > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
> > -	int retry = 100;							\
> > -	struct lockref old;							\
> > -	BUILD_BUG_ON(sizeof(old) != 8);						\
> > -	old.lock_count = READ_ONCE(lockref->lock_count);			\
> > -	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
> > -		struct lockref new = old;					\
> > -		CODE								\
> > -		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> > -						 &old.lock_count,		\
> > -						 new.lock_count))) {		\
> > -			SUCCESS;						\
> > -		}								\
> > -		if (!--retry)							\
> > -			break;							\
> > -	}									\
> > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
> > +	struct lockref old;								\
> > +	BUILD_BUG_ON(sizeof(old) != 8);							\
> > +	old.lock_count = READ_ONCE(lockref->lock_count);				\
> > +	for (;;) {									\
> > +		struct lockref new = old;						\
> > +											\
> > +		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
> > +			CODE								\
> > +			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> > +							 &old.lock_count,		\
> > +							 new.lock_count))) {		\
> > +				SUCCESS;						\
> > +			}								\
> > +		} else {								\
> > +			cpu_relax();							\
> > +			old.lock_count = READ_ONCE(lockref->lock_count);		\
> > +		}									\
> > +	}										\
> >  } while (0)
> >  
> >  #else
> > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> >  	,
> >  		return;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	lockref->count++;
> > -	spin_unlock(&lockref->lock);
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> >  }
> >  EXPORT_SYMBOL(lockref_get);
> >  
> > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> >   */
> >  int lockref_get_not_zero(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count++;
> >  		if (old.count <= 0)
> > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count > 0) {
> > -		lockref->count++;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_zero);
> >  
> > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> >   */
> >  int lockref_put_not_zero(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count--;
> >  		if (old.count <= 1)
> > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count > 1) {
> > -		lockref->count--;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_not_zero);
> >  
> > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> >  	,
> >  		return new.count;
> >  	);
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> >  	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_return);
> > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> >   */
> >  int lockref_get_not_dead(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count++;
> >  		if (old.count < 0)
> > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count >= 0) {
> > -		lockref->count++;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_dead);
> > 
> > -- 
> > 2.45.2
> >
Jeff Layton Aug. 3, 2024, 10:59 a.m. UTC | #4
On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > 
> > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > When we tested that initially, the performance devolved significantly
> > > due to contention for the parent's d_lockref spinlock.
> > > 
> > > There are two problems with lockrefs today: First, once any concurrent
> > > task takes the spinlock, they all end up taking the spinlock, which is
> > > much more costly than a single cmpxchg operation. The second problem is
> > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > spinlock. The upshot there is that even moderate contention can cause a
> > > fallback to serialized spinlocking, which worsens performance.
> > > 
> > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > 
> > > First, change the loop to spin instead of falling back to a locked
> > > codepath when the spinlock is held. Once the lock is released, allow the
> > > task to continue trying its cmpxchg loop as before instead of taking the
> > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > Just continue infinitely.
> > > 
> > > This greatly reduces contention on the lockref when there are large
> > > numbers of concurrent increments and decrements occurring.
> > > 
> > 
> > This was already tried by me and it unfortunately can reduce performance.
> > 
> 
> Oh wait I misread the patch based on what I tried there. Spinning
> indefinitely waiting for the lock to be free is a no-go as it loses
> the forward progress guarantee (and it is possible to get the lock
> being continuously held). Only spinning up to an arbitrary point wins
> some in some tests and loses in others.
> 

I'm a little confused about the forward progress guarantee here. Does
that exist today at all? ISTM that falling back to spin_lock() after a
certain number of retries doesn't guarantee any forward progress. You
can still just end up spinning on the lock forever once that happens,
no?

> Either way, as described below, chances are decent that:
> 1. there is an easy way to not lockref_get/put on the parent if the
> file is already there, dodging the problem
> .. and even if that's not true
> 2. lockref can be ditched in favor of atomics. apart from some minor
> refactoring this all looks perfectly doable and I have a wip. I will
> try to find the time next week to sort it out
> 

Like I said in the earlier mail, I don't think we can stay in RCU mode
because of the audit_inode call. I'm definitely interested in your WIP
though!

> > Key problem is that in some corner cases the lock can be continuously
> > held and be queued on, making the fast path always fail and making all
> > the spins actively waste time (and notably pull on the cacheline).
> > 
> > See this for more details:
> > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> > 
> > However, I *suspect* in the case you are optimizing here (open + O_CREAT
> > of an existing file) lockref on the parent can be avoided altogether
> > with some hackery and that's what should be done here.
> > 
> > When it comes to lockref in vfs in general, most uses can be elided with
> > some hackery (see the above thread) which is in early WIP (the LSMs are
> > a massive headache).
> > 
> > For open calls which *do* need to take a real ref the hackery does not
> > help of course.
> > 
> > This is where I think decoupling ref from the lock is the best way
> > forward. For that to work the dentry must hang around after the last
> > unref (already done thanks to RCU and dput even explicitly handles that
> > already!) and there needs to be a way to block new refs atomically --
> > can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> > coming in. I have that as a WIP as well.
> > 
> > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> > >  1 file changed, 32 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/lib/lockref.c b/lib/lockref.c
> > > index 2afe4c5d8919..b76941043fe9 100644
> > > --- a/lib/lockref.c
> > > +++ b/lib/lockref.c
> > > @@ -8,22 +8,25 @@
> > >   * Note that the "cmpxchg()" reloads the "old" value for the
> > >   * failure case.
> > >   */
> > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > > -     int retry = 100;                                                        \
> > > -     struct lockref old;                                                     \
> > > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > > -             struct lockref new = old;                                       \
> > > -             CODE                                                            \
> > > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > -                                              &old.lock_count,               \
> > > -                                              new.lock_count))) {            \
> > > -                     SUCCESS;                                                \
> > > -             }                                                               \
> > > -             if (!--retry)                                                   \
> > > -                     break;                                                  \
> > > -     }                                                                       \
> > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > > +     struct lockref old;                                                             \
> > > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > > +     for (;;) {                                                                      \
> > > +             struct lockref new = old;                                               \
> > > +                                                                                     \
> > > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > > +                     CODE                                                            \
> > > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > +                                                      &old.lock_count,               \
> > > +                                                      new.lock_count))) {            \
> > > +                             SUCCESS;                                                \
> > > +                     }                                                               \
> > > +             } else {                                                                \
> > > +                     cpu_relax();                                                    \
> > > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > > +             }                                                                       \
> > > +     }                                                                               \
> > >  } while (0)
> > > 
> > >  #else
> > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> > >       ,
> > >               return;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     lockref->count++;
> > > -     spin_unlock(&lockref->lock);
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >  }
> > >  EXPORT_SYMBOL(lockref_get);
> > > 
> > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> > >   */
> > >  int lockref_get_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count <= 0)
> > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_zero);
> > > 
> > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> > >   */
> > >  int lockref_put_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count--;
> > >               if (old.count <= 1)
> > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 1) {
> > > -             lockref->count--;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_not_zero);
> > > 
> > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> > >       ,
> > >               return new.count;
> > >       );
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >       return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_return);
> > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> > >   */
> > >  int lockref_get_not_dead(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count < 0)
> > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count >= 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_dead);
> > > 
> > > --
> > > 2.45.2
> > > 
> 
> 
>
Mateusz Guzik Aug. 3, 2024, 11:21 a.m. UTC | #5
On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > > When we tested that initially, the performance devolved significantly
> > > > due to contention for the parent's d_lockref spinlock.
> > > >
> > > > There are two problems with lockrefs today: First, once any concurrent
> > > > task takes the spinlock, they all end up taking the spinlock, which is
> > > > much more costly than a single cmpxchg operation. The second problem is
> > > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > > spinlock. The upshot there is that even moderate contention can cause a
> > > > fallback to serialized spinlocking, which worsens performance.
> > > >
> > > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > >
> > > > First, change the loop to spin instead of falling back to a locked
> > > > codepath when the spinlock is held. Once the lock is released, allow the
> > > > task to continue trying its cmpxchg loop as before instead of taking the
> > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > > Just continue infinitely.
> > > >
> > > > This greatly reduces contention on the lockref when there are large
> > > > numbers of concurrent increments and decrements occurring.
> > > >
> > >
> > > This was already tried by me and it unfortunately can reduce performance.
> > >
> >
> > Oh wait I misread the patch based on what I tried there. Spinning
> > indefinitely waiting for the lock to be free is a no-go as it loses
> > the forward progress guarantee (and it is possible to get the lock
> > being continuously held). Only spinning up to an arbitrary point wins
> > some in some tests and loses in others.
> >
>
> I'm a little confused about the forward progress guarantee here. Does
> that exist today at all? ISTM that falling back to spin_lock() after a
> certain number of retries doesn't guarantee any forward progress. You
> can still just end up spinning on the lock forever once that happens,
> no?
>

There is the implicit assumption that everyone holds locks for a
finite time. I agree there are no guarantees otherwise if that's what
you meant.

In this case, since spinlocks are queued, a constant stream of lock
holders will make the lock appear taken indefinitely even if they all
hold it for a short period.

Stock lockref will give up atomics immediately and make sure to change
the ref thanks to queueing up.

Lockref as proposed in this patch wont be able to do anything as long
as the lock trading is taking place.

> > Either way, as described below, chances are decent that:
> > 1. there is an easy way to not lockref_get/put on the parent if the
> > file is already there, dodging the problem
> > .. and even if that's not true
> > 2. lockref can be ditched in favor of atomics. apart from some minor
> > refactoring this all looks perfectly doable and I have a wip. I will
> > try to find the time next week to sort it out
> >
>
> Like I said in the earlier mail, I don't think we can stay in RCU mode
> because of the audit_inode call. I'm definitely interested in your WIP
> though!
>

well audit may be hackable so that it works in rcu most of the time,
but that's not something i'm interested in

sorting out the lockref situation would definitely help other stuff
(notably opening the same file RO).

anyhow one idea is to temporarily disable atomic ops with a flag in
the counter, a fallback plan is to loosen lockref so that it can do
transitions other than 0->1->2 with atomics, even if the lock is held.

I have not looked at this in over a month, I'm going to need to
refresh my memory on the details, I do remember there was some stuff
to massage first.

Anyhow, I expect a working WIP some time in the upcoming week.

--
Mateusz Guzik <mjguzik gmail.com>
Jeff Layton Aug. 3, 2024, 11:32 a.m. UTC | #6
On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > 
> > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > > > When we tested that initially, the performance devolved significantly
> > > > > due to contention for the parent's d_lockref spinlock.
> > > > > 
> > > > > There are two problems with lockrefs today: First, once any concurrent
> > > > > task takes the spinlock, they all end up taking the spinlock, which is
> > > > > much more costly than a single cmpxchg operation. The second problem is
> > > > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > > > spinlock. The upshot there is that even moderate contention can cause a
> > > > > fallback to serialized spinlocking, which worsens performance.
> > > > > 
> > > > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > > > 
> > > > > First, change the loop to spin instead of falling back to a locked
> > > > > codepath when the spinlock is held. Once the lock is released, allow the
> > > > > task to continue trying its cmpxchg loop as before instead of taking the
> > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > > > Just continue infinitely.
> > > > > 
> > > > > This greatly reduces contention on the lockref when there are large
> > > > > numbers of concurrent increments and decrements occurring.
> > > > > 
> > > > 
> > > > This was already tried by me and it unfortunately can reduce performance.
> > > > 
> > > 
> > > Oh wait I misread the patch based on what I tried there. Spinning
> > > indefinitely waiting for the lock to be free is a no-go as it loses
> > > the forward progress guarantee (and it is possible to get the lock
> > > being continuously held). Only spinning up to an arbitrary point wins
> > > some in some tests and loses in others.
> > > 
> > 
> > I'm a little confused about the forward progress guarantee here. Does
> > that exist today at all? ISTM that falling back to spin_lock() after a
> > certain number of retries doesn't guarantee any forward progress. You
> > can still just end up spinning on the lock forever once that happens,
> > no?
> > 
> 
> There is the implicit assumption that everyone holds locks for a
> finite time. I agree there are no guarantees otherwise if that's what
> you meant.
> 
> In this case, since spinlocks are queued, a constant stream of lock
> holders will make the lock appear taken indefinitely even if they all
> hold it for a short period.
> 
> Stock lockref will give up atomics immediately and make sure to change
> the ref thanks to queueing up.
> 
> Lockref as proposed in this patch wont be able to do anything as long
> as the lock trading is taking place.
> 

Got it, thanks. This spinning is very simplistic, so I could see that
you could have one task continually getting shuffled to the end of the
queue.

> > > Either way, as described below, chances are decent that:
> > > 1. there is an easy way to not lockref_get/put on the parent if the
> > > file is already there, dodging the problem
> > > .. and even if that's not true
> > > 2. lockref can be ditched in favor of atomics. apart from some minor
> > > refactoring this all looks perfectly doable and I have a wip. I will
> > > try to find the time next week to sort it out
> > > 
> > 
> > Like I said in the earlier mail, I don't think we can stay in RCU mode
> > because of the audit_inode call. I'm definitely interested in your WIP
> > though!
> > 
> 
> well audit may be hackable so that it works in rcu most of the time,
> but that's not something i'm interested in

Audit not my favorite area of the kernel to work in either. I don't see
a good way to make it rcu-friendly, but I haven't looked too hard yet
either. It would be nice to be able to do some of the auditing under
rcu or spinlock.

>
> sorting out the lockref situation would definitely help other stuff
> (notably opening the same file RO).
> 

Indeed. It's clear that the current implementation is a real
scalability problem in a lot of situations.

> anyhow one idea is to temporarily disable atomic ops with a flag in
> the counter, a fallback plan is to loosen lockref so that it can do
> transitions other than 0->1->2 with atomics, even if the lock is held.
> 
> I have not looked at this in over a month, I'm going to need to
> refresh my memory on the details, I do remember there was some stuff
> to massage first.
> 
> Anyhow, I expect a working WIP some time in the upcoming week.
> 

Great, I'll stay tuned.

Thanks!
Christian Brauner Aug. 5, 2024, 11:44 a.m. UTC | #7
> Audit not my favorite area of the kernel to work in either. I don't see
> a good way to make it rcu-friendly, but I haven't looked too hard yet
> either. It would be nice to be able to do some of the auditing under
> rcu or spinlock.

For audit your main option is to dodge the problem and check whether
audit is active and only drop out of rcu if it is. That sidesteps the
problem. I'm somewhat certain that a lot of systems don't really have
audit active.

From a brief look at audit it would be quite involved to make it work
just under rcu. Not just because it does various allocation but it also
reads fscaps from disk and so on. That's not going to work unless we add
a vfs based fscaps cache similar to what we do for acls. I find that
very unlikely.
Jeff Layton Aug. 5, 2024, 12:52 p.m. UTC | #8
On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote:
> > Audit not my favorite area of the kernel to work in either. I don't see
> > a good way to make it rcu-friendly, but I haven't looked too hard yet
> > either. It would be nice to be able to do some of the auditing under
> > rcu or spinlock.
> 
> For audit your main option is to dodge the problem and check whether
> audit is active and only drop out of rcu if it is. That sidesteps the
> problem. I'm somewhat certain that a lot of systems don't really have
> audit active.
> 

I did have an earlier version of 4/4 that checked audit_context() and
stayed in RCU mode if it comes back NULL. I can resurrect that if you
think it's worthwhile.

> From a brief look at audit it would be quite involved to make it work
> just under rcu. Not just because it does various allocation but it also
> reads fscaps from disk and so on. That's not going to work unless we add
> a vfs based fscaps cache similar to what we do for acls. I find that
> very unlikely. 

Yeah. It wants to record a lot of (variable-length) information at very
inconvenient times. I think we're sort of stuck with it though until
someone has a vision on how to do this in a non-blocking way.

Handwavy thought: there is some similarity to tracepoints in what
audit_inode does, and tracepoints are able to be called in all sorts of
contexts. I wonder if we could leverage the same infrastructure
somehow? The catch here is that we can't just drop audit records if
things go wrong.
Christian Brauner Aug. 6, 2024, 11:36 a.m. UTC | #9
On Mon, Aug 05, 2024 at 08:52:28AM GMT, Jeff Layton wrote:
> On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote:
> > > Audit not my favorite area of the kernel to work in either. I don't see
> > > a good way to make it rcu-friendly, but I haven't looked too hard yet
> > > either. It would be nice to be able to do some of the auditing under
> > > rcu or spinlock.
> > 
> > For audit your main option is to dodge the problem and check whether
> > audit is active and only drop out of rcu if it is. That sidesteps the
> > problem. I'm somewhat certain that a lot of systems don't really have
> > audit active.
> > 
> 
> I did have an earlier version of 4/4 that checked audit_context() and
> stayed in RCU mode if it comes back NULL. I can resurrect that if you
> think it's worthwhile.

Let's at least see what it looks like. Maybe just use a helper local to
fs/namei.c that returns ECHILD if audit is available and 0 otherwise?

> > From a brief look at audit it would be quite involved to make it work
> > just under rcu. Not just because it does various allocation but it also
> > reads fscaps from disk and so on. That's not going to work unless we add
> > a vfs based fscaps cache similar to what we do for acls. I find that
> > very unlikely. 
> 
> Yeah. It wants to record a lot of (variable-length) information at very
> inconvenient times. I think we're sort of stuck with it though until
> someone has a vision on how to do this in a non-blocking way.
> 
> Handwavy thought: there is some similarity to tracepoints in what
> audit_inode does, and tracepoints are able to be called in all sorts of
> contexts. I wonder if we could leverage the same infrastructure
> somehow? The catch here is that we can't just drop audit records if
> things go wrong.

I can't say much about the tracepoint idea as I lack the necessary
details around their implementation.

I think the better way forward is a model with a fastpath and a
slowpath. Under RCU audit_inode() returns -ECHILD if it sees that it
neeeds to end up doing anything it couldn't do in a non-blocking way and
then path lookup can drop out of RCU and call audit_inode() again.

I think this wouldn't be extremly terrible. It would amount to adding a
flag to audit_inode() AUDIT_MAY_NOT_BLOCK and then on ECHILD
audit_inode() gets called again without that flag.

Over time if people are interested they could then make more and more
stuff available under rcu for audit.
diff mbox series

Patch

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..b76941043fe9 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -8,22 +8,25 @@ 
  * Note that the "cmpxchg()" reloads the "old" value for the
  * failure case.
  */
-#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
-	int retry = 100;							\
-	struct lockref old;							\
-	BUILD_BUG_ON(sizeof(old) != 8);						\
-	old.lock_count = READ_ONCE(lockref->lock_count);			\
-	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
-		struct lockref new = old;					\
-		CODE								\
-		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
-						 &old.lock_count,		\
-						 new.lock_count))) {		\
-			SUCCESS;						\
-		}								\
-		if (!--retry)							\
-			break;							\
-	}									\
+#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
+	struct lockref old;								\
+	BUILD_BUG_ON(sizeof(old) != 8);							\
+	old.lock_count = READ_ONCE(lockref->lock_count);				\
+	for (;;) {									\
+		struct lockref new = old;						\
+											\
+		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
+			CODE								\
+			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
+							 &old.lock_count,		\
+							 new.lock_count))) {		\
+				SUCCESS;						\
+			}								\
+		} else {								\
+			cpu_relax();							\
+			old.lock_count = READ_ONCE(lockref->lock_count);		\
+		}									\
+	}										\
 } while (0)
 
 #else
@@ -46,10 +49,8 @@  void lockref_get(struct lockref *lockref)
 	,
 		return;
 	);
-
-	spin_lock(&lockref->lock);
-	lockref->count++;
-	spin_unlock(&lockref->lock);
+	/* should never get here */
+	WARN_ON_ONCE(1);
 }
 EXPORT_SYMBOL(lockref_get);
 
@@ -60,8 +61,6 @@  EXPORT_SYMBOL(lockref_get);
  */
 int lockref_get_not_zero(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count++;
 		if (old.count <= 0)
@@ -69,15 +68,9 @@  int lockref_get_not_zero(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count > 0) {
-		lockref->count++;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_get_not_zero);
 
@@ -88,8 +81,6 @@  EXPORT_SYMBOL(lockref_get_not_zero);
  */
 int lockref_put_not_zero(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count--;
 		if (old.count <= 1)
@@ -97,15 +88,9 @@  int lockref_put_not_zero(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count > 1) {
-		lockref->count--;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_put_not_zero);
 
@@ -125,6 +110,8 @@  int lockref_put_return(struct lockref *lockref)
 	,
 		return new.count;
 	);
+	/* should never get here */
+	WARN_ON_ONCE(1);
 	return -1;
 }
 EXPORT_SYMBOL(lockref_put_return);
@@ -171,8 +158,6 @@  EXPORT_SYMBOL(lockref_mark_dead);
  */
 int lockref_get_not_dead(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count++;
 		if (old.count < 0)
@@ -180,14 +165,8 @@  int lockref_get_not_dead(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count >= 0) {
-		lockref->count++;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_get_not_dead);