Message ID | 20240802-openfast-v1-3-a1cff2a33063@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs: try an opportunistic lookup for O_CREAT opens too | expand |
On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > In a later patch, we want to change the open(..., O_CREAT) codepath to > avoid taking the inode->i_rwsem for write when the dentry already exists. > When we tested that initially, the performance devolved significantly > due to contention for the parent's d_lockref spinlock. > > There are two problems with lockrefs today: First, once any concurrent > task takes the spinlock, they all end up taking the spinlock, which is > much more costly than a single cmpxchg operation. The second problem is > that once any task fails to cmpxchg 100 times, it falls back to the > spinlock. The upshot there is that even moderate contention can cause a > fallback to serialized spinlocking, which worsens performance. > > This patch changes CMPXCHG_LOOP in 2 ways: > > First, change the loop to spin instead of falling back to a locked > codepath when the spinlock is held. Once the lock is released, allow the > task to continue trying its cmpxchg loop as before instead of taking the > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > Just continue infinitely. > > This greatly reduces contention on the lockref when there are large > numbers of concurrent increments and decrements occurring. > This was already tried by me and it unfortunately can reduce performance. Key problem is that in some corner cases the lock can be continuously held and be queued on, making the fast path always fail and making all the spins actively waste time (and notably pull on the cacheline). See this for more details: https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ However, I *suspect* in the case you are optimizing here (open + O_CREAT of an existing file) lockref on the parent can be avoided altogether with some hackery and that's what should be done here. When it comes to lockref in vfs in general, most uses can be elided with some hackery (see the above thread) which is in early WIP (the LSMs are a massive headache). For open calls which *do* need to take a real ref the hackery does not help of course. This is where I think decoupling ref from the lock is the best way forward. For that to work the dentry must hang around after the last unref (already done thanks to RCU and dput even explicitly handles that already!) and there needs to be a way to block new refs atomically -- can be done with cmpxchg from a 0-ref state to a flag blocking new refs coming in. I have that as a WIP as well. > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > 1 file changed, 32 insertions(+), 53 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..b76941043fe9 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -8,22 +8,25 @@ > * Note that the "cmpxchg()" reloads the "old" value for the > * failure case. > */ > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > - int retry = 100; \ > - struct lockref old; \ > - BUILD_BUG_ON(sizeof(old) != 8); \ > - old.lock_count = READ_ONCE(lockref->lock_count); \ > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > - struct lockref new = old; \ > - CODE \ > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > - &old.lock_count, \ > - new.lock_count))) { \ > - SUCCESS; \ > - } \ > - if (!--retry) \ > - break; \ > - } \ > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > + struct lockref old; \ > + BUILD_BUG_ON(sizeof(old) != 8); \ > + old.lock_count = READ_ONCE(lockref->lock_count); \ > + for (;;) { \ > + struct lockref new = old; \ > + \ > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > + CODE \ > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > + &old.lock_count, \ > + new.lock_count))) { \ > + SUCCESS; \ > + } \ > + } else { \ > + cpu_relax(); \ > + old.lock_count = READ_ONCE(lockref->lock_count); \ > + } \ > + } \ > } while (0) > > #else > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > , > return; > ); > - > - spin_lock(&lockref->lock); > - lockref->count++; > - spin_unlock(&lockref->lock); > + /* should never get here */ > + WARN_ON_ONCE(1); > } > EXPORT_SYMBOL(lockref_get); > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > */ > int lockref_get_not_zero(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count++; > if (old.count <= 0) > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count > 0) { > - lockref->count++; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_get_not_zero); > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > */ > int lockref_put_not_zero(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count--; > if (old.count <= 1) > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count > 1) { > - lockref->count--; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_put_not_zero); > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > , > return new.count; > ); > + /* should never get here */ > + WARN_ON_ONCE(1); > return -1; > } > EXPORT_SYMBOL(lockref_put_return); > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > */ > int lockref_get_not_dead(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count++; > if (old.count < 0) > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count >= 0) { > - lockref->count++; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_get_not_dead); > > -- > 2.45.2 >
On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > When we tested that initially, the performance devolved significantly > > due to contention for the parent's d_lockref spinlock. > > > > There are two problems with lockrefs today: First, once any concurrent > > task takes the spinlock, they all end up taking the spinlock, which is > > much more costly than a single cmpxchg operation. The second problem is > > that once any task fails to cmpxchg 100 times, it falls back to the > > spinlock. The upshot there is that even moderate contention can cause a > > fallback to serialized spinlocking, which worsens performance. > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > First, change the loop to spin instead of falling back to a locked > > codepath when the spinlock is held. Once the lock is released, allow the > > task to continue trying its cmpxchg loop as before instead of taking the > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > Just continue infinitely. > > > > This greatly reduces contention on the lockref when there are large > > numbers of concurrent increments and decrements occurring. > > > > This was already tried by me and it unfortunately can reduce performance. > Oh wait I misread the patch based on what I tried there. Spinning indefinitely waiting for the lock to be free is a no-go as it loses the forward progress guarantee (and it is possible to get the lock being continuously held). Only spinning up to an arbitrary point wins some in some tests and loses in others. Either way, as described below, chances are decent that: 1. there is an easy way to not lockref_get/put on the parent if the file is already there, dodging the problem .. and even if that's not true 2. lockref can be ditched in favor of atomics. apart from some minor refactoring this all looks perfectly doable and I have a wip. I will try to find the time next week to sort it out > Key problem is that in some corner cases the lock can be continuously > held and be queued on, making the fast path always fail and making all > the spins actively waste time (and notably pull on the cacheline). > > See this for more details: > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > of an existing file) lockref on the parent can be avoided altogether > with some hackery and that's what should be done here. > > When it comes to lockref in vfs in general, most uses can be elided with > some hackery (see the above thread) which is in early WIP (the LSMs are > a massive headache). > > For open calls which *do* need to take a real ref the hackery does not > help of course. > > This is where I think decoupling ref from the lock is the best way > forward. For that to work the dentry must hang around after the last > unref (already done thanks to RCU and dput even explicitly handles that > already!) and there needs to be a way to block new refs atomically -- > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > coming in. I have that as a WIP as well. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..b76941043fe9 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -8,22 +8,25 @@ > > * Note that the "cmpxchg()" reloads the "old" value for the > > * failure case. > > */ > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > - int retry = 100; \ > > - struct lockref old; \ > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > - struct lockref new = old; \ > > - CODE \ > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > - &old.lock_count, \ > > - new.lock_count))) { \ > > - SUCCESS; \ > > - } \ > > - if (!--retry) \ > > - break; \ > > - } \ > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > + struct lockref old; \ > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + for (;;) { \ > > + struct lockref new = old; \ > > + \ > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > + CODE \ > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > + &old.lock_count, \ > > + new.lock_count))) { \ > > + SUCCESS; \ > > + } \ > > + } else { \ > > + cpu_relax(); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + } \ > > + } \ > > } while (0) > > > > #else > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > , > > return; > > ); > > - > > - spin_lock(&lockref->lock); > > - lockref->count++; > > - spin_unlock(&lockref->lock); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > } > > EXPORT_SYMBOL(lockref_get); > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > */ > > int lockref_get_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count <= 0) > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > */ > > int lockref_put_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count--; > > if (old.count <= 1) > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 1) { > > - lockref->count--; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > , > > return new.count; > > ); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > return -1; > > } > > EXPORT_SYMBOL(lockref_put_return); > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > */ > > int lockref_get_not_dead(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count < 0) > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count >= 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > -- > > 2.45.2 > >
On Sat, 2024-08-03 at 06:44 +0200, Mateusz Guzik wrote: > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > When we tested that initially, the performance devolved significantly > > due to contention for the parent's d_lockref spinlock. > > > > There are two problems with lockrefs today: First, once any concurrent > > task takes the spinlock, they all end up taking the spinlock, which is > > much more costly than a single cmpxchg operation. The second problem is > > that once any task fails to cmpxchg 100 times, it falls back to the > > spinlock. The upshot there is that even moderate contention can cause a > > fallback to serialized spinlocking, which worsens performance. > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > First, change the loop to spin instead of falling back to a locked > > codepath when the spinlock is held. Once the lock is released, allow the > > task to continue trying its cmpxchg loop as before instead of taking the > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > Just continue infinitely. > > > > This greatly reduces contention on the lockref when there are large > > numbers of concurrent increments and decrements occurring. > > > > This was already tried by me and it unfortunately can reduce performance. > > Key problem is that in some corner cases the lock can be continuously > held and be queued on, making the fast path always fail and making all > the spins actively waste time (and notably pull on the cacheline). > The cacheline contention does seem like a real problem with this approach. > See this for more details: > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > of an existing file) lockref on the parent can be avoided altogether > with some hackery and that's what should be done here. > Unfortunately I don't think we can in this codepath: -------------------8<---------------------- if (!(open_flag & O_CREAT)) { ... } else { /* create side of things */ if (nd->flags & LOOKUP_RCU) { if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); } audit_inode(nd->name, dir, AUDIT_INODE_PARENT); /* trailing slashes? */ if (unlikely(nd->last.name[nd->last.len])) return ERR_PTR(-EISDIR); } -------------------8<---------------------- The problem here is the audit_inode call, which can do a GFP_KERNEL allocation. We can't stay in RCU mode for that, and we need a reference to "dir" (at least with the current way audit works). > When it comes to lockref in vfs in general, most uses can be elided with > some hackery (see the above thread) which is in early WIP (the LSMs are > a massive headache). > > For open calls which *do* need to take a real ref the hackery does not > help of course. > > This is where I think decoupling ref from the lock is the best way > forward. For that to work the dentry must hang around after the last > unref (already done thanks to RCU and dput even explicitly handles that > already!) and there needs to be a way to block new refs atomically -- > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > coming in. I have that as a WIP as well. > These both sound very interesting. FWIW, Josef also started looking at decoupling the refcount and lock, but I don't think he's gotten very far yet. I'm happy to help test some of this too if you get to that point. The 4th patch in this RFC series really amps up the contention for the lockref once the i_rwsem isn't being touched. > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..b76941043fe9 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -8,22 +8,25 @@ > > * Note that the "cmpxchg()" reloads the "old" value for the > > * failure case. > > */ > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > - int retry = 100; \ > > - struct lockref old; \ > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > - struct lockref new = old; \ > > - CODE \ > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > - &old.lock_count, \ > > - new.lock_count))) { \ > > - SUCCESS; \ > > - } \ > > - if (!--retry) \ > > - break; \ > > - } \ > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > + struct lockref old; \ > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + for (;;) { \ > > + struct lockref new = old; \ > > + \ > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > + CODE \ > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > + &old.lock_count, \ > > + new.lock_count))) { \ > > + SUCCESS; \ > > + } \ > > + } else { \ > > + cpu_relax(); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + } \ > > + } \ > > } while (0) > > > > #else > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > , > > return; > > ); > > - > > - spin_lock(&lockref->lock); > > - lockref->count++; > > - spin_unlock(&lockref->lock); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > } > > EXPORT_SYMBOL(lockref_get); > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > */ > > int lockref_get_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count <= 0) > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > */ > > int lockref_put_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count--; > > if (old.count <= 1) > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 1) { > > - lockref->count--; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > , > > return new.count; > > ); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > return -1; > > } > > EXPORT_SYMBOL(lockref_put_return); > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > */ > > int lockref_get_not_dead(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count < 0) > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count >= 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > -- > > 2.45.2 > >
On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > When we tested that initially, the performance devolved significantly > > > due to contention for the parent's d_lockref spinlock. > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > task takes the spinlock, they all end up taking the spinlock, which is > > > much more costly than a single cmpxchg operation. The second problem is > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > spinlock. The upshot there is that even moderate contention can cause a > > > fallback to serialized spinlocking, which worsens performance. > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > First, change the loop to spin instead of falling back to a locked > > > codepath when the spinlock is held. Once the lock is released, allow the > > > task to continue trying its cmpxchg loop as before instead of taking the > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > Just continue infinitely. > > > > > > This greatly reduces contention on the lockref when there are large > > > numbers of concurrent increments and decrements occurring. > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > Oh wait I misread the patch based on what I tried there. Spinning > indefinitely waiting for the lock to be free is a no-go as it loses > the forward progress guarantee (and it is possible to get the lock > being continuously held). Only spinning up to an arbitrary point wins > some in some tests and loses in others. > I'm a little confused about the forward progress guarantee here. Does that exist today at all? ISTM that falling back to spin_lock() after a certain number of retries doesn't guarantee any forward progress. You can still just end up spinning on the lock forever once that happens, no? > Either way, as described below, chances are decent that: > 1. there is an easy way to not lockref_get/put on the parent if the > file is already there, dodging the problem > .. and even if that's not true > 2. lockref can be ditched in favor of atomics. apart from some minor > refactoring this all looks perfectly doable and I have a wip. I will > try to find the time next week to sort it out > Like I said in the earlier mail, I don't think we can stay in RCU mode because of the audit_inode call. I'm definitely interested in your WIP though! > > Key problem is that in some corner cases the lock can be continuously > > held and be queued on, making the fast path always fail and making all > > the spins actively waste time (and notably pull on the cacheline). > > > > See this for more details: > > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > > of an existing file) lockref on the parent can be avoided altogether > > with some hackery and that's what should be done here. > > > > When it comes to lockref in vfs in general, most uses can be elided with > > some hackery (see the above thread) which is in early WIP (the LSMs are > > a massive headache). > > > > For open calls which *do* need to take a real ref the hackery does not > > help of course. > > > > This is where I think decoupling ref from the lock is the best way > > forward. For that to work the dentry must hang around after the last > > unref (already done thanks to RCU and dput even explicitly handles that > > already!) and there needs to be a way to block new refs atomically -- > > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > > coming in. I have that as a WIP as well. > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > > index 2afe4c5d8919..b76941043fe9 100644 > > > --- a/lib/lockref.c > > > +++ b/lib/lockref.c > > > @@ -8,22 +8,25 @@ > > > * Note that the "cmpxchg()" reloads the "old" value for the > > > * failure case. > > > */ > > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > - int retry = 100; \ > > > - struct lockref old; \ > > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > - struct lockref new = old; \ > > > - CODE \ > > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > - &old.lock_count, \ > > > - new.lock_count))) { \ > > > - SUCCESS; \ > > > - } \ > > > - if (!--retry) \ > > > - break; \ > > > - } \ > > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > + struct lockref old; \ > > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + for (;;) { \ > > > + struct lockref new = old; \ > > > + \ > > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > + CODE \ > > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > + &old.lock_count, \ > > > + new.lock_count))) { \ > > > + SUCCESS; \ > > > + } \ > > > + } else { \ > > > + cpu_relax(); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + } \ > > > + } \ > > > } while (0) > > > > > > #else > > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > > , > > > return; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - lockref->count++; > > > - spin_unlock(&lockref->lock); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > } > > > EXPORT_SYMBOL(lockref_get); > > > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > > */ > > > int lockref_get_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count <= 0) > > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > > */ > > > int lockref_put_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count--; > > > if (old.count <= 1) > > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 1) { > > > - lockref->count--; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > > , > > > return new.count; > > > ); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_return); > > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > > */ > > > int lockref_get_not_dead(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count < 0) > > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count >= 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > > > -- > > > 2.45.2 > > > > > >
On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > When we tested that initially, the performance devolved significantly > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > much more costly than a single cmpxchg operation. The second problem is > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > Just continue infinitely. > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > indefinitely waiting for the lock to be free is a no-go as it loses > > the forward progress guarantee (and it is possible to get the lock > > being continuously held). Only spinning up to an arbitrary point wins > > some in some tests and loses in others. > > > > I'm a little confused about the forward progress guarantee here. Does > that exist today at all? ISTM that falling back to spin_lock() after a > certain number of retries doesn't guarantee any forward progress. You > can still just end up spinning on the lock forever once that happens, > no? > There is the implicit assumption that everyone holds locks for a finite time. I agree there are no guarantees otherwise if that's what you meant. In this case, since spinlocks are queued, a constant stream of lock holders will make the lock appear taken indefinitely even if they all hold it for a short period. Stock lockref will give up atomics immediately and make sure to change the ref thanks to queueing up. Lockref as proposed in this patch wont be able to do anything as long as the lock trading is taking place. > > Either way, as described below, chances are decent that: > > 1. there is an easy way to not lockref_get/put on the parent if the > > file is already there, dodging the problem > > .. and even if that's not true > > 2. lockref can be ditched in favor of atomics. apart from some minor > > refactoring this all looks perfectly doable and I have a wip. I will > > try to find the time next week to sort it out > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > because of the audit_inode call. I'm definitely interested in your WIP > though! > well audit may be hackable so that it works in rcu most of the time, but that's not something i'm interested in sorting out the lockref situation would definitely help other stuff (notably opening the same file RO). anyhow one idea is to temporarily disable atomic ops with a flag in the counter, a fallback plan is to loosen lockref so that it can do transitions other than 0->1->2 with atomics, even if the lock is held. I have not looked at this in over a month, I'm going to need to refresh my memory on the details, I do remember there was some stuff to massage first. Anyhow, I expect a working WIP some time in the upcoming week. -- Mateusz Guzik <mjguzik gmail.com>
On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > > When we tested that initially, the performance devolved significantly > > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > > much more costly than a single cmpxchg operation. The second problem is > > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > > Just continue infinitely. > > > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > > indefinitely waiting for the lock to be free is a no-go as it loses > > > the forward progress guarantee (and it is possible to get the lock > > > being continuously held). Only spinning up to an arbitrary point wins > > > some in some tests and loses in others. > > > > > > > I'm a little confused about the forward progress guarantee here. Does > > that exist today at all? ISTM that falling back to spin_lock() after a > > certain number of retries doesn't guarantee any forward progress. You > > can still just end up spinning on the lock forever once that happens, > > no? > > > > There is the implicit assumption that everyone holds locks for a > finite time. I agree there are no guarantees otherwise if that's what > you meant. > > In this case, since spinlocks are queued, a constant stream of lock > holders will make the lock appear taken indefinitely even if they all > hold it for a short period. > > Stock lockref will give up atomics immediately and make sure to change > the ref thanks to queueing up. > > Lockref as proposed in this patch wont be able to do anything as long > as the lock trading is taking place. > Got it, thanks. This spinning is very simplistic, so I could see that you could have one task continually getting shuffled to the end of the queue. > > > Either way, as described below, chances are decent that: > > > 1. there is an easy way to not lockref_get/put on the parent if the > > > file is already there, dodging the problem > > > .. and even if that's not true > > > 2. lockref can be ditched in favor of atomics. apart from some minor > > > refactoring this all looks perfectly doable and I have a wip. I will > > > try to find the time next week to sort it out > > > > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > > because of the audit_inode call. I'm definitely interested in your WIP > > though! > > > > well audit may be hackable so that it works in rcu most of the time, > but that's not something i'm interested in Audit not my favorite area of the kernel to work in either. I don't see a good way to make it rcu-friendly, but I haven't looked too hard yet either. It would be nice to be able to do some of the auditing under rcu or spinlock. > > sorting out the lockref situation would definitely help other stuff > (notably opening the same file RO). > Indeed. It's clear that the current implementation is a real scalability problem in a lot of situations. > anyhow one idea is to temporarily disable atomic ops with a flag in > the counter, a fallback plan is to loosen lockref so that it can do > transitions other than 0->1->2 with atomics, even if the lock is held. > > I have not looked at this in over a month, I'm going to need to > refresh my memory on the details, I do remember there was some stuff > to massage first. > > Anyhow, I expect a working WIP some time in the upcoming week. > Great, I'll stay tuned. Thanks!
> Audit not my favorite area of the kernel to work in either. I don't see > a good way to make it rcu-friendly, but I haven't looked too hard yet > either. It would be nice to be able to do some of the auditing under > rcu or spinlock. For audit your main option is to dodge the problem and check whether audit is active and only drop out of rcu if it is. That sidesteps the problem. I'm somewhat certain that a lot of systems don't really have audit active. From a brief look at audit it would be quite involved to make it work just under rcu. Not just because it does various allocation but it also reads fscaps from disk and so on. That's not going to work unless we add a vfs based fscaps cache similar to what we do for acls. I find that very unlikely.
On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote: > > Audit not my favorite area of the kernel to work in either. I don't see > > a good way to make it rcu-friendly, but I haven't looked too hard yet > > either. It would be nice to be able to do some of the auditing under > > rcu or spinlock. > > For audit your main option is to dodge the problem and check whether > audit is active and only drop out of rcu if it is. That sidesteps the > problem. I'm somewhat certain that a lot of systems don't really have > audit active. > I did have an earlier version of 4/4 that checked audit_context() and stayed in RCU mode if it comes back NULL. I can resurrect that if you think it's worthwhile. > From a brief look at audit it would be quite involved to make it work > just under rcu. Not just because it does various allocation but it also > reads fscaps from disk and so on. That's not going to work unless we add > a vfs based fscaps cache similar to what we do for acls. I find that > very unlikely. Yeah. It wants to record a lot of (variable-length) information at very inconvenient times. I think we're sort of stuck with it though until someone has a vision on how to do this in a non-blocking way. Handwavy thought: there is some similarity to tracepoints in what audit_inode does, and tracepoints are able to be called in all sorts of contexts. I wonder if we could leverage the same infrastructure somehow? The catch here is that we can't just drop audit records if things go wrong.
On Mon, Aug 05, 2024 at 08:52:28AM GMT, Jeff Layton wrote: > On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote: > > > Audit not my favorite area of the kernel to work in either. I don't see > > > a good way to make it rcu-friendly, but I haven't looked too hard yet > > > either. It would be nice to be able to do some of the auditing under > > > rcu or spinlock. > > > > For audit your main option is to dodge the problem and check whether > > audit is active and only drop out of rcu if it is. That sidesteps the > > problem. I'm somewhat certain that a lot of systems don't really have > > audit active. > > > > I did have an earlier version of 4/4 that checked audit_context() and > stayed in RCU mode if it comes back NULL. I can resurrect that if you > think it's worthwhile. Let's at least see what it looks like. Maybe just use a helper local to fs/namei.c that returns ECHILD if audit is available and 0 otherwise? > > From a brief look at audit it would be quite involved to make it work > > just under rcu. Not just because it does various allocation but it also > > reads fscaps from disk and so on. That's not going to work unless we add > > a vfs based fscaps cache similar to what we do for acls. I find that > > very unlikely. > > Yeah. It wants to record a lot of (variable-length) information at very > inconvenient times. I think we're sort of stuck with it though until > someone has a vision on how to do this in a non-blocking way. > > Handwavy thought: there is some similarity to tracepoints in what > audit_inode does, and tracepoints are able to be called in all sorts of > contexts. I wonder if we could leverage the same infrastructure > somehow? The catch here is that we can't just drop audit records if > things go wrong. I can't say much about the tracepoint idea as I lack the necessary details around their implementation. I think the better way forward is a model with a fastpath and a slowpath. Under RCU audit_inode() returns -ECHILD if it sees that it neeeds to end up doing anything it couldn't do in a non-blocking way and then path lookup can drop out of RCU and call audit_inode() again. I think this wouldn't be extremly terrible. It would amount to adding a flag to audit_inode() AUDIT_MAY_NOT_BLOCK and then on ECHILD audit_inode() gets called again without that flag. Over time if people are interested they could then make more and more stuff available under rcu for audit.
diff --git a/lib/lockref.c b/lib/lockref.c index 2afe4c5d8919..b76941043fe9 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -8,22 +8,25 @@ * Note that the "cmpxchg()" reloads the "old" value for the * failure case. */ -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ - int retry = 100; \ - struct lockref old; \ - BUILD_BUG_ON(sizeof(old) != 8); \ - old.lock_count = READ_ONCE(lockref->lock_count); \ - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ - struct lockref new = old; \ - CODE \ - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ - &old.lock_count, \ - new.lock_count))) { \ - SUCCESS; \ - } \ - if (!--retry) \ - break; \ - } \ +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ + struct lockref old; \ + BUILD_BUG_ON(sizeof(old) != 8); \ + old.lock_count = READ_ONCE(lockref->lock_count); \ + for (;;) { \ + struct lockref new = old; \ + \ + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ + CODE \ + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ + &old.lock_count, \ + new.lock_count))) { \ + SUCCESS; \ + } \ + } else { \ + cpu_relax(); \ + old.lock_count = READ_ONCE(lockref->lock_count); \ + } \ + } \ } while (0) #else @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) , return; ); - - spin_lock(&lockref->lock); - lockref->count++; - spin_unlock(&lockref->lock); + /* should never get here */ + WARN_ON_ONCE(1); } EXPORT_SYMBOL(lockref_get); @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); */ int lockref_get_not_zero(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count++; if (old.count <= 0) @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count > 0) { - lockref->count++; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_get_not_zero); @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); */ int lockref_put_not_zero(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count--; if (old.count <= 1) @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count > 1) { - lockref->count--; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_put_not_zero); @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) , return new.count; ); + /* should never get here */ + WARN_ON_ONCE(1); return -1; } EXPORT_SYMBOL(lockref_put_return); @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); */ int lockref_get_not_dead(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count++; if (old.count < 0) @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count >= 0) { - lockref->count++; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_get_not_dead);
In a later patch, we want to change the open(..., O_CREAT) codepath to avoid taking the inode->i_rwsem for write when the dentry already exists. When we tested that initially, the performance devolved significantly due to contention for the parent's d_lockref spinlock. There are two problems with lockrefs today: First, once any concurrent task takes the spinlock, they all end up taking the spinlock, which is much more costly than a single cmpxchg operation. The second problem is that once any task fails to cmpxchg 100 times, it falls back to the spinlock. The upshot there is that even moderate contention can cause a fallback to serialized spinlocking, which worsens performance. This patch changes CMPXCHG_LOOP in 2 ways: First, change the loop to spin instead of falling back to a locked codepath when the spinlock is held. Once the lock is released, allow the task to continue trying its cmpxchg loop as before instead of taking the lock. Second, don't allow the cmpxchg loop to give up after 100 retries. Just continue infinitely. This greatly reduces contention on the lockref when there are large numbers of concurrent increments and decrements occurring. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- 1 file changed, 32 insertions(+), 53 deletions(-)