Message ID | 1475257257-23072-1-git-send-email-bdegraaf@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Brent, On 30/09/16 18:40, Brent DeGraaf wrote: > Prior spinlock code solely used load-acquire and store-release > semantics to ensure ordering of the spinlock lock and the area it > protects. However, store-release semantics and ordinary stores do > not protect against accesses to the protected area being observed > prior to the access that locks the lock itself. > > While the load-acquire and store-release ordering is sufficient > when the spinlock routines themselves are strictly used, other > kernel code that references the lock values directly (e.g. lockrefs) > could observe changes to the area protected by the spinlock prior > to observance of the lock itself being in a locked state, despite > the fact that the spinlock logic itself is correct. > > Barriers were added to all the locking routines wherever necessary > to ensure that outside observers which read the lock values directly > will not observe changes to the protected data before the lock itself > is observed. > > Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org> > --- > arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++--- > 1 file changed, 55 insertions(+), 4 deletions(-) > > diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h > index 89206b5..4dd0977 100644 > --- a/arch/arm64/include/asm/spinlock.h > +++ b/arch/arm64/include/asm/spinlock.h > @@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > > /* Did we get the lock? */ > " eor %w1, %w0, %w0, ror #16\n" > -" cbz %w1, 3f\n" > +" cbnz %w1, 4f\n" > + /* > + * Yes: The store done on this cpu was the one that locked the lock. > + * Store-release one-way barrier on LL/SC means that accesses coming > + * after this could be reordered into the critical section of the > + * load-acquire/store-release, where we did not own the lock. On LSE, > + * even the one-way barrier of the store-release semantics is missing, > + * so LSE needs an explicit barrier here as well. Without this, the > + * changed contents of the area protected by the spinlock could be > + * observed prior to the lock. What is that last sentence supposed to mean? If the lock is free, then surely any previous writes to the data it's protecting would have already been observed by the release semantics of the previous unlock? If the lock is currently held, what do we care about the state of the data while we're still spinning on the lock itself? And if someone's touching the data without having acquired *or* released the lock, why is there a lock in the first place? This seems like a very expensive way of papering over broken callers :/ Robin. > + */ > +" dmb ish\n" > +" b 3f\n" > +"4:\n" > /* > * No: spin on the owner. Send a local event to avoid missing an > * unlock before the exclusive load. > @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > " ldaxrh %w2, %4\n" > " eor %w1, %w2, %w0, lsr #16\n" > " cbnz %w1, 2b\n" > - /* We got the lock. Critical section starts here. */ > + /* > + * We got the lock and have observed the prior owner's store-release. > + * In this case, the one-way barrier of the prior owner that we > + * observed combined with the one-way barrier of our load-acquire is > + * enough to ensure accesses to the protected area coming after this > + * are not accessed until we own the lock. In this case, other > + * observers will not see our changes prior to observing the lock > + * itself. Critical locked section starts here. > + */ > "3:" > : "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock) > : "Q" (lock->owner), "I" (1 << TICKET_SHIFT) > @@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) > " add %w0, %w0, %3\n" > " stxr %w1, %w0, %2\n" > " cbnz %w1, 1b\n" > + /* > + * We got the lock with a successful store-release: Store-release > + * one-way barrier means accesses coming after this could be observed > + * before the lock is observed as locked. > + */ > + " dmb ish\n" > + " nop\n" > "2:", > /* LSE atomics */ > " ldr %w0, %2\n" > @@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) > " casa %w0, %w1, %2\n" > " and %w1, %w1, #0xffff\n" > " eor %w1, %w1, %w0, lsr #16\n" > + " cbnz %w1, 1f\n" > + /* > + * We got the lock with the LSE casa store. > + * A barrier is required to ensure accesses coming from the > + * critical section of the lock are not observed before our lock. > + */ > + " dmb ish\n" > "1:") > : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock) > : "I" (1 << TICKET_SHIFT) > @@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw) > " cbnz %w0, 1b\n" > " stxr %w0, %w2, %1\n" > " cbnz %w0, 2b\n" > + /* > + * Lock is not ours until the store, which has no implicit barrier. > + * Barrier is needed so our writes to the protected area are not > + * observed before our lock ownership is observed. > + */ > + " dmb ish\n" > " nop", > /* LSE atomics */ > "1: mov %w0, wzr\n" > @@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw) > " cbz %w0, 2b\n" > " wfe\n" > " b 1b\n" > - "3:") > + /* > + * Casa doesn't use store-release semantics. Even if it did, > + * it would not protect us from our writes being observed before > + * our ownership is observed. Barrier is required. > + */ > + "3: dmb ish") > : "=&r" (tmp), "+Q" (rw->lock) > : "r" (0x80000000) > : "memory"); > @@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw) > " tbnz %w1, #31, 1b\n" > " casa %w0, %w1, %2\n" > " sbc %w0, %w1, %w0\n" > - " cbnz %w0, 2b") > + " cbnz %w0, 2b\n" > + /* > + * Need to ensure that our reads of the area protected by the lock > + * are not observed before our lock ownership is observed. > + */ > + " dmb ish\n") > : "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock) > : > : "cc", "memory"); >
On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: > Prior spinlock code solely used load-acquire and store-release > semantics to ensure ordering of the spinlock lock and the area it > protects. However, store-release semantics and ordinary stores do > not protect against accesses to the protected area being observed > prior to the access that locks the lock itself. > > While the load-acquire and store-release ordering is sufficient > when the spinlock routines themselves are strictly used, other > kernel code that references the lock values directly (e.g. lockrefs) > could observe changes to the area protected by the spinlock prior > to observance of the lock itself being in a locked state, despite > the fact that the spinlock logic itself is correct. > > Barriers were added to all the locking routines wherever necessary > to ensure that outside observers which read the lock values directly > will not observe changes to the protected data before the lock itself > is observed. I cannot see this going in. You're making spinlocks far more expensive in the common case that doesn't need this. Please enumerate the special cases (there's more than just lockref?) and fix those.
On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: > Prior spinlock code solely used load-acquire and store-release > semantics to ensure ordering of the spinlock lock and the area it > protects. However, store-release semantics and ordinary stores do > not protect against accesses to the protected area being observed > prior to the access that locks the lock itself. > > While the load-acquire and store-release ordering is sufficient > when the spinlock routines themselves are strictly used, other > kernel code that references the lock values directly (e.g. lockrefs) Isn't the problem with lockref the fact that arch_spin_value_unlocked() isn't a load-acquire, and therefore the CPU in question doesn't need to observe the contents of the critical section etc..? That is, wouldn't fixing arch_spin_value_unlocked() by making that an smp_load_acquire() fix things much better? > could observe changes to the area protected by the spinlock prior > to observance of the lock itself being in a locked state, despite > the fact that the spinlock logic itself is correct.
On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: > Prior spinlock code solely used load-acquire and store-release > semantics to ensure ordering of the spinlock lock and the area it > protects. However, store-release semantics and ordinary stores do > not protect against accesses to the protected area being observed > prior to the access that locks the lock itself. > > While the load-acquire and store-release ordering is sufficient > when the spinlock routines themselves are strictly used, other > kernel code that references the lock values directly (e.g. lockrefs) > could observe changes to the area protected by the spinlock prior > to observance of the lock itself being in a locked state, despite > the fact that the spinlock logic itself is correct. If the spinlock logic is correct, why are we changing that, and not the lockref code that you say has a problem? What exactly goes wrong in the lockref code? Can you give a concrete example? Why does the lockref code accesses lock-protected fields without taking the lock first? Wouldn't concurrent modification be a problem regardless? > + /* > + * Yes: The store done on this cpu was the one that locked the lock. > + * Store-release one-way barrier on LL/SC means that accesses coming > + * after this could be reordered into the critical section of the I assume you meant s/store-release/load-acquire/ here. This does not make sense to me otherwise. > + * load-acquire/store-release, where we did not own the lock. On LSE, > + * even the one-way barrier of the store-release semantics is missing, Likewise (for the LSE case description). > + * so LSE needs an explicit barrier here as well. Without this, the > + * changed contents of the area protected by the spinlock could be > + * observed prior to the lock. > + */ By whom? We generally expect that if data is protected by a lock, you take the lock before accessing it. If you expect concurrent lockless readers, then there's a requirement on the writer side to explicitly provide the ordering it requires -- spinlocks are not expected to provide that. So, why aren't those observers taking the lock? What pattern of accesses are made by readers and writers such that there is a problem? What does this result in? > +" dmb ish\n" > +" b 3f\n" > +"4:\n" > /* > * No: spin on the owner. Send a local event to avoid missing an > * unlock before the exclusive load. > @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > " ldaxrh %w2, %4\n" > " eor %w1, %w2, %w0, lsr #16\n" > " cbnz %w1, 2b\n" > - /* We got the lock. Critical section starts here. */ > + /* > + * We got the lock and have observed the prior owner's store-release. > + * In this case, the one-way barrier of the prior owner that we > + * observed combined with the one-way barrier of our load-acquire is > + * enough to ensure accesses to the protected area coming after this > + * are not accessed until we own the lock. In this case, other > + * observers will not see our changes prior to observing the lock > + * itself. Critical locked section starts here. > + */ Each of these comments ends up covers, and their repeated presence makes the code harder to read. If there's a common problem, note it once at the top of the file. Thanks, Mark.
On 2016-09-30 14:43, Robin Murphy wrote: >> + * so LSE needs an explicit barrier here as well. Without this, the >> + * changed contents of the area protected by the spinlock could be >> + * observed prior to the lock. > > What is that last sentence supposed to mean? If the lock is free, then > surely any previous writes to the data it's protecting would have > already been observed by the release semantics of the previous unlock? > If the lock is currently held, what do we care about the state of the > data while we're still spinning on the lock itself? And if someone's > touching the data without having acquired *or* released the lock, why > is > there a lock in the first place? > > This seems like a very expensive way of papering over broken callers :/ > > Robin. > Thanks for your question. First off, I saw no negative impact to performance as a result of introducing these barriers running a wide variety of use cases, both for mobile and server-class devices ranging from 4 to 24 cpus. Yes, it does protect lockref, which observes the spinlocks in a non-conventional way. In fact, with this code in place, the performance of Linus' test which runs stat like crazy actually improved in the range of 1-2% (I suspect this is due to fewer failures due to contention on the protected count lockref uses). The lockref code can be made compliant, but not by a single load-acquire--it has to take the lock. Turning off CONFIG_ARCH_USE_CMPXCHG_LOCKREF is the most obvious solution as it forces lockref.c to take the lock. That, however, comes at a very high performance cost: 30-50% on Linus' stat test on a 24-core system. For larger systems, this performance gap will get even worse. With the above in mind, it seems that supporting lockref's unorthodox method of dealing with the lock is the better alternative, as it helps, rather than hurts, arm64 performance. Brent
On 2016-09-30 15:05, Peter Zijlstra wrote: > On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: >> Prior spinlock code solely used load-acquire and store-release >> semantics to ensure ordering of the spinlock lock and the area it >> protects. However, store-release semantics and ordinary stores do >> not protect against accesses to the protected area being observed >> prior to the access that locks the lock itself. >> >> While the load-acquire and store-release ordering is sufficient >> when the spinlock routines themselves are strictly used, other >> kernel code that references the lock values directly (e.g. lockrefs) > > Isn't the problem with lockref the fact that arch_spin_value_unlocked() > isn't a load-acquire, and therefore the CPU in question doesn't need to > observe the contents of the critical section etc..? > > That is, wouldn't fixing arch_spin_value_unlocked() by making that an > smp_load_acquire() fix things much better? > >> could observe changes to the area protected by the spinlock prior >> to observance of the lock itself being in a locked state, despite >> the fact that the spinlock logic itself is correct. Thanks for your comments. The load-acquire would not be enough for lockref, as it can still observe the changed data out of order. To ensure order, lockref has to take the lock, which comes at a high performance cost. Turning off the config option CONFIG_ARCH_USE_CMPXCHG_LOCKREF, which forces arch_spin_lock calls reduced my multicore performance between 30 and 50 percent using Linus' "stat" test that was part of the grounds for introducing lockref. On the other hand, I did not see any negative impact to performance by the new barriers, in large part probably because they only tend to come into play when locks are not heavily contended in the case of ticket locks. I have not yet found any other spinlock "abuses" in the kernel besides lockref, but locks are referenced in a large number of places that includes drivers, which are dynamic. It is arguable that I could remove the barriers to the read/write locks, as lockref doesn't use those, but it seemed to me to be safer and more "normal" to ensure that the locked write to the lock itself is visible prior to the changed contents of the protected area. Brent
On 2016-09-30 15:32, Mark Rutland wrote: > On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: >> Prior spinlock code solely used load-acquire and store-release >> semantics to ensure ordering of the spinlock lock and the area it >> protects. However, store-release semantics and ordinary stores do >> not protect against accesses to the protected area being observed >> prior to the access that locks the lock itself. >> >> While the load-acquire and store-release ordering is sufficient >> when the spinlock routines themselves are strictly used, other >> kernel code that references the lock values directly (e.g. lockrefs) >> could observe changes to the area protected by the spinlock prior >> to observance of the lock itself being in a locked state, despite >> the fact that the spinlock logic itself is correct. > > If the spinlock logic is correct, why are we changing that, and not the > lockref > code that you say has a problem? > > What exactly goes wrong in the lockref code? Can you give a concrete > example? > > Why does the lockref code accesses lock-protected fields without taking > the > lock first? Wouldn't concurrent modification be a problem regardless? > >> + /* >> + * Yes: The store done on this cpu was the one that locked the lock. >> + * Store-release one-way barrier on LL/SC means that accesses coming >> + * after this could be reordered into the critical section of the > > I assume you meant s/store-release/load-acquire/ here. This does not > make sense > to me otherwise. > >> + * load-acquire/store-release, where we did not own the lock. On >> LSE, >> + * even the one-way barrier of the store-release semantics is >> missing, > > Likewise (for the LSE case description). > >> + * so LSE needs an explicit barrier here as well. Without this, the >> + * changed contents of the area protected by the spinlock could be >> + * observed prior to the lock. >> + */ > > By whom? We generally expect that if data is protected by a lock, you > take the > lock before accessing it. If you expect concurrent lockless readers, > then > there's a requirement on the writer side to explicitly provide the > ordering it > requires -- spinlocks are not expected to provide that. More details are in my response to Robin, but there is an API arm64 supports in spinlock.h which is used by lockref to determine whether a lock is free or not. For that code to work properly without adding these barriers, that API needs to take the lock. I tested that configuration, and it cost us heavily in terms of lockref performance in the form of a 30 to 50 percent performance loss. On the other hand, I have not seen any performance degradation due to the introduction of these barriers. > > So, why aren't those observers taking the lock? lockref doesn't take the lock specifically because it is slower. > > What pattern of accesses are made by readers and writers such that > there is a > problem? I added the barriers to the readers/writers because I do not know these are not similarly abused. There is a lot of driver code out there, and ensuring order is the safest way to be sure we don't get burned by something similar to the lockref access. > > What does this result in? > No measureable negative performance impact. However, the lockref performance actually improved slightly (between 1 and 2 percent on my 24-core test system) due to the change. >> +" dmb ish\n" >> +" b 3f\n" >> +"4:\n" >> /* >> * No: spin on the owner. Send a local event to avoid missing an >> * unlock before the exclusive load. >> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t >> *lock) >> " ldaxrh %w2, %4\n" >> " eor %w1, %w2, %w0, lsr #16\n" >> " cbnz %w1, 2b\n" >> - /* We got the lock. Critical section starts here. */ >> + /* >> + * We got the lock and have observed the prior owner's >> store-release. >> + * In this case, the one-way barrier of the prior owner that we >> + * observed combined with the one-way barrier of our load-acquire is >> + * enough to ensure accesses to the protected area coming after this >> + * are not accessed until we own the lock. In this case, other >> + * observers will not see our changes prior to observing the lock >> + * itself. Critical locked section starts here. >> + */ > > Each of these comments ends up covers, and their repeated presence > makes the > code harder to read. If there's a common problem, note it once at the > top of > the file. I added these comments to make it crystal clear that the absence of a barrier at this point was deliberate, and that I did consider each code path. > > Thanks, > Mark.
Hi Brent, Evidently my questions weren't sufficiently clear; even with your answers it's not clear to me what precise issue you're attempting to solve. I've tried to be more specific this time. At a high-level, can you clarify whether you're attempting to solve is: (a) a functional correctness issue (e.g. data corruption) (b) a performance issue And whether this was seen in practice, or found through code inspection? On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf@codeaurora.org wrote: > On 2016-09-30 15:32, Mark Rutland wrote: > >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: > >>+ * so LSE needs an explicit barrier here as well. Without this, the > >>+ * changed contents of the area protected by the spinlock could be > >>+ * observed prior to the lock. > >>+ */ > > > >By whom? We generally expect that if data is protected by a lock, you take > >the lock before accessing it. If you expect concurrent lockless readers, > >then there's a requirement on the writer side to explicitly provide the > >ordering it requires -- spinlocks are not expected to provide that. > > More details are in my response to Robin, but there is an API arm64 supports > in spinlock.h which is used by lockref to determine whether a lock is free or > not. For that code to work properly without adding these barriers, that API > needs to take the lock. Can you please provide a concrete example of a case where things go wrong, citing the code (or access pattern) in question? e.g. as in the commit messages for: 8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for full barrier semantics) 859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels") (for the latter, I messed up the register -> var mapping in one paragraph, but the style and reasoning is otherwise sound). In the absence of a concrete example as above, it's very difficult to reason about what the underlying issue is, and what a valid fix would be for said issue. > >What pattern of accesses are made by readers and writers such that there is > >a problem? Please note here I was asking specifically w.r.t. the lockref code, e.g. which loads could see stale data, and what does the code do based on this value such that there is a problem. > I added the barriers to the readers/writers because I do not know these are > not similarly abused. There is a lot of driver code out there, and ensuring > order is the safest way to be sure we don't get burned by something similar > to the lockref access. Making the architecture-provided primitives overly strong harms performance and efficiency (in general), makes the code harder to maintain and optimise in future, and only masks the issue (which could crop up on other architectures, for instance). Thanks, Mark.
On 2016-10-01 14:11, Mark Rutland wrote: > Hi Brent, > > Evidently my questions weren't sufficiently clear; even with your > answers it's > not clear to me what precise issue you're attempting to solve. I've > tried to be > more specific this time. > > At a high-level, can you clarify whether you're attempting to solve is: > > (a) a functional correctness issue (e.g. data corruption) > (b) a performance issue > > And whether this was seen in practice, or found through code > inspection? > > On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf@codeaurora.org > wrote: >> On 2016-09-30 15:32, Mark Rutland wrote: >> >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote: >> >>+ * so LSE needs an explicit barrier here as well. Without this, the >> >>+ * changed contents of the area protected by the spinlock could be >> >>+ * observed prior to the lock. >> >>+ */ >> > >> >By whom? We generally expect that if data is protected by a lock, you take >> >the lock before accessing it. If you expect concurrent lockless readers, >> >then there's a requirement on the writer side to explicitly provide the >> >ordering it requires -- spinlocks are not expected to provide that. >> >> More details are in my response to Robin, but there is an API arm64 >> supports >> in spinlock.h which is used by lockref to determine whether a lock is >> free or >> not. For that code to work properly without adding these barriers, >> that API >> needs to take the lock. > > Can you please provide a concrete example of a case where things go > wrong, > citing the code (or access pattern) in question? e.g. as in the commit > messages > for: > > 8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for > full barrier semantics) > 859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels") > > (for the latter, I messed up the register -> var mapping in one > paragraph, but > the style and reasoning is otherwise sound). > > In the absence of a concrete example as above, it's very difficult to > reason > about what the underlying issue is, and what a valid fix would be for > said > issue. > >> >What pattern of accesses are made by readers and writers such that there is >> >a problem? > > Please note here I was asking specifically w.r.t. the lockref code, > e.g. which > loads could see stale data, and what does the code do based on this > value such > that there is a problem. > >> I added the barriers to the readers/writers because I do not know >> these are >> not similarly abused. There is a lot of driver code out there, and >> ensuring >> order is the safest way to be sure we don't get burned by something >> similar >> to the lockref access. > > Making the architecture-provided primitives overly strong harms > performance and > efficiency (in general), makes the code harder to maintain and optimise > in > future, and only masks the issue (which could crop up on other > architectures, > for instance). > > Thanks, > Mark. Thinking about this, as the reader/writer code has no known "abuse" case, I'll remove it from the patchset, then provide a v2 patchset with a detailed explanation for the lockref problem using the commits you provided as an example, as well as performance consideration. Brent
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org wrote: > Thinking about this, as the reader/writer code has no known "abuse" > case, I'll remove it from the patchset, then provide a v2 patchset > with a detailed explanation for the lockref problem using the commits > you provided as an example, as well as performance consideration. Please, fix lockref in those patches. Touching the spinlock because lockref does something dubious really is the wrong thing.
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org wrote: > On 2016-10-01 14:11, Mark Rutland wrote: > >Hi Brent, > > > >Evidently my questions weren't sufficiently clear; even with your > >answers it's not clear to me what precise issue you're attempting to > >solve. I've tried to be more specific this time. > > > >At a high-level, can you clarify whether you're attempting to solve is: > > > >(a) a functional correctness issue (e.g. data corruption) > >(b) a performance issue > > > >And whether this was seen in practice, or found through code > >inspection? > Thinking about this, as the reader/writer code has no known "abuse" > case, I'll remove it from the patchset, then provide a v2 patchset > with a detailed explanation for the lockref problem using the commits > you provided as an example, as well as performance consideration. If there's a functional problem, let's consider that in isolation first. Once we understand that, then we can consider doing what is optimal. As should be obvious from the above, I'm confused because this patch conflates functional details with performance optimisations which (to me) sound architecturally dubious. I completely agree with Peter that if the problem lies with lockref, it should be solved in the lockref code. Thanks, Mark.
On 2016-10-04 06:12, Mark Rutland wrote: > On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org > wrote: >> On 2016-10-01 14:11, Mark Rutland wrote: >> >Hi Brent, >> > >> >Evidently my questions weren't sufficiently clear; even with your >> >answers it's not clear to me what precise issue you're attempting to >> >solve. I've tried to be more specific this time. >> > >> >At a high-level, can you clarify whether you're attempting to solve is: >> > >> >(a) a functional correctness issue (e.g. data corruption) >> >(b) a performance issue >> > >> >And whether this was seen in practice, or found through code >> >inspection? > >> Thinking about this, as the reader/writer code has no known "abuse" >> case, I'll remove it from the patchset, then provide a v2 patchset >> with a detailed explanation for the lockref problem using the commits >> you provided as an example, as well as performance consideration. > > If there's a functional problem, let's consider that in isolation > first. > Once we understand that, then we can consider doing what is optimal. > > As should be obvious from the above, I'm confused because this patch > conflates functional details with performance optimisations which (to > me) sound architecturally dubious. > > I completely agree with Peter that if the problem lies with lockref, it > should be solved in the lockref code. > > Thanks, > Mark. After looking at this, the problem is not with the lockref code per se: it is a problem with arch_spin_value_unlocked(). In the out-of-order case, arch_spin_value_unlocked() can return TRUE for a spinlock that is in fact locked but the lock is not observable yet via an ordinary load. Other than ensuring order on the locking side (as the prior patch did), there is a way to make arch_spin_value_unlock's TRUE return value deterministic, but it requires that it does a write-back to the lock to ensure we didn't observe the unlocked value while another agent was in process of writing back a locked value. Brent
On 2016-10-04 13:53, bdegraaf@codeaurora.org wrote: > On 2016-10-04 06:12, Mark Rutland wrote: >> On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org >> wrote: >>> On 2016-10-01 14:11, Mark Rutland wrote: >>> >Hi Brent, >>> > >>> >Evidently my questions weren't sufficiently clear; even with your >>> >answers it's not clear to me what precise issue you're attempting to >>> >solve. I've tried to be more specific this time. >>> > >>> >At a high-level, can you clarify whether you're attempting to solve is: >>> > >>> >(a) a functional correctness issue (e.g. data corruption) >>> >(b) a performance issue >>> > >>> >And whether this was seen in practice, or found through code >>> >inspection? >> >>> Thinking about this, as the reader/writer code has no known "abuse" >>> case, I'll remove it from the patchset, then provide a v2 patchset >>> with a detailed explanation for the lockref problem using the commits >>> you provided as an example, as well as performance consideration. >> >> If there's a functional problem, let's consider that in isolation >> first. >> Once we understand that, then we can consider doing what is optimal. >> >> As should be obvious from the above, I'm confused because this patch >> conflates functional details with performance optimisations which (to >> me) sound architecturally dubious. >> >> I completely agree with Peter that if the problem lies with lockref, >> it >> should be solved in the lockref code. >> >> Thanks, >> Mark. > > After looking at this, the problem is not with the lockref code per se: > it is > a problem with arch_spin_value_unlocked(). In the out-of-order case, > arch_spin_value_unlocked() can return TRUE for a spinlock that is in > fact > locked but the lock is not observable yet via an ordinary load. Other > than > ensuring order on the locking side (as the prior patch did), there is a > way > to make arch_spin_value_unlock's TRUE return value deterministic, but > it > requires that it does a write-back to the lock to ensure we didn't > observe > the unlocked value while another agent was in process of writing back a > locked value. > > Brent Scratch that--things get complicated as the lock itself gets "cloned," which could happen during the out-of-order window. I'll post back later after I've analyzed it fully.
Hi Brent, Could you *please* clarify if you are trying to solve: (a) a correctness issue (e.g. data corruption) seen in practice. (b) a correctness issue (e.g. data corruption) found by inspection. (c) A performance issue, seen in practice. (d) A performance issue, found by inspection. Any one of these is fine; we just need to know in order to be able to help effectively, and so far it hasn't been clear. On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org wrote: > After looking at this, the problem is not with the lockref code per > se: it is a problem with arch_spin_value_unlocked(). In the > out-of-order case, arch_spin_value_unlocked() can return TRUE for a > spinlock that is in fact locked but the lock is not observable yet via > an ordinary load. Given arch_spin_value_unlocked() doesn't perform any load itself, I assume the ordinary load that you are referring to is the READ_ONCE() early in CMPXCHG_LOOP(). It's worth noting that even if we ignore ordering and assume a sequentially-consistent machine, READ_ONCE() can give us a stale value. We could perform the read, then another agent can acquire the lock, then we can move onto the cmpxchg(), i.e. CPU0 CPU1 old = READ_ONCE(x.lock_val) spin_lock(x.lock) cmpxchg(x.lock_val, old, new) spin_unlock(x.lock) If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg should return an up-to-date value which we will then retry with. > Other than ensuring order on the locking side (as the prior patch > did), there is a way to make arch_spin_value_unlock's TRUE return > value deterministic, In general, this cannot be made deterministic. As above, there is a race that cannot be avoided. > but it requires that it does a write-back to the lock to ensure we > didn't observe the unlocked value while another agent was in process > of writing back a locked value. The cmpxchg gives us this guarantee. If it successfully stores, then the value it observed was the same as READ_ONCE() saw, and the update was atomic. There *could* have been an intervening sequence between the READ_ONCE and cmpxchg (e.g. put(); get()) but that's not problematic for lockref. Until you've taken your reference it was possible that things changed underneath you. Thanks, Mark.
On 2016-10-04 15:12, Mark Rutland wrote: > Hi Brent, > > Could you *please* clarify if you are trying to solve: > > (a) a correctness issue (e.g. data corruption) seen in practice. > (b) a correctness issue (e.g. data corruption) found by inspection. > (c) A performance issue, seen in practice. > (d) A performance issue, found by inspection. > > Any one of these is fine; we just need to know in order to be able to > help effectively, and so far it hasn't been clear. > > On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org > wrote: >> After looking at this, the problem is not with the lockref code per >> se: it is a problem with arch_spin_value_unlocked(). In the >> out-of-order case, arch_spin_value_unlocked() can return TRUE for a >> spinlock that is in fact locked but the lock is not observable yet via >> an ordinary load. > > Given arch_spin_value_unlocked() doesn't perform any load itself, I > assume the ordinary load that you are referring to is the READ_ONCE() > early in CMPXCHG_LOOP(). > > It's worth noting that even if we ignore ordering and assume a > sequentially-consistent machine, READ_ONCE() can give us a stale value. > We could perform the read, then another agent can acquire the lock, > then > we can move onto the cmpxchg(), i.e. > > CPU0 CPU1 > old = READ_ONCE(x.lock_val) > spin_lock(x.lock) > cmpxchg(x.lock_val, old, new) > spin_unlock(x.lock) > > If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg > should return an up-to-date value which we will then retry with. > >> Other than ensuring order on the locking side (as the prior patch >> did), there is a way to make arch_spin_value_unlock's TRUE return >> value deterministic, > > In general, this cannot be made deterministic. As above, there is a > race > that cannot be avoided. > >> but it requires that it does a write-back to the lock to ensure we >> didn't observe the unlocked value while another agent was in process >> of writing back a locked value. > > The cmpxchg gives us this guarantee. If it successfully stores, then > the > value it observed was the same as READ_ONCE() saw, and the update was > atomic. > > There *could* have been an intervening sequence between the READ_ONCE > and cmpxchg (e.g. put(); get()) but that's not problematic for lockref. > Until you've taken your reference it was possible that things changed > underneath you. > > Thanks, > Mark. Mark, I found the problem. Back in September of 2013, arm64 atomics were broken due to missing barriers in certain situations, but the problem at that time was undiscovered. Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at that time and changed the correct cmpxchg64 in lockref.c to cmpxchg64_relaxed. d2212b4 appeared to be OK at that time because the additional barrier requirements of this specific code sequence were not yet discovered, and this change was consistent with the arm64 atomic code of that time. Around February of 2014, some discovery led Will to correct the problem with the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, which has an excellent explanation of potential ordering problems with the same code sequence used by lockref.c. With this updated understanding, the earlier commit (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted. Because acquire/release semantics are insufficient for the full ordering, the single barrier after the store exclusive is the best approach, similar to Will's atomic barrier fix. Best regards, Brent
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org wrote: > On 2016-10-04 15:12, Mark Rutland wrote: > >Hi Brent, > > > >Could you *please* clarify if you are trying to solve: > > > >(a) a correctness issue (e.g. data corruption) seen in practice. > >(b) a correctness issue (e.g. data corruption) found by inspection. > >(c) A performance issue, seen in practice. > >(d) A performance issue, found by inspection. > > > >Any one of these is fine; we just need to know in order to be able to > >help effectively, and so far it hasn't been clear. Brent, you forgot to state which: 'a-d' is the case here. > I found the problem. > > Back in September of 2013, arm64 atomics were broken due to missing barriers > in certain situations, but the problem at that time was undiscovered. > > Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at > that > time and changed the correct cmpxchg64 in lockref.c to cmpxchg64_relaxed. > > d2212b4 appeared to be OK at that time because the additional barrier > requirements of this specific code sequence were not yet discovered, and > this change was consistent with the arm64 atomic code of that time. > > Around February of 2014, some discovery led Will to correct the problem with > the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, which > has an excellent explanation of potential ordering problems with the same > code sequence used by lockref.c. > > With this updated understanding, the earlier commit > (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted. > > Because acquire/release semantics are insufficient for the full ordering, > the single barrier after the store exclusive is the best approach, similar > to Will's atomic barrier fix. This again does not in fact describe the problem. What is the problem with lockref, and how (refer the earlier a-d multiple choice answer) was this found. Now, I have been looking, and we have some idea what you _might_ be alluding to, but please explain which accesses get reordered how and cause problems.
On 2016-10-05 10:55, bdegraaf@codeaurora.org wrote: > On 2016-10-04 15:12, Mark Rutland wrote: >> Hi Brent, >> >> Could you *please* clarify if you are trying to solve: >> >> (a) a correctness issue (e.g. data corruption) seen in practice. >> (b) a correctness issue (e.g. data corruption) found by inspection. >> (c) A performance issue, seen in practice. >> (d) A performance issue, found by inspection. >> >> Any one of these is fine; we just need to know in order to be able to >> help effectively, and so far it hasn't been clear. >> >> On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org >> wrote: >>> After looking at this, the problem is not with the lockref code per >>> se: it is a problem with arch_spin_value_unlocked(). In the >>> out-of-order case, arch_spin_value_unlocked() can return TRUE for a >>> spinlock that is in fact locked but the lock is not observable yet >>> via >>> an ordinary load. >> >> Given arch_spin_value_unlocked() doesn't perform any load itself, I >> assume the ordinary load that you are referring to is the READ_ONCE() >> early in CMPXCHG_LOOP(). >> >> It's worth noting that even if we ignore ordering and assume a >> sequentially-consistent machine, READ_ONCE() can give us a stale >> value. >> We could perform the read, then another agent can acquire the lock, >> then >> we can move onto the cmpxchg(), i.e. >> >> CPU0 CPU1 >> old = READ_ONCE(x.lock_val) >> spin_lock(x.lock) >> cmpxchg(x.lock_val, old, new) >> spin_unlock(x.lock) >> >> If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg >> should return an up-to-date value which we will then retry with. >> >>> Other than ensuring order on the locking side (as the prior patch >>> did), there is a way to make arch_spin_value_unlock's TRUE return >>> value deterministic, >> >> In general, this cannot be made deterministic. As above, there is a >> race >> that cannot be avoided. >> >>> but it requires that it does a write-back to the lock to ensure we >>> didn't observe the unlocked value while another agent was in process >>> of writing back a locked value. >> >> The cmpxchg gives us this guarantee. If it successfully stores, then >> the >> value it observed was the same as READ_ONCE() saw, and the update was >> atomic. >> >> There *could* have been an intervening sequence between the READ_ONCE >> and cmpxchg (e.g. put(); get()) but that's not problematic for >> lockref. >> Until you've taken your reference it was possible that things changed >> underneath you. >> >> Thanks, >> Mark. > > Mark, > > I found the problem. > > Back in September of 2013, arm64 atomics were broken due to missing > barriers > in certain situations, but the problem at that time was undiscovered. > > Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in > at that > time and changed the correct cmpxchg64 in lockref.c to > cmpxchg64_relaxed. > > d2212b4 appeared to be OK at that time because the additional barrier > requirements of this specific code sequence were not yet discovered, > and > this change was consistent with the arm64 atomic code of that time. > > Around February of 2014, some discovery led Will to correct the problem > with > the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, > which > has an excellent explanation of potential ordering problems with the > same > code sequence used by lockref.c. > > With this updated understanding, the earlier commit > (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted. > > Because acquire/release semantics are insufficient for the full > ordering, > the single barrier after the store exclusive is the best approach, > similar > to Will's atomic barrier fix. > > Best regards, > Brent FYI, this is a "b" type fix (correctness fix based on code inspection).
On 2016-10-05 11:10, Peter Zijlstra wrote: > On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org > wrote: >> On 2016-10-04 15:12, Mark Rutland wrote: >> >Hi Brent, >> > >> >Could you *please* clarify if you are trying to solve: >> > >> >(a) a correctness issue (e.g. data corruption) seen in practice. >> >(b) a correctness issue (e.g. data corruption) found by inspection. >> >(c) A performance issue, seen in practice. >> >(d) A performance issue, found by inspection. >> > >> >Any one of these is fine; we just need to know in order to be able to >> >help effectively, and so far it hasn't been clear. > > Brent, you forgot to state which: 'a-d' is the case here. > >> I found the problem. >> >> Back in September of 2013, arm64 atomics were broken due to missing >> barriers >> in certain situations, but the problem at that time was undiscovered. >> >> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in >> at >> that >> time and changed the correct cmpxchg64 in lockref.c to >> cmpxchg64_relaxed. >> >> d2212b4 appeared to be OK at that time because the additional barrier >> requirements of this specific code sequence were not yet discovered, >> and >> this change was consistent with the arm64 atomic code of that time. >> >> Around February of 2014, some discovery led Will to correct the >> problem with >> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, >> which >> has an excellent explanation of potential ordering problems with the >> same >> code sequence used by lockref.c. >> >> With this updated understanding, the earlier commit >> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted. >> >> Because acquire/release semantics are insufficient for the full >> ordering, >> the single barrier after the store exclusive is the best approach, >> similar >> to Will's atomic barrier fix. > > This again does not in fact describe the problem. > > What is the problem with lockref, and how (refer the earlier a-d > multiple choice answer) was this found. > > Now, I have been looking, and we have some idea what you _might_ be > alluding to, but please explain which accesses get reordered how and > cause problems. Sorry for the confusion, this was a "b" item (correctness fix based on code inspection. I had sent an answer to this yesterday, but didn't realize that it was in a separate, private email thread. I'll work out the before/after problem scenarios and send them along once I've hashed them out (it may take a while for me to paint a clear picture). In the meantime, however, consider that even without the spinlock code in the picture, lockref needs to treat the cmpxchg as a full system-level atomic, because multiple agents could access the value in a variety of timings. Since atomics similar to this are barriered on arm64 since 8e86f0b, the access to lockref should be similar. Brent
On 2016-10-05 11:30, bdegraaf@codeaurora.org wrote: > On 2016-10-05 11:10, Peter Zijlstra wrote: >> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org >> wrote: >>> On 2016-10-04 15:12, Mark Rutland wrote: >>> >Hi Brent, >>> > >>> >Could you *please* clarify if you are trying to solve: >>> > >>> >(a) a correctness issue (e.g. data corruption) seen in practice. >>> >(b) a correctness issue (e.g. data corruption) found by inspection. >>> >(c) A performance issue, seen in practice. >>> >(d) A performance issue, found by inspection. >>> > >>> >Any one of these is fine; we just need to know in order to be able to >>> >help effectively, and so far it hasn't been clear. >> >> Brent, you forgot to state which: 'a-d' is the case here. >> >>> I found the problem. >>> >>> Back in September of 2013, arm64 atomics were broken due to missing >>> barriers >>> in certain situations, but the problem at that time was undiscovered. >>> >>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in >>> at >>> that >>> time and changed the correct cmpxchg64 in lockref.c to >>> cmpxchg64_relaxed. >>> >>> d2212b4 appeared to be OK at that time because the additional barrier >>> requirements of this specific code sequence were not yet discovered, >>> and >>> this change was consistent with the arm64 atomic code of that time. >>> >>> Around February of 2014, some discovery led Will to correct the >>> problem with >>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, >>> which >>> has an excellent explanation of potential ordering problems with the >>> same >>> code sequence used by lockref.c. >>> >>> With this updated understanding, the earlier commit >>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted. >>> >>> Because acquire/release semantics are insufficient for the full >>> ordering, >>> the single barrier after the store exclusive is the best approach, >>> similar >>> to Will's atomic barrier fix. >> >> This again does not in fact describe the problem. >> >> What is the problem with lockref, and how (refer the earlier a-d >> multiple choice answer) was this found. >> >> Now, I have been looking, and we have some idea what you _might_ be >> alluding to, but please explain which accesses get reordered how and >> cause problems. > > Sorry for the confusion, this was a "b" item (correctness fix based on > code > inspection. I had sent an answer to this yesterday, but didn't realize > that > it was in a separate, private email thread. > > I'll work out the before/after problem scenarios and send them along > once > I've hashed them out (it may take a while for me to paint a clear > picture). > In the meantime, however, consider that even without the spinlock code > in > the picture, lockref needs to treat the cmpxchg as a full system-level > atomic, > because multiple agents could access the value in a variety of timings. > Since > atomics similar to this are barriered on arm64 since 8e86f0b, the > access to > lockref should be similar. > > Brent I am still working through some additional analyses for mixed accesses, but I thought I'd send along some sample commit text for the fix as it currently stands. Please feel free to comment if you see something that needs clarification. Brent Text: All arm64 lockref accesses that occur without taking the spinlock must behave like true atomics, ensuring successive operations are all done sequentially. Currently the lockref accesses, when decompiled, look like the following sequence: <Lockref "unlocked" Access [A]> // Lockref "unlocked" (B) 1: ldxr x0, [B] // Exclusive load <change lock_count B> stxr w1, x0, [B] cbnz w1, 1b <Lockref "unlocked" Access [C]> Even though access to the lock_count is protected by exclusives, this is not enough to guarantee order: The lock_count must change atomically, in order, so the only permitted ordering would be: A -> B -> C Unfortunately, this is not the case by the letter of the architecture and, in fact, the accesses to A and C are not protected by any sort of barrier, and hence are permitted to reorder freely, resulting in orderings such as Bl -> A -> C -> Bs In this specific scenario, since "change lock_count" could be an increment, a decrement or even a set to a specific value, there could be trouble. With more agents accessing the lockref without taking the lock, even scenarios where the cmpxchg passes falsely can be encountered, as there is no guarantee that the the "old" value will not match exactly a newer value due to out-of-order access by a combination of agents that increment and decrement the lock_count by the same amount. Since multiple agents are accessing this without locking the spinlock, this access must have the same protections in place as atomics do in the arch's atomic.h. Fortunately, the fix is not complicated: merely removing the errant _relaxed option on the cmpxchg64 is enough to introduce exactly the same code sequence justified in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 to fix arm64 atomics. 1: ldxr x0, [B] <change lock_count> stlxr w1, x0, [B] cbnz w1, 1b dmb ish
Brent, On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org wrote: > I am still working through some additional analyses for mixed accesses, > but I thought I'd send along some sample commit text for the fix as it > currently stands. Please feel free to comment if you see something that > needs clarification. Everything from this point down needs clarification. > All arm64 lockref accesses that occur without taking the spinlock must > behave like true atomics, ensuring successive operations are all done > sequentially. What is a "true atomic"? What do you mean by "successive"? What do you mean by "done sequentially"? The guarantee provided by lockref is that, if you hold the spinlock, then you don't need to use atomics to inspect the reference count, as it is guaranteed to be stable. You can't just go around replacing spin_lock calls with lockref_get -- that's not what this is about. > Currently > the lockref accesses, when decompiled, look like the following sequence: > > <Lockref "unlocked" Access [A]> > > // Lockref "unlocked" (B) > 1: ldxr x0, [B] // Exclusive load > <change lock_count B> > stxr w1, x0, [B] > cbnz w1, 1b > > <Lockref "unlocked" Access [C]> > > Even though access to the lock_count is protected by exclusives, this is not > enough > to guarantee order: The lock_count must change atomically, in order, so the > only > permitted ordering would be: > A -> B -> C Says who? Please point me at a piece of code that relies on this. I'm willing to believe that are bugs in this area, but waving your hands around and saying certain properties "must" hold is not helpful unless you can say *why* they must hold and *where* that is required. > Unfortunately, this is not the case by the letter of the architecture and, > in fact, > the accesses to A and C are not protected by any sort of barrier, and hence > are > permitted to reorder freely, resulting in orderings such as > > Bl -> A -> C -> Bs Again, why is this a problem? It's exactly the same as if you did: spin_lock(lock); inc_ref_cnt(); spin_unlock(lock); Accesses outside of the critical section can still be reordered. Big deal. > In this specific scenario, since "change lock_count" could be an > increment, a decrement or even a set to a specific value, there could be > trouble. What trouble? > With more agents accessing the lockref without taking the lock, even > scenarios where the cmpxchg passes falsely can be encountered, as there is > no guarantee that the the "old" value will not match exactly a newer value > due to out-of-order access by a combination of agents that increment and > decrement the lock_count by the same amount. This is the A-B-A problem, but I don't see why it affects us here. We're dealing with a single reference count. > Since multiple agents are accessing this without locking the spinlock, > this access must have the same protections in place as atomics do in the > arch's atomic.h. Why? I don't think that it does. Have a look at how lockref is used by the dcache code: it's really about keeping a reference to a dentry, which may be in the process of being unhashed and removed. The interaction with concurrent updaters to the dentry itself is handled using a seqlock, which does have the necessary barriers. Yes, the code is extremely complicated, but given that you're reporting issues based on code inspection, then you'll need to understand what you're changing. > Fortunately, the fix is not complicated: merely removing the errant > _relaxed option on the cmpxchg64 is enough to introduce exactly the same > code sequence justified in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 > to fix arm64 atomics. I introduced cmpxchg64_relaxed precisely for the lockref case. I still don't see a compelling reason to strengthen it. If you think there's a bug, please spend the effort to describe how it manifests and what can actually go wrong in the existing codebase. Your previous patches fixing so-called bugs found by inspection have both turned out to be bogus, so I'm sorry, but I'm not exactly leaping on your contributions to this. Will
On 2016-10-13 07:02, Will Deacon wrote: > Brent, > > On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org > wrote: > > Everything from this point down needs clarification. > >> All arm64 lockref accesses that occur without taking the spinlock must >> behave like true atomics, ensuring successive operations are all done >> sequentially. > > What is a "true atomic"? What do you mean by "successive"? What do you > mean by "done sequentially"? > Despite the use case in dentry, lockref itself is a generic locking API, and any problems I describe here are with the generic API itself, not necessarily the dentry use case. I'm not patching dentry--I'm fixing lockref. By necessity, the API must do its update atomically, as keeping things correct involves potential spinlock access by other agents which may opt to use the spinlock API or the lockref API at their discretion. With the current arm64 spinlock implementation, it is possible for the lockref to observe the changed contents of the protected count without observing the spinlock being locked, which could lead to missed changes to the lock_count itself, because any calculations made on it could be overwritten or completed in a different sequence. (Spinlock locked access is obtained with a simple store under certain scenarios. My attempt to fix this in the spinlock code was met with resistance saying it should be addressed in lockref, since that is the API that would encounter the issue. These changes were initiated in response to that request. Additional ordering problems were uncovered when I looked into lockref itself.) The example below involves only a single agent exactly as you explain the problem in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67. Even for a single execution agent, this means that the code below could access out of order. As lockref is a generic API, it doesn't matter whether dentry does this or not. By "done sequentially," I mean "accessed in program order." As far as "true atomic" goes, I am referring to an atomic in the same sense you did in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67. > The guarantee provided by lockref is that, if you hold the spinlock, > then > you don't need to use atomics to inspect the reference count, as it is > guaranteed to be stable. You can't just go around replacing spin_lock > calls with lockref_get -- that's not what this is about. > I am not sure where you got the idea that I was referring to replacing spinlocks with lockref calls. That is not the foundation for this fix. >> Currently >> the lockref accesses, when decompiled, look like the following >> sequence: >> >> <Lockref "unlocked" Access [A]> >> >> // Lockref "unlocked" (B) >> 1: ldxr x0, [B] // Exclusive load >> <change lock_count B> >> stxr w1, x0, [B] >> cbnz w1, 1b >> >> <Lockref "unlocked" Access [C]> >> >> Even though access to the lock_count is protected by exclusives, this >> is not >> enough >> to guarantee order: The lock_count must change atomically, in order, >> so the >> only >> permitted ordering would be: >> A -> B -> C > > Says who? Please point me at a piece of code that relies on this. I'm > willing to believe that are bugs in this area, but waving your hands > around > and saying certain properties "must" hold is not helpful unless you can > say *why* they must hold and *where* that is required. > The lockref code must access in order, because other agents can observe it via spinlock OR lockref APIs. Again, this is a generic API, not an explicit part of dentry. Other code will use it, and the manner in which it is used in dentry is not relevant. What lock_count is changed to is not proscribed by the lockref API. There is no guarantee whether it be an add, subtract, multiply, divide, set to some explicit value, etc. But the changes must be done in program order and observable in that same order by other agents: Therefore, the spinlock and lock_count must be accessed atomically, and observed to change atomically at the system level. I am not off base saying lockref is an atomic access. Here are some references: Under Documentation/filesystems/path-lookup.md, the dentry->d_lockref mechanism is described as an atomic access. At the time lockref was introduced, The Linux Foundation gave a presentation at LinuxCon 2014 that can be found at the following link: https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf On page 46, it outlines the lockref API. The first lines of the slide give the relevant details. Lockref • *Generic* mechanism to *atomically* update a reference count that is protected by a spinlock without actually acquiring the spinlock itself. While dentry's use is mentioned, this API is not restricted to the use case of dentry. >> Unfortunately, this is not the case by the letter of the architecture >> and, >> in fact, >> the accesses to A and C are not protected by any sort of barrier, and >> hence >> are >> permitted to reorder freely, resulting in orderings such as >> >> Bl -> A -> C -> Bs > > Again, why is this a problem? It's exactly the same as if you did: > > spin_lock(lock); > inc_ref_cnt(); > spin_unlock(lock); > > Accesses outside of the critical section can still be reordered. Big > deal. > Since the current code resembles but actually has *fewer* ordering effects compared to the example used by your atomic.h commit, even though A->B->C is in program order, it could access out of order according to your own commit text on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67. Taking spin_lock/spin_unlock, however, includes ordering by nature of the load-acquire observing the store-release of a prior unlock, so ordering is enforced with the spinlock version of accesses. The lockref itself has *no* ordering enforced, unless a locked state is encountered and it falls back to the spinlock code. So this is a fundamental difference between lockref and spinlock. So, no, lockref ordering is currently not exactly the same as spinlock--but it should be. >> In this specific scenario, since "change lock_count" could be an >> increment, a decrement or even a set to a specific value, there could >> be >> trouble. > > What trouble? > Take, for example, a use case where the ref count is either positive or zero. If increments and decrements hit out of order, a decrement that was supposed to come after an increment would instead do nothing if the value of the lock started at zero. Then when the increment hit later, the ref count would remain positive with a net effect of +1 to the ref count instead of +1-1=0. Again, however, the lockref does not specify how the contents of lock_count are manipulated, it was only meant to guarantee that they are done atomically when the lock is not held. >> With more agents accessing the lockref without taking the lock, even >> scenarios where the cmpxchg passes falsely can be encountered, as >> there is >> no guarantee that the the "old" value will not match exactly a newer >> value >> due to out-of-order access by a combination of agents that increment >> and >> decrement the lock_count by the same amount. > > This is the A-B-A problem, but I don't see why it affects us here. > We're > dealing with a single reference count. > If lockref accesses were to occur on many Pe's, there are all sorts of things that could happen in terms of who wins what, and what they set the lock_count to. My point is simply that each access should be atomic because lockref is a generic API and was intended to be a lockless atomic access. Leaving this problem open until someone else introduces a use that exposes it, which could happen in the main kernel code, is probably not a good idea, as it could prove difficult to track down. >> Since multiple agents are accessing this without locking the spinlock, >> this access must have the same protections in place as atomics do in >> the >> arch's atomic.h. > > Why? I don't think that it does. Have a look at how lockref is used by > the dcache code: it's really about keeping a reference to a dentry, > which may be in the process of being unhashed and removed. The > interaction with concurrent updaters to the dentry itself is handled > using a seqlock, which does have the necessary barriers. Yes, the code > is extremely complicated, but given that you're reporting issues based > on code inspection, then you'll need to understand what you're > changing. > Again, this is a generic API, not an API married to dentry. If it were for dentry's sole use, it should not be accessible outside of the dentry code. While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for the generic case. >> Fortunately, the fix is not complicated: merely removing the errant >> _relaxed option on the cmpxchg64 is enough to introduce exactly the >> same >> code sequence justified in commit >> 8e86f0b409a44193f1587e87b69c5dcf8f65be67 >> to fix arm64 atomics. > > I introduced cmpxchg64_relaxed precisely for the lockref case. I still > don't see a compelling reason to strengthen it. If you think there's a > bug, > please spend the effort to describe how it manifests and what can > actually > go wrong in the existing codebase. Your previous patches fixing > so-called > bugs found by inspection have both turned out to be bogus, so I'm > sorry, > but I'm not exactly leaping on your contributions to this. > > Will I have detailed the problems here, and they are with the generic case, no hand waving required. On a further note, it is not accurate to say that my prior patches were bogus: One called to attention a yet-to-be-corrected problem in the ARMv8 Programmer's Guide, and the other was sidestepped by a refactor that addressed the problem I set out to fix with a control flow change. Since that problem was the fundamental reason I had worked on the gettime code in the first place, I abandoned my effort. The refactor that fixed the control-flow problem, however, is still missing on v4.7 and earlier kernels (sequence lock logic should be verified prior to the isb that demarcates the virtual counter register read). I have confirmed this is an issue on various armv8 hardware, sometimes obtaining identical register values between multiple reads that were delayed such that they should have shown changes, evidence that the register read accessed prior to the seqlock update having finished (the control flow problem). Brent
On Thu, Oct 13, 2016 at 04:00:47PM -0400, bdegraaf@codeaurora.org wrote: > On 2016-10-13 07:02, Will Deacon wrote: > >Brent, > > > >On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org wrote: > > > >Everything from this point down needs clarification. > > > >>All arm64 lockref accesses that occur without taking the spinlock must > >>behave like true atomics, ensuring successive operations are all done > >>sequentially. > > > >What is a "true atomic"? What do you mean by "successive"? What do you > >mean by "done sequentially"? > > Despite the use case in dentry, lockref itself is a generic locking API, and > any problems I describe here are with the generic API itself, not necessarily > the dentry use case. I'm not patching dentry--I'm fixing lockref. Having spent the best part of a week looking at this myself, my view is that if there is any problem it's simply that the semantics of lockref are unclear; we can fix that by clarifying the imprecise wording in the lockref documentation (i.e. the comment block in <linux/lockref.h>). As far as I can tell, the only fundamental requirement is that an agent holding the lock won't see the count change under its feet. Ordering requirements for agents performing updates without holding the lock were never strictly defined, and the de-facto ordering requirements of existing users (e.g. none in the case of the dentry cache) appear to be met. [...] > At the time lockref was introduced, The Linux Foundation gave a presentation > at LinuxCon 2014 that can be found at the following link: > > https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf > > On page 46, it outlines the lockref API. The first lines of the slide give > the > relevant details. > > Lockref > • *Generic* mechanism to *atomically* update a reference count that is > protected by a spinlock without actually acquiring the spinlock itself. ... which is exactly what lockref does today. The count is updated atomically (i.e. there are no intervening stores between the load and store to the count) it's just that this atomic update has no enforced ordering against other memory accesses. This is a generically useful primitive, as-is. > >Again, why is this a problem? It's exactly the same as if you did: > > > > spin_lock(lock); > > inc_ref_cnt(); > > spin_unlock(lock); > > > >Accesses outside of the critical section can still be reordered. Big deal. > > Since the current code resembles but actually has *fewer* ordering effects > compared to the example used by your atomic.h commit, even though A->B->C is > in program order, it could access out of order according to your own commit > text on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67. This case is not comparable. The atomic_* API has a documented requirement that the atomic operations in question operate as full barriers, as is called out in the commit message. Lockref has no such documented requirement. [...] > Again, this is a generic API, not an API married to dentry. If it were for > dentry's sole use, it should not be accessible outside of the dentry code. > While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for the > generic case. Again, you are assuming a specific set of semantics that lockref is not documented as having (and which contemporary code does not require). If you have a use-case for which you want stronger semantics, that is a different story entirely. Thanks, Mark.
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index 89206b5..4dd0977 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) /* Did we get the lock? */ " eor %w1, %w0, %w0, ror #16\n" -" cbz %w1, 3f\n" +" cbnz %w1, 4f\n" + /* + * Yes: The store done on this cpu was the one that locked the lock. + * Store-release one-way barrier on LL/SC means that accesses coming + * after this could be reordered into the critical section of the + * load-acquire/store-release, where we did not own the lock. On LSE, + * even the one-way barrier of the store-release semantics is missing, + * so LSE needs an explicit barrier here as well. Without this, the + * changed contents of the area protected by the spinlock could be + * observed prior to the lock. + */ +" dmb ish\n" +" b 3f\n" +"4:\n" /* * No: spin on the owner. Send a local event to avoid missing an * unlock before the exclusive load. @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) " ldaxrh %w2, %4\n" " eor %w1, %w2, %w0, lsr #16\n" " cbnz %w1, 2b\n" - /* We got the lock. Critical section starts here. */ + /* + * We got the lock and have observed the prior owner's store-release. + * In this case, the one-way barrier of the prior owner that we + * observed combined with the one-way barrier of our load-acquire is + * enough to ensure accesses to the protected area coming after this + * are not accessed until we own the lock. In this case, other + * observers will not see our changes prior to observing the lock + * itself. Critical locked section starts here. + */ "3:" : "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock) : "Q" (lock->owner), "I" (1 << TICKET_SHIFT) @@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) " add %w0, %w0, %3\n" " stxr %w1, %w0, %2\n" " cbnz %w1, 1b\n" + /* + * We got the lock with a successful store-release: Store-release + * one-way barrier means accesses coming after this could be observed + * before the lock is observed as locked. + */ + " dmb ish\n" + " nop\n" "2:", /* LSE atomics */ " ldr %w0, %2\n" @@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) " casa %w0, %w1, %2\n" " and %w1, %w1, #0xffff\n" " eor %w1, %w1, %w0, lsr #16\n" + " cbnz %w1, 1f\n" + /* + * We got the lock with the LSE casa store. + * A barrier is required to ensure accesses coming from the + * critical section of the lock are not observed before our lock. + */ + " dmb ish\n" "1:") : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock) : "I" (1 << TICKET_SHIFT) @@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw) " cbnz %w0, 1b\n" " stxr %w0, %w2, %1\n" " cbnz %w0, 2b\n" + /* + * Lock is not ours until the store, which has no implicit barrier. + * Barrier is needed so our writes to the protected area are not + * observed before our lock ownership is observed. + */ + " dmb ish\n" " nop", /* LSE atomics */ "1: mov %w0, wzr\n" @@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw) " cbz %w0, 2b\n" " wfe\n" " b 1b\n" - "3:") + /* + * Casa doesn't use store-release semantics. Even if it did, + * it would not protect us from our writes being observed before + * our ownership is observed. Barrier is required. + */ + "3: dmb ish") : "=&r" (tmp), "+Q" (rw->lock) : "r" (0x80000000) : "memory"); @@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw) " tbnz %w1, #31, 1b\n" " casa %w0, %w1, %2\n" " sbc %w0, %w1, %w0\n" - " cbnz %w0, 2b") + " cbnz %w0, 2b\n" + /* + * Need to ensure that our reads of the area protected by the lock + * are not observed before our lock ownership is observed. + */ + " dmb ish\n") : "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock) : : "cc", "memory");
Prior spinlock code solely used load-acquire and store-release semantics to ensure ordering of the spinlock lock and the area it protects. However, store-release semantics and ordinary stores do not protect against accesses to the protected area being observed prior to the access that locks the lock itself. While the load-acquire and store-release ordering is sufficient when the spinlock routines themselves are strictly used, other kernel code that references the lock values directly (e.g. lockrefs) could observe changes to the area protected by the spinlock prior to observance of the lock itself being in a locked state, despite the fact that the spinlock logic itself is correct. Barriers were added to all the locking routines wherever necessary to ensure that outside observers which read the lock values directly will not observe changes to the protected data before the lock itself is observed. Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org> --- arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 55 insertions(+), 4 deletions(-)