Message ID | 20230128035902.1758726-1-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] srcu: Clarify comments on memory barrier "E" | expand |
On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > During a flip, we have a full memory barrier before srcu_idx is incremented. > > The idea is we intend to order the first phase scan's read of lock > counters with the flipping of the index. > > However, such ordering is already enforced because of the > control-dependency between the 2 scans. We would be flipping the index > only if lock and unlock counts matched. > > But such match will not happen if there was a pending reader before the flip > in the first place (observation courtesy Mathieu Desnoyers). > > The litmus test below shows this: > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): Much better, thank you! I of course did the usual wordsmithing, as shown below. Does this version capture your intent and understanding? Thanx, Paul ------------------------------------------------------------------------ commit 963f34624beb2af1ec08527e637d16ab6a1dacbd Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Sat Jan 28 03:59:01 2023 +0000 srcu: Clarify comments on memory barrier "E" There is an smp_mb() named "E" in srcu_flip() immediately before the increment (flip) of the srcu_struct structure's ->srcu_idx. The purpose of E is to order the preceding scan's read of lock counters against the flipping of the ->srcu_idx, in order to prevent new readers from continuing to use the old ->srcu_idx value, which might needlessly extend the grace period. However, this ordering is already enforced because of the control dependency between the preceding scan and the ->srcu_idx flip. This control dependency exists because atomic_long_read() is used to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and ->srcu_unlock_count[] counts match. And such a match cannot happen when there is an in-flight reader that started before the flip (observation courtesy Mathieu Desnoyers). The litmus test below (courtesy of Frederic Weisbecker, with changes for ctrldep by Boqun and Joel) shows this: C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E. Why not also remove that redundant smp_mb()? Because control dependencies are quite fragile due to their not being recognized by most compilers and tools. Control dependencies therefore exact an ongoing maintenance burden, and such a burden cannot be justified in this slowpath. Therefore, that smp_mb() stays until such time as its overhead becomes a measurable problem in a real workload running on a real production system, or until such time as compilers start paying attention to this sort of control dependency. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index c541b82646b63..cd46fe063e50f 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Because the flip of ->srcu_idx is executed only if the + * preceding call to srcu_readers_active_idx_check() found that + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched + * and because that summing uses atomic_long_read(), there is + * ordering due to a control dependency between that summing and + * the WRITE_ONCE() in this call to srcu_flip(). This ordering + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using a value of ->srcu_idx + * from before the previous call to srcu_flip(), which should be + * quite rare. This ordering thus helps forward progress because + * the grace period could otherwise be delayed by additional + * calls to __srcu_read_lock() using that old (soon to be new) + * value of ->srcu_idx. + * + * This sum-equality check and ordering also ensures that if + * a given call to __srcu_read_lock() uses the new value of + * ->srcu_idx, this updater's earlier scans cannot have seen + * that reader's increments, which is all to the good, because + * this grace period need not wait on that reader. After all, + * if those earlier scans had seen that reader, there would have + * been a sum mismatch and this code would not be reached. + * + * This means that the following smp_mb() is redundant, but + * it stays until either (1) Compilers learn about this sort of + * control dependency or (2) Some production workload running on + * a production system is unduly delayed by this slowpath smp_mb(). */ smp_mb(); /* E */ /* Pairs with B and C. */ - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. /* * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > The idea is we intend to order the first phase scan's read of lock > > counters with the flipping of the index. > > > > However, such ordering is already enforced because of the > > control-dependency between the 2 scans. We would be flipping the index > > only if lock and unlock counts matched. > > > > But such match will not happen if there was a pending reader before the flip > > in the first place (observation courtesy Mathieu Desnoyers). > > > > The litmus test below shows this: > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > Much better, thank you! > > I of course did the usual wordsmithing, as shown below. Does this > version capture your intent and understanding? > It looks good to me. According to [1] , the architecture at least should not be reordering read-write control dependency. Only read-read is problematic. But I am not 100% sure, is that not true? For the compiler, you are saying that read-write control dependency can be reordered even with *ONCE() accesses? In other words, the flipping of idx can happen in ->po order before the locks and unlock counts match? That sounds sort of like a broken compiler. [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf More comments below: > ------------------------------------------------------------------------ > > commit 963f34624beb2af1ec08527e637d16ab6a1dacbd > Author: Joel Fernandes (Google) <joel@joelfernandes.org> > Date: Sat Jan 28 03:59:01 2023 +0000 > > srcu: Clarify comments on memory barrier "E" > > There is an smp_mb() named "E" in srcu_flip() immediately before the > increment (flip) of the srcu_struct structure's ->srcu_idx. > > The purpose of E is to order the preceding scan's read of lock counters > against the flipping of the ->srcu_idx, in order to prevent new readers > from continuing to use the old ->srcu_idx value, which might needlessly > extend the grace period. > > However, this ordering is already enforced because of the control > dependency between the preceding scan and the ->srcu_idx flip. > This control dependency exists because atomic_long_read() is used > to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, > and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and > ->srcu_unlock_count[] counts match. And such a match cannot happen when > there is an in-flight reader that started before the flip (observation > courtesy Mathieu Desnoyers). Agreed. > The litmus test below (courtesy of Frederic Weisbecker, with changes > for ctrldep by Boqun and Joel) shows this: > > C srcu > (* > * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. > * > * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo > * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf > * relations. Combining the ->ppo with ->rf, a cycle is impossible. > *) > > {} > > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > // FLIP > if (lock1 == unlock1) { // Control dep > smp_mb(); // E // Remove E and still passes. > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > } > > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx1; > int idx2; > > // 1st reader > idx1 = READ_ONCE(*IDX); > if (idx1 == 0) { // Control dep > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:lock1=1 /\ 1:idx1=1) > > More complicated litmus tests with multiple SRCU readers also show that > memory barrier E is not needed. > > This commit therefore clarifies the comment on memory barrier E. > > Why not also remove that redundant smp_mb()? > > Because control dependencies are quite fragile due to their not being > recognized by most compilers and tools. Control dependencies therefore > exact an ongoing maintenance burden, and such a burden cannot be justified > in this slowpath. Therefore, that smp_mb() stays until such time as > its overhead becomes a measurable problem in a real workload running on > a real production system, or until such time as compilers start paying > attention to this sort of control dependency. > > Co-developed-by: Frederic Weisbecker <frederic@kernel.org> > Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Co-developed-by: Boqun Feng <boqun.feng@gmail.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index c541b82646b63..cd46fe063e50f 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > static void srcu_flip(struct srcu_struct *ssp) > { > /* > - * Ensure that if this updater saw a given reader's increment > - * from __srcu_read_lock(), that reader was using an old value > - * of ->srcu_idx. Also ensure that if a given reader sees the > - * new value of ->srcu_idx, this updater's earlier scans cannot > - * have seen that reader's increments (which is OK, because this > - * grace period need not wait on that reader). > + * Because the flip of ->srcu_idx is executed only if the > + * preceding call to srcu_readers_active_idx_check() found that > + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched > + * and because that summing uses atomic_long_read(), there is > + * ordering due to a control dependency between that summing and > + * the WRITE_ONCE() in this call to srcu_flip(). This ordering > + * ensures that if this updater saw a given reader's increment from > + * __srcu_read_lock(), that reader was using a value of ->srcu_idx > + * from before the previous call to srcu_flip(), which should be > + * quite rare. This ordering thus helps forward progress because > + * the grace period could otherwise be delayed by additional > + * calls to __srcu_read_lock() using that old (soon to be new) > + * value of ->srcu_idx. > + * > + * This sum-equality check and ordering also ensures that if > + * a given call to __srcu_read_lock() uses the new value of > + * ->srcu_idx, this updater's earlier scans cannot have seen > + * that reader's increments, which is all to the good, because > + * this grace period need not wait on that reader. After all, > + * if those earlier scans had seen that reader, there would have > + * been a sum mismatch and this code would not be reached. > + * > + * This means that the following smp_mb() is redundant, but > + * it stays until either (1) Compilers learn about this sort of > + * control dependency or (2) Some production workload running on > + * a production system is unduly delayed by this slowpath smp_mb(). > */ I agree that a read-write control dependency reordering by the compiler can cause a reader to observe the flipped srcu_idx too soon, thus perhaps delaying the grace period from ending (because the second scan will now end up waiting on that reader..). Thanks, - Joel > smp_mb(); /* E */ /* Pairs with B and C. */ > > - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); > + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. > > /* > * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 04:16:34PM -0500, Joel Fernandes wrote: > On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > > > The idea is we intend to order the first phase scan's read of lock > > > counters with the flipping of the index. > > > > > > However, such ordering is already enforced because of the > > > control-dependency between the 2 scans. We would be flipping the index > > > only if lock and unlock counts matched. > > > > > > But such match will not happen if there was a pending reader before the flip > > > in the first place (observation courtesy Mathieu Desnoyers). > > > > > > The litmus test below shows this: > > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > > > Much better, thank you! > > > > I of course did the usual wordsmithing, as shown below. Does this > > version capture your intent and understanding? > > > > It looks good to me. > According to [1] , the architecture at least should not be reordering > read-write control dependency. Only read-read is problematic. But I am > not 100% sure, is that not true? Agreed, READ_ONCE() or stronger through condition to WRITE_ONCE() or stronger is ordered. Replace that WRITE_ONCE() with any type of unordered read and all bets are off. And now that the ARM folks chimed in, this is a solid guarantee at the hardware level. Not so much at the compiler level. Oddly enough, compilers do provide ordering for plain C-language stores, but that is helpful only if no other CPU or thread is concurrently accessing that variable. > For the compiler, you are saying that read-write control dependency > can be reordered even with *ONCE() accesses? In other words, the > flipping of idx can happen in ->po order before the locks and unlock > counts match? That sounds sort of like a broken compiler. One case where a sane compiler can reasonably enable the hardware to do the reordering is where you have the same WRITE_ONCE() in both the then-clause and else-clause of an "if" statement. Another is if it can somehow prove something about the value returned from that READ_ONCE(), for example, if that value is used to index a singleton array, then the compiler has to do the READ_ONCE(), but it can then assume that the value returned was zero, throwing away the real value returned. Fun with undefined behavior! > [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf > > More comments below: > > > ------------------------------------------------------------------------ > > > > commit 963f34624beb2af1ec08527e637d16ab6a1dacbd > > Author: Joel Fernandes (Google) <joel@joelfernandes.org> > > Date: Sat Jan 28 03:59:01 2023 +0000 > > > > srcu: Clarify comments on memory barrier "E" > > > > There is an smp_mb() named "E" in srcu_flip() immediately before the > > increment (flip) of the srcu_struct structure's ->srcu_idx. > > > > The purpose of E is to order the preceding scan's read of lock counters > > against the flipping of the ->srcu_idx, in order to prevent new readers > > from continuing to use the old ->srcu_idx value, which might needlessly > > extend the grace period. > > > > However, this ordering is already enforced because of the control > > dependency between the preceding scan and the ->srcu_idx flip. > > This control dependency exists because atomic_long_read() is used > > to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, > > and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and > > ->srcu_unlock_count[] counts match. And such a match cannot happen when > > there is an in-flight reader that started before the flip (observation > > courtesy Mathieu Desnoyers). > > Agreed. > > > The litmus test below (courtesy of Frederic Weisbecker, with changes > > for ctrldep by Boqun and Joel) shows this: > > > > C srcu > > (* > > * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. > > * > > * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo > > * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf > > * relations. Combining the ->ppo with ->rf, a cycle is impossible. > > *) > > > > {} > > > > // updater > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > // FLIP > > if (lock1 == unlock1) { // Control dep > > smp_mb(); // E // Remove E and still passes. > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > } > > > > // reader > > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int tmp; > > int idx1; > > int idx2; > > > > // 1st reader > > idx1 = READ_ONCE(*IDX); > > if (idx1 == 0) { // Control dep > > tmp = READ_ONCE(*LOCK0); > > WRITE_ONCE(*LOCK0, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK0); > > WRITE_ONCE(*UNLOCK0, tmp + 1); > > } else { > > tmp = READ_ONCE(*LOCK1); > > WRITE_ONCE(*LOCK1, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK1); > > WRITE_ONCE(*UNLOCK1, tmp + 1); > > } > > } > > > > exists (0:lock1=1 /\ 1:idx1=1) > > > > More complicated litmus tests with multiple SRCU readers also show that > > memory barrier E is not needed. > > > > This commit therefore clarifies the comment on memory barrier E. > > > > Why not also remove that redundant smp_mb()? > > > > Because control dependencies are quite fragile due to their not being > > recognized by most compilers and tools. Control dependencies therefore > > exact an ongoing maintenance burden, and such a burden cannot be justified > > in this slowpath. Therefore, that smp_mb() stays until such time as > > its overhead becomes a measurable problem in a real workload running on > > a real production system, or until such time as compilers start paying > > attention to this sort of control dependency. > > > > Co-developed-by: Frederic Weisbecker <frederic@kernel.org> > > Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Co-developed-by: Boqun Feng <boqun.feng@gmail.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index c541b82646b63..cd46fe063e50f 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > > static void srcu_flip(struct srcu_struct *ssp) > > { > > /* > > - * Ensure that if this updater saw a given reader's increment > > - * from __srcu_read_lock(), that reader was using an old value > > - * of ->srcu_idx. Also ensure that if a given reader sees the > > - * new value of ->srcu_idx, this updater's earlier scans cannot > > - * have seen that reader's increments (which is OK, because this > > - * grace period need not wait on that reader). > > + * Because the flip of ->srcu_idx is executed only if the > > + * preceding call to srcu_readers_active_idx_check() found that > > + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched > > + * and because that summing uses atomic_long_read(), there is > > + * ordering due to a control dependency between that summing and > > + * the WRITE_ONCE() in this call to srcu_flip(). This ordering > > + * ensures that if this updater saw a given reader's increment from > > + * __srcu_read_lock(), that reader was using a value of ->srcu_idx > > + * from before the previous call to srcu_flip(), which should be > > + * quite rare. This ordering thus helps forward progress because > > + * the grace period could otherwise be delayed by additional > > + * calls to __srcu_read_lock() using that old (soon to be new) > > + * value of ->srcu_idx. > > + * > > + * This sum-equality check and ordering also ensures that if > > + * a given call to __srcu_read_lock() uses the new value of > > + * ->srcu_idx, this updater's earlier scans cannot have seen > > + * that reader's increments, which is all to the good, because > > + * this grace period need not wait on that reader. After all, > > + * if those earlier scans had seen that reader, there would have > > + * been a sum mismatch and this code would not be reached. > > + * > > + * This means that the following smp_mb() is redundant, but > > + * it stays until either (1) Compilers learn about this sort of > > + * control dependency or (2) Some production workload running on > > + * a production system is unduly delayed by this slowpath smp_mb(). > > */ > > I agree that a read-write control dependency reordering by the > compiler can cause a reader to observe the flipped srcu_idx too soon, > thus perhaps delaying the grace period from ending (because the second > scan will now end up waiting on that reader..). Very good! I will push the commit out on -rcu. Thanx, Paul > Thanks, > > - Joel > > > smp_mb(); /* E */ /* Pairs with B and C. */ > > > > - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); > > + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. > > > > /* > > * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 09:09:04PM -0800, Paul E. McKenney wrote: > On Sat, Jan 28, 2023 at 04:16:34PM -0500, Joel Fernandes wrote: > > On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > > > > > The idea is we intend to order the first phase scan's read of lock > > > > counters with the flipping of the index. > > > > > > > > However, such ordering is already enforced because of the > > > > control-dependency between the 2 scans. We would be flipping the index > > > > only if lock and unlock counts matched. > > > > > > > > But such match will not happen if there was a pending reader before the flip > > > > in the first place (observation courtesy Mathieu Desnoyers). > > > > > > > > The litmus test below shows this: > > > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > > > > > Much better, thank you! > > > > > > I of course did the usual wordsmithing, as shown below. Does this > > > version capture your intent and understanding? > > > > > > > It looks good to me. > > According to [1] , the architecture at least should not be reordering > > read-write control dependency. Only read-read is problematic. But I am > > not 100% sure, is that not true? > > Agreed, READ_ONCE() or stronger through condition to WRITE_ONCE() > or stronger is ordered. Replace that WRITE_ONCE() with any type of > unordered read and all bets are off. > > And now that the ARM folks chimed in, this is a solid guarantee at > the hardware level. > > Not so much at the compiler level. Oddly enough, compilers do provide > ordering for plain C-language stores, but that is helpful only if no > other CPU or thread is concurrently accessing that variable. > > > For the compiler, you are saying that read-write control dependency > > can be reordered even with *ONCE() accesses? In other words, the > > flipping of idx can happen in ->po order before the locks and unlock > > counts match? That sounds sort of like a broken compiler. > > One case where a sane compiler can reasonably enable the hardware to > do the reordering is where you have the same WRITE_ONCE() in both the > then-clause and else-clause of an "if" statement. Another is if it can > somehow prove something about the value returned from that READ_ONCE(), > for example, if that value is used to index a singleton array, then the > compiler has to do the READ_ONCE(), but it can then assume that the > value returned was zero, throwing away the real value returned. > > Fun with undefined behavior! > > > [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf > > > > More comments below: Except that it was pointed out to me that the Co-developed-by tags also need Signed-off-by tags. If there are no objections to the update shown below, I will fix this on my next rebase. Thanx, Paul ------------------------------------------------------------------------ commit 6c135bb38c55d354527a6659cbf2f4e7e20b4360 Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Sat Jan 28 03:59:01 2023 +0000 srcu: Clarify comments on memory barrier "E" There is an smp_mb() named "E" in srcu_flip() immediately before the increment (flip) of the srcu_struct structure's ->srcu_idx. The purpose of E is to order the preceding scan's read of lock counters against the flipping of the ->srcu_idx, in order to prevent new readers from continuing to use the old ->srcu_idx value, which might needlessly extend the grace period. However, this ordering is already enforced because of the control dependency between the preceding scan and the ->srcu_idx flip. This control dependency exists because atomic_long_read() is used to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and ->srcu_unlock_count[] counts match. And such a match cannot happen when there is an in-flight reader that started before the flip (observation courtesy Mathieu Desnoyers). The litmus test below (courtesy of Frederic Weisbecker, with changes for ctrldep by Boqun and Joel) shows this: C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E. Why not also remove that redundant smp_mb()? Because control dependencies are quite fragile due to their not being recognized by most compilers and tools. Control dependencies therefore exact an ongoing maintenance burden, and such a burden cannot be justified in this slowpath. Therefore, that smp_mb() stays until such time as its overhead becomes a measurable problem in a real workload running on a real production system, or until such time as compilers start paying attention to this sort of control dependency. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index c541b82646b63..cd46fe063e50f 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Because the flip of ->srcu_idx is executed only if the + * preceding call to srcu_readers_active_idx_check() found that + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched + * and because that summing uses atomic_long_read(), there is + * ordering due to a control dependency between that summing and + * the WRITE_ONCE() in this call to srcu_flip(). This ordering + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using a value of ->srcu_idx + * from before the previous call to srcu_flip(), which should be + * quite rare. This ordering thus helps forward progress because + * the grace period could otherwise be delayed by additional + * calls to __srcu_read_lock() using that old (soon to be new) + * value of ->srcu_idx. + * + * This sum-equality check and ordering also ensures that if + * a given call to __srcu_read_lock() uses the new value of + * ->srcu_idx, this updater's earlier scans cannot have seen + * that reader's increments, which is all to the good, because + * this grace period need not wait on that reader. After all, + * if those earlier scans had seen that reader, there would have + * been a sum mismatch and this code would not be reached. + * + * This means that the following smp_mb() is redundant, but + * it stays until either (1) Compilers learn about this sort of + * control dependency or (2) Some production workload running on + * a production system is unduly delayed by this slowpath smp_mb(). */ smp_mb(); /* E */ /* Pairs with B and C. */ - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. /* * Ensure that if the updater misses an __srcu_read_unlock()
On 2023-02-07 22:38, Paul E. McKenney wrote: [...] > > Except that it was pointed out to me that the Co-developed-by tags also > need Signed-off-by tags. If there are no objections to the update shown > below, I will fix this on my next rebase. No objection from me, thanks! Mathieu
On Tue, Feb 07, 2023 at 10:48:42PM -0500, Mathieu Desnoyers wrote: > On 2023-02-07 22:38, Paul E. McKenney wrote: > [...] > > > > Except that it was pointed out to me that the Co-developed-by tags also > > need Signed-off-by tags. If there are no objections to the update shown > > below, I will fix this on my next rebase. > > No objection from me, thanks! And I have updated this commit, thank you all! Thanx, Paul
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 1c304fec89c0..2872998edbb7 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -983,12 +983,20 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Control dependency (causality) between the before-flip + * srcu_readers_active_idx_check() and a call to srcu_flip(), ensures + * that we end up here only if lock and unlock counts match. This fact + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using an old value of + * ->srcu_idx. That is why the lock and unlock counts matched in the + * first place. The causality also ensures that if a given reader sees + * the new value of ->srcu_idx, this updater's earlier scans cannot + * have seen that reader's increments (which is OK, because this grace + * period need not wait on that reader), because again, that would + * cause a lock/unlock count mismatch and we not end up here. + * + * So we don't really need the following smp_mb() before incrementing + * srcu_idx, however we have it anyway for additional safety. */ smp_mb(); /* E */ /* Pairs with B and C. */
During a flip, we have a full memory barrier before srcu_idx is incremented. The idea is we intend to order the first phase scan's read of lock counters with the flipping of the index. However, such ordering is already enforced because of the control-dependency between the 2 scans. We would be flipping the index only if lock and unlock counts matched. But such match will not happen if there was a pending reader before the flip in the first place (observation courtesy Mathieu Desnoyers). The litmus test below shows this: (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E while keeping the memory barrier anyway for extra safety (since it is on a slow path anyway). Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- v1->v2: Update changelog, keep old comments. v2->v3: Moar changelog updates. v3->v4: Keep smp_mb() and just update comments cc_list | 8 ++++++++ kernel/rcu/srcutree.c | 20 ++++++++++++++------ send.sh | 5 +++++ to_list | 1 + 4 files changed, 28 insertions(+), 6 deletions(-) create mode 100644 cc_list create mode 100755 send.sh create mode 100644 to_list