Message ID | 20221218191310.130904-1-joel@joelfernandes.org (mailing list archive) |
---|---|
Headers | show |
Series | srcu: Remove pre-flip memory barrier | expand |
On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > Hello, I believe the pre-flip memory barrier is not required. The only reason I > can say to remove it, other than the possibility that it is unnecessary, is to > not have extra code that does not help. However, since we are issuing a fully > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > For this reason, please consider these patches as "informational", than a > "please merge". :-) Though, feel free to consider merging if you agree! > > All SRCU scenarios pass with these, with 6 hours of testing. Hi Joel, Please have a look at the comments in my side-rcu implementation [1, 2]. It is similar to what SRCU does (per-cpu counter based grace period tracking), but implemented for userspace. The comments explain why this works without the memory barrier you identify as useless in SRCU. Following my implementation of side-rcu, I reviewed the SRCU comments and identified that the barrier "/* E */" appears to be useless. I even discussed this privately with Paul E. McKenney. My implementation and comments go further though, and skip the period "flip" entirely if the first pass observes that all readers (in both periods) are quiescent. The most relevant comment in side-rcu is: * The grace period completes when it observes that there are no active * readers within each of the periods. * * The active_readers state is initially true for each period, until the * grace period observes that no readers are present for each given * period, at which point the active_readers state becomes false. So I agree with the clarifications you propose here, but I think we can improve the grace period implementation further by clarifying the SRCU grace period model. Thanks, Mathieu [1] https://github.com/efficios/libside/blob/master/src/rcu.h [2] https://github.com/efficios/libside/blob/master/src/rcu.c > > thanks, > > - Joel > > Joel Fernandes (Google) (2): > srcu: Remove comment about prior read lock counts > srcu: Remove memory barrier "E" as it is not required > > kernel/rcu/srcutree.c | 10 ---------- > 1 file changed, 10 deletions(-) > > -- > 2.39.0.314.g84b9a713c41-goog >
Hi Mathieu, On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > > Hello, I believe the pre-flip memory barrier is not required. The only reason I > > can say to remove it, other than the possibility that it is unnecessary, is to > > not have extra code that does not help. However, since we are issuing a fully > > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > > > For this reason, please consider these patches as "informational", than a > > "please merge". :-) Though, feel free to consider merging if you agree! > > > > All SRCU scenarios pass with these, with 6 hours of testing. > > Hi Joel, > > Please have a look at the comments in my side-rcu implementation [1, 2]. > It is similar to what SRCU does (per-cpu counter based grace period > tracking), but implemented for userspace. The comments explain why this > works without the memory barrier you identify as useless in SRCU. > > Following my implementation of side-rcu, I reviewed the SRCU comments > and identified that the barrier "/* E */" appears to be useless. I even > discussed this privately with Paul E. McKenney. > > My implementation and comments go further though, and skip the period > "flip" entirely if the first pass observes that all readers (in both > periods) are quiescent. Actually in SRCU, the first pass scans only 1 index, then does the flip, and the second pass scans the second index. Without doing a flip, an index cannot be scanned for forward progress reasons because it is still "active". So I am curious how you can skip flip and still scan both indexes? I will dig more into your implementation to learn more. > The most relevant comment in side-rcu is: > > * The grace period completes when it observes that there are no active > * readers within each of the periods. > * > * The active_readers state is initially true for each period, until the > * grace period observes that no readers are present for each given > * period, at which point the active_readers state becomes false. > > So I agree with the clarifications you propose here, but I think we can > improve the grace period implementation further by clarifying the SRCU > grace period model. Thanks a lot, I am curious how you do the "detection of no new readers" part without globally doing some kind of synchronization. I will dig more into your implementation to learn more. Thanks, - Joel > > Thanks, > > Mathieu > > > [1] https://github.com/efficios/libside/blob/master/src/rcu.h > [2] https://github.com/efficios/libside/blob/master/src/rcu.c > > > > > thanks, > > > > - Joel > > > > Joel Fernandes (Google) (2): > > srcu: Remove comment about prior read lock counts > > srcu: Remove memory barrier "E" as it is not required > > > > kernel/rcu/srcutree.c | 10 ---------- > > 1 file changed, 10 deletions(-) > > > > -- > > 2.39.0.314.g84b9a713c41-goog > > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Sun, Dec 18, 2022 at 04:30:33PM -0500, Joel Fernandes wrote: > Hi Mathieu, > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > > > Hello, I believe the pre-flip memory barrier is not required. The only reason I > > > can say to remove it, other than the possibility that it is unnecessary, is to > > > not have extra code that does not help. However, since we are issuing a fully > > > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > > > > > For this reason, please consider these patches as "informational", than a > > > "please merge". :-) Though, feel free to consider merging if you agree! > > > > > > All SRCU scenarios pass with these, with 6 hours of testing. > > > > Hi Joel, > > > > Please have a look at the comments in my side-rcu implementation [1, 2]. > > It is similar to what SRCU does (per-cpu counter based grace period > > tracking), but implemented for userspace. The comments explain why this > > works without the memory barrier you identify as useless in SRCU. > > > > Following my implementation of side-rcu, I reviewed the SRCU comments > > and identified that the barrier "/* E */" appears to be useless. I even > > discussed this privately with Paul E. McKenney. > > > > My implementation and comments go further though, and skip the period > > "flip" entirely if the first pass observes that all readers (in both > > periods) are quiescent. > > Actually in SRCU, the first pass scans only 1 index, then does the > flip, and the second pass scans the second index. Without doing a > flip, an index cannot be scanned for forward progress reasons because > it is still "active". So I am curious how you can skip flip and still > scan both indexes? I will dig more into your implementation to learn more. > > > The most relevant comment in side-rcu is: > > > > * The grace period completes when it observes that there are no active > > * readers within each of the periods. > > * > > * The active_readers state is initially true for each period, until the > > * grace period observes that no readers are present for each given > > * period, at which point the active_readers state becomes false. > > > > So I agree with the clarifications you propose here, but I think we can > > improve the grace period implementation further by clarifying the SRCU > > grace period model. > > Thanks a lot, I am curious how you do the "detection of no new > readers" part without globally doing some kind of synchronization. I > will dig more into your implementation to learn more. It is very good to see the interest in SRCU internals! Just out of an abundance of caution, I restate the requirements from the synchronize_srcu() header comment: * There are memory-ordering constraints implied by synchronize_srcu(). * On systems with more than one CPU, when synchronize_srcu() returns, * each CPU is guaranteed to have executed a full memory barrier since * the end of its last corresponding SRCU read-side critical section * whose beginning preceded the call to synchronize_srcu(). In addition, * each CPU having an SRCU read-side critical section that extends beyond * the return from synchronize_srcu() is guaranteed to have executed a * full memory barrier after the beginning of synchronize_srcu() and before * the beginning of that SRCU read-side critical section. Note that these * guarantees include CPUs that are offline, idle, or executing in user mode, * as well as CPUs that are executing in the kernel. * * Furthermore, if CPU A invoked synchronize_srcu(), which returned * to its caller on CPU B, then both CPU A and CPU B are guaranteed * to have executed a full memory barrier during the execution of * synchronize_srcu(). This guarantee applies even if CPU A and CPU B * are the same CPU, but again only if the system has more than one CPU. * * Of course, these memory-ordering guarantees apply only when * synchronize_srcu(), srcu_read_lock(), and srcu_read_unlock() are * passed the same srcu_struct structure. And from the __call_srcu() header comment: * Note that all CPUs must agree that the grace period extended beyond * all pre-existing SRCU read-side critical section. On systems with * more than one CPU, this means that when "func()" is invoked, each CPU * is guaranteed to have executed a full memory barrier since the end of * its last corresponding SRCU read-side critical section whose beginning * preceded the call to call_srcu(). It also means that each CPU executing * an SRCU read-side critical section that continues beyond the start of * "func()" must have executed a memory barrier after the call_srcu() * but before the beginning of that SRCU read-side critical section. * Note that these guarantees include CPUs that are offline, idle, or * executing in user mode, as well as CPUs that are executing in the kernel. * * Furthermore, if CPU A invoked call_srcu() and CPU B invoked the * resulting SRCU callback function "func()", then both CPU A and CPU * B are guaranteed to execute a full memory barrier during the time * interval between the call to call_srcu() and the invocation of "func()". * This guarantee applies even if CPU A and CPU B are the same CPU (but * again only if the system has more than one CPU). * * Of course, these guarantees apply only for invocations of call_srcu(), * srcu_read_lock(), and srcu_read_unlock() that are all passed the same * srcu_struct structure. Thanx, Paul
On 2022-12-18 16:30, Joel Fernandes wrote: > Hi Mathieu, > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I >>> can say to remove it, other than the possibility that it is unnecessary, is to >>> not have extra code that does not help. However, since we are issuing a fully >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. >>> >>> For this reason, please consider these patches as "informational", than a >>> "please merge". :-) Though, feel free to consider merging if you agree! >>> >>> All SRCU scenarios pass with these, with 6 hours of testing. >> >> Hi Joel, >> >> Please have a look at the comments in my side-rcu implementation [1, 2]. >> It is similar to what SRCU does (per-cpu counter based grace period >> tracking), but implemented for userspace. The comments explain why this >> works without the memory barrier you identify as useless in SRCU. >> >> Following my implementation of side-rcu, I reviewed the SRCU comments >> and identified that the barrier "/* E */" appears to be useless. I even >> discussed this privately with Paul E. McKenney. >> >> My implementation and comments go further though, and skip the period >> "flip" entirely if the first pass observes that all readers (in both >> periods) are quiescent. > > Actually in SRCU, the first pass scans only 1 index, then does the > flip, and the second pass scans the second index. Without doing a > flip, an index cannot be scanned for forward progress reasons because > it is still "active". So I am curious how you can skip flip and still > scan both indexes? I will dig more into your implementation to learn more. If we look at SRCU read-side: int __srcu_read_lock(struct srcu_struct *ssp) { int idx; idx = READ_ONCE(ssp->srcu_idx) & 0x1; this_cpu_inc(ssp->sda->srcu_lock_count[idx]); smp_mb(); /* B */ /* Avoid leaking the critical section. */ return idx; } If the thread is preempted for a long period of time between load of ssp->srcu_idx and increment of srcu_lock_count[idx], this means this thread can appear as a "new reader" for the idx period at any arbitrary time in the future, independently of which period is the current one within a future grace period. As a result, the grace period algorithm needs to inherently support the fact that a "new reader" can appear in any of the two periods, independently of the current period state. As a result, this means that while within period "0", we _need_ to allow newly coming readers to appear as we scan period "0". As a result, we can simply scan both periods 0/1 for reader quiescence, even while new readers appear within those periods. As a result, flipping between periods 0/1 is just relevant for forward progress, not for correctness. As a result, we can remove barrier /* E */. Thoughts ? Thanks, Mathieu > >> The most relevant comment in side-rcu is: >> >> * The grace period completes when it observes that there are no active >> * readers within each of the periods. >> * >> * The active_readers state is initially true for each period, until the >> * grace period observes that no readers are present for each given >> * period, at which point the active_readers state becomes false. >> >> So I agree with the clarifications you propose here, but I think we can >> improve the grace period implementation further by clarifying the SRCU >> grace period model. > > Thanks a lot, I am curious how you do the "detection of no new > readers" part without globally doing some kind of synchronization. I > will dig more into your implementation to learn more. > > Thanks, > > - Joel > > > >> >> Thanks, >> >> Mathieu >> >> >> [1] https://github.com/efficios/libside/blob/master/src/rcu.h >> [2] https://github.com/efficios/libside/blob/master/src/rcu.c >> >>> >>> thanks, >>> >>> - Joel >>> >>> Joel Fernandes (Google) (2): >>> srcu: Remove comment about prior read lock counts >>> srcu: Remove memory barrier "E" as it is not required >>> >>> kernel/rcu/srcutree.c | 10 ---------- >>> 1 file changed, 10 deletions(-) >>> >>> -- >>> 2.39.0.314.g84b9a713c41-goog >>> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
Hi Mathieu, On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-18 16:30, Joel Fernandes wrote: > > Hi Mathieu, > > > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I > >>> can say to remove it, other than the possibility that it is unnecessary, is to > >>> not have extra code that does not help. However, since we are issuing a fully > >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. > >>> > >>> For this reason, please consider these patches as "informational", than a > >>> "please merge". :-) Though, feel free to consider merging if you agree! > >>> > >>> All SRCU scenarios pass with these, with 6 hours of testing. > >> > >> Hi Joel, > >> > >> Please have a look at the comments in my side-rcu implementation [1, 2]. > >> It is similar to what SRCU does (per-cpu counter based grace period > >> tracking), but implemented for userspace. The comments explain why this > >> works without the memory barrier you identify as useless in SRCU. > >> > >> Following my implementation of side-rcu, I reviewed the SRCU comments > >> and identified that the barrier "/* E */" appears to be useless. I even > >> discussed this privately with Paul E. McKenney. > >> > >> My implementation and comments go further though, and skip the period > >> "flip" entirely if the first pass observes that all readers (in both > >> periods) are quiescent. > > > > Actually in SRCU, the first pass scans only 1 index, then does the > > flip, and the second pass scans the second index. Without doing a > > flip, an index cannot be scanned for forward progress reasons because > > it is still "active". So I am curious how you can skip flip and still > > scan both indexes? I will dig more into your implementation to learn more. > > If we look at SRCU read-side: > > int __srcu_read_lock(struct srcu_struct *ssp) > { > int idx; > > idx = READ_ONCE(ssp->srcu_idx) & 0x1; > this_cpu_inc(ssp->sda->srcu_lock_count[idx]); > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > return idx; > } > > If the thread is preempted for a long period of time between load of > ssp->srcu_idx and increment of srcu_lock_count[idx], this means this > thread can appear as a "new reader" for the idx period at any arbitrary > time in the future, independently of which period is the current one > within a future grace period. > > As a result, the grace period algorithm needs to inherently support the > fact that a "new reader" can appear in any of the two periods, > independently of the current period state. > > As a result, this means that while within period "0", we _need_ to allow > newly coming readers to appear as we scan period "0". Sure, it already does handle it but that is I believe it is a corner case, not the norm. > As a result, we can simply scan both periods 0/1 for reader quiescence, > even while new readers appear within those periods. I think this is a bit dangerous. Yes there is the preemption thing you mentioned above, but that is bounded since you can only have a fixed number of tasks that underwent that preemption, and it is quite rare in the sense, each reader should get preempted just after sampling idx but not incrementing lock count. However, if we scan while new readers appear (outside of the above preemption problem), we can have counter wrap causing a false match much quicker. The scan loop is: check_readers(idx) { count_all_unlocks(idx); smp_mb(); count_all_locks(idx); bool done = (locks == unlocks) if (done) { // readers are done, end scan for this idx. } else { // try again later } } So if check_readers() got preempted just after the smp_mb(), then you can have lots of tasks enter and exit the read-side critical section and increment the locks count. Eventually locks == unlocks will happen, and it is screwed. Sure this is also theoretical, but yeah that issue can be made "worse" by scanning active readers deliberately, especially when such readers can also nest arbitrarily. > As a result, flipping between periods 0/1 is just relevant for forward > progress, not for correctness. Sure, agreed, forward progress. > > As a result, we can remove barrier /* E */. > IMO, the "E" is not needed at all even if you do the flip because of the reasons in the patch series changelogs, so whether we do flip less frequently, we still don't need "E" IMHO. Thanks! - Joel > Thoughts ? > > Thanks, > > Mathieu > > > > > >> The most relevant comment in side-rcu is: > >> > >> * The grace period completes when it observes that there are no active > >> * readers within each of the periods. > >> * > >> * The active_readers state is initially true for each period, until the > >> * grace period observes that no readers are present for each given > >> * period, at which point the active_readers state becomes false. > >> > >> So I agree with the clarifications you propose here, but I think we can > >> improve the grace period implementation further by clarifying the SRCU > >> grace period model. > > > > Thanks a lot, I am curious how you do the "detection of no new > > readers" part without globally doing some kind of synchronization. I > > will dig more into your implementation to learn more. > > > > Thanks, > > > > - Joel > > > > > > > >> > >> Thanks, > >> > >> Mathieu > >> > >> > >> [1] https://github.com/efficios/libside/blob/master/src/rcu.h > >> [2] https://github.com/efficios/libside/blob/master/src/rcu.c > >> > >>> > >>> thanks, > >>> > >>> - Joel > >>> > >>> Joel Fernandes (Google) (2): > >>> srcu: Remove comment about prior read lock counts > >>> srcu: Remove memory barrier "E" as it is not required > >>> > >>> kernel/rcu/srcutree.c | 10 ---------- > >>> 1 file changed, 10 deletions(-) > >>> > >>> -- > >>> 2.39.0.314.g84b9a713c41-goog > >>> > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Sun, Dec 18, 2022 at 7:04 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hi Mathieu, > > On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > On 2022-12-18 16:30, Joel Fernandes wrote: > > > Hi Mathieu, > > > > > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers > > > <mathieu.desnoyers@efficios.com> wrote: > > >> > > >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > > >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I > > >>> can say to remove it, other than the possibility that it is unnecessary, is to > > >>> not have extra code that does not help. However, since we are issuing a fully > > >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > >>> > > >>> For this reason, please consider these patches as "informational", than a > > >>> "please merge". :-) Though, feel free to consider merging if you agree! > > >>> > > >>> All SRCU scenarios pass with these, with 6 hours of testing. > > >> > > >> Hi Joel, > > >> > > >> Please have a look at the comments in my side-rcu implementation [1, 2]. > > >> It is similar to what SRCU does (per-cpu counter based grace period > > >> tracking), but implemented for userspace. The comments explain why this > > >> works without the memory barrier you identify as useless in SRCU. > > >> > > >> Following my implementation of side-rcu, I reviewed the SRCU comments > > >> and identified that the barrier "/* E */" appears to be useless. I even > > >> discussed this privately with Paul E. McKenney. > > >> > > >> My implementation and comments go further though, and skip the period > > >> "flip" entirely if the first pass observes that all readers (in both > > >> periods) are quiescent. > > > > > > Actually in SRCU, the first pass scans only 1 index, then does the > > > flip, and the second pass scans the second index. Without doing a > > > flip, an index cannot be scanned for forward progress reasons because > > > it is still "active". So I am curious how you can skip flip and still > > > scan both indexes? I will dig more into your implementation to learn more. > > > > If we look at SRCU read-side: > > > > int __srcu_read_lock(struct srcu_struct *ssp) > > { > > int idx; > > > > idx = READ_ONCE(ssp->srcu_idx) & 0x1; > > this_cpu_inc(ssp->sda->srcu_lock_count[idx]); > > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > return idx; > > } > > > > If the thread is preempted for a long period of time between load of > > ssp->srcu_idx and increment of srcu_lock_count[idx], this means this > > thread can appear as a "new reader" for the idx period at any arbitrary > > time in the future, independently of which period is the current one > > within a future grace period. > > > > As a result, the grace period algorithm needs to inherently support the > > fact that a "new reader" can appear in any of the two periods, > > independently of the current period state. > > > > As a result, this means that while within period "0", we _need_ to allow > > newly coming readers to appear as we scan period "0". > > Sure, it already does handle it but that is I believe it is a corner > case, not the norm. > > > As a result, we can simply scan both periods 0/1 for reader quiescence, > > even while new readers appear within those periods. > > I think this is a bit dangerous. Yes there is the preemption thing you > mentioned above, but that is bounded since you can only have a fixed > number of tasks that underwent that preemption, and it is quite rare > in the sense, each reader should get preempted just after sampling idx > but not incrementing lock count. > > However, if we scan while new readers appear (outside of the above > preemption problem), we can have counter wrap causing a false match > much quicker. > The scan loop is: > check_readers(idx) { > count_all_unlocks(idx); > smp_mb(); > count_all_locks(idx); > bool done = (locks == unlocks) > if (done) { > // readers are done, end scan for this idx. > } else { > // try again later > } > } > > So if check_readers() got preempted just after the smp_mb(), then you > can have lots of tasks enter and exit the read-side critical section > and increment the locks count. Eventually locks == unlocks will > happen, and it is screwed. Sure this is also theoretical, but yeah > that issue can be made "worse" by scanning active readers > deliberately, especially when such readers can also nest arbitrarily. > > > As a result, flipping between periods 0/1 is just relevant for forward > > progress, not for correctness. > > Sure, agreed, forward progress. Adding to the last statement "But also correctness as described above". thanks, - Joel
On 2022-12-18 19:24, Joel Fernandes wrote: > On Sun, Dec 18, 2022 at 7:04 PM Joel Fernandes <joel@joelfernandes.org> wrote: >> >> Hi Mathieu, >> >> On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers >> <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2022-12-18 16:30, Joel Fernandes wrote: >>>> Hi Mathieu, >>>> >>>> On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers >>>> <mathieu.desnoyers@efficios.com> wrote: >>>>> >>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: >>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I >>>>>> can say to remove it, other than the possibility that it is unnecessary, is to >>>>>> not have extra code that does not help. However, since we are issuing a fully >>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. >>>>>> >>>>>> For this reason, please consider these patches as "informational", than a >>>>>> "please merge". :-) Though, feel free to consider merging if you agree! >>>>>> >>>>>> All SRCU scenarios pass with these, with 6 hours of testing. >>>>> >>>>> Hi Joel, >>>>> >>>>> Please have a look at the comments in my side-rcu implementation [1, 2]. >>>>> It is similar to what SRCU does (per-cpu counter based grace period >>>>> tracking), but implemented for userspace. The comments explain why this >>>>> works without the memory barrier you identify as useless in SRCU. >>>>> >>>>> Following my implementation of side-rcu, I reviewed the SRCU comments >>>>> and identified that the barrier "/* E */" appears to be useless. I even >>>>> discussed this privately with Paul E. McKenney. >>>>> >>>>> My implementation and comments go further though, and skip the period >>>>> "flip" entirely if the first pass observes that all readers (in both >>>>> periods) are quiescent. >>>> >>>> Actually in SRCU, the first pass scans only 1 index, then does the >>>> flip, and the second pass scans the second index. Without doing a >>>> flip, an index cannot be scanned for forward progress reasons because >>>> it is still "active". So I am curious how you can skip flip and still >>>> scan both indexes? I will dig more into your implementation to learn more. >>> >>> If we look at SRCU read-side: >>> >>> int __srcu_read_lock(struct srcu_struct *ssp) >>> { >>> int idx; >>> >>> idx = READ_ONCE(ssp->srcu_idx) & 0x1; >>> this_cpu_inc(ssp->sda->srcu_lock_count[idx]); >>> smp_mb(); /* B */ /* Avoid leaking the critical section. */ >>> return idx; >>> } >>> >>> If the thread is preempted for a long period of time between load of >>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this >>> thread can appear as a "new reader" for the idx period at any arbitrary >>> time in the future, independently of which period is the current one >>> within a future grace period. >>> >>> As a result, the grace period algorithm needs to inherently support the >>> fact that a "new reader" can appear in any of the two periods, >>> independently of the current period state. >>> >>> As a result, this means that while within period "0", we _need_ to allow >>> newly coming readers to appear as we scan period "0". >> >> Sure, it already does handle it but that is I believe it is a corner >> case, not the norm. >> >>> As a result, we can simply scan both periods 0/1 for reader quiescence, >>> even while new readers appear within those periods. >> >> I think this is a bit dangerous. Yes there is the preemption thing you >> mentioned above, but that is bounded since you can only have a fixed >> number of tasks that underwent that preemption, and it is quite rare >> in the sense, each reader should get preempted just after sampling idx >> but not incrementing lock count. >> >> However, if we scan while new readers appear (outside of the above >> preemption problem), we can have counter wrap causing a false match >> much quicker. >> The scan loop is: >> check_readers(idx) { >> count_all_unlocks(idx); >> smp_mb(); >> count_all_locks(idx); >> bool done = (locks == unlocks) >> if (done) { >> // readers are done, end scan for this idx. >> } else { >> // try again later >> } >> } >> >> So if check_readers() got preempted just after the smp_mb(), then you >> can have lots of tasks enter and exit the read-side critical section >> and increment the locks count. Eventually locks == unlocks will >> happen, and it is screwed. Sure this is also theoretical, but yeah >> that issue can be made "worse" by scanning active readers >> deliberately, especially when such readers can also nest arbitrarily. >> >>> As a result, flipping between periods 0/1 is just relevant for forward >>> progress, not for correctness. >> >> Sure, agreed, forward progress. > > Adding to the last statement "But also correctness as described above". Exactly how many entry/exit of the read-side critical section while the grace period is preempted do you need to trigger this ? On a 64-bit system, where 64-bit counters are used, AFAIU this need to be exactly 2^64 read-side critical sections. There are other synchronization algorithms such as seqlocks which are quite happy with much less protection against overflow (using a 32-bit counter even on 64-bit architectures). For practical purposes, I suspect this issue is really just theoretical. Or am I missing your point ? Thanks, Mathieu > > thanks, > > - Joel
On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: [...] > >>>>> > >>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > >>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I > >>>>>> can say to remove it, other than the possibility that it is unnecessary, is to > >>>>>> not have extra code that does not help. However, since we are issuing a fully > >>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. > >>>>>> > >>>>>> For this reason, please consider these patches as "informational", than a > >>>>>> "please merge". :-) Though, feel free to consider merging if you agree! > >>>>>> > >>>>>> All SRCU scenarios pass with these, with 6 hours of testing. > >>>>> > >>>>> Hi Joel, > >>>>> > >>>>> Please have a look at the comments in my side-rcu implementation [1, 2]. > >>>>> It is similar to what SRCU does (per-cpu counter based grace period > >>>>> tracking), but implemented for userspace. The comments explain why this > >>>>> works without the memory barrier you identify as useless in SRCU. > >>>>> > >>>>> Following my implementation of side-rcu, I reviewed the SRCU comments > >>>>> and identified that the barrier "/* E */" appears to be useless. I even > >>>>> discussed this privately with Paul E. McKenney. > >>>>> > >>>>> My implementation and comments go further though, and skip the period > >>>>> "flip" entirely if the first pass observes that all readers (in both > >>>>> periods) are quiescent. > >>>> > >>>> Actually in SRCU, the first pass scans only 1 index, then does the > >>>> flip, and the second pass scans the second index. Without doing a > >>>> flip, an index cannot be scanned for forward progress reasons because > >>>> it is still "active". So I am curious how you can skip flip and still > >>>> scan both indexes? I will dig more into your implementation to learn more. > >>> > >>> If we look at SRCU read-side: > >>> > >>> int __srcu_read_lock(struct srcu_struct *ssp) > >>> { > >>> int idx; > >>> > >>> idx = READ_ONCE(ssp->srcu_idx) & 0x1; > >>> this_cpu_inc(ssp->sda->srcu_lock_count[idx]); > >>> smp_mb(); /* B */ /* Avoid leaking the critical section. */ > >>> return idx; > >>> } > >>> > >>> If the thread is preempted for a long period of time between load of > >>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this > >>> thread can appear as a "new reader" for the idx period at any arbitrary > >>> time in the future, independently of which period is the current one > >>> within a future grace period. > >>> > >>> As a result, the grace period algorithm needs to inherently support the > >>> fact that a "new reader" can appear in any of the two periods, > >>> independently of the current period state. > >>> > >>> As a result, this means that while within period "0", we _need_ to allow > >>> newly coming readers to appear as we scan period "0". > >> > >> Sure, it already does handle it but that is I believe it is a corner > >> case, not the norm. > >> > >>> As a result, we can simply scan both periods 0/1 for reader quiescence, > >>> even while new readers appear within those periods. > >> > >> I think this is a bit dangerous. Yes there is the preemption thing you > >> mentioned above, but that is bounded since you can only have a fixed > >> number of tasks that underwent that preemption, and it is quite rare > >> in the sense, each reader should get preempted just after sampling idx > >> but not incrementing lock count. > >> > >> However, if we scan while new readers appear (outside of the above > >> preemption problem), we can have counter wrap causing a false match > >> much quicker. > >> The scan loop is: > >> check_readers(idx) { > >> count_all_unlocks(idx); > >> smp_mb(); > >> count_all_locks(idx); > >> bool done = (locks == unlocks) > >> if (done) { > >> // readers are done, end scan for this idx. > >> } else { > >> // try again later > >> } > >> } > >> > >> So if check_readers() got preempted just after the smp_mb(), then you > >> can have lots of tasks enter and exit the read-side critical section > >> and increment the locks count. Eventually locks == unlocks will > >> happen, and it is screwed. Sure this is also theoretical, but yeah > >> that issue can be made "worse" by scanning active readers > >> deliberately, especially when such readers can also nest arbitrarily. > >> > >>> As a result, flipping between periods 0/1 is just relevant for forward > >>> progress, not for correctness. > >> > >> Sure, agreed, forward progress. > > > > Adding to the last statement "But also correctness as described above". > > Exactly how many entry/exit of the read-side critical section while the > grace period is preempted do you need to trigger this ? It depends on how many readers are active during the preemption of the scan code. Say the preemption happened after per-CPU unlock counts were totalled. Then AFAICS, if there are N active readers which need the grace period to wait, you need (2^sizeof(int) - N) number of lock+unlock to happen. > On a 64-bit system, where 64-bit counters are used, AFAIU this need to > be exactly 2^64 read-side critical sections. Yes, but what about 32-bit systems? > There are other synchronization algorithms such as seqlocks which are > quite happy with much less protection against overflow (using a 32-bit > counter even on 64-bit architectures). The seqlock is an interesting point. > For practical purposes, I suspect this issue is really just theoretical. I have to ask, what is the benefit of avoiding a flip and scanning active readers? Is the issue about grace period delay or performance? If so, it might be worth prototyping that approach and measuring using rcutorture/rcuscale. If there is significant benefit to current approach, then IMO it is worth exploring. > Or am I missing your point ? No, I think you are not. Let me know if I missed something. Thanks, - Joel > > > > > > thanks, > > > > - Joel > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > [...] > > >>>>> > > >>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: > > >>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I > > >>>>>> can say to remove it, other than the possibility that it is unnecessary, is to > > >>>>>> not have extra code that does not help. However, since we are issuing a fully > > >>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > >>>>>> > > >>>>>> For this reason, please consider these patches as "informational", than a > > >>>>>> "please merge". :-) Though, feel free to consider merging if you agree! > > >>>>>> > > >>>>>> All SRCU scenarios pass with these, with 6 hours of testing. > > >>>>> > > >>>>> Hi Joel, > > >>>>> > > >>>>> Please have a look at the comments in my side-rcu implementation [1, 2]. > > >>>>> It is similar to what SRCU does (per-cpu counter based grace period > > >>>>> tracking), but implemented for userspace. The comments explain why this > > >>>>> works without the memory barrier you identify as useless in SRCU. > > >>>>> > > >>>>> Following my implementation of side-rcu, I reviewed the SRCU comments > > >>>>> and identified that the barrier "/* E */" appears to be useless. I even > > >>>>> discussed this privately with Paul E. McKenney. > > >>>>> > > >>>>> My implementation and comments go further though, and skip the period > > >>>>> "flip" entirely if the first pass observes that all readers (in both > > >>>>> periods) are quiescent. > > >>>> > > >>>> Actually in SRCU, the first pass scans only 1 index, then does the > > >>>> flip, and the second pass scans the second index. Without doing a > > >>>> flip, an index cannot be scanned for forward progress reasons because > > >>>> it is still "active". So I am curious how you can skip flip and still > > >>>> scan both indexes? I will dig more into your implementation to learn more. > > >>> > > >>> If we look at SRCU read-side: > > >>> > > >>> int __srcu_read_lock(struct srcu_struct *ssp) > > >>> { > > >>> int idx; > > >>> > > >>> idx = READ_ONCE(ssp->srcu_idx) & 0x1; > > >>> this_cpu_inc(ssp->sda->srcu_lock_count[idx]); > > >>> smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > >>> return idx; > > >>> } > > >>> > > >>> If the thread is preempted for a long period of time between load of > > >>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this > > >>> thread can appear as a "new reader" for the idx period at any arbitrary > > >>> time in the future, independently of which period is the current one > > >>> within a future grace period. > > >>> > > >>> As a result, the grace period algorithm needs to inherently support the > > >>> fact that a "new reader" can appear in any of the two periods, > > >>> independently of the current period state. > > >>> > > >>> As a result, this means that while within period "0", we _need_ to allow > > >>> newly coming readers to appear as we scan period "0". > > >> > > >> Sure, it already does handle it but that is I believe it is a corner > > >> case, not the norm. > > >> > > >>> As a result, we can simply scan both periods 0/1 for reader quiescence, > > >>> even while new readers appear within those periods. > > >> > > >> I think this is a bit dangerous. Yes there is the preemption thing you > > >> mentioned above, but that is bounded since you can only have a fixed > > >> number of tasks that underwent that preemption, and it is quite rare > > >> in the sense, each reader should get preempted just after sampling idx > > >> but not incrementing lock count. > > >> > > >> However, if we scan while new readers appear (outside of the above > > >> preemption problem), we can have counter wrap causing a false match > > >> much quicker. > > >> The scan loop is: > > >> check_readers(idx) { > > >> count_all_unlocks(idx); > > >> smp_mb(); > > >> count_all_locks(idx); > > >> bool done = (locks == unlocks) > > >> if (done) { > > >> // readers are done, end scan for this idx. > > >> } else { > > >> // try again later > > >> } > > >> } > > >> > > >> So if check_readers() got preempted just after the smp_mb(), then you > > >> can have lots of tasks enter and exit the read-side critical section > > >> and increment the locks count. Eventually locks == unlocks will > > >> happen, and it is screwed. Sure this is also theoretical, but yeah > > >> that issue can be made "worse" by scanning active readers > > >> deliberately, especially when such readers can also nest arbitrarily. > > >> > > >>> As a result, flipping between periods 0/1 is just relevant for forward > > >>> progress, not for correctness. > > >> > > >> Sure, agreed, forward progress. > > > > > > Adding to the last statement "But also correctness as described above". > > > > Exactly how many entry/exit of the read-side critical section while the > > grace period is preempted do you need to trigger this ? > > It depends on how many readers are active during the preemption of the > scan code. Say the preemption happened after per-CPU unlock counts > were totalled. Then AFAICS, if there are N active readers which need > the grace period to wait, you need (2^sizeof(int) - N) number of > lock+unlock to happen. Sorry, here I meant (2^(sizeof(unsigned long) * 8) - N) which is 2^32 on 32-bit systems. thanks, - Joel > > On a 64-bit system, where 64-bit counters are used, AFAIU this need to > > be exactly 2^64 read-side critical sections. > > Yes, but what about 32-bit systems? > > > There are other synchronization algorithms such as seqlocks which are > > quite happy with much less protection against overflow (using a 32-bit > > counter even on 64-bit architectures). > > The seqlock is an interesting point. > > > For practical purposes, I suspect this issue is really just theoretical. > > I have to ask, what is the benefit of avoiding a flip and scanning > active readers? Is the issue about grace period delay or performance? > If so, it might be worth prototyping that approach and measuring using > rcutorture/rcuscale. If there is significant benefit to current > approach, then IMO it is worth exploring. > > > Or am I missing your point ? > > No, I think you are not. Let me know if I missed something. > > Thanks, > > - Joel > > > > > > > > > > > > thanks, > > > > > > - Joel > > > > -- > > Mathieu Desnoyers > > EfficiOS Inc. > > https://www.efficios.com > >
On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > Hello, I believe the pre-flip memory barrier is not required. The only reason I > can say to remove it, other than the possibility that it is unnecessary, is to > not have extra code that does not help. However, since we are issuing a fully > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > For this reason, please consider these patches as "informational", than a > "please merge". :-) Though, feel free to consider merging if you agree! > > All SRCU scenarios pass with these, with 6 hours of testing. > > thanks, > > - Joel > > Joel Fernandes (Google) (2): > srcu: Remove comment about prior read lock counts > srcu: Remove memory barrier "E" as it is not required And litmus tests confirm that "E" does not really do what the comments say, PTAL: Test 1: C mbe (* * Result: sometimes * Does previous scan see old reader's lock count, if a new reader saw the new srcu_idx? *) {} P0(int *lockcount, int *srcu_idx) // updater { int r0; r0 = READ_ONCE(*lockcount); smp_mb(); // E WRITE_ONCE(*srcu_idx, 1); } P1(int *lockcount, int *srcu_idx) // reader { int r0; WRITE_ONCE(*lockcount, 1); // previous reader smp_mb(); // B+C r0 = READ_ONCE(*srcu_idx); // new reader } exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *) Test 2: C mbe2 (* * Result: sometimes * If updater saw reader's lock count, was that reader using the old idx? *) {} P0(int *lockcount, int *srcu_idx) // updater { int r0; r0 = READ_ONCE(*lockcount); smp_mb(); // E WRITE_ONCE(*srcu_idx, 1); } P1(int *lockcount, int *srcu_idx) // reader { int r0; int r1; r1 = READ_ONCE(*srcu_idx); // previous reader WRITE_ONCE(*lockcount, 1); // previous reader smp_mb(); // B+C r0 = READ_ONCE(*srcu_idx); // new reader } exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *) thanks, - Joel > > kernel/rcu/srcutree.c | 10 ---------- > 1 file changed, 10 deletions(-) > > -- > 2.39.0.314.g84b9a713c41-goog >
On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote: > On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > Hello, I believe the pre-flip memory barrier is not required. The only reason I > > can say to remove it, other than the possibility that it is unnecessary, is to > > not have extra code that does not help. However, since we are issuing a fully > > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > > > For this reason, please consider these patches as "informational", than a > > "please merge". :-) Though, feel free to consider merging if you agree! > > > > All SRCU scenarios pass with these, with 6 hours of testing. > > > > thanks, > > > > - Joel > > > > Joel Fernandes (Google) (2): > > srcu: Remove comment about prior read lock counts > > srcu: Remove memory barrier "E" as it is not required > > And litmus tests confirm that "E" does not really do what the comments > say, PTAL: > Test 1: > C mbe > (* > * Result: sometimes > * Does previous scan see old reader's lock count, if a new reader saw > the new srcu_idx? > *) > > {} > > P0(int *lockcount, int *srcu_idx) // updater > { > int r0; > r0 = READ_ONCE(*lockcount); > smp_mb(); // E > WRITE_ONCE(*srcu_idx, 1); > } > > P1(int *lockcount, int *srcu_idx) // reader > { > int r0; > WRITE_ONCE(*lockcount, 1); // previous reader > smp_mb(); // B+C > r0 = READ_ONCE(*srcu_idx); // new reader > } > exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *) > > Test 2: > C mbe2 > > (* > * Result: sometimes > * If updater saw reader's lock count, was that reader using the old idx? > *) > > {} > > P0(int *lockcount, int *srcu_idx) // updater > { > int r0; > r0 = READ_ONCE(*lockcount); > smp_mb(); // E > WRITE_ONCE(*srcu_idx, 1); > } > > P1(int *lockcount, int *srcu_idx) // reader > { > int r0; > int r1; > r1 = READ_ONCE(*srcu_idx); // previous reader > WRITE_ONCE(*lockcount, 1); // previous reader > smp_mb(); // B+C > r0 = READ_ONCE(*srcu_idx); // new reader > } > exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *) Actually, starring at this some more, there is some form of dependency on the idx in order to build the address where the reader must write the lockcount to. Litmus doesn't support arrays but assuming that &ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it could be modelized that way (I'm eluding the unlock part to simplify): --- C w-depend-r { PLOCK=LOCK0; } // updater P0(int *LOCK0, int *LOCK1, int **PLOCK) { int lock1; lock1 = READ_ONCE(*LOCK1); // READ from inactive idx smp_mb(); WRITE_ONCE(*PLOCK, LOCK1); // Flip idx } // reader P1(int **PLOCK) { int *plock; plock = READ_ONCE(*PLOCK); // Read active idx WRITE_ONCE(*plock, 1); // Write to active idx } exists (0:lock0=1) // never happens --- * Is it an address dependency? But the LKMM refers to that only in terms of LOAD - LOAD ordering. * Is it a control dependency? But the LKMM refers to that only in terms of branch with "if". So I don't know the name of the pattern but it makes litmus happy. Thanks.
On Tue, Dec 20, 2022 at 01:34:43PM +0100, Frederic Weisbecker wrote: > On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote: > > On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > Hello, I believe the pre-flip memory barrier is not required. The only reason I > > > can say to remove it, other than the possibility that it is unnecessary, is to > > > not have extra code that does not help. However, since we are issuing a fully > > > memory-barrier after the flip, I cannot say that it hurts to do it anyway. > > > > > > For this reason, please consider these patches as "informational", than a > > > "please merge". :-) Though, feel free to consider merging if you agree! > > > > > > All SRCU scenarios pass with these, with 6 hours of testing. > > > > > > thanks, > > > > > > - Joel > > > > > > Joel Fernandes (Google) (2): > > > srcu: Remove comment about prior read lock counts > > > srcu: Remove memory barrier "E" as it is not required > > > > And litmus tests confirm that "E" does not really do what the comments > > say, PTAL: > > Test 1: > > C mbe > > (* > > * Result: sometimes > > * Does previous scan see old reader's lock count, if a new reader saw > > the new srcu_idx? > > *) > > > > {} > > > > P0(int *lockcount, int *srcu_idx) // updater > > { > > int r0; > > r0 = READ_ONCE(*lockcount); > > smp_mb(); // E > > WRITE_ONCE(*srcu_idx, 1); > > } > > > > P1(int *lockcount, int *srcu_idx) // reader > > { > > int r0; > > WRITE_ONCE(*lockcount, 1); // previous reader > > smp_mb(); // B+C > > r0 = READ_ONCE(*srcu_idx); // new reader > > } > > exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *) > > > > Test 2: > > C mbe2 > > > > (* > > * Result: sometimes > > * If updater saw reader's lock count, was that reader using the old idx? > > *) > > > > {} > > > > P0(int *lockcount, int *srcu_idx) // updater > > { > > int r0; > > r0 = READ_ONCE(*lockcount); > > smp_mb(); // E > > WRITE_ONCE(*srcu_idx, 1); > > } > > > > P1(int *lockcount, int *srcu_idx) // reader > > { > > int r0; > > int r1; > > r1 = READ_ONCE(*srcu_idx); // previous reader > > WRITE_ONCE(*lockcount, 1); // previous reader > > smp_mb(); // B+C > > r0 = READ_ONCE(*srcu_idx); // new reader > > } > > exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *) > > Actually, starring at this some more, there is some form of dependency > on the idx in order to build the address where the reader must write the > lockcount to. Litmus doesn't support arrays but assuming that > &ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it > could be modelized that way (I'm eluding the unlock part to simplify): > > --- > C w-depend-r > > { > PLOCK=LOCK0; > } > > // updater > P0(int *LOCK0, int *LOCK1, int **PLOCK) > { > int lock1; > > lock1 = READ_ONCE(*LOCK1); // READ from inactive idx > smp_mb(); > WRITE_ONCE(*PLOCK, LOCK1); // Flip idx > } > > // reader > P1(int **PLOCK) > { > int *plock; > > plock = READ_ONCE(*PLOCK); // Read active idx > WRITE_ONCE(*plock, 1); // Write to active idx > } > > exists (0:lock0=1) // never happens That's lock1=1, lemme do it again: C w-depend-r { PLOCK=LOCK0; } // updater P0(int *LOCK1, int **PLOCK) { int lock1; lock1 = READ_ONCE(*LOCK1); // READ from inactive idx smp_mb(); WRITE_ONCE(*PLOCK, LOCK1); // Flip idx } // reader P1(int **PLOCK) { int *plock; plock = READ_ONCE(*PLOCK); // Read active idx WRITE_ONCE(*plock, 1); // Write to active idx } exists (0:lock1=1) (* never *)
> On Dec 20, 2022, at 7:40 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Dec 20, 2022 at 01:34:43PM +0100, Frederic Weisbecker wrote: >>> On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote: >>> On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google) >>> <joel@joelfernandes.org> wrote: >>>> >>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I >>>> can say to remove it, other than the possibility that it is unnecessary, is to >>>> not have extra code that does not help. However, since we are issuing a fully >>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway. >>>> >>>> For this reason, please consider these patches as "informational", than a >>>> "please merge". :-) Though, feel free to consider merging if you agree! >>>> >>>> All SRCU scenarios pass with these, with 6 hours of testing. >>>> >>>> thanks, >>>> >>>> - Joel >>>> >>>> Joel Fernandes (Google) (2): >>>> srcu: Remove comment about prior read lock counts >>>> srcu: Remove memory barrier "E" as it is not required >>> >>> And litmus tests confirm that "E" does not really do what the comments >>> say, PTAL: >>> Test 1: >>> C mbe >>> (* >>> * Result: sometimes >>> * Does previous scan see old reader's lock count, if a new reader saw >>> the new srcu_idx? >>> *) >>> >>> {} >>> >>> P0(int *lockcount, int *srcu_idx) // updater >>> { >>> int r0; >>> r0 = READ_ONCE(*lockcount); >>> smp_mb(); // E >>> WRITE_ONCE(*srcu_idx, 1); >>> } >>> >>> P1(int *lockcount, int *srcu_idx) // reader >>> { >>> int r0; >>> WRITE_ONCE(*lockcount, 1); // previous reader >>> smp_mb(); // B+C >>> r0 = READ_ONCE(*srcu_idx); // new reader >>> } >>> exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *) >>> >>> Test 2: >>> C mbe2 >>> >>> (* >>> * Result: sometimes >>> * If updater saw reader's lock count, was that reader using the old idx? >>> *) >>> >>> {} >>> >>> P0(int *lockcount, int *srcu_idx) // updater >>> { >>> int r0; >>> r0 = READ_ONCE(*lockcount); >>> smp_mb(); // E >>> WRITE_ONCE(*srcu_idx, 1); >>> } >>> >>> P1(int *lockcount, int *srcu_idx) // reader >>> { >>> int r0; >>> int r1; >>> r1 = READ_ONCE(*srcu_idx); // previous reader >>> WRITE_ONCE(*lockcount, 1); // previous reader >>> smp_mb(); // B+C >>> r0 = READ_ONCE(*srcu_idx); // new reader >>> } >>> exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *) >> >> Actually, starring at this some more, there is some form of dependency >> on the idx in order to build the address where the reader must write the >> lockcount to. Litmus doesn't support arrays but assuming that >> &ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it >> could be modelized that way (I'm eluding the unlock part to simplify): >> >> --- >> C w-depend-r >> >> { >> PLOCK=LOCK0; >> } >> >> // updater >> P0(int *LOCK0, int *LOCK1, int **PLOCK) >> { >> int lock1; >> >> lock1 = READ_ONCE(*LOCK1); // READ from inactive idx >> smp_mb(); >> WRITE_ONCE(*PLOCK, LOCK1); // Flip idx >> } >> >> // reader >> P1(int **PLOCK) >> { >> int *plock; >> >> plock = READ_ONCE(*PLOCK); // Read active idx >> WRITE_ONCE(*plock, 1); // Write to active idx >> } >> >> exists (0:lock0=1) // never happens > > That's lock1=1, lemme do it again: > > C w-depend-r > > { > PLOCK=LOCK0; > } > > // updater > P0(int *LOCK1, int **PLOCK) > { > int lock1; > > lock1 = READ_ONCE(*LOCK1); // READ from inactive idx > smp_mb(); > WRITE_ONCE(*PLOCK, LOCK1); // Flip idx > } > > // reader > P1(int **PLOCK) > { > int *plock; > > plock = READ_ONCE(*PLOCK); // Read active idx > WRITE_ONCE(*plock, 1); // Write to active idx I am a bit lost here, why would the reader want to write to the active idx? The reader does not update the idx, only the lock count. Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other. Thanks! - Joel > } > > exists (0:lock1=1) (* never *)
On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote: > > C w-depend-r > > > > { > > PLOCK=LOCK0; > > } > > > > // updater > > P0(int *LOCK1, int **PLOCK) > > { > > int lock1; > > > > lock1 = READ_ONCE(*LOCK1); // READ from inactive idx > > smp_mb(); > > WRITE_ONCE(*PLOCK, LOCK1); // Flip idx > > } > > > > // reader > > P1(int **PLOCK) > > { > > int *plock; > > > > plock = READ_ONCE(*PLOCK); // Read active idx > > WRITE_ONCE(*plock, 1); // Write to active idx > > I am a bit lost here, why would the reader want to write to the active idx? > The reader does not update the idx, only the lock count. So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right? The write is then displayed that way: this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); But things could be also thought the other way around with idx being the base address and ssp->sda->srcu_lock_count being the offset. this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter); That would require to change some high level types but the result would be the same from the memory point of view (and even from the ASM point of view). In the end we are dealing with the same address and access. Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change. So it can be zero for example. Then the above increment becomes: this_cpu_inc(idx.counter); And then it can be modelized as in the above litmus test. I had to play that trick because litmus doesn't support arrays but I believe it stands. Now of course I may well have got something wrong since I've always been terrible at maths... > > Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other. Not arguing I'm also still confused by the comment... Thanks.
> On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote: >>> C w-depend-r >>> >>> { >>> PLOCK=LOCK0; >>> } >>> >>> // updater >>> P0(int *LOCK1, int **PLOCK) >>> { >>> int lock1; >>> >>> lock1 = READ_ONCE(*LOCK1); // READ from inactive idx >>> smp_mb(); >>> WRITE_ONCE(*PLOCK, LOCK1); // Flip idx >>> } >>> >>> // reader >>> P1(int **PLOCK) >>> { >>> int *plock; >>> >>> plock = READ_ONCE(*PLOCK); // Read active idx >>> WRITE_ONCE(*plock, 1); // Write to active idx >> >> I am a bit lost here, why would the reader want to write to the active idx? >> The reader does not update the idx, only the lock count. > > So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right? > The write is then displayed that way: > > this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); > > But things could be also thought the other way around with idx being the base address and > ssp->sda->srcu_lock_count being the offset. > > this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter); > > That would require to change some high level types but the result would be the same from > the memory point of view (and even from the ASM point of view). In the end we > are dealing with the same address and access. > > Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change. > So it can be zero for example. Then the above increment becomes: > > this_cpu_inc(idx.counter); > > And then it can be modelized as in the above litmus test. > > I had to play that trick because litmus doesn't support arrays but I believe > it stands. Now of course I may well have got something wrong since I've always > been terrible at maths... Ah ok, I get where you were going with that. Yes there is address dependency between reading idx and writing lock count. But IMHO, the access on the update side is trying to order write to index, and reads from a lock count of a previous index (as far as E / B+C is concerned). So IMHO, on the read side you have to consider 2 consecutive readers and not the same reader in order to pair the same accesses correctly. But I could be missing something. >> Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other. > > Not arguing I'm also still confused by the comment... ;-) Thanks, - Joel > > Thanks.
On 2022-12-19 20:04, Joel Fernandes wrote: > On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: >> >> On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers >> <mathieu.desnoyers@efficios.com> wrote: >> [...] >>>>>>>> >>>>>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote: [...] >>>>> >>>>> I think this is a bit dangerous. Yes there is the preemption thing you >>>>> mentioned above, but that is bounded since you can only have a fixed >>>>> number of tasks that underwent that preemption, and it is quite rare >>>>> in the sense, each reader should get preempted just after sampling idx >>>>> but not incrementing lock count. >>>>> >>>>> However, if we scan while new readers appear (outside of the above >>>>> preemption problem), we can have counter wrap causing a false match >>>>> much quicker. >>>>> The scan loop is: >>>>> check_readers(idx) { >>>>> count_all_unlocks(idx); >>>>> smp_mb(); >>>>> count_all_locks(idx); >>>>> bool done = (locks == unlocks) >>>>> if (done) { >>>>> // readers are done, end scan for this idx. >>>>> } else { >>>>> // try again later >>>>> } >>>>> } >>>>> >>>>> So if check_readers() got preempted just after the smp_mb(), then you >>>>> can have lots of tasks enter and exit the read-side critical section >>>>> and increment the locks count. Eventually locks == unlocks will >>>>> happen, and it is screwed. Sure this is also theoretical, but yeah >>>>> that issue can be made "worse" by scanning active readers >>>>> deliberately, especially when such readers can also nest arbitrarily. >>>>> >>>>>> As a result, flipping between periods 0/1 is just relevant for forward >>>>>> progress, not for correctness. >>>>> >>>>> Sure, agreed, forward progress. >>>> >>>> Adding to the last statement "But also correctness as described above". >>> >>> Exactly how many entry/exit of the read-side critical section while the >>> grace period is preempted do you need to trigger this ? >> >> It depends on how many readers are active during the preemption of the >> scan code. Say the preemption happened after per-CPU unlock counts >> were totalled. Then AFAICS, if there are N active readers which need >> the grace period to wait, you need (2^sizeof(int) - N) number of >> lock+unlock to happen. > > Sorry, here I meant (2^(sizeof(unsigned long) * 8) - N) which is 2^32 > on 32-bit systems. And 2^64 on 64-bit systems. > > thanks, > > - Joel > > >>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>> be exactly 2^64 read-side critical sections. >> >> Yes, but what about 32-bit systems? The overflow indeed happens after 2^32 increments, just like seqlock. The question we need to ask is therefore: if 2^32 is good enough for seqlock, why isn't it good enough for SRCU ? >> >>> There are other synchronization algorithms such as seqlocks which are >>> quite happy with much less protection against overflow (using a 32-bit >>> counter even on 64-bit architectures). >> >> The seqlock is an interesting point. >> >>> For practical purposes, I suspect this issue is really just theoretical. >> >> I have to ask, what is the benefit of avoiding a flip and scanning >> active readers? Is the issue about grace period delay or performance? >> If so, it might be worth prototyping that approach and measuring using >> rcutorture/rcuscale. If there is significant benefit to current >> approach, then IMO it is worth exploring. The main benefit I expect is improved performance of the grace period implementation in common cases where there are few or no readers present, especially on machines with many cpus. It allows scanning both periods (0/1) for each cpu within the same pass, therefore loading both period's unlock counters sitting in the same cache line at once (improved locality), and then loading both period's lock counters, also sitting in the same cache line. It also allows skipping the period flip entirely if there are no readers present, which is an -arguably- tiny performance improvement as well. Thanks, Mathieu >> >>> Or am I missing your point ? >> >> No, I think you are not. Let me know if I missed something. >> >> Thanks, >> >> - Joel >> >> >>> >>> >>>> >>>> thanks, >>>> >>>> - Joel >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>>
Hi Mathieu, On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-19 20:04, Joel Fernandes wrote: > > On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: [...] > >>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to > >>> be exactly 2^64 read-side critical sections. > >> > >> Yes, but what about 32-bit systems? > > The overflow indeed happens after 2^32 increments, just like seqlock. > The question we need to ask is therefore: if 2^32 is good enough for > seqlock, why isn't it good enough for SRCU ? I think Paul said wrap around does happen with SRCU on 32-bit but I'll let him talk more about it. If 32-bit is good enough, let us also drop the size of the counters for 64-bit then? > >>> There are other synchronization algorithms such as seqlocks which are > >>> quite happy with much less protection against overflow (using a 32-bit > >>> counter even on 64-bit architectures). > >> > >> The seqlock is an interesting point. > >> > >>> For practical purposes, I suspect this issue is really just theoretical. > >> > >> I have to ask, what is the benefit of avoiding a flip and scanning > >> active readers? Is the issue about grace period delay or performance? > >> If so, it might be worth prototyping that approach and measuring using > >> rcutorture/rcuscale. If there is significant benefit to current > >> approach, then IMO it is worth exploring. > > The main benefit I expect is improved performance of the grace period > implementation in common cases where there are few or no readers > present, especially on machines with many cpus. > > It allows scanning both periods (0/1) for each cpu within the same pass, > therefore loading both period's unlock counters sitting in the same > cache line at once (improved locality), and then loading both period's > lock counters, also sitting in the same cache line. > > It also allows skipping the period flip entirely if there are no readers > present, which is an -arguably- tiny performance improvement as well. The issue of counter wrap aside, what if a new reader always shows up in the active index being scanned, then can you not delay the GP indefinitely? It seems like writer-starvation is possible then (sure it is possible also with preemption after reader-index-sampling, but scanning active index deliberately will make that worse). Seqlock does not have such writer starvation just because the writer does not care about what the readers are doing. That said, the approach of scanning both counters does seem attractive for when there are no readers, for the reasons you mentioned. Maybe a heuristic to count the number of readers might help? If we are not reader-heavy, then scan both. Otherwise, just scan the inactive ones, and also couple that heuristic with the number of CPUs. I am interested in working on such a design with you! Let us do it and prototype/measure. ;-) Thanks.
On 2022-12-20 13:05, Joel Fernandes wrote: > Hi Mathieu, > > On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2022-12-19 20:04, Joel Fernandes wrote: >>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: > [...] >>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>>>> be exactly 2^64 read-side critical sections. >>>> >>>> Yes, but what about 32-bit systems? >> >> The overflow indeed happens after 2^32 increments, just like seqlock. >> The question we need to ask is therefore: if 2^32 is good enough for >> seqlock, why isn't it good enough for SRCU ? > > I think Paul said wrap around does happen with SRCU on 32-bit but I'll > let him talk more about it. If 32-bit is good enough, let us also drop > the size of the counters for 64-bit then? > >>>>> There are other synchronization algorithms such as seqlocks which are >>>>> quite happy with much less protection against overflow (using a 32-bit >>>>> counter even on 64-bit architectures). >>>> >>>> The seqlock is an interesting point. >>>> >>>>> For practical purposes, I suspect this issue is really just theoretical. >>>> >>>> I have to ask, what is the benefit of avoiding a flip and scanning >>>> active readers? Is the issue about grace period delay or performance? >>>> If so, it might be worth prototyping that approach and measuring using >>>> rcutorture/rcuscale. If there is significant benefit to current >>>> approach, then IMO it is worth exploring. >> >> The main benefit I expect is improved performance of the grace period >> implementation in common cases where there are few or no readers >> present, especially on machines with many cpus. >> >> It allows scanning both periods (0/1) for each cpu within the same pass, >> therefore loading both period's unlock counters sitting in the same >> cache line at once (improved locality), and then loading both period's >> lock counters, also sitting in the same cache line. >> >> It also allows skipping the period flip entirely if there are no readers >> present, which is an -arguably- tiny performance improvement as well. > > The issue of counter wrap aside, what if a new reader always shows up > in the active index being scanned, then can you not delay the GP > indefinitely? It seems like writer-starvation is possible then (sure > it is possible also with preemption after reader-index-sampling, but > scanning active index deliberately will make that worse). Seqlock does > not have such writer starvation just because the writer does not care > about what the readers are doing. No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods. If there are readers in the "non-current" period, the grace period waits for them. If there are readers in the "current" period, it flips the period and then waits for them. > > That said, the approach of scanning both counters does seem attractive > for when there are no readers, for the reasons you mentioned. Maybe a > heuristic to count the number of readers might help? If we are not > reader-heavy, then scan both. Otherwise, just scan the inactive ones, > and also couple that heuristic with the number of CPUs. I am > interested in working on such a design with you! Let us do it and > prototype/measure. ;-) Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass. I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :) Thanks, Mathieu
> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-20 13:05, Joel Fernandes wrote: >> Hi Mathieu, >>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers >>> <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2022-12-19 20:04, Joel Fernandes wrote: >>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: >> [...] >>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>>>>> be exactly 2^64 read-side critical sections. >>>>> >>>>> Yes, but what about 32-bit systems? >>> >>> The overflow indeed happens after 2^32 increments, just like seqlock. >>> The question we need to ask is therefore: if 2^32 is good enough for >>> seqlock, why isn't it good enough for SRCU ? >> I think Paul said wrap around does happen with SRCU on 32-bit but I'll >> let him talk more about it. If 32-bit is good enough, let us also drop >> the size of the counters for 64-bit then? >>>>>> There are other synchronization algorithms such as seqlocks which are >>>>>> quite happy with much less protection against overflow (using a 32-bit >>>>>> counter even on 64-bit architectures). >>>>> >>>>> The seqlock is an interesting point. >>>>> >>>>>> For practical purposes, I suspect this issue is really just theoretical. >>>>> >>>>> I have to ask, what is the benefit of avoiding a flip and scanning >>>>> active readers? Is the issue about grace period delay or performance? >>>>> If so, it might be worth prototyping that approach and measuring using >>>>> rcutorture/rcuscale. If there is significant benefit to current >>>>> approach, then IMO it is worth exploring. >>> >>> The main benefit I expect is improved performance of the grace period >>> implementation in common cases where there are few or no readers >>> present, especially on machines with many cpus. >>> >>> It allows scanning both periods (0/1) for each cpu within the same pass, >>> therefore loading both period's unlock counters sitting in the same >>> cache line at once (improved locality), and then loading both period's >>> lock counters, also sitting in the same cache line. >>> >>> It also allows skipping the period flip entirely if there are no readers >>> present, which is an -arguably- tiny performance improvement as well. >> The issue of counter wrap aside, what if a new reader always shows up >> in the active index being scanned, then can you not delay the GP >> indefinitely? It seems like writer-starvation is possible then (sure >> it is possible also with preemption after reader-index-sampling, but >> scanning active index deliberately will make that worse). Seqlock does >> not have such writer starvation just because the writer does not care >> about what the readers are doing. > > No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods. > > If there are readers in the "non-current" period, the grace period waits for them. > > If there are readers in the "current" period, it flips the period and then waits for them. Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that. > >> That said, the approach of scanning both counters does seem attractive >> for when there are no readers, for the reasons you mentioned. Maybe a >> heuristic to count the number of readers might help? If we are not >> reader-heavy, then scan both. Otherwise, just scan the inactive ones, >> and also couple that heuristic with the number of CPUs. I am >> interested in working on such a design with you! Let us do it and >> prototype/measure. ;-) > > Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass. Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful. > I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :) Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after. I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… Cheers, - Joel > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2022-12-20 13:29, Joel Fernandes wrote: > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… I strongly suspect the memory barrier after flip is useless for the same reasons I mentioned explaining why the barrier before the flip is useless. However, we need to double-check that we have memory barriers at the beginning and end of synchronize_srcu, and between load of "unlock" counters and load of "lock" counters. Where is the barrier at the beginning of synchronize_srcu ? Thanks, Mathieu
On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > I strongly suspect the memory barrier after flip is useless for the same > reasons I mentioned explaining why the barrier before the flip is useless. > > However, we need to double-check that we have memory barriers at the > beginning and end of synchronize_srcu, and between load of "unlock" > counters and load of "lock" counters. > > Where is the barrier at the beginning of synchronize_srcu ? I believe we don't need another memory barrier at the beginning of synchronize_srcu() (but this part of my SRCU study is still pending ;)) . The grace period guarantee (read-side critical sections don't span the GP) is already enforced by the memory barrier between scanning for all unlocks, and scanning for all locks (Paired with corresponding memory barriers on the read-side). Accordingly, before we scan all locks and match lock == unlock, there is an smp_mb(). Why is that not sufficient? Thanks, - Joel > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > >>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2022-12-20 13:05, Joel Fernandes wrote: >>> Hi Mathieu, >>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers >>>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> On 2022-12-19 20:04, Joel Fernandes wrote: >>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>> [...] >>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>>>>>> be exactly 2^64 read-side critical sections. >>>>>> >>>>>> Yes, but what about 32-bit systems? >>>> >>>> The overflow indeed happens after 2^32 increments, just like seqlock. >>>> The question we need to ask is therefore: if 2^32 is good enough for >>>> seqlock, why isn't it good enough for SRCU ? >>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll >>> let him talk more about it. If 32-bit is good enough, let us also drop >>> the size of the counters for 64-bit then? >>>>>>> There are other synchronization algorithms such as seqlocks which are >>>>>>> quite happy with much less protection against overflow (using a 32-bit >>>>>>> counter even on 64-bit architectures). >>>>>> >>>>>> The seqlock is an interesting point. >>>>>> >>>>>>> For practical purposes, I suspect this issue is really just theoretical. >>>>>> >>>>>> I have to ask, what is the benefit of avoiding a flip and scanning >>>>>> active readers? Is the issue about grace period delay or performance? >>>>>> If so, it might be worth prototyping that approach and measuring using >>>>>> rcutorture/rcuscale. If there is significant benefit to current >>>>>> approach, then IMO it is worth exploring. >>>> >>>> The main benefit I expect is improved performance of the grace period >>>> implementation in common cases where there are few or no readers >>>> present, especially on machines with many cpus. >>>> >>>> It allows scanning both periods (0/1) for each cpu within the same pass, >>>> therefore loading both period's unlock counters sitting in the same >>>> cache line at once (improved locality), and then loading both period's >>>> lock counters, also sitting in the same cache line. >>>> >>>> It also allows skipping the period flip entirely if there are no readers >>>> present, which is an -arguably- tiny performance improvement as well. >>> The issue of counter wrap aside, what if a new reader always shows up >>> in the active index being scanned, then can you not delay the GP >>> indefinitely? It seems like writer-starvation is possible then (sure >>> it is possible also with preemption after reader-index-sampling, but >>> scanning active index deliberately will make that worse). Seqlock does >>> not have such writer starvation just because the writer does not care >>> about what the readers are doing. >> >> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods. >> >> If there are readers in the "non-current" period, the grace period waits for them. >> >> If there are readers in the "current" period, it flips the period and then waits for them. > > Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that. > >> >>> That said, the approach of scanning both counters does seem attractive >>> for when there are no readers, for the reasons you mentioned. Maybe a >>> heuristic to count the number of readers might help? If we are not >>> reader-heavy, then scan both. Otherwise, just scan the inactive ones, >>> and also couple that heuristic with the number of CPUs. I am >>> interested in working on such a design with you! Let us do it and >>> prototype/measure. ;-) >> >> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass. > > Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful. > >> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :) > > Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after. > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… In my view, the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to. Neeraj, do you agree? Thanks. > > Cheers, > > - Joel > > >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On Tue, Dec 20, 2022 at 09:20:08AM -0500, Joel Fernandes wrote: > > On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote: > >>> C w-depend-r > >>> > >>> { > >>> PLOCK=LOCK0; > >>> } > >>> > >>> // updater > >>> P0(int *LOCK1, int **PLOCK) > >>> { > >>> int lock1; > >>> > >>> lock1 = READ_ONCE(*LOCK1); // READ from inactive idx > >>> smp_mb(); > >>> WRITE_ONCE(*PLOCK, LOCK1); // Flip idx > >>> } > >>> > >>> // reader > >>> P1(int **PLOCK) > >>> { > >>> int *plock; > >>> > >>> plock = READ_ONCE(*PLOCK); // Read active idx > >>> WRITE_ONCE(*plock, 1); // Write to active idx > >> > >> I am a bit lost here, why would the reader want to write to the active idx? > >> The reader does not update the idx, only the lock count. > > > > So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right? > > The write is then displayed that way: > > > > this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); > > > > But things could be also thought the other way around with idx being the base address and > > ssp->sda->srcu_lock_count being the offset. > > > > this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter); > > > > That would require to change some high level types but the result would be the same from > > the memory point of view (and even from the ASM point of view). In the end we > > are dealing with the same address and access. > > > > Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change. > > So it can be zero for example. Then the above increment becomes: > > > > this_cpu_inc(idx.counter); > > > > And then it can be modelized as in the above litmus test. > > > > I had to play that trick because litmus doesn't support arrays but I believe > > it stands. Now of course I may well have got something wrong since I've always > > been terrible at maths... > > Ah ok, I get where you were going with that. Yes there is address dependency > between reading idx and writing lock count. But IMHO, the access on the update > side is trying to order write to index, and reads from a lock count of a > previous index (as far as E / B+C is concerned). So IMHO, on the read side you > have to consider 2 consecutive readers and not the same reader in order to > pair the same accesses correctly. But I could be missing something. And you're right, for the first part of the comment (let's call that (1)): * Ensure that if this updater saw a given reader's increment * from __srcu_read_lock(), that reader was using an old value * of ->srcu_idx. My litmus test shows the ordering displayed in the second part of the comment (call it (2)): * Also ensure that if a given reader sees the * new value of ->srcu_idx, this updater's earlier scans cannot * have seen that reader's increments (which is OK, because this * grace period need not wait on that reader). _ In (1), E indeed pairs with B and C _ In (2), E pairs with the address-dependency between idx and lock_count. Thanks.
On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote: > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > I strongly suspect the memory barrier after flip is useless for the same > reasons I mentioned explaining why the barrier before the flip is useless. > > However, we need to double-check that we have memory barriers at the > beginning and end of synchronize_srcu, and between load of "unlock" counters > and load of "lock" counters. > > Where is the barrier at the beginning of synchronize_srcu ? rcu_seq_snap() ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote: > On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > > > I strongly suspect the memory barrier after flip is useless for the same > > reasons I mentioned explaining why the barrier before the flip is useless. > > > > However, we need to double-check that we have memory barriers at the > > beginning and end of synchronize_srcu, and between load of "unlock" > > counters and load of "lock" counters. > > > > Where is the barrier at the beginning of synchronize_srcu ? > > I believe we don't need another memory barrier at the beginning of > synchronize_srcu() (but this part of my SRCU study is still pending > ;)) . The grace period guarantee (read-side critical sections don't > span the GP) is already enforced by the memory barrier between > scanning for all unlocks, and scanning for all locks (Paired with > corresponding memory barriers on the read-side). > > Accordingly, before we scan all locks and match lock == unlock, there > is an smp_mb(). Why is that not sufficient? That's not enough, you still need a barrier between the updater's pre-GP accesses and the scans, so that post-GP read side sees the updater's pre-GP accesses: UPDATER READER ------- ------ WRITE A WRITE srcu_read_lock smp_mb() //rcu_seq_snap() smp_mb() READ srcu_read_lock //scans READ A Thanks. > > Thanks, > > - Joel > > > > > Thanks, > > > > Mathieu > > > > -- > > Mathieu Desnoyers > > EfficiOS Inc. > > https://www.efficios.com > >
On Tue, Dec 20, 2022 at 6:05 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote: > > On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > > > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > > > > > I strongly suspect the memory barrier after flip is useless for the same > > > reasons I mentioned explaining why the barrier before the flip is useless. > > > > > > However, we need to double-check that we have memory barriers at the > > > beginning and end of synchronize_srcu, and between load of "unlock" > > > counters and load of "lock" counters. > > > > > > Where is the barrier at the beginning of synchronize_srcu ? > > > > I believe we don't need another memory barrier at the beginning of > > synchronize_srcu() (but this part of my SRCU study is still pending > > ;)) . The grace period guarantee (read-side critical sections don't > > span the GP) is already enforced by the memory barrier between > > scanning for all unlocks, and scanning for all locks (Paired with > > corresponding memory barriers on the read-side). > > > > Accordingly, before we scan all locks and match lock == unlock, there > > is an smp_mb(). Why is that not sufficient? > > That's not enough, you still need a barrier between the updater's pre-GP > accesses and the scans, so that post-GP read side sees the updater's pre-GP > accesses: > > > UPDATER READER > ------- ------ > WRITE A WRITE srcu_read_lock > smp_mb() //rcu_seq_snap() smp_mb() > READ srcu_read_lock //scans READ A But see the comments also in srcu_readers_active_idx_check() * Needs to be a smp_mb() as the read side may * contain a read from a variable that is written to before the * synchronize_srcu() in the write side So that appears to be already covered. Or is your point that the scans are not happening on the same CPU as the pre-GP writer, as scans are happening from workqueue ? Perhaps that comment misled me. Confused, - Joel
On Tue, Dec 20, 2022 at 12:00:58PM -0500, Mathieu Desnoyers wrote: > On 2022-12-19 20:04, Joel Fernandes wrote: > The main benefit I expect is improved performance of the grace period > implementation in common cases where there are few or no readers present, > especially on machines with many cpus. > > It allows scanning both periods (0/1) for each cpu within the same pass, > therefore loading both period's unlock counters sitting in the same cache > line at once (improved locality), and then loading both period's lock > counters, also sitting in the same cache line. > > It also allows skipping the period flip entirely if there are no readers > present, which is an -arguably- tiny performance improvement as well. I would indeed expect performance improvement if there are no readers in the active period/idx but if there are, it's a performance penalty due to the extra scans. So my mean questions are: * Is the no-present-readers the most likely case? I guess it depends on the ssp. * Does the SRCU update side deserve to be optimized with added code (because we are not debating about removing the flip, rather about adding a fast-path and keep the flip as a slow-path) * The SRCU machinery is already quite complicated. Look how we little things lock ourselves in for days doing our exegesis of SRCU state machine. And halfway through it we are still debating some ordering. Is it worth adding a new path there? Thanks.
On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Dec 20, 2022 at 09:20:08AM -0500, Joel Fernandes wrote: > > > On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote: > > >>> C w-depend-r > > >>> > > >>> { > > >>> PLOCK=LOCK0; > > >>> } > > >>> > > >>> // updater > > >>> P0(int *LOCK1, int **PLOCK) > > >>> { > > >>> int lock1; > > >>> > > >>> lock1 = READ_ONCE(*LOCK1); // READ from inactive idx > > >>> smp_mb(); > > >>> WRITE_ONCE(*PLOCK, LOCK1); // Flip idx > > >>> } > > >>> > > >>> // reader > > >>> P1(int **PLOCK) > > >>> { > > >>> int *plock; > > >>> > > >>> plock = READ_ONCE(*PLOCK); // Read active idx > > >>> WRITE_ONCE(*plock, 1); // Write to active idx > > >> > > >> I am a bit lost here, why would the reader want to write to the active idx? > > >> The reader does not update the idx, only the lock count. > > > > > > So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right? > > > The write is then displayed that way: > > > > > > this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); > > > > > > But things could be also thought the other way around with idx being the base address and > > > ssp->sda->srcu_lock_count being the offset. > > > > > > this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter); > > > > > > That would require to change some high level types but the result would be the same from > > > the memory point of view (and even from the ASM point of view). In the end we > > > are dealing with the same address and access. > > > > > > Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change. > > > So it can be zero for example. Then the above increment becomes: > > > > > > this_cpu_inc(idx.counter); > > > > > > And then it can be modelized as in the above litmus test. > > > > > > I had to play that trick because litmus doesn't support arrays but I believe > > > it stands. Now of course I may well have got something wrong since I've always > > > been terrible at maths... > > > > Ah ok, I get where you were going with that. Yes there is address dependency > > between reading idx and writing lock count. But IMHO, the access on the update > > side is trying to order write to index, and reads from a lock count of a > > previous index (as far as E / B+C is concerned). So IMHO, on the read side you > > have to consider 2 consecutive readers and not the same reader in order to > > pair the same accesses correctly. But I could be missing something. > > And you're right, for the first part of the comment (let's call that (1)): > > * Ensure that if this updater saw a given reader's increment > * from __srcu_read_lock(), that reader was using an old value > * of ->srcu_idx. > > My litmus test shows the ordering displayed in the second part of the comment > (call it (2)): > > * Also ensure that if a given reader sees the > * new value of ->srcu_idx, this updater's earlier scans cannot > * have seen that reader's increments (which is OK, because this > * grace period need not wait on that reader). > > _ In (1), E indeed pairs with B and C Agreed about (1). > _ In (2), E pairs with the address-dependency between idx and lock_count. But that is not the only reason. If that was the only reason for (2), then there is an smp_mb() just before the next-scan post-flip before the lock counts are read. The comment also says that: "D" pairs with "C" and does not mention "C" being an addr dependency. So IMHO, you need to look at the unlock count access and the idx access, for (2). Something like this: https://hastebin.com/raw/gadeliquta . Though that is complex as fuck. Or did I miss something subtle? Thanks.
On Tue, Dec 20, 2022 at 06:46:10PM -0500, Joel Fernandes wrote: > On Tue, Dec 20, 2022 at 6:05 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote: > > > On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers > > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > > > > > > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > > > > > > > I strongly suspect the memory barrier after flip is useless for the same > > > > reasons I mentioned explaining why the barrier before the flip is useless. > > > > > > > > However, we need to double-check that we have memory barriers at the > > > > beginning and end of synchronize_srcu, and between load of "unlock" > > > > counters and load of "lock" counters. > > > > > > > > Where is the barrier at the beginning of synchronize_srcu ? > > > > > > I believe we don't need another memory barrier at the beginning of > > > synchronize_srcu() (but this part of my SRCU study is still pending > > > ;)) . The grace period guarantee (read-side critical sections don't > > > span the GP) is already enforced by the memory barrier between > > > scanning for all unlocks, and scanning for all locks (Paired with > > > corresponding memory barriers on the read-side). > > > > > > Accordingly, before we scan all locks and match lock == unlock, there > > > is an smp_mb(). Why is that not sufficient? > > > > That's not enough, you still need a barrier between the updater's pre-GP > > accesses and the scans, so that post-GP read side sees the updater's pre-GP > > accesses: > > > > > > UPDATER READER > > ------- ------ > > WRITE A WRITE srcu_read_lock > > smp_mb() //rcu_seq_snap() smp_mb() > > READ srcu_read_lock //scans READ A > > But see the comments also in srcu_readers_active_idx_check() > > * Needs to be a smp_mb() as the read side may > * contain a read from a variable that is written to before the > * synchronize_srcu() in the write side > > So that appears to be already covered. Or is your point that the scans > are not happening on the same CPU as the pre-GP writer, as scans are > happening from workqueue ? Nah I think you're right. Although I guess we still need the barrier between updater's pre-gp accesses and srcu_unlock scans... > > Perhaps that comment misled me. > > Confused, > > - Joel
On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > Agreed about (1). > > > _ In (2), E pairs with the address-dependency between idx and lock_count. > > But that is not the only reason. If that was the only reason for (2), > then there is an smp_mb() just before the next-scan post-flip before > the lock counts are read. The post-flip barrier makes sure the new idx is visible on the next READER's turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" may appear unordered from the update side POV if there is no barrier between the scan and the flip. If you remove the smp_mb() from the litmus test I sent, things explode.
On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > Agreed about (1). > > > > > _ In (2), E pairs with the address-dependency between idx and lock_count. > > > > But that is not the only reason. If that was the only reason for (2), > > then there is an smp_mb() just before the next-scan post-flip before > > the lock counts are read. > > The post-flip barrier makes sure the new idx is visible on the next READER's > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > may appear unordered from the update side POV if there is no barrier between the > scan and the flip. > > If you remove the smp_mb() from the litmus test I sent, things explode. Or rather, look at it the other way, if there is no barrier between the lock scan and the index flip (E), then the index flip can appear to be written before the lock is read. Which means you may start activating the index before you finish reading it (at least it appears that way from the readers pont of view).
> On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: >> Agreed about (1). >> >>> _ In (2), E pairs with the address-dependency between idx and lock_count. >> >> But that is not the only reason. If that was the only reason for (2), >> then there is an smp_mb() just before the next-scan post-flip before >> the lock counts are read. > > The post-flip barrier makes sure the new idx is visible on the next READER's > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > may appear unordered from the update side POV if there is no barrier between the > scan and the flip. > > If you remove the smp_mb() from the litmus test I sent, things explode. Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). That’s just my opinion and let’s discuss more because I cannot rule out that I am missing something with this complicated topic ;-) Thanks.
On 2022-12-20 17:57, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote: >> On 2022-12-20 13:29, Joel Fernandes wrote: >>> >> >>> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… >> >> I strongly suspect the memory barrier after flip is useless for the same >> reasons I mentioned explaining why the barrier before the flip is useless. >> >> However, we need to double-check that we have memory barriers at the >> beginning and end of synchronize_srcu, and between load of "unlock" counters >> and load of "lock" counters. >> >> Where is the barrier at the beginning of synchronize_srcu ? > > rcu_seq_snap() ? The memory barrier in rcu_seq_snap is not at the very beginning of synchronize_srcu. For example we have: unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) { // Any prior manipulation of SRCU-protected data must happen // before the load from ->srcu_gp_seq. smp_mb(); return rcu_seq_snap(&ssp->srcu_gp_seq); which happens to have an explicit barrier before issuing rcu_seq_snap(). So my question still stands: where is the memory barrier at the beginning of synchronize_srcu ? The memory ordering constraint I am concerned about here is: * [...] In addition, * each CPU having an SRCU read-side critical section that extends beyond * the return from synchronize_srcu() is guaranteed to have executed a * full memory barrier after the beginning of synchronize_srcu() and before * the beginning of that SRCU read-side critical section. [...] So if we have a SRCU read-side critical section that begins after the beginning of synchronize_srcu, but before its first memory barrier, it would miss the guarantee that the full memory barrier is issued before the beginning of that SRCU read-side critical section. IOW, that memory barrier needs to be at the very beginning of the grace period. I suspect that the memory barrier in srcu_read_lock() invoked by srcu_gp_start_if_needed may happen to be early enough, but I'm not sure, and it does not appear to be documented as such. Thanks, Mathieu > >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On 2022-12-20 19:58, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: >> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: >>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: >>> Agreed about (1). >>> >>>> _ In (2), E pairs with the address-dependency between idx and lock_count. >>> >>> But that is not the only reason. If that was the only reason for (2), >>> then there is an smp_mb() just before the next-scan post-flip before >>> the lock counts are read. >> >> The post-flip barrier makes sure the new idx is visible on the next READER's >> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" >> may appear unordered from the update side POV if there is no barrier between the >> scan and the flip. >> >> If you remove the smp_mb() from the litmus test I sent, things explode. > > Or rather, look at it the other way, if there is no barrier between the lock > scan and the index flip (E), then the index flip can appear to be written before the > lock is read. Which means you may start activating the index before you finish > reading it (at least it appears that way from the readers pont of view). Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip. I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed). Thanks, Mathieu
On 2022-12-20 19:07, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 12:00:58PM -0500, Mathieu Desnoyers wrote: >> On 2022-12-19 20:04, Joel Fernandes wrote: >> The main benefit I expect is improved performance of the grace period >> implementation in common cases where there are few or no readers present, >> especially on machines with many cpus. >> >> It allows scanning both periods (0/1) for each cpu within the same pass, >> therefore loading both period's unlock counters sitting in the same cache >> line at once (improved locality), and then loading both period's lock >> counters, also sitting in the same cache line. >> >> It also allows skipping the period flip entirely if there are no readers >> present, which is an -arguably- tiny performance improvement as well. > > I would indeed expect performance improvement if there are no readers in the > active period/idx but if there are, it's a performance penalty due to the extra > scans. > > So my mean questions are: > > * Is the no-present-readers the most likely case? I guess it depends on the ssp. > > * Does the SRCU update side deserve to be optimized with added code (because > we are not debating about removing the flip, rather about adding a fast-path > and keep the flip as a slow-path) > > * The SRCU machinery is already quite complicated. Look how we little things lock > ourselves in for days doing our exegesis of SRCU state machine. And halfway > through it we are still debating some ordering. Is it worth adding a new path there? I'm not arguing for making things more complex unless there are good reasons to do so. However I think we badly need to improve the documentation of the memory barriers in SRCU, because the claimed barrier pairing is odd. Thanks, Mathieu
On 2022-12-20 15:55, Joel Fernandes wrote: > > >> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote: >> >> >> >>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> On 2022-12-20 13:05, Joel Fernandes wrote: >>>> Hi Mathieu, >>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers >>>>> <mathieu.desnoyers@efficios.com> wrote: >>>>> >>>>> On 2022-12-19 20:04, Joel Fernandes wrote: >>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>> [...] >>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>>>>>>> be exactly 2^64 read-side critical sections. >>>>>>> >>>>>>> Yes, but what about 32-bit systems? >>>>> >>>>> The overflow indeed happens after 2^32 increments, just like seqlock. >>>>> The question we need to ask is therefore: if 2^32 is good enough for >>>>> seqlock, why isn't it good enough for SRCU ? >>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll >>>> let him talk more about it. If 32-bit is good enough, let us also drop >>>> the size of the counters for 64-bit then? >>>>>>>> There are other synchronization algorithms such as seqlocks which are >>>>>>>> quite happy with much less protection against overflow (using a 32-bit >>>>>>>> counter even on 64-bit architectures). >>>>>>> >>>>>>> The seqlock is an interesting point. >>>>>>> >>>>>>>> For practical purposes, I suspect this issue is really just theoretical. >>>>>>> >>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning >>>>>>> active readers? Is the issue about grace period delay or performance? >>>>>>> If so, it might be worth prototyping that approach and measuring using >>>>>>> rcutorture/rcuscale. If there is significant benefit to current >>>>>>> approach, then IMO it is worth exploring. >>>>> >>>>> The main benefit I expect is improved performance of the grace period >>>>> implementation in common cases where there are few or no readers >>>>> present, especially on machines with many cpus. >>>>> >>>>> It allows scanning both periods (0/1) for each cpu within the same pass, >>>>> therefore loading both period's unlock counters sitting in the same >>>>> cache line at once (improved locality), and then loading both period's >>>>> lock counters, also sitting in the same cache line. >>>>> >>>>> It also allows skipping the period flip entirely if there are no readers >>>>> present, which is an -arguably- tiny performance improvement as well. >>>> The issue of counter wrap aside, what if a new reader always shows up >>>> in the active index being scanned, then can you not delay the GP >>>> indefinitely? It seems like writer-starvation is possible then (sure >>>> it is possible also with preemption after reader-index-sampling, but >>>> scanning active index deliberately will make that worse). Seqlock does >>>> not have such writer starvation just because the writer does not care >>>> about what the readers are doing. >>> >>> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods. >>> >>> If there are readers in the "non-current" period, the grace period waits for them. >>> >>> If there are readers in the "current" period, it flips the period and then waits for them. >> >> Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that. >> >>> >>>> That said, the approach of scanning both counters does seem attractive >>>> for when there are no readers, for the reasons you mentioned. Maybe a >>>> heuristic to count the number of readers might help? If we are not >>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones, >>>> and also couple that heuristic with the number of CPUs. I am >>>> interested in working on such a design with you! Let us do it and >>>> prototype/measure. ;-) >>> >>> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass. >> >> Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful. >> >>> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :) >> >> Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after. >> >> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > In my view, the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to. > No, AFAIU you also need barriers at the beginning and end of synchronize_srcu to provide those guarantees: * There are memory-ordering constraints implied by synchronize_srcu(). Need for a barrier at the end of synchronize_srcu(): * On systems with more than one CPU, when synchronize_srcu() returns, * each CPU is guaranteed to have executed a full memory barrier since * the end of its last corresponding SRCU read-side critical section * whose beginning preceded the call to synchronize_srcu(). Need for a barrier at the beginning of synchronize_srcu(): * In addition, * each CPU having an SRCU read-side critical section that extends beyond * the return from synchronize_srcu() is guaranteed to have executed a * full memory barrier after the beginning of synchronize_srcu() and before * the beginning of that SRCU read-side critical section. Note that these * guarantees include CPUs that are offline, idle, or executing in user mode, * as well as CPUs that are executing in the kernel. Thanks, Mathieu > Neeraj, do you agree? > > Thanks. > > > > > >> >> Cheers, >> >> - Joel >> >> >>> >>> Thanks, >>> >>> Mathieu >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>>
> On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-20 19:58, Frederic Weisbecker wrote: >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: >>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: >>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: >>>> Agreed about (1). >>>> >>>>> _ In (2), E pairs with the address-dependency between idx and lock_count. >>>> >>>> But that is not the only reason. If that was the only reason for (2), >>>> then there is an smp_mb() just before the next-scan post-flip before >>>> the lock counts are read. >>> >>> The post-flip barrier makes sure the new idx is visible on the next READER's >>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" >>> may appear unordered from the update side POV if there is no barrier between the >>> scan and the flip. >>> >>> If you remove the smp_mb() from the litmus test I sent, things explode. >> Or rather, look at it the other way, if there is no barrier between the lock >> scan and the index flip (E), then the index flip can appear to be written before the >> lock is read. Which means you may start activating the index before you finish >> reading it (at least it appears that way from the readers pont of view). > > Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the > index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip. > > I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed). If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible. I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO. Anyway in all likelihood, I will be trying to remove E completely and clarify docs on D in the coming weeks. And also try to drop the size of the counters per our discussions Thanks. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
> On Dec 20, 2022, at 10:51 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-20 15:55, Joel Fernandes wrote: >>>> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>> >>> >>>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>>>> >>>>> On 2022-12-20 13:05, Joel Fernandes wrote: >>>>> Hi Mathieu, >>>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers >>>>>> <mathieu.desnoyers@efficios.com> wrote: >>>>>> >>>>>> On 2022-12-19 20:04, Joel Fernandes wrote: >>>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>> [...] >>>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to >>>>>>>>> be exactly 2^64 read-side critical sections. >>>>>>>> >>>>>>>> Yes, but what about 32-bit systems? >>>>>> >>>>>> The overflow indeed happens after 2^32 increments, just like seqlock. >>>>>> The question we need to ask is therefore: if 2^32 is good enough for >>>>>> seqlock, why isn't it good enough for SRCU ? >>>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll >>>>> let him talk more about it. If 32-bit is good enough, let us also drop >>>>> the size of the counters for 64-bit then? >>>>>>>>> There are other synchronization algorithms such as seqlocks which are >>>>>>>>> quite happy with much less protection against overflow (using a 32-bit >>>>>>>>> counter even on 64-bit architectures). >>>>>>>> >>>>>>>> The seqlock is an interesting point. >>>>>>>> >>>>>>>>> For practical purposes, I suspect this issue is really just theoretical. >>>>>>>> >>>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning >>>>>>>> active readers? Is the issue about grace period delay or performance? >>>>>>>> If so, it might be worth prototyping that approach and measuring using >>>>>>>> rcutorture/rcuscale. If there is significant benefit to current >>>>>>>> approach, then IMO it is worth exploring. >>>>>> >>>>>> The main benefit I expect is improved performance of the grace period >>>>>> implementation in common cases where there are few or no readers >>>>>> present, especially on machines with many cpus. >>>>>> >>>>>> It allows scanning both periods (0/1) for each cpu within the same pass, >>>>>> therefore loading both period's unlock counters sitting in the same >>>>>> cache line at once (improved locality), and then loading both period's >>>>>> lock counters, also sitting in the same cache line. >>>>>> >>>>>> It also allows skipping the period flip entirely if there are no readers >>>>>> present, which is an -arguably- tiny performance improvement as well. >>>>> The issue of counter wrap aside, what if a new reader always shows up >>>>> in the active index being scanned, then can you not delay the GP >>>>> indefinitely? It seems like writer-starvation is possible then (sure >>>>> it is possible also with preemption after reader-index-sampling, but >>>>> scanning active index deliberately will make that worse). Seqlock does >>>>> not have such writer starvation just because the writer does not care >>>>> about what the readers are doing. >>>> >>>> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods. >>>> >>>> If there are readers in the "non-current" period, the grace period waits for them. >>>> >>>> If there are readers in the "current" period, it flips the period and then waits for them. >>> >>> Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that. >>> >>>> >>>>> That said, the approach of scanning both counters does seem attractive >>>>> for when there are no readers, for the reasons you mentioned. Maybe a >>>>> heuristic to count the number of readers might help? If we are not >>>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones, >>>>> and also couple that heuristic with the number of CPUs. I am >>>>> interested in working on such a design with you! Let us do it and >>>>> prototype/measure. ;-) >>>> >>>> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass. >>> >>> Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful. >>> >>>> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :) >>> >>> Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after. >>> >>> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… >> In my view, the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to. > > No, AFAIU you also need barriers at the beginning and end of synchronize_srcu to provide those guarantees: My bad, I got too hung up on the scan code. Indeed we need additional ordering on synchronize side. Anyway, the full memory barriers are already implemented in the synchronize code AFAICS (beginning and end). At least one of them full memory barriers directly appears at the end of __synchronize_srcu(). But I dont want to say something stupid in the middle of the night, so I will take my time to get back on that. Thanks, Joel > > * There are memory-ordering constraints implied by synchronize_srcu(). > > Need for a barrier at the end of synchronize_srcu(): > > * On systems with more than one CPU, when synchronize_srcu() returns, > * each CPU is guaranteed to have executed a full memory barrier since > * the end of its last corresponding SRCU read-side critical section > * whose beginning preceded the call to synchronize_srcu(). > > Need for a barrier at the beginning of synchronize_srcu(): > > * In addition, > * each CPU having an SRCU read-side critical section that extends beyond > * the return from synchronize_srcu() is guaranteed to have executed a > * full memory barrier after the beginning of synchronize_srcu() and before > * the beginning of that SRCU read-side critical section. Note that these > * guarantees include CPUs that are offline, idle, or executing in user mode, > * as well as CPUs that are executing in the kernel. > > Thanks, > > Mathieu > >> Neeraj, do you agree? >> Thanks. >>> >>> Cheers, >>> >>> - Joel >>> >>> >>>> >>>> Thanks, >>>> >>>> Mathieu >>>> >>>> -- >>>> Mathieu Desnoyers >>>> EfficiOS Inc. >>>> https://www.efficios.com >>>> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > >> Agreed about (1). > >> > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > >> > >> But that is not the only reason. If that was the only reason for (2), > >> then there is an smp_mb() just before the next-scan post-flip before > >> the lock counts are read. > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > may appear unordered from the update side POV if there is no barrier between the > > scan and the flip. > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > That’s just my opinion and let’s discuss more because I cannot rule out that I > am missing something with this complicated topic ;-) I must be missing something. I often do. Ok let's put that on litmus: ---- C srcu {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP smp_mb(); // E WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx; // 1st reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1!=0) --- This gives: Test srcu Allowed States 1 0:lock1=0; No Witnesses Positive: 0 Negative: 9 Condition exists (not (0:lock1=0)) Observation srcu Never 0 9 Time srcu 0.57 Hash=855d17de503814d2421602174f307c59 Now if I comment out the "smp_mb() /* E */" line this gives: Test srcu Allowed States 3 0:lock1=0; 0:lock1=1; 0:lock1=2; Ok Witnesses Positive: 4 Negative: 9 Condition exists (not (0:lock1=0)) Observation srcu Sometimes 4 9 Thanks
On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > On 2022-12-20 17:57, Frederic Weisbecker wrote: > > On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote: > > > On 2022-12-20 13:29, Joel Fernandes wrote: > > > > > > > > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed… > > > > > > I strongly suspect the memory barrier after flip is useless for the same > > > reasons I mentioned explaining why the barrier before the flip is useless. > > > > > > However, we need to double-check that we have memory barriers at the > > > beginning and end of synchronize_srcu, and between load of "unlock" counters > > > and load of "lock" counters. > > > > > > Where is the barrier at the beginning of synchronize_srcu ? > > > > rcu_seq_snap() ? > > The memory barrier in rcu_seq_snap is not at the very beginning of synchronize_srcu. > > For example we have: > > unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) > { > // Any prior manipulation of SRCU-protected data must happen > // before the load from ->srcu_gp_seq. > smp_mb(); > return rcu_seq_snap(&ssp->srcu_gp_seq); > > which happens to have an explicit barrier before issuing rcu_seq_snap(). SRCU (or even RCU) polling is special in that it may rely on a concurrent updater to start the grace period, hence why you need two more barriers here (the second is in poll_state_synchronize_srcu()) so that: * If the polling updater started polling (calling get_state_synchronize_srcu()) before the traditional updater started the grace period, the latter must propagate the changes from the polling updater to all CPUs. * If the polling updater started polling (calling get_state_synchronize_srcu()) after the traditional updater started the grace period, it must wait for a subsequent grace period (rcu_seq_snap() will return that subsequent sequence). * If the polling updater checks (and thereby finishes) polling (calling poll_state_synchronize_srcu()) after the traditional updater completes the grace period, the polling updater sees the propagated barrier. * If the polling updater checks polling (calling poll_state_synchronize_srcu()) before the traditional updater completes the grace period, keep polling. > So my question still stands: where is the memory barrier at the beginning of > synchronize_srcu ? I still think rcu_seq_snap() is what you're looking for. > > The memory ordering constraint I am concerned about here is: > > * [...] In addition, > * each CPU having an SRCU read-side critical section that extends beyond > * the return from synchronize_srcu() is guaranteed to have executed a > * full memory barrier after the beginning of synchronize_srcu() and before > * the beginning of that SRCU read-side critical section. [...] > > So if we have a SRCU read-side critical section that begins after the beginning > of synchronize_srcu, but before its first memory barrier, it would miss the > guarantee that the full memory barrier is issued before the beginning of that > SRCU read-side critical section. IOW, that memory barrier needs to be at the > very beginning of the grace period. I'm confused, what's wrong with this ? UPDATER READER ------- ------ STORE X = 1 STORE srcu_read_lock++ // rcu_seq_snap() smp_mb() smp_mb() READ X // scans READ srcu_read_lock
On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote: > On 2022-12-20 19:58, Frederic Weisbecker wrote: > > On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > > > On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > Agreed about (1). > > > > > > > > > _ In (2), E pairs with the address-dependency between idx and lock_count. > > > > > > > > But that is not the only reason. If that was the only reason for (2), > > > > then there is an smp_mb() just before the next-scan post-flip before > > > > the lock counts are read. > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > may appear unordered from the update side POV if there is no barrier between the > > > scan and the flip. > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > Or rather, look at it the other way, if there is no barrier between the lock > > scan and the index flip (E), then the index flip can appear to be written before the > > lock is read. Which means you may start activating the index before you finish > > reading it (at least it appears that way from the readers pont of view). > > Considering that you can have pre-existing readers from arbitrary index > appearing anywhere in the grace period (because a reader can fetch the > index and be preempted for an arbitrary amount of time before incrementing > the lock count), the grace period algorithm needs to deal with the fact that > a newcoming reader can appear in a given index either before or after the > flip. True but the number of preempted tasks is bound and there is a forward progress guarantee. > I don't see how flipping the index before or after loading the unlock/lock > values would break anything (except for unlikely counter overflow situations > as previously discussed). Forward progress guarantee. Thanks. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Tue, Dec 20, 2022 at 11:26:12PM -0500, Joel Fernandes wrote: > > > > On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > On 2022-12-20 19:58, Frederic Weisbecker wrote: > >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: > >>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > >>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > >>>> Agreed about (1). > >>>> > >>>>> _ In (2), E pairs with the address-dependency between idx and lock_count. > >>>> > >>>> But that is not the only reason. If that was the only reason for (2), > >>>> then there is an smp_mb() just before the next-scan post-flip before > >>>> the lock counts are read. > >>> > >>> The post-flip barrier makes sure the new idx is visible on the next READER's > >>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > >>> may appear unordered from the update side POV if there is no barrier between the > >>> scan and the flip. > >>> > >>> If you remove the smp_mb() from the litmus test I sent, things explode. > >> Or rather, look at it the other way, if there is no barrier between the lock > >> scan and the index flip (E), then the index flip can appear to be written before the > >> lock is read. Which means you may start activating the index before you finish > >> reading it (at least it appears that way from the readers pont of view). > > > > Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the > > index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip. > > > > I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed). > > If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible. I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO. And here is why D is needed: C D {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP smp_mb(); // E WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN 2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx; // 1st reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // third reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:unlock0=0 /\ 1:idx=0) States 6 0:unlock0=0; 1:idx=1; 0:unlock0=1; 1:idx=0; 0:unlock0=1; 1:idx=1; 0:unlock0=2; 1:idx=0; 0:unlock0=2; 1:idx=1; 0:unlock0=3; 1:idx=0; No Witnesses Positive: 0 Negative: 14 Condition exists (0:unlock0=0 /\ 1:idx=0) Observation D Never 0 14 But then if you comment out "smp_mb() /* D */": Test D Allowed States 7 0:unlock0=0; 1:idx=0; 0:unlock0=0; 1:idx=1; 0:unlock0=1; 1:idx=0; 0:unlock0=1; 1:idx=1; 0:unlock0=2; 1:idx=0; 0:unlock0=2; 1:idx=1; 0:unlock0=3; 1:idx=0; Ok Witnesses Positive: 2 Negative: 14 Condition exists (0:unlock0=0 /\ 1:idx=0) Observation D Sometimes 2 14 Without D I guess things would eventually fix up but that would require an extra loop in SCAN2. Thanks.
On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > >> Agreed about (1). > > >> > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > > >> > > >> But that is not the only reason. If that was the only reason for (2), > > >> then there is an smp_mb() just before the next-scan post-flip before > > >> the lock counts are read. > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > may appear unordered from the update side POV if there is no barrier between the > > > scan and the flip. > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I > > am missing something with this complicated topic ;-) > > I must be missing something. I often do. > > Ok let's put that on litmus: > > ---- > C srcu > > {} > > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > // FLIP > smp_mb(); // E In real code there is a control dependency between the READ_ONCE() above and the WRITE_ONCE() before, i.e. only flip the idx when lock1 == unlock1, maybe try with the P0 below? Untested due to not having herd on this computer ;-) > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (unlock1 == lock1) { smp_mb(); // E WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } Regards, Boqun > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx; > > // 1st reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // second reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:lock1!=0) > --- > > This gives: > > Test srcu Allowed > States 1 > 0:lock1=0; > No > Witnesses > Positive: 0 Negative: 9 > Condition exists (not (0:lock1=0)) > Observation srcu Never 0 9 > Time srcu 0.57 > Hash=855d17de503814d2421602174f307c59 > > Now if I comment out the "smp_mb() /* E */" line this gives: > > Test srcu Allowed > States 3 > 0:lock1=0; > 0:lock1=1; > 0:lock1=2; > Ok > Witnesses > Positive: 4 Negative: 9 > Condition exists (not (0:lock1=0)) > Observation srcu Sometimes 4 9 > > Thanks
On 2022-12-20 23:26, Joel Fernandes wrote: > > >> On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> On 2022-12-20 19:58, Frederic Weisbecker wrote: >>>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: >>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: >>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: >>>>> Agreed about (1). >>>>> >>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count. >>>>> >>>>> But that is not the only reason. If that was the only reason for (2), >>>>> then there is an smp_mb() just before the next-scan post-flip before >>>>> the lock counts are read. >>>> >>>> The post-flip barrier makes sure the new idx is visible on the next READER's >>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" >>>> may appear unordered from the update side POV if there is no barrier between the >>>> scan and the flip. >>>> >>>> If you remove the smp_mb() from the litmus test I sent, things explode. >>> Or rather, look at it the other way, if there is no barrier between the lock >>> scan and the index flip (E), then the index flip can appear to be written before the >>> lock is read. Which means you may start activating the index before you finish >>> reading it (at least it appears that way from the readers pont of view). >> >> Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the >> index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip. >> >> I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed). > > If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible. I mean that if we have a synchronize_srcu preemption long enough to get 2^32 or 2^64 concurrent srcu read-side critical sections, I strongly suspect that RCU stall detection will yell loudly. And if it does not already, then we should make it so. So I mean "impossible unless the system is already unusable", rather than just "unlikely". Thanks, Mathieu > I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO. > > Anyway in all likelihood, I will be trying to remove E completely and clarify docs on D in the coming weeks. And also try to drop the size of the counters per our discussions > > Thanks. > > > >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On 2022-12-21 06:59, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: [...] >> >> The memory ordering constraint I am concerned about here is: >> >> * [...] In addition, >> * each CPU having an SRCU read-side critical section that extends beyond >> * the return from synchronize_srcu() is guaranteed to have executed a >> * full memory barrier after the beginning of synchronize_srcu() and before >> * the beginning of that SRCU read-side critical section. [...] >> >> So if we have a SRCU read-side critical section that begins after the beginning >> of synchronize_srcu, but before its first memory barrier, it would miss the >> guarantee that the full memory barrier is issued before the beginning of that >> SRCU read-side critical section. IOW, that memory barrier needs to be at the >> very beginning of the grace period. > > I'm confused, what's wrong with this ? > > UPDATER READER > ------- ------ > STORE X = 1 STORE srcu_read_lock++ > // rcu_seq_snap() smp_mb() > smp_mb() READ X > // scans > READ srcu_read_lock What you refer to here is only memory ordering of the store to X and load from X wrt loading/increment of srcu_read_lock, which is internal to the srcu implementation. If we really want to model the provided high-level memory ordering guarantees, we should consider a scenario where SRCU is used for its memory ordering properties to synchronize other variables. I'm concerned about the following Dekker scenario, where synchronize_srcu() and srcu_read_lock/unlock would be used instead of memory barriers: Initial state: X = 0, Y = 0 Thread A Thread B --------------------------------------------- STORE X = 1 STORE Y = 1 synchronize_srcu() srcu_read_lock() r1 = LOAD X srcu_read_unlock() r0 = LOAD Y BUG_ON(!r0 && !r1) So in the synchronize_srcu implementation, there appears to be two major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, or it uses an already started gp/expedited gp. When snapshotting with rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq load means that it does not order prior memory accesses before that load. This sequence value is then used to identify which gp_seq to wait for when piggy-backing on another already-started gp. I worry about reordering between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to piggy-back on an already-started gp. I suspect that the implicit barrier in srcu_read_lock() invoked at the beginning of srcu_gp_start_if_needed() is really the barrier that makes all this behave as expected. But without documentation it's rather hard to follow. Thanks, Mathieu
On 2022-12-21 07:11, Frederic Weisbecker wrote: > On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote: >> On 2022-12-20 19:58, Frederic Weisbecker wrote: >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: >>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: >>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: >>>>> Agreed about (1). >>>>> >>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count. >>>>> >>>>> But that is not the only reason. If that was the only reason for (2), >>>>> then there is an smp_mb() just before the next-scan post-flip before >>>>> the lock counts are read. >>>> >>>> The post-flip barrier makes sure the new idx is visible on the next READER's >>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" >>>> may appear unordered from the update side POV if there is no barrier between the >>>> scan and the flip. >>>> >>>> If you remove the smp_mb() from the litmus test I sent, things explode. >>> >>> Or rather, look at it the other way, if there is no barrier between the lock >>> scan and the index flip (E), then the index flip can appear to be written before the >>> lock is read. Which means you may start activating the index before you finish >>> reading it (at least it appears that way from the readers pont of view). >> >> Considering that you can have pre-existing readers from arbitrary index >> appearing anywhere in the grace period (because a reader can fetch the >> index and be preempted for an arbitrary amount of time before incrementing >> the lock count), the grace period algorithm needs to deal with the fact that >> a newcoming reader can appear in a given index either before or after the >> flip. > > True but the number of preempted tasks is bound and there is a forward progress guarantee. > >> I don't see how flipping the index before or after loading the unlock/lock >> values would break anything (except for unlikely counter overflow situations >> as previously discussed). > > Forward progress guarantee. Considering a coherent cache, the store-buffer will ensure that the index flip eventually reaches all readers. This bounds the time during which readers can flood the current index, and therefore guarantees forward progress. AFAIK the Linux kernel does not support architectures with incoherent caches. So I still don't see how having the barrier before or after the index flip is useful for forward progress. Thanks, Mathieu > > Thanks. > >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote: > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote: > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > > > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > >> Agreed about (1). > > > >> > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > > > >> > > > >> But that is not the only reason. If that was the only reason for (2), > > > >> then there is an smp_mb() just before the next-scan post-flip before > > > >> the lock counts are read. > > > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > > may appear unordered from the update side POV if there is no barrier between the > > > > scan and the flip. > > > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I > > > am missing something with this complicated topic ;-) > > > > I must be missing something. I often do. > > > > Ok let's put that on litmus: > > > > ---- > > C srcu > > > > {} > > > > // updater > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > // FLIP > > smp_mb(); // E > > In real code there is a control dependency between the READ_ONCE() above > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 == > unlock1, maybe try with the P0 below? Untested due to not having herd on > this computer ;-) > > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > // FLIP > if (unlock1 == lock1) { > smp_mb(); // E > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > } That becomes the below (same effect). C D {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); if (unlock1 == lock1) { // FLIP smp_mb(); // E WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN 2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx; // 1st reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // third reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:unlock0=0 /\ 1:idx=0)
On Wed, Dec 21, 2022 at 5:20 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2022-12-21 07:11, Frederic Weisbecker wrote: > > On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote: > >> On 2022-12-20 19:58, Frederic Weisbecker wrote: > >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote: > >>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > >>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > >>>>> Agreed about (1). > >>>>> > >>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count. > >>>>> > >>>>> But that is not the only reason. If that was the only reason for (2), > >>>>> then there is an smp_mb() just before the next-scan post-flip before > >>>>> the lock counts are read. > >>>> > >>>> The post-flip barrier makes sure the new idx is visible on the next READER's > >>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > >>>> may appear unordered from the update side POV if there is no barrier between the > >>>> scan and the flip. > >>>> > >>>> If you remove the smp_mb() from the litmus test I sent, things explode. > >>> > >>> Or rather, look at it the other way, if there is no barrier between the lock > >>> scan and the index flip (E), then the index flip can appear to be written before the > >>> lock is read. Which means you may start activating the index before you finish > >>> reading it (at least it appears that way from the readers pont of view). > >> > >> Considering that you can have pre-existing readers from arbitrary index > >> appearing anywhere in the grace period (because a reader can fetch the > >> index and be preempted for an arbitrary amount of time before incrementing > >> the lock count), the grace period algorithm needs to deal with the fact that > >> a newcoming reader can appear in a given index either before or after the > >> flip. > > > > True but the number of preempted tasks is bound and there is a forward progress guarantee. > > > >> I don't see how flipping the index before or after loading the unlock/lock > >> values would break anything (except for unlikely counter overflow situations > >> as previously discussed). > > > > Forward progress guarantee. > > Considering a coherent cache, the store-buffer will ensure that the > index flip eventually reaches all readers. This bounds the time during > which readers can flood the current index, and therefore guarantees > forward progress. AFAIK the Linux kernel does not support architectures > with incoherent caches. > > So I still don't see how having the barrier before or after the index > flip is useful for forward progress. Even though eventually the writes on either side will make it, I think you have little lost opportunity without the "D" memory barrier without satisfying the store-buffer pattern. i.e. even if the idx update was seen by the read-side, its unlock count contributions to the now-inactive idx may not be seen. I think that makes it slightly better for making the grace period end sooner. I think that's what you "scan both" idea also tries to achieve, so for that reason you should like "D" as well. But yes it depends on how forward-progress is defined. If defined as starvation - no the post/pref flip MBs are not helpful for that (eventually the writes will make it). If defined as not losing an opportunity to end the GP sooner, yes the flip-related memory barriers does help that. As for whether or not we need "E", that is up for debate. I am intrigued by Frederick's litmus tests showing that it helps, because my litmus tests don't show it. Thanks.
Ah Frederic, I think you nailed it. E is required to order the flip write with the control-dependency on the READ side. I can confirm the below test with bad condition shows the previous reader sees the post-flip index when it shouldn't have. Please see below modifications to your Litmus test. I think we should document it in the code that E pairs with the control-dependency between idx read and lock count write. C srcu {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP smp_mb(); // E -------------------- required to make the bad condition not happen. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx2 = READ_ONCE(*IDX); if (idx2 == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1 /\ 1:idx2=1 ) (* bad condition: 1st reader saw flip *) On Wed, Dec 21, 2022 at 5:30 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote: > > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote: > > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > > > > > > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > >> Agreed about (1). > > > > >> > > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > > > > >> > > > > >> But that is not the only reason. If that was the only reason for (2), > > > > >> then there is an smp_mb() just before the next-scan post-flip before > > > > >> the lock counts are read. > > > > > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > > > may appear unordered from the update side POV if there is no barrier between the > > > > > scan and the flip. > > > > > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > > > > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > > > > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I > > > > am missing something with this complicated topic ;-) > > > > > > I must be missing something. I often do. > > > > > > Ok let's put that on litmus: > > > > > > ---- > > > C srcu > > > > > > {} > > > > > > // updater > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > > { > > > int lock1; > > > int unlock1; > > > int lock0; > > > int unlock0; > > > > > > // SCAN1 > > > unlock1 = READ_ONCE(*UNLOCK1); > > > smp_mb(); // A > > > lock1 = READ_ONCE(*LOCK1); > > > > > > // FLIP > > > smp_mb(); // E > > > > In real code there is a control dependency between the READ_ONCE() above > > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 == > > unlock1, maybe try with the P0 below? Untested due to not having herd on > > this computer ;-) > > > > > WRITE_ONCE(*IDX, 1); > > > smp_mb(); // D > > > > > > // SCAN2 > > > unlock0 = READ_ONCE(*UNLOCK0); > > > smp_mb(); // A > > > lock0 = READ_ONCE(*LOCK0); > > > } > > > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > // FLIP > > if (unlock1 == lock1) { > > smp_mb(); // E > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > } > > That becomes the below (same effect). > > C D > > {} > > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > if (unlock1 == lock1) { > // FLIP > smp_mb(); // E > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN 2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > } > > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx; > > // 1st reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // second reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // third reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:unlock0=0 /\ 1:idx=0) >
Mathieu pointed out to me on IRC that adding the control dep on the update side removes even that need for E. And I see that is what Boqun was suggesting above and that Frederic modified. See updated litmus here: https://www.irccloud.com/pastebin/TrXacogO/ With this, I am not seeing the "bad condition" happen. On Wed, Dec 21, 2022 at 7:33 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Ah Frederic, I think you nailed it. E is required to order the flip > write with the control-dependency on the READ side. I can confirm the > below test with bad condition shows the previous reader sees the > post-flip index when it shouldn't have. Please see below modifications > to your Litmus test. > > I think we should document it in the code that E pairs with the > control-dependency between idx read and lock count write. > > C srcu > {} > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > // FLIP > smp_mb(); // E -------------------- required to make the bad > condition not happen. > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx1; > int idx2; > > // 1st reader > idx1 = READ_ONCE(*IDX); > if (idx1 == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // second reader > idx2 = READ_ONCE(*IDX); > if (idx2 == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:lock1=1 /\ 1:idx1=1 /\ 1:idx2=1 ) (* bad condition: 1st > reader saw flip *) > > > > > > On Wed, Dec 21, 2022 at 5:30 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote: > > > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote: > > > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > > > > > > > > > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > > > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > >> Agreed about (1). > > > > > >> > > > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > > > > > >> > > > > > >> But that is not the only reason. If that was the only reason for (2), > > > > > >> then there is an smp_mb() just before the next-scan post-flip before > > > > > >> the lock counts are read. > > > > > > > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > > > > may appear unordered from the update side POV if there is no barrier between the > > > > > > scan and the flip. > > > > > > > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > > > > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > > > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > > > > > > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > > > > > > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I > > > > > am missing something with this complicated topic ;-) > > > > > > > > I must be missing something. I often do. > > > > > > > > Ok let's put that on litmus: > > > > > > > > ---- > > > > C srcu > > > > > > > > {} > > > > > > > > // updater > > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > > > { > > > > int lock1; > > > > int unlock1; > > > > int lock0; > > > > int unlock0; > > > > > > > > // SCAN1 > > > > unlock1 = READ_ONCE(*UNLOCK1); > > > > smp_mb(); // A > > > > lock1 = READ_ONCE(*LOCK1); > > > > > > > > // FLIP > > > > smp_mb(); // E > > > > > > In real code there is a control dependency between the READ_ONCE() above > > > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 == > > > unlock1, maybe try with the P0 below? Untested due to not having herd on > > > this computer ;-) > > > > > > > WRITE_ONCE(*IDX, 1); > > > > smp_mb(); // D > > > > > > > > // SCAN2 > > > > unlock0 = READ_ONCE(*UNLOCK0); > > > > smp_mb(); // A > > > > lock0 = READ_ONCE(*LOCK0); > > > > } > > > > > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > > { > > > int lock1; > > > int unlock1; > > > int lock0; > > > int unlock0; > > > > > > // SCAN1 > > > unlock1 = READ_ONCE(*UNLOCK1); > > > smp_mb(); // A > > > lock1 = READ_ONCE(*LOCK1); > > > > > > // FLIP > > > if (unlock1 == lock1) { > > > smp_mb(); // E > > > WRITE_ONCE(*IDX, 1); > > > smp_mb(); // D > > > > > > // SCAN2 > > > unlock0 = READ_ONCE(*UNLOCK0); > > > smp_mb(); // A > > > lock0 = READ_ONCE(*LOCK0); > > > } > > > } > > > > That becomes the below (same effect). > > > > C D > > > > {} > > > > // updater > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > if (unlock1 == lock1) { > > // FLIP > > smp_mb(); // E > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN 2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > } > > > > // reader > > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int tmp; > > int idx; > > > > // 1st reader > > idx = READ_ONCE(*IDX); > > if (idx == 0) { > > tmp = READ_ONCE(*LOCK0); > > WRITE_ONCE(*LOCK0, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK0); > > WRITE_ONCE(*UNLOCK0, tmp + 1); > > } else { > > tmp = READ_ONCE(*LOCK1); > > WRITE_ONCE(*LOCK1, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK1); > > WRITE_ONCE(*UNLOCK1, tmp + 1); > > } > > > > // second reader > > idx = READ_ONCE(*IDX); > > if (idx == 0) { > > tmp = READ_ONCE(*LOCK0); > > WRITE_ONCE(*LOCK0, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK0); > > WRITE_ONCE(*UNLOCK0, tmp + 1); > > } else { > > tmp = READ_ONCE(*LOCK1); > > WRITE_ONCE(*LOCK1, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK1); > > WRITE_ONCE(*UNLOCK1, tmp + 1); > > } > > > > // third reader > > idx = READ_ONCE(*IDX); > > if (idx == 0) { > > tmp = READ_ONCE(*LOCK0); > > WRITE_ONCE(*LOCK0, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK0); > > WRITE_ONCE(*UNLOCK0, tmp + 1); > > } else { > > tmp = READ_ONCE(*LOCK1); > > WRITE_ONCE(*LOCK1, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK1); > > WRITE_ONCE(*UNLOCK1, tmp + 1); > > } > > } > > > > exists (0:unlock0=0 /\ 1:idx=0) > >
On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote: > > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote: > > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote: > > > > > > > > > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote: > > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > >> Agreed about (1). > > > > >> > > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count. > > > > >> > > > > >> But that is not the only reason. If that was the only reason for (2), > > > > >> then there is an smp_mb() just before the next-scan post-flip before > > > > >> the lock counts are read. > > > > > > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's > > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]" > > > > > may appear unordered from the update side POV if there is no barrier between the > > > > > scan and the flip. > > > > > > > > > > If you remove the smp_mb() from the litmus test I sent, things explode. > > > > > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency. > > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A. > > > > > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip). > > > > > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I > > > > am missing something with this complicated topic ;-) > > > > > > I must be missing something. I often do. > > > > > > Ok let's put that on litmus: > > > > > > ---- > > > C srcu > > > > > > {} > > > > > > // updater > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > > { > > > int lock1; > > > int unlock1; > > > int lock0; > > > int unlock0; > > > > > > // SCAN1 > > > unlock1 = READ_ONCE(*UNLOCK1); > > > smp_mb(); // A > > > lock1 = READ_ONCE(*LOCK1); > > > > > > // FLIP > > > smp_mb(); // E > > > > In real code there is a control dependency between the READ_ONCE() above > > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 == > > unlock1, maybe try with the P0 below? Untested due to not having herd on > > this computer ;-) > > > > > WRITE_ONCE(*IDX, 1); > > > smp_mb(); // D > > > > > > // SCAN2 > > > unlock0 = READ_ONCE(*UNLOCK0); > > > smp_mb(); // A > > > lock0 = READ_ONCE(*LOCK0); > > > } > > > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > // FLIP > > if (unlock1 == lock1) { > > smp_mb(); // E > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > } > > That becomes the below (same effect). > By "same effect" you mean removing E results in the exist-clause triggered? If so, then our environments disagree with each other ;-) I "download" your litmus test (frederic.litmus) and created an variant (frederic-remove-E.litmus) that only removes the E, both of the litmus tests are in the attachment. And here is the result on my env: for frederic.litmus: Test D Allowed States 6 0:unlock0=0; 1:idx=1; 0:unlock0=1; 1:idx=0; 0:unlock0=1; 1:idx=1; 0:unlock0=2; 1:idx=0; 0:unlock0=2; 1:idx=1; 0:unlock0=3; 1:idx=0; No Witnesses Positive: 0 Negative: 14 Condition exists (0:unlock0=0 /\ 1:idx=0) Observation D Never 0 14 Time D 54.22 Hash=eead834f635201cde8ceb21250e33381 for frederic-remove-E.litmus: Test D Allowed States 6 0:unlock0=0; 1:idx=1; 0:unlock0=1; 1:idx=0; 0:unlock0=1; 1:idx=1; 0:unlock0=2; 1:idx=0; 0:unlock0=2; 1:idx=1; 0:unlock0=3; 1:idx=0; No Witnesses Positive: 0 Negative: 14 Condition exists (0:unlock0=0 /\ 1:idx=0) Observation D Never 0 14 Time D 53.50 Hash=c2579f542cf9f87af125c5792999dc44 Regards, Boqun > C D > > {} > > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > if (unlock1 == lock1) { > // FLIP > smp_mb(); // E > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN 2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > } > > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx; > > // 1st reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // second reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > > // third reader > idx = READ_ONCE(*IDX); > if (idx == 0) { > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:unlock0=0 /\ 1:idx=0) > C D {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); if (unlock1 == lock1) { // FLIP smp_mb(); // E WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN 2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx; // 1st reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // third reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:unlock0=0 /\ 1:idx=0) C D {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); if (unlock1 == lock1) { // FLIP WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN 2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx; // 1st reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // second reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } // third reader idx = READ_ONCE(*IDX); if (idx == 0) { tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:unlock0=0 /\ 1:idx=0)
On Wed, Dec 21, 2022 at 12:19:47PM -0800, Boqun Feng wrote: > On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote: > By "same effect" you mean removing E results in the exist-clause > triggered? If so, then our environments disagree with each other ;-) Nope, removing D :-) (And yeah probably I misread your previous mail and we weren't talking about the same thing...) Thanks.
On Thu, Dec 22, 2022 at 01:16:12PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 12:19:47PM -0800, Boqun Feng wrote: > > On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote: > > By "same effect" you mean removing E results in the exist-clause > > triggered? If so, then our environments disagree with each other ;-) > > Nope, removing D :-) > > (And yeah probably I misread your previous mail and we weren't talking > about the same thing...) Yeah checking that again, I got confused in the discussion between my two tests srcu-E and srcu-D. Anyway, you were right about the control dependency :) > > Thanks.
On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > On 2022-12-21 06:59, Frederic Weisbecker wrote: > > On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > [...] > > > > > > The memory ordering constraint I am concerned about here is: > > > > > > * [...] In addition, > > > * each CPU having an SRCU read-side critical section that extends beyond > > > * the return from synchronize_srcu() is guaranteed to have executed a > > > * full memory barrier after the beginning of synchronize_srcu() and before > > > * the beginning of that SRCU read-side critical section. [...] > > > > > > So if we have a SRCU read-side critical section that begins after the beginning > > > of synchronize_srcu, but before its first memory barrier, it would miss the > > > guarantee that the full memory barrier is issued before the beginning of that > > > SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > very beginning of the grace period. > > > > I'm confused, what's wrong with this ? > > > > UPDATER READER > > ------- ------ > > STORE X = 1 STORE srcu_read_lock++ > > // rcu_seq_snap() smp_mb() > > smp_mb() READ X > > // scans > > READ srcu_read_lock > > What you refer to here is only memory ordering of the store to X and load > from X wrt loading/increment of srcu_read_lock, which is internal to the > srcu implementation. If we really want to model the provided high-level > memory ordering guarantees, we should consider a scenario where SRCU is used > for its memory ordering properties to synchronize other variables. > > I'm concerned about the following Dekker scenario, where synchronize_srcu() > and srcu_read_lock/unlock would be used instead of memory barriers: > > Initial state: X = 0, Y = 0 > > Thread A Thread B > --------------------------------------------- > STORE X = 1 STORE Y = 1 > synchronize_srcu() > srcu_read_lock() > r1 = LOAD X > srcu_read_unlock() > r0 = LOAD Y > > BUG_ON(!r0 && !r1) > > So in the synchronize_srcu implementation, there appears to be two > major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > or it uses an already started gp/expedited gp. When snapshotting with > rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > load means that it does not order prior memory accesses before that load. > This sequence value is then used to identify which gp_seq to wait for when > piggy-backing on another already-started gp. I worry about reordering > between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > piggy-back on an already-started gp. > > I suspect that the implicit barrier in srcu_read_lock() invoked at the > beginning of srcu_gp_start_if_needed() is really the barrier that makes > all this behave as expected. But without documentation it's rather hard to > follow. Oh ok I see now. It might be working that way by accident or on forgotten purpose. In any case, we really want to add a comment above that __srcu_read_lock_nmisafe() call.
> On Dec 22, 2022, at 7:40 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: >>> On 2022-12-21 06:59, Frederic Weisbecker wrote: >>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: >> [...] >>>> >>>> The memory ordering constraint I am concerned about here is: >>>> >>>> * [...] In addition, >>>> * each CPU having an SRCU read-side critical section that extends beyond >>>> * the return from synchronize_srcu() is guaranteed to have executed a >>>> * full memory barrier after the beginning of synchronize_srcu() and before >>>> * the beginning of that SRCU read-side critical section. [...] >>>> >>>> So if we have a SRCU read-side critical section that begins after the beginning >>>> of synchronize_srcu, but before its first memory barrier, it would miss the >>>> guarantee that the full memory barrier is issued before the beginning of that >>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the >>>> very beginning of the grace period. >>> >>> I'm confused, what's wrong with this ? >>> >>> UPDATER READER >>> ------- ------ >>> STORE X = 1 STORE srcu_read_lock++ >>> // rcu_seq_snap() smp_mb() >>> smp_mb() READ X >>> // scans >>> READ srcu_read_lock >> >> What you refer to here is only memory ordering of the store to X and load >> from X wrt loading/increment of srcu_read_lock, which is internal to the >> srcu implementation. If we really want to model the provided high-level >> memory ordering guarantees, we should consider a scenario where SRCU is used >> for its memory ordering properties to synchronize other variables. >> >> I'm concerned about the following Dekker scenario, where synchronize_srcu() >> and srcu_read_lock/unlock would be used instead of memory barriers: >> >> Initial state: X = 0, Y = 0 >> >> Thread A Thread B >> --------------------------------------------- >> STORE X = 1 STORE Y = 1 >> synchronize_srcu() >> srcu_read_lock() >> r1 = LOAD X >> srcu_read_unlock() >> r0 = LOAD Y >> >> BUG_ON(!r0 && !r1) >> >> So in the synchronize_srcu implementation, there appears to be two >> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, >> or it uses an already started gp/expedited gp. When snapshotting with >> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq >> load means that it does not order prior memory accesses before that load. >> This sequence value is then used to identify which gp_seq to wait for when >> piggy-backing on another already-started gp. I worry about reordering >> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to >> piggy-back on an already-started gp. >> >> I suspect that the implicit barrier in srcu_read_lock() invoked at the >> beginning of srcu_gp_start_if_needed() is really the barrier that makes >> all this behave as expected. But without documentation it's rather hard to >> follow. > > Oh ok I see now. It might be working that way by accident or on forgotten > purpose. In any case, we really want to add a comment above that > __srcu_read_lock_nmisafe() call. Agreed on the analysis and the need for comments, thanks! I think we also ought to document some more cases like, how the memory barriers here relate to the other memory barrier A inside the scan loop. But I guess for me I need to understand the GP guarantee ordering first before attempting documenting that aspect. Oh well, thank God for holidays. ;-) Thanks, - Joel >
On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > On 2022-12-21 06:59, Frederic Weisbecker wrote: > > > On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > [...] > > > > > > > > The memory ordering constraint I am concerned about here is: > > > > > > > > * [...] In addition, > > > > * each CPU having an SRCU read-side critical section that extends beyond > > > > * the return from synchronize_srcu() is guaranteed to have executed a > > > > * full memory barrier after the beginning of synchronize_srcu() and before > > > > * the beginning of that SRCU read-side critical section. [...] > > > > > > > > So if we have a SRCU read-side critical section that begins after the beginning > > > > of synchronize_srcu, but before its first memory barrier, it would miss the > > > > guarantee that the full memory barrier is issued before the beginning of that > > > > SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > > very beginning of the grace period. > > > > > > I'm confused, what's wrong with this ? > > > > > > UPDATER READER > > > ------- ------ > > > STORE X = 1 STORE srcu_read_lock++ > > > // rcu_seq_snap() smp_mb() > > > smp_mb() READ X > > > // scans > > > READ srcu_read_lock > > > > What you refer to here is only memory ordering of the store to X and load > > from X wrt loading/increment of srcu_read_lock, which is internal to the > > srcu implementation. If we really want to model the provided high-level > > memory ordering guarantees, we should consider a scenario where SRCU is used > > for its memory ordering properties to synchronize other variables. > > > > I'm concerned about the following Dekker scenario, where synchronize_srcu() > > and srcu_read_lock/unlock would be used instead of memory barriers: > > > > Initial state: X = 0, Y = 0 > > > > Thread A Thread B > > --------------------------------------------- > > STORE X = 1 STORE Y = 1 > > synchronize_srcu() > > srcu_read_lock() > > r1 = LOAD X > > srcu_read_unlock() > > r0 = LOAD Y > > > > BUG_ON(!r0 && !r1) > > > > So in the synchronize_srcu implementation, there appears to be two > > major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > or it uses an already started gp/expedited gp. When snapshotting with > > rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > load means that it does not order prior memory accesses before that load. > > This sequence value is then used to identify which gp_seq to wait for when > > piggy-backing on another already-started gp. I worry about reordering > > between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > piggy-back on an already-started gp. > > > > I suspect that the implicit barrier in srcu_read_lock() invoked at the > > beginning of srcu_gp_start_if_needed() is really the barrier that makes > > all this behave as expected. But without documentation it's rather hard to > > follow. > > Oh ok I see now. It might be working that way by accident or on forgotten > purpose. In any case, we really want to add a comment above that > __srcu_read_lock_nmisafe() call. Another test for the safety (or not) of removing either D or E is to move that WRITE_ONCE() to follow (or, respectively, precede) the adjacent scans. Thanx, Paul
> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: >>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: >>> On 2022-12-21 06:59, Frederic Weisbecker wrote: >>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: >>> [...] >>>>> >>>>> The memory ordering constraint I am concerned about here is: >>>>> >>>>> * [...] In addition, >>>>> * each CPU having an SRCU read-side critical section that extends beyond >>>>> * the return from synchronize_srcu() is guaranteed to have executed a >>>>> * full memory barrier after the beginning of synchronize_srcu() and before >>>>> * the beginning of that SRCU read-side critical section. [...] >>>>> >>>>> So if we have a SRCU read-side critical section that begins after the beginning >>>>> of synchronize_srcu, but before its first memory barrier, it would miss the >>>>> guarantee that the full memory barrier is issued before the beginning of that >>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the >>>>> very beginning of the grace period. >>>> >>>> I'm confused, what's wrong with this ? >>>> >>>> UPDATER READER >>>> ------- ------ >>>> STORE X = 1 STORE srcu_read_lock++ >>>> // rcu_seq_snap() smp_mb() >>>> smp_mb() READ X >>>> // scans >>>> READ srcu_read_lock >>> >>> What you refer to here is only memory ordering of the store to X and load >>> from X wrt loading/increment of srcu_read_lock, which is internal to the >>> srcu implementation. If we really want to model the provided high-level >>> memory ordering guarantees, we should consider a scenario where SRCU is used >>> for its memory ordering properties to synchronize other variables. >>> >>> I'm concerned about the following Dekker scenario, where synchronize_srcu() >>> and srcu_read_lock/unlock would be used instead of memory barriers: >>> >>> Initial state: X = 0, Y = 0 >>> >>> Thread A Thread B >>> --------------------------------------------- >>> STORE X = 1 STORE Y = 1 >>> synchronize_srcu() >>> srcu_read_lock() >>> r1 = LOAD X >>> srcu_read_unlock() >>> r0 = LOAD Y >>> >>> BUG_ON(!r0 && !r1) >>> >>> So in the synchronize_srcu implementation, there appears to be two >>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, >>> or it uses an already started gp/expedited gp. When snapshotting with >>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq >>> load means that it does not order prior memory accesses before that load. >>> This sequence value is then used to identify which gp_seq to wait for when >>> piggy-backing on another already-started gp. I worry about reordering >>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to >>> piggy-back on an already-started gp. >>> >>> I suspect that the implicit barrier in srcu_read_lock() invoked at the >>> beginning of srcu_gp_start_if_needed() is really the barrier that makes >>> all this behave as expected. But without documentation it's rather hard to >>> follow. >> >> Oh ok I see now. It might be working that way by accident or on forgotten >> purpose. In any case, we really want to add a comment above that >> __srcu_read_lock_nmisafe() call. > > Another test for the safety (or not) of removing either D or E is > to move that WRITE_ONCE() to follow (or, respectively, precede) the > adjacent scans. Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. So that (flipping MBs) is unrelated, or did I miss something? Thanks. > > Thanx, Paul
On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote: > > > > On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > >>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > >>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > >>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > >>> [...] > >>>>> > >>>>> The memory ordering constraint I am concerned about here is: > >>>>> > >>>>> * [...] In addition, > >>>>> * each CPU having an SRCU read-side critical section that extends beyond > >>>>> * the return from synchronize_srcu() is guaranteed to have executed a > >>>>> * full memory barrier after the beginning of synchronize_srcu() and before > >>>>> * the beginning of that SRCU read-side critical section. [...] > >>>>> > >>>>> So if we have a SRCU read-side critical section that begins after the beginning > >>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > >>>>> guarantee that the full memory barrier is issued before the beginning of that > >>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > >>>>> very beginning of the grace period. > >>>> > >>>> I'm confused, what's wrong with this ? > >>>> > >>>> UPDATER READER > >>>> ------- ------ > >>>> STORE X = 1 STORE srcu_read_lock++ > >>>> // rcu_seq_snap() smp_mb() > >>>> smp_mb() READ X > >>>> // scans > >>>> READ srcu_read_lock > >>> > >>> What you refer to here is only memory ordering of the store to X and load > >>> from X wrt loading/increment of srcu_read_lock, which is internal to the > >>> srcu implementation. If we really want to model the provided high-level > >>> memory ordering guarantees, we should consider a scenario where SRCU is used > >>> for its memory ordering properties to synchronize other variables. > >>> > >>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > >>> and srcu_read_lock/unlock would be used instead of memory barriers: > >>> > >>> Initial state: X = 0, Y = 0 > >>> > >>> Thread A Thread B > >>> --------------------------------------------- > >>> STORE X = 1 STORE Y = 1 > >>> synchronize_srcu() > >>> srcu_read_lock() > >>> r1 = LOAD X > >>> srcu_read_unlock() > >>> r0 = LOAD Y > >>> > >>> BUG_ON(!r0 && !r1) > >>> > >>> So in the synchronize_srcu implementation, there appears to be two > >>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > >>> or it uses an already started gp/expedited gp. When snapshotting with > >>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > >>> load means that it does not order prior memory accesses before that load. > >>> This sequence value is then used to identify which gp_seq to wait for when > >>> piggy-backing on another already-started gp. I worry about reordering > >>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > >>> piggy-back on an already-started gp. > >>> > >>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > >>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > >>> all this behave as expected. But without documentation it's rather hard to > >>> follow. > >> > >> Oh ok I see now. It might be working that way by accident or on forgotten > >> purpose. In any case, we really want to add a comment above that > >> __srcu_read_lock_nmisafe() call. > > > > Another test for the safety (or not) of removing either D or E is > > to move that WRITE_ONCE() to follow (or, respectively, precede) the > > adjacent scans. > > Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > So that (flipping MBs) is unrelated, or did I miss something? The thought is to manually similate in the source code the maximum memory-reference reordering that a maximally hostile compiler and CPU would be permitted to carry out. So yes, given that there are other memory barriers before and after, these other memory barriers limit how far the flip may be moved in the source code. Here I am talking about the memory barriers associated with the flip, but the same trick can of course be applied to other memory barriers. In general, remove a given memory barrier and (in the source code) maximally rearrange the memory references that were previously ordered by the memory barrier in question. Again, the presence of other memory barriers will limit the permitted maximal source-code rearrangement. Thanx, Paul
> On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote: >> >> >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: >>> >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: >>>>> [...] >>>>>>> >>>>>>> The memory ordering constraint I am concerned about here is: >>>>>>> >>>>>>> * [...] In addition, >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before >>>>>>> * the beginning of that SRCU read-side critical section. [...] >>>>>>> >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the >>>>>>> guarantee that the full memory barrier is issued before the beginning of that >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the >>>>>>> very beginning of the grace period. >>>>>> >>>>>> I'm confused, what's wrong with this ? >>>>>> >>>>>> UPDATER READER >>>>>> ------- ------ >>>>>> STORE X = 1 STORE srcu_read_lock++ >>>>>> // rcu_seq_snap() smp_mb() >>>>>> smp_mb() READ X >>>>>> // scans >>>>>> READ srcu_read_lock >>>>> >>>>> What you refer to here is only memory ordering of the store to X and load >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the >>>>> srcu implementation. If we really want to model the provided high-level >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used >>>>> for its memory ordering properties to synchronize other variables. >>>>> >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: >>>>> >>>>> Initial state: X = 0, Y = 0 >>>>> >>>>> Thread A Thread B >>>>> --------------------------------------------- >>>>> STORE X = 1 STORE Y = 1 >>>>> synchronize_srcu() >>>>> srcu_read_lock() >>>>> r1 = LOAD X >>>>> srcu_read_unlock() >>>>> r0 = LOAD Y >>>>> >>>>> BUG_ON(!r0 && !r1) >>>>> >>>>> So in the synchronize_srcu implementation, there appears to be two >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, >>>>> or it uses an already started gp/expedited gp. When snapshotting with >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq >>>>> load means that it does not order prior memory accesses before that load. >>>>> This sequence value is then used to identify which gp_seq to wait for when >>>>> piggy-backing on another already-started gp. I worry about reordering >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to >>>>> piggy-back on an already-started gp. >>>>> >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes >>>>> all this behave as expected. But without documentation it's rather hard to >>>>> follow. >>>> >>>> Oh ok I see now. It might be working that way by accident or on forgotten >>>> purpose. In any case, we really want to add a comment above that >>>> __srcu_read_lock_nmisafe() call. >>> >>> Another test for the safety (or not) of removing either D or E is >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the >>> adjacent scans. >> >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. >> >> So that (flipping MBs) is unrelated, or did I miss something? > > The thought is to manually similate in the source code the maximum > memory-reference reordering that a maximally hostile compiler and CPU > would be permitted to carry out. So yes, given that there are other > memory barriers before and after, these other memory barriers limit how > far the flip may be moved in the source code. > > Here I am talking about the memory barriers associated with the flip, > but the same trick can of course be applied to other memory barriers. > In general, remove a given memory barrier and (in the source code) > maximally rearrange the memory references that were previously ordered > by the memory barrier in question. > > Again, the presence of other memory barriers will limit the permitted > maximal source-code rearrangement. Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. Thanks. > > Thanx, Paul
On Thu, Dec 22, 2022 at 01:56:17PM -0500, Joel Fernandes wrote: > > > > On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote: > >> > >> > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > >>> > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > >>>>> [...] > >>>>>>> > >>>>>>> The memory ordering constraint I am concerned about here is: > >>>>>>> > >>>>>>> * [...] In addition, > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > >>>>>>> > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > >>>>>>> very beginning of the grace period. > >>>>>> > >>>>>> I'm confused, what's wrong with this ? > >>>>>> > >>>>>> UPDATER READER > >>>>>> ------- ------ > >>>>>> STORE X = 1 STORE srcu_read_lock++ > >>>>>> // rcu_seq_snap() smp_mb() > >>>>>> smp_mb() READ X > >>>>>> // scans > >>>>>> READ srcu_read_lock > >>>>> > >>>>> What you refer to here is only memory ordering of the store to X and load > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > >>>>> srcu implementation. If we really want to model the provided high-level > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > >>>>> for its memory ordering properties to synchronize other variables. > >>>>> > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > >>>>> > >>>>> Initial state: X = 0, Y = 0 > >>>>> > >>>>> Thread A Thread B > >>>>> --------------------------------------------- > >>>>> STORE X = 1 STORE Y = 1 > >>>>> synchronize_srcu() > >>>>> srcu_read_lock() > >>>>> r1 = LOAD X > >>>>> srcu_read_unlock() > >>>>> r0 = LOAD Y > >>>>> > >>>>> BUG_ON(!r0 && !r1) > >>>>> > >>>>> So in the synchronize_srcu implementation, there appears to be two > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > >>>>> load means that it does not order prior memory accesses before that load. > >>>>> This sequence value is then used to identify which gp_seq to wait for when > >>>>> piggy-backing on another already-started gp. I worry about reordering > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > >>>>> piggy-back on an already-started gp. > >>>>> > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > >>>>> all this behave as expected. But without documentation it's rather hard to > >>>>> follow. > >>>> > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > >>>> purpose. In any case, we really want to add a comment above that > >>>> __srcu_read_lock_nmisafe() call. > >>> > >>> Another test for the safety (or not) of removing either D or E is > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > >>> adjacent scans. > >> > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > >> > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > The thought is to manually similate in the source code the maximum > > memory-reference reordering that a maximally hostile compiler and CPU > > would be permitted to carry out. So yes, given that there are other > > memory barriers before and after, these other memory barriers limit how > > far the flip may be moved in the source code. > > > > Here I am talking about the memory barriers associated with the flip, > > but the same trick can of course be applied to other memory barriers. > > In general, remove a given memory barrier and (in the source code) > > maximally rearrange the memory references that were previously ordered > > by the memory barrier in question. > > > > Again, the presence of other memory barriers will limit the permitted > > maximal source-code rearrangement. > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. First, are we talking about the same barrier? I am talking about E. Yes, this would require a bit of restructuring. The overall approach would be something like this, in SRCU_STATE_SCAN1: 1. Scan the unlocks. 2. smp_mb(); /* A */ 3. Flip the index. 4. Scan the locks. 5. If unlocks == locks, advance the state to SRCU_STATE_SCAN2. 6. Otherwise, execute the current SRCU_STATE_SCAN1 code. Give or take the usual devils in the details. Alternatively, remove E and hammer it on a weakly ordered system. Thanx, Paul
On Thu, Dec 22, 2022 at 2:45 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Dec 22, 2022 at 01:56:17PM -0500, Joel Fernandes wrote: > > > > > > > On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote: > > >> > > >> > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > >>> > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > >>>>> [...] > > >>>>>>> > > >>>>>>> The memory ordering constraint I am concerned about here is: > > >>>>>>> > > >>>>>>> * [...] In addition, > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > > >>>>>>> > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > > >>>>>>> very beginning of the grace period. > > >>>>>> > > >>>>>> I'm confused, what's wrong with this ? > > >>>>>> > > >>>>>> UPDATER READER > > >>>>>> ------- ------ > > >>>>>> STORE X = 1 STORE srcu_read_lock++ > > >>>>>> // rcu_seq_snap() smp_mb() > > >>>>>> smp_mb() READ X > > >>>>>> // scans > > >>>>>> READ srcu_read_lock > > >>>>> > > >>>>> What you refer to here is only memory ordering of the store to X and load > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > > >>>>> srcu implementation. If we really want to model the provided high-level > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > > >>>>> for its memory ordering properties to synchronize other variables. > > >>>>> > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > > >>>>> > > >>>>> Initial state: X = 0, Y = 0 > > >>>>> > > >>>>> Thread A Thread B > > >>>>> --------------------------------------------- > > >>>>> STORE X = 1 STORE Y = 1 > > >>>>> synchronize_srcu() > > >>>>> srcu_read_lock() > > >>>>> r1 = LOAD X > > >>>>> srcu_read_unlock() > > >>>>> r0 = LOAD Y > > >>>>> > > >>>>> BUG_ON(!r0 && !r1) > > >>>>> > > >>>>> So in the synchronize_srcu implementation, there appears to be two > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > >>>>> load means that it does not order prior memory accesses before that load. > > >>>>> This sequence value is then used to identify which gp_seq to wait for when > > >>>>> piggy-backing on another already-started gp. I worry about reordering > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > >>>>> piggy-back on an already-started gp. > > >>>>> > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > > >>>>> all this behave as expected. But without documentation it's rather hard to > > >>>>> follow. > > >>>> > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > > >>>> purpose. In any case, we really want to add a comment above that > > >>>> __srcu_read_lock_nmisafe() call. > > >>> > > >>> Another test for the safety (or not) of removing either D or E is > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > > >>> adjacent scans. > > >> > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > >> > > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > > > The thought is to manually similate in the source code the maximum > > > memory-reference reordering that a maximally hostile compiler and CPU > > > would be permitted to carry out. So yes, given that there are other > > > memory barriers before and after, these other memory barriers limit how > > > far the flip may be moved in the source code. > > > > > > Here I am talking about the memory barriers associated with the flip, > > > but the same trick can of course be applied to other memory barriers. > > > In general, remove a given memory barrier and (in the source code) > > > maximally rearrange the memory references that were previously ordered > > > by the memory barrier in question. > > > > > > Again, the presence of other memory barriers will limit the permitted > > > maximal source-code rearrangement. > > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. > > First, are we talking about the same barrier? I am talking about E. No, in the last part you replied to above, Mathieu and Frederic were talking about the need for memory barriers in synchronize_srcu(). That has nothing to do with E: <quote> I suspect that the implicit barrier in srcu_read_lock() invoked at the beginning of srcu_gp_start_if_needed() is really the barrier that makes all this behave as expected. </quote> We need to order code prior to synchronize_srcu() wrt the start of the grace period, so that readers that started after the grace period started see those side-effects as they may not be waited on (they are too late). > Alternatively, remove E and hammer it on a weakly ordered system. But E is useless, see below patch link with litmus test [1]. I don't see what hammering it on a weakly ordered system buys if it does not even execute for the cases when it is supposed to help because of causality with the control dependency of the prior scan :-/ Thanks, -Joel [1] https://lore.kernel.org/rcu/19A03A16-9F38-481F-824D-4C883AB8B105@joelfernandes.org/T/#m60e37d1277a8dd626bb95f92083e4e602c429704
On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote: [...] > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > >>> > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > > >>>>> [...] > > > >>>>>>> > > > >>>>>>> The memory ordering constraint I am concerned about here is: > > > >>>>>>> > > > >>>>>>> * [...] In addition, > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > > > >>>>>>> > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > >>>>>>> very beginning of the grace period. > > > >>>>>> > > > >>>>>> I'm confused, what's wrong with this ? > > > >>>>>> > > > >>>>>> UPDATER READER > > > >>>>>> ------- ------ > > > >>>>>> STORE X = 1 STORE srcu_read_lock++ > > > >>>>>> // rcu_seq_snap() smp_mb() > > > >>>>>> smp_mb() READ X > > > >>>>>> // scans > > > >>>>>> READ srcu_read_lock > > > >>>>> > > > >>>>> What you refer to here is only memory ordering of the store to X and load > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > > > >>>>> srcu implementation. If we really want to model the provided high-level > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > > > >>>>> for its memory ordering properties to synchronize other variables. > > > >>>>> > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > > > >>>>> > > > >>>>> Initial state: X = 0, Y = 0 > > > >>>>> > > > >>>>> Thread A Thread B > > > >>>>> --------------------------------------------- > > > >>>>> STORE X = 1 STORE Y = 1 > > > >>>>> synchronize_srcu() > > > >>>>> srcu_read_lock() > > > >>>>> r1 = LOAD X > > > >>>>> srcu_read_unlock() > > > >>>>> r0 = LOAD Y > > > >>>>> > > > >>>>> BUG_ON(!r0 && !r1) > > > >>>>> > > > >>>>> So in the synchronize_srcu implementation, there appears to be two > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > > >>>>> load means that it does not order prior memory accesses before that load. > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when > > > >>>>> piggy-backing on another already-started gp. I worry about reordering > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > > >>>>> piggy-back on an already-started gp. > > > >>>>> > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > >>>>> all this behave as expected. But without documentation it's rather hard to > > > >>>>> follow. > > > >>>> > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > > > >>>> purpose. In any case, we really want to add a comment above that > > > >>>> __srcu_read_lock_nmisafe() call. > > > >>> > > > >>> Another test for the safety (or not) of removing either D or E is > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > > > >>> adjacent scans. > > > >> > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > > >> > > > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > > > > > The thought is to manually similate in the source code the maximum > > > > memory-reference reordering that a maximally hostile compiler and CPU > > > > would be permitted to carry out. So yes, given that there are other > > > > memory barriers before and after, these other memory barriers limit how > > > > far the flip may be moved in the source code. > > > > > > > > Here I am talking about the memory barriers associated with the flip, > > > > but the same trick can of course be applied to other memory barriers. > > > > In general, remove a given memory barrier and (in the source code) > > > > maximally rearrange the memory references that were previously ordered > > > > by the memory barrier in question. > > > > > > > > Again, the presence of other memory barriers will limit the permitted > > > > maximal source-code rearrangement. > > > > > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. > > > > First, are we talking about the same barrier? I am talking about E. > > No, in the last part you replied to above, Mathieu and Frederic were > talking about the need for memory barriers in synchronize_srcu(). That > has nothing to do with E: > <quote> > I suspect that the implicit barrier in srcu_read_lock() invoked at the > beginning of srcu_gp_start_if_needed() is really the barrier that makes > all this behave as expected. > </quote> > > We need to order code prior to synchronize_srcu() wrt the start of the > grace period, so that readers that started after the grace period > started see those side-effects as they may not be waited on (they are > too late). My thought is this is achieved by the srcu_read_lock() before the grace period is started, and the start of the grace period (which is basically the smp_mb() in the first scan). So from memory ordering PoV, if synchronize_rcu() spans the GP, and readers don't span the GP, that means the reader does not span the synchronize_rcu() which is the GP guarantee. But I could be missing something. I will dig more on my side. Thanks.
On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote: > On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote: > [...] > > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > >>> > > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > > > >>>>> [...] > > > > >>>>>>> > > > > >>>>>>> The memory ordering constraint I am concerned about here is: > > > > >>>>>>> > > > > >>>>>>> * [...] In addition, > > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > > > > >>>>>>> > > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > > >>>>>>> very beginning of the grace period. > > > > >>>>>> > > > > >>>>>> I'm confused, what's wrong with this ? > > > > >>>>>> > > > > >>>>>> UPDATER READER > > > > >>>>>> ------- ------ > > > > >>>>>> STORE X = 1 STORE srcu_read_lock++ > > > > >>>>>> // rcu_seq_snap() smp_mb() > > > > >>>>>> smp_mb() READ X > > > > >>>>>> // scans > > > > >>>>>> READ srcu_read_lock > > > > >>>>> > > > > >>>>> What you refer to here is only memory ordering of the store to X and load > > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > > > > >>>>> srcu implementation. If we really want to model the provided high-level > > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > > > > >>>>> for its memory ordering properties to synchronize other variables. > > > > >>>>> > > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > > > > >>>>> > > > > >>>>> Initial state: X = 0, Y = 0 > > > > >>>>> > > > > >>>>> Thread A Thread B > > > > >>>>> --------------------------------------------- > > > > >>>>> STORE X = 1 STORE Y = 1 > > > > >>>>> synchronize_srcu() > > > > >>>>> srcu_read_lock() > > > > >>>>> r1 = LOAD X > > > > >>>>> srcu_read_unlock() > > > > >>>>> r0 = LOAD Y > > > > >>>>> > > > > >>>>> BUG_ON(!r0 && !r1) > > > > >>>>> > > > > >>>>> So in the synchronize_srcu implementation, there appears to be two > > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > > > >>>>> load means that it does not order prior memory accesses before that load. > > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when > > > > >>>>> piggy-backing on another already-started gp. I worry about reordering > > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > > > >>>>> piggy-back on an already-started gp. > > > > >>>>> > > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > > >>>>> all this behave as expected. But without documentation it's rather hard to > > > > >>>>> follow. > > > > >>>> > > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > > > > >>>> purpose. In any case, we really want to add a comment above that > > > > >>>> __srcu_read_lock_nmisafe() call. > > > > >>> > > > > >>> Another test for the safety (or not) of removing either D or E is > > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > > > > >>> adjacent scans. > > > > >> > > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > > > >> > > > > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > > > > > > > The thought is to manually similate in the source code the maximum > > > > > memory-reference reordering that a maximally hostile compiler and CPU > > > > > would be permitted to carry out. So yes, given that there are other > > > > > memory barriers before and after, these other memory barriers limit how > > > > > far the flip may be moved in the source code. > > > > > > > > > > Here I am talking about the memory barriers associated with the flip, > > > > > but the same trick can of course be applied to other memory barriers. > > > > > In general, remove a given memory barrier and (in the source code) > > > > > maximally rearrange the memory references that were previously ordered > > > > > by the memory barrier in question. > > > > > > > > > > Again, the presence of other memory barriers will limit the permitted > > > > > maximal source-code rearrangement. > > > > > > > > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. > > > > > > First, are we talking about the same barrier? I am talking about E. Apologies. I am a bit fixated on E because it is the one you guys proposed eliminating. ;-) > > No, in the last part you replied to above, Mathieu and Frederic were > > talking about the need for memory barriers in synchronize_srcu(). That > > has nothing to do with E: > > <quote> > > I suspect that the implicit barrier in srcu_read_lock() invoked at the > > beginning of srcu_gp_start_if_needed() is really the barrier that makes > > all this behave as expected. > > </quote> > > > > We need to order code prior to synchronize_srcu() wrt the start of the > > grace period, so that readers that started after the grace period > > started see those side-effects as they may not be waited on (they are > > too late). > > My thought is this is achieved by the srcu_read_lock() before the > grace period is started, and the start of the grace period (which is > basically the smp_mb() in the first scan). > > So from memory ordering PoV, if synchronize_rcu() spans the GP, and > readers don't span the GP, that means the reader does not span the > synchronize_rcu() which is the GP guarantee. > > But I could be missing something. I will dig more on my side. Thanks. Could you please specifically identify that srcu_read_lock()? Also which version you are looking at. ;-) Thanx, Paul
On Fri, Dec 23, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote: > > On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > [...] > > > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > >>> > > > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > > > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > > > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > > > > >>>>> [...] > > > > > >>>>>>> > > > > > >>>>>>> The memory ordering constraint I am concerned about here is: > > > > > >>>>>>> > > > > > >>>>>>> * [...] In addition, > > > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > > > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > > > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > > > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > > > > > >>>>>>> > > > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > > > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > > > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > > > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > > > >>>>>>> very beginning of the grace period. > > > > > >>>>>> > > > > > >>>>>> I'm confused, what's wrong with this ? > > > > > >>>>>> > > > > > >>>>>> UPDATER READER > > > > > >>>>>> ------- ------ > > > > > >>>>>> STORE X = 1 STORE srcu_read_lock++ > > > > > >>>>>> // rcu_seq_snap() smp_mb() > > > > > >>>>>> smp_mb() READ X > > > > > >>>>>> // scans > > > > > >>>>>> READ srcu_read_lock > > > > > >>>>> > > > > > >>>>> What you refer to here is only memory ordering of the store to X and load > > > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > > > > > >>>>> srcu implementation. If we really want to model the provided high-level > > > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > > > > > >>>>> for its memory ordering properties to synchronize other variables. > > > > > >>>>> > > > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > > > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > > > > > >>>>> > > > > > >>>>> Initial state: X = 0, Y = 0 > > > > > >>>>> > > > > > >>>>> Thread A Thread B > > > > > >>>>> --------------------------------------------- > > > > > >>>>> STORE X = 1 STORE Y = 1 > > > > > >>>>> synchronize_srcu() > > > > > >>>>> srcu_read_lock() > > > > > >>>>> r1 = LOAD X > > > > > >>>>> srcu_read_unlock() > > > > > >>>>> r0 = LOAD Y > > > > > >>>>> > > > > > >>>>> BUG_ON(!r0 && !r1) > > > > > >>>>> > > > > > >>>>> So in the synchronize_srcu implementation, there appears to be two > > > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > > > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > > > > >>>>> load means that it does not order prior memory accesses before that load. > > > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when > > > > > >>>>> piggy-backing on another already-started gp. I worry about reordering > > > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > > > > >>>>> piggy-back on an already-started gp. > > > > > >>>>> > > > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > > > >>>>> all this behave as expected. But without documentation it's rather hard to > > > > > >>>>> follow. > > > > > >>>> > > > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > > > > > >>>> purpose. In any case, we really want to add a comment above that > > > > > >>>> __srcu_read_lock_nmisafe() call. > > > > > >>> > > > > > >>> Another test for the safety (or not) of removing either D or E is > > > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > > > > > >>> adjacent scans. > > > > > >> > > > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > > > > >> > > > > > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > > > > > > > > > The thought is to manually similate in the source code the maximum > > > > > > memory-reference reordering that a maximally hostile compiler and CPU > > > > > > would be permitted to carry out. So yes, given that there are other > > > > > > memory barriers before and after, these other memory barriers limit how > > > > > > far the flip may be moved in the source code. > > > > > > > > > > > > Here I am talking about the memory barriers associated with the flip, > > > > > > but the same trick can of course be applied to other memory barriers. > > > > > > In general, remove a given memory barrier and (in the source code) > > > > > > maximally rearrange the memory references that were previously ordered > > > > > > by the memory barrier in question. > > > > > > > > > > > > Again, the presence of other memory barriers will limit the permitted > > > > > > maximal source-code rearrangement. > > > > > > > > > > > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. > > > > > > > > First, are we talking about the same barrier? I am talking about E. > > Apologies. I am a bit fixated on E because it is the one you guys > proposed eliminating. ;-) Ah ok, no worries. :-) > > > No, in the last part you replied to above, Mathieu and Frederic were > > > talking about the need for memory barriers in synchronize_srcu(). That > > > has nothing to do with E: > > > <quote> > > > I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > all this behave as expected. > > > </quote> > > > > > > We need to order code prior to synchronize_srcu() wrt the start of the > > > grace period, so that readers that started after the grace period > > > started see those side-effects as they may not be waited on (they are > > > too late). > > > > My thought is this is achieved by the srcu_read_lock() before the > > grace period is started, and the start of the grace period (which is > > basically the smp_mb() in the first scan). > > > > So from memory ordering PoV, if synchronize_rcu() spans the GP, and > > readers don't span the GP, that means the reader does not span the > > synchronize_rcu() which is the GP guarantee. > > > > But I could be missing something. I will dig more on my side. Thanks. > > Could you please specifically identify that srcu_read_lock()? > > Also which version you are looking at. ;-) Should be this one in current -rcu: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/kernel/rcu/srcutree.c#n1669 Thanks, - Joel
On Fri, Dec 23, 2022 at 03:10:40PM -0500, Joel Fernandes wrote: > On Fri, Dec 23, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote: > > > On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > [...] > > > > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > >>> > > > > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote: > > > > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote: > > > > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote: > > > > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote: > > > > > > >>>>> [...] > > > > > > >>>>>>> > > > > > > >>>>>>> The memory ordering constraint I am concerned about here is: > > > > > > >>>>>>> > > > > > > >>>>>>> * [...] In addition, > > > > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond > > > > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a > > > > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before > > > > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...] > > > > > > >>>>>>> > > > > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning > > > > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the > > > > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that > > > > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the > > > > > > >>>>>>> very beginning of the grace period. > > > > > > >>>>>> > > > > > > >>>>>> I'm confused, what's wrong with this ? > > > > > > >>>>>> > > > > > > >>>>>> UPDATER READER > > > > > > >>>>>> ------- ------ > > > > > > >>>>>> STORE X = 1 STORE srcu_read_lock++ > > > > > > >>>>>> // rcu_seq_snap() smp_mb() > > > > > > >>>>>> smp_mb() READ X > > > > > > >>>>>> // scans > > > > > > >>>>>> READ srcu_read_lock > > > > > > >>>>> > > > > > > >>>>> What you refer to here is only memory ordering of the store to X and load > > > > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the > > > > > > >>>>> srcu implementation. If we really want to model the provided high-level > > > > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used > > > > > > >>>>> for its memory ordering properties to synchronize other variables. > > > > > > >>>>> > > > > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu() > > > > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers: > > > > > > >>>>> > > > > > > >>>>> Initial state: X = 0, Y = 0 > > > > > > >>>>> > > > > > > >>>>> Thread A Thread B > > > > > > >>>>> --------------------------------------------- > > > > > > >>>>> STORE X = 1 STORE Y = 1 > > > > > > >>>>> synchronize_srcu() > > > > > > >>>>> srcu_read_lock() > > > > > > >>>>> r1 = LOAD X > > > > > > >>>>> srcu_read_unlock() > > > > > > >>>>> r0 = LOAD Y > > > > > > >>>>> > > > > > > >>>>> BUG_ON(!r0 && !r1) > > > > > > >>>>> > > > > > > >>>>> So in the synchronize_srcu implementation, there appears to be two > > > > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp, > > > > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with > > > > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq > > > > > > >>>>> load means that it does not order prior memory accesses before that load. > > > > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when > > > > > > >>>>> piggy-backing on another already-started gp. I worry about reordering > > > > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to > > > > > > >>>>> piggy-back on an already-started gp. > > > > > > >>>>> > > > > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > > > > >>>>> all this behave as expected. But without documentation it's rather hard to > > > > > > >>>>> follow. > > > > > > >>>> > > > > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten > > > > > > >>>> purpose. In any case, we really want to add a comment above that > > > > > > >>>> __srcu_read_lock_nmisafe() call. > > > > > > >>> > > > > > > >>> Another test for the safety (or not) of removing either D or E is > > > > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the > > > > > > >>> adjacent scans. > > > > > > >> > > > > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end. > > > > > > >> > > > > > > >> So that (flipping MBs) is unrelated, or did I miss something? > > > > > > > > > > > > > > The thought is to manually similate in the source code the maximum > > > > > > > memory-reference reordering that a maximally hostile compiler and CPU > > > > > > > would be permitted to carry out. So yes, given that there are other > > > > > > > memory barriers before and after, these other memory barriers limit how > > > > > > > far the flip may be moved in the source code. > > > > > > > > > > > > > > Here I am talking about the memory barriers associated with the flip, > > > > > > > but the same trick can of course be applied to other memory barriers. > > > > > > > In general, remove a given memory barrier and (in the source code) > > > > > > > maximally rearrange the memory references that were previously ordered > > > > > > > by the memory barrier in question. > > > > > > > > > > > > > > Again, the presence of other memory barriers will limit the permitted > > > > > > > maximal source-code rearrangement. > > > > > > > > > > > > > > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers. > > > > > > > > > > First, are we talking about the same barrier? I am talking about E. > > > > Apologies. I am a bit fixated on E because it is the one you guys > > proposed eliminating. ;-) > > Ah ok, no worries. :-) > > > > > No, in the last part you replied to above, Mathieu and Frederic were > > > > talking about the need for memory barriers in synchronize_srcu(). That > > > > has nothing to do with E: > > > > <quote> > > > > I suspect that the implicit barrier in srcu_read_lock() invoked at the > > > > beginning of srcu_gp_start_if_needed() is really the barrier that makes > > > > all this behave as expected. > > > > </quote> > > > > > > > > We need to order code prior to synchronize_srcu() wrt the start of the > > > > grace period, so that readers that started after the grace period > > > > started see those side-effects as they may not be waited on (they are > > > > too late). > > > > > > My thought is this is achieved by the srcu_read_lock() before the > > > grace period is started, and the start of the grace period (which is > > > basically the smp_mb() in the first scan). > > > > > > So from memory ordering PoV, if synchronize_rcu() spans the GP, and > > > readers don't span the GP, that means the reader does not span the > > > synchronize_rcu() which is the GP guarantee. > > > > > > But I could be missing something. I will dig more on my side. Thanks. > > > > Could you please specifically identify that srcu_read_lock()? > > > > Also which version you are looking at. ;-) > > Should be this one in current -rcu: > https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/kernel/rcu/srcutree.c#n1669 That gets me this line of code: int ss_state = READ_ONCE(ssp->srcu_size_state); Which is in srcu_torture_stats_print(), so I am guessing that something got lost in translation. ;-) Thanx, Paul