mbox series

[RFC,0/2] srcu: Remove pre-flip memory barrier

Message ID 20221218191310.130904-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series srcu: Remove pre-flip memory barrier | expand

Message

Joel Fernandes Dec. 18, 2022, 7:13 p.m. UTC
Hello, I believe the pre-flip memory barrier is not required. The only reason I
can say to remove it, other than the possibility that it is unnecessary, is to
not have extra code that does not help. However, since we are issuing a fully
memory-barrier after the flip, I cannot say that it hurts to do it anyway.

For this reason, please consider these patches as "informational", than a
"please merge". :-) Though, feel free to consider merging if you agree!

All SRCU scenarios pass with these, with 6 hours of testing.

thanks,

 - Joel

Joel Fernandes (Google) (2):
srcu: Remove comment about prior read lock counts
srcu: Remove memory barrier "E" as it is not required

kernel/rcu/srcutree.c | 10 ----------
1 file changed, 10 deletions(-)

--
2.39.0.314.g84b9a713c41-goog

Comments

Mathieu Desnoyers Dec. 18, 2022, 8:57 p.m. UTC | #1
On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> can say to remove it, other than the possibility that it is unnecessary, is to
> not have extra code that does not help. However, since we are issuing a fully
> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> 
> For this reason, please consider these patches as "informational", than a
> "please merge". :-) Though, feel free to consider merging if you agree!
> 
> All SRCU scenarios pass with these, with 6 hours of testing.

Hi Joel,

Please have a look at the comments in my side-rcu implementation [1, 2]. 
It is similar to what SRCU does (per-cpu counter based grace period 
tracking), but implemented for userspace. The comments explain why this 
works without the memory barrier you identify as useless in SRCU.

Following my implementation of side-rcu, I reviewed the SRCU comments 
and identified that the barrier "/* E */" appears to be useless. I even 
discussed this privately with Paul E. McKenney.

My implementation and comments go further though, and skip the period 
"flip" entirely if the first pass observes that all readers (in both 
periods) are quiescent.

The most relevant comment in side-rcu is:

  * The grace period completes when it observes that there are no active
  * readers within each of the periods.
  *
  * The active_readers state is initially true for each period, until the
  * grace period observes that no readers are present for each given
  * period, at which point the active_readers state becomes false.

So I agree with the clarifications you propose here, but I think we can 
improve the grace period implementation further by clarifying the SRCU 
grace period model.

Thanks,

Mathieu


[1] https://github.com/efficios/libside/blob/master/src/rcu.h
[2] https://github.com/efficios/libside/blob/master/src/rcu.c

> 
> thanks,
> 
>   - Joel
> 
> Joel Fernandes (Google) (2):
> srcu: Remove comment about prior read lock counts
> srcu: Remove memory barrier "E" as it is not required
> 
> kernel/rcu/srcutree.c | 10 ----------
> 1 file changed, 10 deletions(-)
> 
> --
> 2.39.0.314.g84b9a713c41-goog
>
Joel Fernandes Dec. 18, 2022, 9:30 p.m. UTC | #2
Hi Mathieu,

On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> > Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > can say to remove it, other than the possibility that it is unnecessary, is to
> > not have extra code that does not help. However, since we are issuing a fully
> > memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> >
> > For this reason, please consider these patches as "informational", than a
> > "please merge". :-) Though, feel free to consider merging if you agree!
> >
> > All SRCU scenarios pass with these, with 6 hours of testing.
>
> Hi Joel,
>
> Please have a look at the comments in my side-rcu implementation [1, 2].
> It is similar to what SRCU does (per-cpu counter based grace period
> tracking), but implemented for userspace. The comments explain why this
> works without the memory barrier you identify as useless in SRCU.
>
> Following my implementation of side-rcu, I reviewed the SRCU comments
> and identified that the barrier "/* E */" appears to be useless. I even
> discussed this privately with Paul E. McKenney.
>
> My implementation and comments go further though, and skip the period
> "flip" entirely if the first pass observes that all readers (in both
> periods) are quiescent.

Actually in SRCU, the first pass scans only 1 index, then does the
flip, and the second pass scans the second index. Without doing a
flip, an index cannot be scanned for forward progress reasons because
it is still "active". So I am curious how you can skip flip and still
scan both indexes? I will dig more into your implementation to learn more.

> The most relevant comment in side-rcu is:
>
>   * The grace period completes when it observes that there are no active
>   * readers within each of the periods.
>   *
>   * The active_readers state is initially true for each period, until the
>   * grace period observes that no readers are present for each given
>   * period, at which point the active_readers state becomes false.
>
> So I agree with the clarifications you propose here, but I think we can
> improve the grace period implementation further by clarifying the SRCU
> grace period model.

Thanks a lot, I am curious how you do the "detection of no new
readers" part without globally doing some kind of synchronization. I
will dig more into your implementation to learn more.

Thanks,

 - Joel



>
> Thanks,
>
> Mathieu
>
>
> [1] https://github.com/efficios/libside/blob/master/src/rcu.h
> [2] https://github.com/efficios/libside/blob/master/src/rcu.c
>
> >
> > thanks,
> >
> >   - Joel
> >
> > Joel Fernandes (Google) (2):
> > srcu: Remove comment about prior read lock counts
> > srcu: Remove memory barrier "E" as it is not required
> >
> > kernel/rcu/srcutree.c | 10 ----------
> > 1 file changed, 10 deletions(-)
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Paul E. McKenney Dec. 18, 2022, 11:26 p.m. UTC | #3
On Sun, Dec 18, 2022 at 04:30:33PM -0500, Joel Fernandes wrote:
> Hi Mathieu,
> 
> On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> > > Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > > can say to remove it, other than the possibility that it is unnecessary, is to
> > > not have extra code that does not help. However, since we are issuing a fully
> > > memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> > >
> > > For this reason, please consider these patches as "informational", than a
> > > "please merge". :-) Though, feel free to consider merging if you agree!
> > >
> > > All SRCU scenarios pass with these, with 6 hours of testing.
> >
> > Hi Joel,
> >
> > Please have a look at the comments in my side-rcu implementation [1, 2].
> > It is similar to what SRCU does (per-cpu counter based grace period
> > tracking), but implemented for userspace. The comments explain why this
> > works without the memory barrier you identify as useless in SRCU.
> >
> > Following my implementation of side-rcu, I reviewed the SRCU comments
> > and identified that the barrier "/* E */" appears to be useless. I even
> > discussed this privately with Paul E. McKenney.
> >
> > My implementation and comments go further though, and skip the period
> > "flip" entirely if the first pass observes that all readers (in both
> > periods) are quiescent.
> 
> Actually in SRCU, the first pass scans only 1 index, then does the
> flip, and the second pass scans the second index. Without doing a
> flip, an index cannot be scanned for forward progress reasons because
> it is still "active". So I am curious how you can skip flip and still
> scan both indexes? I will dig more into your implementation to learn more.
> 
> > The most relevant comment in side-rcu is:
> >
> >   * The grace period completes when it observes that there are no active
> >   * readers within each of the periods.
> >   *
> >   * The active_readers state is initially true for each period, until the
> >   * grace period observes that no readers are present for each given
> >   * period, at which point the active_readers state becomes false.
> >
> > So I agree with the clarifications you propose here, but I think we can
> > improve the grace period implementation further by clarifying the SRCU
> > grace period model.
> 
> Thanks a lot, I am curious how you do the "detection of no new
> readers" part without globally doing some kind of synchronization. I
> will dig more into your implementation to learn more.

It is very good to see the interest in SRCU internals!

Just out of an abundance of caution, I restate the requirements from
the synchronize_srcu() header comment:

 * There are memory-ordering constraints implied by synchronize_srcu().
 * On systems with more than one CPU, when synchronize_srcu() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last corresponding SRCU read-side critical section
 * whose beginning preceded the call to synchronize_srcu().  In addition,
 * each CPU having an SRCU read-side critical section that extends beyond
 * the return from synchronize_srcu() is guaranteed to have executed a
 * full memory barrier after the beginning of synchronize_srcu() and before
 * the beginning of that SRCU read-side critical section.  Note that these
 * guarantees include CPUs that are offline, idle, or executing in user mode,
 * as well as CPUs that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked synchronize_srcu(), which returned
 * to its caller on CPU B, then both CPU A and CPU B are guaranteed
 * to have executed a full memory barrier during the execution of
 * synchronize_srcu().  This guarantee applies even if CPU A and CPU B
 * are the same CPU, but again only if the system has more than one CPU.
 *
 * Of course, these memory-ordering guarantees apply only when
 * synchronize_srcu(), srcu_read_lock(), and srcu_read_unlock() are
 * passed the same srcu_struct structure.

And from the __call_srcu() header comment:

 * Note that all CPUs must agree that the grace period extended beyond
 * all pre-existing SRCU read-side critical section.  On systems with
 * more than one CPU, this means that when "func()" is invoked, each CPU
 * is guaranteed to have executed a full memory barrier since the end of
 * its last corresponding SRCU read-side critical section whose beginning
 * preceded the call to call_srcu().  It also means that each CPU executing
 * an SRCU read-side critical section that continues beyond the start of
 * "func()" must have executed a memory barrier after the call_srcu()
 * but before the beginning of that SRCU read-side critical section.
 * Note that these guarantees include CPUs that are offline, idle, or
 * executing in user mode, as well as CPUs that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked call_srcu() and CPU B invoked the
 * resulting SRCU callback function "func()", then both CPU A and CPU
 * B are guaranteed to execute a full memory barrier during the time
 * interval between the call to call_srcu() and the invocation of "func()".
 * This guarantee applies even if CPU A and CPU B are the same CPU (but
 * again only if the system has more than one CPU).
 *
 * Of course, these guarantees apply only for invocations of call_srcu(),
 * srcu_read_lock(), and srcu_read_unlock() that are all passed the same
 * srcu_struct structure.

							Thanx, Paul
Mathieu Desnoyers Dec. 18, 2022, 11:38 p.m. UTC | #4
On 2022-12-18 16:30, Joel Fernandes wrote:
> Hi Mathieu,
> 
> On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
>>> can say to remove it, other than the possibility that it is unnecessary, is to
>>> not have extra code that does not help. However, since we are issuing a fully
>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
>>>
>>> For this reason, please consider these patches as "informational", than a
>>> "please merge". :-) Though, feel free to consider merging if you agree!
>>>
>>> All SRCU scenarios pass with these, with 6 hours of testing.
>>
>> Hi Joel,
>>
>> Please have a look at the comments in my side-rcu implementation [1, 2].
>> It is similar to what SRCU does (per-cpu counter based grace period
>> tracking), but implemented for userspace. The comments explain why this
>> works without the memory barrier you identify as useless in SRCU.
>>
>> Following my implementation of side-rcu, I reviewed the SRCU comments
>> and identified that the barrier "/* E */" appears to be useless. I even
>> discussed this privately with Paul E. McKenney.
>>
>> My implementation and comments go further though, and skip the period
>> "flip" entirely if the first pass observes that all readers (in both
>> periods) are quiescent.
> 
> Actually in SRCU, the first pass scans only 1 index, then does the
> flip, and the second pass scans the second index. Without doing a
> flip, an index cannot be scanned for forward progress reasons because
> it is still "active". So I am curious how you can skip flip and still
> scan both indexes? I will dig more into your implementation to learn more.

If we look at SRCU read-side:

int __srcu_read_lock(struct srcu_struct *ssp)
{
         int idx;

         idx = READ_ONCE(ssp->srcu_idx) & 0x1;
         this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
         smp_mb(); /* B */  /* Avoid leaking the critical section. */
         return idx;
}

If the thread is preempted for a long period of time between load of 
ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
thread can appear as a "new reader" for the idx period at any arbitrary 
time in the future, independently of which period is the current one 
within a future grace period.

As a result, the grace period algorithm needs to inherently support the 
fact that a "new reader" can appear in any of the two periods, 
independently of the current period state.

As a result, this means that while within period "0", we _need_ to allow 
newly coming readers to appear as we scan period "0".

As a result, we can simply scan both periods 0/1 for reader quiescence, 
even while new readers appear within those periods.

As a result, flipping between periods 0/1 is just relevant for forward 
progress, not for correctness.

As a result, we can remove barrier /* E */.

Thoughts ?

Thanks,

Mathieu


> 
>> The most relevant comment in side-rcu is:
>>
>>    * The grace period completes when it observes that there are no active
>>    * readers within each of the periods.
>>    *
>>    * The active_readers state is initially true for each period, until the
>>    * grace period observes that no readers are present for each given
>>    * period, at which point the active_readers state becomes false.
>>
>> So I agree with the clarifications you propose here, but I think we can
>> improve the grace period implementation further by clarifying the SRCU
>> grace period model.
> 
> Thanks a lot, I am curious how you do the "detection of no new
> readers" part without globally doing some kind of synchronization. I
> will dig more into your implementation to learn more.
> 
> Thanks,
> 
>   - Joel
> 
> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>> [1] https://github.com/efficios/libside/blob/master/src/rcu.h
>> [2] https://github.com/efficios/libside/blob/master/src/rcu.c
>>
>>>
>>> thanks,
>>>
>>>    - Joel
>>>
>>> Joel Fernandes (Google) (2):
>>> srcu: Remove comment about prior read lock counts
>>> srcu: Remove memory barrier "E" as it is not required
>>>
>>> kernel/rcu/srcutree.c | 10 ----------
>>> 1 file changed, 10 deletions(-)
>>>
>>> --
>>> 2.39.0.314.g84b9a713c41-goog
>>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Joel Fernandes Dec. 19, 2022, 12:04 a.m. UTC | #5
Hi Mathieu,

On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2022-12-18 16:30, Joel Fernandes wrote:
> > Hi Mathieu,
> >
> > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> >>> can say to remove it, other than the possibility that it is unnecessary, is to
> >>> not have extra code that does not help. However, since we are issuing a fully
> >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> >>>
> >>> For this reason, please consider these patches as "informational", than a
> >>> "please merge". :-) Though, feel free to consider merging if you agree!
> >>>
> >>> All SRCU scenarios pass with these, with 6 hours of testing.
> >>
> >> Hi Joel,
> >>
> >> Please have a look at the comments in my side-rcu implementation [1, 2].
> >> It is similar to what SRCU does (per-cpu counter based grace period
> >> tracking), but implemented for userspace. The comments explain why this
> >> works without the memory barrier you identify as useless in SRCU.
> >>
> >> Following my implementation of side-rcu, I reviewed the SRCU comments
> >> and identified that the barrier "/* E */" appears to be useless. I even
> >> discussed this privately with Paul E. McKenney.
> >>
> >> My implementation and comments go further though, and skip the period
> >> "flip" entirely if the first pass observes that all readers (in both
> >> periods) are quiescent.
> >
> > Actually in SRCU, the first pass scans only 1 index, then does the
> > flip, and the second pass scans the second index. Without doing a
> > flip, an index cannot be scanned for forward progress reasons because
> > it is still "active". So I am curious how you can skip flip and still
> > scan both indexes? I will dig more into your implementation to learn more.
>
> If we look at SRCU read-side:
>
> int __srcu_read_lock(struct srcu_struct *ssp)
> {
>          int idx;
>
>          idx = READ_ONCE(ssp->srcu_idx) & 0x1;
>          this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
>          smp_mb(); /* B */  /* Avoid leaking the critical section. */
>          return idx;
> }
>
> If the thread is preempted for a long period of time between load of
> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
> thread can appear as a "new reader" for the idx period at any arbitrary
> time in the future, independently of which period is the current one
> within a future grace period.
>
> As a result, the grace period algorithm needs to inherently support the
> fact that a "new reader" can appear in any of the two periods,
> independently of the current period state.
>
> As a result, this means that while within period "0", we _need_ to allow
> newly coming readers to appear as we scan period "0".

Sure, it already does handle it but that is I believe it is a corner
case, not the norm.

> As a result, we can simply scan both periods 0/1 for reader quiescence,
> even while new readers appear within those periods.

I think this is a bit dangerous. Yes there is the preemption thing you
mentioned above, but that is bounded since you can only have a fixed
number of tasks that underwent that preemption, and it is quite rare
in the sense, each reader should get preempted just after sampling idx
but not incrementing lock count.

However, if we scan while new readers appear (outside of the above
preemption problem), we can have counter wrap causing a false match
much quicker.
The scan loop is:
check_readers(idx) {
   count_all_unlocks(idx);
   smp_mb();
   count_all_locks(idx);
   bool done = (locks == unlocks)
   if (done) {
         // readers are done, end scan for this idx.
   } else {
         // try again later
   }
}

So if check_readers() got preempted just after the smp_mb(), then you
can have lots of tasks enter and exit the read-side critical section
and increment the locks count. Eventually locks == unlocks will
happen, and it is screwed. Sure this is also theoretical, but yeah
that issue can be made "worse" by scanning active readers
deliberately, especially when such readers can also nest arbitrarily.

> As a result, flipping between periods 0/1 is just relevant for forward
> progress, not for correctness.

Sure, agreed, forward progress.

>
> As a result, we can remove barrier /* E */.
>

IMO, the "E" is not needed at all even if you do the flip because of
the reasons in the patch series changelogs, so whether we do flip less
frequently, we still don't need "E" IMHO.

Thanks!

 - Joel


> Thoughts ?
>
> Thanks,
>
> Mathieu
>
>
> >
> >> The most relevant comment in side-rcu is:
> >>
> >>    * The grace period completes when it observes that there are no active
> >>    * readers within each of the periods.
> >>    *
> >>    * The active_readers state is initially true for each period, until the
> >>    * grace period observes that no readers are present for each given
> >>    * period, at which point the active_readers state becomes false.
> >>
> >> So I agree with the clarifications you propose here, but I think we can
> >> improve the grace period implementation further by clarifying the SRCU
> >> grace period model.
> >
> > Thanks a lot, I am curious how you do the "detection of no new
> > readers" part without globally doing some kind of synchronization. I
> > will dig more into your implementation to learn more.
> >
> > Thanks,
> >
> >   - Joel
> >
> >
> >
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >>
> >> [1] https://github.com/efficios/libside/blob/master/src/rcu.h
> >> [2] https://github.com/efficios/libside/blob/master/src/rcu.c
> >>
> >>>
> >>> thanks,
> >>>
> >>>    - Joel
> >>>
> >>> Joel Fernandes (Google) (2):
> >>> srcu: Remove comment about prior read lock counts
> >>> srcu: Remove memory barrier "E" as it is not required
> >>>
> >>> kernel/rcu/srcutree.c | 10 ----------
> >>> 1 file changed, 10 deletions(-)
> >>>
> >>> --
> >>> 2.39.0.314.g84b9a713c41-goog
> >>>
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> https://www.efficios.com
> >>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Joel Fernandes Dec. 19, 2022, 12:24 a.m. UTC | #6
On Sun, Dec 18, 2022 at 7:04 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Mathieu,
>
> On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > On 2022-12-18 16:30, Joel Fernandes wrote:
> > > Hi Mathieu,
> > >
> > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
> > > <mathieu.desnoyers@efficios.com> wrote:
> > >>
> > >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> > >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > >>> can say to remove it, other than the possibility that it is unnecessary, is to
> > >>> not have extra code that does not help. However, since we are issuing a fully
> > >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> > >>>
> > >>> For this reason, please consider these patches as "informational", than a
> > >>> "please merge". :-) Though, feel free to consider merging if you agree!
> > >>>
> > >>> All SRCU scenarios pass with these, with 6 hours of testing.
> > >>
> > >> Hi Joel,
> > >>
> > >> Please have a look at the comments in my side-rcu implementation [1, 2].
> > >> It is similar to what SRCU does (per-cpu counter based grace period
> > >> tracking), but implemented for userspace. The comments explain why this
> > >> works without the memory barrier you identify as useless in SRCU.
> > >>
> > >> Following my implementation of side-rcu, I reviewed the SRCU comments
> > >> and identified that the barrier "/* E */" appears to be useless. I even
> > >> discussed this privately with Paul E. McKenney.
> > >>
> > >> My implementation and comments go further though, and skip the period
> > >> "flip" entirely if the first pass observes that all readers (in both
> > >> periods) are quiescent.
> > >
> > > Actually in SRCU, the first pass scans only 1 index, then does the
> > > flip, and the second pass scans the second index. Without doing a
> > > flip, an index cannot be scanned for forward progress reasons because
> > > it is still "active". So I am curious how you can skip flip and still
> > > scan both indexes? I will dig more into your implementation to learn more.
> >
> > If we look at SRCU read-side:
> >
> > int __srcu_read_lock(struct srcu_struct *ssp)
> > {
> >          int idx;
> >
> >          idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> >          this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
> >          smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >          return idx;
> > }
> >
> > If the thread is preempted for a long period of time between load of
> > ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
> > thread can appear as a "new reader" for the idx period at any arbitrary
> > time in the future, independently of which period is the current one
> > within a future grace period.
> >
> > As a result, the grace period algorithm needs to inherently support the
> > fact that a "new reader" can appear in any of the two periods,
> > independently of the current period state.
> >
> > As a result, this means that while within period "0", we _need_ to allow
> > newly coming readers to appear as we scan period "0".
>
> Sure, it already does handle it but that is I believe it is a corner
> case, not the norm.
>
> > As a result, we can simply scan both periods 0/1 for reader quiescence,
> > even while new readers appear within those periods.
>
> I think this is a bit dangerous. Yes there is the preemption thing you
> mentioned above, but that is bounded since you can only have a fixed
> number of tasks that underwent that preemption, and it is quite rare
> in the sense, each reader should get preempted just after sampling idx
> but not incrementing lock count.
>
> However, if we scan while new readers appear (outside of the above
> preemption problem), we can have counter wrap causing a false match
> much quicker.
> The scan loop is:
> check_readers(idx) {
>    count_all_unlocks(idx);
>    smp_mb();
>    count_all_locks(idx);
>    bool done = (locks == unlocks)
>    if (done) {
>          // readers are done, end scan for this idx.
>    } else {
>          // try again later
>    }
> }
>
> So if check_readers() got preempted just after the smp_mb(), then you
> can have lots of tasks enter and exit the read-side critical section
> and increment the locks count. Eventually locks == unlocks will
> happen, and it is screwed. Sure this is also theoretical, but yeah
> that issue can be made "worse" by scanning active readers
> deliberately, especially when such readers can also nest arbitrarily.
>
> > As a result, flipping between periods 0/1 is just relevant for forward
> > progress, not for correctness.
>
> Sure, agreed, forward progress.

Adding to the last statement "But also correctness as described above".

thanks,

 - Joel
Mathieu Desnoyers Dec. 19, 2022, 1:50 a.m. UTC | #7
On 2022-12-18 19:24, Joel Fernandes wrote:
> On Sun, Dec 18, 2022 at 7:04 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> Hi Mathieu,
>>
>> On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> On 2022-12-18 16:30, Joel Fernandes wrote:
>>>> Hi Mathieu,
>>>>
>>>> On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
>>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
>>>>>> can say to remove it, other than the possibility that it is unnecessary, is to
>>>>>> not have extra code that does not help. However, since we are issuing a fully
>>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
>>>>>>
>>>>>> For this reason, please consider these patches as "informational", than a
>>>>>> "please merge". :-) Though, feel free to consider merging if you agree!
>>>>>>
>>>>>> All SRCU scenarios pass with these, with 6 hours of testing.
>>>>>
>>>>> Hi Joel,
>>>>>
>>>>> Please have a look at the comments in my side-rcu implementation [1, 2].
>>>>> It is similar to what SRCU does (per-cpu counter based grace period
>>>>> tracking), but implemented for userspace. The comments explain why this
>>>>> works without the memory barrier you identify as useless in SRCU.
>>>>>
>>>>> Following my implementation of side-rcu, I reviewed the SRCU comments
>>>>> and identified that the barrier "/* E */" appears to be useless. I even
>>>>> discussed this privately with Paul E. McKenney.
>>>>>
>>>>> My implementation and comments go further though, and skip the period
>>>>> "flip" entirely if the first pass observes that all readers (in both
>>>>> periods) are quiescent.
>>>>
>>>> Actually in SRCU, the first pass scans only 1 index, then does the
>>>> flip, and the second pass scans the second index. Without doing a
>>>> flip, an index cannot be scanned for forward progress reasons because
>>>> it is still "active". So I am curious how you can skip flip and still
>>>> scan both indexes? I will dig more into your implementation to learn more.
>>>
>>> If we look at SRCU read-side:
>>>
>>> int __srcu_read_lock(struct srcu_struct *ssp)
>>> {
>>>           int idx;
>>>
>>>           idx = READ_ONCE(ssp->srcu_idx) & 0x1;
>>>           this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
>>>           smp_mb(); /* B */  /* Avoid leaking the critical section. */
>>>           return idx;
>>> }
>>>
>>> If the thread is preempted for a long period of time between load of
>>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
>>> thread can appear as a "new reader" for the idx period at any arbitrary
>>> time in the future, independently of which period is the current one
>>> within a future grace period.
>>>
>>> As a result, the grace period algorithm needs to inherently support the
>>> fact that a "new reader" can appear in any of the two periods,
>>> independently of the current period state.
>>>
>>> As a result, this means that while within period "0", we _need_ to allow
>>> newly coming readers to appear as we scan period "0".
>>
>> Sure, it already does handle it but that is I believe it is a corner
>> case, not the norm.
>>
>>> As a result, we can simply scan both periods 0/1 for reader quiescence,
>>> even while new readers appear within those periods.
>>
>> I think this is a bit dangerous. Yes there is the preemption thing you
>> mentioned above, but that is bounded since you can only have a fixed
>> number of tasks that underwent that preemption, and it is quite rare
>> in the sense, each reader should get preempted just after sampling idx
>> but not incrementing lock count.
>>
>> However, if we scan while new readers appear (outside of the above
>> preemption problem), we can have counter wrap causing a false match
>> much quicker.
>> The scan loop is:
>> check_readers(idx) {
>>     count_all_unlocks(idx);
>>     smp_mb();
>>     count_all_locks(idx);
>>     bool done = (locks == unlocks)
>>     if (done) {
>>           // readers are done, end scan for this idx.
>>     } else {
>>           // try again later
>>     }
>> }
>>
>> So if check_readers() got preempted just after the smp_mb(), then you
>> can have lots of tasks enter and exit the read-side critical section
>> and increment the locks count. Eventually locks == unlocks will
>> happen, and it is screwed. Sure this is also theoretical, but yeah
>> that issue can be made "worse" by scanning active readers
>> deliberately, especially when such readers can also nest arbitrarily.
>>
>>> As a result, flipping between periods 0/1 is just relevant for forward
>>> progress, not for correctness.
>>
>> Sure, agreed, forward progress.
> 
> Adding to the last statement "But also correctness as described above".

Exactly how many entry/exit of the read-side critical section while the 
grace period is preempted do you need to trigger this ?

On a 64-bit system, where 64-bit counters are used, AFAIU this need to 
be exactly 2^64 read-side critical sections.

There are other synchronization algorithms such as seqlocks which are 
quite happy with much less protection against overflow (using a 32-bit 
counter even on 64-bit architectures).

For practical purposes, I suspect this issue is really just theoretical.

Or am I missing your point ?

Thanks,

Mathieu


> 
> thanks,
> 
>   - Joel
Joel Fernandes Dec. 20, 2022, 12:55 a.m. UTC | #8
On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
[...]
> >>>>>
> >>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> >>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> >>>>>> can say to remove it, other than the possibility that it is unnecessary, is to
> >>>>>> not have extra code that does not help. However, since we are issuing a fully
> >>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> >>>>>>
> >>>>>> For this reason, please consider these patches as "informational", than a
> >>>>>> "please merge". :-) Though, feel free to consider merging if you agree!
> >>>>>>
> >>>>>> All SRCU scenarios pass with these, with 6 hours of testing.
> >>>>>
> >>>>> Hi Joel,
> >>>>>
> >>>>> Please have a look at the comments in my side-rcu implementation [1, 2].
> >>>>> It is similar to what SRCU does (per-cpu counter based grace period
> >>>>> tracking), but implemented for userspace. The comments explain why this
> >>>>> works without the memory barrier you identify as useless in SRCU.
> >>>>>
> >>>>> Following my implementation of side-rcu, I reviewed the SRCU comments
> >>>>> and identified that the barrier "/* E */" appears to be useless. I even
> >>>>> discussed this privately with Paul E. McKenney.
> >>>>>
> >>>>> My implementation and comments go further though, and skip the period
> >>>>> "flip" entirely if the first pass observes that all readers (in both
> >>>>> periods) are quiescent.
> >>>>
> >>>> Actually in SRCU, the first pass scans only 1 index, then does the
> >>>> flip, and the second pass scans the second index. Without doing a
> >>>> flip, an index cannot be scanned for forward progress reasons because
> >>>> it is still "active". So I am curious how you can skip flip and still
> >>>> scan both indexes? I will dig more into your implementation to learn more.
> >>>
> >>> If we look at SRCU read-side:
> >>>
> >>> int __srcu_read_lock(struct srcu_struct *ssp)
> >>> {
> >>>           int idx;
> >>>
> >>>           idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> >>>           this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
> >>>           smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >>>           return idx;
> >>> }
> >>>
> >>> If the thread is preempted for a long period of time between load of
> >>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
> >>> thread can appear as a "new reader" for the idx period at any arbitrary
> >>> time in the future, independently of which period is the current one
> >>> within a future grace period.
> >>>
> >>> As a result, the grace period algorithm needs to inherently support the
> >>> fact that a "new reader" can appear in any of the two periods,
> >>> independently of the current period state.
> >>>
> >>> As a result, this means that while within period "0", we _need_ to allow
> >>> newly coming readers to appear as we scan period "0".
> >>
> >> Sure, it already does handle it but that is I believe it is a corner
> >> case, not the norm.
> >>
> >>> As a result, we can simply scan both periods 0/1 for reader quiescence,
> >>> even while new readers appear within those periods.
> >>
> >> I think this is a bit dangerous. Yes there is the preemption thing you
> >> mentioned above, but that is bounded since you can only have a fixed
> >> number of tasks that underwent that preemption, and it is quite rare
> >> in the sense, each reader should get preempted just after sampling idx
> >> but not incrementing lock count.
> >>
> >> However, if we scan while new readers appear (outside of the above
> >> preemption problem), we can have counter wrap causing a false match
> >> much quicker.
> >> The scan loop is:
> >> check_readers(idx) {
> >>     count_all_unlocks(idx);
> >>     smp_mb();
> >>     count_all_locks(idx);
> >>     bool done = (locks == unlocks)
> >>     if (done) {
> >>           // readers are done, end scan for this idx.
> >>     } else {
> >>           // try again later
> >>     }
> >> }
> >>
> >> So if check_readers() got preempted just after the smp_mb(), then you
> >> can have lots of tasks enter and exit the read-side critical section
> >> and increment the locks count. Eventually locks == unlocks will
> >> happen, and it is screwed. Sure this is also theoretical, but yeah
> >> that issue can be made "worse" by scanning active readers
> >> deliberately, especially when such readers can also nest arbitrarily.
> >>
> >>> As a result, flipping between periods 0/1 is just relevant for forward
> >>> progress, not for correctness.
> >>
> >> Sure, agreed, forward progress.
> >
> > Adding to the last statement "But also correctness as described above".
>
> Exactly how many entry/exit of the read-side critical section while the
> grace period is preempted do you need to trigger this ?

It depends on how many readers are active during the preemption of the
scan code. Say the preemption happened after per-CPU unlock counts
were totalled. Then AFAICS, if there are N active readers which need
the grace period to wait, you need (2^sizeof(int) - N) number of
lock+unlock to happen.

> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
> be exactly 2^64 read-side critical sections.

Yes, but what about 32-bit systems?

> There are other synchronization algorithms such as seqlocks which are
> quite happy with much less protection against overflow (using a 32-bit
> counter even on 64-bit architectures).

The seqlock is an interesting point.

> For practical purposes, I suspect this issue is really just theoretical.

I have to ask, what is the benefit of avoiding a flip and scanning
active readers? Is the issue about grace period delay or performance?
If so, it might be worth prototyping that approach and measuring using
rcutorture/rcuscale. If there is significant benefit to current
approach, then IMO it is worth exploring.

> Or am I missing your point ?

No, I think you are not. Let me know if I missed something.

Thanks,

 - Joel


>
>
> >
> > thanks,
> >
> >   - Joel
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Joel Fernandes Dec. 20, 2022, 1:04 a.m. UTC | #9
On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> [...]
> > >>>>>
> > >>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> > >>>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > >>>>>> can say to remove it, other than the possibility that it is unnecessary, is to
> > >>>>>> not have extra code that does not help. However, since we are issuing a fully
> > >>>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> > >>>>>>
> > >>>>>> For this reason, please consider these patches as "informational", than a
> > >>>>>> "please merge". :-) Though, feel free to consider merging if you agree!
> > >>>>>>
> > >>>>>> All SRCU scenarios pass with these, with 6 hours of testing.
> > >>>>>
> > >>>>> Hi Joel,
> > >>>>>
> > >>>>> Please have a look at the comments in my side-rcu implementation [1, 2].
> > >>>>> It is similar to what SRCU does (per-cpu counter based grace period
> > >>>>> tracking), but implemented for userspace. The comments explain why this
> > >>>>> works without the memory barrier you identify as useless in SRCU.
> > >>>>>
> > >>>>> Following my implementation of side-rcu, I reviewed the SRCU comments
> > >>>>> and identified that the barrier "/* E */" appears to be useless. I even
> > >>>>> discussed this privately with Paul E. McKenney.
> > >>>>>
> > >>>>> My implementation and comments go further though, and skip the period
> > >>>>> "flip" entirely if the first pass observes that all readers (in both
> > >>>>> periods) are quiescent.
> > >>>>
> > >>>> Actually in SRCU, the first pass scans only 1 index, then does the
> > >>>> flip, and the second pass scans the second index. Without doing a
> > >>>> flip, an index cannot be scanned for forward progress reasons because
> > >>>> it is still "active". So I am curious how you can skip flip and still
> > >>>> scan both indexes? I will dig more into your implementation to learn more.
> > >>>
> > >>> If we look at SRCU read-side:
> > >>>
> > >>> int __srcu_read_lock(struct srcu_struct *ssp)
> > >>> {
> > >>>           int idx;
> > >>>
> > >>>           idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> > >>>           this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
> > >>>           smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > >>>           return idx;
> > >>> }
> > >>>
> > >>> If the thread is preempted for a long period of time between load of
> > >>> ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
> > >>> thread can appear as a "new reader" for the idx period at any arbitrary
> > >>> time in the future, independently of which period is the current one
> > >>> within a future grace period.
> > >>>
> > >>> As a result, the grace period algorithm needs to inherently support the
> > >>> fact that a "new reader" can appear in any of the two periods,
> > >>> independently of the current period state.
> > >>>
> > >>> As a result, this means that while within period "0", we _need_ to allow
> > >>> newly coming readers to appear as we scan period "0".
> > >>
> > >> Sure, it already does handle it but that is I believe it is a corner
> > >> case, not the norm.
> > >>
> > >>> As a result, we can simply scan both periods 0/1 for reader quiescence,
> > >>> even while new readers appear within those periods.
> > >>
> > >> I think this is a bit dangerous. Yes there is the preemption thing you
> > >> mentioned above, but that is bounded since you can only have a fixed
> > >> number of tasks that underwent that preemption, and it is quite rare
> > >> in the sense, each reader should get preempted just after sampling idx
> > >> but not incrementing lock count.
> > >>
> > >> However, if we scan while new readers appear (outside of the above
> > >> preemption problem), we can have counter wrap causing a false match
> > >> much quicker.
> > >> The scan loop is:
> > >> check_readers(idx) {
> > >>     count_all_unlocks(idx);
> > >>     smp_mb();
> > >>     count_all_locks(idx);
> > >>     bool done = (locks == unlocks)
> > >>     if (done) {
> > >>           // readers are done, end scan for this idx.
> > >>     } else {
> > >>           // try again later
> > >>     }
> > >> }
> > >>
> > >> So if check_readers() got preempted just after the smp_mb(), then you
> > >> can have lots of tasks enter and exit the read-side critical section
> > >> and increment the locks count. Eventually locks == unlocks will
> > >> happen, and it is screwed. Sure this is also theoretical, but yeah
> > >> that issue can be made "worse" by scanning active readers
> > >> deliberately, especially when such readers can also nest arbitrarily.
> > >>
> > >>> As a result, flipping between periods 0/1 is just relevant for forward
> > >>> progress, not for correctness.
> > >>
> > >> Sure, agreed, forward progress.
> > >
> > > Adding to the last statement "But also correctness as described above".
> >
> > Exactly how many entry/exit of the read-side critical section while the
> > grace period is preempted do you need to trigger this ?
>
> It depends on how many readers are active during the preemption of the
> scan code. Say the preemption happened after per-CPU unlock counts
> were totalled. Then AFAICS, if there are N active readers which need
> the grace period to wait, you need (2^sizeof(int) - N) number of
> lock+unlock to happen.

Sorry, here I meant (2^(sizeof(unsigned long) * 8) - N) which is 2^32
on 32-bit systems.

thanks,

 - Joel


> > On a 64-bit system, where 64-bit counters are used, AFAIU this need to
> > be exactly 2^64 read-side critical sections.
>
> Yes, but what about 32-bit systems?
>
> > There are other synchronization algorithms such as seqlocks which are
> > quite happy with much less protection against overflow (using a 32-bit
> > counter even on 64-bit architectures).
>
> The seqlock is an interesting point.
>
> > For practical purposes, I suspect this issue is really just theoretical.
>
> I have to ask, what is the benefit of avoiding a flip and scanning
> active readers? Is the issue about grace period delay or performance?
> If so, it might be worth prototyping that approach and measuring using
> rcutorture/rcuscale. If there is significant benefit to current
> approach, then IMO it is worth exploring.
>
> > Or am I missing your point ?
>
> No, I think you are not. Let me know if I missed something.
>
> Thanks,
>
>  - Joel
>
>
> >
> >
> > >
> > > thanks,
> > >
> > >   - Joel
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://www.efficios.com
> >
Joel Fernandes Dec. 20, 2022, 4:07 a.m. UTC | #10
On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> can say to remove it, other than the possibility that it is unnecessary, is to
> not have extra code that does not help. However, since we are issuing a fully
> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
>
> For this reason, please consider these patches as "informational", than a
> "please merge". :-) Though, feel free to consider merging if you agree!
>
> All SRCU scenarios pass with these, with 6 hours of testing.
>
> thanks,
>
>  - Joel
>
> Joel Fernandes (Google) (2):
> srcu: Remove comment about prior read lock counts
> srcu: Remove memory barrier "E" as it is not required

And litmus tests confirm that "E" does not really do what the comments
say, PTAL:
Test 1:
C mbe
(*
 * Result: sometimes
 * Does previous scan see old reader's lock count, if a new reader saw
the new srcu_idx?
 *)

{}

P0(int *lockcount, int *srcu_idx) // updater
{
        int r0;
        r0 = READ_ONCE(*lockcount);
        smp_mb();       // E
        WRITE_ONCE(*srcu_idx, 1);
}

P1(int *lockcount, int *srcu_idx) // reader
{
        int r0;
        WRITE_ONCE(*lockcount, 1); // previous reader
        smp_mb();       // B+C
        r0 = READ_ONCE(*srcu_idx); // new reader
}
exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *)

Test 2:
C mbe2

(*
 * Result: sometimes
 * If updater saw reader's lock count, was that reader using the old idx?
 *)

{}

P0(int *lockcount, int *srcu_idx) // updater
{
        int r0;
        r0 = READ_ONCE(*lockcount);
        smp_mb();       // E
        WRITE_ONCE(*srcu_idx, 1);
}

P1(int *lockcount, int *srcu_idx) // reader
{
        int r0;
        int r1;
        r1 = READ_ONCE(*srcu_idx); // previous reader
        WRITE_ONCE(*lockcount, 1); // previous reader
        smp_mb();       // B+C
        r0 = READ_ONCE(*srcu_idx); // new reader
}
exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *)

thanks,

 - Joel


>
> kernel/rcu/srcutree.c | 10 ----------
> 1 file changed, 10 deletions(-)
>
> --
> 2.39.0.314.g84b9a713c41-goog
>
Frederic Weisbecker Dec. 20, 2022, 12:34 p.m. UTC | #11
On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote:
> On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > can say to remove it, other than the possibility that it is unnecessary, is to
> > not have extra code that does not help. However, since we are issuing a fully
> > memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> >
> > For this reason, please consider these patches as "informational", than a
> > "please merge". :-) Though, feel free to consider merging if you agree!
> >
> > All SRCU scenarios pass with these, with 6 hours of testing.
> >
> > thanks,
> >
> >  - Joel
> >
> > Joel Fernandes (Google) (2):
> > srcu: Remove comment about prior read lock counts
> > srcu: Remove memory barrier "E" as it is not required
> 
> And litmus tests confirm that "E" does not really do what the comments
> say, PTAL:
> Test 1:
> C mbe
> (*
>  * Result: sometimes
>  * Does previous scan see old reader's lock count, if a new reader saw
> the new srcu_idx?
>  *)
> 
> {}
> 
> P0(int *lockcount, int *srcu_idx) // updater
> {
>         int r0;
>         r0 = READ_ONCE(*lockcount);
>         smp_mb();       // E
>         WRITE_ONCE(*srcu_idx, 1);
> }
> 
> P1(int *lockcount, int *srcu_idx) // reader
> {
>         int r0;
>         WRITE_ONCE(*lockcount, 1); // previous reader
>         smp_mb();       // B+C
>         r0 = READ_ONCE(*srcu_idx); // new reader
> }
> exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *)
> 
> Test 2:
> C mbe2
> 
> (*
>  * Result: sometimes
>  * If updater saw reader's lock count, was that reader using the old idx?
>  *)
> 
> {}
> 
> P0(int *lockcount, int *srcu_idx) // updater
> {
>         int r0;
>         r0 = READ_ONCE(*lockcount);
>         smp_mb();       // E
>         WRITE_ONCE(*srcu_idx, 1);
> }
> 
> P1(int *lockcount, int *srcu_idx) // reader
> {
>         int r0;
>         int r1;
>         r1 = READ_ONCE(*srcu_idx); // previous reader
>         WRITE_ONCE(*lockcount, 1); // previous reader
>         smp_mb();       // B+C
>         r0 = READ_ONCE(*srcu_idx); // new reader
> }
> exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *)

Actually, starring at this some more, there is some form of dependency
on the idx in order to build the address where the reader must write the
lockcount to. Litmus doesn't support arrays but assuming that
&ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it
could be modelized that way (I'm eluding the unlock part to simplify):

---
C w-depend-r

{
	PLOCK=LOCK0;
}

// updater
P0(int *LOCK0, int *LOCK1, int **PLOCK)
{
	int lock1;

	lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
	smp_mb();
	WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
}

// reader
P1(int **PLOCK)
{
	int *plock;

	plock = READ_ONCE(*PLOCK); 	// Read active idx
	WRITE_ONCE(*plock, 1); // Write to active idx
}

exists (0:lock0=1) // never happens
---

* Is it an address dependency? But the LKMM refers to that only in
  terms of LOAD - LOAD ordering.

* Is it a control dependency? But the LKMM refers to that only in
  terms of branch with "if".

So I don't know the name of the pattern but it makes litmus happy.

Thanks.
Frederic Weisbecker Dec. 20, 2022, 12:40 p.m. UTC | #12
On Tue, Dec 20, 2022 at 01:34:43PM +0100, Frederic Weisbecker wrote:
> On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote:
> > On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > > can say to remove it, other than the possibility that it is unnecessary, is to
> > > not have extra code that does not help. However, since we are issuing a fully
> > > memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> > >
> > > For this reason, please consider these patches as "informational", than a
> > > "please merge". :-) Though, feel free to consider merging if you agree!
> > >
> > > All SRCU scenarios pass with these, with 6 hours of testing.
> > >
> > > thanks,
> > >
> > >  - Joel
> > >
> > > Joel Fernandes (Google) (2):
> > > srcu: Remove comment about prior read lock counts
> > > srcu: Remove memory barrier "E" as it is not required
> > 
> > And litmus tests confirm that "E" does not really do what the comments
> > say, PTAL:
> > Test 1:
> > C mbe
> > (*
> >  * Result: sometimes
> >  * Does previous scan see old reader's lock count, if a new reader saw
> > the new srcu_idx?
> >  *)
> > 
> > {}
> > 
> > P0(int *lockcount, int *srcu_idx) // updater
> > {
> >         int r0;
> >         r0 = READ_ONCE(*lockcount);
> >         smp_mb();       // E
> >         WRITE_ONCE(*srcu_idx, 1);
> > }
> > 
> > P1(int *lockcount, int *srcu_idx) // reader
> > {
> >         int r0;
> >         WRITE_ONCE(*lockcount, 1); // previous reader
> >         smp_mb();       // B+C
> >         r0 = READ_ONCE(*srcu_idx); // new reader
> > }
> > exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *)
> > 
> > Test 2:
> > C mbe2
> > 
> > (*
> >  * Result: sometimes
> >  * If updater saw reader's lock count, was that reader using the old idx?
> >  *)
> > 
> > {}
> > 
> > P0(int *lockcount, int *srcu_idx) // updater
> > {
> >         int r0;
> >         r0 = READ_ONCE(*lockcount);
> >         smp_mb();       // E
> >         WRITE_ONCE(*srcu_idx, 1);
> > }
> > 
> > P1(int *lockcount, int *srcu_idx) // reader
> > {
> >         int r0;
> >         int r1;
> >         r1 = READ_ONCE(*srcu_idx); // previous reader
> >         WRITE_ONCE(*lockcount, 1); // previous reader
> >         smp_mb();       // B+C
> >         r0 = READ_ONCE(*srcu_idx); // new reader
> > }
> > exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *)
> 
> Actually, starring at this some more, there is some form of dependency
> on the idx in order to build the address where the reader must write the
> lockcount to. Litmus doesn't support arrays but assuming that
> &ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it
> could be modelized that way (I'm eluding the unlock part to simplify):
> 
> ---
> C w-depend-r
> 
> {
> 	PLOCK=LOCK0;
> }
> 
> // updater
> P0(int *LOCK0, int *LOCK1, int **PLOCK)
> {
> 	int lock1;
> 
> 	lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
> 	smp_mb();
> 	WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
> }
> 
> // reader
> P1(int **PLOCK)
> {
> 	int *plock;
> 
> 	plock = READ_ONCE(*PLOCK); 	// Read active idx
> 	WRITE_ONCE(*plock, 1); // Write to active idx
> }
> 
> exists (0:lock0=1) // never happens

That's lock1=1, lemme do it again:

C w-depend-r

{
	PLOCK=LOCK0;
}

// updater
P0(int *LOCK1, int **PLOCK)
{
	int lock1;

	lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
	smp_mb();
	WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
}

// reader
P1(int **PLOCK)
{
	int *plock;

	plock = READ_ONCE(*PLOCK); 	// Read active idx
	WRITE_ONCE(*plock, 1); // Write to active idx
}

exists (0:lock1=1) (* never *)
Joel Fernandes Dec. 20, 2022, 1:44 p.m. UTC | #13
> On Dec 20, 2022, at 7:40 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Tue, Dec 20, 2022 at 01:34:43PM +0100, Frederic Weisbecker wrote:
>>> On Mon, Dec 19, 2022 at 11:07:17PM -0500, Joel Fernandes wrote:
>>> On Sun, Dec 18, 2022 at 2:13 PM Joel Fernandes (Google)
>>> <joel@joelfernandes.org> wrote:
>>>> 
>>>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
>>>> can say to remove it, other than the possibility that it is unnecessary, is to
>>>> not have extra code that does not help. However, since we are issuing a fully
>>>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
>>>> 
>>>> For this reason, please consider these patches as "informational", than a
>>>> "please merge". :-) Though, feel free to consider merging if you agree!
>>>> 
>>>> All SRCU scenarios pass with these, with 6 hours of testing.
>>>> 
>>>> thanks,
>>>> 
>>>> - Joel
>>>> 
>>>> Joel Fernandes (Google) (2):
>>>> srcu: Remove comment about prior read lock counts
>>>> srcu: Remove memory barrier "E" as it is not required
>>> 
>>> And litmus tests confirm that "E" does not really do what the comments
>>> say, PTAL:
>>> Test 1:
>>> C mbe
>>> (*
>>> * Result: sometimes
>>> * Does previous scan see old reader's lock count, if a new reader saw
>>> the new srcu_idx?
>>> *)
>>> 
>>> {}
>>> 
>>> P0(int *lockcount, int *srcu_idx) // updater
>>> {
>>>        int r0;
>>>        r0 = READ_ONCE(*lockcount);
>>>        smp_mb();       // E
>>>        WRITE_ONCE(*srcu_idx, 1);
>>> }
>>> 
>>> P1(int *lockcount, int *srcu_idx) // reader
>>> {
>>>        int r0;
>>>        WRITE_ONCE(*lockcount, 1); // previous reader
>>>        smp_mb();       // B+C
>>>        r0 = READ_ONCE(*srcu_idx); // new reader
>>> }
>>> exists (0:r0=0 /\ 1:r0=1) (* Bad outcome. *)
>>> 
>>> Test 2:
>>> C mbe2
>>> 
>>> (*
>>> * Result: sometimes
>>> * If updater saw reader's lock count, was that reader using the old idx?
>>> *)
>>> 
>>> {}
>>> 
>>> P0(int *lockcount, int *srcu_idx) // updater
>>> {
>>>        int r0;
>>>        r0 = READ_ONCE(*lockcount);
>>>        smp_mb();       // E
>>>        WRITE_ONCE(*srcu_idx, 1);
>>> }
>>> 
>>> P1(int *lockcount, int *srcu_idx) // reader
>>> {
>>>        int r0;
>>>        int r1;
>>>        r1 = READ_ONCE(*srcu_idx); // previous reader
>>>        WRITE_ONCE(*lockcount, 1); // previous reader
>>>        smp_mb();       // B+C
>>>        r0 = READ_ONCE(*srcu_idx); // new reader
>>> }
>>> exists (0:r0=1 /\ 1:r1=1) (* Bad outcome. *)
>> 
>> Actually, starring at this some more, there is some form of dependency
>> on the idx in order to build the address where the reader must write the
>> lockcount to. Litmus doesn't support arrays but assuming that
>> &ssp->sda->srcu_lock_count == 0 (note the & in the beginning), it
>> could be modelized that way (I'm eluding the unlock part to simplify):
>> 
>> ---
>> C w-depend-r
>> 
>> {
>>    PLOCK=LOCK0;
>> }
>> 
>> // updater
>> P0(int *LOCK0, int *LOCK1, int **PLOCK)
>> {
>>    int lock1;
>> 
>>    lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
>>    smp_mb();
>>    WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
>> }
>> 
>> // reader
>> P1(int **PLOCK)
>> {
>>    int *plock;
>> 
>>    plock = READ_ONCE(*PLOCK);    // Read active idx
>>    WRITE_ONCE(*plock, 1); // Write to active idx
>> }
>> 
>> exists (0:lock0=1) // never happens
> 
> That's lock1=1, lemme do it again:
> 
> C w-depend-r
> 
> {
>    PLOCK=LOCK0;
> }
> 
> // updater
> P0(int *LOCK1, int **PLOCK)
> {
>    int lock1;
> 
>    lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
>    smp_mb();
>    WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
> }
> 
> // reader
> P1(int **PLOCK)
> {
>    int *plock;
> 
>    plock = READ_ONCE(*PLOCK);    // Read active idx
>    WRITE_ONCE(*plock, 1); // Write to active idx

I am a bit lost here, why would the reader want to write to the active idx? The reader does not update the idx, only the lock count.

Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other.

Thanks!

- Joel



> }
> 
> exists (0:lock1=1) (* never *)
Frederic Weisbecker Dec. 20, 2022, 2:07 p.m. UTC | #14
On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote:
> > C w-depend-r
> > 
> > {
> >    PLOCK=LOCK0;
> > }
> > 
> > // updater
> > P0(int *LOCK1, int **PLOCK)
> > {
> >    int lock1;
> > 
> >    lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
> >    smp_mb();
> >    WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
> > }
> > 
> > // reader
> > P1(int **PLOCK)
> > {
> >    int *plock;
> > 
> >    plock = READ_ONCE(*PLOCK);    // Read active idx
> >    WRITE_ONCE(*plock, 1); // Write to active idx
> 
> I am a bit lost here, why would the reader want to write to the active idx?
> The reader does not update the idx, only the lock count.

So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right?
The write is then displayed that way:

     this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);

But things could be also thought the other way around with idx being the base address and
ssp->sda->srcu_lock_count being the offset.

     this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter);

That would require to change some high level types but the result would be the same from
the memory point of view (and even from the ASM point of view). In the end we
are dealing with the same address and access.

Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change.
So it can be zero for example. Then the above increment becomes:

   this_cpu_inc(idx.counter);

And then it can be modelized as in the above litmus test.

I had to play that trick because litmus doesn't support arrays but I believe
it stands. Now of course I may well have got something wrong since I've always
been terrible at maths...

> 
> Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other.

Not arguing I'm also still confused by the comment...

Thanks.
Joel Fernandes Dec. 20, 2022, 2:20 p.m. UTC | #15
> On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote:
>>> C w-depend-r
>>> 
>>> {
>>>   PLOCK=LOCK0;
>>> }
>>> 
>>> // updater
>>> P0(int *LOCK1, int **PLOCK)
>>> {
>>>   int lock1;
>>> 
>>>   lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
>>>   smp_mb();
>>>   WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
>>> }
>>> 
>>> // reader
>>> P1(int **PLOCK)
>>> {
>>>   int *plock;
>>> 
>>>   plock = READ_ONCE(*PLOCK);    // Read active idx
>>>   WRITE_ONCE(*plock, 1); // Write to active idx
>> 
>> I am a bit lost here, why would the reader want to write to the active idx?
>> The reader does not update the idx, only the lock count.
> 
> So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right?
> The write is then displayed that way:
> 
>     this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
> 
> But things could be also thought the other way around with idx being the base address and
> ssp->sda->srcu_lock_count being the offset.
> 
>     this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter);
> 
> That would require to change some high level types but the result would be the same from
> the memory point of view (and even from the ASM point of view). In the end we
> are dealing with the same address and access.
> 
> Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change.
> So it can be zero for example. Then the above increment becomes:
> 
>   this_cpu_inc(idx.counter);
> 
> And then it can be modelized as in the above litmus test.
> 
> I had to play that trick because litmus doesn't support arrays but I believe
> it stands. Now of course I may well have got something wrong since I've always
> been terrible at maths...

Ah ok, I get where you were going with that. Yes there is address dependency between reading idx and writing lock count. But IMHO, the access on the update side is trying to order write to index, and reads from a lock count of a previous index (as far as E / B+C is concerned). So IMHO, on the read side you have to consider 2 consecutive readers and not the same reader in order to pair the same accesses correctly. But I could be missing something.

>> Further, the comment does not talk about implicit memory ordering, it’s talking about explicit ordering due to B+C on one side, and E on the other.
> 
> Not arguing I'm also still confused by the comment...

;-)

Thanks,

 - Joel


> 
> Thanks.
Mathieu Desnoyers Dec. 20, 2022, 5 p.m. UTC | #16
On 2022-12-19 20:04, Joel Fernandes wrote:
> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>> [...]
>>>>>>>>
>>>>>>>> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
[...]

>>>>>
>>>>> I think this is a bit dangerous. Yes there is the preemption thing you
>>>>> mentioned above, but that is bounded since you can only have a fixed
>>>>> number of tasks that underwent that preemption, and it is quite rare
>>>>> in the sense, each reader should get preempted just after sampling idx
>>>>> but not incrementing lock count.
>>>>>
>>>>> However, if we scan while new readers appear (outside of the above
>>>>> preemption problem), we can have counter wrap causing a false match
>>>>> much quicker.
>>>>> The scan loop is:
>>>>> check_readers(idx) {
>>>>>      count_all_unlocks(idx);
>>>>>      smp_mb();
>>>>>      count_all_locks(idx);
>>>>>      bool done = (locks == unlocks)
>>>>>      if (done) {
>>>>>            // readers are done, end scan for this idx.
>>>>>      } else {
>>>>>            // try again later
>>>>>      }
>>>>> }
>>>>>
>>>>> So if check_readers() got preempted just after the smp_mb(), then you
>>>>> can have lots of tasks enter and exit the read-side critical section
>>>>> and increment the locks count. Eventually locks == unlocks will
>>>>> happen, and it is screwed. Sure this is also theoretical, but yeah
>>>>> that issue can be made "worse" by scanning active readers
>>>>> deliberately, especially when such readers can also nest arbitrarily.
>>>>>
>>>>>> As a result, flipping between periods 0/1 is just relevant for forward
>>>>>> progress, not for correctness.
>>>>>
>>>>> Sure, agreed, forward progress.
>>>>
>>>> Adding to the last statement "But also correctness as described above".
>>>
>>> Exactly how many entry/exit of the read-side critical section while the
>>> grace period is preempted do you need to trigger this ?
>>
>> It depends on how many readers are active during the preemption of the
>> scan code. Say the preemption happened after per-CPU unlock counts
>> were totalled. Then AFAICS, if there are N active readers which need
>> the grace period to wait, you need (2^sizeof(int) - N) number of
>> lock+unlock to happen.
> 
> Sorry, here I meant (2^(sizeof(unsigned long) * 8) - N) which is 2^32
> on 32-bit systems.

And 2^64 on 64-bit systems.

> 
> thanks,
> 
>   - Joel
> 
> 
>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>> be exactly 2^64 read-side critical sections.
>>
>> Yes, but what about 32-bit systems?

The overflow indeed happens after 2^32 increments, just like seqlock.
The question we need to ask is therefore: if 2^32 is good enough for 
seqlock, why isn't it good enough for SRCU ?

>>
>>> There are other synchronization algorithms such as seqlocks which are
>>> quite happy with much less protection against overflow (using a 32-bit
>>> counter even on 64-bit architectures).
>>
>> The seqlock is an interesting point.
>>
>>> For practical purposes, I suspect this issue is really just theoretical.
>>
>> I have to ask, what is the benefit of avoiding a flip and scanning
>> active readers? Is the issue about grace period delay or performance?
>> If so, it might be worth prototyping that approach and measuring using
>> rcutorture/rcuscale. If there is significant benefit to current
>> approach, then IMO it is worth exploring.

The main benefit I expect is improved performance of the grace period 
implementation in common cases where there are few or no readers 
present, especially on machines with many cpus.

It allows scanning both periods (0/1) for each cpu within the same pass, 
therefore loading both period's unlock counters sitting in the same 
cache line at once (improved locality), and then loading both period's 
lock counters, also sitting in the same cache line.

It also allows skipping the period flip entirely if there are no readers 
present, which is an -arguably- tiny performance improvement as well.

Thanks,

Mathieu

>>
>>> Or am I missing your point ?
>>
>> No, I think you are not. Let me know if I missed something.
>>
>> Thanks,
>>
>>   - Joel
>>
>>
>>>
>>>
>>>>
>>>> thanks,
>>>>
>>>>    - Joel
>>>
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
Joel Fernandes Dec. 20, 2022, 6:05 p.m. UTC | #17
Hi Mathieu,

On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2022-12-19 20:04, Joel Fernandes wrote:
> > On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
[...]
> >>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
> >>> be exactly 2^64 read-side critical sections.
> >>
> >> Yes, but what about 32-bit systems?
>
> The overflow indeed happens after 2^32 increments, just like seqlock.
> The question we need to ask is therefore: if 2^32 is good enough for
> seqlock, why isn't it good enough for SRCU ?

I think Paul said wrap around does happen with SRCU on 32-bit but I'll
let him talk more about it. If 32-bit is good enough, let us also drop
the size of the counters for 64-bit then?

> >>> There are other synchronization algorithms such as seqlocks which are
> >>> quite happy with much less protection against overflow (using a 32-bit
> >>> counter even on 64-bit architectures).
> >>
> >> The seqlock is an interesting point.
> >>
> >>> For practical purposes, I suspect this issue is really just theoretical.
> >>
> >> I have to ask, what is the benefit of avoiding a flip and scanning
> >> active readers? Is the issue about grace period delay or performance?
> >> If so, it might be worth prototyping that approach and measuring using
> >> rcutorture/rcuscale. If there is significant benefit to current
> >> approach, then IMO it is worth exploring.
>
> The main benefit I expect is improved performance of the grace period
> implementation in common cases where there are few or no readers
> present, especially on machines with many cpus.
>
> It allows scanning both periods (0/1) for each cpu within the same pass,
> therefore loading both period's unlock counters sitting in the same
> cache line at once (improved locality), and then loading both period's
> lock counters, also sitting in the same cache line.
>
> It also allows skipping the period flip entirely if there are no readers
> present, which is an -arguably- tiny performance improvement as well.

The issue of counter wrap aside, what if a new reader always shows up
in the active index being scanned, then can you not delay the GP
indefinitely? It seems like writer-starvation is possible then (sure
it is possible also with preemption after reader-index-sampling, but
scanning active index deliberately will make that worse). Seqlock does
not have such writer starvation just because the writer does not care
about what the readers are doing.

That said, the approach of scanning both counters does seem attractive
for when there are no readers, for the reasons you mentioned. Maybe a
heuristic to count the number of readers might help? If we are not
reader-heavy, then scan both. Otherwise, just scan the inactive ones,
and also couple that heuristic with the number of CPUs. I am
interested in working on such a design with you! Let us do it and
prototype/measure. ;-)

Thanks.
Mathieu Desnoyers Dec. 20, 2022, 6:14 p.m. UTC | #18
On 2022-12-20 13:05, Joel Fernandes wrote:
> Hi Mathieu,
> 
> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> [...]
>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>> be exactly 2^64 read-side critical sections.
>>>>
>>>> Yes, but what about 32-bit systems?
>>
>> The overflow indeed happens after 2^32 increments, just like seqlock.
>> The question we need to ask is therefore: if 2^32 is good enough for
>> seqlock, why isn't it good enough for SRCU ?
> 
> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
> let him talk more about it. If 32-bit is good enough, let us also drop
> the size of the counters for 64-bit then?
> 
>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>> counter even on 64-bit architectures).
>>>>
>>>> The seqlock is an interesting point.
>>>>
>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>
>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>> active readers? Is the issue about grace period delay or performance?
>>>> If so, it might be worth prototyping that approach and measuring using
>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>> approach, then IMO it is worth exploring.
>>
>> The main benefit I expect is improved performance of the grace period
>> implementation in common cases where there are few or no readers
>> present, especially on machines with many cpus.
>>
>> It allows scanning both periods (0/1) for each cpu within the same pass,
>> therefore loading both period's unlock counters sitting in the same
>> cache line at once (improved locality), and then loading both period's
>> lock counters, also sitting in the same cache line.
>>
>> It also allows skipping the period flip entirely if there are no readers
>> present, which is an -arguably- tiny performance improvement as well.
> 
> The issue of counter wrap aside, what if a new reader always shows up
> in the active index being scanned, then can you not delay the GP
> indefinitely? It seems like writer-starvation is possible then (sure
> it is possible also with preemption after reader-index-sampling, but
> scanning active index deliberately will make that worse). Seqlock does
> not have such writer starvation just because the writer does not care
> about what the readers are doing.

No, it's not possible for "current index" readers to starve the g.p. 
with the side-rcu scheme, because the initial pass (sampling both 
periods) only opportunistically skips flipping the period if there 
happens to be no readers in both periods.

If there are readers in the "non-current" period, the grace period waits 
for them.

If there are readers in the "current" period, it flips the period and 
then waits for them.

> 
> That said, the approach of scanning both counters does seem attractive
> for when there are no readers, for the reasons you mentioned. Maybe a
> heuristic to count the number of readers might help? If we are not
> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
> and also couple that heuristic with the number of CPUs. I am
> interested in working on such a design with you! Let us do it and
> prototype/measure. ;-)

Considering that it would add extra complexity, I'm unsure what that 
extra heuristic would improve over just scanning both periods in the 
first pass.

I'll be happy to work with you on such a design :) I think we can borrow 
quite a few concepts from side-rcu for this. Please be aware that my 
time is limited though, as I'm currently supposed to be on vacation. :)

Thanks,

Mathieu
Joel Fernandes Dec. 20, 2022, 6:29 p.m. UTC | #19
> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2022-12-20 13:05, Joel Fernandes wrote:
>> Hi Mathieu,
>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>> [...]
>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>>> be exactly 2^64 read-side critical sections.
>>>>> 
>>>>> Yes, but what about 32-bit systems?
>>> 
>>> The overflow indeed happens after 2^32 increments, just like seqlock.
>>> The question we need to ask is therefore: if 2^32 is good enough for
>>> seqlock, why isn't it good enough for SRCU ?
>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
>> let him talk more about it. If 32-bit is good enough, let us also drop
>> the size of the counters for 64-bit then?
>>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>>> counter even on 64-bit architectures).
>>>>> 
>>>>> The seqlock is an interesting point.
>>>>> 
>>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>> 
>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>> active readers? Is the issue about grace period delay or performance?
>>>>> If so, it might be worth prototyping that approach and measuring using
>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>> approach, then IMO it is worth exploring.
>>> 
>>> The main benefit I expect is improved performance of the grace period
>>> implementation in common cases where there are few or no readers
>>> present, especially on machines with many cpus.
>>> 
>>> It allows scanning both periods (0/1) for each cpu within the same pass,
>>> therefore loading both period's unlock counters sitting in the same
>>> cache line at once (improved locality), and then loading both period's
>>> lock counters, also sitting in the same cache line.
>>> 
>>> It also allows skipping the period flip entirely if there are no readers
>>> present, which is an -arguably- tiny performance improvement as well.
>> The issue of counter wrap aside, what if a new reader always shows up
>> in the active index being scanned, then can you not delay the GP
>> indefinitely? It seems like writer-starvation is possible then (sure
>> it is possible also with preemption after reader-index-sampling, but
>> scanning active index deliberately will make that worse). Seqlock does
>> not have such writer starvation just because the writer does not care
>> about what the readers are doing.
> 
> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods.
> 
> If there are readers in the "non-current" period, the grace period waits for them.
> 
> If there are readers in the "current" period, it flips the period and then waits for them.

Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that.

> 
>> That said, the approach of scanning both counters does seem attractive
>> for when there are no readers, for the reasons you mentioned. Maybe a
>> heuristic to count the number of readers might help? If we are not
>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>> and also couple that heuristic with the number of CPUs. I am
>> interested in working on such a design with you! Let us do it and
>> prototype/measure. ;-)
> 
> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass.

Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful.

> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :)

Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after.

I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…

Cheers,

 - Joel


> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Mathieu Desnoyers Dec. 20, 2022, 7:01 p.m. UTC | #20
On 2022-12-20 13:29, Joel Fernandes wrote:
> 

> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…

I strongly suspect the memory barrier after flip is useless for the same 
reasons I mentioned explaining why the barrier before the flip is useless.

However, we need to double-check that we have memory barriers at the 
beginning and end of synchronize_srcu, and between load of "unlock" 
counters and load of "lock" counters.

Where is the barrier at the beginning of synchronize_srcu ?

Thanks,

Mathieu
Joel Fernandes Dec. 20, 2022, 7:06 p.m. UTC | #21
On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2022-12-20 13:29, Joel Fernandes wrote:
> >
>
> > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
>
> I strongly suspect the memory barrier after flip is useless for the same
> reasons I mentioned explaining why the barrier before the flip is useless.
>
> However, we need to double-check that we have memory barriers at the
> beginning and end of synchronize_srcu, and between load of "unlock"
> counters and load of "lock" counters.
>
> Where is the barrier at the beginning of synchronize_srcu ?

I believe we don't need another memory barrier at the beginning of
synchronize_srcu() (but this part of my SRCU study is still pending
;)) . The grace period guarantee (read-side critical sections don't
span the GP) is already enforced by the memory barrier between
scanning for all unlocks, and scanning for all locks (Paired with
corresponding memory barriers on the read-side).

Accordingly, before we scan all locks and match lock == unlock, there
is an smp_mb(). Why is that not sufficient?

Thanks,

 - Joel

>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Joel Fernandes Dec. 20, 2022, 8:55 p.m. UTC | #22
> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2022-12-20 13:05, Joel Fernandes wrote:
>>> Hi Mathieu,
>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>> 
>>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>> [...]
>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>>>> be exactly 2^64 read-side critical sections.
>>>>>> 
>>>>>> Yes, but what about 32-bit systems?
>>>> 
>>>> The overflow indeed happens after 2^32 increments, just like seqlock.
>>>> The question we need to ask is therefore: if 2^32 is good enough for
>>>> seqlock, why isn't it good enough for SRCU ?
>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
>>> let him talk more about it. If 32-bit is good enough, let us also drop
>>> the size of the counters for 64-bit then?
>>>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>>>> counter even on 64-bit architectures).
>>>>>> 
>>>>>> The seqlock is an interesting point.
>>>>>> 
>>>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>>> 
>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>>> active readers? Is the issue about grace period delay or performance?
>>>>>> If so, it might be worth prototyping that approach and measuring using
>>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>>> approach, then IMO it is worth exploring.
>>>> 
>>>> The main benefit I expect is improved performance of the grace period
>>>> implementation in common cases where there are few or no readers
>>>> present, especially on machines with many cpus.
>>>> 
>>>> It allows scanning both periods (0/1) for each cpu within the same pass,
>>>> therefore loading both period's unlock counters sitting in the same
>>>> cache line at once (improved locality), and then loading both period's
>>>> lock counters, also sitting in the same cache line.
>>>> 
>>>> It also allows skipping the period flip entirely if there are no readers
>>>> present, which is an -arguably- tiny performance improvement as well.
>>> The issue of counter wrap aside, what if a new reader always shows up
>>> in the active index being scanned, then can you not delay the GP
>>> indefinitely? It seems like writer-starvation is possible then (sure
>>> it is possible also with preemption after reader-index-sampling, but
>>> scanning active index deliberately will make that worse). Seqlock does
>>> not have such writer starvation just because the writer does not care
>>> about what the readers are doing.
>> 
>> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods.
>> 
>> If there are readers in the "non-current" period, the grace period waits for them.
>> 
>> If there are readers in the "current" period, it flips the period and then waits for them.
> 
> Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that.
> 
>> 
>>> That said, the approach of scanning both counters does seem attractive
>>> for when there are no readers, for the reasons you mentioned. Maybe a
>>> heuristic to count the number of readers might help? If we are not
>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>>> and also couple that heuristic with the number of CPUs. I am
>>> interested in working on such a design with you! Let us do it and
>>> prototype/measure. ;-)
>> 
>> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass.
> 
> Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful.
> 
>> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :)
> 
> Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after.
> 
> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…

In my view,  the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to.

Neeraj, do you agree?

Thanks.





> 
> Cheers,
> 
> - Joel
> 
> 
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Frederic Weisbecker Dec. 20, 2022, 10:44 p.m. UTC | #23
On Tue, Dec 20, 2022 at 09:20:08AM -0500, Joel Fernandes wrote:
> > On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > 
> > On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote:
> >>> C w-depend-r
> >>> 
> >>> {
> >>>   PLOCK=LOCK0;
> >>> }
> >>> 
> >>> // updater
> >>> P0(int *LOCK1, int **PLOCK)
> >>> {
> >>>   int lock1;
> >>> 
> >>>   lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
> >>>   smp_mb();
> >>>   WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
> >>> }
> >>> 
> >>> // reader
> >>> P1(int **PLOCK)
> >>> {
> >>>   int *plock;
> >>> 
> >>>   plock = READ_ONCE(*PLOCK);    // Read active idx
> >>>   WRITE_ONCE(*plock, 1); // Write to active idx
> >> 
> >> I am a bit lost here, why would the reader want to write to the active idx?
> >> The reader does not update the idx, only the lock count.
> > 
> > So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right?
> > The write is then displayed that way:
> > 
> >     this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
> > 
> > But things could be also thought the other way around with idx being the base address and
> > ssp->sda->srcu_lock_count being the offset.
> > 
> >     this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter);
> > 
> > That would require to change some high level types but the result would be the same from
> > the memory point of view (and even from the ASM point of view). In the end we
> > are dealing with the same address and access.
> > 
> > Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change.
> > So it can be zero for example. Then the above increment becomes:
> > 
> >   this_cpu_inc(idx.counter);
> > 
> > And then it can be modelized as in the above litmus test.
> > 
> > I had to play that trick because litmus doesn't support arrays but I believe
> > it stands. Now of course I may well have got something wrong since I've always
> > been terrible at maths...
> 
> Ah ok, I get where you were going with that. Yes there is address dependency
> between reading idx and writing lock count. But IMHO, the access on the update
> side is trying to order write to index, and reads from a lock count of a
> previous index (as far as E / B+C is concerned). So IMHO, on the read side you
> have to consider 2 consecutive readers and not the same reader in order to
> pair the same accesses correctly. But I could be missing something.

And you're right, for the first part of the comment (let's call that (1)):

	 * Ensure that if this updater saw a given reader's increment
	 * from __srcu_read_lock(), that reader was using an old value
	 * of ->srcu_idx.

My litmus test shows the ordering displayed in the second part of the comment
(call it (2)):

         * Also ensure that if a given reader sees the
	 * new value of ->srcu_idx, this updater's earlier scans cannot
	 * have seen that reader's increments (which is OK, because this
	 * grace period need not wait on that reader).

_ In (1), E indeed pairs with B and C
_ In (2), E pairs with the address-dependency between idx and lock_count.

Thanks.
Frederic Weisbecker Dec. 20, 2022, 10:57 p.m. UTC | #24
On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote:
> On 2022-12-20 13:29, Joel Fernandes wrote:
> > 
> 
> > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> 
> I strongly suspect the memory barrier after flip is useless for the same
> reasons I mentioned explaining why the barrier before the flip is useless.
> 
> However, we need to double-check that we have memory barriers at the
> beginning and end of synchronize_srcu, and between load of "unlock" counters
> and load of "lock" counters.
> 
> Where is the barrier at the beginning of synchronize_srcu ?

rcu_seq_snap() ?

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Frederic Weisbecker Dec. 20, 2022, 11:05 p.m. UTC | #25
On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote:
> On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > On 2022-12-20 13:29, Joel Fernandes wrote:
> > >
> >
> > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> >
> > I strongly suspect the memory barrier after flip is useless for the same
> > reasons I mentioned explaining why the barrier before the flip is useless.
> >
> > However, we need to double-check that we have memory barriers at the
> > beginning and end of synchronize_srcu, and between load of "unlock"
> > counters and load of "lock" counters.
> >
> > Where is the barrier at the beginning of synchronize_srcu ?
> 
> I believe we don't need another memory barrier at the beginning of
> synchronize_srcu() (but this part of my SRCU study is still pending
> ;)) . The grace period guarantee (read-side critical sections don't
> span the GP) is already enforced by the memory barrier between
> scanning for all unlocks, and scanning for all locks (Paired with
> corresponding memory barriers on the read-side).
> 
> Accordingly, before we scan all locks and match lock == unlock, there
> is an smp_mb(). Why is that not sufficient?

That's not enough, you still need a barrier between the updater's pre-GP
accesses and the scans, so that post-GP read side sees the updater's pre-GP
accesses:


            UPDATER                        READER
            -------                        ------
            WRITE A                        WRITE srcu_read_lock
            smp_mb() //rcu_seq_snap()      smp_mb()
            READ srcu_read_lock //scans    READ A

Thanks.

> 
> Thanks,
> 
>  - Joel
> 
> >
> > Thanks,
> >
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://www.efficios.com
> >
Joel Fernandes Dec. 20, 2022, 11:46 p.m. UTC | #26
On Tue, Dec 20, 2022 at 6:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote:
> > On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > >
> > > On 2022-12-20 13:29, Joel Fernandes wrote:
> > > >
> > >
> > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> > >
> > > I strongly suspect the memory barrier after flip is useless for the same
> > > reasons I mentioned explaining why the barrier before the flip is useless.
> > >
> > > However, we need to double-check that we have memory barriers at the
> > > beginning and end of synchronize_srcu, and between load of "unlock"
> > > counters and load of "lock" counters.
> > >
> > > Where is the barrier at the beginning of synchronize_srcu ?
> >
> > I believe we don't need another memory barrier at the beginning of
> > synchronize_srcu() (but this part of my SRCU study is still pending
> > ;)) . The grace period guarantee (read-side critical sections don't
> > span the GP) is already enforced by the memory barrier between
> > scanning for all unlocks, and scanning for all locks (Paired with
> > corresponding memory barriers on the read-side).
> >
> > Accordingly, before we scan all locks and match lock == unlock, there
> > is an smp_mb(). Why is that not sufficient?
>
> That's not enough, you still need a barrier between the updater's pre-GP
> accesses and the scans, so that post-GP read side sees the updater's pre-GP
> accesses:
>
>
>             UPDATER                        READER
>             -------                        ------
>             WRITE A                        WRITE srcu_read_lock
>             smp_mb() //rcu_seq_snap()      smp_mb()
>             READ srcu_read_lock //scans    READ A

But see the comments also in srcu_readers_active_idx_check()

* Needs to be a smp_mb() as the read side may
* contain a read from a variable that is written to before the
* synchronize_srcu() in the write side

So that appears to be already covered. Or is your point that the scans
are not happening on the same CPU as the pre-GP writer, as scans are
happening from workqueue ?

Perhaps that comment misled me.

Confused,

 - Joel
Frederic Weisbecker Dec. 21, 2022, 12:07 a.m. UTC | #27
On Tue, Dec 20, 2022 at 12:00:58PM -0500, Mathieu Desnoyers wrote:
> On 2022-12-19 20:04, Joel Fernandes wrote:
> The main benefit I expect is improved performance of the grace period
> implementation in common cases where there are few or no readers present,
> especially on machines with many cpus.
> 
> It allows scanning both periods (0/1) for each cpu within the same pass,
> therefore loading both period's unlock counters sitting in the same cache
> line at once (improved locality), and then loading both period's lock
> counters, also sitting in the same cache line.
> 
> It also allows skipping the period flip entirely if there are no readers
> present, which is an -arguably- tiny performance improvement as well.

I would indeed expect performance improvement if there are no readers in the
active period/idx but if there are, it's a performance penalty due to the extra
scans.

So my mean questions are:

* Is the no-present-readers the most likely case? I guess it depends on the ssp.

* Does the SRCU update side deserve to be optimized with added code (because
  we are not debating about removing the flip, rather about adding a fast-path
  and keep the flip as a slow-path)
  
* The SRCU machinery is already quite complicated. Look how we little things lock
  ourselves in for days doing our exegesis of SRCU state machine. And halfway
  through it we are still debating some ordering. Is it worth adding a new path there?

Thanks.
Joel Fernandes Dec. 21, 2022, 12:15 a.m. UTC | #28
On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Tue, Dec 20, 2022 at 09:20:08AM -0500, Joel Fernandes wrote:
> > > On Dec 20, 2022, at 9:07 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > On Tue, Dec 20, 2022 at 08:44:40AM -0500, Joel Fernandes wrote:
> > >>> C w-depend-r
> > >>>
> > >>> {
> > >>>   PLOCK=LOCK0;
> > >>> }
> > >>>
> > >>> // updater
> > >>> P0(int *LOCK1, int **PLOCK)
> > >>> {
> > >>>   int lock1;
> > >>>
> > >>>   lock1 = READ_ONCE(*LOCK1); // READ from inactive idx
> > >>>   smp_mb();
> > >>>   WRITE_ONCE(*PLOCK, LOCK1); // Flip idx
> > >>> }
> > >>>
> > >>> // reader
> > >>> P1(int **PLOCK)
> > >>> {
> > >>>   int *plock;
> > >>>
> > >>>   plock = READ_ONCE(*PLOCK);    // Read active idx
> > >>>   WRITE_ONCE(*plock, 1); // Write to active idx
> > >>
> > >> I am a bit lost here, why would the reader want to write to the active idx?
> > >> The reader does not update the idx, only the lock count.
> > >
> > > So &ssp->sda->srcu_lock_count is the base address and idx is the offset, right?
> > > The write is then displayed that way:
> > >
> > >     this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
> > >
> > > But things could be also thought the other way around with idx being the base address and
> > > ssp->sda->srcu_lock_count being the offset.
> > >
> > >     this_cpu_inc(idx[ssp->sda->srcu_lock_count].counter);
> > >
> > > That would require to change some high level types but the result would be the same from
> > > the memory point of view (and even from the ASM point of view). In the end we
> > > are dealing with the same address and access.
> > >
> > > Now ssp->sda->srcu_lock_count is a constant address value. It doesn't change.
> > > So it can be zero for example. Then the above increment becomes:
> > >
> > >   this_cpu_inc(idx.counter);
> > >
> > > And then it can be modelized as in the above litmus test.
> > >
> > > I had to play that trick because litmus doesn't support arrays but I believe
> > > it stands. Now of course I may well have got something wrong since I've always
> > > been terrible at maths...
> >
> > Ah ok, I get where you were going with that. Yes there is address dependency
> > between reading idx and writing lock count. But IMHO, the access on the update
> > side is trying to order write to index, and reads from a lock count of a
> > previous index (as far as E / B+C is concerned). So IMHO, on the read side you
> > have to consider 2 consecutive readers and not the same reader in order to
> > pair the same accesses correctly. But I could be missing something.
>
> And you're right, for the first part of the comment (let's call that (1)):
>
>          * Ensure that if this updater saw a given reader's increment
>          * from __srcu_read_lock(), that reader was using an old value
>          * of ->srcu_idx.
>
> My litmus test shows the ordering displayed in the second part of the comment
> (call it (2)):
>
>          * Also ensure that if a given reader sees the
>          * new value of ->srcu_idx, this updater's earlier scans cannot
>          * have seen that reader's increments (which is OK, because this
>          * grace period need not wait on that reader).
>
> _ In (1), E indeed pairs with B and C

Agreed about (1).

> _ In (2), E pairs with the address-dependency between idx and lock_count.

But that is not the only reason. If that was the only reason for (2),
then there is an smp_mb() just before the next-scan post-flip before
the lock counts are read.

The comment also says that: "D" pairs with "C" and does not mention
"C" being an addr dependency.  So IMHO, you need to look at the unlock
count access and the idx access, for (2). Something like this:
https://hastebin.com/raw/gadeliquta . Though that is complex as fuck.

Or did I miss something subtle?

Thanks.
Frederic Weisbecker Dec. 21, 2022, 12:27 a.m. UTC | #29
On Tue, Dec 20, 2022 at 06:46:10PM -0500, Joel Fernandes wrote:
> On Tue, Dec 20, 2022 at 6:05 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Tue, Dec 20, 2022 at 07:06:57PM +0000, Joel Fernandes wrote:
> > > On Tue, Dec 20, 2022 at 7:01 PM Mathieu Desnoyers
> > > <mathieu.desnoyers@efficios.com> wrote:
> > > >
> > > > On 2022-12-20 13:29, Joel Fernandes wrote:
> > > > >
> > > >
> > > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> > > >
> > > > I strongly suspect the memory barrier after flip is useless for the same
> > > > reasons I mentioned explaining why the barrier before the flip is useless.
> > > >
> > > > However, we need to double-check that we have memory barriers at the
> > > > beginning and end of synchronize_srcu, and between load of "unlock"
> > > > counters and load of "lock" counters.
> > > >
> > > > Where is the barrier at the beginning of synchronize_srcu ?
> > >
> > > I believe we don't need another memory barrier at the beginning of
> > > synchronize_srcu() (but this part of my SRCU study is still pending
> > > ;)) . The grace period guarantee (read-side critical sections don't
> > > span the GP) is already enforced by the memory barrier between
> > > scanning for all unlocks, and scanning for all locks (Paired with
> > > corresponding memory barriers on the read-side).
> > >
> > > Accordingly, before we scan all locks and match lock == unlock, there
> > > is an smp_mb(). Why is that not sufficient?
> >
> > That's not enough, you still need a barrier between the updater's pre-GP
> > accesses and the scans, so that post-GP read side sees the updater's pre-GP
> > accesses:
> >
> >
> >             UPDATER                        READER
> >             -------                        ------
> >             WRITE A                        WRITE srcu_read_lock
> >             smp_mb() //rcu_seq_snap()      smp_mb()
> >             READ srcu_read_lock //scans    READ A
> 
> But see the comments also in srcu_readers_active_idx_check()
> 
> * Needs to be a smp_mb() as the read side may
> * contain a read from a variable that is written to before the
> * synchronize_srcu() in the write side
> 
> So that appears to be already covered. Or is your point that the scans
> are not happening on the same CPU as the pre-GP writer, as scans are
> happening from workqueue ?

Nah I think you're right. Although I guess we still need the barrier between
updater's pre-gp accesses and srcu_unlock scans...


> 
> Perhaps that comment misled me.
> 
> Confused,
> 
>  - Joel
Frederic Weisbecker Dec. 21, 2022, 12:49 a.m. UTC | #30
On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> Agreed about (1).
> 
> > _ In (2), E pairs with the address-dependency between idx and lock_count.
> 
> But that is not the only reason. If that was the only reason for (2),
> then there is an smp_mb() just before the next-scan post-flip before
> the lock counts are read.

The post-flip barrier makes sure the new idx is visible on the next READER's
turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
may appear unordered from the update side POV if there is no barrier between the
scan and the flip.

If you remove the smp_mb() from the litmus test I sent, things explode.
Frederic Weisbecker Dec. 21, 2022, 12:58 a.m. UTC | #31
On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > Agreed about (1).
> > 
> > > _ In (2), E pairs with the address-dependency between idx and lock_count.
> > 
> > But that is not the only reason. If that was the only reason for (2),
> > then there is an smp_mb() just before the next-scan post-flip before
> > the lock counts are read.
> 
> The post-flip barrier makes sure the new idx is visible on the next READER's
> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> may appear unordered from the update side POV if there is no barrier between the
> scan and the flip.
> 
> If you remove the smp_mb() from the litmus test I sent, things explode.

Or rather, look at it the other way, if there is no barrier between the lock
scan and the index flip (E), then the index flip can appear to be written before the
lock is read. Which means you may start activating the index before you finish
reading it (at least it appears that way from the readers pont of view).
Joel Fernandes Dec. 21, 2022, 2:41 a.m. UTC | #32
> On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>> Agreed about (1).
>> 
>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
>> 
>> But that is not the only reason. If that was the only reason for (2),
>> then there is an smp_mb() just before the next-scan post-flip before
>> the lock counts are read.
> 
> The post-flip barrier makes sure the new idx is visible on the next READER's
> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> may appear unordered from the update side POV if there is no barrier between the
> scan and the flip.
> 
> If you remove the smp_mb() from the litmus test I sent, things explode.

Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.

So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).

That’s just my opinion and let’s discuss more because I cannot rule out that I am missing something with this complicated topic ;-)

Thanks.
Mathieu Desnoyers Dec. 21, 2022, 3:34 a.m. UTC | #33
On 2022-12-20 17:57, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote:
>> On 2022-12-20 13:29, Joel Fernandes wrote:
>>>
>>
>>> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
>>
>> I strongly suspect the memory barrier after flip is useless for the same
>> reasons I mentioned explaining why the barrier before the flip is useless.
>>
>> However, we need to double-check that we have memory barriers at the
>> beginning and end of synchronize_srcu, and between load of "unlock" counters
>> and load of "lock" counters.
>>
>> Where is the barrier at the beginning of synchronize_srcu ?
> 
> rcu_seq_snap() ?

The memory barrier in rcu_seq_snap is not at the very beginning of synchronize_srcu.

For example we have:

unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp)
{
         // Any prior manipulation of SRCU-protected data must happen
         // before the load from ->srcu_gp_seq.
         smp_mb();
         return rcu_seq_snap(&ssp->srcu_gp_seq);

which happens to have an explicit barrier before issuing rcu_seq_snap().

So my question still stands: where is the memory barrier at the beginning of synchronize_srcu ?

The memory ordering constraint I am concerned about here is:

  * [...] In addition,
  * each CPU having an SRCU read-side critical section that extends beyond
  * the return from synchronize_srcu() is guaranteed to have executed a
  * full memory barrier after the beginning of synchronize_srcu() and before
  * the beginning of that SRCU read-side critical section. [...]

So if we have a SRCU read-side critical section that begins after the beginning
of synchronize_srcu, but before its first memory barrier, it would miss the
guarantee that the full memory barrier is issued before the beginning of that
SRCU read-side critical section. IOW, that memory barrier needs to be at the
very beginning of the grace period.

I suspect that the memory barrier in srcu_read_lock() invoked by srcu_gp_start_if_needed
may happen to be early enough, but I'm not sure, and it does not appear to be documented
as such.

Thanks,

Mathieu

> 
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Mathieu Desnoyers Dec. 21, 2022, 3:43 a.m. UTC | #34
On 2022-12-20 19:58, Frederic Weisbecker wrote:
> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>>> Agreed about (1).
>>>
>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
>>>
>>> But that is not the only reason. If that was the only reason for (2),
>>> then there is an smp_mb() just before the next-scan post-flip before
>>> the lock counts are read.
>>
>> The post-flip barrier makes sure the new idx is visible on the next READER's
>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
>> may appear unordered from the update side POV if there is no barrier between the
>> scan and the flip.
>>
>> If you remove the smp_mb() from the litmus test I sent, things explode.
> 
> Or rather, look at it the other way, if there is no barrier between the lock
> scan and the index flip (E), then the index flip can appear to be written before the
> lock is read. Which means you may start activating the index before you finish
> reading it (at least it appears that way from the readers pont of view).

Considering that you can have pre-existing readers from arbitrary index 
appearing anywhere in the grace period (because a reader can fetch the
index and be preempted for an arbitrary amount of time before 
incrementing the lock count), the grace period algorithm needs to deal 
with the fact that a newcoming reader can appear in a given index either 
before or after the flip.

I don't see how flipping the index before or after loading the 
unlock/lock values would break anything (except for unlikely counter 
overflow situations as previously discussed).

Thanks,

Mathieu
Mathieu Desnoyers Dec. 21, 2022, 3:47 a.m. UTC | #35
On 2022-12-20 19:07, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 12:00:58PM -0500, Mathieu Desnoyers wrote:
>> On 2022-12-19 20:04, Joel Fernandes wrote:
>> The main benefit I expect is improved performance of the grace period
>> implementation in common cases where there are few or no readers present,
>> especially on machines with many cpus.
>>
>> It allows scanning both periods (0/1) for each cpu within the same pass,
>> therefore loading both period's unlock counters sitting in the same cache
>> line at once (improved locality), and then loading both period's lock
>> counters, also sitting in the same cache line.
>>
>> It also allows skipping the period flip entirely if there are no readers
>> present, which is an -arguably- tiny performance improvement as well.
> 
> I would indeed expect performance improvement if there are no readers in the
> active period/idx but if there are, it's a performance penalty due to the extra
> scans.
> 
> So my mean questions are:
> 
> * Is the no-present-readers the most likely case? I guess it depends on the ssp.
> 
> * Does the SRCU update side deserve to be optimized with added code (because
>    we are not debating about removing the flip, rather about adding a fast-path
>    and keep the flip as a slow-path)
>    
> * The SRCU machinery is already quite complicated. Look how we little things lock
>    ourselves in for days doing our exegesis of SRCU state machine. And halfway
>    through it we are still debating some ordering. Is it worth adding a new path there?

I'm not arguing for making things more complex unless there are good 
reasons to do so. However I think we badly need to improve the 
documentation of the memory barriers in SRCU, because the claimed 
barrier pairing is odd.

Thanks,

Mathieu
Mathieu Desnoyers Dec. 21, 2022, 3:52 a.m. UTC | #36
On 2022-12-20 15:55, Joel Fernandes wrote:
> 
> 
>> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> 
>>
>>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> On 2022-12-20 13:05, Joel Fernandes wrote:
>>>> Hi Mathieu,
>>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>> [...]
>>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>>>>> be exactly 2^64 read-side critical sections.
>>>>>>>
>>>>>>> Yes, but what about 32-bit systems?
>>>>>
>>>>> The overflow indeed happens after 2^32 increments, just like seqlock.
>>>>> The question we need to ask is therefore: if 2^32 is good enough for
>>>>> seqlock, why isn't it good enough for SRCU ?
>>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
>>>> let him talk more about it. If 32-bit is good enough, let us also drop
>>>> the size of the counters for 64-bit then?
>>>>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>>>>> counter even on 64-bit architectures).
>>>>>>>
>>>>>>> The seqlock is an interesting point.
>>>>>>>
>>>>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>>>>
>>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>>>> active readers? Is the issue about grace period delay or performance?
>>>>>>> If so, it might be worth prototyping that approach and measuring using
>>>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>>>> approach, then IMO it is worth exploring.
>>>>>
>>>>> The main benefit I expect is improved performance of the grace period
>>>>> implementation in common cases where there are few or no readers
>>>>> present, especially on machines with many cpus.
>>>>>
>>>>> It allows scanning both periods (0/1) for each cpu within the same pass,
>>>>> therefore loading both period's unlock counters sitting in the same
>>>>> cache line at once (improved locality), and then loading both period's
>>>>> lock counters, also sitting in the same cache line.
>>>>>
>>>>> It also allows skipping the period flip entirely if there are no readers
>>>>> present, which is an -arguably- tiny performance improvement as well.
>>>> The issue of counter wrap aside, what if a new reader always shows up
>>>> in the active index being scanned, then can you not delay the GP
>>>> indefinitely? It seems like writer-starvation is possible then (sure
>>>> it is possible also with preemption after reader-index-sampling, but
>>>> scanning active index deliberately will make that worse). Seqlock does
>>>> not have such writer starvation just because the writer does not care
>>>> about what the readers are doing.
>>>
>>> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods.
>>>
>>> If there are readers in the "non-current" period, the grace period waits for them.
>>>
>>> If there are readers in the "current" period, it flips the period and then waits for them.
>>
>> Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that.
>>
>>>
>>>> That said, the approach of scanning both counters does seem attractive
>>>> for when there are no readers, for the reasons you mentioned. Maybe a
>>>> heuristic to count the number of readers might help? If we are not
>>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>>>> and also couple that heuristic with the number of CPUs. I am
>>>> interested in working on such a design with you! Let us do it and
>>>> prototype/measure. ;-)
>>>
>>> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass.
>>
>> Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful.
>>
>>> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :)
>>
>> Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after.
>>
>> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> 
> In my view,  the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to.
> 

No, AFAIU you also need barriers at the beginning and end of synchronize_srcu to provide those guarantees:

  * There are memory-ordering constraints implied by synchronize_srcu().

Need for a barrier at the end of synchronize_srcu():

  * On systems with more than one CPU, when synchronize_srcu() returns,
  * each CPU is guaranteed to have executed a full memory barrier since
  * the end of its last corresponding SRCU read-side critical section
  * whose beginning preceded the call to synchronize_srcu().

Need for a barrier at the beginning of synchronize_srcu():

  * In addition,
  * each CPU having an SRCU read-side critical section that extends beyond
  * the return from synchronize_srcu() is guaranteed to have executed a
  * full memory barrier after the beginning of synchronize_srcu() and before
  * the beginning of that SRCU read-side critical section.  Note that these
  * guarantees include CPUs that are offline, idle, or executing in user mode,
  * as well as CPUs that are executing in the kernel.

Thanks,

Mathieu

> Neeraj, do you agree?
> 
> Thanks.
> 
> 
> 
> 
> 
>>
>> Cheers,
>>
>> - Joel
>>
>>
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
Joel Fernandes Dec. 21, 2022, 4:26 a.m. UTC | #37
> On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2022-12-20 19:58, Frederic Weisbecker wrote:
>>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>>>> Agreed about (1).
>>>> 
>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
>>>> 
>>>> But that is not the only reason. If that was the only reason for (2),
>>>> then there is an smp_mb() just before the next-scan post-flip before
>>>> the lock counts are read.
>>> 
>>> The post-flip barrier makes sure the new idx is visible on the next READER's
>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
>>> may appear unordered from the update side POV if there is no barrier between the
>>> scan and the flip.
>>> 
>>> If you remove the smp_mb() from the litmus test I sent, things explode.
>> Or rather, look at it the other way, if there is no barrier between the lock
>> scan and the index flip (E), then the index flip can appear to be written before the
>> lock is read. Which means you may start activating the index before you finish
>> reading it (at least it appears that way from the readers pont of view).
> 
> Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the
> index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip.
> 
> I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed).

If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible. I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO.

Anyway in all likelihood, I will be trying to remove E completely and clarify docs on D in the coming weeks. And also try to drop the size of the counters per our discussions

Thanks.



> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Joel Fernandes Dec. 21, 2022, 5:02 a.m. UTC | #38
> On Dec 20, 2022, at 10:51 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2022-12-20 15:55, Joel Fernandes wrote:
>>>> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>> 
>>> 
>>> 
>>>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>> 
>>>>> On 2022-12-20 13:05, Joel Fernandes wrote:
>>>>> Hi Mathieu,
>>>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>> 
>>>>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>> [...]
>>>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>>>>>> be exactly 2^64 read-side critical sections.
>>>>>>>> 
>>>>>>>> Yes, but what about 32-bit systems?
>>>>>> 
>>>>>> The overflow indeed happens after 2^32 increments, just like seqlock.
>>>>>> The question we need to ask is therefore: if 2^32 is good enough for
>>>>>> seqlock, why isn't it good enough for SRCU ?
>>>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
>>>>> let him talk more about it. If 32-bit is good enough, let us also drop
>>>>> the size of the counters for 64-bit then?
>>>>>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>>>>>> counter even on 64-bit architectures).
>>>>>>>> 
>>>>>>>> The seqlock is an interesting point.
>>>>>>>> 
>>>>>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>>>>> 
>>>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>>>>> active readers? Is the issue about grace period delay or performance?
>>>>>>>> If so, it might be worth prototyping that approach and measuring using
>>>>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>>>>> approach, then IMO it is worth exploring.
>>>>>> 
>>>>>> The main benefit I expect is improved performance of the grace period
>>>>>> implementation in common cases where there are few or no readers
>>>>>> present, especially on machines with many cpus.
>>>>>> 
>>>>>> It allows scanning both periods (0/1) for each cpu within the same pass,
>>>>>> therefore loading both period's unlock counters sitting in the same
>>>>>> cache line at once (improved locality), and then loading both period's
>>>>>> lock counters, also sitting in the same cache line.
>>>>>> 
>>>>>> It also allows skipping the period flip entirely if there are no readers
>>>>>> present, which is an -arguably- tiny performance improvement as well.
>>>>> The issue of counter wrap aside, what if a new reader always shows up
>>>>> in the active index being scanned, then can you not delay the GP
>>>>> indefinitely? It seems like writer-starvation is possible then (sure
>>>>> it is possible also with preemption after reader-index-sampling, but
>>>>> scanning active index deliberately will make that worse). Seqlock does
>>>>> not have such writer starvation just because the writer does not care
>>>>> about what the readers are doing.
>>>> 
>>>> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods.
>>>> 
>>>> If there are readers in the "non-current" period, the grace period waits for them.
>>>> 
>>>> If there are readers in the "current" period, it flips the period and then waits for them.
>>> 
>>> Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that.
>>> 
>>>> 
>>>>> That said, the approach of scanning both counters does seem attractive
>>>>> for when there are no readers, for the reasons you mentioned. Maybe a
>>>>> heuristic to count the number of readers might help? If we are not
>>>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>>>>> and also couple that heuristic with the number of CPUs. I am
>>>>> interested in working on such a design with you! Let us do it and
>>>>> prototype/measure. ;-)
>>>> 
>>>> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass.
>>> 
>>> Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful.
>>> 
>>>> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :)
>>> 
>>> Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after.
>>> 
>>> I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
>> In my view,  the mb between the totaling of unlocks and totaling of locks serves as the mb that is required to enforce the GP guarantee, which I think is what Mathieu is referring to.
> 
> No, AFAIU you also need barriers at the beginning and end of synchronize_srcu to provide those guarantees:

My bad, I got too hung up on the scan code. Indeed we need additional ordering on synchronize side.

Anyway, the full memory barriers are already implemented in the synchronize code AFAICS (beginning and end). At least one of them full memory barriers directly appears at the end of __synchronize_srcu(). But I dont want to say something stupid in the middle of the night, so I will take my time to get back on that.

Thanks,

Joel


> 
> * There are memory-ordering constraints implied by synchronize_srcu().
> 
> Need for a barrier at the end of synchronize_srcu():
> 
> * On systems with more than one CPU, when synchronize_srcu() returns,
> * each CPU is guaranteed to have executed a full memory barrier since
> * the end of its last corresponding SRCU read-side critical section
> * whose beginning preceded the call to synchronize_srcu().
> 
> Need for a barrier at the beginning of synchronize_srcu():
> 
> * In addition,
> * each CPU having an SRCU read-side critical section that extends beyond
> * the return from synchronize_srcu() is guaranteed to have executed a
> * full memory barrier after the beginning of synchronize_srcu() and before
> * the beginning of that SRCU read-side critical section.  Note that these
> * guarantees include CPUs that are offline, idle, or executing in user mode,
> * as well as CPUs that are executing in the kernel.
> 
> Thanks,
> 
> Mathieu
> 
>> Neeraj, do you agree?
>> Thanks.
>>> 
>>> Cheers,
>>> 
>>> - Joel
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Mathieu
>>>> 
>>>> -- 
>>>> Mathieu Desnoyers
>>>> EfficiOS Inc.
>>>> https://www.efficios.com
>>>> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Frederic Weisbecker Dec. 21, 2022, 11:26 a.m. UTC | #39
On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > 
> > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >> Agreed about (1).
> >> 
> >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> >> 
> >> But that is not the only reason. If that was the only reason for (2),
> >> then there is an smp_mb() just before the next-scan post-flip before
> >> the lock counts are read.
> > 
> > The post-flip barrier makes sure the new idx is visible on the next READER's
> > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > may appear unordered from the update side POV if there is no barrier between the
> > scan and the flip.
> > 
> > If you remove the smp_mb() from the litmus test I sent, things explode.
> 
> Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> 
> So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> 
> That’s just my opinion and let’s discuss more because I cannot rule out that I
> am missing something with this complicated topic ;-)

I must be missing something. I often do.

Ok let's put that on litmus:

----
C srcu

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int lock1;
	int unlock1;
	int lock0;
	int unlock0;

	// SCAN1
	unlock1 = READ_ONCE(*UNLOCK1);
	smp_mb(); // A
	lock1 = READ_ONCE(*LOCK1);
	
	// FLIP
	smp_mb(); // E
	WRITE_ONCE(*IDX, 1);
	smp_mb(); // D
	
	// SCAN2
	unlock0 = READ_ONCE(*UNLOCK0);
	smp_mb(); // A
	lock0 = READ_ONCE(*LOCK0);
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int tmp;
	int idx;

	// 1st reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// second reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
}

exists (0:lock1!=0)
---

This gives:

Test srcu Allowed
States 1
0:lock1=0;
No
Witnesses
Positive: 0 Negative: 9
Condition exists (not (0:lock1=0))
Observation srcu Never 0 9
Time srcu 0.57
Hash=855d17de503814d2421602174f307c59

Now if I comment out the "smp_mb() /* E */" line this gives:

Test srcu Allowed
States 3
0:lock1=0;
0:lock1=1;
0:lock1=2;
Ok
Witnesses
Positive: 4 Negative: 9
Condition exists (not (0:lock1=0))
Observation srcu Sometimes 4 9

Thanks
Frederic Weisbecker Dec. 21, 2022, 11:59 a.m. UTC | #40
On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> On 2022-12-20 17:57, Frederic Weisbecker wrote:
> > On Tue, Dec 20, 2022 at 02:01:30PM -0500, Mathieu Desnoyers wrote:
> > > On 2022-12-20 13:29, Joel Fernandes wrote:
> > > > 
> > > 
> > > > I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…
> > > 
> > > I strongly suspect the memory barrier after flip is useless for the same
> > > reasons I mentioned explaining why the barrier before the flip is useless.
> > > 
> > > However, we need to double-check that we have memory barriers at the
> > > beginning and end of synchronize_srcu, and between load of "unlock" counters
> > > and load of "lock" counters.
> > > 
> > > Where is the barrier at the beginning of synchronize_srcu ?
> > 
> > rcu_seq_snap() ?
> 
> The memory barrier in rcu_seq_snap is not at the very beginning of synchronize_srcu.
> 
> For example we have:
> 
> unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp)
> {
>         // Any prior manipulation of SRCU-protected data must happen
>         // before the load from ->srcu_gp_seq.
>         smp_mb();
>         return rcu_seq_snap(&ssp->srcu_gp_seq);
> 
> which happens to have an explicit barrier before issuing rcu_seq_snap().

SRCU (or even RCU) polling is special in that it may rely on a concurrent updater
to start the grace period, hence why you need two more barriers here (the second
is in poll_state_synchronize_srcu()) so that:

* If the polling updater started polling (calling get_state_synchronize_srcu())
  before the traditional updater started the grace period, the latter must
  propagate the changes from the polling updater to all CPUs.

* If the polling updater started polling (calling get_state_synchronize_srcu())
  after the traditional updater started the grace period, it must wait for a
  subsequent grace period (rcu_seq_snap() will return that subsequent sequence).

* If the polling updater checks (and thereby finishes) polling (calling poll_state_synchronize_srcu())
  after the traditional updater completes the grace period, the polling updater sees
  the propagated barrier.

* If the polling updater checks polling (calling poll_state_synchronize_srcu())
  before the traditional updater completes the grace period, keep polling.

> So my question still stands: where is the memory barrier at the beginning of
> synchronize_srcu ?

I still think rcu_seq_snap() is what you're looking for.

> 
> The memory ordering constraint I am concerned about here is:
> 
>  * [...] In addition,
>  * each CPU having an SRCU read-side critical section that extends beyond
>  * the return from synchronize_srcu() is guaranteed to have executed a
>  * full memory barrier after the beginning of synchronize_srcu() and before
>  * the beginning of that SRCU read-side critical section. [...]
> 
> So if we have a SRCU read-side critical section that begins after the beginning
> of synchronize_srcu, but before its first memory barrier, it would miss the
> guarantee that the full memory barrier is issued before the beginning of that
> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> very beginning of the grace period.

I'm confused, what's wrong with this ?

UPDATER                  READER
-------                  ------
STORE X = 1              STORE srcu_read_lock++
// rcu_seq_snap()        smp_mb()
smp_mb()                 READ X
// scans
READ srcu_read_lock
Frederic Weisbecker Dec. 21, 2022, 12:11 p.m. UTC | #41
On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote:
> On 2022-12-20 19:58, Frederic Weisbecker wrote:
> > On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
> > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > > > On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > Agreed about (1).
> > > > 
> > > > > _ In (2), E pairs with the address-dependency between idx and lock_count.
> > > > 
> > > > But that is not the only reason. If that was the only reason for (2),
> > > > then there is an smp_mb() just before the next-scan post-flip before
> > > > the lock counts are read.
> > > 
> > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > may appear unordered from the update side POV if there is no barrier between the
> > > scan and the flip.
> > > 
> > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > 
> > Or rather, look at it the other way, if there is no barrier between the lock
> > scan and the index flip (E), then the index flip can appear to be written before the
> > lock is read. Which means you may start activating the index before you finish
> > reading it (at least it appears that way from the readers pont of view).
> 
> Considering that you can have pre-existing readers from arbitrary index
> appearing anywhere in the grace period (because a reader can fetch the
> index and be preempted for an arbitrary amount of time before incrementing
> the lock count), the grace period algorithm needs to deal with the fact that
> a newcoming reader can appear in a given index either before or after the
> flip.

True but the number of preempted tasks is bound and there is a forward progress guarantee.

> I don't see how flipping the index before or after loading the unlock/lock
> values would break anything (except for unlikely counter overflow situations
> as previously discussed).

Forward progress guarantee.

Thanks.

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Frederic Weisbecker Dec. 21, 2022, 2:04 p.m. UTC | #42
On Tue, Dec 20, 2022 at 11:26:12PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > On 2022-12-20 19:58, Frederic Weisbecker wrote:
> >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
> >>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> >>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>> Agreed about (1).
> >>>> 
> >>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> >>>> 
> >>>> But that is not the only reason. If that was the only reason for (2),
> >>>> then there is an smp_mb() just before the next-scan post-flip before
> >>>> the lock counts are read.
> >>> 
> >>> The post-flip barrier makes sure the new idx is visible on the next READER's
> >>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> >>> may appear unordered from the update side POV if there is no barrier between the
> >>> scan and the flip.
> >>> 
> >>> If you remove the smp_mb() from the litmus test I sent, things explode.
> >> Or rather, look at it the other way, if there is no barrier between the lock
> >> scan and the index flip (E), then the index flip can appear to be written before the
> >> lock is read. Which means you may start activating the index before you finish
> >> reading it (at least it appears that way from the readers pont of view).
> > 
> > Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the
> > index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip.
> > 
> > I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed).
> 
> If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible. I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO.

And here is why D is needed:

C D

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int lock1;
	int unlock1;
	int lock0;
	int unlock0;

	// SCAN1
	unlock1 = READ_ONCE(*UNLOCK1);
	smp_mb(); // A
	lock1 = READ_ONCE(*LOCK1);
	
	// FLIP
	smp_mb(); // E
	WRITE_ONCE(*IDX, 1);
	smp_mb(); // D
	
	// SCAN 2
	unlock0 = READ_ONCE(*UNLOCK0);
	smp_mb(); // A
	lock0 = READ_ONCE(*LOCK0);
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int tmp;
	int idx;

	// 1st reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// second reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// third reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
}

exists (0:unlock0=0 /\ 1:idx=0)

States 6
0:unlock0=0; 1:idx=1;
0:unlock0=1; 1:idx=0;
0:unlock0=1; 1:idx=1;
0:unlock0=2; 1:idx=0;
0:unlock0=2; 1:idx=1;
0:unlock0=3; 1:idx=0;
No
Witnesses
Positive: 0 Negative: 14
Condition exists (0:unlock0=0 /\ 1:idx=0)
Observation D Never 0 14


But then if you comment out "smp_mb() /* D */":

Test D Allowed
States 7
0:unlock0=0; 1:idx=0;
0:unlock0=0; 1:idx=1;
0:unlock0=1; 1:idx=0;
0:unlock0=1; 1:idx=1;
0:unlock0=2; 1:idx=0;
0:unlock0=2; 1:idx=1;
0:unlock0=3; 1:idx=0;
Ok
Witnesses
Positive: 2 Negative: 14
Condition exists (0:unlock0=0 /\ 1:idx=0)
Observation D Sometimes 2 14


Without D I guess things would eventually fix up but that would require an
extra loop in SCAN2.

Thanks.
Boqun Feng Dec. 21, 2022, 4:02 p.m. UTC | #43
On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> > 
> > 
> > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > 
> > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >> Agreed about (1).
> > >> 
> > >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> > >> 
> > >> But that is not the only reason. If that was the only reason for (2),
> > >> then there is an smp_mb() just before the next-scan post-flip before
> > >> the lock counts are read.
> > > 
> > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > may appear unordered from the update side POV if there is no barrier between the
> > > scan and the flip.
> > > 
> > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > 
> > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> > 
> > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> > 
> > That’s just my opinion and let’s discuss more because I cannot rule out that I
> > am missing something with this complicated topic ;-)
> 
> I must be missing something. I often do.
> 
> Ok let's put that on litmus:
> 
> ----
> C srcu
> 
> {}
> 
> // updater
> P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
> 	int lock1;
> 	int unlock1;
> 	int lock0;
> 	int unlock0;
> 
> 	// SCAN1
> 	unlock1 = READ_ONCE(*UNLOCK1);
> 	smp_mb(); // A
> 	lock1 = READ_ONCE(*LOCK1);
> 	
> 	// FLIP
> 	smp_mb(); // E

In real code there is a control dependency between the READ_ONCE() above
and the WRITE_ONCE() before, i.e. only flip the idx when lock1 ==
unlock1, maybe try with the P0 below? Untested due to not having herd on
this computer ;-)

> 	WRITE_ONCE(*IDX, 1);
> 	smp_mb(); // D
> 	
> 	// SCAN2
> 	unlock0 = READ_ONCE(*UNLOCK0);
> 	smp_mb(); // A
> 	lock0 = READ_ONCE(*LOCK0);
> }
> 
	P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
	{
		int lock1;
		int unlock1;
		int lock0;
		int unlock0;

		// SCAN1
		unlock1 = READ_ONCE(*UNLOCK1);
		smp_mb(); // A
		lock1 = READ_ONCE(*LOCK1);
		
		// FLIP
		if (unlock1 == lock1) {
			smp_mb(); // E
			WRITE_ONCE(*IDX, 1);
			smp_mb(); // D
			
			// SCAN2
			unlock0 = READ_ONCE(*UNLOCK0);
			smp_mb(); // A
			lock0 = READ_ONCE(*LOCK0);
		}
	}

Regards,
Boqun

> // reader
> P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
> 	int tmp;
> 	int idx;
> 
> 	// 1st reader
> 	idx = READ_ONCE(*IDX);
> 	if (idx == 0) {
> 		tmp = READ_ONCE(*LOCK0);
> 		WRITE_ONCE(*LOCK0, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK0);
> 		WRITE_ONCE(*UNLOCK0, tmp + 1);
> 	} else {
> 		tmp = READ_ONCE(*LOCK1);
> 		WRITE_ONCE(*LOCK1, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK1);
> 		WRITE_ONCE(*UNLOCK1, tmp + 1);
> 	}
> 	
> 	// second reader
> 	idx = READ_ONCE(*IDX);
> 	if (idx == 0) {
> 		tmp = READ_ONCE(*LOCK0);
> 		WRITE_ONCE(*LOCK0, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK0);
> 		WRITE_ONCE(*UNLOCK0, tmp + 1);
> 	} else {
> 		tmp = READ_ONCE(*LOCK1);
> 		WRITE_ONCE(*LOCK1, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK1);
> 		WRITE_ONCE(*UNLOCK1, tmp + 1);
> 	}
> }
> 
> exists (0:lock1!=0)
> ---
> 
> This gives:
> 
> Test srcu Allowed
> States 1
> 0:lock1=0;
> No
> Witnesses
> Positive: 0 Negative: 9
> Condition exists (not (0:lock1=0))
> Observation srcu Never 0 9
> Time srcu 0.57
> Hash=855d17de503814d2421602174f307c59
> 
> Now if I comment out the "smp_mb() /* E */" line this gives:
> 
> Test srcu Allowed
> States 3
> 0:lock1=0;
> 0:lock1=1;
> 0:lock1=2;
> Ok
> Witnesses
> Positive: 4 Negative: 9
> Condition exists (not (0:lock1=0))
> Observation srcu Sometimes 4 9
> 
> Thanks
Mathieu Desnoyers Dec. 21, 2022, 4:30 p.m. UTC | #44
On 2022-12-20 23:26, Joel Fernandes wrote:
> 
> 
>> On Dec 20, 2022, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2022-12-20 19:58, Frederic Weisbecker wrote:
>>>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
>>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
>>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>>>>> Agreed about (1).
>>>>>
>>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
>>>>>
>>>>> But that is not the only reason. If that was the only reason for (2),
>>>>> then there is an smp_mb() just before the next-scan post-flip before
>>>>> the lock counts are read.
>>>>
>>>> The post-flip barrier makes sure the new idx is visible on the next READER's
>>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
>>>> may appear unordered from the update side POV if there is no barrier between the
>>>> scan and the flip.
>>>>
>>>> If you remove the smp_mb() from the litmus test I sent, things explode.
>>> Or rather, look at it the other way, if there is no barrier between the lock
>>> scan and the index flip (E), then the index flip can appear to be written before the
>>> lock is read. Which means you may start activating the index before you finish
>>> reading it (at least it appears that way from the readers pont of view).
>>
>> Considering that you can have pre-existing readers from arbitrary index appearing anywhere in the grace period (because a reader can fetch the
>> index and be preempted for an arbitrary amount of time before incrementing the lock count), the grace period algorithm needs to deal with the fact that a newcoming reader can appear in a given index either before or after the flip.
>>
>> I don't see how flipping the index before or after loading the unlock/lock values would break anything (except for unlikely counter overflow situations as previously discussed).
> 
> If you say unlikely, that means it can happen some times which is bad enough ;-). Maybe you mean impossible.

I mean that if we have a synchronize_srcu preemption long enough to get 
2^32 or 2^64 concurrent srcu read-side critical sections, I strongly 
suspect that RCU stall detection will yell loudly. And if it does not 
already, then we should make it so.

So I mean "impossible unless the system is already unusable", rather 
than just "unlikely".

Thanks,

Mathieu

> I would not settle for anything less than keeping the memory barrier around if it helps unlikely cases, but only D does help for the theoretical wrapping/overflow issue. E is broken and does not even help the theoretical issue IMO. And both D and E do not affect correctness IMO.
> 
> Anyway in all likelihood, I will be trying to remove E completely and clarify docs on D in the coming weeks. And also try to drop the size of the counters per our discussions
> 
> Thanks.
> 
> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Mathieu Desnoyers Dec. 21, 2022, 5:11 p.m. UTC | #45
On 2022-12-21 06:59, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
[...]
>>
>> The memory ordering constraint I am concerned about here is:
>>
>>   * [...] In addition,
>>   * each CPU having an SRCU read-side critical section that extends beyond
>>   * the return from synchronize_srcu() is guaranteed to have executed a
>>   * full memory barrier after the beginning of synchronize_srcu() and before
>>   * the beginning of that SRCU read-side critical section. [...]
>>
>> So if we have a SRCU read-side critical section that begins after the beginning
>> of synchronize_srcu, but before its first memory barrier, it would miss the
>> guarantee that the full memory barrier is issued before the beginning of that
>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
>> very beginning of the grace period.
> 
> I'm confused, what's wrong with this ?
> 
> UPDATER                  READER
> -------                  ------
> STORE X = 1              STORE srcu_read_lock++
> // rcu_seq_snap()        smp_mb()
> smp_mb()                 READ X
> // scans
> READ srcu_read_lock

What you refer to here is only memory ordering of the store to X and 
load from X wrt loading/increment of srcu_read_lock, which is internal 
to the srcu implementation. If we really want to model the provided 
high-level memory ordering guarantees, we should consider a scenario 
where SRCU is used for its memory ordering properties to synchronize 
other variables.

I'm concerned about the following Dekker scenario, where 
synchronize_srcu() and srcu_read_lock/unlock would be used instead of 
memory barriers:

Initial state: X = 0, Y = 0

Thread A                   Thread B
---------------------------------------------
STORE X = 1                STORE Y = 1
synchronize_srcu()
                            srcu_read_lock()
                            r1 = LOAD X
                            srcu_read_unlock()
r0 = LOAD Y

BUG_ON(!r0 && !r1)

So in the synchronize_srcu implementation, there appears to be two
major scenarios: either srcu_gp_start_if_needed starts a gp or expedited 
gp, or it uses an already started gp/expedited gp. When snapshotting 
with rcu_seq_snap, the fact that the memory barrier is after the 
ssp->srcu_gp_seq load means that it does not order prior memory accesses 
before that load. This sequence value is then used to identify which 
gp_seq to wait for when piggy-backing on another already-started gp. I 
worry about reordering between STORE X = 1 and load of ssp->srcu_gp_seq, 
which is then used to piggy-back on an already-started gp.

I suspect that the implicit barrier in srcu_read_lock() invoked at the
beginning of srcu_gp_start_if_needed() is really the barrier that makes
all this behave as expected. But without documentation it's rather hard 
to follow.

Thanks,

Mathieu
Mathieu Desnoyers Dec. 21, 2022, 5:20 p.m. UTC | #46
On 2022-12-21 07:11, Frederic Weisbecker wrote:
> On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote:
>> On 2022-12-20 19:58, Frederic Weisbecker wrote:
>>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
>>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
>>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>>>>> Agreed about (1).
>>>>>
>>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
>>>>>
>>>>> But that is not the only reason. If that was the only reason for (2),
>>>>> then there is an smp_mb() just before the next-scan post-flip before
>>>>> the lock counts are read.
>>>>
>>>> The post-flip barrier makes sure the new idx is visible on the next READER's
>>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
>>>> may appear unordered from the update side POV if there is no barrier between the
>>>> scan and the flip.
>>>>
>>>> If you remove the smp_mb() from the litmus test I sent, things explode.
>>>
>>> Or rather, look at it the other way, if there is no barrier between the lock
>>> scan and the index flip (E), then the index flip can appear to be written before the
>>> lock is read. Which means you may start activating the index before you finish
>>> reading it (at least it appears that way from the readers pont of view).
>>
>> Considering that you can have pre-existing readers from arbitrary index
>> appearing anywhere in the grace period (because a reader can fetch the
>> index and be preempted for an arbitrary amount of time before incrementing
>> the lock count), the grace period algorithm needs to deal with the fact that
>> a newcoming reader can appear in a given index either before or after the
>> flip.
> 
> True but the number of preempted tasks is bound and there is a forward progress guarantee.
> 
>> I don't see how flipping the index before or after loading the unlock/lock
>> values would break anything (except for unlikely counter overflow situations
>> as previously discussed).
> 
> Forward progress guarantee.

Considering a coherent cache, the store-buffer will ensure that the 
index flip eventually reaches all readers. This bounds the time during 
which readers can flood the current index, and therefore guarantees 
forward progress. AFAIK the Linux kernel does not support architectures 
with incoherent caches.

So I still don't see how having the barrier before or after the index 
flip is useful for forward progress.

Thanks,

Mathieu

> 
> Thanks.
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Frederic Weisbecker Dec. 21, 2022, 5:30 p.m. UTC | #47
On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote:
> On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote:
> > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > 
> > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >> Agreed about (1).
> > > >> 
> > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> > > >> 
> > > >> But that is not the only reason. If that was the only reason for (2),
> > > >> then there is an smp_mb() just before the next-scan post-flip before
> > > >> the lock counts are read.
> > > > 
> > > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > > may appear unordered from the update side POV if there is no barrier between the
> > > > scan and the flip.
> > > > 
> > > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > > 
> > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> > > 
> > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> > > 
> > > That’s just my opinion and let’s discuss more because I cannot rule out that I
> > > am missing something with this complicated topic ;-)
> > 
> > I must be missing something. I often do.
> > 
> > Ok let's put that on litmus:
> > 
> > ----
> > C srcu
> > 
> > {}
> > 
> > // updater
> > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > {
> > 	int lock1;
> > 	int unlock1;
> > 	int lock0;
> > 	int unlock0;
> > 
> > 	// SCAN1
> > 	unlock1 = READ_ONCE(*UNLOCK1);
> > 	smp_mb(); // A
> > 	lock1 = READ_ONCE(*LOCK1);
> > 	
> > 	// FLIP
> > 	smp_mb(); // E
> 
> In real code there is a control dependency between the READ_ONCE() above
> and the WRITE_ONCE() before, i.e. only flip the idx when lock1 ==
> unlock1, maybe try with the P0 below? Untested due to not having herd on
> this computer ;-)
> 
> > 	WRITE_ONCE(*IDX, 1);
> > 	smp_mb(); // D
> > 	
> > 	// SCAN2
> > 	unlock0 = READ_ONCE(*UNLOCK0);
> > 	smp_mb(); // A
> > 	lock0 = READ_ONCE(*LOCK0);
> > }
> > 
> 	P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> 	{
> 		int lock1;
> 		int unlock1;
> 		int lock0;
> 		int unlock0;
> 
> 		// SCAN1
> 		unlock1 = READ_ONCE(*UNLOCK1);
> 		smp_mb(); // A
> 		lock1 = READ_ONCE(*LOCK1);
> 		
> 		// FLIP
> 		if (unlock1 == lock1) {
> 			smp_mb(); // E
> 			WRITE_ONCE(*IDX, 1);
> 			smp_mb(); // D
> 			
> 			// SCAN2
> 			unlock0 = READ_ONCE(*UNLOCK0);
> 			smp_mb(); // A
> 			lock0 = READ_ONCE(*LOCK0);
> 		}
> 	}

That becomes the below (same effect).

C D

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int lock1;
	int unlock1;
	int lock0;
	int unlock0;

	// SCAN1
	unlock1 = READ_ONCE(*UNLOCK1);
	smp_mb(); // A
	lock1 = READ_ONCE(*LOCK1);
	
	if (unlock1 == lock1) {
		// FLIP
		smp_mb(); // E
		WRITE_ONCE(*IDX, 1);
		smp_mb(); // D
	
		// SCAN 2
		unlock0 = READ_ONCE(*UNLOCK0);
		smp_mb(); // A
		lock0 = READ_ONCE(*LOCK0);
	}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int tmp;
	int idx;

	// 1st reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// second reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// third reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
}

exists (0:unlock0=0 /\ 1:idx=0)
Joel Fernandes Dec. 21, 2022, 6:18 p.m. UTC | #48
On Wed, Dec 21, 2022 at 5:20 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2022-12-21 07:11, Frederic Weisbecker wrote:
> > On Tue, Dec 20, 2022 at 10:43:25PM -0500, Mathieu Desnoyers wrote:
> >> On 2022-12-20 19:58, Frederic Weisbecker wrote:
> >>> On Wed, Dec 21, 2022 at 01:49:57AM +0100, Frederic Weisbecker wrote:
> >>>> On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> >>>>> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>>> Agreed about (1).
> >>>>>
> >>>>>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> >>>>>
> >>>>> But that is not the only reason. If that was the only reason for (2),
> >>>>> then there is an smp_mb() just before the next-scan post-flip before
> >>>>> the lock counts are read.
> >>>>
> >>>> The post-flip barrier makes sure the new idx is visible on the next READER's
> >>>> turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> >>>> may appear unordered from the update side POV if there is no barrier between the
> >>>> scan and the flip.
> >>>>
> >>>> If you remove the smp_mb() from the litmus test I sent, things explode.
> >>>
> >>> Or rather, look at it the other way, if there is no barrier between the lock
> >>> scan and the index flip (E), then the index flip can appear to be written before the
> >>> lock is read. Which means you may start activating the index before you finish
> >>> reading it (at least it appears that way from the readers pont of view).
> >>
> >> Considering that you can have pre-existing readers from arbitrary index
> >> appearing anywhere in the grace period (because a reader can fetch the
> >> index and be preempted for an arbitrary amount of time before incrementing
> >> the lock count), the grace period algorithm needs to deal with the fact that
> >> a newcoming reader can appear in a given index either before or after the
> >> flip.
> >
> > True but the number of preempted tasks is bound and there is a forward progress guarantee.
> >
> >> I don't see how flipping the index before or after loading the unlock/lock
> >> values would break anything (except for unlikely counter overflow situations
> >> as previously discussed).
> >
> > Forward progress guarantee.
>
> Considering a coherent cache, the store-buffer will ensure that the
> index flip eventually reaches all readers. This bounds the time during
> which readers can flood the current index, and therefore guarantees
> forward progress. AFAIK the Linux kernel does not support architectures
> with incoherent caches.
>
> So I still don't see how having the barrier before or after the index
> flip is useful for forward progress.

Even though eventually the writes on either side will make it, I think
you have little lost opportunity without the "D" memory barrier
without satisfying the store-buffer pattern. i.e. even if the idx
update was seen by the read-side, its unlock count contributions to
the now-inactive idx may not be seen.

I think that makes it slightly better for making the grace period end
sooner. I think that's what you "scan both" idea also tries to
achieve, so for that reason you should like "D" as well.

But yes it depends on how forward-progress is defined. If defined as
starvation - no the post/pref flip MBs are not helpful for that
(eventually the writes will make it). If defined as not losing an
opportunity to end the GP sooner, yes the flip-related memory barriers
does help that.

As for whether or not we need "E", that is up for debate. I am
intrigued by Frederick's litmus tests showing that it helps, because
my litmus tests don't show it.

Thanks.
Joel Fernandes Dec. 21, 2022, 7:33 p.m. UTC | #49
Ah Frederic, I think you nailed it. E is required to order the flip
write with the control-dependency on the READ side. I can confirm the
below test with bad condition shows the previous reader sees the
post-flip index when it shouldn't have. Please see below modifications
to your Litmus test.

I think we should document it in the code that E pairs with the
control-dependency between idx read and lock count write.

C srcu
{}
// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
        int lock1;
        int unlock1;
        int lock0;
        int unlock0;

        // SCAN1
        unlock1 = READ_ONCE(*UNLOCK1);
        smp_mb(); // A
        lock1 = READ_ONCE(*LOCK1);

        // FLIP
        smp_mb(); // E -------------------- required to make the bad
condition not happen.
        WRITE_ONCE(*IDX, 1);
        smp_mb(); // D

        // SCAN2
        unlock0 = READ_ONCE(*UNLOCK0);
        smp_mb(); // A
        lock0 = READ_ONCE(*LOCK0);
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
        int tmp;
        int idx1;
        int idx2;

        // 1st reader
        idx1 = READ_ONCE(*IDX);
        if (idx1 == 0) {
                tmp = READ_ONCE(*LOCK0);
                WRITE_ONCE(*LOCK0, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK0);
                WRITE_ONCE(*UNLOCK0, tmp + 1);
        } else {
                tmp = READ_ONCE(*LOCK1);
                WRITE_ONCE(*LOCK1, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK1);
                WRITE_ONCE(*UNLOCK1, tmp + 1);
        }

        // second reader
        idx2 = READ_ONCE(*IDX);
        if (idx2 == 0) {
                tmp = READ_ONCE(*LOCK0);
                WRITE_ONCE(*LOCK0, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK0);
                WRITE_ONCE(*UNLOCK0, tmp + 1);
        } else {
                tmp = READ_ONCE(*LOCK1);
                WRITE_ONCE(*LOCK1, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK1);
                WRITE_ONCE(*UNLOCK1, tmp + 1);
        }
}

exists (0:lock1=1 /\ 1:idx1=1 /\ 1:idx2=1 )  (* bad condition: 1st
reader saw flip *)





On Wed, Dec 21, 2022 at 5:30 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote:
> > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> > > >
> > > >
> > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >
> > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >> Agreed about (1).
> > > > >>
> > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> > > > >>
> > > > >> But that is not the only reason. If that was the only reason for (2),
> > > > >> then there is an smp_mb() just before the next-scan post-flip before
> > > > >> the lock counts are read.
> > > > >
> > > > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > > > may appear unordered from the update side POV if there is no barrier between the
> > > > > scan and the flip.
> > > > >
> > > > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > > >
> > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> > > >
> > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> > > >
> > > > That’s just my opinion and let’s discuss more because I cannot rule out that I
> > > > am missing something with this complicated topic ;-)
> > >
> > > I must be missing something. I often do.
> > >
> > > Ok let's put that on litmus:
> > >
> > > ----
> > > C srcu
> > >
> > > {}
> > >
> > > // updater
> > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > > {
> > >     int lock1;
> > >     int unlock1;
> > >     int lock0;
> > >     int unlock0;
> > >
> > >     // SCAN1
> > >     unlock1 = READ_ONCE(*UNLOCK1);
> > >     smp_mb(); // A
> > >     lock1 = READ_ONCE(*LOCK1);
> > >
> > >     // FLIP
> > >     smp_mb(); // E
> >
> > In real code there is a control dependency between the READ_ONCE() above
> > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 ==
> > unlock1, maybe try with the P0 below? Untested due to not having herd on
> > this computer ;-)
> >
> > >     WRITE_ONCE(*IDX, 1);
> > >     smp_mb(); // D
> > >
> > >     // SCAN2
> > >     unlock0 = READ_ONCE(*UNLOCK0);
> > >     smp_mb(); // A
> > >     lock0 = READ_ONCE(*LOCK0);
> > > }
> > >
> >       P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> >       {
> >               int lock1;
> >               int unlock1;
> >               int lock0;
> >               int unlock0;
> >
> >               // SCAN1
> >               unlock1 = READ_ONCE(*UNLOCK1);
> >               smp_mb(); // A
> >               lock1 = READ_ONCE(*LOCK1);
> >
> >               // FLIP
> >               if (unlock1 == lock1) {
> >                       smp_mb(); // E
> >                       WRITE_ONCE(*IDX, 1);
> >                       smp_mb(); // D
> >
> >                       // SCAN2
> >                       unlock0 = READ_ONCE(*UNLOCK0);
> >                       smp_mb(); // A
> >                       lock0 = READ_ONCE(*LOCK0);
> >               }
> >       }
>
> That becomes the below (same effect).
>
> C D
>
> {}
>
> // updater
> P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
>         int lock1;
>         int unlock1;
>         int lock0;
>         int unlock0;
>
>         // SCAN1
>         unlock1 = READ_ONCE(*UNLOCK1);
>         smp_mb(); // A
>         lock1 = READ_ONCE(*LOCK1);
>
>         if (unlock1 == lock1) {
>                 // FLIP
>                 smp_mb(); // E
>                 WRITE_ONCE(*IDX, 1);
>                 smp_mb(); // D
>
>                 // SCAN 2
>                 unlock0 = READ_ONCE(*UNLOCK0);
>                 smp_mb(); // A
>                 lock0 = READ_ONCE(*LOCK0);
>         }
> }
>
> // reader
> P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
>         int tmp;
>         int idx;
>
>         // 1st reader
>         idx = READ_ONCE(*IDX);
>         if (idx == 0) {
>                 tmp = READ_ONCE(*LOCK0);
>                 WRITE_ONCE(*LOCK0, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK0);
>                 WRITE_ONCE(*UNLOCK0, tmp + 1);
>         } else {
>                 tmp = READ_ONCE(*LOCK1);
>                 WRITE_ONCE(*LOCK1, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK1);
>                 WRITE_ONCE(*UNLOCK1, tmp + 1);
>         }
>
>         // second reader
>         idx = READ_ONCE(*IDX);
>         if (idx == 0) {
>                 tmp = READ_ONCE(*LOCK0);
>                 WRITE_ONCE(*LOCK0, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK0);
>                 WRITE_ONCE(*UNLOCK0, tmp + 1);
>         } else {
>                 tmp = READ_ONCE(*LOCK1);
>                 WRITE_ONCE(*LOCK1, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK1);
>                 WRITE_ONCE(*UNLOCK1, tmp + 1);
>         }
>
>         // third reader
>         idx = READ_ONCE(*IDX);
>         if (idx == 0) {
>                 tmp = READ_ONCE(*LOCK0);
>                 WRITE_ONCE(*LOCK0, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK0);
>                 WRITE_ONCE(*UNLOCK0, tmp + 1);
>         } else {
>                 tmp = READ_ONCE(*LOCK1);
>                 WRITE_ONCE(*LOCK1, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK1);
>                 WRITE_ONCE(*UNLOCK1, tmp + 1);
>         }
> }
>
> exists (0:unlock0=0 /\ 1:idx=0)
>
Joel Fernandes Dec. 21, 2022, 7:57 p.m. UTC | #50
Mathieu pointed out to me on IRC that adding the control dep on the
update side removes even that need for E. And I see that is what Boqun
was suggesting above and that Frederic modified.

See updated litmus here: https://www.irccloud.com/pastebin/TrXacogO/

With this, I am not seeing the "bad condition" happen.

On Wed, Dec 21, 2022 at 7:33 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Ah Frederic, I think you nailed it. E is required to order the flip
> write with the control-dependency on the READ side. I can confirm the
> below test with bad condition shows the previous reader sees the
> post-flip index when it shouldn't have. Please see below modifications
> to your Litmus test.
>
> I think we should document it in the code that E pairs with the
> control-dependency between idx read and lock count write.
>
> C srcu
> {}
> // updater
> P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
>         int lock1;
>         int unlock1;
>         int lock0;
>         int unlock0;
>
>         // SCAN1
>         unlock1 = READ_ONCE(*UNLOCK1);
>         smp_mb(); // A
>         lock1 = READ_ONCE(*LOCK1);
>
>         // FLIP
>         smp_mb(); // E -------------------- required to make the bad
> condition not happen.
>         WRITE_ONCE(*IDX, 1);
>         smp_mb(); // D
>
>         // SCAN2
>         unlock0 = READ_ONCE(*UNLOCK0);
>         smp_mb(); // A
>         lock0 = READ_ONCE(*LOCK0);
> }
>
> // reader
> P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
>         int tmp;
>         int idx1;
>         int idx2;
>
>         // 1st reader
>         idx1 = READ_ONCE(*IDX);
>         if (idx1 == 0) {
>                 tmp = READ_ONCE(*LOCK0);
>                 WRITE_ONCE(*LOCK0, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK0);
>                 WRITE_ONCE(*UNLOCK0, tmp + 1);
>         } else {
>                 tmp = READ_ONCE(*LOCK1);
>                 WRITE_ONCE(*LOCK1, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK1);
>                 WRITE_ONCE(*UNLOCK1, tmp + 1);
>         }
>
>         // second reader
>         idx2 = READ_ONCE(*IDX);
>         if (idx2 == 0) {
>                 tmp = READ_ONCE(*LOCK0);
>                 WRITE_ONCE(*LOCK0, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK0);
>                 WRITE_ONCE(*UNLOCK0, tmp + 1);
>         } else {
>                 tmp = READ_ONCE(*LOCK1);
>                 WRITE_ONCE(*LOCK1, tmp + 1);
>                 smp_mb(); /* B and C */
>                 tmp = READ_ONCE(*UNLOCK1);
>                 WRITE_ONCE(*UNLOCK1, tmp + 1);
>         }
> }
>
> exists (0:lock1=1 /\ 1:idx1=1 /\ 1:idx2=1 )  (* bad condition: 1st
> reader saw flip *)
>
>
>
>
>
> On Wed, Dec 21, 2022 at 5:30 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote:
> > > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote:
> > > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> > > > >
> > > > >
> > > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > >> Agreed about (1).
> > > > > >>
> > > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> > > > > >>
> > > > > >> But that is not the only reason. If that was the only reason for (2),
> > > > > >> then there is an smp_mb() just before the next-scan post-flip before
> > > > > >> the lock counts are read.
> > > > > >
> > > > > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > > > > may appear unordered from the update side POV if there is no barrier between the
> > > > > > scan and the flip.
> > > > > >
> > > > > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > > > >
> > > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> > > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> > > > >
> > > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> > > > >
> > > > > That’s just my opinion and let’s discuss more because I cannot rule out that I
> > > > > am missing something with this complicated topic ;-)
> > > >
> > > > I must be missing something. I often do.
> > > >
> > > > Ok let's put that on litmus:
> > > >
> > > > ----
> > > > C srcu
> > > >
> > > > {}
> > > >
> > > > // updater
> > > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > > > {
> > > >     int lock1;
> > > >     int unlock1;
> > > >     int lock0;
> > > >     int unlock0;
> > > >
> > > >     // SCAN1
> > > >     unlock1 = READ_ONCE(*UNLOCK1);
> > > >     smp_mb(); // A
> > > >     lock1 = READ_ONCE(*LOCK1);
> > > >
> > > >     // FLIP
> > > >     smp_mb(); // E
> > >
> > > In real code there is a control dependency between the READ_ONCE() above
> > > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 ==
> > > unlock1, maybe try with the P0 below? Untested due to not having herd on
> > > this computer ;-)
> > >
> > > >     WRITE_ONCE(*IDX, 1);
> > > >     smp_mb(); // D
> > > >
> > > >     // SCAN2
> > > >     unlock0 = READ_ONCE(*UNLOCK0);
> > > >     smp_mb(); // A
> > > >     lock0 = READ_ONCE(*LOCK0);
> > > > }
> > > >
> > >       P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > >       {
> > >               int lock1;
> > >               int unlock1;
> > >               int lock0;
> > >               int unlock0;
> > >
> > >               // SCAN1
> > >               unlock1 = READ_ONCE(*UNLOCK1);
> > >               smp_mb(); // A
> > >               lock1 = READ_ONCE(*LOCK1);
> > >
> > >               // FLIP
> > >               if (unlock1 == lock1) {
> > >                       smp_mb(); // E
> > >                       WRITE_ONCE(*IDX, 1);
> > >                       smp_mb(); // D
> > >
> > >                       // SCAN2
> > >                       unlock0 = READ_ONCE(*UNLOCK0);
> > >                       smp_mb(); // A
> > >                       lock0 = READ_ONCE(*LOCK0);
> > >               }
> > >       }
> >
> > That becomes the below (same effect).
> >
> > C D
> >
> > {}
> >
> > // updater
> > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > {
> >         int lock1;
> >         int unlock1;
> >         int lock0;
> >         int unlock0;
> >
> >         // SCAN1
> >         unlock1 = READ_ONCE(*UNLOCK1);
> >         smp_mb(); // A
> >         lock1 = READ_ONCE(*LOCK1);
> >
> >         if (unlock1 == lock1) {
> >                 // FLIP
> >                 smp_mb(); // E
> >                 WRITE_ONCE(*IDX, 1);
> >                 smp_mb(); // D
> >
> >                 // SCAN 2
> >                 unlock0 = READ_ONCE(*UNLOCK0);
> >                 smp_mb(); // A
> >                 lock0 = READ_ONCE(*LOCK0);
> >         }
> > }
> >
> > // reader
> > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > {
> >         int tmp;
> >         int idx;
> >
> >         // 1st reader
> >         idx = READ_ONCE(*IDX);
> >         if (idx == 0) {
> >                 tmp = READ_ONCE(*LOCK0);
> >                 WRITE_ONCE(*LOCK0, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK0);
> >                 WRITE_ONCE(*UNLOCK0, tmp + 1);
> >         } else {
> >                 tmp = READ_ONCE(*LOCK1);
> >                 WRITE_ONCE(*LOCK1, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK1);
> >                 WRITE_ONCE(*UNLOCK1, tmp + 1);
> >         }
> >
> >         // second reader
> >         idx = READ_ONCE(*IDX);
> >         if (idx == 0) {
> >                 tmp = READ_ONCE(*LOCK0);
> >                 WRITE_ONCE(*LOCK0, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK0);
> >                 WRITE_ONCE(*UNLOCK0, tmp + 1);
> >         } else {
> >                 tmp = READ_ONCE(*LOCK1);
> >                 WRITE_ONCE(*LOCK1, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK1);
> >                 WRITE_ONCE(*UNLOCK1, tmp + 1);
> >         }
> >
> >         // third reader
> >         idx = READ_ONCE(*IDX);
> >         if (idx == 0) {
> >                 tmp = READ_ONCE(*LOCK0);
> >                 WRITE_ONCE(*LOCK0, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK0);
> >                 WRITE_ONCE(*UNLOCK0, tmp + 1);
> >         } else {
> >                 tmp = READ_ONCE(*LOCK1);
> >                 WRITE_ONCE(*LOCK1, tmp + 1);
> >                 smp_mb(); /* B and C */
> >                 tmp = READ_ONCE(*UNLOCK1);
> >                 WRITE_ONCE(*UNLOCK1, tmp + 1);
> >         }
> > }
> >
> > exists (0:unlock0=0 /\ 1:idx=0)
> >
Boqun Feng Dec. 21, 2022, 8:19 p.m. UTC | #51
On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 21, 2022 at 08:02:28AM -0800, Boqun Feng wrote:
> > On Wed, Dec 21, 2022 at 12:26:29PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Dec 20, 2022 at 09:41:17PM -0500, Joel Fernandes wrote:
> > > > 
> > > > 
> > > > > On Dec 20, 2022, at 7:50 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > 
> > > > > On Tue, Dec 20, 2022 at 07:15:00PM -0500, Joel Fernandes wrote:
> > > > >> On Tue, Dec 20, 2022 at 5:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >> Agreed about (1).
> > > > >> 
> > > > >>> _ In (2), E pairs with the address-dependency between idx and lock_count.
> > > > >> 
> > > > >> But that is not the only reason. If that was the only reason for (2),
> > > > >> then there is an smp_mb() just before the next-scan post-flip before
> > > > >> the lock counts are read.
> > > > > 
> > > > > The post-flip barrier makes sure the new idx is visible on the next READER's
> > > > > turn, but it doesn't protect against the fact that "READ idx then WRITE lock[idx]"
> > > > > may appear unordered from the update side POV if there is no barrier between the
> > > > > scan and the flip.
> > > > > 
> > > > > If you remove the smp_mb() from the litmus test I sent, things explode.
> > > > 
> > > > Sure I see what you are saying and it’s a valid point as well. However why do you need memory barrier D (labeled such in the kernel code) for that? You already have a memory barrier A before the lock count is read. That will suffice for the ordering pairing with the addr dependency.
> > > > In other words, if updater sees readers lock counts, then reader would be making those lock count updates on post-flip inactive index, not the one being scanned as you wanted, and you will accomplish that just with the mem barrier A.
> > > > 
> > > > So D fixes the above issue you are talking about (lock count update), however that is already fixed by the memory barrier A. But you still need D for the issue I mentioned (unlock counts vs flip).
> > > > 
> > > > That’s just my opinion and let’s discuss more because I cannot rule out that I
> > > > am missing something with this complicated topic ;-)
> > > 
> > > I must be missing something. I often do.
> > > 
> > > Ok let's put that on litmus:
> > > 
> > > ----
> > > C srcu
> > > 
> > > {}
> > > 
> > > // updater
> > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > > {
> > > 	int lock1;
> > > 	int unlock1;
> > > 	int lock0;
> > > 	int unlock0;
> > > 
> > > 	// SCAN1
> > > 	unlock1 = READ_ONCE(*UNLOCK1);
> > > 	smp_mb(); // A
> > > 	lock1 = READ_ONCE(*LOCK1);
> > > 	
> > > 	// FLIP
> > > 	smp_mb(); // E
> > 
> > In real code there is a control dependency between the READ_ONCE() above
> > and the WRITE_ONCE() before, i.e. only flip the idx when lock1 ==
> > unlock1, maybe try with the P0 below? Untested due to not having herd on
> > this computer ;-)
> > 
> > > 	WRITE_ONCE(*IDX, 1);
> > > 	smp_mb(); // D
> > > 	
> > > 	// SCAN2
> > > 	unlock0 = READ_ONCE(*UNLOCK0);
> > > 	smp_mb(); // A
> > > 	lock0 = READ_ONCE(*LOCK0);
> > > }
> > > 
> > 	P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> > 	{
> > 		int lock1;
> > 		int unlock1;
> > 		int lock0;
> > 		int unlock0;
> > 
> > 		// SCAN1
> > 		unlock1 = READ_ONCE(*UNLOCK1);
> > 		smp_mb(); // A
> > 		lock1 = READ_ONCE(*LOCK1);
> > 		
> > 		// FLIP
> > 		if (unlock1 == lock1) {
> > 			smp_mb(); // E
> > 			WRITE_ONCE(*IDX, 1);
> > 			smp_mb(); // D
> > 			
> > 			// SCAN2
> > 			unlock0 = READ_ONCE(*UNLOCK0);
> > 			smp_mb(); // A
> > 			lock0 = READ_ONCE(*LOCK0);
> > 		}
> > 	}
> 
> That becomes the below (same effect).
> 

By "same effect" you mean removing E results in the exist-clause
triggered? If so, then our environments disagree with each other ;-)

I "download" your litmus test (frederic.litmus) and created an variant
(frederic-remove-E.litmus) that only removes the E, both of the litmus
tests are in the attachment.

And here is the result on my env:

for frederic.litmus:

	Test D Allowed
	States 6
	0:unlock0=0; 1:idx=1;
	0:unlock0=1; 1:idx=0;
	0:unlock0=1; 1:idx=1;
	0:unlock0=2; 1:idx=0;
	0:unlock0=2; 1:idx=1;
	0:unlock0=3; 1:idx=0;
	No
	Witnesses
	Positive: 0 Negative: 14
	Condition exists (0:unlock0=0 /\ 1:idx=0)
	Observation D Never 0 14
	Time D 54.22
	Hash=eead834f635201cde8ceb21250e33381

for frederic-remove-E.litmus:

	Test D Allowed
	States 6
	0:unlock0=0; 1:idx=1;
	0:unlock0=1; 1:idx=0;
	0:unlock0=1; 1:idx=1;
	0:unlock0=2; 1:idx=0;
	0:unlock0=2; 1:idx=1;
	0:unlock0=3; 1:idx=0;
	No
	Witnesses
	Positive: 0 Negative: 14
	Condition exists (0:unlock0=0 /\ 1:idx=0)
	Observation D Never 0 14
	Time D 53.50
	Hash=c2579f542cf9f87af125c5792999dc44

Regards,
Boqun

> C D
> 
> {}
> 
> // updater
> P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
> 	int lock1;
> 	int unlock1;
> 	int lock0;
> 	int unlock0;
> 
> 	// SCAN1
> 	unlock1 = READ_ONCE(*UNLOCK1);
> 	smp_mb(); // A
> 	lock1 = READ_ONCE(*LOCK1);
> 	
> 	if (unlock1 == lock1) {
> 		// FLIP
> 		smp_mb(); // E
> 		WRITE_ONCE(*IDX, 1);
> 		smp_mb(); // D
> 	
> 		// SCAN 2
> 		unlock0 = READ_ONCE(*UNLOCK0);
> 		smp_mb(); // A
> 		lock0 = READ_ONCE(*LOCK0);
> 	}
> }
> 
> // reader
> P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
> {
> 	int tmp;
> 	int idx;
> 
> 	// 1st reader
> 	idx = READ_ONCE(*IDX);
> 	if (idx == 0) {
> 		tmp = READ_ONCE(*LOCK0);
> 		WRITE_ONCE(*LOCK0, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK0);
> 		WRITE_ONCE(*UNLOCK0, tmp + 1);
> 	} else {
> 		tmp = READ_ONCE(*LOCK1);
> 		WRITE_ONCE(*LOCK1, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK1);
> 		WRITE_ONCE(*UNLOCK1, tmp + 1);
> 	}
> 	
> 	// second reader
> 	idx = READ_ONCE(*IDX);
> 	if (idx == 0) {
> 		tmp = READ_ONCE(*LOCK0);
> 		WRITE_ONCE(*LOCK0, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK0);
> 		WRITE_ONCE(*UNLOCK0, tmp + 1);
> 	} else {
> 		tmp = READ_ONCE(*LOCK1);
> 		WRITE_ONCE(*LOCK1, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK1);
> 		WRITE_ONCE(*UNLOCK1, tmp + 1);
> 	}
> 	
> 	// third reader
> 	idx = READ_ONCE(*IDX);
> 	if (idx == 0) {
> 		tmp = READ_ONCE(*LOCK0);
> 		WRITE_ONCE(*LOCK0, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK0);
> 		WRITE_ONCE(*UNLOCK0, tmp + 1);
> 	} else {
> 		tmp = READ_ONCE(*LOCK1);
> 		WRITE_ONCE(*LOCK1, tmp + 1);
> 		smp_mb(); /* B and C */
> 		tmp = READ_ONCE(*UNLOCK1);
> 		WRITE_ONCE(*UNLOCK1, tmp + 1);
> 	}
> }
> 
> exists (0:unlock0=0 /\ 1:idx=0)
>
C D

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int lock1;
	int unlock1;
	int lock0;
	int unlock0;

	// SCAN1
	unlock1 = READ_ONCE(*UNLOCK1);
	smp_mb(); // A
	lock1 = READ_ONCE(*LOCK1);
	
	if (unlock1 == lock1) {
		// FLIP
		smp_mb(); // E
		WRITE_ONCE(*IDX, 1);
		smp_mb(); // D
	
		// SCAN 2
		unlock0 = READ_ONCE(*UNLOCK0);
		smp_mb(); // A
		lock0 = READ_ONCE(*LOCK0);
	}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int tmp;
	int idx;

	// 1st reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// second reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// third reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
}

exists (0:unlock0=0 /\ 1:idx=0)
C D

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int lock1;
	int unlock1;
	int lock0;
	int unlock0;

	// SCAN1
	unlock1 = READ_ONCE(*UNLOCK1);
	smp_mb(); // A
	lock1 = READ_ONCE(*LOCK1);
	
	if (unlock1 == lock1) {
		// FLIP
		WRITE_ONCE(*IDX, 1);
		smp_mb(); // D
	
		// SCAN 2
		unlock0 = READ_ONCE(*UNLOCK0);
		smp_mb(); // A
		lock0 = READ_ONCE(*LOCK0);
	}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
	int tmp;
	int idx;

	// 1st reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// second reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
	
	// third reader
	idx = READ_ONCE(*IDX);
	if (idx == 0) {
		tmp = READ_ONCE(*LOCK0);
		WRITE_ONCE(*LOCK0, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK0);
		WRITE_ONCE(*UNLOCK0, tmp + 1);
	} else {
		tmp = READ_ONCE(*LOCK1);
		WRITE_ONCE(*LOCK1, tmp + 1);
		smp_mb(); /* B and C */
		tmp = READ_ONCE(*UNLOCK1);
		WRITE_ONCE(*UNLOCK1, tmp + 1);
	}
}

exists (0:unlock0=0 /\ 1:idx=0)
Frederic Weisbecker Dec. 22, 2022, 12:16 p.m. UTC | #52
On Wed, Dec 21, 2022 at 12:19:47PM -0800, Boqun Feng wrote:
> On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote:
> By "same effect" you mean removing E results in the exist-clause
> triggered? If so, then our environments disagree with each other ;-)

Nope, removing D :-)

(And yeah probably I misread your previous mail and we weren't talking
about the same thing...)

Thanks.
Frederic Weisbecker Dec. 22, 2022, 12:24 p.m. UTC | #53
On Thu, Dec 22, 2022 at 01:16:12PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 21, 2022 at 12:19:47PM -0800, Boqun Feng wrote:
> > On Wed, Dec 21, 2022 at 06:30:05PM +0100, Frederic Weisbecker wrote:
> > By "same effect" you mean removing E results in the exist-clause
> > triggered? If so, then our environments disagree with each other ;-)
> 
> Nope, removing D :-)
> 
> (And yeah probably I misread your previous mail and we weren't talking
> about the same thing...)

Yeah checking that again, I got confused in the discussion between my two
tests srcu-E and srcu-D. Anyway, you were right about the control dependency :)

> 
> Thanks.
Frederic Weisbecker Dec. 22, 2022, 12:40 p.m. UTC | #54
On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> [...]
> > > 
> > > The memory ordering constraint I am concerned about here is:
> > > 
> > >   * [...] In addition,
> > >   * each CPU having an SRCU read-side critical section that extends beyond
> > >   * the return from synchronize_srcu() is guaranteed to have executed a
> > >   * full memory barrier after the beginning of synchronize_srcu() and before
> > >   * the beginning of that SRCU read-side critical section. [...]
> > > 
> > > So if we have a SRCU read-side critical section that begins after the beginning
> > > of synchronize_srcu, but before its first memory barrier, it would miss the
> > > guarantee that the full memory barrier is issued before the beginning of that
> > > SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > very beginning of the grace period.
> > 
> > I'm confused, what's wrong with this ?
> > 
> > UPDATER                  READER
> > -------                  ------
> > STORE X = 1              STORE srcu_read_lock++
> > // rcu_seq_snap()        smp_mb()
> > smp_mb()                 READ X
> > // scans
> > READ srcu_read_lock
> 
> What you refer to here is only memory ordering of the store to X and load
> from X wrt loading/increment of srcu_read_lock, which is internal to the
> srcu implementation. If we really want to model the provided high-level
> memory ordering guarantees, we should consider a scenario where SRCU is used
> for its memory ordering properties to synchronize other variables.
> 
> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> and srcu_read_lock/unlock would be used instead of memory barriers:
> 
> Initial state: X = 0, Y = 0
> 
> Thread A                   Thread B
> ---------------------------------------------
> STORE X = 1                STORE Y = 1
> synchronize_srcu()
>                            srcu_read_lock()
>                            r1 = LOAD X
>                            srcu_read_unlock()
> r0 = LOAD Y
> 
> BUG_ON(!r0 && !r1)
> 
> So in the synchronize_srcu implementation, there appears to be two
> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> or it uses an already started gp/expedited gp. When snapshotting with
> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> load means that it does not order prior memory accesses before that load.
> This sequence value is then used to identify which gp_seq to wait for when
> piggy-backing on another already-started gp. I worry about reordering
> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> piggy-back on an already-started gp.
> 
> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> all this behave as expected. But without documentation it's rather hard to
> follow.

Oh ok I see now. It might be working that way by accident or on forgotten
purpose. In any case, we really want to add a comment above that
__srcu_read_lock_nmisafe() call.
Joel Fernandes Dec. 22, 2022, 1:19 p.m. UTC | #55
> On Dec 22, 2022, at 7:40 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
>> [...]
>>>> 
>>>> The memory ordering constraint I am concerned about here is:
>>>> 
>>>>  * [...] In addition,
>>>>  * each CPU having an SRCU read-side critical section that extends beyond
>>>>  * the return from synchronize_srcu() is guaranteed to have executed a
>>>>  * full memory barrier after the beginning of synchronize_srcu() and before
>>>>  * the beginning of that SRCU read-side critical section. [...]
>>>> 
>>>> So if we have a SRCU read-side critical section that begins after the beginning
>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
>>>> guarantee that the full memory barrier is issued before the beginning of that
>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
>>>> very beginning of the grace period.
>>> 
>>> I'm confused, what's wrong with this ?
>>> 
>>> UPDATER                  READER
>>> -------                  ------
>>> STORE X = 1              STORE srcu_read_lock++
>>> // rcu_seq_snap()        smp_mb()
>>> smp_mb()                 READ X
>>> // scans
>>> READ srcu_read_lock
>> 
>> What you refer to here is only memory ordering of the store to X and load
>> from X wrt loading/increment of srcu_read_lock, which is internal to the
>> srcu implementation. If we really want to model the provided high-level
>> memory ordering guarantees, we should consider a scenario where SRCU is used
>> for its memory ordering properties to synchronize other variables.
>> 
>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
>> and srcu_read_lock/unlock would be used instead of memory barriers:
>> 
>> Initial state: X = 0, Y = 0
>> 
>> Thread A                   Thread B
>> ---------------------------------------------
>> STORE X = 1                STORE Y = 1
>> synchronize_srcu()
>>                           srcu_read_lock()
>>                           r1 = LOAD X
>>                           srcu_read_unlock()
>> r0 = LOAD Y
>> 
>> BUG_ON(!r0 && !r1)
>> 
>> So in the synchronize_srcu implementation, there appears to be two
>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
>> or it uses an already started gp/expedited gp. When snapshotting with
>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
>> load means that it does not order prior memory accesses before that load.
>> This sequence value is then used to identify which gp_seq to wait for when
>> piggy-backing on another already-started gp. I worry about reordering
>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
>> piggy-back on an already-started gp.
>> 
>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
>> all this behave as expected. But without documentation it's rather hard to
>> follow.
> 
> Oh ok I see now. It might be working that way by accident or on forgotten
> purpose. In any case, we really want to add a comment above that
> __srcu_read_lock_nmisafe() call.

Agreed on the analysis and the need for comments, thanks!

I think we also ought to document some more cases like, how the memory barriers here relate to the other memory barrier A inside the scan loop. But I guess for me I need to understand the GP guarantee ordering first before attempting documenting that aspect. Oh well, thank God for holidays. ;-)

Thanks,

 - Joel



>
Paul E. McKenney Dec. 22, 2022, 4:43 p.m. UTC | #56
On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > > On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > [...]
> > > > 
> > > > The memory ordering constraint I am concerned about here is:
> > > > 
> > > >   * [...] In addition,
> > > >   * each CPU having an SRCU read-side critical section that extends beyond
> > > >   * the return from synchronize_srcu() is guaranteed to have executed a
> > > >   * full memory barrier after the beginning of synchronize_srcu() and before
> > > >   * the beginning of that SRCU read-side critical section. [...]
> > > > 
> > > > So if we have a SRCU read-side critical section that begins after the beginning
> > > > of synchronize_srcu, but before its first memory barrier, it would miss the
> > > > guarantee that the full memory barrier is issued before the beginning of that
> > > > SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > > very beginning of the grace period.
> > > 
> > > I'm confused, what's wrong with this ?
> > > 
> > > UPDATER                  READER
> > > -------                  ------
> > > STORE X = 1              STORE srcu_read_lock++
> > > // rcu_seq_snap()        smp_mb()
> > > smp_mb()                 READ X
> > > // scans
> > > READ srcu_read_lock
> > 
> > What you refer to here is only memory ordering of the store to X and load
> > from X wrt loading/increment of srcu_read_lock, which is internal to the
> > srcu implementation. If we really want to model the provided high-level
> > memory ordering guarantees, we should consider a scenario where SRCU is used
> > for its memory ordering properties to synchronize other variables.
> > 
> > I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > and srcu_read_lock/unlock would be used instead of memory barriers:
> > 
> > Initial state: X = 0, Y = 0
> > 
> > Thread A                   Thread B
> > ---------------------------------------------
> > STORE X = 1                STORE Y = 1
> > synchronize_srcu()
> >                            srcu_read_lock()
> >                            r1 = LOAD X
> >                            srcu_read_unlock()
> > r0 = LOAD Y
> > 
> > BUG_ON(!r0 && !r1)
> > 
> > So in the synchronize_srcu implementation, there appears to be two
> > major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > or it uses an already started gp/expedited gp. When snapshotting with
> > rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > load means that it does not order prior memory accesses before that load.
> > This sequence value is then used to identify which gp_seq to wait for when
> > piggy-backing on another already-started gp. I worry about reordering
> > between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > piggy-back on an already-started gp.
> > 
> > I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > all this behave as expected. But without documentation it's rather hard to
> > follow.
> 
> Oh ok I see now. It might be working that way by accident or on forgotten
> purpose. In any case, we really want to add a comment above that
> __srcu_read_lock_nmisafe() call.

Another test for the safety (or not) of removing either D or E is
to move that WRITE_ONCE() to follow (or, respectively, precede) the
adjacent scans.

							Thanx, Paul
Joel Fernandes Dec. 22, 2022, 6:19 p.m. UTC | #57
> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
>>> [...]
>>>>> 
>>>>> The memory ordering constraint I am concerned about here is:
>>>>> 
>>>>>  * [...] In addition,
>>>>>  * each CPU having an SRCU read-side critical section that extends beyond
>>>>>  * the return from synchronize_srcu() is guaranteed to have executed a
>>>>>  * full memory barrier after the beginning of synchronize_srcu() and before
>>>>>  * the beginning of that SRCU read-side critical section. [...]
>>>>> 
>>>>> So if we have a SRCU read-side critical section that begins after the beginning
>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
>>>>> guarantee that the full memory barrier is issued before the beginning of that
>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
>>>>> very beginning of the grace period.
>>>> 
>>>> I'm confused, what's wrong with this ?
>>>> 
>>>> UPDATER                  READER
>>>> -------                  ------
>>>> STORE X = 1              STORE srcu_read_lock++
>>>> // rcu_seq_snap()        smp_mb()
>>>> smp_mb()                 READ X
>>>> // scans
>>>> READ srcu_read_lock
>>> 
>>> What you refer to here is only memory ordering of the store to X and load
>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
>>> srcu implementation. If we really want to model the provided high-level
>>> memory ordering guarantees, we should consider a scenario where SRCU is used
>>> for its memory ordering properties to synchronize other variables.
>>> 
>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
>>> and srcu_read_lock/unlock would be used instead of memory barriers:
>>> 
>>> Initial state: X = 0, Y = 0
>>> 
>>> Thread A                   Thread B
>>> ---------------------------------------------
>>> STORE X = 1                STORE Y = 1
>>> synchronize_srcu()
>>>                           srcu_read_lock()
>>>                           r1 = LOAD X
>>>                           srcu_read_unlock()
>>> r0 = LOAD Y
>>> 
>>> BUG_ON(!r0 && !r1)
>>> 
>>> So in the synchronize_srcu implementation, there appears to be two
>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
>>> or it uses an already started gp/expedited gp. When snapshotting with
>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
>>> load means that it does not order prior memory accesses before that load.
>>> This sequence value is then used to identify which gp_seq to wait for when
>>> piggy-backing on another already-started gp. I worry about reordering
>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
>>> piggy-back on an already-started gp.
>>> 
>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
>>> all this behave as expected. But without documentation it's rather hard to
>>> follow.
>> 
>> Oh ok I see now. It might be working that way by accident or on forgotten
>> purpose. In any case, we really want to add a comment above that
>> __srcu_read_lock_nmisafe() call.
> 
> Another test for the safety (or not) of removing either D or E is
> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> adjacent scans.

Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.

So that (flipping MBs) is unrelated, or did I miss something?

Thanks.



> 
>                            Thanx, Paul
Paul E. McKenney Dec. 22, 2022, 6:53 p.m. UTC | #58
On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> >>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> >>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> >>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> >>> [...]
> >>>>> 
> >>>>> The memory ordering constraint I am concerned about here is:
> >>>>> 
> >>>>>  * [...] In addition,
> >>>>>  * each CPU having an SRCU read-side critical section that extends beyond
> >>>>>  * the return from synchronize_srcu() is guaranteed to have executed a
> >>>>>  * full memory barrier after the beginning of synchronize_srcu() and before
> >>>>>  * the beginning of that SRCU read-side critical section. [...]
> >>>>> 
> >>>>> So if we have a SRCU read-side critical section that begins after the beginning
> >>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> >>>>> guarantee that the full memory barrier is issued before the beginning of that
> >>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> >>>>> very beginning of the grace period.
> >>>> 
> >>>> I'm confused, what's wrong with this ?
> >>>> 
> >>>> UPDATER                  READER
> >>>> -------                  ------
> >>>> STORE X = 1              STORE srcu_read_lock++
> >>>> // rcu_seq_snap()        smp_mb()
> >>>> smp_mb()                 READ X
> >>>> // scans
> >>>> READ srcu_read_lock
> >>> 
> >>> What you refer to here is only memory ordering of the store to X and load
> >>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> >>> srcu implementation. If we really want to model the provided high-level
> >>> memory ordering guarantees, we should consider a scenario where SRCU is used
> >>> for its memory ordering properties to synchronize other variables.
> >>> 
> >>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> >>> and srcu_read_lock/unlock would be used instead of memory barriers:
> >>> 
> >>> Initial state: X = 0, Y = 0
> >>> 
> >>> Thread A                   Thread B
> >>> ---------------------------------------------
> >>> STORE X = 1                STORE Y = 1
> >>> synchronize_srcu()
> >>>                           srcu_read_lock()
> >>>                           r1 = LOAD X
> >>>                           srcu_read_unlock()
> >>> r0 = LOAD Y
> >>> 
> >>> BUG_ON(!r0 && !r1)
> >>> 
> >>> So in the synchronize_srcu implementation, there appears to be two
> >>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> >>> or it uses an already started gp/expedited gp. When snapshotting with
> >>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> >>> load means that it does not order prior memory accesses before that load.
> >>> This sequence value is then used to identify which gp_seq to wait for when
> >>> piggy-backing on another already-started gp. I worry about reordering
> >>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> >>> piggy-back on an already-started gp.
> >>> 
> >>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> >>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> >>> all this behave as expected. But without documentation it's rather hard to
> >>> follow.
> >> 
> >> Oh ok I see now. It might be working that way by accident or on forgotten
> >> purpose. In any case, we really want to add a comment above that
> >> __srcu_read_lock_nmisafe() call.
> > 
> > Another test for the safety (or not) of removing either D or E is
> > to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > adjacent scans.
> 
> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> 
> So that (flipping MBs) is unrelated, or did I miss something?

The thought is to manually similate in the source code the maximum
memory-reference reordering that a maximally hostile compiler and CPU
would be permitted to carry out.  So yes, given that there are other
memory barriers before and after, these other memory barriers limit how
far the flip may be moved in the source code.

Here I am talking about the memory barriers associated with the flip,
but the same trick can of course be applied to other memory barriers.
In general, remove a given memory barrier and (in the source code)
maximally rearrange the memory references that were previously ordered
by the memory barrier in question.

Again, the presence of other memory barriers will limit the permitted
maximal source-code rearrangement.

							Thanx, Paul
Joel Fernandes Dec. 22, 2022, 6:56 p.m. UTC | #59
> On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote:
>> 
>> 
>>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
>>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
>>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
>>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
>>>>> [...]
>>>>>>> 
>>>>>>> The memory ordering constraint I am concerned about here is:
>>>>>>> 
>>>>>>> * [...] In addition,
>>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
>>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
>>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
>>>>>>> * the beginning of that SRCU read-side critical section. [...]
>>>>>>> 
>>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
>>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
>>>>>>> guarantee that the full memory barrier is issued before the beginning of that
>>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
>>>>>>> very beginning of the grace period.
>>>>>> 
>>>>>> I'm confused, what's wrong with this ?
>>>>>> 
>>>>>> UPDATER                  READER
>>>>>> -------                  ------
>>>>>> STORE X = 1              STORE srcu_read_lock++
>>>>>> // rcu_seq_snap()        smp_mb()
>>>>>> smp_mb()                 READ X
>>>>>> // scans
>>>>>> READ srcu_read_lock
>>>>> 
>>>>> What you refer to here is only memory ordering of the store to X and load
>>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
>>>>> srcu implementation. If we really want to model the provided high-level
>>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
>>>>> for its memory ordering properties to synchronize other variables.
>>>>> 
>>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
>>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
>>>>> 
>>>>> Initial state: X = 0, Y = 0
>>>>> 
>>>>> Thread A                   Thread B
>>>>> ---------------------------------------------
>>>>> STORE X = 1                STORE Y = 1
>>>>> synchronize_srcu()
>>>>>                          srcu_read_lock()
>>>>>                          r1 = LOAD X
>>>>>                          srcu_read_unlock()
>>>>> r0 = LOAD Y
>>>>> 
>>>>> BUG_ON(!r0 && !r1)
>>>>> 
>>>>> So in the synchronize_srcu implementation, there appears to be two
>>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
>>>>> or it uses an already started gp/expedited gp. When snapshotting with
>>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
>>>>> load means that it does not order prior memory accesses before that load.
>>>>> This sequence value is then used to identify which gp_seq to wait for when
>>>>> piggy-backing on another already-started gp. I worry about reordering
>>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
>>>>> piggy-back on an already-started gp.
>>>>> 
>>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
>>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
>>>>> all this behave as expected. But without documentation it's rather hard to
>>>>> follow.
>>>> 
>>>> Oh ok I see now. It might be working that way by accident or on forgotten
>>>> purpose. In any case, we really want to add a comment above that
>>>> __srcu_read_lock_nmisafe() call.
>>> 
>>> Another test for the safety (or not) of removing either D or E is
>>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
>>> adjacent scans.
>> 
>> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
>> 
>> So that (flipping MBs) is unrelated, or did I miss something?
> 
> The thought is to manually similate in the source code the maximum
> memory-reference reordering that a maximally hostile compiler and CPU
> would be permitted to carry out.  So yes, given that there are other
> memory barriers before and after, these other memory barriers limit how
> far the flip may be moved in the source code.
> 
> Here I am talking about the memory barriers associated with the flip,
> but the same trick can of course be applied to other memory barriers.
> In general, remove a given memory barrier and (in the source code)
> maximally rearrange the memory references that were previously ordered
> by the memory barrier in question.
> 
> Again, the presence of other memory barriers will limit the permitted
> maximal source-code rearrangement.


Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.

Thanks.




> 
>                            Thanx, Paul
Paul E. McKenney Dec. 22, 2022, 7:45 p.m. UTC | #60
On Thu, Dec 22, 2022 at 01:56:17PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote:
> >> 
> >> 
> >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> 
> >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> >>>>> [...]
> >>>>>>> 
> >>>>>>> The memory ordering constraint I am concerned about here is:
> >>>>>>> 
> >>>>>>> * [...] In addition,
> >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> >>>>>>> 
> >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> >>>>>>> very beginning of the grace period.
> >>>>>> 
> >>>>>> I'm confused, what's wrong with this ?
> >>>>>> 
> >>>>>> UPDATER                  READER
> >>>>>> -------                  ------
> >>>>>> STORE X = 1              STORE srcu_read_lock++
> >>>>>> // rcu_seq_snap()        smp_mb()
> >>>>>> smp_mb()                 READ X
> >>>>>> // scans
> >>>>>> READ srcu_read_lock
> >>>>> 
> >>>>> What you refer to here is only memory ordering of the store to X and load
> >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> >>>>> srcu implementation. If we really want to model the provided high-level
> >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> >>>>> for its memory ordering properties to synchronize other variables.
> >>>>> 
> >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> >>>>> 
> >>>>> Initial state: X = 0, Y = 0
> >>>>> 
> >>>>> Thread A                   Thread B
> >>>>> ---------------------------------------------
> >>>>> STORE X = 1                STORE Y = 1
> >>>>> synchronize_srcu()
> >>>>>                          srcu_read_lock()
> >>>>>                          r1 = LOAD X
> >>>>>                          srcu_read_unlock()
> >>>>> r0 = LOAD Y
> >>>>> 
> >>>>> BUG_ON(!r0 && !r1)
> >>>>> 
> >>>>> So in the synchronize_srcu implementation, there appears to be two
> >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> >>>>> load means that it does not order prior memory accesses before that load.
> >>>>> This sequence value is then used to identify which gp_seq to wait for when
> >>>>> piggy-backing on another already-started gp. I worry about reordering
> >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> >>>>> piggy-back on an already-started gp.
> >>>>> 
> >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> >>>>> all this behave as expected. But without documentation it's rather hard to
> >>>>> follow.
> >>>> 
> >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> >>>> purpose. In any case, we really want to add a comment above that
> >>>> __srcu_read_lock_nmisafe() call.
> >>> 
> >>> Another test for the safety (or not) of removing either D or E is
> >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> >>> adjacent scans.
> >> 
> >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> >> 
> >> So that (flipping MBs) is unrelated, or did I miss something?
> > 
> > The thought is to manually similate in the source code the maximum
> > memory-reference reordering that a maximally hostile compiler and CPU
> > would be permitted to carry out.  So yes, given that there are other
> > memory barriers before and after, these other memory barriers limit how
> > far the flip may be moved in the source code.
> > 
> > Here I am talking about the memory barriers associated with the flip,
> > but the same trick can of course be applied to other memory barriers.
> > In general, remove a given memory barrier and (in the source code)
> > maximally rearrange the memory references that were previously ordered
> > by the memory barrier in question.
> > 
> > Again, the presence of other memory barriers will limit the permitted
> > maximal source-code rearrangement.
> 
> 
> Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.

First, are we talking about the same barrier?  I am talking about E.

Yes, this would require a bit of restructuring.  The overall
approach would be something like this, in SRCU_STATE_SCAN1:

1.	Scan the unlocks.

2.	smp_mb(); /* A */

3.	Flip the index.

4.	Scan the locks.

5.	If unlocks == locks, advance the state to SRCU_STATE_SCAN2.

6.	Otherwise, execute the current SRCU_STATE_SCAN1 code.

Give or take the usual devils in the details.

Alternatively, remove E and hammer it on a weakly ordered system.

							Thanx, Paul
Joel Fernandes Dec. 23, 2022, 4:43 a.m. UTC | #61
On Thu, Dec 22, 2022 at 2:45 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Dec 22, 2022 at 01:56:17PM -0500, Joel Fernandes wrote:
> >
> >
> > > On Dec 22, 2022, at 1:53 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote:
> > >>
> > >>
> > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >>>
> > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > >>>>> [...]
> > >>>>>>>
> > >>>>>>> The memory ordering constraint I am concerned about here is:
> > >>>>>>>
> > >>>>>>> * [...] In addition,
> > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> > >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> > >>>>>>>
> > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > >>>>>>> very beginning of the grace period.
> > >>>>>>
> > >>>>>> I'm confused, what's wrong with this ?
> > >>>>>>
> > >>>>>> UPDATER                  READER
> > >>>>>> -------                  ------
> > >>>>>> STORE X = 1              STORE srcu_read_lock++
> > >>>>>> // rcu_seq_snap()        smp_mb()
> > >>>>>> smp_mb()                 READ X
> > >>>>>> // scans
> > >>>>>> READ srcu_read_lock
> > >>>>>
> > >>>>> What you refer to here is only memory ordering of the store to X and load
> > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> > >>>>> srcu implementation. If we really want to model the provided high-level
> > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> > >>>>> for its memory ordering properties to synchronize other variables.
> > >>>>>
> > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> > >>>>>
> > >>>>> Initial state: X = 0, Y = 0
> > >>>>>
> > >>>>> Thread A                   Thread B
> > >>>>> ---------------------------------------------
> > >>>>> STORE X = 1                STORE Y = 1
> > >>>>> synchronize_srcu()
> > >>>>>                          srcu_read_lock()
> > >>>>>                          r1 = LOAD X
> > >>>>>                          srcu_read_unlock()
> > >>>>> r0 = LOAD Y
> > >>>>>
> > >>>>> BUG_ON(!r0 && !r1)
> > >>>>>
> > >>>>> So in the synchronize_srcu implementation, there appears to be two
> > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > >>>>> load means that it does not order prior memory accesses before that load.
> > >>>>> This sequence value is then used to identify which gp_seq to wait for when
> > >>>>> piggy-backing on another already-started gp. I worry about reordering
> > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > >>>>> piggy-back on an already-started gp.
> > >>>>>
> > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > >>>>> all this behave as expected. But without documentation it's rather hard to
> > >>>>> follow.
> > >>>>
> > >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> > >>>> purpose. In any case, we really want to add a comment above that
> > >>>> __srcu_read_lock_nmisafe() call.
> > >>>
> > >>> Another test for the safety (or not) of removing either D or E is
> > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > >>> adjacent scans.
> > >>
> > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> > >>
> > >> So that (flipping MBs) is unrelated, or did I miss something?
> > >
> > > The thought is to manually similate in the source code the maximum
> > > memory-reference reordering that a maximally hostile compiler and CPU
> > > would be permitted to carry out.  So yes, given that there are other
> > > memory barriers before and after, these other memory barriers limit how
> > > far the flip may be moved in the source code.
> > >
> > > Here I am talking about the memory barriers associated with the flip,
> > > but the same trick can of course be applied to other memory barriers.
> > > In general, remove a given memory barrier and (in the source code)
> > > maximally rearrange the memory references that were previously ordered
> > > by the memory barrier in question.
> > >
> > > Again, the presence of other memory barriers will limit the permitted
> > > maximal source-code rearrangement.
> >
> >
> > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.
>
> First, are we talking about the same barrier?  I am talking about E.

No, in the last part you replied to above, Mathieu and Frederic were
talking about the need for memory barriers in synchronize_srcu(). That
has nothing to do with E:
<quote>
 I suspect that the implicit barrier in srcu_read_lock() invoked at the
 beginning of srcu_gp_start_if_needed() is really the barrier that makes
 all this behave as expected.
</quote>

We need to order code prior to synchronize_srcu() wrt the start of the
grace period, so that readers that started after the grace period
started see those side-effects as they may not be waited on (they are
too late).

> Alternatively, remove E and hammer it on a weakly ordered system.

But E is useless, see below patch link with litmus test [1]. I don't
see what hammering it on a weakly ordered system buys if it does not
even execute for the cases when it is supposed to help because of
causality with the control dependency of the prior scan :-/

Thanks,

  -Joel
[1] https://lore.kernel.org/rcu/19A03A16-9F38-481F-824D-4C883AB8B105@joelfernandes.org/T/#m60e37d1277a8dd626bb95f92083e4e602c429704
Joel Fernandes Dec. 23, 2022, 4:12 p.m. UTC | #62
On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
[...]
> > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >>>
> > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > > >>>>> [...]
> > > >>>>>>>
> > > >>>>>>> The memory ordering constraint I am concerned about here is:
> > > >>>>>>>
> > > >>>>>>> * [...] In addition,
> > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> > > >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> > > >>>>>>>
> > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > >>>>>>> very beginning of the grace period.
> > > >>>>>>
> > > >>>>>> I'm confused, what's wrong with this ?
> > > >>>>>>
> > > >>>>>> UPDATER                  READER
> > > >>>>>> -------                  ------
> > > >>>>>> STORE X = 1              STORE srcu_read_lock++
> > > >>>>>> // rcu_seq_snap()        smp_mb()
> > > >>>>>> smp_mb()                 READ X
> > > >>>>>> // scans
> > > >>>>>> READ srcu_read_lock
> > > >>>>>
> > > >>>>> What you refer to here is only memory ordering of the store to X and load
> > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> > > >>>>> srcu implementation. If we really want to model the provided high-level
> > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> > > >>>>> for its memory ordering properties to synchronize other variables.
> > > >>>>>
> > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> > > >>>>>
> > > >>>>> Initial state: X = 0, Y = 0
> > > >>>>>
> > > >>>>> Thread A                   Thread B
> > > >>>>> ---------------------------------------------
> > > >>>>> STORE X = 1                STORE Y = 1
> > > >>>>> synchronize_srcu()
> > > >>>>>                          srcu_read_lock()
> > > >>>>>                          r1 = LOAD X
> > > >>>>>                          srcu_read_unlock()
> > > >>>>> r0 = LOAD Y
> > > >>>>>
> > > >>>>> BUG_ON(!r0 && !r1)
> > > >>>>>
> > > >>>>> So in the synchronize_srcu implementation, there appears to be two
> > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > > >>>>> load means that it does not order prior memory accesses before that load.
> > > >>>>> This sequence value is then used to identify which gp_seq to wait for when
> > > >>>>> piggy-backing on another already-started gp. I worry about reordering
> > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > > >>>>> piggy-back on an already-started gp.
> > > >>>>>
> > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > > >>>>> all this behave as expected. But without documentation it's rather hard to
> > > >>>>> follow.
> > > >>>>
> > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> > > >>>> purpose. In any case, we really want to add a comment above that
> > > >>>> __srcu_read_lock_nmisafe() call.
> > > >>>
> > > >>> Another test for the safety (or not) of removing either D or E is
> > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > > >>> adjacent scans.
> > > >>
> > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> > > >>
> > > >> So that (flipping MBs) is unrelated, or did I miss something?
> > > >
> > > > The thought is to manually similate in the source code the maximum
> > > > memory-reference reordering that a maximally hostile compiler and CPU
> > > > would be permitted to carry out.  So yes, given that there are other
> > > > memory barriers before and after, these other memory barriers limit how
> > > > far the flip may be moved in the source code.
> > > >
> > > > Here I am talking about the memory barriers associated with the flip,
> > > > but the same trick can of course be applied to other memory barriers.
> > > > In general, remove a given memory barrier and (in the source code)
> > > > maximally rearrange the memory references that were previously ordered
> > > > by the memory barrier in question.
> > > >
> > > > Again, the presence of other memory barriers will limit the permitted
> > > > maximal source-code rearrangement.
> > >
> > >
> > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.
> >
> > First, are we talking about the same barrier?  I am talking about E.
>
> No, in the last part you replied to above, Mathieu and Frederic were
> talking about the need for memory barriers in synchronize_srcu(). That
> has nothing to do with E:
> <quote>
>  I suspect that the implicit barrier in srcu_read_lock() invoked at the
>  beginning of srcu_gp_start_if_needed() is really the barrier that makes
>  all this behave as expected.
> </quote>
>
> We need to order code prior to synchronize_srcu() wrt the start of the
> grace period, so that readers that started after the grace period
> started see those side-effects as they may not be waited on (they are
> too late).

My thought is this is achieved by the srcu_read_lock() before the
grace period is started, and the start of the grace period (which is
basically the smp_mb() in the first scan).

So from memory ordering PoV, if synchronize_rcu() spans the GP, and
readers don't span the GP, that means the reader does not span the
synchronize_rcu() which is the GP guarantee.

But I could be missing something. I will dig more on my side. Thanks.
Paul E. McKenney Dec. 23, 2022, 6:15 p.m. UTC | #63
On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote:
> On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> [...]
> > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >>>
> > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > > > >>>>> [...]
> > > > >>>>>>>
> > > > >>>>>>> The memory ordering constraint I am concerned about here is:
> > > > >>>>>>>
> > > > >>>>>>> * [...] In addition,
> > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> > > > >>>>>>>
> > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > > >>>>>>> very beginning of the grace period.
> > > > >>>>>>
> > > > >>>>>> I'm confused, what's wrong with this ?
> > > > >>>>>>
> > > > >>>>>> UPDATER                  READER
> > > > >>>>>> -------                  ------
> > > > >>>>>> STORE X = 1              STORE srcu_read_lock++
> > > > >>>>>> // rcu_seq_snap()        smp_mb()
> > > > >>>>>> smp_mb()                 READ X
> > > > >>>>>> // scans
> > > > >>>>>> READ srcu_read_lock
> > > > >>>>>
> > > > >>>>> What you refer to here is only memory ordering of the store to X and load
> > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> > > > >>>>> srcu implementation. If we really want to model the provided high-level
> > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> > > > >>>>> for its memory ordering properties to synchronize other variables.
> > > > >>>>>
> > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> > > > >>>>>
> > > > >>>>> Initial state: X = 0, Y = 0
> > > > >>>>>
> > > > >>>>> Thread A                   Thread B
> > > > >>>>> ---------------------------------------------
> > > > >>>>> STORE X = 1                STORE Y = 1
> > > > >>>>> synchronize_srcu()
> > > > >>>>>                          srcu_read_lock()
> > > > >>>>>                          r1 = LOAD X
> > > > >>>>>                          srcu_read_unlock()
> > > > >>>>> r0 = LOAD Y
> > > > >>>>>
> > > > >>>>> BUG_ON(!r0 && !r1)
> > > > >>>>>
> > > > >>>>> So in the synchronize_srcu implementation, there appears to be two
> > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > > > >>>>> load means that it does not order prior memory accesses before that load.
> > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when
> > > > >>>>> piggy-backing on another already-started gp. I worry about reordering
> > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > > > >>>>> piggy-back on an already-started gp.
> > > > >>>>>
> > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > > > >>>>> all this behave as expected. But without documentation it's rather hard to
> > > > >>>>> follow.
> > > > >>>>
> > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> > > > >>>> purpose. In any case, we really want to add a comment above that
> > > > >>>> __srcu_read_lock_nmisafe() call.
> > > > >>>
> > > > >>> Another test for the safety (or not) of removing either D or E is
> > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > > > >>> adjacent scans.
> > > > >>
> > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> > > > >>
> > > > >> So that (flipping MBs) is unrelated, or did I miss something?
> > > > >
> > > > > The thought is to manually similate in the source code the maximum
> > > > > memory-reference reordering that a maximally hostile compiler and CPU
> > > > > would be permitted to carry out.  So yes, given that there are other
> > > > > memory barriers before and after, these other memory barriers limit how
> > > > > far the flip may be moved in the source code.
> > > > >
> > > > > Here I am talking about the memory barriers associated with the flip,
> > > > > but the same trick can of course be applied to other memory barriers.
> > > > > In general, remove a given memory barrier and (in the source code)
> > > > > maximally rearrange the memory references that were previously ordered
> > > > > by the memory barrier in question.
> > > > >
> > > > > Again, the presence of other memory barriers will limit the permitted
> > > > > maximal source-code rearrangement.
> > > >
> > > >
> > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.
> > >
> > > First, are we talking about the same barrier?  I am talking about E.

Apologies.  I am a bit fixated on E because it is the one you guys
proposed eliminating.  ;-)

> > No, in the last part you replied to above, Mathieu and Frederic were
> > talking about the need for memory barriers in synchronize_srcu(). That
> > has nothing to do with E:
> > <quote>
> >  I suspect that the implicit barrier in srcu_read_lock() invoked at the
> >  beginning of srcu_gp_start_if_needed() is really the barrier that makes
> >  all this behave as expected.
> > </quote>
> >
> > We need to order code prior to synchronize_srcu() wrt the start of the
> > grace period, so that readers that started after the grace period
> > started see those side-effects as they may not be waited on (they are
> > too late).
> 
> My thought is this is achieved by the srcu_read_lock() before the
> grace period is started, and the start of the grace period (which is
> basically the smp_mb() in the first scan).
> 
> So from memory ordering PoV, if synchronize_rcu() spans the GP, and
> readers don't span the GP, that means the reader does not span the
> synchronize_rcu() which is the GP guarantee.
> 
> But I could be missing something. I will dig more on my side. Thanks.

Could you please specifically identify that srcu_read_lock()?

Also which version you are looking at.  ;-)

							Thanx, Paul
Joel Fernandes Dec. 23, 2022, 8:10 p.m. UTC | #64
On Fri, Dec 23, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote:
> > On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > [...]
> > > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >>>
> > > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> > > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > > > > >>>>> [...]
> > > > > >>>>>>>
> > > > > >>>>>>> The memory ordering constraint I am concerned about here is:
> > > > > >>>>>>>
> > > > > >>>>>>> * [...] In addition,
> > > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> > > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> > > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> > > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> > > > > >>>>>>>
> > > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> > > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> > > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> > > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > > > >>>>>>> very beginning of the grace period.
> > > > > >>>>>>
> > > > > >>>>>> I'm confused, what's wrong with this ?
> > > > > >>>>>>
> > > > > >>>>>> UPDATER                  READER
> > > > > >>>>>> -------                  ------
> > > > > >>>>>> STORE X = 1              STORE srcu_read_lock++
> > > > > >>>>>> // rcu_seq_snap()        smp_mb()
> > > > > >>>>>> smp_mb()                 READ X
> > > > > >>>>>> // scans
> > > > > >>>>>> READ srcu_read_lock
> > > > > >>>>>
> > > > > >>>>> What you refer to here is only memory ordering of the store to X and load
> > > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> > > > > >>>>> srcu implementation. If we really want to model the provided high-level
> > > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> > > > > >>>>> for its memory ordering properties to synchronize other variables.
> > > > > >>>>>
> > > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> > > > > >>>>>
> > > > > >>>>> Initial state: X = 0, Y = 0
> > > > > >>>>>
> > > > > >>>>> Thread A                   Thread B
> > > > > >>>>> ---------------------------------------------
> > > > > >>>>> STORE X = 1                STORE Y = 1
> > > > > >>>>> synchronize_srcu()
> > > > > >>>>>                          srcu_read_lock()
> > > > > >>>>>                          r1 = LOAD X
> > > > > >>>>>                          srcu_read_unlock()
> > > > > >>>>> r0 = LOAD Y
> > > > > >>>>>
> > > > > >>>>> BUG_ON(!r0 && !r1)
> > > > > >>>>>
> > > > > >>>>> So in the synchronize_srcu implementation, there appears to be two
> > > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> > > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > > > > >>>>> load means that it does not order prior memory accesses before that load.
> > > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when
> > > > > >>>>> piggy-backing on another already-started gp. I worry about reordering
> > > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > > > > >>>>> piggy-back on an already-started gp.
> > > > > >>>>>
> > > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > > > > >>>>> all this behave as expected. But without documentation it's rather hard to
> > > > > >>>>> follow.
> > > > > >>>>
> > > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> > > > > >>>> purpose. In any case, we really want to add a comment above that
> > > > > >>>> __srcu_read_lock_nmisafe() call.
> > > > > >>>
> > > > > >>> Another test for the safety (or not) of removing either D or E is
> > > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > > > > >>> adjacent scans.
> > > > > >>
> > > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> > > > > >>
> > > > > >> So that (flipping MBs) is unrelated, or did I miss something?
> > > > > >
> > > > > > The thought is to manually similate in the source code the maximum
> > > > > > memory-reference reordering that a maximally hostile compiler and CPU
> > > > > > would be permitted to carry out.  So yes, given that there are other
> > > > > > memory barriers before and after, these other memory barriers limit how
> > > > > > far the flip may be moved in the source code.
> > > > > >
> > > > > > Here I am talking about the memory barriers associated with the flip,
> > > > > > but the same trick can of course be applied to other memory barriers.
> > > > > > In general, remove a given memory barrier and (in the source code)
> > > > > > maximally rearrange the memory references that were previously ordered
> > > > > > by the memory barrier in question.
> > > > > >
> > > > > > Again, the presence of other memory barriers will limit the permitted
> > > > > > maximal source-code rearrangement.
> > > > >
> > > > >
> > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.
> > > >
> > > > First, are we talking about the same barrier?  I am talking about E.
>
> Apologies.  I am a bit fixated on E because it is the one you guys
> proposed eliminating.  ;-)

Ah ok, no worries. :-)

> > > No, in the last part you replied to above, Mathieu and Frederic were
> > > talking about the need for memory barriers in synchronize_srcu(). That
> > > has nothing to do with E:
> > > <quote>
> > >  I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > >  beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > >  all this behave as expected.
> > > </quote>
> > >
> > > We need to order code prior to synchronize_srcu() wrt the start of the
> > > grace period, so that readers that started after the grace period
> > > started see those side-effects as they may not be waited on (they are
> > > too late).
> >
> > My thought is this is achieved by the srcu_read_lock() before the
> > grace period is started, and the start of the grace period (which is
> > basically the smp_mb() in the first scan).
> >
> > So from memory ordering PoV, if synchronize_rcu() spans the GP, and
> > readers don't span the GP, that means the reader does not span the
> > synchronize_rcu() which is the GP guarantee.
> >
> > But I could be missing something. I will dig more on my side. Thanks.
>
> Could you please specifically identify that srcu_read_lock()?
>
> Also which version you are looking at.  ;-)

Should be this one in current -rcu:
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/kernel/rcu/srcutree.c#n1669

Thanks,

 - Joel
Paul E. McKenney Dec. 23, 2022, 8:52 p.m. UTC | #65
On Fri, Dec 23, 2022 at 03:10:40PM -0500, Joel Fernandes wrote:
> On Fri, Dec 23, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Dec 23, 2022 at 11:12:06AM -0500, Joel Fernandes wrote:
> > > On Thu, Dec 22, 2022 at 11:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > [...]
> > > > > > >>>> On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > >>>
> > > > > > >>> On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> > > > > > >>>>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> > > > > > >>>>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> > > > > > >>>>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> > > > > > >>>>> [...]
> > > > > > >>>>>>>
> > > > > > >>>>>>> The memory ordering constraint I am concerned about here is:
> > > > > > >>>>>>>
> > > > > > >>>>>>> * [...] In addition,
> > > > > > >>>>>>> * each CPU having an SRCU read-side critical section that extends beyond
> > > > > > >>>>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> > > > > > >>>>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> > > > > > >>>>>>> * the beginning of that SRCU read-side critical section. [...]
> > > > > > >>>>>>>
> > > > > > >>>>>>> So if we have a SRCU read-side critical section that begins after the beginning
> > > > > > >>>>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> > > > > > >>>>>>> guarantee that the full memory barrier is issued before the beginning of that
> > > > > > >>>>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> > > > > > >>>>>>> very beginning of the grace period.
> > > > > > >>>>>>
> > > > > > >>>>>> I'm confused, what's wrong with this ?
> > > > > > >>>>>>
> > > > > > >>>>>> UPDATER                  READER
> > > > > > >>>>>> -------                  ------
> > > > > > >>>>>> STORE X = 1              STORE srcu_read_lock++
> > > > > > >>>>>> // rcu_seq_snap()        smp_mb()
> > > > > > >>>>>> smp_mb()                 READ X
> > > > > > >>>>>> // scans
> > > > > > >>>>>> READ srcu_read_lock
> > > > > > >>>>>
> > > > > > >>>>> What you refer to here is only memory ordering of the store to X and load
> > > > > > >>>>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> > > > > > >>>>> srcu implementation. If we really want to model the provided high-level
> > > > > > >>>>> memory ordering guarantees, we should consider a scenario where SRCU is used
> > > > > > >>>>> for its memory ordering properties to synchronize other variables.
> > > > > > >>>>>
> > > > > > >>>>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> > > > > > >>>>> and srcu_read_lock/unlock would be used instead of memory barriers:
> > > > > > >>>>>
> > > > > > >>>>> Initial state: X = 0, Y = 0
> > > > > > >>>>>
> > > > > > >>>>> Thread A                   Thread B
> > > > > > >>>>> ---------------------------------------------
> > > > > > >>>>> STORE X = 1                STORE Y = 1
> > > > > > >>>>> synchronize_srcu()
> > > > > > >>>>>                          srcu_read_lock()
> > > > > > >>>>>                          r1 = LOAD X
> > > > > > >>>>>                          srcu_read_unlock()
> > > > > > >>>>> r0 = LOAD Y
> > > > > > >>>>>
> > > > > > >>>>> BUG_ON(!r0 && !r1)
> > > > > > >>>>>
> > > > > > >>>>> So in the synchronize_srcu implementation, there appears to be two
> > > > > > >>>>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> > > > > > >>>>> or it uses an already started gp/expedited gp. When snapshotting with
> > > > > > >>>>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> > > > > > >>>>> load means that it does not order prior memory accesses before that load.
> > > > > > >>>>> This sequence value is then used to identify which gp_seq to wait for when
> > > > > > >>>>> piggy-backing on another already-started gp. I worry about reordering
> > > > > > >>>>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> > > > > > >>>>> piggy-back on an already-started gp.
> > > > > > >>>>>
> > > > > > >>>>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > > > > > >>>>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > > > > > >>>>> all this behave as expected. But without documentation it's rather hard to
> > > > > > >>>>> follow.
> > > > > > >>>>
> > > > > > >>>> Oh ok I see now. It might be working that way by accident or on forgotten
> > > > > > >>>> purpose. In any case, we really want to add a comment above that
> > > > > > >>>> __srcu_read_lock_nmisafe() call.
> > > > > > >>>
> > > > > > >>> Another test for the safety (or not) of removing either D or E is
> > > > > > >>> to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > > > > > >>> adjacent scans.
> > > > > > >>
> > > > > > >> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
> > > > > > >>
> > > > > > >> So that (flipping MBs) is unrelated, or did I miss something?
> > > > > > >
> > > > > > > The thought is to manually similate in the source code the maximum
> > > > > > > memory-reference reordering that a maximally hostile compiler and CPU
> > > > > > > would be permitted to carry out.  So yes, given that there are other
> > > > > > > memory barriers before and after, these other memory barriers limit how
> > > > > > > far the flip may be moved in the source code.
> > > > > > >
> > > > > > > Here I am talking about the memory barriers associated with the flip,
> > > > > > > but the same trick can of course be applied to other memory barriers.
> > > > > > > In general, remove a given memory barrier and (in the source code)
> > > > > > > maximally rearrange the memory references that were previously ordered
> > > > > > > by the memory barrier in question.
> > > > > > >
> > > > > > > Again, the presence of other memory barriers will limit the permitted
> > > > > > > maximal source-code rearrangement.
> > > > > >
> > > > > >
> > > > > > Makes sense if the memory barrier is explicit. In this case, the memory barriers are implicit apparently, with a srcu_read_lock() in the beginning of synchronize_rcu() having the implicit / indirect memory barrier. So I am not sure if that can be implemented without breaking SRCU readers.
> > > > >
> > > > > First, are we talking about the same barrier?  I am talking about E.
> >
> > Apologies.  I am a bit fixated on E because it is the one you guys
> > proposed eliminating.  ;-)
> 
> Ah ok, no worries. :-)
> 
> > > > No, in the last part you replied to above, Mathieu and Frederic were
> > > > talking about the need for memory barriers in synchronize_srcu(). That
> > > > has nothing to do with E:
> > > > <quote>
> > > >  I suspect that the implicit barrier in srcu_read_lock() invoked at the
> > > >  beginning of srcu_gp_start_if_needed() is really the barrier that makes
> > > >  all this behave as expected.
> > > > </quote>
> > > >
> > > > We need to order code prior to synchronize_srcu() wrt the start of the
> > > > grace period, so that readers that started after the grace period
> > > > started see those side-effects as they may not be waited on (they are
> > > > too late).
> > >
> > > My thought is this is achieved by the srcu_read_lock() before the
> > > grace period is started, and the start of the grace period (which is
> > > basically the smp_mb() in the first scan).
> > >
> > > So from memory ordering PoV, if synchronize_rcu() spans the GP, and
> > > readers don't span the GP, that means the reader does not span the
> > > synchronize_rcu() which is the GP guarantee.
> > >
> > > But I could be missing something. I will dig more on my side. Thanks.
> >
> > Could you please specifically identify that srcu_read_lock()?
> >
> > Also which version you are looking at.  ;-)
> 
> Should be this one in current -rcu:
> https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/kernel/rcu/srcutree.c#n1669

That gets me this line of code:

	int ss_state = READ_ONCE(ssp->srcu_size_state);

Which is in srcu_torture_stats_print(), so I am guessing that something
got lost in translation.  ;-)

							Thanx, Paul