diff mbox series

[04/10] rcu/nocb: Remove needless full barrier after callback advancing

Message ID 20230908203603.5865-5-frederic@kernel.org (mailing list archive)
State New, archived
Headers show
Series rcu cleanups | expand

Commit Message

Frederic Weisbecker Sept. 8, 2023, 8:35 p.m. UTC
A full barrier is issued from nocb_gp_wait() upon callbacks advancing
to order grace period completion with callbacks execution.

However these two events are already ordered by the
smp_mb__after_unlock_lock() barrier within the call to
raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
happen.

The following litmus test shows the kind of guarantee that this barrier
provides:

	C smp_mb__after_unlock_lock

	{}

	// rcu_gp_cleanup()
	P0(spinlock_t *rnp_lock, int *gpnum)
	{
		// Grace period cleanup increase gp sequence number
		spin_lock(rnp_lock);
		WRITE_ONCE(*gpnum, 1);
		spin_unlock(rnp_lock);
	}

	// nocb_gp_wait()
	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
	{
		int r1;

		// Call rcu_advance_cbs() from nocb_gp_wait()
		spin_lock(nocb_lock);
		spin_lock(rnp_lock);
		smp_mb__after_unlock_lock();
		r1 = READ_ONCE(*gpnum);
		WRITE_ONCE(*cb_ready, 1);
		spin_unlock(rnp_lock);
		spin_unlock(nocb_lock);
	}

	// nocb_cb_wait()
	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
	{
		int r2;

		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
		spin_lock(nocb_lock);
		r2 = READ_ONCE(*cb_ready);
		spin_unlock(nocb_lock);

		// Actual callback execution
		WRITE_ONCE(*cb_executed, 1);
	}

	P3(int *cb_executed, int *gpnum)
	{
		int r3;

		WRITE_ONCE(*cb_executed, 2);
		smp_mb();
		r3 = READ_ONCE(*gpnum);
	}

	exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *)

Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is
removed. This barrier orders the grace period completion against
callbacks advancing and even later callbacks invocation, thanks to the
opportunistic propagation via the ->nocb_lock to nocb_cb_wait().

Therefore the smp_mb() placed after callbacks advancing can be safely
removed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_nocb.h | 1 -
 1 file changed, 1 deletion(-)

Comments

Joel Fernandes Sept. 9, 2023, 4:31 a.m. UTC | #1
On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> to order grace period completion with callbacks execution.
> 
> However these two events are already ordered by the
> smp_mb__after_unlock_lock() barrier within the call to
> raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> happen.
> 
> The following litmus test shows the kind of guarantee that this barrier
> provides:
> 
> 	C smp_mb__after_unlock_lock
> 
> 	{}
> 
> 	// rcu_gp_cleanup()
> 	P0(spinlock_t *rnp_lock, int *gpnum)
> 	{
> 		// Grace period cleanup increase gp sequence number
> 		spin_lock(rnp_lock);
> 		WRITE_ONCE(*gpnum, 1);
> 		spin_unlock(rnp_lock);
> 	}
> 
> 	// nocb_gp_wait()
> 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> 	{
> 		int r1;
> 
> 		// Call rcu_advance_cbs() from nocb_gp_wait()
> 		spin_lock(nocb_lock);
> 		spin_lock(rnp_lock);
> 		smp_mb__after_unlock_lock();
> 		r1 = READ_ONCE(*gpnum);
> 		WRITE_ONCE(*cb_ready, 1);
> 		spin_unlock(rnp_lock);
> 		spin_unlock(nocb_lock);
> 	}
> 
> 	// nocb_cb_wait()
> 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> 	{
> 		int r2;
> 
> 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> 		spin_lock(nocb_lock);
> 		r2 = READ_ONCE(*cb_ready);
> 		spin_unlock(nocb_lock);
> 
> 		// Actual callback execution
> 		WRITE_ONCE(*cb_executed, 1);

So related to this something in the docs caught my attention under "Callback
Invocation" [1]

<quote>
However, if the callback function communicates to other CPUs, for example,
doing a wakeup, then it is that function's responsibility to maintain
ordering. For example, if the callback function wakes up a task that runs on
some other CPU, proper ordering must in place in both the callback function
and the task being awakened. To see why this is important, consider the top
half of the grace-period cleanup diagram. The callback might be running on a
CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
that is to run on a CPU corresponding to the rightmost leaf rcu_node
structure, and the grace-period kernel thread might not yet have reached the
rightmost leaf. In this case, the grace period's memory ordering might not
yet have reached that CPU, so again the callback function and the awakened
task must supply proper ordering.
</quote>

I believe this text is for non-nocb but if we apply that to the nocb case,
lets see what happens.

In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
executing on P2. That sounds very similar to the non-nocb world described in
the text where a callback tries to wake something up on a different CPU and
needs to take care of all the ordering.

So unless I'm missing something (quite possible), P2 must see the update to
gpnum as well. However, per your limus test, the only thing P2  does is
acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.
I am curious what happens in your litmus if you read gpnum in a register and
checked for it.

So maybe the memory barriers you are deleting need to be kept in place? Idk.

thanks,

 - Joel

[1] https://docs.kernel.org/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html#callback-invocation
Boqun Feng Sept. 9, 2023, 6:22 p.m. UTC | #2
On Sat, Sep 09, 2023 at 04:31:25AM +0000, Joel Fernandes wrote:
> On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > to order grace period completion with callbacks execution.
> > 
> > However these two events are already ordered by the
> > smp_mb__after_unlock_lock() barrier within the call to
> > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > happen.
> > 
> > The following litmus test shows the kind of guarantee that this barrier
> > provides:
> > 
> > 	C smp_mb__after_unlock_lock
> > 
> > 	{}
> > 
> > 	// rcu_gp_cleanup()
> > 	P0(spinlock_t *rnp_lock, int *gpnum)
> > 	{
> > 		// Grace period cleanup increase gp sequence number
> > 		spin_lock(rnp_lock);
> > 		WRITE_ONCE(*gpnum, 1);
> > 		spin_unlock(rnp_lock);
> > 	}
> > 
> > 	// nocb_gp_wait()
> > 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> > 	{
> > 		int r1;
> > 
> > 		// Call rcu_advance_cbs() from nocb_gp_wait()
> > 		spin_lock(nocb_lock);
> > 		spin_lock(rnp_lock);
> > 		smp_mb__after_unlock_lock();
> > 		r1 = READ_ONCE(*gpnum);
> > 		WRITE_ONCE(*cb_ready, 1);
> > 		spin_unlock(rnp_lock);
> > 		spin_unlock(nocb_lock);
> > 	}
> > 
> > 	// nocb_cb_wait()
> > 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > 	{
> > 		int r2;
> > 
> > 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > 		spin_lock(nocb_lock);
> > 		r2 = READ_ONCE(*cb_ready);
> > 		spin_unlock(nocb_lock);
> > 
> > 		// Actual callback execution
> > 		WRITE_ONCE(*cb_executed, 1);
> 
> So related to this something in the docs caught my attention under "Callback
> Invocation" [1]
> 
> <quote>
> However, if the callback function communicates to other CPUs, for example,
> doing a wakeup, then it is that function's responsibility to maintain
> ordering. For example, if the callback function wakes up a task that runs on
> some other CPU, proper ordering must in place in both the callback function
> and the task being awakened. To see why this is important, consider the top
> half of the grace-period cleanup diagram. The callback might be running on a
> CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> that is to run on a CPU corresponding to the rightmost leaf rcu_node
> structure, and the grace-period kernel thread might not yet have reached the
> rightmost leaf. In this case, the grace period's memory ordering might not
> yet have reached that CPU, so again the callback function and the awakened
> task must supply proper ordering.
> </quote>
> 
> I believe this text is for non-nocb but if we apply that to the nocb case,
> lets see what happens.
> 
> In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> executing on P2. That sounds very similar to the non-nocb world described in
> the text where a callback tries to wake something up on a different CPU and
> needs to take care of all the ordering.
> 
> So unless I'm missing something (quite possible), P2 must see the update to
> gpnum as well. However, per your limus test, the only thing P2  does is
> acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.

Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
nocb_lock as well and if P2 read P1's write, then we know the serialized
order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
from the spin_unlock(nocb_lock) on P1), in other words:

(fact #1)

	unlock(nocb_lock) // on P1
	->rfe
	lock(nocb_lock) // on P2

so if P1 reads P0's write on gpnum

(assumption #1)

	W(gpnum)=1 // on P0
	->rfe
	R(gpnum)=1 // on P1

and we have

(fact #2)

	R(gpnum)=1 // on P1
	->(po; [UL])
	unlock(nocb_lock) // on P1

combine them you get

	W(gpnum)=1 // on P0
	->rfe           // fact #1
	->(po; [UL])    // fact #2
	->rfe           // assumption #1
	lock(nocb_lock) // on P2
	->([LKR]; po)
	M // any access on P2 after spin_lock(nocb_lock);

so
	W(gpnum)=1 // on P0
	->rfe ->po-unlock-lock-po
	M // on P2

and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
"rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
means the write of gpnum on P0 propagates to P2 before any memory
accesses after spin_lock(nocb_lock)?

Of course, I haven't looked into the bigger picture here (whether the
barrier is for something else, etc.) ;-)

Regards,
Boqun

> I am curious what happens in your litmus if you read gpnum in a register and
> checked for it.
> 
> So maybe the memory barriers you are deleting need to be kept in place? Idk.
> 
> thanks,
> 
>  - Joel
> 
> [1] https://docs.kernel.org/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html#callback-invocation
>
Joel Fernandes Sept. 10, 2023, 4:09 a.m. UTC | #3
On Sat, Sep 09, 2023 at 11:22:48AM -0700, Boqun Feng wrote:
> On Sat, Sep 09, 2023 at 04:31:25AM +0000, Joel Fernandes wrote:
> > On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > > to order grace period completion with callbacks execution.
> > > 
> > > However these two events are already ordered by the
> > > smp_mb__after_unlock_lock() barrier within the call to
> > > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > > happen.
> > > 
> > > The following litmus test shows the kind of guarantee that this barrier
> > > provides:
> > > 
> > > 	C smp_mb__after_unlock_lock
> > > 
> > > 	{}
> > > 
> > > 	// rcu_gp_cleanup()
> > > 	P0(spinlock_t *rnp_lock, int *gpnum)
> > > 	{
> > > 		// Grace period cleanup increase gp sequence number
> > > 		spin_lock(rnp_lock);
> > > 		WRITE_ONCE(*gpnum, 1);
> > > 		spin_unlock(rnp_lock);
> > > 	}
> > > 
> > > 	// nocb_gp_wait()
> > > 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> > > 	{
> > > 		int r1;
> > > 
> > > 		// Call rcu_advance_cbs() from nocb_gp_wait()
> > > 		spin_lock(nocb_lock);
> > > 		spin_lock(rnp_lock);
> > > 		smp_mb__after_unlock_lock();
> > > 		r1 = READ_ONCE(*gpnum);
> > > 		WRITE_ONCE(*cb_ready, 1);
> > > 		spin_unlock(rnp_lock);
> > > 		spin_unlock(nocb_lock);
> > > 	}
> > > 
> > > 	// nocb_cb_wait()
> > > 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > > 	{
> > > 		int r2;
> > > 
> > > 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > > 		spin_lock(nocb_lock);
> > > 		r2 = READ_ONCE(*cb_ready);
> > > 		spin_unlock(nocb_lock);
> > > 
> > > 		// Actual callback execution
> > > 		WRITE_ONCE(*cb_executed, 1);
> > 
> > So related to this something in the docs caught my attention under "Callback
> > Invocation" [1]
> > 
> > <quote>
> > However, if the callback function communicates to other CPUs, for example,
> > doing a wakeup, then it is that function's responsibility to maintain
> > ordering. For example, if the callback function wakes up a task that runs on
> > some other CPU, proper ordering must in place in both the callback function
> > and the task being awakened. To see why this is important, consider the top
> > half of the grace-period cleanup diagram. The callback might be running on a
> > CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> > that is to run on a CPU corresponding to the rightmost leaf rcu_node
> > structure, and the grace-period kernel thread might not yet have reached the
> > rightmost leaf. In this case, the grace period's memory ordering might not
> > yet have reached that CPU, so again the callback function and the awakened
> > task must supply proper ordering.
> > </quote>
> > 
> > I believe this text is for non-nocb but if we apply that to the nocb case,
> > lets see what happens.
> > 
> > In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> > executing on P2. That sounds very similar to the non-nocb world described in
> > the text where a callback tries to wake something up on a different CPU and
> > needs to take care of all the ordering.
> > 
> > So unless I'm missing something (quite possible), P2 must see the update to
> > gpnum as well. However, per your limus test, the only thing P2  does is
> > acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.
> 
> Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
> nocb_lock as well and if P2 read P1's write, then we know the serialized
> order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
> from the spin_unlock(nocb_lock) on P1), in other words:
> 
> (fact #1)
> 
> 	unlock(nocb_lock) // on P1
> 	->rfe
> 	lock(nocb_lock) // on P2
> 
> so if P1 reads P0's write on gpnum
> 
> (assumption #1)
> 
> 	W(gpnum)=1 // on P0
> 	->rfe
> 	R(gpnum)=1 // on P1
> 
> and we have
> 
> (fact #2)
> 
> 	R(gpnum)=1 // on P1
> 	->(po; [UL])
> 	unlock(nocb_lock) // on P1
> 
> combine them you get
> 
> 	W(gpnum)=1 // on P0
> 	->rfe           // fact #1
> 	->(po; [UL])    // fact #2
> 	->rfe           // assumption #1
> 	lock(nocb_lock) // on P2
> 	->([LKR]; po)
> 	M // any access on P2 after spin_lock(nocb_lock);
> 
> so
> 	W(gpnum)=1 // on P0
> 	->rfe ->po-unlock-lock-po
> 	M // on P2
> 
> and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
> "rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
> means the write of gpnum on P0 propagates to P2 before any memory
> accesses after spin_lock(nocb_lock)?

You and Frederic are right. I confirmed this by running herd7 as well.

Also he adds a ->co between P2 and P3, so that's why the
smp_mb__after_lock_unlock() helps to keep the propogation intact. Its pretty
much the R-pattern extended across 4 CPUs.

We should probably document these in the RCU memory ordering docs.

thanks,

 - Joel
Paul E. McKenney Sept. 10, 2023, 10:22 a.m. UTC | #4
On Sun, Sep 10, 2023 at 12:09:23AM -0400, Joel Fernandes wrote:
> On Sat, Sep 09, 2023 at 11:22:48AM -0700, Boqun Feng wrote:
> > On Sat, Sep 09, 2023 at 04:31:25AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > > > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > > > to order grace period completion with callbacks execution.
> > > > 
> > > > However these two events are already ordered by the
> > > > smp_mb__after_unlock_lock() barrier within the call to
> > > > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > > > happen.
> > > > 
> > > > The following litmus test shows the kind of guarantee that this barrier
> > > > provides:
> > > > 
> > > > 	C smp_mb__after_unlock_lock
> > > > 
> > > > 	{}
> > > > 
> > > > 	// rcu_gp_cleanup()
> > > > 	P0(spinlock_t *rnp_lock, int *gpnum)
> > > > 	{
> > > > 		// Grace period cleanup increase gp sequence number
> > > > 		spin_lock(rnp_lock);
> > > > 		WRITE_ONCE(*gpnum, 1);
> > > > 		spin_unlock(rnp_lock);
> > > > 	}
> > > > 
> > > > 	// nocb_gp_wait()
> > > > 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> > > > 	{
> > > > 		int r1;
> > > > 
> > > > 		// Call rcu_advance_cbs() from nocb_gp_wait()
> > > > 		spin_lock(nocb_lock);
> > > > 		spin_lock(rnp_lock);
> > > > 		smp_mb__after_unlock_lock();
> > > > 		r1 = READ_ONCE(*gpnum);
> > > > 		WRITE_ONCE(*cb_ready, 1);
> > > > 		spin_unlock(rnp_lock);
> > > > 		spin_unlock(nocb_lock);
> > > > 	}
> > > > 
> > > > 	// nocb_cb_wait()
> > > > 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > > > 	{
> > > > 		int r2;
> > > > 
> > > > 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > > > 		spin_lock(nocb_lock);
> > > > 		r2 = READ_ONCE(*cb_ready);
> > > > 		spin_unlock(nocb_lock);
> > > > 
> > > > 		// Actual callback execution
> > > > 		WRITE_ONCE(*cb_executed, 1);
> > > 
> > > So related to this something in the docs caught my attention under "Callback
> > > Invocation" [1]
> > > 
> > > <quote>
> > > However, if the callback function communicates to other CPUs, for example,
> > > doing a wakeup, then it is that function's responsibility to maintain
> > > ordering. For example, if the callback function wakes up a task that runs on
> > > some other CPU, proper ordering must in place in both the callback function
> > > and the task being awakened. To see why this is important, consider the top
> > > half of the grace-period cleanup diagram. The callback might be running on a
> > > CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> > > that is to run on a CPU corresponding to the rightmost leaf rcu_node
> > > structure, and the grace-period kernel thread might not yet have reached the
> > > rightmost leaf. In this case, the grace period's memory ordering might not
> > > yet have reached that CPU, so again the callback function and the awakened
> > > task must supply proper ordering.
> > > </quote>
> > > 
> > > I believe this text is for non-nocb but if we apply that to the nocb case,
> > > lets see what happens.
> > > 
> > > In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> > > executing on P2. That sounds very similar to the non-nocb world described in
> > > the text where a callback tries to wake something up on a different CPU and
> > > needs to take care of all the ordering.
> > > 
> > > So unless I'm missing something (quite possible), P2 must see the update to
> > > gpnum as well. However, per your limus test, the only thing P2  does is
> > > acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.
> > 
> > Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
> > nocb_lock as well and if P2 read P1's write, then we know the serialized
> > order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
> > from the spin_unlock(nocb_lock) on P1), in other words:
> > 
> > (fact #1)
> > 
> > 	unlock(nocb_lock) // on P1
> > 	->rfe
> > 	lock(nocb_lock) // on P2
> > 
> > so if P1 reads P0's write on gpnum
> > 
> > (assumption #1)
> > 
> > 	W(gpnum)=1 // on P0
> > 	->rfe
> > 	R(gpnum)=1 // on P1
> > 
> > and we have
> > 
> > (fact #2)
> > 
> > 	R(gpnum)=1 // on P1
> > 	->(po; [UL])
> > 	unlock(nocb_lock) // on P1
> > 
> > combine them you get
> > 
> > 	W(gpnum)=1 // on P0
> > 	->rfe           // fact #1
> > 	->(po; [UL])    // fact #2
> > 	->rfe           // assumption #1
> > 	lock(nocb_lock) // on P2
> > 	->([LKR]; po)
> > 	M // any access on P2 after spin_lock(nocb_lock);
> > 
> > so
> > 	W(gpnum)=1 // on P0
> > 	->rfe ->po-unlock-lock-po
> > 	M // on P2
> > 
> > and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
> > "rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
> > means the write of gpnum on P0 propagates to P2 before any memory
> > accesses after spin_lock(nocb_lock)?
> 
> You and Frederic are right. I confirmed this by running herd7 as well.
> 
> Also he adds a ->co between P2 and P3, so that's why the
> smp_mb__after_lock_unlock() helps to keep the propogation intact. Its pretty
> much the R-pattern extended across 4 CPUs.
> 
> We should probably document these in the RCU memory ordering docs.

I never have heard of R showing up in real code before, so that is
worth a few words in the documentation as well.  ;-)

							Thanx, Paul
Frederic Weisbecker Sept. 10, 2023, 8:17 p.m. UTC | #5
Le Sun, Sep 10, 2023 at 12:09:23AM -0400, Joel Fernandes a écrit :
> On Sat, Sep 09, 2023 at 11:22:48AM -0700, Boqun Feng wrote:
> > On Sat, Sep 09, 2023 at 04:31:25AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > > > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > > > to order grace period completion with callbacks execution.
> > > > 
> > > > However these two events are already ordered by the
> > > > smp_mb__after_unlock_lock() barrier within the call to
> > > > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > > > happen.
> > > > 
> > > > The following litmus test shows the kind of guarantee that this barrier
> > > > provides:
> > > > 
> > > > 	C smp_mb__after_unlock_lock
> > > > 
> > > > 	{}
> > > > 
> > > > 	// rcu_gp_cleanup()
> > > > 	P0(spinlock_t *rnp_lock, int *gpnum)
> > > > 	{
> > > > 		// Grace period cleanup increase gp sequence number
> > > > 		spin_lock(rnp_lock);
> > > > 		WRITE_ONCE(*gpnum, 1);
> > > > 		spin_unlock(rnp_lock);
> > > > 	}
> > > > 
> > > > 	// nocb_gp_wait()
> > > > 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> > > > 	{
> > > > 		int r1;
> > > > 
> > > > 		// Call rcu_advance_cbs() from nocb_gp_wait()
> > > > 		spin_lock(nocb_lock);
> > > > 		spin_lock(rnp_lock);
> > > > 		smp_mb__after_unlock_lock();
> > > > 		r1 = READ_ONCE(*gpnum);
> > > > 		WRITE_ONCE(*cb_ready, 1);
> > > > 		spin_unlock(rnp_lock);
> > > > 		spin_unlock(nocb_lock);
> > > > 	}
> > > > 
> > > > 	// nocb_cb_wait()
> > > > 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > > > 	{
> > > > 		int r2;
> > > > 
> > > > 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > > > 		spin_lock(nocb_lock);
> > > > 		r2 = READ_ONCE(*cb_ready);
> > > > 		spin_unlock(nocb_lock);
> > > > 
> > > > 		// Actual callback execution
> > > > 		WRITE_ONCE(*cb_executed, 1);
> > > 
> > > So related to this something in the docs caught my attention under "Callback
> > > Invocation" [1]
> > > 
> > > <quote>
> > > However, if the callback function communicates to other CPUs, for example,
> > > doing a wakeup, then it is that function's responsibility to maintain
> > > ordering. For example, if the callback function wakes up a task that runs on
> > > some other CPU, proper ordering must in place in both the callback function
> > > and the task being awakened. To see why this is important, consider the top
> > > half of the grace-period cleanup diagram. The callback might be running on a
> > > CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> > > that is to run on a CPU corresponding to the rightmost leaf rcu_node
> > > structure, and the grace-period kernel thread might not yet have reached the
> > > rightmost leaf. In this case, the grace period's memory ordering might not
> > > yet have reached that CPU, so again the callback function and the awakened
> > > task must supply proper ordering.
> > > </quote>
> > > 
> > > I believe this text is for non-nocb but if we apply that to the nocb case,
> > > lets see what happens.
> > > 
> > > In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> > > executing on P2. That sounds very similar to the non-nocb world described in
> > > the text where a callback tries to wake something up on a different CPU and
> > > needs to take care of all the ordering.
> > > 
> > > So unless I'm missing something (quite possible), P2 must see the update to
> > > gpnum as well. However, per your limus test, the only thing P2  does is
> > > acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.
> > 
> > Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
> > nocb_lock as well and if P2 read P1's write, then we know the serialized
> > order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
> > from the spin_unlock(nocb_lock) on P1), in other words:
> > 
> > (fact #1)
> > 
> > 	unlock(nocb_lock) // on P1
> > 	->rfe
> > 	lock(nocb_lock) // on P2
> > 
> > so if P1 reads P0's write on gpnum
> > 
> > (assumption #1)
> > 
> > 	W(gpnum)=1 // on P0
> > 	->rfe
> > 	R(gpnum)=1 // on P1
> > 
> > and we have
> > 
> > (fact #2)
> > 
> > 	R(gpnum)=1 // on P1
> > 	->(po; [UL])
> > 	unlock(nocb_lock) // on P1
> > 
> > combine them you get
> > 
> > 	W(gpnum)=1 // on P0
> > 	->rfe           // fact #1
> > 	->(po; [UL])    // fact #2
> > 	->rfe           // assumption #1
> > 	lock(nocb_lock) // on P2
> > 	->([LKR]; po)
> > 	M // any access on P2 after spin_lock(nocb_lock);
> > 
> > so
> > 	W(gpnum)=1 // on P0
> > 	->rfe ->po-unlock-lock-po
> > 	M // on P2
> > 
> > and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
> > "rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
> > means the write of gpnum on P0 propagates to P2 before any memory
> > accesses after spin_lock(nocb_lock)?
> 
> You and Frederic are right. I confirmed this by running herd7 as well.
> 
> Also he adds a ->co between P2 and P3, so that's why the
> smp_mb__after_lock_unlock() helps to keep the propogation intact. Its pretty
> much the R-pattern extended across 4 CPUs.
> 
> We should probably document these in the RCU memory ordering docs.

I have to trust you on that guys, I haven't managed to spend time on
tools/memory-model/Documentation/explanation.txt yet. But glad you sorted
it out.

> 
> thanks,
> 
>  - Joel
>
Frederic Weisbecker Sept. 10, 2023, 8:29 p.m. UTC | #6
Le Sun, Sep 10, 2023 at 10:17:15PM +0200, Frederic Weisbecker a écrit :
> Le Sun, Sep 10, 2023 at 12:09:23AM -0400, Joel Fernandes a écrit :
> > On Sat, Sep 09, 2023 at 11:22:48AM -0700, Boqun Feng wrote:
> > > On Sat, Sep 09, 2023 at 04:31:25AM +0000, Joel Fernandes wrote:
> > > > On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > > > > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > > > > to order grace period completion with callbacks execution.
> > > > > 
> > > > > However these two events are already ordered by the
> > > > > smp_mb__after_unlock_lock() barrier within the call to
> > > > > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > > > > happen.
> > > > > 
> > > > > The following litmus test shows the kind of guarantee that this barrier
> > > > > provides:
> > > > > 
> > > > > 	C smp_mb__after_unlock_lock
> > > > > 
> > > > > 	{}
> > > > > 
> > > > > 	// rcu_gp_cleanup()
> > > > > 	P0(spinlock_t *rnp_lock, int *gpnum)
> > > > > 	{
> > > > > 		// Grace period cleanup increase gp sequence number
> > > > > 		spin_lock(rnp_lock);
> > > > > 		WRITE_ONCE(*gpnum, 1);
> > > > > 		spin_unlock(rnp_lock);
> > > > > 	}
> > > > > 
> > > > > 	// nocb_gp_wait()
> > > > > 	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
> > > > > 	{
> > > > > 		int r1;
> > > > > 
> > > > > 		// Call rcu_advance_cbs() from nocb_gp_wait()
> > > > > 		spin_lock(nocb_lock);
> > > > > 		spin_lock(rnp_lock);
> > > > > 		smp_mb__after_unlock_lock();
> > > > > 		r1 = READ_ONCE(*gpnum);
> > > > > 		WRITE_ONCE(*cb_ready, 1);
> > > > > 		spin_unlock(rnp_lock);
> > > > > 		spin_unlock(nocb_lock);
> > > > > 	}
> > > > > 
> > > > > 	// nocb_cb_wait()
> > > > > 	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > > > > 	{
> > > > > 		int r2;
> > > > > 
> > > > > 		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > > > > 		spin_lock(nocb_lock);
> > > > > 		r2 = READ_ONCE(*cb_ready);
> > > > > 		spin_unlock(nocb_lock);
> > > > > 
> > > > > 		// Actual callback execution
> > > > > 		WRITE_ONCE(*cb_executed, 1);
> > > > 
> > > > So related to this something in the docs caught my attention under "Callback
> > > > Invocation" [1]
> > > > 
> > > > <quote>
> > > > However, if the callback function communicates to other CPUs, for example,
> > > > doing a wakeup, then it is that function's responsibility to maintain
> > > > ordering. For example, if the callback function wakes up a task that runs on
> > > > some other CPU, proper ordering must in place in both the callback function
> > > > and the task being awakened. To see why this is important, consider the top
> > > > half of the grace-period cleanup diagram. The callback might be running on a
> > > > CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> > > > that is to run on a CPU corresponding to the rightmost leaf rcu_node
> > > > structure, and the grace-period kernel thread might not yet have reached the
> > > > rightmost leaf. In this case, the grace period's memory ordering might not
> > > > yet have reached that CPU, so again the callback function and the awakened
> > > > task must supply proper ordering.
> > > > </quote>
> > > > 
> > > > I believe this text is for non-nocb but if we apply that to the nocb case,
> > > > lets see what happens.
> > > > 
> > > > In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> > > > executing on P2. That sounds very similar to the non-nocb world described in
> > > > the text where a callback tries to wake something up on a different CPU and
> > > > needs to take care of all the ordering.
> > > > 
> > > > So unless I'm missing something (quite possible), P2 must see the update to
> > > > gpnum as well. However, per your limus test, the only thing P2  does is
> > > > acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.
> > > 
> > > Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
> > > nocb_lock as well and if P2 read P1's write, then we know the serialized
> > > order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
> > > from the spin_unlock(nocb_lock) on P1), in other words:
> > > 
> > > (fact #1)
> > > 
> > > 	unlock(nocb_lock) // on P1
> > > 	->rfe
> > > 	lock(nocb_lock) // on P2
> > > 
> > > so if P1 reads P0's write on gpnum
> > > 
> > > (assumption #1)
> > > 
> > > 	W(gpnum)=1 // on P0
> > > 	->rfe
> > > 	R(gpnum)=1 // on P1
> > > 
> > > and we have
> > > 
> > > (fact #2)
> > > 
> > > 	R(gpnum)=1 // on P1
> > > 	->(po; [UL])
> > > 	unlock(nocb_lock) // on P1
> > > 
> > > combine them you get
> > > 
> > > 	W(gpnum)=1 // on P0
> > > 	->rfe           // fact #1
> > > 	->(po; [UL])    // fact #2
> > > 	->rfe           // assumption #1
> > > 	lock(nocb_lock) // on P2
> > > 	->([LKR]; po)
> > > 	M // any access on P2 after spin_lock(nocb_lock);
> > > 
> > > so
> > > 	W(gpnum)=1 // on P0
> > > 	->rfe ->po-unlock-lock-po
> > > 	M // on P2
> > > 
> > > and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
> > > "rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
> > > means the write of gpnum on P0 propagates to P2 before any memory
> > > accesses after spin_lock(nocb_lock)?
> > 
> > You and Frederic are right. I confirmed this by running herd7 as well.
> > 
> > Also he adds a ->co between P2 and P3, so that's why the
> > smp_mb__after_lock_unlock() helps to keep the propogation intact. Its pretty
> > much the R-pattern extended across 4 CPUs.
> > 
> > We should probably document these in the RCU memory ordering docs.
> 
> I have to trust you on that guys, I haven't managed to spend time on
> tools/memory-model/Documentation/explanation.txt yet. But glad you sorted
> it out.

Oh and I can indeed add a comment in rcu_do_batch() -> rcu_nocb_lock_irqsave()
to tell about how the guaranteed ordering against grace period completion is
enforced. Will do on the next take.

Thanks.
diff mbox series

Patch

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 6e63ba4788e1..2dc76f5e6e78 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -779,7 +779,6 @@  static void nocb_gp_wait(struct rcu_data *my_rdp)
 		if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
 			needwake = rdp->nocb_cb_sleep;
 			WRITE_ONCE(rdp->nocb_cb_sleep, false);
-			smp_mb(); /* CB invocation -after- GP end. */
 		} else {
 			needwake = false;
 		}