diff mbox series

rcu: use try_cmpxchg in check_cpu_stall

Message ID 20230228155121.3416-1-ubizjak@gmail.com (mailing list archive)
State New, archived
Headers show
Series rcu: use try_cmpxchg in check_cpu_stall | expand

Commit Message

Uros Bizjak Feb. 28, 2023, 3:51 p.m. UTC
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

No functional change intended.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 kernel/rcu/tree_stall.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Joel Fernandes Feb. 28, 2023, 8:39 p.m. UTC | #1
On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> this change saves a compare after cmpxchg (and related move instruction in
> front of cmpxchg).

In my codegen, I am not seeing mov instruction before the cmp removed, how
can that be? The rax has to be populated with a mov before cmpxchg right?

So try_cmpxchg gives: mov, cmpxchg, cmp, jne
Where as cmpxchg gives: mov, cmpxchg, mov, jne

So yeah you got rid of compare, but I am not seeing reduction in moves.
Either way, I think it is an improvement due to dropping cmp so:

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


> 
> No functional change intended.
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
> ---
>  kernel/rcu/tree_stall.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b10b8349bb2a..d81c88e66b42 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>  	jn = jiffies + ULONG_MAX / 2;
>  	if (rcu_gp_in_progress() &&
>  	    (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> -	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +	    try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>  
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
> @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>  
>  	} else if (rcu_gp_in_progress() &&
>  		   ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> -		   cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +		   try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>  
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
> -- 
> 2.39.2
>
Joel Fernandes Feb. 28, 2023, 8:56 p.m. UTC | #2
> On Feb 28, 2023, at 3:39 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
>> Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
>> check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
>> this change saves a compare after cmpxchg (and related move instruction in
>> front of cmpxchg).
> 
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
> 
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
> 
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
> 
> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Ah there probably is a lesser mov to hold the value to compare against,
but maybe not always? 

Would be nice to clarify that in the changelog anyway.

Thanks,

 - Joel

> 
> thanks,
> 
> - Joel
> 
> 
>> 
>> No functional change intended.
>> 
>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
>> ---
>> kernel/rcu/tree_stall.h | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index b10b8349bb2a..d81c88e66b42 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>>    jn = jiffies + ULONG_MAX / 2;
>>    if (rcu_gp_in_progress() &&
>>        (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
>> -        cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> +        try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>> 
>>        /*
>>         * If a virtual machine is stopped by the host it can look to
>> @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>> 
>>    } else if (rcu_gp_in_progress() &&
>>           ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
>> -           cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> +           try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>> 
>>        /*
>>         * If a virtual machine is stopped by the host it can look to
>> -- 
>> 2.39.2
>>
Steven Rostedt Feb. 28, 2023, 9:03 p.m. UTC | #3
On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <joel@joelfernandes.org> wrote:

> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).  
> 
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
> 
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
> 
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:

Did you get the above backwards?

Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:

0000000000000070 <ring_buffer_record_off>:
      70:       48 8d 4f 08             lea    0x8(%rdi),%rcx
      74:       8b 57 08                mov    0x8(%rdi),%edx
      77:       89 d6                   mov    %edx,%esi
      79:       89 d0                   mov    %edx,%eax
      7b:       81 ce 00 00 10 00       or     $0x100000,%esi
      81:       f0 0f b1 31             lock cmpxchg %esi,(%rcx)
      85:       39 d0                   cmp    %edx,%eax
      87:       75 eb                   jne    74 <ring_buffer_record_off+0x4>
      89:       e9 00 00 00 00          jmp    8e <ring_buffer_record_off+0x1e>
                        8a: R_X86_64_PLT32      __x86_return_thunk-0x4
      8e:       66 90                   xchg   %ax,%ax


  To

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
     1bb:       89 c2                   mov    %eax,%edx
     1bd:       81 ca 00 00 10 00       or     $0x100000,%edx
     1c3:       f0 0f b1 11             lock cmpxchg %edx,(%rcx)
     1c7:       75 f2                   jne    1bb <ring_buffer_record_off+0x1b>
     1c9:       e9 00 00 00 00          jmp    1ce <ring_buffer_record_off+0x2e>
                        1ca: R_X86_64_PLT32     __x86_return_thunk-0x4
     1ce:       66 90                   xchg   %ax,%ax


It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4


Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.

-- Steve
Paul E. McKenney Feb. 28, 2023, 9:29 p.m. UTC | #4
On Tue, Feb 28, 2023 at 04:03:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 20:39:30 +0000
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > > this change saves a compare after cmpxchg (and related move instruction in
> > > front of cmpxchg).  
> > 
> > In my codegen, I am not seeing mov instruction before the cmp removed, how
> > can that be? The rax has to be populated with a mov before cmpxchg right?
> > 
> > So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> > Where as cmpxchg gives: mov, cmpxchg, mov, jne
> > 
> > So yeah you got rid of compare, but I am not seeing reduction in moves.
> > Either way, I think it is an improvement due to dropping cmp so:
> 
> Did you get the above backwards?
> 
> Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
> Uros sent to me for the ring buffer, the code went from:
> 
> 0000000000000070 <ring_buffer_record_off>:
>       70:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>       74:       8b 57 08                mov    0x8(%rdi),%edx
>       77:       89 d6                   mov    %edx,%esi
>       79:       89 d0                   mov    %edx,%eax
>       7b:       81 ce 00 00 10 00       or     $0x100000,%esi
>       81:       f0 0f b1 31             lock cmpxchg %esi,(%rcx)
>       85:       39 d0                   cmp    %edx,%eax
>       87:       75 eb                   jne    74 <ring_buffer_record_off+0x4>
>       89:       e9 00 00 00 00          jmp    8e <ring_buffer_record_off+0x1e>
>                         8a: R_X86_64_PLT32      __x86_return_thunk-0x4
>       8e:       66 90                   xchg   %ax,%ax
> 
> 
>   To
> 
> 00000000000001a0 <ring_buffer_record_off>:
>      1a0:       8b 47 08                mov    0x8(%rdi),%eax
>      1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>      1a7:       89 c2                   mov    %eax,%edx
>      1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
>      1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
>      1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
>                         1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
>      1bb:       89 c2                   mov    %eax,%edx
>      1bd:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1c3:       f0 0f b1 11             lock cmpxchg %edx,(%rcx)
>      1c7:       75 f2                   jne    1bb <ring_buffer_record_off+0x1b>
>      1c9:       e9 00 00 00 00          jmp    1ce <ring_buffer_record_off+0x2e>
>                         1ca: R_X86_64_PLT32     __x86_return_thunk-0x4
>      1ce:       66 90                   xchg   %ax,%ax
> 
> 
> It does add a bit more code, but the fast path seems better (where the
> cmpxchg succeeds). That would be:
> 
> 00000000000001a0 <ring_buffer_record_off>:
>      1a0:       8b 47 08                mov    0x8(%rdi),%eax
>      1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>      1a7:       89 c2                   mov    %eax,%edx
>      1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
>      1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
>      1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
>                         1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
> 
> 
> Where there's only two moves and no cmp, where the former has 3 moves and a
> cmp in the fast path.

All well and good, but the stall-warning code is nowhere near a fastpath.

Is try_cmpxchg() considered more readable in this context?

							Thanx, Paul
Steven Rostedt Feb. 28, 2023, 9:41 p.m. UTC | #5
On Tue, 28 Feb 2023 13:29:11 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> All well and good, but the stall-warning code is nowhere near a fastpath.
> 
> Is try_cmpxchg() considered more readable in this context?


-	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+	    try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {

It's basically the same :-/

But looking at this use case, I'd actually NAK it, as it is misleading.

As try_cmpxchg() is used to get rid of the updating of the old value. As in
the ring buffer code we had:

void ring_buffer_record_off(struct trace_buffer *buffer)
{
	unsigned int rd;
	unsigned int new_rd;

	do {
		rd = atomic_read(&buffer->record_disabled);
		new_rd = rd | RB_BUFFER_OFF;
	} while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
}

and the try_cmpxchg() converted it to:

void ring_buffer_record_off(struct trace_buffer *buffer)
{
	unsigned int rd;
	unsigned int new_rd;

	rd = atomic_read(&buffer->record_disabled);
	do {
		new_rd = rd | RB_BUFFER_OFF;
	} while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
}

Which got rid of the need to constantly update the rd variable (cmpxchg
will load rax with the value read, so it removes the need for an extra
move).

But in your case, we don't need to update js, in which case the
try_cmpxchg() does.

The patch that Uros sent me for the ring buffer code also does some of
that, which I feel is wrong.

So with that, I would nack the patch.

-- Steve
Paul E. McKenney Feb. 28, 2023, 9:56 p.m. UTC | #6
On Tue, Feb 28, 2023 at 04:41:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 13:29:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > All well and good, but the stall-warning code is nowhere near a fastpath.
> > 
> > Is try_cmpxchg() considered more readable in this context?
> 
> 
> -	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +	    try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
> 
> It's basically the same :-/

That was my assessment.  ;-)

> But looking at this use case, I'd actually NAK it, as it is misleading.
> 
> As try_cmpxchg() is used to get rid of the updating of the old value. As in
> the ring buffer code we had:
> 
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> 	unsigned int rd;
> 	unsigned int new_rd;
> 
> 	do {
> 		rd = atomic_read(&buffer->record_disabled);
> 		new_rd = rd | RB_BUFFER_OFF;
> 	} while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
> }
> 
> and the try_cmpxchg() converted it to:
> 
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> 	unsigned int rd;
> 	unsigned int new_rd;
> 
> 	rd = atomic_read(&buffer->record_disabled);
> 	do {
> 		new_rd = rd | RB_BUFFER_OFF;
> 	} while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> }
> 
> Which got rid of the need to constantly update the rd variable (cmpxchg
> will load rax with the value read, so it removes the need for an extra
> move).
> 
> But in your case, we don't need to update js, in which case the
> try_cmpxchg() does.
> 
> The patch that Uros sent me for the ring buffer code also does some of
> that, which I feel is wrong.
> 
> So with that, I would nack the patch.

OK, I will leave this one out.

							Thanx, Paul
Joel Fernandes Feb. 28, 2023, 11:30 p.m. UTC | #7
Hey Steve,

On Tue, Feb 28, 2023 at 4:41 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 28 Feb 2023 13:29:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > All well and good, but the stall-warning code is nowhere near a fastpath.
> >
> > Is try_cmpxchg() considered more readable in this context?
>
>
> -           cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +           try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>
> It's basically the same :-/
>
> But looking at this use case, I'd actually NAK it, as it is misleading.

I'm trying to parse this. You are saying it is misleading, because it
updates js when it doesn't need to?

> As try_cmpxchg() is used to get rid of the updating of the old value. As in
> the ring buffer code we had:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
>         unsigned int rd;
>         unsigned int new_rd;
>
>         do {
>                 rd = atomic_read(&buffer->record_disabled);
>                 new_rd = rd | RB_BUFFER_OFF;
>         } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);

Hear you actually meant "rd" as the second parameter without the & ?

> }
>
> and the try_cmpxchg() converted it to:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
>         unsigned int rd;
>         unsigned int new_rd;
>
>         rd = atomic_read(&buffer->record_disabled);
>         do {
>                 new_rd = rd | RB_BUFFER_OFF;
>         } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> }
>
> Which got rid of the need to constantly update the rd variable (cmpxchg
> will load rax with the value read, so it removes the need for an extra
> move).

So that's a good thing?

>
> But in your case, we don't need to update js, in which case the
> try_cmpxchg() does.

Right, it has lesser value here but I'm curious why you feel it also
doesn't belong in that ring buffer loop you shared (or did you mean,
it does belong there but not in other ftrace code modified by Uros?).

> The patch that Uros sent me for the ring buffer code also does some of
> that, which I feel is wrong.

I am guessing that is code *other than* the ring buffer loop above.

> So with that, I would nack the patch.

Dropping it for RCU is fine with me as well.

(And yes for the other thread, I had it backwards, try_cmpxchg is what
drops the need for cmp instruction -- sorry.)

thanks,

 -Joel
Steven Rostedt March 1, 2023, 12:08 a.m. UTC | #8
On Tue, 28 Feb 2023 18:30:14 -0500
Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > But looking at this use case, I'd actually NAK it, as it is misleading.  
> 
> I'm trying to parse this. You are saying it is misleading, because it
> updates js when it doesn't need to?

Correct.

> 
> > As try_cmpxchg() is used to get rid of the updating of the old value. As in
> > the ring buffer code we had:
> >
> > void ring_buffer_record_off(struct trace_buffer *buffer)
> > {
> >         unsigned int rd;
> >         unsigned int new_rd;
> >
> >         do {
> >                 rd = atomic_read(&buffer->record_disabled);
> >                 new_rd = rd | RB_BUFFER_OFF;
> >         } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);  
> 
> Hear you actually meant "rd" as the second parameter without the & ?

Yes, I cut and pasted the updated code and incorrectly try to revert it in
this example :-p

> 
> > }
> >
> > and the try_cmpxchg() converted it to:
> >
> > void ring_buffer_record_off(struct trace_buffer *buffer)
> > {
> >         unsigned int rd;
> >         unsigned int new_rd;
> >
> >         rd = atomic_read(&buffer->record_disabled);
> >         do {
> >                 new_rd = rd | RB_BUFFER_OFF;
> >         } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> > }
> >
> > Which got rid of the need to constantly update the rd variable (cmpxchg
> > will load rax with the value read, so it removes the need for an extra
> > move).  
> 
> So that's a good thing?

Yes. For looping, try_cmpxchg() is the proper function to use. But in the
RCU case (and other cases in the ring-buffer patch) there is no loop, and
no need to modify the "old" variable.

> 
> >
> > But in your case, we don't need to update js, in which case the
> > try_cmpxchg() does.  
> 
> Right, it has lesser value here but I'm curious why you feel it also
> doesn't belong in that ring buffer loop you shared (or did you mean,
> it does belong there but not in other ftrace code modified by Uros?).

The ring buffer patch had more than one change, where half the updates were
fine, and half were not.

-- Steve
Joel Fernandes March 1, 2023, 12:45 a.m. UTC | #9
Hey Steve,

On Tue, Feb 28, 2023 at 7:08 PM Steven Rostedt <rostedt@goodmis.org> wrote:
[...]
> >
> > >
> > > But in your case, we don't need to update js, in which case the
> > > try_cmpxchg() does.
> >
> > Right, it has lesser value here but I'm curious why you feel it also
> > doesn't belong in that ring buffer loop you shared (or did you mean,
> > it does belong there but not in other ftrace code modified by Uros?).
>
> The ring buffer patch had more than one change, where half the updates were
> fine, and half were not.

Thanks for all the clarifications. Good to know such loop design
patterns can benefit from it!


 - Joel
Uros Bizjak March 1, 2023, 9:52 a.m. UTC | #10
On Tue, Feb 28, 2023 at 9:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne

(I assume the above is reversed)

You have quite a new compiler (e.g. gcc-12+) which is able to reorder
basic blocks, reschedule instructions and create fast paths through
the code based on likely/unlikely annotations in the definition of
try_cmpxchg. [On a related note, some code growth is allowed to the
compiler in the hot path since the kernel is compiled with -O2, not
-Os.] gcc-10.3.1 improves the code from tree.c from:

    a1c5:    0f 84 53 03 00 00        je     a51e <rcu_sched_clock_irq+0x70e>
    a1cb:    48 89 c8                 mov    %rcx,%rax
    a1ce:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1d7 <rcu_sched_clock_irq+0x3c7>
    a1d5:    00 00
            a1d3: R_X86_64_PC32    .data+0xf9c
    a1d7:    48 39 c1                 cmp    %rax,%rcx
    a1da:    0f 85 3e 03 00 00        jne    a51e <rcu_sched_clock_irq+0x70e>

to:

    a1d0:    0f 84 49 03 00 00        je     a51f <rcu_sched_clock_irq+0x70f>
    a1d6:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1df <rcu_sched_clock_irq+0x3cf>
    a1dd:    00 00
            a1db: R_X86_64_PC32    .data+0xf9c
    a1df:    0f 85 3a 03 00 00        jne    a51f <rcu_sched_clock_irq+0x70f>

The change brings the following code size improvement:

  text    data     bss     dec     hex filename
 57712    9945      86   67743   1089f tree-new.o
 57760    9945      86   67791   108cf tree-old.o

Small change, but the result of effectively an almost mechanical
one-line substitutions.

Uros.

> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
>
> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> thanks,
>
>  - Joel
>
>
> >
> > No functional change intended.
> >
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> > Cc: Josh Triplett <josh@joshtriplett.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > Cc: Joel Fernandes <joel@joelfernandes.org>
> > Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
> > ---
> >  kernel/rcu/tree_stall.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > index b10b8349bb2a..d81c88e66b42 100644
> > --- a/kernel/rcu/tree_stall.h
> > +++ b/kernel/rcu/tree_stall.h
> > @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> >       jn = jiffies + ULONG_MAX / 2;
> >       if (rcu_gp_in_progress() &&
> >           (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> > -         cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > +         try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
> >
> >               /*
> >                * If a virtual machine is stopped by the host it can look to
> > @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> >
> >       } else if (rcu_gp_in_progress() &&
> >                  ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> > -                cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > +                try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
> >
> >               /*
> >                * If a virtual machine is stopped by the host it can look to
> > --
> > 2.39.2
> >
Uros Bizjak March 1, 2023, 10:28 a.m. UTC | #11
On Wed, Mar 1, 2023 at 1:08 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 28 Feb 2023 18:30:14 -0500
> Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > But looking at this use case, I'd actually NAK it, as it is misleading.
> >
> > I'm trying to parse this. You are saying it is misleading, because it
> > updates js when it doesn't need to?
>
> Correct.

I'm a bit late to the discussion (well, I have to sleep from time to
time, too), but in the hope that everybody interested in this issue
will find the reply, I'll try to clarify the "updates" claim:

The try_cmpxchg is written in such a way that benefits loops as well
as linear code, in the latter case it depends on the compiler to
eliminate the dead assignment.

When changing linear code from cmpxchg to try_cmpxchg, one has to take
care that the variable, passed by reference, is unused after cmpxchg,
so it can be considered as a temporary variable (as said elsewhere,
the alternative is to copy the value to a local temporary variable and
pass the pointer to this variable to try_cmpxchg - the compiler will
eliminate the assignment if the original variable is unused).

Even in linear code, the conversion from cmpxchg to try_cmpxchg is
able to eliminate assignment and compare, as can be seen when the code
is compiled with gcc-10.3.1:

    a1c5:    0f 84 53 03 00 00        je     a51e <rcu_sched_clock_irq+0x70e>
    a1cb:    48 89 c8                 mov    %rcx,%rax
    a1ce:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1d7 <rcu_sched_clock_irq+0x3c7>
    a1d5:    00 00
            a1d3: R_X86_64_PC32    .data+0xf9c
    a1d7:    48 39 c1                 cmp    %rax,%rcx
    a1da:    0f 85 3e 03 00 00        jne    a51e <rcu_sched_clock_irq+0x70e>

to:

    a1d0:    0f 84 49 03 00 00        je     a51f <rcu_sched_clock_irq+0x70f>
    a1d6:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1df <rcu_sched_clock_irq+0x3cf>
    a1dd:    00 00
            a1db: R_X86_64_PC32    .data+0xf9c
    a1df:    0f 85 3a 03 00 00        jne    a51f <rcu_sched_clock_irq+0x70f>

Newer compilers (e.g. gcc-12+) are able to use likely/unlikely
annotations to reorder the code, so the change is less visible. But
due to reordering, even targets that don't define try_cmpxchg natively
benefit from the change, please see thread at [1].

These benefits are the reason the change to try_cmpxchg was accepted
also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
name a few commits, with a thumbs-up and a claim that the new code is
actually *clearer* at the merge commit [4].

I really think that the above demonstrates various improvements, and
would be unfortunate not to consider them.

[1] https://lore.kernel.org/lkml/871qwgmqws.fsf@mpe.ellerman.id.au/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4e1da8fe031303599e78f88e0dad9f44272e4f99
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8baceabca656d5ef4494cdeb3b9b9fbb844ac613
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91bc559d8d3aed488b4b50e9eba1d7ebb1da7bbf

Uros.

> >
> > > As try_cmpxchg() is used to get rid of the updating of the old value. As in
> > > the ring buffer code we had:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > >         unsigned int rd;
> > >         unsigned int new_rd;
> > >
> > >         do {
> > >                 rd = atomic_read(&buffer->record_disabled);
> > >                 new_rd = rd | RB_BUFFER_OFF;
> > >         } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
> >
> > Hear you actually meant "rd" as the second parameter without the & ?
>
> Yes, I cut and pasted the updated code and incorrectly try to revert it in
> this example :-p
>
> >
> > > }
> > >
> > > and the try_cmpxchg() converted it to:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > >         unsigned int rd;
> > >         unsigned int new_rd;
> > >
> > >         rd = atomic_read(&buffer->record_disabled);
> > >         do {
> > >                 new_rd = rd | RB_BUFFER_OFF;
> > >         } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> > > }
> > >
> > > Which got rid of the need to constantly update the rd variable (cmpxchg
> > > will load rax with the value read, so it removes the need for an extra
> > > move).
> >
> > So that's a good thing?
>
> Yes. For looping, try_cmpxchg() is the proper function to use. But in the
> RCU case (and other cases in the ring-buffer patch) there is no loop, and
> no need to modify the "old" variable.
>
> >
> > >
> > > But in your case, we don't need to update js, in which case the
> > > try_cmpxchg() does.
> >
> > Right, it has lesser value here but I'm curious why you feel it also
> > doesn't belong in that ring buffer loop you shared (or did you mean,
> > it does belong there but not in other ftrace code modified by Uros?).
>
> The ring buffer patch had more than one change, where half the updates were
> fine, and half were not.
>
> -- Steve
Steven Rostedt March 1, 2023, 4:38 p.m. UTC | #12
On Wed, 1 Mar 2023 11:28:47 +0100
Uros Bizjak <ubizjak@gmail.com> wrote:

> These benefits are the reason the change to try_cmpxchg was accepted
> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> name a few commits, with a thumbs-up and a claim that the new code is
> actually *clearer* at the merge commit [4].

I'll say it here too. I really like Joel's suggestion of having a
cmpxchg_success() that does not have the added side effect of modifying the
old variable.

I think that would allow for the arch optimizations that you are trying to
achieve, as well as remove the side effect that might cause issues down the
road.

-- Steve
Uros Bizjak March 1, 2023, 6:43 p.m. UTC | #13
On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 1 Mar 2023 11:28:47 +0100
> Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > These benefits are the reason the change to try_cmpxchg was accepted
> > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > name a few commits, with a thumbs-up and a claim that the new code is
> > actually *clearer* at the merge commit [4].
>
> I'll say it here too. I really like Joel's suggestion of having a
> cmpxchg_success() that does not have the added side effect of modifying the
> old variable.
>
> I think that would allow for the arch optimizations that you are trying to
> achieve, as well as remove the side effect that might cause issues down the
> road.

Attached patch implements this suggestion.

Uros.
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b10b8349bb2a..229263ebba3b 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -709,6 +709,12 @@ static void print_cpu_stall(unsigned long gps)
 	set_preempt_need_resched();
 }
 
+#define cmpxchg_success(ptr, old, new)				\
+({								\
+	__typeof__(*(ptr)) __tmp = (old);			\
+	try_cmpxchg((ptr), &__tmp, (new));			\
+})
+
 static void check_cpu_stall(struct rcu_data *rdp)
 {
 	bool didstall = false;
@@ -760,7 +766,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
 	jn = jiffies + ULONG_MAX / 2;
 	if (rcu_gp_in_progress() &&
 	    (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
-	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+	    cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
 
 		/*
 		 * If a virtual machine is stopped by the host it can look to
@@ -778,7 +784,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
 
 	} else if (rcu_gp_in_progress() &&
 		   ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
-		   cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+		   cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
 
 		/*
 		 * If a virtual machine is stopped by the host it can look to
Steven Rostedt March 1, 2023, 6:52 p.m. UTC | #14
On Wed, 1 Mar 2023 19:43:34 +0100
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 1 Mar 2023 11:28:47 +0100
> > Uros Bizjak <ubizjak@gmail.com> wrote:
> >  
> > > These benefits are the reason the change to try_cmpxchg was accepted
> > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > name a few commits, with a thumbs-up and a claim that the new code is
> > > actually *clearer* at the merge commit [4].  
> >
> > I'll say it here too. I really like Joel's suggestion of having a
> > cmpxchg_success() that does not have the added side effect of modifying the
> > old variable.
> >
> > I think that would allow for the arch optimizations that you are trying to
> > achieve, as well as remove the side effect that might cause issues down the
> > road.  
> 
> Attached patch implements this suggestion.

I like it!

Anyway to make this more generic?

-- Steve
Uros Bizjak March 1, 2023, 7:18 p.m. UTC | #15
On Wed, Mar 1, 2023 at 7:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 1 Mar 2023 19:43:34 +0100
> Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 1 Mar 2023 11:28:47 +0100
> > > Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > These benefits are the reason the change to try_cmpxchg was accepted
> > > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > > name a few commits, with a thumbs-up and a claim that the new code is
> > > > actually *clearer* at the merge commit [4].
> > >
> > > I'll say it here too. I really like Joel's suggestion of having a
> > > cmpxchg_success() that does not have the added side effect of modifying the
> > > old variable.
> > >
> > > I think that would allow for the arch optimizations that you are trying to
> > > achieve, as well as remove the side effect that might cause issues down the
> > > road.
> >
> > Attached patch implements this suggestion.
>
> I like it!

Thanks!

> Anyway to make this more generic?

If we want to put the definition in generic headers, then we also need
to define acquire/release/relaxed and 64bit variants. ATM, we have two
sites that require this definition and I think that for now we could
live with two instances of the same definition in two separate
subsystems. But this would definitely be a good future addition. There
is some code in the form of

if (cmpxchg (ptr, 0, 1) == 0)

that can not be converted to try_cmpxchg, but can use cmpxchg_success.

Uros.
Steven Rostedt March 1, 2023, 7:43 p.m. UTC | #16
On Wed, 1 Mar 2023 20:18:11 +0100
Uros Bizjak <ubizjak@gmail.com> wrote:

> If we want to put the definition in generic headers, then we also need
> to define acquire/release/relaxed and 64bit variants. ATM, we have two
> sites that require this definition and I think that for now we could
> live with two instances of the same definition in two separate
> subsystems. But this would definitely be a good future addition. There
> is some code in the form of
> 
> if (cmpxchg (ptr, 0, 1) == 0)
> 
> that can not be converted to try_cmpxchg, but can use cmpxchg_success.

And even modify code that uses temp variables. For example, where you
modified the ring buffer code to use try_cmpxchg(), I could convert your:

static int rb_head_page_replace(struct buffer_page *old,
				struct buffer_page *new)
{
	unsigned long *ptr = (unsigned long *)&old->list.prev->next;
	unsigned long val;

	val = *ptr & ~RB_FLAG_MASK;
	val |= RB_PAGE_HEAD;

	return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
}

Into just:

static int rb_head_page_replace(struct buffer_page *old,
				struct buffer_page *new)
{
	unsigned long *ptr = (unsigned long *)&old->list.prev->next;
	unsigned long val;

	val = *ptr & ~RB_FLAG_MASK;

	return cmpxchg_success(ptr, val | RB_PAGE_HEAD, (unsigned long)&new->list);
}

-- Steve
Joel Fernandes March 1, 2023, 7:45 p.m. UTC | #17
> On Mar 1, 2023, at 2:18 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> On Wed, Mar 1, 2023 at 7:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>> 
>>> On Wed, 1 Mar 2023 19:43:34 +0100
>>> Uros Bizjak <ubizjak@gmail.com> wrote:
>>> 
>>> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>> 
>>>> On Wed, 1 Mar 2023 11:28:47 +0100
>>>> Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> 
>>>>> These benefits are the reason the change to try_cmpxchg was accepted
>>>>> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
>>>>> name a few commits, with a thumbs-up and a claim that the new code is
>>>>> actually *clearer* at the merge commit [4].
>>>> 
>>>> I'll say it here too. I really like Joel's suggestion of having a
>>>> cmpxchg_success() that does not have the added side effect of modifying the
>>>> old variable.
>>>> 
>>>> I think that would allow for the arch optimizations that you are trying to
>>>> achieve, as well as remove the side effect that might cause issues down the
>>>> road.
>>> 
>>> Attached patch implements this suggestion.
>> 
>> I like it!

Me too :)

> Thanks!
> 
>> Anyway to make this more generic?
> 
> If we want to put the definition in generic headers, then we also need
> to define acquire/release/relaxed and 64bit variants. ATM, we have two
> sites that require this definition and I think that for now we could
> live with two instances of the same definition in two separate
> subsystems. But this would definitely be a good future addition. There
> is some code in the form of
> 
> if (cmpxchg (ptr, 0, 1) == 0)
> 
> that can not be converted to try_cmpxchg, but can use cmpxchg_success.

I would prefer if we can put it in generic headers instead of duplicating
across ftrace and RCU.

thanks,

 - Joel

> 
> Uros.
Joel Fernandes March 1, 2023, 7:46 p.m. UTC | #18
> On Mar 1, 2023, at 2:43 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed, 1 Mar 2023 20:18:11 +0100
> Uros Bizjak <ubizjak@gmail.com> wrote:
> 
>> If we want to put the definition in generic headers, then we also need
>> to define acquire/release/relaxed and 64bit variants. ATM, we have two
>> sites that require this definition and I think that for now we could
>> live with two instances of the same definition in two separate
>> subsystems. But this would definitely be a good future addition. There
>> is some code in the form of
>> 
>> if (cmpxchg (ptr, 0, 1) == 0)
>> 
>> that can not be converted to try_cmpxchg, but can use cmpxchg_success.
> 
> And even modify code that uses temp variables. For example, where you
> modified the ring buffer code to use try_cmpxchg(), I could convert your:
> 
> static int rb_head_page_replace(struct buffer_page *old,
>                struct buffer_page *new)
> {
>    unsigned long *ptr = (unsigned long *)&old->list.prev->next;
>    unsigned long val;
> 
>    val = *ptr & ~RB_FLAG_MASK;
>    val |= RB_PAGE_HEAD;
> 
>    return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
> }
> 
> Into just:
> 
> static int rb_head_page_replace(struct buffer_page *old,
>                struct buffer_page *new)
> {
>    unsigned long *ptr = (unsigned long *)&old->list.prev->next;
>    unsigned long val;
> 
>    val = *ptr & ~RB_FLAG_MASK;
> 
>    return cmpxchg_success(ptr, val | RB_PAGE_HEAD, (unsigned long)&new->list);
> }

Nice! Looks much better.

Thanks,

 - Joel

> 
> -- Steve
Paul E. McKenney March 1, 2023, 8:08 p.m. UTC | #19
On Wed, Mar 01, 2023 at 07:43:34PM +0100, Uros Bizjak wrote:
> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 1 Mar 2023 11:28:47 +0100
> > Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > These benefits are the reason the change to try_cmpxchg was accepted
> > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > name a few commits, with a thumbs-up and a claim that the new code is
> > > actually *clearer* at the merge commit [4].
> >
> > I'll say it here too. I really like Joel's suggestion of having a
> > cmpxchg_success() that does not have the added side effect of modifying the
> > old variable.
> >
> > I think that would allow for the arch optimizations that you are trying to
> > achieve, as well as remove the side effect that might cause issues down the
> > road.
> 
> Attached patch implements this suggestion.

Please help me out here.

Why on earth are we even discussing making this change to code that
normally never executes?  Performance is not a consideration here.

What am I missing here?  Is there some sort of forward-progress
issue that this change addresses?

							Thanx, Paul

> Uros.

> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b10b8349bb2a..229263ebba3b 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -709,6 +709,12 @@ static void print_cpu_stall(unsigned long gps)
>  	set_preempt_need_resched();
>  }
>  
> +#define cmpxchg_success(ptr, old, new)				\
> +({								\
> +	__typeof__(*(ptr)) __tmp = (old);			\
> +	try_cmpxchg((ptr), &__tmp, (new));			\
> +})
> +
>  static void check_cpu_stall(struct rcu_data *rdp)
>  {
>  	bool didstall = false;
> @@ -760,7 +766,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>  	jn = jiffies + ULONG_MAX / 2;
>  	if (rcu_gp_in_progress() &&
>  	    (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> -	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +	    cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>  
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
> @@ -778,7 +784,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>  
>  	} else if (rcu_gp_in_progress() &&
>  		   ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> -		   cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +		   cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>  
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
Joel Fernandes March 1, 2023, 8:18 p.m. UTC | #20
> On Mar 1, 2023, at 3:08 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Wed, Mar 01, 2023 at 07:43:34PM +0100, Uros Bizjak wrote:
>>> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>> 
>>> On Wed, 1 Mar 2023 11:28:47 +0100
>>> Uros Bizjak <ubizjak@gmail.com> wrote:
>>> 
>>>> These benefits are the reason the change to try_cmpxchg was accepted
>>>> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
>>>> name a few commits, with a thumbs-up and a claim that the new code is
>>>> actually *clearer* at the merge commit [4].
>>> 
>>> I'll say it here too. I really like Joel's suggestion of having a
>>> cmpxchg_success() that does not have the added side effect of modifying the
>>> old variable.
>>> 
>>> I think that would allow for the arch optimizations that you are trying to
>>> achieve, as well as remove the side effect that might cause issues down the
>>> road.
>> 
>> Attached patch implements this suggestion.
> 
> Please help me out here.
> 
> Why on earth are we even discussing making this change to code that
> normally never executes?  Performance is not a consideration here.
> 
> What am I missing here?  Is there some sort of forward-progress
> issue that this change addresses?

I do not think it is anything with performance. The suggestion just makes
the code easier to read. In the case of ftrace (not RCU), it results in further
deleted lines of code.

Maybe it got confusing because we are discussing the change as it
applies to both ftrace and RCU.

You could argue that it has to do with performance in the fast path, but it
is probably down in the noise.

 - Joel 


> 
>                            Thanx, Paul
> 
>> Uros.
> 
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index b10b8349bb2a..229263ebba3b 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -709,6 +709,12 @@ static void print_cpu_stall(unsigned long gps)
>>    set_preempt_need_resched();
>> }
>> 
>> +#define cmpxchg_success(ptr, old, new)                \
>> +({                                \
>> +    __typeof__(*(ptr)) __tmp = (old);            \
>> +    try_cmpxchg((ptr), &__tmp, (new));            \
>> +})
>> +
>> static void check_cpu_stall(struct rcu_data *rdp)
>> {
>>    bool didstall = false;
>> @@ -760,7 +766,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>>    jn = jiffies + ULONG_MAX / 2;
>>    if (rcu_gp_in_progress() &&
>>        (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
>> -        cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> +        cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>> 
>>        /*
>>         * If a virtual machine is stopped by the host it can look to
>> @@ -778,7 +784,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>> 
>>    } else if (rcu_gp_in_progress() &&
>>           ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
>> -           cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> +           cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>> 
>>        /*
>>         * If a virtual machine is stopped by the host it can look to
>
Steven Rostedt March 1, 2023, 8:18 p.m. UTC | #21
On Wed, 1 Mar 2023 12:08:20 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > Attached patch implements this suggestion.  
> 
> Please help me out here.
> 
> Why on earth are we even discussing making this change to code that
> normally never executes?  Performance is not a consideration here.
> 
> What am I missing here?  Is there some sort of forward-progress
> issue that this change addresses?

Well, we sorta hijacked this thread. It turned into a more general
discussion, as there is code that this change will be useful for
(ring_buffer.c), but we just happen to be having the discussion here.

Where it will at most remove some text and give you back a few extra bytes
of memory ;-)

But if we do use cmpxchg_success() IMHO, it does improve readability.

> -	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> +	    cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {

-- Steve
Paul E. McKenney March 1, 2023, 8:36 p.m. UTC | #22
On Wed, Mar 01, 2023 at 03:18:26PM -0500, Steven Rostedt wrote:
> On Wed, 1 Mar 2023 12:08:20 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > Attached patch implements this suggestion.  
> > 
> > Please help me out here.
> > 
> > Why on earth are we even discussing making this change to code that
> > normally never executes?  Performance is not a consideration here.
> > 
> > What am I missing here?  Is there some sort of forward-progress
> > issue that this change addresses?
> 
> Well, we sorta hijacked this thread. It turned into a more general
> discussion, as there is code that this change will be useful for
> (ring_buffer.c), but we just happen to be having the discussion here.
> 
> Where it will at most remove some text and give you back a few extra bytes
> of memory ;-)
> 
> But if we do use cmpxchg_success() IMHO, it does improve readability.
> 
> > -	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > +	    cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {

Some years down the road, should cmpxchg_success() be on the tip of
the tongue of every kernel hacker, perhaps.  Or perhaps not.

In the meantime, we have yet another abysmally documented atomic
operation that is not well known throughout the community.  And then the
people coming across this curse everyone who had anything to do with it,
as they search the source code, dig through assembly output, and so on
trying to work out exactly what this thing does.

Sorry, but no way.

Again, unless there is some sort of forward-progress argument or
similar convincing argument.

							Thanx, Paul
Steven Rostedt March 1, 2023, 8:46 p.m. UTC | #23
On Wed, 1 Mar 2023 12:36:45 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Some years down the road, should cmpxchg_success() be on the tip of
> the tongue of every kernel hacker, perhaps.  Or perhaps not.

A bit of a catch-22 I would say. It will only become something everyone
knows if it exists.

> 
> In the meantime, we have yet another abysmally documented atomic

Is it?

> operation that is not well known throughout the community.  And then the
> people coming across this curse everyone who had anything to do with it,
> as they search the source code, dig through assembly output, and so on
> trying to work out exactly what this thing does.
> 
> Sorry, but no way.
> 
> Again, unless there is some sort of forward-progress argument or
> similar convincing argument.

Speaking of forward progress...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/atomic_t.txt#n316

Anyway, I'm guessing this will not become part of rcu any time soon. But
for the ring buffer, I would happily take it.

-- Steve
Paul E. McKenney March 1, 2023, 9:29 p.m. UTC | #24
On Wed, Mar 01, 2023 at 03:46:41PM -0500, Steven Rostedt wrote:
> On Wed, 1 Mar 2023 12:36:45 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > Some years down the road, should cmpxchg_success() be on the tip of
> > the tongue of every kernel hacker, perhaps.  Or perhaps not.
> 
> A bit of a catch-22 I would say. It will only become something everyone
> knows if it exists.
> 
> > In the meantime, we have yet another abysmally documented atomic
> 
> Is it?

Is sure is.

> > operation that is not well known throughout the community.  And then the
> > people coming across this curse everyone who had anything to do with it,
> > as they search the source code, dig through assembly output, and so on
> > trying to work out exactly what this thing does.
> > 
> > Sorry, but no way.
> > 
> > Again, unless there is some sort of forward-progress argument or
> > similar convincing argument.
> 
> Speaking of forward progress...
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/atomic_t.txt#n316

Yes, that is probably the best starting point.

And if you have been around long enough, you know that this is in fact
the best starting point.  Plus if you can correctly interpret the words
in that document.  And if you are familiar with the entire document.
But otherwise, good luck learning the semantics for something like
atomic_fetch_add_release().

> Anyway, I'm guessing this will not become part of rcu any time soon. But
> for the ring buffer, I would happily take it.

Certainly not unless someone comes up with a good reason for it.

							Thanx, Paul
diff mbox series

Patch

diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b10b8349bb2a..d81c88e66b42 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -760,7 +760,7 @@  static void check_cpu_stall(struct rcu_data *rdp)
 	jn = jiffies + ULONG_MAX / 2;
 	if (rcu_gp_in_progress() &&
 	    (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
-	    cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+	    try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
 
 		/*
 		 * If a virtual machine is stopped by the host it can look to
@@ -778,7 +778,7 @@  static void check_cpu_stall(struct rcu_data *rdp)
 
 	} else if (rcu_gp_in_progress() &&
 		   ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
-		   cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+		   try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
 
 		/*
 		 * If a virtual machine is stopped by the host it can look to