mbox series

[v3,0/5] Implement call_rcu_lazy() and miscellaneous fixes

Message ID 20220713213237.1596225-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series Implement call_rcu_lazy() and miscellaneous fixes | expand

Message

Joel Fernandes July 13, 2022, 9:32 p.m. UTC
Hello!

Please find the next improved version of call_rcu_lazy() attached.  The main
difference between the previous versions is that:
- In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
  GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
  requested by Paul.
- Fixed checkpatch and build robot issues.
- Some more changes to 'lazy' parameter passing and consolidation of segcblist
  functions.
- more testing via rcutorture and rcuscale.

Note that these tests were run on v2 patches, I am expecting similar power
improvements however I've not yet tested power.

Following are power savings we saw on top of RCU_NOCB_CPU on an Intel platform
in v2.  The observation is that due to a 'trickle down' effect of RCU
callbacks, the system is very lightly loaded but constantly running few RCU
callbacks very often. This confuses the power management hardware that the
system is active, when it is in fact idle.

For example, when ChromeOS screen is off and user is not doing anything on the
system, we can see big power savings.
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58
CorWatt = 0.04

After:
Pk%pc10 = 81.28
PkgWatt = 0.41
CorWatt = 0.03

Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
can see that the display pipeline is constantly doing RCU callback queuing due
to open/close of file descriptors associated with graphics buffers. This is
attributed to the file_free_rcu() path which this patch series also touches.

On memory pressure, timeout or queue growing too big, we initiate a flush of of
the bypass lists holding the lazy CBs.

Similar results can be achieved by increasing jiffies_till_first_fqs, however
that also has the effect of slowing down RCU. Especially I saw huge slow down
of function graph tracer when increasing that. That may be possible to fix via
rcu_expedited=1 boot parameter, however call_rcu_lazy() provides another option
over slowing down ALL call_rcu() globally. Further using jiffies_till_first_fqs
approach will still cause a wake up of the main RCU GP kthread, with this work
we delay even those wakeups.

One drawback of this series is, if another frequent RCU callback creeps up in
the future, that's not lazy, then that will again hurt the power. However, I
believe identifying and fixing those is a more reasonable approach than slowing
RCU down for the whole system.

Disclaimer: I have intentionally not CC'd other subsystem maintainers (like
net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
of review and agreements.

Joel Fernandes (Google) (4):
  rcu: Introduce call_rcu_lazy() API implementation
  rcuscale: Add laziness and kfree tests
  fs: Move call_rcu() to call_rcu_lazy() in some paths
  rcutorture: Add test code for call_rcu_lazy()

Vineeth Pillai (1):
  rcu: shrinker for lazy rcu

 fs/dcache.c                                   |   4 +-
 fs/eventpoll.c                                |   2 +-
 fs/file_table.c                               |   2 +-
 fs/inode.c                                    |   2 +-
 include/linux/rcu_segcblist.h                 |   1 +
 include/linux/rcupdate.h                      |   6 +
 kernel/rcu/Kconfig                            |   8 +
 kernel/rcu/rcu.h                              |  12 +
 kernel/rcu/rcu_segcblist.c                    |  15 +-
 kernel/rcu/rcu_segcblist.h                    |  20 +-
 kernel/rcu/rcuscale.c                         |  74 +++++-
 kernel/rcu/rcutorture.c                       |  60 ++++-
 kernel/rcu/tree.c                             | 132 ++++++----
 kernel/rcu/tree.h                             |  10 +-
 kernel/rcu/tree_nocb.h                        | 239 ++++++++++++++----
 .../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
 .../rcutorture/configs/rcu/TREE11.boot        |   8 +
 17 files changed, 508 insertions(+), 105 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

Comments

Paul E. McKenney July 14, 2022, 8:51 p.m. UTC | #1
On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> Hello!
> 
> Please find the next improved version of call_rcu_lazy() attached.  The main
> difference between the previous versions is that:
> - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
>   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
>   requested by Paul.
> - Fixed checkpatch and build robot issues.
> - Some more changes to 'lazy' parameter passing and consolidation of segcblist
>   functions.
> - more testing via rcutorture and rcuscale.

Thank you!  What I am going to do is to pull these into an experimental
not-for-mainline branch and run the usual set of rcutorture tests.
I will then take a look at the patches.

> Note that these tests were run on v2 patches, I am expecting similar power
> improvements however I've not yet tested power.
> 
> Following are power savings we saw on top of RCU_NOCB_CPU on an Intel platform
> in v2.  The observation is that due to a 'trickle down' effect of RCU
> callbacks, the system is very lightly loaded but constantly running few RCU
> callbacks very often. This confuses the power management hardware that the
> system is active, when it is in fact idle.
> 
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
> 
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03

When you update these numbers, please explain what they all are and
evaluate them in the cover letter (or in the relevant patch's commit log).
For final submission, please also include some estimate of the variance.
For example, CorWatt might be essentially the same both before and after,
as in 0.035 and 0.034, or there might be a large difference, as in 0.044
and 0.025.  The 81.28 might be constant in all four digits (ha!), or it
might vary between (say) 80 and 83.  And so on.

Based on our earlier emails, my guess is that Pk%pc10 is the percent of
time that the system is in a low-power state (bigger is better), PkgWatt
is power consumed by the CPU chip (smaller is better), and CorWatt is
power consumed by the CPU core (again, smaller is better).

It all might sound painfully obvious to you right now, but please have
pity on future readers of this series.

> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
> 
> On memory pressure, timeout or queue growing too big, we initiate a flush of of
> the bypass lists holding the lazy CBs.
> 
> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down

In the final submission, please quantify "huge slow down".  ;-)

> of function graph tracer when increasing that. That may be possible to fix via
> rcu_expedited=1 boot parameter, however call_rcu_lazy() provides another option
> over slowing down ALL call_rcu() globally. Further using jiffies_till_first_fqs
> approach will still cause a wake up of the main RCU GP kthread, with this work
> we delay even those wakeups.
> 
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.

Like I said earlier, you are the official call_rcu_lazy() whack-a-mole
developer.  ;-)

							Thanx, Paul

> Disclaimer: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.
> 
> Joel Fernandes (Google) (4):
>   rcu: Introduce call_rcu_lazy() API implementation
>   rcuscale: Add laziness and kfree tests
>   fs: Move call_rcu() to call_rcu_lazy() in some paths
>   rcutorture: Add test code for call_rcu_lazy()
> 
> Vineeth Pillai (1):
>   rcu: shrinker for lazy rcu
> 
>  fs/dcache.c                                   |   4 +-
>  fs/eventpoll.c                                |   2 +-
>  fs/file_table.c                               |   2 +-
>  fs/inode.c                                    |   2 +-
>  include/linux/rcu_segcblist.h                 |   1 +
>  include/linux/rcupdate.h                      |   6 +
>  kernel/rcu/Kconfig                            |   8 +
>  kernel/rcu/rcu.h                              |  12 +
>  kernel/rcu/rcu_segcblist.c                    |  15 +-
>  kernel/rcu/rcu_segcblist.h                    |  20 +-
>  kernel/rcu/rcuscale.c                         |  74 +++++-
>  kernel/rcu/rcutorture.c                       |  60 ++++-
>  kernel/rcu/tree.c                             | 132 ++++++----
>  kernel/rcu/tree.h                             |  10 +-
>  kernel/rcu/tree_nocb.h                        | 239 ++++++++++++++----
>  .../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
>  .../rcutorture/configs/rcu/TREE11.boot        |   8 +
>  17 files changed, 508 insertions(+), 105 deletions(-)
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> 
> -- 
> 2.37.0.144.g8ac04bfd2-goog
>
Joel Fernandes July 14, 2022, 9:33 p.m. UTC | #2
On 7/14/2022 4:51 PM, Paul E. McKenney wrote:
> On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
>> Hello!
>>
>> Please find the next improved version of call_rcu_lazy() attached.  The main
>> difference between the previous versions is that:
>> - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
>>   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
>>   requested by Paul.
>> - Fixed checkpatch and build robot issues.
>> - Some more changes to 'lazy' parameter passing and consolidation of segcblist
>>   functions.
>> - more testing via rcutorture and rcuscale.
> 
> Thank you!  What I am going to do is to pull these into an experimental
> not-for-mainline branch and run the usual set of rcutorture tests.
> I will then take a look at the patches.

Thanks, that sounds great.

> 
>> Note that these tests were run on v2 patches, I am expecting similar power
>> improvements however I've not yet tested power.
>>

>> Following are power savings we saw on top of RCU_NOCB_CPU on an Intel platform
>> in v2.  The observation is that due to a 'trickle down' effect of RCU
>> callbacks, the system is very lightly loaded but constantly running few RCU
>> callbacks very often. This confuses the power management hardware that the
>> system is active, when it is in fact idle.
>>
>> For example, when ChromeOS screen is off and user is not doing anything on the
>> system, we can see big power savings.
>> Before:
>> Pk%pc10 = 72.13
>> PkgWatt = 0.58
>> CorWatt = 0.04
>>
>> After:
>> Pk%pc10 = 81.28
>> PkgWatt = 0.41
>> CorWatt = 0.03
> 
> When you update these numbers, please explain what they all are and
> evaluate them in the cover letter (or in the relevant patch's commit log).
> For final submission, please also include some estimate of the variance.
> For example, CorWatt might be essentially the same both before and after,
> as in 0.035 and 0.034, or there might be a large difference, as in 0.044
> and 0.025.  The 81.28 might be constant in all four digits (ha!), or it
> might vary between (say) 80 and 83.  And so on.

Sure thanks for the suggestions and will do.

> 
> Based on our earlier emails, my guess is that Pk%pc10 is the percent of
> time that the system is in a low-power state (bigger is better), PkgWatt
> is power consumed by the CPU chip (smaller is better), and CorWatt is
> power consumed by the CPU core (again, smaller is better).

Yes that's correct.


>> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
>> can see that the display pipeline is constantly doing RCU callback queuing due
>> to open/close of file descriptors associated with graphics buffers. This is
>> attributed to the file_free_rcu() path which this patch series also touches.
>>
>> On memory pressure, timeout or queue growing too big, we initiate a flush of of
>> the bypass lists holding the lazy CBs.
>>
>> Similar results can be achieved by increasing jiffies_till_first_fqs, however
>> that also has the effect of slowing down RCU. Especially I saw huge slow down
> 
> In the final submission, please quantify "huge slow down".  ;-)

Sure will do. IIRC it was something like 30 second to stop function
graph tracer versus the usual 2-3 seconds.

>> of function graph tracer when increasing that. That may be possible to fix via
>> rcu_expedited=1 boot parameter, however call_rcu_lazy() provides another option
>> over slowing down ALL call_rcu() globally. Further using jiffies_till_first_fqs
>> approach will still cause a wake up of the main RCU GP kthread, with this work
>> we delay even those wakeups.
>>
>> One drawback of this series is, if another frequent RCU callback creeps up in
>> the future, that's not lazy, then that will again hurt the power. However, I
>> believe identifying and fixing those is a more reasonable approach than slowing
>> RCU down for the whole system.
> 
> Like I said earlier, you are the official call_rcu_lazy() whack-a-mole
> developer.  ;-)

Haha..

Thanks,

- Joel
Paul E. McKenney July 14, 2022, 10:21 p.m. UTC | #3
On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > Hello!
> > 
> > Please find the next improved version of call_rcu_lazy() attached.  The main
> > difference between the previous versions is that:
> > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> >   requested by Paul.
> > - Fixed checkpatch and build robot issues.
> > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> >   functions.
> > - more testing via rcutorture and rcuscale.
> 
> Thank you!  What I am going to do is to pull these into an experimental
> not-for-mainline branch and run the usual set of rcutorture tests.
> I will then take a look at the patches.

And there were a few conflicts with the nocb patch series in -rcu.
The allegedly conflict-resolved series is here: joel.2022.07.14a
Please let me know if I messed something up.

Again, this is an experimental series of commits for my testing.  And for
anyone else who would like to test against -rcu, for that matter.

						Thanx, Paul

> > Note that these tests were run on v2 patches, I am expecting similar power
> > improvements however I've not yet tested power.
> > 
> > Following are power savings we saw on top of RCU_NOCB_CPU on an Intel platform
> > in v2.  The observation is that due to a 'trickle down' effect of RCU
> > callbacks, the system is very lightly loaded but constantly running few RCU
> > callbacks very often. This confuses the power management hardware that the
> > system is active, when it is in fact idle.
> > 
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> > 
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> 
> When you update these numbers, please explain what they all are and
> evaluate them in the cover letter (or in the relevant patch's commit log).
> For final submission, please also include some estimate of the variance.
> For example, CorWatt might be essentially the same both before and after,
> as in 0.035 and 0.034, or there might be a large difference, as in 0.044
> and 0.025.  The 81.28 might be constant in all four digits (ha!), or it
> might vary between (say) 80 and 83.  And so on.
> 
> Based on our earlier emails, my guess is that Pk%pc10 is the percent of
> time that the system is in a low-power state (bigger is better), PkgWatt
> is power consumed by the CPU chip (smaller is better), and CorWatt is
> power consumed by the CPU core (again, smaller is better).
> 
> It all might sound painfully obvious to you right now, but please have
> pity on future readers of this series.
> 
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> > 
> > On memory pressure, timeout or queue growing too big, we initiate a flush of of
> > the bypass lists holding the lazy CBs.
> > 
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> 
> In the final submission, please quantify "huge slow down".  ;-)
> 
> > of function graph tracer when increasing that. That may be possible to fix via
> > rcu_expedited=1 boot parameter, however call_rcu_lazy() provides another option
> > over slowing down ALL call_rcu() globally. Further using jiffies_till_first_fqs
> > approach will still cause a wake up of the main RCU GP kthread, with this work
> > we delay even those wakeups.
> > 
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> 
> Like I said earlier, you are the official call_rcu_lazy() whack-a-mole
> developer.  ;-)
> 
> 							Thanx, Paul
> 
> > Disclaimer: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.
> > 
> > Joel Fernandes (Google) (4):
> >   rcu: Introduce call_rcu_lazy() API implementation
> >   rcuscale: Add laziness and kfree tests
> >   fs: Move call_rcu() to call_rcu_lazy() in some paths
> >   rcutorture: Add test code for call_rcu_lazy()
> > 
> > Vineeth Pillai (1):
> >   rcu: shrinker for lazy rcu
> > 
> >  fs/dcache.c                                   |   4 +-
> >  fs/eventpoll.c                                |   2 +-
> >  fs/file_table.c                               |   2 +-
> >  fs/inode.c                                    |   2 +-
> >  include/linux/rcu_segcblist.h                 |   1 +
> >  include/linux/rcupdate.h                      |   6 +
> >  kernel/rcu/Kconfig                            |   8 +
> >  kernel/rcu/rcu.h                              |  12 +
> >  kernel/rcu/rcu_segcblist.c                    |  15 +-
> >  kernel/rcu/rcu_segcblist.h                    |  20 +-
> >  kernel/rcu/rcuscale.c                         |  74 +++++-
> >  kernel/rcu/rcutorture.c                       |  60 ++++-
> >  kernel/rcu/tree.c                             | 132 ++++++----
> >  kernel/rcu/tree.h                             |  10 +-
> >  kernel/rcu/tree_nocb.h                        | 239 ++++++++++++++----
> >  .../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
> >  .../rcutorture/configs/rcu/TREE11.boot        |   8 +
> >  17 files changed, 508 insertions(+), 105 deletions(-)
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> > 
> > -- 
> > 2.37.0.144.g8ac04bfd2-goog
> >
Joel Fernandes July 15, 2022, 3:18 p.m. UTC | #4
On Thu, Jul 14, 2022 at 03:21:31PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > > Hello!
> > > 
> > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > difference between the previous versions is that:
> > > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> > >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> > >   requested by Paul.
> > > - Fixed checkpatch and build robot issues.
> > > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> > >   functions.
> > > - more testing via rcutorture and rcuscale.
> > 
> > Thank you!  What I am going to do is to pull these into an experimental
> > not-for-mainline branch and run the usual set of rcutorture tests.
> > I will then take a look at the patches.
> 
> And there were a few conflicts with the nocb patch series in -rcu.
> The allegedly conflict-resolved series is here: joel.2022.07.14a
> Please let me know if I messed something up.

Thanks, it looks Ok. There is one robot fix for hexagon's arch where I think
TREE_RCU is disabled, could you apply the diff below to patch 1/5 ?

Or, I can also just keep it in my version of 1/5 to go out with the next rev.

---8<-----------------------

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index c358387fd223..aa3243e49506 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -464,6 +464,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
@@ -475,14 +483,6 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 void rcu_gp_set_torture_wait(int duration);
 void rcu_force_call_rcu_to_lazy(bool force);
 
-#if defined(CONFIG_RCU_LAZY)
-unsigned long rcu_lazy_get_jiffies_till_flush(void);
-void rcu_lazy_set_jiffies_till_flush(unsigned long j);
-#else
-static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
-static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
-#endif
-
 #else
 static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
 					  int *flags, unsigned long *gp_seq)
Paul E. McKenney July 15, 2022, 3:29 p.m. UTC | #5
On Fri, Jul 15, 2022 at 03:18:04PM +0000, Joel Fernandes wrote:
> On Thu, Jul 14, 2022 at 03:21:31PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > > > Hello!
> > > > 
> > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > difference between the previous versions is that:
> > > > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> > > >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> > > >   requested by Paul.
> > > > - Fixed checkpatch and build robot issues.
> > > > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> > > >   functions.
> > > > - more testing via rcutorture and rcuscale.
> > > 
> > > Thank you!  What I am going to do is to pull these into an experimental
> > > not-for-mainline branch and run the usual set of rcutorture tests.
> > > I will then take a look at the patches.
> > 
> > And there were a few conflicts with the nocb patch series in -rcu.
> > The allegedly conflict-resolved series is here: joel.2022.07.14a
> > Please let me know if I messed something up.
> 
> Thanks, it looks Ok. There is one robot fix for hexagon's arch where I think
> TREE_RCU is disabled, could you apply the diff below to patch 1/5 ?
> 
> Or, I can also just keep it in my version of 1/5 to go out with the next rev.

Given that I am not testing on hexagon, I will let you fix this one on
the next rev.  If someone out there is testing this branch on hexagon,
they should feel free to apply your patch locally.  ;-)

							Thanx, Paul

> ---8<-----------------------
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index c358387fd223..aa3243e49506 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -464,6 +464,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> @@ -475,14 +483,6 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  void rcu_gp_set_torture_wait(int duration);
>  void rcu_force_call_rcu_to_lazy(bool force);
>  
> -#if defined(CONFIG_RCU_LAZY)
> -unsigned long rcu_lazy_get_jiffies_till_flush(void);
> -void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> -#else
> -static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> -static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> -#endif
> -
>  #else
>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>  					  int *flags, unsigned long *gp_seq)
Joel Fernandes July 15, 2022, 3:40 p.m. UTC | #6
On Fri, Jul 15, 2022 at 08:29:37AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 15, 2022 at 03:18:04PM +0000, Joel Fernandes wrote:
> > On Thu, Jul 14, 2022 at 03:21:31PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > > > > Hello!
> > > > > 
> > > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > > difference between the previous versions is that:
> > > > > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> > > > >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> > > > >   requested by Paul.
> > > > > - Fixed checkpatch and build robot issues.
> > > > > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> > > > >   functions.
> > > > > - more testing via rcutorture and rcuscale.
> > > > 
> > > > Thank you!  What I am going to do is to pull these into an experimental
> > > > not-for-mainline branch and run the usual set of rcutorture tests.
> > > > I will then take a look at the patches.
> > > 
> > > And there were a few conflicts with the nocb patch series in -rcu.
> > > The allegedly conflict-resolved series is here: joel.2022.07.14a
> > > Please let me know if I messed something up.
> > 
> > Thanks, it looks Ok. There is one robot fix for hexagon's arch where I think
> > TREE_RCU is disabled, could you apply the diff below to patch 1/5 ?
> > 
> > Or, I can also just keep it in my version of 1/5 to go out with the next rev.
> 
> Given that I am not testing on hexagon, I will let you fix this one on
> the next rev.  If someone out there is testing this branch on hexagon,
> they should feel free to apply your patch locally.  ;-)

I am pretty sure this is the feature that was going to make Hexagon widely
adopted, but your call ;-)

thanks,

 - Joel
Joel Fernandes July 15, 2022, 3:50 p.m. UTC | #7
On Fri, Jul 15, 2022 at 03:40:33PM +0000, Joel Fernandes wrote:
> On Fri, Jul 15, 2022 at 08:29:37AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 15, 2022 at 03:18:04PM +0000, Joel Fernandes wrote:
> > > On Thu, Jul 14, 2022 at 03:21:31PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > > > > > Hello!
> > > > > > 
> > > > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > > > difference between the previous versions is that:
> > > > > > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> > > > > >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> > > > > >   requested by Paul.
> > > > > > - Fixed checkpatch and build robot issues.
> > > > > > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> > > > > >   functions.
> > > > > > - more testing via rcutorture and rcuscale.
> > > > > 
> > > > > Thank you!  What I am going to do is to pull these into an experimental
> > > > > not-for-mainline branch and run the usual set of rcutorture tests.
> > > > > I will then take a look at the patches.
> > > > 
> > > > And there were a few conflicts with the nocb patch series in -rcu.
> > > > The allegedly conflict-resolved series is here: joel.2022.07.14a
> > > > Please let me know if I messed something up.
> > > 
> > > Thanks, it looks Ok. There is one robot fix for hexagon's arch where I think
> > > TREE_RCU is disabled, could you apply the diff below to patch 1/5 ?
> > > 
> > > Or, I can also just keep it in my version of 1/5 to go out with the next rev.
> > 
> > Given that I am not testing on hexagon, I will let you fix this one on
> > the next rev.  If someone out there is testing this branch on hexagon,
> > they should feel free to apply your patch locally.  ;-)
> 
> I am pretty sure this is the feature that was going to make Hexagon widely
> adopted, but your call ;-)

Jokes apart, I am sure this feature will be useful to a lot of folks and
architectures but lets keep this diff for the next revision as you said.

Actually never heard of Hexagon till today... 32-bit and 4x VLIW, Cool! :
https://en.wikipedia.org/wiki/Qualcomm_Hexagon

thanks,

 - Joel
Paul E. McKenney July 15, 2022, 5:17 p.m. UTC | #8
On Fri, Jul 15, 2022 at 03:50:17PM +0000, Joel Fernandes wrote:
> On Fri, Jul 15, 2022 at 03:40:33PM +0000, Joel Fernandes wrote:
> > On Fri, Jul 15, 2022 at 08:29:37AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jul 15, 2022 at 03:18:04PM +0000, Joel Fernandes wrote:
> > > > On Thu, Jul 14, 2022 at 03:21:31PM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jul 14, 2022 at 01:51:54PM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
> > > > > > > Hello!
> > > > > > > 
> > > > > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > > > > difference between the previous versions is that:
> > > > > > > - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
> > > > > > >   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
> > > > > > >   requested by Paul.
> > > > > > > - Fixed checkpatch and build robot issues.
> > > > > > > - Some more changes to 'lazy' parameter passing and consolidation of segcblist
> > > > > > >   functions.
> > > > > > > - more testing via rcutorture and rcuscale.
> > > > > > 
> > > > > > Thank you!  What I am going to do is to pull these into an experimental
> > > > > > not-for-mainline branch and run the usual set of rcutorture tests.
> > > > > > I will then take a look at the patches.
> > > > > 
> > > > > And there were a few conflicts with the nocb patch series in -rcu.
> > > > > The allegedly conflict-resolved series is here: joel.2022.07.14a
> > > > > Please let me know if I messed something up.
> > > > 
> > > > Thanks, it looks Ok. There is one robot fix for hexagon's arch where I think
> > > > TREE_RCU is disabled, could you apply the diff below to patch 1/5 ?
> > > > 
> > > > Or, I can also just keep it in my version of 1/5 to go out with the next rev.
> > > 
> > > Given that I am not testing on hexagon, I will let you fix this one on
> > > the next rev.  If someone out there is testing this branch on hexagon,
> > > they should feel free to apply your patch locally.  ;-)
> > 
> > I am pretty sure this is the feature that was going to make Hexagon widely
> > adopted, but your call ;-)

Then they should be highly motivated to apply your patch.  ;-)

> Jokes apart, I am sure this feature will be useful to a lot of folks and
> architectures but lets keep this diff for the next revision as you said.
> 
> Actually never heard of Hexagon till today... 32-bit and 4x VLIW, Cool! :
> https://en.wikipedia.org/wiki/Qualcomm_Hexagon

Indeed, there are still about 20 different types of processors running Linux.

							Thanx, Paul
Joel Fernandes Aug. 23, 2022, 5:19 p.m. UTC | #9
On 7/14/2022 4:51 PM, Paul E. McKenney wrote:
> On Wed, Jul 13, 2022 at 09:32:32PM +0000, Joel Fernandes (Google) wrote:
>> Hello!
>>
>> Please find the next improved version of call_rcu_lazy() attached.  The main
>> difference between the previous versions is that:
>> - In v2 rcu_barrier is fixed to not hang (I found this to be due to a missing
>>   GP thread wakeup), now I am limiting this wake up only to rcu_barrier() as
>>   requested by Paul.
>> - Fixed checkpatch and build robot issues.
>> - Some more changes to 'lazy' parameter passing and consolidation of segcblist
>>   functions.
>> - more testing via rcutorture and rcuscale.
> 
> Thank you!  What I am going to do is to pull these into an experimental
> not-for-mainline branch and run the usual set of rcutorture tests.
> I will then take a look at the patches.
> 
>> Note that these tests were run on v2 patches, I am expecting similar power
>> improvements however I've not yet tested power.
>>
>> Following are power savings we saw on top of RCU_NOCB_CPU on an Intel platform
>> in v2.  The observation is that due to a 'trickle down' effect of RCU
>> callbacks, the system is very lightly loaded but constantly running few RCU
>> callbacks very often. This confuses the power management hardware that the
>> system is active, when it is in fact idle.
>>
>> For example, when ChromeOS screen is off and user is not doing anything on the
>> system, we can see big power savings.
>> Before:
>> Pk%pc10 = 72.13
>> PkgWatt = 0.58
>> CorWatt = 0.04
>>
>> After:
>> Pk%pc10 = 81.28
>> PkgWatt = 0.41
>> CorWatt = 0.03
> 
> When you update these numbers, please explain what they all are and
> evaluate them in the cover letter (or in the relevant patch's commit log).
> For final submission, please also include some estimate of the variance.
> For example, CorWatt might be essentially the same both before and after,
> as in 0.035 and 0.034, or there might be a large difference, as in 0.044
> and 0.025.  The 81.28 might be constant in all four digits (ha!), or it
> might vary between (say) 80 and 83.  And so on.
> 
> Based on our earlier emails, my guess is that Pk%pc10 is the percent of
> time that the system is in a low-power state (bigger is better), PkgWatt
> is power consumed by the CPU chip (smaller is better), and CorWatt is
> power consumed by the CPU core (again, smaller is better).
> 
> It all might sound painfully obvious to you right now, but please have
> pity on future readers of this series.

Following is some power data I collected with Turbostat on an x86 ChromeOS
ADL machine. The power seems better on the upstream kernel already versus
my earlier tests, however I still see 5-8% power improvement with the patches.

Rushikesh, Sitanshu, Neeraj, Vlad - could you test power with these as well?
The following branch boots on ChromeOS ADL:
https://github.com/joelagnel/linux-kernel/tree/chromeos-kernelupstream-5.19-rc4.aug22

These are output from Turbostat running:
turbostat -S -s PkgWatt,CorWatt --interval 5
PkgWatt - summary of package power in Watts 5 second interval.
CoreWatt - summary of core power in Watts 5 second interval.

I could not get PC10% on the upstream kernel, it always shows 0. Maybe that is a turbostat
version issue or something.

+───────────+─────────────────────────+──────────+───+──────────────────────────+──────────+
|           | Screen on idle (After)  |          |   | Screen on idle (Before)  |          |
+───────────+─────────────────────────+──────────+───+──────────────────────────+──────────+
|           | PkgWatt                 | CorWatt  |   | PkgWatt                  | CorWatt  |
|           |                         |          |   |                          |          |
|           | 0.6100                  | 0.0500   |   | 0.6700                   | 0.0500   |
|           | 0.5800                  | 0.0400   |   | 0.6200                   | 0.0500   |
|           | 0.5800                  | 0.0400   |   | 0.6500                   | 0.0500   |
|           | 0.6600                  | 0.0800   |   | 0.6200                   | 0.0500   |
|           | 0.5900                  | 0.0400   |   | 0.6200                   | 0.0500   |
|           | 0.5900                  | 0.0400   |   | 0.6400                   | 0.0600   |
|           | 0.6000                  | 0.0500   |   | 0.6700                   | 0.0600   |
|           | 0.6400                  | 0.0600   |   | 0.6200                   | 0.0500   |
|           | 0.6000                  | 0.0500   |   | 0.6200                   | 0.0500   |
|           | 0.5800                  | 0.0400   |   | 0.6200                   | 0.0500   |
|           | 0.6000                  | 0.0400   |   | 0.6200                   | 0.0500   |
|           | 0.6000                  | 0.0500   |   | 0.7200                   | 0.0900   |
--------------------------------------------------------------------------------------------
| Variance  | 0.0006                  | 0.0001   |   | 0.0010                   | 0.0001   |
| Mean      | 0.6025                  | 0.0483   |   | 0.6408                   | 0.0550   |
--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------
|           | Lid closed + no suspend            |   Lid closed + no suspend               |
|                                                |                                         |
|           | Screen off idle (After)            |   | Screen off idle (Before)            |
--------------------------------------------------------------------------------------------
|           | PkgWatt                 | CorWatt  |   | PkgWatt                  | CorWatt  |
|           | 0.3100                  | 0.0700   |   | 0.3500                   | 0.0700   |
|           | 0.3100                  | 0.0700   |   | 0.3200                   | 0.0700   |
|           | 0.3300                  | 0.0700   |   | 0.4100                   | 0.0900   |
|           | 0.3100                  | 0.0600   |   | 0.3400                   | 0.0700   |
|           | 0.3300                  | 0.0700   |   | 0.3200                   | 0.0700   |
|           | 0.3800                  | 0.0900   |   | 0.3400                   | 0.0700   |
|           | 0.3600                  | 0.0700   |   | 0.3400                   | 0.0700   |
|           | 0.3100                  | 0.0700   |   | 0.3300                   | 0.0700   |
|           | 0.3300                  | 0.0700   |   | 0.3300                   | 0.0800   |
|           | 0.3700                  | 0.0800   |   | 0.3200                   | 0.0700   |
|           | 0.3500                  | 0.0800   |   | 0.3400                   | 0.0700   |
|           | 0.2900                  | 0.0600   |   | 0.4300                   | 0.1100   |
|           | 0.3300                  | 0.0700   |   | 0.3400                   | 0.0700   |
|           | 0.3300                  | 0.0700   |   | 0.3200                   | 0.0700   |
|           | 0.3400                  | 0.0700   |   | 0.3600                   | 0.0800   |
|           | 0.3000                  | 0.0700   |   | 0.3200                   | 0.0700   |
|           | 0.3400                  | 0.0700   |   | 0.3300                   | 0.0700   |
|           | 0.3400                  | 0.0900   |   | 0.3500                   | 0.0800   |
|           | 0.3400                  | 0.0700   |   | 0.3800                   | 0.0800   |
--------------------------------------------------------------------------------------------
| Variance  | 0.0005                  | 0.0001   |   | 0.0010                   | 0.0001   |
| Mean      | 0.3028                  | 0.0664   |   | 0.3174                   | 0.0700   |
+───────────+─────────────────────────+──────────+───+──────────────────────────+──────────+

Thanks,

 - Joel