mbox series

[v2,0/8] Implement call_rcu_lazy() and miscellaneous fixes

Message ID 20220622225102.2112026-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series Implement call_rcu_lazy() and miscellaneous fixes | expand

Message

Joel Fernandes June 22, 2022, 10:50 p.m. UTC
Hello!
Please find the next improved version of call_rcu_lazy() attached.  The main
difference between the previous version is that it is now using bypass lists,
and thus handling rcu_barrier() and hotplug situations, with some small changes
to those parts.

I also don't see the TREE07 RCU stall from v1 anymore.

In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
feel free to pull these patches into your tree. Just to note, you will also
need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
series, just to make the series focus on the feature code first.

Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
The observation is that due to a 'trickle down' effect of RCU callbacks, the
system is very lightly loaded but constantly running few RCU callbacks very
often. This confuses the power management hardware that the system is active,
when it is in fact idle.

For example, when ChromeOS screen is off and user is not doing anything on the
system, we can see big power savings.
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58
CorWatt = 0.04

After:
Pk%pc10 = 81.28
PkgWatt = 0.41
CorWatt = 0.03

Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
can see that the display pipeline is constantly doing RCU callback queuing due
to open/close of file descriptors associated with graphics buffers. This is
attributed to the file_free_rcu() path which this patch series also touches.

This patch series adds a simple but effective, and lockless implementation of
RCU callback batching. On memory pressure, timeout or queue growing too big, we
initiate a flush of one or more per-CPU lists.

Similar results can be achieved by increasing jiffies_till_first_fqs, however
that also has the effect of slowing down RCU. Especially I saw huge slow down
of function graph tracer when increasing that.

One drawback of this series is, if another frequent RCU callback creeps up in
the future, that's not lazy, then that will again hurt the power. However, I
believe identifying and fixing those is a more reasonable approach than slowing
RCU down for the whole system.

Disclaimer: I have intentionally not CC'd other subsystem maintainers (like
net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
of review and agreements.

Joel Fernandes (Google) (7):
  rcu: Introduce call_rcu_lazy() API implementation
  fs: Move call_rcu() to call_rcu_lazy() in some paths
  rcu/nocb: Add option to force all call_rcu() to lazy
  rcu/nocb: Wake up gp thread when flushing
  rcuscale: Add test for using call_rcu_lazy() to emulate kfree_rcu()
  rcu/nocb: Rewrite deferred wake up logic to be more clean
  rcu/kfree: Fix kfree_rcu_shrink_count() return value

Vineeth Pillai (1):
  rcu: shrinker for lazy rcu

 fs/dcache.c                   |   4 +-
 fs/eventpoll.c                |   2 +-
 fs/file_table.c               |   2 +-
 fs/inode.c                    |   2 +-
 include/linux/rcu_segcblist.h |   1 +
 include/linux/rcupdate.h      |   6 +
 kernel/rcu/Kconfig            |   8 ++
 kernel/rcu/rcu.h              |   8 ++
 kernel/rcu/rcu_segcblist.c    |  19 +++
 kernel/rcu/rcu_segcblist.h    |  24 ++++
 kernel/rcu/rcuscale.c         |  64 +++++++++-
 kernel/rcu/tree.c             |  35 +++++-
 kernel/rcu/tree.h             |  10 +-
 kernel/rcu/tree_nocb.h        | 217 +++++++++++++++++++++++++++-------
 14 files changed, 345 insertions(+), 57 deletions(-)

Comments

Paul E. McKenney June 26, 2022, 3:12 a.m. UTC | #1
On Wed, Jun 22, 2022 at 10:50:53PM +0000, Joel Fernandes (Google) wrote:
> 
> Hello!
> Please find the next improved version of call_rcu_lazy() attached.  The main
> difference between the previous version is that it is now using bypass lists,
> and thus handling rcu_barrier() and hotplug situations, with some small changes
> to those parts.
> 
> I also don't see the TREE07 RCU stall from v1 anymore.
> 
> In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
> feel free to pull these patches into your tree. Just to note, you will also
> need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
> series, just to make the series focus on the feature code first.
> 
> Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> The observation is that due to a 'trickle down' effect of RCU callbacks, the
> system is very lightly loaded but constantly running few RCU callbacks very
> often. This confuses the power management hardware that the system is active,
> when it is in fact idle.
> 
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
> 
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03

So not quite 30% savings in power at the package level?  Not bad at all!

> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
> 
> This patch series adds a simple but effective, and lockless implementation of
> RCU callback batching. On memory pressure, timeout or queue growing too big, we
> initiate a flush of one or more per-CPU lists.

It is no longer lockless, correct?  Or am I missing something subtle?

Full disclosure: I don't see a whole lot of benefit to its being lockless.
But truth in advertising!  ;-)

> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down
> of function graph tracer when increasing that.
> 
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.

Very good!  I have you down as the official call_rcu_lazy() whack-a-mole
developer.  ;-)

							Thanx, Paul

> Disclaimer: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.
> 
> Joel Fernandes (Google) (7):
>   rcu: Introduce call_rcu_lazy() API implementation
>   fs: Move call_rcu() to call_rcu_lazy() in some paths
>   rcu/nocb: Add option to force all call_rcu() to lazy
>   rcu/nocb: Wake up gp thread when flushing
>   rcuscale: Add test for using call_rcu_lazy() to emulate kfree_rcu()
>   rcu/nocb: Rewrite deferred wake up logic to be more clean
>   rcu/kfree: Fix kfree_rcu_shrink_count() return value
> 
> Vineeth Pillai (1):
>   rcu: shrinker for lazy rcu
> 
>  fs/dcache.c                   |   4 +-
>  fs/eventpoll.c                |   2 +-
>  fs/file_table.c               |   2 +-
>  fs/inode.c                    |   2 +-
>  include/linux/rcu_segcblist.h |   1 +
>  include/linux/rcupdate.h      |   6 +
>  kernel/rcu/Kconfig            |   8 ++
>  kernel/rcu/rcu.h              |   8 ++
>  kernel/rcu/rcu_segcblist.c    |  19 +++
>  kernel/rcu/rcu_segcblist.h    |  24 ++++
>  kernel/rcu/rcuscale.c         |  64 +++++++++-
>  kernel/rcu/tree.c             |  35 +++++-
>  kernel/rcu/tree.h             |  10 +-
>  kernel/rcu/tree_nocb.h        | 217 +++++++++++++++++++++++++++-------
>  14 files changed, 345 insertions(+), 57 deletions(-)
> 
> -- 
> 2.37.0.rc0.104.g0611611a94-goog
>
Joel Fernandes July 8, 2022, 4:17 a.m. UTC | #2
On Sat, Jun 25, 2022 at 08:12:06PM -0700, Paul E. McKenney wrote:
> On Wed, Jun 22, 2022 at 10:50:53PM +0000, Joel Fernandes (Google) wrote:
> > 
> > Hello!
> > Please find the next improved version of call_rcu_lazy() attached.  The main
> > difference between the previous version is that it is now using bypass lists,
> > and thus handling rcu_barrier() and hotplug situations, with some small changes
> > to those parts.
> > 
> > I also don't see the TREE07 RCU stall from v1 anymore.
> > 
> > In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
> > feel free to pull these patches into your tree. Just to note, you will also
> > need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
> > series, just to make the series focus on the feature code first.
> > 
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> > 
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> > 
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> 
> So not quite 30% savings in power at the package level?  Not bad at all!

Yes this is the package residency amount, not the amount of power. This % is
not power.

> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> > 
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> 
> It is no longer lockless, correct?  Or am I missing something subtle?
> 
> Full disclosure: I don't see a whole lot of benefit to its being lockless.
> But truth in advertising!  ;-)

Yes, you are right. Maybe a better way I could put it is it is "lock
contention less" :D

> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> > 
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> 
> Very good!  I have you down as the official call_rcu_lazy() whack-a-mole
> developer.  ;-)

:-D

thanks,

 - Joel
Paul E. McKenney July 8, 2022, 10:45 p.m. UTC | #3
On Fri, Jul 08, 2022 at 04:17:30AM +0000, Joel Fernandes wrote:
> On Sat, Jun 25, 2022 at 08:12:06PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 22, 2022 at 10:50:53PM +0000, Joel Fernandes (Google) wrote:
> > > 
> > > Hello!
> > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > difference between the previous version is that it is now using bypass lists,
> > > and thus handling rcu_barrier() and hotplug situations, with some small changes
> > > to those parts.
> > > 
> > > I also don't see the TREE07 RCU stall from v1 anymore.
> > > 
> > > In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
> > > feel free to pull these patches into your tree. Just to note, you will also
> > > need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
> > > series, just to make the series focus on the feature code first.
> > > 
> > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > system is very lightly loaded but constantly running few RCU callbacks very
> > > often. This confuses the power management hardware that the system is active,
> > > when it is in fact idle.
> > > 
> > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > system, we can see big power savings.
> > > Before:
> > > Pk%pc10 = 72.13
> > > PkgWatt = 0.58
> > > CorWatt = 0.04
> > > 
> > > After:
> > > Pk%pc10 = 81.28
> > > PkgWatt = 0.41
> > > CorWatt = 0.03
> > 
> > So not quite 30% savings in power at the package level?  Not bad at all!
> 
> Yes this is the package residency amount, not the amount of power. This % is
> not power.

So what exactly is PkgWatt, then?  If you can say.  That is where I was
getting the 30% from.

> > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > to open/close of file descriptors associated with graphics buffers. This is
> > > attributed to the file_free_rcu() path which this patch series also touches.
> > > 
> > > This patch series adds a simple but effective, and lockless implementation of
> > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > initiate a flush of one or more per-CPU lists.
> > 
> > It is no longer lockless, correct?  Or am I missing something subtle?
> > 
> > Full disclosure: I don't see a whole lot of benefit to its being lockless.
> > But truth in advertising!  ;-)
> 
> Yes, you are right. Maybe a better way I could put it is it is "lock
> contention less" :D

Yes, "reduced lock contention" would be a good phrase.  As long as you
carefully indicate exactly what scenario with greater lock contention
you are comparing to.

But aren't you acquiring the bypass lock at about the same rate as it
would be aquired without laziness?  What am I missing here?

							Thanx, Paul

> > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > of function graph tracer when increasing that.
> > > 
> > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > the future, that's not lazy, then that will again hurt the power. However, I
> > > believe identifying and fixing those is a more reasonable approach than slowing
> > > RCU down for the whole system.
> > 
> > Very good!  I have you down as the official call_rcu_lazy() whack-a-mole
> > developer.  ;-)
> 
> :-D
> 
> thanks,
> 
>  - Joel
>
Joel Fernandes July 10, 2022, 1:38 a.m. UTC | #4
On Fri, Jul 08, 2022 at 03:45:14PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 08, 2022 at 04:17:30AM +0000, Joel Fernandes wrote:
> > On Sat, Jun 25, 2022 at 08:12:06PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jun 22, 2022 at 10:50:53PM +0000, Joel Fernandes (Google) wrote:
> > > > 
> > > > Hello!
> > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > difference between the previous version is that it is now using bypass lists,
> > > > and thus handling rcu_barrier() and hotplug situations, with some small changes
> > > > to those parts.
> > > > 
> > > > I also don't see the TREE07 RCU stall from v1 anymore.
> > > > 
> > > > In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
> > > > feel free to pull these patches into your tree. Just to note, you will also
> > > > need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
> > > > series, just to make the series focus on the feature code first.
> > > > 
> > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > > system is very lightly loaded but constantly running few RCU callbacks very
> > > > often. This confuses the power management hardware that the system is active,
> > > > when it is in fact idle.
> > > > 
> > > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > > system, we can see big power savings.
> > > > Before:
> > > > Pk%pc10 = 72.13
> > > > PkgWatt = 0.58
> > > > CorWatt = 0.04
> > > > 
> > > > After:
> > > > Pk%pc10 = 81.28
> > > > PkgWatt = 0.41
> > > > CorWatt = 0.03
> > > 
> > > So not quite 30% savings in power at the package level?  Not bad at all!
> > 
> > Yes this is the package residency amount, not the amount of power. This % is
> > not power.
> 
> So what exactly is PkgWatt, then?  If you can say.  That is where I was
> getting the 30% from.

Its the total package power (SoC power) - so like not just the CPU but also
the interconnect, other controllers and other blocks in there.

This output is from the turbostat program and the number is mentioned in the
manpage:
"PkgWatt Watts consumed by the whole package."
https://manpages.debian.org/testing/linux-cpupower/turbostat.8.en.html


> > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > > to open/close of file descriptors associated with graphics buffers. This is
> > > > attributed to the file_free_rcu() path which this patch series also touches.
> > > > 
> > > > This patch series adds a simple but effective, and lockless implementation of
> > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > > initiate a flush of one or more per-CPU lists.
> > > 
> > > It is no longer lockless, correct?  Or am I missing something subtle?
> > > 
> > > Full disclosure: I don't see a whole lot of benefit to its being lockless.
> > > But truth in advertising!  ;-)
> > 
> > Yes, you are right. Maybe a better way I could put it is it is "lock
> > contention less" :D
> 
> Yes, "reduced lock contention" would be a good phrase.  As long as you
> carefully indicate exactly what scenario with greater lock contention
> you are comparing to.
> 
> But aren't you acquiring the bypass lock at about the same rate as it
> would be aquired without laziness?  What am I missing here?

You are right, why not I just drop the locking phrases from the summary.
Anyway the main win from this work is not related to locking.

thanks,

 - Joel

> 
> 							Thanx, Paul
> 
> > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > > of function graph tracer when increasing that.
> > > > 
> > > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > > the future, that's not lazy, then that will again hurt the power. However, I
> > > > believe identifying and fixing those is a more reasonable approach than slowing
> > > > RCU down for the whole system.
> > > 
> > > Very good!  I have you down as the official call_rcu_lazy() whack-a-mole
> > > developer.  ;-)
> > 
> > :-D
> > 
> > thanks,
> > 
> >  - Joel
> >
Paul E. McKenney July 10, 2022, 3:47 p.m. UTC | #5
On Sun, Jul 10, 2022 at 01:38:01AM +0000, Joel Fernandes wrote:
> On Fri, Jul 08, 2022 at 03:45:14PM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 08, 2022 at 04:17:30AM +0000, Joel Fernandes wrote:
> > > On Sat, Jun 25, 2022 at 08:12:06PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jun 22, 2022 at 10:50:53PM +0000, Joel Fernandes (Google) wrote:
> > > > > 
> > > > > Hello!
> > > > > Please find the next improved version of call_rcu_lazy() attached.  The main
> > > > > difference between the previous version is that it is now using bypass lists,
> > > > > and thus handling rcu_barrier() and hotplug situations, with some small changes
> > > > > to those parts.
> > > > > 
> > > > > I also don't see the TREE07 RCU stall from v1 anymore.
> > > > > 
> > > > > In the v1, we some numbers below (testing on v2 is in progress). Rushikesh,
> > > > > feel free to pull these patches into your tree. Just to note, you will also
> > > > > need to pull the call_rcu_lazy() user patches from v1. I have dropped in this
> > > > > series, just to make the series focus on the feature code first.
> > > > > 
> > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > > > system is very lightly loaded but constantly running few RCU callbacks very
> > > > > often. This confuses the power management hardware that the system is active,
> > > > > when it is in fact idle.
> > > > > 
> > > > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > > > system, we can see big power savings.
> > > > > Before:
> > > > > Pk%pc10 = 72.13
> > > > > PkgWatt = 0.58
> > > > > CorWatt = 0.04
> > > > > 
> > > > > After:
> > > > > Pk%pc10 = 81.28
> > > > > PkgWatt = 0.41
> > > > > CorWatt = 0.03
> > > > 
> > > > So not quite 30% savings in power at the package level?  Not bad at all!
> > > 
> > > Yes this is the package residency amount, not the amount of power. This % is
> > > not power.
> > 
> > So what exactly is PkgWatt, then?  If you can say.  That is where I was
> > getting the 30% from.
> 
> Its the total package power (SoC power) - so like not just the CPU but also
> the interconnect, other controllers and other blocks in there.
> 
> This output is from the turbostat program and the number is mentioned in the
> manpage:
> "PkgWatt Watts consumed by the whole package."
> https://manpages.debian.org/testing/linux-cpupower/turbostat.8.en.html

Are we back to about a 30% savings in power at the package level?  ;-)

Either way, please quantify your "big power savings" by calculating and
stating a percentage decrease.

> > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > > > to open/close of file descriptors associated with graphics buffers. This is
> > > > > attributed to the file_free_rcu() path which this patch series also touches.
> > > > > 
> > > > > This patch series adds a simple but effective, and lockless implementation of
> > > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > > > initiate a flush of one or more per-CPU lists.
> > > > 
> > > > It is no longer lockless, correct?  Or am I missing something subtle?
> > > > 
> > > > Full disclosure: I don't see a whole lot of benefit to its being lockless.
> > > > But truth in advertising!  ;-)
> > > 
> > > Yes, you are right. Maybe a better way I could put it is it is "lock
> > > contention less" :D
> > 
> > Yes, "reduced lock contention" would be a good phrase.  As long as you
> > carefully indicate exactly what scenario with greater lock contention
> > you are comparing to.
> > 
> > But aren't you acquiring the bypass lock at about the same rate as it
> > would be aquired without laziness?  What am I missing here?
> 
> You are right, why not I just drop the locking phrases from the summary.
> Anyway the main win from this work is not related to locking.

Sounds good!

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> > 
> > 							Thanx, Paul
> > 
> > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > > > of function graph tracer when increasing that.
> > > > > 
> > > > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > > > the future, that's not lazy, then that will again hurt the power. However, I
> > > > > believe identifying and fixing those is a more reasonable approach than slowing
> > > > > RCU down for the whole system.
> > > > 
> > > > Very good!  I have you down as the official call_rcu_lazy() whack-a-mole
> > > > developer.  ;-)
> > > 
> > > :-D
> > > 
> > > thanks,
> > > 
> > >  - Joel
> > >