mbox series

[RFC,v1,00/14] Implement call_rcu_lazy() and miscellaneous fixes

Message ID 20220512030442.2530552-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series Implement call_rcu_lazy() and miscellaneous fixes | expand

Message

Joel Fernandes May 12, 2022, 3:04 a.m. UTC
Hello!
Please find the proof of concept version of call_rcu_lazy() attached. This
gives a lot of savings when the CPUs are relatively idle. Huge thanks to
Rushikesh Kadam from Intel for investigating it with me.

Some numbers below:

Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
The observation is that due to a 'trickle down' effect of RCU callbacks, the
system is very lightly loaded but constantly running few RCU callbacks very
often. This confuses the power management hardware that the system is active,
when it is in fact idle.

For example, when ChromeOS screen is off and user is not doing anything on the
system, we can see big power savings.
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58
CorWatt = 0.04

After:
Pk%pc10 = 81.28
PkgWatt = 0.41
CorWatt = 0.03

Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
can see that the display pipeline is constantly doing RCU callback queuing due
to open/close of file descriptors associated with graphics buffers. This is
attributed to the file_free_rcu() path which this patch series also touches.

This patch series adds a simple but effective, and lockless implementation of
RCU callback batching. On memory pressure, timeout or queue growing too big, we
initiate a flush of one or more per-CPU lists.

Similar results can be achieved by increasing jiffies_till_first_fqs, however
that also has the effect of slowing down RCU. Especially I saw huge slow down
of function graph tracer when increasing that.

One drawback of this series is, if another frequent RCU callback creeps up in
the future, that's not lazy, then that will again hurt the power. However, I
believe identifying and fixing those is a more reasonable approach than slowing
RCU down for the whole system.

NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
at runtime to turn it on or off globally. It is default to on. Further, please
use the sysctls in lazy.c for further tuning of parameters that effect the
flushing.

Disclaimer 1: Don't boot your personal system on it yet anticipating power
savings, as TREE07 still causes RCU stalls and I am looking more into that, but
I believe this series should be good for general testing.

Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
of review and agreements.

Joel Fernandes (Google) (14):
  rcu: Add a lock-less lazy RCU implementation
  workqueue: Add a lazy version of queue_rcu_work()
  block/blk-ioc: Move call_rcu() to call_rcu_lazy()
  cred: Move call_rcu() to call_rcu_lazy()
  fs: Move call_rcu() to call_rcu_lazy() in some paths
  kernel: Move various core kernel usages to call_rcu_lazy()
  security: Move call_rcu() to call_rcu_lazy()
  net/core: Move call_rcu() to call_rcu_lazy()
  lib: Move call_rcu() to call_rcu_lazy()
  kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  i915: Move call_rcu() to call_rcu_lazy()
  rcu/kfree: remove useless monitor_todo flag
  rcu/kfree: Fix kfree_rcu_shrink_count() return value
  DEBUG: Toggle rcu_lazy and tune at runtime

 block/blk-ioc.c                            |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
 fs/dcache.c                                |   4 +-
 fs/eventpoll.c                             |   2 +-
 fs/file_table.c                            |   3 +-
 fs/inode.c                                 |   2 +-
 include/linux/rcupdate.h                   |   6 +
 include/linux/sched/sysctl.h               |   4 +
 include/linux/workqueue.h                  |   1 +
 kernel/cred.c                              |   2 +-
 kernel/exit.c                              |   2 +-
 kernel/pid.c                               |   2 +-
 kernel/rcu/Kconfig                         |   8 ++
 kernel/rcu/Makefile                        |   1 +
 kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
 kernel/rcu/rcu.h                           |   5 +
 kernel/rcu/tree.c                          |  28 ++--
 kernel/sysctl.c                            |  23 ++++
 kernel/time/posix-timers.c                 |   2 +-
 kernel/workqueue.c                         |  25 ++++
 lib/radix-tree.c                           |   2 +-
 lib/xarray.c                               |   2 +-
 net/core/dst.c                             |   2 +-
 security/security.c                        |   2 +-
 security/selinux/avc.c                     |   4 +-
 25 files changed, 255 insertions(+), 34 deletions(-)
 create mode 100644 kernel/rcu/lazy.c

Comments

Joel Fernandes May 12, 2022, 3:17 a.m. UTC | #1
On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Hello!
> Please find the proof of concept version of call_rcu_lazy() attached. This
> gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> Rushikesh Kadam from Intel for investigating it with me.
>
> Some numbers below:
>
> Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> The observation is that due to a 'trickle down' effect of RCU callbacks, the
> system is very lightly loaded but constantly running few RCU callbacks very
> often. This confuses the power management hardware that the system is active,
> when it is in fact idle.
>
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
>
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03
>
> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
>
> This patch series adds a simple but effective, and lockless implementation of
> RCU callback batching. On memory pressure, timeout or queue growing too big, we
> initiate a flush of one or more per-CPU lists.
>
> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down
> of function graph tracer when increasing that.
>
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.
>
> NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> at runtime to turn it on or off globally. It is default to on. Further, please
> use the sysctls in lazy.c for further tuning of parameters that effect the
> flushing.
>
> Disclaimer 1: Don't boot your personal system on it yet anticipating power
> savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> I believe this series should be good for general testing.
>
> Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.

I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
support for that definitely needs work.

thanks,

 - Joel
Uladzislau Rezki May 12, 2022, 1:09 p.m. UTC | #2
Hello, Joel!

Which kernel version have you used for this series?

--
Uladzislau Rezki

On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Some numbers below:
> >
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> >
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> >
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> >
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> >
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> >
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> >
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> >
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> >
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.
> >
> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.
>
> I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> support for that definitely needs work.
>
> thanks,
>
>  - Joel
Uladzislau Rezki May 12, 2022, 1:56 p.m. UTC | #3
Never mind. I port it into 5.10

On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> Hello, Joel!
>
> Which kernel version have you used for this series?
>
> --
> Uladzislau Rezki
>
> On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> > >
> > > Some numbers below:
> > >
> > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > system is very lightly loaded but constantly running few RCU callbacks very
> > > often. This confuses the power management hardware that the system is active,
> > > when it is in fact idle.
> > >
> > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > system, we can see big power savings.
> > > Before:
> > > Pk%pc10 = 72.13
> > > PkgWatt = 0.58
> > > CorWatt = 0.04
> > >
> > > After:
> > > Pk%pc10 = 81.28
> > > PkgWatt = 0.41
> > > CorWatt = 0.03
> > >
> > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > to open/close of file descriptors associated with graphics buffers. This is
> > > attributed to the file_free_rcu() path which this patch series also touches.
> > >
> > > This patch series adds a simple but effective, and lockless implementation of
> > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > initiate a flush of one or more per-CPU lists.
> > >
> > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > of function graph tracer when increasing that.
> > >
> > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > the future, that's not lazy, then that will again hurt the power. However, I
> > > believe identifying and fixing those is a more reasonable approach than slowing
> > > RCU down for the whole system.
> > >
> > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > flushing.
> > >
> > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > I believe this series should be good for general testing.
> > >
> > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > > of review and agreements.
> >
> > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > support for that definitely needs work.
> >
> > thanks,
> >
> >  - Joel
>
>
>
> --
> Uladzislau Rezki
Joel Fernandes May 12, 2022, 2:03 p.m. UTC | #4
On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> Never mind. I port it into 5.10

Oh, this is on mainline. Sorry about that. If you want I have a tree here for
5.10 , although that does not have the kfree changes, everything else is
ditto.
https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4

thanks,

 - Joel



> 
> On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > Hello, Joel!
> >
> > Which kernel version have you used for this series?
> >
> > --
> > Uladzislau Rezki
> >
> > On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > Hello!
> > > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > > Rushikesh Kadam from Intel for investigating it with me.
> > > >
> > > > Some numbers below:
> > > >
> > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > > system is very lightly loaded but constantly running few RCU callbacks very
> > > > often. This confuses the power management hardware that the system is active,
> > > > when it is in fact idle.
> > > >
> > > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > > system, we can see big power savings.
> > > > Before:
> > > > Pk%pc10 = 72.13
> > > > PkgWatt = 0.58
> > > > CorWatt = 0.04
> > > >
> > > > After:
> > > > Pk%pc10 = 81.28
> > > > PkgWatt = 0.41
> > > > CorWatt = 0.03
> > > >
> > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > > to open/close of file descriptors associated with graphics buffers. This is
> > > > attributed to the file_free_rcu() path which this patch series also touches.
> > > >
> > > > This patch series adds a simple but effective, and lockless implementation of
> > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > > initiate a flush of one or more per-CPU lists.
> > > >
> > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > > of function graph tracer when increasing that.
> > > >
> > > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > > the future, that's not lazy, then that will again hurt the power. However, I
> > > > believe identifying and fixing those is a more reasonable approach than slowing
> > > > RCU down for the whole system.
> > > >
> > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > > flushing.
> > > >
> > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > > I believe this series should be good for general testing.
> > > >
> > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > > > of review and agreements.
> > >
> > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > > support for that definitely needs work.
> > >
> > > thanks,
> > >
> > >  - Joel
> >
> >
> >
> > --
> > Uladzislau Rezki
> 
> 
> 
> -- 
> Uladzislau Rezki
Uladzislau Rezki May 12, 2022, 2:37 p.m. UTC | #5
> On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > Never mind. I port it into 5.10
> 
> Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> 5.10 , although that does not have the kfree changes, everything else is
> ditto.
> https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> 
No problem. kfree_rcu two patches are not so important in this series.
So i have backported them into my 5.10 kernel because the latest kernel
is not so easy to up and run on my device :)

--
Uladzislau Rezki
Joel Fernandes May 12, 2022, 4:09 p.m. UTC | #6
On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > Never mind. I port it into 5.10
> >
> > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > 5.10 , although that does not have the kfree changes, everything else is
> > ditto.
> > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> >
> No problem. kfree_rcu two patches are not so important in this series.
> So i have backported them into my 5.10 kernel because the latest kernel
> is not so easy to up and run on my device :)

Actually I was going to write here, apparently some tests are showing
kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
good to drop those for your initial testing!

Thanks,

 - Joel
Uladzislau Rezki May 12, 2022, 4:32 p.m. UTC | #7
> On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > Never mind. I port it into 5.10
> > >
> > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > 5.10 , although that does not have the kfree changes, everything else is
> > > ditto.
> > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > >
> > No problem. kfree_rcu two patches are not so important in this series.
> > So i have backported them into my 5.10 kernel because the latest kernel
> > is not so easy to up and run on my device :)
> 
> Actually I was going to write here, apparently some tests are showing
> kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> good to drop those for your initial testing!
> 
Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
so important for kfree_rcu() because we do batch requests there anyway.
One thing that i would like to improve in kfree_rcu() is a better utilization
of page slots.

I will share my results either tomorrow or on Monday. I hope that is fine.

--
Uladzislau Rezki
Paul E. McKenney May 13, 2022, 12:23 a.m. UTC | #8
On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote:
> On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Some numbers below:
> >
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> >
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> >
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> >
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> >
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> >
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> >
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> >
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> >
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.

Sometimes OOM conditions result in stalls.

> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.

We will of course need them to look at the call_rcu_lazy() conversions
at some point, but in the meantime, experimentation is fine.  I looked
at a few, but quickly decided to defer to the people with a better
understanding of the code.

> I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> support for that definitely needs work.

Good to know.  ;-)

With this in place, can the system survive a userspace close(open())
loop, or does that result in OOM?  (I am not worried about battery
lifetime while close(open()) is running, just OOM resistance.)

Does waiting for the shrinker to kick in suffice, or should the
system pressure be taken into account?  As in the "total" numbers
from /proc/pressure/memory.

Again, it is very good to see this series!

							Thanx, Paul
Joel Fernandes May 13, 2022, 2:45 p.m. UTC | #9
On Thu, May 12, 2022 at 8:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote:
> > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> > >
> > > Some numbers below:
> > >
> > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > system is very lightly loaded but constantly running few RCU callbacks very
> > > often. This confuses the power management hardware that the system is active,
> > > when it is in fact idle.
> > >
> > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > system, we can see big power savings.
> > > Before:
> > > Pk%pc10 = 72.13
> > > PkgWatt = 0.58
> > > CorWatt = 0.04
> > >
> > > After:
> > > Pk%pc10 = 81.28
> > > PkgWatt = 0.41
> > > CorWatt = 0.03
> > >
> > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > to open/close of file descriptors associated with graphics buffers. This is
> > > attributed to the file_free_rcu() path which this patch series also touches.
> > >
> > > This patch series adds a simple but effective, and lockless implementation of
> > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > initiate a flush of one or more per-CPU lists.
> > >
> > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > of function graph tracer when increasing that.
> > >
> > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > the future, that's not lazy, then that will again hurt the power. However, I
> > > believe identifying and fixing those is a more reasonable approach than slowing
> > > RCU down for the whole system.
> > >
> > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > flushing.
> > >
> > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > I believe this series should be good for general testing.
>
> Sometimes OOM conditions result in stalls.

I see.

> > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > support for that definitely needs work.
>
> Good to know.  ;-)
>
> With this in place, can the system survive a userspace close(open())
> loop, or does that result in OOM?  (I am not worried about battery
> lifetime while close(open()) is running, just OOM resistance.)

Yes, in my testing it survived. I even dropped memory to 512MB and did
the open/close loop test. I believe it survives also because we don't
let the list to grow too big (other than shrinker flushing).

>
> Does waiting for the shrinker to kick in suffice, or should the
> system pressure be taken into account?  As in the "total" numbers
> from /proc/pressure/memory.

I did not find that taking system memory pressure into account is necessary.

> Again, it is very good to see this series!

Thanks I appreciate that, I am excited about battery life savings in
millions of battery powered devices ;-) Even on my Grand mom's android
phone ;-)

Thanks,

 - Joel
Joel Fernandes May 13, 2022, 2:51 p.m. UTC | #10
On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > Never mind. I port it into 5.10
> > > > >
> > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > ditto.
> > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > >
> > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > is not so easy to up and run on my device :)
> > >
> > > Actually I was going to write here, apparently some tests are showing
> > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > good to drop those for your initial testing!
> > >
> > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > so important for kfree_rcu() because we do batch requests there anyway.
> > One thing that i would like to improve in kfree_rcu() is a better utilization
> > of page slots.
> >
> > I will share my results either tomorrow or on Monday. I hope that is fine.
> >
>
> Here we go with some data on our Android handset that runs 5.10 kernel. The test
> case i have checked was a "static image" use case. Condition is: screen ON with
> disabled all connectivity.
>
> 1.
> First data i took is how many wakeups cause an RCU subsystem during this test case
> when everything is pretty idling. Duration is 360 seconds:
>
> <snip>
> serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu

Nice! Do you mind sharing this script? I was just talking to Rushikesh
that we want something like this during testing. Appreciate it. Also,
if we dump timer wakeup reasons/callbacks that would also be awesome.

FWIW, I wrote a BPF tool that periodically dumps callbacks and can
share that with you on request as well. That is probably not in a
shape for mainline though (Makefile missing and such).

> name:                         rcub/0 pid:         16 woken-up     2     interval: min 86772734  max 86772734    avg 43386367
> name:                        rcuop/7 pid:         69 woken-up     4     interval: min  4189     max  8050       avg  5049
> name:                        rcuop/6 pid:         62 woken-up    55     interval: min  6910     max 42592159    avg 3818752
[..]

> There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot
> and it is really good!

Cool!

> 2.
> Please find in attachment two power plots. The same test case. One is related to a
> regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> what is also important, thus it proofs the concept. Please note it might be more power
> efficient for other arches and platforms. Because of different HW design that is related
> to C-states of CPU and energy that is needed to in/out of those deep power states.

Nice! I wonder if you still have other frequent callbacks on your
system that are getting queued during the tests. Could you dump the
rcu_callbacks trace event and see if you have any CBs frequently
called that the series did not address?

Also, one more thing I was curious about is - do you see savings when
you pin the rcu threads to the LITTLE CPUs of the system? The theory
being, not disturbing the BIG CPUs which are more power hungry may let
them go into a deeper idle state and save power (due to leakage
current and so forth).

> So a front-lazy-batching is something worth to have, IMHO :)

Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
it, we can add your data to the LPC slides as well about your
investigation (with attribution to you).

thanks,

 - Joel
Uladzislau Rezki May 13, 2022, 3:43 p.m. UTC | #11
> On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > Never mind. I port it into 5.10
> > > > > >
> > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > ditto.
> > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > >
> > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > is not so easy to up and run on my device :)
> > > >
> > > > Actually I was going to write here, apparently some tests are showing
> > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > good to drop those for your initial testing!
> > > >
> > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > so important for kfree_rcu() because we do batch requests there anyway.
> > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > of page slots.
> > >
> > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > >
> >
> > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > case i have checked was a "static image" use case. Condition is: screen ON with
> > disabled all connectivity.
> >
> > 1.
> > First data i took is how many wakeups cause an RCU subsystem during this test case
> > when everything is pretty idling. Duration is 360 seconds:
> >
> > <snip>
> > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> 
> Nice! Do you mind sharing this script? I was just talking to Rushikesh
> that we want something like this during testing. Appreciate it. Also,
> if we dump timer wakeup reasons/callbacks that would also be awesome.
> 
Please find in attachment. I wrote it once upon a time and make use of
to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
so just compile it.

How to use it:
1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
2. ./perf script -i ./perf.data > foo.script
3. ./perf_script_parser ./foo.script

>
> FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> share that with you on request as well. That is probably not in a
> shape for mainline though (Makefile missing and such).
> 
Yep, please share!

> > name:                         rcub/0 pid:         16 woken-up     2     interval: min 86772734  max 86772734    avg 43386367
> > name:                        rcuop/7 pid:         69 woken-up     4     interval: min  4189     max  8050       avg  5049
> > name:                        rcuop/6 pid:         62 woken-up    55     interval: min  6910     max 42592159    avg 3818752
> [..]
> 
> > There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot
> > and it is really good!
> 
> Cool!
> 
> > 2.
> > Please find in attachment two power plots. The same test case. One is related to a
> > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > what is also important, thus it proofs the concept. Please note it might be more power
> > efficient for other arches and platforms. Because of different HW design that is related
> > to C-states of CPU and energy that is needed to in/out of those deep power states.
> 
> Nice! I wonder if you still have other frequent callbacks on your
> system that are getting queued during the tests. Could you dump the
> rcu_callbacks trace event and see if you have any CBs frequently
> called that the series did not address?
>
I have pretty much like this:
<snip>
    rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
    rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
    rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
    rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
    rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
    rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
<snip>

so almost everything is batched.

>
> Also, one more thing I was curious about is - do you see savings when
> you pin the rcu threads to the LITTLE CPUs of the system? The theory
> being, not disturbing the BIG CPUs which are more power hungry may let
> them go into a deeper idle state and save power (due to leakage
> current and so forth).
> 
I did some experimenting to pin nocbs to a little cluster. For idle use
cases i did not see any power gain. For heavy one i see that "big" CPUs
are also invoking and busy with it quite often. Probably i should think
of some use case where i can detect the power difference. If you have
something please let me know.

> > So a front-lazy-batching is something worth to have, IMHO :)
> 
> Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> it, we can add your data to the LPC slides as well about your
> investigation (with attribution to you).
> 
No problem, since we will give a talk on LPC, more data we have more
convincing we are :)

--
Uladzislau Rezki
Joel Fernandes May 14, 2022, 2:25 p.m. UTC | #12
On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > Never mind. I port it into 5.10
> > > > > > >
> > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > ditto.
> > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > >
> > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > is not so easy to up and run on my device :)
> > > > >
> > > > > Actually I was going to write here, apparently some tests are showing
> > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > good to drop those for your initial testing!
> > > > >
> > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > of page slots.
> > > >
> > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > >
> > >
> > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > disabled all connectivity.
> > >
> > > 1.
> > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > when everything is pretty idling. Duration is 360 seconds:
> > >
> > > <snip>
> > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > 
> > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > that we want something like this during testing. Appreciate it. Also,
> > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > 
> Please find in attachment. I wrote it once upon a time and make use of
> to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> so just compile it.
> 
> How to use it:
> 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> 2. ./perf script -i ./perf.data > foo.script
> 3. ./perf_script_parser ./foo.script

Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
use "perf sched record" and "perf sched report --sort latency" to get wakeup
latencies.

> > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > share that with you on request as well. That is probably not in a
> > shape for mainline though (Makefile missing and such).
> > 
> Yep, please share!

Sure, check out my bcc repo from here:
https://github.com/joelagnel/bcc

Build this project, then cd libbpf-tools and run make. This should produce a
static binary 'rcutop' which you can push to Android. You have to build for
ARM which bcc should have instructions for. I have also included the rcuptop
diff at the end of this file for reference.

> > > 2.
> > > Please find in attachment two power plots. The same test case. One is related to a
> > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > what is also important, thus it proofs the concept. Please note it might be more power
> > > efficient for other arches and platforms. Because of different HW design that is related
> > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > 
> > Nice! I wonder if you still have other frequent callbacks on your
> > system that are getting queued during the tests. Could you dump the
> > rcu_callbacks trace event and see if you have any CBs frequently
> > called that the series did not address?
> >
> I have pretty much like this:
> <snip>
>     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
>     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
>     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
>     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
>     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
>     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> <snip>
> 
> so almost everything is batched.

Nice, glad to know this is happening even without the kfree_rcu() changes.

> > Also, one more thing I was curious about is - do you see savings when
> > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > being, not disturbing the BIG CPUs which are more power hungry may let
> > them go into a deeper idle state and save power (due to leakage
> > current and so forth).
> > 
> I did some experimenting to pin nocbs to a little cluster. For idle use
> cases i did not see any power gain. For heavy one i see that "big" CPUs
> are also invoking and busy with it quite often. Probably i should think
> of some use case where i can detect the power difference. If you have
> something please let me know.

Yeah, probably screen off + audio playback might be a good one, because it
lightly loads the CPUs.

> > > So a front-lazy-batching is something worth to have, IMHO :)
> > 
> > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > it, we can add your data to the LPC slides as well about your
> > investigation (with attribution to you).
> > 
> No problem, since we will give a talk on LPC, more data we have more
> convincing we are :)

I forget, you did mention you are Ok with presenting with us right? It would
be great if you present your data when we come to Android, if you are OK with
it. I'll start a common slide deck soon and share so you, Rushikesh and me
can add slides to it and present together.

thanks,

 - Joel

---8<-----------------------

From: Joel Fernandes <joelaf@google.com>
Subject: [PATCH] rcutop

Signed-off-by: Joel Fernandes <joelaf@google.com>
---
 libbpf-tools/Makefile     |   1 +
 libbpf-tools/rcutop.bpf.c |  56 ++++++++
 libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
 libbpf-tools/rcutop.h     |   8 ++
 4 files changed, 353 insertions(+)
 create mode 100644 libbpf-tools/rcutop.bpf.c
 create mode 100644 libbpf-tools/rcutop.c
 create mode 100644 libbpf-tools/rcutop.h

diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
index e60ec409..0d4cdff2 100644
--- a/libbpf-tools/Makefile
+++ b/libbpf-tools/Makefile
@@ -42,6 +42,7 @@ APPS = \
 	klockstat \
 	ksnoop \
 	llcstat \
+	rcutop \
 	mountsnoop \
 	numamove \
 	offcputime \
diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
new file mode 100644
index 00000000..8287bbe2
--- /dev/null
+++ b/libbpf-tools/rcutop.bpf.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+/* Copyright (c) 2021 Hengqi Chen */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+#include "rcutop.h"
+#include "maps.bpf.h"
+
+#define MAX_ENTRIES	10240
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, void *);
+	__type(value, int);
+} cbs_queued SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, void *);
+	__type(value, int);
+} cbs_executed SEC(".maps");
+
+SEC("tracepoint/rcu/rcu_callback")
+int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
+{
+	void *key = ctx->func;
+	int *val = NULL;
+	static const int zero;
+
+	val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
+	if (val) {
+		__sync_fetch_and_add(val, 1);
+	}
+
+	return 0;
+}
+
+SEC("tracepoint/rcu/rcu_invoke_callback")
+int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
+{
+	void *key = ctx->func;
+	int *val;
+	int zero = 0;
+
+	val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
+	if (val) {
+		__sync_fetch_and_add(val, 1);
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
new file mode 100644
index 00000000..35795875
--- /dev/null
+++ b/libbpf-tools/rcutop.c
@@ -0,0 +1,288 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+
+/*
+ * rcutop
+ * Copyright (c) 2022 Joel Fernandes
+ *
+ * 05-May-2022   Joel Fernandes   Created this.
+ */
+#include <argp.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include "rcutop.h"
+#include "rcutop.skel.h"
+#include "btf_helpers.h"
+#include "trace_helpers.h"
+
+#define warn(...) fprintf(stderr, __VA_ARGS__)
+#define OUTPUT_ROWS_LIMIT 10240
+
+static volatile sig_atomic_t exiting = 0;
+
+static bool clear_screen = true;
+static int output_rows = 20;
+static int interval = 1;
+static int count = 99999999;
+static bool verbose = false;
+
+const char *argp_program_version = "rcutop 0.1";
+const char *argp_program_bug_address =
+"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
+const char argp_program_doc[] =
+"Show RCU callback queuing and execution stats.\n"
+"\n"
+"USAGE: rcutop [-h] [interval] [count]\n"
+"\n"
+"EXAMPLES:\n"
+"    rcutop            # rcu activity top, refresh every 1s\n"
+"    rcutop 5 10       # 5s summaries, 10 times\n";
+
+static const struct argp_option opts[] = {
+	{ "noclear", 'C', NULL, 0, "Don't clear the screen" },
+	{ "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
+	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
+	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	long rows;
+	static int pos_args;
+
+	switch (key) {
+		case 'C':
+			clear_screen = false;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		case 'h':
+			argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
+			break;
+		case 'r':
+			errno = 0;
+			rows = strtol(arg, NULL, 10);
+			if (errno || rows <= 0) {
+				warn("invalid rows: %s\n", arg);
+				argp_usage(state);
+			}
+			output_rows = rows;
+			if (output_rows > OUTPUT_ROWS_LIMIT)
+				output_rows = OUTPUT_ROWS_LIMIT;
+			break;
+		case ARGP_KEY_ARG:
+			errno = 0;
+			if (pos_args == 0) {
+				interval = strtol(arg, NULL, 10);
+				if (errno || interval <= 0) {
+					warn("invalid interval\n");
+					argp_usage(state);
+				}
+			} else if (pos_args == 1) {
+				count = strtol(arg, NULL, 10);
+				if (errno || count <= 0) {
+					warn("invalid count\n");
+					argp_usage(state);
+				}
+			} else {
+				warn("unrecognized positional argument: %s\n", arg);
+				argp_usage(state);
+			}
+			pos_args++;
+			break;
+		default:
+			return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sig_int(int signo)
+{
+	exiting = 1;
+}
+
+static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
+		struct rcutop_bpf *obj)
+{
+	void *key, **prev_key = NULL;
+	int n, err = 0;
+	int qfd = bpf_map__fd(obj->maps.cbs_queued);
+	int efd = bpf_map__fd(obj->maps.cbs_executed);
+	const struct ksym *ksym;
+	FILE *f;
+	time_t t;
+	struct tm *tm;
+	char ts[16], buf[256];
+
+	f = fopen("/proc/loadavg", "r");
+	if (f) {
+		time(&t);
+		tm = localtime(&t);
+		strftime(ts, sizeof(ts), "%H:%M:%S", tm);
+		memset(buf, 0, sizeof(buf));
+		n = fread(buf, 1, sizeof(buf), f);
+		if (n)
+			printf("%8s loadavg: %s\n", ts, buf);
+		fclose(f);
+	}
+
+	printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
+
+	while (1) {
+		int qcount = 0, ecount = 0;
+
+		err = bpf_map_get_next_key(qfd, prev_key, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
+			return err;
+		}
+
+		err = bpf_map_lookup_elem(qfd, &key, &qcount);
+		if (err) {
+			warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
+			return err;
+		}
+		prev_key = &key;
+
+		bpf_map_lookup_elem(efd, &key, &ecount);
+
+		ksym = ksyms__map_addr(ksyms, (unsigned long)key);
+		printf("%-32s %-6d %-6d\n",
+				ksym ? ksym->name : "Unknown",
+				qcount, ecount);
+	}
+	printf("\n");
+	prev_key = NULL;
+	while (1) {
+		err = bpf_map_get_next_key(qfd, prev_key, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
+			return err;
+		}
+		err = bpf_map_delete_elem(qfd, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				continue;
+			}
+			warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
+			return err;
+		}
+
+		bpf_map_delete_elem(efd, &key);
+		prev_key = &key;
+	}
+
+	return err;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
+	static const struct argp argp = {
+		.options = opts,
+		.parser = parse_arg,
+		.doc = argp_program_doc,
+	};
+	struct rcutop_bpf *obj;
+	int err;
+	struct syms_cache *syms_cache = NULL;
+	struct ksyms *ksyms = NULL;
+
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+	libbpf_set_print(libbpf_print_fn);
+
+	err = ensure_core_btf(&open_opts);
+	if (err) {
+		fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
+		return 1;
+	}
+
+	obj = rcutop_bpf__open_opts(&open_opts);
+	if (!obj) {
+		warn("failed to open BPF object\n");
+		return 1;
+	}
+
+	err = rcutop_bpf__load(obj);
+	if (err) {
+		warn("failed to load BPF object: %d\n", err);
+		goto cleanup;
+	}
+
+	err = rcutop_bpf__attach(obj);
+	if (err) {
+		warn("failed to attach BPF programs: %d\n", err);
+		goto cleanup;
+	}
+
+	ksyms = ksyms__load();
+	if (!ksyms) {
+		fprintf(stderr, "failed to load kallsyms\n");
+		goto cleanup;
+	}
+
+	syms_cache = syms_cache__new(0);
+	if (!syms_cache) {
+		fprintf(stderr, "failed to create syms_cache\n");
+		goto cleanup;
+	}
+
+	if (signal(SIGINT, sig_int) == SIG_ERR) {
+		warn("can't set signal handler: %s\n", strerror(errno));
+		err = 1;
+		goto cleanup;
+	}
+
+	while (1) {
+		sleep(interval);
+
+		if (clear_screen) {
+			err = system("clear");
+			if (err)
+				goto cleanup;
+		}
+
+		err = print_stat(ksyms, syms_cache, obj);
+		if (err)
+			goto cleanup;
+
+		count--;
+		if (exiting || !count)
+			goto cleanup;
+	}
+
+cleanup:
+	rcutop_bpf__destroy(obj);
+	cleanup_core_btf(&open_opts);
+
+	return err != 0;
+}
diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
new file mode 100644
index 00000000..cb2a3557
--- /dev/null
+++ b/libbpf-tools/rcutop.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __RCUTOP_H
+#define __RCUTOP_H
+
+#define PATH_MAX	4096
+#define TASK_COMM_LEN	16
+
+#endif /* __RCUTOP_H */
Uladzislau Rezki May 14, 2022, 7:01 p.m. UTC | #13
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > > 
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > > 
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> > 
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
> 
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
> 
Good. "perf sched report --sort latency" it must be related to wakeup delays?

> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > > 
> > Yep, please share!
> 
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc
> 
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
> 
Cool. Will try it later :)

> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > > 
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> > 
> > so almost everything is batched.
> 
> Nice, glad to know this is happening even without the kfree_rcu() changes.
> 
kvfree_rcu() does the job quite good, even though as i mentioned i would like
to send a patch about better page slot utialization. But, yes, since there is 
batching mechanism the call_rcu_lazy() does not give any effect.


> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > > 
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
> 
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
> 
Hm.. Will see and might check, just in case! I have implemented the simple 
call_rcu() static workload generator in order to examine it in more controlled
environment. So i would like to see how it behaves under a static workload
generator.

> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > > 
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > > 
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
> 
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
> 
I am OK to report with you together. I can do and prepare some data related to
examination on our Android and big.LITTLE platform that we use for our customers.

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
> 
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
> 
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>  	klockstat \
>  	ksnoop \
>  	llcstat \
> +	rcutop \
>  	mountsnoop \
>  	numamove \
>  	offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES	10240
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, MAX_ENTRIES);
> +	__type(key, void *);
> +	__type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, MAX_ENTRIES);
> +	__type(key, void *);
> +	__type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +	void *key = ctx->func;
> +	int *val = NULL;
> +	static const int zero;
> +
> +	val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +	if (val) {
> +		__sync_fetch_and_add(val, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +	void *key = ctx->func;
> +	int *val;
> +	int zero = 0;
> +
> +	val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +	if (val) {
> +		__sync_fetch_and_add(val, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +	{ "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +	{ "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
> +	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +	{},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +	long rows;
> +	static int pos_args;
> +
> +	switch (key) {
> +		case 'C':
> +			clear_screen = false;
> +			break;
> +		case 'v':
> +			verbose = true;
> +			break;
> +		case 'h':
> +			argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +			break;
> +		case 'r':
> +			errno = 0;
> +			rows = strtol(arg, NULL, 10);
> +			if (errno || rows <= 0) {
> +				warn("invalid rows: %s\n", arg);
> +				argp_usage(state);
> +			}
> +			output_rows = rows;
> +			if (output_rows > OUTPUT_ROWS_LIMIT)
> +				output_rows = OUTPUT_ROWS_LIMIT;
> +			break;
> +		case ARGP_KEY_ARG:
> +			errno = 0;
> +			if (pos_args == 0) {
> +				interval = strtol(arg, NULL, 10);
> +				if (errno || interval <= 0) {
> +					warn("invalid interval\n");
> +					argp_usage(state);
> +				}
> +			} else if (pos_args == 1) {
> +				count = strtol(arg, NULL, 10);
> +				if (errno || count <= 0) {
> +					warn("invalid count\n");
> +					argp_usage(state);
> +				}
> +			} else {
> +				warn("unrecognized positional argument: %s\n", arg);
> +				argp_usage(state);
> +			}
> +			pos_args++;
> +			break;
> +		default:
> +			return ARGP_ERR_UNKNOWN;
> +	}
> +	return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +	if (level == LIBBPF_DEBUG && !verbose)
> +		return 0;
> +	return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +	exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +		struct rcutop_bpf *obj)
> +{
> +	void *key, **prev_key = NULL;
> +	int n, err = 0;
> +	int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +	int efd = bpf_map__fd(obj->maps.cbs_executed);
> +	const struct ksym *ksym;
> +	FILE *f;
> +	time_t t;
> +	struct tm *tm;
> +	char ts[16], buf[256];
> +
> +	f = fopen("/proc/loadavg", "r");
> +	if (f) {
> +		time(&t);
> +		tm = localtime(&t);
> +		strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +		memset(buf, 0, sizeof(buf));
> +		n = fread(buf, 1, sizeof(buf), f);
> +		if (n)
> +			printf("%8s loadavg: %s\n", ts, buf);
> +		fclose(f);
> +	}
> +
> +	printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +	while (1) {
> +		int qcount = 0, ecount = 0;
> +
> +		err = bpf_map_get_next_key(qfd, prev_key, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				break;
> +			}
> +			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +
> +		err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +		if (err) {
> +			warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +		prev_key = &key;
> +
> +		bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +		ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +		printf("%-32s %-6d %-6d\n",
> +				ksym ? ksym->name : "Unknown",
> +				qcount, ecount);
> +	}
> +	printf("\n");
> +	prev_key = NULL;
> +	while (1) {
> +		err = bpf_map_get_next_key(qfd, prev_key, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				break;
> +			}
> +			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +		err = bpf_map_delete_elem(qfd, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				continue;
> +			}
> +			warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +
> +		bpf_map_delete_elem(efd, &key);
> +		prev_key = &key;
> +	}
> +
> +	return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +	static const struct argp argp = {
> +		.options = opts,
> +		.parser = parse_arg,
> +		.doc = argp_program_doc,
> +	};
> +	struct rcutop_bpf *obj;
> +	int err;
> +	struct syms_cache *syms_cache = NULL;
> +	struct ksyms *ksyms = NULL;
> +
> +	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +	if (err)
> +		return err;
> +
> +	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +	libbpf_set_print(libbpf_print_fn);
> +
> +	err = ensure_core_btf(&open_opts);
> +	if (err) {
> +		fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +		return 1;
> +	}
> +
> +	obj = rcutop_bpf__open_opts(&open_opts);
> +	if (!obj) {
> +		warn("failed to open BPF object\n");
> +		return 1;
> +	}
> +
> +	err = rcutop_bpf__load(obj);
> +	if (err) {
> +		warn("failed to load BPF object: %d\n", err);
> +		goto cleanup;
> +	}
> +
> +	err = rcutop_bpf__attach(obj);
> +	if (err) {
> +		warn("failed to attach BPF programs: %d\n", err);
> +		goto cleanup;
> +	}
> +
> +	ksyms = ksyms__load();
> +	if (!ksyms) {
> +		fprintf(stderr, "failed to load kallsyms\n");
> +		goto cleanup;
> +	}
> +
> +	syms_cache = syms_cache__new(0);
> +	if (!syms_cache) {
> +		fprintf(stderr, "failed to create syms_cache\n");
> +		goto cleanup;
> +	}
> +
> +	if (signal(SIGINT, sig_int) == SIG_ERR) {
> +		warn("can't set signal handler: %s\n", strerror(errno));
> +		err = 1;
> +		goto cleanup;
> +	}
> +
> +	while (1) {
> +		sleep(interval);
> +
> +		if (clear_screen) {
> +			err = system("clear");
> +			if (err)
> +				goto cleanup;
> +		}
> +
> +		err = print_stat(ksyms, syms_cache, obj);
> +		if (err)
> +			goto cleanup;
> +
> +		count--;
> +		if (exiting || !count)
> +			goto cleanup;
> +	}
> +
> +cleanup:
> +	rcutop_bpf__destroy(obj);
> +	cleanup_core_btf(&open_opts);
> +
> +	return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX	4096
> +#define TASK_COMM_LEN	16
> +
> +#endif /* __RCUTOP_H */
> -- 
> 2.36.0.550.gb090851708-goog
>
Joel Fernandes June 13, 2022, 6:53 p.m. UTC | #14
On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> Hello!
> Please find the proof of concept version of call_rcu_lazy() attached. This
> gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> Rushikesh Kadam from Intel for investigating it with me.

Just a status update, we're reworking this code to use the bypass lists which
takes care of a lot of corner cases. I've had to make changes the the rcuog
thread code to handle wakes up a bit differently and such. Initial testing is
looking good. Currently working on the early-flushing of the lazy CBs based
on memory pressure.

Should be hopefully sending v2 soon!

thanks,

 - Joel



> 
> Some numbers below:
> 
> Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> The observation is that due to a 'trickle down' effect of RCU callbacks, the
> system is very lightly loaded but constantly running few RCU callbacks very
> often. This confuses the power management hardware that the system is active,
> when it is in fact idle.
> 
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
> 
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03
> 
> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
> 
> This patch series adds a simple but effective, and lockless implementation of
> RCU callback batching. On memory pressure, timeout or queue growing too big, we
> initiate a flush of one or more per-CPU lists.
> 
> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down
> of function graph tracer when increasing that.
> 
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.
> 
> NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> at runtime to turn it on or off globally. It is default to on. Further, please
> use the sysctls in lazy.c for further tuning of parameters that effect the
> flushing.
> 
> Disclaimer 1: Don't boot your personal system on it yet anticipating power
> savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> I believe this series should be good for general testing.
> 
> Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.
> 
> Joel Fernandes (Google) (14):
>   rcu: Add a lock-less lazy RCU implementation
>   workqueue: Add a lazy version of queue_rcu_work()
>   block/blk-ioc: Move call_rcu() to call_rcu_lazy()
>   cred: Move call_rcu() to call_rcu_lazy()
>   fs: Move call_rcu() to call_rcu_lazy() in some paths
>   kernel: Move various core kernel usages to call_rcu_lazy()
>   security: Move call_rcu() to call_rcu_lazy()
>   net/core: Move call_rcu() to call_rcu_lazy()
>   lib: Move call_rcu() to call_rcu_lazy()
>   kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
>   i915: Move call_rcu() to call_rcu_lazy()
>   rcu/kfree: remove useless monitor_todo flag
>   rcu/kfree: Fix kfree_rcu_shrink_count() return value
>   DEBUG: Toggle rcu_lazy and tune at runtime
> 
>  block/blk-ioc.c                            |   2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
>  fs/dcache.c                                |   4 +-
>  fs/eventpoll.c                             |   2 +-
>  fs/file_table.c                            |   3 +-
>  fs/inode.c                                 |   2 +-
>  include/linux/rcupdate.h                   |   6 +
>  include/linux/sched/sysctl.h               |   4 +
>  include/linux/workqueue.h                  |   1 +
>  kernel/cred.c                              |   2 +-
>  kernel/exit.c                              |   2 +-
>  kernel/pid.c                               |   2 +-
>  kernel/rcu/Kconfig                         |   8 ++
>  kernel/rcu/Makefile                        |   1 +
>  kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
>  kernel/rcu/rcu.h                           |   5 +
>  kernel/rcu/tree.c                          |  28 ++--
>  kernel/sysctl.c                            |  23 ++++
>  kernel/time/posix-timers.c                 |   2 +-
>  kernel/workqueue.c                         |  25 ++++
>  lib/radix-tree.c                           |   2 +-
>  lib/xarray.c                               |   2 +-
>  net/core/dst.c                             |   2 +-
>  security/security.c                        |   2 +-
>  security/selinux/avc.c                     |   4 +-
>  25 files changed, 255 insertions(+), 34 deletions(-)
>  create mode 100644 kernel/rcu/lazy.c
> 
> -- 
> 2.36.0.550.gb090851708-goog
>
Paul E. McKenney June 13, 2022, 10:48 p.m. UTC | #15
On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote:
> On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> 
> Just a status update, we're reworking this code to use the bypass lists which
> takes care of a lot of corner cases. I've had to make changes the the rcuog
> thread code to handle wakes up a bit differently and such. Initial testing is
> looking good. Currently working on the early-flushing of the lazy CBs based
> on memory pressure.
> 
> Should be hopefully sending v2 soon!

Looking forward to seeing it!

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> 
> > 
> > Some numbers below:
> > 
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> > 
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> > 
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> > 
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> > 
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> > 
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> > 
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> > 
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> > 
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.
> > 
> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.
> > 
> > Joel Fernandes (Google) (14):
> >   rcu: Add a lock-less lazy RCU implementation
> >   workqueue: Add a lazy version of queue_rcu_work()
> >   block/blk-ioc: Move call_rcu() to call_rcu_lazy()
> >   cred: Move call_rcu() to call_rcu_lazy()
> >   fs: Move call_rcu() to call_rcu_lazy() in some paths
> >   kernel: Move various core kernel usages to call_rcu_lazy()
> >   security: Move call_rcu() to call_rcu_lazy()
> >   net/core: Move call_rcu() to call_rcu_lazy()
> >   lib: Move call_rcu() to call_rcu_lazy()
> >   kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
> >   i915: Move call_rcu() to call_rcu_lazy()
> >   rcu/kfree: remove useless monitor_todo flag
> >   rcu/kfree: Fix kfree_rcu_shrink_count() return value
> >   DEBUG: Toggle rcu_lazy and tune at runtime
> > 
> >  block/blk-ioc.c                            |   2 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
> >  fs/dcache.c                                |   4 +-
> >  fs/eventpoll.c                             |   2 +-
> >  fs/file_table.c                            |   3 +-
> >  fs/inode.c                                 |   2 +-
> >  include/linux/rcupdate.h                   |   6 +
> >  include/linux/sched/sysctl.h               |   4 +
> >  include/linux/workqueue.h                  |   1 +
> >  kernel/cred.c                              |   2 +-
> >  kernel/exit.c                              |   2 +-
> >  kernel/pid.c                               |   2 +-
> >  kernel/rcu/Kconfig                         |   8 ++
> >  kernel/rcu/Makefile                        |   1 +
> >  kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
> >  kernel/rcu/rcu.h                           |   5 +
> >  kernel/rcu/tree.c                          |  28 ++--
> >  kernel/sysctl.c                            |  23 ++++
> >  kernel/time/posix-timers.c                 |   2 +-
> >  kernel/workqueue.c                         |  25 ++++
> >  lib/radix-tree.c                           |   2 +-
> >  lib/xarray.c                               |   2 +-
> >  net/core/dst.c                             |   2 +-
> >  security/security.c                        |   2 +-
> >  security/selinux/avc.c                     |   4 +-
> >  25 files changed, 255 insertions(+), 34 deletions(-)
> >  create mode 100644 kernel/rcu/lazy.c
> > 
> > -- 
> > 2.36.0.550.gb090851708-goog
> >
Joel Fernandes June 16, 2022, 4:26 p.m. UTC | #16
On Mon, Jun 13, 2022 at 6:48 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote:
> > On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Just a status update, we're reworking this code to use the bypass lists which
> > takes care of a lot of corner cases. I've had to make changes the the rcuog
> > thread code to handle wakes up a bit differently and such. Initial testing is
> > looking good. Currently working on the early-flushing of the lazy CBs based
> > on memory pressure.
> >
> > Should be hopefully sending v2 soon!
>
> Looking forward to seeing it!

Looks like rebasing on paul's dev is needed for me to even apply test
on our downstream kernels.  Why you ask? Because our downstream folks
are indeed using upstream 5.19-rc3. =) Finally rebase succeeded and
now I can test lazy bypass downstream before sending out the next
RFC...

 - Joel
Joel Fernandes Aug. 9, 2022, 2:25 a.m. UTC | #17
On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > >
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > >
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> >
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
>
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
>
> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > >
> > Yep, please share!
>
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc

Kind of random, but this is why I love sharing my tools with others. I
actually forgot that I checked in the code for my rcutop here, and
this email is how I found it, lol :)

Did you happen to try it out too btw? :)

Thanks,

Joel


>
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
>
> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > >
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> >
> > so almost everything is batched.
>
> Nice, glad to know this is happening even without the kfree_rcu() changes.
>
> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > >
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
>
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
>
> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > >
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > >
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
>
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
>
> thanks,
>
>  - Joel
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
>
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
>
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>         klockstat \
>         ksnoop \
>         llcstat \
> +       rcutop \
>         mountsnoop \
>         numamove \
>         offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES    10240
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val = NULL;
> +       static const int zero;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val;
> +       int zero = 0;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +       { "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +       { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +       { "verbose", 'v', NULL, 0, "Verbose debug output" },
> +       { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +       {},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +       long rows;
> +       static int pos_args;
> +
> +       switch (key) {
> +               case 'C':
> +                       clear_screen = false;
> +                       break;
> +               case 'v':
> +                       verbose = true;
> +                       break;
> +               case 'h':
> +                       argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +                       break;
> +               case 'r':
> +                       errno = 0;
> +                       rows = strtol(arg, NULL, 10);
> +                       if (errno || rows <= 0) {
> +                               warn("invalid rows: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       output_rows = rows;
> +                       if (output_rows > OUTPUT_ROWS_LIMIT)
> +                               output_rows = OUTPUT_ROWS_LIMIT;
> +                       break;
> +               case ARGP_KEY_ARG:
> +                       errno = 0;
> +                       if (pos_args == 0) {
> +                               interval = strtol(arg, NULL, 10);
> +                               if (errno || interval <= 0) {
> +                                       warn("invalid interval\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else if (pos_args == 1) {
> +                               count = strtol(arg, NULL, 10);
> +                               if (errno || count <= 0) {
> +                                       warn("invalid count\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else {
> +                               warn("unrecognized positional argument: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       pos_args++;
> +                       break;
> +               default:
> +                       return ARGP_ERR_UNKNOWN;
> +       }
> +       return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +       if (level == LIBBPF_DEBUG && !verbose)
> +               return 0;
> +       return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +       exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +               struct rcutop_bpf *obj)
> +{
> +       void *key, **prev_key = NULL;
> +       int n, err = 0;
> +       int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +       int efd = bpf_map__fd(obj->maps.cbs_executed);
> +       const struct ksym *ksym;
> +       FILE *f;
> +       time_t t;
> +       struct tm *tm;
> +       char ts[16], buf[256];
> +
> +       f = fopen("/proc/loadavg", "r");
> +       if (f) {
> +               time(&t);
> +               tm = localtime(&t);
> +               strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +               memset(buf, 0, sizeof(buf));
> +               n = fread(buf, 1, sizeof(buf), f);
> +               if (n)
> +                       printf("%8s loadavg: %s\n", ts, buf);
> +               fclose(f);
> +       }
> +
> +       printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +       while (1) {
> +               int qcount = 0, ecount = 0;
> +
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +               if (err) {
> +                       warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               prev_key = &key;
> +
> +               bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +               ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +               printf("%-32s %-6d %-6d\n",
> +                               ksym ? ksym->name : "Unknown",
> +                               qcount, ecount);
> +       }
> +       printf("\n");
> +       prev_key = NULL;
> +       while (1) {
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               err = bpf_map_delete_elem(qfd, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               continue;
> +                       }
> +                       warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               bpf_map_delete_elem(efd, &key);
> +               prev_key = &key;
> +       }
> +
> +       return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +       static const struct argp argp = {
> +               .options = opts,
> +               .parser = parse_arg,
> +               .doc = argp_program_doc,
> +       };
> +       struct rcutop_bpf *obj;
> +       int err;
> +       struct syms_cache *syms_cache = NULL;
> +       struct ksyms *ksyms = NULL;
> +
> +       err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +       if (err)
> +               return err;
> +
> +       libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +       libbpf_set_print(libbpf_print_fn);
> +
> +       err = ensure_core_btf(&open_opts);
> +       if (err) {
> +               fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +               return 1;
> +       }
> +
> +       obj = rcutop_bpf__open_opts(&open_opts);
> +       if (!obj) {
> +               warn("failed to open BPF object\n");
> +               return 1;
> +       }
> +
> +       err = rcutop_bpf__load(obj);
> +       if (err) {
> +               warn("failed to load BPF object: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       err = rcutop_bpf__attach(obj);
> +       if (err) {
> +               warn("failed to attach BPF programs: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       ksyms = ksyms__load();
> +       if (!ksyms) {
> +               fprintf(stderr, "failed to load kallsyms\n");
> +               goto cleanup;
> +       }
> +
> +       syms_cache = syms_cache__new(0);
> +       if (!syms_cache) {
> +               fprintf(stderr, "failed to create syms_cache\n");
> +               goto cleanup;
> +       }
> +
> +       if (signal(SIGINT, sig_int) == SIG_ERR) {
> +               warn("can't set signal handler: %s\n", strerror(errno));
> +               err = 1;
> +               goto cleanup;
> +       }
> +
> +       while (1) {
> +               sleep(interval);
> +
> +               if (clear_screen) {
> +                       err = system("clear");
> +                       if (err)
> +                               goto cleanup;
> +               }
> +
> +               err = print_stat(ksyms, syms_cache, obj);
> +               if (err)
> +                       goto cleanup;
> +
> +               count--;
> +               if (exiting || !count)
> +                       goto cleanup;
> +       }
> +
> +cleanup:
> +       rcutop_bpf__destroy(obj);
> +       cleanup_core_btf(&open_opts);
> +
> +       return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX       4096
> +#define TASK_COMM_LEN  16
> +
> +#endif /* __RCUTOP_H */
> --
> 2.36.0.550.gb090851708-goog
>

On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > >
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > >
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> >
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
>
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
>
> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > >
> > Yep, please share!
>
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc
>
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
>
> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > >
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> >
> > so almost everything is batched.
>
> Nice, glad to know this is happening even without the kfree_rcu() changes.
>
> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > >
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
>
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
>
> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > >
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > >
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
>
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
>
> thanks,
>
>  - Joel
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
>
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
>
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>         klockstat \
>         ksnoop \
>         llcstat \
> +       rcutop \
>         mountsnoop \
>         numamove \
>         offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES    10240
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val = NULL;
> +       static const int zero;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val;
> +       int zero = 0;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +       { "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +       { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +       { "verbose", 'v', NULL, 0, "Verbose debug output" },
> +       { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +       {},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +       long rows;
> +       static int pos_args;
> +
> +       switch (key) {
> +               case 'C':
> +                       clear_screen = false;
> +                       break;
> +               case 'v':
> +                       verbose = true;
> +                       break;
> +               case 'h':
> +                       argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +                       break;
> +               case 'r':
> +                       errno = 0;
> +                       rows = strtol(arg, NULL, 10);
> +                       if (errno || rows <= 0) {
> +                               warn("invalid rows: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       output_rows = rows;
> +                       if (output_rows > OUTPUT_ROWS_LIMIT)
> +                               output_rows = OUTPUT_ROWS_LIMIT;
> +                       break;
> +               case ARGP_KEY_ARG:
> +                       errno = 0;
> +                       if (pos_args == 0) {
> +                               interval = strtol(arg, NULL, 10);
> +                               if (errno || interval <= 0) {
> +                                       warn("invalid interval\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else if (pos_args == 1) {
> +                               count = strtol(arg, NULL, 10);
> +                               if (errno || count <= 0) {
> +                                       warn("invalid count\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else {
> +                               warn("unrecognized positional argument: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       pos_args++;
> +                       break;
> +               default:
> +                       return ARGP_ERR_UNKNOWN;
> +       }
> +       return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +       if (level == LIBBPF_DEBUG && !verbose)
> +               return 0;
> +       return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +       exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +               struct rcutop_bpf *obj)
> +{
> +       void *key, **prev_key = NULL;
> +       int n, err = 0;
> +       int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +       int efd = bpf_map__fd(obj->maps.cbs_executed);
> +       const struct ksym *ksym;
> +       FILE *f;
> +       time_t t;
> +       struct tm *tm;
> +       char ts[16], buf[256];
> +
> +       f = fopen("/proc/loadavg", "r");
> +       if (f) {
> +               time(&t);
> +               tm = localtime(&t);
> +               strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +               memset(buf, 0, sizeof(buf));
> +               n = fread(buf, 1, sizeof(buf), f);
> +               if (n)
> +                       printf("%8s loadavg: %s\n", ts, buf);
> +               fclose(f);
> +       }
> +
> +       printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +       while (1) {
> +               int qcount = 0, ecount = 0;
> +
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +               if (err) {
> +                       warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               prev_key = &key;
> +
> +               bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +               ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +               printf("%-32s %-6d %-6d\n",
> +                               ksym ? ksym->name : "Unknown",
> +                               qcount, ecount);
> +       }
> +       printf("\n");
> +       prev_key = NULL;
> +       while (1) {
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               err = bpf_map_delete_elem(qfd, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               continue;
> +                       }
> +                       warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               bpf_map_delete_elem(efd, &key);
> +               prev_key = &key;
> +       }
> +
> +       return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +       static const struct argp argp = {
> +               .options = opts,
> +               .parser = parse_arg,
> +               .doc = argp_program_doc,
> +       };
> +       struct rcutop_bpf *obj;
> +       int err;
> +       struct syms_cache *syms_cache = NULL;
> +       struct ksyms *ksyms = NULL;
> +
> +       err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +       if (err)
> +               return err;
> +
> +       libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +       libbpf_set_print(libbpf_print_fn);
> +
> +       err = ensure_core_btf(&open_opts);
> +       if (err) {
> +               fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +               return 1;
> +       }
> +
> +       obj = rcutop_bpf__open_opts(&open_opts);
> +       if (!obj) {
> +               warn("failed to open BPF object\n");
> +               return 1;
> +       }
> +
> +       err = rcutop_bpf__load(obj);
> +       if (err) {
> +               warn("failed to load BPF object: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       err = rcutop_bpf__attach(obj);
> +       if (err) {
> +               warn("failed to attach BPF programs: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       ksyms = ksyms__load();
> +       if (!ksyms) {
> +               fprintf(stderr, "failed to load kallsyms\n");
> +               goto cleanup;
> +       }
> +
> +       syms_cache = syms_cache__new(0);
> +       if (!syms_cache) {
> +               fprintf(stderr, "failed to create syms_cache\n");
> +               goto cleanup;
> +       }
> +
> +       if (signal(SIGINT, sig_int) == SIG_ERR) {
> +               warn("can't set signal handler: %s\n", strerror(errno));
> +               err = 1;
> +               goto cleanup;
> +       }
> +
> +       while (1) {
> +               sleep(interval);
> +
> +               if (clear_screen) {
> +                       err = system("clear");
> +                       if (err)
> +                               goto cleanup;
> +               }
> +
> +               err = print_stat(ksyms, syms_cache, obj);
> +               if (err)
> +                       goto cleanup;
> +
> +               count--;
> +               if (exiting || !count)
> +                       goto cleanup;
> +       }
> +
> +cleanup:
> +       rcutop_bpf__destroy(obj);
> +       cleanup_core_btf(&open_opts);
> +
> +       return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX       4096
> +#define TASK_COMM_LEN  16
> +
> +#endif /* __RCUTOP_H */
> --
> 2.36.0.550.gb090851708-goog
>