Message ID | 20220512030442.2530552-1-joel@joelfernandes.org (mailing list archive) |
---|---|
Headers | show |
Series | Implement call_rcu_lazy() and miscellaneous fixes | expand |
On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > Hello! > Please find the proof of concept version of call_rcu_lazy() attached. This > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > Rushikesh Kadam from Intel for investigating it with me. > > Some numbers below: > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > The observation is that due to a 'trickle down' effect of RCU callbacks, the > system is very lightly loaded but constantly running few RCU callbacks very > often. This confuses the power management hardware that the system is active, > when it is in fact idle. > > For example, when ChromeOS screen is off and user is not doing anything on the > system, we can see big power savings. > Before: > Pk%pc10 = 72.13 > PkgWatt = 0.58 > CorWatt = 0.04 > > After: > Pk%pc10 = 81.28 > PkgWatt = 0.41 > CorWatt = 0.03 > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > can see that the display pipeline is constantly doing RCU callback queuing due > to open/close of file descriptors associated with graphics buffers. This is > attributed to the file_free_rcu() path which this patch series also touches. > > This patch series adds a simple but effective, and lockless implementation of > RCU callback batching. On memory pressure, timeout or queue growing too big, we > initiate a flush of one or more per-CPU lists. > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > that also has the effect of slowing down RCU. Especially I saw huge slow down > of function graph tracer when increasing that. > > One drawback of this series is, if another frequent RCU callback creeps up in > the future, that's not lazy, then that will again hurt the power. However, I > believe identifying and fixing those is a more reasonable approach than slowing > RCU down for the whole system. > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > at runtime to turn it on or off globally. It is default to on. Further, please > use the sysctls in lazy.c for further tuning of parameters that effect the > flushing. > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > I believe this series should be good for general testing. > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > of review and agreements. I did forget to add Disclaimer 3, that this breaks rcu_barrier() and support for that definitely needs work. thanks, - Joel
Hello, Joel! Which kernel version have you used for this series? -- Uladzislau Rezki On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > Hello! > > Please find the proof of concept version of call_rcu_lazy() attached. This > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > Rushikesh Kadam from Intel for investigating it with me. > > > > Some numbers below: > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > system is very lightly loaded but constantly running few RCU callbacks very > > often. This confuses the power management hardware that the system is active, > > when it is in fact idle. > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > system, we can see big power savings. > > Before: > > Pk%pc10 = 72.13 > > PkgWatt = 0.58 > > CorWatt = 0.04 > > > > After: > > Pk%pc10 = 81.28 > > PkgWatt = 0.41 > > CorWatt = 0.03 > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > can see that the display pipeline is constantly doing RCU callback queuing due > > to open/close of file descriptors associated with graphics buffers. This is > > attributed to the file_free_rcu() path which this patch series also touches. > > > > This patch series adds a simple but effective, and lockless implementation of > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > initiate a flush of one or more per-CPU lists. > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > of function graph tracer when increasing that. > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > the future, that's not lazy, then that will again hurt the power. However, I > > believe identifying and fixing those is a more reasonable approach than slowing > > RCU down for the whole system. > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > at runtime to turn it on or off globally. It is default to on. Further, please > > use the sysctls in lazy.c for further tuning of parameters that effect the > > flushing. > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > I believe this series should be good for general testing. > > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > > of review and agreements. > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and > support for that definitely needs work. > > thanks, > > - Joel
Never mind. I port it into 5.10 On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > Hello, Joel! > > Which kernel version have you used for this series? > > -- > Uladzislau Rezki > > On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > Hello! > > > Please find the proof of concept version of call_rcu_lazy() attached. This > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > > Rushikesh Kadam from Intel for investigating it with me. > > > > > > Some numbers below: > > > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > > system is very lightly loaded but constantly running few RCU callbacks very > > > often. This confuses the power management hardware that the system is active, > > > when it is in fact idle. > > > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > > system, we can see big power savings. > > > Before: > > > Pk%pc10 = 72.13 > > > PkgWatt = 0.58 > > > CorWatt = 0.04 > > > > > > After: > > > Pk%pc10 = 81.28 > > > PkgWatt = 0.41 > > > CorWatt = 0.03 > > > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > > can see that the display pipeline is constantly doing RCU callback queuing due > > > to open/close of file descriptors associated with graphics buffers. This is > > > attributed to the file_free_rcu() path which this patch series also touches. > > > > > > This patch series adds a simple but effective, and lockless implementation of > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > > initiate a flush of one or more per-CPU lists. > > > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > > of function graph tracer when increasing that. > > > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > > the future, that's not lazy, then that will again hurt the power. However, I > > > believe identifying and fixing those is a more reasonable approach than slowing > > > RCU down for the whole system. > > > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > > at runtime to turn it on or off globally. It is default to on. Further, please > > > use the sysctls in lazy.c for further tuning of parameters that effect the > > > flushing. > > > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > > I believe this series should be good for general testing. > > > > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > > > of review and agreements. > > > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and > > support for that definitely needs work. > > > > thanks, > > > > - Joel > > > > -- > Uladzislau Rezki
On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > Never mind. I port it into 5.10 Oh, this is on mainline. Sorry about that. If you want I have a tree here for 5.10 , although that does not have the kfree changes, everything else is ditto. https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 thanks, - Joel > > On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Joel! > > > > Which kernel version have you used for this series? > > > > -- > > Uladzislau Rezki > > > > On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > > > > > Hello! > > > > Please find the proof of concept version of call_rcu_lazy() attached. This > > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > > > Rushikesh Kadam from Intel for investigating it with me. > > > > > > > > Some numbers below: > > > > > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > > > system is very lightly loaded but constantly running few RCU callbacks very > > > > often. This confuses the power management hardware that the system is active, > > > > when it is in fact idle. > > > > > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > > > system, we can see big power savings. > > > > Before: > > > > Pk%pc10 = 72.13 > > > > PkgWatt = 0.58 > > > > CorWatt = 0.04 > > > > > > > > After: > > > > Pk%pc10 = 81.28 > > > > PkgWatt = 0.41 > > > > CorWatt = 0.03 > > > > > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > > > can see that the display pipeline is constantly doing RCU callback queuing due > > > > to open/close of file descriptors associated with graphics buffers. This is > > > > attributed to the file_free_rcu() path which this patch series also touches. > > > > > > > > This patch series adds a simple but effective, and lockless implementation of > > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > > > initiate a flush of one or more per-CPU lists. > > > > > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > > > of function graph tracer when increasing that. > > > > > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > > > the future, that's not lazy, then that will again hurt the power. However, I > > > > believe identifying and fixing those is a more reasonable approach than slowing > > > > RCU down for the whole system. > > > > > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > > > at runtime to turn it on or off globally. It is default to on. Further, please > > > > use the sysctls in lazy.c for further tuning of parameters that effect the > > > > flushing. > > > > > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > > > I believe this series should be good for general testing. > > > > > > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > > > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > > > > of review and agreements. > > > > > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and > > > support for that definitely needs work. > > > > > > thanks, > > > > > > - Joel > > > > > > > > -- > > Uladzislau Rezki > > > > -- > Uladzislau Rezki
> On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > Never mind. I port it into 5.10 > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > 5.10 , although that does not have the kfree changes, everything else is > ditto. > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > No problem. kfree_rcu two patches are not so important in this series. So i have backported them into my 5.10 kernel because the latest kernel is not so easy to up and run on my device :) -- Uladzislau Rezki
On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > Never mind. I port it into 5.10 > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > 5.10 , although that does not have the kfree changes, everything else is > > ditto. > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > No problem. kfree_rcu two patches are not so important in this series. > So i have backported them into my 5.10 kernel because the latest kernel > is not so easy to up and run on my device :) Actually I was going to write here, apparently some tests are showing kfree_rcu()->call_rcu_lazy() causing possible regression. So it is good to drop those for your initial testing! Thanks, - Joel
> On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > Never mind. I port it into 5.10 > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > 5.10 , although that does not have the kfree changes, everything else is > > > ditto. > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > No problem. kfree_rcu two patches are not so important in this series. > > So i have backported them into my 5.10 kernel because the latest kernel > > is not so easy to up and run on my device :) > > Actually I was going to write here, apparently some tests are showing > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > good to drop those for your initial testing! > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not so important for kfree_rcu() because we do batch requests there anyway. One thing that i would like to improve in kfree_rcu() is a better utilization of page slots. I will share my results either tomorrow or on Monday. I hope that is fine. -- Uladzislau Rezki
On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote: > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > Hello! > > Please find the proof of concept version of call_rcu_lazy() attached. This > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > Rushikesh Kadam from Intel for investigating it with me. > > > > Some numbers below: > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > system is very lightly loaded but constantly running few RCU callbacks very > > often. This confuses the power management hardware that the system is active, > > when it is in fact idle. > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > system, we can see big power savings. > > Before: > > Pk%pc10 = 72.13 > > PkgWatt = 0.58 > > CorWatt = 0.04 > > > > After: > > Pk%pc10 = 81.28 > > PkgWatt = 0.41 > > CorWatt = 0.03 > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > can see that the display pipeline is constantly doing RCU callback queuing due > > to open/close of file descriptors associated with graphics buffers. This is > > attributed to the file_free_rcu() path which this patch series also touches. > > > > This patch series adds a simple but effective, and lockless implementation of > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > initiate a flush of one or more per-CPU lists. > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > of function graph tracer when increasing that. > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > the future, that's not lazy, then that will again hurt the power. However, I > > believe identifying and fixing those is a more reasonable approach than slowing > > RCU down for the whole system. > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > at runtime to turn it on or off globally. It is default to on. Further, please > > use the sysctls in lazy.c for further tuning of parameters that effect the > > flushing. > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > I believe this series should be good for general testing. Sometimes OOM conditions result in stalls. > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > > of review and agreements. We will of course need them to look at the call_rcu_lazy() conversions at some point, but in the meantime, experimentation is fine. I looked at a few, but quickly decided to defer to the people with a better understanding of the code. > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and > support for that definitely needs work. Good to know. ;-) With this in place, can the system survive a userspace close(open()) loop, or does that result in OOM? (I am not worried about battery lifetime while close(open()) is running, just OOM resistance.) Does waiting for the shrinker to kick in suffice, or should the system pressure be taken into account? As in the "total" numbers from /proc/pressure/memory. Again, it is very good to see this series! Thanx, Paul
On Thu, May 12, 2022 at 8:23 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote: > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > Hello! > > > Please find the proof of concept version of call_rcu_lazy() attached. This > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > > Rushikesh Kadam from Intel for investigating it with me. > > > > > > Some numbers below: > > > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > > system is very lightly loaded but constantly running few RCU callbacks very > > > often. This confuses the power management hardware that the system is active, > > > when it is in fact idle. > > > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > > system, we can see big power savings. > > > Before: > > > Pk%pc10 = 72.13 > > > PkgWatt = 0.58 > > > CorWatt = 0.04 > > > > > > After: > > > Pk%pc10 = 81.28 > > > PkgWatt = 0.41 > > > CorWatt = 0.03 > > > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > > can see that the display pipeline is constantly doing RCU callback queuing due > > > to open/close of file descriptors associated with graphics buffers. This is > > > attributed to the file_free_rcu() path which this patch series also touches. > > > > > > This patch series adds a simple but effective, and lockless implementation of > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > > initiate a flush of one or more per-CPU lists. > > > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > > of function graph tracer when increasing that. > > > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > > the future, that's not lazy, then that will again hurt the power. However, I > > > believe identifying and fixing those is a more reasonable approach than slowing > > > RCU down for the whole system. > > > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > > at runtime to turn it on or off globally. It is default to on. Further, please > > > use the sysctls in lazy.c for further tuning of parameters that effect the > > > flushing. > > > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > > I believe this series should be good for general testing. > > Sometimes OOM conditions result in stalls. I see. > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and > > support for that definitely needs work. > > Good to know. ;-) > > With this in place, can the system survive a userspace close(open()) > loop, or does that result in OOM? (I am not worried about battery > lifetime while close(open()) is running, just OOM resistance.) Yes, in my testing it survived. I even dropped memory to 512MB and did the open/close loop test. I believe it survives also because we don't let the list to grow too big (other than shrinker flushing). > > Does waiting for the shrinker to kick in suffice, or should the > system pressure be taken into account? As in the "total" numbers > from /proc/pressure/memory. I did not find that taking system memory pressure into account is necessary. > Again, it is very good to see this series! Thanks I appreciate that, I am excited about battery life savings in millions of battery powered devices ;-) Even on my Grand mom's android phone ;-) Thanks, - Joel
On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > ditto. > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > is not so easy to up and run on my device :) > > > > > > Actually I was going to write here, apparently some tests are showing > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > good to drop those for your initial testing! > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > so important for kfree_rcu() because we do batch requests there anyway. > > One thing that i would like to improve in kfree_rcu() is a better utilization > > of page slots. > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > case i have checked was a "static image" use case. Condition is: screen ON with > disabled all connectivity. > > 1. > First data i took is how many wakeups cause an RCU subsystem during this test case > when everything is pretty idling. Duration is 360 seconds: > > <snip> > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu Nice! Do you mind sharing this script? I was just talking to Rushikesh that we want something like this during testing. Appreciate it. Also, if we dump timer wakeup reasons/callbacks that would also be awesome. FWIW, I wrote a BPF tool that periodically dumps callbacks and can share that with you on request as well. That is probably not in a shape for mainline though (Makefile missing and such). > name: rcub/0 pid: 16 woken-up 2 interval: min 86772734 max 86772734 avg 43386367 > name: rcuop/7 pid: 69 woken-up 4 interval: min 4189 max 8050 avg 5049 > name: rcuop/6 pid: 62 woken-up 55 interval: min 6910 max 42592159 avg 3818752 [..] > There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot > and it is really good! Cool! > 2. > Please find in attachment two power plots. The same test case. One is related to a > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > what is also important, thus it proofs the concept. Please note it might be more power > efficient for other arches and platforms. Because of different HW design that is related > to C-states of CPU and energy that is needed to in/out of those deep power states. Nice! I wonder if you still have other frequent callbacks on your system that are getting queued during the tests. Could you dump the rcu_callbacks trace event and see if you have any CBs frequently called that the series did not address? Also, one more thing I was curious about is - do you see savings when you pin the rcu threads to the LITTLE CPUs of the system? The theory being, not disturbing the BIG CPUs which are more power hungry may let them go into a deeper idle state and save power (due to leakage current and so forth). > So a front-lazy-batching is something worth to have, IMHO :) Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with it, we can add your data to the LPC slides as well about your investigation (with attribution to you). thanks, - Joel
> On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > ditto. > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > is not so easy to up and run on my device :) > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > good to drop those for your initial testing! > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > so important for kfree_rcu() because we do batch requests there anyway. > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > of page slots. > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > case i have checked was a "static image" use case. Condition is: screen ON with > > disabled all connectivity. > > > > 1. > > First data i took is how many wakeups cause an RCU subsystem during this test case > > when everything is pretty idling. Duration is 360 seconds: > > > > <snip> > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > that we want something like this during testing. Appreciate it. Also, > if we dump timer wakeup reasons/callbacks that would also be awesome. > Please find in attachment. I wrote it once upon a time and make use of to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c so just compile it. How to use it: 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' 2. ./perf script -i ./perf.data > foo.script 3. ./perf_script_parser ./foo.script > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > share that with you on request as well. That is probably not in a > shape for mainline though (Makefile missing and such). > Yep, please share! > > name: rcub/0 pid: 16 woken-up 2 interval: min 86772734 max 86772734 avg 43386367 > > name: rcuop/7 pid: 69 woken-up 4 interval: min 4189 max 8050 avg 5049 > > name: rcuop/6 pid: 62 woken-up 55 interval: min 6910 max 42592159 avg 3818752 > [..] > > > There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot > > and it is really good! > > Cool! > > > 2. > > Please find in attachment two power plots. The same test case. One is related to a > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > what is also important, thus it proofs the concept. Please note it might be more power > > efficient for other arches and platforms. Because of different HW design that is related > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > Nice! I wonder if you still have other frequent callbacks on your > system that are getting queued during the tests. Could you dump the > rcu_callbacks trace event and see if you have any CBs frequently > called that the series did not address? > I have pretty much like this: <snip> rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 <snip> so almost everything is batched. > > Also, one more thing I was curious about is - do you see savings when > you pin the rcu threads to the LITTLE CPUs of the system? The theory > being, not disturbing the BIG CPUs which are more power hungry may let > them go into a deeper idle state and save power (due to leakage > current and so forth). > I did some experimenting to pin nocbs to a little cluster. For idle use cases i did not see any power gain. For heavy one i see that "big" CPUs are also invoking and busy with it quite often. Probably i should think of some use case where i can detect the power difference. If you have something please let me know. > > So a front-lazy-batching is something worth to have, IMHO :) > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > it, we can add your data to the LPC slides as well about your > investigation (with attribution to you). > No problem, since we will give a talk on LPC, more data we have more convincing we are :) -- Uladzislau Rezki
On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > ditto. > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > good to drop those for your initial testing! > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > of page slots. > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > disabled all connectivity. > > > > > > 1. > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > <snip> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > that we want something like this during testing. Appreciate it. Also, > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > Please find in attachment. I wrote it once upon a time and make use of > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > so just compile it. > > How to use it: > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > 2. ./perf script -i ./perf.data > foo.script > 3. ./perf_script_parser ./foo.script Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also use "perf sched record" and "perf sched report --sort latency" to get wakeup latencies. > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > share that with you on request as well. That is probably not in a > > shape for mainline though (Makefile missing and such). > > > Yep, please share! Sure, check out my bcc repo from here: https://github.com/joelagnel/bcc Build this project, then cd libbpf-tools and run make. This should produce a static binary 'rcutop' which you can push to Android. You have to build for ARM which bcc should have instructions for. I have also included the rcuptop diff at the end of this file for reference. > > > 2. > > > Please find in attachment two power plots. The same test case. One is related to a > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > what is also important, thus it proofs the concept. Please note it might be more power > > > efficient for other arches and platforms. Because of different HW design that is related > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > Nice! I wonder if you still have other frequent callbacks on your > > system that are getting queued during the tests. Could you dump the > > rcu_callbacks trace event and see if you have any CBs frequently > > called that the series did not address? > > > I have pretty much like this: > <snip> > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > <snip> > > so almost everything is batched. Nice, glad to know this is happening even without the kfree_rcu() changes. > > Also, one more thing I was curious about is - do you see savings when > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > being, not disturbing the BIG CPUs which are more power hungry may let > > them go into a deeper idle state and save power (due to leakage > > current and so forth). > > > I did some experimenting to pin nocbs to a little cluster. For idle use > cases i did not see any power gain. For heavy one i see that "big" CPUs > are also invoking and busy with it quite often. Probably i should think > of some use case where i can detect the power difference. If you have > something please let me know. Yeah, probably screen off + audio playback might be a good one, because it lightly loads the CPUs. > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > it, we can add your data to the LPC slides as well about your > > investigation (with attribution to you). > > > No problem, since we will give a talk on LPC, more data we have more > convincing we are :) I forget, you did mention you are Ok with presenting with us right? It would be great if you present your data when we come to Android, if you are OK with it. I'll start a common slide deck soon and share so you, Rushikesh and me can add slides to it and present together. thanks, - Joel ---8<----------------------- From: Joel Fernandes <joelaf@google.com> Subject: [PATCH] rcutop Signed-off-by: Joel Fernandes <joelaf@google.com> --- libbpf-tools/Makefile | 1 + libbpf-tools/rcutop.bpf.c | 56 ++++++++ libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ libbpf-tools/rcutop.h | 8 ++ 4 files changed, 353 insertions(+) create mode 100644 libbpf-tools/rcutop.bpf.c create mode 100644 libbpf-tools/rcutop.c create mode 100644 libbpf-tools/rcutop.h diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile index e60ec409..0d4cdff2 100644 --- a/libbpf-tools/Makefile +++ b/libbpf-tools/Makefile @@ -42,6 +42,7 @@ APPS = \ klockstat \ ksnoop \ llcstat \ + rcutop \ mountsnoop \ numamove \ offcputime \ diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c new file mode 100644 index 00000000..8287bbe2 --- /dev/null +++ b/libbpf-tools/rcutop.bpf.c @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2021 Hengqi Chen */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include <bpf/bpf_tracing.h> +#include "rcutop.h" +#include "maps.bpf.h" + +#define MAX_ENTRIES 10240 + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, void *); + __type(value, int); +} cbs_queued SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, void *); + __type(value, int); +} cbs_executed SEC(".maps"); + +SEC("tracepoint/rcu/rcu_callback") +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) +{ + void *key = ctx->func; + int *val = NULL; + static const int zero; + + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); + if (val) { + __sync_fetch_and_add(val, 1); + } + + return 0; +} + +SEC("tracepoint/rcu/rcu_invoke_callback") +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) +{ + void *key = ctx->func; + int *val; + int zero = 0; + + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); + if (val) { + __sync_fetch_and_add(val, 1); + } + + return 0; +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c new file mode 100644 index 00000000..35795875 --- /dev/null +++ b/libbpf-tools/rcutop.c @@ -0,0 +1,288 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ + +/* + * rcutop + * Copyright (c) 2022 Joel Fernandes + * + * 05-May-2022 Joel Fernandes Created this. + */ +#include <argp.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> +#include <unistd.h> + +#include <bpf/libbpf.h> +#include <bpf/bpf.h> +#include "rcutop.h" +#include "rcutop.skel.h" +#include "btf_helpers.h" +#include "trace_helpers.h" + +#define warn(...) fprintf(stderr, __VA_ARGS__) +#define OUTPUT_ROWS_LIMIT 10240 + +static volatile sig_atomic_t exiting = 0; + +static bool clear_screen = true; +static int output_rows = 20; +static int interval = 1; +static int count = 99999999; +static bool verbose = false; + +const char *argp_program_version = "rcutop 0.1"; +const char *argp_program_bug_address = +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; +const char argp_program_doc[] = +"Show RCU callback queuing and execution stats.\n" +"\n" +"USAGE: rcutop [-h] [interval] [count]\n" +"\n" +"EXAMPLES:\n" +" rcutop # rcu activity top, refresh every 1s\n" +" rcutop 5 10 # 5s summaries, 10 times\n"; + +static const struct argp_option opts[] = { + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, + { "verbose", 'v', NULL, 0, "Verbose debug output" }, + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + long rows; + static int pos_args; + + switch (key) { + case 'C': + clear_screen = false; + break; + case 'v': + verbose = true; + break; + case 'h': + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); + break; + case 'r': + errno = 0; + rows = strtol(arg, NULL, 10); + if (errno || rows <= 0) { + warn("invalid rows: %s\n", arg); + argp_usage(state); + } + output_rows = rows; + if (output_rows > OUTPUT_ROWS_LIMIT) + output_rows = OUTPUT_ROWS_LIMIT; + break; + case ARGP_KEY_ARG: + errno = 0; + if (pos_args == 0) { + interval = strtol(arg, NULL, 10); + if (errno || interval <= 0) { + warn("invalid interval\n"); + argp_usage(state); + } + } else if (pos_args == 1) { + count = strtol(arg, NULL, 10); + if (errno || count <= 0) { + warn("invalid count\n"); + argp_usage(state); + } + } else { + warn("unrecognized positional argument: %s\n", arg); + argp_usage(state); + } + pos_args++; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sig_int(int signo) +{ + exiting = 1; +} + +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, + struct rcutop_bpf *obj) +{ + void *key, **prev_key = NULL; + int n, err = 0; + int qfd = bpf_map__fd(obj->maps.cbs_queued); + int efd = bpf_map__fd(obj->maps.cbs_executed); + const struct ksym *ksym; + FILE *f; + time_t t; + struct tm *tm; + char ts[16], buf[256]; + + f = fopen("/proc/loadavg", "r"); + if (f) { + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + memset(buf, 0, sizeof(buf)); + n = fread(buf, 1, sizeof(buf), f); + if (n) + printf("%8s loadavg: %s\n", ts, buf); + fclose(f); + } + + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); + + while (1) { + int qcount = 0, ecount = 0; + + err = bpf_map_get_next_key(qfd, prev_key, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + break; + } + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); + return err; + } + + err = bpf_map_lookup_elem(qfd, &key, &qcount); + if (err) { + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); + return err; + } + prev_key = &key; + + bpf_map_lookup_elem(efd, &key, &ecount); + + ksym = ksyms__map_addr(ksyms, (unsigned long)key); + printf("%-32s %-6d %-6d\n", + ksym ? ksym->name : "Unknown", + qcount, ecount); + } + printf("\n"); + prev_key = NULL; + while (1) { + err = bpf_map_get_next_key(qfd, prev_key, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + break; + } + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); + return err; + } + err = bpf_map_delete_elem(qfd, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + continue; + } + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); + return err; + } + + bpf_map_delete_elem(efd, &key); + prev_key = &key; + } + + return err; +} + +int main(int argc, char **argv) +{ + LIBBPF_OPTS(bpf_object_open_opts, open_opts); + static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, + }; + struct rcutop_bpf *obj; + int err; + struct syms_cache *syms_cache = NULL; + struct ksyms *ksyms = NULL; + + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); + libbpf_set_print(libbpf_print_fn); + + err = ensure_core_btf(&open_opts); + if (err) { + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); + return 1; + } + + obj = rcutop_bpf__open_opts(&open_opts); + if (!obj) { + warn("failed to open BPF object\n"); + return 1; + } + + err = rcutop_bpf__load(obj); + if (err) { + warn("failed to load BPF object: %d\n", err); + goto cleanup; + } + + err = rcutop_bpf__attach(obj); + if (err) { + warn("failed to attach BPF programs: %d\n", err); + goto cleanup; + } + + ksyms = ksyms__load(); + if (!ksyms) { + fprintf(stderr, "failed to load kallsyms\n"); + goto cleanup; + } + + syms_cache = syms_cache__new(0); + if (!syms_cache) { + fprintf(stderr, "failed to create syms_cache\n"); + goto cleanup; + } + + if (signal(SIGINT, sig_int) == SIG_ERR) { + warn("can't set signal handler: %s\n", strerror(errno)); + err = 1; + goto cleanup; + } + + while (1) { + sleep(interval); + + if (clear_screen) { + err = system("clear"); + if (err) + goto cleanup; + } + + err = print_stat(ksyms, syms_cache, obj); + if (err) + goto cleanup; + + count--; + if (exiting || !count) + goto cleanup; + } + +cleanup: + rcutop_bpf__destroy(obj); + cleanup_core_btf(&open_opts); + + return err != 0; +} diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h new file mode 100644 index 00000000..cb2a3557 --- /dev/null +++ b/libbpf-tools/rcutop.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +#ifndef __RCUTOP_H +#define __RCUTOP_H + +#define PATH_MAX 4096 +#define TASK_COMM_LEN 16 + +#endif /* __RCUTOP_H */
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > > ditto. > > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > > good to drop those for your initial testing! > > > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > > of page slots. > > > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > > disabled all connectivity. > > > > > > > > 1. > > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > > > <snip> > > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > > that we want something like this during testing. Appreciate it. Also, > > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > > > Please find in attachment. I wrote it once upon a time and make use of > > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > > so just compile it. > > > > How to use it: > > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > > 2. ./perf script -i ./perf.data > foo.script > > 3. ./perf_script_parser ./foo.script > > Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also > use "perf sched record" and "perf sched report --sort latency" to get wakeup > latencies. > Good. "perf sched report --sort latency" it must be related to wakeup delays? > > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > > share that with you on request as well. That is probably not in a > > > shape for mainline though (Makefile missing and such). > > > > > Yep, please share! > > Sure, check out my bcc repo from here: > https://github.com/joelagnel/bcc > > Build this project, then cd libbpf-tools and run make. This should produce a > static binary 'rcutop' which you can push to Android. You have to build for > ARM which bcc should have instructions for. I have also included the rcuptop > diff at the end of this file for reference. > Cool. Will try it later :) > > > > 2. > > > > Please find in attachment two power plots. The same test case. One is related to a > > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > > what is also important, thus it proofs the concept. Please note it might be more power > > > > efficient for other arches and platforms. Because of different HW design that is related > > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > > > Nice! I wonder if you still have other frequent callbacks on your > > > system that are getting queued during the tests. Could you dump the > > > rcu_callbacks trace event and see if you have any CBs frequently > > > called that the series did not address? > > > > > I have pretty much like this: > > <snip> > > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > > <snip> > > > > so almost everything is batched. > > Nice, glad to know this is happening even without the kfree_rcu() changes. > kvfree_rcu() does the job quite good, even though as i mentioned i would like to send a patch about better page slot utialization. But, yes, since there is batching mechanism the call_rcu_lazy() does not give any effect. > > > Also, one more thing I was curious about is - do you see savings when > > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > > being, not disturbing the BIG CPUs which are more power hungry may let > > > them go into a deeper idle state and save power (due to leakage > > > current and so forth). > > > > > I did some experimenting to pin nocbs to a little cluster. For idle use > > cases i did not see any power gain. For heavy one i see that "big" CPUs > > are also invoking and busy with it quite often. Probably i should think > > of some use case where i can detect the power difference. If you have > > something please let me know. > > Yeah, probably screen off + audio playback might be a good one, because it > lightly loads the CPUs. > Hm.. Will see and might check, just in case! I have implemented the simple call_rcu() static workload generator in order to examine it in more controlled environment. So i would like to see how it behaves under a static workload generator. > > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > > it, we can add your data to the LPC slides as well about your > > > investigation (with attribution to you). > > > > > No problem, since we will give a talk on LPC, more data we have more > > convincing we are :) > > I forget, you did mention you are Ok with presenting with us right? It would > be great if you present your data when we come to Android, if you are OK with > it. I'll start a common slide deck soon and share so you, Rushikesh and me > can add slides to it and present together. > I am OK to report with you together. I can do and prepare some data related to examination on our Android and big.LITTLE platform that we use for our customers. > thanks, > > - Joel > > ---8<----------------------- > > From: Joel Fernandes <joelaf@google.com> > Subject: [PATCH] rcutop > > Signed-off-by: Joel Fernandes <joelaf@google.com> > --- > libbpf-tools/Makefile | 1 + > libbpf-tools/rcutop.bpf.c | 56 ++++++++ > libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ > libbpf-tools/rcutop.h | 8 ++ > 4 files changed, 353 insertions(+) > create mode 100644 libbpf-tools/rcutop.bpf.c > create mode 100644 libbpf-tools/rcutop.c > create mode 100644 libbpf-tools/rcutop.h > > diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile > index e60ec409..0d4cdff2 100644 > --- a/libbpf-tools/Makefile > +++ b/libbpf-tools/Makefile > @@ -42,6 +42,7 @@ APPS = \ > klockstat \ > ksnoop \ > llcstat \ > + rcutop \ > mountsnoop \ > numamove \ > offcputime \ > diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c > new file mode 100644 > index 00000000..8287bbe2 > --- /dev/null > +++ b/libbpf-tools/rcutop.bpf.c > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +/* Copyright (c) 2021 Hengqi Chen */ > +#include <vmlinux.h> > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_core_read.h> > +#include <bpf/bpf_tracing.h> > +#include "rcutop.h" > +#include "maps.bpf.h" > + > +#define MAX_ENTRIES 10240 > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_queued SEC(".maps"); > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_executed SEC(".maps"); > + > +SEC("tracepoint/rcu/rcu_callback") > +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) > +{ > + void *key = ctx->func; > + int *val = NULL; > + static const int zero; > + > + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +SEC("tracepoint/rcu/rcu_invoke_callback") > +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) > +{ > + void *key = ctx->func; > + int *val; > + int zero = 0; > + > + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +char LICENSE[] SEC("license") = "Dual BSD/GPL"; > diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c > new file mode 100644 > index 00000000..35795875 > --- /dev/null > +++ b/libbpf-tools/rcutop.c > @@ -0,0 +1,288 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > + > +/* > + * rcutop > + * Copyright (c) 2022 Joel Fernandes > + * > + * 05-May-2022 Joel Fernandes Created this. > + */ > +#include <argp.h> > +#include <errno.h> > +#include <signal.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <time.h> > +#include <unistd.h> > + > +#include <bpf/libbpf.h> > +#include <bpf/bpf.h> > +#include "rcutop.h" > +#include "rcutop.skel.h" > +#include "btf_helpers.h" > +#include "trace_helpers.h" > + > +#define warn(...) fprintf(stderr, __VA_ARGS__) > +#define OUTPUT_ROWS_LIMIT 10240 > + > +static volatile sig_atomic_t exiting = 0; > + > +static bool clear_screen = true; > +static int output_rows = 20; > +static int interval = 1; > +static int count = 99999999; > +static bool verbose = false; > + > +const char *argp_program_version = "rcutop 0.1"; > +const char *argp_program_bug_address = > +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; > +const char argp_program_doc[] = > +"Show RCU callback queuing and execution stats.\n" > +"\n" > +"USAGE: rcutop [-h] [interval] [count]\n" > +"\n" > +"EXAMPLES:\n" > +" rcutop # rcu activity top, refresh every 1s\n" > +" rcutop 5 10 # 5s summaries, 10 times\n"; > + > +static const struct argp_option opts[] = { > + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, > + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, > + { "verbose", 'v', NULL, 0, "Verbose debug output" }, > + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, > + {}, > +}; > + > +static error_t parse_arg(int key, char *arg, struct argp_state *state) > +{ > + long rows; > + static int pos_args; > + > + switch (key) { > + case 'C': > + clear_screen = false; > + break; > + case 'v': > + verbose = true; > + break; > + case 'h': > + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); > + break; > + case 'r': > + errno = 0; > + rows = strtol(arg, NULL, 10); > + if (errno || rows <= 0) { > + warn("invalid rows: %s\n", arg); > + argp_usage(state); > + } > + output_rows = rows; > + if (output_rows > OUTPUT_ROWS_LIMIT) > + output_rows = OUTPUT_ROWS_LIMIT; > + break; > + case ARGP_KEY_ARG: > + errno = 0; > + if (pos_args == 0) { > + interval = strtol(arg, NULL, 10); > + if (errno || interval <= 0) { > + warn("invalid interval\n"); > + argp_usage(state); > + } > + } else if (pos_args == 1) { > + count = strtol(arg, NULL, 10); > + if (errno || count <= 0) { > + warn("invalid count\n"); > + argp_usage(state); > + } > + } else { > + warn("unrecognized positional argument: %s\n", arg); > + argp_usage(state); > + } > + pos_args++; > + break; > + default: > + return ARGP_ERR_UNKNOWN; > + } > + return 0; > +} > + > +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) > +{ > + if (level == LIBBPF_DEBUG && !verbose) > + return 0; > + return vfprintf(stderr, format, args); > +} > + > +static void sig_int(int signo) > +{ > + exiting = 1; > +} > + > +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, > + struct rcutop_bpf *obj) > +{ > + void *key, **prev_key = NULL; > + int n, err = 0; > + int qfd = bpf_map__fd(obj->maps.cbs_queued); > + int efd = bpf_map__fd(obj->maps.cbs_executed); > + const struct ksym *ksym; > + FILE *f; > + time_t t; > + struct tm *tm; > + char ts[16], buf[256]; > + > + f = fopen("/proc/loadavg", "r"); > + if (f) { > + time(&t); > + tm = localtime(&t); > + strftime(ts, sizeof(ts), "%H:%M:%S", tm); > + memset(buf, 0, sizeof(buf)); > + n = fread(buf, 1, sizeof(buf), f); > + if (n) > + printf("%8s loadavg: %s\n", ts, buf); > + fclose(f); > + } > + > + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); > + > + while (1) { > + int qcount = 0, ecount = 0; > + > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + > + err = bpf_map_lookup_elem(qfd, &key, &qcount); > + if (err) { > + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); > + return err; > + } > + prev_key = &key; > + > + bpf_map_lookup_elem(efd, &key, &ecount); > + > + ksym = ksyms__map_addr(ksyms, (unsigned long)key); > + printf("%-32s %-6d %-6d\n", > + ksym ? ksym->name : "Unknown", > + qcount, ecount); > + } > + printf("\n"); > + prev_key = NULL; > + while (1) { > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + err = bpf_map_delete_elem(qfd, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + continue; > + } > + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); > + return err; > + } > + > + bpf_map_delete_elem(efd, &key); > + prev_key = &key; > + } > + > + return err; > +} > + > +int main(int argc, char **argv) > +{ > + LIBBPF_OPTS(bpf_object_open_opts, open_opts); > + static const struct argp argp = { > + .options = opts, > + .parser = parse_arg, > + .doc = argp_program_doc, > + }; > + struct rcutop_bpf *obj; > + int err; > + struct syms_cache *syms_cache = NULL; > + struct ksyms *ksyms = NULL; > + > + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); > + if (err) > + return err; > + > + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > + libbpf_set_print(libbpf_print_fn); > + > + err = ensure_core_btf(&open_opts); > + if (err) { > + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); > + return 1; > + } > + > + obj = rcutop_bpf__open_opts(&open_opts); > + if (!obj) { > + warn("failed to open BPF object\n"); > + return 1; > + } > + > + err = rcutop_bpf__load(obj); > + if (err) { > + warn("failed to load BPF object: %d\n", err); > + goto cleanup; > + } > + > + err = rcutop_bpf__attach(obj); > + if (err) { > + warn("failed to attach BPF programs: %d\n", err); > + goto cleanup; > + } > + > + ksyms = ksyms__load(); > + if (!ksyms) { > + fprintf(stderr, "failed to load kallsyms\n"); > + goto cleanup; > + } > + > + syms_cache = syms_cache__new(0); > + if (!syms_cache) { > + fprintf(stderr, "failed to create syms_cache\n"); > + goto cleanup; > + } > + > + if (signal(SIGINT, sig_int) == SIG_ERR) { > + warn("can't set signal handler: %s\n", strerror(errno)); > + err = 1; > + goto cleanup; > + } > + > + while (1) { > + sleep(interval); > + > + if (clear_screen) { > + err = system("clear"); > + if (err) > + goto cleanup; > + } > + > + err = print_stat(ksyms, syms_cache, obj); > + if (err) > + goto cleanup; > + > + count--; > + if (exiting || !count) > + goto cleanup; > + } > + > +cleanup: > + rcutop_bpf__destroy(obj); > + cleanup_core_btf(&open_opts); > + > + return err != 0; > +} > diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h > new file mode 100644 > index 00000000..cb2a3557 > --- /dev/null > +++ b/libbpf-tools/rcutop.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +#ifndef __RCUTOP_H > +#define __RCUTOP_H > + > +#define PATH_MAX 4096 > +#define TASK_COMM_LEN 16 > + > +#endif /* __RCUTOP_H */ > -- > 2.36.0.550.gb090851708-goog >
On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote: > Hello! > Please find the proof of concept version of call_rcu_lazy() attached. This > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > Rushikesh Kadam from Intel for investigating it with me. Just a status update, we're reworking this code to use the bypass lists which takes care of a lot of corner cases. I've had to make changes the the rcuog thread code to handle wakes up a bit differently and such. Initial testing is looking good. Currently working on the early-flushing of the lazy CBs based on memory pressure. Should be hopefully sending v2 soon! thanks, - Joel > > Some numbers below: > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > The observation is that due to a 'trickle down' effect of RCU callbacks, the > system is very lightly loaded but constantly running few RCU callbacks very > often. This confuses the power management hardware that the system is active, > when it is in fact idle. > > For example, when ChromeOS screen is off and user is not doing anything on the > system, we can see big power savings. > Before: > Pk%pc10 = 72.13 > PkgWatt = 0.58 > CorWatt = 0.04 > > After: > Pk%pc10 = 81.28 > PkgWatt = 0.41 > CorWatt = 0.03 > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > can see that the display pipeline is constantly doing RCU callback queuing due > to open/close of file descriptors associated with graphics buffers. This is > attributed to the file_free_rcu() path which this patch series also touches. > > This patch series adds a simple but effective, and lockless implementation of > RCU callback batching. On memory pressure, timeout or queue growing too big, we > initiate a flush of one or more per-CPU lists. > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > that also has the effect of slowing down RCU. Especially I saw huge slow down > of function graph tracer when increasing that. > > One drawback of this series is, if another frequent RCU callback creeps up in > the future, that's not lazy, then that will again hurt the power. However, I > believe identifying and fixing those is a more reasonable approach than slowing > RCU down for the whole system. > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > at runtime to turn it on or off globally. It is default to on. Further, please > use the sysctls in lazy.c for further tuning of parameters that effect the > flushing. > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > I believe this series should be good for general testing. > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > of review and agreements. > > Joel Fernandes (Google) (14): > rcu: Add a lock-less lazy RCU implementation > workqueue: Add a lazy version of queue_rcu_work() > block/blk-ioc: Move call_rcu() to call_rcu_lazy() > cred: Move call_rcu() to call_rcu_lazy() > fs: Move call_rcu() to call_rcu_lazy() in some paths > kernel: Move various core kernel usages to call_rcu_lazy() > security: Move call_rcu() to call_rcu_lazy() > net/core: Move call_rcu() to call_rcu_lazy() > lib: Move call_rcu() to call_rcu_lazy() > kfree/rcu: Queue RCU work via queue_rcu_work_lazy() > i915: Move call_rcu() to call_rcu_lazy() > rcu/kfree: remove useless monitor_todo flag > rcu/kfree: Fix kfree_rcu_shrink_count() return value > DEBUG: Toggle rcu_lazy and tune at runtime > > block/blk-ioc.c | 2 +- > drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +- > fs/dcache.c | 4 +- > fs/eventpoll.c | 2 +- > fs/file_table.c | 3 +- > fs/inode.c | 2 +- > include/linux/rcupdate.h | 6 + > include/linux/sched/sysctl.h | 4 + > include/linux/workqueue.h | 1 + > kernel/cred.c | 2 +- > kernel/exit.c | 2 +- > kernel/pid.c | 2 +- > kernel/rcu/Kconfig | 8 ++ > kernel/rcu/Makefile | 1 + > kernel/rcu/lazy.c | 153 +++++++++++++++++++++ > kernel/rcu/rcu.h | 5 + > kernel/rcu/tree.c | 28 ++-- > kernel/sysctl.c | 23 ++++ > kernel/time/posix-timers.c | 2 +- > kernel/workqueue.c | 25 ++++ > lib/radix-tree.c | 2 +- > lib/xarray.c | 2 +- > net/core/dst.c | 2 +- > security/security.c | 2 +- > security/selinux/avc.c | 4 +- > 25 files changed, 255 insertions(+), 34 deletions(-) > create mode 100644 kernel/rcu/lazy.c > > -- > 2.36.0.550.gb090851708-goog >
On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote: > On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote: > > Hello! > > Please find the proof of concept version of call_rcu_lazy() attached. This > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > Rushikesh Kadam from Intel for investigating it with me. > > Just a status update, we're reworking this code to use the bypass lists which > takes care of a lot of corner cases. I've had to make changes the the rcuog > thread code to handle wakes up a bit differently and such. Initial testing is > looking good. Currently working on the early-flushing of the lazy CBs based > on memory pressure. > > Should be hopefully sending v2 soon! Looking forward to seeing it! Thanx, Paul > thanks, > > - Joel > > > > > > > Some numbers below: > > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform. > > The observation is that due to a 'trickle down' effect of RCU callbacks, the > > system is very lightly loaded but constantly running few RCU callbacks very > > often. This confuses the power management hardware that the system is active, > > when it is in fact idle. > > > > For example, when ChromeOS screen is off and user is not doing anything on the > > system, we can see big power savings. > > Before: > > Pk%pc10 = 72.13 > > PkgWatt = 0.58 > > CorWatt = 0.04 > > > > After: > > Pk%pc10 = 81.28 > > PkgWatt = 0.41 > > CorWatt = 0.03 > > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we > > can see that the display pipeline is constantly doing RCU callback queuing due > > to open/close of file descriptors associated with graphics buffers. This is > > attributed to the file_free_rcu() path which this patch series also touches. > > > > This patch series adds a simple but effective, and lockless implementation of > > RCU callback batching. On memory pressure, timeout or queue growing too big, we > > initiate a flush of one or more per-CPU lists. > > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however > > that also has the effect of slowing down RCU. Especially I saw huge slow down > > of function graph tracer when increasing that. > > > > One drawback of this series is, if another frequent RCU callback creeps up in > > the future, that's not lazy, then that will again hurt the power. However, I > > believe identifying and fixing those is a more reasonable approach than slowing > > RCU down for the whole system. > > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy > > at runtime to turn it on or off globally. It is default to on. Further, please > > use the sysctls in lazy.c for further tuning of parameters that effect the > > flushing. > > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but > > I believe this series should be good for general testing. > > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds > > of review and agreements. > > > > Joel Fernandes (Google) (14): > > rcu: Add a lock-less lazy RCU implementation > > workqueue: Add a lazy version of queue_rcu_work() > > block/blk-ioc: Move call_rcu() to call_rcu_lazy() > > cred: Move call_rcu() to call_rcu_lazy() > > fs: Move call_rcu() to call_rcu_lazy() in some paths > > kernel: Move various core kernel usages to call_rcu_lazy() > > security: Move call_rcu() to call_rcu_lazy() > > net/core: Move call_rcu() to call_rcu_lazy() > > lib: Move call_rcu() to call_rcu_lazy() > > kfree/rcu: Queue RCU work via queue_rcu_work_lazy() > > i915: Move call_rcu() to call_rcu_lazy() > > rcu/kfree: remove useless monitor_todo flag > > rcu/kfree: Fix kfree_rcu_shrink_count() return value > > DEBUG: Toggle rcu_lazy and tune at runtime > > > > block/blk-ioc.c | 2 +- > > drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +- > > fs/dcache.c | 4 +- > > fs/eventpoll.c | 2 +- > > fs/file_table.c | 3 +- > > fs/inode.c | 2 +- > > include/linux/rcupdate.h | 6 + > > include/linux/sched/sysctl.h | 4 + > > include/linux/workqueue.h | 1 + > > kernel/cred.c | 2 +- > > kernel/exit.c | 2 +- > > kernel/pid.c | 2 +- > > kernel/rcu/Kconfig | 8 ++ > > kernel/rcu/Makefile | 1 + > > kernel/rcu/lazy.c | 153 +++++++++++++++++++++ > > kernel/rcu/rcu.h | 5 + > > kernel/rcu/tree.c | 28 ++-- > > kernel/sysctl.c | 23 ++++ > > kernel/time/posix-timers.c | 2 +- > > kernel/workqueue.c | 25 ++++ > > lib/radix-tree.c | 2 +- > > lib/xarray.c | 2 +- > > net/core/dst.c | 2 +- > > security/security.c | 2 +- > > security/selinux/avc.c | 4 +- > > 25 files changed, 255 insertions(+), 34 deletions(-) > > create mode 100644 kernel/rcu/lazy.c > > > > -- > > 2.36.0.550.gb090851708-goog > >
On Mon, Jun 13, 2022 at 6:48 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote: > > On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote: > > > Hello! > > > Please find the proof of concept version of call_rcu_lazy() attached. This > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to > > > Rushikesh Kadam from Intel for investigating it with me. > > > > Just a status update, we're reworking this code to use the bypass lists which > > takes care of a lot of corner cases. I've had to make changes the the rcuog > > thread code to handle wakes up a bit differently and such. Initial testing is > > looking good. Currently working on the early-flushing of the lazy CBs based > > on memory pressure. > > > > Should be hopefully sending v2 soon! > > Looking forward to seeing it! Looks like rebasing on paul's dev is needed for me to even apply test on our downstream kernels. Why you ask? Because our downstream folks are indeed using upstream 5.19-rc3. =) Finally rebase succeeded and now I can test lazy bypass downstream before sending out the next RFC... - Joel
On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > > ditto. > > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > > good to drop those for your initial testing! > > > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > > of page slots. > > > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > > disabled all connectivity. > > > > > > > > 1. > > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > > > <snip> > > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > > that we want something like this during testing. Appreciate it. Also, > > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > > > Please find in attachment. I wrote it once upon a time and make use of > > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > > so just compile it. > > > > How to use it: > > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > > 2. ./perf script -i ./perf.data > foo.script > > 3. ./perf_script_parser ./foo.script > > Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also > use "perf sched record" and "perf sched report --sort latency" to get wakeup > latencies. > > > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > > share that with you on request as well. That is probably not in a > > > shape for mainline though (Makefile missing and such). > > > > > Yep, please share! > > Sure, check out my bcc repo from here: > https://github.com/joelagnel/bcc Kind of random, but this is why I love sharing my tools with others. I actually forgot that I checked in the code for my rcutop here, and this email is how I found it, lol :) Did you happen to try it out too btw? :) Thanks, Joel > > Build this project, then cd libbpf-tools and run make. This should produce a > static binary 'rcutop' which you can push to Android. You have to build for > ARM which bcc should have instructions for. I have also included the rcuptop > diff at the end of this file for reference. > > > > > 2. > > > > Please find in attachment two power plots. The same test case. One is related to a > > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > > what is also important, thus it proofs the concept. Please note it might be more power > > > > efficient for other arches and platforms. Because of different HW design that is related > > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > > > Nice! I wonder if you still have other frequent callbacks on your > > > system that are getting queued during the tests. Could you dump the > > > rcu_callbacks trace event and see if you have any CBs frequently > > > called that the series did not address? > > > > > I have pretty much like this: > > <snip> > > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > > <snip> > > > > so almost everything is batched. > > Nice, glad to know this is happening even without the kfree_rcu() changes. > > > > Also, one more thing I was curious about is - do you see savings when > > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > > being, not disturbing the BIG CPUs which are more power hungry may let > > > them go into a deeper idle state and save power (due to leakage > > > current and so forth). > > > > > I did some experimenting to pin nocbs to a little cluster. For idle use > > cases i did not see any power gain. For heavy one i see that "big" CPUs > > are also invoking and busy with it quite often. Probably i should think > > of some use case where i can detect the power difference. If you have > > something please let me know. > > Yeah, probably screen off + audio playback might be a good one, because it > lightly loads the CPUs. > > > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > > it, we can add your data to the LPC slides as well about your > > > investigation (with attribution to you). > > > > > No problem, since we will give a talk on LPC, more data we have more > > convincing we are :) > > I forget, you did mention you are Ok with presenting with us right? It would > be great if you present your data when we come to Android, if you are OK with > it. I'll start a common slide deck soon and share so you, Rushikesh and me > can add slides to it and present together. > > thanks, > > - Joel > > ---8<----------------------- > > From: Joel Fernandes <joelaf@google.com> > Subject: [PATCH] rcutop > > Signed-off-by: Joel Fernandes <joelaf@google.com> > --- > libbpf-tools/Makefile | 1 + > libbpf-tools/rcutop.bpf.c | 56 ++++++++ > libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ > libbpf-tools/rcutop.h | 8 ++ > 4 files changed, 353 insertions(+) > create mode 100644 libbpf-tools/rcutop.bpf.c > create mode 100644 libbpf-tools/rcutop.c > create mode 100644 libbpf-tools/rcutop.h > > diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile > index e60ec409..0d4cdff2 100644 > --- a/libbpf-tools/Makefile > +++ b/libbpf-tools/Makefile > @@ -42,6 +42,7 @@ APPS = \ > klockstat \ > ksnoop \ > llcstat \ > + rcutop \ > mountsnoop \ > numamove \ > offcputime \ > diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c > new file mode 100644 > index 00000000..8287bbe2 > --- /dev/null > +++ b/libbpf-tools/rcutop.bpf.c > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +/* Copyright (c) 2021 Hengqi Chen */ > +#include <vmlinux.h> > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_core_read.h> > +#include <bpf/bpf_tracing.h> > +#include "rcutop.h" > +#include "maps.bpf.h" > + > +#define MAX_ENTRIES 10240 > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_queued SEC(".maps"); > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_executed SEC(".maps"); > + > +SEC("tracepoint/rcu/rcu_callback") > +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) > +{ > + void *key = ctx->func; > + int *val = NULL; > + static const int zero; > + > + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +SEC("tracepoint/rcu/rcu_invoke_callback") > +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) > +{ > + void *key = ctx->func; > + int *val; > + int zero = 0; > + > + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +char LICENSE[] SEC("license") = "Dual BSD/GPL"; > diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c > new file mode 100644 > index 00000000..35795875 > --- /dev/null > +++ b/libbpf-tools/rcutop.c > @@ -0,0 +1,288 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > + > +/* > + * rcutop > + * Copyright (c) 2022 Joel Fernandes > + * > + * 05-May-2022 Joel Fernandes Created this. > + */ > +#include <argp.h> > +#include <errno.h> > +#include <signal.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <time.h> > +#include <unistd.h> > + > +#include <bpf/libbpf.h> > +#include <bpf/bpf.h> > +#include "rcutop.h" > +#include "rcutop.skel.h" > +#include "btf_helpers.h" > +#include "trace_helpers.h" > + > +#define warn(...) fprintf(stderr, __VA_ARGS__) > +#define OUTPUT_ROWS_LIMIT 10240 > + > +static volatile sig_atomic_t exiting = 0; > + > +static bool clear_screen = true; > +static int output_rows = 20; > +static int interval = 1; > +static int count = 99999999; > +static bool verbose = false; > + > +const char *argp_program_version = "rcutop 0.1"; > +const char *argp_program_bug_address = > +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; > +const char argp_program_doc[] = > +"Show RCU callback queuing and execution stats.\n" > +"\n" > +"USAGE: rcutop [-h] [interval] [count]\n" > +"\n" > +"EXAMPLES:\n" > +" rcutop # rcu activity top, refresh every 1s\n" > +" rcutop 5 10 # 5s summaries, 10 times\n"; > + > +static const struct argp_option opts[] = { > + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, > + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, > + { "verbose", 'v', NULL, 0, "Verbose debug output" }, > + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, > + {}, > +}; > + > +static error_t parse_arg(int key, char *arg, struct argp_state *state) > +{ > + long rows; > + static int pos_args; > + > + switch (key) { > + case 'C': > + clear_screen = false; > + break; > + case 'v': > + verbose = true; > + break; > + case 'h': > + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); > + break; > + case 'r': > + errno = 0; > + rows = strtol(arg, NULL, 10); > + if (errno || rows <= 0) { > + warn("invalid rows: %s\n", arg); > + argp_usage(state); > + } > + output_rows = rows; > + if (output_rows > OUTPUT_ROWS_LIMIT) > + output_rows = OUTPUT_ROWS_LIMIT; > + break; > + case ARGP_KEY_ARG: > + errno = 0; > + if (pos_args == 0) { > + interval = strtol(arg, NULL, 10); > + if (errno || interval <= 0) { > + warn("invalid interval\n"); > + argp_usage(state); > + } > + } else if (pos_args == 1) { > + count = strtol(arg, NULL, 10); > + if (errno || count <= 0) { > + warn("invalid count\n"); > + argp_usage(state); > + } > + } else { > + warn("unrecognized positional argument: %s\n", arg); > + argp_usage(state); > + } > + pos_args++; > + break; > + default: > + return ARGP_ERR_UNKNOWN; > + } > + return 0; > +} > + > +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) > +{ > + if (level == LIBBPF_DEBUG && !verbose) > + return 0; > + return vfprintf(stderr, format, args); > +} > + > +static void sig_int(int signo) > +{ > + exiting = 1; > +} > + > +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, > + struct rcutop_bpf *obj) > +{ > + void *key, **prev_key = NULL; > + int n, err = 0; > + int qfd = bpf_map__fd(obj->maps.cbs_queued); > + int efd = bpf_map__fd(obj->maps.cbs_executed); > + const struct ksym *ksym; > + FILE *f; > + time_t t; > + struct tm *tm; > + char ts[16], buf[256]; > + > + f = fopen("/proc/loadavg", "r"); > + if (f) { > + time(&t); > + tm = localtime(&t); > + strftime(ts, sizeof(ts), "%H:%M:%S", tm); > + memset(buf, 0, sizeof(buf)); > + n = fread(buf, 1, sizeof(buf), f); > + if (n) > + printf("%8s loadavg: %s\n", ts, buf); > + fclose(f); > + } > + > + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); > + > + while (1) { > + int qcount = 0, ecount = 0; > + > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + > + err = bpf_map_lookup_elem(qfd, &key, &qcount); > + if (err) { > + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); > + return err; > + } > + prev_key = &key; > + > + bpf_map_lookup_elem(efd, &key, &ecount); > + > + ksym = ksyms__map_addr(ksyms, (unsigned long)key); > + printf("%-32s %-6d %-6d\n", > + ksym ? ksym->name : "Unknown", > + qcount, ecount); > + } > + printf("\n"); > + prev_key = NULL; > + while (1) { > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + err = bpf_map_delete_elem(qfd, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + continue; > + } > + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); > + return err; > + } > + > + bpf_map_delete_elem(efd, &key); > + prev_key = &key; > + } > + > + return err; > +} > + > +int main(int argc, char **argv) > +{ > + LIBBPF_OPTS(bpf_object_open_opts, open_opts); > + static const struct argp argp = { > + .options = opts, > + .parser = parse_arg, > + .doc = argp_program_doc, > + }; > + struct rcutop_bpf *obj; > + int err; > + struct syms_cache *syms_cache = NULL; > + struct ksyms *ksyms = NULL; > + > + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); > + if (err) > + return err; > + > + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > + libbpf_set_print(libbpf_print_fn); > + > + err = ensure_core_btf(&open_opts); > + if (err) { > + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); > + return 1; > + } > + > + obj = rcutop_bpf__open_opts(&open_opts); > + if (!obj) { > + warn("failed to open BPF object\n"); > + return 1; > + } > + > + err = rcutop_bpf__load(obj); > + if (err) { > + warn("failed to load BPF object: %d\n", err); > + goto cleanup; > + } > + > + err = rcutop_bpf__attach(obj); > + if (err) { > + warn("failed to attach BPF programs: %d\n", err); > + goto cleanup; > + } > + > + ksyms = ksyms__load(); > + if (!ksyms) { > + fprintf(stderr, "failed to load kallsyms\n"); > + goto cleanup; > + } > + > + syms_cache = syms_cache__new(0); > + if (!syms_cache) { > + fprintf(stderr, "failed to create syms_cache\n"); > + goto cleanup; > + } > + > + if (signal(SIGINT, sig_int) == SIG_ERR) { > + warn("can't set signal handler: %s\n", strerror(errno)); > + err = 1; > + goto cleanup; > + } > + > + while (1) { > + sleep(interval); > + > + if (clear_screen) { > + err = system("clear"); > + if (err) > + goto cleanup; > + } > + > + err = print_stat(ksyms, syms_cache, obj); > + if (err) > + goto cleanup; > + > + count--; > + if (exiting || !count) > + goto cleanup; > + } > + > +cleanup: > + rcutop_bpf__destroy(obj); > + cleanup_core_btf(&open_opts); > + > + return err != 0; > +} > diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h > new file mode 100644 > index 00000000..cb2a3557 > --- /dev/null > +++ b/libbpf-tools/rcutop.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +#ifndef __RCUTOP_H > +#define __RCUTOP_H > + > +#define PATH_MAX 4096 > +#define TASK_COMM_LEN 16 > + > +#endif /* __RCUTOP_H */ > -- > 2.36.0.550.gb090851708-goog > On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > > ditto. > > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > > good to drop those for your initial testing! > > > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > > of page slots. > > > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > > disabled all connectivity. > > > > > > > > 1. > > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > > > <snip> > > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > > that we want something like this during testing. Appreciate it. Also, > > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > > > Please find in attachment. I wrote it once upon a time and make use of > > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > > so just compile it. > > > > How to use it: > > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > > 2. ./perf script -i ./perf.data > foo.script > > 3. ./perf_script_parser ./foo.script > > Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also > use "perf sched record" and "perf sched report --sort latency" to get wakeup > latencies. > > > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > > share that with you on request as well. That is probably not in a > > > shape for mainline though (Makefile missing and such). > > > > > Yep, please share! > > Sure, check out my bcc repo from here: > https://github.com/joelagnel/bcc > > Build this project, then cd libbpf-tools and run make. This should produce a > static binary 'rcutop' which you can push to Android. You have to build for > ARM which bcc should have instructions for. I have also included the rcuptop > diff at the end of this file for reference. > > > > > 2. > > > > Please find in attachment two power plots. The same test case. One is related to a > > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > > what is also important, thus it proofs the concept. Please note it might be more power > > > > efficient for other arches and platforms. Because of different HW design that is related > > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > > > Nice! I wonder if you still have other frequent callbacks on your > > > system that are getting queued during the tests. Could you dump the > > > rcu_callbacks trace event and see if you have any CBs frequently > > > called that the series did not address? > > > > > I have pretty much like this: > > <snip> > > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > > <snip> > > > > so almost everything is batched. > > Nice, glad to know this is happening even without the kfree_rcu() changes. > > > > Also, one more thing I was curious about is - do you see savings when > > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > > being, not disturbing the BIG CPUs which are more power hungry may let > > > them go into a deeper idle state and save power (due to leakage > > > current and so forth). > > > > > I did some experimenting to pin nocbs to a little cluster. For idle use > > cases i did not see any power gain. For heavy one i see that "big" CPUs > > are also invoking and busy with it quite often. Probably i should think > > of some use case where i can detect the power difference. If you have > > something please let me know. > > Yeah, probably screen off + audio playback might be a good one, because it > lightly loads the CPUs. > > > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > > it, we can add your data to the LPC slides as well about your > > > investigation (with attribution to you). > > > > > No problem, since we will give a talk on LPC, more data we have more > > convincing we are :) > > I forget, you did mention you are Ok with presenting with us right? It would > be great if you present your data when we come to Android, if you are OK with > it. I'll start a common slide deck soon and share so you, Rushikesh and me > can add slides to it and present together. > > thanks, > > - Joel > > ---8<----------------------- > > From: Joel Fernandes <joelaf@google.com> > Subject: [PATCH] rcutop > > Signed-off-by: Joel Fernandes <joelaf@google.com> > --- > libbpf-tools/Makefile | 1 + > libbpf-tools/rcutop.bpf.c | 56 ++++++++ > libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ > libbpf-tools/rcutop.h | 8 ++ > 4 files changed, 353 insertions(+) > create mode 100644 libbpf-tools/rcutop.bpf.c > create mode 100644 libbpf-tools/rcutop.c > create mode 100644 libbpf-tools/rcutop.h > > diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile > index e60ec409..0d4cdff2 100644 > --- a/libbpf-tools/Makefile > +++ b/libbpf-tools/Makefile > @@ -42,6 +42,7 @@ APPS = \ > klockstat \ > ksnoop \ > llcstat \ > + rcutop \ > mountsnoop \ > numamove \ > offcputime \ > diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c > new file mode 100644 > index 00000000..8287bbe2 > --- /dev/null > +++ b/libbpf-tools/rcutop.bpf.c > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +/* Copyright (c) 2021 Hengqi Chen */ > +#include <vmlinux.h> > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_core_read.h> > +#include <bpf/bpf_tracing.h> > +#include "rcutop.h" > +#include "maps.bpf.h" > + > +#define MAX_ENTRIES 10240 > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_queued SEC(".maps"); > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_executed SEC(".maps"); > + > +SEC("tracepoint/rcu/rcu_callback") > +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) > +{ > + void *key = ctx->func; > + int *val = NULL; > + static const int zero; > + > + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +SEC("tracepoint/rcu/rcu_invoke_callback") > +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) > +{ > + void *key = ctx->func; > + int *val; > + int zero = 0; > + > + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +char LICENSE[] SEC("license") = "Dual BSD/GPL"; > diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c > new file mode 100644 > index 00000000..35795875 > --- /dev/null > +++ b/libbpf-tools/rcutop.c > @@ -0,0 +1,288 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > + > +/* > + * rcutop > + * Copyright (c) 2022 Joel Fernandes > + * > + * 05-May-2022 Joel Fernandes Created this. > + */ > +#include <argp.h> > +#include <errno.h> > +#include <signal.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <time.h> > +#include <unistd.h> > + > +#include <bpf/libbpf.h> > +#include <bpf/bpf.h> > +#include "rcutop.h" > +#include "rcutop.skel.h" > +#include "btf_helpers.h" > +#include "trace_helpers.h" > + > +#define warn(...) fprintf(stderr, __VA_ARGS__) > +#define OUTPUT_ROWS_LIMIT 10240 > + > +static volatile sig_atomic_t exiting = 0; > + > +static bool clear_screen = true; > +static int output_rows = 20; > +static int interval = 1; > +static int count = 99999999; > +static bool verbose = false; > + > +const char *argp_program_version = "rcutop 0.1"; > +const char *argp_program_bug_address = > +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; > +const char argp_program_doc[] = > +"Show RCU callback queuing and execution stats.\n" > +"\n" > +"USAGE: rcutop [-h] [interval] [count]\n" > +"\n" > +"EXAMPLES:\n" > +" rcutop # rcu activity top, refresh every 1s\n" > +" rcutop 5 10 # 5s summaries, 10 times\n"; > + > +static const struct argp_option opts[] = { > + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, > + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, > + { "verbose", 'v', NULL, 0, "Verbose debug output" }, > + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, > + {}, > +}; > + > +static error_t parse_arg(int key, char *arg, struct argp_state *state) > +{ > + long rows; > + static int pos_args; > + > + switch (key) { > + case 'C': > + clear_screen = false; > + break; > + case 'v': > + verbose = true; > + break; > + case 'h': > + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); > + break; > + case 'r': > + errno = 0; > + rows = strtol(arg, NULL, 10); > + if (errno || rows <= 0) { > + warn("invalid rows: %s\n", arg); > + argp_usage(state); > + } > + output_rows = rows; > + if (output_rows > OUTPUT_ROWS_LIMIT) > + output_rows = OUTPUT_ROWS_LIMIT; > + break; > + case ARGP_KEY_ARG: > + errno = 0; > + if (pos_args == 0) { > + interval = strtol(arg, NULL, 10); > + if (errno || interval <= 0) { > + warn("invalid interval\n"); > + argp_usage(state); > + } > + } else if (pos_args == 1) { > + count = strtol(arg, NULL, 10); > + if (errno || count <= 0) { > + warn("invalid count\n"); > + argp_usage(state); > + } > + } else { > + warn("unrecognized positional argument: %s\n", arg); > + argp_usage(state); > + } > + pos_args++; > + break; > + default: > + return ARGP_ERR_UNKNOWN; > + } > + return 0; > +} > + > +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) > +{ > + if (level == LIBBPF_DEBUG && !verbose) > + return 0; > + return vfprintf(stderr, format, args); > +} > + > +static void sig_int(int signo) > +{ > + exiting = 1; > +} > + > +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, > + struct rcutop_bpf *obj) > +{ > + void *key, **prev_key = NULL; > + int n, err = 0; > + int qfd = bpf_map__fd(obj->maps.cbs_queued); > + int efd = bpf_map__fd(obj->maps.cbs_executed); > + const struct ksym *ksym; > + FILE *f; > + time_t t; > + struct tm *tm; > + char ts[16], buf[256]; > + > + f = fopen("/proc/loadavg", "r"); > + if (f) { > + time(&t); > + tm = localtime(&t); > + strftime(ts, sizeof(ts), "%H:%M:%S", tm); > + memset(buf, 0, sizeof(buf)); > + n = fread(buf, 1, sizeof(buf), f); > + if (n) > + printf("%8s loadavg: %s\n", ts, buf); > + fclose(f); > + } > + > + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); > + > + while (1) { > + int qcount = 0, ecount = 0; > + > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + > + err = bpf_map_lookup_elem(qfd, &key, &qcount); > + if (err) { > + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); > + return err; > + } > + prev_key = &key; > + > + bpf_map_lookup_elem(efd, &key, &ecount); > + > + ksym = ksyms__map_addr(ksyms, (unsigned long)key); > + printf("%-32s %-6d %-6d\n", > + ksym ? ksym->name : "Unknown", > + qcount, ecount); > + } > + printf("\n"); > + prev_key = NULL; > + while (1) { > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + err = bpf_map_delete_elem(qfd, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + continue; > + } > + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); > + return err; > + } > + > + bpf_map_delete_elem(efd, &key); > + prev_key = &key; > + } > + > + return err; > +} > + > +int main(int argc, char **argv) > +{ > + LIBBPF_OPTS(bpf_object_open_opts, open_opts); > + static const struct argp argp = { > + .options = opts, > + .parser = parse_arg, > + .doc = argp_program_doc, > + }; > + struct rcutop_bpf *obj; > + int err; > + struct syms_cache *syms_cache = NULL; > + struct ksyms *ksyms = NULL; > + > + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); > + if (err) > + return err; > + > + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > + libbpf_set_print(libbpf_print_fn); > + > + err = ensure_core_btf(&open_opts); > + if (err) { > + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); > + return 1; > + } > + > + obj = rcutop_bpf__open_opts(&open_opts); > + if (!obj) { > + warn("failed to open BPF object\n"); > + return 1; > + } > + > + err = rcutop_bpf__load(obj); > + if (err) { > + warn("failed to load BPF object: %d\n", err); > + goto cleanup; > + } > + > + err = rcutop_bpf__attach(obj); > + if (err) { > + warn("failed to attach BPF programs: %d\n", err); > + goto cleanup; > + } > + > + ksyms = ksyms__load(); > + if (!ksyms) { > + fprintf(stderr, "failed to load kallsyms\n"); > + goto cleanup; > + } > + > + syms_cache = syms_cache__new(0); > + if (!syms_cache) { > + fprintf(stderr, "failed to create syms_cache\n"); > + goto cleanup; > + } > + > + if (signal(SIGINT, sig_int) == SIG_ERR) { > + warn("can't set signal handler: %s\n", strerror(errno)); > + err = 1; > + goto cleanup; > + } > + > + while (1) { > + sleep(interval); > + > + if (clear_screen) { > + err = system("clear"); > + if (err) > + goto cleanup; > + } > + > + err = print_stat(ksyms, syms_cache, obj); > + if (err) > + goto cleanup; > + > + count--; > + if (exiting || !count) > + goto cleanup; > + } > + > +cleanup: > + rcutop_bpf__destroy(obj); > + cleanup_core_btf(&open_opts); > + > + return err != 0; > +} > diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h > new file mode 100644 > index 00000000..cb2a3557 > --- /dev/null > +++ b/libbpf-tools/rcutop.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +#ifndef __RCUTOP_H > +#define __RCUTOP_H > + > +#define PATH_MAX 4096 > +#define TASK_COMM_LEN 16 > + > +#endif /* __RCUTOP_H */ > -- > 2.36.0.550.gb090851708-goog >