mbox series

[v5,00/18] Implement call_rcu_lazy() and miscellaneous fixes

Message ID 20220901221720.1105021-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series Implement call_rcu_lazy() and miscellaneous fixes | expand

Message

Joel Fernandes Sept. 1, 2022, 10:17 p.m. UTC
Here is v5 of call_rcu_lazy() based on the latest RCU -dev branch. Main changes are:
- moved length field into rcu_data (Frederic suggestion)
- added new traces to aid debugging and testing.
	- the new trace patch (along with the rcuscale and rcutorture tests)
	  gives confidence that the patches work well. Also it is tested on
	  real ChromeOS hardware and the boot time is looking good even though
	  lazy callbacks are being queued (i.e. the lazy ones do not effect the
	  synchronous non-lazy ones that effect boot time)
- rewrote some parts of the core patch.
- for rcutop, please apply the diff in the following link to the BCC repo:
  https://lore.kernel.org/r/Yn+79v4q+1c9lXdc@google.com
  Then, cd libbpf-tools/ and run make to build the rcutop static binary.
  (If you need an x86 binary, ping me and I'll email you).
  In the future, I will attempt to make rcutop built within the kernel repo.
  This is already done for another tool (see tools/bpf/runqslower) so is doable.

The 2 mm patches are what Vlastimil pulled into slab-next. I included them in
this series so that the tracing patch builds.

Previous series was posted here:
 https://lore.kernel.org/all/20220713213237.1596225-1-joel@joelfernandes.org/

Linked below [1] is some power data I collected with Turbostat on an x86
ChromeOS ADL machine. The numbers are not based on -next, but rather 5.19
kernel as that's what booted on my ChromeOS machine).

These are output by Turbostat, by running:
turbostat -S -s PkgWatt,CorWatt --interval 5
PkgWatt - summary of package power in Watts 5 second interval.
CoreWatt - summary of core power in Watts 5 second interval.

[1] https://lore.kernel.org/r/b4145008-6840-fc69-a6d6-e38bc218009d@joelfernandes.org

Joel Fernandes (Google) (15):
rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
rcu: Fix late wakeup when flush of bypass cblist happens
rcu: Move trace_rcu_callback() before bypassing
rcu: Introduce call_rcu_lazy() API implementation
rcu: Add per-CB tracing for queuing, flush and invocation.
rcuscale: Add laziness and kfree tests
rcutorture: Add test code for call_rcu_lazy()
fs: Move call_rcu() to call_rcu_lazy() in some paths
cred: Move call_rcu() to call_rcu_lazy()
security: Move call_rcu() to call_rcu_lazy()
net/core: Move call_rcu() to call_rcu_lazy()
kernel: Move various core kernel usages to call_rcu_lazy()
lib: Move call_rcu() to call_rcu_lazy()
i915: Move call_rcu() to call_rcu_lazy()
fork: Move thread_stack_free_rcu() to call_rcu_lazy()

Vineeth Pillai (1):
rcu: shrinker for lazy rcu

Vlastimil Babka (2):
mm/slub: perform free consistency checks before call_rcu
mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head

drivers/gpu/drm/i915/gem/i915_gem_object.c    |   2 +-
fs/dcache.c                                   |   4 +-
fs/eventpoll.c                                |   2 +-
fs/file_table.c                               |   2 +-
fs/inode.c                                    |   2 +-
include/linux/rcupdate.h                      |   6 +
include/linux/types.h                         |  44 +++
include/trace/events/rcu.h                    |  69 ++++-
kernel/cred.c                                 |   2 +-
kernel/exit.c                                 |   2 +-
kernel/fork.c                                 |   6 +-
kernel/pid.c                                  |   2 +-
kernel/rcu/Kconfig                            |  19 ++
kernel/rcu/rcu.h                              |  12 +
kernel/rcu/rcu_segcblist.c                    |  23 +-
kernel/rcu/rcu_segcblist.h                    |   8 +
kernel/rcu/rcuscale.c                         |  74 ++++-
kernel/rcu/rcutorture.c                       |  60 +++-
kernel/rcu/tree.c                             | 187 ++++++++----
kernel/rcu/tree.h                             |  13 +-
kernel/rcu/tree_nocb.h                        | 282 +++++++++++++++---
kernel/time/posix-timers.c                    |   2 +-
lib/radix-tree.c                              |   2 +-
lib/xarray.c                                  |   2 +-
mm/slab.h                                     |  54 ++--
mm/slub.c                                     |  20 +-
net/core/dst.c                                |   2 +-
security/security.c                           |   2 +-
security/selinux/avc.c                        |   4 +-
.../selftests/rcutorture/configs/rcu/CFLIST   |   1 +
.../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
.../rcutorture/configs/rcu/TREE11.boot        |   8 +
32 files changed, 783 insertions(+), 153 deletions(-)
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.37.2.789.g6183377224-goog

Comments

Paul E. McKenney Sept. 3, 2022, 3:22 p.m. UTC | #1
On Thu, Sep 01, 2022 at 10:17:02PM +0000, Joel Fernandes (Google) wrote:
> Here is v5 of call_rcu_lazy() based on the latest RCU -dev branch. Main changes are:
> - moved length field into rcu_data (Frederic suggestion)
> - added new traces to aid debugging and testing.
> 	- the new trace patch (along with the rcuscale and rcutorture tests)
> 	  gives confidence that the patches work well. Also it is tested on
> 	  real ChromeOS hardware and the boot time is looking good even though
> 	  lazy callbacks are being queued (i.e. the lazy ones do not effect the
> 	  synchronous non-lazy ones that effect boot time)
> - rewrote some parts of the core patch.
> - for rcutop, please apply the diff in the following link to the BCC repo:
>   https://lore.kernel.org/r/Yn+79v4q+1c9lXdc@google.com
>   Then, cd libbpf-tools/ and run make to build the rcutop static binary.
>   (If you need an x86 binary, ping me and I'll email you).
>   In the future, I will attempt to make rcutop built within the kernel repo.
>   This is already done for another tool (see tools/bpf/runqslower) so is doable.
> 
> The 2 mm patches are what Vlastimil pulled into slab-next. I included them in
> this series so that the tracing patch builds.
> 
> Previous series was posted here:
>  https://lore.kernel.org/all/20220713213237.1596225-1-joel@joelfernandes.org/
> 
> Linked below [1] is some power data I collected with Turbostat on an x86
> ChromeOS ADL machine. The numbers are not based on -next, but rather 5.19
> kernel as that's what booted on my ChromeOS machine).
> 
> These are output by Turbostat, by running:
> turbostat -S -s PkgWatt,CorWatt --interval 5
> PkgWatt - summary of package power in Watts 5 second interval.
> CoreWatt - summary of core power in Watts 5 second interval.
> 
> [1] https://lore.kernel.org/r/b4145008-6840-fc69-a6d6-e38bc218009d@joelfernandes.org

Thank you for all your work on this!

I have pulled these in for testing and review.  Some work is required
for them to be ready for mainline.

> Joel Fernandes (Google) (15):

I took these:

> rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
> rcu: Fix late wakeup when flush of bypass cblist happens
> rcu: Move trace_rcu_callback() before bypassing
> rcu: Introduce call_rcu_lazy() API implementation

You and Frederic need to come to agreement on "rcu: Fix late wakeup when
flush of bypass cblist happens".

These have some difficulties, so I put them on top of the stack:

> rcu: Add per-CB tracing for queuing, flush and invocation.
	This one breaks 32-bit MIPS.
> rcuscale: Add laziness and kfree tests
	I am concerned that this one has OOM problems.
> rcutorture: Add test code for call_rcu_lazy()
	This one does not need a separate scenario, but instead a separate
	rcutorture.gp_lazy boot parameter.  For a rough template for this
	sort of change, please see the rcutorture changes in this commit:

	76ea364161e7 ("rcu: Add full-sized polling for start_poll()")

	The key point is that every rcutorture.torture_type=rcu test then
	becomes a call_rcu_lazy() test.  (Unless explicitly overridden.)

I took these, though they need go up their respective trees or get acks
from their respective maintainers.

> fs: Move call_rcu() to call_rcu_lazy() in some paths
> cred: Move call_rcu() to call_rcu_lazy()
> security: Move call_rcu() to call_rcu_lazy()
> net/core: Move call_rcu() to call_rcu_lazy()
> kernel: Move various core kernel usages to call_rcu_lazy()
> lib: Move call_rcu() to call_rcu_lazy()
> i915: Move call_rcu() to call_rcu_lazy()
> fork: Move thread_stack_free_rcu() to call_rcu_lazy()
> 
> Vineeth Pillai (1):

I took this:

> rcu: shrinker for lazy rcu
> 
> Vlastimil Babka (2):

As noted earlier, I took these strictly for testing.  I do not expect
to push them into -next, let alone into mainline.

> mm/slub: perform free consistency checks before call_rcu
> mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head

These all are on -rcu not-yet-for-mainline branch lazy.2022.09.03b.

							Thanx, Paul

> drivers/gpu/drm/i915/gem/i915_gem_object.c    |   2 +-
> fs/dcache.c                                   |   4 +-
> fs/eventpoll.c                                |   2 +-
> fs/file_table.c                               |   2 +-
> fs/inode.c                                    |   2 +-
> include/linux/rcupdate.h                      |   6 +
> include/linux/types.h                         |  44 +++
> include/trace/events/rcu.h                    |  69 ++++-
> kernel/cred.c                                 |   2 +-
> kernel/exit.c                                 |   2 +-
> kernel/fork.c                                 |   6 +-
> kernel/pid.c                                  |   2 +-
> kernel/rcu/Kconfig                            |  19 ++
> kernel/rcu/rcu.h                              |  12 +
> kernel/rcu/rcu_segcblist.c                    |  23 +-
> kernel/rcu/rcu_segcblist.h                    |   8 +
> kernel/rcu/rcuscale.c                         |  74 ++++-
> kernel/rcu/rcutorture.c                       |  60 +++-
> kernel/rcu/tree.c                             | 187 ++++++++----
> kernel/rcu/tree.h                             |  13 +-
> kernel/rcu/tree_nocb.h                        | 282 +++++++++++++++---
> kernel/time/posix-timers.c                    |   2 +-
> lib/radix-tree.c                              |   2 +-
> lib/xarray.c                                  |   2 +-
> mm/slab.h                                     |  54 ++--
> mm/slub.c                                     |  20 +-
> net/core/dst.c                                |   2 +-
> security/security.c                           |   2 +-
> security/selinux/avc.c                        |   4 +-
> .../selftests/rcutorture/configs/rcu/CFLIST   |   1 +
> .../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
> .../rcutorture/configs/rcu/TREE11.boot        |   8 +
> 32 files changed, 783 insertions(+), 153 deletions(-)
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> 
> --
> 2.37.2.789.g6183377224-goog
>
Joel Fernandes Sept. 20, 2022, 11:40 a.m. UTC | #2
+rcu@vger for archives.



On Tue, Sep 20, 2022 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> On 9/20/2022 6:29 AM, Joel Fernandes wrote:
> >
> >
> > On 9/19/2022 7:40 PM, Joel Fernandes wrote:
> >>
> >>
> >> On 9/19/2022 6:32 PM, Paul E. McKenney wrote:
> >>> On Mon, Sep 19, 2022 at 05:38:19PM -0400, Joel Fernandes wrote:
> >>>> On Mon, Sep 19, 2022 at 9:11 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>>
> >>>>> On Mon, Sep 19, 2022 at 10:08:28AM +0200, Uladzislau Rezki wrote:
> >>>>>> On Wed, Sep 14, 2022 at 04:22:51AM -0700, Paul E. McKenney wrote:
> >>>>>>> On Wed, Sep 14, 2022 at 11:38:32AM +0100, Uladzislau Rezki wrote:
> >>>>>>>> On Wed, Sep 14, 2022, 11:23 Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Sep 14, 2022 at 02:34:23AM -0700, Paul E. McKenney wrote:
> >>>>>>>>>> Agreed.  There could be a list of functions in call_rcu() that are
> >>>>>>>>>> treated specially.
> >>>>>>>>>>
> >>>>>>>>>> But please be careful.  Maintainers of a given function might have
> >>>>>>>>>> something to say about that function being added to that list.  ;-)
> >>>>>>>>>>
> >>>>>>>>>> But one big advantage of this is that the list of functions could be
> >>>>>>>>>> private to ChromeOS and/or Android.  Maintainers might still be annoyed
> >>>>>>>>>> about bugs sent to them whose root cause was addition of their function
> >>>>>>>>>> to the call_rcu() list, though.  ;-)
> >>>>>>>>>
> >>>>>>>>> Are you sure you want this feature to be usable and debuggable by Android
> >>>>>>>>> only?
> >>>>>>>>>
> >>>>>>>> Adding on top of Frederic sentence. Sorry for that 
Paul E. McKenney Sept. 20, 2022, 12:32 p.m. UTC | #3
On Tue, Sep 20, 2022 at 07:40:01AM -0400, Joel Fernandes wrote:
> +rcu@vger for archives.
> On Tue, Sep 20, 2022 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > On 9/19/2022 7:40 PM, Joel Fernandes wrote:
> > >> On 9/19/2022 6:32 PM, Paul E. McKenney wrote:
> > >>> On Mon, Sep 19, 2022 at 05:38:19PM -0400, Joel Fernandes wrote:
> > >>>> On Mon, Sep 19, 2022 at 9:11 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >>>>>
> > >>>>> On Mon, Sep 19, 2022 at 10:08:28AM +0200, Uladzislau Rezki wrote:
> > >>>>>> On Wed, Sep 14, 2022 at 04:22:51AM -0700, Paul E. McKenney wrote:
> > >>>>>>> On Wed, Sep 14, 2022 at 11:38:32AM +0100, Uladzislau Rezki wrote:
> > >>>>>>>> On Wed, Sep 14, 2022, 11:23 Frederic Weisbecker <frederic@kernel.org> wrote:
> > >>>>>>>>
> > >>>>>>>>> On Wed, Sep 14, 2022 at 02:34:23AM -0700, Paul E. McKenney wrote:
> > >>>>>>>>>> Agreed.  There could be a list of functions in call_rcu() that are
> > >>>>>>>>>> treated specially.
> > >>>>>>>>>>
> > >>>>>>>>>> But please be careful.  Maintainers of a given function might have
> > >>>>>>>>>> something to say about that function being added to that list.  ;-)
> > >>>>>>>>>>
> > >>>>>>>>>> But one big advantage of this is that the list of functions could be
> > >>>>>>>>>> private to ChromeOS and/or Android.  Maintainers might still be annoyed
> > >>>>>>>>>> about bugs sent to them whose root cause was addition of their function
> > >>>>>>>>>> to the call_rcu() list, though.  ;-)
> > >>>>>>>>>
> > >>>>>>>>> Are you sure you want this feature to be usable and debuggable by Android
> > >>>>>>>>> only?
> > >>>>>>>>>
> > >>>>>>>> Adding on top of Frederic sentence. Sorry for that 
Joel Fernandes Sept. 20, 2022, 5:55 p.m. UTC | #4
Sorry for slightly delayed reply, today was crazy at work.

On Tue, Sep 20, 2022 at 05:32:08AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 20, 2022 at 07:40:01AM -0400, Joel Fernandes wrote:
> > +rcu@vger for archives.
> > On Tue, Sep 20, 2022 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > On 9/19/2022 7:40 PM, Joel Fernandes wrote:
> > > >> On 9/19/2022 6:32 PM, Paul E. McKenney wrote:
> > > >>> On Mon, Sep 19, 2022 at 05:38:19PM -0400, Joel Fernandes wrote:
> > > >>>> On Mon, Sep 19, 2022 at 9:11 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >>>>>
> > > >>>>> On Mon, Sep 19, 2022 at 10:08:28AM +0200, Uladzislau Rezki wrote:
> > > >>>>>> On Wed, Sep 14, 2022 at 04:22:51AM -0700, Paul E. McKenney wrote:
> > > >>>>>>> On Wed, Sep 14, 2022 at 11:38:32AM +0100, Uladzislau Rezki wrote:
> > > >>>>>>>> On Wed, Sep 14, 2022, 11:23 Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> On Wed, Sep 14, 2022 at 02:34:23AM -0700, Paul E. McKenney wrote:
> > > >>>>>>>>>> Agreed.  There could be a list of functions in call_rcu() that are
> > > >>>>>>>>>> treated specially.
> > > >>>>>>>>>>
> > > >>>>>>>>>> But please be careful.  Maintainers of a given function might have
> > > >>>>>>>>>> something to say about that function being added to that list.  ;-)
> > > >>>>>>>>>>
> > > >>>>>>>>>> But one big advantage of this is that the list of functions could be
> > > >>>>>>>>>> private to ChromeOS and/or Android.  Maintainers might still be annoyed
> > > >>>>>>>>>> about bugs sent to them whose root cause was addition of their function
> > > >>>>>>>>>> to the call_rcu() list, though.  ;-)
> > > >>>>>>>>>
> > > >>>>>>>>> Are you sure you want this feature to be usable and debuggable by Android
> > > >>>>>>>>> only?
> > > >>>>>>>>>
> > > >>>>>>>> Adding on top of Frederic sentence. Sorry for that 
Paul E. McKenney Sept. 20, 2022, 7:03 p.m. UTC | #5
On Tue, Sep 20, 2022 at 05:55:53PM +0000, Joel Fernandes wrote:
> Sorry for slightly delayed reply, today was crazy at work.

I know that feeling!

> On Tue, Sep 20, 2022 at 05:32:08AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 20, 2022 at 07:40:01AM -0400, Joel Fernandes wrote:
> > > +rcu@vger for archives.
> > > On Tue, Sep 20, 2022 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > On 9/19/2022 7:40 PM, Joel Fernandes wrote:
> > > > >> On 9/19/2022 6:32 PM, Paul E. McKenney wrote:
> > > > >>> On Mon, Sep 19, 2022 at 05:38:19PM -0400, Joel Fernandes wrote:
> > > > >>>> On Mon, Sep 19, 2022 at 9:11 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >>>>>
> > > > >>>>> On Mon, Sep 19, 2022 at 10:08:28AM +0200, Uladzislau Rezki wrote:
> > > > >>>>>> On Wed, Sep 14, 2022 at 04:22:51AM -0700, Paul E. McKenney wrote:
> > > > >>>>>>> On Wed, Sep 14, 2022 at 11:38:32AM +0100, Uladzislau Rezki wrote:
> > > > >>>>>>>> On Wed, Sep 14, 2022, 11:23 Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> On Wed, Sep 14, 2022 at 02:34:23AM -0700, Paul E. McKenney wrote:
> > > > >>>>>>>>>> Agreed.  There could be a list of functions in call_rcu() that are
> > > > >>>>>>>>>> treated specially.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> But please be careful.  Maintainers of a given function might have
> > > > >>>>>>>>>> something to say about that function being added to that list.  ;-)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> But one big advantage of this is that the list of functions could be
> > > > >>>>>>>>>> private to ChromeOS and/or Android.  Maintainers might still be annoyed
> > > > >>>>>>>>>> about bugs sent to them whose root cause was addition of their function
> > > > >>>>>>>>>> to the call_rcu() list, though.  ;-)
> > > > >>>>>>>>>
> > > > >>>>>>>>> Are you sure you want this feature to be usable and debuggable by Android
> > > > >>>>>>>>> only?
> > > > >>>>>>>>>
> > > > >>>>>>>> Adding on top of Frederic sentence. Sorry for that 
Joel Fernandes Sept. 20, 2022, 10:27 p.m. UTC | #6
On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > > > > >>>> Are you saying we should not make call_rcu() default-lazy for
> > > > > >>>> everyone? According to Thomas, anyone expecting call_rcu() to finish
> > > > > >>>> in any reasonable amount of time is a bug, and that usecase needs to
> > > > > >>>> be fixed to use synchronize_rcu(). Are we saying now that that's not
> > > > > >>>> our direction?
> > > > > >>>
> > > > > >>> I am saying that Thomas can sometimes get quite enthusiastic about
> > > > > >>> a particular direction.  Plus I strongly suspect that Thomas has not
> > > > > >>> actually reviewed all of the RCU callbacks.
> > > > > >>>
> > > > > >>> Of course, I agree that it won't hurt to try out Thomas's suggestion,
> > > > > >>> which is why I did not object at the time.  But we should be fully
> > > > > >>> prepared for it to blow up in our faces.  ;-)
> > > > > >>
> > > > > >> Makes sense, thanks for clarifying. My impression was he does expect some pain
> > > > > >> in the short-term but the pain is worth not introduction a new opt-in API.
> > > > > >>
> > > > > >> I am wondering if putting it behind a CONFIG option will delay the pain too
> > > > > >> much. Thoughts? I am assuming that we want the CONFIG to turn it off by default.
> > > > > >
> > > > > > I tried a poor man's version of Thomas's idea by doing
> > > > > > jiffies_till_first_fqs=1000 and rcu.rcu_expedited=1 boot parameters. Of course
> > > > > > this just for trying has no regard for memory pressure and such so it is in the
> > > > > > 'do not try at home' category.
> > > > > >
> > > > > > Then I tried jiffies_till_first_fqs=128 and compare before/after kernel logs.
> > > > > >
> > > > > > These are on ChromeOS hardware (5.19-rc4 kernel)
> > > > > > For jiffies_till_first_fqs=1000 Everything is fine until this point:
> > > > > >     8.383646] Freeing unused kernel image (initmem) memory: 1676K
> > > > > >    11.378880] Write protecting the kernel read-only data: 28672k
> > > > > >
> > > > > > I do not see this 3 second delay with jiffies_till_first_fqs=128.
> > > > > >     8.411866] Freeing unused kernel image (initmem) memory: 1676K
> > > > > >     8.656701] Write protecting the kernel read-only data: 28672k
> > > > >
> > > > > At least this could be because of the rodata write protection stuff calling
> > > > > rcu_barrier():
> > > > >
> > > > > From init/main.c
> > > > >
> > > > > #ifdef CONFIG_STRICT_KERNEL_RWX
> > > > > static void mark_readonly(void)
> > > > > {
> > > > >         if (rodata_enabled) {
> > > > >                 /*
> > > > >                  * load_module() results in W+X mappings, which are cleaned
> > > > >                  * up with call_rcu().  Let's make sure that queued work is
> > > > >                  * flushed so that we don't hit false positives looking for
> > > > >                  * insecure pages which are W+X.
> > > > >                  */
> > > > >                 rcu_barrier();
> > > > >                 mark_rodata_ro();
> > > > >                 rodata_test();
> > > > >         } else
> > > > >                 pr_info("Kernel memory protection disabled.\n");
> > > > > }
> > > > >
> > > > >
> > > > > I think rcu_barrier() does not play well with jiffiest_till_first_fqs (probably
> > > > > because of the missing wake ups I found in the existing code, I should try that
> > > > > patch on this test..).
> > > > >
> > > > > Is it expected that a sleep involving jiffies_till_first_fqs does not get
> > > > > disturbed due to rcu_barrier() ? That seems like a bug to me. At least for
> > > > > !NOCB, I don't see any call the rcu_gp_kthread_wake() from rcu_barrier()!
> > >
> > > Another approach is to take your current stack of patches, but only those
> > > to RCU, and then as Thomas suggests, make call_rcu() behave lazily and
> > > make a call_rcu_expedited() that retains the current behavior.  Keep the
> > > current grace-period timings.
> >
> > Yes these sound good. But one issue is, we do not know all the cases that
> > needed call_rcu_expedited(). There is the per-cpu refcount which needs it,
> > but I'm not aware of any others. If you take the patches, can we add convert
> > to call_rcu_expedited() as regressions happen?
>
> Exactly.
>
> Several of the calls to wait_rcu_gp() would need their arguments to
> change from call_rcu() to call_rcu_expedited().  Oddly enough, including
> the call from synchronize_rcu_expedited() in the case where there are
> to be no expedited grace periods.  ;-)

Oh true, I totally missed that. So if we are going with it this way
(everything async is lazy and then selectively make them non-lazy),
then we can't do without a call_rcu_expedited() API in the patchset.
Right? So I'll go work on that API then. It should be somewhat trivial
to do, I think.

One question - call_rcu_expedited() just means "use regular non lazy
rcu", not "use expedited grace periods" right? In this case, is it
better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid
conflicts with the "expedited" terminology in the
synchronize_{s,}rcu_expedited() APIs?

> > > Then make synchronize_rcu() and friends use call_rcu_expedited() and see
> > > if you can get away with only a small number of call_rcu_expedited()
> > > "upgrades".
> >
> > Right. It is a bit of a chicken-and-egg problem, if we have to whitelist
> > everything, then we have to blacklist as we go. If we blacklist everything,
> > then we may miss whitelisting something.
> >
> > Thoughts?
>
> Try it and see what happens.  We can always fall back to call_rcu_lazy().

Ok :) I'll rewrite my patchset this way and push it out soon without
further ado, and I'll fix any regressions I notice by converting to
the _fast() API.

> > Meanwhile just to clarify- I am very much invested in seeing the current
> > stack of patches merged soon give or take the _expedited() conversions,
> > however I did the above jiffies_till_first experiments in the hopes they
> > would show the ones that need such conversion quickly.
>
> I suggest doing the experiment on top of your patch stack, but excluding
> the call_rcu_lazy() conversios.  That way, as we come to agreement on
> pieces we can pull the patches in, so that the experiments do not get
> in the way.

Yes, sounds good!

thanks,

 - Joel
Paul E. McKenney Sept. 20, 2022, 10:35 p.m. UTC | #7
On Tue, Sep 20, 2022 at 06:27:52PM -0400, Joel Fernandes wrote:
> On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> [...]
> > > > > > >>>> Are you saying we should not make call_rcu() default-lazy for
> > > > > > >>>> everyone? According to Thomas, anyone expecting call_rcu() to finish
> > > > > > >>>> in any reasonable amount of time is a bug, and that usecase needs to
> > > > > > >>>> be fixed to use synchronize_rcu(). Are we saying now that that's not
> > > > > > >>>> our direction?
> > > > > > >>>
> > > > > > >>> I am saying that Thomas can sometimes get quite enthusiastic about
> > > > > > >>> a particular direction.  Plus I strongly suspect that Thomas has not
> > > > > > >>> actually reviewed all of the RCU callbacks.
> > > > > > >>>
> > > > > > >>> Of course, I agree that it won't hurt to try out Thomas's suggestion,
> > > > > > >>> which is why I did not object at the time.  But we should be fully
> > > > > > >>> prepared for it to blow up in our faces.  ;-)
> > > > > > >>
> > > > > > >> Makes sense, thanks for clarifying. My impression was he does expect some pain
> > > > > > >> in the short-term but the pain is worth not introduction a new opt-in API.
> > > > > > >>
> > > > > > >> I am wondering if putting it behind a CONFIG option will delay the pain too
> > > > > > >> much. Thoughts? I am assuming that we want the CONFIG to turn it off by default.
> > > > > > >
> > > > > > > I tried a poor man's version of Thomas's idea by doing
> > > > > > > jiffies_till_first_fqs=1000 and rcu.rcu_expedited=1 boot parameters. Of course
> > > > > > > this just for trying has no regard for memory pressure and such so it is in the
> > > > > > > 'do not try at home' category.
> > > > > > >
> > > > > > > Then I tried jiffies_till_first_fqs=128 and compare before/after kernel logs.
> > > > > > >
> > > > > > > These are on ChromeOS hardware (5.19-rc4 kernel)
> > > > > > > For jiffies_till_first_fqs=1000 Everything is fine until this point:
> > > > > > >     8.383646] Freeing unused kernel image (initmem) memory: 1676K
> > > > > > >    11.378880] Write protecting the kernel read-only data: 28672k
> > > > > > >
> > > > > > > I do not see this 3 second delay with jiffies_till_first_fqs=128.
> > > > > > >     8.411866] Freeing unused kernel image (initmem) memory: 1676K
> > > > > > >     8.656701] Write protecting the kernel read-only data: 28672k
> > > > > >
> > > > > > At least this could be because of the rodata write protection stuff calling
> > > > > > rcu_barrier():
> > > > > >
> > > > > > From init/main.c
> > > > > >
> > > > > > #ifdef CONFIG_STRICT_KERNEL_RWX
> > > > > > static void mark_readonly(void)
> > > > > > {
> > > > > >         if (rodata_enabled) {
> > > > > >                 /*
> > > > > >                  * load_module() results in W+X mappings, which are cleaned
> > > > > >                  * up with call_rcu().  Let's make sure that queued work is
> > > > > >                  * flushed so that we don't hit false positives looking for
> > > > > >                  * insecure pages which are W+X.
> > > > > >                  */
> > > > > >                 rcu_barrier();
> > > > > >                 mark_rodata_ro();
> > > > > >                 rodata_test();
> > > > > >         } else
> > > > > >                 pr_info("Kernel memory protection disabled.\n");
> > > > > > }
> > > > > >
> > > > > >
> > > > > > I think rcu_barrier() does not play well with jiffiest_till_first_fqs (probably
> > > > > > because of the missing wake ups I found in the existing code, I should try that
> > > > > > patch on this test..).
> > > > > >
> > > > > > Is it expected that a sleep involving jiffies_till_first_fqs does not get
> > > > > > disturbed due to rcu_barrier() ? That seems like a bug to me. At least for
> > > > > > !NOCB, I don't see any call the rcu_gp_kthread_wake() from rcu_barrier()!
> > > >
> > > > Another approach is to take your current stack of patches, but only those
> > > > to RCU, and then as Thomas suggests, make call_rcu() behave lazily and
> > > > make a call_rcu_expedited() that retains the current behavior.  Keep the
> > > > current grace-period timings.
> > >
> > > Yes these sound good. But one issue is, we do not know all the cases that
> > > needed call_rcu_expedited(). There is the per-cpu refcount which needs it,
> > > but I'm not aware of any others. If you take the patches, can we add convert
> > > to call_rcu_expedited() as regressions happen?
> >
> > Exactly.
> >
> > Several of the calls to wait_rcu_gp() would need their arguments to
> > change from call_rcu() to call_rcu_expedited().  Oddly enough, including
> > the call from synchronize_rcu_expedited() in the case where there are
> > to be no expedited grace periods.  ;-)
> 
> Oh true, I totally missed that. So if we are going with it this way
> (everything async is lazy and then selectively make them non-lazy),
> then we can't do without a call_rcu_expedited() API in the patchset.
> Right? So I'll go work on that API then. It should be somewhat trivial
> to do, I think.

Yes, it is needed.

> One question - call_rcu_expedited() just means "use regular non lazy
> rcu", not "use expedited grace periods" right? In this case, is it
> better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid
> conflicts with the "expedited" terminology in the
> synchronize_{s,}rcu_expedited() APIs?

I agree that call_rcu_expedited() is likely to cause confusion.
The call_rcu_fast() is nice and short, but the result really is not all
that fast.  The call_rcu_nonlazy() sounds good, but I am sure that as
usual there will be no shortage of bikeshedders.  ;-)

> > > > Then make synchronize_rcu() and friends use call_rcu_expedited() and see
> > > > if you can get away with only a small number of call_rcu_expedited()
> > > > "upgrades".
> > >
> > > Right. It is a bit of a chicken-and-egg problem, if we have to whitelist
> > > everything, then we have to blacklist as we go. If we blacklist everything,
> > > then we may miss whitelisting something.
> > >
> > > Thoughts?
> >
> > Try it and see what happens.  We can always fall back to call_rcu_lazy().
> 
> Ok :) I'll rewrite my patchset this way and push it out soon without
> further ado, and I'll fix any regressions I notice by converting to
> the _fast() API.

Sounds good!

								Thanx, Paul

> > > Meanwhile just to clarify- I am very much invested in seeing the current
> > > stack of patches merged soon give or take the _expedited() conversions,
> > > however I did the above jiffies_till_first experiments in the hopes they
> > > would show the ones that need such conversion quickly.
> >
> > I suggest doing the experiment on top of your patch stack, but excluding
> > the call_rcu_lazy() conversios.  That way, as we come to agreement on
> > pieces we can pull the patches in, so that the experiments do not get
> > in the way.
> 
> Yes, sounds good!
> 
> thanks,
> 
>  - Joel
Joel Fernandes Sept. 20, 2022, 10:40 p.m. UTC | #8
On Tue, Sep 20, 2022 at 6:35 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Sep 20, 2022 at 06:27:52PM -0400, Joel Fernandes wrote:
> > On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > [...]
> > One question - call_rcu_expedited() just means "use regular non lazy
> > rcu", not "use expedited grace periods" right? In this case, is it
> > better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid
> > conflicts with the "expedited" terminology in the
> > synchronize_{s,}rcu_expedited() APIs?
>
> I agree that call_rcu_expedited() is likely to cause confusion.
> The call_rcu_fast() is nice and short, but the result really is not all
> that fast.  The call_rcu_nonlazy() sounds good, but I am sure that as
> usual there will be no shortage of bikeshedders.  ;-)

call_rcu_tglx() and then we can Thomas's tags :-D Jk.

call_rcu_faster() since it is faster than regular call_rcu()

call_rcu_hurry_up_already()  , likely a NAK but sounds cool :-D

Jokes aside,
call_rcu_flush() - because the CBs are not batched like regular call_rcu() ?

thanks,

- Joel
Paul E. McKenney Sept. 21, 2022, 2:21 a.m. UTC | #9
On Tue, Sep 20, 2022 at 06:40:48PM -0400, Joel Fernandes wrote:
> On Tue, Sep 20, 2022 at 6:35 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Sep 20, 2022 at 06:27:52PM -0400, Joel Fernandes wrote:
> > > On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > [...]
> > > One question - call_rcu_expedited() just means "use regular non lazy
> > > rcu", not "use expedited grace periods" right? In this case, is it
> > > better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid
> > > conflicts with the "expedited" terminology in the
> > > synchronize_{s,}rcu_expedited() APIs?
> >
> > I agree that call_rcu_expedited() is likely to cause confusion.
> > The call_rcu_fast() is nice and short, but the result really is not all
> > that fast.  The call_rcu_nonlazy() sounds good, but I am sure that as
> > usual there will be no shortage of bikeshedders.  ;-)
> 
> call_rcu_tglx() and then we can Thomas's tags :-D Jk.
> 
> call_rcu_faster() since it is faster than regular call_rcu()
> 
> call_rcu_hurry_up_already()  , likely a NAK but sounds cool :-D
> 
> Jokes aside,
> call_rcu_flush() - because the CBs are not batched like regular call_rcu() ?

Possibly, possibly...

							Thanx, Paul