Message ID | 20220512030442.2530552-2-joel@joelfernandes.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Implement call_rcu_lazy() and miscellaneous fixes | expand |
On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > Implement timer-based RCU callback batching. The batch is flushed > whenever a certain amount of time has passed, or the batch on a > particular CPU grows too big. Also memory pressure can flush it. > > Locking is avoided to reduce lock contention when queuing and dequeuing > happens on different CPUs of a per-cpu list, such as when shrinker > context is running on different CPU. Also not having to use locks keeps > the per-CPU structure size small. > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> It is very good to see this! Inevitable comments and questions below. Thanx, Paul > --- > include/linux/rcupdate.h | 6 ++ > kernel/rcu/Kconfig | 8 +++ > kernel/rcu/Makefile | 1 + > kernel/rcu/lazy.c | 145 +++++++++++++++++++++++++++++++++++++++ > kernel/rcu/rcu.h | 5 ++ > kernel/rcu/tree.c | 2 + > 6 files changed, 167 insertions(+) > create mode 100644 kernel/rcu/lazy.c > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 88b42eb46406..d0a6c4f5172c 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void) > > #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ > > +#ifdef CONFIG_RCU_LAZY > +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func); > +#else > +#define call_rcu_lazy(head, func) call_rcu(head, func) > +#endif > + > /* Internal to kernel */ > void rcu_init(void); > extern int rcu_scheduler_active __read_mostly; > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index bf8e341e75b4..c09715079829 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > Say N here if you hate read-side memory barriers. > Take the default if you are unsure. > > +config RCU_LAZY > + bool "RCU callback lazy invocation functionality" > + depends on RCU_NOCB_CPU > + default y This "default y" is OK for experimentation, but for mainline we should not be forcing this on unsuspecting people. ;-) > + help > + To save power, batch RCU callbacks and flush after delay, memory > + pressure or callback list growing too big. > + > endmenu # "RCU Subsystem" > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > index 0cfb009a99b9..8968b330d6e0 100644 > --- a/kernel/rcu/Makefile > +++ b/kernel/rcu/Makefile > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > obj-$(CONFIG_TREE_RCU) += tree.o > obj-$(CONFIG_TINY_RCU) += tiny.o > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > +obj-$(CONFIG_RCU_LAZY) += lazy.o > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > new file mode 100644 > index 000000000000..55e406cfc528 > --- /dev/null > +++ b/kernel/rcu/lazy.c > @@ -0,0 +1,145 @@ > +/* > + * Lockless lazy-RCU implementation. > + */ > +#include <linux/rcupdate.h> > +#include <linux/shrinker.h> > +#include <linux/workqueue.h> > +#include "rcu.h" > + > +// How much to batch before flushing? > +#define MAX_LAZY_BATCH 2048 > + > +// How much to wait before flushing? > +#define MAX_LAZY_JIFFIES 10000 That is more than a minute on a HZ=10 system. Are you sure that you did not mean "(10 * HZ)" or some such? > + > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > +struct lazy_rcu_head { > + struct llist_node llist_node; > + void (*func)(struct callback_head *head); > +} __attribute__((aligned(sizeof(void *)))); This needs a build-time check that rcu_head and lazy_rcu_head are of the same size. Maybe something like this in some appropriate context: BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy)); Never mind! I see you have this in rcu_init_lazy(). Plus I now see that you also mention this in the above comments. ;-) > + > +struct rcu_lazy_pcp { > + struct llist_head head; > + struct delayed_work work; > + atomic_t count; > +}; > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > + > +// Lockless flush of CPU, can be called concurrently. > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > +{ > + struct llist_node *node = llist_del_all(&rlp->head); > + struct lazy_rcu_head *cursor, *temp; > + > + if (!node) > + return; At this point, the list is empty but the count is non-zero. Can that cause a problem? (For the existing callback lists, this would be OK.) > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > + struct rcu_head *rh = (struct rcu_head *)cursor; > + debug_rcu_head_unqueue(rh); Good to see this check! > + call_rcu(rh, rh->func); > + atomic_dec(&rlp->count); > + } > +} > + > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > +{ > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > + struct rcu_lazy_pcp *rlp; > + > + preempt_disable(); > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); Whitespace issue, please fix. > + preempt_enable(); > + > + if (debug_rcu_head_queue((void *)head)) { > + // Probable double call_rcu(), just leak. > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > + __func__, head); > + > + // Mark as success and leave. > + return; > + } > + > + // Queue to per-cpu llist > + head->func = func; > + llist_add(&head->llist_node, &rlp->head); Suppose that there are a bunch of preemptions between the preempt_enable() above and this point, so that the current CPU's list has lots of callbacks, but zero ->count. Can that cause a problem? In the past, this sort of thing has been an issue for rcu_barrier() and friends. > + // Flush queue if too big You will also need to check for early boot use. I -think- it suffice to simply skip the following "if" statement when rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT kernels won't expire them until the softirq kthreads have been spawned. Which is OK, as it just means that call_rcu_lazy() is a bit more lazy than expected that early. Except that call_rcu() can be invoked even before rcu_init() has been invoked, which is therefore also before rcu_init_lazy() has been invoked. I thefore suggest something like this at the very start of this function: if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) call_rcu(head_rcu, func); The goal is that people can replace call_rcu() with call_rcu_lazy() without having to worry about invocation during early boot. Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", though "rhp" is more consistent with the RCU pointer initials approach. > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > + lazy_rcu_flush_cpu(rlp); > + } else { > + if (!delayed_work_pending(&rlp->work)) { This check is racy because the work might run to completion right at this point. Wouldn't it be better to rely on the internal check of WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > + } > + } > +} > + > +static unsigned long > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > +{ > + unsigned long count = 0; > + int cpu; > + > + /* Snapshot count of all CPUs */ > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + > + count += atomic_read(&rlp->count); > + } > + > + return count; > +} > + > +static unsigned long > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > +{ > + int cpu, freed = 0; > + > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + unsigned long count; > + > + count = atomic_read(&rlp->count); > + lazy_rcu_flush_cpu(rlp); > + sc->nr_to_scan -= count; > + freed += count; > + if (sc->nr_to_scan <= 0) > + break; > + } > + > + return freed == 0 ? SHRINK_STOP : freed; This is a bit surprising given the stated aim of SHRINK_STOP to indicate potential deadlocks. But this pattern is common, including on the kvfree_rcu() path, so OK! ;-) Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is that as well. > +} > + > +/* > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > + */ > +static void lazy_work(struct work_struct *work) > +{ > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > + > + lazy_rcu_flush_cpu(rlp); > +} > + > +static struct shrinker lazy_rcu_shrinker = { > + .count_objects = lazy_rcu_shrink_count, > + .scan_objects = lazy_rcu_shrink_scan, > + .batch = 0, > + .seeks = DEFAULT_SEEKS, > +}; > + > +void __init rcu_lazy_init(void) > +{ > + int cpu; > + > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > + > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > + } > + > + if (register_shrinker(&lazy_rcu_shrinker)) > + pr_err("Failed to register lazy_rcu shrinker!\n"); > +} > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 24b5f2c2de87..a5f4b44f395f 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > #endif > > +#ifdef CONFIG_RCU_LAZY > +void rcu_lazy_init(void); > +#else > +static inline void rcu_lazy_init(void) {} > +#endif > #endif /* __LINUX_RCU_H */ > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index a4c25a6283b0..ebdf6f7c9023 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > else > qovld_calc = qovld; > + > + rcu_lazy_init(); > } > > #include "tree_stall.h" > -- > 2.36.0.550.gb090851708-goog >
Hi Paul, Thanks for bearing with my slightly late reply, I did "time blocking" technique to work on RCU today! ;-) On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > Implement timer-based RCU callback batching. The batch is flushed > > whenever a certain amount of time has passed, or the batch on a > > particular CPU grows too big. Also memory pressure can flush it. > > > > Locking is avoided to reduce lock contention when queuing and dequeuing > > happens on different CPUs of a per-cpu list, such as when shrinker > > context is running on different CPU. Also not having to use locks keeps > > the per-CPU structure size small. > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > It is very good to see this! Inevitable comments and questions below. Thanks for taking a look! > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > index bf8e341e75b4..c09715079829 100644 > > --- a/kernel/rcu/Kconfig > > +++ b/kernel/rcu/Kconfig > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > > Say N here if you hate read-side memory barriers. > > Take the default if you are unsure. > > > > +config RCU_LAZY > > + bool "RCU callback lazy invocation functionality" > > + depends on RCU_NOCB_CPU > > + default y > > This "default y" is OK for experimentation, but for mainline we should > not be forcing this on unsuspecting people. ;-) Agreed. I'll default 'n' :) > > + help > > + To save power, batch RCU callbacks and flush after delay, memory > > + pressure or callback list growing too big. > > + > > endmenu # "RCU Subsystem" > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > > index 0cfb009a99b9..8968b330d6e0 100644 > > --- a/kernel/rcu/Makefile > > +++ b/kernel/rcu/Makefile > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > > obj-$(CONFIG_TREE_RCU) += tree.o > > obj-$(CONFIG_TINY_RCU) += tiny.o > > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > > +obj-$(CONFIG_RCU_LAZY) += lazy.o > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > > new file mode 100644 > > index 000000000000..55e406cfc528 > > --- /dev/null > > +++ b/kernel/rcu/lazy.c > > @@ -0,0 +1,145 @@ > > +/* > > + * Lockless lazy-RCU implementation. > > + */ > > +#include <linux/rcupdate.h> > > +#include <linux/shrinker.h> > > +#include <linux/workqueue.h> > > +#include "rcu.h" > > + > > +// How much to batch before flushing? > > +#define MAX_LAZY_BATCH 2048 > > + > > +// How much to wait before flushing? > > +#define MAX_LAZY_JIFFIES 10000 > > That is more than a minute on a HZ=10 system. Are you sure that you > did not mean "(10 * HZ)" or some such? Yes, you are right. I need to change that to be constant regardless of HZ. I will make the change as you suggest. > > + > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > > +struct lazy_rcu_head { > > + struct llist_node llist_node; > > + void (*func)(struct callback_head *head); > > +} __attribute__((aligned(sizeof(void *)))); > > This needs a build-time check that rcu_head and lazy_rcu_head are of > the same size. Maybe something like this in some appropriate context: > > BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy)); > > Never mind! I see you have this in rcu_init_lazy(). Plus I now see that > you also mention this in the above comments. ;-) Cool, great minds think alike! ;-) > > + > > +struct rcu_lazy_pcp { > > + struct llist_head head; > > + struct delayed_work work; > > + atomic_t count; > > +}; > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > > + > > +// Lockless flush of CPU, can be called concurrently. > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > > +{ > > + struct llist_node *node = llist_del_all(&rlp->head); > > + struct lazy_rcu_head *cursor, *temp; > > + > > + if (!node) > > + return; > > At this point, the list is empty but the count is non-zero. Can > that cause a problem? (For the existing callback lists, this would > be OK.) > > > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > > + struct rcu_head *rh = (struct rcu_head *)cursor; > > + debug_rcu_head_unqueue(rh); > > Good to see this check! > > > + call_rcu(rh, rh->func); > > + atomic_dec(&rlp->count); > > + } > > +} > > + > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > > +{ > > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > > + struct rcu_lazy_pcp *rlp; > > + > > + preempt_disable(); > > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > > Whitespace issue, please fix. Fixed, thanks. > > + preempt_enable(); > > + > > + if (debug_rcu_head_queue((void *)head)) { > > + // Probable double call_rcu(), just leak. > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > + __func__, head); > > + > > + // Mark as success and leave. > > + return; > > + } > > + > > + // Queue to per-cpu llist > > + head->func = func; > > + llist_add(&head->llist_node, &rlp->head); > > Suppose that there are a bunch of preemptions between the preempt_enable() > above and this point, so that the current CPU's list has lots of > callbacks, but zero ->count. Can that cause a problem? > > In the past, this sort of thing has been an issue for rcu_barrier() > and friends. Thanks, I think I dropped the ball on this. You have given me something to think about. I am thinking on first thought that setting the count in advance of populating the list should do the trick. I will look more on it. > > + // Flush queue if too big > > You will also need to check for early boot use. > > I -think- it suffice to simply skip the following "if" statement when > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT > kernels won't expire them until the softirq kthreads have been spawned. > > Which is OK, as it just means that call_rcu_lazy() is a bit more > lazy than expected that early. > > Except that call_rcu() can be invoked even before rcu_init() has been > invoked, which is therefore also before rcu_init_lazy() has been invoked. In other words, you are concerned that too many lazy callbacks might be pending before rcu_init() is called? I am going through the kfree_rcu() threads/patches involving RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before the scheduler is running causes a crash or warnings? > I thefore suggest something like this at the very start of this function: > > if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) > call_rcu(head_rcu, func); > > The goal is that people can replace call_rcu() with call_rcu_lazy() > without having to worry about invocation during early boot. Yes, this seems safer. I don't expect much power savings during system boot process anyway ;-). I believe perhaps a static branch would work better to take a branch out from what is likely a fast path. > Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", > though "rhp" is more consistent with the RCU pointer initials approach. Fixed, thanks. > > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > > + lazy_rcu_flush_cpu(rlp); > > + } else { > > + if (!delayed_work_pending(&rlp->work)) { > > This check is racy because the work might run to completion right at > this point. Wouldn't it be better to rely on the internal check of > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? Oops, agreed. Will make it as you suggest. thanks, - Joel > > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > > + } > > + } > > +} > > + > > +static unsigned long > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > > +{ > > + unsigned long count = 0; > > + int cpu; > > + > > + /* Snapshot count of all CPUs */ > > + for_each_possible_cpu(cpu) { > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > + > > + count += atomic_read(&rlp->count); > > + } > > + > > + return count; > > +} > > + > > +static unsigned long > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > +{ > > + int cpu, freed = 0; > > + > > + for_each_possible_cpu(cpu) { > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > + unsigned long count; > > + > > + count = atomic_read(&rlp->count); > > + lazy_rcu_flush_cpu(rlp); > > + sc->nr_to_scan -= count; > > + freed += count; > > + if (sc->nr_to_scan <= 0) > > + break; > > + } > > + > > + return freed == 0 ? SHRINK_STOP : freed; > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate > potential deadlocks. But this pattern is common, including on the > kvfree_rcu() path, so OK! ;-) > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is > that as well. > > > +} > > + > > +/* > > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > > + */ > > +static void lazy_work(struct work_struct *work) > > +{ > > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > > + > > + lazy_rcu_flush_cpu(rlp); > > +} > > + > > +static struct shrinker lazy_rcu_shrinker = { > > + .count_objects = lazy_rcu_shrink_count, > > + .scan_objects = lazy_rcu_shrink_scan, > > + .batch = 0, > > + .seeks = DEFAULT_SEEKS, > > +}; > > + > > +void __init rcu_lazy_init(void) > > +{ > > + int cpu; > > + > > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > > + > > + for_each_possible_cpu(cpu) { > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > > + } > > + > > + if (register_shrinker(&lazy_rcu_shrinker)) > > + pr_err("Failed to register lazy_rcu shrinker!\n"); > > +} > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > index 24b5f2c2de87..a5f4b44f395f 100644 > > --- a/kernel/rcu/rcu.h > > +++ b/kernel/rcu/rcu.h > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > > #endif > > > > +#ifdef CONFIG_RCU_LAZY > > +void rcu_lazy_init(void); > > +#else > > +static inline void rcu_lazy_init(void) {} > > +#endif > > #endif /* __LINUX_RCU_H */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index a4c25a6283b0..ebdf6f7c9023 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > > else > > qovld_calc = qovld; > > + > > + rcu_lazy_init(); > > } > > > > #include "tree_stall.h" > > -- > > 2.36.0.550.gb090851708-goog > >
On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote: > Hi Paul, > > Thanks for bearing with my slightly late reply, I did "time blocking" > technique to work on RCU today! ;-) > > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > > Implement timer-based RCU callback batching. The batch is flushed > > > whenever a certain amount of time has passed, or the batch on a > > > particular CPU grows too big. Also memory pressure can flush it. > > > > > > Locking is avoided to reduce lock contention when queuing and dequeuing > > > happens on different CPUs of a per-cpu list, such as when shrinker > > > context is running on different CPU. Also not having to use locks keeps > > > the per-CPU structure size small. > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > It is very good to see this! Inevitable comments and questions below. > > Thanks for taking a look! > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > > index bf8e341e75b4..c09715079829 100644 > > > --- a/kernel/rcu/Kconfig > > > +++ b/kernel/rcu/Kconfig > > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > > > Say N here if you hate read-side memory barriers. > > > Take the default if you are unsure. > > > > > > +config RCU_LAZY > > > + bool "RCU callback lazy invocation functionality" > > > + depends on RCU_NOCB_CPU > > > + default y > > > > This "default y" is OK for experimentation, but for mainline we should > > not be forcing this on unsuspecting people. ;-) > > Agreed. I'll default 'n' :) > > > > + help > > > + To save power, batch RCU callbacks and flush after delay, memory > > > + pressure or callback list growing too big. > > > + > > > endmenu # "RCU Subsystem" > > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > > > index 0cfb009a99b9..8968b330d6e0 100644 > > > --- a/kernel/rcu/Makefile > > > +++ b/kernel/rcu/Makefile > > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > > > obj-$(CONFIG_TREE_RCU) += tree.o > > > obj-$(CONFIG_TINY_RCU) += tiny.o > > > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > > > +obj-$(CONFIG_RCU_LAZY) += lazy.o > > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > > > new file mode 100644 > > > index 000000000000..55e406cfc528 > > > --- /dev/null > > > +++ b/kernel/rcu/lazy.c > > > @@ -0,0 +1,145 @@ > > > +/* > > > + * Lockless lazy-RCU implementation. > > > + */ > > > +#include <linux/rcupdate.h> > > > +#include <linux/shrinker.h> > > > +#include <linux/workqueue.h> > > > +#include "rcu.h" > > > + > > > +// How much to batch before flushing? > > > +#define MAX_LAZY_BATCH 2048 > > > + > > > +// How much to wait before flushing? > > > +#define MAX_LAZY_JIFFIES 10000 > > > > That is more than a minute on a HZ=10 system. Are you sure that you > > did not mean "(10 * HZ)" or some such? > > Yes, you are right. I need to change that to be constant regardless of HZ. I > will make the change as you suggest. > > > > + > > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > > > +struct lazy_rcu_head { > > > + struct llist_node llist_node; > > > + void (*func)(struct callback_head *head); > > > +} __attribute__((aligned(sizeof(void *)))); > > > > This needs a build-time check that rcu_head and lazy_rcu_head are of > > the same size. Maybe something like this in some appropriate context: > > > > BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy)); > > > > Never mind! I see you have this in rcu_init_lazy(). Plus I now see that > > you also mention this in the above comments. ;-) > > Cool, great minds think alike! ;-) > > > > + > > > +struct rcu_lazy_pcp { > > > + struct llist_head head; > > > + struct delayed_work work; > > > + atomic_t count; > > > +}; > > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > > > + > > > +// Lockless flush of CPU, can be called concurrently. > > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > > > +{ > > > + struct llist_node *node = llist_del_all(&rlp->head); > > > + struct lazy_rcu_head *cursor, *temp; > > > + > > > + if (!node) > > > + return; > > > > At this point, the list is empty but the count is non-zero. Can > > that cause a problem? (For the existing callback lists, this would > > be OK.) > > > > > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > > > + struct rcu_head *rh = (struct rcu_head *)cursor; > > > + debug_rcu_head_unqueue(rh); > > > > Good to see this check! > > > > > + call_rcu(rh, rh->func); > > > + atomic_dec(&rlp->count); > > > + } > > > +} > > > + > > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > > > +{ > > > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > > > + struct rcu_lazy_pcp *rlp; > > > + > > > + preempt_disable(); > > > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > > > > Whitespace issue, please fix. > > Fixed, thanks. > > > > + preempt_enable(); > > > + > > > + if (debug_rcu_head_queue((void *)head)) { > > > + // Probable double call_rcu(), just leak. > > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > > + __func__, head); > > > + > > > + // Mark as success and leave. > > > + return; > > > + } > > > + > > > + // Queue to per-cpu llist > > > + head->func = func; > > > + llist_add(&head->llist_node, &rlp->head); > > > > Suppose that there are a bunch of preemptions between the preempt_enable() > > above and this point, so that the current CPU's list has lots of > > callbacks, but zero ->count. Can that cause a problem? > > > > In the past, this sort of thing has been an issue for rcu_barrier() > > and friends. > > Thanks, I think I dropped the ball on this. You have given me something to > think about. I am thinking on first thought that setting the count in advance > of populating the list should do the trick. I will look more on it. That can work, but don't forget the need for proper memory ordering. Another approach would be to what the current bypass list does, namely count the callback in the ->cblist structure. > > > + // Flush queue if too big > > > > You will also need to check for early boot use. > > > > I -think- it suffice to simply skip the following "if" statement when > > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being > > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT > > kernels won't expire them until the softirq kthreads have been spawned. > > > > Which is OK, as it just means that call_rcu_lazy() is a bit more > > lazy than expected that early. > > > > Except that call_rcu() can be invoked even before rcu_init() has been > > invoked, which is therefore also before rcu_init_lazy() has been invoked. > > In other words, you are concerned that too many lazy callbacks might be > pending before rcu_init() is called? In other words, I am concerned that bad things might happen if fields in a rcu_lazy_pcp structure are used before they are initialized. I am not worried about too many lazy callbacks before rcu_init() because the non-lazy callbacks (which these currently are) are allowed to pile up until RCU's grace-period kthreads have been spawned. There might come a time when RCU callbacks need to be invoked earlier, but that will be a separate problem to solve when and if, but with the benefit of the additional information derived from seeing the problem actually happen. > I am going through the kfree_rcu() threads/patches involving > RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before > the scheduler is running causes a crash or warnings? There are a lot of different issues that arise in different phases of boot. In this particular case, my primary concern is the use-before-initialized bug. > > I thefore suggest something like this at the very start of this function: > > > > if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) > > call_rcu(head_rcu, func); > > > > The goal is that people can replace call_rcu() with call_rcu_lazy() > > without having to worry about invocation during early boot. > > Yes, this seems safer. I don't expect much power savings during system boot > process anyway ;-). I believe perhaps a static branch would work better to > take a branch out from what is likely a fast path. A static branch would be fine. Maybe encapsulate it in a static inline function for all such comparisons, but most such comparisons are far from anything resembling a fastpath, so the main potential benefit would be added readability. Which could be a compelling reason in and of itself. Thanx, Paul > > Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", > > though "rhp" is more consistent with the RCU pointer initials approach. > > Fixed, thanks. > > > > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > > > + lazy_rcu_flush_cpu(rlp); > > > + } else { > > > + if (!delayed_work_pending(&rlp->work)) { > > > > This check is racy because the work might run to completion right at > > this point. Wouldn't it be better to rely on the internal check of > > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? > > Oops, agreed. Will make it as you suggest. > > thanks, > > - Joel > > > > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > > > + } > > > + } > > > +} > > > + > > > +static unsigned long > > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > > > +{ > > > + unsigned long count = 0; > > > + int cpu; > > > + > > > + /* Snapshot count of all CPUs */ > > > + for_each_possible_cpu(cpu) { > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > + > > > + count += atomic_read(&rlp->count); > > > + } > > > + > > > + return count; > > > +} > > > + > > > +static unsigned long > > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > +{ > > > + int cpu, freed = 0; > > > + > > > + for_each_possible_cpu(cpu) { > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > + unsigned long count; > > > + > > > + count = atomic_read(&rlp->count); > > > + lazy_rcu_flush_cpu(rlp); > > > + sc->nr_to_scan -= count; > > > + freed += count; > > > + if (sc->nr_to_scan <= 0) > > > + break; > > > + } > > > + > > > + return freed == 0 ? SHRINK_STOP : freed; > > > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate > > potential deadlocks. But this pattern is common, including on the > > kvfree_rcu() path, so OK! ;-) > > > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is > > that as well. > > > > > +} > > > + > > > +/* > > > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > > > + */ > > > +static void lazy_work(struct work_struct *work) > > > +{ > > > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > > > + > > > + lazy_rcu_flush_cpu(rlp); > > > +} > > > + > > > +static struct shrinker lazy_rcu_shrinker = { > > > + .count_objects = lazy_rcu_shrink_count, > > > + .scan_objects = lazy_rcu_shrink_scan, > > > + .batch = 0, > > > + .seeks = DEFAULT_SEEKS, > > > +}; > > > + > > > +void __init rcu_lazy_init(void) > > > +{ > > > + int cpu; > > > + > > > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > > > + > > > + for_each_possible_cpu(cpu) { > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > > > + } > > > + > > > + if (register_shrinker(&lazy_rcu_shrinker)) > > > + pr_err("Failed to register lazy_rcu shrinker!\n"); > > > +} > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > index 24b5f2c2de87..a5f4b44f395f 100644 > > > --- a/kernel/rcu/rcu.h > > > +++ b/kernel/rcu/rcu.h > > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > > > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > > > #endif > > > > > > +#ifdef CONFIG_RCU_LAZY > > > +void rcu_lazy_init(void); > > > +#else > > > +static inline void rcu_lazy_init(void) {} > > > +#endif > > > #endif /* __LINUX_RCU_H */ > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index a4c25a6283b0..ebdf6f7c9023 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > > > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > > > else > > > qovld_calc = qovld; > > > + > > > + rcu_lazy_init(); > > > } > > > > > > #include "tree_stall.h" > > > -- > > > 2.36.0.550.gb090851708-goog > > >
Some extra but not overlapping comments with already mentioned in this email thread: > Implement timer-based RCU callback batching. The batch is flushed > whenever a certain amount of time has passed, or the batch on a > particular CPU grows too big. Also memory pressure can flush it. > > Locking is avoided to reduce lock contention when queuing and dequeuing > happens on different CPUs of a per-cpu list, such as when shrinker > context is running on different CPU. Also not having to use locks keeps > the per-CPU structure size small. > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > include/linux/rcupdate.h | 6 ++ > kernel/rcu/Kconfig | 8 +++ > kernel/rcu/Makefile | 1 + > kernel/rcu/lazy.c | 145 +++++++++++++++++++++++++++++++++++++++ > kernel/rcu/rcu.h | 5 ++ > kernel/rcu/tree.c | 2 + > 6 files changed, 167 insertions(+) > create mode 100644 kernel/rcu/lazy.c > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 88b42eb46406..d0a6c4f5172c 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void) > > #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ > > +#ifdef CONFIG_RCU_LAZY > +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func); > +#else > +#define call_rcu_lazy(head, func) call_rcu(head, func) > +#endif > + > /* Internal to kernel */ > void rcu_init(void); > extern int rcu_scheduler_active __read_mostly; > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index bf8e341e75b4..c09715079829 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > Say N here if you hate read-side memory barriers. > Take the default if you are unsure. > > +config RCU_LAZY > + bool "RCU callback lazy invocation functionality" > + depends on RCU_NOCB_CPU > + default y > + help > + To save power, batch RCU callbacks and flush after delay, memory > + pressure or callback list growing too big. > + > endmenu # "RCU Subsystem" > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > index 0cfb009a99b9..8968b330d6e0 100644 > --- a/kernel/rcu/Makefile > +++ b/kernel/rcu/Makefile > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > obj-$(CONFIG_TREE_RCU) += tree.o > obj-$(CONFIG_TINY_RCU) += tiny.o > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > +obj-$(CONFIG_RCU_LAZY) += lazy.o > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > new file mode 100644 > index 000000000000..55e406cfc528 > --- /dev/null > +++ b/kernel/rcu/lazy.c > @@ -0,0 +1,145 @@ > +/* > + * Lockless lazy-RCU implementation. > + */ > +#include <linux/rcupdate.h> > +#include <linux/shrinker.h> > +#include <linux/workqueue.h> > +#include "rcu.h" > + > +// How much to batch before flushing? > +#define MAX_LAZY_BATCH 2048 > + > +// How much to wait before flushing? > +#define MAX_LAZY_JIFFIES 10000 > + > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > +struct lazy_rcu_head { > + struct llist_node llist_node; > + void (*func)(struct callback_head *head); > +} __attribute__((aligned(sizeof(void *)))); > + > +struct rcu_lazy_pcp { > + struct llist_head head; > + struct delayed_work work; > + atomic_t count; > +}; > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > + > +// Lockless flush of CPU, can be called concurrently. > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > +{ > + struct llist_node *node = llist_del_all(&rlp->head); > + struct lazy_rcu_head *cursor, *temp; > + > + if (!node) > + return; > + > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > + struct rcu_head *rh = (struct rcu_head *)cursor; > + debug_rcu_head_unqueue(rh); > + call_rcu(rh, rh->func); > + atomic_dec(&rlp->count); > + } > +} > + > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > +{ > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > + struct rcu_lazy_pcp *rlp; > + > + preempt_disable(); > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > + preempt_enable(); > Can we get rid of such explicit disabling/enabling preemption? > + > + if (debug_rcu_head_queue((void *)head)) { > + // Probable double call_rcu(), just leak. > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > + __func__, head); > + > + // Mark as success and leave. > + return; > + } > + > + // Queue to per-cpu llist > + head->func = func; > + llist_add(&head->llist_node, &rlp->head); > + > + // Flush queue if too big > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > + lazy_rcu_flush_cpu(rlp); > Can we just schedule the work instead of drawn from the caller context? For example it can be a hard-irq context. > + } else { > + if (!delayed_work_pending(&rlp->work)) { > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > + } > + } > +} EXPORT_SYMBOL_GPL()? to be able to use in kernel modules. > + > +static unsigned long > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > +{ > + unsigned long count = 0; > + int cpu; > + > + /* Snapshot count of all CPUs */ > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + > + count += atomic_read(&rlp->count); > + } > + > + return count; > +} > + > +static unsigned long > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > +{ > + int cpu, freed = 0; > + > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + unsigned long count; > + > + count = atomic_read(&rlp->count); > + lazy_rcu_flush_cpu(rlp); > + sc->nr_to_scan -= count; > + freed += count; > + if (sc->nr_to_scan <= 0) > + break; > + } > + > + return freed == 0 ? SHRINK_STOP : freed; > +} > + > +/* > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > + */ > +static void lazy_work(struct work_struct *work) > +{ > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > + > + lazy_rcu_flush_cpu(rlp); > +} > + > +static struct shrinker lazy_rcu_shrinker = { > + .count_objects = lazy_rcu_shrink_count, > + .scan_objects = lazy_rcu_shrink_scan, > + .batch = 0, > + .seeks = DEFAULT_SEEKS, > +}; > + > +void __init rcu_lazy_init(void) > +{ > + int cpu; > + > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > + > + for_each_possible_cpu(cpu) { > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > + } > + > + if (register_shrinker(&lazy_rcu_shrinker)) > + pr_err("Failed to register lazy_rcu shrinker!\n"); > +} > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 24b5f2c2de87..a5f4b44f395f 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > #endif > > +#ifdef CONFIG_RCU_LAZY > +void rcu_lazy_init(void); > +#else > +static inline void rcu_lazy_init(void) {} > +#endif > #endif /* __LINUX_RCU_H */ > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index a4c25a6283b0..ebdf6f7c9023 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > else > qovld_calc = qovld; > + > + rcu_lazy_init(); > } > > #include "tree_stall.h" > -- > 2.36.0.550.gb090851708-goog >
On Sat, May 14, 2022 at 09:34:21AM -0700, Paul E. McKenney wrote: > On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote: > > Hi Paul, > > > > Thanks for bearing with my slightly late reply, I did "time blocking" > > technique to work on RCU today! ;-) > > > > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > > > Implement timer-based RCU callback batching. The batch is flushed > > > > whenever a certain amount of time has passed, or the batch on a > > > > particular CPU grows too big. Also memory pressure can flush it. > > > > > > > > Locking is avoided to reduce lock contention when queuing and dequeuing > > > > happens on different CPUs of a per-cpu list, such as when shrinker > > > > context is running on different CPU. Also not having to use locks keeps > > > > the per-CPU structure size small. > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > It is very good to see this! Inevitable comments and questions below. > > > > Thanks for taking a look! > > > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > > > index bf8e341e75b4..c09715079829 100644 > > > > --- a/kernel/rcu/Kconfig > > > > +++ b/kernel/rcu/Kconfig > > > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > > > > Say N here if you hate read-side memory barriers. > > > > Take the default if you are unsure. > > > > > > > > +config RCU_LAZY > > > > + bool "RCU callback lazy invocation functionality" > > > > + depends on RCU_NOCB_CPU > > > > + default y > > > > > > This "default y" is OK for experimentation, but for mainline we should > > > not be forcing this on unsuspecting people. ;-) > > > > Agreed. I'll default 'n' :) > > > > > > + help > > > > + To save power, batch RCU callbacks and flush after delay, memory > > > > + pressure or callback list growing too big. > > > > + > > > > endmenu # "RCU Subsystem" > > > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > > > > index 0cfb009a99b9..8968b330d6e0 100644 > > > > --- a/kernel/rcu/Makefile > > > > +++ b/kernel/rcu/Makefile > > > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > > > > obj-$(CONFIG_TREE_RCU) += tree.o > > > > obj-$(CONFIG_TINY_RCU) += tiny.o > > > > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > > > > +obj-$(CONFIG_RCU_LAZY) += lazy.o > > > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > > > > new file mode 100644 > > > > index 000000000000..55e406cfc528 > > > > --- /dev/null > > > > +++ b/kernel/rcu/lazy.c > > > > @@ -0,0 +1,145 @@ > > > > +/* > > > > + * Lockless lazy-RCU implementation. > > > > + */ > > > > +#include <linux/rcupdate.h> > > > > +#include <linux/shrinker.h> > > > > +#include <linux/workqueue.h> > > > > +#include "rcu.h" > > > > + > > > > +// How much to batch before flushing? > > > > +#define MAX_LAZY_BATCH 2048 > > > > + > > > > +// How much to wait before flushing? > > > > +#define MAX_LAZY_JIFFIES 10000 > > > > > > That is more than a minute on a HZ=10 system. Are you sure that you > > > did not mean "(10 * HZ)" or some such? > > > > Yes, you are right. I need to change that to be constant regardless of HZ. I > > will make the change as you suggest. > > > > > > + > > > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > > > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > > > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > > > > +struct lazy_rcu_head { > > > > + struct llist_node llist_node; > > > > + void (*func)(struct callback_head *head); > > > > +} __attribute__((aligned(sizeof(void *)))); > > > > > > This needs a build-time check that rcu_head and lazy_rcu_head are of > > > the same size. Maybe something like this in some appropriate context: > > > > > > BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy)); > > > > > > Never mind! I see you have this in rcu_init_lazy(). Plus I now see that > > > you also mention this in the above comments. ;-) > > > > Cool, great minds think alike! ;-) > > > > > > + > > > > +struct rcu_lazy_pcp { > > > > + struct llist_head head; > > > > + struct delayed_work work; > > > > + atomic_t count; > > > > +}; > > > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > > > > + > > > > +// Lockless flush of CPU, can be called concurrently. > > > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > > > > +{ > > > > + struct llist_node *node = llist_del_all(&rlp->head); > > > > + struct lazy_rcu_head *cursor, *temp; > > > > + > > > > + if (!node) > > > > + return; > > > > > > At this point, the list is empty but the count is non-zero. Can > > > that cause a problem? (For the existing callback lists, this would > > > be OK.) > > > > > > > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > > > > + struct rcu_head *rh = (struct rcu_head *)cursor; > > > > + debug_rcu_head_unqueue(rh); > > > > > > Good to see this check! > > > > > > > + call_rcu(rh, rh->func); > > > > + atomic_dec(&rlp->count); > > > > + } > > > > +} > > > > + > > > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > > > > +{ > > > > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > > > > + struct rcu_lazy_pcp *rlp; > > > > + > > > > + preempt_disable(); > > > > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > > > > > > Whitespace issue, please fix. > > > > Fixed, thanks. > > > > > > + preempt_enable(); > > > > + > > > > + if (debug_rcu_head_queue((void *)head)) { > > > > + // Probable double call_rcu(), just leak. > > > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > > > + __func__, head); > > > > + > > > > + // Mark as success and leave. > > > > + return; > > > > + } > > > > + > > > > + // Queue to per-cpu llist > > > > + head->func = func; > > > > + llist_add(&head->llist_node, &rlp->head); > > > > > > Suppose that there are a bunch of preemptions between the preempt_enable() > > > above and this point, so that the current CPU's list has lots of > > > callbacks, but zero ->count. Can that cause a problem? > > > > > > In the past, this sort of thing has been an issue for rcu_barrier() > > > and friends. > > > > Thanks, I think I dropped the ball on this. You have given me something to > > think about. I am thinking on first thought that setting the count in advance > > of populating the list should do the trick. I will look more on it. > > That can work, but don't forget the need for proper memory ordering. > > Another approach would be to what the current bypass list does, namely > count the callback in the ->cblist structure. I have been thinking about this, and I also arrived at the same conclusion that it is easier to make it a segcblist structure which has the functions do the memory ordering. Then we can make rcu_barrier() locklessly sample the ->len of the new per-cpu structure, I *think* it might work. However, I was actually considering another situation outside of rcu_barrier, where the preemptions you mentioned would cause the list to appear to be empty if one were to rely on count, but actually have a lot of stuff queued. This causes shrinkers to not know that there is a lot of memory available to free. I don't think its that big an issue, if we can assume the caller's process context to have sufficient priority. Obviously using a raw spinlock makes the process context to be highest priority briefly but without the lock, we don't get that. But yeah that can be sorted by just updating the count proactively, and then doing the queuing operations. Another advantage of using the segcblist structure, is, we can also record the grace period numbers in it, and avoid triggering RCU from the timer if gp already elapsed (similar to the rcu-poll family of functions). WDYT? thanks, - Joel > > > > + // Flush queue if too big > > > > > > You will also need to check for early boot use. > > > > > > I -think- it suffice to simply skip the following "if" statement when > > > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being > > > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT > > > kernels won't expire them until the softirq kthreads have been spawned. > > > > > > Which is OK, as it just means that call_rcu_lazy() is a bit more > > > lazy than expected that early. > > > > > > Except that call_rcu() can be invoked even before rcu_init() has been > > > invoked, which is therefore also before rcu_init_lazy() has been invoked. > > > > In other words, you are concerned that too many lazy callbacks might be > > pending before rcu_init() is called? > > In other words, I am concerned that bad things might happen if fields > in a rcu_lazy_pcp structure are used before they are initialized. > > I am not worried about too many lazy callbacks before rcu_init() because > the non-lazy callbacks (which these currently are) are allowed to pile > up until RCU's grace-period kthreads have been spawned. There might > come a time when RCU callbacks need to be invoked earlier, but that will > be a separate problem to solve when and if, but with the benefit of the > additional information derived from seeing the problem actually happen. > > > I am going through the kfree_rcu() threads/patches involving > > RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before > > the scheduler is running causes a crash or warnings? > > There are a lot of different issues that arise in different phases > of boot. In this particular case, my primary concern is the > use-before-initialized bug. > > > > I thefore suggest something like this at the very start of this function: > > > > > > if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) > > > call_rcu(head_rcu, func); > > > > > > The goal is that people can replace call_rcu() with call_rcu_lazy() > > > without having to worry about invocation during early boot. > > > > Yes, this seems safer. I don't expect much power savings during system boot > > process anyway ;-). I believe perhaps a static branch would work better to > > take a branch out from what is likely a fast path. > > A static branch would be fine. Maybe encapsulate it in a static inline > function for all such comparisons, but most such comparisons are far from > anything resembling a fastpath, so the main potential benefit would be > added readability. Which could be a compelling reason in and of itself. > > Thanx, Paul > > > > Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", > > > though "rhp" is more consistent with the RCU pointer initials approach. > > > > Fixed, thanks. > > > > > > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > > > > + lazy_rcu_flush_cpu(rlp); > > > > + } else { > > > > + if (!delayed_work_pending(&rlp->work)) { > > > > > > This check is racy because the work might run to completion right at > > > this point. Wouldn't it be better to rely on the internal check of > > > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? > > > > Oops, agreed. Will make it as you suggest. > > > > thanks, > > > > - Joel > > > > > > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > > > > + } > > > > + } > > > > +} > > > > + > > > > +static unsigned long > > > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > > > > +{ > > > > + unsigned long count = 0; > > > > + int cpu; > > > > + > > > > + /* Snapshot count of all CPUs */ > > > > + for_each_possible_cpu(cpu) { > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > + > > > > + count += atomic_read(&rlp->count); > > > > + } > > > > + > > > > + return count; > > > > +} > > > > + > > > > +static unsigned long > > > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > > +{ > > > > + int cpu, freed = 0; > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > + unsigned long count; > > > > + > > > > + count = atomic_read(&rlp->count); > > > > + lazy_rcu_flush_cpu(rlp); > > > > + sc->nr_to_scan -= count; > > > > + freed += count; > > > > + if (sc->nr_to_scan <= 0) > > > > + break; > > > > + } > > > > + > > > > + return freed == 0 ? SHRINK_STOP : freed; > > > > > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate > > > potential deadlocks. But this pattern is common, including on the > > > kvfree_rcu() path, so OK! ;-) > > > > > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is > > > that as well. > > > > > > > +} > > > > + > > > > +/* > > > > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > > > > + */ > > > > +static void lazy_work(struct work_struct *work) > > > > +{ > > > > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > > > > + > > > > + lazy_rcu_flush_cpu(rlp); > > > > +} > > > > + > > > > +static struct shrinker lazy_rcu_shrinker = { > > > > + .count_objects = lazy_rcu_shrink_count, > > > > + .scan_objects = lazy_rcu_shrink_scan, > > > > + .batch = 0, > > > > + .seeks = DEFAULT_SEEKS, > > > > +}; > > > > + > > > > +void __init rcu_lazy_init(void) > > > > +{ > > > > + int cpu; > > > > + > > > > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > > > > + } > > > > + > > > > + if (register_shrinker(&lazy_rcu_shrinker)) > > > > + pr_err("Failed to register lazy_rcu shrinker!\n"); > > > > +} > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > index 24b5f2c2de87..a5f4b44f395f 100644 > > > > --- a/kernel/rcu/rcu.h > > > > +++ b/kernel/rcu/rcu.h > > > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > > > > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > > > > #endif > > > > > > > > +#ifdef CONFIG_RCU_LAZY > > > > +void rcu_lazy_init(void); > > > > +#else > > > > +static inline void rcu_lazy_init(void) {} > > > > +#endif > > > > #endif /* __LINUX_RCU_H */ > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index a4c25a6283b0..ebdf6f7c9023 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > > > > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > > > > else > > > > qovld_calc = qovld; > > > > + > > > > + rcu_lazy_init(); > > > > } > > > > > > > > #include "tree_stall.h" > > > > -- > > > > 2.36.0.550.gb090851708-goog > > > >
On Fri, May 27, 2022 at 11:12:03PM +0000, Joel Fernandes wrote: > On Sat, May 14, 2022 at 09:34:21AM -0700, Paul E. McKenney wrote: > > On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote: > > > Hi Paul, > > > > > > Thanks for bearing with my slightly late reply, I did "time blocking" > > > technique to work on RCU today! ;-) > > > > > > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > > > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > > > > Implement timer-based RCU callback batching. The batch is flushed > > > > > whenever a certain amount of time has passed, or the batch on a > > > > > particular CPU grows too big. Also memory pressure can flush it. > > > > > > > > > > Locking is avoided to reduce lock contention when queuing and dequeuing > > > > > happens on different CPUs of a per-cpu list, such as when shrinker > > > > > context is running on different CPU. Also not having to use locks keeps > > > > > the per-CPU structure size small. > > > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > It is very good to see this! Inevitable comments and questions below. > > > > > > Thanks for taking a look! > > > > > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > > > > index bf8e341e75b4..c09715079829 100644 > > > > > --- a/kernel/rcu/Kconfig > > > > > +++ b/kernel/rcu/Kconfig > > > > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB > > > > > Say N here if you hate read-side memory barriers. > > > > > Take the default if you are unsure. > > > > > > > > > > +config RCU_LAZY > > > > > + bool "RCU callback lazy invocation functionality" > > > > > + depends on RCU_NOCB_CPU > > > > > + default y > > > > > > > > This "default y" is OK for experimentation, but for mainline we should > > > > not be forcing this on unsuspecting people. ;-) > > > > > > Agreed. I'll default 'n' :) > > > > > > > > + help > > > > > + To save power, batch RCU callbacks and flush after delay, memory > > > > > + pressure or callback list growing too big. > > > > > + > > > > > endmenu # "RCU Subsystem" > > > > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > > > > > index 0cfb009a99b9..8968b330d6e0 100644 > > > > > --- a/kernel/rcu/Makefile > > > > > +++ b/kernel/rcu/Makefile > > > > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o > > > > > obj-$(CONFIG_TREE_RCU) += tree.o > > > > > obj-$(CONFIG_TINY_RCU) += tiny.o > > > > > obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o > > > > > +obj-$(CONFIG_RCU_LAZY) += lazy.o > > > > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c > > > > > new file mode 100644 > > > > > index 000000000000..55e406cfc528 > > > > > --- /dev/null > > > > > +++ b/kernel/rcu/lazy.c > > > > > @@ -0,0 +1,145 @@ > > > > > +/* > > > > > + * Lockless lazy-RCU implementation. > > > > > + */ > > > > > +#include <linux/rcupdate.h> > > > > > +#include <linux/shrinker.h> > > > > > +#include <linux/workqueue.h> > > > > > +#include "rcu.h" > > > > > + > > > > > +// How much to batch before flushing? > > > > > +#define MAX_LAZY_BATCH 2048 > > > > > + > > > > > +// How much to wait before flushing? > > > > > +#define MAX_LAZY_JIFFIES 10000 > > > > > > > > That is more than a minute on a HZ=10 system. Are you sure that you > > > > did not mean "(10 * HZ)" or some such? > > > > > > Yes, you are right. I need to change that to be constant regardless of HZ. I > > > will make the change as you suggest. > > > > > > > > + > > > > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while > > > > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON > > > > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size. > > > > > +struct lazy_rcu_head { > > > > > + struct llist_node llist_node; > > > > > + void (*func)(struct callback_head *head); > > > > > +} __attribute__((aligned(sizeof(void *)))); > > > > > > > > This needs a build-time check that rcu_head and lazy_rcu_head are of > > > > the same size. Maybe something like this in some appropriate context: > > > > > > > > BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy)); > > > > > > > > Never mind! I see you have this in rcu_init_lazy(). Plus I now see that > > > > you also mention this in the above comments. ;-) > > > > > > Cool, great minds think alike! ;-) > > > > > > > > + > > > > > +struct rcu_lazy_pcp { > > > > > + struct llist_head head; > > > > > + struct delayed_work work; > > > > > + atomic_t count; > > > > > +}; > > > > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); > > > > > + > > > > > +// Lockless flush of CPU, can be called concurrently. > > > > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) > > > > > +{ > > > > > + struct llist_node *node = llist_del_all(&rlp->head); > > > > > + struct lazy_rcu_head *cursor, *temp; > > > > > + > > > > > + if (!node) > > > > > + return; > > > > > > > > At this point, the list is empty but the count is non-zero. Can > > > > that cause a problem? (For the existing callback lists, this would > > > > be OK.) > > > > > > > > > + llist_for_each_entry_safe(cursor, temp, node, llist_node) { > > > > > + struct rcu_head *rh = (struct rcu_head *)cursor; > > > > > + debug_rcu_head_unqueue(rh); > > > > > > > > Good to see this check! > > > > > > > > > + call_rcu(rh, rh->func); > > > > > + atomic_dec(&rlp->count); > > > > > + } > > > > > +} > > > > > + > > > > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > > > > > +{ > > > > > + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > > > > > + struct rcu_lazy_pcp *rlp; > > > > > + > > > > > + preempt_disable(); > > > > > + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > > > > > > > > Whitespace issue, please fix. > > > > > > Fixed, thanks. > > > > > > > > + preempt_enable(); > > > > > + > > > > > + if (debug_rcu_head_queue((void *)head)) { > > > > > + // Probable double call_rcu(), just leak. > > > > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > > > > + __func__, head); > > > > > + > > > > > + // Mark as success and leave. > > > > > + return; > > > > > + } > > > > > + > > > > > + // Queue to per-cpu llist > > > > > + head->func = func; > > > > > + llist_add(&head->llist_node, &rlp->head); > > > > > > > > Suppose that there are a bunch of preemptions between the preempt_enable() > > > > above and this point, so that the current CPU's list has lots of > > > > callbacks, but zero ->count. Can that cause a problem? > > > > > > > > In the past, this sort of thing has been an issue for rcu_barrier() > > > > and friends. > > > > > > Thanks, I think I dropped the ball on this. You have given me something to > > > think about. I am thinking on first thought that setting the count in advance > > > of populating the list should do the trick. I will look more on it. > > > > That can work, but don't forget the need for proper memory ordering. > > > > Another approach would be to what the current bypass list does, namely > > count the callback in the ->cblist structure. > > I have been thinking about this, and I also arrived at the same conclusion > that it is easier to make it a segcblist structure which has the functions do > the memory ordering. Then we can make rcu_barrier() locklessly sample the > ->len of the new per-cpu structure, I *think* it might work. > > However, I was actually considering another situation outside of rcu_barrier, > where the preemptions you mentioned would cause the list to appear to be > empty if one were to rely on count, but actually have a lot of stuff queued. > This causes shrinkers to not know that there is a lot of memory available to > free. I don't think its that big an issue, if we can assume the caller's > process context to have sufficient priority. Obviously using a raw spinlock > makes the process context to be highest priority briefly but without the > lock, we don't get that. But yeah that can be sorted by just updating the > count proactively, and then doing the queuing operations. > > Another advantage of using the segcblist structure, is, we can also record > the grace period numbers in it, and avoid triggering RCU from the timer if gp > already elapsed (similar to the rcu-poll family of functions). > > WDYT? What you are seeing is that the design space for call_rcu_lazy() is huge, and the tradeoffs between energy efficiency and complexity are not well understood. It is therefore necessary to quickly evaluate alternatives. Yes, you could simply implement all of them and test them, but that might you take longer than you would like. Plus you likely need to put in place more testing than rcutorture currently provides. Given what we have seen thus far, I think that the first step has to be to evaluate what can be obtained with this approach compared with what can be obtained with other approaches. What we have right now, more or less, is this: o Offloading buys about 15%. o Slowing grace periods buys about another 7%, but requires userspace changes to prevent degrading user-visible performance. o The initial call_rcu_lazy() prototype buys about 1%. True, this is not apples to apples because the first two were measured on ChromeOS and this one on Android. Which means that apples to apples evaluation is also required. But this is the information I currently have at hand, and it is probably no more than a factor of two off of what would be seen on ChromeOS. Or is there some ChromeOS data that tells a different story? After all, for all I know, Android might still be expediting all normal grace periods. At which point, the question becomes "how to make up that 7%?" After all, it is not likely that anyone is going to leave that much battery lifetime on the table. Here are the possibilities that I currently see: o Slow down grace periods, and also bite the bullet and make userspace changes to speed up the RCU grace periods during critical operations. o Slow down grace periods, but leverage in-kernel changes for some of those operations. For example, RCU uses pm_notifier() to cause grace periods to be expedited during suspend and hibernation. The requests for expediting are atomically counted, so there can be other similar setups. o Choose a more moderate slowing down of the grace period so as to minimize (or maybe even eliminate) the need for the aforementioned userspace changes. This also leaves battery lifetime on the table, but considerably less of it. o Implement more selective slowdowns, such as call_rcu_lazy(). Other selective slowdowns include in-RCU heuristics such as forcing the current grace period to end once the entire system has gone idle. (There is prior work detecting full-system idleness for NO_HZ_FULL, but smaller systems could use more aggressive techniques.) I am sure that there are additional possibilities. Note that it is possible to do more than one of these, if that proves helpful. But given this, one question is "What is the most you can possibly obtain from call_rcu_lazy()?" If you have a platform with enough memory, one way to obtain an upper bound is to make call_rcu_lazy() simply leak the memory. If that amount of savings does not meet the need, then other higher-level approaches will be needed, whether as alternatives or as additions to the call_rcu_lazy() approach. Should call_rcu_lazy() prove to be part of the mix, here are a (very) few of the tradeoffs: o Put lazy callbacks on their own list or not? o If not, use the bypass list? If so, is it the case that call_rcu_lazy() is just call_rcu() on non-rcu_nocbs CPUs? Or add the complexity required to use the bypass list on rcu_nocbs CPUs but to use ->cblist otherwise? o If so: o Use segmented list with marked grace periods? Keep in mind that such a list can track only two grace periods. o Use a plain list and have grace-period start simply drain the list? o Does call_rcu_lazy() do anything to ensure that additional grace periods that exist only for the benefit of lazy callbacks are maximally shared among all CPUs' lazy callbacks? If so, what? (Keeping in mind that failing to share such grace periods burns battery lifetime.) o When there is ample memory, how long are lazy callbacks allowed to sleep? Forever? If not forever, what feeds into computing the timeout? o What is used to determine when memory is low, so that laziness has become a net negative? o What other conditions, if any, should prompt motivating lazy callbacks? (See above for the start of a grace period motivating lazy callbacks.) In short, you need to cast your net pretty wide on this one. It has not yet been very carefully explored, so there are likely to be surprises, maybe even good surprises. ;-) Thanx, Paul > thanks, > > - Joel > > > > > > + // Flush queue if too big > > > > > > > > You will also need to check for early boot use. > > > > > > > > I -think- it suffice to simply skip the following "if" statement when > > > > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being > > > > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT > > > > kernels won't expire them until the softirq kthreads have been spawned. > > > > > > > > Which is OK, as it just means that call_rcu_lazy() is a bit more > > > > lazy than expected that early. > > > > > > > > Except that call_rcu() can be invoked even before rcu_init() has been > > > > invoked, which is therefore also before rcu_init_lazy() has been invoked. > > > > > > In other words, you are concerned that too many lazy callbacks might be > > > pending before rcu_init() is called? > > > > In other words, I am concerned that bad things might happen if fields > > in a rcu_lazy_pcp structure are used before they are initialized. > > > > I am not worried about too many lazy callbacks before rcu_init() because > > the non-lazy callbacks (which these currently are) are allowed to pile > > up until RCU's grace-period kthreads have been spawned. There might > > come a time when RCU callbacks need to be invoked earlier, but that will > > be a separate problem to solve when and if, but with the benefit of the > > additional information derived from seeing the problem actually happen. > > > > > I am going through the kfree_rcu() threads/patches involving > > > RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before > > > the scheduler is running causes a crash or warnings? > > > > There are a lot of different issues that arise in different phases > > of boot. In this particular case, my primary concern is the > > use-before-initialized bug. > > > > > > I thefore suggest something like this at the very start of this function: > > > > > > > > if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) > > > > call_rcu(head_rcu, func); > > > > > > > > The goal is that people can replace call_rcu() with call_rcu_lazy() > > > > without having to worry about invocation during early boot. > > > > > > Yes, this seems safer. I don't expect much power savings during system boot > > > process anyway ;-). I believe perhaps a static branch would work better to > > > take a branch out from what is likely a fast path. > > > > A static branch would be fine. Maybe encapsulate it in a static inline > > function for all such comparisons, but most such comparisons are far from > > anything resembling a fastpath, so the main potential benefit would be > > added readability. Which could be a compelling reason in and of itself. > > > > Thanx, Paul > > > > > > Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", > > > > though "rhp" is more consistent with the RCU pointer initials approach. > > > > > > Fixed, thanks. > > > > > > > > + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > > > > > + lazy_rcu_flush_cpu(rlp); > > > > > + } else { > > > > > + if (!delayed_work_pending(&rlp->work)) { > > > > > > > > This check is racy because the work might run to completion right at > > > > this point. Wouldn't it be better to rely on the internal check of > > > > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? > > > > > > Oops, agreed. Will make it as you suggest. > > > > > > thanks, > > > > > > - Joel > > > > > > > > + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > > > > > + } > > > > > + } > > > > > +} > > > > > + > > > > > +static unsigned long > > > > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > > > > > +{ > > > > > + unsigned long count = 0; > > > > > + int cpu; > > > > > + > > > > > + /* Snapshot count of all CPUs */ > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > > + > > > > > + count += atomic_read(&rlp->count); > > > > > + } > > > > > + > > > > > + return count; > > > > > +} > > > > > + > > > > > +static unsigned long > > > > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > > > +{ > > > > > + int cpu, freed = 0; > > > > > + > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > > + unsigned long count; > > > > > + > > > > > + count = atomic_read(&rlp->count); > > > > > + lazy_rcu_flush_cpu(rlp); > > > > > + sc->nr_to_scan -= count; > > > > > + freed += count; > > > > > + if (sc->nr_to_scan <= 0) > > > > > + break; > > > > > + } > > > > > + > > > > > + return freed == 0 ? SHRINK_STOP : freed; > > > > > > > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate > > > > potential deadlocks. But this pattern is common, including on the > > > > kvfree_rcu() path, so OK! ;-) > > > > > > > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is > > > > that as well. > > > > > > > > > +} > > > > > + > > > > > +/* > > > > > + * This function is invoked after MAX_LAZY_JIFFIES timeout. > > > > > + */ > > > > > +static void lazy_work(struct work_struct *work) > > > > > +{ > > > > > + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > > > > > + > > > > > + lazy_rcu_flush_cpu(rlp); > > > > > +} > > > > > + > > > > > +static struct shrinker lazy_rcu_shrinker = { > > > > > + .count_objects = lazy_rcu_shrink_count, > > > > > + .scan_objects = lazy_rcu_shrink_scan, > > > > > + .batch = 0, > > > > > + .seeks = DEFAULT_SEEKS, > > > > > +}; > > > > > + > > > > > +void __init rcu_lazy_init(void) > > > > > +{ > > > > > + int cpu; > > > > > + > > > > > + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > > > > > + > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > > > > > + INIT_DELAYED_WORK(&rlp->work, lazy_work); > > > > > + } > > > > > + > > > > > + if (register_shrinker(&lazy_rcu_shrinker)) > > > > > + pr_err("Failed to register lazy_rcu shrinker!\n"); > > > > > +} > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > index 24b5f2c2de87..a5f4b44f395f 100644 > > > > > --- a/kernel/rcu/rcu.h > > > > > +++ b/kernel/rcu/rcu.h > > > > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > > > > > static inline void show_rcu_tasks_trace_gp_kthread(void) {} > > > > > #endif > > > > > > > > > > +#ifdef CONFIG_RCU_LAZY > > > > > +void rcu_lazy_init(void); > > > > > +#else > > > > > +static inline void rcu_lazy_init(void) {} > > > > > +#endif > > > > > #endif /* __LINUX_RCU_H */ > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > index a4c25a6283b0..ebdf6f7c9023 100644 > > > > > --- a/kernel/rcu/tree.c > > > > > +++ b/kernel/rcu/tree.c > > > > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > > > > > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > > > > > else > > > > > qovld_calc = qovld; > > > > > + > > > > > + rcu_lazy_init(); > > > > > } > > > > > > > > > > #include "tree_stall.h" > > > > > -- > > > > > 2.36.0.550.gb090851708-goog > > > > >
Hi Paul, I just setup thunderbird on an old machine just to reply to LKML. Apologies if the formatting is weird. Replies below: On 5/28/22 13:57, Paul E. McKenney wrote: [..] >>>>>> + preempt_enable(); >>>>>> + >>>>>> + if (debug_rcu_head_queue((void *)head)) { >>>>>> + // Probable double call_rcu(), just leak. >>>>>> + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", >>>>>> + __func__, head); >>>>>> + >>>>>> + // Mark as success and leave. >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + // Queue to per-cpu llist >>>>>> + head->func = func; >>>>>> + llist_add(&head->llist_node, &rlp->head); >>>>> >>>>> Suppose that there are a bunch of preemptions between the preempt_enable() >>>>> above and this point, so that the current CPU's list has lots of >>>>> callbacks, but zero ->count. Can that cause a problem? >>>>> >>>>> In the past, this sort of thing has been an issue for rcu_barrier() >>>>> and friends. >>>> >>>> Thanks, I think I dropped the ball on this. You have given me something to >>>> think about. I am thinking on first thought that setting the count in advance >>>> of populating the list should do the trick. I will look more on it. >>> >>> That can work, but don't forget the need for proper memory ordering. >>> >>> Another approach would be to what the current bypass list does, namely >>> count the callback in the ->cblist structure. >> >> I have been thinking about this, and I also arrived at the same conclusion >> that it is easier to make it a segcblist structure which has the functions do >> the memory ordering. Then we can make rcu_barrier() locklessly sample the >> ->len of the new per-cpu structure, I *think* it might work. >> >> However, I was actually considering another situation outside of rcu_barrier, >> where the preemptions you mentioned would cause the list to appear to be >> empty if one were to rely on count, but actually have a lot of stuff queued. >> This causes shrinkers to not know that there is a lot of memory available to >> free. I don't think its that big an issue, if we can assume the caller's >> process context to have sufficient priority. Obviously using a raw spinlock >> makes the process context to be highest priority briefly but without the >> lock, we don't get that. But yeah that can be sorted by just updating the >> count proactively, and then doing the queuing operations. >> >> Another advantage of using the segcblist structure, is, we can also record >> the grace period numbers in it, and avoid triggering RCU from the timer if gp >> already elapsed (similar to the rcu-poll family of functions). >> >> WDYT? > > What you are seeing is that the design space for call_rcu_lazy() is huge, > and the tradeoffs between energy efficiency and complexity are not well > understood. It is therefore necessary to quickly evaluate alternatives. > Yes, you could simply implement all of them and test them, but that > might you take longer than you would like. Plus you likely need to > put in place more testing than rcutorture currently provides. > > Given what we have seen thus far, I think that the first step has to > be to evaluate what can be obtained with this approach compared with > what can be obtained with other approaches. What we have right now, > more or less, is this: > > o Offloading buys about 15%. > > o Slowing grace periods buys about another 7%, but requires > userspace changes to prevent degrading user-visible performance. Agreed. > > o The initial call_rcu_lazy() prototype buys about 1%. Well, that depends on the usecase. A busy system saves you less because it is busy anyway, so RCU being quiet does not help much. From my cover letter, you can see the idle power savings is a whopping 29% with this patch series. Check the PC10 state residencies and the package wattage. When ChromeOS screen is off and user is not doing anything on the Before: Pk%pc10 = 72.13 PkgWatt = 0.58 After: Pk%pc10 = 81.28 PkgWatt = 0.41 > > True, this is not apples to apples because the first two were > measured on ChromeOS and this one on Android. Yes agreed. I think Vlad is not seeing any jaw dropping savings, because he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to also check if rcu callbacks are indeed not being quued / quiet after applying the patch, and if not then there may still be some call_rcu()s that need conversion to lazy, on his setup. For example, he might have a driver on his ARM platform that's queing RCU CBs constantly. I was thinking that perhaps a debug patch can help quickly nail down such RCU noise, in the way of a warning. Of course, debug CONFIG should never be enabled in production, so it would be something along the lines of how lockdep is used for debugging. >Which means that > apples to apples evaluation is also required. But this is the > information I currently have at hand, and it is probably no more > than a factor of two off of what would be seen on ChromeOS. > > Or is there some ChromeOS data that tells a different story? > After all, for all I know, Android might still be expediting > all normal grace periods. > > At which point, the question becomes "how to make up that 7%?" After all, > it is not likely that anyone is going to leave that much battery lifetime > on the table. Here are the possibilities that I currently see: > > o Slow down grace periods, and also bite the bullet and make > userspace changes to speed up the RCU grace periods during > critical operations. We tried this, and tracing suffers quiet a lot. The system also felt "sluggish" which I suspect is because of synchronize_rcu() slow downs in other paths. > > o Slow down grace periods, but leverage in-kernel changes for > some of those operations. For example, RCU uses pm_notifier() > to cause grace periods to be expedited during suspend and Nice! Did not know this. > hibernation. The requests for expediting are atomically counted, > so there can be other similar setups. I do like slowing down grace periods globally, because its easy, but I think it can have quite a few ripple effects if in the future a user of call_rcu() does not expect lazy behavior :( but still gets it. Same like how synchronize_rcu_mult() suffered back in its day. > o Choose a more moderate slowing down of the grace period so as to > minimize (or maybe even eliminate) the need for the aforementioned > userspace changes. This also leaves battery lifetime on the > table, but considerably less of it. > > o Implement more selective slowdowns, such as call_rcu_lazy(). > > Other selective slowdowns include in-RCU heuristics such as > forcing the current grace period to end once the entire > system has gone idle. (There is prior work detecting > full-system idleness for NO_HZ_FULL, but smaller systems > could use more aggressive techniques.) > > I am sure that there are additional possibilities. Note that it is > possible to do more than one of these, if that proves helpful. Sure! > But given this, one question is "What is the most you can possibly > obtain from call_rcu_lazy()?" If you have a platform with enough > memory, one way to obtain an upper bound is to make call_rcu_lazy() > simply leak the memory. If that amount of savings does not meet the > need, then other higher-level approaches will be needed, whether > as alternatives or as additions to the call_rcu_lazy() approach. > > Should call_rcu_lazy() prove to be part of the mix, here are a (very) > few of the tradeoffs: > > o Put lazy callbacks on their own list or not? > > o If not, use the bypass list? If so, is it the > case that call_rcu_lazy() is just call_rcu() on > non-rcu_nocbs CPUs? > > Or add the complexity required to use the bypass > list on rcu_nocbs CPUs but to use ->cblist otherwise? For bypass list, I am kind of reluctant because the "flush time" of the bypass list may not match the laziness we seek? For example, I want to allow user to configure to flush the CBs only after 15 seconds or so, if the CB list does not grow too big. That might hurt folks using the bypass list, but requiring a quicker response. Also, does doing so not prevent usage of lazy CBs on systems without NOCB? So if we want to future-proof this, I guess that might not be a good decision. > > o If so: > > o Use segmented list with marked grace periods? > Keep in mind that such a list can track only > two grace periods. > > o Use a plain list and have grace-period start > simply drain the list? I want to use the segmented list, regardless of whether we use the bypass list or not, because we get those memory barriers and rcu_barrier() lockless sampling of ->len, for free :). > > o Does call_rcu_lazy() do anything to ensure that additional grace > periods that exist only for the benefit of lazy callbacks are > maximally shared among all CPUs' lazy callbacks? If so, what? > (Keeping in mind that failing to share such grace periods > burns battery lifetime.) I could be missing your point, can you give example of how you want the behavior to be? I am not thinking of creating separate GP cycles just for lazy CBs. The GP cycle will be the same that non-lazy uses. A lazy CB will just record what is the current or last GP, and then wait. Once the timer expires, it will check what is the current or last GP. If a new GP was not started, then it starts a new GP. If a new GP was started but not completed, it can simply call_rcu(). If a new GP started and completed, it does not start new GP and just executes CB, something like that. Before the flush happens, if multiple lazy CBs were queued in the meanwhile and the GP seq counters moved forward, then all those lazy CBs will share the same GP (either not starting a new one, or calling call_rcu() to share the ongoing on, something like that). > o When there is ample memory, how long are lazy callbacks allowed > to sleep? Forever? If not forever, what feeds into computing > the timeout? Yeah, the timeout is fixed currently. I agree this is a good thing to optimize. > o What is used to determine when memory is low, so that laziness > has become a net negative? The assumption I think we make is that the shrinkers will constantly flush out lazy CBs if there is memory pressure. > o What other conditions, if any, should prompt motivating lazy > callbacks? (See above for the start of a grace period motivating > lazy callbacks.) > > In short, you need to cast your net pretty wide on this one. It has > not yet been very carefully explored, so there are likely to be surprises, > maybe even good surprises. ;-) Cool that sounds like some good opportunity to work on something cool ;-) Happy memorial day! Off for lunch with some friends and then back to tinkering. Thanks, - Joel > > Thanx, Paul > >> thanks, >> >> - Joel >> >>>>>> + // Flush queue if too big >>>>> >>>>> You will also need to check for early boot use. >>>>> >>>>> I -think- it suffice to simply skip the following "if" statement when >>>>> rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being >>>>> that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT >>>>> kernels won't expire them until the softirq kthreads have been spawned. >>>>> >>>>> Which is OK, as it just means that call_rcu_lazy() is a bit more >>>>> lazy than expected that early. >>>>> >>>>> Except that call_rcu() can be invoked even before rcu_init() has been >>>>> invoked, which is therefore also before rcu_init_lazy() has been invoked. >>>> >>>> In other words, you are concerned that too many lazy callbacks might be >>>> pending before rcu_init() is called? >>> >>> In other words, I am concerned that bad things might happen if fields >>> in a rcu_lazy_pcp structure are used before they are initialized. >>> >>> I am not worried about too many lazy callbacks before rcu_init() because >>> the non-lazy callbacks (which these currently are) are allowed to pile >>> up until RCU's grace-period kthreads have been spawned. There might >>> come a time when RCU callbacks need to be invoked earlier, but that will >>> be a separate problem to solve when and if, but with the benefit of the >>> additional information derived from seeing the problem actually happen. >>> >>>> I am going through the kfree_rcu() threads/patches involving >>>> RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before >>>> the scheduler is running causes a crash or warnings? >>> >>> There are a lot of different issues that arise in different phases >>> of boot. In this particular case, my primary concern is the >>> use-before-initialized bug. >>> >>>>> I thefore suggest something like this at the very start of this function: >>>>> >>>>> if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) >>>>> call_rcu(head_rcu, func); >>>>> >>>>> The goal is that people can replace call_rcu() with call_rcu_lazy() >>>>> without having to worry about invocation during early boot. >>>> >>>> Yes, this seems safer. I don't expect much power savings during system boot >>>> process anyway ;-). I believe perhaps a static branch would work better to >>>> take a branch out from what is likely a fast path. >>> >>> A static branch would be fine. Maybe encapsulate it in a static inline >>> function for all such comparisons, but most such comparisons are far from >>> anything resembling a fastpath, so the main potential benefit would be >>> added readability. Which could be a compelling reason in and of itself. >>> >>> Thanx, Paul >>> >>>>> Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", >>>>> though "rhp" is more consistent with the RCU pointer initials approach. >>>> >>>> Fixed, thanks. >>>> >>>>>> + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { >>>>>> + lazy_rcu_flush_cpu(rlp); >>>>>> + } else { >>>>>> + if (!delayed_work_pending(&rlp->work)) { >>>>> >>>>> This check is racy because the work might run to completion right at >>>>> this point. Wouldn't it be better to rely on the internal check of >>>>> WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? >>>> >>>> Oops, agreed. Will make it as you suggest. >>>> >>>> thanks, >>>> >>>> - Joel >>>> >>>>>> + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); >>>>>> + } >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +static unsigned long >>>>>> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) >>>>>> +{ >>>>>> + unsigned long count = 0; >>>>>> + int cpu; >>>>>> + >>>>>> + /* Snapshot count of all CPUs */ >>>>>> + for_each_possible_cpu(cpu) { >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); >>>>>> + >>>>>> + count += atomic_read(&rlp->count); >>>>>> + } >>>>>> + >>>>>> + return count; >>>>>> +} >>>>>> + >>>>>> +static unsigned long >>>>>> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) >>>>>> +{ >>>>>> + int cpu, freed = 0; >>>>>> + >>>>>> + for_each_possible_cpu(cpu) { >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); >>>>>> + unsigned long count; >>>>>> + >>>>>> + count = atomic_read(&rlp->count); >>>>>> + lazy_rcu_flush_cpu(rlp); >>>>>> + sc->nr_to_scan -= count; >>>>>> + freed += count; >>>>>> + if (sc->nr_to_scan <= 0) >>>>>> + break; >>>>>> + } >>>>>> + >>>>>> + return freed == 0 ? SHRINK_STOP : freed; >>>>> >>>>> This is a bit surprising given the stated aim of SHRINK_STOP to indicate >>>>> potential deadlocks. But this pattern is common, including on the >>>>> kvfree_rcu() path, so OK! ;-) >>>>> >>>>> Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is >>>>> that as well. >>>>> >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * This function is invoked after MAX_LAZY_JIFFIES timeout. >>>>>> + */ >>>>>> +static void lazy_work(struct work_struct *work) >>>>>> +{ >>>>>> + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); >>>>>> + >>>>>> + lazy_rcu_flush_cpu(rlp); >>>>>> +} >>>>>> + >>>>>> +static struct shrinker lazy_rcu_shrinker = { >>>>>> + .count_objects = lazy_rcu_shrink_count, >>>>>> + .scan_objects = lazy_rcu_shrink_scan, >>>>>> + .batch = 0, >>>>>> + .seeks = DEFAULT_SEEKS, >>>>>> +}; >>>>>> + >>>>>> +void __init rcu_lazy_init(void) >>>>>> +{ >>>>>> + int cpu; >>>>>> + >>>>>> + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); >>>>>> + >>>>>> + for_each_possible_cpu(cpu) { >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); >>>>>> + INIT_DELAYED_WORK(&rlp->work, lazy_work); >>>>>> + } >>>>>> + >>>>>> + if (register_shrinker(&lazy_rcu_shrinker)) >>>>>> + pr_err("Failed to register lazy_rcu shrinker!\n"); >>>>>> +} >>>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h >>>>>> index 24b5f2c2de87..a5f4b44f395f 100644 >>>>>> --- a/kernel/rcu/rcu.h >>>>>> +++ b/kernel/rcu/rcu.h >>>>>> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); >>>>>> static inline void show_rcu_tasks_trace_gp_kthread(void) {} >>>>>> #endif >>>>>> >>>>>> +#ifdef CONFIG_RCU_LAZY >>>>>> +void rcu_lazy_init(void); >>>>>> +#else >>>>>> +static inline void rcu_lazy_init(void) {} >>>>>> +#endif >>>>>> #endif /* __LINUX_RCU_H */ >>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c >>>>>> index a4c25a6283b0..ebdf6f7c9023 100644 >>>>>> --- a/kernel/rcu/tree.c >>>>>> +++ b/kernel/rcu/tree.c >>>>>> @@ -4775,6 +4775,8 @@ void __init rcu_init(void) >>>>>> qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; >>>>>> else >>>>>> qovld_calc = qovld; >>>>>> + >>>>>> + rcu_lazy_init(); >>>>>> } >>>>>> >>>>>> #include "tree_stall.h" >>>>>> -- >>>>>> 2.36.0.550.gb090851708-goog >>>>>>
Hi Vlad, Thanks for the comments. I replied inline: On 5/17/22 05:07, Uladzislau Rezki wrote: >> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile >> index 0cfb009a99b9..8968b330d6e0 100644 >> --- a/kernel/rcu/Makefile >> +++ b/kernel/rcu/Makefile >> @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o >> obj-$(CONFIG_TREE_RCU) += tree.o >> obj-$(CONFIG_TINY_RCU) += tiny.o >> obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o >> +obj-$(CONFIG_RCU_LAZY) += lazy.o >> diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c >> new file mode 100644 >> index 000000000000..55e406cfc528 >> --- /dev/null >> +++ b/kernel/rcu/lazy.c >> @@ -0,0 +1,145 @@ >> +/* >> + * Lockless lazy-RCU implementation. >> + */ >> +#include <linux/rcupdate.h> >> +#include <linux/shrinker.h> >> +#include <linux/workqueue.h> >> +#include "rcu.h" >> + >> +// How much to batch before flushing? >> +#define MAX_LAZY_BATCH 2048 >> + >> +// How much to wait before flushing? >> +#define MAX_LAZY_JIFFIES 10000 >> + >> +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while >> +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON >> +// later to ensure that rcu_head and lazy_rcu_head are of the same size. >> +struct lazy_rcu_head { >> + struct llist_node llist_node; >> + void (*func)(struct callback_head *head); >> +} __attribute__((aligned(sizeof(void *)))); >> + >> +struct rcu_lazy_pcp { >> + struct llist_head head; >> + struct delayed_work work; >> + atomic_t count; >> +}; >> +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); >> + >> +// Lockless flush of CPU, can be called concurrently. >> +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) >> +{ >> + struct llist_node *node = llist_del_all(&rlp->head); >> + struct lazy_rcu_head *cursor, *temp; >> + >> + if (!node) >> + return; >> + >> + llist_for_each_entry_safe(cursor, temp, node, llist_node) { >> + struct rcu_head *rh = (struct rcu_head *)cursor; >> + debug_rcu_head_unqueue(rh); >> + call_rcu(rh, rh->func); >> + atomic_dec(&rlp->count); >> + } >> +} >> + >> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) >> +{ >> + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; >> + struct rcu_lazy_pcp *rlp; >> + >> + preempt_disable(); >> + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); >> + preempt_enable(); >> > Can we get rid of such explicit disabling/enabling preemption? Ok I'll try. Last I checked, something needs to disable preemption to prevent warnings with sampling the current processor ID. >> + >> + if (debug_rcu_head_queue((void *)head)) { >> + // Probable double call_rcu(), just leak. >> + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", >> + __func__, head); >> + >> + // Mark as success and leave. >> + return; >> + } >> + >> + // Queue to per-cpu llist >> + head->func = func; >> + llist_add(&head->llist_node, &rlp->head); >> + >> + // Flush queue if too big >> + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { >> + lazy_rcu_flush_cpu(rlp); >> > Can we just schedule the work instead of drawn from the caller context? > For example it can be a hard-irq context. You raise a good point. Ok I'll do that. Though if the CB list is small, I would prefer to do it inline. I will look more into it. > >> + } else { >> + if (!delayed_work_pending(&rlp->work)) { >> + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); >> + } >> + } >> +} > EXPORT_SYMBOL_GPL()? to be able to use in kernel modules. > Sure, will fix. Thanks, - Joel
On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote: > Hi Paul, > > I just setup thunderbird on an old machine just to reply to LKML. > Apologies if the formatting is weird. Replies below: Looks fine at first glance. ;-) > On 5/28/22 13:57, Paul E. McKenney wrote: > [..] > >>>>>> + preempt_enable(); > >>>>>> + > >>>>>> + if (debug_rcu_head_queue((void *)head)) { > >>>>>> + // Probable double call_rcu(), just leak. > >>>>>> + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > >>>>>> + __func__, head); > >>>>>> + > >>>>>> + // Mark as success and leave. > >>>>>> + return; > >>>>>> + } > >>>>>> + > >>>>>> + // Queue to per-cpu llist > >>>>>> + head->func = func; > >>>>>> + llist_add(&head->llist_node, &rlp->head); > >>>>> > >>>>> Suppose that there are a bunch of preemptions between the preempt_enable() > >>>>> above and this point, so that the current CPU's list has lots of > >>>>> callbacks, but zero ->count. Can that cause a problem? > >>>>> > >>>>> In the past, this sort of thing has been an issue for rcu_barrier() > >>>>> and friends. > >>>> > >>>> Thanks, I think I dropped the ball on this. You have given me something to > >>>> think about. I am thinking on first thought that setting the count in advance > >>>> of populating the list should do the trick. I will look more on it. > >>> > >>> That can work, but don't forget the need for proper memory ordering. > >>> > >>> Another approach would be to what the current bypass list does, namely > >>> count the callback in the ->cblist structure. > >> > >> I have been thinking about this, and I also arrived at the same conclusion > >> that it is easier to make it a segcblist structure which has the functions do > >> the memory ordering. Then we can make rcu_barrier() locklessly sample the > >> ->len of the new per-cpu structure, I *think* it might work. > >> > >> However, I was actually considering another situation outside of rcu_barrier, > >> where the preemptions you mentioned would cause the list to appear to be > >> empty if one were to rely on count, but actually have a lot of stuff queued. > >> This causes shrinkers to not know that there is a lot of memory available to > >> free. I don't think its that big an issue, if we can assume the caller's > >> process context to have sufficient priority. Obviously using a raw spinlock > >> makes the process context to be highest priority briefly but without the > >> lock, we don't get that. But yeah that can be sorted by just updating the > >> count proactively, and then doing the queuing operations. > >> > >> Another advantage of using the segcblist structure, is, we can also record > >> the grace period numbers in it, and avoid triggering RCU from the timer if gp > >> already elapsed (similar to the rcu-poll family of functions). > >> > >> WDYT? > > > > What you are seeing is that the design space for call_rcu_lazy() is huge, > > and the tradeoffs between energy efficiency and complexity are not well > > understood. It is therefore necessary to quickly evaluate alternatives. > > Yes, you could simply implement all of them and test them, but that > > might you take longer than you would like. Plus you likely need to > > put in place more testing than rcutorture currently provides. > > > > Given what we have seen thus far, I think that the first step has to > > be to evaluate what can be obtained with this approach compared with > > what can be obtained with other approaches. What we have right now, > > more or less, is this: > > > > o Offloading buys about 15%. > > > > o Slowing grace periods buys about another 7%, but requires > > userspace changes to prevent degrading user-visible performance. > > Agreed. > > > > > o The initial call_rcu_lazy() prototype buys about 1%. > > Well, that depends on the usecase. A busy system saves you less because > it is busy anyway, so RCU being quiet does not help much. > > >From my cover letter, you can see the idle power savings is a whopping > 29% with this patch series. Check the PC10 state residencies and the > package wattage. > > When ChromeOS screen is off and user is not doing anything on the > Before: > Pk%pc10 = 72.13 > PkgWatt = 0.58 > > After: > Pk%pc10 = 81.28 > PkgWatt = 0.41 These numbers do make a better case for this. On the other hand, that patch series was missing things like rcu_barrier() support and might also be converting more call_rcu() invocations to call_rcu_lazy() invocations. It is therefore necessary to take these numbers with a grain of salt, much though I hope that they turn out to be realistic. > > > > True, this is not apples to apples because the first two were > > measured on ChromeOS and this one on Android. > > Yes agreed. I think Vlad is not seeing any jaw dropping savings, because > he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to > also check if rcu callbacks are indeed not being quued / quiet after > applying the patch, and if not then there may still be some call_rcu()s > that need conversion to lazy, on his setup. For example, he might have a > driver on his ARM platform that's queing RCU CBs constantly. > > I was thinking that perhaps a debug patch can help quickly nail down > such RCU noise, in the way of a warning. Of course, debug CONFIG should > never be enabled in production, so it would be something along the lines > of how lockdep is used for debugging. The trick would be figuring out which of the callbacks are noise. I suspect that userspace help would be required. Or maybe just leverage the existing event traces and let userspace sort it out. > >Which means that > > apples to apples evaluation is also required. But this is the > > information I currently have at hand, and it is probably no more > > than a factor of two off of what would be seen on ChromeOS. > > > > Or is there some ChromeOS data that tells a different story? > > After all, for all I know, Android might still be expediting > > all normal grace periods. > > > > At which point, the question becomes "how to make up that 7%?" After all, > > it is not likely that anyone is going to leave that much battery lifetime > > on the table. Here are the possibilities that I currently see: > > > > o Slow down grace periods, and also bite the bullet and make > > userspace changes to speed up the RCU grace periods during > > critical operations. > > We tried this, and tracing suffers quiet a lot. The system also felt > "sluggish" which I suspect is because of synchronize_rcu() slow downs in > other paths. In what way does tracing suffer? Removing tracepoints? And yes, this approach absolutely requires finding code paths with user-visible grace periods. Which sounds like you tried the "Slow down grace periods" part but not the "bit the bullet" part. ;-) > > o Slow down grace periods, but leverage in-kernel changes for > > some of those operations. For example, RCU uses pm_notifier() > > to cause grace periods to be expedited during suspend and > > Nice! Did not know this. > > > hibernation. The requests for expediting are atomically counted, > > so there can be other similar setups. > > I do like slowing down grace periods globally, because its easy, but I > think it can have quite a few ripple effects if in the future a user of > call_rcu() does not expect lazy behavior :( but still gets it. Same like > how synchronize_rcu_mult() suffered back in its day. No matter what the approach, there are going to be ripple effects. > > o Choose a more moderate slowing down of the grace period so as to > > minimize (or maybe even eliminate) the need for the aforementioned > > userspace changes. This also leaves battery lifetime on the > > table, but considerably less of it. > > > > o Implement more selective slowdowns, such as call_rcu_lazy(). > > > > Other selective slowdowns include in-RCU heuristics such as > > forcing the current grace period to end once the entire > > system has gone idle. (There is prior work detecting > > full-system idleness for NO_HZ_FULL, but smaller systems > > could use more aggressive techniques.) > > > > I am sure that there are additional possibilities. Note that it is > > possible to do more than one of these, if that proves helpful. > > Sure! > > But given this, one question is "What is the most you can possibly > > obtain from call_rcu_lazy()?" If you have a platform with enough > > memory, one way to obtain an upper bound is to make call_rcu_lazy() > > simply leak the memory. If that amount of savings does not meet the > > need, then other higher-level approaches will be needed, whether > > as alternatives or as additions to the call_rcu_lazy() approach. > > > > Should call_rcu_lazy() prove to be part of the mix, here are a (very) > > few of the tradeoffs: > > > > o Put lazy callbacks on their own list or not? > > > > o If not, use the bypass list? If so, is it the > > case that call_rcu_lazy() is just call_rcu() on > > non-rcu_nocbs CPUs? > > > > Or add the complexity required to use the bypass > > list on rcu_nocbs CPUs but to use ->cblist otherwise? > > For bypass list, I am kind of reluctant because the "flush time" of the > bypass list may not match the laziness we seek? For example, I want to > allow user to configure to flush the CBs only after 15 seconds or so, if > the CB list does not grow too big. That might hurt folks using the > bypass list, but requiring a quicker response. > > Also, does doing so not prevent usage of lazy CBs on systems without > NOCB? So if we want to future-proof this, I guess that might not be a > good decision. True enough, but would this future actually arrive? After all, if someone cared enough about energy efficiency to use call_rcu_lazy(), why wouldn't they also offload callbacks? On the flush-time mismatch, if there are any non-lazy callbacks in the list, it costs you nothing to let the lazy callbacks tag along through the grace period. So one approach would be to use the current flush time if there are non-lazy callbacks, but use the longer flush time if all of the callbacks in the list are lazy callbacks. > > o If so: > > > > o Use segmented list with marked grace periods? > > Keep in mind that such a list can track only > > two grace periods. > > > > o Use a plain list and have grace-period start > > simply drain the list? > > I want to use the segmented list, regardless of whether we use the > bypass list or not, because we get those memory barriers and > rcu_barrier() lockless sampling of ->len, for free :). The bypass list also gets you the needed memory barriers and lockless sampling of ->len. As does any other type of list as long as the ->cblist.len field accounts for the lazy callbacks. So the main difference is the tracking of grace periods, and even then, only those grace periods for which this CPU has no non-lazy callbacks. Or, in the previously discussed case where a single rcuoc kthread serves multiple CPUs, only those grace periods for which this group of CPUs has no non-lazy callbacks. And on devices with few CPUs, wouldn't you make do with one rcuoc kthread for the system, at least in the common case where there is no callback flooding? > > o Does call_rcu_lazy() do anything to ensure that additional grace > > periods that exist only for the benefit of lazy callbacks are > > maximally shared among all CPUs' lazy callbacks? If so, what? > > (Keeping in mind that failing to share such grace periods > > burns battery lifetime.) > > I could be missing your point, can you give example of how you want the > behavior to be? CPU 0 has 55 lazy callbacks and one non-lazy callback. So the system does a grace-period computation and CPU 0 wakes up its rcuoc kthread. Given that the full price is begin paid anyway, why not also invoke those 55 lazy callbacks? If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23 lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks, again the full price of grace-period computation and rcuoc wakeups is being paid, so why not also invoke those 26 lazy callbacks? On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other rcuoc kthread, and with the same numbers of callbacks as in the previous paragraph, you get to make a choice. Do you do an extra wakeup for CPU 0's rcuoc kthread? Or do you potentially do an extra grace-period computation some time in the future? Suppose further that the CPU that would be awakened for CPU 0's rcuoc kthread is already non-idle. At that point, this wakeup is quite cheap. Except that the wakeup-or-don't choice is made at the beginning of the grace period, and that CPU's idleness at that point might not be all that well correlated with its idleness at the time of the wakeup at the end of that grace period. Nevertheless, idleness statistics could help trade off needless from-idle wakeups for needless grace periods. Which sound like a good reason to consolidate rcuoc processing on non-busy systems. But more to the point... Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that RCU is otherwise idle. We really want one (not three!) grace periods to eventually handle those lazy callbacks, right? > I am not thinking of creating separate GP cycles just for lazy CBs. The > GP cycle will be the same that non-lazy uses. A lazy CB will just record > what is the current or last GP, and then wait. Once the timer expires, > it will check what is the current or last GP. If a new GP was not > started, then it starts a new GP. From my perspective, this new GP is in fact a separate GP just for lazy callbacks. What am I missing here? Also, if your timer triggers the check, isn't that an additional potential wakeup from idle? Would it make sense to also minimize those wakeups? (This is one motivation for grace periods to pick up lazy callbacks.) > If a new GP was started but not > completed, it can simply call_rcu(). If a new GP started and completed, > it does not start new GP and just executes CB, something like that. > Before the flush happens, if multiple lazy CBs were queued in the > meanwhile and the GP seq counters moved forward, then all those lazy CBs > will share the same GP (either not starting a new one, or calling > call_rcu() to share the ongoing on, something like that). You could move the ready-to-invoke callbacks to the end of the DONE segment of ->cblist, assuming that you also adjust rcu_barrier() appropriately. If rcu_barrier() is rare, you could simply flush whatever lazy list before entraining the rcu_barrier() callback. > > o When there is ample memory, how long are lazy callbacks allowed > > to sleep? Forever? If not forever, what feeds into computing > > the timeout? > > Yeah, the timeout is fixed currently. I agree this is a good thing to > optimize. First see what the behavior is. If the timeout is long enough that there is almost never a lazy-only grace period, then maybe a fixed timeout is good enough. > > o What is used to determine when memory is low, so that laziness > > has become a net negative? > > The assumption I think we make is that the shrinkers will constantly > flush out lazy CBs if there is memory pressure. It might be worth looking at SPI. That could help avoid a grace-period wait when clearing out lazy callbacks. > > o What other conditions, if any, should prompt motivating lazy > > callbacks? (See above for the start of a grace period motivating > > lazy callbacks.) > > > > In short, you need to cast your net pretty wide on this one. It has > > not yet been very carefully explored, so there are likely to be surprises, > > maybe even good surprises. ;-) > > Cool that sounds like some good opportunity to work on something cool ;-) Which comes back around to your point of needing evaluation of extra RCU work. Lots of tradeoffs between different wakeup sources and grace periods. Fine-grained views into the black box will be helpful. ;-) > Happy memorial day! Off for lunch with some friends and then back to > tinkering. Have a good holiday! Thanx, Paul > Thanks, > > - Joel > > > > > Thanx, Paul > > > >> thanks, > >> > >> - Joel > >> > >>>>>> + // Flush queue if too big > >>>>> > >>>>> You will also need to check for early boot use. > >>>>> > >>>>> I -think- it suffice to simply skip the following "if" statement when > >>>>> rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices. The reason being > >>>>> that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT > >>>>> kernels won't expire them until the softirq kthreads have been spawned. > >>>>> > >>>>> Which is OK, as it just means that call_rcu_lazy() is a bit more > >>>>> lazy than expected that early. > >>>>> > >>>>> Except that call_rcu() can be invoked even before rcu_init() has been > >>>>> invoked, which is therefore also before rcu_init_lazy() has been invoked. > >>>> > >>>> In other words, you are concerned that too many lazy callbacks might be > >>>> pending before rcu_init() is called? > >>> > >>> In other words, I am concerned that bad things might happen if fields > >>> in a rcu_lazy_pcp structure are used before they are initialized. > >>> > >>> I am not worried about too many lazy callbacks before rcu_init() because > >>> the non-lazy callbacks (which these currently are) are allowed to pile > >>> up until RCU's grace-period kthreads have been spawned. There might > >>> come a time when RCU callbacks need to be invoked earlier, but that will > >>> be a separate problem to solve when and if, but with the benefit of the > >>> additional information derived from seeing the problem actually happen. > >>> > >>>> I am going through the kfree_rcu() threads/patches involving > >>>> RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before > >>>> the scheduler is running causes a crash or warnings? > >>> > >>> There are a lot of different issues that arise in different phases > >>> of boot. In this particular case, my primary concern is the > >>> use-before-initialized bug. > >>> > >>>>> I thefore suggest something like this at the very start of this function: > >>>>> > >>>>> if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) > >>>>> call_rcu(head_rcu, func); > >>>>> > >>>>> The goal is that people can replace call_rcu() with call_rcu_lazy() > >>>>> without having to worry about invocation during early boot. > >>>> > >>>> Yes, this seems safer. I don't expect much power savings during system boot > >>>> process anyway ;-). I believe perhaps a static branch would work better to > >>>> take a branch out from what is likely a fast path. > >>> > >>> A static branch would be fine. Maybe encapsulate it in a static inline > >>> function for all such comparisons, but most such comparisons are far from > >>> anything resembling a fastpath, so the main potential benefit would be > >>> added readability. Which could be a compelling reason in and of itself. > >>> > >>> Thanx, Paul > >>> > >>>>> Huh. "head_rcu"? Why not "rhp"? Or follow call_rcu() with "head", > >>>>> though "rhp" is more consistent with the RCU pointer initials approach. > >>>> > >>>> Fixed, thanks. > >>>> > >>>>>> + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { > >>>>>> + lazy_rcu_flush_cpu(rlp); > >>>>>> + } else { > >>>>>> + if (!delayed_work_pending(&rlp->work)) { > >>>>> > >>>>> This check is racy because the work might run to completion right at > >>>>> this point. Wouldn't it be better to rely on the internal check of > >>>>> WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()? > >>>> > >>>> Oops, agreed. Will make it as you suggest. > >>>> > >>>> thanks, > >>>> > >>>> - Joel > >>>> > >>>>>> + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); > >>>>>> + } > >>>>>> + } > >>>>>> +} > >>>>>> + > >>>>>> +static unsigned long > >>>>>> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > >>>>>> +{ > >>>>>> + unsigned long count = 0; > >>>>>> + int cpu; > >>>>>> + > >>>>>> + /* Snapshot count of all CPUs */ > >>>>>> + for_each_possible_cpu(cpu) { > >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > >>>>>> + > >>>>>> + count += atomic_read(&rlp->count); > >>>>>> + } > >>>>>> + > >>>>>> + return count; > >>>>>> +} > >>>>>> + > >>>>>> +static unsigned long > >>>>>> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > >>>>>> +{ > >>>>>> + int cpu, freed = 0; > >>>>>> + > >>>>>> + for_each_possible_cpu(cpu) { > >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > >>>>>> + unsigned long count; > >>>>>> + > >>>>>> + count = atomic_read(&rlp->count); > >>>>>> + lazy_rcu_flush_cpu(rlp); > >>>>>> + sc->nr_to_scan -= count; > >>>>>> + freed += count; > >>>>>> + if (sc->nr_to_scan <= 0) > >>>>>> + break; > >>>>>> + } > >>>>>> + > >>>>>> + return freed == 0 ? SHRINK_STOP : freed; > >>>>> > >>>>> This is a bit surprising given the stated aim of SHRINK_STOP to indicate > >>>>> potential deadlocks. But this pattern is common, including on the > >>>>> kvfree_rcu() path, so OK! ;-) > >>>>> > >>>>> Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is > >>>>> that as well. > >>>>> > >>>>>> +} > >>>>>> + > >>>>>> +/* > >>>>>> + * This function is invoked after MAX_LAZY_JIFFIES timeout. > >>>>>> + */ > >>>>>> +static void lazy_work(struct work_struct *work) > >>>>>> +{ > >>>>>> + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); > >>>>>> + > >>>>>> + lazy_rcu_flush_cpu(rlp); > >>>>>> +} > >>>>>> + > >>>>>> +static struct shrinker lazy_rcu_shrinker = { > >>>>>> + .count_objects = lazy_rcu_shrink_count, > >>>>>> + .scan_objects = lazy_rcu_shrink_scan, > >>>>>> + .batch = 0, > >>>>>> + .seeks = DEFAULT_SEEKS, > >>>>>> +}; > >>>>>> + > >>>>>> +void __init rcu_lazy_init(void) > >>>>>> +{ > >>>>>> + int cpu; > >>>>>> + > >>>>>> + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); > >>>>>> + > >>>>>> + for_each_possible_cpu(cpu) { > >>>>>> + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); > >>>>>> + INIT_DELAYED_WORK(&rlp->work, lazy_work); > >>>>>> + } > >>>>>> + > >>>>>> + if (register_shrinker(&lazy_rcu_shrinker)) > >>>>>> + pr_err("Failed to register lazy_rcu shrinker!\n"); > >>>>>> +} > >>>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > >>>>>> index 24b5f2c2de87..a5f4b44f395f 100644 > >>>>>> --- a/kernel/rcu/rcu.h > >>>>>> +++ b/kernel/rcu/rcu.h > >>>>>> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); > >>>>>> static inline void show_rcu_tasks_trace_gp_kthread(void) {} > >>>>>> #endif > >>>>>> > >>>>>> +#ifdef CONFIG_RCU_LAZY > >>>>>> +void rcu_lazy_init(void); > >>>>>> +#else > >>>>>> +static inline void rcu_lazy_init(void) {} > >>>>>> +#endif > >>>>>> #endif /* __LINUX_RCU_H */ > >>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > >>>>>> index a4c25a6283b0..ebdf6f7c9023 100644 > >>>>>> --- a/kernel/rcu/tree.c > >>>>>> +++ b/kernel/rcu/tree.c > >>>>>> @@ -4775,6 +4775,8 @@ void __init rcu_init(void) > >>>>>> qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > >>>>>> else > >>>>>> qovld_calc = qovld; > >>>>>> + > >>>>>> + rcu_lazy_init(); > >>>>>> } > >>>>>> > >>>>>> #include "tree_stall.h" > >>>>>> -- > >>>>>> 2.36.0.550.gb090851708-goog > >>>>>>
Hi Paul, I am sending in one more reply before I sleep ;-) On 5/30/22 12:42, Paul E. McKenney wrote: > On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote: >> Hi Paul, >> >> I just setup thunderbird on an old machine just to reply to LKML. >> Apologies if the formatting is weird. Replies below: > > Looks fine at first glance. ;-) > >> On 5/28/22 13:57, Paul E. McKenney wrote: >> [..] >>>>>>>> + preempt_enable(); >>>>>>>> + >>>>>>>> + if (debug_rcu_head_queue((void *)head)) { >>>>>>>> + // Probable double call_rcu(), just leak. >>>>>>>> + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", >>>>>>>> + __func__, head); >>>>>>>> + >>>>>>>> + // Mark as success and leave. >>>>>>>> + return; >>>>>>>> + } >>>>>>>> + >>>>>>>> + // Queue to per-cpu llist >>>>>>>> + head->func = func; >>>>>>>> + llist_add(&head->llist_node, &rlp->head); >>>>>>> >>>>>>> Suppose that there are a bunch of preemptions between the preempt_enable() >>>>>>> above and this point, so that the current CPU's list has lots of >>>>>>> callbacks, but zero ->count. Can that cause a problem? >>>>>>> >>>>>>> In the past, this sort of thing has been an issue for rcu_barrier() >>>>>>> and friends. >>>>>> >>>>>> Thanks, I think I dropped the ball on this. You have given me something to >>>>>> think about. I am thinking on first thought that setting the count in advance >>>>>> of populating the list should do the trick. I will look more on it. >>>>> >>>>> That can work, but don't forget the need for proper memory ordering. >>>>> >>>>> Another approach would be to what the current bypass list does, namely >>>>> count the callback in the ->cblist structure. >>>> >>>> I have been thinking about this, and I also arrived at the same conclusion >>>> that it is easier to make it a segcblist structure which has the functions do >>>> the memory ordering. Then we can make rcu_barrier() locklessly sample the >>>> ->len of the new per-cpu structure, I *think* it might work. >>>> >>>> However, I was actually considering another situation outside of rcu_barrier, >>>> where the preemptions you mentioned would cause the list to appear to be >>>> empty if one were to rely on count, but actually have a lot of stuff queued. >>>> This causes shrinkers to not know that there is a lot of memory available to >>>> free. I don't think its that big an issue, if we can assume the caller's >>>> process context to have sufficient priority. Obviously using a raw spinlock >>>> makes the process context to be highest priority briefly but without the >>>> lock, we don't get that. But yeah that can be sorted by just updating the >>>> count proactively, and then doing the queuing operations. >>>> >>>> Another advantage of using the segcblist structure, is, we can also record >>>> the grace period numbers in it, and avoid triggering RCU from the timer if gp >>>> already elapsed (similar to the rcu-poll family of functions). >>>> >>>> WDYT? >>> >>> What you are seeing is that the design space for call_rcu_lazy() is huge, >>> and the tradeoffs between energy efficiency and complexity are not well >>> understood. It is therefore necessary to quickly evaluate alternatives. >>> Yes, you could simply implement all of them and test them, but that >>> might you take longer than you would like. Plus you likely need to >>> put in place more testing than rcutorture currently provides. >>> >>> Given what we have seen thus far, I think that the first step has to >>> be to evaluate what can be obtained with this approach compared with >>> what can be obtained with other approaches. What we have right now, >>> more or less, is this: >>> >>> o Offloading buys about 15%. >>> >>> o Slowing grace periods buys about another 7%, but requires >>> userspace changes to prevent degrading user-visible performance. >> >> Agreed. >> >>> >>> o The initial call_rcu_lazy() prototype buys about 1%. >> >> Well, that depends on the usecase. A busy system saves you less because >> it is busy anyway, so RCU being quiet does not help much. >> >> >From my cover letter, you can see the idle power savings is a whopping >> 29% with this patch series. Check the PC10 state residencies and the >> package wattage. >> >> When ChromeOS screen is off and user is not doing anything on the >> Before: >> Pk%pc10 = 72.13 >> PkgWatt = 0.58 >> >> After: >> Pk%pc10 = 81.28 >> PkgWatt = 0.41 > > These numbers do make a better case for this. On the other hand, that > patch series was missing things like rcu_barrier() support and might also > be converting more call_rcu() invocations to call_rcu_lazy() invocations. > It is therefore necessary to take these numbers with a grain of salt, > much though I hope that they turn out to be realistic. Makes sense, thank you. Actually the promising savings showed up when also doing jiffies_till_first_fqs increases, so that's why I did the prototype of call_rcu_lazy to prove that we can achieve same thing instead of slowing GP for everything. I believe the savings comes from just not kicking the RCU GP machinery at all. Agreed on the grain of salt part. Only a test of a final design can tell us more accurately what the savings are. >>> >>> True, this is not apples to apples because the first two were >>> measured on ChromeOS and this one on Android. >> >> Yes agreed. I think Vlad is not seeing any jaw dropping savings, because >> he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to >> also check if rcu callbacks are indeed not being quued / quiet after >> applying the patch, and if not then there may still be some call_rcu()s >> that need conversion to lazy, on his setup. For example, he might have a >> driver on his ARM platform that's queing RCU CBs constantly. >> >> I was thinking that perhaps a debug patch can help quickly nail down >> such RCU noise, in the way of a warning. Of course, debug CONFIG should >> never be enabled in production, so it would be something along the lines >> of how lockdep is used for debugging. > > The trick would be figuring out which of the callbacks are noise. > I suspect that userspace help would be required. Or maybe just leverage > the existing event traces and let userspace sort it out. Ok. >>> Which means that >>> apples to apples evaluation is also required. But this is the >>> information I currently have at hand, and it is probably no more >>> than a factor of two off of what would be seen on ChromeOS. >>> >>> Or is there some ChromeOS data that tells a different story? >>> After all, for all I know, Android might still be expediting >>> all normal grace periods. >>> >>> At which point, the question becomes "how to make up that 7%?" After all, >>> it is not likely that anyone is going to leave that much battery lifetime >>> on the table. Here are the possibilities that I currently see: >>> >>> o Slow down grace periods, and also bite the bullet and make >>> userspace changes to speed up the RCU grace periods during >>> critical operations. >> >> We tried this, and tracing suffers quiet a lot. The system also felt >> "sluggish" which I suspect is because of synchronize_rcu() slow downs in >> other paths. > > In what way does tracing suffer? Removing tracepoints? Yes. Start and stopping function tracer goes from 5 seconds to like 30 seconds or so. > And yes, this approach absolutely requires finding code paths with > user-visible grace periods. Which sounds like you tried the "Slow down > grace periods" part but not the "bit the bullet" part. ;-) True, I got so scared looking at the performance that I just decided to play it safe and do selective call_rcu() lazifying, than making the everything lazy. >>> But given this, one question is "What is the most you can possibly >>> obtain from call_rcu_lazy()?" If you have a platform with enough >>> memory, one way to obtain an upper bound is to make call_rcu_lazy() >>> simply leak the memory. If that amount of savings does not meet the >>> need, then other higher-level approaches will be needed, whether >>> as alternatives or as additions to the call_rcu_lazy() approach. >>> >>> Should call_rcu_lazy() prove to be part of the mix, here are a (very) >>> few of the tradeoffs: >>> >>> o Put lazy callbacks on their own list or not? >>> >>> o If not, use the bypass list? If so, is it the >>> case that call_rcu_lazy() is just call_rcu() on >>> non-rcu_nocbs CPUs? >>> >>> Or add the complexity required to use the bypass >>> list on rcu_nocbs CPUs but to use ->cblist otherwise? >> >> For bypass list, I am kind of reluctant because the "flush time" of the >> bypass list may not match the laziness we seek? For example, I want to >> allow user to configure to flush the CBs only after 15 seconds or so, if >> the CB list does not grow too big. That might hurt folks using the >> bypass list, but requiring a quicker response. >> >> Also, does doing so not prevent usage of lazy CBs on systems without >> NOCB? So if we want to future-proof this, I guess that might not be a >> good decision. > > True enough, but would this future actually arrive? After all, if > someone cared enough about energy efficiency to use call_rcu_lazy(), > why wouldn't they also offload callbacks? I am not sure, but I also don't mind making it depend on NOCB for now (see below). > On the flush-time mismatch, if there are any non-lazy callbacks in the > list, it costs you nothing to let the lazy callbacks tag along through > the grace period. So one approach would be to use the current flush > time if there are non-lazy callbacks, but use the longer flush time if > all of the callbacks in the list are lazy callbacks. Cool! >>> o If so: >>> >>> o Use segmented list with marked grace periods? >>> Keep in mind that such a list can track only >>> two grace periods. >>> >>> o Use a plain list and have grace-period start >>> simply drain the list? >> >> I want to use the segmented list, regardless of whether we use the >> bypass list or not, because we get those memory barriers and >> rcu_barrier() lockless sampling of ->len, for free :). > > The bypass list also gets you the needed memory barriers and lockless > sampling of ->len. As does any other type of list as long as the > ->cblist.len field accounts for the lazy callbacks. > > So the main difference is the tracking of grace periods, and even then, > only those grace periods for which this CPU has no non-lazy callbacks. > Or, in the previously discussed case where a single rcuoc kthread serves > multiple CPUs, only those grace periods for which this group of CPUs > has no non-lazy callbacks. Unless we assume that everything in the bypass list is lazy, can we really use it to track both lazy and non lazy CBs, and grace periods? Currently the struct looks like this: struct rcu_segcblist { struct rcu_head *head; struct rcu_head **tails[RCU_CBLIST_NSEGS]; unsigned long gp_seq[RCU_CBLIST_NSEGS]; #ifdef CONFIG_RCU_NOCB_CPU atomic_long_t len; #else long len; #endif long seglen[RCU_CBLIST_NSEGS]; u8 flags; }; So now, it would need to be like this? struct rcu_segcblist { struct rcu_head *head; struct rcu_head **tails[RCU_CBLIST_NSEGS]; unsigned long gp_seq[RCU_CBLIST_NSEGS]; #ifdef CONFIG_RCU_NOCB_CPU struct rcu_head *lazy_head; struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS]; unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS]; atomic_long_t lazy_len; #else long len; #endif long seglen[RCU_CBLIST_NSEGS]; u8 flags; }; > And on devices with few CPUs, wouldn't you make do with one rcuoc kthread > for the system, at least in the common case where there is no callback > flooding? When Rushikesh tried to reduce the number of callback threads, he did not see much improvement in power. I don't think the additional wake ups of extra rcuoc threads makes too much difference - the big power improvement comes from not even kicking rcu_preempt / rcuog threads... >>> o Does call_rcu_lazy() do anything to ensure that additional grace >>> periods that exist only for the benefit of lazy callbacks are >>> maximally shared among all CPUs' lazy callbacks? If so, what? >>> (Keeping in mind that failing to share such grace periods >>> burns battery lifetime.) >> >> I could be missing your point, can you give example of how you want the >> behavior to be? > > CPU 0 has 55 lazy callbacks and one non-lazy callback. So the system > does a grace-period computation and CPU 0 wakes up its rcuoc kthread. > Given that the full price is begin paid anyway, why not also invoke > those 55 lazy callbacks? > > If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23 > lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks, > again the full price of grace-period computation and rcuoc wakeups is > being paid, so why not also invoke those 26 lazy callbacks? > > On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other > rcuoc kthread, and with the same numbers of callbacks as in the previous > paragraph, you get to make a choice. Do you do an extra wakeup for > CPU 0's rcuoc kthread? Or do you potentially do an extra grace-period > computation some time in the future? > > Suppose further that the CPU that would be awakened for CPU 0's rcuoc > kthread is already non-idle. At that point, this wakeup is quite cheap. > Except that the wakeup-or-don't choice is made at the beginning of the > grace period, and that CPU's idleness at that point might not be all > that well correlated with its idleness at the time of the wakeup at the > end of that grace period. Nevertheless, idleness statistics could > help trade off needless from-idle wakeups for needless grace periods. > > Which sound like a good reason to consolidate rcuoc processing on > non-busy systems. > > But more to the point... > > Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that > RCU is otherwise idle. We really want one (not three!) grace periods > to eventually handle those lazy callbacks, right? Yes. I think I understand you now, yes we do want to reduce the number of grace periods. But I think that is an optimization which is not strictly necessary to get the power savings this patch series demonstrates. To get the power savings shown here, we need all the RCU threads to be quiet as much as possible, and not start the rcu_preempt thread's state machine until when necessary. >> I am not thinking of creating separate GP cycles just for lazy CBs. The >> GP cycle will be the same that non-lazy uses. A lazy CB will just record >> what is the current or last GP, and then wait. Once the timer expires, >> it will check what is the current or last GP. If a new GP was not >> started, then it starts a new GP. > > From my perspective, this new GP is in fact a separate GP just for lazy > callbacks. > > What am I missing here? I think my thought for power savings is slightly different from yours. I see lots of power savings when not even going to RCU when system is idle - that appears to be a lot like what the bypass list does - it avoids call_rcu() completely and does an early exit: if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) return; // Enqueued onto ->nocb_bypass, so just leave. I think amortizing cost of grace period computation across lazy CBs may give power savings, but that is secondary to the goal of not even poking RCU (which an early exit like the above does). So maybe we can just talk about that goal first? > > Also, if your timer triggers the check, isn't that an additional potential > wakeup from idle? Would it make sense to also minimize those wakeups? > (This is one motivation for grace periods to pick up lazy callbacks.) > >> If a new GP was started but not >> completed, it can simply call_rcu(). If a new GP started and completed, >> it does not start new GP and just executes CB, something like that. >> Before the flush happens, if multiple lazy CBs were queued in the >> meanwhile and the GP seq counters moved forward, then all those lazy CBs >> will share the same GP (either not starting a new one, or calling >> call_rcu() to share the ongoing on, something like that). > > You could move the ready-to-invoke callbacks to the end of the DONE > segment of ->cblist, assuming that you also adjust rcu_barrier() > appropriately. If rcu_barrier() is rare, you could simply flush whatever > lazy list before entraining the rcu_barrier() callback. Got it. >>> o When there is ample memory, how long are lazy callbacks allowed >>> to sleep? Forever? If not forever, what feeds into computing >>> the timeout? >> >> Yeah, the timeout is fixed currently. I agree this is a good thing to >> optimize. > > First see what the behavior is. If the timeout is long enough that > there is almost never a lazy-only grace period, then maybe a fixed > timeout is good enough. Ok. >>> o What is used to determine when memory is low, so that laziness >>> has become a net negative? >> >> The assumption I think we make is that the shrinkers will constantly >> flush out lazy CBs if there is memory pressure. > > It might be worth looking at SPI. That could help avoid a grace-period > wait when clearing out lazy callbacks. You mean PSI? > >>> o What other conditions, if any, should prompt motivating lazy >>> callbacks? (See above for the start of a grace period motivating >>> lazy callbacks.) >>> >>> In short, you need to cast your net pretty wide on this one. It has >>> not yet been very carefully explored, so there are likely to be surprises, >>> maybe even good surprises. ;-) >> >> Cool that sounds like some good opportunity to work on something cool ;-) > > Which comes back around to your point of needing evaluation of extra > RCU work. Lots of tradeoffs between different wakeup sources and grace > periods. Fine-grained views into the black box will be helpful. ;-) Thanks for the conversations. I very much like the idea of using bypass list, but I am not sure if the current segcblist structure will support both lazy and non-lazy CBs. Basically, if it contains both call_rcu() and call_rcu_lazy() CBs, that means either all of them are lazy or all of them are not right? Unless we are also talking about making additional changes to rcu_segcblist struct to accomodate both types of CBs. I don't mind admitting to my illiteracy about callback offloading and bypass lists, but using bypass lists for this problems might improve my literacy nonetheless so that does motivate me to use them if you can help me out with the above questions :D Thanks, - Joel
On Mon, May 30, 2022 at 10:12:05PM -0400, Joel Fernandes wrote: > Hi Paul, > > I am sending in one more reply before I sleep ;-) Likewise. ;-) > On 5/30/22 12:42, Paul E. McKenney wrote: > > On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote: > >> Hi Paul, > >> > >> I just setup thunderbird on an old machine just to reply to LKML. > >> Apologies if the formatting is weird. Replies below: > > > > Looks fine at first glance. ;-) > > > >> On 5/28/22 13:57, Paul E. McKenney wrote: > >> [..] > >>>>>>>> + preempt_enable(); > >>>>>>>> + > >>>>>>>> + if (debug_rcu_head_queue((void *)head)) { > >>>>>>>> + // Probable double call_rcu(), just leak. > >>>>>>>> + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > >>>>>>>> + __func__, head); > >>>>>>>> + > >>>>>>>> + // Mark as success and leave. > >>>>>>>> + return; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + // Queue to per-cpu llist > >>>>>>>> + head->func = func; > >>>>>>>> + llist_add(&head->llist_node, &rlp->head); > >>>>>>> > >>>>>>> Suppose that there are a bunch of preemptions between the preempt_enable() > >>>>>>> above and this point, so that the current CPU's list has lots of > >>>>>>> callbacks, but zero ->count. Can that cause a problem? > >>>>>>> > >>>>>>> In the past, this sort of thing has been an issue for rcu_barrier() > >>>>>>> and friends. > >>>>>> > >>>>>> Thanks, I think I dropped the ball on this. You have given me something to > >>>>>> think about. I am thinking on first thought that setting the count in advance > >>>>>> of populating the list should do the trick. I will look more on it. > >>>>> > >>>>> That can work, but don't forget the need for proper memory ordering. > >>>>> > >>>>> Another approach would be to what the current bypass list does, namely > >>>>> count the callback in the ->cblist structure. > >>>> > >>>> I have been thinking about this, and I also arrived at the same conclusion > >>>> that it is easier to make it a segcblist structure which has the functions do > >>>> the memory ordering. Then we can make rcu_barrier() locklessly sample the > >>>> ->len of the new per-cpu structure, I *think* it might work. > >>>> > >>>> However, I was actually considering another situation outside of rcu_barrier, > >>>> where the preemptions you mentioned would cause the list to appear to be > >>>> empty if one were to rely on count, but actually have a lot of stuff queued. > >>>> This causes shrinkers to not know that there is a lot of memory available to > >>>> free. I don't think its that big an issue, if we can assume the caller's > >>>> process context to have sufficient priority. Obviously using a raw spinlock > >>>> makes the process context to be highest priority briefly but without the > >>>> lock, we don't get that. But yeah that can be sorted by just updating the > >>>> count proactively, and then doing the queuing operations. > >>>> > >>>> Another advantage of using the segcblist structure, is, we can also record > >>>> the grace period numbers in it, and avoid triggering RCU from the timer if gp > >>>> already elapsed (similar to the rcu-poll family of functions). > >>>> > >>>> WDYT? > >>> > >>> What you are seeing is that the design space for call_rcu_lazy() is huge, > >>> and the tradeoffs between energy efficiency and complexity are not well > >>> understood. It is therefore necessary to quickly evaluate alternatives. > >>> Yes, you could simply implement all of them and test them, but that > >>> might you take longer than you would like. Plus you likely need to > >>> put in place more testing than rcutorture currently provides. > >>> > >>> Given what we have seen thus far, I think that the first step has to > >>> be to evaluate what can be obtained with this approach compared with > >>> what can be obtained with other approaches. What we have right now, > >>> more or less, is this: > >>> > >>> o Offloading buys about 15%. > >>> > >>> o Slowing grace periods buys about another 7%, but requires > >>> userspace changes to prevent degrading user-visible performance. > >> > >> Agreed. > >> > >>> > >>> o The initial call_rcu_lazy() prototype buys about 1%. > >> > >> Well, that depends on the usecase. A busy system saves you less because > >> it is busy anyway, so RCU being quiet does not help much. > >> > >> >From my cover letter, you can see the idle power savings is a whopping > >> 29% with this patch series. Check the PC10 state residencies and the > >> package wattage. > >> > >> When ChromeOS screen is off and user is not doing anything on the > >> Before: > >> Pk%pc10 = 72.13 > >> PkgWatt = 0.58 > >> > >> After: > >> Pk%pc10 = 81.28 > >> PkgWatt = 0.41 > > > > These numbers do make a better case for this. On the other hand, that > > patch series was missing things like rcu_barrier() support and might also > > be converting more call_rcu() invocations to call_rcu_lazy() invocations. > > It is therefore necessary to take these numbers with a grain of salt, > > much though I hope that they turn out to be realistic. > > Makes sense, thank you. Actually the promising savings showed up when > also doing jiffies_till_first_fqs increases, so that's why I did the > prototype of call_rcu_lazy to prove that we can achieve same thing > instead of slowing GP for everything. I believe the savings comes from > just not kicking the RCU GP machinery at all. Agreed on reducing the number of grace periods. After all, that is the entire purpose of the earlier boot-parameter changes that slowed down the grace periods. But the other side of this same coin is that if the grace-period machinery is being kicked anyway by some non-lazy callback, then making the lazy callbacks take advantage of that forced grace period might well save an entire grace period later on. Take advantage of the one that is happening anyway to possibly avoid having to do a full grace period solely for the benefit of lazy callbacks later on. > Agreed on the grain of salt part. Only a test of a final design can tell > us more accurately what the savings are. With careful experiment design, we can get bounds earlier. > >>> > >>> True, this is not apples to apples because the first two were > >>> measured on ChromeOS and this one on Android. > >> > >> Yes agreed. I think Vlad is not seeing any jaw dropping savings, because > >> he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to > >> also check if rcu callbacks are indeed not being quued / quiet after > >> applying the patch, and if not then there may still be some call_rcu()s > >> that need conversion to lazy, on his setup. For example, he might have a > >> driver on his ARM platform that's queing RCU CBs constantly. > >> > >> I was thinking that perhaps a debug patch can help quickly nail down > >> such RCU noise, in the way of a warning. Of course, debug CONFIG should > >> never be enabled in production, so it would be something along the lines > >> of how lockdep is used for debugging. > > > > The trick would be figuring out which of the callbacks are noise. > > I suspect that userspace help would be required. Or maybe just leverage > > the existing event traces and let userspace sort it out. > > Ok. > > >>> Which means that > >>> apples to apples evaluation is also required. But this is the > >>> information I currently have at hand, and it is probably no more > >>> than a factor of two off of what would be seen on ChromeOS. > >>> > >>> Or is there some ChromeOS data that tells a different story? > >>> After all, for all I know, Android might still be expediting > >>> all normal grace periods. > >>> > >>> At which point, the question becomes "how to make up that 7%?" After all, > >>> it is not likely that anyone is going to leave that much battery lifetime > >>> on the table. Here are the possibilities that I currently see: > >>> > >>> o Slow down grace periods, and also bite the bullet and make > >>> userspace changes to speed up the RCU grace periods during > >>> critical operations. > >> > >> We tried this, and tracing suffers quiet a lot. The system also felt > >> "sluggish" which I suspect is because of synchronize_rcu() slow downs in > >> other paths. > > > > In what way does tracing suffer? Removing tracepoints? > > Yes. Start and stopping function tracer goes from 5 seconds to like 30 > seconds or so. > > > And yes, this approach absolutely requires finding code paths with > > user-visible grace periods. Which sounds like you tried the "Slow down > > grace periods" part but not the "bit the bullet" part. ;-) > > True, I got so scared looking at the performance that I just decided to > play it safe and do selective call_rcu() lazifying, than making the > everything lazy. For what definition of "safe", exactly? ;-) > >>> But given this, one question is "What is the most you can possibly > >>> obtain from call_rcu_lazy()?" If you have a platform with enough > >>> memory, one way to obtain an upper bound is to make call_rcu_lazy() > >>> simply leak the memory. If that amount of savings does not meet the > >>> need, then other higher-level approaches will be needed, whether > >>> as alternatives or as additions to the call_rcu_lazy() approach. > >>> > >>> Should call_rcu_lazy() prove to be part of the mix, here are a (very) > >>> few of the tradeoffs: > >>> > >>> o Put lazy callbacks on their own list or not? > >>> > >>> o If not, use the bypass list? If so, is it the > >>> case that call_rcu_lazy() is just call_rcu() on > >>> non-rcu_nocbs CPUs? > >>> > >>> Or add the complexity required to use the bypass > >>> list on rcu_nocbs CPUs but to use ->cblist otherwise? > >> > >> For bypass list, I am kind of reluctant because the "flush time" of the > >> bypass list may not match the laziness we seek? For example, I want to > >> allow user to configure to flush the CBs only after 15 seconds or so, if > >> the CB list does not grow too big. That might hurt folks using the > >> bypass list, but requiring a quicker response. > >> > >> Also, does doing so not prevent usage of lazy CBs on systems without > >> NOCB? So if we want to future-proof this, I guess that might not be a > >> good decision. > > > > True enough, but would this future actually arrive? After all, if > > someone cared enough about energy efficiency to use call_rcu_lazy(), > > why wouldn't they also offload callbacks? > > I am not sure, but I also don't mind making it depend on NOCB for now > (see below). Very good. > > On the flush-time mismatch, if there are any non-lazy callbacks in the > > list, it costs you nothing to let the lazy callbacks tag along through > > the grace period. So one approach would be to use the current flush > > time if there are non-lazy callbacks, but use the longer flush time if > > all of the callbacks in the list are lazy callbacks. > > Cool! > > >>> o If so: > >>> > >>> o Use segmented list with marked grace periods? > >>> Keep in mind that such a list can track only > >>> two grace periods. > >>> > >>> o Use a plain list and have grace-period start > >>> simply drain the list? > >> > >> I want to use the segmented list, regardless of whether we use the > >> bypass list or not, because we get those memory barriers and > >> rcu_barrier() lockless sampling of ->len, for free :). > > > > The bypass list also gets you the needed memory barriers and lockless > > sampling of ->len. As does any other type of list as long as the > > ->cblist.len field accounts for the lazy callbacks. > > > > So the main difference is the tracking of grace periods, and even then, > > only those grace periods for which this CPU has no non-lazy callbacks. > > Or, in the previously discussed case where a single rcuoc kthread serves > > multiple CPUs, only those grace periods for which this group of CPUs > > has no non-lazy callbacks. > > Unless we assume that everything in the bypass list is lazy, can we > really use it to track both lazy and non lazy CBs, and grace periods? Why not just count the number of lazy callbacks in the bypass list? There is already a count of the total number of callbacks in ->nocb_bypass.len. This list is protected by a lock, so protect the count of lazy callbacks with that same lock. Then just compare the number of lazy callbacks to rcu_cblist_n_cbs(&rdp->nocb_bypass). If they are equal, all callbacks are lazy. What am I missing here? > Currently the struct looks like this: > > struct rcu_segcblist { > struct rcu_head *head; > struct rcu_head **tails[RCU_CBLIST_NSEGS]; > unsigned long gp_seq[RCU_CBLIST_NSEGS]; > #ifdef CONFIG_RCU_NOCB_CPU > atomic_long_t len; > #else > long len; > #endif > long seglen[RCU_CBLIST_NSEGS]; > u8 flags; > }; > > So now, it would need to be like this? > > struct rcu_segcblist { > struct rcu_head *head; > struct rcu_head **tails[RCU_CBLIST_NSEGS]; > unsigned long gp_seq[RCU_CBLIST_NSEGS]; > #ifdef CONFIG_RCU_NOCB_CPU > struct rcu_head *lazy_head; > struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS]; > unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS]; > atomic_long_t lazy_len; > #else > long len; > #endif > long seglen[RCU_CBLIST_NSEGS]; > u8 flags; > }; I freely confess that I am not loving this arrangement. Large increase in state space, but little benefit that I can see. Again, what am I missing here? > > And on devices with few CPUs, wouldn't you make do with one rcuoc kthread > > for the system, at least in the common case where there is no callback > > flooding? > > When Rushikesh tried to reduce the number of callback threads, he did > not see much improvement in power. I don't think the additional wake ups > of extra rcuoc threads makes too much difference - the big power > improvement comes from not even kicking rcu_preempt / rcuog threads... I suspect that this might come into play later on. It is often the case that a given technique's effectiveness depends on the starting point. > >>> o Does call_rcu_lazy() do anything to ensure that additional grace > >>> periods that exist only for the benefit of lazy callbacks are > >>> maximally shared among all CPUs' lazy callbacks? If so, what? > >>> (Keeping in mind that failing to share such grace periods > >>> burns battery lifetime.) > >> > >> I could be missing your point, can you give example of how you want the > >> behavior to be? > > > > CPU 0 has 55 lazy callbacks and one non-lazy callback. So the system > > does a grace-period computation and CPU 0 wakes up its rcuoc kthread. > > Given that the full price is begin paid anyway, why not also invoke > > those 55 lazy callbacks? > > > > If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23 > > lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks, > > again the full price of grace-period computation and rcuoc wakeups is > > being paid, so why not also invoke those 26 lazy callbacks? > > > > On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other > > rcuoc kthread, and with the same numbers of callbacks as in the previous > > paragraph, you get to make a choice. Do you do an extra wakeup for > > CPU 0's rcuoc kthread? Or do you potentially do an extra grace-period > > computation some time in the future? > > > > Suppose further that the CPU that would be awakened for CPU 0's rcuoc > > kthread is already non-idle. At that point, this wakeup is quite cheap. > > Except that the wakeup-or-don't choice is made at the beginning of the > > grace period, and that CPU's idleness at that point might not be all > > that well correlated with its idleness at the time of the wakeup at the > > end of that grace period. Nevertheless, idleness statistics could > > help trade off needless from-idle wakeups for needless grace periods. > > > > Which sound like a good reason to consolidate rcuoc processing on > > non-busy systems. > > > > But more to the point... > > > > Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that > > RCU is otherwise idle. We really want one (not three!) grace periods > > to eventually handle those lazy callbacks, right? > > Yes. I think I understand you now, yes we do want to reduce the number > of grace periods. But I think that is an optimization which is not > strictly necessary to get the power savings this patch series > demonstrates. To get the power savings shown here, we need all the RCU > threads to be quiet as much as possible, and not start the rcu_preempt > thread's state machine until when necessary. I believe that it will be both an optimization and a simplification. Again, if you have to start the grace-period machinery anyway, you might as well take care of the lazy callbacks while you are at it. > >> I am not thinking of creating separate GP cycles just for lazy CBs. The > >> GP cycle will be the same that non-lazy uses. A lazy CB will just record > >> what is the current or last GP, and then wait. Once the timer expires, > >> it will check what is the current or last GP. If a new GP was not > >> started, then it starts a new GP. > > > > From my perspective, this new GP is in fact a separate GP just for lazy > > callbacks. > > > > What am I missing here? > > I think my thought for power savings is slightly different from yours. I > see lots of power savings when not even going to RCU when system is idle > - that appears to be a lot like what the bypass list does - it avoids > call_rcu() completely and does an early exit: > > if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) > return; // Enqueued onto ->nocb_bypass, so just leave. > > I think amortizing cost of grace period computation across lazy CBs may > give power savings, but that is secondary to the goal of not even poking > RCU (which an early exit like the above does). So maybe we can just talk > about that goal first? You completely lost me on this one. My point is not to amortize the cost of grace-period computation across the lazy CBs. My point is instead to completely avoid doing any extra grace-period computation for the lazy CBs in cases where grace-period computations are happening anyway. Of course, if there are no non-lazy callbacks, then there won't be any happening-anyway grace periods, and then, yes, there will eventually need to be a grace period specially for the benefit of the lazy callbacks. And taking this piecewise is unlikely to give good results, especially given that workloads evolve over time, much though I understand your desire for doing this one piece at a time. > > Also, if your timer triggers the check, isn't that an additional potential > > wakeup from idle? Would it make sense to also minimize those wakeups? > > (This is one motivation for grace periods to pick up lazy callbacks.) > > > >> If a new GP was started but not > >> completed, it can simply call_rcu(). If a new GP started and completed, > >> it does not start new GP and just executes CB, something like that. > >> Before the flush happens, if multiple lazy CBs were queued in the > >> meanwhile and the GP seq counters moved forward, then all those lazy CBs > >> will share the same GP (either not starting a new one, or calling > >> call_rcu() to share the ongoing on, something like that). > > > > You could move the ready-to-invoke callbacks to the end of the DONE > > segment of ->cblist, assuming that you also adjust rcu_barrier() > > appropriately. If rcu_barrier() is rare, you could simply flush whatever > > lazy list before entraining the rcu_barrier() callback. > > Got it. > > >>> o When there is ample memory, how long are lazy callbacks allowed > >>> to sleep? Forever? If not forever, what feeds into computing > >>> the timeout? > >> > >> Yeah, the timeout is fixed currently. I agree this is a good thing to > >> optimize. > > > > First see what the behavior is. If the timeout is long enough that > > there is almost never a lazy-only grace period, then maybe a fixed > > timeout is good enough. > > Ok. > > >>> o What is used to determine when memory is low, so that laziness > >>> has become a net negative? > >> > >> The assumption I think we make is that the shrinkers will constantly > >> flush out lazy CBs if there is memory pressure. > > > > It might be worth looking at SPI. That could help avoid a grace-period > > wait when clearing out lazy callbacks. > > You mean PSI? Yes, pressure stall information. Apologies for my confusion! > >>> o What other conditions, if any, should prompt motivating lazy > >>> callbacks? (See above for the start of a grace period motivating > >>> lazy callbacks.) > >>> > >>> In short, you need to cast your net pretty wide on this one. It has > >>> not yet been very carefully explored, so there are likely to be surprises, > >>> maybe even good surprises. ;-) > >> > >> Cool that sounds like some good opportunity to work on something cool ;-) > > > > Which comes back around to your point of needing evaluation of extra > > RCU work. Lots of tradeoffs between different wakeup sources and grace > > periods. Fine-grained views into the black box will be helpful. ;-) > > Thanks for the conversations. I very much like the idea of using bypass > list, but I am not sure if the current segcblist structure will support > both lazy and non-lazy CBs. Basically, if it contains both call_rcu() > and call_rcu_lazy() CBs, that means either all of them are lazy or all > of them are not right? Unless we are also talking about making > additional changes to rcu_segcblist struct to accomodate both types of CBs. Do you need to do anything different to lazy callbacks once they have been associated with a grace period has been started? Your current patch treats them the same. If that approach continues to work, then why do you need to track laziness downstream? > I don't mind admitting to my illiteracy about callback offloading and > bypass lists, but using bypass lists for this problems might improve my > literacy nonetheless so that does motivate me to use them if you can > help me out with the above questions :D More detail on the grace-period machinery will also be helpful to you. ;-) Thanx, Paul
On 5/31/22 00:26, Paul E. McKenney wrote: [...] >>>>> Which means that >>>>> apples to apples evaluation is also required. But this is the >>>>> information I currently have at hand, and it is probably no more >>>>> than a factor of two off of what would be seen on ChromeOS. >>>>> >>>>> Or is there some ChromeOS data that tells a different story? >>>>> After all, for all I know, Android might still be expediting >>>>> all normal grace periods. >>>>> >>>>> At which point, the question becomes "how to make up that 7%?" After all, >>>>> it is not likely that anyone is going to leave that much battery lifetime >>>>> on the table. Here are the possibilities that I currently see: >>>>> >>>>> o Slow down grace periods, and also bite the bullet and make >>>>> userspace changes to speed up the RCU grace periods during >>>>> critical operations. >>>> >>>> We tried this, and tracing suffers quiet a lot. The system also felt >>>> "sluggish" which I suspect is because of synchronize_rcu() slow downs in >>>> other paths. >>> >>> In what way does tracing suffer? Removing tracepoints? >> >> Yes. Start and stopping function tracer goes from 5 seconds to like 30 >> seconds or so. >> >>> And yes, this approach absolutely requires finding code paths with >>> user-visible grace periods. Which sounds like you tried the "Slow down >>> grace periods" part but not the "bit the bullet" part. ;-) >> >> True, I got so scared looking at the performance that I just decided to >> play it safe and do selective call_rcu() lazifying, than making the >> everything lazy. > > For what definition of "safe", exactly? ;-) I mean the usual, like somebody complaining performance issues crept up in their workload :) > >>> On the flush-time mismatch, if there are any non-lazy callbacks in the >>> list, it costs you nothing to let the lazy callbacks tag along through >>> the grace period. So one approach would be to use the current flush >>> time if there are non-lazy callbacks, but use the longer flush time if >>> all of the callbacks in the list are lazy callbacks. >> >> Cool! >> >>>>> o If so: >>>>> >>>>> o Use segmented list with marked grace periods? >>>>> Keep in mind that such a list can track only >>>>> two grace periods. >>>>> >>>>> o Use a plain list and have grace-period start >>>>> simply drain the list? >>>> >>>> I want to use the segmented list, regardless of whether we use the >>>> bypass list or not, because we get those memory barriers and >>>> rcu_barrier() lockless sampling of ->len, for free :). >>> >>> The bypass list also gets you the needed memory barriers and lockless >>> sampling of ->len. As does any other type of list as long as the >>> ->cblist.len field accounts for the lazy callbacks. >>> >>> So the main difference is the tracking of grace periods, and even then, >>> only those grace periods for which this CPU has no non-lazy callbacks. >>> Or, in the previously discussed case where a single rcuoc kthread serves >>> multiple CPUs, only those grace periods for which this group of CPUs >>> has no non-lazy callbacks. >> >> Unless we assume that everything in the bypass list is lazy, can we >> really use it to track both lazy and non lazy CBs, and grace periods? > > Why not just count the number of lazy callbacks in the bypass list? > There > is already a count of the total number of callbacks in ->nocb_bypass.len. > This list is protected by a lock, so protect the count of lazy callbacks > with that same lock. That's a good idea, I can try that. Then just compare the number of lazy callbacks to > rcu_cblist_n_cbs(&rdp->nocb_bypass). If they are equal, all callbacks > are lazy. > > What am I missing here? There could be more issues that are incompatible with bypass lists, but nothing that I feel cannot be worked around. Example: 1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is empty). 2. Now say 10000 non-lazy CBs are queued. As per the comments, these have to go to the bypass list to keep rcu_barrier() from breaking. 3. Because this causes the bypass list to overflow, all the lazy + non-lazy CBs have to flushed to the main -cblist. If only the non-lazy CBs are flushed, rcu_barrier() might break. If all are flushed, then the lazy ones lose their laziness property as RCU will be immediately kicked off to process GPs on their behalf. This can fixed by making rcu_barrier() queue both a lazy and non-lazy CB, and only flushing the non-lazy CBs on a bypass list overflow, to the ->cblist, I think. Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it simple. I think that should be OK since if there are a lot of CBs queued in a short time, I don't think there is much opportunity for power savings anyway IMHO. >> Currently the struct looks like this: >> >> struct rcu_segcblist { >> struct rcu_head *head; >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; >> #ifdef CONFIG_RCU_NOCB_CPU >> atomic_long_t len; >> #else >> long len; >> #endif >> long seglen[RCU_CBLIST_NSEGS]; >> u8 flags; >> }; >> >> So now, it would need to be like this? >> >> struct rcu_segcblist { >> struct rcu_head *head; >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; >> #ifdef CONFIG_RCU_NOCB_CPU >> struct rcu_head *lazy_head; >> struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS]; >> unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS]; >> atomic_long_t lazy_len; >> #else >> long len; >> #endif >> long seglen[RCU_CBLIST_NSEGS]; >> u8 flags; >> }; > > I freely confess that I am not loving this arrangement. Large increase > in state space, but little benefit that I can see. Again, what am I > missing here? I somehow thought tracking GPs separately for the lazy CBs requires duplication of the rcu_head pointers/double-points in this struct. As you pointed, just tracking the lazy len may be sufficient. > >>> And on devices with few CPUs, wouldn't you make do with one rcuoc kthread >>> for the system, at least in the common case where there is no callback >>> flooding? >> >> When Rushikesh tried to reduce the number of callback threads, he did >> not see much improvement in power. I don't think the additional wake ups >> of extra rcuoc threads makes too much difference - the big power >> improvement comes from not even kicking rcu_preempt / rcuog threads... > > I suspect that this might come into play later on. It is often the case > that a given technique's effectiveness depends on the starting point. > >>>>> o Does call_rcu_lazy() do anything to ensure that additional grace >>>>> periods that exist only for the benefit of lazy callbacks are >>>>> maximally shared among all CPUs' lazy callbacks? If so, what? >>>>> (Keeping in mind that failing to share such grace periods >>>>> burns battery lifetime.) >>>> >>>> I could be missing your point, can you give example of how you want the >>>> behavior to be? >>> >>> CPU 0 has 55 lazy callbacks and one non-lazy callback. So the system >>> does a grace-period computation and CPU 0 wakes up its rcuoc kthread. >>> Given that the full price is begin paid anyway, why not also invoke >>> those 55 lazy callbacks? >>> >>> If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23 >>> lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks, >>> again the full price of grace-period computation and rcuoc wakeups is >>> being paid, so why not also invoke those 26 lazy callbacks? >>> >>> On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other >>> rcuoc kthread, and with the same numbers of callbacks as in the previous >>> paragraph, you get to make a choice. Do you do an extra wakeup for >>> CPU 0's rcuoc kthread? Or do you potentially do an extra grace-period >>> computation some time in the future? >>> >>> Suppose further that the CPU that would be awakened for CPU 0's rcuoc >>> kthread is already non-idle. At that point, this wakeup is quite cheap. >>> Except that the wakeup-or-don't choice is made at the beginning of the >>> grace period, and that CPU's idleness at that point might not be all >>> that well correlated with its idleness at the time of the wakeup at the >>> end of that grace period. Nevertheless, idleness statistics could >>> help trade off needless from-idle wakeups for needless grace periods. >>> >>> Which sound like a good reason to consolidate rcuoc processing on >>> non-busy systems. >>> >>> But more to the point... >>> >>> Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that >>> RCU is otherwise idle. We really want one (not three!) grace periods >>> to eventually handle those lazy callbacks, right? >> >> Yes. I think I understand you now, yes we do want to reduce the number >> of grace periods. But I think that is an optimization which is not >> strictly necessary to get the power savings this patch series >> demonstrates. To get the power savings shown here, we need all the RCU >> threads to be quiet as much as possible, and not start the rcu_preempt >> thread's state machine until when necessary. > > I believe that it will be both an optimization and a simplification. > Again, if you have to start the grace-period machinery anyway, you might > as well take care of the lazy callbacks while you are at it. Sure, sounds good. I agree with sharing grace periods instead of starting a new one.. >>>> I am not thinking of creating separate GP cycles just for lazy CBs. The >>>> GP cycle will be the same that non-lazy uses. A lazy CB will just record >>>> what is the current or last GP, and then wait. Once the timer expires, >>>> it will check what is the current or last GP. If a new GP was not >>>> started, then it starts a new GP. >>> >>> From my perspective, this new GP is in fact a separate GP just for lazy >>> callbacks. >>> >>> What am I missing here? >> >> I think my thought for power savings is slightly different from yours. I >> see lots of power savings when not even going to RCU when system is idle >> - that appears to be a lot like what the bypass list does - it avoids >> call_rcu() completely and does an early exit: >> >> if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) >> return; // Enqueued onto ->nocb_bypass, so just leave. >> >> I think amortizing cost of grace period computation across lazy CBs may >> give power savings, but that is secondary to the goal of not even poking >> RCU (which an early exit like the above does). So maybe we can just talk >> about that goal first? > > You completely lost me on this one. My point is not to amortize the cost > of grace-period computation across the lazy CBs. My point is instead > to completely avoid doing any extra grace-period computation for the > lazy CBs in cases where grace-period computations are happening anyway. > Of course, if there are no non-lazy callbacks, then there won't be > any happening-anyway grace periods, and then, yes, there will eventually > need to be a grace period specially for the benefit of the lazy > callbacks. > > And taking this piecewise is unlikely to give good results, especially > given that workloads evolve over time, much though I understand your > desire for doing this one piece at a time. Sure ok, I agree with you on that. >>> Also, if your timer triggers the check, isn't that an additional potential >>> wakeup from idle? Would it make sense to also minimize those wakeups? >>> (This is one motivation for grace periods to pick up lazy callbacks.) >>> >>>> If a new GP was started but not >>>> completed, it can simply call_rcu(). If a new GP started and completed, >>>> it does not start new GP and just executes CB, something like that. >>>> Before the flush happens, if multiple lazy CBs were queued in the >>>> meanwhile and the GP seq counters moved forward, then all those lazy CBs >>>> will share the same GP (either not starting a new one, or calling >>>> call_rcu() to share the ongoing on, something like that). >>> >>> You could move the ready-to-invoke callbacks to the end of the DONE >>> segment of ->cblist, assuming that you also adjust rcu_barrier() >>> appropriately. If rcu_barrier() is rare, you could simply flush whatever >>> lazy list before entraining the rcu_barrier() callback. >> >> Got it. >> >>>>> o When there is ample memory, how long are lazy callbacks allowed >>>>> to sleep? Forever? If not forever, what feeds into computing >>>>> the timeout? >>>> >>>> Yeah, the timeout is fixed currently. I agree this is a good thing to >>>> optimize. >>> >>> First see what the behavior is. If the timeout is long enough that >>> there is almost never a lazy-only grace period, then maybe a fixed >>> timeout is good enough. >> >> Ok. >> >>>>> o What is used to determine when memory is low, so that laziness >>>>> has become a net negative? >>>> >>>> The assumption I think we make is that the shrinkers will constantly >>>> flush out lazy CBs if there is memory pressure. >>> >>> It might be worth looking at SPI. That could help avoid a grace-period >>> wait when clearing out lazy callbacks. >> >> You mean PSI? > > Yes, pressure stall information. Apologies for my confusion! Ok, I will look into it, thanks. >>>>> o What other conditions, if any, should prompt motivating lazy >>>>> callbacks? (See above for the start of a grace period motivating >>>>> lazy callbacks.) >>>>> >>>>> In short, you need to cast your net pretty wide on this one. It has >>>>> not yet been very carefully explored, so there are likely to be surprises, >>>>> maybe even good surprises. ;-) >>>> >>>> Cool that sounds like some good opportunity to work on something cool ;-) >>> >>> Which comes back around to your point of needing evaluation of extra >>> RCU work. Lots of tradeoffs between different wakeup sources and grace >>> periods. Fine-grained views into the black box will be helpful. ;-) >> >> Thanks for the conversations. I very much like the idea of using bypass >> list, but I am not sure if the current segcblist structure will support >> both lazy and non-lazy CBs. Basically, if it contains both call_rcu() >> and call_rcu_lazy() CBs, that means either all of them are lazy or all >> of them are not right? Unless we are also talking about making >> additional changes to rcu_segcblist struct to accomodate both types of CBs. > > Do you need to do anything different to lazy callbacks once they have > been associated with a grace period has been started? Your current > patch treats them the same. If that approach continues to work, then > why do you need to track laziness downstream? Yes I am confused a lot on this part. You are saying just tracking the "lazy ength" is sufficient to not have to kick off RCU machinery, and knowing exactly which CBs are lazy and which are not, is not needed right? If I misunderstood that, could you clarify what you mean by "track laziness downstream" ? Thanks, - Joel
On Tue, May 31, 2022 at 12:11:00PM -0400, Joel Fernandes wrote: > On 5/31/22 00:26, Paul E. McKenney wrote: > [...] > >>>>> Which means that > >>>>> apples to apples evaluation is also required. But this is the > >>>>> information I currently have at hand, and it is probably no more > >>>>> than a factor of two off of what would be seen on ChromeOS. > >>>>> > >>>>> Or is there some ChromeOS data that tells a different story? > >>>>> After all, for all I know, Android might still be expediting > >>>>> all normal grace periods. > >>>>> > >>>>> At which point, the question becomes "how to make up that 7%?" After all, > >>>>> it is not likely that anyone is going to leave that much battery lifetime > >>>>> on the table. Here are the possibilities that I currently see: > >>>>> > >>>>> o Slow down grace periods, and also bite the bullet and make > >>>>> userspace changes to speed up the RCU grace periods during > >>>>> critical operations. > >>>> > >>>> We tried this, and tracing suffers quiet a lot. The system also felt > >>>> "sluggish" which I suspect is because of synchronize_rcu() slow downs in > >>>> other paths. > >>> > >>> In what way does tracing suffer? Removing tracepoints? > >> > >> Yes. Start and stopping function tracer goes from 5 seconds to like 30 > >> seconds or so. > >> > >>> And yes, this approach absolutely requires finding code paths with > >>> user-visible grace periods. Which sounds like you tried the "Slow down > >>> grace periods" part but not the "bit the bullet" part. ;-) > >> > >> True, I got so scared looking at the performance that I just decided to > >> play it safe and do selective call_rcu() lazifying, than making the > >> everything lazy. > > > > For what definition of "safe", exactly? ;-) > > I mean the usual, like somebody complaining performance issues crept up > in their workload :) It is not that difficult. Just do something like this: # run a statistically significant set of tracer tests echo 1 > /sys/kernel/rcu_expedited # run a statistically significant set of tracer tests It is quite possible that you will decide that you want grace periods to be expedited during tracing operations even on non-battery-powered systems. > >>> On the flush-time mismatch, if there are any non-lazy callbacks in the > >>> list, it costs you nothing to let the lazy callbacks tag along through > >>> the grace period. So one approach would be to use the current flush > >>> time if there are non-lazy callbacks, but use the longer flush time if > >>> all of the callbacks in the list are lazy callbacks. > >> > >> Cool! > >> > >>>>> o If so: > >>>>> > >>>>> o Use segmented list with marked grace periods? > >>>>> Keep in mind that such a list can track only > >>>>> two grace periods. > >>>>> > >>>>> o Use a plain list and have grace-period start > >>>>> simply drain the list? > >>>> > >>>> I want to use the segmented list, regardless of whether we use the > >>>> bypass list or not, because we get those memory barriers and > >>>> rcu_barrier() lockless sampling of ->len, for free :). > >>> > >>> The bypass list also gets you the needed memory barriers and lockless > >>> sampling of ->len. As does any other type of list as long as the > >>> ->cblist.len field accounts for the lazy callbacks. > >>> > >>> So the main difference is the tracking of grace periods, and even then, > >>> only those grace periods for which this CPU has no non-lazy callbacks. > >>> Or, in the previously discussed case where a single rcuoc kthread serves > >>> multiple CPUs, only those grace periods for which this group of CPUs > >>> has no non-lazy callbacks. > >> > >> Unless we assume that everything in the bypass list is lazy, can we > >> really use it to track both lazy and non lazy CBs, and grace periods? > > > > Why not just count the number of lazy callbacks in the bypass list? > > There > > is already a count of the total number of callbacks in ->nocb_bypass.len. > > This list is protected by a lock, so protect the count of lazy callbacks > > with that same lock. > > That's a good idea, I can try that. > > Then just compare the number of lazy callbacks to > > rcu_cblist_n_cbs(&rdp->nocb_bypass). If they are equal, all callbacks > > are lazy. > > > > What am I missing here? > > There could be more issues that are incompatible with bypass lists, but > nothing that I feel cannot be worked around. Good! > Example: > 1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is > empty). > 2. Now say 10000 non-lazy CBs are queued. As per the comments, these > have to go to the bypass list to keep rcu_barrier() from breaking. > 3. Because this causes the bypass list to overflow, all the lazy + > non-lazy CBs have to flushed to the main -cblist. > > If only the non-lazy CBs are flushed, rcu_barrier() might break. If all > are flushed, then the lazy ones lose their laziness property as RCU will > be immediately kicked off to process GPs on their behalf. Exactly why is this loss of laziness a problem? You are doing that grace period for the 10,000 non-lazy callbacks anyway, so what difference could the five non-lazy callbacks possibly make? > This can fixed by making rcu_barrier() queue both a lazy and non-lazy > CB, and only flushing the non-lazy CBs on a bypass list overflow, to the > ->cblist, I think. I don't see anything that needs fixing. If you are doing a grace period anyway, just process the lazy callbacks along with the non-lazy callbacks. After all, you are paying for that grace period anyway. And handling the lazy callbacks with that grace period means that you don't need a later grace period for those five lazy callbacks. So running the lazy callbacks into the grace period required by the non-lazy callbacks is a pure win, right? If it is not a pure win, please explain exactly what is being lost. > Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it > simple. I think that should be OK since if there are a lot of CBs queued > in a short time, I don't think there is much opportunity for power > savings anyway IMHO. I believe that it will be simpler, faster, and more energy efficient to do it this way, flushing everything from the bypass list to ->cblist. Again, leaving the lazy callbacks lying around means that there must be a later battery-draining grace period that might not be required otherwise. > >> Currently the struct looks like this: > >> > >> struct rcu_segcblist { > >> struct rcu_head *head; > >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; > >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; > >> #ifdef CONFIG_RCU_NOCB_CPU > >> atomic_long_t len; > >> #else > >> long len; > >> #endif > >> long seglen[RCU_CBLIST_NSEGS]; > >> u8 flags; > >> }; > >> > >> So now, it would need to be like this? > >> > >> struct rcu_segcblist { > >> struct rcu_head *head; > >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; > >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; > >> #ifdef CONFIG_RCU_NOCB_CPU > >> struct rcu_head *lazy_head; > >> struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS]; > >> unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS]; > >> atomic_long_t lazy_len; > >> #else > >> long len; > >> #endif > >> long seglen[RCU_CBLIST_NSEGS]; > >> u8 flags; > >> }; > > > > I freely confess that I am not loving this arrangement. Large increase > > in state space, but little benefit that I can see. Again, what am I > > missing here? > > I somehow thought tracking GPs separately for the lazy CBs requires > duplication of the rcu_head pointers/double-points in this struct. As > you pointed, just tracking the lazy len may be sufficient. Here is hoping! After all, if you thought that taking care of applications that need expediting of grace periods is scary, well, now... > >>> And on devices with few CPUs, wouldn't you make do with one rcuoc kthread > >>> for the system, at least in the common case where there is no callback > >>> flooding? > >> > >> When Rushikesh tried to reduce the number of callback threads, he did > >> not see much improvement in power. I don't think the additional wake ups > >> of extra rcuoc threads makes too much difference - the big power > >> improvement comes from not even kicking rcu_preempt / rcuog threads... > > > > I suspect that this might come into play later on. It is often the case > > that a given technique's effectiveness depends on the starting point. > > > >>>>> o Does call_rcu_lazy() do anything to ensure that additional grace > >>>>> periods that exist only for the benefit of lazy callbacks are > >>>>> maximally shared among all CPUs' lazy callbacks? If so, what? > >>>>> (Keeping in mind that failing to share such grace periods > >>>>> burns battery lifetime.) > >>>> > >>>> I could be missing your point, can you give example of how you want the > >>>> behavior to be? > >>> > >>> CPU 0 has 55 lazy callbacks and one non-lazy callback. So the system > >>> does a grace-period computation and CPU 0 wakes up its rcuoc kthread. > >>> Given that the full price is begin paid anyway, why not also invoke > >>> those 55 lazy callbacks? > >>> > >>> If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23 > >>> lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks, > >>> again the full price of grace-period computation and rcuoc wakeups is > >>> being paid, so why not also invoke those 26 lazy callbacks? > >>> > >>> On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other > >>> rcuoc kthread, and with the same numbers of callbacks as in the previous > >>> paragraph, you get to make a choice. Do you do an extra wakeup for > >>> CPU 0's rcuoc kthread? Or do you potentially do an extra grace-period > >>> computation some time in the future? > >>> > >>> Suppose further that the CPU that would be awakened for CPU 0's rcuoc > >>> kthread is already non-idle. At that point, this wakeup is quite cheap. > >>> Except that the wakeup-or-don't choice is made at the beginning of the > >>> grace period, and that CPU's idleness at that point might not be all > >>> that well correlated with its idleness at the time of the wakeup at the > >>> end of that grace period. Nevertheless, idleness statistics could > >>> help trade off needless from-idle wakeups for needless grace periods. > >>> > >>> Which sound like a good reason to consolidate rcuoc processing on > >>> non-busy systems. > >>> > >>> But more to the point... > >>> > >>> Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that > >>> RCU is otherwise idle. We really want one (not three!) grace periods > >>> to eventually handle those lazy callbacks, right? > >> > >> Yes. I think I understand you now, yes we do want to reduce the number > >> of grace periods. But I think that is an optimization which is not > >> strictly necessary to get the power savings this patch series > >> demonstrates. To get the power savings shown here, we need all the RCU > >> threads to be quiet as much as possible, and not start the rcu_preempt > >> thread's state machine until when necessary. > > > > I believe that it will be both an optimization and a simplification. > > Again, if you have to start the grace-period machinery anyway, you might > > as well take care of the lazy callbacks while you are at it. > > Sure, sounds good. I agree with sharing grace periods instead of > starting a new one.. As you say, sounds good. ;-) > >>>> I am not thinking of creating separate GP cycles just for lazy CBs. The > >>>> GP cycle will be the same that non-lazy uses. A lazy CB will just record > >>>> what is the current or last GP, and then wait. Once the timer expires, > >>>> it will check what is the current or last GP. If a new GP was not > >>>> started, then it starts a new GP. > >>> > >>> From my perspective, this new GP is in fact a separate GP just for lazy > >>> callbacks. > >>> > >>> What am I missing here? > >> > >> I think my thought for power savings is slightly different from yours. I > >> see lots of power savings when not even going to RCU when system is idle > >> - that appears to be a lot like what the bypass list does - it avoids > >> call_rcu() completely and does an early exit: > >> > >> if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) > >> return; // Enqueued onto ->nocb_bypass, so just leave. > >> > >> I think amortizing cost of grace period computation across lazy CBs may > >> give power savings, but that is secondary to the goal of not even poking > >> RCU (which an early exit like the above does). So maybe we can just talk > >> about that goal first? > > > > You completely lost me on this one. My point is not to amortize the cost > > of grace-period computation across the lazy CBs. My point is instead > > to completely avoid doing any extra grace-period computation for the > > lazy CBs in cases where grace-period computations are happening anyway. > > Of course, if there are no non-lazy callbacks, then there won't be > > any happening-anyway grace periods, and then, yes, there will eventually > > need to be a grace period specially for the benefit of the lazy > > callbacks. > > > > And taking this piecewise is unlikely to give good results, especially > > given that workloads evolve over time, much though I understand your > > desire for doing this one piece at a time. > > Sure ok, I agree with you on that. Very good! > >>> Also, if your timer triggers the check, isn't that an additional potential > >>> wakeup from idle? Would it make sense to also minimize those wakeups? > >>> (This is one motivation for grace periods to pick up lazy callbacks.) > >>> > >>>> If a new GP was started but not > >>>> completed, it can simply call_rcu(). If a new GP started and completed, > >>>> it does not start new GP and just executes CB, something like that. > >>>> Before the flush happens, if multiple lazy CBs were queued in the > >>>> meanwhile and the GP seq counters moved forward, then all those lazy CBs > >>>> will share the same GP (either not starting a new one, or calling > >>>> call_rcu() to share the ongoing on, something like that). > >>> > >>> You could move the ready-to-invoke callbacks to the end of the DONE > >>> segment of ->cblist, assuming that you also adjust rcu_barrier() > >>> appropriately. If rcu_barrier() is rare, you could simply flush whatever > >>> lazy list before entraining the rcu_barrier() callback. > >> > >> Got it. > >> > >>>>> o When there is ample memory, how long are lazy callbacks allowed > >>>>> to sleep? Forever? If not forever, what feeds into computing > >>>>> the timeout? > >>>> > >>>> Yeah, the timeout is fixed currently. I agree this is a good thing to > >>>> optimize. > >>> > >>> First see what the behavior is. If the timeout is long enough that > >>> there is almost never a lazy-only grace period, then maybe a fixed > >>> timeout is good enough. > >> > >> Ok. > >> > >>>>> o What is used to determine when memory is low, so that laziness > >>>>> has become a net negative? > >>>> > >>>> The assumption I think we make is that the shrinkers will constantly > >>>> flush out lazy CBs if there is memory pressure. > >>> > >>> It might be worth looking at SPI. That could help avoid a grace-period > >>> wait when clearing out lazy callbacks. > >> > >> You mean PSI? > > > > Yes, pressure stall information. Apologies for my confusion! > > Ok, I will look into it, thanks. And I bet that RCU should be using it elsewhere as well. > >>>>> o What other conditions, if any, should prompt motivating lazy > >>>>> callbacks? (See above for the start of a grace period motivating > >>>>> lazy callbacks.) > >>>>> > >>>>> In short, you need to cast your net pretty wide on this one. It has > >>>>> not yet been very carefully explored, so there are likely to be surprises, > >>>>> maybe even good surprises. ;-) > >>>> > >>>> Cool that sounds like some good opportunity to work on something cool ;-) > >>> > >>> Which comes back around to your point of needing evaluation of extra > >>> RCU work. Lots of tradeoffs between different wakeup sources and grace > >>> periods. Fine-grained views into the black box will be helpful. ;-) > >> > >> Thanks for the conversations. I very much like the idea of using bypass > >> list, but I am not sure if the current segcblist structure will support > >> both lazy and non-lazy CBs. Basically, if it contains both call_rcu() > >> and call_rcu_lazy() CBs, that means either all of them are lazy or all > >> of them are not right? Unless we are also talking about making > >> additional changes to rcu_segcblist struct to accomodate both types of CBs. > > > > Do you need to do anything different to lazy callbacks once they have > > been associated with a grace period has been started? Your current > > patch treats them the same. If that approach continues to work, then > > why do you need to track laziness downstream? > > Yes I am confused a lot on this part. You are saying just tracking the > "lazy ength" is sufficient to not have to kick off RCU machinery, and > knowing exactly which CBs are lazy and which are not, is not needed > right? If I misunderstood that, could you clarify what you mean by > "track laziness downstream" ? Pretty much. What I am saying is that the concept of laziness is helpful primarily while the callbacks are waiting for a grace period to start. Lazy callbacks are much less aggressive about starting new grace periods than are non-lazy callbacks. But once the grace period has started, why treat them differently? From what you said earlier, your testing thus far shows that the biggest energy savings is delaying/merging grace periods rather than the details of callback processing within and after the grace period, right? And even if differences in callback processing are warranted, there are lots of ways to make that happen, many of which do not require propagation of per-callback laziness into and through the grace-period machinery. Thanx, Paul
On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote: [..] > > Example: > > 1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is > > empty). > > 2. Now say 10000 non-lazy CBs are queued. As per the comments, these > > have to go to the bypass list to keep rcu_barrier() from breaking. > > 3. Because this causes the bypass list to overflow, all the lazy + > > non-lazy CBs have to flushed to the main -cblist. > > > > If only the non-lazy CBs are flushed, rcu_barrier() might break. If all > > are flushed, then the lazy ones lose their laziness property as RCU will > > be immediately kicked off to process GPs on their behalf. > > Exactly why is this loss of laziness a problem? You are doing that > grace period for the 10,000 non-lazy callbacks anyway, so what difference > could the five non-lazy callbacks possibly make? It does not make any difference, I kind of answered my own question. I was thinking out loud in this thread (Sorry). > > This can fixed by making rcu_barrier() queue both a lazy and non-lazy > > CB, and only flushing the non-lazy CBs on a bypass list overflow, to the > > ->cblist, I think. > > I don't see anything that needs fixing. If you are doing a grace period > anyway, just process the lazy callbacks along with the non-lazy callbacks. > After all, you are paying for that grace period anyway. And handling > the lazy callbacks with that grace period means that you don't need a > later grace period for those five lazy callbacks. So running the lazy > callbacks into the grace period required by the non-lazy callbacks is > a pure win, right? > > If it is not a pure win, please explain exactly what is being lost. Agreed. As discussed on IRC, we can only care about increment of the lazy length, and the flush will drop it to 1 or 0. No need to design for partial flushing for now as no usecase. > > Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it > > simple. I think that should be OK since if there are a lot of CBs queued > > in a short time, I don't think there is much opportunity for power > > savings anyway IMHO. > > I believe that it will be simpler, faster, and more energy efficient to > do it this way, flushing everything from the bypass list to ->cblist. > Again, leaving the lazy callbacks lying around means that there must be a > later battery-draining grace period that might not be required otherwise. Perfect. > > >> Currently the struct looks like this: > > >> > > >> struct rcu_segcblist { > > >> struct rcu_head *head; > > >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; > > >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; > > >> #ifdef CONFIG_RCU_NOCB_CPU > > >> atomic_long_t len; > > >> #else > > >> long len; > > >> #endif > > >> long seglen[RCU_CBLIST_NSEGS]; > > >> u8 flags; > > >> }; > > >> > > >> So now, it would need to be like this? > > >> > > >> struct rcu_segcblist { > > >> struct rcu_head *head; > > >> struct rcu_head **tails[RCU_CBLIST_NSEGS]; > > >> unsigned long gp_seq[RCU_CBLIST_NSEGS]; > > >> #ifdef CONFIG_RCU_NOCB_CPU > > >> struct rcu_head *lazy_head; > > >> struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS]; > > >> unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS]; > > >> atomic_long_t lazy_len; > > >> #else > > >> long len; > > >> #endif > > >> long seglen[RCU_CBLIST_NSEGS]; > > >> u8 flags; > > >> }; > > > > > > I freely confess that I am not loving this arrangement. Large increase > > > in state space, but little benefit that I can see. Again, what am I > > > missing here? > > > > I somehow thought tracking GPs separately for the lazy CBs requires > > duplication of the rcu_head pointers/double-points in this struct. As > > you pointed, just tracking the lazy len may be sufficient. > > Here is hoping! > > After all, if you thought that taking care of applications that need > expediting of grace periods is scary, well, now... Haha... my fear is I don't know all the applications requiring expedited GP and I keep getting surprised by new RCU usages that pop up in the system, or new systems. For one, a number of tools and processes, use ftrace directly in the system, and it may not be practical to chase down every tool. Some of them start tracing randomly in the system. Handling it in-kernel itself would be best if possible. Productive email discussion indeed! On to writing the code :P Thanks, - Joel
On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote: > On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote: [ . . . ] > > Here is hoping! > > > > After all, if you thought that taking care of applications that need > > expediting of grace periods is scary, well, now... > > Haha... my fear is I don't know all the applications requiring expedited GP > and I keep getting surprised by new RCU usages that pop up in the system, or > new systems. > > For one, a number of tools and processes, use ftrace directly in the system, > and it may not be practical to chase down every tool. Some of them start > tracing randomly in the system. Handling it in-kernel itself would be best if > possible. Shouldn't it be possible to hook into the kernel code that runs when the sysfs tracing files are updated? As in, why not both? ;-) > Productive email discussion indeed! On to writing the code :P Should be fun! ;-) Thanx, Paul
On Tue, May 31, 2022 at 12:25:32PM -0700, Paul E. McKenney wrote: > On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote: > > On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote: > > [ . . . ] > > > > Here is hoping! > > > > > > After all, if you thought that taking care of applications that need > > > expediting of grace periods is scary, well, now... > > > > Haha... my fear is I don't know all the applications requiring expedited GP > > and I keep getting surprised by new RCU usages that pop up in the system, or > > new systems. > > > > For one, a number of tools and processes, use ftrace directly in the system, > > and it may not be practical to chase down every tool. Some of them start > > tracing randomly in the system. Handling it in-kernel itself would be best if > > possible. > > Shouldn't it be possible to hook into the kernel code that runs when > the sysfs tracing files are updated? As in, why not both? ;-) Yes, Point taken on this, let us do both solutions. thanks, - Joel
On Tue, May 31, 2022 at 5:29 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Tue, May 31, 2022 at 12:25:32PM -0700, Paul E. McKenney wrote: > > On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote: > > > On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote: > > > > [ . . . ] > > > > > > Here is hoping! > > > > > > > > After all, if you thought that taking care of applications that need > > > > expediting of grace periods is scary, well, now... > > > > > > Haha... my fear is I don't know all the applications requiring expedited GP > > > and I keep getting surprised by new RCU usages that pop up in the system, or > > > new systems. > > > > > > For one, a number of tools and processes, use ftrace directly in the system, > > > and it may not be practical to chase down every tool. Some of them start > > > tracing randomly in the system. Handling it in-kernel itself would be best if > > > possible. > > > > Shouldn't it be possible to hook into the kernel code that runs when > > the sysfs tracing files are updated? As in, why not both? ;-) > > Yes, Point taken on this, let us do both solutions. It appears Paul has handled both hotplug and rcu_barrier ordering issues for bypass lists, so this makes it an obvious choice to solve those problems... :-) Thanks, - Joel
On Mon, May 30, 2022 at 10:54:26AM -0400, Joel Fernandes wrote: > >> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > >> +{ > >> + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > >> + struct rcu_lazy_pcp *rlp; > >> + > >> + preempt_disable(); > >> + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > >> + preempt_enable(); > >> > > Can we get rid of such explicit disabling/enabling preemption? > > Ok I'll try. Last I checked, something needs to disable preemption to > prevent warnings with sampling the current processor ID. raw_cpu_ptr() Thanks.
On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > + preempt_enable(); > > + > > + if (debug_rcu_head_queue((void *)head)) { > > + // Probable double call_rcu(), just leak. > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > + __func__, head); > > + > > + // Mark as success and leave. > > + return; > > + } > > + > > + // Queue to per-cpu llist > > + head->func = func; > > + llist_add(&head->llist_node, &rlp->head); > > Suppose that there are a bunch of preemptions between the preempt_enable() > above and this point, so that the current CPU's list has lots of > callbacks, but zero ->count. Can that cause a problem? > > In the past, this sort of thing has been an issue for rcu_barrier() > and friends. Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I might have missed that somewhere in the patchset though. Thanks.
On Wed, Jun 01, 2022 at 04:24:54PM +0200, Frederic Weisbecker wrote: > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > > + preempt_enable(); > > > + > > > + if (debug_rcu_head_queue((void *)head)) { > > > + // Probable double call_rcu(), just leak. > > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > > + __func__, head); > > > + > > > + // Mark as success and leave. > > > + return; > > > + } > > > + > > > + // Queue to per-cpu llist > > > + head->func = func; > > > + llist_add(&head->llist_node, &rlp->head); > > > > Suppose that there are a bunch of preemptions between the preempt_enable() > > above and this point, so that the current CPU's list has lots of > > callbacks, but zero ->count. Can that cause a problem? > > > > In the past, this sort of thing has been an issue for rcu_barrier() > > and friends. > > Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I > might have missed that somewhere in the patchset though. It should, but Joel deferred rcu_barrier() handling for this initial prototype. So be careful when playing with it. ;-) Thanx, Paul
Hi Fred, On Wed, Jun 1, 2022 at 10:24 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote: > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote: > > > + preempt_enable(); > > > + > > > + if (debug_rcu_head_queue((void *)head)) { > > > + // Probable double call_rcu(), just leak. > > > + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", > > > + __func__, head); > > > + > > > + // Mark as success and leave. > > > + return; > > > + } > > > + > > > + // Queue to per-cpu llist > > > + head->func = func; > > > + llist_add(&head->llist_node, &rlp->head); > > > > Suppose that there are a bunch of preemptions between the preempt_enable() > > above and this point, so that the current CPU's list has lots of > > callbacks, but zero ->count. Can that cause a problem? > > > > In the past, this sort of thing has been an issue for rcu_barrier() > > and friends. > > Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I > might have missed that somewhere in the patchset though. Yes, it should. This initial prototype did not handle it and was mentioned in one of the replies to the cover letter. However, in the v2 of this, I am planning to use Paul's idea of sticking these in the bypass list. It simplifies a lot of things which bypass handles (hotplug, rcu_barrier, de-offload, etc). Thanks, - Joel
On Wed, Jun 1, 2022 at 10:12 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Mon, May 30, 2022 at 10:54:26AM -0400, Joel Fernandes wrote: > > >> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) > > >> +{ > > >> + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; > > >> + struct rcu_lazy_pcp *rlp; > > >> + > > >> + preempt_disable(); > > >> + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); > > >> + preempt_enable(); > > >> > > > Can we get rid of such explicit disabling/enabling preemption? > > > > Ok I'll try. Last I checked, something needs to disable preemption to > > prevent warnings with sampling the current processor ID. > > raw_cpu_ptr() Indeed, thanks! - Joel
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 88b42eb46406..d0a6c4f5172c 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void) #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ +#ifdef CONFIG_RCU_LAZY +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func); +#else +#define call_rcu_lazy(head, func) call_rcu(head, func) +#endif + /* Internal to kernel */ void rcu_init(void); extern int rcu_scheduler_active __read_mostly; diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index bf8e341e75b4..c09715079829 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB Say N here if you hate read-side memory barriers. Take the default if you are unsure. +config RCU_LAZY + bool "RCU callback lazy invocation functionality" + depends on RCU_NOCB_CPU + default y + help + To save power, batch RCU callbacks and flush after delay, memory + pressure or callback list growing too big. + endmenu # "RCU Subsystem" diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index 0cfb009a99b9..8968b330d6e0 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o obj-$(CONFIG_TREE_RCU) += tree.o obj-$(CONFIG_TINY_RCU) += tiny.o obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o +obj-$(CONFIG_RCU_LAZY) += lazy.o diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c new file mode 100644 index 000000000000..55e406cfc528 --- /dev/null +++ b/kernel/rcu/lazy.c @@ -0,0 +1,145 @@ +/* + * Lockless lazy-RCU implementation. + */ +#include <linux/rcupdate.h> +#include <linux/shrinker.h> +#include <linux/workqueue.h> +#include "rcu.h" + +// How much to batch before flushing? +#define MAX_LAZY_BATCH 2048 + +// How much to wait before flushing? +#define MAX_LAZY_JIFFIES 10000 + +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON +// later to ensure that rcu_head and lazy_rcu_head are of the same size. +struct lazy_rcu_head { + struct llist_node llist_node; + void (*func)(struct callback_head *head); +} __attribute__((aligned(sizeof(void *)))); + +struct rcu_lazy_pcp { + struct llist_head head; + struct delayed_work work; + atomic_t count; +}; +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins); + +// Lockless flush of CPU, can be called concurrently. +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp) +{ + struct llist_node *node = llist_del_all(&rlp->head); + struct lazy_rcu_head *cursor, *temp; + + if (!node) + return; + + llist_for_each_entry_safe(cursor, temp, node, llist_node) { + struct rcu_head *rh = (struct rcu_head *)cursor; + debug_rcu_head_unqueue(rh); + call_rcu(rh, rh->func); + atomic_dec(&rlp->count); + } +} + +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func) +{ + struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu; + struct rcu_lazy_pcp *rlp; + + preempt_disable(); + rlp = this_cpu_ptr(&rcu_lazy_pcp_ins); + preempt_enable(); + + if (debug_rcu_head_queue((void *)head)) { + // Probable double call_rcu(), just leak. + WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", + __func__, head); + + // Mark as success and leave. + return; + } + + // Queue to per-cpu llist + head->func = func; + llist_add(&head->llist_node, &rlp->head); + + // Flush queue if too big + if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) { + lazy_rcu_flush_cpu(rlp); + } else { + if (!delayed_work_pending(&rlp->work)) { + schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES); + } + } +} + +static unsigned long +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) +{ + unsigned long count = 0; + int cpu; + + /* Snapshot count of all CPUs */ + for_each_possible_cpu(cpu) { + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); + + count += atomic_read(&rlp->count); + } + + return count; +} + +static unsigned long +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +{ + int cpu, freed = 0; + + for_each_possible_cpu(cpu) { + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); + unsigned long count; + + count = atomic_read(&rlp->count); + lazy_rcu_flush_cpu(rlp); + sc->nr_to_scan -= count; + freed += count; + if (sc->nr_to_scan <= 0) + break; + } + + return freed == 0 ? SHRINK_STOP : freed; +} + +/* + * This function is invoked after MAX_LAZY_JIFFIES timeout. + */ +static void lazy_work(struct work_struct *work) +{ + struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work); + + lazy_rcu_flush_cpu(rlp); +} + +static struct shrinker lazy_rcu_shrinker = { + .count_objects = lazy_rcu_shrink_count, + .scan_objects = lazy_rcu_shrink_scan, + .batch = 0, + .seeks = DEFAULT_SEEKS, +}; + +void __init rcu_lazy_init(void) +{ + int cpu; + + BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head)); + + for_each_possible_cpu(cpu) { + struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu); + INIT_DELAYED_WORK(&rlp->work, lazy_work); + } + + if (register_shrinker(&lazy_rcu_shrinker)) + pr_err("Failed to register lazy_rcu shrinker!\n"); +} diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 24b5f2c2de87..a5f4b44f395f 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void); static inline void show_rcu_tasks_trace_gp_kthread(void) {} #endif +#ifdef CONFIG_RCU_LAZY +void rcu_lazy_init(void); +#else +static inline void rcu_lazy_init(void) {} +#endif #endif /* __LINUX_RCU_H */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a4c25a6283b0..ebdf6f7c9023 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4775,6 +4775,8 @@ void __init rcu_init(void) qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; else qovld_calc = qovld; + + rcu_lazy_init(); } #include "tree_stall.h"
Implement timer-based RCU callback batching. The batch is flushed whenever a certain amount of time has passed, or the batch on a particular CPU grows too big. Also memory pressure can flush it. Locking is avoided to reduce lock contention when queuing and dequeuing happens on different CPUs of a per-cpu list, such as when shrinker context is running on different CPU. Also not having to use locks keeps the per-CPU structure size small. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- include/linux/rcupdate.h | 6 ++ kernel/rcu/Kconfig | 8 +++ kernel/rcu/Makefile | 1 + kernel/rcu/lazy.c | 145 +++++++++++++++++++++++++++++++++++++++ kernel/rcu/rcu.h | 5 ++ kernel/rcu/tree.c | 2 + 6 files changed, 167 insertions(+) create mode 100644 kernel/rcu/lazy.c