Message ID | 20240819165939.745801-7-kent.overstreet@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | rcu_pending | expand |
I still remain quite skeptical of this one, but it has improved since June's version. Responses inline. Thanx, Paul In the meantime, thank you for avoiding reaching into RCU's and SRCU's innards. This makes it reasonable to have you put this file into your code, at least initially. That way, you get what you want *now* and us RCU guys are not committing to maintain it before it is ready for us to do so. On Mon, Aug 19, 2024 at 12:59:32PM -0400, Kent Overstreet wrote: > Generic data structure for explicitly tracking pending RCU items, > allowing items to be dequeued (i.e. allocate from items pending > freeing). Works with conventional RCU and SRCU, and possibly other RCU > flavors in the future, meaning this can serve as a more generic > replacement for SLAB_TYPESAFE_BY_RCU. I added a few of the slab guys on CC for his thoughts on what would be required to make slabs typesafe by other forms of RCU. > Pending items are tracked in radix trees; if memory allocation fails, we > fall back to linked lists. I still have strong misgivings about radix trees and cache locality. Poor cache locality on the other side of the grace period can result in OOMs during callback-flooding events, hence the move of kfree_rcu() and friends to pages of pointers. And, as you note below, radix trees don't merge nicely. > A rcu_pending is initialized with a callback, which is invoked when > pending items's grace periods have expired. Two types of callback > processing are handled specially: > > - RCU_PENDING_KVFREE_FN > > New backend for kvfree_rcu(). Slightly faster, and eliminates the > synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an > rcu_head is allocated if we don't have one and can't use the radix > tree The advantage of synchronize_rcu() is that it can proceed without memory allocation. If you block on allocation, even of a 16-byte rcu_head structure, you can go into OOM-induced deadlock. It might well make sense to do an rcu_head allocation with GFP flags that try reasonably hard, but still allow failure before falling all the way back to synchronize_rcu(). (And for all I know, this might have been tested and found wanting, but Uladzislau Rezki (CCed) would know.) But a hard wait on that allocation is asking for trouble. > TODO: > - add a shrinker (as in the existing kvfree_rcu implementation) so that > memory reclaim can free expired objects if callback processing isn't > keeping up, and to expedite a grace period if we're under memory > pressure and too much memory is stranded by RCU There is work underway to make the current callback lists take advantage of expedited grace periods, transparent to the caller. This allows the shrinker (or whatever) to speed up everything by simply invoking synchronize_rcu_expedited(). This speedup includes callback processing because it avoids "bubbles" in the callback processing that can occur when the next grace period has not yet completed, but all callbacks from earlier grace periods have been invoked. > - add a counter for amount of memory pending As noted earlier, for things like call_rcu() and synchronize_rcu(), this needs help from the callers of those functions. > - RCU_PENDING_CALL_RCU_FN > > Accelerated backend for call_rcu() - pending callbacks are tracked in > a radix tree to eliminate linked list overhead. But to add radix-tree overhead. And to eliminate the ability to do O(1) list merges. Is this really a win? > to serve as replacement backends for kvfree_rcu() and call_rcu(); these > may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users). > > Note: > > Internally, we're using a single rearming call_rcu() callback for > notifications from the core RCU subsystem for notifications when objects > are ready to be processed. > > Ideally we would be getting a callback every time a grace period > completes for which we have objects, but that would require multiple > rcu_heads in flight, and since the number of gp sequence numbers with > uncompleted callbacks is not bounded, we can't do that yet. Doesn't the call from __rcu_pending_enqueue() to process_finished_items() serve this purpose? After all, the case that causes trouble is the one where lots of things are being very frequently queued. > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > --- > include/linux/rcu_pending.h | 25 ++ > kernel/rcu/Makefile | 2 +- > kernel/rcu/pending.c | 603 ++++++++++++++++++++++++++++++++++++ > 3 files changed, 629 insertions(+), 1 deletion(-) > create mode 100644 include/linux/rcu_pending.h > create mode 100644 kernel/rcu/pending.c > > diff --git a/include/linux/rcu_pending.h b/include/linux/rcu_pending.h > new file mode 100644 > index 000000000000..a875c640da8d > --- /dev/null > +++ b/include/linux/rcu_pending.h > @@ -0,0 +1,25 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_RCU_PENDING_H > +#define _LINUX_RCU_PENDING_H > + > +struct rcu_pending; > +typedef void (*rcu_pending_process_fn)(struct rcu_pending *, struct rcu_head *); > + > +struct rcu_pending_pcpu; > + > +struct rcu_pending { > + struct rcu_pending_pcpu __percpu *p; > + struct srcu_struct *srcu; > + rcu_pending_process_fn process; > +}; > + > +void rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *obj); > +struct rcu_head *rcu_pending_dequeue(struct rcu_pending *pending); > +struct rcu_head *rcu_pending_dequeue_from_all(struct rcu_pending *pending); > + > +void rcu_pending_exit(struct rcu_pending *pending); > +int rcu_pending_init(struct rcu_pending *pending, > + struct srcu_struct *srcu, > + rcu_pending_process_fn process); > + > +#endif /* _LINUX_RCU_PENDING_H */ > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile > index 0cfb009a99b9..2582f0324a11 100644 > --- a/kernel/rcu/Makefile > +++ b/kernel/rcu/Makefile > @@ -7,7 +7,7 @@ ifeq ($(CONFIG_KCSAN),y) > KBUILD_CFLAGS += -g -fno-omit-frame-pointer > endif > > -obj-y += update.o sync.o > +obj-y += update.o sync.o pending.o > obj-$(CONFIG_TREE_SRCU) += srcutree.o > obj-$(CONFIG_TINY_SRCU) += srcutiny.o > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o > diff --git a/kernel/rcu/pending.c b/kernel/rcu/pending.c > new file mode 100644 > index 000000000000..c0e2351ba198 > --- /dev/null > +++ b/kernel/rcu/pending.c > @@ -0,0 +1,603 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#define pr_fmt(fmt) "%s() " fmt "\n", __func__ > + > +#include <linux/darray.h> > +#include <linux/generic-radix-tree.h> > +#include <linux/mm.h> > +#include <linux/percpu.h> > +#include <linux/rcu_pending.h> > +#include <linux/slab.h> > +#include <linux/srcu.h> > +#include <linux/vmalloc.h> > + > +#include "rcu.h" > + > +#define static_array_for_each(_a, _i) \ > + for (typeof(&(_a)[0]) _i = _a; \ > + _i < (_a) + ARRAY_SIZE(_a); \ > + _i++) > + > +enum rcu_pending_special { > + RCU_PENDING_KVFREE = 1, > + RCU_PENDING_CALL_RCU = 2, > +}; > + > +#define RCU_PENDING_KVFREE_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_KVFREE) > +#define RCU_PENDING_CALL_RCU_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_CALL_RCU) > + > +static inline unsigned long __get_state_synchronize_rcu(struct srcu_struct *ssp) > +{ > + return ssp > + ? get_state_synchronize_srcu(ssp) > + : get_state_synchronize_rcu(); > +} > + > +static inline unsigned long __start_poll_synchronize_rcu(struct srcu_struct *ssp) > +{ > + return ssp > + ? start_poll_synchronize_srcu(ssp) > + : start_poll_synchronize_rcu(); > +} > + > +static inline bool __poll_state_synchronize_rcu(struct srcu_struct *ssp, unsigned long cookie) > +{ > + return ssp > + ? poll_state_synchronize_srcu(ssp, cookie) > + : poll_state_synchronize_rcu(cookie); > +} > + > +static inline void __rcu_barrier(struct srcu_struct *ssp) > +{ > + return ssp > + ? srcu_barrier(ssp) > + : rcu_barrier(); > +} > + > +static inline void __call_rcu(struct srcu_struct *ssp, struct rcu_head *rhp, > + rcu_callback_t func) > +{ > + if (ssp) > + call_srcu(ssp, rhp, func); > + else > + call_rcu(rhp, func); > +} I do sympathize with the desire to provide RCU-variant-oblivious API members, and maybe SRCU and RCU will be all that is required. You might need call_rcu_hurry()? > +struct rcu_pending_seq { > + /* > + * We're using a radix tree like a vector - we're just pushing elements > + * onto the end; we're using a radix tree instead of an actual vector to > + * avoid reallocation overhead > + */ > + GENRADIX(struct rcu_head *) objs; OK, I will bite... Why not a doubly-linked list of pages of pointers? Are you encountering situations where page-sized allocations fail, but smaller allocations proceed nicely? But that should be rare, and also should be handledd by your fallback to linked list of rcu_head structures, right? > + size_t nr; > + struct rcu_head **cursor; > + unsigned long seq; > +}; > + > +struct rcu_pending_list { > + struct rcu_head *head; > + struct rcu_head *tail; > + unsigned long seq; > +}; > + > +struct rcu_pending_pcpu { > + struct rcu_pending *parent; > + spinlock_t lock; > + int cpu; > + > + /* > + * We can't bound the number of unprocessed gp sequence numbers, and we > + * can't efficiently merge radix trees for expired grace periods, so we > + * need darray/vector: > + */ > + DARRAY_PREALLOCATED(struct rcu_pending_seq, 4) objs; This is a big strength of the current linked list of pages: Merging is trivial. Then there need only be one list for everything for which a grace period has completed. I ran out of time at this point. I do have additional comments further down due to searching for things, but this is where I will continue my review before the end of this week. > + /* Third entry is for expired objects: */ > + struct rcu_pending_list lists[NUM_ACTIVE_RCU_POLL_OLDSTATE + 1]; > + > + struct rcu_head cb; > + bool cb_armed; > + struct work_struct work; > +}; > + > +static bool __rcu_pending_has_pending(struct rcu_pending_pcpu *p) > +{ > + if (p->objs.nr) > + return true; > + > + static_array_for_each(p->lists, i) > + if (i->head) > + return true; > + > + return false; > +} > + > +static void rcu_pending_list_merge(struct rcu_pending_list *l1, > + struct rcu_pending_list *l2) > +{ > + if (!l1->head) > + l1->head = l2->head; > + else > + l1->tail->next = l2->head; > + l1->tail = l2->tail; > + > + l2->head = l2->tail = NULL; > +} The rcu_segcblist structure handles this sort of thing straightforwardly. > +static void rcu_pending_list_add(struct rcu_pending_list *l, > + struct rcu_head *n) > +{ > + if (!l->head) > + l->head = n; > + else > + l->tail->next = n; > + l->tail = n; > + n->next = NULL; > +} > + > +static void merge_expired_lists(struct rcu_pending_pcpu *p) > +{ > + struct rcu_pending_list *expired = &p->lists[NUM_ACTIVE_RCU_POLL_OLDSTATE]; > + > + for (struct rcu_pending_list *i = p->lists; i < expired; i++) > + if (i->head && __poll_state_synchronize_rcu(p->parent->srcu, i->seq)) > + rcu_pending_list_merge(expired, i); > +} > + > +static noinline void __process_finished_items(struct rcu_pending *pending, > + struct rcu_pending_pcpu *p, > + unsigned long flags) > +{ > + struct rcu_pending_list *expired = &p->lists[NUM_ACTIVE_RCU_POLL_OLDSTATE]; > + struct rcu_pending_seq objs = {}; > + struct rcu_head *list = NULL; > + > + if (p->objs.nr && > + __poll_state_synchronize_rcu(pending->srcu, p->objs.data[0].seq)) { > + objs = p->objs.data[0]; > + darray_remove_item(&p->objs, p->objs.data); > + } > + > + merge_expired_lists(p); > + > + list = expired->head; > + expired->head = expired->tail = NULL; > + > + spin_unlock_irqrestore(&p->lock, flags); > + > + switch ((ulong) pending->process) { > + case RCU_PENDING_KVFREE: > + for (size_t i = 0; i < objs.nr; ) { > + size_t nr_this_node = min(GENRADIX_NODE_SIZE / sizeof(void *), objs.nr - i); > + > + kfree_bulk(nr_this_node, (void **) genradix_ptr(&objs.objs, i)); > + i += nr_this_node; > + } > + genradix_free(&objs.objs); > + > + while (list) { > + struct rcu_head *obj = list; > + list = obj->next; > + > + /* > + * low bit of pointer indicates whether rcu_head needs > + * to be freed - kvfree_rcu_mightsleep() > + */ > + BUILD_BUG_ON(ARCH_SLAB_MINALIGN == 0); > + > + void *ptr = (void *)(((unsigned long) obj->func) & ~1UL); > + kvfree(ptr); > + > + bool free_head = ((unsigned long) obj->func) & 1UL; > + if (free_head) > + kfree(obj); > + } > + > + break; > + > + case RCU_PENDING_CALL_RCU: > + for (size_t i = 0; i < objs.nr; i++) { > + struct rcu_head *obj = *genradix_ptr(&objs.objs, i); > + obj->func(obj); > + } > + genradix_free(&objs.objs); > + > + while (list) { > + struct rcu_head *obj = list; > + list = obj->next; > + obj->func(obj); > + } > + break; > + > + default: > + for (size_t i = 0; i < objs.nr; i++) > + pending->process(pending, *genradix_ptr(&objs.objs, i)); > + genradix_free(&objs.objs); > + > + while (list) { > + struct rcu_head *obj = list; > + list = obj->next; > + pending->process(pending, obj); > + } > + break; > + } > +} > + > +static bool process_finished_items(struct rcu_pending *pending, > + struct rcu_pending_pcpu *p, > + unsigned long flags) > +{ > + /* > + * XXX: we should grab the gp seq once and avoid multiple function > + * calls, this is called from __rcu_pending_enqueue() fastpath in > + * may_sleep==true mode > + */ > + if ((p->objs.nr && __poll_state_synchronize_rcu(pending->srcu, p->objs.data[0].seq)) || > + (p->lists[0].head && __poll_state_synchronize_rcu(pending->srcu, p->lists[0].seq)) || > + (p->lists[1].head && __poll_state_synchronize_rcu(pending->srcu, p->lists[1].seq)) || One approach would be to create a variant of poll_state_synchronize_rcu() that takes multiple cookies, and returns the index of the first one that has expired. Would that help? > + p->lists[2].head) { > + __process_finished_items(pending, p, flags); > + return true; > + } > + > + return false; > +} > + > +static void rcu_pending_work(struct work_struct *work) > +{ > + struct rcu_pending_pcpu *p = > + container_of(work, struct rcu_pending_pcpu, work); > + struct rcu_pending *pending = p->parent; > + unsigned long flags; > + > + do { > + spin_lock_irqsave(&p->lock, flags); > + } while (process_finished_items(pending, p, flags)); > + > + spin_unlock_irqrestore(&p->lock, flags); > +} > + > +static void rcu_pending_rcu_cb(struct rcu_head *rcu) > +{ > + struct rcu_pending_pcpu *p = container_of(rcu, struct rcu_pending_pcpu, cb); > + > + schedule_work_on(p->cpu, &p->work); > + > + unsigned long flags; > + spin_lock_irqsave(&p->lock, flags); > + if (__rcu_pending_has_pending(p)) > + __call_rcu(p->parent->srcu, &p->cb, rcu_pending_rcu_cb); > + else > + p->cb_armed = false; > + spin_unlock_irqrestore(&p->lock, flags); > +} > + > +static __always_inline struct rcu_pending_seq * > +get_object_radix(struct rcu_pending_pcpu *p, unsigned long seq) > +{ > + darray_for_each_reverse(p->objs, objs) > + if (objs->seq == seq) > + return objs; > + > + if (darray_push_gfp(&p->objs, ((struct rcu_pending_seq) { .seq = seq }), GFP_ATOMIC)) > + return NULL; > + > + return &darray_last(p->objs); > +} > + > +static noinline bool > +rcu_pending_enqueue_list(struct rcu_pending_pcpu *p, unsigned long seq, > + struct rcu_head *head, void *ptr, > + unsigned long *flags) > +{ > + if (ptr) { > + if (!head) { > + /* > + * kvfree_rcu_mightsleep(): we weren't passed an > + * rcu_head, but we need one: use the low bit of the > + * ponter to free to flag that the head needs to be > + * freed as well: > + */ > + ptr = (void *)(((unsigned long) ptr)|1UL); > + head = kmalloc(sizeof(*head), __GFP_NOWARN); > + if (!head) { > + spin_unlock_irqrestore(&p->lock, *flags); > + head = kmalloc(sizeof(*head), GFP_KERNEL|__GFP_NOFAIL); > + /* > + * dropped lock, did GFP_KERNEL allocation, > + * check for gp expiration > + */ > + if (unlikely(__poll_state_synchronize_rcu(p->parent->srcu, seq))) { > + kvfree(--ptr); > + kfree(head); > + spin_lock_irqsave(&p->lock, *flags); > + return false; > + } > + } > + } > + > + head->func = ptr; > + } > +again: > + for (struct rcu_pending_list *i = p->lists; > + i < p->lists + NUM_ACTIVE_RCU_POLL_OLDSTATE; i++) { > + if (i->seq == seq) { > + rcu_pending_list_add(i, head); > + return false; > + } > + } > + > + for (struct rcu_pending_list *i = p->lists; > + i < p->lists + NUM_ACTIVE_RCU_POLL_OLDSTATE; i++) { > + if (!i->head) { > + i->seq = seq; > + rcu_pending_list_add(i, head); > + return true; > + } > + } > + > + merge_expired_lists(p); > + goto again; > +} > + > +/* > + * __rcu_pending_enqueue: enqueue a pending RCU item, to be processed (via > + * pending->pracess) once grace period elapses. > + * > + * Attempt to enqueue items onto a radix tree; if memory allocation fails, fall > + * back to a linked list. > + * > + * - If @ptr is NULL, we're enqueuing an item for a generic @pending with a > + * process callback > + * > + * - If @ptr and @head are both not NULL, we're kvfree_rcu() > + * > + * - If @ptr is not NULL and @head is, we're kvfree_rcu_mightsleep() > + * > + * - If @may_sleep is true, will do GFP_KERNEL memory allocations and process > + * expired items. > + */ > +static __always_inline void > +__rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *head, > + void *ptr, bool may_sleep) > +{ > + > + struct rcu_pending_pcpu *p; > + struct rcu_pending_seq *objs; > + struct genradix_node *new_node = NULL; > + unsigned long seq, flags; > + bool start_gp = false; > + > + BUG_ON((ptr != NULL) != (pending->process == RCU_PENDING_KVFREE_FN)); > + > + local_irq_save(flags); > + p = this_cpu_ptr(pending->p); > + spin_lock(&p->lock); > + seq = __get_state_synchronize_rcu(pending->srcu); > +restart: > + if (may_sleep && > + unlikely(process_finished_items(pending, p, flags))) > + goto check_expired; > + > + /* > + * In kvfree_rcu() mode, the radix tree is only for slab pointers so > + * that we can do kfree_bulk() - vmalloc pointers always use the linked > + * list: > + */ > + if (ptr && unlikely(is_vmalloc_addr_inlined(ptr))) > + goto list_add; > + > + objs = get_object_radix(p, seq); > + if (unlikely(!objs)) > + goto list_add; > + > + if (unlikely(!objs->cursor)) { > + /* > + * New radix tree nodes must be added under @p->lock because the > + * tree root is in a darray that can be resized (typically, > + * genradix supports concurrent unlocked allocation of new > + * nodes) - hence preallocation and the retry loop: > + */ > + objs->cursor = genradix_ptr_alloc_preallocated_inlined(&objs->objs, > + objs->nr, &new_node, GFP_ATOMIC|__GFP_NOWARN); > + if (unlikely(!objs->cursor)) { > + if (may_sleep) { > + spin_unlock_irqrestore(&p->lock, flags); > + > + gfp_t gfp = GFP_KERNEL; > + if (!head) > + gfp |= __GFP_NOFAIL; > + > + new_node = genradix_alloc_node(gfp); > + if (!new_node) > + may_sleep = false; > + goto check_expired; > + } > +list_add: > + start_gp = rcu_pending_enqueue_list(p, seq, head, ptr, &flags); > + goto start_gp; > + } > + } > + > + *objs->cursor++ = ptr ?: head; > + /* zero cursor if we hit the end of a radix tree node: */ > + if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1))) > + objs->cursor = NULL; > + start_gp = !objs->nr; > + objs->nr++; > +start_gp: > + if (unlikely(start_gp)) { > + /* > + * We only have one callback (ideally, we would have one for > + * every outstanding graceperiod) - so if our callback is > + * already in flight, we may still have to start a grace period > + * (since we used get_state() above, not start_poll()) > + */ > + if (!p->cb_armed) { > + p->cb_armed = true; > + __call_rcu(pending->srcu, &p->cb, rcu_pending_rcu_cb); > + } else { > + __start_poll_synchronize_rcu(pending->srcu); > + } > + } > + spin_unlock_irqrestore(&p->lock, flags); > +free_node: > + if (new_node) > + genradix_free_node(new_node); > + return; > +check_expired: > + if (unlikely(__poll_state_synchronize_rcu(pending->srcu, seq))) { > + switch ((ulong) pending->process) { > + case RCU_PENDING_KVFREE: > + kvfree(ptr); > + break; > + case RCU_PENDING_CALL_RCU: > + head->func(head); > + break; > + default: > + pending->process(pending, head); > + break; > + } > + goto free_node; > + } > + > + local_irq_save(flags); > + p = this_cpu_ptr(pending->p); > + spin_lock(&p->lock); > + goto restart; > +} > + > +void rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *obj) > +{ > + __rcu_pending_enqueue(pending, obj, NULL, true); > +} > + > +static struct rcu_head *rcu_pending_pcpu_dequeue(struct rcu_pending_pcpu *p) > +{ > + struct rcu_head *ret = NULL; > + > + spin_lock_irq(&p->lock); > + darray_for_each(p->objs, objs) > + if (objs->nr) { > + ret = *genradix_ptr(&objs->objs, --objs->nr); > + objs->cursor = NULL; > + if (!objs->nr) > + genradix_free(&objs->objs); > + goto out; > + } > + > + static_array_for_each(p->lists, i) > + if (i->head) { > + ret = i->head; > + i->head = ret->next; > + if (!i->head) > + i->tail = NULL; > + goto out; > + } > +out: > + spin_unlock_irq(&p->lock); > + > + return ret; > +} > + > +struct rcu_head *rcu_pending_dequeue(struct rcu_pending *pending) > +{ > + return rcu_pending_pcpu_dequeue(raw_cpu_ptr(pending->p)); > +} > + > +struct rcu_head *rcu_pending_dequeue_from_all(struct rcu_pending *pending) > +{ > + struct rcu_head *ret = rcu_pending_dequeue(pending); > + > + if (ret) > + return ret; > + > + int cpu; > + for_each_possible_cpu(cpu) { > + ret = rcu_pending_pcpu_dequeue(per_cpu_ptr(pending->p, cpu)); > + if (ret) > + break; > + } > + return ret; > +} > + > +static bool rcu_pending_has_pending_or_armed(struct rcu_pending *pending) > +{ > + int cpu; > + for_each_possible_cpu(cpu) { > + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); > + spin_lock_irq(&p->lock); > + if (__rcu_pending_has_pending(p) || p->cb_armed) { > + spin_unlock_irq(&p->lock); > + return true; > + } > + spin_unlock_irq(&p->lock); > + } > + > + return false; > +} > + > +void rcu_pending_exit(struct rcu_pending *pending) > +{ > + int cpu; > + > + if (!pending->p) > + return; > + > + while (rcu_pending_has_pending_or_armed(pending)) { > + __rcu_barrier(pending->srcu); > + > + for_each_possible_cpu(cpu) { > + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); > + flush_work(&p->work); > + } > + } > + > + for_each_possible_cpu(cpu) { > + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); > + flush_work(&p->work); > + } > + > + for_each_possible_cpu(cpu) { > + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); > + > + static_array_for_each(p->lists, i) > + WARN_ON(i->head); > + WARN_ON(p->objs.nr); > + darray_exit(&p->objs); > + } > + free_percpu(pending->p); > +} > + > +/** > + * rcu_pending_init: - initialize a rcu_pending > + * > + * @pending: Object to init > + * @srcu: May optionally be used with an srcu_struct; if NULL, uses normal > + * RCU flavor > + * @process: Callback function invoked on objects once their RCU barriers > + * have completed; if NULL, kvfree() is used. > + */ > +int rcu_pending_init(struct rcu_pending *pending, > + struct srcu_struct *srcu, > + rcu_pending_process_fn process) > +{ > + pending->p = alloc_percpu(struct rcu_pending_pcpu); > + if (!pending->p) > + return -ENOMEM; > + > + int cpu; > + for_each_possible_cpu(cpu) { > + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); > + p->parent = pending; > + p->cpu = cpu; > + spin_lock_init(&p->lock); > + darray_init(&p->objs); > + INIT_WORK(&p->work, rcu_pending_work); > + } > + > + pending->srcu = srcu; > + pending->process = process; > + > + return 0; > +} > -- > 2.45.2 >
On Mon, Aug 19, 2024 at 03:58:26PM GMT, Paul E. McKenney wrote: > I still remain quite skeptical of this one, but it has improved since > June's version. > > Responses inline. > > Thanx, Paul > > In the meantime, thank you for avoiding reaching into RCU's and SRCU's > innards. This makes it reasonable to have you put this file into your > code, at least initially. That way, you get what you want *now* and us > RCU guys are not committing to maintain it before it is ready for us to > do so. The RCU interfaces do force a lot of function calls for things that should be inlined though, and the gratuitious interface fragmentation between RCU and SRCU is... annoying. > I still have strong misgivings about radix trees and cache locality. > Poor cache locality on the other side of the grace period can result > in OOMs during callback-flooding events, hence the move of kfree_rcu() > and friends to pages of pointers. And, as you note below, radix trees > don't merge nicely. Cache locality where? On the enqueue side, which is the fastpath, this uses a cursor - we're not walking the radix tree every time. On the processing side, where we're kfree_bulk()ing entire radix tree nodes, the radix tree is going to have better cache locality than a list of pages. > The advantage of synchronize_rcu() is that it can proceed without > memory allocation. If you block on allocation, even of a 16-byte > rcu_head structure, you can go into OOM-induced deadlock. > > It might well make sense to do an rcu_head allocation with GFP flags > that try reasonably hard, but still allow failure before falling all > the way back to synchronize_rcu(). (And for all I know, this might have > been tested and found wanting, but Uladzislau Rezki (CCed) would know.) > But a hard wait on that allocation is asking for trouble. That's reasonable - as long as we're trying the 16 byte allocation before doing a synchronize_rcu(). > There is work underway to make the current callback lists take advantage > of expedited grace periods, transparent to the caller. This allows > the shrinker (or whatever) to speed up everything by simply invoking > synchronize_rcu_expedited(). This speedup includes callback processing > because it avoids "bubbles" in the callback processing that can occur > when the next grace period has not yet completed, but all callbacks from > earlier grace periods have been invoked. Glad to here, that was first on my list when adding a shrinker to this code. > > - RCU_PENDING_CALL_RCU_FN > > > > Accelerated backend for call_rcu() - pending callbacks are tracked in > > a radix tree to eliminate linked list overhead. > > But to add radix-tree overhead. And to eliminate the ability to do O(1) > list merges. Is this really a win? As mentioned, there's a cursor so we're not adding radix-tree overhead to the fast path, and there's no need to merge lists for expired objects - that's all handled fine. But yes, using it for call_rcu() would need more performance justification. I haven't seen workloads where call_rcu() performance particularly matters, so it's not something I'm pushing for - I included that because it's minimal code and other people might know of workloads where we do want it. > > Ideally we would be getting a callback every time a grace period > > completes for which we have objects, but that would require multiple > > rcu_heads in flight, and since the number of gp sequence numbers with > > uncompleted callbacks is not bounded, we can't do that yet. > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > serve this purpose? After all, the case that causes trouble is the one > where lots of things are being very frequently queued. No, because that's unpredictable, and we don't process pending items in enqueue() unless we know we can sleep (i.e., we don't do it if the caller is latency sensitive).
On Mon, Aug 19, 2024 at 07:59:34PM -0400, Kent Overstreet wrote: > On Mon, Aug 19, 2024 at 03:58:26PM GMT, Paul E. McKenney wrote: > > I still remain quite skeptical of this one, but it has improved since > > June's version. > > > > Responses inline. > > > > Thanx, Paul > > > > In the meantime, thank you for avoiding reaching into RCU's and SRCU's > > innards. This makes it reasonable to have you put this file into your > > code, at least initially. That way, you get what you want *now* and us > > RCU guys are not committing to maintain it before it is ready for us to > > do so. > > The RCU interfaces do force a lot of function calls for things that > should be inlined though, and the gratuitious interface fragmentation > between RCU and SRCU is... annoying. I have gotten requests for a variant of SRCU that dispenses with the read-side memory barriers, which would allow you to dispense with RCU. Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to the _nmisafe() variants. There is always a price, and in this case the price would be that srcu_read_lock_lite() and srcu_read_unlock_lite() be restricted to code regions where RCU is watching. But from what I know, fs code must already abide by that restriction. Would that help? > > I still have strong misgivings about radix trees and cache locality. > > Poor cache locality on the other side of the grace period can result > > in OOMs during callback-flooding events, hence the move of kfree_rcu() > > and friends to pages of pointers. And, as you note below, radix trees > > don't merge nicely. > > Cache locality where? > > On the enqueue side, which is the fastpath, this uses a cursor - we're > not walking the radix tree every time. > > On the processing side, where we're kfree_bulk()ing entire radix tree > nodes, the radix tree is going to have better cache locality than a list > of pages. Interesting. What testing or analysis did you do in the course of arriving at this conclusion? In the meantime I will assume that the analysis involves your radix-tree nodes being more than one page in size. > > The advantage of synchronize_rcu() is that it can proceed without > > memory allocation. If you block on allocation, even of a 16-byte > > rcu_head structure, you can go into OOM-induced deadlock. > > > > It might well make sense to do an rcu_head allocation with GFP flags > > that try reasonably hard, but still allow failure before falling all > > the way back to synchronize_rcu(). (And for all I know, this might have > > been tested and found wanting, but Uladzislau Rezki (CCed) would know.) > > But a hard wait on that allocation is asking for trouble. > > That's reasonable - as long as we're trying the 16 byte allocation > before doing a synchronize_rcu(). It might or might not be reasonable, depending on what is going on in the rest of the system. The current kfree_rcu() code can run the system out of full pages, but it won't impede other code allocating smaller blocks of memory. We could easily change it to allocate individual rcu_head structures, but doing so would not necessarily be a win in OOM conditions, again, depending on what else is going on. We are very likely to add this, but also very likely to have a "chicken switch" to allow the system administrator to control it. But I freely concede that if your radix tree is using multi-page nodes, then your code might well have greater need to allocate individual rcu_head structures than does the current kfree_rcu() implementation. > > There is work underway to make the current callback lists take advantage > > of expedited grace periods, transparent to the caller. This allows > > the shrinker (or whatever) to speed up everything by simply invoking > > synchronize_rcu_expedited(). This speedup includes callback processing > > because it avoids "bubbles" in the callback processing that can occur > > when the next grace period has not yet completed, but all callbacks from > > earlier grace periods have been invoked. > > Glad to here, that was first on my list when adding a shrinker to this > code. Glad you approve. This has been on my list for quite some time, and we now have thing in place to make it easy. Well, easier, anyway. > > > - RCU_PENDING_CALL_RCU_FN > > > > > > Accelerated backend for call_rcu() - pending callbacks are tracked in > > > a radix tree to eliminate linked list overhead. > > > > But to add radix-tree overhead. And to eliminate the ability to do O(1) > > list merges. Is this really a win? > > As mentioned, there's a cursor so we're not adding radix-tree overhead > to the fast path, and there's no need to merge lists for expired objects > - that's all handled fine. I took a quick look at __rcu_pending_enqueue() and my eyes are still bleeding. Concatenating linked list of pages is way simpler, way faster, and way more robust. > But yes, using it for call_rcu() would need more performance > justification. I haven't seen workloads where call_rcu() performance > particularly matters, so it's not something I'm pushing for - I included > that because it's minimal code and other people might know of workloads > where we do want it. Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback functions usually access the memory block passed to them, which means that they are incurring that per-element cache miss in any case. > > > Ideally we would be getting a callback every time a grace period > > > completes for which we have objects, but that would require multiple > > > rcu_heads in flight, and since the number of gp sequence numbers with > > > uncompleted callbacks is not bounded, we can't do that yet. > > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > > serve this purpose? After all, the case that causes trouble is the one > > where lots of things are being very frequently queued. > > No, because that's unpredictable, and we don't process pending items in > enqueue() unless we know we can sleep (i.e., we don't do it if the > caller is latency sensitive). It is possible to do this in an extremely lightweight fashion in the common case. Thanx, Paul
On Mon, Aug 26, 2024 at 07:44:12AM GMT, Paul E. McKenney wrote: > I have gotten requests for a variant of SRCU that dispenses with the > read-side memory barriers, which would allow you to dispense with RCU. > Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to > the _nmisafe() variants. There is always a price, and in this case the > price would be that srcu_read_lock_lite() and srcu_read_unlock_lite() > be restricted to code regions where RCU is watching. But from what I > know, fs code must already abide by that restriction. > > Would that help? I don't think that helps here, but that is something I'd be interested in. What I would like for this is: - a single #define for RCU_NR_OLDSTATES for both RCU and SRCU - a way of getting the current RCU gp sequence number without a function call, since we hit that on every single enqueue(). One of my development versions added a function to RCU and SRCU for getting a pointer to the internal sequence number, which we'd call at init time; this lets us skip the function call and a branch. Another thing that might make the code a bit easier to reason about is if rcu_read_lock() also worked as an srcu_read_lock() - is something the SRCU variant you're talking about previously would provide? In my current version of the code, we call __get_state_synchronize_rcu() after we've disabled interrupts; we know the rcu gp seq isn't going to tick while we're in the critical path. But this doesn't apply if it's for SRCU, and I don't want to add if (src) srcu_read_lock() branches to it. Not at all essential, the races that result from this are harmless, but if we e.g. decide it's worthwhile to only kick off a gp if it hasn't ticked (i.e. only kick rcu if we're still on seq of the object being enqueued) it'd be nice. > > On the enqueue side, which is the fastpath, this uses a cursor - we're > > not walking the radix tree every time. > > > > On the processing side, where we're kfree_bulk()ing entire radix tree > > nodes, the radix tree is going to have better cache locality than a list > > of pages. > > Interesting. What testing or analysis did you do in the course of > arriving at this conclusion? In the meantime I will assume that the > analysis involves your radix-tree nodes being more than one page in size. No, the radix tree nodes are 512 bytes; that's been sufficient in my testing. (also, please don't refer to PAGE_SIZE outside of mm/ code without a _good_ reason; that's something we've been trying to clean up.) I'll try to post some performance numbers when I have some time. > It might or might not be reasonable, depending on what is going on in the > rest of the system. The current kfree_rcu() code can run the system out > of full pages, but it won't impede other code allocating smaller blocks > of memory. We could easily change it to allocate individual rcu_head > structures, but doing so would not necessarily be a win in OOM conditions, > again, depending on what else is going on. As long as the thread calling kvfree_rcu_mightsleep() can block without blocking memory reclaim it'll be safe. We might want to tweak the GFP flags so to tell the allocator "don't try too hard, bail out so we can check if the gp has expired". > I took a quick look at __rcu_pending_enqueue() and my eyes are still > bleeding. Concatenating linked list of pages is way simpler, way faster, > and way more robust. Funny, I had the same thoughts trying to read your code... :) But, most of __rcu_pending_enqueue() is slowpaths; the fastpath is just objs = get_object_radix(p, seq); /* lookup seq in p->objs */ *objs->cursor++ = ptr ?: head; /* zero cursor if we hit the end of a radix tree node: */ if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1))) objs->cursor = NULL; start_gp = !objs->nr; objs->nr++; So I think the performance concerns are moot, and for robustness - memory allocation failure always turns into "use the linked lists of objects", which works similarly to the old code. > > But yes, using it for call_rcu() would need more performance > > justification. I haven't seen workloads where call_rcu() performance > > particularly matters, so it's not something I'm pushing for - I included > > that because it's minimal code and other people might know of workloads > > where we do want it. > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback > functions usually access the memory block passed to them, which means > that they are incurring that per-element cache miss in any case. True. But this would allow us to prefetch those objects (several iterations in advance). > > > > Ideally we would be getting a callback every time a grace period > > > > completes for which we have objects, but that would require multiple > > > > rcu_heads in flight, and since the number of gp sequence numbers with > > > > uncompleted callbacks is not bounded, we can't do that yet. > > > > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > > > serve this purpose? After all, the case that causes trouble is the one > > > where lots of things are being very frequently queued. > > > > No, because that's unpredictable, and we don't process pending items in > > enqueue() unless we know we can sleep (i.e., we don't do it if the > > caller is latency sensitive). > > It is possible to do this in an extremely lightweight fashion in the > common case. Just processing a few items? hmm, would we want to though, when otherwise we'd be calling kfree_bulk()? I think icache locality means we'd generally prefer not to.
On Mon, Aug 26, 2024 at 11:17:29AM -0400, Kent Overstreet wrote: > On Mon, Aug 26, 2024 at 07:44:12AM GMT, Paul E. McKenney wrote: > > I have gotten requests for a variant of SRCU that dispenses with the > > read-side memory barriers, which would allow you to dispense with RCU. > > Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to > > the _nmisafe() variants. There is always a price, and in this case the > > price would be that srcu_read_lock_lite() and srcu_read_unlock_lite() > > be restricted to code regions where RCU is watching. But from what I > > know, fs code must already abide by that restriction. > > > > Would that help? > > I don't think that helps here, but that is something I'd be interested > in. > > What I would like for this is: > - a single #define for RCU_NR_OLDSTATES for both RCU and SRCU > > - a way of getting the current RCU gp sequence number without a > function call, since we hit that on every single enqueue(). One of my > development versions added a function to RCU and SRCU for getting a > pointer to the internal sequence number, which we'd call at init > time; this lets us skip the function call and a branch. > > Another thing that might make the code a bit easier to reason about is > if rcu_read_lock() also worked as an srcu_read_lock() - is something the > SRCU variant you're talking about previously would provide? Yes, srcu_read_lock_lite() would return an value that you later pass to the corresponding srcu_read_unlock_lite(), just as srcu_read_lock() and srcu_read_unlock() do now. And they would have the same grace-period algorithm, and thus the same value for NUM_ACTIVE_SRCU_POLL_OLDSTATE, as requested above. But get_state_synchronize_srcu() is still a function call. But the offer of a function that checks multiple grace-period-state cookies in one call still stands. > In my current version of the code, we call __get_state_synchronize_rcu() > after we've disabled interrupts; we know the rcu gp seq isn't going to > tick while we're in the critical path. But this doesn't apply if it's > for SRCU, and I don't want to add if (src) srcu_read_lock() branches to > it. Actually, disabling interrupts does *not* prevent RCU's grace-period sequence number from changing. For example, the following really can happen: o RCU notices that the current grace period can end. o A CPU disables interrupts here. o A call to get_state_synchronize_rcu() returns a cookie corresponding to the end of the grace period following the current one. o RCU ends the current grace period, therefore updating the grace-period sequence number. o RCU starts a new grace period, therefore updating the grace-period sequence number once again. o RCU cannot complete this new grace period until that CPU re-enables interrupts, but it has already updated its grace-period sequence number not once, but twice. So if your code knows that RCU's grace-period sequence number cannot change while a given CPU has interrupts disabled, that code has a bug. A low-probability bug, perhaps, but if your code is running on enough systems, it will make its presence known. > Not at all essential, the races that result from this are harmless, but > if we e.g. decide it's worthwhile to only kick off a gp if it hasn't > ticked (i.e. only kick rcu if we're still on seq of the object being > enqueued) it'd be nice. Why not call get_state_synchronize_rcu(), and ask for a new grace period if the value returned is different than that from the previous call? > > > On the enqueue side, which is the fastpath, this uses a cursor - we're > > > not walking the radix tree every time. > > > > > > On the processing side, where we're kfree_bulk()ing entire radix tree > > > nodes, the radix tree is going to have better cache locality than a list > > > of pages. > > > > Interesting. What testing or analysis did you do in the course of > > arriving at this conclusion? In the meantime I will assume that the > > analysis involves your radix-tree nodes being more than one page in size. > > No, the radix tree nodes are 512 bytes; that's been sufficient in my > testing. > > (also, please don't refer to PAGE_SIZE outside of mm/ code without a > _good_ reason; that's something we've been trying to clean up.) Moving kfree_rcu() into mm/ will fix that problem. Plus we did discuss this with the mm folks back in the day, especially about the GFP flags, so apparently there was a sufficiently good reason. Maybe still is. > I'll try to post some performance numbers when I have some time. I look forward to seeing what you come up with. > > It might or might not be reasonable, depending on what is going on in the > > rest of the system. The current kfree_rcu() code can run the system out > > of full pages, but it won't impede other code allocating smaller blocks > > of memory. We could easily change it to allocate individual rcu_head > > structures, but doing so would not necessarily be a win in OOM conditions, > > again, depending on what else is going on. > > As long as the thread calling kvfree_rcu_mightsleep() can block without > blocking memory reclaim it'll be safe. We might want to tweak the GFP > flags so to tell the allocator "don't try too hard, bail out so we can > check if the gp has expired". There can be quite a difference between "safe" and "does well in actual workloads". > > I took a quick look at __rcu_pending_enqueue() and my eyes are still > > bleeding. Concatenating linked list of pages is way simpler, way faster, > > and way more robust. > > Funny, I had the same thoughts trying to read your code... :) Amazing how much easier it is to generate new code than to understand existing code, isn't it? ;-) > But, most of __rcu_pending_enqueue() is slowpaths; the fastpath is just > > objs = get_object_radix(p, seq); /* lookup seq in p->objs */ > > *objs->cursor++ = ptr ?: head; > /* zero cursor if we hit the end of a radix tree node: */ > if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1))) > objs->cursor = NULL; > start_gp = !objs->nr; > objs->nr++; > > So I think the performance concerns are moot, and for robustness - > memory allocation failure always turns into "use the linked lists of > objects", which works similarly to the old code. Bugs live on slowpaths as well as on fastpaths. In fact, they tend to accumulate there. > > > But yes, using it for call_rcu() would need more performance > > > justification. I haven't seen workloads where call_rcu() performance > > > particularly matters, so it's not something I'm pushing for - I included > > > that because it's minimal code and other people might know of workloads > > > where we do want it. > > > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback > > functions usually access the memory block passed to them, which means > > that they are incurring that per-element cache miss in any case. > > True. But this would allow us to prefetch those objects (several > iterations in advance). I need to see a CPU on which this actually make a significant difference before adding this sort of complexity. > > > > > Ideally we would be getting a callback every time a grace period > > > > > completes for which we have objects, but that would require multiple > > > > > rcu_heads in flight, and since the number of gp sequence numbers with > > > > > uncompleted callbacks is not bounded, we can't do that yet. > > > > > > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > > > > serve this purpose? After all, the case that causes trouble is the one > > > > where lots of things are being very frequently queued. > > > > > > No, because that's unpredictable, and we don't process pending items in > > > enqueue() unless we know we can sleep (i.e., we don't do it if the > > > caller is latency sensitive). > > > > It is possible to do this in an extremely lightweight fashion in the > > common case. > > Just processing a few items? hmm, would we want to though, when > otherwise we'd be calling kfree_bulk()? I think icache locality means > we'd generally prefer not to. You might not want to yet, but you eventually would want this. Thanx, Paul
On Mon, Aug 26, 2024 at 09:01:54AM GMT, Paul E. McKenney wrote: > But get_state_synchronize_srcu() is still a function call. But the > offer of a function that checks multiple grace-period-state cookies in > one call still stands. It shouldn't be though, it's just reading a sequence number - I'd much prefer if it could be at least a static inline. Which you won't want to do, because it would mean exposing RCU private data structures; hence my approach of exposing an RCU-only api for getting a pointer to the sequence number. > > > In my current version of the code, we call __get_state_synchronize_rcu() > > after we've disabled interrupts; we know the rcu gp seq isn't going to > > tick while we're in the critical path. But this doesn't apply if it's > > for SRCU, and I don't want to add if (src) srcu_read_lock() branches to > > it. > > Actually, disabling interrupts does *not* prevent RCU's grace-period > sequence number from changing. For example, the following really can > happen: > > o RCU notices that the current grace period can end. > > o A CPU disables interrupts here. > > o A call to get_state_synchronize_rcu() returns a cookie > corresponding to the end of the grace period following the > current one. > > o RCU ends the current grace period, therefore updating the > grace-period sequence number. > > o RCU starts a new grace period, therefore updating the > grace-period sequence number once again. > > o RCU cannot complete this new grace period until that CPU > re-enables interrupts, but it has already updated its grace-period > sequence number not once, but twice. > > So if your code knows that RCU's grace-period sequence number cannot > change while a given CPU has interrupts disabled, that code has a bug. > A low-probability bug, perhaps, but if your code is running on enough > systems, it will make its presence known. Ok, good to know > > > Not at all essential, the races that result from this are harmless, but > > if we e.g. decide it's worthwhile to only kick off a gp if it hasn't > > ticked (i.e. only kick rcu if we're still on seq of the object being > > enqueued) it'd be nice. > > Why not call get_state_synchronize_rcu(), and ask for a new grace period > if the value returned is different than that from the previous call? We don't want to falsely think the object expires later than it actually does, and have more accumulated work than we need to. > > Funny, I had the same thoughts trying to read your code... :) > > Amazing how much easier it is to generate new code than to understand > existing code, isn't it? ;-) I would have much preferred if your existing code worked with SRCU. You think I'm doing this for fun? > > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback > > > functions usually access the memory block passed to them, which means > > > that they are incurring that per-element cache miss in any case. > > > > True. But this would allow us to prefetch those objects (several > > iterations in advance). > > I need to see a CPU on which this actually make a significant difference > before adding this sort of complexity. We would of course want benchmarks to show that this was worthwhile before switching call_rcu(), since absent a performance improvement we'd want to stick with the approach that doesn't allocate memory. > > Just processing a few items? hmm, would we want to though, when > > otherwise we'd be calling kfree_bulk()? I think icache locality means > > we'd generally prefer not to. > > You might not want to yet, but you eventually would want this. Because?
On Mon, Aug 26, 2024 at 01:09:15PM -0400, Kent Overstreet wrote: > On Mon, Aug 26, 2024 at 09:01:54AM GMT, Paul E. McKenney wrote: > > But get_state_synchronize_srcu() is still a function call. But the > > offer of a function that checks multiple grace-period-state cookies in > > one call still stands. > > It shouldn't be though, it's just reading a sequence number - I'd much > prefer if it could be at least a static inline. > > Which you won't want to do, because it would mean exposing RCU private > data structures; hence my approach of exposing an RCU-only api for > getting a pointer to the sequence number. True enough. In addition, I have not yet been able to come up with any safe use of this sequence number that does not require memory ordering. And -that- I most emphatically won't want to export to the caller. Hence my offer of a function taking multiple grace-period-state cookies in one go, allowing the memory-ordering overhead to be amortized over the set of cookies. > > > In my current version of the code, we call __get_state_synchronize_rcu() > > > after we've disabled interrupts; we know the rcu gp seq isn't going to > > > tick while we're in the critical path. But this doesn't apply if it's > > > for SRCU, and I don't want to add if (src) srcu_read_lock() branches to > > > it. > > > > Actually, disabling interrupts does *not* prevent RCU's grace-period > > sequence number from changing. For example, the following really can > > happen: > > > > o RCU notices that the current grace period can end. > > > > o A CPU disables interrupts here. > > > > o A call to get_state_synchronize_rcu() returns a cookie > > corresponding to the end of the grace period following the > > current one. > > > > o RCU ends the current grace period, therefore updating the > > grace-period sequence number. > > > > o RCU starts a new grace period, therefore updating the > > grace-period sequence number once again. > > > > o RCU cannot complete this new grace period until that CPU > > re-enables interrupts, but it has already updated its grace-period > > sequence number not once, but twice. > > > > So if your code knows that RCU's grace-period sequence number cannot > > change while a given CPU has interrupts disabled, that code has a bug. > > A low-probability bug, perhaps, but if your code is running on enough > > systems, it will make its presence known. > > Ok, good to know > > > > Not at all essential, the races that result from this are harmless, but > > > if we e.g. decide it's worthwhile to only kick off a gp if it hasn't > > > ticked (i.e. only kick rcu if we're still on seq of the object being > > > enqueued) it'd be nice. > > > > Why not call get_state_synchronize_rcu(), and ask for a new grace period > > if the value returned is different than that from the previous call? > > We don't want to falsely think the object expires later than it actually > does, and have more accumulated work than we need to. Yes, that can happen, but grace periods are long and that race window will normally be very short. It will not matter in actual practice. > > > Funny, I had the same thoughts trying to read your code... :) > > > > Amazing how much easier it is to generate new code than to understand > > existing code, isn't it? ;-) > > I would have much preferred if your existing code worked with SRCU. You > think I'm doing this for fun? To be fair, the "for fun" reason seemed to me to be one of the more positive explanations for your providing an unsolicited rip-and-replace patch for kfree_rcu() without prior consultation. ;-) But, to your point, why haven't we already implemented kfree_srcu()? Because in v6.10, there are six use cases for it, and they appear to have infrequent updates. Thus, kfree_srcu() is not yet not worth its weight. > > > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback > > > > functions usually access the memory block passed to them, which means > > > > that they are incurring that per-element cache miss in any case. > > > > > > True. But this would allow us to prefetch those objects (several > > > iterations in advance). > > > > I need to see a CPU on which this actually make a significant difference > > before adding this sort of complexity. > > We would of course want benchmarks to show that this was worthwhile > before switching call_rcu(), since absent a performance improvement we'd > want to stick with the approach that doesn't allocate memory. Agreed. Furthermore, any approach that does allocate memory needs a non-allocating fallback. Out-of-memory deadlocks are not fun. > > > Just processing a few items? hmm, would we want to though, when > > > otherwise we'd be calling kfree_bulk()? I think icache locality means > > > we'd generally prefer not to. > > > > You might not want to yet, but you eventually would want this. > > Because? TL;DR: Lessons learned from bitter experience. For more detail, please read on. Because certain inconvenient laws of physics (and I am looking at *you*, finite speed of light and atomic nature of matter!) mean that global agreement across a computer system is expensive. This is especially a problem if synchronous global agreement is required, which is one reason why reader-writer locking is so slow, one way or another. (You can either make readers slow or you can make updaters *really* slow.) RCU does not completely avoid the cost of global agreement. After all, at the end of a grace period, there must be global agreement that all pre-existing readers have completed. But the computation of that global agreement can be (and is) spread over time. This spreading has two good effects: First, it permits CPUs and tasks to use lighter-weight operations for high-frequency operations such as rcu_read_lock(), and second, the grace-period requests that arrived during one grace-period computation can all be satisfied by the next grace-period computation, in turn permitting the overhead of that next computation to be amortized over all those requests. Which is one reason why grace periods are not unconditionally expedited. But it is all high-amortization fun and games only until there is risk of exhausting memory. At that point, RCU takes steps to expedite the current grace period, albeit in a light-weight fashion compared to synchronize_rcu_expedited(). It has proven unwise to wait for mm to complain to RCU (for example, via a shrinker), so RCU does this lightweight expediting using heuristics based on the number and rate of callbacks on each CPU. And yes, RCU also uses shrinkers, just not as the first line of defense. So why don't we have such heuristics in SRCU? Because SRCU has not yet been subjected to workloads that flood SRCU with callbacks, so it has not yet proven necessary. Obviously, should SRCU start getting flooded, SRCU will start taking whatever forms of evasive action are indicated by the situation at hand, quite likely including expediting the current SRCU grace period. Until that time, we keep it simple. Or simpler, anyway. Thanx, Paul
On Fri, Aug 30, 2024 at 12:01:26PM GMT, Paul E. McKenney wrote: > On Mon, Aug 26, 2024 at 01:09:15PM -0400, Kent Overstreet wrote: > > On Mon, Aug 26, 2024 at 09:01:54AM GMT, Paul E. McKenney wrote: > > > But get_state_synchronize_srcu() is still a function call. But the > > > offer of a function that checks multiple grace-period-state cookies in > > > one call still stands. > > > > It shouldn't be though, it's just reading a sequence number - I'd much > > prefer if it could be at least a static inline. > > > > Which you won't want to do, because it would mean exposing RCU private > > data structures; hence my approach of exposing an RCU-only api for > > getting a pointer to the sequence number. > > True enough. > > In addition, I have not yet been able to come up with any safe use of > this sequence number that does not require memory ordering. And -that- > I most emphatically won't want to export to the caller. > > Hence my offer of a function taking multiple grace-period-state cookies > in one go, allowing the memory-ordering overhead to be amortized over > the set of cookies. I'd have to have an idea of what that would look like to comment. > > We don't want to falsely think the object expires later than it actually > > does, and have more accumulated work than we need to. > > Yes, that can happen, but grace periods are long and that race window > will normally be very short. It will not matter in actual practice. True, it's not much of a practical consideration. Part of the reason I bring this up is I'm coming to this with experience with the bcachefs journal flushing API, which feels quite similar in a lot of ways - we have items associated with sequence numbers, and we may later need to flush and wait for a given sequence number to be persistent. That code just feels easier to reason about when we're being explicit about which sequence number we're talking about whenever possible. Granted it's not as necessary with RCU - but who knows, maybe someday? Just wanted to share the thought. > > I would have much preferred if your existing code worked with SRCU. You > > think I'm doing this for fun? > > To be fair, the "for fun" reason seemed to me to be one of the more > positive explanations for your providing an unsolicited rip-and-replace > patch for kfree_rcu() without prior consultation. ;-) Well, partly yes; one of the things that I derive the most satisfaction from is clean modular solutions to interesting problems with nice simple APIs, where I get to slim down the original chunk of code and cut out something grotty :) There hit a point when working on the btree key cache lock contention issues where I realized I was onto something nice, and thought - "this might be worth doing right". And I still feel that way. Slimming down rcu/tree.c seems like a nice win, and if we could slim down the slab code by pulling out SLAB_TYPESAFE_BY_RCU too, that'd be a nice win. I really like codebases that are nicely organized, where individual components aren't too big and have a clear purpose, where I can still find my way around them years later :) > But, to your point, why haven't we already implemented kfree_srcu()? > > Because in v6.10, there are six use cases for it, and they appear to have > infrequent updates. Thus, kfree_srcu() is not yet not worth its weight. Only if you leave out bcachefs. > > We would of course want benchmarks to show that this was worthwhile > > before switching call_rcu(), since absent a performance improvement we'd > > want to stick with the approach that doesn't allocate memory. > > Agreed. > > Furthermore, any approach that does allocate memory needs a non-allocating > fallback. Out-of-memory deadlocks are not fun. Try telling that to the filesystem people who are currently obsessed with __GFP_NOFAIL... But the fallbacks are already there - if you didn't get that far through reading the code, the -ENOMEM fallback works pretty much exactly like how call_rcu() works currently, with three lists of pending objects (one for each outstanding grace period, and a list for expired objects). > > > > Just processing a few items? hmm, would we want to though, when > > > > otherwise we'd be calling kfree_bulk()? I think icache locality means > > > > we'd generally prefer not to. > > > > > > You might not want to yet, but you eventually would want this. > > > > Because? > > TL;DR: Lessons learned from bitter experience. > > For more detail, please read on. > > Because certain inconvenient laws of physics (and I am looking at *you*, > finite speed of light and atomic nature of matter!) mean that global > agreement across a computer system is expensive. This is especially > a problem if synchronous global agreement is required, which is one > reason why reader-writer locking is so slow, one way or another. > (You can either make readers slow or you can make updaters *really* slow.) > > RCU does not completely avoid the cost of global agreement. After all, > at the end of a grace period, there must be global agreement that all > pre-existing readers have completed. But the computation of that global > agreement can be (and is) spread over time. This spreading has two > good effects: First, it permits CPUs and tasks to use lighter-weight > operations for high-frequency operations such as rcu_read_lock(), and > second, the grace-period requests that arrived during one grace-period > computation can all be satisfied by the next grace-period computation, > in turn permitting the overhead of that next computation to be amortized > over all those requests. Which is one reason why grace periods are not > unconditionally expedited. *nod* There's a fun algorithm for ticking the global RCU sequence number, is there not? I assume it's where tree-RCU gets its name, but I wasn't able to find documented or in the code. My version goes like: construct a binary tree, where every CPU is a leaf node. When a CPU hits a quiescent point, it compares its node's sequence number to the root; if they're the same, increment its node. On increment, check sibling node; if they're now identical, tick parent node; recurse up to the root. Do I have that right? > But it is all high-amortization fun and games only until there is risk > of exhausting memory. At that point, RCU takes steps to expedite > the current grace period, albeit in a light-weight fashion compared > to synchronize_rcu_expedited(). It has proven unwise to wait for mm > to complain to RCU (for example, via a shrinker), so RCU does this > lightweight expediting using heuristics based on the number and rate of > callbacks on each CPU. And yes, RCU also uses shrinkers, just not as > the first line of defense. So that's interesting. I think it would be really valuable to get those heuristics better documented (and perhaps better factored out; if we got them in rcu_pending we might make it easier for mm people to find and understand that code, vs. having to dig through rcu/tree.c). Speaking from painful personal experience, memory reclaim issues are one of the most frustrating things to debug - they never crop up in a controlled environment, it's always users hitting them at the worst time and I always have very little to go on, and there's _so many_ things that can go wrong. Hence my current work to improve shrinker debugging and add it to the show_mem.c oom report, and my recent ask for actual byte counters on amount of memory stranded by RCU - it would be lovely if we could have that included in the show_mem.c report (perhaps when heuristics tell us they're likely to be relevant), along with anything else we might want when debugging (are we waiting on an unusually long grace period?).
diff --git a/include/linux/rcu_pending.h b/include/linux/rcu_pending.h new file mode 100644 index 000000000000..a875c640da8d --- /dev/null +++ b/include/linux/rcu_pending.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RCU_PENDING_H +#define _LINUX_RCU_PENDING_H + +struct rcu_pending; +typedef void (*rcu_pending_process_fn)(struct rcu_pending *, struct rcu_head *); + +struct rcu_pending_pcpu; + +struct rcu_pending { + struct rcu_pending_pcpu __percpu *p; + struct srcu_struct *srcu; + rcu_pending_process_fn process; +}; + +void rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *obj); +struct rcu_head *rcu_pending_dequeue(struct rcu_pending *pending); +struct rcu_head *rcu_pending_dequeue_from_all(struct rcu_pending *pending); + +void rcu_pending_exit(struct rcu_pending *pending); +int rcu_pending_init(struct rcu_pending *pending, + struct srcu_struct *srcu, + rcu_pending_process_fn process); + +#endif /* _LINUX_RCU_PENDING_H */ diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index 0cfb009a99b9..2582f0324a11 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -7,7 +7,7 @@ ifeq ($(CONFIG_KCSAN),y) KBUILD_CFLAGS += -g -fno-omit-frame-pointer endif -obj-y += update.o sync.o +obj-y += update.o sync.o pending.o obj-$(CONFIG_TREE_SRCU) += srcutree.o obj-$(CONFIG_TINY_SRCU) += srcutiny.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o diff --git a/kernel/rcu/pending.c b/kernel/rcu/pending.c new file mode 100644 index 000000000000..c0e2351ba198 --- /dev/null +++ b/kernel/rcu/pending.c @@ -0,0 +1,603 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) "%s() " fmt "\n", __func__ + +#include <linux/darray.h> +#include <linux/generic-radix-tree.h> +#include <linux/mm.h> +#include <linux/percpu.h> +#include <linux/rcu_pending.h> +#include <linux/slab.h> +#include <linux/srcu.h> +#include <linux/vmalloc.h> + +#include "rcu.h" + +#define static_array_for_each(_a, _i) \ + for (typeof(&(_a)[0]) _i = _a; \ + _i < (_a) + ARRAY_SIZE(_a); \ + _i++) + +enum rcu_pending_special { + RCU_PENDING_KVFREE = 1, + RCU_PENDING_CALL_RCU = 2, +}; + +#define RCU_PENDING_KVFREE_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_KVFREE) +#define RCU_PENDING_CALL_RCU_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_CALL_RCU) + +static inline unsigned long __get_state_synchronize_rcu(struct srcu_struct *ssp) +{ + return ssp + ? get_state_synchronize_srcu(ssp) + : get_state_synchronize_rcu(); +} + +static inline unsigned long __start_poll_synchronize_rcu(struct srcu_struct *ssp) +{ + return ssp + ? start_poll_synchronize_srcu(ssp) + : start_poll_synchronize_rcu(); +} + +static inline bool __poll_state_synchronize_rcu(struct srcu_struct *ssp, unsigned long cookie) +{ + return ssp + ? poll_state_synchronize_srcu(ssp, cookie) + : poll_state_synchronize_rcu(cookie); +} + +static inline void __rcu_barrier(struct srcu_struct *ssp) +{ + return ssp + ? srcu_barrier(ssp) + : rcu_barrier(); +} + +static inline void __call_rcu(struct srcu_struct *ssp, struct rcu_head *rhp, + rcu_callback_t func) +{ + if (ssp) + call_srcu(ssp, rhp, func); + else + call_rcu(rhp, func); +} + +struct rcu_pending_seq { + /* + * We're using a radix tree like a vector - we're just pushing elements + * onto the end; we're using a radix tree instead of an actual vector to + * avoid reallocation overhead + */ + GENRADIX(struct rcu_head *) objs; + size_t nr; + struct rcu_head **cursor; + unsigned long seq; +}; + +struct rcu_pending_list { + struct rcu_head *head; + struct rcu_head *tail; + unsigned long seq; +}; + +struct rcu_pending_pcpu { + struct rcu_pending *parent; + spinlock_t lock; + int cpu; + + /* + * We can't bound the number of unprocessed gp sequence numbers, and we + * can't efficiently merge radix trees for expired grace periods, so we + * need darray/vector: + */ + DARRAY_PREALLOCATED(struct rcu_pending_seq, 4) objs; + + /* Third entry is for expired objects: */ + struct rcu_pending_list lists[NUM_ACTIVE_RCU_POLL_OLDSTATE + 1]; + + struct rcu_head cb; + bool cb_armed; + struct work_struct work; +}; + +static bool __rcu_pending_has_pending(struct rcu_pending_pcpu *p) +{ + if (p->objs.nr) + return true; + + static_array_for_each(p->lists, i) + if (i->head) + return true; + + return false; +} + +static void rcu_pending_list_merge(struct rcu_pending_list *l1, + struct rcu_pending_list *l2) +{ + if (!l1->head) + l1->head = l2->head; + else + l1->tail->next = l2->head; + l1->tail = l2->tail; + + l2->head = l2->tail = NULL; +} + +static void rcu_pending_list_add(struct rcu_pending_list *l, + struct rcu_head *n) +{ + if (!l->head) + l->head = n; + else + l->tail->next = n; + l->tail = n; + n->next = NULL; +} + +static void merge_expired_lists(struct rcu_pending_pcpu *p) +{ + struct rcu_pending_list *expired = &p->lists[NUM_ACTIVE_RCU_POLL_OLDSTATE]; + + for (struct rcu_pending_list *i = p->lists; i < expired; i++) + if (i->head && __poll_state_synchronize_rcu(p->parent->srcu, i->seq)) + rcu_pending_list_merge(expired, i); +} + +static noinline void __process_finished_items(struct rcu_pending *pending, + struct rcu_pending_pcpu *p, + unsigned long flags) +{ + struct rcu_pending_list *expired = &p->lists[NUM_ACTIVE_RCU_POLL_OLDSTATE]; + struct rcu_pending_seq objs = {}; + struct rcu_head *list = NULL; + + if (p->objs.nr && + __poll_state_synchronize_rcu(pending->srcu, p->objs.data[0].seq)) { + objs = p->objs.data[0]; + darray_remove_item(&p->objs, p->objs.data); + } + + merge_expired_lists(p); + + list = expired->head; + expired->head = expired->tail = NULL; + + spin_unlock_irqrestore(&p->lock, flags); + + switch ((ulong) pending->process) { + case RCU_PENDING_KVFREE: + for (size_t i = 0; i < objs.nr; ) { + size_t nr_this_node = min(GENRADIX_NODE_SIZE / sizeof(void *), objs.nr - i); + + kfree_bulk(nr_this_node, (void **) genradix_ptr(&objs.objs, i)); + i += nr_this_node; + } + genradix_free(&objs.objs); + + while (list) { + struct rcu_head *obj = list; + list = obj->next; + + /* + * low bit of pointer indicates whether rcu_head needs + * to be freed - kvfree_rcu_mightsleep() + */ + BUILD_BUG_ON(ARCH_SLAB_MINALIGN == 0); + + void *ptr = (void *)(((unsigned long) obj->func) & ~1UL); + kvfree(ptr); + + bool free_head = ((unsigned long) obj->func) & 1UL; + if (free_head) + kfree(obj); + } + + break; + + case RCU_PENDING_CALL_RCU: + for (size_t i = 0; i < objs.nr; i++) { + struct rcu_head *obj = *genradix_ptr(&objs.objs, i); + obj->func(obj); + } + genradix_free(&objs.objs); + + while (list) { + struct rcu_head *obj = list; + list = obj->next; + obj->func(obj); + } + break; + + default: + for (size_t i = 0; i < objs.nr; i++) + pending->process(pending, *genradix_ptr(&objs.objs, i)); + genradix_free(&objs.objs); + + while (list) { + struct rcu_head *obj = list; + list = obj->next; + pending->process(pending, obj); + } + break; + } +} + +static bool process_finished_items(struct rcu_pending *pending, + struct rcu_pending_pcpu *p, + unsigned long flags) +{ + /* + * XXX: we should grab the gp seq once and avoid multiple function + * calls, this is called from __rcu_pending_enqueue() fastpath in + * may_sleep==true mode + */ + if ((p->objs.nr && __poll_state_synchronize_rcu(pending->srcu, p->objs.data[0].seq)) || + (p->lists[0].head && __poll_state_synchronize_rcu(pending->srcu, p->lists[0].seq)) || + (p->lists[1].head && __poll_state_synchronize_rcu(pending->srcu, p->lists[1].seq)) || + p->lists[2].head) { + __process_finished_items(pending, p, flags); + return true; + } + + return false; +} + +static void rcu_pending_work(struct work_struct *work) +{ + struct rcu_pending_pcpu *p = + container_of(work, struct rcu_pending_pcpu, work); + struct rcu_pending *pending = p->parent; + unsigned long flags; + + do { + spin_lock_irqsave(&p->lock, flags); + } while (process_finished_items(pending, p, flags)); + + spin_unlock_irqrestore(&p->lock, flags); +} + +static void rcu_pending_rcu_cb(struct rcu_head *rcu) +{ + struct rcu_pending_pcpu *p = container_of(rcu, struct rcu_pending_pcpu, cb); + + schedule_work_on(p->cpu, &p->work); + + unsigned long flags; + spin_lock_irqsave(&p->lock, flags); + if (__rcu_pending_has_pending(p)) + __call_rcu(p->parent->srcu, &p->cb, rcu_pending_rcu_cb); + else + p->cb_armed = false; + spin_unlock_irqrestore(&p->lock, flags); +} + +static __always_inline struct rcu_pending_seq * +get_object_radix(struct rcu_pending_pcpu *p, unsigned long seq) +{ + darray_for_each_reverse(p->objs, objs) + if (objs->seq == seq) + return objs; + + if (darray_push_gfp(&p->objs, ((struct rcu_pending_seq) { .seq = seq }), GFP_ATOMIC)) + return NULL; + + return &darray_last(p->objs); +} + +static noinline bool +rcu_pending_enqueue_list(struct rcu_pending_pcpu *p, unsigned long seq, + struct rcu_head *head, void *ptr, + unsigned long *flags) +{ + if (ptr) { + if (!head) { + /* + * kvfree_rcu_mightsleep(): we weren't passed an + * rcu_head, but we need one: use the low bit of the + * ponter to free to flag that the head needs to be + * freed as well: + */ + ptr = (void *)(((unsigned long) ptr)|1UL); + head = kmalloc(sizeof(*head), __GFP_NOWARN); + if (!head) { + spin_unlock_irqrestore(&p->lock, *flags); + head = kmalloc(sizeof(*head), GFP_KERNEL|__GFP_NOFAIL); + /* + * dropped lock, did GFP_KERNEL allocation, + * check for gp expiration + */ + if (unlikely(__poll_state_synchronize_rcu(p->parent->srcu, seq))) { + kvfree(--ptr); + kfree(head); + spin_lock_irqsave(&p->lock, *flags); + return false; + } + } + } + + head->func = ptr; + } +again: + for (struct rcu_pending_list *i = p->lists; + i < p->lists + NUM_ACTIVE_RCU_POLL_OLDSTATE; i++) { + if (i->seq == seq) { + rcu_pending_list_add(i, head); + return false; + } + } + + for (struct rcu_pending_list *i = p->lists; + i < p->lists + NUM_ACTIVE_RCU_POLL_OLDSTATE; i++) { + if (!i->head) { + i->seq = seq; + rcu_pending_list_add(i, head); + return true; + } + } + + merge_expired_lists(p); + goto again; +} + +/* + * __rcu_pending_enqueue: enqueue a pending RCU item, to be processed (via + * pending->pracess) once grace period elapses. + * + * Attempt to enqueue items onto a radix tree; if memory allocation fails, fall + * back to a linked list. + * + * - If @ptr is NULL, we're enqueuing an item for a generic @pending with a + * process callback + * + * - If @ptr and @head are both not NULL, we're kvfree_rcu() + * + * - If @ptr is not NULL and @head is, we're kvfree_rcu_mightsleep() + * + * - If @may_sleep is true, will do GFP_KERNEL memory allocations and process + * expired items. + */ +static __always_inline void +__rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *head, + void *ptr, bool may_sleep) +{ + + struct rcu_pending_pcpu *p; + struct rcu_pending_seq *objs; + struct genradix_node *new_node = NULL; + unsigned long seq, flags; + bool start_gp = false; + + BUG_ON((ptr != NULL) != (pending->process == RCU_PENDING_KVFREE_FN)); + + local_irq_save(flags); + p = this_cpu_ptr(pending->p); + spin_lock(&p->lock); + seq = __get_state_synchronize_rcu(pending->srcu); +restart: + if (may_sleep && + unlikely(process_finished_items(pending, p, flags))) + goto check_expired; + + /* + * In kvfree_rcu() mode, the radix tree is only for slab pointers so + * that we can do kfree_bulk() - vmalloc pointers always use the linked + * list: + */ + if (ptr && unlikely(is_vmalloc_addr_inlined(ptr))) + goto list_add; + + objs = get_object_radix(p, seq); + if (unlikely(!objs)) + goto list_add; + + if (unlikely(!objs->cursor)) { + /* + * New radix tree nodes must be added under @p->lock because the + * tree root is in a darray that can be resized (typically, + * genradix supports concurrent unlocked allocation of new + * nodes) - hence preallocation and the retry loop: + */ + objs->cursor = genradix_ptr_alloc_preallocated_inlined(&objs->objs, + objs->nr, &new_node, GFP_ATOMIC|__GFP_NOWARN); + if (unlikely(!objs->cursor)) { + if (may_sleep) { + spin_unlock_irqrestore(&p->lock, flags); + + gfp_t gfp = GFP_KERNEL; + if (!head) + gfp |= __GFP_NOFAIL; + + new_node = genradix_alloc_node(gfp); + if (!new_node) + may_sleep = false; + goto check_expired; + } +list_add: + start_gp = rcu_pending_enqueue_list(p, seq, head, ptr, &flags); + goto start_gp; + } + } + + *objs->cursor++ = ptr ?: head; + /* zero cursor if we hit the end of a radix tree node: */ + if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1))) + objs->cursor = NULL; + start_gp = !objs->nr; + objs->nr++; +start_gp: + if (unlikely(start_gp)) { + /* + * We only have one callback (ideally, we would have one for + * every outstanding graceperiod) - so if our callback is + * already in flight, we may still have to start a grace period + * (since we used get_state() above, not start_poll()) + */ + if (!p->cb_armed) { + p->cb_armed = true; + __call_rcu(pending->srcu, &p->cb, rcu_pending_rcu_cb); + } else { + __start_poll_synchronize_rcu(pending->srcu); + } + } + spin_unlock_irqrestore(&p->lock, flags); +free_node: + if (new_node) + genradix_free_node(new_node); + return; +check_expired: + if (unlikely(__poll_state_synchronize_rcu(pending->srcu, seq))) { + switch ((ulong) pending->process) { + case RCU_PENDING_KVFREE: + kvfree(ptr); + break; + case RCU_PENDING_CALL_RCU: + head->func(head); + break; + default: + pending->process(pending, head); + break; + } + goto free_node; + } + + local_irq_save(flags); + p = this_cpu_ptr(pending->p); + spin_lock(&p->lock); + goto restart; +} + +void rcu_pending_enqueue(struct rcu_pending *pending, struct rcu_head *obj) +{ + __rcu_pending_enqueue(pending, obj, NULL, true); +} + +static struct rcu_head *rcu_pending_pcpu_dequeue(struct rcu_pending_pcpu *p) +{ + struct rcu_head *ret = NULL; + + spin_lock_irq(&p->lock); + darray_for_each(p->objs, objs) + if (objs->nr) { + ret = *genradix_ptr(&objs->objs, --objs->nr); + objs->cursor = NULL; + if (!objs->nr) + genradix_free(&objs->objs); + goto out; + } + + static_array_for_each(p->lists, i) + if (i->head) { + ret = i->head; + i->head = ret->next; + if (!i->head) + i->tail = NULL; + goto out; + } +out: + spin_unlock_irq(&p->lock); + + return ret; +} + +struct rcu_head *rcu_pending_dequeue(struct rcu_pending *pending) +{ + return rcu_pending_pcpu_dequeue(raw_cpu_ptr(pending->p)); +} + +struct rcu_head *rcu_pending_dequeue_from_all(struct rcu_pending *pending) +{ + struct rcu_head *ret = rcu_pending_dequeue(pending); + + if (ret) + return ret; + + int cpu; + for_each_possible_cpu(cpu) { + ret = rcu_pending_pcpu_dequeue(per_cpu_ptr(pending->p, cpu)); + if (ret) + break; + } + return ret; +} + +static bool rcu_pending_has_pending_or_armed(struct rcu_pending *pending) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); + spin_lock_irq(&p->lock); + if (__rcu_pending_has_pending(p) || p->cb_armed) { + spin_unlock_irq(&p->lock); + return true; + } + spin_unlock_irq(&p->lock); + } + + return false; +} + +void rcu_pending_exit(struct rcu_pending *pending) +{ + int cpu; + + if (!pending->p) + return; + + while (rcu_pending_has_pending_or_armed(pending)) { + __rcu_barrier(pending->srcu); + + for_each_possible_cpu(cpu) { + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); + flush_work(&p->work); + } + } + + for_each_possible_cpu(cpu) { + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); + flush_work(&p->work); + } + + for_each_possible_cpu(cpu) { + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); + + static_array_for_each(p->lists, i) + WARN_ON(i->head); + WARN_ON(p->objs.nr); + darray_exit(&p->objs); + } + free_percpu(pending->p); +} + +/** + * rcu_pending_init: - initialize a rcu_pending + * + * @pending: Object to init + * @srcu: May optionally be used with an srcu_struct; if NULL, uses normal + * RCU flavor + * @process: Callback function invoked on objects once their RCU barriers + * have completed; if NULL, kvfree() is used. + */ +int rcu_pending_init(struct rcu_pending *pending, + struct srcu_struct *srcu, + rcu_pending_process_fn process) +{ + pending->p = alloc_percpu(struct rcu_pending_pcpu); + if (!pending->p) + return -ENOMEM; + + int cpu; + for_each_possible_cpu(cpu) { + struct rcu_pending_pcpu *p = per_cpu_ptr(pending->p, cpu); + p->parent = pending; + p->cpu = cpu; + spin_lock_init(&p->lock); + darray_init(&p->objs); + INIT_WORK(&p->work, rcu_pending_work); + } + + pending->srcu = srcu; + pending->process = process; + + return 0; +}
Generic data structure for explicitly tracking pending RCU items, allowing items to be dequeued (i.e. allocate from items pending freeing). Works with conventional RCU and SRCU, and possibly other RCU flavors in the future, meaning this can serve as a more generic replacement for SLAB_TYPESAFE_BY_RCU. Pending items are tracked in radix trees; if memory allocation fails, we fall back to linked lists. A rcu_pending is initialized with a callback, which is invoked when pending items's grace periods have expired. Two types of callback processing are handled specially: - RCU_PENDING_KVFREE_FN New backend for kvfree_rcu(). Slightly faster, and eliminates the synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an rcu_head is allocated if we don't have one and can't use the radix tree TODO: - add a shrinker (as in the existing kvfree_rcu implementation) so that memory reclaim can free expired objects if callback processing isn't keeping up, and to expedite a grace period if we're under memory pressure and too much memory is stranded by RCU - add a counter for amount of memory pending - RCU_PENDING_CALL_RCU_FN Accelerated backend for call_rcu() - pending callbacks are tracked in a radix tree to eliminate linked list overhead. to serve as replacement backends for kvfree_rcu() and call_rcu(); these may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users). Note: Internally, we're using a single rearming call_rcu() callback for notifications from the core RCU subsystem for notifications when objects are ready to be processed. Ideally we would be getting a callback every time a grace period completes for which we have objects, but that would require multiple rcu_heads in flight, and since the number of gp sequence numbers with uncompleted callbacks is not bounded, we can't do that yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> --- include/linux/rcu_pending.h | 25 ++ kernel/rcu/Makefile | 2 +- kernel/rcu/pending.c | 603 ++++++++++++++++++++++++++++++++++++ 3 files changed, 629 insertions(+), 1 deletion(-) create mode 100644 include/linux/rcu_pending.h create mode 100644 kernel/rcu/pending.c