diff mbox series

[RFC,1/4] hazptr: Add initial implementation of hazard pointers

Message ID 20240917143402.930114-2-boqun.feng@gmail.com (mailing list archive)
State New
Headers show
Series Add hazard pointers to kernel | expand

Commit Message

Boqun Feng Sept. 17, 2024, 2:33 p.m. UTC
Hazard pointers [1] provide a way to dynamically distribute refcounting
and can be used to improve the scalability of refcounting without
significant space cost.

Hazard pointers are similar to RCU: they build the synchronization
between two part, readers and updaters. Readers are refcount users, they
acquire and release refcount. Updaters cleanup objects when there are no
reader referencing them (via call_hazptr()). The difference is instead
of waiting for a grace period, hazard pointers can free an object as
long as there is no one referencing the object. This means for a
particular workload, hazard pointers may have smaller memory footprint
due to fewer pending callbacks.

The synchronization between readers and updaters is built around "hazard
pointer slots": a slot readers can use to store a pointer value.

Reader side protection:

1.	Read the value of a pointer to the target data element.
2.	Store it to a hazard pointer slot.
3.	Enforce full ordering (e.g. smp_mb()).
4.	Re-read the original pointer, reset the slot and retry if the
	value changed.
5.	Otherwise, the continued existence of the target data element
	is guaranteed.

Updater side check:

1.	Unpublish the target data element (e.g. setting the pointer
	value to NULL).
2.	Enforce full ordering.
3.	Read the value from a hazard pointer slot.
4.	If the value doesn't match the target data element, then this
	slot doesn't represent a reference to it.
5.	Otherwise, updater needs to re-check (step 3).

To distribute the accesses of hazptr slots from different contexts,
hazptr_context is introduced. Users need to define/allocate their own
hazptr_context to allocate hazard pointer slots.

For the updater side to confirm no existing reference, it needs to scan
all the possible slots, and to speed up this process, hazptr_context
also contains an rbtree node for each slot so that updater can cache the
reader-scan result in an rbtree. The rbtree nodes are pre-allocated in
this way to prevent "allocate memory to free memory" in extreme cases.

[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
     lock-free objects," in IEEE Transactions on Parallel and
     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004

Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/hazptr.h |  83 ++++++++
 kernel/Makefile        |   1 +
 kernel/hazptr.c        | 463 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 547 insertions(+)
 create mode 100644 include/linux/hazptr.h
 create mode 100644 kernel/hazptr.c

Comments

Mathieu Desnoyers Sept. 18, 2024, 8:27 a.m. UTC | #1
On 2024-09-17 16:33, Boqun Feng wrote:
[...]
> 
> The synchronization between readers and updaters is built around "hazard
> pointer slots": a slot readers can use to store a pointer value.
> 
> Reader side protection:
> 
> 1.	Read the value of a pointer to the target data element.
> 2.	Store it to a hazard pointer slot.
> 3.	Enforce full ordering (e.g. smp_mb()).
> 4.	Re-read the original pointer, reset the slot and retry if the
> 	value changed.
> 5.	Otherwise, the continued existence of the target data element
> 	is guaranteed.
> 
> Updater side check:
> 
> 1.	Unpublish the target data element (e.g. setting the pointer
> 	value to NULL).
> 2.	Enforce full ordering.
> 3.	Read the value from a hazard pointer slot.
> 4.	If the value doesn't match the target data element, then this
> 	slot doesn't represent a reference to it.
> 5.	Otherwise, updater needs to re-check (step 3).

Cool! I look forward to see where this is meant to be used. I would
expect it to be a useful tool to speed up reference counting of
things like the mm_struct and for TLB flush IPIs.

On a related note, with a userspace port in mind, the membarrier(2)
syscall can be useful to turn the smp_mb() in (3) from the reader
side into a simple compiler barrier, assuming (2) from the updater
is using membarrier. If IPIs are acceptable (or already required) for
some kernel use-cases, this means a similar asymmetric fence scheme
could be used to speed up readers.

Thanks,

Mathieu
Alan Huang Sept. 18, 2024, 3:17 p.m. UTC | #2
2024年9月17日 22:33,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> Hazard pointers [1] provide a way to dynamically distribute refcounting
> and can be used to improve the scalability of refcounting without
> significant space cost.
> 
> Hazard pointers are similar to RCU: they build the synchronization
> between two part, readers and updaters. Readers are refcount users, they
> acquire and release refcount. Updaters cleanup objects when there are no
> reader referencing them (via call_hazptr()). The difference is instead
> of waiting for a grace period, hazard pointers can free an object as
> long as there is no one referencing the object. This means for a
> particular workload, hazard pointers may have smaller memory footprint
> due to fewer pending callbacks.
> 
> The synchronization between readers and updaters is built around "hazard
> pointer slots": a slot readers can use to store a pointer value.
> 
> Reader side protection:
> 
> 1. Read the value of a pointer to the target data element.
> 2. Store it to a hazard pointer slot.
> 3. Enforce full ordering (e.g. smp_mb()).
> 4. Re-read the original pointer, reset the slot and retry if the
> value changed.
> 5. Otherwise, the continued existence of the target data element
> is guaranteed.
> 
> Updater side check:
> 
> 1. Unpublish the target data element (e.g. setting the pointer
> value to NULL).
> 2. Enforce full ordering.
> 3. Read the value from a hazard pointer slot.
> 4. If the value doesn't match the target data element, then this
> slot doesn't represent a reference to it.
> 5. Otherwise, updater needs to re-check (step 3).
> 
> To distribute the accesses of hazptr slots from different contexts,
> hazptr_context is introduced. Users need to define/allocate their own
> hazptr_context to allocate hazard pointer slots.
> 
> For the updater side to confirm no existing reference, it needs to scan
> all the possible slots, and to speed up this process, hazptr_context
> also contains an rbtree node for each slot so that updater can cache the
> reader-scan result in an rbtree. The rbtree nodes are pre-allocated in
> this way to prevent "allocate memory to free memory" in extreme cases.
> 
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>     lock-free objects," in IEEE Transactions on Parallel and
>     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> 
> Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/hazptr.h |  83 ++++++++
> kernel/Makefile        |   1 +
> kernel/hazptr.c        | 463 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 547 insertions(+)
> create mode 100644 include/linux/hazptr.h
> create mode 100644 kernel/hazptr.c
> 
> diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
> new file mode 100644
> index 000000000000..4548ca8c75eb
> --- /dev/null
> +++ b/include/linux/hazptr.h
> @@ -0,0 +1,83 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_HAZPTR_H
> +#define _LINUX_HAZPTR_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rbtree.h>
> +
> +/* Hazard pointer slot. */
> +typedef void* hazptr_t;
> +
> +#define HAZPTR_SLOT_PER_CTX 8
> +
> +struct hazptr_slot_snap {
> + struct rb_node node;
> + hazptr_t slot;
> +};
> +
> +/*
> + * A set of hazard pointer slots for a context.
> + *
> + * The context can be a task, CPU or reader (depends on the use case). It may
> + * be allocated statically or dynamically. It can only be used after calling
> + * init_hazptr_context(), and users are also responsible to call
> + * cleaup_hazptr_context() when it's not used any more.
> + */
> +struct hazptr_context {
> + // The lock of the percpu context lists.
> + spinlock_t *lock;
> +
> + struct list_head list;
> + struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX];
> + ____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX];
> +};
> +
> +void init_hazptr_context(struct hazptr_context *hzcp);
> +void cleanup_hazptr_context(struct hazptr_context *hzcp);
> +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp);
> +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp);
> +
> +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
> +#define hazptr_protect(hzp, gp, field) ({ \
> + typeof(gp) ___p; \
> + \
> + ___p = hazptr_tryprotect(hzp, gp, field); \
> + BUG_ON(!___p); \

hazptr_tryprotect might return NULL, do you need a loop here?

> + ___p; \
> +})
> +
> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> + void *const *p,
> + unsigned long head_offset)
> +{
> + void *ptr;
> + struct callback_head *head;
> +
> + ptr = READ_ONCE(*p);
> +
> + if (ptr == NULL)
> + return NULL;
> +
> + head = (struct callback_head *)(ptr + head_offset);
> +
> + WRITE_ONCE(*hzp, head);
> + smp_mb();
> +
> + ptr = READ_ONCE(*p); // read again
> +
> + if (ptr + head_offset != head) { // pointer changed
> + WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> + return NULL;
> + } else
> + return ptr;
> +}
> +
> +static inline void hazptr_clear(hazptr_t *hzp)
> +{
> + /* Pairs with smp_load_acquire() in reader scan. */
> + smp_store_release(hzp, NULL);
> +}
> +
> +void call_hazptr(struct callback_head *head, rcu_callback_t func);
> +#endif
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..7927264b9870 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,6 +50,7 @@ obj-y += rcu/
> obj-y += livepatch/
> obj-y += dma/
> obj-y += entry/
> +obj-y += hazptr.o
> obj-$(CONFIG_MODULES) += module/
> 
> obj-$(CONFIG_KCMP) += kcmp.o
> diff --git a/kernel/hazptr.c b/kernel/hazptr.c
> new file mode 100644
> index 000000000000..f22ccc2a4a62
> --- /dev/null
> +++ b/kernel/hazptr.c
> @@ -0,0 +1,463 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/spinlock.h>
> +#include <linux/cleanup.h>
> +#include <linux/hazptr.h>
> +#include <linux/percpu.h>
> +#include <linux/workqueue.h>
> +
> +#define HAZPTR_UNUSED (1ul)
> +
> +/* Per-CPU data for hazard pointer management. */
> +struct hazptr_percpu {
> + spinlock_t ctx_lock; /* hazptr context list lock */
> + struct list_head ctx_list; /* hazptr context list */
> + spinlock_t callback_lock; /* Per-CPU callback queue lock */
> + struct callback_head *queued; /* Per-CPU callback queue */
> + struct callback_head **tail;
> +};
> +
> +/*
> + * Per-CPU data contains context lists and callbacks, which are maintained in a
> + * per-CPU maner to reduce potential contention. This means a global scan (for
> + * readers or callbacks) has to take each CPU's lock, but it's less problematic
> + * because that's a slowpath.
> + */
> +DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu);
> +
> +/* A RBTree that stores the reader scan result of all hazptr slot */
> +struct hazptr_reader_tree {
> + spinlock_t lock;
> + struct rb_root root;
> +};
> +
> +static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree)
> +{
> + spin_lock_init(&tree->lock);
> + tree->root = RB_ROOT;
> +}
> +
> +static bool is_null_or_unused(hazptr_t slot)
> +{
> + return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED;
> +}
> +
> +static int hazptr_node_cmp(const void *key, const struct rb_node *curr)
> +{
> + unsigned long slot = (unsigned long)key;
> + struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node);
> + unsigned long curr_slot = (unsigned long)(curr_snap->slot);
> +
> + if (slot < curr_slot)
> + return -1;
> + else if (slot > curr_slot)
> + return 1;
> + else
> + return 0;
> +}
> +
> +static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr)
> +{
> + struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node);
> +
> + return hazptr_node_cmp((void *)new_snap->slot, curr) < 0;
> +}
> +
> +/* Add the snapshot into a search tree. tree->lock must be held. */
> +static inline void reader_add_locked(struct hazptr_reader_tree *tree,
> +     struct hazptr_slot_snap *snap)
> +{
> + lockdep_assert_held(&tree->lock);
> + BUG_ON(is_null_or_unused(snap->slot));
> +
> + rb_add(&snap->node, &tree->root, hazptr_node_less);
> +}
> +
> +static inline void reader_add(struct hazptr_reader_tree *tree,
> +      struct hazptr_slot_snap *snap)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + reader_add_locked(tree, snap);
> +}
> +
> +/* Delete the snapshot from a search tree. tree->lock must be held. */
> +static inline void reader_del_locked(struct hazptr_reader_tree *tree,
> +     struct hazptr_slot_snap *snap)
> +{
> + lockdep_assert_held(&tree->lock);
> +
> + rb_erase(&snap->node, &tree->root);
> +}
> +
> +static inline void reader_del(struct hazptr_reader_tree *tree,
> +      struct hazptr_slot_snap *snap)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + reader_del_locked(tree, snap);
> +}
> +
> +/* Find whether a read exists. tree->lock must be held. */
> +static inline bool reader_exist_locked(struct hazptr_reader_tree *tree,
> +       unsigned long slot)
> +{
> + lockdep_assert_held(&tree->lock);
> +
> + return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp);
> +}
> +
> +static inline bool reader_exist(struct hazptr_reader_tree *tree,
> + unsigned long slot)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + return reader_exist_locked(tree, slot);
> +}
> +
> +/*
> + * Scan the readers from one hazptr context and update the global readers tree.
> + *
> + * Must be called with hzcp->lock held.
> + */
> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> +       struct hazptr_context *hzcp)
> +{
> + lockdep_assert_held(hzcp->lock);
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + /*
> + * Pairs with smp_store_release() in hazptr_{clear,free}().
> + *
> + * Ensure
> + *
> + * <reader> <updater>
> + *
> + * [access protected pointers]
> + * hazptr_clear();
> + *   smp_store_release()
> + * // in reader scan.
> + * smp_load_acquire(); // is null or unused.
> + * [run callbacks] // all accesses from
> + * // reader must be
> + * // observed.
> + */
> + hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> +
> + if (!is_null_or_unused(val)) {
> + struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> + // Already in the tree, need to remove first.
> + if (!is_null_or_unused(snap->slot)) {
> + reader_del(tree, snap);
> + }
> + snap->slot = val;
> + reader_add(tree, snap);
> + }
> + }
> +}
> +
> +/*
> + * Initialize a hazptr context.
> + *
> + * Must be called before using the context for hazptr allocation.
> + */
> +void init_hazptr_context(struct hazptr_context *hzcp)
> +{
> + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED;
> + hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED;
> + }
> +
> + guard(spinlock_irqsave)(&pcpu->ctx_lock);
> + list_add(&hzcp->list, &pcpu->ctx_list);
> + hzcp->lock = &pcpu->ctx_lock;
> +}
> +
> +struct hazptr_struct {
> + struct work_struct work;
> + bool scheduled;
> +
> + // list of callbacks, we move percpu queued callbacks into the global
> + // queued list in workqueue function.
> + spinlock_t callback_lock;
> + struct callback_head *queued;
> + struct callback_head **tail;
> +
> + struct hazptr_reader_tree readers;
> +};
> +
> +static struct hazptr_struct hazptr_struct;
> +
> +/*
> + * Clean up hazptr context.
> + *
> + * Must call before freeing the context. This function also removes any
> + * reference held by the hazard pointer slots in the context, even
> + * hazptr_clear() or hazptr_free() is not called previously.
> + */
> +void cleanup_hazptr_context(struct hazptr_context *hzcp)
> +{
> + if (hzcp->lock) {
> + scoped_guard(spinlock_irqsave, hzcp->lock) {
> + list_del(&hzcp->list);
> + hzcp->lock = NULL;
> + }
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> + if (!is_null_or_unused(snap->slot))
> + reader_del(&hazptr_struct.readers, snap);
> + }
> + }
> +}
> +
> +/*
> + * Allocate a hazptr slot from a hazptr_context.
> + *
> + * Return: NULL means fail to allocate, otherwise the address of the allocated
> + * slot.
> + */
> +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp)
> +{
> + unsigned long unused;
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) {
> + unused = HAZPTR_UNUSED;
> +
> + /*
> + * rwm-sequence is relied on here.
> + *
> + * This is needed since in case of a previous reader:
> + *
> + * <reader 1> <reader 2> <updater>
> + * [access protected pointers]
> + * hazptr_free():
> + *   smp_store_release(); // hzptr == UNUSED
> + * hazptr_alloc():
> + *  try_cmpxchg_relaxed(); // hzptr == NULL
> + *
> + * // in reader scan
> + * smp_load_acquire(); // is null
> + * [run callbacks]
> + *
> + * Because of the rwm-sequence of release operations,
> + * when callbacks are run, accesses from reader 1 must
> + * be already observed by the updater.
> + */
> + if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) {
> + return (hazptr_t *)&hzcp->slots[i];
> + }
> + }
> + }
> +
> + return NULL;
> +}
> +
> +/* Free a hazptr slot. */
> +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr)
> +{
> + WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED);
> +
> + /* Pairs with smp_load_acquire() in reader scan */
> + smp_store_release(hzptr, (void *)HAZPTR_UNUSED);
> +}
> +
> +/* Scan all possible readers, and update the global reader tree. */
> +static void check_readers(struct hazptr_struct *hzst)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> + struct hazptr_context *ctx;
> +
> + guard(spinlock_irqsave)(&pcpu->ctx_lock);
> + list_for_each_entry(ctx, &pcpu->ctx_list, list)
> + hazptr_context_snap_readers_locked(&hzst->readers, ctx);
> + }
> +}
> +
> +/*
> + * Start the background work for hazptr callback handling if not started.
> + *
> + * Must be called with hazptr_struct lock held.
> + */
> +static void kick_hazptr_work(void)
> +{
> + if (hazptr_struct.scheduled)
> + return;
> +
> + queue_work(system_wq, &hazptr_struct.work);
> + hazptr_struct.scheduled = true;
> +}
> +
> +/*
> + * Check which callbacks are ready to be called.
> + *
> + * Return: a callback list that no reader is referencing the corresponding
> + * objects.
> + */
> +static struct callback_head *do_hazptr(struct hazptr_struct *hzst)
> +{
> + struct callback_head *tmp, **curr;
> + struct callback_head *todo = NULL, **todo_tail = &todo;
> +
> + // Currently, the lock is unnecessary, but maybe we will have multiple
> + // work_structs sharing the same callback list in the future;
> + guard(spinlock_irqsave)(&hzst->callback_lock);
> +
> + curr = &hzst->queued;
> +
> + while ((tmp = *curr)) {
> + // No reader, we can free the object.
> + if (!reader_exist(&hzst->readers, (unsigned long)tmp)) {
> + // Add tmp into todo.
> + *todo_tail = tmp;
> + todo_tail = &tmp->next;
> +
> + // Delete tmp from ->queued and move to the next entry.
> + *curr = tmp->next;
> + } else
> + curr = &tmp->next;
> + }
> +
> + hzst->tail = curr;
> +
> + // Keep checking if callback list is not empty.
> + if (hzst->queued)
> + kick_hazptr_work();
> +
> + *todo_tail = NULL;
> +
> + return todo;
> +}
> +
> +static void hazptr_work_func(struct work_struct *work)
> +{
> + struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work);
> + struct callback_head *todo;
> +
> + // Advance callbacks from hzpcpu to hzst
> + scoped_guard(spinlock_irqsave, &hzst->callback_lock) {
> + int cpu;
> +
> + hzst->scheduled = false;
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> +
> + guard(spinlock)(&pcpu->callback_lock);
> +
> + if (pcpu->queued) {
> + *(hzst->tail) = pcpu->queued;
> + hzst->tail = pcpu->tail;
> + pcpu->queued = NULL;
> + pcpu->tail = &pcpu->queued;
> + }
> + }
> + }
> +
> + // Pairs with the smp_mb() on the reader side. This guarantees that if
> + // the hazptr work picked up the callback from an updater and the
> + // updater set the global pointer to NULL before enqueue the callback,
> + // the hazptr work must observe a reader that protects the hazptr before
> + // the updater.
> + //
> + // <reader> <updater> <hazptr work>
> + // ptr = READ_ONCE(gp);
> + // WRITE_ONCE(*hazptr, ptr);
> + // smp_mb(); // => ->strong-fence
> + // tofree = gp;
> + // ptr = READ_ONCE(gp); // re-read, gp is not NULL
> + // // => ->fre
> + // WRITE_ONCE(gp, NULL);
> + // call_hazptr(gp, ...):
> + //  lock(->callback_lock);
> + //  [queued the callback]
> + //  unlock(->callback_lock);// => ->po-unlock-lock-po
> + // lock(->callback_lock);
> + // [move from hzpcpu to hzst]
> + //
> + // smp_mb(); => ->strong-fence
> + // ptr = READ_ONCE(*hazptr);
> + // // ptr == gp, otherwise => ->fre
> + //
> + // If ptr != gp, it means there exists a circle with the following
> + // memory ordering relationships:
> + // ->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre
> + // => (due to the definition of prop)
> + // ->strong-fence ->prop ->strong-fence ->hb ->prop
> + // => (rotate the circle)
> + // ->prop ->strong-fence ->prop ->strong-fence ->hb
> + // => (due to the definition of pb)
> + // ->pb ->pb
> + // but pb is acyclic according to LKMM, so this cannot happen.
> + smp_mb();
> + check_readers(hzst);
> +
> + todo = do_hazptr(hzst);
> +
> + while (todo) {
> + struct callback_head *next = todo->next;
> + void (*func)(struct callback_head *) = todo->func;
> +
> + func(todo);
> +
> + todo = next;
> + }
> +}
> +
> +/*
> + * Put @head into the cleanup callback queue.
> + *
> + * @func(@head) will be called after no one is referencing the object. Caller
> + * must ensure the object has been unpublished before calling this.
> + */
> +void call_hazptr(struct callback_head *head, rcu_callback_t func)
> +{
> + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
> +
> + head->func = func;
> + head->next = NULL;
> +
> + scoped_guard(spinlock_irqsave, &pcpu->callback_lock) {
> + *(pcpu->tail) = head;
> + pcpu->tail = &head->next;
> + }
> +
> + guard(spinlock_irqsave)(&hazptr_struct.callback_lock);
> + kick_hazptr_work();
> +}
> +
> +static int init_hazptr_struct(void)
> +{
> + int cpu;
> +
> + INIT_WORK(&hazptr_struct.work, hazptr_work_func);
> +
> + spin_lock_init(&hazptr_struct.callback_lock);
> + hazptr_struct.queued = NULL;
> + hazptr_struct.tail = &hazptr_struct.queued;
> +
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> +
> + spin_lock_init(&pcpu->ctx_lock);
> + INIT_LIST_HEAD(&pcpu->ctx_list);
> +
> + spin_lock_init(&pcpu->callback_lock);
> + pcpu->queued = NULL;
> + pcpu->tail = &pcpu->queued;
> +
> + }
> +
> + init_hazptr_reader_tree(&hazptr_struct.readers);
> +
> + return 0;
> +}
> +
> +early_initcall(init_hazptr_struct);
> -- 
> 2.45.2
> 
>
Jann Horn Sept. 19, 2024, 12:12 a.m. UTC | #3
On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> Hazard pointers [1] provide a way to dynamically distribute refcounting
> and can be used to improve the scalability of refcounting without
> significant space cost.

> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> +                                       void *const *p,
> +                                       unsigned long head_offset)
> +{
> +       void *ptr;
> +       struct callback_head *head;
> +
> +       ptr = READ_ONCE(*p);
> +
> +       if (ptr == NULL)
> +               return NULL;
> +
> +       head = (struct callback_head *)(ptr + head_offset);
> +
> +       WRITE_ONCE(*hzp, head);
> +       smp_mb();
> +
> +       ptr = READ_ONCE(*p); // read again
> +
> +       if (ptr + head_offset != head) { // pointer changed
> +               WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> +               return NULL;
> +       } else
> +               return ptr;
> +}

I got nerdsniped by the Plumbers talk. So, about that smp_mb()...

I think you should be able to avoid the smp_mb() using relaxed atomics
(on architectures that have those), at the cost of something like a
cmpxchg-acquire sandwiched between a load-acquire and a relaxed load?
I'm not sure how their cost compares to an smp_mb() though.

Something like this, *assuming there can only be one context at a time
waiting for a given hazptr_t*:


typedef struct {
  /* consists of: marker bit in least significant bit, rest is normal pointer */
  atomic_long_t value;
} hazptr_t;

/* assumes that hzp is currently set to NULL (but it may contain a
marker bit) */
static inline void *__hazptr_tryprotect(hazptr_t *hzp, void *const *p) {
  /* note that the loads of these three operations are ordered using
acquire semantics */
  void *ptr = smp_load_acquire(p);
  /* set pointer while leaving marker bit intact */
  unsigned long hazard_scanning =
atomic_long_fetch_or_acquire((unsigned long)ptr, &hzp->value);
  if (unlikely(hazard_scanning)) {
    BUG_ON(hazard_scanning != 1);
    /* slowpath, concurrent hazard pointer waiter */
    smp_mb();
  }
  if (READ_ONCE(*p) == ptr) { /* recheck */
    atomic_long_and(~1UL, &hzp->value);
    return NULL;
  }
  return ptr;
}

/* simplified for illustration, assumes there's only a single hazard
pointer @hzp that could point to @ptr */
static void remove_pointer_and_wait_for_hazard(hazptr_t *hzp, void
*ptr, void *const *p) {
  WRITE_ONCE(*p, NULL);
  smb_mb();
  /* set marker bit */
  atomic_long_or(1UL, &hzp->value);
  while ((void*)(atomic_long_read(&hzp->value) & ~1UL) == ptr))
    wait();
  /* clear marker bit */
  atomic_long_and(~1UL, &hzp->value);
}


The idea would be that the possible orderings when these two functions
race against each other are:

Ordering A: The atomic_long_fetch_or_acquire() in
__hazptr_tryprotect() happens after the atomic_long_or(), two
subcases:
Ordering A1 (slowpath): atomic_long_fetch_or_acquire() is ordered
before the atomic_long_and() in remove_pointer_and_wait_for_hazard(),
so the marker bit is observed set, "hazard_scanning" is true. We go on
the safe slowpath which is like the original patch, so it's safe.
Ordering A2 (recheck fails): atomic_long_fetch_or_acquire() is ordered
after the atomic_long_and() in remove_pointer_and_wait_for_hazard(),
so the subsequent READ_ONCE(*p) is also ordered after the
atomic_long_and(), which is ordered after the WRITE_ONCE(*p, NULL), so
the READ_ONCE(*p) recheck must see a NULL pointer and fail.
Ordering B (hazard pointer visible): The
atomic_long_fetch_or_acquire() in __hazptr_tryprotect() happens before
the atomic_long_or(). In that case, it also happens before the
atomic_long_read() in remove_pointer_and_wait_for_hazard(), so the
hazard pointer will be visible to
remove_pointer_and_wait_for_hazard().

But this seems pretty gnarly/complicated, so even if my 2AM reasoning
ability is correct, actually implementing this might not be a good
idea... and it definitely wouldn't help on X86 at all, since X86
doesn't have such nice relaxed RMW ops.
Lai Jiangshan Sept. 19, 2024, 6:39 a.m. UTC | #4
On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:

> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> +                                              struct hazptr_context *hzcp)
> +{
> +       lockdep_assert_held(hzcp->lock);
> +
> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> +               /*
> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
> +                *
> +                * Ensure
> +                *
> +                * <reader>             <updater>
> +                *
> +                * [access protected pointers]
> +                * hazptr_clear();
> +                *   smp_store_release()
> +                *                      // in reader scan.
> +                *                      smp_load_acquire(); // is null or unused.
> +                *                      [run callbacks] // all accesses from
> +                *                                      // reader must be
> +                *                                      // observed.
> +                */
> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> +
> +               if (!is_null_or_unused(val)) {
> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> +                       // Already in the tree, need to remove first.
> +                       if (!is_null_or_unused(snap->slot)) {
> +                               reader_del(tree, snap);
> +                       }
> +                       snap->slot = val;
> +                       reader_add(tree, snap);
> +               }
> +       }
> +}

Hello

I'm curious about whether there are any possible memory leaks here.

It seems that call_hazptr() never frees the memory until the slot is
set to another valid value.

In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
and snap->slot is not which I think it should be.

And it can cause unneeded deletion and addition of the snap if the slot
value is unchanged.

I'm not so sure...

Thanks
Lai
Boqun Feng Sept. 19, 2024, 6:56 a.m. UTC | #5
On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote:
[...]
> > +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
> > +#define hazptr_protect(hzp, gp, field) ({ \
> > + typeof(gp) ___p; \
> > + \
> > + ___p = hazptr_tryprotect(hzp, gp, field); \
> > + BUG_ON(!___p); \
> 
> hazptr_tryprotect might return NULL, do you need a loop here?
> 

Thanks for the review. It's me who didn't do a good job here on the
documentation. hazptr_protect() is supposed to use for the case where
readers know the gp won't change.

Regards,
Boqun

> > + ___p; \
> > +})
> > +
Boqun Feng Sept. 19, 2024, 7:10 a.m. UTC | #6
On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> > +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> > +                                              struct hazptr_context *hzcp)
> > +{
> > +       lockdep_assert_held(hzcp->lock);
> > +
> > +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> > +               /*
> > +                * Pairs with smp_store_release() in hazptr_{clear,free}().
> > +                *
> > +                * Ensure
> > +                *
> > +                * <reader>             <updater>
> > +                *
> > +                * [access protected pointers]
> > +                * hazptr_clear();
> > +                *   smp_store_release()
> > +                *                      // in reader scan.
> > +                *                      smp_load_acquire(); // is null or unused.
> > +                *                      [run callbacks] // all accesses from
> > +                *                                      // reader must be
> > +                *                                      // observed.
> > +                */
> > +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> > +
> > +               if (!is_null_or_unused(val)) {
> > +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> > +
> > +                       // Already in the tree, need to remove first.
> > +                       if (!is_null_or_unused(snap->slot)) {
> > +                               reader_del(tree, snap);
> > +                       }
> > +                       snap->slot = val;
> > +                       reader_add(tree, snap);
> > +               }
> > +       }
> > +}
> 
> Hello
> 
> I'm curious about whether there are any possible memory leaks here.
> 
> It seems that call_hazptr() never frees the memory until the slot is
> set to another valid value.
> 
> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
> and snap->slot is not which I think it should be.
> 
> And it can cause unneeded deletion and addition of the snap if the slot
> value is unchanged.
> 

I think you're right. (Although the node will be eventually deleted at
cleanup_hazptr_context(), however there could be a long-live
hazptr_context). It should be:

		hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
		struct hazptr_slot_snap *snap = &hzcp->snaps[i];

		if (val != snap->slot) { // val changed, need to update the tree node.
			// Already in the tree, need to remove first.
			if (!is_null_or_unused(snap->slot)) {
				reader_del(tree, snap);
			}

			// use the latest snapshot.
			snap->slot = val;

			// Add it into tree if there is a reader
			if (!is_null_or_unused(val))
				reader_add(tree, snap);
		}

Regards,
Boqun

> I'm not so sure...
> 
> Thanks
> Lai
Alan Huang Sept. 19, 2024, 12:33 p.m. UTC | #7
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

With this changed, and force users call hazptr_clear() like rcu_read_unlock(), we could remove
the reader_del() in cleanup_hazptr_context(), then remove the tree->lock?

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Alan Huang Sept. 19, 2024, 1:57 p.m. UTC | #8
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

It seems like that two different hazptr_context can’t be used to protect the same pointer?

Otherwise the following can happen?

thread1 					thread2  					 thread3(worker) 	      thread4
hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
												 add ptr1 to tree
hazptr_clear(hzp1) 
hazptr_tryprotect(hzp1, ptr2) 
												 delete ptr1 from tree     unpub ptr1
																       call_hazptr(ptr1)
																       oops: invoke ptr1's callback
Or am I missing something?

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Jonas Oberhauser Sept. 19, 2024, 2 p.m. UTC | #9
Am 9/17/2024 um 4:33 PM schrieb Boqun Feng:
> +#define hazptr_protect(hzp, gp, field) ({				\
> +	typeof(gp) ___p;						\
> +									\
> +	___p = hazptr_tryprotect(hzp, gp, field);			\
> +	BUG_ON(!___p);							\
> +	___p;								\
> +})


Hi Boqun, why crash instead of retry?

jonas
Alan Huang Sept. 19, 2024, 4:10 p.m. UTC | #10
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

Even using the same context, if two slots are used to protect the same pointer, let it be ptr1,
then if the second slot is reused for ptr2, ptr1’s callback will be invoked even the first slot still has the ptr1.

The original patch also has this problem.

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Jonas Oberhauser Sept. 19, 2024, 6:07 p.m. UTC | #11
Am 9/19/2024 um 8:56 AM schrieb Boqun Feng:
> On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote:
> [...]
>>> +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
>>> +#define hazptr_protect(hzp, gp, field) ({ \
>>> + typeof(gp) ___p; \
>>> + \
>>> + ___p = hazptr_tryprotect(hzp, gp, field); \
>>> + BUG_ON(!___p); \
>>
>> hazptr_tryprotect might return NULL, do you need a loop here?
>>
> 
> Thanks for the review. It's me who didn't do a good job here on the
> documentation. hazptr_protect() is supposed to use for the case where
> readers know the gp won't change.
> 
> Regards,
> Boqun
> 
>>> + ___p; \
>>> +})
>>> +

Oh, disregard my other e-mail, I hadn't seen this discussion.

Do you have any specific use case of this in mind? If you know that the 
pointer can't change, I would assume you can also just read the pointer 
normally and assign to the hazard pointer without a fence, no?

have fun, jonas
Boqun Feng Sept. 19, 2024, 6:58 p.m. UTC | #12
On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote:
[...]
> > 
> > I think you're right. (Although the node will be eventually deleted at
> > cleanup_hazptr_context(), however there could be a long-live
> > hazptr_context). It should be:
> > 
> > hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> > struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> > 
> > if (val != snap->slot) { // val changed, need to update the tree node.
> > // Already in the tree, need to remove first.
> > if (!is_null_or_unused(snap->slot)) {
> > reader_del(tree, snap);
> > }
> > 
> > // use the latest snapshot.
> > snap->slot = val;
> > 
> > // Add it into tree if there is a reader
> > if (!is_null_or_unused(val))
> > reader_add(tree, snap);
> > }
> 
> It seems like that two different hazptr_context can’t be used to protect the same pointer?
> 
> Otherwise the following can happen?
> 
> thread1 					thread2  					 thread3(worker) 	      thread4
> hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
> 												 add ptr1 to tree

Note that we have snapshot rb_node for each hazard pointer slot, so here
thread3 actually would add two rb_nodes with ->slot == ptr1 here.

> hazptr_clear(hzp1) 
> hazptr_tryprotect(hzp1, ptr2) 
> 												 delete ptr1 from tree     unpub ptr1

Therefore, there is still one rb_node with ->slot == ptr1 in the tree
after the deletion, so updaters won't invoke ptr1's callback.

Regards,
Boqun

> 																       call_hazptr(ptr1)
> 																       oops: invoke ptr1's callback
> Or am I missing something?
> 
> > 
> > Regards,
> > Boqun
> > 
> >> I'm not so sure...
> >> 
> >> Thanks
> >> Lai
> 
>
Alan Huang Sept. 19, 2024, 7:53 p.m. UTC | #13
2024年9月20日 02:58,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote:
> [...]
>>> 
>>> I think you're right. (Although the node will be eventually deleted at
>>> cleanup_hazptr_context(), however there could be a long-live
>>> hazptr_context). It should be:
>>> 
>>> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> 
>>> if (val != snap->slot) { // val changed, need to update the tree node.
>>> // Already in the tree, need to remove first.
>>> if (!is_null_or_unused(snap->slot)) {
>>> reader_del(tree, snap);
>>> }
>>> 
>>> // use the latest snapshot.
>>> snap->slot = val;
>>> 
>>> // Add it into tree if there is a reader
>>> if (!is_null_or_unused(val))
>>> reader_add(tree, snap);
>>> }
>> 
>> It seems like that two different hazptr_context can’t be used to protect the same pointer?
>> 
>> Otherwise the following can happen?
>> 
>> thread1  thread2    thread3(worker)        thread4
>> hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
>>  add ptr1 to tree
> 
> Note that we have snapshot rb_node for each hazard pointer slot, so here
> thread3 actually would add two rb_nodes with ->slot == ptr1 here.

Ok, good to know the rbtree can have multiple nodes with the same key.

Thanks for the explanation!

> 
>> hazptr_clear(hzp1) 
>> hazptr_tryprotect(hzp1, ptr2) 
>>  delete ptr1 from tree     unpub ptr1
> 
> Therefore, there is still one rb_node with ->slot == ptr1 in the tree
> after the deletion, so updaters won't invoke ptr1's callback.
> 
> Regards,
> Boqun
> 
>>        call_hazptr(ptr1)
>>        oops: invoke ptr1's callback
>> Or am I missing something?
>> 
>>> 
>>> Regards,
>>> Boqun
>>> 
>>>> I'm not so sure...
>>>> 
>>>> Thanks
>>>> Lai
diff mbox series

Patch

diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..4548ca8c75eb
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,83 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
+
+/* Hazard pointer slot. */
+typedef void* hazptr_t;
+
+#define HAZPTR_SLOT_PER_CTX 8
+
+struct hazptr_slot_snap {
+	struct rb_node node;
+	hazptr_t slot;
+};
+
+/*
+ * A set of hazard pointer slots for a context.
+ *
+ * The context can be a task, CPU or reader (depends on the use case). It may
+ * be allocated statically or dynamically. It can only be used after calling
+ * init_hazptr_context(), and users are also responsible to call
+ * cleaup_hazptr_context() when it's not used any more.
+ */
+struct hazptr_context {
+	// The lock of the percpu context lists.
+	spinlock_t *lock;
+
+	struct list_head list;
+	struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX];
+	____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX];
+};
+
+void init_hazptr_context(struct hazptr_context *hzcp);
+void cleanup_hazptr_context(struct hazptr_context *hzcp);
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp);
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp);
+
+#define hazptr_tryprotect(hzp, gp, field)	(typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
+#define hazptr_protect(hzp, gp, field) ({				\
+	typeof(gp) ___p;						\
+									\
+	___p = hazptr_tryprotect(hzp, gp, field);			\
+	BUG_ON(!___p);							\
+	___p;								\
+})
+
+static inline void *__hazptr_tryprotect(hazptr_t *hzp,
+					void *const *p,
+					unsigned long head_offset)
+{
+	void *ptr;
+	struct callback_head *head;
+
+	ptr = READ_ONCE(*p);
+
+	if (ptr == NULL)
+		return NULL;
+
+	head = (struct callback_head *)(ptr + head_offset);
+
+	WRITE_ONCE(*hzp, head);
+	smp_mb();
+
+	ptr = READ_ONCE(*p); // read again
+
+	if (ptr + head_offset != head) { // pointer changed
+		WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
+		return NULL;
+	} else
+		return ptr;
+}
+
+static inline void hazptr_clear(hazptr_t *hzp)
+{
+	/* Pairs with smp_load_acquire() in reader scan. */
+	smp_store_release(hzp, NULL);
+}
+
+void call_hazptr(struct callback_head *head, rcu_callback_t func);
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..7927264b9870 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@  obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += hazptr.o
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..f22ccc2a4a62
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,463 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/spinlock.h>
+#include <linux/cleanup.h>
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+
+#define HAZPTR_UNUSED (1ul)
+
+/* Per-CPU data for hazard pointer management. */
+struct hazptr_percpu {
+	spinlock_t ctx_lock;		/* hazptr context list lock */
+	struct list_head ctx_list;	/* hazptr context list */
+	spinlock_t callback_lock;	/* Per-CPU callback queue lock */
+	struct callback_head *queued;	/* Per-CPU callback queue */
+	struct callback_head **tail;
+};
+
+/*
+ * Per-CPU data contains context lists and callbacks, which are maintained in a
+ * per-CPU maner to reduce potential contention. This means a global scan (for
+ * readers or callbacks) has to take each CPU's lock, but it's less problematic
+ * because that's a slowpath.
+ */
+DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu);
+
+/* A RBTree that stores the reader scan result of all hazptr slot */
+struct hazptr_reader_tree {
+	spinlock_t lock;
+	struct rb_root root;
+};
+
+static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree)
+{
+	spin_lock_init(&tree->lock);
+	tree->root = RB_ROOT;
+}
+
+static bool is_null_or_unused(hazptr_t slot)
+{
+	return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED;
+}
+
+static int hazptr_node_cmp(const void *key, const struct rb_node *curr)
+{
+	unsigned long slot = (unsigned long)key;
+	struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node);
+	unsigned long curr_slot = (unsigned long)(curr_snap->slot);
+
+	if (slot < curr_slot)
+		return -1;
+	else if (slot > curr_slot)
+		return 1;
+	else
+		return 0;
+}
+
+static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr)
+{
+	struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node);
+
+	return hazptr_node_cmp((void *)new_snap->slot, curr) < 0;
+}
+
+/* Add the snapshot into a search tree. tree->lock must be held. */
+static inline void reader_add_locked(struct hazptr_reader_tree *tree,
+				     struct hazptr_slot_snap *snap)
+{
+	lockdep_assert_held(&tree->lock);
+	BUG_ON(is_null_or_unused(snap->slot));
+
+	rb_add(&snap->node, &tree->root, hazptr_node_less);
+}
+
+static inline void reader_add(struct hazptr_reader_tree *tree,
+			      struct hazptr_slot_snap *snap)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	reader_add_locked(tree, snap);
+}
+
+/* Delete the snapshot from a search tree. tree->lock must be held. */
+static inline void reader_del_locked(struct hazptr_reader_tree *tree,
+				     struct hazptr_slot_snap *snap)
+{
+	lockdep_assert_held(&tree->lock);
+
+	rb_erase(&snap->node, &tree->root);
+}
+
+static inline void reader_del(struct hazptr_reader_tree *tree,
+			      struct hazptr_slot_snap *snap)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	reader_del_locked(tree, snap);
+}
+
+/* Find whether a read exists. tree->lock must be held. */
+static inline bool reader_exist_locked(struct hazptr_reader_tree *tree,
+				       unsigned long slot)
+{
+	lockdep_assert_held(&tree->lock);
+
+	return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp);
+}
+
+static inline bool reader_exist(struct hazptr_reader_tree *tree,
+				unsigned long slot)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	return reader_exist_locked(tree, slot);
+}
+
+/*
+ * Scan the readers from one hazptr context and update the global readers tree.
+ *
+ * Must be called with hzcp->lock held.
+ */
+static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
+					       struct hazptr_context *hzcp)
+{
+	lockdep_assert_held(hzcp->lock);
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		/*
+		 * Pairs with smp_store_release() in hazptr_{clear,free}().
+		 *
+		 * Ensure
+		 *
+		 * <reader>		<updater>
+		 *
+		 * [access protected pointers]
+		 * hazptr_clear();
+		 *   smp_store_release()
+		 *			// in reader scan.
+		 *			smp_load_acquire(); // is null or unused.
+		 *			[run callbacks] // all accesses from
+		 *					// reader must be
+		 *					// observed.
+		 */
+		hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
+
+		if (!is_null_or_unused(val)) {
+			struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+			// Already in the tree, need to remove first.
+			if (!is_null_or_unused(snap->slot)) {
+				reader_del(tree, snap);
+			}
+			snap->slot = val;
+			reader_add(tree, snap);
+		}
+	}
+}
+
+/*
+ * Initialize a hazptr context.
+ *
+ * Must be called before using the context for hazptr allocation.
+ */
+void init_hazptr_context(struct hazptr_context *hzcp)
+{
+	struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED;
+		hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED;
+	}
+
+	guard(spinlock_irqsave)(&pcpu->ctx_lock);
+	list_add(&hzcp->list, &pcpu->ctx_list);
+	hzcp->lock = &pcpu->ctx_lock;
+}
+
+struct hazptr_struct {
+	struct work_struct work;
+	bool scheduled;
+
+	// list of callbacks, we move percpu queued callbacks into the global
+	// queued list in workqueue function.
+	spinlock_t callback_lock;
+	struct callback_head *queued;
+	struct callback_head **tail;
+
+	struct hazptr_reader_tree readers;
+};
+
+static struct hazptr_struct hazptr_struct;
+
+/*
+ * Clean up hazptr context.
+ *
+ * Must call before freeing the context. This function also removes any
+ * reference held by the hazard pointer slots in the context, even
+ * hazptr_clear() or hazptr_free() is not called previously.
+ */
+void cleanup_hazptr_context(struct hazptr_context *hzcp)
+{
+	if (hzcp->lock) {
+		scoped_guard(spinlock_irqsave, hzcp->lock) {
+			list_del(&hzcp->list);
+			hzcp->lock = NULL;
+		}
+
+		for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+			struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+			if (!is_null_or_unused(snap->slot))
+				reader_del(&hazptr_struct.readers, snap);
+		}
+	}
+}
+
+/*
+ * Allocate a hazptr slot from a hazptr_context.
+ *
+ * Return: NULL means fail to allocate, otherwise the address of the allocated
+ * slot.
+ */
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp)
+{
+	unsigned long unused;
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) {
+			unused = HAZPTR_UNUSED;
+
+			/*
+			 * rwm-sequence is relied on here.
+			 *
+			 * This is needed since in case of a previous reader:
+			 *
+			 * <reader 1>		<reader 2>		<updater>
+			 * [access protected pointers]
+			 * hazptr_free():
+			 *   smp_store_release(); // hzptr == UNUSED
+			 *			hazptr_alloc():
+			 *			  try_cmpxchg_relaxed(); // hzptr == NULL
+			 *
+			 *						// in reader scan
+			 *						smp_load_acquire(); // is null
+			 *						[run callbacks]
+			 *
+			 * Because of the rwm-sequence of release operations,
+			 * when callbacks are run, accesses from reader 1 must
+			 * be already observed by the updater.
+			 */
+			if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) {
+				return (hazptr_t *)&hzcp->slots[i];
+			}
+		}
+	}
+
+	return NULL;
+}
+
+/* Free a hazptr slot. */
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr)
+{
+	WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED);
+
+	/* Pairs with smp_load_acquire() in reader scan */
+	smp_store_release(hzptr, (void *)HAZPTR_UNUSED);
+}
+
+/* Scan all possible readers, and update the global reader tree. */
+static void check_readers(struct hazptr_struct *hzst)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+		struct hazptr_context *ctx;
+
+		guard(spinlock_irqsave)(&pcpu->ctx_lock);
+		list_for_each_entry(ctx, &pcpu->ctx_list, list)
+			hazptr_context_snap_readers_locked(&hzst->readers, ctx);
+	}
+}
+
+/*
+ * Start the background work for hazptr callback handling if not started.
+ *
+ * Must be called with hazptr_struct lock held.
+ */
+static void kick_hazptr_work(void)
+{
+	if (hazptr_struct.scheduled)
+		return;
+
+	queue_work(system_wq, &hazptr_struct.work);
+	hazptr_struct.scheduled = true;
+}
+
+/*
+ * Check which callbacks are ready to be called.
+ *
+ * Return: a callback list that no reader is referencing the corresponding
+ * objects.
+ */
+static struct callback_head *do_hazptr(struct hazptr_struct *hzst)
+{
+	struct callback_head *tmp, **curr;
+	struct callback_head *todo = NULL, **todo_tail = &todo;
+
+	// Currently, the lock is unnecessary, but maybe we will have multiple
+	// work_structs sharing the same callback list in the future;
+	guard(spinlock_irqsave)(&hzst->callback_lock);
+
+	curr = &hzst->queued;
+
+	while ((tmp = *curr)) {
+		// No reader, we can free the object.
+		if (!reader_exist(&hzst->readers, (unsigned long)tmp)) {
+			// Add tmp into todo.
+			*todo_tail = tmp;
+			todo_tail = &tmp->next;
+
+			// Delete tmp from ->queued and move to the next entry.
+			*curr = tmp->next;
+		} else
+			curr = &tmp->next;
+	}
+
+	hzst->tail = curr;
+
+	// Keep checking if callback list is not empty.
+	if (hzst->queued)
+		kick_hazptr_work();
+
+	*todo_tail = NULL;
+
+	return todo;
+}
+
+static void hazptr_work_func(struct work_struct *work)
+{
+	struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work);
+	struct callback_head *todo;
+
+	// Advance callbacks from hzpcpu to hzst
+	scoped_guard(spinlock_irqsave, &hzst->callback_lock) {
+		int cpu;
+
+		hzst->scheduled = false;
+		for_each_possible_cpu(cpu) {
+			struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+			guard(spinlock)(&pcpu->callback_lock);
+
+			if (pcpu->queued) {
+				*(hzst->tail) = pcpu->queued;
+				hzst->tail = pcpu->tail;
+				pcpu->queued = NULL;
+				pcpu->tail = &pcpu->queued;
+			}
+		}
+	}
+
+	// Pairs with the smp_mb() on the reader side. This guarantees that if
+	// the hazptr work picked up the callback from an updater and the
+	// updater set the global pointer to NULL before enqueue the callback,
+	// the hazptr work must observe a reader that protects the hazptr before
+	// the updater.
+	//
+	// <reader>		<updater>		<hazptr work>
+	// ptr = READ_ONCE(gp);
+	// WRITE_ONCE(*hazptr, ptr);
+	// smp_mb(); // => ->strong-fence
+	//			tofree = gp;
+	// ptr = READ_ONCE(gp); // re-read, gp is not NULL
+	//			// => ->fre
+	//			WRITE_ONCE(gp, NULL);
+	//			call_hazptr(gp, ...):
+	//			  lock(->callback_lock);
+	//			  [queued the callback]
+	//			  unlock(->callback_lock);// => ->po-unlock-lock-po
+	//						lock(->callback_lock);
+	//						[move from hzpcpu to hzst]
+	//
+	//						smp_mb(); => ->strong-fence
+	//						ptr = READ_ONCE(*hazptr);
+	//						// ptr == gp, otherwise => ->fre
+	//
+	// If ptr != gp, it means there exists a circle with the following
+	// memory ordering relationships:
+	//	->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre
+	// => (due to the definition of prop)
+	//	->strong-fence ->prop ->strong-fence ->hb ->prop
+	// => (rotate the circle)
+	//	->prop ->strong-fence ->prop ->strong-fence ->hb
+	// => (due to the definition of pb)
+	//	->pb ->pb
+	// but pb is acyclic according to LKMM, so this cannot happen.
+	smp_mb();
+	check_readers(hzst);
+
+	todo = do_hazptr(hzst);
+
+	while (todo) {
+		struct callback_head *next = todo->next;
+		void (*func)(struct callback_head *) = todo->func;
+
+		func(todo);
+
+		todo = next;
+	}
+}
+
+/*
+ * Put @head into the cleanup callback queue.
+ *
+ * @func(@head) will be called after no one is referencing the object. Caller
+ * must ensure the object has been unpublished before calling this.
+ */
+void call_hazptr(struct callback_head *head, rcu_callback_t func)
+{
+	struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+	head->func = func;
+	head->next = NULL;
+
+	scoped_guard(spinlock_irqsave, &pcpu->callback_lock) {
+		*(pcpu->tail) = head;
+		pcpu->tail = &head->next;
+	}
+
+	guard(spinlock_irqsave)(&hazptr_struct.callback_lock);
+	kick_hazptr_work();
+}
+
+static int init_hazptr_struct(void)
+{
+	int cpu;
+
+	INIT_WORK(&hazptr_struct.work, hazptr_work_func);
+
+	spin_lock_init(&hazptr_struct.callback_lock);
+	hazptr_struct.queued = NULL;
+	hazptr_struct.tail = &hazptr_struct.queued;
+
+	for_each_possible_cpu(cpu) {
+		struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+		spin_lock_init(&pcpu->ctx_lock);
+		INIT_LIST_HEAD(&pcpu->ctx_list);
+
+		spin_lock_init(&pcpu->callback_lock);
+		pcpu->queued = NULL;
+		pcpu->tail = &pcpu->queued;
+
+	}
+
+	init_hazptr_reader_tree(&hazptr_struct.readers);
+
+	return 0;
+}
+
+early_initcall(init_hazptr_struct);