diff mbox series

[RFC,1/4] hazptr: Add initial implementation of hazard pointers

Message ID 20240917143402.930114-2-boqun.feng@gmail.com (mailing list archive)
State New
Headers show
Series Add hazard pointers to kernel | expand

Commit Message

Boqun Feng Sept. 17, 2024, 2:33 p.m. UTC
Hazard pointers [1] provide a way to dynamically distribute refcounting
and can be used to improve the scalability of refcounting without
significant space cost.

Hazard pointers are similar to RCU: they build the synchronization
between two part, readers and updaters. Readers are refcount users, they
acquire and release refcount. Updaters cleanup objects when there are no
reader referencing them (via call_hazptr()). The difference is instead
of waiting for a grace period, hazard pointers can free an object as
long as there is no one referencing the object. This means for a
particular workload, hazard pointers may have smaller memory footprint
due to fewer pending callbacks.

The synchronization between readers and updaters is built around "hazard
pointer slots": a slot readers can use to store a pointer value.

Reader side protection:

1.	Read the value of a pointer to the target data element.
2.	Store it to a hazard pointer slot.
3.	Enforce full ordering (e.g. smp_mb()).
4.	Re-read the original pointer, reset the slot and retry if the
	value changed.
5.	Otherwise, the continued existence of the target data element
	is guaranteed.

Updater side check:

1.	Unpublish the target data element (e.g. setting the pointer
	value to NULL).
2.	Enforce full ordering.
3.	Read the value from a hazard pointer slot.
4.	If the value doesn't match the target data element, then this
	slot doesn't represent a reference to it.
5.	Otherwise, updater needs to re-check (step 3).

To distribute the accesses of hazptr slots from different contexts,
hazptr_context is introduced. Users need to define/allocate their own
hazptr_context to allocate hazard pointer slots.

For the updater side to confirm no existing reference, it needs to scan
all the possible slots, and to speed up this process, hazptr_context
also contains an rbtree node for each slot so that updater can cache the
reader-scan result in an rbtree. The rbtree nodes are pre-allocated in
this way to prevent "allocate memory to free memory" in extreme cases.

[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
     lock-free objects," in IEEE Transactions on Parallel and
     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004

Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/hazptr.h |  83 ++++++++
 kernel/Makefile        |   1 +
 kernel/hazptr.c        | 463 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 547 insertions(+)
 create mode 100644 include/linux/hazptr.h
 create mode 100644 kernel/hazptr.c

Comments

Mathieu Desnoyers Sept. 18, 2024, 8:27 a.m. UTC | #1
On 2024-09-17 16:33, Boqun Feng wrote:
[...]
> 
> The synchronization between readers and updaters is built around "hazard
> pointer slots": a slot readers can use to store a pointer value.
> 
> Reader side protection:
> 
> 1.	Read the value of a pointer to the target data element.
> 2.	Store it to a hazard pointer slot.
> 3.	Enforce full ordering (e.g. smp_mb()).
> 4.	Re-read the original pointer, reset the slot and retry if the
> 	value changed.
> 5.	Otherwise, the continued existence of the target data element
> 	is guaranteed.
> 
> Updater side check:
> 
> 1.	Unpublish the target data element (e.g. setting the pointer
> 	value to NULL).
> 2.	Enforce full ordering.
> 3.	Read the value from a hazard pointer slot.
> 4.	If the value doesn't match the target data element, then this
> 	slot doesn't represent a reference to it.
> 5.	Otherwise, updater needs to re-check (step 3).

Cool! I look forward to see where this is meant to be used. I would
expect it to be a useful tool to speed up reference counting of
things like the mm_struct and for TLB flush IPIs.

On a related note, with a userspace port in mind, the membarrier(2)
syscall can be useful to turn the smp_mb() in (3) from the reader
side into a simple compiler barrier, assuming (2) from the updater
is using membarrier. If IPIs are acceptable (or already required) for
some kernel use-cases, this means a similar asymmetric fence scheme
could be used to speed up readers.

Thanks,

Mathieu
Alan Huang Sept. 18, 2024, 3:17 p.m. UTC | #2
2024年9月17日 22:33,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> Hazard pointers [1] provide a way to dynamically distribute refcounting
> and can be used to improve the scalability of refcounting without
> significant space cost.
> 
> Hazard pointers are similar to RCU: they build the synchronization
> between two part, readers and updaters. Readers are refcount users, they
> acquire and release refcount. Updaters cleanup objects when there are no
> reader referencing them (via call_hazptr()). The difference is instead
> of waiting for a grace period, hazard pointers can free an object as
> long as there is no one referencing the object. This means for a
> particular workload, hazard pointers may have smaller memory footprint
> due to fewer pending callbacks.
> 
> The synchronization between readers and updaters is built around "hazard
> pointer slots": a slot readers can use to store a pointer value.
> 
> Reader side protection:
> 
> 1. Read the value of a pointer to the target data element.
> 2. Store it to a hazard pointer slot.
> 3. Enforce full ordering (e.g. smp_mb()).
> 4. Re-read the original pointer, reset the slot and retry if the
> value changed.
> 5. Otherwise, the continued existence of the target data element
> is guaranteed.
> 
> Updater side check:
> 
> 1. Unpublish the target data element (e.g. setting the pointer
> value to NULL).
> 2. Enforce full ordering.
> 3. Read the value from a hazard pointer slot.
> 4. If the value doesn't match the target data element, then this
> slot doesn't represent a reference to it.
> 5. Otherwise, updater needs to re-check (step 3).
> 
> To distribute the accesses of hazptr slots from different contexts,
> hazptr_context is introduced. Users need to define/allocate their own
> hazptr_context to allocate hazard pointer slots.
> 
> For the updater side to confirm no existing reference, it needs to scan
> all the possible slots, and to speed up this process, hazptr_context
> also contains an rbtree node for each slot so that updater can cache the
> reader-scan result in an rbtree. The rbtree nodes are pre-allocated in
> this way to prevent "allocate memory to free memory" in extreme cases.
> 
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>     lock-free objects," in IEEE Transactions on Parallel and
>     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> 
> Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/hazptr.h |  83 ++++++++
> kernel/Makefile        |   1 +
> kernel/hazptr.c        | 463 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 547 insertions(+)
> create mode 100644 include/linux/hazptr.h
> create mode 100644 kernel/hazptr.c
> 
> diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
> new file mode 100644
> index 000000000000..4548ca8c75eb
> --- /dev/null
> +++ b/include/linux/hazptr.h
> @@ -0,0 +1,83 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_HAZPTR_H
> +#define _LINUX_HAZPTR_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rbtree.h>
> +
> +/* Hazard pointer slot. */
> +typedef void* hazptr_t;
> +
> +#define HAZPTR_SLOT_PER_CTX 8
> +
> +struct hazptr_slot_snap {
> + struct rb_node node;
> + hazptr_t slot;
> +};
> +
> +/*
> + * A set of hazard pointer slots for a context.
> + *
> + * The context can be a task, CPU or reader (depends on the use case). It may
> + * be allocated statically or dynamically. It can only be used after calling
> + * init_hazptr_context(), and users are also responsible to call
> + * cleaup_hazptr_context() when it's not used any more.
> + */
> +struct hazptr_context {
> + // The lock of the percpu context lists.
> + spinlock_t *lock;
> +
> + struct list_head list;
> + struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX];
> + ____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX];
> +};
> +
> +void init_hazptr_context(struct hazptr_context *hzcp);
> +void cleanup_hazptr_context(struct hazptr_context *hzcp);
> +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp);
> +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp);
> +
> +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
> +#define hazptr_protect(hzp, gp, field) ({ \
> + typeof(gp) ___p; \
> + \
> + ___p = hazptr_tryprotect(hzp, gp, field); \
> + BUG_ON(!___p); \

hazptr_tryprotect might return NULL, do you need a loop here?

> + ___p; \
> +})
> +
> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> + void *const *p,
> + unsigned long head_offset)
> +{
> + void *ptr;
> + struct callback_head *head;
> +
> + ptr = READ_ONCE(*p);
> +
> + if (ptr == NULL)
> + return NULL;
> +
> + head = (struct callback_head *)(ptr + head_offset);
> +
> + WRITE_ONCE(*hzp, head);
> + smp_mb();
> +
> + ptr = READ_ONCE(*p); // read again
> +
> + if (ptr + head_offset != head) { // pointer changed
> + WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> + return NULL;
> + } else
> + return ptr;
> +}
> +
> +static inline void hazptr_clear(hazptr_t *hzp)
> +{
> + /* Pairs with smp_load_acquire() in reader scan. */
> + smp_store_release(hzp, NULL);
> +}
> +
> +void call_hazptr(struct callback_head *head, rcu_callback_t func);
> +#endif
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..7927264b9870 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,6 +50,7 @@ obj-y += rcu/
> obj-y += livepatch/
> obj-y += dma/
> obj-y += entry/
> +obj-y += hazptr.o
> obj-$(CONFIG_MODULES) += module/
> 
> obj-$(CONFIG_KCMP) += kcmp.o
> diff --git a/kernel/hazptr.c b/kernel/hazptr.c
> new file mode 100644
> index 000000000000..f22ccc2a4a62
> --- /dev/null
> +++ b/kernel/hazptr.c
> @@ -0,0 +1,463 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/spinlock.h>
> +#include <linux/cleanup.h>
> +#include <linux/hazptr.h>
> +#include <linux/percpu.h>
> +#include <linux/workqueue.h>
> +
> +#define HAZPTR_UNUSED (1ul)
> +
> +/* Per-CPU data for hazard pointer management. */
> +struct hazptr_percpu {
> + spinlock_t ctx_lock; /* hazptr context list lock */
> + struct list_head ctx_list; /* hazptr context list */
> + spinlock_t callback_lock; /* Per-CPU callback queue lock */
> + struct callback_head *queued; /* Per-CPU callback queue */
> + struct callback_head **tail;
> +};
> +
> +/*
> + * Per-CPU data contains context lists and callbacks, which are maintained in a
> + * per-CPU maner to reduce potential contention. This means a global scan (for
> + * readers or callbacks) has to take each CPU's lock, but it's less problematic
> + * because that's a slowpath.
> + */
> +DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu);
> +
> +/* A RBTree that stores the reader scan result of all hazptr slot */
> +struct hazptr_reader_tree {
> + spinlock_t lock;
> + struct rb_root root;
> +};
> +
> +static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree)
> +{
> + spin_lock_init(&tree->lock);
> + tree->root = RB_ROOT;
> +}
> +
> +static bool is_null_or_unused(hazptr_t slot)
> +{
> + return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED;
> +}
> +
> +static int hazptr_node_cmp(const void *key, const struct rb_node *curr)
> +{
> + unsigned long slot = (unsigned long)key;
> + struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node);
> + unsigned long curr_slot = (unsigned long)(curr_snap->slot);
> +
> + if (slot < curr_slot)
> + return -1;
> + else if (slot > curr_slot)
> + return 1;
> + else
> + return 0;
> +}
> +
> +static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr)
> +{
> + struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node);
> +
> + return hazptr_node_cmp((void *)new_snap->slot, curr) < 0;
> +}
> +
> +/* Add the snapshot into a search tree. tree->lock must be held. */
> +static inline void reader_add_locked(struct hazptr_reader_tree *tree,
> +     struct hazptr_slot_snap *snap)
> +{
> + lockdep_assert_held(&tree->lock);
> + BUG_ON(is_null_or_unused(snap->slot));
> +
> + rb_add(&snap->node, &tree->root, hazptr_node_less);
> +}
> +
> +static inline void reader_add(struct hazptr_reader_tree *tree,
> +      struct hazptr_slot_snap *snap)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + reader_add_locked(tree, snap);
> +}
> +
> +/* Delete the snapshot from a search tree. tree->lock must be held. */
> +static inline void reader_del_locked(struct hazptr_reader_tree *tree,
> +     struct hazptr_slot_snap *snap)
> +{
> + lockdep_assert_held(&tree->lock);
> +
> + rb_erase(&snap->node, &tree->root);
> +}
> +
> +static inline void reader_del(struct hazptr_reader_tree *tree,
> +      struct hazptr_slot_snap *snap)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + reader_del_locked(tree, snap);
> +}
> +
> +/* Find whether a read exists. tree->lock must be held. */
> +static inline bool reader_exist_locked(struct hazptr_reader_tree *tree,
> +       unsigned long slot)
> +{
> + lockdep_assert_held(&tree->lock);
> +
> + return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp);
> +}
> +
> +static inline bool reader_exist(struct hazptr_reader_tree *tree,
> + unsigned long slot)
> +{
> + guard(spinlock_irqsave)(&tree->lock);
> +
> + return reader_exist_locked(tree, slot);
> +}
> +
> +/*
> + * Scan the readers from one hazptr context and update the global readers tree.
> + *
> + * Must be called with hzcp->lock held.
> + */
> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> +       struct hazptr_context *hzcp)
> +{
> + lockdep_assert_held(hzcp->lock);
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + /*
> + * Pairs with smp_store_release() in hazptr_{clear,free}().
> + *
> + * Ensure
> + *
> + * <reader> <updater>
> + *
> + * [access protected pointers]
> + * hazptr_clear();
> + *   smp_store_release()
> + * // in reader scan.
> + * smp_load_acquire(); // is null or unused.
> + * [run callbacks] // all accesses from
> + * // reader must be
> + * // observed.
> + */
> + hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> +
> + if (!is_null_or_unused(val)) {
> + struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> + // Already in the tree, need to remove first.
> + if (!is_null_or_unused(snap->slot)) {
> + reader_del(tree, snap);
> + }
> + snap->slot = val;
> + reader_add(tree, snap);
> + }
> + }
> +}
> +
> +/*
> + * Initialize a hazptr context.
> + *
> + * Must be called before using the context for hazptr allocation.
> + */
> +void init_hazptr_context(struct hazptr_context *hzcp)
> +{
> + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED;
> + hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED;
> + }
> +
> + guard(spinlock_irqsave)(&pcpu->ctx_lock);
> + list_add(&hzcp->list, &pcpu->ctx_list);
> + hzcp->lock = &pcpu->ctx_lock;
> +}
> +
> +struct hazptr_struct {
> + struct work_struct work;
> + bool scheduled;
> +
> + // list of callbacks, we move percpu queued callbacks into the global
> + // queued list in workqueue function.
> + spinlock_t callback_lock;
> + struct callback_head *queued;
> + struct callback_head **tail;
> +
> + struct hazptr_reader_tree readers;
> +};
> +
> +static struct hazptr_struct hazptr_struct;
> +
> +/*
> + * Clean up hazptr context.
> + *
> + * Must call before freeing the context. This function also removes any
> + * reference held by the hazard pointer slots in the context, even
> + * hazptr_clear() or hazptr_free() is not called previously.
> + */
> +void cleanup_hazptr_context(struct hazptr_context *hzcp)
> +{
> + if (hzcp->lock) {
> + scoped_guard(spinlock_irqsave, hzcp->lock) {
> + list_del(&hzcp->list);
> + hzcp->lock = NULL;
> + }
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> + if (!is_null_or_unused(snap->slot))
> + reader_del(&hazptr_struct.readers, snap);
> + }
> + }
> +}
> +
> +/*
> + * Allocate a hazptr slot from a hazptr_context.
> + *
> + * Return: NULL means fail to allocate, otherwise the address of the allocated
> + * slot.
> + */
> +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp)
> +{
> + unsigned long unused;
> +
> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> + if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) {
> + unused = HAZPTR_UNUSED;
> +
> + /*
> + * rwm-sequence is relied on here.
> + *
> + * This is needed since in case of a previous reader:
> + *
> + * <reader 1> <reader 2> <updater>
> + * [access protected pointers]
> + * hazptr_free():
> + *   smp_store_release(); // hzptr == UNUSED
> + * hazptr_alloc():
> + *  try_cmpxchg_relaxed(); // hzptr == NULL
> + *
> + * // in reader scan
> + * smp_load_acquire(); // is null
> + * [run callbacks]
> + *
> + * Because of the rwm-sequence of release operations,
> + * when callbacks are run, accesses from reader 1 must
> + * be already observed by the updater.
> + */
> + if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) {
> + return (hazptr_t *)&hzcp->slots[i];
> + }
> + }
> + }
> +
> + return NULL;
> +}
> +
> +/* Free a hazptr slot. */
> +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr)
> +{
> + WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED);
> +
> + /* Pairs with smp_load_acquire() in reader scan */
> + smp_store_release(hzptr, (void *)HAZPTR_UNUSED);
> +}
> +
> +/* Scan all possible readers, and update the global reader tree. */
> +static void check_readers(struct hazptr_struct *hzst)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> + struct hazptr_context *ctx;
> +
> + guard(spinlock_irqsave)(&pcpu->ctx_lock);
> + list_for_each_entry(ctx, &pcpu->ctx_list, list)
> + hazptr_context_snap_readers_locked(&hzst->readers, ctx);
> + }
> +}
> +
> +/*
> + * Start the background work for hazptr callback handling if not started.
> + *
> + * Must be called with hazptr_struct lock held.
> + */
> +static void kick_hazptr_work(void)
> +{
> + if (hazptr_struct.scheduled)
> + return;
> +
> + queue_work(system_wq, &hazptr_struct.work);
> + hazptr_struct.scheduled = true;
> +}
> +
> +/*
> + * Check which callbacks are ready to be called.
> + *
> + * Return: a callback list that no reader is referencing the corresponding
> + * objects.
> + */
> +static struct callback_head *do_hazptr(struct hazptr_struct *hzst)
> +{
> + struct callback_head *tmp, **curr;
> + struct callback_head *todo = NULL, **todo_tail = &todo;
> +
> + // Currently, the lock is unnecessary, but maybe we will have multiple
> + // work_structs sharing the same callback list in the future;
> + guard(spinlock_irqsave)(&hzst->callback_lock);
> +
> + curr = &hzst->queued;
> +
> + while ((tmp = *curr)) {
> + // No reader, we can free the object.
> + if (!reader_exist(&hzst->readers, (unsigned long)tmp)) {
> + // Add tmp into todo.
> + *todo_tail = tmp;
> + todo_tail = &tmp->next;
> +
> + // Delete tmp from ->queued and move to the next entry.
> + *curr = tmp->next;
> + } else
> + curr = &tmp->next;
> + }
> +
> + hzst->tail = curr;
> +
> + // Keep checking if callback list is not empty.
> + if (hzst->queued)
> + kick_hazptr_work();
> +
> + *todo_tail = NULL;
> +
> + return todo;
> +}
> +
> +static void hazptr_work_func(struct work_struct *work)
> +{
> + struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work);
> + struct callback_head *todo;
> +
> + // Advance callbacks from hzpcpu to hzst
> + scoped_guard(spinlock_irqsave, &hzst->callback_lock) {
> + int cpu;
> +
> + hzst->scheduled = false;
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> +
> + guard(spinlock)(&pcpu->callback_lock);
> +
> + if (pcpu->queued) {
> + *(hzst->tail) = pcpu->queued;
> + hzst->tail = pcpu->tail;
> + pcpu->queued = NULL;
> + pcpu->tail = &pcpu->queued;
> + }
> + }
> + }
> +
> + // Pairs with the smp_mb() on the reader side. This guarantees that if
> + // the hazptr work picked up the callback from an updater and the
> + // updater set the global pointer to NULL before enqueue the callback,
> + // the hazptr work must observe a reader that protects the hazptr before
> + // the updater.
> + //
> + // <reader> <updater> <hazptr work>
> + // ptr = READ_ONCE(gp);
> + // WRITE_ONCE(*hazptr, ptr);
> + // smp_mb(); // => ->strong-fence
> + // tofree = gp;
> + // ptr = READ_ONCE(gp); // re-read, gp is not NULL
> + // // => ->fre
> + // WRITE_ONCE(gp, NULL);
> + // call_hazptr(gp, ...):
> + //  lock(->callback_lock);
> + //  [queued the callback]
> + //  unlock(->callback_lock);// => ->po-unlock-lock-po
> + // lock(->callback_lock);
> + // [move from hzpcpu to hzst]
> + //
> + // smp_mb(); => ->strong-fence
> + // ptr = READ_ONCE(*hazptr);
> + // // ptr == gp, otherwise => ->fre
> + //
> + // If ptr != gp, it means there exists a circle with the following
> + // memory ordering relationships:
> + // ->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre
> + // => (due to the definition of prop)
> + // ->strong-fence ->prop ->strong-fence ->hb ->prop
> + // => (rotate the circle)
> + // ->prop ->strong-fence ->prop ->strong-fence ->hb
> + // => (due to the definition of pb)
> + // ->pb ->pb
> + // but pb is acyclic according to LKMM, so this cannot happen.
> + smp_mb();
> + check_readers(hzst);
> +
> + todo = do_hazptr(hzst);
> +
> + while (todo) {
> + struct callback_head *next = todo->next;
> + void (*func)(struct callback_head *) = todo->func;
> +
> + func(todo);
> +
> + todo = next;
> + }
> +}
> +
> +/*
> + * Put @head into the cleanup callback queue.
> + *
> + * @func(@head) will be called after no one is referencing the object. Caller
> + * must ensure the object has been unpublished before calling this.
> + */
> +void call_hazptr(struct callback_head *head, rcu_callback_t func)
> +{
> + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
> +
> + head->func = func;
> + head->next = NULL;
> +
> + scoped_guard(spinlock_irqsave, &pcpu->callback_lock) {
> + *(pcpu->tail) = head;
> + pcpu->tail = &head->next;
> + }
> +
> + guard(spinlock_irqsave)(&hazptr_struct.callback_lock);
> + kick_hazptr_work();
> +}
> +
> +static int init_hazptr_struct(void)
> +{
> + int cpu;
> +
> + INIT_WORK(&hazptr_struct.work, hazptr_work_func);
> +
> + spin_lock_init(&hazptr_struct.callback_lock);
> + hazptr_struct.queued = NULL;
> + hazptr_struct.tail = &hazptr_struct.queued;
> +
> + for_each_possible_cpu(cpu) {
> + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
> +
> + spin_lock_init(&pcpu->ctx_lock);
> + INIT_LIST_HEAD(&pcpu->ctx_list);
> +
> + spin_lock_init(&pcpu->callback_lock);
> + pcpu->queued = NULL;
> + pcpu->tail = &pcpu->queued;
> +
> + }
> +
> + init_hazptr_reader_tree(&hazptr_struct.readers);
> +
> + return 0;
> +}
> +
> +early_initcall(init_hazptr_struct);
> -- 
> 2.45.2
> 
>
Jann Horn Sept. 19, 2024, 12:12 a.m. UTC | #3
On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> Hazard pointers [1] provide a way to dynamically distribute refcounting
> and can be used to improve the scalability of refcounting without
> significant space cost.

> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> +                                       void *const *p,
> +                                       unsigned long head_offset)
> +{
> +       void *ptr;
> +       struct callback_head *head;
> +
> +       ptr = READ_ONCE(*p);
> +
> +       if (ptr == NULL)
> +               return NULL;
> +
> +       head = (struct callback_head *)(ptr + head_offset);
> +
> +       WRITE_ONCE(*hzp, head);
> +       smp_mb();
> +
> +       ptr = READ_ONCE(*p); // read again
> +
> +       if (ptr + head_offset != head) { // pointer changed
> +               WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> +               return NULL;
> +       } else
> +               return ptr;
> +}

I got nerdsniped by the Plumbers talk. So, about that smp_mb()...

I think you should be able to avoid the smp_mb() using relaxed atomics
(on architectures that have those), at the cost of something like a
cmpxchg-acquire sandwiched between a load-acquire and a relaxed load?
I'm not sure how their cost compares to an smp_mb() though.

Something like this, *assuming there can only be one context at a time
waiting for a given hazptr_t*:


typedef struct {
  /* consists of: marker bit in least significant bit, rest is normal pointer */
  atomic_long_t value;
} hazptr_t;

/* assumes that hzp is currently set to NULL (but it may contain a
marker bit) */
static inline void *__hazptr_tryprotect(hazptr_t *hzp, void *const *p) {
  /* note that the loads of these three operations are ordered using
acquire semantics */
  void *ptr = smp_load_acquire(p);
  /* set pointer while leaving marker bit intact */
  unsigned long hazard_scanning =
atomic_long_fetch_or_acquire((unsigned long)ptr, &hzp->value);
  if (unlikely(hazard_scanning)) {
    BUG_ON(hazard_scanning != 1);
    /* slowpath, concurrent hazard pointer waiter */
    smp_mb();
  }
  if (READ_ONCE(*p) == ptr) { /* recheck */
    atomic_long_and(~1UL, &hzp->value);
    return NULL;
  }
  return ptr;
}

/* simplified for illustration, assumes there's only a single hazard
pointer @hzp that could point to @ptr */
static void remove_pointer_and_wait_for_hazard(hazptr_t *hzp, void
*ptr, void *const *p) {
  WRITE_ONCE(*p, NULL);
  smb_mb();
  /* set marker bit */
  atomic_long_or(1UL, &hzp->value);
  while ((void*)(atomic_long_read(&hzp->value) & ~1UL) == ptr))
    wait();
  /* clear marker bit */
  atomic_long_and(~1UL, &hzp->value);
}


The idea would be that the possible orderings when these two functions
race against each other are:

Ordering A: The atomic_long_fetch_or_acquire() in
__hazptr_tryprotect() happens after the atomic_long_or(), two
subcases:
Ordering A1 (slowpath): atomic_long_fetch_or_acquire() is ordered
before the atomic_long_and() in remove_pointer_and_wait_for_hazard(),
so the marker bit is observed set, "hazard_scanning" is true. We go on
the safe slowpath which is like the original patch, so it's safe.
Ordering A2 (recheck fails): atomic_long_fetch_or_acquire() is ordered
after the atomic_long_and() in remove_pointer_and_wait_for_hazard(),
so the subsequent READ_ONCE(*p) is also ordered after the
atomic_long_and(), which is ordered after the WRITE_ONCE(*p, NULL), so
the READ_ONCE(*p) recheck must see a NULL pointer and fail.
Ordering B (hazard pointer visible): The
atomic_long_fetch_or_acquire() in __hazptr_tryprotect() happens before
the atomic_long_or(). In that case, it also happens before the
atomic_long_read() in remove_pointer_and_wait_for_hazard(), so the
hazard pointer will be visible to
remove_pointer_and_wait_for_hazard().

But this seems pretty gnarly/complicated, so even if my 2AM reasoning
ability is correct, actually implementing this might not be a good
idea... and it definitely wouldn't help on X86 at all, since X86
doesn't have such nice relaxed RMW ops.
Lai Jiangshan Sept. 19, 2024, 6:39 a.m. UTC | #4
On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:

> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> +                                              struct hazptr_context *hzcp)
> +{
> +       lockdep_assert_held(hzcp->lock);
> +
> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> +               /*
> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
> +                *
> +                * Ensure
> +                *
> +                * <reader>             <updater>
> +                *
> +                * [access protected pointers]
> +                * hazptr_clear();
> +                *   smp_store_release()
> +                *                      // in reader scan.
> +                *                      smp_load_acquire(); // is null or unused.
> +                *                      [run callbacks] // all accesses from
> +                *                                      // reader must be
> +                *                                      // observed.
> +                */
> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> +
> +               if (!is_null_or_unused(val)) {
> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> +
> +                       // Already in the tree, need to remove first.
> +                       if (!is_null_or_unused(snap->slot)) {
> +                               reader_del(tree, snap);
> +                       }
> +                       snap->slot = val;
> +                       reader_add(tree, snap);
> +               }
> +       }
> +}

Hello

I'm curious about whether there are any possible memory leaks here.

It seems that call_hazptr() never frees the memory until the slot is
set to another valid value.

In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
and snap->slot is not which I think it should be.

And it can cause unneeded deletion and addition of the snap if the slot
value is unchanged.

I'm not so sure...

Thanks
Lai
Boqun Feng Sept. 19, 2024, 6:56 a.m. UTC | #5
On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote:
[...]
> > +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
> > +#define hazptr_protect(hzp, gp, field) ({ \
> > + typeof(gp) ___p; \
> > + \
> > + ___p = hazptr_tryprotect(hzp, gp, field); \
> > + BUG_ON(!___p); \
> 
> hazptr_tryprotect might return NULL, do you need a loop here?
> 

Thanks for the review. It's me who didn't do a good job here on the
documentation. hazptr_protect() is supposed to use for the case where
readers know the gp won't change.

Regards,
Boqun

> > + ___p; \
> > +})
> > +
Boqun Feng Sept. 19, 2024, 7:10 a.m. UTC | #6
On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> > +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
> > +                                              struct hazptr_context *hzcp)
> > +{
> > +       lockdep_assert_held(hzcp->lock);
> > +
> > +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
> > +               /*
> > +                * Pairs with smp_store_release() in hazptr_{clear,free}().
> > +                *
> > +                * Ensure
> > +                *
> > +                * <reader>             <updater>
> > +                *
> > +                * [access protected pointers]
> > +                * hazptr_clear();
> > +                *   smp_store_release()
> > +                *                      // in reader scan.
> > +                *                      smp_load_acquire(); // is null or unused.
> > +                *                      [run callbacks] // all accesses from
> > +                *                                      // reader must be
> > +                *                                      // observed.
> > +                */
> > +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> > +
> > +               if (!is_null_or_unused(val)) {
> > +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> > +
> > +                       // Already in the tree, need to remove first.
> > +                       if (!is_null_or_unused(snap->slot)) {
> > +                               reader_del(tree, snap);
> > +                       }
> > +                       snap->slot = val;
> > +                       reader_add(tree, snap);
> > +               }
> > +       }
> > +}
> 
> Hello
> 
> I'm curious about whether there are any possible memory leaks here.
> 
> It seems that call_hazptr() never frees the memory until the slot is
> set to another valid value.
> 
> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
> and snap->slot is not which I think it should be.
> 
> And it can cause unneeded deletion and addition of the snap if the slot
> value is unchanged.
> 

I think you're right. (Although the node will be eventually deleted at
cleanup_hazptr_context(), however there could be a long-live
hazptr_context). It should be:

		hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
		struct hazptr_slot_snap *snap = &hzcp->snaps[i];

		if (val != snap->slot) { // val changed, need to update the tree node.
			// Already in the tree, need to remove first.
			if (!is_null_or_unused(snap->slot)) {
				reader_del(tree, snap);
			}

			// use the latest snapshot.
			snap->slot = val;

			// Add it into tree if there is a reader
			if (!is_null_or_unused(val))
				reader_add(tree, snap);
		}

Regards,
Boqun

> I'm not so sure...
> 
> Thanks
> Lai
Alan Huang Sept. 19, 2024, 12:33 p.m. UTC | #7
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

With this changed, and force users call hazptr_clear() like rcu_read_unlock(), we could remove
the reader_del() in cleanup_hazptr_context(), then remove the tree->lock?

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Alan Huang Sept. 19, 2024, 1:57 p.m. UTC | #8
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

It seems like that two different hazptr_context can’t be used to protect the same pointer?

Otherwise the following can happen?

thread1 					thread2  					 thread3(worker) 	      thread4
hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
												 add ptr1 to tree
hazptr_clear(hzp1) 
hazptr_tryprotect(hzp1, ptr2) 
												 delete ptr1 from tree     unpub ptr1
																       call_hazptr(ptr1)
																       oops: invoke ptr1's callback
Or am I missing something?

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Jonas Oberhauser Sept. 19, 2024, 2 p.m. UTC | #9
Am 9/17/2024 um 4:33 PM schrieb Boqun Feng:
> +#define hazptr_protect(hzp, gp, field) ({				\
> +	typeof(gp) ___p;						\
> +									\
> +	___p = hazptr_tryprotect(hzp, gp, field);			\
> +	BUG_ON(!___p);							\
> +	___p;								\
> +})


Hi Boqun, why crash instead of retry?

jonas
Alan Huang Sept. 19, 2024, 4:10 p.m. UTC | #10
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote:
>> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> 
>>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
>>> +                                              struct hazptr_context *hzcp)
>>> +{
>>> +       lockdep_assert_held(hzcp->lock);
>>> +
>>> +       for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
>>> +               /*
>>> +                * Pairs with smp_store_release() in hazptr_{clear,free}().
>>> +                *
>>> +                * Ensure
>>> +                *
>>> +                * <reader>             <updater>
>>> +                *
>>> +                * [access protected pointers]
>>> +                * hazptr_clear();
>>> +                *   smp_store_release()
>>> +                *                      // in reader scan.
>>> +                *                      smp_load_acquire(); // is null or unused.
>>> +                *                      [run callbacks] // all accesses from
>>> +                *                                      // reader must be
>>> +                *                                      // observed.
>>> +                */
>>> +               hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> +
>>> +               if (!is_null_or_unused(val)) {
>>> +                       struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> +
>>> +                       // Already in the tree, need to remove first.
>>> +                       if (!is_null_or_unused(snap->slot)) {
>>> +                               reader_del(tree, snap);
>>> +                       }
>>> +                       snap->slot = val;
>>> +                       reader_add(tree, snap);
>>> +               }
>>> +       }
>>> +}
>> 
>> Hello
>> 
>> I'm curious about whether there are any possible memory leaks here.
>> 
>> It seems that call_hazptr() never frees the memory until the slot is
>> set to another valid value.
>> 
>> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused
>> and snap->slot is not which I think it should be.
>> 
>> And it can cause unneeded deletion and addition of the snap if the slot
>> value is unchanged.
>> 
> 
> I think you're right. (Although the node will be eventually deleted at
> cleanup_hazptr_context(), however there could be a long-live
> hazptr_context). It should be:
> 
> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> 
> if (val != snap->slot) { // val changed, need to update the tree node.
> // Already in the tree, need to remove first.
> if (!is_null_or_unused(snap->slot)) {
> reader_del(tree, snap);
> }
> 
> // use the latest snapshot.
> snap->slot = val;
> 
> // Add it into tree if there is a reader
> if (!is_null_or_unused(val))
> reader_add(tree, snap);
> }

Even using the same context, if two slots are used to protect the same pointer, let it be ptr1,
then if the second slot is reused for ptr2, ptr1’s callback will be invoked even the first slot still has the ptr1.

The original patch also has this problem.

> 
> Regards,
> Boqun
> 
>> I'm not so sure...
>> 
>> Thanks
>> Lai
Jonas Oberhauser Sept. 19, 2024, 6:07 p.m. UTC | #11
Am 9/19/2024 um 8:56 AM schrieb Boqun Feng:
> On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote:
> [...]
>>> +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
>>> +#define hazptr_protect(hzp, gp, field) ({ \
>>> + typeof(gp) ___p; \
>>> + \
>>> + ___p = hazptr_tryprotect(hzp, gp, field); \
>>> + BUG_ON(!___p); \
>>
>> hazptr_tryprotect might return NULL, do you need a loop here?
>>
> 
> Thanks for the review. It's me who didn't do a good job here on the
> documentation. hazptr_protect() is supposed to use for the case where
> readers know the gp won't change.
> 
> Regards,
> Boqun
> 
>>> + ___p; \
>>> +})
>>> +

Oh, disregard my other e-mail, I hadn't seen this discussion.

Do you have any specific use case of this in mind? If you know that the 
pointer can't change, I would assume you can also just read the pointer 
normally and assign to the hazard pointer without a fence, no?

have fun, jonas
Boqun Feng Sept. 19, 2024, 6:58 p.m. UTC | #12
On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote:
[...]
> > 
> > I think you're right. (Although the node will be eventually deleted at
> > cleanup_hazptr_context(), however there could be a long-live
> > hazptr_context). It should be:
> > 
> > hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
> > struct hazptr_slot_snap *snap = &hzcp->snaps[i];
> > 
> > if (val != snap->slot) { // val changed, need to update the tree node.
> > // Already in the tree, need to remove first.
> > if (!is_null_or_unused(snap->slot)) {
> > reader_del(tree, snap);
> > }
> > 
> > // use the latest snapshot.
> > snap->slot = val;
> > 
> > // Add it into tree if there is a reader
> > if (!is_null_or_unused(val))
> > reader_add(tree, snap);
> > }
> 
> It seems like that two different hazptr_context can’t be used to protect the same pointer?
> 
> Otherwise the following can happen?
> 
> thread1 					thread2  					 thread3(worker) 	      thread4
> hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
> 												 add ptr1 to tree

Note that we have snapshot rb_node for each hazard pointer slot, so here
thread3 actually would add two rb_nodes with ->slot == ptr1 here.

> hazptr_clear(hzp1) 
> hazptr_tryprotect(hzp1, ptr2) 
> 												 delete ptr1 from tree     unpub ptr1

Therefore, there is still one rb_node with ->slot == ptr1 in the tree
after the deletion, so updaters won't invoke ptr1's callback.

Regards,
Boqun

> 																       call_hazptr(ptr1)
> 																       oops: invoke ptr1's callback
> Or am I missing something?
> 
> > 
> > Regards,
> > Boqun
> > 
> >> I'm not so sure...
> >> 
> >> Thanks
> >> Lai
> 
>
Alan Huang Sept. 19, 2024, 7:53 p.m. UTC | #13
2024年9月20日 02:58,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote:
> [...]
>>> 
>>> I think you're right. (Although the node will be eventually deleted at
>>> cleanup_hazptr_context(), however there could be a long-live
>>> hazptr_context). It should be:
>>> 
>>> hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
>>> struct hazptr_slot_snap *snap = &hzcp->snaps[i];
>>> 
>>> if (val != snap->slot) { // val changed, need to update the tree node.
>>> // Already in the tree, need to remove first.
>>> if (!is_null_or_unused(snap->slot)) {
>>> reader_del(tree, snap);
>>> }
>>> 
>>> // use the latest snapshot.
>>> snap->slot = val;
>>> 
>>> // Add it into tree if there is a reader
>>> if (!is_null_or_unused(val))
>>> reader_add(tree, snap);
>>> }
>> 
>> It seems like that two different hazptr_context can’t be used to protect the same pointer?
>> 
>> Otherwise the following can happen?
>> 
>> thread1  thread2    thread3(worker)        thread4
>> hazptr_tryprotect(hzp1, ptr1)   hazptr_tryprotect(hzp2, ptr1) 
>>  add ptr1 to tree
> 
> Note that we have snapshot rb_node for each hazard pointer slot, so here
> thread3 actually would add two rb_nodes with ->slot == ptr1 here.

Ok, good to know the rbtree can have multiple nodes with the same key.

Thanks for the explanation!

> 
>> hazptr_clear(hzp1) 
>> hazptr_tryprotect(hzp1, ptr2) 
>>  delete ptr1 from tree     unpub ptr1
> 
> Therefore, there is still one rb_node with ->slot == ptr1 in the tree
> after the deletion, so updaters won't invoke ptr1's callback.
> 
> Regards,
> Boqun
> 
>>        call_hazptr(ptr1)
>>        oops: invoke ptr1's callback
>> Or am I missing something?
>> 
>>> 
>>> Regards,
>>> Boqun
>>> 
>>>> I'm not so sure...
>>>> 
>>>> Thanks
>>>> Lai
Jonas Oberhauser Sept. 19, 2024, 8:30 p.m. UTC | #14
Am 9/19/2024 um 2:12 AM schrieb Jann Horn:
> On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>> Hazard pointers [1] provide a way to dynamically distribute refcounting
>> and can be used to improve the scalability of refcounting without
>> significant space cost.
> 
>> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
>> +                                       void *const *p,
>> +                                       unsigned long head_offset)
>> +{
>> +       void *ptr;
>> +       struct callback_head *head;
>> +
>> +       ptr = READ_ONCE(*p);
>> +
>> +       if (ptr == NULL)
>> +               return NULL;
>> +
>> +       head = (struct callback_head *)(ptr + head_offset);
>> +
>> +       WRITE_ONCE(*hzp, head);
>> +       smp_mb();
>> +
>> +       ptr = READ_ONCE(*p); // read again
>> +
>> +       if (ptr + head_offset != head) { // pointer changed
>> +               WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
>> +               return NULL;
>> +       } else
>> +               return ptr;
>> +}
> 
> I got nerdsniped by the Plumbers talk. So, about that smp_mb()...
> 
> I think you should be able to avoid the smp_mb() using relaxed atomics
> (on architectures that have those), at the cost of something like a
> cmpxchg-acquire sandwiched between a load-acquire and a relaxed load?
> I'm not sure how their cost compares to an smp_mb() though.



We have done a similar scheme before, and on some architectures (not 
x86) the RMW is slightly cheaper than the mb.

Your reasoning is a bit simplified because it seems to assume a stronger 
concept of ordering than LKMM has, but even with LKMM's ordering your 
code seems fine.

I feel it can even be simplified a little, the hazard bit does not seem 
necessary.

Assuming atomic operations for everything racy, relaxed unless stated 
otherwise:

(R)eader:

    old = read p // I don't think this needs acq, because of address 
dependencies (*)
    haz ||=_acq old
    if (read p != old) retry;
    *old

(W)riter:

    p =_??? ... // assuming we don't set it to null this needs a rel
    --- mb ---
    haz ||= 0
    while (read_acq haz == old) retry;
    delete old


In order to get a use-after-free, both of the  R:read p  need to read 
before  W:p = ... , so because of the W:mb, they execute before  W:haz||=0 .

Also, for the use-after-free,  W:read_acq haz  needs to read before (the 
write part of)  R:haz||=_acq old .


Then the  W:haz ||= 0  also needs to read before (the write part of) 
R:haz||=_acq old  because of coherence on the same location.

Since both of them are atomic RMW, the  W:haz||= 0  also needs to write 
before (the write part of)  R:haz||=_acq old , and in the absence of 
non-RMW writes (so assuming no thread will just store to the hazard 
pointer), this implies that the latter reads-from an rmw-sequence-later 
store than  W:haz||=0 , which therefore executes before  R:haz||=_acq old .

But because of the acquire barrier,  R:haz||=_acq old  executes before 
the second  R:read p .


This gives a cycle
    (2nd)R:read p  ->xb  W:haz||=0  ->xb  R:haz||=_acq  ->xb  (2nd)R:read p

and therefore is forbidden.

Note that in this argument, the two  R:read p  are not necessarily 
reading from the same store. Because of ABA, it can happen that they 
read from distinct stores and see the same value.

It does require clearing hazard pointers with an RMW like atomic_and(0) 
(**). The combination of the two RMW (for setting & clearing the hazard 
pointer) might in total be slower again than an mb.


(I took the liberty to add an acquire barrier in the writer's while loop 
for the case where we read from the (not shown) release store of the 
reader that clears the hazard pointer. It's arguable whether that acq is 
needed since one could argue that a delete is a kind of store, in which 
case control dependency would handle it.)

have fun,
   jonas


(*
   you talk about sandwiching, and the data dependency does guarantee 
some weaker form of sandwiching than the acq, but I don't think 
sandwiching is required at all. If you happened to be able to set the 
hazard pointer before reading the pointer, that would just extend the 
protected area, wouldn't it?
)

(**
I think if you clear the pointer with a store, the hazard bit does not 
help. You could be overwriting the hazard bit, and the RMWs that set the 
hazard bit might never propagate to your CPU.

Also in your code you probably meant to clear the whole hazard pointer 
in the retry code of the reader, not just the hazard bit.)
Jonas Oberhauser Sept. 20, 2024, 7:41 a.m. UTC | #15
Am 9/17/2024 um 4:33 PM schrieb Boqun Feng:
> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> +					void *const *p,
> +					unsigned long head_offset)
> +{
> +	void *ptr;
> +	struct callback_head *head;
> +
> +	ptr = READ_ONCE(*p);
> +
> +	if (ptr == NULL)
> +		return NULL;
> +
> +	head = (struct callback_head *)(ptr + head_offset);
> +
> +	WRITE_ONCE(*hzp, head);
> +	smp_mb();
> +
> +	ptr = READ_ONCE(*p); // read again
> +
> +	if (ptr + head_offset != head) { // pointer changed
> +		WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> +		return NULL;
> +	} else
> +		return ptr;
> +}

There is a subtle potential for ABA issues here.

If the compiler replaces 'return ptr;' with 'return head - 
head_offset;', then you do not have an address dependency from the 
second read.

In this case, in ABA, the first read can read from a stale store, then 
the second read reads the same value from a newer store but only 
establishes control-dependency based synchronization with that store; 
any reads from *ptr could be speculatively executed before doing the 
second ptr = READ_ONCE(*p).

Therefore you could read the object state before it is properly 
reinitialized by the second store.

I'm not sure what the most efficient fix is or if you just want to 
gamble that "the compiler will never do that".
I guess either READ_ONCE(ptr) or a compiler barrier before return ptr 
might do it?

Have fun,
    jonas
Jonas Oberhauser Sept. 20, 2024, 7:43 a.m. UTC | #16
Am 9/19/2024 um 10:30 PM schrieb Jonas Oberhauser:
> 
> 
> 
> Am 9/19/2024 um 2:12 AM schrieb Jann Horn:
>> On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>>> Hazard pointers [1] provide a way to dynamically distribute refcounting
>>> and can be used to improve the scalability of refcounting without
>>> significant space cost.
>>
>>> +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
>>> +                                       void *const *p,
>>> +                                       unsigned long head_offset)
>>> +{
>>> +       void *ptr;
>>> +       struct callback_head *head;
>>> +
>>> +       ptr = READ_ONCE(*p);
>>> +
>>> +       if (ptr == NULL)
>>> +               return NULL;
>>> +
>>> +       head = (struct callback_head *)(ptr + head_offset);
>>> +
>>> +       WRITE_ONCE(*hzp, head);
>>> +       smp_mb();
>>> +
>>> +       ptr = READ_ONCE(*p); // read again
>>> +
>>> +       if (ptr + head_offset != head) { // pointer changed
>>> +               WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
>>> +               return NULL;
>>> +       } else
>>> +               return ptr;
>>> +}
>>
>> I got nerdsniped by the Plumbers talk. So, about that smp_mb()...
>>
>> I think you should be able to avoid the smp_mb() using relaxed atomics
>> (on architectures that have those), at the cost of something like a
>> cmpxchg-acquire sandwiched between a load-acquire and a relaxed load?
>> I'm not sure how their cost compares to an smp_mb() though.
> 
> 
> 
> We have done a similar scheme before, and on some architectures (not 
> x86) the RMW is slightly cheaper than the mb.
> 
> Your reasoning is a bit simplified because it seems to assume a stronger 
> concept of ordering than LKMM has, but even with LKMM's ordering your 
> code seems fine.
> 
> I feel it can even be simplified a little, the hazard bit does not seem 
> necessary.
> 
> Assuming atomic operations for everything racy, relaxed unless stated 
> otherwise:
> 
> (R)eader:
> 
>     old = read p // I don't think this needs acq, because of address 
> dependencies (*)
>     haz ||=_acq old
>     if (read p != old) retry;

I realized before going to bed that I copied a subtle mistake here from 
Jann's code, we need an address dependency from this read, or it is not 
ABA safe.
This can be done with the small detour that Boqun used:

      head = read p // I don't think this needs acq, because of address 
dependencies (*)
      haz ||=_acq head
      old = read p // again
      if (head != old) retry;
      barrier(); // ensure compiler does not use its knowledge that head 
== old to do *head instead!
      *old // definitely use the second read pointer so hardware can't 
speculatively dereference this before the second read!


Have a good time,
   jonas
Boqun Feng Sept. 25, 2024, 10:02 a.m. UTC | #17
Hi Jonas,

On Fri, Sep 20, 2024 at 09:41:15AM +0200, Jonas Oberhauser wrote:
> 
> 
> Am 9/17/2024 um 4:33 PM schrieb Boqun Feng:
> > +static inline void *__hazptr_tryprotect(hazptr_t *hzp,
> > +					void *const *p,
> > +					unsigned long head_offset)
> > +{
> > +	void *ptr;
> > +	struct callback_head *head;
> > +
> > +	ptr = READ_ONCE(*p);
> > +
> > +	if (ptr == NULL)
> > +		return NULL;
> > +
> > +	head = (struct callback_head *)(ptr + head_offset);
> > +
> > +	WRITE_ONCE(*hzp, head);
> > +	smp_mb();
> > +
> > +	ptr = READ_ONCE(*p); // read again
> > +
> > +	if (ptr + head_offset != head) { // pointer changed
> > +		WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
> > +		return NULL;
> > +	} else
> > +		return ptr;
> > +}
> 
> There is a subtle potential for ABA issues here.
> 
> If the compiler replaces 'return ptr;' with 'return head - head_offset;',
> then you do not have an address dependency from the second read.
> 
> In this case, in ABA, the first read can read from a stale store, then the
> second read reads the same value from a newer store but only establishes
> control-dependency based synchronization with that store; any reads from
> *ptr could be speculatively executed before doing the second ptr =
> READ_ONCE(*p).
> 
> Therefore you could read the object state before it is properly
> reinitialized by the second store.
> 

Thanks for taking a look, and nice find!

> I'm not sure what the most efficient fix is or if you just want to gamble
> that "the compiler will never do that".
> I guess either READ_ONCE(ptr) or a compiler barrier before return ptr might
> do it?
> 

I think the root cause of this is that compiler can replace 'ptr' with
'head - head_offset' based on pointer value comparison. A fix would be
converting pointer to unsigned long and doing the comparison:

	if ((unsigned long)ptr + head_offset != (unsigned long)head) {
		WRITE_ONCE(*hzp, NULL);
		return NULL;
	} else
		return ptr;

because of the conversion, compilers lose the information of pointer
equality, therefore cannot replace 'ptr' with 'head - head_offset'. Of
course, if we are really worried about compilers being too "smart", we
can always do the comparison in asm code, then compilers don't know
anything of the equality between 'ptr' and 'head - head_offset'.

Regards,
boqun

> Have fun,
>    jonas
>
Jonas Oberhauser Sept. 25, 2024, 10:11 a.m. UTC | #18
Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
> Hi Jonas,
>
> Of
> course, if we are really worried about compilers being too "smart"

Ah, I see you know me better and better...

> we can always do the comparison in asm code, then compilers don't know
> anything of the equality between 'ptr' and 'head - head_offset'.
Yes, but then a simple compiler barrier between the comparison and 
returning ptr would also do the trick, right? And maybe easier on the eyes.


Have fun,
    jonas
Boqun Feng Sept. 25, 2024, 10:45 a.m. UTC | #19
On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote:
> 
> 
> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
> > Hi Jonas,
> > 
> > Of
> > course, if we are really worried about compilers being too "smart"
> 
> Ah, I see you know me better and better...
> 
> > we can always do the comparison in asm code, then compilers don't know
> > anything of the equality between 'ptr' and 'head - head_offset'.
> Yes, but then a simple compiler barrier between the comparison and returning
> ptr would also do the trick, right? And maybe easier on the eyes.
> 

The thing about putting a compiler barrier is that it will prevent all
compiler reorderings, and some of the reordering may contribute to
better codegen. (I know in this case, we have a smp_mb(), but still
compilers can move unrelated code upto the second load for optimization
purpose). Asm comparison is cheaper in this way. But TBH, compilers
should provide a way to compare pointer values without using the result
for pointer equality proof, if "convert to unsigned long" doesn't work,
some other ways should work.

Regards,
Boqun

> 
> Have fun,
>    jonas
>
Mathieu Desnoyers Sept. 25, 2024, 11:59 a.m. UTC | #20
On 2024-09-25 12:45, Boqun Feng wrote:
> On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote:
>>
>>
>> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
>>> Hi Jonas,
>>>
>>> Of
>>> course, if we are really worried about compilers being too "smart"
>>
>> Ah, I see you know me better and better...
>>
>>> we can always do the comparison in asm code, then compilers don't know
>>> anything of the equality between 'ptr' and 'head - head_offset'.
>> Yes, but then a simple compiler barrier between the comparison and returning
>> ptr would also do the trick, right? And maybe easier on the eyes.
>>
> 
> The thing about putting a compiler barrier is that it will prevent all
> compiler reorderings, and some of the reordering may contribute to
> better codegen. (I know in this case, we have a smp_mb(), but still
> compilers can move unrelated code upto the second load for optimization
> purpose). Asm comparison is cheaper in this way. But TBH, compilers
> should provide a way to compare pointer values without using the result
> for pointer equality proof, if "convert to unsigned long" doesn't work,
> some other ways should work.
> 

Based on Documentation/RCU/rcu_dereference.rst :

-       Be very careful about comparing pointers obtained from
         rcu_dereference() against non-NULL values.  As Linus Torvalds
         explained, if the two pointers are equal, the compiler could
         substitute the pointer you are comparing against for the pointer
         obtained from rcu_dereference().  For example::

                 p = rcu_dereference(gp);
                 if (p == &default_struct)
                         do_default(p->a);

         Because the compiler now knows that the value of "p" is exactly
         the address of the variable "default_struct", it is free to
         transform this code into the following::

                 p = rcu_dereference(gp);
                 if (p == &default_struct)
                         do_default(default_struct.a);

         On ARM and Power hardware, the load from "default_struct.a"
         can now be speculated, such that it might happen before the
         rcu_dereference().  This could result in bugs due to misordering.

So I am not only concerned about compiler proofs here, as it appears
that the speculation done by the CPU can also cause issues on some
architectures.

Thanks,

Mathieu

> Regards,
> Boqun
> 
>>
>> Have fun,
>>     jonas
>>
Boqun Feng Sept. 25, 2024, 12:16 p.m. UTC | #21
On Wed, Sep 25, 2024 at 01:59:06PM +0200, Mathieu Desnoyers wrote:
> On 2024-09-25 12:45, Boqun Feng wrote:
> > On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote:
> > > 
> > > 
> > > Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
> > > > Hi Jonas,
> > > > 
> > > > Of
> > > > course, if we are really worried about compilers being too "smart"
> > > 
> > > Ah, I see you know me better and better...
> > > 
> > > > we can always do the comparison in asm code, then compilers don't know
> > > > anything of the equality between 'ptr' and 'head - head_offset'.
> > > Yes, but then a simple compiler barrier between the comparison and returning
> > > ptr would also do the trick, right? And maybe easier on the eyes.
> > > 
> > 
> > The thing about putting a compiler barrier is that it will prevent all
> > compiler reorderings, and some of the reordering may contribute to
> > better codegen. (I know in this case, we have a smp_mb(), but still
> > compilers can move unrelated code upto the second load for optimization
> > purpose). Asm comparison is cheaper in this way. But TBH, compilers
> > should provide a way to compare pointer values without using the result
> > for pointer equality proof, if "convert to unsigned long" doesn't work,
> > some other ways should work.
> > 
> 
> Based on Documentation/RCU/rcu_dereference.rst :
> 
> -       Be very careful about comparing pointers obtained from
>         rcu_dereference() against non-NULL values.  As Linus Torvalds
>         explained, if the two pointers are equal, the compiler could
>         substitute the pointer you are comparing against for the pointer
>         obtained from rcu_dereference().  For example::
> 
>                 p = rcu_dereference(gp);
>                 if (p == &default_struct)
>                         do_default(p->a);
> 
>         Because the compiler now knows that the value of "p" is exactly
>         the address of the variable "default_struct", it is free to
>         transform this code into the following::
> 
>                 p = rcu_dereference(gp);
>                 if (p == &default_struct)
>                         do_default(default_struct.a);
> 
>         On ARM and Power hardware, the load from "default_struct.a"
>         can now be speculated, such that it might happen before the
>         rcu_dereference().  This could result in bugs due to misordering.
> 
> So I am not only concerned about compiler proofs here, as it appears
> that the speculation done by the CPU can also cause issues on some
> architectures.
> 

Note that the issue is caused by compiler replacing one pointer with
the other, CPU speculation won't cause the issue solely, because all
architectures Linux supports honor address dependencies (except for
Alpha, but on Alpha, READ_ONCE() has a smp_mb() after the load). That
is: if the compiler generates code as the code is written, there should
be no problem.

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> > Regards,
> > Boqun
> > 
> > > 
> > > Have fun,
> > >     jonas
> > > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Jonas Oberhauser Sept. 25, 2024, 12:19 p.m. UTC | #22
Am 9/25/2024 um 1:59 PM schrieb Mathieu Desnoyers:
> On 2024-09-25 12:45, Boqun Feng wrote:
>> On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote:
>>>
>>>
>>> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
>>>> Hi Jonas,
>>>>
>>>> Of
>>>> course, if we are really worried about compilers being too "smart"
>>>
>>> Ah, I see you know me better and better...
>>>
>>>> we can always do the comparison in asm code, then compilers don't know
>>>> anything of the equality between 'ptr' and 'head - head_offset'.
>>> Yes, but then a simple compiler barrier between the comparison and 
>>> returning
>>> ptr would also do the trick, right? And maybe easier on the eyes.
>>>
>>
>> The thing about putting a compiler barrier is that it will prevent all
>> compiler reorderings, and some of the reordering may contribute to
>> better codegen. (I know in this case, we have a smp_mb(), but still
>> compilers can move unrelated code upto the second load for optimization
>> purpose). Asm comparison is cheaper in this way. But TBH, compilers
>> should provide a way to compare pointer values without using the result
>> for pointer equality proof, if "convert to unsigned long" doesn't work,
>> some other ways should work.
>>
> 
> Based on Documentation/RCU/rcu_dereference.rst :
> 
> -       Be very careful about comparing pointers obtained from
>          rcu_dereference() against non-NULL values.  As Linus Torvalds
>          explained, if the two pointers are equal, the compiler could
>          substitute the pointer you are comparing against for the pointer
>          obtained from rcu_dereference().  For example::
> 
>                  p = rcu_dereference(gp);
>                  if (p == &default_struct)
>                          do_default(p->a);
> 
>          Because the compiler now knows that the value of "p" is exactly
>          the address of the variable "default_struct", it is free to
>          transform this code into the following::
> 
>                  p = rcu_dereference(gp);
>                  if (p == &default_struct)
>                          do_default(default_struct.a);
> 
>          On ARM and Power hardware, the load from "default_struct.a"
>          can now be speculated, such that it might happen before the
>          rcu_dereference().  This could result in bugs due to misordering.
> 
> So I am not only concerned about compiler proofs here, as it appears
> that the speculation done by the CPU can also cause issues on some
> architectures.

No, this is only possible in this example because of the compiler first 
doing some other optimizations (like what I mentioned on Boqun's 
original patch).

If you can ensure that the instruction sequence corresponds to more or less

t = load p // again
// on alpha: dep fence
...

*t

then you can be sure that there is an address dependency which orders 
the access. This is guaranteed by LKMM, or if you don't trust LKMM, also 
by Arm, Power, Alpha etc.

The extra dep fence on alpha is automatically inserted if you use 
READ_ONCE as boqun did (and I assumed your uatomic_load or whatever is 
doing the same thing, but I didn't check).

Given that the hazard-pointer-protected object presumably is not a 
single static non-freeable object, but some dynamically allocated 
object, it is pretty much impossible for the compiler to guess the 
address like in the example you shared above.

Note that inside the if, the code after transform is 
do_default(default_struct.a);
which is an address that is known to the hardware before it loads from gp.

That would not be the case here (if the compiler optimization is ruled out).

jonas
Mathieu Desnoyers Sept. 25, 2024, 12:47 p.m. UTC | #23
On 2024-09-25 14:16, Boqun Feng wrote:
> On Wed, Sep 25, 2024 at 01:59:06PM +0200, Mathieu Desnoyers wrote:
>> On 2024-09-25 12:45, Boqun Feng wrote:
>>> On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote:
>>>>
>>>>
>>>> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng:
>>>>> Hi Jonas,
>>>>>
>>>>> Of
>>>>> course, if we are really worried about compilers being too "smart"
>>>>
>>>> Ah, I see you know me better and better...
>>>>
>>>>> we can always do the comparison in asm code, then compilers don't know
>>>>> anything of the equality between 'ptr' and 'head - head_offset'.
>>>> Yes, but then a simple compiler barrier between the comparison and returning
>>>> ptr would also do the trick, right? And maybe easier on the eyes.
>>>>
>>>
>>> The thing about putting a compiler barrier is that it will prevent all
>>> compiler reorderings, and some of the reordering may contribute to
>>> better codegen. (I know in this case, we have a smp_mb(), but still
>>> compilers can move unrelated code upto the second load for optimization
>>> purpose). Asm comparison is cheaper in this way. But TBH, compilers
>>> should provide a way to compare pointer values without using the result
>>> for pointer equality proof, if "convert to unsigned long" doesn't work,
>>> some other ways should work.
>>>
>>
>> Based on Documentation/RCU/rcu_dereference.rst :
>>
>> -       Be very careful about comparing pointers obtained from
>>          rcu_dereference() against non-NULL values.  As Linus Torvalds
>>          explained, if the two pointers are equal, the compiler could
>>          substitute the pointer you are comparing against for the pointer
>>          obtained from rcu_dereference().  For example::
>>
>>                  p = rcu_dereference(gp);
>>                  if (p == &default_struct)
>>                          do_default(p->a);
>>
>>          Because the compiler now knows that the value of "p" is exactly
>>          the address of the variable "default_struct", it is free to
>>          transform this code into the following::
>>
>>                  p = rcu_dereference(gp);
>>                  if (p == &default_struct)
>>                          do_default(default_struct.a);
>>
>>          On ARM and Power hardware, the load from "default_struct.a"
>>          can now be speculated, such that it might happen before the
>>          rcu_dereference().  This could result in bugs due to misordering.
>>
>> So I am not only concerned about compiler proofs here, as it appears
>> that the speculation done by the CPU can also cause issues on some
>> architectures.
>>
> 
> Note that the issue is caused by compiler replacing one pointer with
> the other, CPU speculation won't cause the issue solely, because all
> architectures Linux supports honor address dependencies (except for
> Alpha, but on Alpha, READ_ONCE() has a smp_mb() after the load). That
> is: if the compiler generates code as the code is written, there should
> be no problem.

I am starting to wonder if it would not be more robust to just bite
the bullet and add the inline asm helpers to do the pointer comparison
outside of the compiler knowledge for each architecture. This should
work fine in the short term, until compilers eventually give us a
"compare pointers without allowing the compiler to infer anything about
pointer equality".

Like so:

#include <stdio.h>

#define __str_1(x)  #x
#define __str(x)    __str_1(x)

/* x86-64 */
#define bne_ptr(_a, _b, _label) \
     asm goto ( \
         "cmpq %[a], %[b]\n\t" \
         "jne %l[" __str(_label) "]\n\t" \
         : : [a] "r" (_a), [b] "r" (_b) \
         : : _label)

int x;

int v[2];

int main(void)
{
     bne_ptr(v, v + 1, label_same);
     x = 1;
label_same:
     printf("%d\n", x);
     return 0;
}


> 
> Regards,
> Boqun
> 
>> Thanks,
>>
>> Mathieu
>>
>>> Regards,
>>> Boqun
>>>
>>>>
>>>> Have fun,
>>>>      jonas
>>>>
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Mathieu Desnoyers Sept. 25, 2024, 1:10 p.m. UTC | #24
On 2024-09-25 14:47, Mathieu Desnoyers wrote:
[...]

> Like so:
> 
> #include <stdio.h>
> 
> #define __str_1(x)  #x
> #define __str(x)    __str_1(x)
> 
> /* x86-64 */
> #define bne_ptr(_a, _b, _label) \
>      asm goto ( \
>          "cmpq %[a], %[b]\n\t" \
>          "jne %l[" __str(_label) "]\n\t" \
>          : : [a] "r" (_a), [b] "r" (_b) \
>          : : _label)
> 
> int x;
> 
> int v[2];
> 
> int main(void)
> {
>      bne_ptr(v, v + 1, label_same);
>      x = 1;
> label_same:

Note that this label should probably be called "label_ne".
I flipped the macro logic without changing the labels.

Thanks,

Mathieu

>      printf("%d\n", x);
>      return 0;
> }
> 
> 
>>
>> Regards,
>> Boqun
>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>> Regards,
>>>> Boqun
>>>>
>>>>>
>>>>> Have fun,
>>>>>      jonas
>>>>>
>>>
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
>
Mathieu Desnoyers Sept. 25, 2024, 1:20 p.m. UTC | #25
On 2024-09-25 15:10, Mathieu Desnoyers wrote:
[...]

Cleaner without goto in the user code:

#include <stdio.h>
#include <stdbool.h>

static inline
bool same_ptr(void *a, void *b)
{
     asm goto (
         "cmpq %[a], %[b]\n\t"
         "jne %l[ne]\n\t"
         : : [a] "r" (a), [b] "r" (b)
         : : ne);
     return true;
ne:
     return false;
}

int x;

int v[2];

int main(void)
{
     if (same_ptr(v, v + 1))
         x = 1;
     printf("%d\n", x);
     return 0;
}
Mathieu Desnoyers Sept. 26, 2024, 6:16 a.m. UTC | #26
On 2024-09-25 15:20, Mathieu Desnoyers wrote:
[...]
> static inline
> bool same_ptr(void *a, void *b)
> {
>      asm goto (
>          "cmpq %[a], %[b]\n\t"
>          "jne %l[ne]\n\t"
>          : : [a] "r" (a), [b] "r" (b)
>          : : ne);
>      return true;
> ne:
>      return false;
> }

Based on the information provided in this email thread,
it appears the only concern when it comes to comparing a
pointer loaded by rcu_dereference() with another pointer
is the possibility of compiler optimizations.

In the specific case of hazard pointer hpref_hp_get(), this
function does both loads which are then compared with one
another. Therefore, it is not possible for the user to
provide a comparison value known at compile-time, except
in the unlikely scenario where the hazard pointer would
be const, which does not really make much sense.

Therefore, just using rcu_dereference() for the second
load should be fine.

Thanks,

Mathieu
Jonas Oberhauser Sept. 26, 2024, 3:53 p.m. UTC | #27
Am 9/26/2024 um 8:16 AM schrieb Mathieu Desnoyers:
> On 2024-09-25 15:20, Mathieu Desnoyers wrote:
> [...]
>> static inline
>> bool same_ptr(void *a, void *b)
>> {
>>      asm goto (
>>          "cmpq %[a], %[b]\n\t"
>>          "jne %l[ne]\n\t"
>>          : : [a] "r" (a), [b] "r" (b)
>>          : : ne);
>>      return true;
>> ne:
>>      return false;
>> }
> 
> Based on the information provided in this email thread,
> it appears the only concern when it comes to comparing a
> pointer loaded by rcu_dereference() with another pointer
> is the possibility of compiler optimizations.
> 
> In the specific case of hazard pointer hpref_hp_get(), this
> function does both loads which are then compared with one
> another. Therefore, it is not possible for the user to
> provide a comparison value known at compile-time, except
> in the unlikely scenario where the hazard pointer would
> be const, which does not really make much sense.
> 
> Therefore, just using rcu_dereference() for the second
> load should be fine.
> 
> Thanks,
> 
> Mathieu
> 

No, the issue introduced by the compiler optimization (or by your 
original patch) is that the CPU can speculatively load from the first 
pointer as soon as it has completeled the load of that poitner:


node = ...
// t's value can be computed here, since the value of node is known
...
node2 = ...
if (node == node2) // just a control dependency, doesn't prevent 
speculatively loading node
    t = *node

if we can force the compiler's hand, it will generate this code which 
provides the necessary ordering at hardware level:

node = ...
// t can't be computed here since the value of node2 is not known here
...
node2 = ...
// t can be computed here
if (node == node2)
    t = *node2


Kind regards,

   jonas
Linus Torvalds Sept. 26, 2024, 4:12 p.m. UTC | #28
On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser
<jonas.oberhauser@huaweicloud.com> wrote:
>
> No, the issue introduced by the compiler optimization (or by your
> original patch) is that the CPU can speculatively load from the first
> pointer as soon as it has completed the load of that pointer:

You mean the compiler can do it. The inline asm has no impact on what
the CPU does. The conditional isn't a barrier for the actual hardware.
But once the compiler doesn't try to do it, the data dependency on the
address does end up being an ordering constraint on the hardware too
(I'm happy to say that I haven't heard from the crazies that want
value prediction in a long time).

Just use a barrier.  Or make sure to use the proper ordered memory
accesses when possible. Don't use an inline asm for the compare - we
don't even have anything insane like that as a portable helper, and we
shouldn't have it.

                   Linus
Jonas Oberhauser Sept. 26, 2024, 4:40 p.m. UTC | #29
Am 9/26/2024 um 6:12 PM schrieb Linus Torvalds:
> On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser
> <jonas.oberhauser@huaweicloud.com> wrote:
>>
>> No, the issue introduced by the compiler optimization (or by your
>> original patch) is that the CPU can speculatively load from the first
>> pointer as soon as it has completed the load of that pointer:
> 
> You mean the compiler can do it.

What I mean is that if we only use rcu_dereference for the second load 
(and not either some form of compiler barrier or an acquire load), then 
the compiler can transform the second program from my previous e-mail 
(which if mapped 1:1 to hardware would be correct because hardware 
ensures the ordering based on the address dependency) into the first one 
(which is incorrect).

In particular, the compiler can change

  if (node == node2) t = *node2;

into

  if (node == node2) t = *node;

and then the CPU can speculatively read *node before knowing the value 
of node2.

The compiler can also speculatively read *node in this case, but that is 
not what I meant.

The code in Mathieu's original patch is already like the latter one and 
is broken even if the compiler does not do any optimizations.


> The inline asm has no impact on what
> the CPU does. The conditional isn't a barrier for the actual hardware.
> But once the compiler doesn't try to do it, the data dependency on the
> address does end up being an ordering constraint on the hardware too

Exactly. The inline asm would prevent the compiler from doing the 
transformation though, which would mean that the address dependency 
appears in the final compiler output.

> Just use a barrier.  Or make sure to use the proper ordered memory
> accesses when possible. 
 >
> Don't use an inline asm for the compare - we
> don't even have anything insane like that as a portable helper, and we
> shouldn't have it.

I'm glad you say that :))

I would also just use a barrier before returing the pointer.

Boqun seems to be unhappy with a barrier though, because it would 
theoretically also forbid unrelated optimizations.
But I have not seen any evidence that there are any unrelated 
optimizations going on in the first place that would be forbidden by this.

Have fun,
   jonas
Linus Torvalds Sept. 26, 2024, 4:54 p.m. UTC | #30
On Thu, 26 Sept 2024 at 09:40, Jonas Oberhauser
<jonas.oberhauser@huaweicloud.com> wrote:
>
> Boqun seems to be unhappy with a barrier though, because it would
> theoretically also forbid unrelated optimizations.

Well, doing a "barrier()" is kind of a big hammer thing, but honestly,
I don't think we've ever seen any real situation where it makes a
noticeable difference. Yes, it can pessimize compiler output more than
strictly necessary, but the kind of code generation issues it causes
tends to be the non-problematic kind (and particularly the kind that
even a trivial OoO core will deal with well).

We do have some more directed compiler barriers available, and this
code might be able to use OPTIMIZER_HIDE_VAR() for example. It's kind
of a "single variable value barrier".

Honestly, we don't use it much. It just tends to be _too_specific. But
it is there if somebody wants to use it.

> But I have not seen any evidence that there are any unrelated
> optimizations going on in the first place that would be forbidden by this.

Compared to something like "smp_mb()", which is not just a compiler
barrier but also generates typically very expensive instructions that
completely mess with an OoO core, a regular compiler barrier is a
complete non-issue. When you have those two close to each other, you'd
have to make up some very odd situation where the plain "barrier()" is
even noticeable.

               Linus
Boqun Feng Sept. 27, 2024, 12:01 a.m. UTC | #31
On Thu, Sep 26, 2024 at 09:54:33AM -0700, Linus Torvalds wrote:
> On Thu, 26 Sept 2024 at 09:40, Jonas Oberhauser
> <jonas.oberhauser@huaweicloud.com> wrote:
> >
> > Boqun seems to be unhappy with a barrier though, because it would
> > theoretically also forbid unrelated optimizations.
> 
> Well, doing a "barrier()" is kind of a big hammer thing, but honestly,
> I don't think we've ever seen any real situation where it makes a
> noticeable difference. Yes, it can pessimize compiler output more than
> strictly necessary, but the kind of code generation issues it causes
> tends to be the non-problematic kind (and particularly the kind that
> even a trivial OoO core will deal with well).
> 
> We do have some more directed compiler barriers available, and this
> code might be able to use OPTIMIZER_HIDE_VAR() for example. It's kind
> of a "single variable value barrier".
> 

Hmm.. this seems can do the trick? 

	#define ADDRESS_EQ(var, expr)							\
	({										\
		bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr);	\
											\
		OPTIMIZER_HIDE_VAR(var);						\
		_____cmp_res;								\
	})

i.e. compare the address and hide the equality information immediately,
so in hazptr code:

	ptr = READ_ONCE(*p);	// first read

	if (ptr == NULL)
		return NULL;

	head = (struct callback_head *)(ptr + head_offset);

	WRITE_ONCE(*hzp, head);
	smp_mb();

	ptr = READ_ONCE(*p); // read again

	if (!ADDRESS_EQ(ptr, (void *)head - head_offset)) { // pointer changed
		WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
		return NULL;
	} else {
		// Optimizer lost the information on the value of 'ptr',
		// so it cannot replace it with head - head_offset.
		return ptr;
	}

Regards,
Boqun

> Honestly, we don't use it much. It just tends to be _too_specific. But
> it is there if somebody wants to use it.
> 
> > But I have not seen any evidence that there are any unrelated
> > optimizations going on in the first place that would be forbidden by this.
> 
> Compared to something like "smp_mb()", which is not just a compiler
> barrier but also generates typically very expensive instructions that
> completely mess with an OoO core, a regular compiler barrier is a
> complete non-issue. When you have those two close to each other, you'd
> have to make up some very odd situation where the plain "barrier()" is
> even noticeable.
> 
>                Linus
>
Mathieu Desnoyers Sept. 27, 2024, 1:20 a.m. UTC | #32
On 2024-09-26 18:12, Linus Torvalds wrote:
> On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser
> <jonas.oberhauser@huaweicloud.com> wrote:
>>
>> No, the issue introduced by the compiler optimization (or by your
>> original patch) is that the CPU can speculatively load from the first
>> pointer as soon as it has completed the load of that pointer:
> 
> You mean the compiler can do it. The inline asm has no impact on what
> the CPU does. The conditional isn't a barrier for the actual hardware.
> But once the compiler doesn't try to do it, the data dependency on the
> address does end up being an ordering constraint on the hardware too
> (I'm happy to say that I haven't heard from the crazies that want
> value prediction in a long time).
> 
> Just use a barrier.  Or make sure to use the proper ordered memory
> accesses when possible. Don't use an inline asm for the compare - we
> don't even have anything insane like that as a portable helper, and we
> shouldn't have it.

How does the compiler barrier help in any way here ?

I am concerned about the compiler SSA GVN (Global Value Numbering)
optimizations, and I don't think a compiler barrier solves anything.
(or I'm missing something obvious)

I was concerned about the suggestion from Jonas to use "node2"
rather than "node" after the equality check as a way to ensure
the intended register is used to return the pointer, because after
the SSA GVN optimization pass, AFAIU this won't help in any way.
I have a set of examples below that show gcc use the result of the
first load, and clang use the result of the second load (on
both x86-64 and aarch64). Likewise when a load-acquire is used as
second load, which I find odd. Hopefully mixing this optimization
from gcc with speculation still abide by the memory model.

Only the asm goto approach ensures that gcc uses the result from
the second load.

#include <stdbool.h>

#define READ_ONCE(x)   \
     (*(__volatile__  __typeof__(x) *)&(x))

static inline
bool same_ptr(void *a, void *b)
{
     asm goto (
#ifdef __x86_64__
         "cmpq %[a], %[b]\n\t"
         "jne %l[ne]\n\t"
#elif defined(__aarch64__)
         "cmp %[a], %[b]\n\t"
         "bne %l[ne]\n\t"
#else
# error "unimplemented"
#endif
         : : [a] "r" (a), [b] "r" (b)
         : : ne);
     return true;
ne:
     return false;
}

int *p;

int fct_2_volatile(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (a != b);
     return *b;
}

int fct_volatile_acquire(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = __atomic_load_n(&p, __ATOMIC_ACQUIRE);
     } while (a != b);
     return *b;
}

int fct_asm_compare(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (!same_ptr(a, b));
     return *b;
}

x86-64 gcc 14.2:

fct_2_volatile:
  mov    rax,QWORD PTR [rip+0x0]        # 7 <fct_2_volatile+0x7>
  mov    rdx,QWORD PTR [rip+0x0]        # e <fct_2_volatile+0xe>
  cmp    rax,rdx
  jne    0 <fct_2_volatile>
  mov    eax,DWORD PTR [rax]
  ret
  cs nop WORD PTR [rax+rax*1+0x0]
fct_volatile_acquire:
  mov    rax,QWORD PTR [rip+0x0]        # 27 <fct_volatile_acquire+0x7>
  mov    rdx,QWORD PTR [rip+0x0]        # 2e <fct_volatile_acquire+0xe>
  cmp    rax,rdx
  jne    20 <fct_volatile_acquire>
  mov    eax,DWORD PTR [rax]
  ret
  cs nop WORD PTR [rax+rax*1+0x0]
fct_asm_compare:
  mov    rdx,QWORD PTR [rip+0x0]        # 47 <fct_asm_compare+0x7>
  mov    rax,QWORD PTR [rip+0x0]        # 4e <fct_asm_compare+0xe>
  cmp    rax,rdx
  jne    40 <fct_asm_compare>
  mov    eax,DWORD PTR [rax]
  ret
main:
  xor    eax,eax
  ret

x86-64 clang 19.1.0:

fct_2_volatile:
  mov    rcx,QWORD PTR [rip+0x0]        # 7 <fct_2_volatile+0x7>
  mov    rax,QWORD PTR [rip+0x0]        # e <fct_2_volatile+0xe>
  cmp    rcx,rax
  jne    0 <fct_2_volatile>
  mov    eax,DWORD PTR [rax]
  ret
  cs nop WORD PTR [rax+rax*1+0x0]
fct_volatile_acquire:
  mov    rcx,QWORD PTR [rip+0x0]        # 27 <fct_volatile_acquire+0x7>
  mov    rax,QWORD PTR [rip+0x0]        # 2e <fct_volatile_acquire+0xe>
  cmp    rcx,rax
  jne    20 <fct_volatile_acquire>
  mov    eax,DWORD PTR [rax]
  ret
  cs nop WORD PTR [rax+rax*1+0x0]
fct_asm_compare:
  mov    rcx,QWORD PTR [rip+0x0]        # 47 <fct_asm_compare+0x7>
  mov    rax,QWORD PTR [rip+0x0]        # 4e <fct_asm_compare+0xe>
  cmp    rax,rcx
  jne    40 <fct_asm_compare>
  mov    eax,DWORD PTR [rax]
  ret
  cs nop WORD PTR [rax+rax*1+0x0]
main:
  xor    eax,eax
  ret

ARM64 gcc 14.2.0:

fct_2_volatile:
         adrp    x0, .LANCHOR0
         add     x0, x0, :lo12:.LANCHOR0
.L2:
         ldr     x1, [x0]
         ldr     x2, [x0]
         cmp     x1, x2
         bne     .L2
         ldr     w0, [x1]
         ret
fct_volatile_acquire:
         adrp    x0, .LANCHOR0
         add     x0, x0, :lo12:.LANCHOR0
.L6:
         ldr     x1, [x0]
         ldar    x2, [x0]
         cmp     x1, x2
         bne     .L6
         ldr     w0, [x1]
         ret
fct_asm_compare:
         adrp    x1, .LANCHOR0
         add     x1, x1, :lo12:.LANCHOR0
.L9:
         ldr     x2, [x1]
         ldr     x0, [x1]
         cmp x2, x0
         bne .L9

         ldr     w0, [x0]
         ret
p:
         .zero   8

armv8-a clang (trunk):

fct_2_volatile:
  adrp	x8, 0 <fct_2_volatile>
  ldr	x10, [x8]
  ldr	x9, [x8]
  cmp	x10, x9
  b.ne	4 <fct_2_volatile+0x4>  // b.any
  ldr	w0, [x9]
  ret
fct_volatile_acquire:
  adrp	x8, 0 <fct_2_volatile>
  add	x8, x8, #0x0
  ldr	x10, [x8]
  ldar	x9, [x8]
  cmp	x10, x9
  b.ne	24 <fct_volatile_acquire+0x8>  // b.any
  ldr	w0, [x9]
  ret
fct_asm_compare:
  adrp	x8, 0 <fct_2_volatile>
  ldr	x9, [x8]
  ldr	x8, [x8]
  cmp	x9, x8
  b.ne	3c <fct_asm_compare>  // b.any
  ldr	w0, [x8]
  ret
main:
  mov	w0, wzr
Mathieu Desnoyers Sept. 27, 2024, 1:30 a.m. UTC | #33
On 2024-09-27 02:01, Boqun Feng wrote:
> 	#define ADDRESS_EQ(var, expr)							\
> 	({										\
> 		bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr);	\
> 											\
> 		OPTIMIZER_HIDE_VAR(var);						\
> 		_____cmp_res;								\
> 	})

If the goal is to ensure gcc uses the register populated by the
second, I'm afraid it does not work. AFAIU, "hiding" the dependency
chain does not prevent the SSA GVN optimization from combining the
registers as being one and choosing one arbitrary source. "hiding"
the dependency chain before or after the comparison won't help here.

int fct_hide_var_compare(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (!ADDRESS_EQ(a, b));
     return *b;
}

gcc 14.2 x86-64:

fct_hide_var_compare:
  mov    rax,QWORD PTR [rip+0x0]        # 67 <fct_hide_var_compare+0x7>
  mov    rdx,QWORD PTR [rip+0x0]        # 6e <fct_hide_var_compare+0xe>
  cmp    rax,rdx
  jne    60 <fct_hide_var_compare>
  mov    eax,DWORD PTR [rax]
  ret
main:
  xor    eax,eax
  ret

gcc 14.2.0 ARM64:

fct_hide_var_compare:
         adrp    x0, .LANCHOR0
         add     x0, x0, :lo12:.LANCHOR0
.L12:
         ldr     x1, [x0]
         ldr     x2, [x0]
         cmp     x1, x2
         bne     .L12
         ldr     w0, [x1]
         ret
p:
         .zero   8
Boqun Feng Sept. 27, 2024, 1:37 a.m. UTC | #34
On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote:
> On 2024-09-27 02:01, Boqun Feng wrote:
>> 	#define ADDRESS_EQ(var, expr)							\
>> 	({										\
>> 		bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr);	\
>> 											\
>> 		OPTIMIZER_HIDE_VAR(var);						\
>> 		_____cmp_res;								\
>> 	})
>
> If the goal is to ensure gcc uses the register populated by the
> second, I'm afraid it does not work. AFAIU, "hiding" the dependency
> chain does not prevent the SSA GVN optimization from combining the
> registers as being one and choosing one arbitrary source. "hiding"
> the dependency chain before or after the comparison won't help here.
>
> int fct_hide_var_compare(void)
> {
>      int *a, *b;
>
>      do {
>          a = READ_ONCE(p);
>          asm volatile ("" : : : "memory");
>          b = READ_ONCE(p);
>      } while (!ADDRESS_EQ(a, b));

Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).

Regards,
Boqun

>      return *b;
> }
>
> gcc 14.2 x86-64:
>
> fct_hide_var_compare:
>   mov    rax,QWORD PTR [rip+0x0]        # 67 <fct_hide_var_compare+0x7>
>   mov    rdx,QWORD PTR [rip+0x0]        # 6e <fct_hide_var_compare+0xe>
>   cmp    rax,rdx
>   jne    60 <fct_hide_var_compare>
>   mov    eax,DWORD PTR [rax]
>   ret
> main:
>   xor    eax,eax
>   ret
>
> gcc 14.2.0 ARM64:
>
> fct_hide_var_compare:
>          adrp    x0, .LANCHOR0
>          add     x0, x0, :lo12:.LANCHOR0
> .L12:
>          ldr     x1, [x0]
>          ldr     x2, [x0]
>          cmp     x1, x2
>          bne     .L12
>          ldr     w0, [x1]
>          ret
> p:
>          .zero   8
>
>
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
Boqun Feng Sept. 27, 2024, 4:28 a.m. UTC | #35
On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote:
> 
> 
> On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote:
> > On 2024-09-27 02:01, Boqun Feng wrote:
> >> 	#define ADDRESS_EQ(var, expr)							\
> >> 	({										\
> >> 		bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr);	\
> >> 											\
> >> 		OPTIMIZER_HIDE_VAR(var);						\
> >> 		_____cmp_res;								\
> >> 	})
> >
> > If the goal is to ensure gcc uses the register populated by the
> > second, I'm afraid it does not work. AFAIU, "hiding" the dependency
> > chain does not prevent the SSA GVN optimization from combining the

Note it's not hiding the dependency, rather the equality,

> > registers as being one and choosing one arbitrary source. "hiding"

after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is
equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var)
to indicate the output is overwritten. So when 'var' is referred later,
compiler cannot use the register for a 'expr' value or any other
register that has the same value, because 'var' may have a different
value from the compiler's POV.

> > the dependency chain before or after the comparison won't help here.
> >
> > int fct_hide_var_compare(void)
> > {
> >      int *a, *b;
> >
> >      do {
> >          a = READ_ONCE(p);
> >          asm volatile ("" : : : "memory");
> >          b = READ_ONCE(p);
> >      } while (!ADDRESS_EQ(a, b));
> 
> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).
> 

I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile
result shows it can prevent the issue:

gcc 14.2 x86-64:

fct_hide_var_compare:
.L2:
        mov     rcx, QWORD PTR p[rip]
        mov     rdx, QWORD PTR p[rip]
        mov     rax, rdx
        cmp     rcx, rdx
        jne     .L2
        mov     eax, DWORD PTR [rax]
        ret

gcc 14.2.0 ARM64:

fct_hide_var_compare:
        adrp    x2, p
        add     x2, x2, :lo12:p
.L2:
        ldr     x3, [x2]
        ldr     x1, [x2]
        mov     x0, x1
        cmp     x3, x1
        bne     .L2
        ldr     w0, [x0]
        ret

Link to godbolt:

	https://godbolt.org/z/a7jsfzjxY

Regards,
Boqun

> Regards,
> Boqun
> 
> >      return *b;
> > }
> >
> > gcc 14.2 x86-64:
> >
> > fct_hide_var_compare:
> >   mov    rax,QWORD PTR [rip+0x0]        # 67 <fct_hide_var_compare+0x7>
> >   mov    rdx,QWORD PTR [rip+0x0]        # 6e <fct_hide_var_compare+0xe>
> >   cmp    rax,rdx
> >   jne    60 <fct_hide_var_compare>
> >   mov    eax,DWORD PTR [rax]
> >   ret
> > main:
> >   xor    eax,eax
> >   ret
> >
> > gcc 14.2.0 ARM64:
> >
> > fct_hide_var_compare:
> >          adrp    x0, .LANCHOR0
> >          add     x0, x0, :lo12:.LANCHOR0
> > .L12:
> >          ldr     x1, [x0]
> >          ldr     x2, [x0]
> >          cmp     x1, x2
> >          bne     .L12
> >          ldr     w0, [x1]
> >          ret
> > p:
> >          .zero   8
> >
> >
> > -- 
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://www.efficios.com
>
Boqun Feng Sept. 27, 2024, 4:38 a.m. UTC | #36
On Fri, Sep 27, 2024 at 03:20:40AM +0200, Mathieu Desnoyers wrote:
> On 2024-09-26 18:12, Linus Torvalds wrote:
> > On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser
> > <jonas.oberhauser@huaweicloud.com> wrote:
> > > 
> > > No, the issue introduced by the compiler optimization (or by your
> > > original patch) is that the CPU can speculatively load from the first
> > > pointer as soon as it has completed the load of that pointer:
> > 
> > You mean the compiler can do it. The inline asm has no impact on what
> > the CPU does. The conditional isn't a barrier for the actual hardware.
> > But once the compiler doesn't try to do it, the data dependency on the
> > address does end up being an ordering constraint on the hardware too
> > (I'm happy to say that I haven't heard from the crazies that want
> > value prediction in a long time).
> > 
> > Just use a barrier.  Or make sure to use the proper ordered memory
> > accesses when possible. Don't use an inline asm for the compare - we
> > don't even have anything insane like that as a portable helper, and we
> > shouldn't have it.
> 
> How does the compiler barrier help in any way here ?
> 
> I am concerned about the compiler SSA GVN (Global Value Numbering)
> optimizations, and I don't think a compiler barrier solves anything.
> (or I'm missing something obvious)

I think you're right, a compiler barrier doesn't help here:

	head = READ_ONCE(p);
	smp_mb();
	WRITE_ONCE(*slot, head);

	ptr = READ_ONCE(p);
	if (ptr != head) {
		...
	} else {
		barrier();
		return ptr;
	}

compilers can replace 'ptr' with 'head' because of the equality, and
even putting barrier() here cannot prevent compiler to rewrite the else
branch into:

	else {
		barrier();
		return head;
	}

because that's just using a different register, unrelated to memory
accesses.

Jonas, am I missing something subtle? Or this is different than what you
proposed?

Regards,
Boqun

> 
> I was concerned about the suggestion from Jonas to use "node2"
> rather than "node" after the equality check as a way to ensure
> the intended register is used to return the pointer, because after
> the SSA GVN optimization pass, AFAIU this won't help in any way.
> I have a set of examples below that show gcc use the result of the
> first load, and clang use the result of the second load (on
> both x86-64 and aarch64). Likewise when a load-acquire is used as
> second load, which I find odd. Hopefully mixing this optimization
> from gcc with speculation still abide by the memory model.
> 
> Only the asm goto approach ensures that gcc uses the result from
> the second load.
> 
[...]
Mathieu Desnoyers Sept. 27, 2024, 10:59 a.m. UTC | #37
On 2024-09-27 06:28, Boqun Feng wrote:
> On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote:
>>
>>
>> On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote:
>>> On 2024-09-27 02:01, Boqun Feng wrote:
>>>> 	#define ADDRESS_EQ(var, expr)							\
>>>> 	({										\
>>>> 		bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr);	\
>>>> 											\
>>>> 		OPTIMIZER_HIDE_VAR(var);						\
>>>> 		_____cmp_res;								\
>>>> 	})
>>>
>>> If the goal is to ensure gcc uses the register populated by the
>>> second, I'm afraid it does not work. AFAIU, "hiding" the dependency
>>> chain does not prevent the SSA GVN optimization from combining the
> 
> Note it's not hiding the dependency, rather the equality,
> 
>>> registers as being one and choosing one arbitrary source. "hiding"
> 
> after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is
> equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var)
> to indicate the output is overwritten. So when 'var' is referred later,
> compiler cannot use the register for a 'expr' value or any other
> register that has the same value, because 'var' may have a different
> value from the compiler's POV.
> 
>>> the dependency chain before or after the comparison won't help here.
>>>
>>> int fct_hide_var_compare(void)
>>> {
>>>       int *a, *b;
>>>
>>>       do {
>>>           a = READ_ONCE(p);
>>>           asm volatile ("" : : : "memory");
>>>           b = READ_ONCE(p);
>>>       } while (!ADDRESS_EQ(a, b));
>>
>> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).
>>
> 
> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile
> result shows it can prevent the issue:

I see, yes. It prevents the issue by making the compiler create
a copy of the value "modified" by the asm before doing the equality
comparison.

This means the compiler cannot derive the value for b from the first
load when b is used after after the equality comparison.

The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra
"mov" instruction to move the content across registers. I don't think
it matters performance wise though, so that solution is appealing
because it is arch-agnostic.

One small improvement over your proposed solution would be to apply
OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile
asm, it is simply optimized away if var1 or var2 is unused following
the equality comparison. It is more convenient to prevent replacement
of both addresses being compared by the other rather than providing
the guarantee only on a single parameter:

#define OPTIMIZER_HIDE_VAR(var)			\
         __asm__ ("" : "+r" (var))

#define ADDRESS_EQ(var1, var2)			\
({						\
	bool _____cmp_res = (var1) == (var2);	\
						\
	OPTIMIZER_HIDE_VAR(var1);		\
	OPTIMIZER_HIDE_VAR(var2);		\
	_____cmp_res;				\
})

Thanks,

Mathieu

> 
> gcc 14.2 x86-64:
> 
> fct_hide_var_compare:
> .L2:
>          mov     rcx, QWORD PTR p[rip]
>          mov     rdx, QWORD PTR p[rip]
>          mov     rax, rdx
>          cmp     rcx, rdx
>          jne     .L2
>          mov     eax, DWORD PTR [rax]
>          ret
> 
> gcc 14.2.0 ARM64:
> 
> fct_hide_var_compare:
>          adrp    x2, p
>          add     x2, x2, :lo12:p
> .L2:
>          ldr     x3, [x2]
>          ldr     x1, [x2]
>          mov     x0, x1
>          cmp     x3, x1
>          bne     .L2
>          ldr     w0, [x0]
>          ret
> 
> Link to godbolt:
> 
> 	https://godbolt.org/z/a7jsfzjxY-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Mathieu Desnoyers Sept. 27, 2024, 2:43 p.m. UTC | #38
On 2024-09-27 12:59, Mathieu Desnoyers wrote:
> On 2024-09-27 06:28, Boqun Feng wrote:
[...]
>> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile
>> result shows it can prevent the issue:
> 
> I see, yes. It prevents the issue by making the compiler create
> a copy of the value "modified" by the asm before doing the equality
> comparison.
> 
> This means the compiler cannot derive the value for b from the first
> load when b is used after after the equality comparison.
> 
> The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra
> "mov" instruction to move the content across registers. I don't think
> it matters performance wise though, so that solution is appealing
> because it is arch-agnostic.
> 
> One small improvement over your proposed solution would be to apply
> OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile
> asm, it is simply optimized away if var1 or var2 is unused following
> the equality comparison. It is more convenient to prevent replacement
> of both addresses being compared by the other rather than providing
> the guarantee only on a single parameter:

Actually, your approach is better (only preserving the address
dependency on the first parameter), because it allows the second
parameter to be a constant.

Here is a diff. Please let me know if I need to improve anything wrt
comments or implementation:

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 2df665fa2964..52434eccd715 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -186,6 +186,32 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  	__asm__ ("" : "=r" (var) : "0" (var))
  #endif
  
+/*
+ * Compare an address with an expression while preserving the address
+ * dependencies for later use of the address. It should be used when
+ * comparing an address returned by rcu_dereference() with another
+ * address (either constant or in registers).
+ *
+ * This is needed to prevent the compiler SSA GVN optimization pass from
+ * replacing the register holding @addr by @expr (either a constant or a
+ * register) based on their equality, which does not preserve address
+ * dependencies and allows the following misordering speculations:
+ *
+ * - If @expr is a constant, the compiler can issue the loads which depend
+ *   on @addr before the load of @addr.
+ * - If @expr is a register populated by a prior load, weakly-ordered
+ *   CPUs can speculate loads which depend on @addr before the load of the
+ *   address they depend on.
+ */
+#ifndef ADDRESS_EQ
+#define ADDRESS_EQ(addr, expr)					\
+	({							\
+		bool __res = (addr) == (expr);			\
+		OPTIMIZER_HIDE_VAR(addr);			\
+		__res;						\
+	})
+#endif
+
  #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
  
  /**
Mathieu Desnoyers Sept. 27, 2024, 3:22 p.m. UTC | #39
On 2024-09-27 16:43, Mathieu Desnoyers wrote:
> On 2024-09-27 12:59, Mathieu Desnoyers wrote:
>> On 2024-09-27 06:28, Boqun Feng wrote:
> [...]
>>> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile
>>> result shows it can prevent the issue:
>>
>> I see, yes. It prevents the issue by making the compiler create
>> a copy of the value "modified" by the asm before doing the equality
>> comparison.
>>
>> This means the compiler cannot derive the value for b from the first
>> load when b is used after after the equality comparison.
>>
>> The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra
>> "mov" instruction to move the content across registers. I don't think
>> it matters performance wise though, so that solution is appealing
>> because it is arch-agnostic.
>>
>> One small improvement over your proposed solution would be to apply
>> OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile
>> asm, it is simply optimized away if var1 or var2 is unused following
>> the equality comparison. It is more convenient to prevent replacement
>> of both addresses being compared by the other rather than providing
>> the guarantee only on a single parameter:
> 
> Actually, your approach is better (only preserving the address
> dependency on the first parameter), because it allows the second
> parameter to be a constant.
> 
> Here is a diff. Please let me know if I need to improve anything wrt
> comments or implementation:
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 2df665fa2964..52434eccd715 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -186,6 +186,32 @@ void ftrace_likely_update(struct ftrace_likely_data 
> *f, int val,
>       __asm__ ("" : "=r" (var) : "0" (var))
>   #endif
> 
> +/*
> + * Compare an address with an expression while preserving the address
> + * dependencies for later use of the address. It should be used when
> + * comparing an address returned by rcu_dereference() with another
> + * address (either constant or in registers).
> + *
> + * This is needed to prevent the compiler SSA GVN optimization pass from
> + * replacing the register holding @addr by @expr (either a constant or a

Actually, I've done a bit on testing on godbolt, and it appears that
disabling this specific optimization makes the problem disappear:
-fcse-follow-jumps (by use of -fno-cse-follow-jumps).

I will update this comment to state that both CSE and SSA GVN
optimizations can cause issues there.

Thanks,

Mathieu

> + * register) based on their equality, which does not preserve address
> + * dependencies and allows the following misordering speculations:
> + *
> + * - If @expr is a constant, the compiler can issue the loads which depend
> + *   on @addr before the load of @addr.
> + * - If @expr is a register populated by a prior load, weakly-ordered
> + *   CPUs can speculate loads which depend on @addr before the load of the
> + *   address they depend on.
> + */
> +#ifndef ADDRESS_EQ
> +#define ADDRESS_EQ(addr, expr)                    \
> +    ({                            \
> +        bool __res = (addr) == (expr);            \
> +        OPTIMIZER_HIDE_VAR(addr);            \
> +        __res;                        \
> +    })
> +#endif
> +
>   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), 
> __COUNTER__)
> 
>   /**
>
Alan Huang Sept. 27, 2024, 4:06 p.m. UTC | #40
2024年9月27日 12:28,Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote:
>> 
>> 
>> On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote:
>>> On 2024-09-27 02:01, Boqun Feng wrote:
>>>> #define ADDRESS_EQ(var, expr) \
>>>> ({ \
>>>> bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \
>>>> \
>>>> OPTIMIZER_HIDE_VAR(var); \
>>>> _____cmp_res; \
>>>> })
>>> 
>>> If the goal is to ensure gcc uses the register populated by the
>>> second, I'm afraid it does not work. AFAIU, "hiding" the dependency
>>> chain does not prevent the SSA GVN optimization from combining the
> 
> Note it's not hiding the dependency, rather the equality,
> 
>>> registers as being one and choosing one arbitrary source. "hiding"
> 
> after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is
> equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var)
> to indicate the output is overwritten. So when 'var' is referred later,
> compiler cannot use the register for a 'expr' value or any other
> register that has the same value, because 'var' may have a different
> value from the compiler's POV.
> 
>>> the dependency chain before or after the comparison won't help here.
>>> 
>>> int fct_hide_var_compare(void)
>>> {
>>>     int *a, *b;
>>> 
>>>     do {
>>>         a = READ_ONCE(p);
>>>         asm volatile ("" : : : "memory");
>>>         b = READ_ONCE(p);
>>>     } while (!ADDRESS_EQ(a, b));
>> 
>> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).
>> 
> 
> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile
> result shows it can prevent the issue:
> 
> gcc 14.2 x86-64:
> 
> fct_hide_var_compare:
> .L2:
>        mov     rcx, QWORD PTR p[rip]
>        mov     rdx, QWORD PTR p[rip]
>        mov     rax, rdx
>        cmp     rcx, rdx
>        jne     .L2
>        mov     eax, DWORD PTR [rax]
>        ret
> 
> gcc 14.2.0 ARM64:
> 
> fct_hide_var_compare:
>        adrp    x2, p
>        add     x2, x2, :lo12:p
> .L2:
>        ldr     x3, [x2]
>        ldr     x1, [x2]
>        mov     x0, x1
>        cmp     x3, x1
>        bne     .L2
>        ldr     w0, [x0]
>        ret
> 
> Link to godbolt:
> 
> https://godbolt.org/z/a7jsfzjxY

Checking the assembly generated by different compilers for the kernel on the local machine will yield more accurate results. Some optimizations are restricted by the kernel. Therefore, if you use Godbolt, ensure that the compiler arguments match those used for the kernel.

> 
> Regards,
> Boqun
> 
>> Regards,
>> Boqun
>> 
>>>     return *b;
>>> }
>>> 
>>> gcc 14.2 x86-64:
>>> 
>>> fct_hide_var_compare:
>>>  mov    rax,QWORD PTR [rip+0x0]        # 67 <fct_hide_var_compare+0x7>
>>>  mov    rdx,QWORD PTR [rip+0x0]        # 6e <fct_hide_var_compare+0xe>
>>>  cmp    rax,rdx
>>>  jne    60 <fct_hide_var_compare>
>>>  mov    eax,DWORD PTR [rax]
>>>  ret
>>> main:
>>>  xor    eax,eax
>>>  ret
>>> 
>>> gcc 14.2.0 ARM64:
>>> 
>>> fct_hide_var_compare:
>>>         adrp    x0, .LANCHOR0
>>>         add     x0, x0, :lo12:.LANCHOR0
>>> .L12:
>>>         ldr     x1, [x0]
>>>         ldr     x2, [x0]
>>>         cmp     x1, x2
>>>         bne     .L12
>>>         ldr     w0, [x1]
>>>         ret
>>> p:
>>>         .zero   8
>>> 
>>> 
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
Linus Torvalds Sept. 27, 2024, 4:44 p.m. UTC | #41
On Thu, 26 Sept 2024 at 18:38, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).

Yeah, please stop making things unnecessarily complicated.

Just use a barrier(). Please. Stop these stupid games until you can
show why it matters.

And by "why it matters" I mean "major difference in code generation",
not some "it uses one more register and has to spill" kind of small
detail.

At this point, I'm not even convinced the whole hazard pointer
approach makes sense. And you're not helping by making it more
complicated than it needs to be.

          Linus
Mathieu Desnoyers Sept. 27, 2024, 5:15 p.m. UTC | #42
On 2024-09-27 18:44, Linus Torvalds wrote:
> On Thu, 26 Sept 2024 at 18:38, Boqun Feng <boqun.feng@gmail.com> wrote:
>>
>> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a).
> 
> Yeah, please stop making things unnecessarily complicated.
> 
> Just use a barrier(). Please. Stop these stupid games until you can
> show why it matters.

The barrier() is ineffective at fixing the issue.
It does not prevent the compiler CSE from losing the
address dependency:

int fct_2_volatile_barriers(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (a != b);
     asm volatile ("" : : : "memory");  <----- where you would like your barrier
     return *b;
}

With gcc 14.2 (arm64):

fct_2_volatile_barriers:
         adrp    x0, .LANCHOR0
         add     x0, x0, :lo12:.LANCHOR0
.L2:
         ldr     x1, [x0]    <------ x1 populated by first load.
         ldr     x2, [x0]
         cmp     x1, x2
         bne     .L2
         ldr     w0, [x1]    <------ x1 is used for access which should depend on b.
         ret

On weakly-ordered architectures, this lets CPU speculation
use the result from the first load to speculate
"ldr w0, [x1]" before "ldr x2, [x0]".
Based on the RCU documentation, the control dependency
does not prevent the CPU from speculating loads.

There are a few ways to fix this:

- Compile everything with -fno-cse-follow-jumps.
- Make the compiler unaware of the relationship between the
   address equality and address-dependent use of b. This can
   be done either using ADDRESS_EQ() or asm goto.

I prefer ADDRESS_EQ() because it is arch-agnostic. I don't
care that it adds one more register movement instruction.

> And by "why it matters" I mean "major difference in code generation",
> not some "it uses one more register and has to spill" kind of small
> detail.

Why it matters is because gcc generates code that does not
preserve address dependency of the second READ_ONCE().

> At this point, I'm not even convinced the whole hazard pointer
> approach makes sense. And you're not helping by making it more
> complicated than it needs to be.

I'm preparing a small series that aims to show how a minimal
hazard pointer implementation can help improve common scenarios:

- Guarantee object existence on pointer dereference to increment
   a reference count:
   - replace locking used for that purpose in drivers (e.g. usb),
   - replace RCU + inc_not_zero pattern,
- rtmutex: I suspect we can improve situations where locks need
   to be taken in reverse dependency chain order by guaranteeing
   existence of first and second locks in traversal order,
   allowing them to be locked in the correct order (which is
   reverse from traversal order) rather than try-lock+retry on
   nested lock. This can be done with hazard pointers without
   requiring object reclaim to be delayed by an RCU grace period.

Thanks,

Mathieu
Linus Torvalds Sept. 27, 2024, 5:23 p.m. UTC | #43
On Fri, 27 Sept 2024 at 10:17, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> The barrier() is ineffective at fixing the issue.
> It does not prevent the compiler CSE from losing the
> address dependency:

Ok. Thanks for actually specifying code.

That needs to be

 (a) in a comment

 (b) the value barrier needs to be on *both* values so that the order
of the equality testing doesn't matter.

> I'm preparing a small series that aims to show how a minimal
> hazard pointer implementation can help improve common scenarios:

I want actual numbers on real loads. Just so you know.  Not "this can
help". But "this actually really _does_ help".

                    Linus
Mathieu Desnoyers Sept. 27, 2024, 5:51 p.m. UTC | #44
On 2024-09-27 19:23, Linus Torvalds wrote:
> On Fri, 27 Sept 2024 at 10:17, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> The barrier() is ineffective at fixing the issue.
>> It does not prevent the compiler CSE from losing the
>> address dependency:
> 
> Ok. Thanks for actually specifying code.
> 
> That needs to be
> 
>   (a) in a comment

OK. I'll add the code/asm examples to the comment above ADDRESS_EQ().

> 
>   (b) the value barrier needs to be on *both* values so that the order
> of the equality testing doesn't matter.

If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes
the odds that someone get the order wrong, but it disallows using
ADDRESS_EQ() with a constant parameter
(e.g. ADDRESS_EQ(ptr, &mystruct)), which would be nice to cover. It
works fine with using OPTIMIZER_HIDE_VAR() on the first argument,
but it opens the door to misuses.

Perhaps there is a trick with compiler builtins we could do to only
use OPTIMIZER_HIDE_VAR() on non-constant arguments, but I can't get
it to work so far.

> 
>> I'm preparing a small series that aims to show how a minimal
>> hazard pointer implementation can help improve common scenarios:
> 
> I want actual numbers on real loads. Just so you know.  Not "this can
> help". But "this actually really _does_ help".

Noted, thanks!

Mathieu
Linus Torvalds Sept. 27, 2024, 6:13 p.m. UTC | #45
On Fri, 27 Sept 2024 at 10:53, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> >   (b) the value barrier needs to be on *both* values so that the order
> > of the equality testing doesn't matter.
>
> If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes
> the odds that someone get the order wrong, but it disallows using
> ADDRESS_EQ() with a constant parameter

No it doesn't.

This is trivial - just hide the source of the *comparison*, so that
the compiler doesn't know what you are comparing, and can't use it to
then replace one with the other:

   static __always_inline bool compare_ptr(const volatile void *a,
const volatile void *b)
   {
        OPTIMIZER_HIDE_VAR(a);
        OPTIMIZER_HIDE_VAR(b);
        return a == b;
   }

compares two arbitrary pointer values without allowing the compiler to
then use the comparison result to use either of the original values as
a replacement for the other.

And yes, that "hide both" model will cause a bit more register
pressure, because the compiler will now compare two values that it
really thinks are potentially different from the originals. So you'll
have two "useless" temporaries that contain the same values as the
source pointers, but if that's the cost of having a comparison that
the compiler can't see, that's fine.

Making it a bit less obvious, you can hide just one of the variables -
you don't actually need to hide both m(but for clarity, maybe you want
to).

Because even hiding the value one from the compiler will mean that it
can't use the comparison to decide that the originals are equal, even
if one of them is unhidden.

No?

              Linus
Jonas Oberhauser Sept. 27, 2024, 7:12 p.m. UTC | #46
Am 9/27/2024 um 8:13 PM schrieb Linus Torvalds:

> Because even hiding the value one from the compiler will mean that it
> can't use the comparison to decide that the originals are equal, even
> if one of them is unhidden.
> 
> No?
> 
>                Linus

I think it depends on which one you hide.

If you do

  z = b;
  hide(z);
  if (a==z) { *b; }

then it will be fine, because it knows a==z but nothing about the 
relation of b with a or z.


But for

  z = a;
  hide(z);
  if (z==b) { *b; }

then it would still know that b == z, and could replace *b with *z 
(which really is *a).


Best wishes,
   jonas
Jonas Oberhauser Sept. 27, 2024, 7:23 p.m. UTC | #47
Two comments inline,

Am 9/27/2024 um 6:38 AM schrieb Boqun Feng:

> compilers can replace 'ptr' with 'head' because of the equality, and
> even putting barrier() here cannot prevent compiler to rewrite the else
> branch into:
> 
> 	else {
> 		barrier();
> 		return head;
> 	}
> 
> because that's just using a different register, unrelated to memory
> accesses.

Yeah, that was the solution I had in mind. My reasoning was that from 
the C abstract machine, head is still a memory access, and the barrier() 
should make the compiler forget everything it knew about the value of 
head from before the barrier().

So I had a gap in my understanding of the strength of barrier(). Can I 
understand it to mean that the compiler can do escape analysis to reason 
that a volatile asm code with ::memory can't possibly be manipulating 
some variables (like head in this case)?

That idea seems to be confirmed by this (atrocious, not to be copied!) 
example:

int fct_escape_address_of_b(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (a != b);

     // really really hide b
     int **p = &b;
     OPTIMIZER_HIDE_VAR(p);

     asm volatile ("" : : : "memory");
     return *b;
}

This also does not generate any additional instructions, unlike just 
using OPTIMIZER_HIDE_VAR(b).

What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently 
works instead of like above?



 > On Fri, Sep 27, 2024 at 03:20:40AM +0200, Mathieu Desnoyers wrote:

>> I have a set of examples below that show gcc use the result of the
>> first load, and clang use the result of the second load (on
>> both x86-64 and aarch64). Likewise when a load-acquire is used as
>> second load, which I find odd. 

Note that if you use acquire on the second load, the code is correct 
even if the compiler uses the result of the first load.

That is because the acquire load will still synchronize sufficiently 
with the second publishing store after the ABA, and then we can get the 
right data even from the old address.

Best wishes,

   jonas
Linus Torvalds Sept. 27, 2024, 7:28 p.m. UTC | #48
On Fri, 27 Sept 2024 at 12:12, Jonas Oberhauser
<jonas.oberhauser@huaweicloud.com> wrote:
>
> I think it depends on which one you hide.

No.

Dammit, people, read the code I posted.

> But for
>
>   z = a;
>   hide(z);
>   if (z==b) { *b; }

No.

I *intentionally* made it an inline function, and only hid the
arguments to the equality comparison.

So the "hide(z)" hides the argument to the inline function - NOT THE ORIGINAL.

> then it would still know that b == z, and could replace *b with *z
> (which really is *a).

No.

The hiding is literally *ONLY* for the comparison. It's inside the
helper function. It doesn't affect the originals at all.

Which means that the compiler CANNOT KNOW anything about the original
pointers when it compares for equality (or inequality).

Basically, the comparison is now a black box to the compiler, and the
compiler cannot use the result of the comparison to make ANY judgment
on whether the two original pointers were related or not.

               Linus
Mathieu Desnoyers Sept. 27, 2024, 8:02 p.m. UTC | #49
On 2024-09-27 20:13, Linus Torvalds wrote:
> On Fri, 27 Sept 2024 at 10:53, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>>>    (b) the value barrier needs to be on *both* values so that the order
>>> of the equality testing doesn't matter.
>>
>> If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes
>> the odds that someone get the order wrong, but it disallows using
>> ADDRESS_EQ() with a constant parameter
> 
> No it doesn't.
> 
> This is trivial - just hide the source of the *comparison*, so that
> the compiler doesn't know what you are comparing, and can't use it to
> then replace one with the other:
> 
>     static __always_inline bool compare_ptr(const volatile void *a,
> const volatile void *b)
>     {
>          OPTIMIZER_HIDE_VAR(a);
>          OPTIMIZER_HIDE_VAR(b);
>          return a == b;
>     }

Cool! It works! Thanks!

> 
> compares two arbitrary pointer values without allowing the compiler to
> then use the comparison result to use either of the original values as
> a replacement for the other.

Yep. And the static inline is much cleaner as it allows users to pass
constants as well.

> 
> And yes, that "hide both" model will cause a bit more register
> pressure, because the compiler will now compare two values that it
> really thinks are potentially different from the originals. So you'll
> have two "useless" temporaries that contain the same values as the
> source pointers, but if that's the cost of having a comparison that
> the compiler can't see, that's fine.

I've tried it and it seems that the compiler only leaves one "mov"
extra there, since the extra register movement on the input that is
not used afterwards can then be optimized away.

> 
> Making it a bit less obvious, you can hide just one of the variables -
> you don't actually need to hide both m(but for clarity, maybe you want
> to).
 >
 > Because even hiding the value one from the compiler will mean that it
 > can't use the comparison to decide that the originals are equal, even
 > if one of them is unhidden.
 >
 > No?

I would prefer hiding the two input variables.

Hust hiding one variable might work for CSE (and light
godbolt attempts seem to confirm this), but I'm worried that
it eventually breaks when compilers start making SSA GVN
optimizations smarter.

AFAIU, in the SSA GVN model, if we just hide @a before the
comparison and don't hide @b, we'd be in a situation where the
compiler could know that the version of the variable generated
by hiding @a (call it a') is equal to @b, and therefore when using
@b afterward could instead use a', which is derived from @a
rather than @b.

It may not happen in practice just because why would a sane
optimization would prefer using a version that is deeper in
the dependency chain (a') rather than @b, but that would be
based on assumptions on how specific heuristics work, and
would therefore be fragile.

Thanks,

Mathieu
Mathieu Desnoyers Sept. 27, 2024, 8:10 p.m. UTC | #50
On 2024-09-27 21:23, Jonas Oberhauser wrote:
[...]
> That idea seems to be confirmed by this (atrocious, not to be copied!) 
> example:
> 
> int fct_escape_address_of_b(void)
> {
>      int *a, *b;
> 
>      do {
>          a = READ_ONCE(p);
>          asm volatile ("" : : : "memory");
>          b = READ_ONCE(p);
>      } while (a != b);
> 
>      // really really hide b
>      int **p = &b;
>      OPTIMIZER_HIDE_VAR(p);
> 
>      asm volatile ("" : : : "memory");
>      return *b;
> }
> 
> This also does not generate any additional instructions, unlike just 
> using OPTIMIZER_HIDE_VAR(b).
> 
> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently 
> works instead of like above?

Did you try it on godbolt.org ? Does it have the intended effect ?
By the looks of it, you're just creating another version of @b called
"p", which is then never used and would be discarded by further
optimization. I'm unsure what you are trying to achieve here.

Thanks,

Mathieu
Linus Torvalds Sept. 27, 2024, 8:24 p.m. UTC | #51
On Fri, 27 Sept 2024 at 12:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Dammit, people, read the code I posted.

Actually, no, apologies. You were right, and I was wrong.

It does need both of the sources of the comparison to be hidden,
because while even hiding just one side makes the comparison result
"meaningless" as far as the compiler is concerned, the compiler will
still have generated a pseudo for the hidden value, and might decide
that it can re-use that pseudo for the non-hidden pointer if the two
then match.

So yeah, the example function I posted should be safe, but my "you can
probably make do with hiding just one side" was actually a bogus and
invalid optimization. You do need to hide both sides. Not just for
clarity, but for correctness.

             Linus
Jonas Oberhauser Sept. 27, 2024, 10:18 p.m. UTC | #52
Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers:
> On 2024-09-27 21:23, Jonas Oberhauser wrote:
> [...]
>> That idea seems to be confirmed by this (atrocious, not to be copied!) 
>> example:
>>
>> int fct_escape_address_of_b(void)
>> {
>>      int *a, *b;
>>
>>      do {
>>          a = READ_ONCE(p);
>>          asm volatile ("" : : : "memory");
>>          b = READ_ONCE(p);
>>      } while (a != b);
>>
>>      // really really hide b
>>      int **p = &b;
>>      OPTIMIZER_HIDE_VAR(p);
>>
>>      asm volatile ("" : : : "memory");
>>      return *b;
>> }
>>
>> This also does not generate any additional instructions, unlike just 
>> using OPTIMIZER_HIDE_VAR(b).
>>
>> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it 
>> currently works instead of like above?
> 
> Did you try it on godbolt.org ? Does it have the intended effect ?

I certainly did try and certainly read it as having the intended effect, 
otherwise I wouldn't have written that it seems confirmed.

However, just because my eyes read it doesn't mean that's what happened, 
and even if it happened doesn't mean that it is guaranteed to happen.

> By the looks of it, you're just creating another version of @b called
> "p", which is then never used and would be discarded by further
> optimization. >
> I'm unsure what you are trying to achieve here.

Simply put I'm trying to let the compiler think that I leaked the 
address of b. After that, the memory barrier should let it think that 
the b after the memory barrier might not be the same as the one before 
it (which was equal to a), forcing it to read from b.

However, I suppose on second thought that that might not be enough, 
because the compiler could still simply do b = a right after exiting the 
while loop.

And that is true no matter what we put behind the while loop or before 
the condition, as long as the condition compares a and b, right after it 
the compiler can do b = a. Just took me a while to see :))

I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but 
(as far as I read the code) doesn't do it with the above. Maybe just a 
weird corner case...

Have fun,
   jonas
Alan Huang Sept. 28, 2024, 10:10 p.m. UTC | #53
2024年9月28日 06:18,Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote:
> 
> 
> 
> Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers:
>> On 2024-09-27 21:23, Jonas Oberhauser wrote:
>> [...]
>>> That idea seems to be confirmed by this (atrocious, not to be copied!) example:
>>> 
>>> int fct_escape_address_of_b(void)
>>> {
>>>      int *a, *b;
>>> 
>>>      do {
>>>          a = READ_ONCE(p);
>>>          asm volatile ("" : : : "memory");
>>>          b = READ_ONCE(p);
>>>      } while (a != b);
>>> 
>>>      // really really hide b
>>>      int **p = &b;
>>>      OPTIMIZER_HIDE_VAR(p);
>>> 
>>>      asm volatile ("" : : : "memory");
>>>      return *b;
>>> }
>>> 
>>> This also does not generate any additional instructions, unlike just using OPTIMIZER_HIDE_VAR(b).
>>> 
>>> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently works instead of like above?
>> Did you try it on godbolt.org ? Does it have the intended effect ?
> 
> I certainly did try and certainly read it as having the intended effect, otherwise I wouldn't have written that it seems confirmed.
> 
> However, just because my eyes read it doesn't mean that's what happened, and even if it happened doesn't mean that it is guaranteed to happen.
> 
>> By the looks of it, you're just creating another version of @b called
>> "p", which is then never used and would be discarded by further
>> optimization. >
>> I'm unsure what you are trying to achieve here.
> 
> Simply put I'm trying to let the compiler think that I leaked the address of b. After that, the memory barrier should let it think that the b after the memory barrier might not be the same as the one before it (which was equal to a), forcing it to read from b.
> 
> However, I suppose on second thought that that might not be enough, because the compiler could still simply do b = a right after exiting the while loop.
> 
> And that is true no matter what we put behind the while loop or before the condition, as long as the condition compares a and b, right after it the compiler can do b = a. Just took me a while to see :))
> 
> I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but (as far as I read the code) doesn't do it with the above. Maybe just a weird corner case...

Let the p to be a static variable out of the function will make a difference.

Or the following:
	
	int **p = &b;
	barrier_data(p);

also works.

BTW, barrier_data(&b) generates more instructions than godbolt when build the kernel.

> 
> Have fun,
>  jonas
> 
>
Alan Huang Sept. 28, 2024, 11:12 p.m. UTC | #54
2024年9月29日 06:10,Alan Huang <mmpgouride@gmail.com> wrote:
> 
> 2024年9月28日 06:18,Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote:
>> 
>> 
>> 
>> Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers:
>>> On 2024-09-27 21:23, Jonas Oberhauser wrote:
>>> [...]
>>>> That idea seems to be confirmed by this (atrocious, not to be copied!) example:
>>>> 
>>>> int fct_escape_address_of_b(void)
>>>> {
>>>>     int *a, *b;
>>>> 
>>>>     do {
>>>>         a = READ_ONCE(p);
>>>>         asm volatile ("" : : : "memory");
>>>>         b = READ_ONCE(p);
>>>>     } while (a != b);
>>>> 
>>>>     // really really hide b
>>>>     int **p = &b;
>>>>     OPTIMIZER_HIDE_VAR(p);
>>>> 
>>>>     asm volatile ("" : : : "memory");
>>>>     return *b;
>>>> }
>>>> 
>>>> This also does not generate any additional instructions, unlike just using OPTIMIZER_HIDE_VAR(b).
>>>> 
>>>> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently works instead of like above?
>>> Did you try it on godbolt.org ? Does it have the intended effect ?
>> 
>> I certainly did try and certainly read it as having the intended effect, otherwise I wouldn't have written that it seems confirmed.
>> 
>> However, just because my eyes read it doesn't mean that's what happened, and even if it happened doesn't mean that it is guaranteed to happen.
>> 
>>> By the looks of it, you're just creating another version of @b called
>>> "p", which is then never used and would be discarded by further
>>> optimization. >
>>> I'm unsure what you are trying to achieve here.
>> 
>> Simply put I'm trying to let the compiler think that I leaked the address of b. After that, the memory barrier should let it think that the b after the memory barrier might not be the same as the one before it (which was equal to a), forcing it to read from b.
>> 
>> However, I suppose on second thought that that might not be enough, because the compiler could still simply do b = a right after exiting the while loop.
>> 
>> And that is true no matter what we put behind the while loop or before the condition, as long as the condition compares a and b, right after it the compiler can do b = a. Just took me a while to see :))
>> 
>> I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but (as far as I read the code) doesn't do it with the above. Maybe just a weird corner case...
> 
> Let the p to be a static variable out of the function will make a difference.
> 
> Or the following:
> 
> int **p = &b;
> barrier_data(p);

Or the following:

	int **t = &b;
	WRITE_ONCE(t, &b);
	barrier();
 	return *b;

also works.

> 
> also works.
> 
> BTW, barrier_data(&b) generates more instructions than godbolt when build the kernel.
> 
>> 
>> Have fun,
>> jonas
diff mbox series

Patch

diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..4548ca8c75eb
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,83 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
+
+/* Hazard pointer slot. */
+typedef void* hazptr_t;
+
+#define HAZPTR_SLOT_PER_CTX 8
+
+struct hazptr_slot_snap {
+	struct rb_node node;
+	hazptr_t slot;
+};
+
+/*
+ * A set of hazard pointer slots for a context.
+ *
+ * The context can be a task, CPU or reader (depends on the use case). It may
+ * be allocated statically or dynamically. It can only be used after calling
+ * init_hazptr_context(), and users are also responsible to call
+ * cleaup_hazptr_context() when it's not used any more.
+ */
+struct hazptr_context {
+	// The lock of the percpu context lists.
+	spinlock_t *lock;
+
+	struct list_head list;
+	struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX];
+	____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX];
+};
+
+void init_hazptr_context(struct hazptr_context *hzcp);
+void cleanup_hazptr_context(struct hazptr_context *hzcp);
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp);
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp);
+
+#define hazptr_tryprotect(hzp, gp, field)	(typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
+#define hazptr_protect(hzp, gp, field) ({				\
+	typeof(gp) ___p;						\
+									\
+	___p = hazptr_tryprotect(hzp, gp, field);			\
+	BUG_ON(!___p);							\
+	___p;								\
+})
+
+static inline void *__hazptr_tryprotect(hazptr_t *hzp,
+					void *const *p,
+					unsigned long head_offset)
+{
+	void *ptr;
+	struct callback_head *head;
+
+	ptr = READ_ONCE(*p);
+
+	if (ptr == NULL)
+		return NULL;
+
+	head = (struct callback_head *)(ptr + head_offset);
+
+	WRITE_ONCE(*hzp, head);
+	smp_mb();
+
+	ptr = READ_ONCE(*p); // read again
+
+	if (ptr + head_offset != head) { // pointer changed
+		WRITE_ONCE(*hzp, NULL);  // reset hazard pointer
+		return NULL;
+	} else
+		return ptr;
+}
+
+static inline void hazptr_clear(hazptr_t *hzp)
+{
+	/* Pairs with smp_load_acquire() in reader scan. */
+	smp_store_release(hzp, NULL);
+}
+
+void call_hazptr(struct callback_head *head, rcu_callback_t func);
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..7927264b9870 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@  obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += hazptr.o
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..f22ccc2a4a62
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,463 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/spinlock.h>
+#include <linux/cleanup.h>
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+
+#define HAZPTR_UNUSED (1ul)
+
+/* Per-CPU data for hazard pointer management. */
+struct hazptr_percpu {
+	spinlock_t ctx_lock;		/* hazptr context list lock */
+	struct list_head ctx_list;	/* hazptr context list */
+	spinlock_t callback_lock;	/* Per-CPU callback queue lock */
+	struct callback_head *queued;	/* Per-CPU callback queue */
+	struct callback_head **tail;
+};
+
+/*
+ * Per-CPU data contains context lists and callbacks, which are maintained in a
+ * per-CPU maner to reduce potential contention. This means a global scan (for
+ * readers or callbacks) has to take each CPU's lock, but it's less problematic
+ * because that's a slowpath.
+ */
+DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu);
+
+/* A RBTree that stores the reader scan result of all hazptr slot */
+struct hazptr_reader_tree {
+	spinlock_t lock;
+	struct rb_root root;
+};
+
+static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree)
+{
+	spin_lock_init(&tree->lock);
+	tree->root = RB_ROOT;
+}
+
+static bool is_null_or_unused(hazptr_t slot)
+{
+	return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED;
+}
+
+static int hazptr_node_cmp(const void *key, const struct rb_node *curr)
+{
+	unsigned long slot = (unsigned long)key;
+	struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node);
+	unsigned long curr_slot = (unsigned long)(curr_snap->slot);
+
+	if (slot < curr_slot)
+		return -1;
+	else if (slot > curr_slot)
+		return 1;
+	else
+		return 0;
+}
+
+static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr)
+{
+	struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node);
+
+	return hazptr_node_cmp((void *)new_snap->slot, curr) < 0;
+}
+
+/* Add the snapshot into a search tree. tree->lock must be held. */
+static inline void reader_add_locked(struct hazptr_reader_tree *tree,
+				     struct hazptr_slot_snap *snap)
+{
+	lockdep_assert_held(&tree->lock);
+	BUG_ON(is_null_or_unused(snap->slot));
+
+	rb_add(&snap->node, &tree->root, hazptr_node_less);
+}
+
+static inline void reader_add(struct hazptr_reader_tree *tree,
+			      struct hazptr_slot_snap *snap)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	reader_add_locked(tree, snap);
+}
+
+/* Delete the snapshot from a search tree. tree->lock must be held. */
+static inline void reader_del_locked(struct hazptr_reader_tree *tree,
+				     struct hazptr_slot_snap *snap)
+{
+	lockdep_assert_held(&tree->lock);
+
+	rb_erase(&snap->node, &tree->root);
+}
+
+static inline void reader_del(struct hazptr_reader_tree *tree,
+			      struct hazptr_slot_snap *snap)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	reader_del_locked(tree, snap);
+}
+
+/* Find whether a read exists. tree->lock must be held. */
+static inline bool reader_exist_locked(struct hazptr_reader_tree *tree,
+				       unsigned long slot)
+{
+	lockdep_assert_held(&tree->lock);
+
+	return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp);
+}
+
+static inline bool reader_exist(struct hazptr_reader_tree *tree,
+				unsigned long slot)
+{
+	guard(spinlock_irqsave)(&tree->lock);
+
+	return reader_exist_locked(tree, slot);
+}
+
+/*
+ * Scan the readers from one hazptr context and update the global readers tree.
+ *
+ * Must be called with hzcp->lock held.
+ */
+static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
+					       struct hazptr_context *hzcp)
+{
+	lockdep_assert_held(hzcp->lock);
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		/*
+		 * Pairs with smp_store_release() in hazptr_{clear,free}().
+		 *
+		 * Ensure
+		 *
+		 * <reader>		<updater>
+		 *
+		 * [access protected pointers]
+		 * hazptr_clear();
+		 *   smp_store_release()
+		 *			// in reader scan.
+		 *			smp_load_acquire(); // is null or unused.
+		 *			[run callbacks] // all accesses from
+		 *					// reader must be
+		 *					// observed.
+		 */
+		hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
+
+		if (!is_null_or_unused(val)) {
+			struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+			// Already in the tree, need to remove first.
+			if (!is_null_or_unused(snap->slot)) {
+				reader_del(tree, snap);
+			}
+			snap->slot = val;
+			reader_add(tree, snap);
+		}
+	}
+}
+
+/*
+ * Initialize a hazptr context.
+ *
+ * Must be called before using the context for hazptr allocation.
+ */
+void init_hazptr_context(struct hazptr_context *hzcp)
+{
+	struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED;
+		hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED;
+	}
+
+	guard(spinlock_irqsave)(&pcpu->ctx_lock);
+	list_add(&hzcp->list, &pcpu->ctx_list);
+	hzcp->lock = &pcpu->ctx_lock;
+}
+
+struct hazptr_struct {
+	struct work_struct work;
+	bool scheduled;
+
+	// list of callbacks, we move percpu queued callbacks into the global
+	// queued list in workqueue function.
+	spinlock_t callback_lock;
+	struct callback_head *queued;
+	struct callback_head **tail;
+
+	struct hazptr_reader_tree readers;
+};
+
+static struct hazptr_struct hazptr_struct;
+
+/*
+ * Clean up hazptr context.
+ *
+ * Must call before freeing the context. This function also removes any
+ * reference held by the hazard pointer slots in the context, even
+ * hazptr_clear() or hazptr_free() is not called previously.
+ */
+void cleanup_hazptr_context(struct hazptr_context *hzcp)
+{
+	if (hzcp->lock) {
+		scoped_guard(spinlock_irqsave, hzcp->lock) {
+			list_del(&hzcp->list);
+			hzcp->lock = NULL;
+		}
+
+		for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+			struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+			if (!is_null_or_unused(snap->slot))
+				reader_del(&hazptr_struct.readers, snap);
+		}
+	}
+}
+
+/*
+ * Allocate a hazptr slot from a hazptr_context.
+ *
+ * Return: NULL means fail to allocate, otherwise the address of the allocated
+ * slot.
+ */
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp)
+{
+	unsigned long unused;
+
+	for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+		if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) {
+			unused = HAZPTR_UNUSED;
+
+			/*
+			 * rwm-sequence is relied on here.
+			 *
+			 * This is needed since in case of a previous reader:
+			 *
+			 * <reader 1>		<reader 2>		<updater>
+			 * [access protected pointers]
+			 * hazptr_free():
+			 *   smp_store_release(); // hzptr == UNUSED
+			 *			hazptr_alloc():
+			 *			  try_cmpxchg_relaxed(); // hzptr == NULL
+			 *
+			 *						// in reader scan
+			 *						smp_load_acquire(); // is null
+			 *						[run callbacks]
+			 *
+			 * Because of the rwm-sequence of release operations,
+			 * when callbacks are run, accesses from reader 1 must
+			 * be already observed by the updater.
+			 */
+			if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) {
+				return (hazptr_t *)&hzcp->slots[i];
+			}
+		}
+	}
+
+	return NULL;
+}
+
+/* Free a hazptr slot. */
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr)
+{
+	WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED);
+
+	/* Pairs with smp_load_acquire() in reader scan */
+	smp_store_release(hzptr, (void *)HAZPTR_UNUSED);
+}
+
+/* Scan all possible readers, and update the global reader tree. */
+static void check_readers(struct hazptr_struct *hzst)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+		struct hazptr_context *ctx;
+
+		guard(spinlock_irqsave)(&pcpu->ctx_lock);
+		list_for_each_entry(ctx, &pcpu->ctx_list, list)
+			hazptr_context_snap_readers_locked(&hzst->readers, ctx);
+	}
+}
+
+/*
+ * Start the background work for hazptr callback handling if not started.
+ *
+ * Must be called with hazptr_struct lock held.
+ */
+static void kick_hazptr_work(void)
+{
+	if (hazptr_struct.scheduled)
+		return;
+
+	queue_work(system_wq, &hazptr_struct.work);
+	hazptr_struct.scheduled = true;
+}
+
+/*
+ * Check which callbacks are ready to be called.
+ *
+ * Return: a callback list that no reader is referencing the corresponding
+ * objects.
+ */
+static struct callback_head *do_hazptr(struct hazptr_struct *hzst)
+{
+	struct callback_head *tmp, **curr;
+	struct callback_head *todo = NULL, **todo_tail = &todo;
+
+	// Currently, the lock is unnecessary, but maybe we will have multiple
+	// work_structs sharing the same callback list in the future;
+	guard(spinlock_irqsave)(&hzst->callback_lock);
+
+	curr = &hzst->queued;
+
+	while ((tmp = *curr)) {
+		// No reader, we can free the object.
+		if (!reader_exist(&hzst->readers, (unsigned long)tmp)) {
+			// Add tmp into todo.
+			*todo_tail = tmp;
+			todo_tail = &tmp->next;
+
+			// Delete tmp from ->queued and move to the next entry.
+			*curr = tmp->next;
+		} else
+			curr = &tmp->next;
+	}
+
+	hzst->tail = curr;
+
+	// Keep checking if callback list is not empty.
+	if (hzst->queued)
+		kick_hazptr_work();
+
+	*todo_tail = NULL;
+
+	return todo;
+}
+
+static void hazptr_work_func(struct work_struct *work)
+{
+	struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work);
+	struct callback_head *todo;
+
+	// Advance callbacks from hzpcpu to hzst
+	scoped_guard(spinlock_irqsave, &hzst->callback_lock) {
+		int cpu;
+
+		hzst->scheduled = false;
+		for_each_possible_cpu(cpu) {
+			struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+			guard(spinlock)(&pcpu->callback_lock);
+
+			if (pcpu->queued) {
+				*(hzst->tail) = pcpu->queued;
+				hzst->tail = pcpu->tail;
+				pcpu->queued = NULL;
+				pcpu->tail = &pcpu->queued;
+			}
+		}
+	}
+
+	// Pairs with the smp_mb() on the reader side. This guarantees that if
+	// the hazptr work picked up the callback from an updater and the
+	// updater set the global pointer to NULL before enqueue the callback,
+	// the hazptr work must observe a reader that protects the hazptr before
+	// the updater.
+	//
+	// <reader>		<updater>		<hazptr work>
+	// ptr = READ_ONCE(gp);
+	// WRITE_ONCE(*hazptr, ptr);
+	// smp_mb(); // => ->strong-fence
+	//			tofree = gp;
+	// ptr = READ_ONCE(gp); // re-read, gp is not NULL
+	//			// => ->fre
+	//			WRITE_ONCE(gp, NULL);
+	//			call_hazptr(gp, ...):
+	//			  lock(->callback_lock);
+	//			  [queued the callback]
+	//			  unlock(->callback_lock);// => ->po-unlock-lock-po
+	//						lock(->callback_lock);
+	//						[move from hzpcpu to hzst]
+	//
+	//						smp_mb(); => ->strong-fence
+	//						ptr = READ_ONCE(*hazptr);
+	//						// ptr == gp, otherwise => ->fre
+	//
+	// If ptr != gp, it means there exists a circle with the following
+	// memory ordering relationships:
+	//	->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre
+	// => (due to the definition of prop)
+	//	->strong-fence ->prop ->strong-fence ->hb ->prop
+	// => (rotate the circle)
+	//	->prop ->strong-fence ->prop ->strong-fence ->hb
+	// => (due to the definition of pb)
+	//	->pb ->pb
+	// but pb is acyclic according to LKMM, so this cannot happen.
+	smp_mb();
+	check_readers(hzst);
+
+	todo = do_hazptr(hzst);
+
+	while (todo) {
+		struct callback_head *next = todo->next;
+		void (*func)(struct callback_head *) = todo->func;
+
+		func(todo);
+
+		todo = next;
+	}
+}
+
+/*
+ * Put @head into the cleanup callback queue.
+ *
+ * @func(@head) will be called after no one is referencing the object. Caller
+ * must ensure the object has been unpublished before calling this.
+ */
+void call_hazptr(struct callback_head *head, rcu_callback_t func)
+{
+	struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+	head->func = func;
+	head->next = NULL;
+
+	scoped_guard(spinlock_irqsave, &pcpu->callback_lock) {
+		*(pcpu->tail) = head;
+		pcpu->tail = &head->next;
+	}
+
+	guard(spinlock_irqsave)(&hazptr_struct.callback_lock);
+	kick_hazptr_work();
+}
+
+static int init_hazptr_struct(void)
+{
+	int cpu;
+
+	INIT_WORK(&hazptr_struct.work, hazptr_work_func);
+
+	spin_lock_init(&hazptr_struct.callback_lock);
+	hazptr_struct.queued = NULL;
+	hazptr_struct.tail = &hazptr_struct.queued;
+
+	for_each_possible_cpu(cpu) {
+		struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+		spin_lock_init(&pcpu->ctx_lock);
+		INIT_LIST_HEAD(&pcpu->ctx_list);
+
+		spin_lock_init(&pcpu->callback_lock);
+		pcpu->queued = NULL;
+		pcpu->tail = &pcpu->queued;
+
+	}
+
+	init_hazptr_reader_tree(&hazptr_struct.readers);
+
+	return 0;
+}
+
+early_initcall(init_hazptr_struct);