diff mbox series

[RFC,v2,3/4] hp: Implement Hazard Pointers

Message ID 20241004182734.1761555-4-mathieu.desnoyers@efficios.com (mailing list archive)
State Superseded
Headers show
Series sched+mm: Track lazy active mm existence with hazard pointers | expand

Commit Message

Mathieu Desnoyers Oct. 4, 2024, 6:27 p.m. UTC
This API provides existence guarantees of objects through Hazard
Pointers (HP). This minimalist implementation is specific to use
with preemption disabled, but can be extended further as needed.

Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
across the entire system.

Its main benefit over RCU is that it allows fast reclaim of
HP-protected pointers without needing to wait for a grace period.

It also allows the hazard pointer scan to call a user-defined callback
to retire a hazard pointer slot immediately if needed. This callback
may, for instance, issue an IPI to the relevant CPU.

There are a few possible use-cases for this in the Linux kernel:

  - Improve performance of mm_count by replacing lazy active mm by HP.
  - Guarantee object existence on pointer dereference to use refcount:
    - replace locking used for that purpose in some drivers,
    - replace RCU + inc_not_zero pattern,
  - rtmutex: Improve situations where locks need to be taken in
    reverse dependency chain order by guaranteeing existence of
    first and second locks in traversal order, allowing them to be
    locked in the correct order (which is reverse from traversal
    order) rather than try-lock+retry on nested lock.

References:

[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
     lock-free objects," in IEEE Transactions on Parallel and
     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004

Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/
Link: https://lpc.events/event/18/contributions/1731/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
Changes since v0:
- Remove slot variable from hp_dereference_allocate().
---
 include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/Makefile    |   2 +-
 kernel/hp.c        |  46 +++++++++++++
 3 files changed, 205 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hp.h
 create mode 100644 kernel/hp.c

Comments

Joel Fernandes Oct. 4, 2024, 9:25 p.m. UTC | #1
On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This API provides existence guarantees of objects through Hazard
> Pointers (HP). This minimalist implementation is specific to use
> with preemption disabled, but can be extended further as needed.
>
> Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
> across the entire system.
>
> Its main benefit over RCU is that it allows fast reclaim of
> HP-protected pointers without needing to wait for a grace period.
>
> It also allows the hazard pointer scan to call a user-defined callback
> to retire a hazard pointer slot immediately if needed. This callback
> may, for instance, issue an IPI to the relevant CPU.
>
> There are a few possible use-cases for this in the Linux kernel:
>
>   - Improve performance of mm_count by replacing lazy active mm by HP.
>   - Guarantee object existence on pointer dereference to use refcount:
>     - replace locking used for that purpose in some drivers,
>     - replace RCU + inc_not_zero pattern,
>   - rtmutex: Improve situations where locks need to be taken in
>     reverse dependency chain order by guaranteeing existence of
>     first and second locks in traversal order, allowing them to be
>     locked in the correct order (which is reverse from traversal
>     order) rather than try-lock+retry on nested lock.
>
> References:
>
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>      lock-free objects," in IEEE Transactions on Parallel and
>      Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
[ ... ]
> ---
> Changes since v0:
> - Remove slot variable from hp_dereference_allocate().
> ---
>  include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/Makefile    |   2 +-
>  kernel/hp.c        |  46 +++++++++++++

Just a housekeeping comment, ISTR Linus looking down on adding bodies
of C code to header files (like hp_dereference_allocate). I understand
maybe the rationale is that the functions included are inlined. But do
all of them have to be inlined? Such headers also hurt code browsing
capabilities in code browsers like clangd. clangd doesn't understand
header files because it can't independently compile them -- it uses
the compiler to generate and extract the AST for superior code
browsing/completion.

Also have you looked at the benefits of inlining for hp.h?
hp_dereference_allocate() seems large enough that inlining may not
matter much, but I haven't compiled it and looked at the asm myself.

Will continue staring at the code.

thanks,

  - Joel


>  3 files changed, 205 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/hp.h
>  create mode 100644 kernel/hp.c
>
> diff --git a/include/linux/hp.h b/include/linux/hp.h
> new file mode 100644
> index 000000000000..e85fc4365ea2
> --- /dev/null
> +++ b/include/linux/hp.h
> @@ -0,0 +1,158 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +#ifndef _LINUX_HP_H
> +#define _LINUX_HP_H
> +
> +/*
> + * HP: Hazard Pointers
> + *
> + * This API provides existence guarantees of objects through hazard
> + * pointers.
> + *
> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
> + * entire system for each HP domain.
> + *
> + * Its main benefit over RCU is that it allows fast reclaim of
> + * HP-protected pointers without needing to wait for a grace period.
> + *
> + * It also allows the hazard pointer scan to call a user-defined callback
> + * to retire a hazard pointer slot immediately if needed. This callback
> + * may, for instance, issue an IPI to the relevant CPU.
> + *
> + * References:
> + *
> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> + *      lock-free objects," in IEEE Transactions on Parallel and
> + *      Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> + */
> +
> +#include <linux/rcupdate.h>
> +
> +/*
> + * Hazard pointer slot.
> + */
> +struct hp_slot {
> +       void *addr;
> +};
> +
> +/*
> + * Hazard pointer context, returned by hp_use().
> + */
> +struct hp_ctx {
> +       struct hp_slot *slot;
> +       void *addr;
> +};
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is NULL, wait to observe that each slot contains a value
> + * that differs from @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> +            void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
> +
> +/* Get the hazard pointer context address (may be NULL). */
> +static inline
> +void *hp_ctx_addr(struct hp_ctx ctx)
> +{
> +       return ctx.addr;
> +}
> +
> +/*
> + * hp_allocate: Allocate a hazard pointer.
> + *
> + * Allocate a hazard pointer slot for @addr. The object existence should
> + * be guaranteed by the caller. Expects to be called from preempt
> + * disable context.
> + *
> + * Returns a hazard pointer context.
> + */
> +static inline
> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
> +{
> +       struct hp_slot *slot;
> +       struct hp_ctx ctx;
> +
> +       if (!addr)
> +               goto fail;
> +       slot = this_cpu_ptr(percpu_slots);
> +       /*
> +        * A single hazard pointer slot per CPU is available currently.
> +        * Other hazard pointer domains can eventually have a different
> +        * configuration.
> +        */
> +       if (READ_ONCE(slot->addr))
> +               goto fail;
> +       WRITE_ONCE(slot->addr, addr);   /* Store B */
> +       ctx.slot = slot;
> +       ctx.addr = addr;
> +       return ctx;
> +
> +fail:
> +       ctx.slot = NULL;
> +       ctx.addr = NULL;
> +       return ctx;
> +}
> +
> +/*
> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> + *
> + * Returns a hazard pointer context. Expects to be called from preempt
> + * disable context.
> + */
> +static inline
> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
> +{
> +       void *addr, *addr2;
> +       struct hp_ctx ctx;
> +
> +       addr = READ_ONCE(*addr_p);
> +retry:
> +       ctx = hp_allocate(percpu_slots, addr);
> +       if (!hp_ctx_addr(ctx))
> +               goto fail;
> +       /* Memory ordering: Store B before Load A. */
> +       smp_mb();
> +       /*
> +        * Use RCU dereference without lockdep checks, because
> +        * lockdep is not aware of HP guarantees.
> +        */
> +       addr2 = rcu_access_pointer(*addr_p);    /* Load A */
> +       /*
> +        * If @addr_p content has changed since the first load,
> +        * clear the hazard pointer and try again.
> +        */
> +       if (!ptr_eq(addr2, addr)) {
> +               WRITE_ONCE(ctx.slot->addr, NULL);
> +               if (!addr2)
> +                       goto fail;
> +               addr = addr2;
> +               goto retry;
> +       }
> +       /*
> +        * Use addr2 loaded from rcu_access_pointer() to preserve
> +        * address dependency ordering.
> +        */
> +       ctx.addr = addr2;
> +       return ctx;
> +
> +fail:
> +       ctx.slot = NULL;
> +       ctx.addr = NULL;
> +       return ctx;
> +}
> +
> +/* Retire the hazard pointer in @ctx. */
> +static inline
> +void hp_retire(const struct hp_ctx ctx)
> +{
> +       smp_store_release(&ctx.slot->addr, NULL);
> +}
> +
> +#endif /* _LINUX_HP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..ec16de96fa80 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y     = fork.o exec_domain.o panic.o \
>             cpu.o exit.o softirq.o resource.o \
>             sysctl.o capability.o ptrace.o user.o \
>             signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> -           extable.o params.o \
> +           extable.o params.o hp.o \
>             kthread.o sys_ni.o nsproxy.o \
>             notifier.o ksysfs.o cred.o reboot.o \
>             async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hp.c b/kernel/hp.c
> new file mode 100644
> index 000000000000..b2447bf15300
> --- /dev/null
> +++ b/kernel/hp.c
> @@ -0,0 +1,46 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * HP: Hazard Pointers
> + */
> +
> +#include <linux/hp.h>
> +#include <linux/percpu.h>
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> +            void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> +       int cpu;
> +
> +       /*
> +        * Store A precedes hp_scan(): it unpublishes addr (sets it to
> +        * NULL or to a different value), and thus hides it from hazard
> +        * pointer readers.
> +        */
> +
> +       if (!addr)
> +               return;
> +       /* Memory ordering: Store A before Load B. */
> +       smp_mb();
> +       /* Scan all CPUs slots. */
> +       for_each_possible_cpu(cpu) {
> +               struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> +               if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> +                       retire_cb(cpu, slot, addr);
> +               /* Busy-wait if node is found. */
> +               while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
> +                       cpu_relax();
> +       }
> +}
> --
> 2.39.2
>
Frederic Weisbecker Oct. 5, 2024, 11:19 a.m. UTC | #2
Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit :
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> +	int cpu;
> +
> +	/*
> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> +	 * NULL or to a different value), and thus hides it from hazard
> +	 * pointer readers.
> +	 */
> +
> +	if (!addr)
> +		return;
> +	/* Memory ordering: Store A before Load B. */
> +	smp_mb();
> +	/* Scan all CPUs slots. */
> +	for_each_possible_cpu(cpu) {
> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> +			retire_cb(cpu, slot, addr);
> +		/* Busy-wait if node is found. */
> +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
> +			cpu_relax();

You agree that having a single possible per-cpu pointer per context and a busy
waiting update side pointer release can't be a general purpose hazard pointer
implementation, right? :-)

Thanks.

> +	}
> +}
> -- 
> 2.39.2
>
Mathieu Desnoyers Oct. 5, 2024, 11:42 a.m. UTC | #3
On 2024-10-05 13:19, Frederic Weisbecker wrote:
> Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit :
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>> +{
>> +	int cpu;
>> +
>> +	/*
>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>> +	 * NULL or to a different value), and thus hides it from hazard
>> +	 * pointer readers.
>> +	 */
>> +
>> +	if (!addr)
>> +		return;
>> +	/* Memory ordering: Store A before Load B. */
>> +	smp_mb();
>> +	/* Scan all CPUs slots. */
>> +	for_each_possible_cpu(cpu) {
>> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>> +
>> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
>> +			retire_cb(cpu, slot, addr);
>> +		/* Busy-wait if node is found. */
>> +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
>> +			cpu_relax();
> 
> You agree that having a single possible per-cpu pointer per context and a busy
> waiting update side pointer release can't be a general purpose hazard pointer
> implementation, right? :-)

Of course. This is a minimalist implementation, which can be extended in
various ways, some of which I've implemented as POC in userspace already:

- Increase the number of per-cpu slots available,
- Distinguish between current scan depth target and available
   per-cpu slots,
- Fall-back to reference counter when slots are full,
- Allow scanning for a range of addresses (useful for type-safe
   memory),
- Allow scanning for a set of hazard pointers (scan batching)
   using Bloom filters to probabilistically speed up the comparison
   (not implemented yet).
- Implement a queued blocking wait/wakeup when HP scan must wait
   (not implemented yet).
- Implement a HP-to-refcount promotion triggered by the HP scan
   callback to promote hazard pointers which would be blocked on
   to a reference count increment. (not implemented yet)
- Use hazard pointers + refcount to implement smart pointers, which
   could be useful for Rust. (not implemented yet)

But my general approach is to wait until the use-cases justify adding
features.

Although if you are curious about any of the points listed above,
just ask and I'll be happy to discuss them in more depth.

Thanks,

Mathieu

> 
> Thanks.
> 
>> +	}
>> +}
>> -- 
>> 2.39.2
>>
Mathieu Desnoyers Oct. 5, 2024, 12:05 p.m. UTC | #4
On 2024-10-04 23:25, Joel Fernandes wrote:
> On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This API provides existence guarantees of objects through Hazard
>> Pointers (HP). This minimalist implementation is specific to use
>> with preemption disabled, but can be extended further as needed.
>>
>> Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
>> across the entire system.
>>
>> Its main benefit over RCU is that it allows fast reclaim of
>> HP-protected pointers without needing to wait for a grace period.
>>
>> It also allows the hazard pointer scan to call a user-defined callback
>> to retire a hazard pointer slot immediately if needed. This callback
>> may, for instance, issue an IPI to the relevant CPU.
>>
>> There are a few possible use-cases for this in the Linux kernel:
>>
>>    - Improve performance of mm_count by replacing lazy active mm by HP.
>>    - Guarantee object existence on pointer dereference to use refcount:
>>      - replace locking used for that purpose in some drivers,
>>      - replace RCU + inc_not_zero pattern,
>>    - rtmutex: Improve situations where locks need to be taken in
>>      reverse dependency chain order by guaranteeing existence of
>>      first and second locks in traversal order, allowing them to be
>>      locked in the correct order (which is reverse from traversal
>>      order) rather than try-lock+retry on nested lock.
>>
>> References:
>>
>> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>>       lock-free objects," in IEEE Transactions on Parallel and
>>       Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> [ ... ]
>> ---
>> Changes since v0:
>> - Remove slot variable from hp_dereference_allocate().
>> ---
>>   include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/Makefile    |   2 +-
>>   kernel/hp.c        |  46 +++++++++++++
> 
> Just a housekeeping comment, ISTR Linus looking down on adding bodies
> of C code to header files (like hp_dereference_allocate). I understand
> maybe the rationale is that the functions included are inlined. But do
> all of them have to be inlined? Such headers also hurt code browsing
> capabilities in code browsers like clangd. clangd doesn't understand
> header files because it can't independently compile them -- it uses
> the compiler to generate and extract the AST for superior code
> browsing/completion.
> 
> Also have you looked at the benefits of inlining for hp.h?
> hp_dereference_allocate() seems large enough that inlining may not
> matter much, but I haven't compiled it and looked at the asm myself.

Here is a comparison in userspace:

* With "hp dereference allocate" inlined:

     test_hpref_benchmark (smp_mb)             nr_reads   1994298193 nr_writes     22293162 nr_ops   2016591355
     test_hpref_benchmark (barrier/membarrier) nr_reads  15208690879 nr_writes      1893785 nr_ops  15210584664

* With "hp dereference allocate" implemented as a function call:

     test_hpref_benchmark (smp_mb)             nr_reads   1558924716 nr_writes     14261028 nr_ops   1573185744
     test_hpref_benchmark (barrier/membarrier) nr_reads   5881131707 nr_writes      2005140 nr_ops   5883136847

So the overhead of the function call when using symmetric memory barriers
between hp allocate/hp scan is a 20% slowdown.

It's worse in the asymmetric barrier/membarrier case, introducing a 61%
slowdown.

Given that the overhead is noticeable, I am tempted to leave the hazard
pointer allocate/retire as inline functions.

About code browsers like clangd, I would recommend improving the tooling
rather than alter the design of the code based on current tooling
limitations.

Thanks,

Mathieu
Peter Zijlstra Oct. 5, 2024, 4:04 p.m. UTC | #5
On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>  include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/Makefile    |   2 +-
>  kernel/hp.c        |  46 +++++++++++++
>  3 files changed, 205 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/hp.h
>  create mode 100644 kernel/hp.c
> 
> diff --git a/include/linux/hp.h b/include/linux/hp.h
> new file mode 100644
> index 000000000000..e85fc4365ea2
> --- /dev/null
> +++ b/include/linux/hp.h
> @@ -0,0 +1,158 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +#ifndef _LINUX_HP_H
> +#define _LINUX_HP_H
> +
> +/*
> + * HP: Hazard Pointers
> + *
> + * This API provides existence guarantees of objects through hazard
> + * pointers.
> + *
> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
> + * entire system for each HP domain.
> + *
> + * Its main benefit over RCU is that it allows fast reclaim of
> + * HP-protected pointers without needing to wait for a grace period.
> + *
> + * It also allows the hazard pointer scan to call a user-defined callback
> + * to retire a hazard pointer slot immediately if needed. This callback
> + * may, for instance, issue an IPI to the relevant CPU.
> + *
> + * References:
> + *
> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> + *      lock-free objects," in IEEE Transactions on Parallel and
> + *      Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> + */
> +
> +#include <linux/rcupdate.h>
> +
> +/*
> + * Hazard pointer slot.
> + */
> +struct hp_slot {
> +	void *addr;
> +};
> +
> +/*
> + * Hazard pointer context, returned by hp_use().
> + */
> +struct hp_ctx {
> +	struct hp_slot *slot;
> +	void *addr;
> +};
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is NULL, wait to observe that each slot contains a value
> + * that differs from @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));

struct hp_domain {
	struct hp_slot __percpu *slots
};

might clarify things a wee little.

> +
> +/* Get the hazard pointer context address (may be NULL). */
> +static inline
> +void *hp_ctx_addr(struct hp_ctx ctx)
> +{
> +	return ctx.addr;
> +}

From where I'm sitting this seems like superfluous fluff, what's wrong
with ctx.addr ?

> +/*
> + * hp_allocate: Allocate a hazard pointer.
> + *
> + * Allocate a hazard pointer slot for @addr. The object existence should
> + * be guaranteed by the caller. Expects to be called from preempt
> + * disable context.
> + *
> + * Returns a hazard pointer context.

So you made the WTF'o'meter crack, this here function does not allocate
nothing. Naming is bad. At best this is something like
try-set-hazard-pointer or somesuch.

> + */
> +static inline
> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
> +{
> +	struct hp_slot *slot;
> +	struct hp_ctx ctx;
> +
> +	if (!addr)
> +		goto fail;
> +	slot = this_cpu_ptr(percpu_slots);
> +	/*
> +	 * A single hazard pointer slot per CPU is available currently.
> +	 * Other hazard pointer domains can eventually have a different
> +	 * configuration.
> +	 */
> +	if (READ_ONCE(slot->addr))
> +		goto fail;
> +	WRITE_ONCE(slot->addr, addr);	/* Store B */
> +	ctx.slot = slot;
> +	ctx.addr = addr;
> +	return ctx;
> +
> +fail:
> +	ctx.slot = NULL;
> +	ctx.addr = NULL;
> +	return ctx;
> +}
> +
> +/*
> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> + *
> + * Returns a hazard pointer context. Expects to be called from preempt
> + * disable context.
> + */

More terrible naming. Same as above, but additionally, I would expect a
'dereference' to actually dereference the pointer and have a return
value of the dereferenced type.

This function seems to double check and update the hp_ctx thing. I'm not
at all sure yet wtf this is doing -- and the total lack of comments
aren't helping.

> +static inline
> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
> +{
> +	void *addr, *addr2;
> +	struct hp_ctx ctx;
> +
> +	addr = READ_ONCE(*addr_p);
> +retry:
> +	ctx = hp_allocate(percpu_slots, addr);
> +	if (!hp_ctx_addr(ctx))
> +		goto fail;
> +	/* Memory ordering: Store B before Load A. */
> +	smp_mb();
> +	/*
> +	 * Use RCU dereference without lockdep checks, because
> +	 * lockdep is not aware of HP guarantees.
> +	 */
> +	addr2 = rcu_access_pointer(*addr_p);	/* Load A */
> +	/*
> +	 * If @addr_p content has changed since the first load,
> +	 * clear the hazard pointer and try again.
> +	 */
> +	if (!ptr_eq(addr2, addr)) {
> +		WRITE_ONCE(ctx.slot->addr, NULL);
> +		if (!addr2)
> +			goto fail;
> +		addr = addr2;
> +		goto retry;
> +	}
> +	/*
> +	 * Use addr2 loaded from rcu_access_pointer() to preserve
> +	 * address dependency ordering.
> +	 */
> +	ctx.addr = addr2;
> +	return ctx;
> +
> +fail:
> +	ctx.slot = NULL;
> +	ctx.addr = NULL;
> +	return ctx;
> +}
> +
> +/* Retire the hazard pointer in @ctx. */
> +static inline
> +void hp_retire(const struct hp_ctx ctx)
> +{
> +	smp_store_release(&ctx.slot->addr, NULL);
> +}
> +
> +#endif /* _LINUX_HP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..ec16de96fa80 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y     = fork.o exec_domain.o panic.o \
>  	    cpu.o exit.o softirq.o resource.o \
>  	    sysctl.o capability.o ptrace.o user.o \
>  	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> -	    extable.o params.o \
> +	    extable.o params.o hp.o \
>  	    kthread.o sys_ni.o nsproxy.o \
>  	    notifier.o ksysfs.o cred.o reboot.o \
>  	    async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hp.c b/kernel/hp.c
> new file mode 100644
> index 000000000000..b2447bf15300
> --- /dev/null
> +++ b/kernel/hp.c
> @@ -0,0 +1,46 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * HP: Hazard Pointers
> + */
> +
> +#include <linux/hp.h>
> +#include <linux/percpu.h>
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> +	int cpu;
> +
> +	/*
> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> +	 * NULL or to a different value), and thus hides it from hazard
> +	 * pointer readers.
> +	 */
> +
> +	if (!addr)
> +		return;
> +	/* Memory ordering: Store A before Load B. */
> +	smp_mb();
> +	/* Scan all CPUs slots. */
> +	for_each_possible_cpu(cpu) {
> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> +			retire_cb(cpu, slot, addr);

Is retirce_cb allowed to cmpxchg the thing?

> +		/* Busy-wait if node is found. */
> +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
> +			cpu_relax();

This really should be using smp_cond_load_acquire()

> +	}
> +}
Peter Zijlstra Oct. 5, 2024, 4:07 p.m. UTC | #6
On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:

> > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > +{
> > +	int cpu;
> > +
> > +	/*
> > +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > +	 * NULL or to a different value), and thus hides it from hazard
> > +	 * pointer readers.
> > +	 */

This should probably assert we're in a preemptible context. Otherwise
people will start using this in non-preemptible context and then we get
to unfuck things later.

> > +
> > +	if (!addr)
> > +		return;
> > +	/* Memory ordering: Store A before Load B. */
> > +	smp_mb();
> > +	/* Scan all CPUs slots. */
> > +	for_each_possible_cpu(cpu) {
> > +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > +
> > +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> > +			retire_cb(cpu, slot, addr);
> 
> Is retirce_cb allowed to cmpxchg the thing?
> 
> > +		/* Busy-wait if node is found. */
> > +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
> > +			cpu_relax();
> 
> This really should be using smp_cond_load_acquire()
> 
> > +	}
> > +}
Mathieu Desnoyers Oct. 5, 2024, 6:50 p.m. UTC | #7
On 2024-10-05 18:04, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>>   include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/Makefile    |   2 +-
>>   kernel/hp.c        |  46 +++++++++++++
>>   3 files changed, 205 insertions(+), 1 deletion(-)
>>   create mode 100644 include/linux/hp.h
>>   create mode 100644 kernel/hp.c
>>
>> diff --git a/include/linux/hp.h b/include/linux/hp.h
>> new file mode 100644
>> index 000000000000..e85fc4365ea2
>> --- /dev/null
>> +++ b/include/linux/hp.h
>> @@ -0,0 +1,158 @@
>> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> +//
>> +// SPDX-License-Identifier: LGPL-2.1-or-later
>> +
>> +#ifndef _LINUX_HP_H
>> +#define _LINUX_HP_H
>> +
>> +/*
>> + * HP: Hazard Pointers
>> + *
>> + * This API provides existence guarantees of objects through hazard
>> + * pointers.
>> + *
>> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
>> + * entire system for each HP domain.
>> + *
>> + * Its main benefit over RCU is that it allows fast reclaim of
>> + * HP-protected pointers without needing to wait for a grace period.
>> + *
>> + * It also allows the hazard pointer scan to call a user-defined callback
>> + * to retire a hazard pointer slot immediately if needed. This callback
>> + * may, for instance, issue an IPI to the relevant CPU.
>> + *
>> + * References:
>> + *
>> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>> + *      lock-free objects," in IEEE Transactions on Parallel and
>> + *      Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
>> + */
>> +
>> +#include <linux/rcupdate.h>
>> +
>> +/*
>> + * Hazard pointer slot.
>> + */
>> +struct hp_slot {
>> +	void *addr;
>> +};
>> +
>> +/*
>> + * Hazard pointer context, returned by hp_use().
>> + */
>> +struct hp_ctx {
>> +	struct hp_slot *slot;
>> +	void *addr;
>> +};
>> +
>> +/*
>> + * hp_scan: Scan hazard pointer domain for @addr.
>> + *
>> + * Scan hazard pointer domain for @addr.
>> + * If @retire_cb is NULL, wait to observe that each slot contains a value
>> + * that differs from @addr.
>> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
>> + * @addr.
>> + */
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
> 
> struct hp_domain {
> 	struct hp_slot __percpu *slots
> };
> 
> might clarify things a wee little.

Good point. This introduces:

#define DECLARE_HP_DOMAIN(domain)                                       \
         extern struct hp_domain domain

#define DEFINE_HP_DOMAIN(domain)                                        \
         static DEFINE_PER_CPU(struct hp_slot, __ ## domain ## _slots);  \
         struct hp_domain domain = {                                     \
                 .percpu_slots = &__## domain ## _slots,                 \
         }

> 
>> +
>> +/* Get the hazard pointer context address (may be NULL). */
>> +static inline
>> +void *hp_ctx_addr(struct hp_ctx ctx)
>> +{
>> +	return ctx.addr;
>> +}
> 
>  From where I'm sitting this seems like superfluous fluff, what's wrong
> with ctx.addr ?

I'm OK removing the accessor and just using ctx.addr.

> 
>> +/*
>> + * hp_allocate: Allocate a hazard pointer.
>> + *
>> + * Allocate a hazard pointer slot for @addr. The object existence should
>> + * be guaranteed by the caller. Expects to be called from preempt
>> + * disable context.
>> + *
>> + * Returns a hazard pointer context.
> 
> So you made the WTF'o'meter crack, this here function does not allocate
> nothing. Naming is bad. At best this is something like
> try-set-hazard-pointer or somesuch.

I went with the naming from the 2004 paper from Maged Michael, but I
agree it could be clearer.

I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
"posting" the intent to use a pointer (as in on a metaphorical billboard),
and removing it when it's done.

> 
>> + */
>> +static inline
>> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
>> +{
>> +	struct hp_slot *slot;
>> +	struct hp_ctx ctx;
>> +
>> +	if (!addr)
>> +		goto fail;
>> +	slot = this_cpu_ptr(percpu_slots);
>> +	/*
>> +	 * A single hazard pointer slot per CPU is available currently.
>> +	 * Other hazard pointer domains can eventually have a different
>> +	 * configuration.
>> +	 */
>> +	if (READ_ONCE(slot->addr))
>> +		goto fail;
>> +	WRITE_ONCE(slot->addr, addr);	/* Store B */
>> +	ctx.slot = slot;
>> +	ctx.addr = addr;
>> +	return ctx;
>> +
>> +fail:
>> +	ctx.slot = NULL;
>> +	ctx.addr = NULL;
>> +	return ctx;
>> +}
>> +
>> +/*
>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>> + *
>> + * Returns a hazard pointer context. Expects to be called from preempt
>> + * disable context.
>> + */
> 
> More terrible naming. Same as above, but additionally, I would expect a
> 'dereference' to actually dereference the pointer and have a return
> value of the dereferenced type.

hp_dereference_try_post() ?

> 
> This function seems to double check and update the hp_ctx thing. I'm not
> at all sure yet wtf this is doing -- and the total lack of comments
> aren't helping.

The hp_ctx contains the outputs.

The function loads *addr_p to then try_post it into a HP slot. On success,
it re-reads the *addr_p (with address dependency) and if it still matches,
use that as output address pointer.

I'm planning to remove hp_ctx, and just have:

/*
  * hp_try_post: Try to post a hazard pointer.
  *
  * Post a hazard pointer slot for @addr. The object existence should
  * be guaranteed by the caller. Expects to be called from preempt
  * disable context.
  *
  * Returns true if post succeeds, false otherwise.
  */
static inline
bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
[...]

/*
  * hp_dereference_try_post: Dereference and try to post a hazard pointer.
  *
  * Returns a hazard pointer context. Expects to be called from preempt
  * disable context.
  */
static inline
void *__hp_dereference_try_post(struct hp_domain *hp_domain,
                                 void * const * addr_p, struct hp_slot **_slot)
[...]

#define hp_dereference_try_post(domain, p, slot_p)              \
         ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))

/* Clear the hazard pointer in @slot. */
static inline
void hp_remove(struct hp_slot *slot)
[...]

> 
>> +static inline
>> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
>> +{
>> +	void *addr, *addr2;
>> +	struct hp_ctx ctx;
>> +
>> +	addr = READ_ONCE(*addr_p);
>> +retry:
>> +	ctx = hp_allocate(percpu_slots, addr);
>> +	if (!hp_ctx_addr(ctx))
>> +		goto fail;
>> +	/* Memory ordering: Store B before Load A. */
>> +	smp_mb();
>> +	/*
>> +	 * Use RCU dereference without lockdep checks, because
>> +	 * lockdep is not aware of HP guarantees.
>> +	 */
>> +	addr2 = rcu_access_pointer(*addr_p);	/* Load A */
>> +	/*
>> +	 * If @addr_p content has changed since the first load,
>> +	 * clear the hazard pointer and try again.
>> +	 */
>> +	if (!ptr_eq(addr2, addr)) {
>> +		WRITE_ONCE(ctx.slot->addr, NULL);
>> +		if (!addr2)
>> +			goto fail;
>> +		addr = addr2;
>> +		goto retry;
>> +	}
>> +	/*
>> +	 * Use addr2 loaded from rcu_access_pointer() to preserve
>> +	 * address dependency ordering.
>> +	 */
>> +	ctx.addr = addr2;
>> +	return ctx;
>> +
>> +fail:
>> +	ctx.slot = NULL;
>> +	ctx.addr = NULL;
>> +	return ctx;
>> +}
>> +
>> +/* Retire the hazard pointer in @ctx. */
>> +static inline
>> +void hp_retire(const struct hp_ctx ctx)
>> +{
>> +	smp_store_release(&ctx.slot->addr, NULL);
>> +}
>> +
>> +#endif /* _LINUX_HP_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 3c13240dfc9f..ec16de96fa80 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -7,7 +7,7 @@ obj-y     = fork.o exec_domain.o panic.o \
>>   	    cpu.o exit.o softirq.o resource.o \
>>   	    sysctl.o capability.o ptrace.o user.o \
>>   	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
>> -	    extable.o params.o \
>> +	    extable.o params.o hp.o \
>>   	    kthread.o sys_ni.o nsproxy.o \
>>   	    notifier.o ksysfs.o cred.o reboot.o \
>>   	    async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
>> diff --git a/kernel/hp.c b/kernel/hp.c
>> new file mode 100644
>> index 000000000000..b2447bf15300
>> --- /dev/null
>> +++ b/kernel/hp.c
>> @@ -0,0 +1,46 @@
>> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> +//
>> +// SPDX-License-Identifier: LGPL-2.1-or-later
>> +
>> +/*
>> + * HP: Hazard Pointers
>> + */
>> +
>> +#include <linux/hp.h>
>> +#include <linux/percpu.h>
>> +
>> +/*
>> + * hp_scan: Scan hazard pointer domain for @addr.
>> + *
>> + * Scan hazard pointer domain for @addr.
>> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
>> + * @addr.
>> + * Wait to observe that each slot contains a value that differs from
>> + * @addr before returning.
>> + */
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>> +{
>> +	int cpu;
>> +
>> +	/*
>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>> +	 * NULL or to a different value), and thus hides it from hazard
>> +	 * pointer readers.
>> +	 */
>> +
>> +	if (!addr)
>> +		return;
>> +	/* Memory ordering: Store A before Load B. */
>> +	smp_mb();
>> +	/* Scan all CPUs slots. */
>> +	for_each_possible_cpu(cpu) {
>> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>> +
>> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
>> +			retire_cb(cpu, slot, addr);
> 
> Is retirce_cb allowed to cmpxchg the thing?

It could, but we'd need to make sure the slot is not re-used by another
hp_try_post() before the current user removes its own post. It would
need to synchronize with the current HP user (e.g. with IPI).

I've actually renamed retire_cb to "on_match_cb".

> 
>> +		/* Busy-wait if node is found. */
>> +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
>> +			cpu_relax();
> 
> This really should be using smp_cond_load_acquire()

Good point,

Thanks,

Mathieu

> 
>> +	}
>> +}
Mathieu Desnoyers Oct. 5, 2024, 6:56 p.m. UTC | #8
On 2024-10-05 18:07, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
> 
>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>> +{
>>> +	int cpu;
>>> +
>>> +	/*
>>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>> +	 * NULL or to a different value), and thus hides it from hazard
>>> +	 * pointer readers.
>>> +	 */
> 
> This should probably assert we're in a preemptible context. Otherwise
> people will start using this in non-preemptible context and then we get
> to unfuck things later.

Something like this ?

+       /* Should only be called from preemptible context. */
+       WARN_ON_ONCE(in_atomic());

> 
>>> +
>>> +	if (!addr)
>>> +		return;
>>> +	/* Memory ordering: Store A before Load B. */
>>> +	smp_mb();
>>> +	/* Scan all CPUs slots. */
>>> +	for_each_possible_cpu(cpu) {
>>> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>> +
>>> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
>>> +			retire_cb(cpu, slot, addr);
>>
>> Is retirce_cb allowed to cmpxchg the thing?

Renaming retire_cb to "on_match_cb". Whatever the callback does needs to
be done with knowledge of the slot user (e.g. IPI).


>>
>>> +		/* Busy-wait if node is found. */
>>> +		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
>>> +			cpu_relax();
>>
>> This really should be using smp_cond_load_acquire()

Done,

Thanks,

Mathieu

>>
>>> +	}
>>> +}
Peter Zijlstra Oct. 7, 2024, 10:40 a.m. UTC | #9
On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> On 2024-10-05 18:04, Peter Zijlstra wrote:


> > > +/*
> > > + * hp_allocate: Allocate a hazard pointer.
> > > + *
> > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > + * be guaranteed by the caller. Expects to be called from preempt
> > > + * disable context.
> > > + *
> > > + * Returns a hazard pointer context.
> > 
> > So you made the WTF'o'meter crack, this here function does not allocate
> > nothing. Naming is bad. At best this is something like
> > try-set-hazard-pointer or somesuch.
> 
> I went with the naming from the 2004 paper from Maged Michael, but I
> agree it could be clearer.
> 
> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> "posting" the intent to use a pointer (as in on a metaphorical billboard),
> and removing it when it's done.

For RCU we've taken to using the word: 'publish', no?


> > > +/*
> > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > + *
> > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > + * disable context.
> > > + */
> > 
> > More terrible naming. Same as above, but additionally, I would expect a
> > 'dereference' to actually dereference the pointer and have a return
> > value of the dereferenced type.
> 
> hp_dereference_try_post() ?
> 
> > 
> > This function seems to double check and update the hp_ctx thing. I'm not
> > at all sure yet wtf this is doing -- and the total lack of comments
> > aren't helping.
> 
> The hp_ctx contains the outputs.
> 
> The function loads *addr_p to then try_post it into a HP slot. On success,
> it re-reads the *addr_p (with address dependency) and if it still matches,
> use that as output address pointer.
> 
> I'm planning to remove hp_ctx, and just have:
> 
> /*
>  * hp_try_post: Try to post a hazard pointer.
>  *
>  * Post a hazard pointer slot for @addr. The object existence should
>  * be guaranteed by the caller. Expects to be called from preempt
>  * disable context.
>  *
>  * Returns true if post succeeds, false otherwise.
>  */
> static inline
> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> [...]
> 
> /*
>  * hp_dereference_try_post: Dereference and try to post a hazard pointer.
>  *
>  * Returns a hazard pointer context. Expects to be called from preempt
>  * disable context.
>  */
> static inline
> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
>                                 void * const * addr_p, struct hp_slot **_slot)
> [...]
> 
> #define hp_dereference_try_post(domain, p, slot_p)              \
>         ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))

This will compile, but do the wrong thing when p is a regular pointer, no?

> 
> /* Clear the hazard pointer in @slot. */
> static inline
> void hp_remove(struct hp_slot *slot)
> [...]

Differently weird, but better I suppose :-)


> > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > +{
> > > +	int cpu;
> > > +
> > > +	/*
> > > +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > +	 * NULL or to a different value), and thus hides it from hazard
> > > +	 * pointer readers.
> > > +	 */
> > > +
> > > +	if (!addr)
> > > +		return;
> > > +	/* Memory ordering: Store A before Load B. */
> > > +	smp_mb();
> > > +	/* Scan all CPUs slots. */
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > +
> > > +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> > > +			retire_cb(cpu, slot, addr);
> > 
> > Is retirce_cb allowed to cmpxchg the thing?
> 
> It could, but we'd need to make sure the slot is not re-used by another
> hp_try_post() before the current user removes its own post. It would
> need to synchronize with the current HP user (e.g. with IPI).
> 
> I've actually renamed retire_cb to "on_match_cb".

Hmm, I think I see. Would it make sense to pass the expected addr to
hp_remove() and double check we don't NULL out something unexpected? --
maybe just for a DEBUG option.

I'm always seeing the NOHZ_FULL guys hating on this :-)
Peter Zijlstra Oct. 7, 2024, 10:42 a.m. UTC | #10
On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote:
> On 2024-10-05 18:07, Peter Zijlstra wrote:
> > On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
> > 
> > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	/*
> > > > +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > +	 * NULL or to a different value), and thus hides it from hazard
> > > > +	 * pointer readers.
> > > > +	 */
> > 
> > This should probably assert we're in a preemptible context. Otherwise
> > people will start using this in non-preemptible context and then we get
> > to unfuck things later.
> 
> Something like this ?
> 
> +       /* Should only be called from preemptible context. */
> +       WARN_ON_ONCE(in_atomic());

	lockdep_assert_preemption_enabled();

that also checks local IRQ state IIRC.
Mathieu Desnoyers Oct. 7, 2024, 1:22 p.m. UTC | #11
On 2024-10-07 12:42, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote:
>> On 2024-10-05 18:07, Peter Zijlstra wrote:
>>> On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
>>>> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>>>
>>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>>> +{
>>>>> +	int cpu;
>>>>> +
>>>>> +	/*
>>>>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>>> +	 * NULL or to a different value), and thus hides it from hazard
>>>>> +	 * pointer readers.
>>>>> +	 */
>>>
>>> This should probably assert we're in a preemptible context. Otherwise
>>> people will start using this in non-preemptible context and then we get
>>> to unfuck things later.
>>
>> Something like this ?
>>
>> +       /* Should only be called from preemptible context. */
>> +       WARN_ON_ONCE(in_atomic());
> 
> 	lockdep_assert_preemption_enabled();
> 
> that also checks local IRQ state IIRC.

I'll use this instead, thanks!

Mathieu
Mathieu Desnoyers Oct. 7, 2024, 2:50 p.m. UTC | #12
On 2024-10-07 12:40, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
>> On 2024-10-05 18:04, Peter Zijlstra wrote:
> 
> 
>>>> +/*
>>>> + * hp_allocate: Allocate a hazard pointer.
>>>> + *
>>>> + * Allocate a hazard pointer slot for @addr. The object existence should
>>>> + * be guaranteed by the caller. Expects to be called from preempt
>>>> + * disable context.
>>>> + *
>>>> + * Returns a hazard pointer context.
>>>
>>> So you made the WTF'o'meter crack, this here function does not allocate
>>> nothing. Naming is bad. At best this is something like
>>> try-set-hazard-pointer or somesuch.
>>
>> I went with the naming from the 2004 paper from Maged Michael, but I
>> agree it could be clearer.
>>
>> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
>> "posting" the intent to use a pointer (as in on a metaphorical billboard),
>> and removing it when it's done.
> 
> For RCU we've taken to using the word: 'publish', no?

I'm so glad you suggest this, because it turns out that from all
the possible words you could choose from, 'publish' is probably the
most actively confusing. I'll explain.

Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
through a simple example:

[ Note: I've renamed the HP dereference try_post to HP load try_post
   based on further discussion below. ]

*** RCU ***

* Dereference RCU-protected pointer:
     rcu_read_lock();          // [ Begin read transaction ]
     l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
     if (l_p)
       [ use *l_p ...]
     rcu_read_unlock();        // [ End read transaction ]

* Publish @addr:    addr = kmalloc();
                     init(addr);
                     rcu_assign_pointer(p, addr);

* Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
                     synchronize_rcu();           // Wait for all pre-existing
                                                  // read transactions to complete.
                     kfree(addr);


*** Hazard Pointers ***

* Load and post a HP-protected pointer:
     l_p = hp_load_try_post(domain, &p, &slot);
     if (l_p) {
       [ use *l_p ...]
       hp_remove(&slot, l_p);
     }

* Publish @addr:    addr = kmalloc();
                     init(addr);
                     rcu_assign_pointer(p, addr);

* Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
                     hp_scan(domain, addr, NULL);
                     kfree(addr);

Both HP and RCU have publication guarantees, which can in fact be
implemented in the same way (e.g. rcu_assign_pointer paired with
something that respects address dependencies ordering). A stronger
implementation of this would be pairing a store-release with a
load-acquire: it works, but it would add needless overhead on
weakly-ordered CPUs.

How the two mechanisms differ is in how they track when it is
safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
and makes sure that all pre-existing transactions are gone before
synchronize_rcu() is allowed to complete. HP does this by tracking
"posted" pointer slots with a HP domain. As long as hp_scan observes
that HP readers are showing interest in @addr, it will wait.

One notable difference between RCU and HP is that HP knows exactly
which pointer is blocking progress, and from which CPU (at least
with my per-CPU HP domain implementation). Therefore, it is possible
for HP to issue an IPI and make sure the HP user either completes its
use of the pointer quickly, or stops using it right away (e.g. making
the active mm use idle mm instead).

One strength of RCU is that it can track use of a whole set of RCU
pointers just by tracking reader transaction begin/end, but this is
also one of its weaknesses: a long reader transaction can postpone
completion of grace period for a long time and increase the memory
footprint. In comparison, HP can immediately complete as soon as the
pointer it is scanning for is gone. Even better, it can send an IPI
to the belate CPU and abort use of the pointer using a callback.

> 
> 
>>>> +/*
>>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>>>> + *
>>>> + * Returns a hazard pointer context. Expects to be called from preempt
>>>> + * disable context.
>>>> + */
>>>
>>> More terrible naming. Same as above, but additionally, I would expect a
>>> 'dereference' to actually dereference the pointer and have a return
>>> value of the dereferenced type.
>>
>> hp_dereference_try_post() ?
>>
>>>
>>> This function seems to double check and update the hp_ctx thing. I'm not
>>> at all sure yet wtf this is doing -- and the total lack of comments
>>> aren't helping.
>>
>> The hp_ctx contains the outputs.
>>
>> The function loads *addr_p to then try_post it into a HP slot. On success,
>> it re-reads the *addr_p (with address dependency) and if it still matches,
>> use that as output address pointer.
>>
>> I'm planning to remove hp_ctx, and just have:
>>
>> /*
>>   * hp_try_post: Try to post a hazard pointer.
>>   *
>>   * Post a hazard pointer slot for @addr. The object existence should
>>   * be guaranteed by the caller. Expects to be called from preempt
>>   * disable context.
>>   *
>>   * Returns true if post succeeds, false otherwise.
>>   */
>> static inline
>> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
>> [...]
>>
>> /*
>>   * hp_dereference_try_post: Dereference and try to post a hazard pointer.
>>   *
>>   * Returns a hazard pointer context. Expects to be called from preempt
>>   * disable context.
>>   */
>> static inline
>> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
>>                                  void * const * addr_p, struct hp_slot **_slot)
>> [...]
>>
>> #define hp_dereference_try_post(domain, p, slot_p)              \
>>          ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
> 
> This will compile, but do the wrong thing when p is a regular pointer, no?

Right, at least in some cases the compiler may not complain, and people used to
rcu_dereference() will expect that "p" is the pointer to load rather than the
address of that pointer. This would be unexpected.

I must admit that passing the address holding the pointer to load rather than
the pointer to load itself makes it much less troublesome in terms of macro
layers. But perhaps this is another example where we should wander away from the
beaten path and use a word different from "dereference" here. E.g.:

/*
  * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
  * to generate a compile error if addr_p is not a pointer to a pointer.
  */
#define hp_load_try_post(domain, addr_p, slot_p)                \
         ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))

> 
>>
>> /* Clear the hazard pointer in @slot. */
>> static inline
>> void hp_remove(struct hp_slot *slot)
>> [...]
> 
> Differently weird, but better I suppose :-)

If you find a better word than "remove" to pair with "post", I'm all in :)

> 
> 
>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>> +{
>>>> +	int cpu;
>>>> +
>>>> +	/*
>>>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>> +	 * NULL or to a different value), and thus hides it from hazard
>>>> +	 * pointer readers.
>>>> +	 */
>>>> +
>>>> +	if (!addr)
>>>> +		return;
>>>> +	/* Memory ordering: Store A before Load B. */
>>>> +	smp_mb();
>>>> +	/* Scan all CPUs slots. */
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>>> +
>>>> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
>>>> +			retire_cb(cpu, slot, addr);
>>>
>>> Is retirce_cb allowed to cmpxchg the thing?
>>
>> It could, but we'd need to make sure the slot is not re-used by another
>> hp_try_post() before the current user removes its own post. It would
>> need to synchronize with the current HP user (e.g. with IPI).
>>
>> I've actually renamed retire_cb to "on_match_cb".
> 
> Hmm, I think I see. Would it make sense to pass the expected addr to
> hp_remove() and double check we don't NULL out something unexpected? --
> maybe just for a DEBUG option.
> 
> I'm always seeing the NOHZ_FULL guys hating on this :-)

That's a fair point. Sure, we can do this as an extra safety net. For now I
will just make the check always present, we can always move it to a debug
option later.

And now I notice that hp_remove is also used for CPU hotplug (grep
matches for cpuhp_remove_state()). I wonder if we should go for something
more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?

Thanks,

Mathieu
Paul E. McKenney Oct. 7, 2024, 6:18 p.m. UTC | #13
On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
> On 2024-10-07 12:40, Peter Zijlstra wrote:
> > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> > > On 2024-10-05 18:04, Peter Zijlstra wrote:
> > 
> > 
> > > > > +/*
> > > > > + * hp_allocate: Allocate a hazard pointer.
> > > > > + *
> > > > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > > > + * be guaranteed by the caller. Expects to be called from preempt
> > > > > + * disable context.
> > > > > + *
> > > > > + * Returns a hazard pointer context.
> > > > 
> > > > So you made the WTF'o'meter crack, this here function does not allocate
> > > > nothing. Naming is bad. At best this is something like
> > > > try-set-hazard-pointer or somesuch.
> > > 
> > > I went with the naming from the 2004 paper from Maged Michael, but I
> > > agree it could be clearer.
> > > 
> > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> > > "posting" the intent to use a pointer (as in on a metaphorical billboard),
> > > and removing it when it's done.
> > 
> > For RCU we've taken to using the word: 'publish', no?
> 
> I'm so glad you suggest this, because it turns out that from all
> the possible words you could choose from, 'publish' is probably the
> most actively confusing. I'll explain.
> 
> Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
> through a simple example:
> 
> [ Note: I've renamed the HP dereference try_post to HP load try_post
>   based on further discussion below. ]
> 
> *** RCU ***
> 
> * Dereference RCU-protected pointer:
>     rcu_read_lock();          // [ Begin read transaction ]
>     l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
>     if (l_p)
>       [ use *l_p ...]
>     rcu_read_unlock();        // [ End read transaction ]
> 
> * Publish @addr:    addr = kmalloc();
>                     init(addr);
>                     rcu_assign_pointer(p, addr);
> 
> * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>                     synchronize_rcu();           // Wait for all pre-existing
>                                                  // read transactions to complete.
>                     kfree(addr);
> 
> 
> *** Hazard Pointers ***
> 
> * Load and post a HP-protected pointer:
>     l_p = hp_load_try_post(domain, &p, &slot);
>     if (l_p) {
>       [ use *l_p ...]
>       hp_remove(&slot, l_p);
>     }
> 
> * Publish @addr:    addr = kmalloc();
>                     init(addr);
>                     rcu_assign_pointer(p, addr);
> 
> * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>                     hp_scan(domain, addr, NULL);
>                     kfree(addr);
> 
> Both HP and RCU have publication guarantees, which can in fact be
> implemented in the same way (e.g. rcu_assign_pointer paired with
> something that respects address dependencies ordering). A stronger
> implementation of this would be pairing a store-release with a
> load-acquire: it works, but it would add needless overhead on
> weakly-ordered CPUs.
> 
> How the two mechanisms differ is in how they track when it is
> safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
> and makes sure that all pre-existing transactions are gone before
> synchronize_rcu() is allowed to complete. HP does this by tracking
> "posted" pointer slots with a HP domain. As long as hp_scan observes
> that HP readers are showing interest in @addr, it will wait.
> 
> One notable difference between RCU and HP is that HP knows exactly
> which pointer is blocking progress, and from which CPU (at least
> with my per-CPU HP domain implementation). Therefore, it is possible
> for HP to issue an IPI and make sure the HP user either completes its
> use of the pointer quickly, or stops using it right away (e.g. making
> the active mm use idle mm instead).
> 
> One strength of RCU is that it can track use of a whole set of RCU
> pointers just by tracking reader transaction begin/end, but this is
> also one of its weaknesses: a long reader transaction can postpone
> completion of grace period for a long time and increase the memory
> footprint. In comparison, HP can immediately complete as soon as the
> pointer it is scanning for is gone. Even better, it can send an IPI
> to the belate CPU and abort use of the pointer using a callback.

Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".

This all sounds like arguments *for* use of the term "publish" for
hazard pointers rather than against it.  What am I missing here?

							Thanx, Paul

> > > > > +/*
> > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > > > + *
> > > > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > > > + * disable context.
> > > > > + */
> > > > 
> > > > More terrible naming. Same as above, but additionally, I would expect a
> > > > 'dereference' to actually dereference the pointer and have a return
> > > > value of the dereferenced type.
> > > 
> > > hp_dereference_try_post() ?
> > > 
> > > > 
> > > > This function seems to double check and update the hp_ctx thing. I'm not
> > > > at all sure yet wtf this is doing -- and the total lack of comments
> > > > aren't helping.
> > > 
> > > The hp_ctx contains the outputs.
> > > 
> > > The function loads *addr_p to then try_post it into a HP slot. On success,
> > > it re-reads the *addr_p (with address dependency) and if it still matches,
> > > use that as output address pointer.
> > > 
> > > I'm planning to remove hp_ctx, and just have:
> > > 
> > > /*
> > >   * hp_try_post: Try to post a hazard pointer.
> > >   *
> > >   * Post a hazard pointer slot for @addr. The object existence should
> > >   * be guaranteed by the caller. Expects to be called from preempt
> > >   * disable context.
> > >   *
> > >   * Returns true if post succeeds, false otherwise.
> > >   */
> > > static inline
> > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> > > [...]
> > > 
> > > /*
> > >   * hp_dereference_try_post: Dereference and try to post a hazard pointer.
> > >   *
> > >   * Returns a hazard pointer context. Expects to be called from preempt
> > >   * disable context.
> > >   */
> > > static inline
> > > void *__hp_dereference_try_post(struct hp_domain *hp_domain,
> > >                                  void * const * addr_p, struct hp_slot **_slot)
> > > [...]
> > > 
> > > #define hp_dereference_try_post(domain, p, slot_p)              \
> > >          ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
> > 
> > This will compile, but do the wrong thing when p is a regular pointer, no?
> 
> Right, at least in some cases the compiler may not complain, and people used to
> rcu_dereference() will expect that "p" is the pointer to load rather than the
> address of that pointer. This would be unexpected.
> 
> I must admit that passing the address holding the pointer to load rather than
> the pointer to load itself makes it much less troublesome in terms of macro
> layers. But perhaps this is another example where we should wander away from the
> beaten path and use a word different from "dereference" here. E.g.:
> 
> /*
>  * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
>  * to generate a compile error if addr_p is not a pointer to a pointer.
>  */
> #define hp_load_try_post(domain, addr_p, slot_p)                \
>         ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
> 
> > 
> > > 
> > > /* Clear the hazard pointer in @slot. */
> > > static inline
> > > void hp_remove(struct hp_slot *slot)
> > > [...]
> > 
> > Differently weird, but better I suppose :-)
> 
> If you find a better word than "remove" to pair with "post", I'm all in :)
> 
> > 
> > 
> > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > > +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > > +{
> > > > > +	int cpu;
> > > > > +
> > > > > +	/*
> > > > > +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > > +	 * NULL or to a different value), and thus hides it from hazard
> > > > > +	 * pointer readers.
> > > > > +	 */
> > > > > +
> > > > > +	if (!addr)
> > > > > +		return;
> > > > > +	/* Memory ordering: Store A before Load B. */
> > > > > +	smp_mb();
> > > > > +	/* Scan all CPUs slots. */
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > > > +
> > > > > +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> > > > > +			retire_cb(cpu, slot, addr);
> > > > 
> > > > Is retirce_cb allowed to cmpxchg the thing?
> > > 
> > > It could, but we'd need to make sure the slot is not re-used by another
> > > hp_try_post() before the current user removes its own post. It would
> > > need to synchronize with the current HP user (e.g. with IPI).
> > > 
> > > I've actually renamed retire_cb to "on_match_cb".
> > 
> > Hmm, I think I see. Would it make sense to pass the expected addr to
> > hp_remove() and double check we don't NULL out something unexpected? --
> > maybe just for a DEBUG option.
> > 
> > I'm always seeing the NOHZ_FULL guys hating on this :-)
> 
> That's a fair point. Sure, we can do this as an extra safety net. For now I
> will just make the check always present, we can always move it to a debug
> option later.
> 
> And now I notice that hp_remove is also used for CPU hotplug (grep
> matches for cpuhp_remove_state()). I wonder if we should go for something
> more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
> 
> Thanks,
> 
> Mathieu
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Paul E. McKenney Oct. 7, 2024, 7:06 p.m. UTC | #14
On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
> > On 2024-10-07 12:40, Peter Zijlstra wrote:
> > > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> > > > On 2024-10-05 18:04, Peter Zijlstra wrote:
> > > 
> > > 
> > > > > > +/*
> > > > > > + * hp_allocate: Allocate a hazard pointer.
> > > > > > + *
> > > > > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > > > > + * be guaranteed by the caller. Expects to be called from preempt
> > > > > > + * disable context.
> > > > > > + *
> > > > > > + * Returns a hazard pointer context.
> > > > > 
> > > > > So you made the WTF'o'meter crack, this here function does not allocate
> > > > > nothing. Naming is bad. At best this is something like
> > > > > try-set-hazard-pointer or somesuch.
> > > > 
> > > > I went with the naming from the 2004 paper from Maged Michael, but I
> > > > agree it could be clearer.
> > > > 
> > > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> > > > "posting" the intent to use a pointer (as in on a metaphorical billboard),
> > > > and removing it when it's done.
> > > 
> > > For RCU we've taken to using the word: 'publish', no?
> > 
> > I'm so glad you suggest this, because it turns out that from all
> > the possible words you could choose from, 'publish' is probably the
> > most actively confusing. I'll explain.
> > 
> > Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
> > through a simple example:
> > 
> > [ Note: I've renamed the HP dereference try_post to HP load try_post
> >   based on further discussion below. ]
> > 
> > *** RCU ***
> > 
> > * Dereference RCU-protected pointer:
> >     rcu_read_lock();          // [ Begin read transaction ]
> >     l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
> >     if (l_p)
> >       [ use *l_p ...]
> >     rcu_read_unlock();        // [ End read transaction ]
> > 
> > * Publish @addr:    addr = kmalloc();
> >                     init(addr);
> >                     rcu_assign_pointer(p, addr);
> > 
> > * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> >                     synchronize_rcu();           // Wait for all pre-existing
> >                                                  // read transactions to complete.
> >                     kfree(addr);
> > 
> > 
> > *** Hazard Pointers ***
> > 
> > * Load and post a HP-protected pointer:
> >     l_p = hp_load_try_post(domain, &p, &slot);
> >     if (l_p) {
> >       [ use *l_p ...]
> >       hp_remove(&slot, l_p);
> >     }
> > 
> > * Publish @addr:    addr = kmalloc();
> >                     init(addr);
> >                     rcu_assign_pointer(p, addr);
> > 
> > * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> >                     hp_scan(domain, addr, NULL);
> >                     kfree(addr);
> > 
> > Both HP and RCU have publication guarantees, which can in fact be
> > implemented in the same way (e.g. rcu_assign_pointer paired with
> > something that respects address dependencies ordering). A stronger
> > implementation of this would be pairing a store-release with a
> > load-acquire: it works, but it would add needless overhead on
> > weakly-ordered CPUs.
> > 
> > How the two mechanisms differ is in how they track when it is
> > safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
> > and makes sure that all pre-existing transactions are gone before
> > synchronize_rcu() is allowed to complete. HP does this by tracking
> > "posted" pointer slots with a HP domain. As long as hp_scan observes
> > that HP readers are showing interest in @addr, it will wait.
> > 
> > One notable difference between RCU and HP is that HP knows exactly
> > which pointer is blocking progress, and from which CPU (at least
> > with my per-CPU HP domain implementation). Therefore, it is possible
> > for HP to issue an IPI and make sure the HP user either completes its
> > use of the pointer quickly, or stops using it right away (e.g. making
> > the active mm use idle mm instead).
> > 
> > One strength of RCU is that it can track use of a whole set of RCU
> > pointers just by tracking reader transaction begin/end, but this is
> > also one of its weaknesses: a long reader transaction can postpone
> > completion of grace period for a long time and increase the memory
> > footprint. In comparison, HP can immediately complete as soon as the
> > pointer it is scanning for is gone. Even better, it can send an IPI
> > to the belate CPU and abort use of the pointer using a callback.
> 
> Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".
> 
> This all sounds like arguments *for* use of the term "publish" for
> hazard pointers rather than against it.  What am I missing here?

OK, one thing that I was missing is that this was not about the
counterpart to rcu_assign_pointer(), for which I believe "publish" makes
a lot of sense, but rather about the setting of a hazard pointer.  Here,
"protect" is the traditional term of power, which has served users well
for some years.

							Thanx, Paul

> > > > > > +/*
> > > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > > > > + *
> > > > > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > > > > + * disable context.
> > > > > > + */
> > > > > 
> > > > > More terrible naming. Same as above, but additionally, I would expect a
> > > > > 'dereference' to actually dereference the pointer and have a return
> > > > > value of the dereferenced type.
> > > > 
> > > > hp_dereference_try_post() ?
> > > > 
> > > > > 
> > > > > This function seems to double check and update the hp_ctx thing. I'm not
> > > > > at all sure yet wtf this is doing -- and the total lack of comments
> > > > > aren't helping.
> > > > 
> > > > The hp_ctx contains the outputs.
> > > > 
> > > > The function loads *addr_p to then try_post it into a HP slot. On success,
> > > > it re-reads the *addr_p (with address dependency) and if it still matches,
> > > > use that as output address pointer.
> > > > 
> > > > I'm planning to remove hp_ctx, and just have:
> > > > 
> > > > /*
> > > >   * hp_try_post: Try to post a hazard pointer.
> > > >   *
> > > >   * Post a hazard pointer slot for @addr. The object existence should
> > > >   * be guaranteed by the caller. Expects to be called from preempt
> > > >   * disable context.
> > > >   *
> > > >   * Returns true if post succeeds, false otherwise.
> > > >   */
> > > > static inline
> > > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> > > > [...]
> > > > 
> > > > /*
> > > >   * hp_dereference_try_post: Dereference and try to post a hazard pointer.
> > > >   *
> > > >   * Returns a hazard pointer context. Expects to be called from preempt
> > > >   * disable context.
> > > >   */
> > > > static inline
> > > > void *__hp_dereference_try_post(struct hp_domain *hp_domain,
> > > >                                  void * const * addr_p, struct hp_slot **_slot)
> > > > [...]
> > > > 
> > > > #define hp_dereference_try_post(domain, p, slot_p)              \
> > > >          ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
> > > 
> > > This will compile, but do the wrong thing when p is a regular pointer, no?
> > 
> > Right, at least in some cases the compiler may not complain, and people used to
> > rcu_dereference() will expect that "p" is the pointer to load rather than the
> > address of that pointer. This would be unexpected.
> > 
> > I must admit that passing the address holding the pointer to load rather than
> > the pointer to load itself makes it much less troublesome in terms of macro
> > layers. But perhaps this is another example where we should wander away from the
> > beaten path and use a word different from "dereference" here. E.g.:
> > 
> > /*
> >  * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
> >  * to generate a compile error if addr_p is not a pointer to a pointer.
> >  */
> > #define hp_load_try_post(domain, addr_p, slot_p)                \
> >         ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
> > 
> > > 
> > > > 
> > > > /* Clear the hazard pointer in @slot. */
> > > > static inline
> > > > void hp_remove(struct hp_slot *slot)
> > > > [...]
> > > 
> > > Differently weird, but better I suppose :-)
> > 
> > If you find a better word than "remove" to pair with "post", I'm all in :)
> > 
> > > 
> > > 
> > > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > > > +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > > > +{
> > > > > > +	int cpu;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > > > +	 * NULL or to a different value), and thus hides it from hazard
> > > > > > +	 * pointer readers.
> > > > > > +	 */
> > > > > > +
> > > > > > +	if (!addr)
> > > > > > +		return;
> > > > > > +	/* Memory ordering: Store A before Load B. */
> > > > > > +	smp_mb();
> > > > > > +	/* Scan all CPUs slots. */
> > > > > > +	for_each_possible_cpu(cpu) {
> > > > > > +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > > > > +
> > > > > > +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
> > > > > > +			retire_cb(cpu, slot, addr);
> > > > > 
> > > > > Is retirce_cb allowed to cmpxchg the thing?
> > > > 
> > > > It could, but we'd need to make sure the slot is not re-used by another
> > > > hp_try_post() before the current user removes its own post. It would
> > > > need to synchronize with the current HP user (e.g. with IPI).
> > > > 
> > > > I've actually renamed retire_cb to "on_match_cb".
> > > 
> > > Hmm, I think I see. Would it make sense to pass the expected addr to
> > > hp_remove() and double check we don't NULL out something unexpected? --
> > > maybe just for a DEBUG option.
> > > 
> > > I'm always seeing the NOHZ_FULL guys hating on this :-)
> > 
> > That's a fair point. Sure, we can do this as an extra safety net. For now I
> > will just make the check always present, we can always move it to a debug
> > option later.
> > 
> > And now I notice that hp_remove is also used for CPU hotplug (grep
> > matches for cpuhp_remove_state()). I wonder if we should go for something
> > more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://www.efficios.com
> >
Mathieu Desnoyers Oct. 7, 2024, 7:08 p.m. UTC | #15
On 2024-10-07 21:06, Paul E. McKenney wrote:
> On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote:
>> On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
>>> On 2024-10-07 12:40, Peter Zijlstra wrote:
>>>> On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
>>>>> On 2024-10-05 18:04, Peter Zijlstra wrote:
>>>>
>>>>
>>>>>>> +/*
>>>>>>> + * hp_allocate: Allocate a hazard pointer.
>>>>>>> + *
>>>>>>> + * Allocate a hazard pointer slot for @addr. The object existence should
>>>>>>> + * be guaranteed by the caller. Expects to be called from preempt
>>>>>>> + * disable context.
>>>>>>> + *
>>>>>>> + * Returns a hazard pointer context.
>>>>>>
>>>>>> So you made the WTF'o'meter crack, this here function does not allocate
>>>>>> nothing. Naming is bad. At best this is something like
>>>>>> try-set-hazard-pointer or somesuch.
>>>>>
>>>>> I went with the naming from the 2004 paper from Maged Michael, but I
>>>>> agree it could be clearer.
>>>>>
>>>>> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
>>>>> "posting" the intent to use a pointer (as in on a metaphorical billboard),
>>>>> and removing it when it's done.
>>>>
>>>> For RCU we've taken to using the word: 'publish', no?
>>>
>>> I'm so glad you suggest this, because it turns out that from all
>>> the possible words you could choose from, 'publish' is probably the
>>> most actively confusing. I'll explain.
>>>
>>> Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
>>> through a simple example:
>>>
>>> [ Note: I've renamed the HP dereference try_post to HP load try_post
>>>    based on further discussion below. ]
>>>
>>> *** RCU ***
>>>
>>> * Dereference RCU-protected pointer:
>>>      rcu_read_lock();          // [ Begin read transaction ]
>>>      l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
>>>      if (l_p)
>>>        [ use *l_p ...]
>>>      rcu_read_unlock();        // [ End read transaction ]
>>>
>>> * Publish @addr:    addr = kmalloc();
>>>                      init(addr);
>>>                      rcu_assign_pointer(p, addr);
>>>
>>> * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>>>                      synchronize_rcu();           // Wait for all pre-existing
>>>                                                   // read transactions to complete.
>>>                      kfree(addr);
>>>
>>>
>>> *** Hazard Pointers ***
>>>
>>> * Load and post a HP-protected pointer:
>>>      l_p = hp_load_try_post(domain, &p, &slot);
>>>      if (l_p) {
>>>        [ use *l_p ...]
>>>        hp_remove(&slot, l_p);
>>>      }
>>>
>>> * Publish @addr:    addr = kmalloc();
>>>                      init(addr);
>>>                      rcu_assign_pointer(p, addr);
>>>
>>> * Reclaim @addr:    rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>>>                      hp_scan(domain, addr, NULL);
>>>                      kfree(addr);
>>>
>>> Both HP and RCU have publication guarantees, which can in fact be
>>> implemented in the same way (e.g. rcu_assign_pointer paired with
>>> something that respects address dependencies ordering). A stronger
>>> implementation of this would be pairing a store-release with a
>>> load-acquire: it works, but it would add needless overhead on
>>> weakly-ordered CPUs.
>>>
>>> How the two mechanisms differ is in how they track when it is
>>> safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
>>> and makes sure that all pre-existing transactions are gone before
>>> synchronize_rcu() is allowed to complete. HP does this by tracking
>>> "posted" pointer slots with a HP domain. As long as hp_scan observes
>>> that HP readers are showing interest in @addr, it will wait.
>>>
>>> One notable difference between RCU and HP is that HP knows exactly
>>> which pointer is blocking progress, and from which CPU (at least
>>> with my per-CPU HP domain implementation). Therefore, it is possible
>>> for HP to issue an IPI and make sure the HP user either completes its
>>> use of the pointer quickly, or stops using it right away (e.g. making
>>> the active mm use idle mm instead).
>>>
>>> One strength of RCU is that it can track use of a whole set of RCU
>>> pointers just by tracking reader transaction begin/end, but this is
>>> also one of its weaknesses: a long reader transaction can postpone
>>> completion of grace period for a long time and increase the memory
>>> footprint. In comparison, HP can immediately complete as soon as the
>>> pointer it is scanning for is gone. Even better, it can send an IPI
>>> to the belate CPU and abort use of the pointer using a callback.
>>
>> Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".
>>
>> This all sounds like arguments *for* use of the term "publish" for
>> hazard pointers rather than against it.  What am I missing here?
> 
> OK, one thing that I was missing is that this was not about the
> counterpart to rcu_assign_pointer(), for which I believe "publish" makes
> a lot of sense, but rather about the setting of a hazard pointer.  Here,
> "protect" is the traditional term of power, which has served users well
> for some years.

After some reading of the C++ specification wording used for hazard
pointers, I am indeed tempted to go for "try_protect()" and
"retire()" to minimize confusion.

Thanks,

Mathieu


> 
> 							Thanx, Paul
> 
>>>>>>> +/*
>>>>>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>>>>>>> + *
>>>>>>> + * Returns a hazard pointer context. Expects to be called from preempt
>>>>>>> + * disable context.
>>>>>>> + */
>>>>>>
>>>>>> More terrible naming. Same as above, but additionally, I would expect a
>>>>>> 'dereference' to actually dereference the pointer and have a return
>>>>>> value of the dereferenced type.
>>>>>
>>>>> hp_dereference_try_post() ?
>>>>>
>>>>>>
>>>>>> This function seems to double check and update the hp_ctx thing. I'm not
>>>>>> at all sure yet wtf this is doing -- and the total lack of comments
>>>>>> aren't helping.
>>>>>
>>>>> The hp_ctx contains the outputs.
>>>>>
>>>>> The function loads *addr_p to then try_post it into a HP slot. On success,
>>>>> it re-reads the *addr_p (with address dependency) and if it still matches,
>>>>> use that as output address pointer.
>>>>>
>>>>> I'm planning to remove hp_ctx, and just have:
>>>>>
>>>>> /*
>>>>>    * hp_try_post: Try to post a hazard pointer.
>>>>>    *
>>>>>    * Post a hazard pointer slot for @addr. The object existence should
>>>>>    * be guaranteed by the caller. Expects to be called from preempt
>>>>>    * disable context.
>>>>>    *
>>>>>    * Returns true if post succeeds, false otherwise.
>>>>>    */
>>>>> static inline
>>>>> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
>>>>> [...]
>>>>>
>>>>> /*
>>>>>    * hp_dereference_try_post: Dereference and try to post a hazard pointer.
>>>>>    *
>>>>>    * Returns a hazard pointer context. Expects to be called from preempt
>>>>>    * disable context.
>>>>>    */
>>>>> static inline
>>>>> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
>>>>>                                   void * const * addr_p, struct hp_slot **_slot)
>>>>> [...]
>>>>>
>>>>> #define hp_dereference_try_post(domain, p, slot_p)              \
>>>>>           ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
>>>>
>>>> This will compile, but do the wrong thing when p is a regular pointer, no?
>>>
>>> Right, at least in some cases the compiler may not complain, and people used to
>>> rcu_dereference() will expect that "p" is the pointer to load rather than the
>>> address of that pointer. This would be unexpected.
>>>
>>> I must admit that passing the address holding the pointer to load rather than
>>> the pointer to load itself makes it much less troublesome in terms of macro
>>> layers. But perhaps this is another example where we should wander away from the
>>> beaten path and use a word different from "dereference" here. E.g.:
>>>
>>> /*
>>>   * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
>>>   * to generate a compile error if addr_p is not a pointer to a pointer.
>>>   */
>>> #define hp_load_try_post(domain, addr_p, slot_p)                \
>>>          ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
>>>
>>>>
>>>>>
>>>>> /* Clear the hazard pointer in @slot. */
>>>>> static inline
>>>>> void hp_remove(struct hp_slot *slot)
>>>>> [...]
>>>>
>>>> Differently weird, but better I suppose :-)
>>>
>>> If you find a better word than "remove" to pair with "post", I'm all in :)
>>>
>>>>
>>>>
>>>>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>>>>> +	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>>>>> +{
>>>>>>> +	int cpu;
>>>>>>> +
>>>>>>> +	/*
>>>>>>> +	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>>>>> +	 * NULL or to a different value), and thus hides it from hazard
>>>>>>> +	 * pointer readers.
>>>>>>> +	 */
>>>>>>> +
>>>>>>> +	if (!addr)
>>>>>>> +		return;
>>>>>>> +	/* Memory ordering: Store A before Load B. */
>>>>>>> +	smp_mb();
>>>>>>> +	/* Scan all CPUs slots. */
>>>>>>> +	for_each_possible_cpu(cpu) {
>>>>>>> +		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>>>>>> +
>>>>>>> +		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
>>>>>>> +			retire_cb(cpu, slot, addr);
>>>>>>
>>>>>> Is retirce_cb allowed to cmpxchg the thing?
>>>>>
>>>>> It could, but we'd need to make sure the slot is not re-used by another
>>>>> hp_try_post() before the current user removes its own post. It would
>>>>> need to synchronize with the current HP user (e.g. with IPI).
>>>>>
>>>>> I've actually renamed retire_cb to "on_match_cb".
>>>>
>>>> Hmm, I think I see. Would it make sense to pass the expected addr to
>>>> hp_remove() and double check we don't NULL out something unexpected? --
>>>> maybe just for a DEBUG option.
>>>>
>>>> I'm always seeing the NOHZ_FULL guys hating on this :-)
>>>
>>> That's a fair point. Sure, we can do this as an extra safety net. For now I
>>> will just make the check always present, we can always move it to a debug
>>> option later.
>>>
>>> And now I notice that hp_remove is also used for CPU hotplug (grep
>>> matches for cpuhp_remove_state()). I wonder if we should go for something
>>> more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
diff mbox series

Patch

diff --git a/include/linux/hp.h b/include/linux/hp.h
new file mode 100644
index 000000000000..e85fc4365ea2
--- /dev/null
+++ b/include/linux/hp.h
@@ -0,0 +1,158 @@ 
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+#ifndef _LINUX_HP_H
+#define _LINUX_HP_H
+
+/*
+ * HP: Hazard Pointers
+ *
+ * This API provides existence guarantees of objects through hazard
+ * pointers.
+ *
+ * It uses a fixed number of hazard pointer slots (nr_cpus) across the
+ * entire system for each HP domain.
+ *
+ * Its main benefit over RCU is that it allows fast reclaim of
+ * HP-protected pointers without needing to wait for a grace period.
+ *
+ * It also allows the hazard pointer scan to call a user-defined callback
+ * to retire a hazard pointer slot immediately if needed. This callback
+ * may, for instance, issue an IPI to the relevant CPU.
+ *
+ * References:
+ *
+ * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
+ *      lock-free objects," in IEEE Transactions on Parallel and
+ *      Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
+ */
+
+#include <linux/rcupdate.h>
+
+/*
+ * Hazard pointer slot.
+ */
+struct hp_slot {
+	void *addr;
+};
+
+/*
+ * Hazard pointer context, returned by hp_use().
+ */
+struct hp_ctx {
+	struct hp_slot *slot;
+	void *addr;
+};
+
+/*
+ * hp_scan: Scan hazard pointer domain for @addr.
+ *
+ * Scan hazard pointer domain for @addr.
+ * If @retire_cb is NULL, wait to observe that each slot contains a value
+ * that differs from @addr.
+ * If @retire_cb is non-NULL, invoke @callback for each slot containing
+ * @addr.
+ */
+void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
+	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
+
+/* Get the hazard pointer context address (may be NULL). */
+static inline
+void *hp_ctx_addr(struct hp_ctx ctx)
+{
+	return ctx.addr;
+}
+
+/*
+ * hp_allocate: Allocate a hazard pointer.
+ *
+ * Allocate a hazard pointer slot for @addr. The object existence should
+ * be guaranteed by the caller. Expects to be called from preempt
+ * disable context.
+ *
+ * Returns a hazard pointer context.
+ */
+static inline
+struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
+{
+	struct hp_slot *slot;
+	struct hp_ctx ctx;
+
+	if (!addr)
+		goto fail;
+	slot = this_cpu_ptr(percpu_slots);
+	/*
+	 * A single hazard pointer slot per CPU is available currently.
+	 * Other hazard pointer domains can eventually have a different
+	 * configuration.
+	 */
+	if (READ_ONCE(slot->addr))
+		goto fail;
+	WRITE_ONCE(slot->addr, addr);	/* Store B */
+	ctx.slot = slot;
+	ctx.addr = addr;
+	return ctx;
+
+fail:
+	ctx.slot = NULL;
+	ctx.addr = NULL;
+	return ctx;
+}
+
+/*
+ * hp_dereference_allocate: Dereference and allocate a hazard pointer.
+ *
+ * Returns a hazard pointer context. Expects to be called from preempt
+ * disable context.
+ */
+static inline
+struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
+{
+	void *addr, *addr2;
+	struct hp_ctx ctx;
+
+	addr = READ_ONCE(*addr_p);
+retry:
+	ctx = hp_allocate(percpu_slots, addr);
+	if (!hp_ctx_addr(ctx))
+		goto fail;
+	/* Memory ordering: Store B before Load A. */
+	smp_mb();
+	/*
+	 * Use RCU dereference without lockdep checks, because
+	 * lockdep is not aware of HP guarantees.
+	 */
+	addr2 = rcu_access_pointer(*addr_p);	/* Load A */
+	/*
+	 * If @addr_p content has changed since the first load,
+	 * clear the hazard pointer and try again.
+	 */
+	if (!ptr_eq(addr2, addr)) {
+		WRITE_ONCE(ctx.slot->addr, NULL);
+		if (!addr2)
+			goto fail;
+		addr = addr2;
+		goto retry;
+	}
+	/*
+	 * Use addr2 loaded from rcu_access_pointer() to preserve
+	 * address dependency ordering.
+	 */
+	ctx.addr = addr2;
+	return ctx;
+
+fail:
+	ctx.slot = NULL;
+	ctx.addr = NULL;
+	return ctx;
+}
+
+/* Retire the hazard pointer in @ctx. */
+static inline
+void hp_retire(const struct hp_ctx ctx)
+{
+	smp_store_release(&ctx.slot->addr, NULL);
+}
+
+#endif /* _LINUX_HP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..ec16de96fa80 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@  obj-y     = fork.o exec_domain.o panic.o \
 	    cpu.o exit.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o user.o \
 	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
-	    extable.o params.o \
+	    extable.o params.o hp.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
 	    async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
diff --git a/kernel/hp.c b/kernel/hp.c
new file mode 100644
index 000000000000..b2447bf15300
--- /dev/null
+++ b/kernel/hp.c
@@ -0,0 +1,46 @@ 
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+/*
+ * HP: Hazard Pointers
+ */
+
+#include <linux/hp.h>
+#include <linux/percpu.h>
+
+/*
+ * hp_scan: Scan hazard pointer domain for @addr.
+ *
+ * Scan hazard pointer domain for @addr.
+ * If @retire_cb is non-NULL, invoke @callback for each slot containing
+ * @addr.
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ */
+void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
+	     void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
+{
+	int cpu;
+
+	/*
+	 * Store A precedes hp_scan(): it unpublishes addr (sets it to
+	 * NULL or to a different value), and thus hides it from hazard
+	 * pointer readers.
+	 */
+
+	if (!addr)
+		return;
+	/* Memory ordering: Store A before Load B. */
+	smp_mb();
+	/* Scan all CPUs slots. */
+	for_each_possible_cpu(cpu) {
+		struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
+
+		if (retire_cb && smp_load_acquire(&slot->addr) == addr)	/* Load B */
+			retire_cb(cpu, slot, addr);
+		/* Busy-wait if node is found. */
+		while ((smp_load_acquire(&slot->addr)) == addr)	/* Load B */
+			cpu_relax();
+	}
+}