Message ID | 20241004182734.1761555-4-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | sched+mm: Track lazy active mm existence with hazard pointers | expand |
On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > This API provides existence guarantees of objects through Hazard > Pointers (HP). This minimalist implementation is specific to use > with preemption disabled, but can be extended further as needed. > > Each HP domain defines a fixed number of hazard pointer slots (nr_cpus) > across the entire system. > > Its main benefit over RCU is that it allows fast reclaim of > HP-protected pointers without needing to wait for a grace period. > > It also allows the hazard pointer scan to call a user-defined callback > to retire a hazard pointer slot immediately if needed. This callback > may, for instance, issue an IPI to the relevant CPU. > > There are a few possible use-cases for this in the Linux kernel: > > - Improve performance of mm_count by replacing lazy active mm by HP. > - Guarantee object existence on pointer dereference to use refcount: > - replace locking used for that purpose in some drivers, > - replace RCU + inc_not_zero pattern, > - rtmutex: Improve situations where locks need to be taken in > reverse dependency chain order by guaranteeing existence of > first and second locks in traversal order, allowing them to be > locked in the correct order (which is reverse from traversal > order) rather than try-lock+retry on nested lock. > > References: > > [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for > lock-free objects," in IEEE Transactions on Parallel and > Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 [ ... ] > --- > Changes since v0: > - Remove slot variable from hp_dereference_allocate(). > --- > include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++ > kernel/Makefile | 2 +- > kernel/hp.c | 46 +++++++++++++ Just a housekeeping comment, ISTR Linus looking down on adding bodies of C code to header files (like hp_dereference_allocate). I understand maybe the rationale is that the functions included are inlined. But do all of them have to be inlined? Such headers also hurt code browsing capabilities in code browsers like clangd. clangd doesn't understand header files because it can't independently compile them -- it uses the compiler to generate and extract the AST for superior code browsing/completion. Also have you looked at the benefits of inlining for hp.h? hp_dereference_allocate() seems large enough that inlining may not matter much, but I haven't compiled it and looked at the asm myself. Will continue staring at the code. thanks, - Joel > 3 files changed, 205 insertions(+), 1 deletion(-) > create mode 100644 include/linux/hp.h > create mode 100644 kernel/hp.c > > diff --git a/include/linux/hp.h b/include/linux/hp.h > new file mode 100644 > index 000000000000..e85fc4365ea2 > --- /dev/null > +++ b/include/linux/hp.h > @@ -0,0 +1,158 @@ > +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > +// > +// SPDX-License-Identifier: LGPL-2.1-or-later > + > +#ifndef _LINUX_HP_H > +#define _LINUX_HP_H > + > +/* > + * HP: Hazard Pointers > + * > + * This API provides existence guarantees of objects through hazard > + * pointers. > + * > + * It uses a fixed number of hazard pointer slots (nr_cpus) across the > + * entire system for each HP domain. > + * > + * Its main benefit over RCU is that it allows fast reclaim of > + * HP-protected pointers without needing to wait for a grace period. > + * > + * It also allows the hazard pointer scan to call a user-defined callback > + * to retire a hazard pointer slot immediately if needed. This callback > + * may, for instance, issue an IPI to the relevant CPU. > + * > + * References: > + * > + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for > + * lock-free objects," in IEEE Transactions on Parallel and > + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 > + */ > + > +#include <linux/rcupdate.h> > + > +/* > + * Hazard pointer slot. > + */ > +struct hp_slot { > + void *addr; > +}; > + > +/* > + * Hazard pointer context, returned by hp_use(). > + */ > +struct hp_ctx { > + struct hp_slot *slot; > + void *addr; > +}; > + > +/* > + * hp_scan: Scan hazard pointer domain for @addr. > + * > + * Scan hazard pointer domain for @addr. > + * If @retire_cb is NULL, wait to observe that each slot contains a value > + * that differs from @addr. > + * If @retire_cb is non-NULL, invoke @callback for each slot containing > + * @addr. > + */ > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)); > + > +/* Get the hazard pointer context address (may be NULL). */ > +static inline > +void *hp_ctx_addr(struct hp_ctx ctx) > +{ > + return ctx.addr; > +} > + > +/* > + * hp_allocate: Allocate a hazard pointer. > + * > + * Allocate a hazard pointer slot for @addr. The object existence should > + * be guaranteed by the caller. Expects to be called from preempt > + * disable context. > + * > + * Returns a hazard pointer context. > + */ > +static inline > +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr) > +{ > + struct hp_slot *slot; > + struct hp_ctx ctx; > + > + if (!addr) > + goto fail; > + slot = this_cpu_ptr(percpu_slots); > + /* > + * A single hazard pointer slot per CPU is available currently. > + * Other hazard pointer domains can eventually have a different > + * configuration. > + */ > + if (READ_ONCE(slot->addr)) > + goto fail; > + WRITE_ONCE(slot->addr, addr); /* Store B */ > + ctx.slot = slot; > + ctx.addr = addr; > + return ctx; > + > +fail: > + ctx.slot = NULL; > + ctx.addr = NULL; > + return ctx; > +} > + > +/* > + * hp_dereference_allocate: Dereference and allocate a hazard pointer. > + * > + * Returns a hazard pointer context. Expects to be called from preempt > + * disable context. > + */ > +static inline > +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p) > +{ > + void *addr, *addr2; > + struct hp_ctx ctx; > + > + addr = READ_ONCE(*addr_p); > +retry: > + ctx = hp_allocate(percpu_slots, addr); > + if (!hp_ctx_addr(ctx)) > + goto fail; > + /* Memory ordering: Store B before Load A. */ > + smp_mb(); > + /* > + * Use RCU dereference without lockdep checks, because > + * lockdep is not aware of HP guarantees. > + */ > + addr2 = rcu_access_pointer(*addr_p); /* Load A */ > + /* > + * If @addr_p content has changed since the first load, > + * clear the hazard pointer and try again. > + */ > + if (!ptr_eq(addr2, addr)) { > + WRITE_ONCE(ctx.slot->addr, NULL); > + if (!addr2) > + goto fail; > + addr = addr2; > + goto retry; > + } > + /* > + * Use addr2 loaded from rcu_access_pointer() to preserve > + * address dependency ordering. > + */ > + ctx.addr = addr2; > + return ctx; > + > +fail: > + ctx.slot = NULL; > + ctx.addr = NULL; > + return ctx; > +} > + > +/* Retire the hazard pointer in @ctx. */ > +static inline > +void hp_retire(const struct hp_ctx ctx) > +{ > + smp_store_release(&ctx.slot->addr, NULL); > +} > + > +#endif /* _LINUX_HP_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 3c13240dfc9f..ec16de96fa80 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \ > cpu.o exit.o softirq.o resource.o \ > sysctl.o capability.o ptrace.o user.o \ > signal.o sys.o umh.o workqueue.o pid.o task_work.o \ > - extable.o params.o \ > + extable.o params.o hp.o \ > kthread.o sys_ni.o nsproxy.o \ > notifier.o ksysfs.o cred.o reboot.o \ > async.o range.o smpboot.o ucount.o regset.o ksyms_common.o > diff --git a/kernel/hp.c b/kernel/hp.c > new file mode 100644 > index 000000000000..b2447bf15300 > --- /dev/null > +++ b/kernel/hp.c > @@ -0,0 +1,46 @@ > +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > +// > +// SPDX-License-Identifier: LGPL-2.1-or-later > + > +/* > + * HP: Hazard Pointers > + */ > + > +#include <linux/hp.h> > +#include <linux/percpu.h> > + > +/* > + * hp_scan: Scan hazard pointer domain for @addr. > + * > + * Scan hazard pointer domain for @addr. > + * If @retire_cb is non-NULL, invoke @callback for each slot containing > + * @addr. > + * Wait to observe that each slot contains a value that differs from > + * @addr before returning. > + */ > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > +{ > + int cpu; > + > + /* > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > + * NULL or to a different value), and thus hides it from hazard > + * pointer readers. > + */ > + > + if (!addr) > + return; > + /* Memory ordering: Store A before Load B. */ > + smp_mb(); > + /* Scan all CPUs slots. */ > + for_each_possible_cpu(cpu) { > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > + > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > + retire_cb(cpu, slot, addr); > + /* Busy-wait if node is found. */ > + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ > + cpu_relax(); > + } > +} > -- > 2.39.2 >
Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit : > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > +{ > + int cpu; > + > + /* > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > + * NULL or to a different value), and thus hides it from hazard > + * pointer readers. > + */ > + > + if (!addr) > + return; > + /* Memory ordering: Store A before Load B. */ > + smp_mb(); > + /* Scan all CPUs slots. */ > + for_each_possible_cpu(cpu) { > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > + > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > + retire_cb(cpu, slot, addr); > + /* Busy-wait if node is found. */ > + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ > + cpu_relax(); You agree that having a single possible per-cpu pointer per context and a busy waiting update side pointer release can't be a general purpose hazard pointer implementation, right? :-) Thanks. > + } > +} > -- > 2.39.2 >
On 2024-10-05 13:19, Frederic Weisbecker wrote: > Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit : >> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >> +{ >> + int cpu; >> + >> + /* >> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >> + * NULL or to a different value), and thus hides it from hazard >> + * pointer readers. >> + */ >> + >> + if (!addr) >> + return; >> + /* Memory ordering: Store A before Load B. */ >> + smp_mb(); >> + /* Scan all CPUs slots. */ >> + for_each_possible_cpu(cpu) { >> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); >> + >> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ >> + retire_cb(cpu, slot, addr); >> + /* Busy-wait if node is found. */ >> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ >> + cpu_relax(); > > You agree that having a single possible per-cpu pointer per context and a busy > waiting update side pointer release can't be a general purpose hazard pointer > implementation, right? :-) Of course. This is a minimalist implementation, which can be extended in various ways, some of which I've implemented as POC in userspace already: - Increase the number of per-cpu slots available, - Distinguish between current scan depth target and available per-cpu slots, - Fall-back to reference counter when slots are full, - Allow scanning for a range of addresses (useful for type-safe memory), - Allow scanning for a set of hazard pointers (scan batching) using Bloom filters to probabilistically speed up the comparison (not implemented yet). - Implement a queued blocking wait/wakeup when HP scan must wait (not implemented yet). - Implement a HP-to-refcount promotion triggered by the HP scan callback to promote hazard pointers which would be blocked on to a reference count increment. (not implemented yet) - Use hazard pointers + refcount to implement smart pointers, which could be useful for Rust. (not implemented yet) But my general approach is to wait until the use-cases justify adding features. Although if you are curious about any of the points listed above, just ask and I'll be happy to discuss them in more depth. Thanks, Mathieu > > Thanks. > >> + } >> +} >> -- >> 2.39.2 >>
On 2024-10-04 23:25, Joel Fernandes wrote: > On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> This API provides existence guarantees of objects through Hazard >> Pointers (HP). This minimalist implementation is specific to use >> with preemption disabled, but can be extended further as needed. >> >> Each HP domain defines a fixed number of hazard pointer slots (nr_cpus) >> across the entire system. >> >> Its main benefit over RCU is that it allows fast reclaim of >> HP-protected pointers without needing to wait for a grace period. >> >> It also allows the hazard pointer scan to call a user-defined callback >> to retire a hazard pointer slot immediately if needed. This callback >> may, for instance, issue an IPI to the relevant CPU. >> >> There are a few possible use-cases for this in the Linux kernel: >> >> - Improve performance of mm_count by replacing lazy active mm by HP. >> - Guarantee object existence on pointer dereference to use refcount: >> - replace locking used for that purpose in some drivers, >> - replace RCU + inc_not_zero pattern, >> - rtmutex: Improve situations where locks need to be taken in >> reverse dependency chain order by guaranteeing existence of >> first and second locks in traversal order, allowing them to be >> locked in the correct order (which is reverse from traversal >> order) rather than try-lock+retry on nested lock. >> >> References: >> >> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for >> lock-free objects," in IEEE Transactions on Parallel and >> Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 > [ ... ] >> --- >> Changes since v0: >> - Remove slot variable from hp_dereference_allocate(). >> --- >> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++ >> kernel/Makefile | 2 +- >> kernel/hp.c | 46 +++++++++++++ > > Just a housekeeping comment, ISTR Linus looking down on adding bodies > of C code to header files (like hp_dereference_allocate). I understand > maybe the rationale is that the functions included are inlined. But do > all of them have to be inlined? Such headers also hurt code browsing > capabilities in code browsers like clangd. clangd doesn't understand > header files because it can't independently compile them -- it uses > the compiler to generate and extract the AST for superior code > browsing/completion. > > Also have you looked at the benefits of inlining for hp.h? > hp_dereference_allocate() seems large enough that inlining may not > matter much, but I haven't compiled it and looked at the asm myself. Here is a comparison in userspace: * With "hp dereference allocate" inlined: test_hpref_benchmark (smp_mb) nr_reads 1994298193 nr_writes 22293162 nr_ops 2016591355 test_hpref_benchmark (barrier/membarrier) nr_reads 15208690879 nr_writes 1893785 nr_ops 15210584664 * With "hp dereference allocate" implemented as a function call: test_hpref_benchmark (smp_mb) nr_reads 1558924716 nr_writes 14261028 nr_ops 1573185744 test_hpref_benchmark (barrier/membarrier) nr_reads 5881131707 nr_writes 2005140 nr_ops 5883136847 So the overhead of the function call when using symmetric memory barriers between hp allocate/hp scan is a 20% slowdown. It's worse in the asymmetric barrier/membarrier case, introducing a 61% slowdown. Given that the overhead is noticeable, I am tempted to leave the hazard pointer allocate/retire as inline functions. About code browsers like clangd, I would recommend improving the tooling rather than alter the design of the code based on current tooling limitations. Thanks, Mathieu
On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: > include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++ > kernel/Makefile | 2 +- > kernel/hp.c | 46 +++++++++++++ > 3 files changed, 205 insertions(+), 1 deletion(-) > create mode 100644 include/linux/hp.h > create mode 100644 kernel/hp.c > > diff --git a/include/linux/hp.h b/include/linux/hp.h > new file mode 100644 > index 000000000000..e85fc4365ea2 > --- /dev/null > +++ b/include/linux/hp.h > @@ -0,0 +1,158 @@ > +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > +// > +// SPDX-License-Identifier: LGPL-2.1-or-later > + > +#ifndef _LINUX_HP_H > +#define _LINUX_HP_H > + > +/* > + * HP: Hazard Pointers > + * > + * This API provides existence guarantees of objects through hazard > + * pointers. > + * > + * It uses a fixed number of hazard pointer slots (nr_cpus) across the > + * entire system for each HP domain. > + * > + * Its main benefit over RCU is that it allows fast reclaim of > + * HP-protected pointers without needing to wait for a grace period. > + * > + * It also allows the hazard pointer scan to call a user-defined callback > + * to retire a hazard pointer slot immediately if needed. This callback > + * may, for instance, issue an IPI to the relevant CPU. > + * > + * References: > + * > + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for > + * lock-free objects," in IEEE Transactions on Parallel and > + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 > + */ > + > +#include <linux/rcupdate.h> > + > +/* > + * Hazard pointer slot. > + */ > +struct hp_slot { > + void *addr; > +}; > + > +/* > + * Hazard pointer context, returned by hp_use(). > + */ > +struct hp_ctx { > + struct hp_slot *slot; > + void *addr; > +}; > + > +/* > + * hp_scan: Scan hazard pointer domain for @addr. > + * > + * Scan hazard pointer domain for @addr. > + * If @retire_cb is NULL, wait to observe that each slot contains a value > + * that differs from @addr. > + * If @retire_cb is non-NULL, invoke @callback for each slot containing > + * @addr. > + */ > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)); struct hp_domain { struct hp_slot __percpu *slots }; might clarify things a wee little. > + > +/* Get the hazard pointer context address (may be NULL). */ > +static inline > +void *hp_ctx_addr(struct hp_ctx ctx) > +{ > + return ctx.addr; > +} From where I'm sitting this seems like superfluous fluff, what's wrong with ctx.addr ? > +/* > + * hp_allocate: Allocate a hazard pointer. > + * > + * Allocate a hazard pointer slot for @addr. The object existence should > + * be guaranteed by the caller. Expects to be called from preempt > + * disable context. > + * > + * Returns a hazard pointer context. So you made the WTF'o'meter crack, this here function does not allocate nothing. Naming is bad. At best this is something like try-set-hazard-pointer or somesuch. > + */ > +static inline > +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr) > +{ > + struct hp_slot *slot; > + struct hp_ctx ctx; > + > + if (!addr) > + goto fail; > + slot = this_cpu_ptr(percpu_slots); > + /* > + * A single hazard pointer slot per CPU is available currently. > + * Other hazard pointer domains can eventually have a different > + * configuration. > + */ > + if (READ_ONCE(slot->addr)) > + goto fail; > + WRITE_ONCE(slot->addr, addr); /* Store B */ > + ctx.slot = slot; > + ctx.addr = addr; > + return ctx; > + > +fail: > + ctx.slot = NULL; > + ctx.addr = NULL; > + return ctx; > +} > + > +/* > + * hp_dereference_allocate: Dereference and allocate a hazard pointer. > + * > + * Returns a hazard pointer context. Expects to be called from preempt > + * disable context. > + */ More terrible naming. Same as above, but additionally, I would expect a 'dereference' to actually dereference the pointer and have a return value of the dereferenced type. This function seems to double check and update the hp_ctx thing. I'm not at all sure yet wtf this is doing -- and the total lack of comments aren't helping. > +static inline > +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p) > +{ > + void *addr, *addr2; > + struct hp_ctx ctx; > + > + addr = READ_ONCE(*addr_p); > +retry: > + ctx = hp_allocate(percpu_slots, addr); > + if (!hp_ctx_addr(ctx)) > + goto fail; > + /* Memory ordering: Store B before Load A. */ > + smp_mb(); > + /* > + * Use RCU dereference without lockdep checks, because > + * lockdep is not aware of HP guarantees. > + */ > + addr2 = rcu_access_pointer(*addr_p); /* Load A */ > + /* > + * If @addr_p content has changed since the first load, > + * clear the hazard pointer and try again. > + */ > + if (!ptr_eq(addr2, addr)) { > + WRITE_ONCE(ctx.slot->addr, NULL); > + if (!addr2) > + goto fail; > + addr = addr2; > + goto retry; > + } > + /* > + * Use addr2 loaded from rcu_access_pointer() to preserve > + * address dependency ordering. > + */ > + ctx.addr = addr2; > + return ctx; > + > +fail: > + ctx.slot = NULL; > + ctx.addr = NULL; > + return ctx; > +} > + > +/* Retire the hazard pointer in @ctx. */ > +static inline > +void hp_retire(const struct hp_ctx ctx) > +{ > + smp_store_release(&ctx.slot->addr, NULL); > +} > + > +#endif /* _LINUX_HP_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 3c13240dfc9f..ec16de96fa80 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \ > cpu.o exit.o softirq.o resource.o \ > sysctl.o capability.o ptrace.o user.o \ > signal.o sys.o umh.o workqueue.o pid.o task_work.o \ > - extable.o params.o \ > + extable.o params.o hp.o \ > kthread.o sys_ni.o nsproxy.o \ > notifier.o ksysfs.o cred.o reboot.o \ > async.o range.o smpboot.o ucount.o regset.o ksyms_common.o > diff --git a/kernel/hp.c b/kernel/hp.c > new file mode 100644 > index 000000000000..b2447bf15300 > --- /dev/null > +++ b/kernel/hp.c > @@ -0,0 +1,46 @@ > +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > +// > +// SPDX-License-Identifier: LGPL-2.1-or-later > + > +/* > + * HP: Hazard Pointers > + */ > + > +#include <linux/hp.h> > +#include <linux/percpu.h> > + > +/* > + * hp_scan: Scan hazard pointer domain for @addr. > + * > + * Scan hazard pointer domain for @addr. > + * If @retire_cb is non-NULL, invoke @callback for each slot containing > + * @addr. > + * Wait to observe that each slot contains a value that differs from > + * @addr before returning. > + */ > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > +{ > + int cpu; > + > + /* > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > + * NULL or to a different value), and thus hides it from hazard > + * pointer readers. > + */ > + > + if (!addr) > + return; > + /* Memory ordering: Store A before Load B. */ > + smp_mb(); > + /* Scan all CPUs slots. */ > + for_each_possible_cpu(cpu) { > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > + > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > + retire_cb(cpu, slot, addr); Is retirce_cb allowed to cmpxchg the thing? > + /* Busy-wait if node is found. */ > + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ > + cpu_relax(); This really should be using smp_cond_load_acquire() > + } > +}
On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote: > On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > > +{ > > + int cpu; > > + > > + /* > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > > + * NULL or to a different value), and thus hides it from hazard > > + * pointer readers. > > + */ This should probably assert we're in a preemptible context. Otherwise people will start using this in non-preemptible context and then we get to unfuck things later. > > + > > + if (!addr) > > + return; > > + /* Memory ordering: Store A before Load B. */ > > + smp_mb(); > > + /* Scan all CPUs slots. */ > > + for_each_possible_cpu(cpu) { > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > > + > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > > + retire_cb(cpu, slot, addr); > > Is retirce_cb allowed to cmpxchg the thing? > > > + /* Busy-wait if node is found. */ > > + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ > > + cpu_relax(); > > This really should be using smp_cond_load_acquire() > > > + } > > +}
On 2024-10-05 18:04, Peter Zijlstra wrote: > On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: >> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++ >> kernel/Makefile | 2 +- >> kernel/hp.c | 46 +++++++++++++ >> 3 files changed, 205 insertions(+), 1 deletion(-) >> create mode 100644 include/linux/hp.h >> create mode 100644 kernel/hp.c >> >> diff --git a/include/linux/hp.h b/include/linux/hp.h >> new file mode 100644 >> index 000000000000..e85fc4365ea2 >> --- /dev/null >> +++ b/include/linux/hp.h >> @@ -0,0 +1,158 @@ >> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> +// >> +// SPDX-License-Identifier: LGPL-2.1-or-later >> + >> +#ifndef _LINUX_HP_H >> +#define _LINUX_HP_H >> + >> +/* >> + * HP: Hazard Pointers >> + * >> + * This API provides existence guarantees of objects through hazard >> + * pointers. >> + * >> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the >> + * entire system for each HP domain. >> + * >> + * Its main benefit over RCU is that it allows fast reclaim of >> + * HP-protected pointers without needing to wait for a grace period. >> + * >> + * It also allows the hazard pointer scan to call a user-defined callback >> + * to retire a hazard pointer slot immediately if needed. This callback >> + * may, for instance, issue an IPI to the relevant CPU. >> + * >> + * References: >> + * >> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for >> + * lock-free objects," in IEEE Transactions on Parallel and >> + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 >> + */ >> + >> +#include <linux/rcupdate.h> >> + >> +/* >> + * Hazard pointer slot. >> + */ >> +struct hp_slot { >> + void *addr; >> +}; >> + >> +/* >> + * Hazard pointer context, returned by hp_use(). >> + */ >> +struct hp_ctx { >> + struct hp_slot *slot; >> + void *addr; >> +}; >> + >> +/* >> + * hp_scan: Scan hazard pointer domain for @addr. >> + * >> + * Scan hazard pointer domain for @addr. >> + * If @retire_cb is NULL, wait to observe that each slot contains a value >> + * that differs from @addr. >> + * If @retire_cb is non-NULL, invoke @callback for each slot containing >> + * @addr. >> + */ >> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)); > > struct hp_domain { > struct hp_slot __percpu *slots > }; > > might clarify things a wee little. Good point. This introduces: #define DECLARE_HP_DOMAIN(domain) \ extern struct hp_domain domain #define DEFINE_HP_DOMAIN(domain) \ static DEFINE_PER_CPU(struct hp_slot, __ ## domain ## _slots); \ struct hp_domain domain = { \ .percpu_slots = &__## domain ## _slots, \ } > >> + >> +/* Get the hazard pointer context address (may be NULL). */ >> +static inline >> +void *hp_ctx_addr(struct hp_ctx ctx) >> +{ >> + return ctx.addr; >> +} > > From where I'm sitting this seems like superfluous fluff, what's wrong > with ctx.addr ? I'm OK removing the accessor and just using ctx.addr. > >> +/* >> + * hp_allocate: Allocate a hazard pointer. >> + * >> + * Allocate a hazard pointer slot for @addr. The object existence should >> + * be guaranteed by the caller. Expects to be called from preempt >> + * disable context. >> + * >> + * Returns a hazard pointer context. > > So you made the WTF'o'meter crack, this here function does not allocate > nothing. Naming is bad. At best this is something like > try-set-hazard-pointer or somesuch. I went with the naming from the 2004 paper from Maged Michael, but I agree it could be clearer. I'm tempted to go for "hp_try_post()" and "hp_remove()", basically "posting" the intent to use a pointer (as in on a metaphorical billboard), and removing it when it's done. > >> + */ >> +static inline >> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr) >> +{ >> + struct hp_slot *slot; >> + struct hp_ctx ctx; >> + >> + if (!addr) >> + goto fail; >> + slot = this_cpu_ptr(percpu_slots); >> + /* >> + * A single hazard pointer slot per CPU is available currently. >> + * Other hazard pointer domains can eventually have a different >> + * configuration. >> + */ >> + if (READ_ONCE(slot->addr)) >> + goto fail; >> + WRITE_ONCE(slot->addr, addr); /* Store B */ >> + ctx.slot = slot; >> + ctx.addr = addr; >> + return ctx; >> + >> +fail: >> + ctx.slot = NULL; >> + ctx.addr = NULL; >> + return ctx; >> +} >> + >> +/* >> + * hp_dereference_allocate: Dereference and allocate a hazard pointer. >> + * >> + * Returns a hazard pointer context. Expects to be called from preempt >> + * disable context. >> + */ > > More terrible naming. Same as above, but additionally, I would expect a > 'dereference' to actually dereference the pointer and have a return > value of the dereferenced type. hp_dereference_try_post() ? > > This function seems to double check and update the hp_ctx thing. I'm not > at all sure yet wtf this is doing -- and the total lack of comments > aren't helping. The hp_ctx contains the outputs. The function loads *addr_p to then try_post it into a HP slot. On success, it re-reads the *addr_p (with address dependency) and if it still matches, use that as output address pointer. I'm planning to remove hp_ctx, and just have: /* * hp_try_post: Try to post a hazard pointer. * * Post a hazard pointer slot for @addr. The object existence should * be guaranteed by the caller. Expects to be called from preempt * disable context. * * Returns true if post succeeds, false otherwise. */ static inline bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) [...] /* * hp_dereference_try_post: Dereference and try to post a hazard pointer. * * Returns a hazard pointer context. Expects to be called from preempt * disable context. */ static inline void *__hp_dereference_try_post(struct hp_domain *hp_domain, void * const * addr_p, struct hp_slot **_slot) [...] #define hp_dereference_try_post(domain, p, slot_p) \ ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) /* Clear the hazard pointer in @slot. */ static inline void hp_remove(struct hp_slot *slot) [...] > >> +static inline >> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p) >> +{ >> + void *addr, *addr2; >> + struct hp_ctx ctx; >> + >> + addr = READ_ONCE(*addr_p); >> +retry: >> + ctx = hp_allocate(percpu_slots, addr); >> + if (!hp_ctx_addr(ctx)) >> + goto fail; >> + /* Memory ordering: Store B before Load A. */ >> + smp_mb(); >> + /* >> + * Use RCU dereference without lockdep checks, because >> + * lockdep is not aware of HP guarantees. >> + */ >> + addr2 = rcu_access_pointer(*addr_p); /* Load A */ >> + /* >> + * If @addr_p content has changed since the first load, >> + * clear the hazard pointer and try again. >> + */ >> + if (!ptr_eq(addr2, addr)) { >> + WRITE_ONCE(ctx.slot->addr, NULL); >> + if (!addr2) >> + goto fail; >> + addr = addr2; >> + goto retry; >> + } >> + /* >> + * Use addr2 loaded from rcu_access_pointer() to preserve >> + * address dependency ordering. >> + */ >> + ctx.addr = addr2; >> + return ctx; >> + >> +fail: >> + ctx.slot = NULL; >> + ctx.addr = NULL; >> + return ctx; >> +} >> + >> +/* Retire the hazard pointer in @ctx. */ >> +static inline >> +void hp_retire(const struct hp_ctx ctx) >> +{ >> + smp_store_release(&ctx.slot->addr, NULL); >> +} >> + >> +#endif /* _LINUX_HP_H */ >> diff --git a/kernel/Makefile b/kernel/Makefile >> index 3c13240dfc9f..ec16de96fa80 100644 >> --- a/kernel/Makefile >> +++ b/kernel/Makefile >> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \ >> cpu.o exit.o softirq.o resource.o \ >> sysctl.o capability.o ptrace.o user.o \ >> signal.o sys.o umh.o workqueue.o pid.o task_work.o \ >> - extable.o params.o \ >> + extable.o params.o hp.o \ >> kthread.o sys_ni.o nsproxy.o \ >> notifier.o ksysfs.o cred.o reboot.o \ >> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o >> diff --git a/kernel/hp.c b/kernel/hp.c >> new file mode 100644 >> index 000000000000..b2447bf15300 >> --- /dev/null >> +++ b/kernel/hp.c >> @@ -0,0 +1,46 @@ >> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> +// >> +// SPDX-License-Identifier: LGPL-2.1-or-later >> + >> +/* >> + * HP: Hazard Pointers >> + */ >> + >> +#include <linux/hp.h> >> +#include <linux/percpu.h> >> + >> +/* >> + * hp_scan: Scan hazard pointer domain for @addr. >> + * >> + * Scan hazard pointer domain for @addr. >> + * If @retire_cb is non-NULL, invoke @callback for each slot containing >> + * @addr. >> + * Wait to observe that each slot contains a value that differs from >> + * @addr before returning. >> + */ >> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >> +{ >> + int cpu; >> + >> + /* >> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >> + * NULL or to a different value), and thus hides it from hazard >> + * pointer readers. >> + */ >> + >> + if (!addr) >> + return; >> + /* Memory ordering: Store A before Load B. */ >> + smp_mb(); >> + /* Scan all CPUs slots. */ >> + for_each_possible_cpu(cpu) { >> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); >> + >> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ >> + retire_cb(cpu, slot, addr); > > Is retirce_cb allowed to cmpxchg the thing? It could, but we'd need to make sure the slot is not re-used by another hp_try_post() before the current user removes its own post. It would need to synchronize with the current HP user (e.g. with IPI). I've actually renamed retire_cb to "on_match_cb". > >> + /* Busy-wait if node is found. */ >> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ >> + cpu_relax(); > > This really should be using smp_cond_load_acquire() Good point, Thanks, Mathieu > >> + } >> +}
On 2024-10-05 18:07, Peter Zijlstra wrote: > On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote: >> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: > >>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >>> +{ >>> + int cpu; >>> + >>> + /* >>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >>> + * NULL or to a different value), and thus hides it from hazard >>> + * pointer readers. >>> + */ > > This should probably assert we're in a preemptible context. Otherwise > people will start using this in non-preemptible context and then we get > to unfuck things later. Something like this ? + /* Should only be called from preemptible context. */ + WARN_ON_ONCE(in_atomic()); > >>> + >>> + if (!addr) >>> + return; >>> + /* Memory ordering: Store A before Load B. */ >>> + smp_mb(); >>> + /* Scan all CPUs slots. */ >>> + for_each_possible_cpu(cpu) { >>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); >>> + >>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ >>> + retire_cb(cpu, slot, addr); >> >> Is retirce_cb allowed to cmpxchg the thing? Renaming retire_cb to "on_match_cb". Whatever the callback does needs to be done with knowledge of the slot user (e.g. IPI). >> >>> + /* Busy-wait if node is found. */ >>> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ >>> + cpu_relax(); >> >> This really should be using smp_cond_load_acquire() Done, Thanks, Mathieu >> >>> + } >>> +}
On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote: > On 2024-10-05 18:04, Peter Zijlstra wrote: > > > +/* > > > + * hp_allocate: Allocate a hazard pointer. > > > + * > > > + * Allocate a hazard pointer slot for @addr. The object existence should > > > + * be guaranteed by the caller. Expects to be called from preempt > > > + * disable context. > > > + * > > > + * Returns a hazard pointer context. > > > > So you made the WTF'o'meter crack, this here function does not allocate > > nothing. Naming is bad. At best this is something like > > try-set-hazard-pointer or somesuch. > > I went with the naming from the 2004 paper from Maged Michael, but I > agree it could be clearer. > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically > "posting" the intent to use a pointer (as in on a metaphorical billboard), > and removing it when it's done. For RCU we've taken to using the word: 'publish', no? > > > +/* > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer. > > > + * > > > + * Returns a hazard pointer context. Expects to be called from preempt > > > + * disable context. > > > + */ > > > > More terrible naming. Same as above, but additionally, I would expect a > > 'dereference' to actually dereference the pointer and have a return > > value of the dereferenced type. > > hp_dereference_try_post() ? > > > > > This function seems to double check and update the hp_ctx thing. I'm not > > at all sure yet wtf this is doing -- and the total lack of comments > > aren't helping. > > The hp_ctx contains the outputs. > > The function loads *addr_p to then try_post it into a HP slot. On success, > it re-reads the *addr_p (with address dependency) and if it still matches, > use that as output address pointer. > > I'm planning to remove hp_ctx, and just have: > > /* > * hp_try_post: Try to post a hazard pointer. > * > * Post a hazard pointer slot for @addr. The object existence should > * be guaranteed by the caller. Expects to be called from preempt > * disable context. > * > * Returns true if post succeeds, false otherwise. > */ > static inline > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) > [...] > > /* > * hp_dereference_try_post: Dereference and try to post a hazard pointer. > * > * Returns a hazard pointer context. Expects to be called from preempt > * disable context. > */ > static inline > void *__hp_dereference_try_post(struct hp_domain *hp_domain, > void * const * addr_p, struct hp_slot **_slot) > [...] > > #define hp_dereference_try_post(domain, p, slot_p) \ > ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) This will compile, but do the wrong thing when p is a regular pointer, no? > > /* Clear the hazard pointer in @slot. */ > static inline > void hp_remove(struct hp_slot *slot) > [...] Differently weird, but better I suppose :-) > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > > > +{ > > > + int cpu; > > > + > > > + /* > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > > > + * NULL or to a different value), and thus hides it from hazard > > > + * pointer readers. > > > + */ > > > + > > > + if (!addr) > > > + return; > > > + /* Memory ordering: Store A before Load B. */ > > > + smp_mb(); > > > + /* Scan all CPUs slots. */ > > > + for_each_possible_cpu(cpu) { > > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > > > + > > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > > > + retire_cb(cpu, slot, addr); > > > > Is retirce_cb allowed to cmpxchg the thing? > > It could, but we'd need to make sure the slot is not re-used by another > hp_try_post() before the current user removes its own post. It would > need to synchronize with the current HP user (e.g. with IPI). > > I've actually renamed retire_cb to "on_match_cb". Hmm, I think I see. Would it make sense to pass the expected addr to hp_remove() and double check we don't NULL out something unexpected? -- maybe just for a DEBUG option. I'm always seeing the NOHZ_FULL guys hating on this :-)
On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote: > On 2024-10-05 18:07, Peter Zijlstra wrote: > > On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote: > > > On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: > > > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > > > > +{ > > > > + int cpu; > > > > + > > > > + /* > > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > > > > + * NULL or to a different value), and thus hides it from hazard > > > > + * pointer readers. > > > > + */ > > > > This should probably assert we're in a preemptible context. Otherwise > > people will start using this in non-preemptible context and then we get > > to unfuck things later. > > Something like this ? > > + /* Should only be called from preemptible context. */ > + WARN_ON_ONCE(in_atomic()); lockdep_assert_preemption_enabled(); that also checks local IRQ state IIRC.
On 2024-10-07 12:42, Peter Zijlstra wrote: > On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote: >> On 2024-10-05 18:07, Peter Zijlstra wrote: >>> On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote: >>>> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote: >>> >>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >>>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >>>>> +{ >>>>> + int cpu; >>>>> + >>>>> + /* >>>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >>>>> + * NULL or to a different value), and thus hides it from hazard >>>>> + * pointer readers. >>>>> + */ >>> >>> This should probably assert we're in a preemptible context. Otherwise >>> people will start using this in non-preemptible context and then we get >>> to unfuck things later. >> >> Something like this ? >> >> + /* Should only be called from preemptible context. */ >> + WARN_ON_ONCE(in_atomic()); > > lockdep_assert_preemption_enabled(); > > that also checks local IRQ state IIRC. I'll use this instead, thanks! Mathieu
On 2024-10-07 12:40, Peter Zijlstra wrote: > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote: >> On 2024-10-05 18:04, Peter Zijlstra wrote: > > >>>> +/* >>>> + * hp_allocate: Allocate a hazard pointer. >>>> + * >>>> + * Allocate a hazard pointer slot for @addr. The object existence should >>>> + * be guaranteed by the caller. Expects to be called from preempt >>>> + * disable context. >>>> + * >>>> + * Returns a hazard pointer context. >>> >>> So you made the WTF'o'meter crack, this here function does not allocate >>> nothing. Naming is bad. At best this is something like >>> try-set-hazard-pointer or somesuch. >> >> I went with the naming from the 2004 paper from Maged Michael, but I >> agree it could be clearer. >> >> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically >> "posting" the intent to use a pointer (as in on a metaphorical billboard), >> and removing it when it's done. > > For RCU we've taken to using the word: 'publish', no? I'm so glad you suggest this, because it turns out that from all the possible words you could choose from, 'publish' is probably the most actively confusing. I'll explain. Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers through a simple example: [ Note: I've renamed the HP dereference try_post to HP load try_post based on further discussion below. ] *** RCU *** * Dereference RCU-protected pointer: rcu_read_lock(); // [ Begin read transaction ] l_p = rcu_dereference(p); // [ Load p: @addr or NULL ] if (l_p) [ use *l_p ...] rcu_read_unlock(); // [ End read transaction ] * Publish @addr: addr = kmalloc(); init(addr); rcu_assign_pointer(p, addr); * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] synchronize_rcu(); // Wait for all pre-existing // read transactions to complete. kfree(addr); *** Hazard Pointers *** * Load and post a HP-protected pointer: l_p = hp_load_try_post(domain, &p, &slot); if (l_p) { [ use *l_p ...] hp_remove(&slot, l_p); } * Publish @addr: addr = kmalloc(); init(addr); rcu_assign_pointer(p, addr); * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] hp_scan(domain, addr, NULL); kfree(addr); Both HP and RCU have publication guarantees, which can in fact be implemented in the same way (e.g. rcu_assign_pointer paired with something that respects address dependencies ordering). A stronger implementation of this would be pairing a store-release with a load-acquire: it works, but it would add needless overhead on weakly-ordered CPUs. How the two mechanisms differ is in how they track when it is safe to reclaim @addr. RCU tracks reader "transactions" begin/end, and makes sure that all pre-existing transactions are gone before synchronize_rcu() is allowed to complete. HP does this by tracking "posted" pointer slots with a HP domain. As long as hp_scan observes that HP readers are showing interest in @addr, it will wait. One notable difference between RCU and HP is that HP knows exactly which pointer is blocking progress, and from which CPU (at least with my per-CPU HP domain implementation). Therefore, it is possible for HP to issue an IPI and make sure the HP user either completes its use of the pointer quickly, or stops using it right away (e.g. making the active mm use idle mm instead). One strength of RCU is that it can track use of a whole set of RCU pointers just by tracking reader transaction begin/end, but this is also one of its weaknesses: a long reader transaction can postpone completion of grace period for a long time and increase the memory footprint. In comparison, HP can immediately complete as soon as the pointer it is scanning for is gone. Even better, it can send an IPI to the belate CPU and abort use of the pointer using a callback. > > >>>> +/* >>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer. >>>> + * >>>> + * Returns a hazard pointer context. Expects to be called from preempt >>>> + * disable context. >>>> + */ >>> >>> More terrible naming. Same as above, but additionally, I would expect a >>> 'dereference' to actually dereference the pointer and have a return >>> value of the dereferenced type. >> >> hp_dereference_try_post() ? >> >>> >>> This function seems to double check and update the hp_ctx thing. I'm not >>> at all sure yet wtf this is doing -- and the total lack of comments >>> aren't helping. >> >> The hp_ctx contains the outputs. >> >> The function loads *addr_p to then try_post it into a HP slot. On success, >> it re-reads the *addr_p (with address dependency) and if it still matches, >> use that as output address pointer. >> >> I'm planning to remove hp_ctx, and just have: >> >> /* >> * hp_try_post: Try to post a hazard pointer. >> * >> * Post a hazard pointer slot for @addr. The object existence should >> * be guaranteed by the caller. Expects to be called from preempt >> * disable context. >> * >> * Returns true if post succeeds, false otherwise. >> */ >> static inline >> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) >> [...] >> >> /* >> * hp_dereference_try_post: Dereference and try to post a hazard pointer. >> * >> * Returns a hazard pointer context. Expects to be called from preempt >> * disable context. >> */ >> static inline >> void *__hp_dereference_try_post(struct hp_domain *hp_domain, >> void * const * addr_p, struct hp_slot **_slot) >> [...] >> >> #define hp_dereference_try_post(domain, p, slot_p) \ >> ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) > > This will compile, but do the wrong thing when p is a regular pointer, no? Right, at least in some cases the compiler may not complain, and people used to rcu_dereference() will expect that "p" is the pointer to load rather than the address of that pointer. This would be unexpected. I must admit that passing the address holding the pointer to load rather than the pointer to load itself makes it much less troublesome in terms of macro layers. But perhaps this is another example where we should wander away from the beaten path and use a word different from "dereference" here. E.g.: /* * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p)) * to generate a compile error if addr_p is not a pointer to a pointer. */ #define hp_load_try_post(domain, addr_p, slot_p) \ ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p)) > >> >> /* Clear the hazard pointer in @slot. */ >> static inline >> void hp_remove(struct hp_slot *slot) >> [...] > > Differently weird, but better I suppose :-) If you find a better word than "remove" to pair with "post", I'm all in :) > > >>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >>>> +{ >>>> + int cpu; >>>> + >>>> + /* >>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >>>> + * NULL or to a different value), and thus hides it from hazard >>>> + * pointer readers. >>>> + */ >>>> + >>>> + if (!addr) >>>> + return; >>>> + /* Memory ordering: Store A before Load B. */ >>>> + smp_mb(); >>>> + /* Scan all CPUs slots. */ >>>> + for_each_possible_cpu(cpu) { >>>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); >>>> + >>>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ >>>> + retire_cb(cpu, slot, addr); >>> >>> Is retirce_cb allowed to cmpxchg the thing? >> >> It could, but we'd need to make sure the slot is not re-used by another >> hp_try_post() before the current user removes its own post. It would >> need to synchronize with the current HP user (e.g. with IPI). >> >> I've actually renamed retire_cb to "on_match_cb". > > Hmm, I think I see. Would it make sense to pass the expected addr to > hp_remove() and double check we don't NULL out something unexpected? -- > maybe just for a DEBUG option. > > I'm always seeing the NOHZ_FULL guys hating on this :-) That's a fair point. Sure, we can do this as an extra safety net. For now I will just make the check always present, we can always move it to a debug option later. And now I notice that hp_remove is also used for CPU hotplug (grep matches for cpuhp_remove_state()). I wonder if we should go for something more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ? Thanks, Mathieu
On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote: > On 2024-10-07 12:40, Peter Zijlstra wrote: > > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote: > > > On 2024-10-05 18:04, Peter Zijlstra wrote: > > > > > > > > > +/* > > > > > + * hp_allocate: Allocate a hazard pointer. > > > > > + * > > > > > + * Allocate a hazard pointer slot for @addr. The object existence should > > > > > + * be guaranteed by the caller. Expects to be called from preempt > > > > > + * disable context. > > > > > + * > > > > > + * Returns a hazard pointer context. > > > > > > > > So you made the WTF'o'meter crack, this here function does not allocate > > > > nothing. Naming is bad. At best this is something like > > > > try-set-hazard-pointer or somesuch. > > > > > > I went with the naming from the 2004 paper from Maged Michael, but I > > > agree it could be clearer. > > > > > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically > > > "posting" the intent to use a pointer (as in on a metaphorical billboard), > > > and removing it when it's done. > > > > For RCU we've taken to using the word: 'publish', no? > > I'm so glad you suggest this, because it turns out that from all > the possible words you could choose from, 'publish' is probably the > most actively confusing. I'll explain. > > Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers > through a simple example: > > [ Note: I've renamed the HP dereference try_post to HP load try_post > based on further discussion below. ] > > *** RCU *** > > * Dereference RCU-protected pointer: > rcu_read_lock(); // [ Begin read transaction ] > l_p = rcu_dereference(p); // [ Load p: @addr or NULL ] > if (l_p) > [ use *l_p ...] > rcu_read_unlock(); // [ End read transaction ] > > * Publish @addr: addr = kmalloc(); > init(addr); > rcu_assign_pointer(p, addr); > > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] > synchronize_rcu(); // Wait for all pre-existing > // read transactions to complete. > kfree(addr); > > > *** Hazard Pointers *** > > * Load and post a HP-protected pointer: > l_p = hp_load_try_post(domain, &p, &slot); > if (l_p) { > [ use *l_p ...] > hp_remove(&slot, l_p); > } > > * Publish @addr: addr = kmalloc(); > init(addr); > rcu_assign_pointer(p, addr); > > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] > hp_scan(domain, addr, NULL); > kfree(addr); > > Both HP and RCU have publication guarantees, which can in fact be > implemented in the same way (e.g. rcu_assign_pointer paired with > something that respects address dependencies ordering). A stronger > implementation of this would be pairing a store-release with a > load-acquire: it works, but it would add needless overhead on > weakly-ordered CPUs. > > How the two mechanisms differ is in how they track when it is > safe to reclaim @addr. RCU tracks reader "transactions" begin/end, > and makes sure that all pre-existing transactions are gone before > synchronize_rcu() is allowed to complete. HP does this by tracking > "posted" pointer slots with a HP domain. As long as hp_scan observes > that HP readers are showing interest in @addr, it will wait. > > One notable difference between RCU and HP is that HP knows exactly > which pointer is blocking progress, and from which CPU (at least > with my per-CPU HP domain implementation). Therefore, it is possible > for HP to issue an IPI and make sure the HP user either completes its > use of the pointer quickly, or stops using it right away (e.g. making > the active mm use idle mm instead). > > One strength of RCU is that it can track use of a whole set of RCU > pointers just by tracking reader transaction begin/end, but this is > also one of its weaknesses: a long reader transaction can postpone > completion of grace period for a long time and increase the memory > footprint. In comparison, HP can immediately complete as soon as the > pointer it is scanning for is gone. Even better, it can send an IPI > to the belate CPU and abort use of the pointer using a callback. Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no". This all sounds like arguments *for* use of the term "publish" for hazard pointers rather than against it. What am I missing here? Thanx, Paul > > > > > +/* > > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer. > > > > > + * > > > > > + * Returns a hazard pointer context. Expects to be called from preempt > > > > > + * disable context. > > > > > + */ > > > > > > > > More terrible naming. Same as above, but additionally, I would expect a > > > > 'dereference' to actually dereference the pointer and have a return > > > > value of the dereferenced type. > > > > > > hp_dereference_try_post() ? > > > > > > > > > > > This function seems to double check and update the hp_ctx thing. I'm not > > > > at all sure yet wtf this is doing -- and the total lack of comments > > > > aren't helping. > > > > > > The hp_ctx contains the outputs. > > > > > > The function loads *addr_p to then try_post it into a HP slot. On success, > > > it re-reads the *addr_p (with address dependency) and if it still matches, > > > use that as output address pointer. > > > > > > I'm planning to remove hp_ctx, and just have: > > > > > > /* > > > * hp_try_post: Try to post a hazard pointer. > > > * > > > * Post a hazard pointer slot for @addr. The object existence should > > > * be guaranteed by the caller. Expects to be called from preempt > > > * disable context. > > > * > > > * Returns true if post succeeds, false otherwise. > > > */ > > > static inline > > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) > > > [...] > > > > > > /* > > > * hp_dereference_try_post: Dereference and try to post a hazard pointer. > > > * > > > * Returns a hazard pointer context. Expects to be called from preempt > > > * disable context. > > > */ > > > static inline > > > void *__hp_dereference_try_post(struct hp_domain *hp_domain, > > > void * const * addr_p, struct hp_slot **_slot) > > > [...] > > > > > > #define hp_dereference_try_post(domain, p, slot_p) \ > > > ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) > > > > This will compile, but do the wrong thing when p is a regular pointer, no? > > Right, at least in some cases the compiler may not complain, and people used to > rcu_dereference() will expect that "p" is the pointer to load rather than the > address of that pointer. This would be unexpected. > > I must admit that passing the address holding the pointer to load rather than > the pointer to load itself makes it much less troublesome in terms of macro > layers. But perhaps this is another example where we should wander away from the > beaten path and use a word different from "dereference" here. E.g.: > > /* > * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p)) > * to generate a compile error if addr_p is not a pointer to a pointer. > */ > #define hp_load_try_post(domain, addr_p, slot_p) \ > ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p)) > > > > > > > > > /* Clear the hazard pointer in @slot. */ > > > static inline > > > void hp_remove(struct hp_slot *slot) > > > [...] > > > > Differently weird, but better I suppose :-) > > If you find a better word than "remove" to pair with "post", I'm all in :) > > > > > > > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > > > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > > > > > +{ > > > > > + int cpu; > > > > > + > > > > > + /* > > > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > > > > > + * NULL or to a different value), and thus hides it from hazard > > > > > + * pointer readers. > > > > > + */ > > > > > + > > > > > + if (!addr) > > > > > + return; > > > > > + /* Memory ordering: Store A before Load B. */ > > > > > + smp_mb(); > > > > > + /* Scan all CPUs slots. */ > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > > > > > + > > > > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > > > > > + retire_cb(cpu, slot, addr); > > > > > > > > Is retirce_cb allowed to cmpxchg the thing? > > > > > > It could, but we'd need to make sure the slot is not re-used by another > > > hp_try_post() before the current user removes its own post. It would > > > need to synchronize with the current HP user (e.g. with IPI). > > > > > > I've actually renamed retire_cb to "on_match_cb". > > > > Hmm, I think I see. Would it make sense to pass the expected addr to > > hp_remove() and double check we don't NULL out something unexpected? -- > > maybe just for a DEBUG option. > > > > I'm always seeing the NOHZ_FULL guys hating on this :-) > > That's a fair point. Sure, we can do this as an extra safety net. For now I > will just make the check always present, we can always move it to a debug > option later. > > And now I notice that hp_remove is also used for CPU hotplug (grep > matches for cpuhp_remove_state()). I wonder if we should go for something > more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ? > > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote: > On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote: > > On 2024-10-07 12:40, Peter Zijlstra wrote: > > > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote: > > > > On 2024-10-05 18:04, Peter Zijlstra wrote: > > > > > > > > > > > > +/* > > > > > > + * hp_allocate: Allocate a hazard pointer. > > > > > > + * > > > > > > + * Allocate a hazard pointer slot for @addr. The object existence should > > > > > > + * be guaranteed by the caller. Expects to be called from preempt > > > > > > + * disable context. > > > > > > + * > > > > > > + * Returns a hazard pointer context. > > > > > > > > > > So you made the WTF'o'meter crack, this here function does not allocate > > > > > nothing. Naming is bad. At best this is something like > > > > > try-set-hazard-pointer or somesuch. > > > > > > > > I went with the naming from the 2004 paper from Maged Michael, but I > > > > agree it could be clearer. > > > > > > > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically > > > > "posting" the intent to use a pointer (as in on a metaphorical billboard), > > > > and removing it when it's done. > > > > > > For RCU we've taken to using the word: 'publish', no? > > > > I'm so glad you suggest this, because it turns out that from all > > the possible words you could choose from, 'publish' is probably the > > most actively confusing. I'll explain. > > > > Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers > > through a simple example: > > > > [ Note: I've renamed the HP dereference try_post to HP load try_post > > based on further discussion below. ] > > > > *** RCU *** > > > > * Dereference RCU-protected pointer: > > rcu_read_lock(); // [ Begin read transaction ] > > l_p = rcu_dereference(p); // [ Load p: @addr or NULL ] > > if (l_p) > > [ use *l_p ...] > > rcu_read_unlock(); // [ End read transaction ] > > > > * Publish @addr: addr = kmalloc(); > > init(addr); > > rcu_assign_pointer(p, addr); > > > > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] > > synchronize_rcu(); // Wait for all pre-existing > > // read transactions to complete. > > kfree(addr); > > > > > > *** Hazard Pointers *** > > > > * Load and post a HP-protected pointer: > > l_p = hp_load_try_post(domain, &p, &slot); > > if (l_p) { > > [ use *l_p ...] > > hp_remove(&slot, l_p); > > } > > > > * Publish @addr: addr = kmalloc(); > > init(addr); > > rcu_assign_pointer(p, addr); > > > > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] > > hp_scan(domain, addr, NULL); > > kfree(addr); > > > > Both HP and RCU have publication guarantees, which can in fact be > > implemented in the same way (e.g. rcu_assign_pointer paired with > > something that respects address dependencies ordering). A stronger > > implementation of this would be pairing a store-release with a > > load-acquire: it works, but it would add needless overhead on > > weakly-ordered CPUs. > > > > How the two mechanisms differ is in how they track when it is > > safe to reclaim @addr. RCU tracks reader "transactions" begin/end, > > and makes sure that all pre-existing transactions are gone before > > synchronize_rcu() is allowed to complete. HP does this by tracking > > "posted" pointer slots with a HP domain. As long as hp_scan observes > > that HP readers are showing interest in @addr, it will wait. > > > > One notable difference between RCU and HP is that HP knows exactly > > which pointer is blocking progress, and from which CPU (at least > > with my per-CPU HP domain implementation). Therefore, it is possible > > for HP to issue an IPI and make sure the HP user either completes its > > use of the pointer quickly, or stops using it right away (e.g. making > > the active mm use idle mm instead). > > > > One strength of RCU is that it can track use of a whole set of RCU > > pointers just by tracking reader transaction begin/end, but this is > > also one of its weaknesses: a long reader transaction can postpone > > completion of grace period for a long time and increase the memory > > footprint. In comparison, HP can immediately complete as soon as the > > pointer it is scanning for is gone. Even better, it can send an IPI > > to the belate CPU and abort use of the pointer using a callback. > > Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no". > > This all sounds like arguments *for* use of the term "publish" for > hazard pointers rather than against it. What am I missing here? OK, one thing that I was missing is that this was not about the counterpart to rcu_assign_pointer(), for which I believe "publish" makes a lot of sense, but rather about the setting of a hazard pointer. Here, "protect" is the traditional term of power, which has served users well for some years. Thanx, Paul > > > > > > +/* > > > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer. > > > > > > + * > > > > > > + * Returns a hazard pointer context. Expects to be called from preempt > > > > > > + * disable context. > > > > > > + */ > > > > > > > > > > More terrible naming. Same as above, but additionally, I would expect a > > > > > 'dereference' to actually dereference the pointer and have a return > > > > > value of the dereferenced type. > > > > > > > > hp_dereference_try_post() ? > > > > > > > > > > > > > > This function seems to double check and update the hp_ctx thing. I'm not > > > > > at all sure yet wtf this is doing -- and the total lack of comments > > > > > aren't helping. > > > > > > > > The hp_ctx contains the outputs. > > > > > > > > The function loads *addr_p to then try_post it into a HP slot. On success, > > > > it re-reads the *addr_p (with address dependency) and if it still matches, > > > > use that as output address pointer. > > > > > > > > I'm planning to remove hp_ctx, and just have: > > > > > > > > /* > > > > * hp_try_post: Try to post a hazard pointer. > > > > * > > > > * Post a hazard pointer slot for @addr. The object existence should > > > > * be guaranteed by the caller. Expects to be called from preempt > > > > * disable context. > > > > * > > > > * Returns true if post succeeds, false otherwise. > > > > */ > > > > static inline > > > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) > > > > [...] > > > > > > > > /* > > > > * hp_dereference_try_post: Dereference and try to post a hazard pointer. > > > > * > > > > * Returns a hazard pointer context. Expects to be called from preempt > > > > * disable context. > > > > */ > > > > static inline > > > > void *__hp_dereference_try_post(struct hp_domain *hp_domain, > > > > void * const * addr_p, struct hp_slot **_slot) > > > > [...] > > > > > > > > #define hp_dereference_try_post(domain, p, slot_p) \ > > > > ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) > > > > > > This will compile, but do the wrong thing when p is a regular pointer, no? > > > > Right, at least in some cases the compiler may not complain, and people used to > > rcu_dereference() will expect that "p" is the pointer to load rather than the > > address of that pointer. This would be unexpected. > > > > I must admit that passing the address holding the pointer to load rather than > > the pointer to load itself makes it much less troublesome in terms of macro > > layers. But perhaps this is another example where we should wander away from the > > beaten path and use a word different from "dereference" here. E.g.: > > > > /* > > * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p)) > > * to generate a compile error if addr_p is not a pointer to a pointer. > > */ > > #define hp_load_try_post(domain, addr_p, slot_p) \ > > ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p)) > > > > > > > > > > > > > /* Clear the hazard pointer in @slot. */ > > > > static inline > > > > void hp_remove(struct hp_slot *slot) > > > > [...] > > > > > > Differently weird, but better I suppose :-) > > > > If you find a better word than "remove" to pair with "post", I'm all in :) > > > > > > > > > > > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, > > > > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) > > > > > > +{ > > > > > > + int cpu; > > > > > > + > > > > > > + /* > > > > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to > > > > > > + * NULL or to a different value), and thus hides it from hazard > > > > > > + * pointer readers. > > > > > > + */ > > > > > > + > > > > > > + if (!addr) > > > > > > + return; > > > > > > + /* Memory ordering: Store A before Load B. */ > > > > > > + smp_mb(); > > > > > > + /* Scan all CPUs slots. */ > > > > > > + for_each_possible_cpu(cpu) { > > > > > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); > > > > > > + > > > > > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ > > > > > > + retire_cb(cpu, slot, addr); > > > > > > > > > > Is retirce_cb allowed to cmpxchg the thing? > > > > > > > > It could, but we'd need to make sure the slot is not re-used by another > > > > hp_try_post() before the current user removes its own post. It would > > > > need to synchronize with the current HP user (e.g. with IPI). > > > > > > > > I've actually renamed retire_cb to "on_match_cb". > > > > > > Hmm, I think I see. Would it make sense to pass the expected addr to > > > hp_remove() and double check we don't NULL out something unexpected? -- > > > maybe just for a DEBUG option. > > > > > > I'm always seeing the NOHZ_FULL guys hating on this :-) > > > > That's a fair point. Sure, we can do this as an extra safety net. For now I > > will just make the check always present, we can always move it to a debug > > option later. > > > > And now I notice that hp_remove is also used for CPU hotplug (grep > > matches for cpuhp_remove_state()). I wonder if we should go for something > > more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ? > > > > Thanks, > > > > Mathieu > > > > > > -- > > Mathieu Desnoyers > > EfficiOS Inc. > > https://www.efficios.com > >
On 2024-10-07 21:06, Paul E. McKenney wrote: > On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote: >> On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote: >>> On 2024-10-07 12:40, Peter Zijlstra wrote: >>>> On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote: >>>>> On 2024-10-05 18:04, Peter Zijlstra wrote: >>>> >>>> >>>>>>> +/* >>>>>>> + * hp_allocate: Allocate a hazard pointer. >>>>>>> + * >>>>>>> + * Allocate a hazard pointer slot for @addr. The object existence should >>>>>>> + * be guaranteed by the caller. Expects to be called from preempt >>>>>>> + * disable context. >>>>>>> + * >>>>>>> + * Returns a hazard pointer context. >>>>>> >>>>>> So you made the WTF'o'meter crack, this here function does not allocate >>>>>> nothing. Naming is bad. At best this is something like >>>>>> try-set-hazard-pointer or somesuch. >>>>> >>>>> I went with the naming from the 2004 paper from Maged Michael, but I >>>>> agree it could be clearer. >>>>> >>>>> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically >>>>> "posting" the intent to use a pointer (as in on a metaphorical billboard), >>>>> and removing it when it's done. >>>> >>>> For RCU we've taken to using the word: 'publish', no? >>> >>> I'm so glad you suggest this, because it turns out that from all >>> the possible words you could choose from, 'publish' is probably the >>> most actively confusing. I'll explain. >>> >>> Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers >>> through a simple example: >>> >>> [ Note: I've renamed the HP dereference try_post to HP load try_post >>> based on further discussion below. ] >>> >>> *** RCU *** >>> >>> * Dereference RCU-protected pointer: >>> rcu_read_lock(); // [ Begin read transaction ] >>> l_p = rcu_dereference(p); // [ Load p: @addr or NULL ] >>> if (l_p) >>> [ use *l_p ...] >>> rcu_read_unlock(); // [ End read transaction ] >>> >>> * Publish @addr: addr = kmalloc(); >>> init(addr); >>> rcu_assign_pointer(p, addr); >>> >>> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] >>> synchronize_rcu(); // Wait for all pre-existing >>> // read transactions to complete. >>> kfree(addr); >>> >>> >>> *** Hazard Pointers *** >>> >>> * Load and post a HP-protected pointer: >>> l_p = hp_load_try_post(domain, &p, &slot); >>> if (l_p) { >>> [ use *l_p ...] >>> hp_remove(&slot, l_p); >>> } >>> >>> * Publish @addr: addr = kmalloc(); >>> init(addr); >>> rcu_assign_pointer(p, addr); >>> >>> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ] >>> hp_scan(domain, addr, NULL); >>> kfree(addr); >>> >>> Both HP and RCU have publication guarantees, which can in fact be >>> implemented in the same way (e.g. rcu_assign_pointer paired with >>> something that respects address dependencies ordering). A stronger >>> implementation of this would be pairing a store-release with a >>> load-acquire: it works, but it would add needless overhead on >>> weakly-ordered CPUs. >>> >>> How the two mechanisms differ is in how they track when it is >>> safe to reclaim @addr. RCU tracks reader "transactions" begin/end, >>> and makes sure that all pre-existing transactions are gone before >>> synchronize_rcu() is allowed to complete. HP does this by tracking >>> "posted" pointer slots with a HP domain. As long as hp_scan observes >>> that HP readers are showing interest in @addr, it will wait. >>> >>> One notable difference between RCU and HP is that HP knows exactly >>> which pointer is blocking progress, and from which CPU (at least >>> with my per-CPU HP domain implementation). Therefore, it is possible >>> for HP to issue an IPI and make sure the HP user either completes its >>> use of the pointer quickly, or stops using it right away (e.g. making >>> the active mm use idle mm instead). >>> >>> One strength of RCU is that it can track use of a whole set of RCU >>> pointers just by tracking reader transaction begin/end, but this is >>> also one of its weaknesses: a long reader transaction can postpone >>> completion of grace period for a long time and increase the memory >>> footprint. In comparison, HP can immediately complete as soon as the >>> pointer it is scanning for is gone. Even better, it can send an IPI >>> to the belate CPU and abort use of the pointer using a callback. >> >> Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no". >> >> This all sounds like arguments *for* use of the term "publish" for >> hazard pointers rather than against it. What am I missing here? > > OK, one thing that I was missing is that this was not about the > counterpart to rcu_assign_pointer(), for which I believe "publish" makes > a lot of sense, but rather about the setting of a hazard pointer. Here, > "protect" is the traditional term of power, which has served users well > for some years. After some reading of the C++ specification wording used for hazard pointers, I am indeed tempted to go for "try_protect()" and "retire()" to minimize confusion. Thanks, Mathieu > > Thanx, Paul > >>>>>>> +/* >>>>>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer. >>>>>>> + * >>>>>>> + * Returns a hazard pointer context. Expects to be called from preempt >>>>>>> + * disable context. >>>>>>> + */ >>>>>> >>>>>> More terrible naming. Same as above, but additionally, I would expect a >>>>>> 'dereference' to actually dereference the pointer and have a return >>>>>> value of the dereferenced type. >>>>> >>>>> hp_dereference_try_post() ? >>>>> >>>>>> >>>>>> This function seems to double check and update the hp_ctx thing. I'm not >>>>>> at all sure yet wtf this is doing -- and the total lack of comments >>>>>> aren't helping. >>>>> >>>>> The hp_ctx contains the outputs. >>>>> >>>>> The function loads *addr_p to then try_post it into a HP slot. On success, >>>>> it re-reads the *addr_p (with address dependency) and if it still matches, >>>>> use that as output address pointer. >>>>> >>>>> I'm planning to remove hp_ctx, and just have: >>>>> >>>>> /* >>>>> * hp_try_post: Try to post a hazard pointer. >>>>> * >>>>> * Post a hazard pointer slot for @addr. The object existence should >>>>> * be guaranteed by the caller. Expects to be called from preempt >>>>> * disable context. >>>>> * >>>>> * Returns true if post succeeds, false otherwise. >>>>> */ >>>>> static inline >>>>> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot) >>>>> [...] >>>>> >>>>> /* >>>>> * hp_dereference_try_post: Dereference and try to post a hazard pointer. >>>>> * >>>>> * Returns a hazard pointer context. Expects to be called from preempt >>>>> * disable context. >>>>> */ >>>>> static inline >>>>> void *__hp_dereference_try_post(struct hp_domain *hp_domain, >>>>> void * const * addr_p, struct hp_slot **_slot) >>>>> [...] >>>>> >>>>> #define hp_dereference_try_post(domain, p, slot_p) \ >>>>> ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p)) >>>> >>>> This will compile, but do the wrong thing when p is a regular pointer, no? >>> >>> Right, at least in some cases the compiler may not complain, and people used to >>> rcu_dereference() will expect that "p" is the pointer to load rather than the >>> address of that pointer. This would be unexpected. >>> >>> I must admit that passing the address holding the pointer to load rather than >>> the pointer to load itself makes it much less troublesome in terms of macro >>> layers. But perhaps this is another example where we should wander away from the >>> beaten path and use a word different from "dereference" here. E.g.: >>> >>> /* >>> * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p)) >>> * to generate a compile error if addr_p is not a pointer to a pointer. >>> */ >>> #define hp_load_try_post(domain, addr_p, slot_p) \ >>> ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p)) >>> >>>> >>>>> >>>>> /* Clear the hazard pointer in @slot. */ >>>>> static inline >>>>> void hp_remove(struct hp_slot *slot) >>>>> [...] >>>> >>>> Differently weird, but better I suppose :-) >>> >>> If you find a better word than "remove" to pair with "post", I'm all in :) >>> >>>> >>>> >>>>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, >>>>>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) >>>>>>> +{ >>>>>>> + int cpu; >>>>>>> + >>>>>>> + /* >>>>>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to >>>>>>> + * NULL or to a different value), and thus hides it from hazard >>>>>>> + * pointer readers. >>>>>>> + */ >>>>>>> + >>>>>>> + if (!addr) >>>>>>> + return; >>>>>>> + /* Memory ordering: Store A before Load B. */ >>>>>>> + smp_mb(); >>>>>>> + /* Scan all CPUs slots. */ >>>>>>> + for_each_possible_cpu(cpu) { >>>>>>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); >>>>>>> + >>>>>>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ >>>>>>> + retire_cb(cpu, slot, addr); >>>>>> >>>>>> Is retirce_cb allowed to cmpxchg the thing? >>>>> >>>>> It could, but we'd need to make sure the slot is not re-used by another >>>>> hp_try_post() before the current user removes its own post. It would >>>>> need to synchronize with the current HP user (e.g. with IPI). >>>>> >>>>> I've actually renamed retire_cb to "on_match_cb". >>>> >>>> Hmm, I think I see. Would it make sense to pass the expected addr to >>>> hp_remove() and double check we don't NULL out something unexpected? -- >>>> maybe just for a DEBUG option. >>>> >>>> I'm always seeing the NOHZ_FULL guys hating on this :-) >>> >>> That's a fair point. Sure, we can do this as an extra safety net. For now I >>> will just make the check always present, we can always move it to a debug >>> option later. >>> >>> And now I notice that hp_remove is also used for CPU hotplug (grep >>> matches for cpuhp_remove_state()). I wonder if we should go for something >>> more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ? >>> >>> Thanks, >>> >>> Mathieu >>> >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>>
diff --git a/include/linux/hp.h b/include/linux/hp.h new file mode 100644 index 000000000000..e85fc4365ea2 --- /dev/null +++ b/include/linux/hp.h @@ -0,0 +1,158 @@ +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> +// +// SPDX-License-Identifier: LGPL-2.1-or-later + +#ifndef _LINUX_HP_H +#define _LINUX_HP_H + +/* + * HP: Hazard Pointers + * + * This API provides existence guarantees of objects through hazard + * pointers. + * + * It uses a fixed number of hazard pointer slots (nr_cpus) across the + * entire system for each HP domain. + * + * Its main benefit over RCU is that it allows fast reclaim of + * HP-protected pointers without needing to wait for a grace period. + * + * It also allows the hazard pointer scan to call a user-defined callback + * to retire a hazard pointer slot immediately if needed. This callback + * may, for instance, issue an IPI to the relevant CPU. + * + * References: + * + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for + * lock-free objects," in IEEE Transactions on Parallel and + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 + */ + +#include <linux/rcupdate.h> + +/* + * Hazard pointer slot. + */ +struct hp_slot { + void *addr; +}; + +/* + * Hazard pointer context, returned by hp_use(). + */ +struct hp_ctx { + struct hp_slot *slot; + void *addr; +}; + +/* + * hp_scan: Scan hazard pointer domain for @addr. + * + * Scan hazard pointer domain for @addr. + * If @retire_cb is NULL, wait to observe that each slot contains a value + * that differs from @addr. + * If @retire_cb is non-NULL, invoke @callback for each slot containing + * @addr. + */ +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)); + +/* Get the hazard pointer context address (may be NULL). */ +static inline +void *hp_ctx_addr(struct hp_ctx ctx) +{ + return ctx.addr; +} + +/* + * hp_allocate: Allocate a hazard pointer. + * + * Allocate a hazard pointer slot for @addr. The object existence should + * be guaranteed by the caller. Expects to be called from preempt + * disable context. + * + * Returns a hazard pointer context. + */ +static inline +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr) +{ + struct hp_slot *slot; + struct hp_ctx ctx; + + if (!addr) + goto fail; + slot = this_cpu_ptr(percpu_slots); + /* + * A single hazard pointer slot per CPU is available currently. + * Other hazard pointer domains can eventually have a different + * configuration. + */ + if (READ_ONCE(slot->addr)) + goto fail; + WRITE_ONCE(slot->addr, addr); /* Store B */ + ctx.slot = slot; + ctx.addr = addr; + return ctx; + +fail: + ctx.slot = NULL; + ctx.addr = NULL; + return ctx; +} + +/* + * hp_dereference_allocate: Dereference and allocate a hazard pointer. + * + * Returns a hazard pointer context. Expects to be called from preempt + * disable context. + */ +static inline +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p) +{ + void *addr, *addr2; + struct hp_ctx ctx; + + addr = READ_ONCE(*addr_p); +retry: + ctx = hp_allocate(percpu_slots, addr); + if (!hp_ctx_addr(ctx)) + goto fail; + /* Memory ordering: Store B before Load A. */ + smp_mb(); + /* + * Use RCU dereference without lockdep checks, because + * lockdep is not aware of HP guarantees. + */ + addr2 = rcu_access_pointer(*addr_p); /* Load A */ + /* + * If @addr_p content has changed since the first load, + * clear the hazard pointer and try again. + */ + if (!ptr_eq(addr2, addr)) { + WRITE_ONCE(ctx.slot->addr, NULL); + if (!addr2) + goto fail; + addr = addr2; + goto retry; + } + /* + * Use addr2 loaded from rcu_access_pointer() to preserve + * address dependency ordering. + */ + ctx.addr = addr2; + return ctx; + +fail: + ctx.slot = NULL; + ctx.addr = NULL; + return ctx; +} + +/* Retire the hazard pointer in @ctx. */ +static inline +void hp_retire(const struct hp_ctx ctx) +{ + smp_store_release(&ctx.slot->addr, NULL); +} + +#endif /* _LINUX_HP_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 3c13240dfc9f..ec16de96fa80 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \ cpu.o exit.o softirq.o resource.o \ sysctl.o capability.o ptrace.o user.o \ signal.o sys.o umh.o workqueue.o pid.o task_work.o \ - extable.o params.o \ + extable.o params.o hp.o \ kthread.o sys_ni.o nsproxy.o \ notifier.o ksysfs.o cred.o reboot.o \ async.o range.o smpboot.o ucount.o regset.o ksyms_common.o diff --git a/kernel/hp.c b/kernel/hp.c new file mode 100644 index 000000000000..b2447bf15300 --- /dev/null +++ b/kernel/hp.c @@ -0,0 +1,46 @@ +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> +// +// SPDX-License-Identifier: LGPL-2.1-or-later + +/* + * HP: Hazard Pointers + */ + +#include <linux/hp.h> +#include <linux/percpu.h> + +/* + * hp_scan: Scan hazard pointer domain for @addr. + * + * Scan hazard pointer domain for @addr. + * If @retire_cb is non-NULL, invoke @callback for each slot containing + * @addr. + * Wait to observe that each slot contains a value that differs from + * @addr before returning. + */ +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr, + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr)) +{ + int cpu; + + /* + * Store A precedes hp_scan(): it unpublishes addr (sets it to + * NULL or to a different value), and thus hides it from hazard + * pointer readers. + */ + + if (!addr) + return; + /* Memory ordering: Store A before Load B. */ + smp_mb(); + /* Scan all CPUs slots. */ + for_each_possible_cpu(cpu) { + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu); + + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */ + retire_cb(cpu, slot, addr); + /* Busy-wait if node is found. */ + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */ + cpu_relax(); + } +}
This API provides existence guarantees of objects through Hazard Pointers (HP). This minimalist implementation is specific to use with preemption disabled, but can be extended further as needed. Each HP domain defines a fixed number of hazard pointer slots (nr_cpus) across the entire system. Its main benefit over RCU is that it allows fast reclaim of HP-protected pointers without needing to wait for a grace period. It also allows the hazard pointer scan to call a user-defined callback to retire a hazard pointer slot immediately if needed. This callback may, for instance, issue an IPI to the relevant CPU. There are a few possible use-cases for this in the Linux kernel: - Improve performance of mm_count by replacing lazy active mm by HP. - Guarantee object existence on pointer dereference to use refcount: - replace locking used for that purpose in some drivers, - replace RCU + inc_not_zero pattern, - rtmutex: Improve situations where locks need to be taken in reverse dependency chain order by guaranteeing existence of first and second locks in traversal order, allowing them to be locked in the correct order (which is reverse from traversal order) rather than try-lock+retry on nested lock. References: [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for lock-free objects," in IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/ Link: https://lpc.events/event/18/contributions/1731/ Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: John Stultz <jstultz@google.com> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: maged.michael@gmail.com Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev --- Changes since v0: - Remove slot variable from hp_dereference_allocate(). --- include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++ kernel/Makefile | 2 +- kernel/hp.c | 46 +++++++++++++ 3 files changed, 205 insertions(+), 1 deletion(-) create mode 100644 include/linux/hp.h create mode 100644 kernel/hp.c