[v4,12/22] lib/stackdepot: use read/write lock

Message ID	9f81ffcc4bb422ebb6326a65a770bf1918634cbb.1700502145.git.andreyknvl@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: andrey.konovalov@linux.dev To: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Konovalov <andreyknvl@gmail.com>, Marco Elver <elver@google.com>, Alexander Potapenko <glider@google.com>, Dmitry Vyukov <dvyukov@google.com>, Vlastimil Babka <vbabka@suse.cz>, kasan-dev@googlegroups.com, Evgenii Stepanov <eugenis@google.com>, Oscar Salvador <osalvador@suse.de>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrey Konovalov <andreyknvl@google.com> Subject: [PATCH v4 12/22] lib/stackdepot: use read/write lock Date: Mon, 20 Nov 2023 18:47:10 +0100 Message-Id: <9f81ffcc4bb422ebb6326a65a770bf1918634cbb.1700502145.git.andreyknvl@google.com> In-Reply-To: <cover.1700502145.git.andreyknvl@google.com> References: <cover.1700502145.git.andreyknvl@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	stackdepot: allow evicting stack traces \| expand [v4,00/22] stackdepot: allow evicting stack traces [v4,01/22] lib/stackdepot: print disabled message only if truly disabled [v4,02/22] lib/stackdepot: check disabled flag when fetching [v4,03/22] lib/stackdepot: simplify __stack_depot_save [v4,04/22] lib/stackdepot: drop valid bit from handles [v4,05/22] lib/stackdepot: add depot_fetch_stack helper [v4,06/22] lib/stackdepot: use fixed-sized slots for stack records [v4,07/22] lib/stackdepot: fix and clean-up atomic annotations [v4,08/22] lib/stackdepot: rework helpers for depot_alloc_stack [v4,09/22] lib/stackdepot: rename next_pool_required to new_pool_required [v4,10/22] lib/stackdepot: store next pool pointer in new_pool [v4,11/22] lib/stackdepot: store free stack records in a freelist [v4,12/22] lib/stackdepot: use read/write lock [v4,13/22] lib/stackdepot: use list_head for stack record links [v4,14/22] kmsan: use stack_depot_save instead of __stack_depot_save [v4,15/22] lib/stackdepot, kasan: add flags to __stack_depot_save and rename [v4,16/22] lib/stackdepot: add refcount for records [v4,17/22] lib/stackdepot: allow users to evict stack traces [v4,18/22] kasan: remove atomic accesses to stack ring entries [v4,19/22] kasan: check object_size in kasan_complete_mode_report_info [v4,20/22] kasan: use stack_depot_put for tag-based modes [v4,21/22] kasan: use stack_depot_put for Generic mode [v4,22/22] lib/stackdepot: adjust DEPOT_POOLS_CAP for KMSAN

andrey.konovalov@linux.dev Nov. 20, 2023, 5:47 p.m. UTC

From: Andrey Konovalov <andreyknvl@google.com>

Currently, stack depot uses the following locking scheme:

1. Lock-free accesses when looking up a stack record, which allows to
   have multiple users to look up records in parallel;
2. Spinlock for protecting the stack depot pools and the hash table
   when adding a new record.

For implementing the eviction of stack traces from stack depot, the
lock-free approach is not going to work anymore, as we will need to be
able to also remove records from the hash table.

Convert the spinlock into a read/write lock, and drop the atomic accesses,
as they are no longer required.

Looking up stack traces is now protected by the read lock and adding new
records - by the write lock. One of the following patches will add a new
function for evicting stack records, which will be protected by the write
lock as well.

With this change, multiple users can still look up records in parallel.

This is preparatory patch for implementing the eviction of stack records
from the stack depot.

Reviewed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

---

Changed v2->v3:
- Use lockdep_assert_held_read annotation in depot_fetch_stack.

Changes v1->v2:
- Add lockdep_assert annotations.
---
 lib/stackdepot.c | 87 +++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 41 deletions(-)

Oscar Salvador Jan. 3, 2024, 9:14 a.m. UTC | #1

On Mon, Nov 20, 2023 at 06:47:10PM +0100, andrey.konovalov@linux.dev wrote:
> From: Andrey Konovalov <andreyknvl@google.com>
> 
> Currently, stack depot uses the following locking scheme:
> 
> 1. Lock-free accesses when looking up a stack record, which allows to
>    have multiple users to look up records in parallel;
> 2. Spinlock for protecting the stack depot pools and the hash table
>    when adding a new record.
> 
> For implementing the eviction of stack traces from stack depot, the
> lock-free approach is not going to work anymore, as we will need to be
> able to also remove records from the hash table.
> 
> Convert the spinlock into a read/write lock, and drop the atomic accesses,
> as they are no longer required.
> 
> Looking up stack traces is now protected by the read lock and adding new
> records - by the write lock. One of the following patches will add a new
> function for evicting stack records, which will be protected by the write
> lock as well.
> 
> With this change, multiple users can still look up records in parallel.
> 
> This is preparatory patch for implementing the eviction of stack records
> from the stack depot.
> 
> Reviewed-by: Alexander Potapenko <glider@google.com>
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
> 
> Changed v2->v3:
> - Use lockdep_assert_held_read annotation in depot_fetch_stack.
> 
> Changes v1->v2:
> - Add lockdep_assert annotations.
> ---
>  lib/stackdepot.c | 87 +++++++++++++++++++++++++-----------------------
>  1 file changed, 46 insertions(+), 41 deletions(-)
> 
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index a5eff165c0d5..8378b32b5310 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -23,6 +23,7 @@
>  #include <linux/percpu.h>
>  #include <linux/printk.h>
>  #include <linux/slab.h>
> +#include <linux/spinlock.h>
>  #include <linux/stacktrace.h>
>  #include <linux/stackdepot.h>
>  #include <linux/string.h>
> @@ -91,15 +92,15 @@ static void *new_pool;
>  static int pools_num;
>  /* Next stack in the freelist of stack records within stack_pools. */
>  static struct stack_record *next_stack;
> -/* Lock that protects the variables above. */
> -static DEFINE_RAW_SPINLOCK(pool_lock);
>  /*
>   * Stack depot tries to keep an extra pool allocated even before it runs out
>   * of space in the currently used pool. This flag marks whether this extra pool
>   * needs to be allocated. It has the value 0 when either an extra pool is not
>   * yet allocated or if the limit on the number of pools is reached.
>   */
> -static int new_pool_required = 1;
> +static bool new_pool_required = true;
> +/* Lock that protects the variables above. */
> +static DEFINE_RWLOCK(pool_rwlock);
>  
>  static int __init disable_stack_depot(char *str)
>  {
> @@ -232,6 +233,8 @@ static void depot_init_pool(void *pool)
>  	const int records_in_pool = DEPOT_POOL_SIZE / DEPOT_STACK_RECORD_SIZE;
>  	int i, offset;
>  
> +	lockdep_assert_held_write(&pool_rwlock);
> +
>  	/* Initialize handles and link stack records to each other. */
>  	for (i = 0, offset = 0;
>  	     offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> @@ -254,22 +257,17 @@ static void depot_init_pool(void *pool)
>  
>  	/* Save reference to the pool to be used by depot_fetch_stack(). */
>  	stack_pools[pools_num] = pool;
> -
> -	/*
> -	 * WRITE_ONCE() pairs with potential concurrent read in
> -	 * depot_fetch_stack().
> -	 */
> -	WRITE_ONCE(pools_num, pools_num + 1);
> +	pools_num++;
>  }
>  
>  /* Keeps the preallocated memory to be used for a new stack depot pool. */
>  static void depot_keep_new_pool(void **prealloc)
>  {
> +	lockdep_assert_held_write(&pool_rwlock);
> +
>  	/*
>  	 * If a new pool is already saved or the maximum number of
>  	 * pools is reached, do not use the preallocated memory.
> -	 * Access new_pool_required non-atomically, as there are no concurrent
> -	 * write accesses to this variable.
>  	 */
>  	if (!new_pool_required)
>  		return;
> @@ -287,15 +285,15 @@ static void depot_keep_new_pool(void **prealloc)
>  	 * At this point, either a new pool is kept or the maximum
>  	 * number of pools is reached. In either case, take note that
>  	 * keeping another pool is not required.
> -	 * smp_store_release() pairs with smp_load_acquire() in
> -	 * stack_depot_save().
>  	 */
> -	smp_store_release(&new_pool_required, 0);
> +	new_pool_required = false;
>  }
>  
>  /* Updates references to the current and the next stack depot pools. */
>  static bool depot_update_pools(void **prealloc)
>  {
> +	lockdep_assert_held_write(&pool_rwlock);
> +
>  	/* Check if we still have objects in the freelist. */
>  	if (next_stack)
>  		goto out_keep_prealloc;
> @@ -307,7 +305,7 @@ static bool depot_update_pools(void **prealloc)
>  
>  		/* Take note that we might need a new new_pool. */
>  		if (pools_num < DEPOT_MAX_POOLS)
> -			smp_store_release(&new_pool_required, 1);
> +			new_pool_required = true;
>  
>  		/* Try keeping the preallocated memory for new_pool. */
>  		goto out_keep_prealloc;
> @@ -341,6 +339,8 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
>  {
>  	struct stack_record *stack;
>  
> +	lockdep_assert_held_write(&pool_rwlock);
> +
>  	/* Update current and new pools if required and possible. */
>  	if (!depot_update_pools(prealloc))
>  		return NULL;
> @@ -376,18 +376,15 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
>  static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
>  {
>  	union handle_parts parts = { .handle = handle };
> -	/*
> -	 * READ_ONCE() pairs with potential concurrent write in
> -	 * depot_init_pool().
> -	 */
> -	int pools_num_cached = READ_ONCE(pools_num);
>  	void *pool;
>  	size_t offset = parts.offset << DEPOT_STACK_ALIGN;
>  	struct stack_record *stack;
>  
> -	if (parts.pool_index > pools_num_cached) {
> +	lockdep_assert_held_read(&pool_rwlock);
> +
> +	if (parts.pool_index > pools_num) {
>  		WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
> -		     parts.pool_index, pools_num_cached, handle);
> +		     parts.pool_index, pools_num, handle);
>  		return NULL;
>  	}
>  
> @@ -429,6 +426,8 @@ static inline struct stack_record *find_stack(struct stack_record *bucket,
>  {
>  	struct stack_record *found;
>  
> +	lockdep_assert_held(&pool_rwlock);
> +
>  	for (found = bucket; found; found = found->next) {
>  		if (found->hash == hash &&
>  		    found->size == size &&
> @@ -446,6 +445,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>  	depot_stack_handle_t handle = 0;
>  	struct page *page = NULL;
>  	void *prealloc = NULL;
> +	bool need_alloc = false;
>  	unsigned long flags;
>  	u32 hash;
>  
> @@ -465,22 +465,26 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>  	hash = hash_stack(entries, nr_entries);
>  	bucket = &stack_table[hash & stack_hash_mask];
>  
> -	/*
> -	 * Fast path: look the stack trace up without locking.
> -	 * smp_load_acquire() pairs with smp_store_release() to |bucket| below.
> -	 */
> -	found = find_stack(smp_load_acquire(bucket), entries, nr_entries, hash);
> -	if (found)
> +	read_lock_irqsave(&pool_rwlock, flags);
> +
> +	/* Fast path: look the stack trace up without full locking. */
> +	found = find_stack(*bucket, entries, nr_entries, hash);
> +	if (found) {
> +		read_unlock_irqrestore(&pool_rwlock, flags);
>  		goto exit;
> +	}
> +
> +	/* Take note if another stack pool needs to be allocated. */
> +	if (new_pool_required)
> +		need_alloc = true;
> +
> +	read_unlock_irqrestore(&pool_rwlock, flags);
>  
>  	/*
> -	 * Check if another stack pool needs to be allocated. If so, allocate
> -	 * the memory now: we won't be able to do that under the lock.
> -	 *
> -	 * smp_load_acquire() pairs with smp_store_release() in
> -	 * depot_update_pools() and depot_keep_new_pool().
> +	 * Allocate memory for a new pool if required now:
> +	 * we won't be able to do that under the lock.
>  	 */
> -	if (unlikely(can_alloc && smp_load_acquire(&new_pool_required))) {
> +	if (unlikely(can_alloc && need_alloc)) {
>  		/*
>  		 * Zero out zone modifiers, as we don't have specific zone
>  		 * requirements. Keep the flags related to allocation in atomic
> @@ -494,7 +498,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>  			prealloc = page_address(page);
>  	}
>  
> -	raw_spin_lock_irqsave(&pool_lock, flags);
> +	write_lock_irqsave(&pool_rwlock, flags);
>  
>  	found = find_stack(*bucket, entries, nr_entries, hash);
>  	if (!found) {
> @@ -503,11 +507,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>  
>  		if (new) {
>  			new->next = *bucket;
> -			/*
> -			 * smp_store_release() pairs with smp_load_acquire()
> -			 * from |bucket| above.
> -			 */
> -			smp_store_release(bucket, new);
> +			*bucket = new;
>  			found = new;
>  		}
>  	} else if (prealloc) {
> @@ -518,7 +518,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>  		depot_keep_new_pool(&prealloc);
>  	}
>  
> -	raw_spin_unlock_irqrestore(&pool_lock, flags);
> +	write_unlock_irqrestore(&pool_rwlock, flags);
>  exit:
>  	if (prealloc) {
>  		/* Stack depot didn't use this memory, free it. */
> @@ -542,6 +542,7 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
>  			       unsigned long **entries)
>  {
>  	struct stack_record *stack;
> +	unsigned long flags;
>  
>  	*entries = NULL;
>  	/*
> @@ -553,8 +554,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
>  	if (!handle || stack_depot_disabled)
>  		return 0;
>  
> +	read_lock_irqsave(&pool_rwlock, flags);
> +
>  	stack = depot_fetch_stack(handle);
>  
> +	read_unlock_irqrestore(&pool_rwlock, flags);
> +
>  	*entries = stack->entries;
>  	return stack->size;
>  }
> -- 
> 2.25.1
>

Andi Kleen Jan. 10, 2024, 11:01 p.m. UTC | #2

Oscar Salvador <osalvador@suse.de> writes:
>> 
>> With this change, multiple users can still look up records in parallel.

That's a severe misunderstanding -- rwlocks always bounce a cache line,
so the parallelism is significantly reduced.

Normally rwlocks are only worth it if your critical region is quite long.

>> 
>> This is preparatory patch for implementing the eviction of stack records
>> from the stack depot.
>> 
>> Reviewed-by: Alexander Potapenko <glider@google.com>
>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

Has anyone benchmarked this on a high core count machine? It sounds
pretty bad if every lock aquisition starts bouncing a single cache line.

Consider using RCU or similar.

-Andi

Marco Elver Jan. 11, 2024, 9:48 a.m. UTC | #3

On Thu, 11 Jan 2024 at 00:01, Andi Kleen <ak@linux.intel.com> wrote:
>
> Oscar Salvador <osalvador@suse.de> writes:
> >>
> >> With this change, multiple users can still look up records in parallel.
>
> That's a severe misunderstanding -- rwlocks always bounce a cache line,
> so the parallelism is significantly reduced.
>
> Normally rwlocks are only worth it if your critical region is quite long.
>
> >>
> >> This is preparatory patch for implementing the eviction of stack records
> >> from the stack depot.
> >>
> >> Reviewed-by: Alexander Potapenko <glider@google.com>
> >> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> >
> > Reviewed-by: Oscar Salvador <osalvador@suse.de>
>
>
> Has anyone benchmarked this on a high core count machine? It sounds
> pretty bad if every lock aquisition starts bouncing a single cache line.
>
> Consider using RCU or similar.

stackdepot is severely limited in what kernel facilities it may use
due to being used by such low level facilities as the allocator
itself.

I've been suggesting percpu-rwsem here, but looking at it in more
detail that doesn't work because percpu-rwsem wants to sleep, but
stackdepot must work in non-sleepable contexts. :-/

Andi Kleen Jan. 11, 2024, 12:36 p.m. UTC | #4

> stackdepot is severely limited in what kernel facilities it may use
> due to being used by such low level facilities as the allocator
> itself.

RCU can be done quite low level too (e.g. there is NMI safe RCU)

> 
> I've been suggesting percpu-rwsem here, but looking at it in more
> detail that doesn't work because percpu-rwsem wants to sleep, but
> stackdepot must work in non-sleepable contexts. :-/

Yes something per CPU would work too I suppose. We used to have
big reader spinlocks for this. 

-Andi

Marco Elver Jan. 11, 2024, 7:08 p.m. UTC | #5

On Thu, Jan 11, 2024 at 04:36AM -0800, Andi Kleen wrote:
> > stackdepot is severely limited in what kernel facilities it may use
> > due to being used by such low level facilities as the allocator
> > itself.
> 
> RCU can be done quite low level too (e.g. there is NMI safe RCU)

How about the below? This should get us back the performance of the old
lock-less version. Although it's using rculist, we don't actually need
to synchronize via RCU.

Thanks,
-- Marco

------ >8 ------

From: Marco Elver <elver@google.com>
Date: Tue, 9 Jan 2024 10:21:56 +0100
Subject: [PATCH] stackdepot: make fast paths lock-less again

stack_depot_put() unconditionally takes the pool_rwlock as a writer.
This is unnecessary if the stack record is not going to be freed.
Furthermore, reader-writer locks have inherent cache contention, which
does not scale well on machines with large CPU counts.

Instead, rework the synchronization story of stack depot to again avoid
taking any locks in the fast paths. This is done by relying on RCU
primitives to give us lock-less list traversal. See code comments for
more details.

Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
Signed-off-by: Marco Elver <elver@google.com>
---
 lib/stackdepot.c | 222 ++++++++++++++++++++++++++++-------------------
 1 file changed, 133 insertions(+), 89 deletions(-)

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index a0be5d05c7f0..9eaf46f8abc4 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -19,10 +19,13 @@
 #include <linux/kernel.h>
 #include <linux/kmsan.h>
 #include <linux/list.h>
+#include <linux/llist.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/percpu.h>
 #include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -67,7 +70,8 @@ union handle_parts {
 };
 
 struct stack_record {
-	struct list_head list;		/* Links in hash table or freelist */
+	struct list_head hash_list;	/* Links in the hash table */
+	struct llist_node free_list;	/* Links in the freelist */
 	u32 hash;			/* Hash in hash table */
 	u32 size;			/* Number of stored frames */
 	union handle_parts handle;
@@ -104,7 +108,7 @@ static void *new_pool;
 /* Number of pools in stack_pools. */
 static int pools_num;
 /* Freelist of stack records within stack_pools. */
-static LIST_HEAD(free_stacks);
+static LLIST_HEAD(free_stacks);
 /*
  * Stack depot tries to keep an extra pool allocated even before it runs out
  * of space in the currently used pool. This flag marks whether this extra pool
@@ -112,8 +116,8 @@ static LIST_HEAD(free_stacks);
  * yet allocated or if the limit on the number of pools is reached.
  */
 static bool new_pool_required = true;
-/* Lock that protects the variables above. */
-static DEFINE_RWLOCK(pool_rwlock);
+/* The lock must be held when performing pool or free list modifications. */
+static DEFINE_RAW_SPINLOCK(pool_lock);
 
 static int __init disable_stack_depot(char *str)
 {
@@ -263,9 +267,7 @@ static void depot_init_pool(void *pool)
 {
 	int offset;
 
-	lockdep_assert_held_write(&pool_rwlock);
-
-	WARN_ON(!list_empty(&free_stacks));
+	lockdep_assert_held(&pool_lock);
 
 	/* Initialize handles and link stack records into the freelist. */
 	for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
@@ -276,18 +278,25 @@ static void depot_init_pool(void *pool)
 		stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
 		stack->handle.extra = 0;
 
-		list_add(&stack->list, &free_stacks);
+		llist_add(&stack->free_list, &free_stacks);
+		INIT_LIST_HEAD(&stack->hash_list);
 	}
 
 	/* Save reference to the pool to be used by depot_fetch_stack(). */
 	stack_pools[pools_num] = pool;
-	pools_num++;
+
+	/*
+	 * Release of pool pointer assignment above. Pairs with the
+	 * smp_load_acquire() in depot_fetch_stack().
+	 */
+	smp_store_release(&pools_num, pools_num + 1);
+	ASSERT_EXCLUSIVE_WRITER(pools_num);
 }
 
 /* Keeps the preallocated memory to be used for a new stack depot pool. */
 static void depot_keep_new_pool(void **prealloc)
 {
-	lockdep_assert_held_write(&pool_rwlock);
+	lockdep_assert_held(&pool_lock);
 
 	/*
 	 * If a new pool is already saved or the maximum number of
@@ -310,16 +319,16 @@ static void depot_keep_new_pool(void **prealloc)
 	 * number of pools is reached. In either case, take note that
 	 * keeping another pool is not required.
 	 */
-	new_pool_required = false;
+	WRITE_ONCE(new_pool_required, false);
 }
 
 /* Updates references to the current and the next stack depot pools. */
 static bool depot_update_pools(void **prealloc)
 {
-	lockdep_assert_held_write(&pool_rwlock);
+	lockdep_assert_held(&pool_lock);
 
 	/* Check if we still have objects in the freelist. */
-	if (!list_empty(&free_stacks))
+	if (!llist_empty(&free_stacks))
 		goto out_keep_prealloc;
 
 	/* Check if we have a new pool saved and use it. */
@@ -329,7 +338,7 @@ static bool depot_update_pools(void **prealloc)
 
 		/* Take note that we might need a new new_pool. */
 		if (pools_num < DEPOT_MAX_POOLS)
-			new_pool_required = true;
+			WRITE_ONCE(new_pool_required, true);
 
 		/* Try keeping the preallocated memory for new_pool. */
 		goto out_keep_prealloc;
@@ -362,20 +371,19 @@ static struct stack_record *
 depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 {
 	struct stack_record *stack;
+	struct llist_node *free;
 
-	lockdep_assert_held_write(&pool_rwlock);
+	lockdep_assert_held(&pool_lock);
 
 	/* Update current and new pools if required and possible. */
 	if (!depot_update_pools(prealloc))
 		return NULL;
 
 	/* Check if we have a stack record to save the stack trace. */
-	if (list_empty(&free_stacks))
+	free = llist_del_first(&free_stacks);
+	if (!free)
 		return NULL;
-
-	/* Get and unlink the first entry from the freelist. */
-	stack = list_first_entry(&free_stacks, struct stack_record, list);
-	list_del(&stack->list);
+	stack = llist_entry(free, struct stack_record, free_list);
 
 	/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
 	if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
@@ -385,7 +393,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 	stack->hash = hash;
 	stack->size = size;
 	/* stack->handle is already filled in by depot_init_pool(). */
-	refcount_set(&stack->count, 1);
 	memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
 
 	/*
@@ -394,21 +401,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 	 */
 	kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
 
+	/*
+	 * Release saving of the stack trace. Pairs with smp_mb() in
+	 * depot_fetch_stack().
+	 */
+	smp_mb__before_atomic();
+	refcount_set(&stack->count, 1);
+
 	return stack;
 }
 
 static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
 {
+	/* Acquire the pool pointer written in depot_init_pool(). */
+	const int pools_num_cached = smp_load_acquire(&pools_num);
 	union handle_parts parts = { .handle = handle };
 	void *pool;
 	size_t offset = parts.offset << DEPOT_STACK_ALIGN;
 	struct stack_record *stack;
 
-	lockdep_assert_held(&pool_rwlock);
+	lockdep_assert_not_held(&pool_lock);
 
-	if (parts.pool_index > pools_num) {
+	if (parts.pool_index > pools_num_cached) {
 		WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
-		     parts.pool_index, pools_num, handle);
+		     parts.pool_index, pools_num_cached, handle);
 		return NULL;
 	}
 
@@ -417,15 +433,35 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
 		return NULL;
 
 	stack = pool + offset;
+
+	/*
+	 * Acquire the stack trace. Pairs with smp_mb() in depot_alloc_stack().
+	 *
+	 * This does not protect against a stack_depot_put() freeing the record
+	 * and having it subsequently being reused. Callers are responsible to
+	 * avoid using stack depot handles after passing to stack_depot_put().
+	 */
+	if (!refcount_read(&stack->count))
+		return NULL;
+	smp_mb__after_atomic();
+
 	return stack;
 }
 
 /* Links stack into the freelist. */
 static void depot_free_stack(struct stack_record *stack)
 {
-	lockdep_assert_held_write(&pool_rwlock);
+	unsigned long flags;
+
+	lockdep_assert_not_held(&pool_lock);
+
+	raw_spin_lock_irqsave(&pool_lock, flags);
+	printk_deferred_enter();
+	list_del_rcu(&stack->hash_list);
+	printk_deferred_exit();
+	raw_spin_unlock_irqrestore(&pool_lock, flags);
 
-	list_add(&stack->list, &free_stacks);
+	llist_add(&stack->free_list, &free_stacks);
 }
 
 /* Calculates the hash for a stack. */
@@ -453,22 +489,55 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
 
 /* Finds a stack in a bucket of the hash table. */
 static inline struct stack_record *find_stack(struct list_head *bucket,
-					     unsigned long *entries, int size,
-					     u32 hash)
+					      unsigned long *entries, int size,
+					      u32 hash, depot_flags_t flags)
 {
-	struct list_head *pos;
-	struct stack_record *found;
+	struct stack_record *stack, *ret = NULL;
 
-	lockdep_assert_held(&pool_rwlock);
+	/*
+	 * Due to being used from low-level code paths such as the allocators,
+	 * NMI, or even RCU itself, stackdepot cannot rely on primitives that
+	 * would sleep (such as synchronize_rcu()) or end up recursively call
+	 * into stack depot again (such as call_rcu()).
+	 *
+	 * Instead, lock-less readers only rely on RCU primitives for correct
+	 * memory ordering, but do not use RCU-based synchronization otherwise.
+	 * Instead, we perform 3-pass validation below to ensure that the stack
+	 * record we accessed is actually valid. If we fail to obtain a valid
+	 * stack record here, the slow-path in stack_depot_save_flags() will
+	 * retry to avoid inserting duplicates.
+	 *
+	 * If STACK_DEPOT_FLAG_GET is not used, it is undefined behaviour to
+	 * call stack_depot_put() later - i.e. in the non-refcounted case, we do
+	 * not have to worry that the entry will be recycled.
+	 */
+
+	list_for_each_entry_rcu(stack, bucket, hash_list) {
+		/* 1. Check if this entry could potentially match. */
+		if (data_race(stack->hash != hash || stack->size != size))
+			continue;
+
+		/*
+		 * 2. Increase refcount if not zero. If this is successful, we
+		 *    know that this stack record is valid and will not be freed by
+		 *    stack_depot_put().
+		 */
+		if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(!refcount_inc_not_zero(&stack->count)))
+			continue;
+
+		/* 3. Do full validation of the record. */
+		if (likely(stack->hash == hash && stack->size == size &&
+			   !stackdepot_memcmp(entries, stack->entries, size))) {
+			ret = stack;
+			break;
+		}
 
-	list_for_each(pos, bucket) {
-		found = list_entry(pos, struct stack_record, list);
-		if (found->hash == hash &&
-		    found->size == size &&
-		    !stackdepot_memcmp(entries, found->entries, size))
-			return found;
+		/* Undo refcount - could have raced with stack_depot_put(). */
+		if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(refcount_dec_and_test(&stack->count)))
+			depot_free_stack(stack);
 	}
-	return NULL;
+
+	return ret;
 }
 
 depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
@@ -482,7 +551,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	struct page *page = NULL;
 	void *prealloc = NULL;
 	bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
-	bool need_alloc = false;
 	unsigned long flags;
 	u32 hash;
 
@@ -505,31 +573,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	hash = hash_stack(entries, nr_entries);
 	bucket = &stack_table[hash & stack_hash_mask];
 
-	read_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
-	/* Fast path: look the stack trace up without full locking. */
-	found = find_stack(bucket, entries, nr_entries, hash);
-	if (found) {
-		if (depot_flags & STACK_DEPOT_FLAG_GET)
-			refcount_inc(&found->count);
-		printk_deferred_exit();
-		read_unlock_irqrestore(&pool_rwlock, flags);
+	/* Fast path: look the stack trace up without locking. */
+	found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
+	if (found)
 		goto exit;
-	}
-
-	/* Take note if another stack pool needs to be allocated. */
-	if (new_pool_required)
-		need_alloc = true;
-
-	printk_deferred_exit();
-	read_unlock_irqrestore(&pool_rwlock, flags);
 
 	/*
 	 * Allocate memory for a new pool if required now:
 	 * we won't be able to do that under the lock.
 	 */
-	if (unlikely(can_alloc && need_alloc)) {
+	if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
 		/*
 		 * Zero out zone modifiers, as we don't have specific zone
 		 * requirements. Keep the flags related to allocation in atomic
@@ -543,31 +596,33 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 			prealloc = page_address(page);
 	}
 
-	write_lock_irqsave(&pool_rwlock, flags);
+	raw_spin_lock_irqsave(&pool_lock, flags);
 	printk_deferred_enter();
 
-	found = find_stack(bucket, entries, nr_entries, hash);
+	/* Try to find again, to avoid concurrently inserting duplicates. */
+	found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
 	if (!found) {
 		struct stack_record *new =
 			depot_alloc_stack(entries, nr_entries, hash, &prealloc);
 
 		if (new) {
-			list_add(&new->list, bucket);
+			/*
+			 * This releases the stack record into the bucket and
+			 * makes it visible to readers in find_stack().
+			 */
+			list_add_rcu(&new->hash_list, bucket);
 			found = new;
 		}
-	} else {
-		if (depot_flags & STACK_DEPOT_FLAG_GET)
-			refcount_inc(&found->count);
+	} else if (prealloc) {
 		/*
 		 * Stack depot already contains this stack trace, but let's
 		 * keep the preallocated memory for future.
 		 */
-		if (prealloc)
-			depot_keep_new_pool(&prealloc);
+		depot_keep_new_pool(&prealloc);
 	}
 
 	printk_deferred_exit();
-	write_unlock_irqrestore(&pool_rwlock, flags);
+	raw_spin_unlock_irqrestore(&pool_lock, flags);
 exit:
 	if (prealloc) {
 		/* Stack depot didn't use this memory, free it. */
@@ -592,7 +647,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 			       unsigned long **entries)
 {
 	struct stack_record *stack;
-	unsigned long flags;
 
 	*entries = NULL;
 	/*
@@ -604,13 +658,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 	if (!handle || stack_depot_disabled)
 		return 0;
 
-	read_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
 	stack = depot_fetch_stack(handle);
-
-	printk_deferred_exit();
-	read_unlock_irqrestore(&pool_rwlock, flags);
+	/*
+	 * Should never be NULL, otherwise this is a use-after-put.
+	 */
+	if (WARN_ON(!stack))
+		return 0;
 
 	*entries = stack->entries;
 	return stack->size;
@@ -620,29 +673,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
 void stack_depot_put(depot_stack_handle_t handle)
 {
 	struct stack_record *stack;
-	unsigned long flags;
 
 	if (!handle || stack_depot_disabled)
 		return;
 
-	write_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
 	stack = depot_fetch_stack(handle);
+	/*
+	 * Should always be able to find the stack record, otherwise this is an
+	 * unbalanced put attempt.
+	 */
 	if (WARN_ON(!stack))
-		goto out;
-
-	if (refcount_dec_and_test(&stack->count)) {
-		/* Unlink stack from the hash table. */
-		list_del(&stack->list);
+		return;
 
-		/* Free stack. */
+	if (refcount_dec_and_test(&stack->count))
 		depot_free_stack(stack);
-	}
-
-out:
-	printk_deferred_exit();
-	write_unlock_irqrestore(&pool_rwlock, flags);
 }
 EXPORT_SYMBOL_GPL(stack_depot_put);

Andrey Konovalov Jan. 12, 2024, 2:38 a.m. UTC | #6

On Thu, Jan 11, 2024 at 8:08 PM Marco Elver <elver@google.com> wrote:
>
> On Thu, Jan 11, 2024 at 04:36AM -0800, Andi Kleen wrote:
> > > stackdepot is severely limited in what kernel facilities it may use
> > > due to being used by such low level facilities as the allocator
> > > itself.
> >
> > RCU can be done quite low level too (e.g. there is NMI safe RCU)
>
> How about the below? This should get us back the performance of the old
> lock-less version. Although it's using rculist, we don't actually need
> to synchronize via RCU.
>
> Thanks,
> -- Marco
>
> ------ >8 ------
>
> From: Marco Elver <elver@google.com>
> Date: Tue, 9 Jan 2024 10:21:56 +0100
> Subject: [PATCH] stackdepot: make fast paths lock-less again
>
> stack_depot_put() unconditionally takes the pool_rwlock as a writer.
> This is unnecessary if the stack record is not going to be freed.
> Furthermore, reader-writer locks have inherent cache contention, which
> does not scale well on machines with large CPU counts.
>
> Instead, rework the synchronization story of stack depot to again avoid
> taking any locks in the fast paths. This is done by relying on RCU
> primitives to give us lock-less list traversal. See code comments for
> more details.
>
> Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
> Signed-off-by: Marco Elver <elver@google.com>
> ---
>  lib/stackdepot.c | 222 ++++++++++++++++++++++++++++-------------------
>  1 file changed, 133 insertions(+), 89 deletions(-)
>
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index a0be5d05c7f0..9eaf46f8abc4 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -19,10 +19,13 @@
>  #include <linux/kernel.h>
>  #include <linux/kmsan.h>
>  #include <linux/list.h>
> +#include <linux/llist.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
>  #include <linux/percpu.h>
>  #include <linux/printk.h>
> +#include <linux/rculist.h>
> +#include <linux/rcupdate.h>
>  #include <linux/refcount.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
> @@ -67,7 +70,8 @@ union handle_parts {
>  };
>
>  struct stack_record {
> -       struct list_head list;          /* Links in hash table or freelist */
> +       struct list_head hash_list;     /* Links in the hash table */
> +       struct llist_node free_list;    /* Links in the freelist */
>         u32 hash;                       /* Hash in hash table */
>         u32 size;                       /* Number of stored frames */
>         union handle_parts handle;
> @@ -104,7 +108,7 @@ static void *new_pool;
>  /* Number of pools in stack_pools. */
>  static int pools_num;
>  /* Freelist of stack records within stack_pools. */
> -static LIST_HEAD(free_stacks);
> +static LLIST_HEAD(free_stacks);
>  /*
>   * Stack depot tries to keep an extra pool allocated even before it runs out
>   * of space in the currently used pool. This flag marks whether this extra pool
> @@ -112,8 +116,8 @@ static LIST_HEAD(free_stacks);
>   * yet allocated or if the limit on the number of pools is reached.
>   */
>  static bool new_pool_required = true;
> -/* Lock that protects the variables above. */
> -static DEFINE_RWLOCK(pool_rwlock);
> +/* The lock must be held when performing pool or free list modifications. */
> +static DEFINE_RAW_SPINLOCK(pool_lock);
>
>  static int __init disable_stack_depot(char *str)
>  {
> @@ -263,9 +267,7 @@ static void depot_init_pool(void *pool)
>  {
>         int offset;
>
> -       lockdep_assert_held_write(&pool_rwlock);
> -
> -       WARN_ON(!list_empty(&free_stacks));
> +       lockdep_assert_held(&pool_lock);
>
>         /* Initialize handles and link stack records into the freelist. */
>         for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> @@ -276,18 +278,25 @@ static void depot_init_pool(void *pool)
>                 stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
>                 stack->handle.extra = 0;
>
> -               list_add(&stack->list, &free_stacks);
> +               llist_add(&stack->free_list, &free_stacks);
> +               INIT_LIST_HEAD(&stack->hash_list);
>         }
>
>         /* Save reference to the pool to be used by depot_fetch_stack(). */
>         stack_pools[pools_num] = pool;
> -       pools_num++;
> +
> +       /*
> +        * Release of pool pointer assignment above. Pairs with the
> +        * smp_load_acquire() in depot_fetch_stack().
> +        */
> +       smp_store_release(&pools_num, pools_num + 1);
> +       ASSERT_EXCLUSIVE_WRITER(pools_num);
>  }
>
>  /* Keeps the preallocated memory to be used for a new stack depot pool. */
>  static void depot_keep_new_pool(void **prealloc)
>  {
> -       lockdep_assert_held_write(&pool_rwlock);
> +       lockdep_assert_held(&pool_lock);
>
>         /*
>          * If a new pool is already saved or the maximum number of
> @@ -310,16 +319,16 @@ static void depot_keep_new_pool(void **prealloc)
>          * number of pools is reached. In either case, take note that
>          * keeping another pool is not required.
>          */
> -       new_pool_required = false;
> +       WRITE_ONCE(new_pool_required, false);
>  }
>
>  /* Updates references to the current and the next stack depot pools. */
>  static bool depot_update_pools(void **prealloc)
>  {
> -       lockdep_assert_held_write(&pool_rwlock);
> +       lockdep_assert_held(&pool_lock);
>
>         /* Check if we still have objects in the freelist. */
> -       if (!list_empty(&free_stacks))
> +       if (!llist_empty(&free_stacks))
>                 goto out_keep_prealloc;
>
>         /* Check if we have a new pool saved and use it. */
> @@ -329,7 +338,7 @@ static bool depot_update_pools(void **prealloc)
>
>                 /* Take note that we might need a new new_pool. */
>                 if (pools_num < DEPOT_MAX_POOLS)
> -                       new_pool_required = true;
> +                       WRITE_ONCE(new_pool_required, true);
>
>                 /* Try keeping the preallocated memory for new_pool. */
>                 goto out_keep_prealloc;
> @@ -362,20 +371,19 @@ static struct stack_record *
>  depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
>  {
>         struct stack_record *stack;
> +       struct llist_node *free;
>
> -       lockdep_assert_held_write(&pool_rwlock);
> +       lockdep_assert_held(&pool_lock);
>
>         /* Update current and new pools if required and possible. */
>         if (!depot_update_pools(prealloc))
>                 return NULL;
>
>         /* Check if we have a stack record to save the stack trace. */
> -       if (list_empty(&free_stacks))
> +       free = llist_del_first(&free_stacks);
> +       if (!free)
>                 return NULL;
> -
> -       /* Get and unlink the first entry from the freelist. */
> -       stack = list_first_entry(&free_stacks, struct stack_record, list);
> -       list_del(&stack->list);
> +       stack = llist_entry(free, struct stack_record, free_list);
>
>         /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
>         if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
> @@ -385,7 +393,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
>         stack->hash = hash;
>         stack->size = size;
>         /* stack->handle is already filled in by depot_init_pool(). */
> -       refcount_set(&stack->count, 1);
>         memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
>
>         /*
> @@ -394,21 +401,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
>          */
>         kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
>
> +       /*
> +        * Release saving of the stack trace. Pairs with smp_mb() in
> +        * depot_fetch_stack().
> +        */
> +       smp_mb__before_atomic();
> +       refcount_set(&stack->count, 1);
> +
>         return stack;
>  }
>
>  static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
>  {
> +       /* Acquire the pool pointer written in depot_init_pool(). */
> +       const int pools_num_cached = smp_load_acquire(&pools_num);
>         union handle_parts parts = { .handle = handle };
>         void *pool;
>         size_t offset = parts.offset << DEPOT_STACK_ALIGN;
>         struct stack_record *stack;
>
> -       lockdep_assert_held(&pool_rwlock);
> +       lockdep_assert_not_held(&pool_lock);
>
> -       if (parts.pool_index > pools_num) {
> +       if (parts.pool_index > pools_num_cached) {
>                 WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
> -                    parts.pool_index, pools_num, handle);
> +                    parts.pool_index, pools_num_cached, handle);
>                 return NULL;
>         }
>
> @@ -417,15 +433,35 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
>                 return NULL;
>
>         stack = pool + offset;
> +
> +       /*
> +        * Acquire the stack trace. Pairs with smp_mb() in depot_alloc_stack().
> +        *
> +        * This does not protect against a stack_depot_put() freeing the record
> +        * and having it subsequently being reused. Callers are responsible to
> +        * avoid using stack depot handles after passing to stack_depot_put().
> +        */
> +       if (!refcount_read(&stack->count))
> +               return NULL;

Can this happen? It seems that depot_fetch_stack should only be called
for handles that were returned from stack_depot_save_flags before all
puts and thus the the refcount should > 0. Or is this a safeguard
against improper API usage?

> +       smp_mb__after_atomic();
> +
>         return stack;
>  }
>
>  /* Links stack into the freelist. */
>  static void depot_free_stack(struct stack_record *stack)
>  {
> -       lockdep_assert_held_write(&pool_rwlock);
> +       unsigned long flags;
> +
> +       lockdep_assert_not_held(&pool_lock);
> +
> +       raw_spin_lock_irqsave(&pool_lock, flags);
> +       printk_deferred_enter();
> +       list_del_rcu(&stack->hash_list);
> +       printk_deferred_exit();
> +       raw_spin_unlock_irqrestore(&pool_lock, flags);
>
> -       list_add(&stack->list, &free_stacks);
> +       llist_add(&stack->free_list, &free_stacks);

This llist_add is outside of the lock just because we can (i.e.
llist_add can run concurrently with the other free_stacks operations,
which are all under the lock), right? This slightly contradicts the
comment above the free_stacks definition.

If we put this under the lock and use normal list instead of llist, I
think we can then combine the hash_list with the free_list like before
to save up on some space for stack_record. Would that make sense?

>  }
>
>  /* Calculates the hash for a stack. */
> @@ -453,22 +489,55 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
>
>  /* Finds a stack in a bucket of the hash table. */
>  static inline struct stack_record *find_stack(struct list_head *bucket,
> -                                            unsigned long *entries, int size,
> -                                            u32 hash)
> +                                             unsigned long *entries, int size,
> +                                             u32 hash, depot_flags_t flags)
>  {
> -       struct list_head *pos;
> -       struct stack_record *found;
> +       struct stack_record *stack, *ret = NULL;
>
> -       lockdep_assert_held(&pool_rwlock);
> +       /*
> +        * Due to being used from low-level code paths such as the allocators,
> +        * NMI, or even RCU itself, stackdepot cannot rely on primitives that
> +        * would sleep (such as synchronize_rcu()) or end up recursively call
> +        * into stack depot again (such as call_rcu()).
> +        *
> +        * Instead, lock-less readers only rely on RCU primitives for correct
> +        * memory ordering, but do not use RCU-based synchronization otherwise.
> +        * Instead, we perform 3-pass validation below to ensure that the stack
> +        * record we accessed is actually valid. If we fail to obtain a valid
> +        * stack record here, the slow-path in stack_depot_save_flags() will
> +        * retry to avoid inserting duplicates.
> +        *
> +        * If STACK_DEPOT_FLAG_GET is not used, it is undefined behaviour to
> +        * call stack_depot_put() later - i.e. in the non-refcounted case, we do
> +        * not have to worry that the entry will be recycled.
> +        */
> +
> +       list_for_each_entry_rcu(stack, bucket, hash_list) {

So we don't need rcu_read_lock here, because we don't rely on call_rcu
etc., right?

> +               /* 1. Check if this entry could potentially match. */
> +               if (data_race(stack->hash != hash || stack->size != size))
> +                       continue;
> +
> +               /*
> +                * 2. Increase refcount if not zero. If this is successful, we
> +                *    know that this stack record is valid and will not be freed by
> +                *    stack_depot_put().
> +                */
> +               if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(!refcount_inc_not_zero(&stack->count)))
> +                       continue;
> +
> +               /* 3. Do full validation of the record. */
> +               if (likely(stack->hash == hash && stack->size == size &&
> +                          !stackdepot_memcmp(entries, stack->entries, size))) {
> +                       ret = stack;
> +                       break;
> +               }
>
> -       list_for_each(pos, bucket) {
> -               found = list_entry(pos, struct stack_record, list);
> -               if (found->hash == hash &&
> -                   found->size == size &&
> -                   !stackdepot_memcmp(entries, found->entries, size))
> -                       return found;
> +               /* Undo refcount - could have raced with stack_depot_put(). */
> +               if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(refcount_dec_and_test(&stack->count)))
> +                       depot_free_stack(stack);
>         }
> -       return NULL;
> +
> +       return ret;
>  }
>
>  depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> @@ -482,7 +551,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>         struct page *page = NULL;
>         void *prealloc = NULL;
>         bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
> -       bool need_alloc = false;
>         unsigned long flags;
>         u32 hash;
>
> @@ -505,31 +573,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>         hash = hash_stack(entries, nr_entries);
>         bucket = &stack_table[hash & stack_hash_mask];
>
> -       read_lock_irqsave(&pool_rwlock, flags);
> -       printk_deferred_enter();
> -
> -       /* Fast path: look the stack trace up without full locking. */
> -       found = find_stack(bucket, entries, nr_entries, hash);
> -       if (found) {
> -               if (depot_flags & STACK_DEPOT_FLAG_GET)
> -                       refcount_inc(&found->count);
> -               printk_deferred_exit();
> -               read_unlock_irqrestore(&pool_rwlock, flags);
> +       /* Fast path: look the stack trace up without locking. */
> +       found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
> +       if (found)
>                 goto exit;
> -       }
> -
> -       /* Take note if another stack pool needs to be allocated. */
> -       if (new_pool_required)
> -               need_alloc = true;
> -
> -       printk_deferred_exit();
> -       read_unlock_irqrestore(&pool_rwlock, flags);
>
>         /*
>          * Allocate memory for a new pool if required now:
>          * we won't be able to do that under the lock.
>          */
> -       if (unlikely(can_alloc && need_alloc)) {
> +       if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
>                 /*
>                  * Zero out zone modifiers, as we don't have specific zone
>                  * requirements. Keep the flags related to allocation in atomic
> @@ -543,31 +596,33 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>                         prealloc = page_address(page);
>         }
>
> -       write_lock_irqsave(&pool_rwlock, flags);
> +       raw_spin_lock_irqsave(&pool_lock, flags);
>         printk_deferred_enter();
>
> -       found = find_stack(bucket, entries, nr_entries, hash);
> +       /* Try to find again, to avoid concurrently inserting duplicates. */
> +       found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
>         if (!found) {
>                 struct stack_record *new =
>                         depot_alloc_stack(entries, nr_entries, hash, &prealloc);
>
>                 if (new) {
> -                       list_add(&new->list, bucket);
> +                       /*
> +                        * This releases the stack record into the bucket and
> +                        * makes it visible to readers in find_stack().
> +                        */
> +                       list_add_rcu(&new->hash_list, bucket);
>                         found = new;
>                 }
> -       } else {
> -               if (depot_flags & STACK_DEPOT_FLAG_GET)
> -                       refcount_inc(&found->count);
> +       } else if (prealloc) {
>                 /*
>                  * Stack depot already contains this stack trace, but let's
>                  * keep the preallocated memory for future.
>                  */
> -               if (prealloc)
> -                       depot_keep_new_pool(&prealloc);
> +               depot_keep_new_pool(&prealloc);
>         }
>
>         printk_deferred_exit();
> -       write_unlock_irqrestore(&pool_rwlock, flags);
> +       raw_spin_unlock_irqrestore(&pool_lock, flags);
>  exit:
>         if (prealloc) {
>                 /* Stack depot didn't use this memory, free it. */
> @@ -592,7 +647,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
>                                unsigned long **entries)
>  {
>         struct stack_record *stack;
> -       unsigned long flags;
>
>         *entries = NULL;
>         /*
> @@ -604,13 +658,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
>         if (!handle || stack_depot_disabled)
>                 return 0;
>
> -       read_lock_irqsave(&pool_rwlock, flags);
> -       printk_deferred_enter();
> -
>         stack = depot_fetch_stack(handle);
> -
> -       printk_deferred_exit();
> -       read_unlock_irqrestore(&pool_rwlock, flags);
> +       /*
> +        * Should never be NULL, otherwise this is a use-after-put.
> +        */
> +       if (WARN_ON(!stack))
> +               return 0;
>
>         *entries = stack->entries;
>         return stack->size;
> @@ -620,29 +673,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
>  void stack_depot_put(depot_stack_handle_t handle)
>  {
>         struct stack_record *stack;
> -       unsigned long flags;
>
>         if (!handle || stack_depot_disabled)
>                 return;
>
> -       write_lock_irqsave(&pool_rwlock, flags);
> -       printk_deferred_enter();
> -
>         stack = depot_fetch_stack(handle);
> +       /*
> +        * Should always be able to find the stack record, otherwise this is an
> +        * unbalanced put attempt.
> +        */
>         if (WARN_ON(!stack))
> -               goto out;
> -
> -       if (refcount_dec_and_test(&stack->count)) {
> -               /* Unlink stack from the hash table. */
> -               list_del(&stack->list);
> +               return;
>
> -               /* Free stack. */
> +       if (refcount_dec_and_test(&stack->count))
>                 depot_free_stack(stack);
> -       }
> -
> -out:
> -       printk_deferred_exit();
> -       write_unlock_irqrestore(&pool_rwlock, flags);
>  }
>  EXPORT_SYMBOL_GPL(stack_depot_put);
>
> --
> 2.43.0.275.g3460e3d667-goog
>

Looks good to me from the functional perspective (modulo the
clarification comments I left above), but it would be great to get a
review from someone with a better understanding of the low-level
synchronization primitives.

Thank you!

Marco Elver Jan. 12, 2024, 8:24 a.m. UTC | #7

On Fri, 12 Jan 2024 at 03:38, Andrey Konovalov <andreyknvl@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 8:08 PM Marco Elver <elver@google.com> wrote:
> >
> > On Thu, Jan 11, 2024 at 04:36AM -0800, Andi Kleen wrote:
> > > > stackdepot is severely limited in what kernel facilities it may use
> > > > due to being used by such low level facilities as the allocator
> > > > itself.
> > >
> > > RCU can be done quite low level too (e.g. there is NMI safe RCU)
> >
> > How about the below? This should get us back the performance of the old
> > lock-less version. Although it's using rculist, we don't actually need
> > to synchronize via RCU.
> >
> > Thanks,
> > -- Marco
> >
> > ------ >8 ------
> >
> > From: Marco Elver <elver@google.com>
> > Date: Tue, 9 Jan 2024 10:21:56 +0100
> > Subject: [PATCH] stackdepot: make fast paths lock-less again
> >
> > stack_depot_put() unconditionally takes the pool_rwlock as a writer.
> > This is unnecessary if the stack record is not going to be freed.
> > Furthermore, reader-writer locks have inherent cache contention, which
> > does not scale well on machines with large CPU counts.
> >
> > Instead, rework the synchronization story of stack depot to again avoid
> > taking any locks in the fast paths. This is done by relying on RCU
> > primitives to give us lock-less list traversal. See code comments for
> > more details.
> >
> > Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
> > Signed-off-by: Marco Elver <elver@google.com>
> > ---
> >  lib/stackdepot.c | 222 ++++++++++++++++++++++++++++-------------------
> >  1 file changed, 133 insertions(+), 89 deletions(-)
> >
> > diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> > index a0be5d05c7f0..9eaf46f8abc4 100644
> > --- a/lib/stackdepot.c
> > +++ b/lib/stackdepot.c
> > @@ -19,10 +19,13 @@
> >  #include <linux/kernel.h>
> >  #include <linux/kmsan.h>
> >  #include <linux/list.h>
> > +#include <linux/llist.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> >  #include <linux/percpu.h>
> >  #include <linux/printk.h>
> > +#include <linux/rculist.h>
> > +#include <linux/rcupdate.h>
> >  #include <linux/refcount.h>
> >  #include <linux/slab.h>
> >  #include <linux/spinlock.h>
> > @@ -67,7 +70,8 @@ union handle_parts {
> >  };
> >
> >  struct stack_record {
> > -       struct list_head list;          /* Links in hash table or freelist */
> > +       struct list_head hash_list;     /* Links in the hash table */
> > +       struct llist_node free_list;    /* Links in the freelist */
> >         u32 hash;                       /* Hash in hash table */
> >         u32 size;                       /* Number of stored frames */
> >         union handle_parts handle;
> > @@ -104,7 +108,7 @@ static void *new_pool;
> >  /* Number of pools in stack_pools. */
> >  static int pools_num;
> >  /* Freelist of stack records within stack_pools. */
> > -static LIST_HEAD(free_stacks);
> > +static LLIST_HEAD(free_stacks);
> >  /*
> >   * Stack depot tries to keep an extra pool allocated even before it runs out
> >   * of space in the currently used pool. This flag marks whether this extra pool
> > @@ -112,8 +116,8 @@ static LIST_HEAD(free_stacks);
> >   * yet allocated or if the limit on the number of pools is reached.
> >   */
> >  static bool new_pool_required = true;
> > -/* Lock that protects the variables above. */
> > -static DEFINE_RWLOCK(pool_rwlock);
> > +/* The lock must be held when performing pool or free list modifications. */
> > +static DEFINE_RAW_SPINLOCK(pool_lock);
> >
> >  static int __init disable_stack_depot(char *str)
> >  {
> > @@ -263,9 +267,7 @@ static void depot_init_pool(void *pool)
> >  {
> >         int offset;
> >
> > -       lockdep_assert_held_write(&pool_rwlock);
> > -
> > -       WARN_ON(!list_empty(&free_stacks));
> > +       lockdep_assert_held(&pool_lock);
> >
> >         /* Initialize handles and link stack records into the freelist. */
> >         for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> > @@ -276,18 +278,25 @@ static void depot_init_pool(void *pool)
> >                 stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
> >                 stack->handle.extra = 0;
> >
> > -               list_add(&stack->list, &free_stacks);
> > +               llist_add(&stack->free_list, &free_stacks);
> > +               INIT_LIST_HEAD(&stack->hash_list);
> >         }
> >
> >         /* Save reference to the pool to be used by depot_fetch_stack(). */
> >         stack_pools[pools_num] = pool;
> > -       pools_num++;
> > +
> > +       /*
> > +        * Release of pool pointer assignment above. Pairs with the
> > +        * smp_load_acquire() in depot_fetch_stack().
> > +        */
> > +       smp_store_release(&pools_num, pools_num + 1);
> > +       ASSERT_EXCLUSIVE_WRITER(pools_num);
> >  }
> >
> >  /* Keeps the preallocated memory to be used for a new stack depot pool. */
> >  static void depot_keep_new_pool(void **prealloc)
> >  {
> > -       lockdep_assert_held_write(&pool_rwlock);
> > +       lockdep_assert_held(&pool_lock);
> >
> >         /*
> >          * If a new pool is already saved or the maximum number of
> > @@ -310,16 +319,16 @@ static void depot_keep_new_pool(void **prealloc)
> >          * number of pools is reached. In either case, take note that
> >          * keeping another pool is not required.
> >          */
> > -       new_pool_required = false;
> > +       WRITE_ONCE(new_pool_required, false);
> >  }
> >
> >  /* Updates references to the current and the next stack depot pools. */
> >  static bool depot_update_pools(void **prealloc)
> >  {
> > -       lockdep_assert_held_write(&pool_rwlock);
> > +       lockdep_assert_held(&pool_lock);
> >
> >         /* Check if we still have objects in the freelist. */
> > -       if (!list_empty(&free_stacks))
> > +       if (!llist_empty(&free_stacks))
> >                 goto out_keep_prealloc;
> >
> >         /* Check if we have a new pool saved and use it. */
> > @@ -329,7 +338,7 @@ static bool depot_update_pools(void **prealloc)
> >
> >                 /* Take note that we might need a new new_pool. */
> >                 if (pools_num < DEPOT_MAX_POOLS)
> > -                       new_pool_required = true;
> > +                       WRITE_ONCE(new_pool_required, true);
> >
> >                 /* Try keeping the preallocated memory for new_pool. */
> >                 goto out_keep_prealloc;
> > @@ -362,20 +371,19 @@ static struct stack_record *
> >  depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> >  {
> >         struct stack_record *stack;
> > +       struct llist_node *free;
> >
> > -       lockdep_assert_held_write(&pool_rwlock);
> > +       lockdep_assert_held(&pool_lock);
> >
> >         /* Update current and new pools if required and possible. */
> >         if (!depot_update_pools(prealloc))
> >                 return NULL;
> >
> >         /* Check if we have a stack record to save the stack trace. */
> > -       if (list_empty(&free_stacks))
> > +       free = llist_del_first(&free_stacks);
> > +       if (!free)
> >                 return NULL;
> > -
> > -       /* Get and unlink the first entry from the freelist. */
> > -       stack = list_first_entry(&free_stacks, struct stack_record, list);
> > -       list_del(&stack->list);
> > +       stack = llist_entry(free, struct stack_record, free_list);
> >
> >         /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
> >         if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
> > @@ -385,7 +393,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> >         stack->hash = hash;
> >         stack->size = size;
> >         /* stack->handle is already filled in by depot_init_pool(). */
> > -       refcount_set(&stack->count, 1);
> >         memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
> >
> >         /*
> > @@ -394,21 +401,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> >          */
> >         kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
> >
> > +       /*
> > +        * Release saving of the stack trace. Pairs with smp_mb() in
> > +        * depot_fetch_stack().
> > +        */
> > +       smp_mb__before_atomic();
> > +       refcount_set(&stack->count, 1);
> > +
> >         return stack;
> >  }
> >
> >  static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
> >  {
> > +       /* Acquire the pool pointer written in depot_init_pool(). */
> > +       const int pools_num_cached = smp_load_acquire(&pools_num);
> >         union handle_parts parts = { .handle = handle };
> >         void *pool;
> >         size_t offset = parts.offset << DEPOT_STACK_ALIGN;
> >         struct stack_record *stack;
> >
> > -       lockdep_assert_held(&pool_rwlock);
> > +       lockdep_assert_not_held(&pool_lock);
> >
> > -       if (parts.pool_index > pools_num) {
> > +       if (parts.pool_index > pools_num_cached) {
> >                 WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
> > -                    parts.pool_index, pools_num, handle);
> > +                    parts.pool_index, pools_num_cached, handle);
> >                 return NULL;
> >         }
> >
> > @@ -417,15 +433,35 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
> >                 return NULL;
> >
> >         stack = pool + offset;
> > +
> > +       /*
> > +        * Acquire the stack trace. Pairs with smp_mb() in depot_alloc_stack().
> > +        *
> > +        * This does not protect against a stack_depot_put() freeing the record
> > +        * and having it subsequently being reused. Callers are responsible to
> > +        * avoid using stack depot handles after passing to stack_depot_put().
> > +        */
> > +       if (!refcount_read(&stack->count))
> > +               return NULL;
>
> Can this happen? It seems that depot_fetch_stack should only be called
> for handles that were returned from stack_depot_save_flags before all
> puts and thus the the refcount should > 0. Or is this a safeguard
> against improper API usage?
>
> > +       smp_mb__after_atomic();
> > +
> >         return stack;
> >  }
> >
> >  /* Links stack into the freelist. */
> >  static void depot_free_stack(struct stack_record *stack)
> >  {
> > -       lockdep_assert_held_write(&pool_rwlock);
> > +       unsigned long flags;
> > +
> > +       lockdep_assert_not_held(&pool_lock);
> > +
> > +       raw_spin_lock_irqsave(&pool_lock, flags);
> > +       printk_deferred_enter();
> > +       list_del_rcu(&stack->hash_list);
> > +       printk_deferred_exit();
> > +       raw_spin_unlock_irqrestore(&pool_lock, flags);
> >
> > -       list_add(&stack->list, &free_stacks);
> > +       llist_add(&stack->free_list, &free_stacks);
>
> This llist_add is outside of the lock just because we can (i.e.
> llist_add can run concurrently with the other free_stacks operations,
> which are all under the lock), right? This slightly contradicts the
> comment above the free_stacks definition.

Yes, llist can be used without locks.

> If we put this under the lock and use normal list instead of llist, I
> think we can then combine the hash_list with the free_list like before
> to save up on some space for stack_record. Would that make sense?

No, the RCU protected list can only be deleted, but not immediately
moved elsewhere. I.e. doing list_del_rcu() and list_add() immediately
will break list_for_each_entry_rcu() list traversal because list_add()
would modify the entry's next pointer which list traversal can still
potentially observe.

This actually made me realize that even doing list_del_rcu() and
list_add_rcu() later under the lock is dubious: it's possible that
find_stack() observes an entry that is being deleted, stalls, and that
entry is re-added so another list and then have a data race on reading
the next pointer of the old/new entry (which list_add_rcu() assigns
with plain C writes). While the documentation says that list_del_rcu()
and list_add_rcu() can be used concurrently with
list_for_each_entry_rcu(), 2 successive list_del_rcu() and
list_add_rcu() have to normally be separated by an RCU grace period.

I was trying to not have to use synchronize_rcu() or call_rcu()
(because we can't from stack depot), but perhaps there is no way
around it. What we can do is use get_state_synchronize_rcu(), but that
requires adding yet another field to stack_record. Another option
would be to have validation to figure out that the entry moved between
lists, but that's also hard to do.

> >  }
> >
> >  /* Calculates the hash for a stack. */
> > @@ -453,22 +489,55 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
> >
> >  /* Finds a stack in a bucket of the hash table. */
> >  static inline struct stack_record *find_stack(struct list_head *bucket,
> > -                                            unsigned long *entries, int size,
> > -                                            u32 hash)
> > +                                             unsigned long *entries, int size,
> > +                                             u32 hash, depot_flags_t flags)
> >  {
> > -       struct list_head *pos;
> > -       struct stack_record *found;
> > +       struct stack_record *stack, *ret = NULL;
> >
> > -       lockdep_assert_held(&pool_rwlock);
> > +       /*
> > +        * Due to being used from low-level code paths such as the allocators,
> > +        * NMI, or even RCU itself, stackdepot cannot rely on primitives that
> > +        * would sleep (such as synchronize_rcu()) or end up recursively call
> > +        * into stack depot again (such as call_rcu()).
> > +        *
> > +        * Instead, lock-less readers only rely on RCU primitives for correct
> > +        * memory ordering, but do not use RCU-based synchronization otherwise.
> > +        * Instead, we perform 3-pass validation below to ensure that the stack
> > +        * record we accessed is actually valid. If we fail to obtain a valid
> > +        * stack record here, the slow-path in stack_depot_save_flags() will
> > +        * retry to avoid inserting duplicates.
> > +        *
> > +        * If STACK_DEPOT_FLAG_GET is not used, it is undefined behaviour to
> > +        * call stack_depot_put() later - i.e. in the non-refcounted case, we do
> > +        * not have to worry that the entry will be recycled.
> > +        */
> > +
> > +       list_for_each_entry_rcu(stack, bucket, hash_list) {
>
> So we don't need rcu_read_lock here, because we don't rely on call_rcu
> etc., right?

That was the idea, but see my answer above. I will have a rethink how
to solve the list_del_rcu() with successive list_add_rcu() problem.

> > +               /* 1. Check if this entry could potentially match. */
> > +               if (data_race(stack->hash != hash || stack->size != size))
> > +                       continue;
> > +
> > +               /*
> > +                * 2. Increase refcount if not zero. If this is successful, we
> > +                *    know that this stack record is valid and will not be freed by
> > +                *    stack_depot_put().
> > +                */
> > +               if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(!refcount_inc_not_zero(&stack->count)))
> > +                       continue;
> > +
> > +               /* 3. Do full validation of the record. */
> > +               if (likely(stack->hash == hash && stack->size == size &&
> > +                          !stackdepot_memcmp(entries, stack->entries, size))) {
> > +                       ret = stack;
> > +                       break;
> > +               }
> >
> > -       list_for_each(pos, bucket) {
> > -               found = list_entry(pos, struct stack_record, list);
> > -               if (found->hash == hash &&
> > -                   found->size == size &&
> > -                   !stackdepot_memcmp(entries, found->entries, size))
> > -                       return found;
> > +               /* Undo refcount - could have raced with stack_depot_put(). */
> > +               if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(refcount_dec_and_test(&stack->count)))
> > +                       depot_free_stack(stack);
> >         }
> > -       return NULL;
> > +
> > +       return ret;
> >  }
> >
> >  depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> > @@ -482,7 +551,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> >         struct page *page = NULL;
> >         void *prealloc = NULL;
> >         bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
> > -       bool need_alloc = false;
> >         unsigned long flags;
> >         u32 hash;
> >
> > @@ -505,31 +573,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> >         hash = hash_stack(entries, nr_entries);
> >         bucket = &stack_table[hash & stack_hash_mask];
> >
> > -       read_lock_irqsave(&pool_rwlock, flags);
> > -       printk_deferred_enter();
> > -
> > -       /* Fast path: look the stack trace up without full locking. */
> > -       found = find_stack(bucket, entries, nr_entries, hash);
> > -       if (found) {
> > -               if (depot_flags & STACK_DEPOT_FLAG_GET)
> > -                       refcount_inc(&found->count);
> > -               printk_deferred_exit();
> > -               read_unlock_irqrestore(&pool_rwlock, flags);
> > +       /* Fast path: look the stack trace up without locking. */
> > +       found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
> > +       if (found)
> >                 goto exit;
> > -       }
> > -
> > -       /* Take note if another stack pool needs to be allocated. */
> > -       if (new_pool_required)
> > -               need_alloc = true;
> > -
> > -       printk_deferred_exit();
> > -       read_unlock_irqrestore(&pool_rwlock, flags);
> >
> >         /*
> >          * Allocate memory for a new pool if required now:
> >          * we won't be able to do that under the lock.
> >          */
> > -       if (unlikely(can_alloc && need_alloc)) {
> > +       if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
> >                 /*
> >                  * Zero out zone modifiers, as we don't have specific zone
> >                  * requirements. Keep the flags related to allocation in atomic
> > @@ -543,31 +596,33 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> >                         prealloc = page_address(page);
> >         }
> >
> > -       write_lock_irqsave(&pool_rwlock, flags);
> > +       raw_spin_lock_irqsave(&pool_lock, flags);
> >         printk_deferred_enter();
> >
> > -       found = find_stack(bucket, entries, nr_entries, hash);
> > +       /* Try to find again, to avoid concurrently inserting duplicates. */
> > +       found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
> >         if (!found) {
> >                 struct stack_record *new =
> >                         depot_alloc_stack(entries, nr_entries, hash, &prealloc);
> >
> >                 if (new) {
> > -                       list_add(&new->list, bucket);
> > +                       /*
> > +                        * This releases the stack record into the bucket and
> > +                        * makes it visible to readers in find_stack().
> > +                        */
> > +                       list_add_rcu(&new->hash_list, bucket);
> >                         found = new;
> >                 }
> > -       } else {
> > -               if (depot_flags & STACK_DEPOT_FLAG_GET)
> > -                       refcount_inc(&found->count);
> > +       } else if (prealloc) {
> >                 /*
> >                  * Stack depot already contains this stack trace, but let's
> >                  * keep the preallocated memory for future.
> >                  */
> > -               if (prealloc)
> > -                       depot_keep_new_pool(&prealloc);
> > +               depot_keep_new_pool(&prealloc);
> >         }
> >
> >         printk_deferred_exit();
> > -       write_unlock_irqrestore(&pool_rwlock, flags);
> > +       raw_spin_unlock_irqrestore(&pool_lock, flags);
> >  exit:
> >         if (prealloc) {
> >                 /* Stack depot didn't use this memory, free it. */
> > @@ -592,7 +647,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> >                                unsigned long **entries)
> >  {
> >         struct stack_record *stack;
> > -       unsigned long flags;
> >
> >         *entries = NULL;
> >         /*
> > @@ -604,13 +658,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> >         if (!handle || stack_depot_disabled)
> >                 return 0;
> >
> > -       read_lock_irqsave(&pool_rwlock, flags);
> > -       printk_deferred_enter();
> > -
> >         stack = depot_fetch_stack(handle);
> > -
> > -       printk_deferred_exit();
> > -       read_unlock_irqrestore(&pool_rwlock, flags);
> > +       /*
> > +        * Should never be NULL, otherwise this is a use-after-put.
> > +        */
> > +       if (WARN_ON(!stack))
> > +               return 0;
> >
> >         *entries = stack->entries;
> >         return stack->size;
> > @@ -620,29 +673,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
> >  void stack_depot_put(depot_stack_handle_t handle)
> >  {
> >         struct stack_record *stack;
> > -       unsigned long flags;
> >
> >         if (!handle || stack_depot_disabled)
> >                 return;
> >
> > -       write_lock_irqsave(&pool_rwlock, flags);
> > -       printk_deferred_enter();
> > -
> >         stack = depot_fetch_stack(handle);
> > +       /*
> > +        * Should always be able to find the stack record, otherwise this is an
> > +        * unbalanced put attempt.
> > +        */
> >         if (WARN_ON(!stack))
> > -               goto out;
> > -
> > -       if (refcount_dec_and_test(&stack->count)) {
> > -               /* Unlink stack from the hash table. */
> > -               list_del(&stack->list);
> > +               return;
> >
> > -               /* Free stack. */
> > +       if (refcount_dec_and_test(&stack->count))
> >                 depot_free_stack(stack);
> > -       }
> > -
> > -out:
> > -       printk_deferred_exit();
> > -       write_unlock_irqrestore(&pool_rwlock, flags);
> >  }
> >  EXPORT_SYMBOL_GPL(stack_depot_put);
> >
> > --
> > 2.43.0.275.g3460e3d667-goog
> >
>
> Looks good to me from the functional perspective (modulo the
> clarification comments I left above), but it would be great to get a
> review from someone with a better understanding of the low-level
> synchronization primitives.

Yes - and I'll have to rework this to use get_state_synchronize_rcu()
after all. When it's ready for proper review I'll send an RFC patch.

Thanks,
-- Marco

Marco Elver Jan. 12, 2024, 10:15 p.m. UTC | #8

On Fri, Jan 12, 2024 at 09:24AM +0100, Marco Elver wrote:
> On Fri, 12 Jan 2024 at 03:38, Andrey Konovalov <andreyknvl@gmail.com> wrote:
[...]
> > Looks good to me from the functional perspective (modulo the
> > clarification comments I left above), but it would be great to get a
> > review from someone with a better understanding of the low-level
> > synchronization primitives.
> 
> Yes - and I'll have to rework this to use get_state_synchronize_rcu()
> after all. When it's ready for proper review I'll send an RFC patch.

The below should be what we want, this time without weird hacks.
NMI-safe RCU does work for this case.

I'll let the test robot beat on it then send the patch next week
(there's going to be a patch 1/2 to add stats counters as well).

Also note that the rwlock broke RT kernels, which is also fixed by the
below patch.

Thanks,
-- Marco


From 02b11afe1dcaf42e86b7030e32146a8b34d4bd09 Mon Sep 17 00:00:00 2001
From: Marco Elver <elver@google.com>
Date: Tue, 9 Jan 2024 10:21:56 +0100
Subject: [PATCH] stackdepot: make fast paths lock-less again

With the introduction of the pool_rwlock (reader-writer lock), several
fast paths end up taking the pool_rwlock as readers. Furthermore,
stack_depot_put() unconditionally takes the pool_rwlock as a writer.

Despite allowing readers to make forward-progress concurrently,
reader-writer locks have inherent cache contention issues, which does
not scale well on systems with large CPU counts. For cases with short
critical sections, as is the case with stack depot, they can cause more
harm than good, and alternative designs should be preferred.

Rework the synchronization story of stack depot to again avoid taking
any locks in the fast paths. This is done by relying on RCU-protected
list traversal, and the NMI-safe subset of RCU to delay reuse of freed
stack records. See code comments for more details.

Along with the performance issues, this also fixes incorrect nesting of rwlock
within a raw_spinlock (pool->lock is the raw spinlock), given that stack depot
should still be usable from anywhere:

 | [ BUG: Invalid wait context ]
 | -----------------------------
 | swapper/0/1 is trying to lock:
 | ffffffff89869be8 (pool_rwlock){..--}-{3:3}, at: stack_depot_save_flags+0x147/0x660
 | other info that might help us debug this:
 | context-{5:5}
 | 2 locks held by swapper/0/1:
 |  #0: ffffffff89632440 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x153/0xd70
 |  #1: ffff888100092018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0x549/0xd70

Stack depot usage stats are similar to the previous version after a KASAN
kernel boot:

 $ cat /sys/kernel/debug/stackdepot/stats
 pools: 838
 allocations: 29865
 frees: 6604
 in_use: 23261
 freelist_size: 1879

As we can see, the number of pools is the same as previously. The freelist size
is minimally larger, but this may also be due to variance across system boots.
This shows that even though we do not eagerly wait for the next RCU grace
period (such as with synchronize_rcu() or call_rcu()) after freeing a stack
record - requiring depot_pop_free() to "poll" if an entry may be used - new
allocations are very likely to happen only in later RCU grace periods.

Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
Signed-off-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
---
 lib/stackdepot.c | 327 +++++++++++++++++++++++++++++++----------------
 1 file changed, 215 insertions(+), 112 deletions(-)

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 80a8ca24ccc8..456396df1c0e 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -24,6 +24,8 @@
 #include <linux/mutex.h>
 #include <linux/percpu.h>
 #include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -68,12 +70,28 @@ union handle_parts {
 };
 
 struct stack_record {
-	struct list_head list;		/* Links in hash table or freelist */
+	struct list_head hash_list;	/* Links in the hash table */
 	u32 hash;			/* Hash in hash table */
 	u32 size;			/* Number of stored frames */
-	union handle_parts handle;
+	union handle_parts handle;	/* Constant after initialization */
 	refcount_t count;
-	unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES];	/* Frames */
+	union {
+		unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES];	/* Frames */
+		struct {
+			/*
+			 * An important invariant of the implementation is to
+			 * only place a stack record onto the freelist iff its
+			 * refcount is zero. Because stack records with a zero
+			 * refcount are never considered as valid, it is safe to
+			 * union @entries and freelist management state below.
+			 * Conversely, as soon as an entry is off the freelist
+			 * and its refcount becomes non-zero, the below must not
+			 * be accessed until being placed back on the freelist.
+			 */
+			struct list_head free_list;	/* Links in the freelist */
+			unsigned long rcu_state;	/* RCU cookie */
+		};
+	};
 };
 
 #define DEPOT_STACK_RECORD_SIZE \
@@ -113,8 +131,8 @@ static LIST_HEAD(free_stacks);
  * yet allocated or if the limit on the number of pools is reached.
  */
 static bool new_pool_required = true;
-/* Lock that protects the variables above. */
-static DEFINE_RWLOCK(pool_rwlock);
+/* The lock must be held when performing pool or free list modifications. */
+static DEFINE_RAW_SPINLOCK(pool_lock);
 
 /* Statistics counters for debugfs. */
 enum depot_counter_id {
@@ -276,14 +294,15 @@ int stack_depot_init(void)
 }
 EXPORT_SYMBOL_GPL(stack_depot_init);
 
-/* Initializes a stack depol pool. */
+/*
+ * Initializes new stack depot @pool, release all its entries to the freelist,
+ * and update the list of pools.
+ */
 static void depot_init_pool(void *pool)
 {
 	int offset;
 
-	lockdep_assert_held_write(&pool_rwlock);
-
-	WARN_ON(!list_empty(&free_stacks));
+	lockdep_assert_held(&pool_lock);
 
 	/* Initialize handles and link stack records into the freelist. */
 	for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
@@ -294,19 +313,31 @@ static void depot_init_pool(void *pool)
 		stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
 		stack->handle.extra = 0;
 
-		list_add(&stack->list, &free_stacks);
+		/*
+		 * Stack traces of size 0 are never saved, and we can simply use
+		 * the size field as an indicator if this is a new unused stack
+		 * record in the freelist.
+		 */
+		stack->size = 0;
+
+		INIT_LIST_HEAD(&stack->hash_list);
+		/* Add to the freelist front to prioritize never-used entries. */
+		list_add(&stack->free_list, &free_stacks);
 		counters[DEPOT_COUNTER_FREELIST_SIZE]++;
 	}
 
 	/* Save reference to the pool to be used by depot_fetch_stack(). */
 	stack_pools[pools_num] = pool;
-	pools_num++;
+
+	/* Pairs with concurrent READ_ONCE() in depot_fetch_stack(). */
+	WRITE_ONCE(pools_num, pools_num + 1);
+	ASSERT_EXCLUSIVE_WRITER(pools_num);
 }
 
 /* Keeps the preallocated memory to be used for a new stack depot pool. */
 static void depot_keep_new_pool(void **prealloc)
 {
-	lockdep_assert_held_write(&pool_rwlock);
+	lockdep_assert_held(&pool_lock);
 
 	/*
 	 * If a new pool is already saved or the maximum number of
@@ -329,17 +360,16 @@ static void depot_keep_new_pool(void **prealloc)
 	 * number of pools is reached. In either case, take note that
 	 * keeping another pool is not required.
 	 */
-	new_pool_required = false;
+	WRITE_ONCE(new_pool_required, false);
 }
 
-/* Updates references to the current and the next stack depot pools. */
-static bool depot_update_pools(void **prealloc)
+/*
+ * Try to initialize a new stack depot pool from either a previous or the
+ * current pre-allocation, and release all its entries to the freelist.
+ */
+static bool depot_try_init_pool(void **prealloc)
 {
-	lockdep_assert_held_write(&pool_rwlock);
-
-	/* Check if we still have objects in the freelist. */
-	if (!list_empty(&free_stacks))
-		goto out_keep_prealloc;
+	lockdep_assert_held(&pool_lock);
 
 	/* Check if we have a new pool saved and use it. */
 	if (new_pool) {
@@ -348,10 +378,9 @@ static bool depot_update_pools(void **prealloc)
 
 		/* Take note that we might need a new new_pool. */
 		if (pools_num < DEPOT_MAX_POOLS)
-			new_pool_required = true;
+			WRITE_ONCE(new_pool_required, true);
 
-		/* Try keeping the preallocated memory for new_pool. */
-		goto out_keep_prealloc;
+		return true;
 	}
 
 	/* Bail out if we reached the pool limit. */
@@ -368,12 +397,30 @@ static bool depot_update_pools(void **prealloc)
 	}
 
 	return false;
+}
+
+/* Try to find next free usable entry. */
+static struct stack_record *depot_pop_free(void)
+{
+	struct stack_record *stack;
+
+	if (list_empty(&free_stacks))
+		return NULL;
+
+	/*
+	 * We maintain the invariant that the elements in front are least
+	 * recently used, and are therefore more likely to be associated with an
+	 * RCU grace period in the past. Consequently it is sufficient to only
+	 * check the first entry.
+	 */
+	stack = list_first_entry(&free_stacks, struct stack_record, free_list);
+	if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
+		return NULL;
 
-out_keep_prealloc:
-	/* Keep the preallocated memory for a new pool if required. */
-	if (*prealloc)
-		depot_keep_new_pool(prealloc);
-	return true;
+	list_del(&stack->free_list);
+	counters[DEPOT_COUNTER_FREELIST_SIZE]--;
+
+	return stack;
 }
 
 /* Allocates a new stack in a stack depot pool. */
@@ -382,20 +429,18 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 {
 	struct stack_record *stack;
 
-	lockdep_assert_held_write(&pool_rwlock);
-
-	/* Update current and new pools if required and possible. */
-	if (!depot_update_pools(prealloc))
-		return NULL;
+	lockdep_assert_held(&pool_lock);
 
 	/* Check if we have a stack record to save the stack trace. */
-	if (list_empty(&free_stacks))
-		return NULL;
-
-	/* Get and unlink the first entry from the freelist. */
-	stack = list_first_entry(&free_stacks, struct stack_record, list);
-	list_del(&stack->list);
-	counters[DEPOT_COUNTER_FREELIST_SIZE]--;
+	stack = depot_pop_free();
+	if (!stack) {
+		/* No usable entries on the freelist - try to refill the freelist. */
+		if (!depot_try_init_pool(prealloc))
+			return NULL;
+		stack = depot_pop_free();
+		if (WARN_ON(!stack))
+			return NULL;
+	}
 
 	/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
 	if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
@@ -421,37 +466,73 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 
 static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
 {
+	const int pools_num_cached = READ_ONCE(pools_num);
 	union handle_parts parts = { .handle = handle };
 	void *pool;
 	size_t offset = parts.offset << DEPOT_STACK_ALIGN;
 	struct stack_record *stack;
 
-	lockdep_assert_held(&pool_rwlock);
+	lockdep_assert_not_held(&pool_lock);
 
-	if (parts.pool_index > pools_num) {
+	if (parts.pool_index > pools_num_cached) {
 		WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
-		     parts.pool_index, pools_num, handle);
+		     parts.pool_index, pools_num_cached, handle);
 		return NULL;
 	}
 
 	pool = stack_pools[parts.pool_index];
-	if (!pool)
+	if (WARN_ON(!pool))
 		return NULL;
 
 	stack = pool + offset;
+	if (WARN_ON(!refcount_read(&stack->count)))
+		return NULL;
+
 	return stack;
 }
 
 /* Links stack into the freelist. */
 static void depot_free_stack(struct stack_record *stack)
 {
-	lockdep_assert_held_write(&pool_rwlock);
+	unsigned long flags;
+
+	lockdep_assert_not_held(&pool_lock);
+
+	raw_spin_lock_irqsave(&pool_lock, flags);
+	printk_deferred_enter();
+
+	/*
+	 * Remove the entry from the hash list. Concurrent list traversal may
+	 * still observe the entry, but since the refcount is zero, this entry
+	 * will no longer be considered as valid.
+	 */
+	list_del_rcu(&stack->hash_list);
+
+	/*
+	 * Due to being used from constrained contexts such as the allocators,
+	 * NMI, or even RCU itself, stack depot cannot rely on primitives that
+	 * would sleep (such as synchronize_rcu()) or recursively call into
+	 * stack depot again (such as call_rcu()).
+	 *
+	 * Instead, get an RCU cookie, so that we can ensure this entry isn't
+	 * moved onto another list until the next grace period, and concurrent
+	 * RCU list traversal remains safe.
+	 */
+	stack->rcu_state = get_state_synchronize_rcu();
 
-	list_add(&stack->list, &free_stacks);
+	/*
+	 * Add the entry to the freelist tail, so that older entries are
+	 * considered first - their RCU cookie is more likely to no longer be
+	 * associated with the current grace period.
+	 */
+	list_add_tail(&stack->free_list, &free_stacks);
 
 	counters[DEPOT_COUNTER_FREELIST_SIZE]++;
 	counters[DEPOT_COUNTER_FREES]++;
 	counters[DEPOT_COUNTER_INUSE]--;
+
+	printk_deferred_exit();
+	raw_spin_unlock_irqrestore(&pool_lock, flags);
 }
 
 /* Calculates the hash for a stack. */
@@ -479,22 +560,65 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
 
 /* Finds a stack in a bucket of the hash table. */
 static inline struct stack_record *find_stack(struct list_head *bucket,
-					     unsigned long *entries, int size,
-					     u32 hash)
+					      unsigned long *entries, int size,
+					      u32 hash, depot_flags_t flags)
 {
-	struct list_head *pos;
-	struct stack_record *found;
+	struct stack_record *stack, *ret = NULL;
+
+	rcu_read_lock();
 
-	lockdep_assert_held(&pool_rwlock);
+	list_for_each_entry_rcu(stack, bucket, hash_list) {
+		if (stack->hash != hash || stack->size != size)
+			continue;
 
-	list_for_each(pos, bucket) {
-		found = list_entry(pos, struct stack_record, list);
-		if (found->hash == hash &&
-		    found->size == size &&
-		    !stackdepot_memcmp(entries, found->entries, size))
-			return found;
+		/*
+		 * This may race with depot_free_stack() accessing the freelist
+		 * management state unioned with @entries. The refcount is zero
+		 * in that case and the below refcount_inc_not_zero() will fail.
+		 */
+		if (data_race(stackdepot_memcmp(entries, stack->entries, size)))
+			continue;
+
+		/*
+		 * Try to increment refcount. If this succeeds, the stack record
+		 * is valid and has not yet been freed.
+		 *
+		 * If STACK_DEPOT_FLAG_GET is not used, it is undefined behavior
+		 * to then call stack_depot_put() later, and we can assume that
+		 * a stack record is never placed back on the freelist.
+		 */
+		if (flags & STACK_DEPOT_FLAG_GET) {
+			if (!refcount_inc_not_zero(&stack->count))
+				continue;
+			smp_mb__after_atomic();
+		} else {
+			/*
+			 * Pairs with the release implied by list_add_rcu() to
+			 * turn the list-pointer access into an acquire; as-is
+			 * it only provides dependency-ordering implied by
+			 * READ_ONCE().
+			 *
+			 * Normally this is not needed, if we were to continue
+			 * using the stack_record pointer only. But, the pointer
+			 * returned here is not actually used to lookup entries.
+			 * Instead, the handle is returned, from which a pointer
+			 * may then be reconstructed in depot_fetch_stack().
+			 *
+			 * Therefore, it is required to upgrade the ordering
+			 * from dependency-ordering only to at least acquire to
+			 * be able to use the handle as another reference to the
+			 * same stack record.
+			 */
+			smp_mb();
+		}
+
+		ret = stack;
+		break;
 	}
-	return NULL;
+
+	rcu_read_unlock();
+
+	return ret;
 }
 
 depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
@@ -508,7 +632,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	struct page *page = NULL;
 	void *prealloc = NULL;
 	bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
-	bool need_alloc = false;
 	unsigned long flags;
 	u32 hash;
 
@@ -531,31 +654,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	hash = hash_stack(entries, nr_entries);
 	bucket = &stack_table[hash & stack_hash_mask];
 
-	read_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
-	/* Fast path: look the stack trace up without full locking. */
-	found = find_stack(bucket, entries, nr_entries, hash);
-	if (found) {
-		if (depot_flags & STACK_DEPOT_FLAG_GET)
-			refcount_inc(&found->count);
-		printk_deferred_exit();
-		read_unlock_irqrestore(&pool_rwlock, flags);
+	/* Fast path: look the stack trace up without locking. */
+	found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
+	if (found)
 		goto exit;
-	}
-
-	/* Take note if another stack pool needs to be allocated. */
-	if (new_pool_required)
-		need_alloc = true;
-
-	printk_deferred_exit();
-	read_unlock_irqrestore(&pool_rwlock, flags);
 
 	/*
 	 * Allocate memory for a new pool if required now:
 	 * we won't be able to do that under the lock.
 	 */
-	if (unlikely(can_alloc && need_alloc)) {
+	if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
 		/*
 		 * Zero out zone modifiers, as we don't have specific zone
 		 * requirements. Keep the flags related to allocation in atomic
@@ -569,31 +677,36 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 			prealloc = page_address(page);
 	}
 
-	write_lock_irqsave(&pool_rwlock, flags);
+	raw_spin_lock_irqsave(&pool_lock, flags);
 	printk_deferred_enter();
 
-	found = find_stack(bucket, entries, nr_entries, hash);
+	/* Try to find again, to avoid concurrently inserting duplicates. */
+	found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
 	if (!found) {
 		struct stack_record *new =
 			depot_alloc_stack(entries, nr_entries, hash, &prealloc);
 
 		if (new) {
-			list_add(&new->list, bucket);
+			/*
+			 * This releases the stack record into the bucket and
+			 * makes it visible to readers in find_stack().
+			 */
+			list_add_rcu(&new->hash_list, bucket);
 			found = new;
 		}
-	} else {
-		if (depot_flags & STACK_DEPOT_FLAG_GET)
-			refcount_inc(&found->count);
+	}
+
+	if (prealloc) {
 		/*
-		 * Stack depot already contains this stack trace, but let's
-		 * keep the preallocated memory for future.
+		 * Either stack depot already contains this stack trace, or
+		 * depot_alloc_stack() did not consume the preallocated memory.
+		 * Try to keep the preallocated memory for future.
 		 */
-		if (prealloc)
-			depot_keep_new_pool(&prealloc);
+		depot_keep_new_pool(&prealloc);
 	}
 
 	printk_deferred_exit();
-	write_unlock_irqrestore(&pool_rwlock, flags);
+	raw_spin_unlock_irqrestore(&pool_lock, flags);
 exit:
 	if (prealloc) {
 		/* Stack depot didn't use this memory, free it. */
@@ -618,7 +731,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 			       unsigned long **entries)
 {
 	struct stack_record *stack;
-	unsigned long flags;
 
 	*entries = NULL;
 	/*
@@ -630,13 +742,13 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 	if (!handle || stack_depot_disabled)
 		return 0;
 
-	read_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
 	stack = depot_fetch_stack(handle);
-
-	printk_deferred_exit();
-	read_unlock_irqrestore(&pool_rwlock, flags);
+	/*
+	 * Should never be NULL, otherwise this is a use-after-put (or just a
+	 * corrupt handle).
+	 */
+	if (WARN(!stack, "corrupt handle or use after stack_depot_put()"))
+		return 0;
 
 	*entries = stack->entries;
 	return stack->size;
@@ -646,29 +758,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
 void stack_depot_put(depot_stack_handle_t handle)
 {
 	struct stack_record *stack;
-	unsigned long flags;
 
 	if (!handle || stack_depot_disabled)
 		return;
 
-	write_lock_irqsave(&pool_rwlock, flags);
-	printk_deferred_enter();
-
 	stack = depot_fetch_stack(handle);
-	if (WARN_ON(!stack))
-		goto out;
-
-	if (refcount_dec_and_test(&stack->count)) {
-		/* Unlink stack from the hash table. */
-		list_del(&stack->list);
+	/*
+	 * Should always be able to find the stack record, otherwise this is an
+	 * unbalanced put attempt (or corrupt handle).
+	 */
+	if (WARN(!stack, "corrupt handle or unbalanced stack_depot_put()"))
+		return;
 
-		/* Free stack. */
+	if (refcount_dec_and_test(&stack->count))
 		depot_free_stack(stack);
-	}
-
-out:
-	printk_deferred_exit();
-	write_unlock_irqrestore(&pool_rwlock, flags);
 }
 EXPORT_SYMBOL_GPL(stack_depot_put);

Andi Kleen Jan. 13, 2024, 1:24 a.m. UTC | #9

On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> +		/*
> +		 * Stack traces of size 0 are never saved, and we can simply use
> +		 * the size field as an indicator if this is a new unused stack
> +		 * record in the freelist.
> +		 */
> +		stack->size = 0;

I would use WRITE_ONCE here too, at least for TSan.

> +		return NULL;
> +
> +	/*
> +	 * We maintain the invariant that the elements in front are least
> +	 * recently used, and are therefore more likely to be associated with an
> +	 * RCU grace period in the past. Consequently it is sufficient to only
> +	 * check the first entry.
> +	 */
> +	stack = list_first_entry(&free_stacks, struct stack_record, free_list);
> +	if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))

READ_ONCE (also for TSan, and might be safer long term in case the
compiler considers some fancy code transformation)

> +		return NULL;
>  
> +		stack = depot_pop_free();
> +		if (WARN_ON(!stack))

Won't you get nesting problems here if this triggers due to the print?
I assume the nmi safe printk won't consider it like an NMI.

>  	counters[DEPOT_COUNTER_FREELIST_SIZE]++;
>  	counters[DEPOT_COUNTER_FREES]++;
>  	counters[DEPOT_COUNTER_INUSE]--;
> +
> +	printk_deferred_exit();

Ah this handles the WARN_ON? Should be ok then.

-Andi

Marco Elver Jan. 13, 2024, 9:12 a.m. UTC | #10

On Sat, 13 Jan 2024 at 02:24, Andi Kleen <ak@linux.intel.com> wrote:
>
> On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > +             /*
> > +              * Stack traces of size 0 are never saved, and we can simply use
> > +              * the size field as an indicator if this is a new unused stack
> > +              * record in the freelist.
> > +              */
> > +             stack->size = 0;
>
> I would use WRITE_ONCE here too, at least for TSan.

This is written with the pool_lock held.

> > +             return NULL;
> > +
> > +     /*
> > +      * We maintain the invariant that the elements in front are least
> > +      * recently used, and are therefore more likely to be associated with an
> > +      * RCU grace period in the past. Consequently it is sufficient to only
> > +      * check the first entry.
> > +      */
> > +     stack = list_first_entry(&free_stacks, struct stack_record, free_list);
> > +     if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
>
> READ_ONCE (also for TSan, and might be safer long term in case the
> compiler considers some fancy code transformation)

And this is also only read with the pool_lock held, so it's impossible
that there'd be a data race due to size. (And if there is a data race,
I'd want KCSAN to tell us because that'd be a bug then.)
depot_pop_free() can't be used w/o the lock because it's manipulating
the freelist.
To be sure, I'm adding a lockdep_assert_held() to depot_pop_free().

> > +             return NULL;
> >
> > +             stack = depot_pop_free();
> > +             if (WARN_ON(!stack))
>
> Won't you get nesting problems here if this triggers due to the print?
> I assume the nmi safe printk won't consider it like an NMI.
>
> >       counters[DEPOT_COUNTER_FREELIST_SIZE]++;
> >       counters[DEPOT_COUNTER_FREES]++;
> >       counters[DEPOT_COUNTER_INUSE]--;
> > +
> > +     printk_deferred_exit();
>
> Ah this handles the WARN_ON? Should be ok then.

Yes, the pool_lock critical sections are surrounded by printk_deferred.

Thanks,
-- Marco

Andi Kleen Jan. 13, 2024, 9:19 a.m. UTC | #11

On Sat, Jan 13, 2024 at 10:12:21AM +0100, Marco Elver wrote:
> On Sat, 13 Jan 2024 at 02:24, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > > +             /*
> > > +              * Stack traces of size 0 are never saved, and we can simply use
> > > +              * the size field as an indicator if this is a new unused stack
> > > +              * record in the freelist.
> > > +              */
> > > +             stack->size = 0;
> >
> > I would use WRITE_ONCE here too, at least for TSan.
> 
> This is written with the pool_lock held.

...which doesn't help because the readers don't take it?

-Andi

Marco Elver Jan. 13, 2024, 9:23 a.m. UTC | #12

On Sat, 13 Jan 2024 at 10:19, Andi Kleen <ak@linux.intel.com> wrote:
>
> On Sat, Jan 13, 2024 at 10:12:21AM +0100, Marco Elver wrote:
> > On Sat, 13 Jan 2024 at 02:24, Andi Kleen <ak@linux.intel.com> wrote:
> > >
> > > On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > > > +             /*
> > > > +              * Stack traces of size 0 are never saved, and we can simply use
> > > > +              * the size field as an indicator if this is a new unused stack
> > > > +              * record in the freelist.
> > > > +              */
> > > > +             stack->size = 0;
> > >
> > > I would use WRITE_ONCE here too, at least for TSan.
> >
> > This is written with the pool_lock held.
>
> ...which doesn't help because the readers don't take it?

This function is only refilling the freelist. Readers don't see it yet
because it's in none of the hash table buckets. The freelist is only
ever accessed under the lock.

Once an entry is allocated from the freelist, its size is overwritten
with something non-zero (since it then contains a stack trace). Those
updates are released into the right hash table bucket with
list_add_rcu() (which implies a release).

Am I missing something else?

Marco Elver Jan. 13, 2024, 9:30 a.m. UTC | #13

On Sat, 13 Jan 2024 at 10:23, Marco Elver <elver@google.com> wrote:
>
> On Sat, 13 Jan 2024 at 10:19, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > On Sat, Jan 13, 2024 at 10:12:21AM +0100, Marco Elver wrote:
> > > On Sat, 13 Jan 2024 at 02:24, Andi Kleen <ak@linux.intel.com> wrote:
> > > >
> > > > On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > > > > +             /*
> > > > > +              * Stack traces of size 0 are never saved, and we can simply use
> > > > > +              * the size field as an indicator if this is a new unused stack
> > > > > +              * record in the freelist.
> > > > > +              */
> > > > > +             stack->size = 0;
> > > >
> > > > I would use WRITE_ONCE here too, at least for TSan.
> > >
> > > This is written with the pool_lock held.
> >
> > ...which doesn't help because the readers don't take it?
>
> This function is only refilling the freelist. Readers don't see it yet
> because it's in none of the hash table buckets. The freelist is only
> ever accessed under the lock.
>
> Once an entry is allocated from the freelist, its size is overwritten
> with something non-zero (since it then contains a stack trace). Those
> updates are released into the right hash table bucket with
> list_add_rcu() (which implies a release).
>
> Am I missing something else?

FWIW, the current version (draft) of this can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/melver/linux.git/log/?h=kasan/dev
I'll send the 2 patches next week - they should apply cleanly on
current mainline.

Thanks,
-- Marco

Andi Kleen Jan. 13, 2024, 9:31 a.m. UTC | #14

> This function is only refilling the freelist. Readers don't see it yet
> because it's in none of the hash table buckets. The freelist is only
> ever accessed under the lock.
> 
> Once an entry is allocated from the freelist, its size is overwritten
> with something non-zero (since it then contains a stack trace). Those
> updates are released into the right hash table bucket with
> list_add_rcu() (which implies a release).
> 
> Am I missing something else?

It's probably ok semantically here, but at least I would be consistent with
using the macro for a specific field.

-Andi

Breno Leitao Jan. 24, 2024, 2:15 p.m. UTC | #15

Hello Andrey,

On Mon, Nov 20, 2023 at 06:47:10PM +0100, andrey.konovalov@linux.dev wrote:
> From: Andrey Konovalov <andreyknvl@google.com>
> 
> Currently, stack depot uses the following locking scheme:
> 
> 1. Lock-free accesses when looking up a stack record, which allows to
>    have multiple users to look up records in parallel;
> 2. Spinlock for protecting the stack depot pools and the hash table
>    when adding a new record.
> 
> For implementing the eviction of stack traces from stack depot, the
> lock-free approach is not going to work anymore, as we will need to be
> able to also remove records from the hash table.
> 
> Convert the spinlock into a read/write lock, and drop the atomic accesses,
> as they are no longer required.
> 
> Looking up stack traces is now protected by the read lock and adding new
> records - by the write lock. One of the following patches will add a new
> function for evicting stack records, which will be protected by the write
> lock as well.
> 
> With this change, multiple users can still look up records in parallel.
> 
> This is preparatory patch for implementing the eviction of stack records
> from the stack depot.

I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
enabled). This kernel is based on
9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following
lockdep DEADLOCK splat recently. Before investigating further, I would
like to share here to check if this is a known problem:

	 ======================================================
	 WARNING: possible circular locking dependency detected
	 6.7.0-0_fbk700_debug_kbuilder_0_g9f8413c4a66f #1 Tainted: G S          EL   N
	 ------------------------------------------------------
	 btrfs-transacti/566 is trying to acquire lock:
	 ffffffff84563058 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock (kernel/locking/semaphore.c:135) 
 
	 but task is already holding lock:
	 ffffffff850a2a70 (pool_lock#2){-.-.}-{2:2}, at: __stack_depot_save (lib/stackdepot.c:415) 
	 which lock already depends on the new lock.

	 the existing dependency chain (in reverse order) is:

	 -> #3 (pool_lock#2){-.-.}-{2:2}:
	 lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753) 
	 _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162) 
	 __stack_depot_save (lib/stackdepot.c:415) 
	 kasan_save_stack (mm/kasan/common.c:46) 
	 __kasan_record_aux_stack (mm/kasan/generic.c:492) 
	 task_work_add (kernel/task_work.c:44) 
	 scheduler_tick (kernel/sched/core.c:5677) 
	 update_process_times (kernel/time/timer.c:2070 kernel/time/timer.c:2086) 
	 tick_nohz_highres_handler (kernel/time/tick-sched.c:? kernel/time/tick-sched.c:1512) 
	 __hrtimer_run_queues (kernel/time/hrtimer.c:484 kernel/time/hrtimer.c:1656 kernel/time/hrtimer.c:1752) 
	 hrtimer_interrupt (kernel/time/hrtimer.c:? kernel/time/hrtimer.c:1796) 
	 __sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052 arch/x86/kernel/apic/apic.c:1082) 
	 sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076) 
	 asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) 
	 _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
	 debug_object_activate (lib/debugobjects.c:? lib/debugobjects.c:732) 
	 call_rcu (kernel/rcu/rcu.h:227 kernel/rcu/tree.c:2666 kernel/rcu/tree.c:2795) 
	 mas_wr_modify (lib/maple_tree.c:3375 lib/maple_tree.c:3482 lib/maple_tree.c:4198 lib/maple_tree.c:4236) 
	 mas_wr_store_entry (lib/maple_tree.c:1389 lib/maple_tree.c:4250) 
	 mas_store_prealloc (lib/maple_tree.c:5364 lib/maple_tree.c:5458) 
	 vma_complete (mm/mmap.c:523) 
	 __split_vma (include/linux/mmap_lock.h:71 include/linux/mm.h:694 include/linux/mm.h:713 mm/mmap.c:2399) 
	 do_vmi_align_munmap (include/linux/maple_tree.h:621 mm/mmap.c:2547) 
	 do_vmi_munmap (mm/mmap.c:2729) 
	 __vm_munmap (mm/mmap.c:3010) 
	 elf_load (include/linux/instrumented.h:68 include/asm-generic/bitops/instrumented-non-atomic.h:141 include/linux/thread_info.h:118 fs/binfmt_elf.c:382 fs/binfmt_elf.c:408) 
	 load_elf_binary (fs/binfmt_elf.c:?) 
	 bprm_execve (fs/exec.c:? fs/exec.c:1854) 
	 kernel_execve (fs/exec.c:2009) 
	 kernel_init (init/main.c:1468) 
	 ret_from_fork (arch/x86/kernel/process.c:153) 
	 ret_from_fork_asm (arch/x86/entry/entry_64.S:250) 

	-> #2 (&rq->__lock){-.-.}-{2:2}:
	 lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753) 
	 _raw_spin_lock_nested (kernel/locking/spinlock.c:378) 
	 raw_spin_rq_lock_nested (kernel/sched/core.c:560) 
	 task_fork_fair (kernel/sched/fair.c:12644) 
	 sched_cgroup_fork (kernel/sched/sched.h:2024 kernel/sched/sched.h:2047 kernel/sched/core.c:4833) 
	 copy_process (include/linux/instrumented.h:82 include/linux/atomic/atomic-instrumented.h:67 include/linux/refcount.h:136 kernel/fork.c:1868 kernel/fork.c:2499) 
	 kernel_clone (kernel/fork.c:2898) 
	 user_mode_thread (kernel/fork.c:2977) 
	 rest_init (init/main.c:695) 
	 arch_call_rest_init+0xf/0x10 
	 start_kernel (init/main.c:1011) 
	 x86_64_start_reservations (??:?) 
	 x86_64_start_kernel (??:?) 
	 secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:461) 

	-> #1 (&p->pi_lock){-.-.}-{2:2}:
	 lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753) 
	 _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162) 
	 try_to_wake_up (include/linux/spinlock.h:? kernel/sched/core.c:4252) 
	 up (kernel/locking/semaphore.c:189) 
	 console_unlock+0xcb/0x1e0a <- releases console semaphore 
	 vprintk_emit (arch/x86/include/asm/irqflags.h:? arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:1968 kernel/printk/printk.c:2302) 
	 _printk (kernel/printk/printk.c:?) 
	 _fat_msg (fs/fat/misc.c:?) 
	 fat_fill_super (fs/fat/inode.c:? fs/fat/inode.c:1646) 
	 mount_bdev (fs/super.c:?) 
	 legacy_get_tree (fs/fs_context.c:662) 
	 vfs_get_tree (fs/super.c:1784) 
	 path_mount (fs/namespace.c:? fs/namespace.c:3662) 
	 __se_sys_mount (fs/namespace.c:? fs/namespace.c:3882 fs/namespace.c:3864) 
	 do_syscall_64 (arch/x86/entry/common.c:?) 
	 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) 

	-> #0 ((console_sem).lock){-.-.}-{2:2}:
	 check_prevs_add (kernel/locking/lockdep.c:223 kernel/locking/lockdep.c:3098 kernel/locking/lockdep.c:3253) 
	 __lock_acquire+0x2399/0x24f0 * Trying to get the sem_console 
	 lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753) 
	 _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162) 
	 down_trylock (kernel/locking/semaphore.c:135) 
	 __down_trylock_console_sem (kernel/printk/printk.c:322) 
	 vprintk_emit (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/atomic/atomic-instrumented.h:33 kernel/printk/printk.c:2621 kernel/printk/printk.c:2657 kernel/printk/printk.c:1923 kernel/printk/printk.c:2302) 
	 _printk (kernel/printk/printk.c:?) 
	 __warn_printk (include/linux/context_tracking.h:131 include/linux/context_tracking.h:145 kernel/panic.c:718) 
	 __stack_depot_save+0x685/0x690 * <- Got the *pool* lock 
	 kasan_set_track (mm/kasan/common.c:52) 
	 __kasan_slab_alloc (mm/kasan/common.c:331) 
	 kmem_cache_alloc (include/linux/kasan.h:188 mm/slab.h:763 mm/slub.c:3478 mm/slub.c:3486 mm/slub.c:3493 mm/slub.c:3502) 
	 btrfs_add_delayed_tree_ref (fs/btrfs/delayed-ref.c:1027) 
	 btrfs_free_tree_block (fs/btrfs/delayed-ref.h:316 fs/btrfs/extent-tree.c:3452) 
	 btrfs_force_cow_block (include/asm-generic/unaligned.h:52 fs/btrfs/accessors.h:681 fs/btrfs/accessors.h:721 fs/btrfs/ctree.c:574) 
	 btrfs_cow_block (include/asm-generic/unaligned.h:37 fs/btrfs/accessors.h:678 fs/btrfs/ctree.c:678 fs/btrfs/ctree.c:727) 
	 balance_level (fs/btrfs/ctree.c:960) 
	 btrfs_search_slot (fs/btrfs/ctree.c:2097) 
	 lookup_inline_extent_backref (fs/btrfs/accessors.h:371 fs/btrfs/extent-tree.c:814) 
	 __btrfs_free_extent (fs/btrfs/extent-tree.c:3113) 
	 __btrfs_run_delayed_refs (include/linux/instrumented.h:96 include/linux/atomic/atomic-instrumented.h:592 fs/btrfs/extent-tree.c:2045 fs/btrfs/extent-tree.c:2134) 
	 btrfs_run_delayed_refs (include/linux/instrumented.h:68 include/asm-generic/bitops/instrumented-non-atomic.h:141 fs/btrfs/extent-tree.c:2238) 
	 btrfs_commit_transaction (fs/btrfs/transaction.c:2217) 
	 transaction_kthread (fs/btrfs/disk-io.c:1558) 
	 kthread (kernel/kthread.c:373) 
	 ret_from_fork (arch/x86/kernel/process.c:153) 
	 ret_from_fork_asm (arch/x86/entry/entry_64.S:250) 

	other info that might help us debug this:

	 Chain exists of:
	(console_sem).lock --> &rq->__lock --> pool_lock#2

	  Possible unsafe locking scenario:

		CPU0                    CPU1
		----                    ----
	   lock(pool_lock#2);
					lock(&rq->__lock);
					lock(pool_lock#2);
	   lock((console_sem).lock);

	*** DEADLOCK ***

	 11 locks held by btrfs-transacti/566:
	 #0: ffff8883221147d8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread (fs/btrfs/disk-io.c:?) 
	 #1: ffff888322116390 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction (include/linux/spinlock.h:? fs/btrfs/transaction.c:285) 
	 #2: ffff8883221163b8 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction (include/linux/spinlock.h:? fs/btrfs/transaction.c:285) 
	 #3: ffff8883221163e0 (btrfs_trans_commit_prep){.+.+}-{0:0}, at: btrfs_commit_transaction (fs/btrfs/transaction.c:2205) 
	 #4: ffff8889220f2240 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_delayed_ref_lock (include/linux/lockdep.h:288 fs/btrfs/delayed-ref.c:503) 
	 #5: ffff8885b372a980 (btrfs-extent-02){++++}-{3:3}, at: btrfs_lock_root_node (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211 fs/btrfs/locking.c:217 fs/btrfs/locking.c:270) 
	 #6: ffff8888635b5538 (btrfs-extent-01){++++}-{3:3}, at: btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211 fs/btrfs/locking.c:217) 
	 #7: ffff8885df3e2d30 (btrfs-extent-01/2){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211) 
	 #8: ffff88832bb33490 (btrfs-extent-01/3){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211) 
	 #9: ffff888708195c98 (btrfs-extent-01/5){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211) 
	 #10: ffffffff850a2a70 (pool_lock#2){-.-.}-{2:2}, at: __stack_depot_save (lib/stackdepot.c:415)

Marco Elver Jan. 24, 2024, 2:21 p.m. UTC | #16

On Wed, 24 Jan 2024 at 15:16, Breno Leitao <leitao@debian.org> wrote:
>
> Hello Andrey,
>
> On Mon, Nov 20, 2023 at 06:47:10PM +0100, andrey.konovalov@linux.dev wrote:
> > From: Andrey Konovalov <andreyknvl@google.com>
> >
> > Currently, stack depot uses the following locking scheme:
> >
> > 1. Lock-free accesses when looking up a stack record, which allows to
> >    have multiple users to look up records in parallel;
> > 2. Spinlock for protecting the stack depot pools and the hash table
> >    when adding a new record.
> >
> > For implementing the eviction of stack traces from stack depot, the
> > lock-free approach is not going to work anymore, as we will need to be
> > able to also remove records from the hash table.
> >
> > Convert the spinlock into a read/write lock, and drop the atomic accesses,
> > as they are no longer required.
> >
> > Looking up stack traces is now protected by the read lock and adding new
> > records - by the write lock. One of the following patches will add a new
> > function for evicting stack records, which will be protected by the write
> > lock as well.
> >
> > With this change, multiple users can still look up records in parallel.
> >
> > This is preparatory patch for implementing the eviction of stack records
> > from the stack depot.
>
> I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
> enabled). This kernel is based on
> 9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following

This version predates this series, as far as I can tell. Can you try linux-next?

Thanks,
-- Marco

Breno Leitao Jan. 24, 2024, 4:24 p.m. UTC | #17

On Wed, Jan 24, 2024 at 03:21:26PM +0100, Marco Elver wrote:
> On Wed, 24 Jan 2024 at 15:16, Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Andrey,
> >
> > On Mon, Nov 20, 2023 at 06:47:10PM +0100, andrey.konovalov@linux.dev wrote:
> > > From: Andrey Konovalov <andreyknvl@google.com>
> > >
> > > Currently, stack depot uses the following locking scheme:
> > >
> > > 1. Lock-free accesses when looking up a stack record, which allows to
> > >    have multiple users to look up records in parallel;
> > > 2. Spinlock for protecting the stack depot pools and the hash table
> > >    when adding a new record.
> > >
> > > For implementing the eviction of stack traces from stack depot, the
> > > lock-free approach is not going to work anymore, as we will need to be
> > > able to also remove records from the hash table.
> > >
> > > Convert the spinlock into a read/write lock, and drop the atomic accesses,
> > > as they are no longer required.
> > >
> > > Looking up stack traces is now protected by the read lock and adding new
> > > records - by the write lock. One of the following patches will add a new
> > > function for evicting stack records, which will be protected by the write
> > > lock as well.
> > >
> > > With this change, multiple users can still look up records in parallel.
> > >
> > > This is preparatory patch for implementing the eviction of stack records
> > > from the stack depot.
> >
> > I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
> > enabled). This kernel is based on
> > 9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following
> 
> This version predates this series, as far as I can tell. Can you try linux-next?

That is true. I will retest and let you know if it is still
reproducible.

Thanks.

[v4,12/22] lib/stackdepot: use read/write lock

Commit Message

Comments

Patch