diff mbox series

[v4,2/2] mm: slub: call WARN() when the slab detect an error

Message ID 20250226081206.680495-3-hyesoo.yu@samsung.com (mailing list archive)
State New
Headers show
Series mm: slub: Enhanced debugging in slub error | expand

Commit Message

Hyesoo Yu Feb. 26, 2025, 8:12 a.m. UTC
If a slab object is corrupted or an error occurs in its internal
value, continuing after restoration may cause other side effects.
At this point, it is difficult to debug because the problem occurred
in the past. It is useful to use WARN() to catch errors at the point
of issue because WARN() could trigger panic for system debugging when
panic_on_warn is enabled. WARN() is added where to detect the error
on slab_err and object_err.

It makes sense to only do the WARN() after printing the logs. slab_err
is splited to __slab_err that calls the WARN() and it is called after
printing logs.

Changes in v4:
- Remove WARN() in kmem_cache_destroy to remove redundant warning.

Changes in v3:
- move the WARN from slab_fix to slab_err, object_err and check_obj to
use WARN on all error reporting paths.

Changes in v2:
- Replace direct calling with BUG_ON with the use of WARN in slab_fix.

Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
---
 mm/slab_common.c |  3 ---
 mm/slub.c        | 31 +++++++++++++++++++------------
 2 files changed, 19 insertions(+), 15 deletions(-)

Comments

Harry Yoo Feb. 27, 2025, 12:55 p.m. UTC | #1
On Wed, Feb 26, 2025 at 05:12:01PM +0900, Hyesoo Yu wrote:
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. It is useful to use WARN() to catch errors at the point
> of issue because WARN() could trigger panic for system debugging when
> panic_on_warn is enabled. WARN() is added where to detect the error
> on slab_err and object_err.
> 
> It makes sense to only do the WARN() after printing the logs. slab_err
> is splited to __slab_err that calls the WARN() and it is called after
> printing logs.
> 
> Changes in v4:
> - Remove WARN() in kmem_cache_destroy to remove redundant warning.
> 
> Changes in v3:
> - move the WARN from slab_fix to slab_err, object_err and check_obj to
> use WARN on all error reporting paths.
> 
> Changes in v2:
> - Replace direct calling with BUG_ON with the use of WARN in slab_fix.

Same here, please rephrase the changelog without "Changes in vN" in the
changelog, and move the patch version changes below "---" line.

> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> ---

Otherwise this change in general looks good to me (with a suggestion below).
Please feel free to add:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

# Suggestion

There's a case where SLUB just calls pr_err() and dump_stack() instead of
slab_err() or object_err() in free_consistency_checks().

I would like to add something like this:

diff --git a/mm/slub.c b/mm/slub.c
index b7660ee85db0..d5609fa63da4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1022,12 +1022,16 @@ static void slab_bug(struct kmem_cache *s, char *fmt, ...)
 {
 	struct va_format vaf;
 	va_list args;
+	const char *name = "<unknown>";
+
+	if (s)
+		name = s->name;
 
 	va_start(args, fmt);
 	vaf.fmt = fmt;
 	vaf.va = &args;
 	pr_err("=============================================================================\n");
-	pr_err("BUG %s (%s): %pV\n", s->name, print_tainted(), &vaf);
+	pr_err("BUG %s (%s): %pV\n", name, print_tainted(), &vaf);
 	pr_err("-----------------------------------------------------------------------------\n\n");
 	va_end(args);
 }
@@ -1628,9 +1632,8 @@ static inline int free_consistency_checks(struct kmem_cache *s,
 			slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
 				 object);
 		} else if (!slab->slab_cache) {
-			pr_err("SLUB <none>: no slab for object 0x%p.\n",
-			       object);
-			dump_stack();
+			slab_err(NULL, slab, "No slab for object 0x%p",
+				 object);
 		} else
 			object_err(s, slab, object,
 					"page slab pointer corrupt.");
Vlastimil Babka Feb. 27, 2025, 2:38 p.m. UTC | #2
On 2/26/25 09:12, Hyesoo Yu wrote:
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. It is useful to use WARN() to catch errors at the point
> of issue because WARN() could trigger panic for system debugging when
> panic_on_warn is enabled. WARN() is added where to detect the error
> on slab_err and object_err.
> 
> It makes sense to only do the WARN() after printing the logs. slab_err
> is splited to __slab_err that calls the WARN() and it is called after
> printing logs.
> 
> Changes in v4:
> - Remove WARN() in kmem_cache_destroy to remove redundant warning.
> 
> Changes in v3:
> - move the WARN from slab_fix to slab_err, object_err and check_obj to
> use WARN on all error reporting paths.
> 
> Changes in v2:
> - Replace direct calling with BUG_ON with the use of WARN in slab_fix.
> 
> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>

As Harry said. I'll remove that locally.

> ---
>  mm/slab_common.c |  3 ---
>  mm/slub.c        | 31 +++++++++++++++++++------------
>  2 files changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 477fa471da18..d13f4ffe252b 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -517,9 +517,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  	kasan_cache_shutdown(s);
>  
>  	err = __kmem_cache_shutdown(s);
> -	if (!slab_in_kunit_test())
> -		WARN(err, "%s %s: Slab cache still has objects when called from %pS",
> -		     __func__, s->name, (void *)_RET_IP_);

I think I'll keep this one, because the more detailed warning via
list_slab_objects() is only enabled with CONFIG_SLUB_DEBUG.
If it's not enabled, the kmem_cache_destroy() failure rather should not be
silent.
So slab_in_kunit_test() would also stay.

>  		} else {
> -			list_slab_objects(s, slab,
> -			  "Objects remaining in %s on __kmem_cache_shutdown()");
> +			list_slab_objects(s, slab);

I tried to extract slab_bug() and __slab_err() from list_slab_objects() but
they were also only available with CONFIG_SLUB_DEBUG.

Perhaps we can improve that, but as a follow-up cleanup so we don't hold
this up further.

>  		}
>  	}
>  	spin_unlock_irq(&n->list_lock);
Vlastimil Babka Feb. 27, 2025, 3:18 p.m. UTC | #3
On 2/27/25 13:55, Harry Yoo wrote:
> On Wed, Feb 26, 2025 at 05:12:01PM +0900, Hyesoo Yu wrote:
>> If a slab object is corrupted or an error occurs in its internal
>> value, continuing after restoration may cause other side effects.
>> At this point, it is difficult to debug because the problem occurred
>> in the past. It is useful to use WARN() to catch errors at the point
>> of issue because WARN() could trigger panic for system debugging when
>> panic_on_warn is enabled. WARN() is added where to detect the error
>> on slab_err and object_err.
>> 
>> It makes sense to only do the WARN() after printing the logs. slab_err
>> is splited to __slab_err that calls the WARN() and it is called after
>> printing logs.
>> 
>> Changes in v4:
>> - Remove WARN() in kmem_cache_destroy to remove redundant warning.
>> 
>> Changes in v3:
>> - move the WARN from slab_fix to slab_err, object_err and check_obj to
>> use WARN on all error reporting paths.
>> 
>> Changes in v2:
>> - Replace direct calling with BUG_ON with the use of WARN in slab_fix.
> 
> Same here, please rephrase the changelog without "Changes in vN" in the
> changelog, and move the patch version changes below "---" line.
> 
>> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
>> ---
> 
> Otherwise this change in general looks good to me (with a suggestion below).
> Please feel free to add:
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> # Suggestion
> 
> There's a case where SLUB just calls pr_err() and dump_stack() instead of
> slab_err() or object_err() in free_consistency_checks().
> 
> I would like to add something like this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b7660ee85db0..d5609fa63da4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1022,12 +1022,16 @@ static void slab_bug(struct kmem_cache *s, char *fmt, ...)
>  {
>  	struct va_format vaf;
>  	va_list args;
> +	const char *name = "<unknown>";
> +
> +	if (s)
> +		name = s->name;
>  
>  	va_start(args, fmt);
>  	vaf.fmt = fmt;
>  	vaf.va = &args;
>  	pr_err("=============================================================================\n");
> -	pr_err("BUG %s (%s): %pV\n", s->name, print_tainted(), &vaf);
> +	pr_err("BUG %s (%s): %pV\n", name, print_tainted(), &vaf);

				     s ? s->name : "<unknown>"

more concise

>  	pr_err("-----------------------------------------------------------------------------\n\n");
>  	va_end(args);
>  }
> @@ -1628,9 +1632,8 @@ static inline int free_consistency_checks(struct kmem_cache *s,
>  			slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
>  				 object);
>  		} else if (!slab->slab_cache) {
> -			pr_err("SLUB <none>: no slab for object 0x%p.\n",
> -			       object);
> -			dump_stack();
> +			slab_err(NULL, slab, "No slab for object 0x%p",
> +				 object);

Good suggestion, added that locally. Probably not likely to trigger as a
use-after-free would mean we're rather hit !folio_test_slab() above than a
folio that has a slab flag but has a NULL pointer (or the pointer might be
garbage and not NULL). But at least the error handling will be consistent.
Changed the error message to "No slab cache" as that's more accurate.
Thanks.

>  		} else
>  			object_err(s, slab, object,
>  					"page slab pointer corrupt.");
>
diff mbox series

Patch

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 477fa471da18..d13f4ffe252b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -517,9 +517,6 @@  void kmem_cache_destroy(struct kmem_cache *s)
 	kasan_cache_shutdown(s);
 
 	err = __kmem_cache_shutdown(s);
-	if (!slab_in_kunit_test())
-		WARN(err, "%s %s: Slab cache still has objects when called from %pS",
-		     __func__, s->name, (void *)_RET_IP_);
 
 	list_del(&s->list);
 
diff --git a/mm/slub.c b/mm/slub.c
index 8c13cd43c0fd..4961eeccf3ad 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1096,8 +1096,6 @@  static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
 		/* Beginning of the filler is the free pointer */
 		print_section(KERN_ERR, "Padding  ", p + off,
 			      size_from_object(s) - off);
-
-	dump_stack();
 }
 
 static void object_err(struct kmem_cache *s, struct slab *slab,
@@ -1109,6 +1107,8 @@  static void object_err(struct kmem_cache *s, struct slab *slab,
 	slab_bug(s, "%s", reason);
 	print_trailer(s, slab, object);
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+
+	WARN_ON(1);
 }
 
 static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
@@ -1125,6 +1125,14 @@  static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
 	return false;
 }
 
+static void __slab_err(struct slab *slab)
+{
+	print_slab_info(slab);
+	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+
+	WARN_ON(1);
+}
+
 static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
 			const char *fmt, ...)
 {
@@ -1138,9 +1146,7 @@  static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 	slab_bug(s, "%s", buf);
-	print_slab_info(slab);
-	dump_stack();
-	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+	__slab_err(slab);
 }
 
 static void init_object(struct kmem_cache *s, void *object, u8 val)
@@ -1313,9 +1319,10 @@  slab_pad_check(struct kmem_cache *s, struct slab *slab)
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
-	slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu",
-			fault, end - 1, fault - start);
+	slab_bug(s, "Padding overwritten. 0x%p-0x%p @offset=%tu",
+		 fault, end - 1, fault - start);
 	print_section(KERN_ERR, "Padding ", pad, remainder);
+	__slab_err(slab);
 
 	restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
 }
@@ -5428,14 +5435,13 @@  static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
 	return !!oo_objects(s->oo);
 }
 
-static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
-			      const char *text)
+static void list_slab_objects(struct kmem_cache *s, struct slab *slab)
 {
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = slab_address(slab);
 	void *p;
 
-	slab_err(s, slab, text, s->name);
+	slab_bug(s, "Objects remaining on __kmem_cache_shutdown()");
 
 	spin_lock(&object_map_lock);
 	__fill_map(object_map, s, slab);
@@ -5450,6 +5456,8 @@  static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
 		}
 	}
 	spin_unlock(&object_map_lock);
+
+	__slab_err(slab);
 #endif
 }
 
@@ -5470,8 +5478,7 @@  static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 			remove_partial(n, slab);
 			list_add(&slab->slab_list, &discard);
 		} else {
-			list_slab_objects(s, slab,
-			  "Objects remaining in %s on __kmem_cache_shutdown()");
+			list_slab_objects(s, slab);
 		}
 	}
 	spin_unlock_irq(&n->list_lock);