diff mbox series

[v1] mm/slub: enable debugging memory wasting of kmalloc

Message ID 20220701135954.45045-1-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series [v1] mm/slub: enable debugging memory wasting of kmalloc | expand

Commit Message

Feng Tang July 1, 2022, 1:59 p.m. UTC
kmalloc's API family is critical for mm, with one shortcoming that
its object size is fixed to be power of 2. When user requests memory
for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
in worst case, there is around 50% memory space waste.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:

    [   26.062145] kmalloc-2k            814056KB     814056KB

From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

    [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
    [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
    ...
    [    4.857565] kmalloc-2048           59164KB      59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are picked just to demo the problem)

So add an way to track each kmalloc's memory waste info, and leverage
the existing SLUB debug framework to show its call stack info, so
that user can evaluate the waste situation, identify some hot spots
and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
/sys/kernel/debug/slab/kmalloc-xx/alloc_traces, one example of
'kmalloc-4k' after boot is:

126 ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe] waste=233856/1856 age=1493302/1493830/1494358 pid=1284 cpus=32 nodes=1
        __slab_alloc.isra.86+0x52/0x80
        __kmalloc_node+0x143/0x350
        ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe]
        ixgbe_init_interrupt_scheme+0x1a6/0x730 [ixgbe]
        ixgbe_probe+0xc8e/0x10d0 [ixgbe]
        local_pci_probe+0x42/0x80
        work_for_cpu_fn+0x13/0x20
        process_one_work+0x1c5/0x390

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total). And when system starts some real
workload like multiple docker instances, there are more
severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Changelog:

  since RFC
    * fix problems in kmem_cache_alloc_bulk() and records sorting,
      improve the print format (Hyeonggon Yoo)
    * fix a compiling issue found by 0Day bot
    * update the commit log based info from iova developers

 mm/slub.c | 52 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 17 deletions(-)

Comments

Christoph Lameter July 1, 2022, 2:37 p.m. UTC | #1
On Fri, 1 Jul 2022, Feng Tang wrote:

>  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			  unsigned long addr, struct kmem_cache_cpu *c)
> +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {

It would be good to avoid expanding the basic slab handling functions for
kmalloc. Can we restrict the mods to the kmalloc related functions?
Feng Tang July 1, 2022, 3:04 p.m. UTC | #2
Hi Christoph,

On Fri, Jul 01, 2022 at 04:37:00PM +0200, Christoph Lameter wrote:
> On Fri, 1 Jul 2022, Feng Tang wrote:
> 
> >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> >  {
> 
> It would be good to avoid expanding the basic slab handling functions for
> kmalloc. Can we restrict the mods to the kmalloc related functions?

Yes, this is the part that concerned me. I tried but haven't figured
a way.

I started implemting it several month ago, and stuck with several
kmalloc APIs in a hacky way like dump_stack() when there is a waste
over 1/4 of the object_size of the kmalloc_caches[][].

Then I found one central API which has all the needed info (object_size &
orig_size) that we can yell about the waste :

static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
                gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)

which I thought could be still hacky, as the existing 'alloc_traces'
can't be resued which already has the count/call-stack info. Current
solution leverage it at the cost of adding 'orig_size' parameters, but
I don't know how to pass the 'waste' info through as track/location is
in the lowest level.

Thanks,
Feng
Hyeonggon Yoo July 3, 2022, 2:17 p.m. UTC | #3
On Fri, Jul 01, 2022 at 11:04:51PM +0800, Feng Tang wrote:
> Hi Christoph,
> 
> On Fri, Jul 01, 2022 at 04:37:00PM +0200, Christoph Lameter wrote:
> > On Fri, 1 Jul 2022, Feng Tang wrote:
> > 
> > >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > >  {
> > 
> > It would be good to avoid expanding the basic slab handling functions for
> > kmalloc. Can we restrict the mods to the kmalloc related functions?
> 
> Yes, this is the part that concerned me. I tried but haven't figured
> a way.
> 
> I started implemting it several month ago, and stuck with several
> kmalloc APIs in a hacky way like dump_stack() when there is a waste
> over 1/4 of the object_size of the kmalloc_caches[][].
> 
> Then I found one central API which has all the needed info (object_size &
> orig_size) that we can yell about the waste :
> 
> static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
>                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> 
> which I thought could be still hacky, as the existing 'alloc_traces'
> can't be resued which already has the count/call-stack info. Current
> solution leverage it at the cost of adding 'orig_size' parameters, but
> I don't know how to pass the 'waste' info through as track/location is
> in the lowest level.

If adding cost of orig_size parameter for non-debugging case is concern,
what about doing this in userspace script that makes use of kmalloc
tracepoints?

	kmalloc: call_site=tty_buffer_alloc+0x43/0x90 ptr=00000000b78761e1
	bytes_req=1056 bytes_alloc=2048 gfp_flags=GFP_ATOMIC|__GFP_NOWARN
	accounted=false

calculating sum of (bytes_alloc - bytes_req) for each call_site
may be an alternative solution.

Thanks,
Hyeonggon

> Thanks,
> Feng
> 
> 
>
Feng Tang July 4, 2022, 5:56 a.m. UTC | #4
On Sun, Jul 03, 2022 at 02:17:37PM +0000, Hyeonggon Yoo wrote:
> On Fri, Jul 01, 2022 at 11:04:51PM +0800, Feng Tang wrote:
> > Hi Christoph,
> > 
> > On Fri, Jul 01, 2022 at 04:37:00PM +0200, Christoph Lameter wrote:
> > > On Fri, 1 Jul 2022, Feng Tang wrote:
> > > 
> > > >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > > > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > > >  {
> > > 
> > > It would be good to avoid expanding the basic slab handling functions for
> > > kmalloc. Can we restrict the mods to the kmalloc related functions?
> > 
> > Yes, this is the part that concerned me. I tried but haven't figured
> > a way.
> > 
> > I started implemting it several month ago, and stuck with several
> > kmalloc APIs in a hacky way like dump_stack() when there is a waste
> > over 1/4 of the object_size of the kmalloc_caches[][].
> > 
> > Then I found one central API which has all the needed info (object_size &
> > orig_size) that we can yell about the waste :
> > 
> > static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> >                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> > 
> > which I thought could be still hacky, as the existing 'alloc_traces'
> > can't be resued which already has the count/call-stack info. Current
> > solution leverage it at the cost of adding 'orig_size' parameters, but
> > I don't know how to pass the 'waste' info through as track/location is
> > in the lowest level.
> 
> If adding cost of orig_size parameter for non-debugging case is concern,
> what about doing this in userspace script that makes use of kmalloc
> tracepoints?
> 
> 	kmalloc: call_site=tty_buffer_alloc+0x43/0x90 ptr=00000000b78761e1
> 	bytes_req=1056 bytes_alloc=2048 gfp_flags=GFP_ATOMIC|__GFP_NOWARN
> 	accounted=false
> 
> calculating sum of (bytes_alloc - bytes_req) for each call_site
> may be an alternative solution.

Yes, this is doable, and it will met some of the problems I met before,
one is there are currently 2 alloc path: kmalloc and kmalloc_node, also
we need to consider the free problem to calculate the real waste, and
the free trace point doesn't have size info (Yes, we could compare
the pointer with alloc path, and the user script may need to be more
complexer). That's why I love the current 'alloc_traces' interface,
which has the count (slove the free counting problem) and full call
stack info.

And for the extra parameter cost issue, I rethink about it, and we
can leverage the 'slab_alloc_node()' to solve it, and the patch is 
much simpler now without adding a new parameter:

---
diff --git a/mm/slub.c b/mm/slub.c
index b1281b8654bd3..ce4568dbb0f2d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -271,6 +271,7 @@ struct track {
 #endif
 	int cpu;		/* Was running on cpu */
 	int pid;		/* Pid context */
+	unsigned long waste;	/* memory waste for a kmalloc-ed object */
 	unsigned long when;	/* When did the operation occur */
 };
 
@@ -3240,6 +3241,16 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 	init = slab_want_init_on_alloc(gfpflags, s);
 
 out:
+
+#ifdef CONFIG_SLUB_DEBUG
+	if (object && s->object_size != orig_size) {
+		struct track *track;
+
+		track = get_track(s, object, TRACK_ALLOC);
+		track->waste = s->object_size - orig_size;
+	}
+#endif
+
 	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
 
 	return object;
@@ -5092,6 +5103,7 @@ struct location {
 	depot_stack_handle_t handle;
 	unsigned long count;
 	unsigned long addr;
+	unsigned long waste;
 	long long sum_time;
 	long min_time;
 	long max_time;
@@ -5142,7 +5154,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 {
 	long start, end, pos;
 	struct location *l;
-	unsigned long caddr, chandle;
+	unsigned long caddr, chandle, cwaste;
 	unsigned long age = jiffies - track->when;
 	depot_stack_handle_t handle = 0;
 
@@ -5162,11 +5174,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 		if (pos == end)
 			break;
 
-		caddr = t->loc[pos].addr;
-		chandle = t->loc[pos].handle;
-		if ((track->addr == caddr) && (handle == chandle)) {
+		l = &t->loc[pos];
+		caddr = l->addr;
+		chandle = l->handle;
+		cwaste = l->waste;
+		if ((track->addr == caddr) && (handle == chandle) &&
+			(track->waste == cwaste)) {
 
-			l = &t->loc[pos];
 			l->count++;
 			if (track->when) {
 				l->sum_time += age;
@@ -5191,6 +5205,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 			end = pos;
 		else if (track->addr == caddr && handle < chandle)
 			end = pos;
+		else if (track->addr == caddr && handle == chandle &&
+				track->waste < cwaste)
+			end = pos;
 		else
 			start = pos;
 	}
@@ -5214,6 +5231,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 	l->min_pid = track->pid;
 	l->max_pid = track->pid;
 	l->handle = handle;
+	l->waste = track->waste;
 	cpumask_clear(to_cpumask(l->cpus));
 	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
 	nodes_clear(l->nodes);
@@ -6102,6 +6120,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
 		else
 			seq_puts(seq, "<not-available>");
 
+		if (l->waste)
+			seq_printf(seq, " waste=%lu/%lu",
+				l->count * l->waste, l->waste);
+
 		if (l->sum_time != l->min_time) {
 			seq_printf(seq, " age=%ld/%llu/%ld",
 				l->min_time, div_u64(l->sum_time, l->count),

Thanks,
Feng

> Thanks,
> Hyeonggon
> 
> > Thanks,
> > Feng
> > 
> > 
> >
Hyeonggon Yoo July 4, 2022, 10:05 a.m. UTC | #5
On Mon, Jul 04, 2022 at 01:56:00PM +0800, Feng Tang wrote:
> On Sun, Jul 03, 2022 at 02:17:37PM +0000, Hyeonggon Yoo wrote:
> > On Fri, Jul 01, 2022 at 11:04:51PM +0800, Feng Tang wrote:
> > > Hi Christoph,
> > > 
> > > On Fri, Jul 01, 2022 at 04:37:00PM +0200, Christoph Lameter wrote:
> > > > On Fri, 1 Jul 2022, Feng Tang wrote:
> > > > 
> > > > >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > > > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > > > > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > > > >  {
> > > > 
> > > > It would be good to avoid expanding the basic slab handling functions for
> > > > kmalloc. Can we restrict the mods to the kmalloc related functions?
> > > 
> > > Yes, this is the part that concerned me. I tried but haven't figured
> > > a way.
> > > 
> > > I started implemting it several month ago, and stuck with several
> > > kmalloc APIs in a hacky way like dump_stack() when there is a waste
> > > over 1/4 of the object_size of the kmalloc_caches[][].
> > > 
> > > Then I found one central API which has all the needed info (object_size &
> > > orig_size) that we can yell about the waste :
> > > 
> > > static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> > >                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> > > 
> > > which I thought could be still hacky, as the existing 'alloc_traces'
> > > can't be resued which already has the count/call-stack info. Current
> > > solution leverage it at the cost of adding 'orig_size' parameters, but
> > > I don't know how to pass the 'waste' info through as track/location is
> > > in the lowest level.
> > 
> > If adding cost of orig_size parameter for non-debugging case is concern,
> > what about doing this in userspace script that makes use of kmalloc
> > tracepoints?
> > 
> > 	kmalloc: call_site=tty_buffer_alloc+0x43/0x90 ptr=00000000b78761e1
> > 	bytes_req=1056 bytes_alloc=2048 gfp_flags=GFP_ATOMIC|__GFP_NOWARN
> > 	accounted=false
> > 
> > calculating sum of (bytes_alloc - bytes_req) for each call_site
> > may be an alternative solution.
> 
> Yes, this is doable, and it will met some of the problems I met before,
> one is there are currently 2 alloc path: kmalloc and kmalloc_node, also
> we need to consider the free problem to calculate the real waste, and
> the free trace point doesn't have size info (Yes, we could compare
> the pointer with alloc path, and the user script may need to be more
> complexer). That's why I love the current 'alloc_traces' interface,
> which has the count (slove the free counting problem) and full call
> stack info.

Understood.

> And for the extra parameter cost issue, I rethink about it, and we
> can leverage the 'slab_alloc_node()' to solve it, and the patch is 
> much simpler now without adding a new parameter:
> 
> ---
> diff --git a/mm/slub.c b/mm/slub.c
> index b1281b8654bd3..ce4568dbb0f2d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -271,6 +271,7 @@ struct track {
>  #endif
>  	int cpu;		/* Was running on cpu */
>  	int pid;		/* Pid context */
> +	unsigned long waste;	/* memory waste for a kmalloc-ed object */
>  	unsigned long when;	/* When did the operation occur */
>  };
>  
> @@ -3240,6 +3241,16 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>  	init = slab_want_init_on_alloc(gfpflags, s);
>  
>  out:
> +
> +#ifdef CONFIG_SLUB_DEBUG
> +	if (object && s->object_size != orig_size) {
> +		struct track *track;
> +
> +		track = get_track(s, object, TRACK_ALLOC);
> +		track->waste = s->object_size - orig_size;
> +	}
> +#endif
> +

This scares me. It does not check if the cache has
SLAB_STORE_USER flag.

Also CONFIG_SLUB_DEBUG is enabled by default, which means that
it is still against not affecting non-debugging case.

I like v1 more than modified version.

Thanks,
Hyeonggon

>  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
>  
>  	return object;
> @@ -5092,6 +5103,7 @@ struct location {
>  	depot_stack_handle_t handle;
>  	unsigned long count;
>  	unsigned long addr;
> +	unsigned long waste;
>  	long long sum_time;
>  	long min_time;
>  	long max_time;
> @@ -5142,7 +5154,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  {
>  	long start, end, pos;
>  	struct location *l;
> -	unsigned long caddr, chandle;
> +	unsigned long caddr, chandle, cwaste;
>  	unsigned long age = jiffies - track->when;
>  	depot_stack_handle_t handle = 0;
>  
> @@ -5162,11 +5174,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  		if (pos == end)
>  			break;
>  
> -		caddr = t->loc[pos].addr;
> -		chandle = t->loc[pos].handle;
> -		if ((track->addr == caddr) && (handle == chandle)) {
> +		l = &t->loc[pos];
> +		caddr = l->addr;
> +		chandle = l->handle;
> +		cwaste = l->waste;
> +		if ((track->addr == caddr) && (handle == chandle) &&
> +			(track->waste == cwaste)) {
>  
> -			l = &t->loc[pos];
>  			l->count++;
>  			if (track->when) {
>  				l->sum_time += age;
> @@ -5191,6 +5205,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  			end = pos;
>  		else if (track->addr == caddr && handle < chandle)
>  			end = pos;
> +		else if (track->addr == caddr && handle == chandle &&
> +				track->waste < cwaste)
> +			end = pos;
>  		else
>  			start = pos;
>  	}
> @@ -5214,6 +5231,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  	l->min_pid = track->pid;
>  	l->max_pid = track->pid;
>  	l->handle = handle;
> +	l->waste = track->waste;
>  	cpumask_clear(to_cpumask(l->cpus));
>  	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
>  	nodes_clear(l->nodes);
> @@ -6102,6 +6120,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
>  		else
>  			seq_puts(seq, "<not-available>");
>  
> +		if (l->waste)
> +			seq_printf(seq, " waste=%lu/%lu",
> +				l->count * l->waste, l->waste);
> +
>  		if (l->sum_time != l->min_time) {
>  			seq_printf(seq, " age=%ld/%llu/%ld",
>  				l->min_time, div_u64(l->sum_time, l->count),
> 
> Thanks,
> Feng
> 
> > Thanks,
> > Hyeonggon
> > 
> > > Thanks,
> > > Feng
> > > 
> > > 
> > >
Feng Tang July 5, 2022, 2:34 a.m. UTC | #6
On Mon, Jul 04, 2022 at 10:05:29AM +0000, Hyeonggon Yoo wrote:
> On Mon, Jul 04, 2022 at 01:56:00PM +0800, Feng Tang wrote:
> > On Sun, Jul 03, 2022 at 02:17:37PM +0000, Hyeonggon Yoo wrote:
> > > On Fri, Jul 01, 2022 at 11:04:51PM +0800, Feng Tang wrote:
> > > > Hi Christoph,
> > > > 
> > > > On Fri, Jul 01, 2022 at 04:37:00PM +0200, Christoph Lameter wrote:
> > > > > On Fri, 1 Jul 2022, Feng Tang wrote:
> > > > > 
> > > > > >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > > > > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > > > > > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > > > > >  {
> > > > > 
> > > > > It would be good to avoid expanding the basic slab handling functions for
> > > > > kmalloc. Can we restrict the mods to the kmalloc related functions?
> > > > 
> > > > Yes, this is the part that concerned me. I tried but haven't figured
> > > > a way.
> > > > 
> > > > I started implemting it several month ago, and stuck with several
> > > > kmalloc APIs in a hacky way like dump_stack() when there is a waste
> > > > over 1/4 of the object_size of the kmalloc_caches[][].
> > > > 
> > > > Then I found one central API which has all the needed info (object_size &
> > > > orig_size) that we can yell about the waste :
> > > > 
> > > > static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> > > >                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> > > > 
> > > > which I thought could be still hacky, as the existing 'alloc_traces'
> > > > can't be resued which already has the count/call-stack info. Current
> > > > solution leverage it at the cost of adding 'orig_size' parameters, but
> > > > I don't know how to pass the 'waste' info through as track/location is
> > > > in the lowest level.
> > > 
> > > If adding cost of orig_size parameter for non-debugging case is concern,
> > > what about doing this in userspace script that makes use of kmalloc
> > > tracepoints?
> > > 
> > > 	kmalloc: call_site=tty_buffer_alloc+0x43/0x90 ptr=00000000b78761e1
> > > 	bytes_req=1056 bytes_alloc=2048 gfp_flags=GFP_ATOMIC|__GFP_NOWARN
> > > 	accounted=false
> > > 
> > > calculating sum of (bytes_alloc - bytes_req) for each call_site
> > > may be an alternative solution.
> > 
> > Yes, this is doable, and it will met some of the problems I met before,
> > one is there are currently 2 alloc path: kmalloc and kmalloc_node, also
> > we need to consider the free problem to calculate the real waste, and
> > the free trace point doesn't have size info (Yes, we could compare
> > the pointer with alloc path, and the user script may need to be more
> > complexer). That's why I love the current 'alloc_traces' interface,
> > which has the count (slove the free counting problem) and full call
> > stack info.
> 
> Understood.
> 
> > And for the extra parameter cost issue, I rethink about it, and we
> > can leverage the 'slab_alloc_node()' to solve it, and the patch is 
> > much simpler now without adding a new parameter:
> > 
> > ---
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b1281b8654bd3..ce4568dbb0f2d 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -271,6 +271,7 @@ struct track {
> >  #endif
> >  	int cpu;		/* Was running on cpu */
> >  	int pid;		/* Pid context */
> > +	unsigned long waste;	/* memory waste for a kmalloc-ed object */
> >  	unsigned long when;	/* When did the operation occur */
> >  };
> >  
> > @@ -3240,6 +3241,16 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
> >  	init = slab_want_init_on_alloc(gfpflags, s);
> >  
> >  out:
> > +
> > +#ifdef CONFIG_SLUB_DEBUG
> > +	if (object && s->object_size != orig_size) {
> > +		struct track *track;
> > +
> > +		track = get_track(s, object, TRACK_ALLOC);
> > +		track->waste = s->object_size - orig_size;
> > +	}
> > +#endif
> > +
> 
> This scares me. It does not check if the cache has
> SLAB_STORE_USER flag.
 
Yes, I missed that.

> Also CONFIG_SLUB_DEBUG is enabled by default, which means that
> it is still against not affecting non-debugging case.
 
Yes, logically these debug stuff can be put together in low-level
function.

> I like v1 more than modified version.

I see, thanks

- Feng

> Thanks,
> Hyeonggon
> 
> >  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> >  
> >  	return object;
> > @@ -5092,6 +5103,7 @@ struct location {
> >  	depot_stack_handle_t handle;
> >  	unsigned long count;
> >  	unsigned long addr;
> > +	unsigned long waste;
> >  	long long sum_time;
> >  	long min_time;
> >  	long max_time;
> > @@ -5142,7 +5154,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  {
> >  	long start, end, pos;
> >  	struct location *l;
> > -	unsigned long caddr, chandle;
> > +	unsigned long caddr, chandle, cwaste;
> >  	unsigned long age = jiffies - track->when;
> >  	depot_stack_handle_t handle = 0;
> >  
> > @@ -5162,11 +5174,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  		if (pos == end)
> >  			break;
> >  
> > -		caddr = t->loc[pos].addr;
> > -		chandle = t->loc[pos].handle;
> > -		if ((track->addr == caddr) && (handle == chandle)) {
> > +		l = &t->loc[pos];
> > +		caddr = l->addr;
> > +		chandle = l->handle;
> > +		cwaste = l->waste;
> > +		if ((track->addr == caddr) && (handle == chandle) &&
> > +			(track->waste == cwaste)) {
> >  
> > -			l = &t->loc[pos];
> >  			l->count++;
> >  			if (track->when) {
> >  				l->sum_time += age;
> > @@ -5191,6 +5205,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  			end = pos;
> >  		else if (track->addr == caddr && handle < chandle)
> >  			end = pos;
> > +		else if (track->addr == caddr && handle == chandle &&
> > +				track->waste < cwaste)
> > +			end = pos;
> >  		else
> >  			start = pos;
> >  	}
> > @@ -5214,6 +5231,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  	l->min_pid = track->pid;
> >  	l->max_pid = track->pid;
> >  	l->handle = handle;
> > +	l->waste = track->waste;
> >  	cpumask_clear(to_cpumask(l->cpus));
> >  	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
> >  	nodes_clear(l->nodes);
> > @@ -6102,6 +6120,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
> >  		else
> >  			seq_puts(seq, "<not-available>");
> >  
> > +		if (l->waste)
> > +			seq_printf(seq, " waste=%lu/%lu",
> > +				l->count * l->waste, l->waste);
> > +
> >  		if (l->sum_time != l->min_time) {
> >  			seq_printf(seq, " age=%ld/%llu/%ld",
> >  				l->min_time, div_u64(l->sum_time, l->count),
> > 
> > Thanks,
> > Feng
> > 
> > > Thanks,
> > > Hyeonggon
> > > 
> > > > Thanks,
> > > > Feng
> > > > 
> > > > 
> > > >
Vlastimil Babka July 11, 2022, 8:15 a.m. UTC | #7
On 7/1/22 15:59, Feng Tang wrote:
> kmalloc's API family is critical for mm, with one shortcoming that
> its object size is fixed to be power of 2. When user requests memory
> for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
> in worst case, there is around 50% memory space waste.
> 
> We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:
> 
>     [   26.062145] kmalloc-2k            814056KB     814056KB
> 
> From debug we found there are huge number of 'struct iova_magazine',
> whose size is 1032 bytes (1024 + 8), so each allocation will waste
> 1016 bytes. Though the issue was solved by giving the right (bigger)
> size of RAM, it is still nice to optimize the size (either use a
> kmalloc friendly size or create a dedicated slab for it).
> 
> And from lkml archive, there was another crash kernel OOM case [1]
> back in 2019, which seems to be related with the similar slab waste
> situation, as the log is similar:
> 
>     [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
>     [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
>     ...
>     [    4.857565] kmalloc-2048           59164KB      59164KB
> 
> The crash kernel only has 256M memory, and 59M is pretty big here.
> (Note: the related code has been changed and optimised in recent
> kernel [2], these logs are picked just to demo the problem)
> 
> So add an way to track each kmalloc's memory waste info, and leverage
> the existing SLUB debug framework to show its call stack info, so
> that user can evaluate the waste situation, identify some hot spots
> and optimize accordingly, for a better utilization of memory.
> 
> The waste info is integrated into existing interface:
> /sys/kernel/debug/slab/kmalloc-xx/alloc_traces, one example of
> 'kmalloc-4k' after boot is:
> 
> 126 ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe] waste=233856/1856 age=1493302/1493830/1494358 pid=1284 cpus=32 nodes=1
>         __slab_alloc.isra.86+0x52/0x80
>         __kmalloc_node+0x143/0x350
>         ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe]
>         ixgbe_init_interrupt_scheme+0x1a6/0x730 [ixgbe]
>         ixgbe_probe+0xc8e/0x10d0 [ixgbe]
>         local_pci_probe+0x42/0x80
>         work_for_cpu_fn+0x13/0x20
>         process_one_work+0x1c5/0x390
> 
> which means in 'kmalloc-4k' slab, there are 126 requests of
> 2240 bytes which got a 4KB space (wasting 1856 bytes each
> and 233856 bytes in total). And when system starts some real
> workload like multiple docker instances, there are more
> severe waste.
> 
> [1]. https://lkml.org/lkml/2019/8/12/266
> [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>

Hi and thanks.
I would suggest some improvements to consider:

- don't use the struct track to store orig_size, although it's an obvious
first choice. It's unused waste for the free_track, and also for any
non-kmalloc caches. I'd carve out an extra int next to the struct tracks.
Only for kmalloc caches (probably a new kmem cache flag set on creation will
be needed to easily distinguish them).
Besides the saved space, you can then set the field from ___slab_alloc()
directly and not need to pass the orig_size also to alloc_debug_processing()
etc.

- the knowledge of actual size could be used to improve poisoning checks as
well, detect cases when there's buffer overrun over the orig_size but not
cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
it now, but with orig_size stored we could?

Thanks!
Vlastimil

> ---
> Changelog:
> 
>   since RFC
>     * fix problems in kmem_cache_alloc_bulk() and records sorting,
>       improve the print format (Hyeonggon Yoo)
>     * fix a compiling issue found by 0Day bot
>     * update the commit log based info from iova developers
> 
>  mm/slub.c | 52 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b1281b8654bd3..97304ea1e6aa5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -271,6 +271,7 @@ struct track {
>  #endif
>  	int cpu;		/* Was running on cpu */
>  	int pid;		/* Pid context */
> +	unsigned long waste;	/* memory waste for a kmalloc-ed object */
>  	unsigned long when;	/* When did the operation occur */
>  };
>  
> @@ -747,6 +748,7 @@ static inline depot_stack_handle_t set_track_prepare(void)
>  
>  static void set_track_update(struct kmem_cache *s, void *object,
>  			     enum track_item alloc, unsigned long addr,
> +			     unsigned long waste,
>  			     depot_stack_handle_t handle)
>  {
>  	struct track *p = get_track(s, object, alloc);
> @@ -758,14 +760,16 @@ static void set_track_update(struct kmem_cache *s, void *object,
>  	p->cpu = smp_processor_id();
>  	p->pid = current->pid;
>  	p->when = jiffies;
> +	p->waste = waste;
>  }
>  
>  static __always_inline void set_track(struct kmem_cache *s, void *object,
> -				      enum track_item alloc, unsigned long addr)
> +				      enum track_item alloc, unsigned long addr,
> +				      unsigned long waste)
>  {
>  	depot_stack_handle_t handle = set_track_prepare();
>  
> -	set_track_update(s, object, alloc, addr, handle);
> +	set_track_update(s, object, alloc, addr, waste, handle);
>  }
>  
>  static void init_tracking(struct kmem_cache *s, void *object)
> @@ -1325,7 +1329,9 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>  
>  static noinline int alloc_debug_processing(struct kmem_cache *s,
>  					struct slab *slab,
> -					void *object, unsigned long addr)
> +					void *object, unsigned long addr,
> +					unsigned long waste
> +					)
>  {
>  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
>  		if (!alloc_consistency_checks(s, slab, object))
> @@ -1334,7 +1340,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
>  
>  	/* Success perform special debug activities for allocs */
>  	if (s->flags & SLAB_STORE_USER)
> -		set_track(s, object, TRACK_ALLOC, addr);
> +		set_track(s, object, TRACK_ALLOC, addr, waste);
>  	trace(s, slab, object, 1);
>  	init_object(s, object, SLUB_RED_ACTIVE);
>  	return 1;
> @@ -1418,7 +1424,7 @@ static noinline int free_debug_processing(
>  	}
>  
>  	if (s->flags & SLAB_STORE_USER)
> -		set_track_update(s, object, TRACK_FREE, addr, handle);
> +		set_track_update(s, object, TRACK_FREE, addr, 0, handle);
>  	trace(s, slab, object, 0);
>  	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
>  	init_object(s, object, SLUB_RED_INACTIVE);
> @@ -1661,7 +1667,8 @@ static inline
>  void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
>  
>  static inline int alloc_debug_processing(struct kmem_cache *s,
> -	struct slab *slab, void *object, unsigned long addr) { return 0; }
> +	struct slab *slab, void *object, unsigned long addr,
> +	unsigned long waste) { return 0; }
>  
>  static inline int free_debug_processing(
>  	struct kmem_cache *s, struct slab *slab,
> @@ -2905,7 +2912,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>   * already disabled (which is the case for bulk allocation).
>   */
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			  unsigned long addr, struct kmem_cache_cpu *c)
> +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {
>  	void *freelist;
>  	struct slab *slab;
> @@ -3048,7 +3055,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  check_new_slab:
>  
>  	if (kmem_cache_debug(s)) {
> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
> +		if (!alloc_debug_processing(s, slab, freelist, addr, s->object_size - orig_size)) {
>  			/* Slab failed checks. Next slab needed */
>  			goto new_slab;
>  		} else {
> @@ -3102,7 +3109,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>   * pointer.
>   */
>  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			  unsigned long addr, struct kmem_cache_cpu *c)
> +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {
>  	void *p;
>  
> @@ -3115,7 +3122,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  #endif
>  
> -	p = ___slab_alloc(s, gfpflags, node, addr, c);
> +	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  #ifdef CONFIG_PREEMPT_COUNT
>  	slub_put_cpu_ptr(s->cpu_slab);
>  #endif
> @@ -3206,7 +3213,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>  	 */
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
>  	    unlikely(!object || !slab || !node_match(slab, node))) {
> -		object = __slab_alloc(s, gfpflags, node, addr, c);
> +		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  	} else {
>  		void *next_object = get_freepointer_safe(s, object);
>  
> @@ -3731,7 +3738,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			 * of re-populating per CPU c->freelist
>  			 */
>  			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> -					    _RET_IP_, c);
> +					    _RET_IP_, c, s->object_size);
>  			if (unlikely(!p[i]))
>  				goto error;
>  
> @@ -5092,6 +5099,7 @@ struct location {
>  	depot_stack_handle_t handle;
>  	unsigned long count;
>  	unsigned long addr;
> +	unsigned long waste;
>  	long long sum_time;
>  	long min_time;
>  	long max_time;
> @@ -5142,7 +5150,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  {
>  	long start, end, pos;
>  	struct location *l;
> -	unsigned long caddr, chandle;
> +	unsigned long caddr, chandle, cwaste;
>  	unsigned long age = jiffies - track->when;
>  	depot_stack_handle_t handle = 0;
>  
> @@ -5162,11 +5170,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  		if (pos == end)
>  			break;
>  
> -		caddr = t->loc[pos].addr;
> -		chandle = t->loc[pos].handle;
> -		if ((track->addr == caddr) && (handle == chandle)) {
> +		l = &t->loc[pos];
> +		caddr = l->addr;
> +		chandle = l->handle;
> +		cwaste = l->waste;
> +		if ((track->addr == caddr) && (handle == chandle) &&
> +			(track->waste == cwaste)) {
>  
> -			l = &t->loc[pos];
>  			l->count++;
>  			if (track->when) {
>  				l->sum_time += age;
> @@ -5191,6 +5201,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  			end = pos;
>  		else if (track->addr == caddr && handle < chandle)
>  			end = pos;
> +		else if (track->addr == caddr && handle == chandle &&
> +				track->waste < cwaste)
> +			end = pos;
>  		else
>  			start = pos;
>  	}
> @@ -5214,6 +5227,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  	l->min_pid = track->pid;
>  	l->max_pid = track->pid;
>  	l->handle = handle;
> +	l->waste = track->waste;
>  	cpumask_clear(to_cpumask(l->cpus));
>  	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
>  	nodes_clear(l->nodes);
> @@ -6102,6 +6116,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
>  		else
>  			seq_puts(seq, "<not-available>");
>  
> +		if (l->waste)
> +			seq_printf(seq, " waste=%lu/%lu",
> +				l->count * l->waste, l->waste);
> +
>  		if (l->sum_time != l->min_time) {
>  			seq_printf(seq, " age=%ld/%llu/%ld",
>  				l->min_time, div_u64(l->sum_time, l->count),
Feng Tang July 11, 2022, 11:54 a.m. UTC | #8
On Mon, Jul 11, 2022 at 10:15:21AM +0200, Vlastimil Babka wrote:
> On 7/1/22 15:59, Feng Tang wrote:
[...]
> > The waste info is integrated into existing interface:
> > /sys/kernel/debug/slab/kmalloc-xx/alloc_traces, one example of
> > 'kmalloc-4k' after boot is:
> > 
> > 126 ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe] waste=233856/1856 age=1493302/1493830/1494358 pid=1284 cpus=32 nodes=1
> >         __slab_alloc.isra.86+0x52/0x80
> >         __kmalloc_node+0x143/0x350
> >         ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe]
> >         ixgbe_init_interrupt_scheme+0x1a6/0x730 [ixgbe]
> >         ixgbe_probe+0xc8e/0x10d0 [ixgbe]
> >         local_pci_probe+0x42/0x80
> >         work_for_cpu_fn+0x13/0x20
> >         process_one_work+0x1c5/0x390
> > 
> > which means in 'kmalloc-4k' slab, there are 126 requests of
> > 2240 bytes which got a 4KB space (wasting 1856 bytes each
> > and 233856 bytes in total). And when system starts some real
> > workload like multiple docker instances, there are more
> > severe waste.
> > 
> > [1]. https://lkml.org/lkml/2019/8/12/266
> > [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> Hi and thanks.
> I would suggest some improvements to consider:
 
Hi Vlastimil,

Thanks for the review and detailed suggestions!

> - don't use the struct track to store orig_size, although it's an obvious
> first choice. It's unused waste for the free_track, and also for any
> non-kmalloc caches. I'd carve out an extra int next to the struct tracks.
> Only for kmalloc caches (probably a new kmem cache flag set on creation will
> be needed to easily distinguish them).
> Besides the saved space, you can then set the field from ___slab_alloc()
> directly and not need to pass the orig_size also to alloc_debug_processing()
> etc.

Do you mean decouple 'track' and the 'orig_size'(waste info), and
add 'orig_size' for each kmalloc object (with the help of flag)?
This solution depends hugely on 'track' framework.

Initially when implementing it, I met several problems:
1. where to save the orig_size(waste)
2. how to calcuate the waste info
3. where to show the waste info
4. how to show the full call stack for users' easy handling

I thought about saving the global waste in kmem_cache, and show it in
'slabinfo', then we have to handle it in all free calls or loop all
objects in-use  when 'cat slabinfo', still it lacks the detail info
of each call stacks (some call points can be called with different
sizes). And the 'track' seems to be ideal to solve them. 

Or based on your suggestion, we still add 'waste' into 'track' but
under a kernel config option 'SLUB_DEBUG_WASTE' for space saving.

Also I checked the struct track, it is well packed, only the
'int cpu' could be hacked, that it could spare 16 bits for storing
the waste info:

struct track {
    ...
    unsigned short cpu; /* 0-65535 */
    unsigned short waste;
    ...
}

> - the knowledge of actual size could be used to improve poisoning checks as
> well, detect cases when there's buffer overrun over the orig_size but not
> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
> it now, but with orig_size stored we could?
 
Yes! This could be imporoved. 

Thanks,
Feng

> Thanks!
> Vlastimil
> 
> > ---
> > Changelog:
> > 
> >   since RFC
> >     * fix problems in kmem_cache_alloc_bulk() and records sorting,
> >       improve the print format (Hyeonggon Yoo)
> >     * fix a compiling issue found by 0Day bot
> >     * update the commit log based info from iova developers
> > 
> >  mm/slub.c | 52 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 35 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b1281b8654bd3..97304ea1e6aa5 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -271,6 +271,7 @@ struct track {
> >  #endif
> >  	int cpu;		/* Was running on cpu */
> >  	int pid;		/* Pid context */
> > +	unsigned long waste;	/* memory waste for a kmalloc-ed object */
> >  	unsigned long when;	/* When did the operation occur */
> >  };
> >  
> > @@ -747,6 +748,7 @@ static inline depot_stack_handle_t set_track_prepare(void)
> >  
> >  static void set_track_update(struct kmem_cache *s, void *object,
> >  			     enum track_item alloc, unsigned long addr,
> > +			     unsigned long waste,
> >  			     depot_stack_handle_t handle)
> >  {
> >  	struct track *p = get_track(s, object, alloc);
> > @@ -758,14 +760,16 @@ static void set_track_update(struct kmem_cache *s, void *object,
> >  	p->cpu = smp_processor_id();
> >  	p->pid = current->pid;
> >  	p->when = jiffies;
> > +	p->waste = waste;
> >  }
> >  
> >  static __always_inline void set_track(struct kmem_cache *s, void *object,
> > -				      enum track_item alloc, unsigned long addr)
> > +				      enum track_item alloc, unsigned long addr,
> > +				      unsigned long waste)
> >  {
> >  	depot_stack_handle_t handle = set_track_prepare();
> >  
> > -	set_track_update(s, object, alloc, addr, handle);
> > +	set_track_update(s, object, alloc, addr, waste, handle);
> >  }
> >  
> >  static void init_tracking(struct kmem_cache *s, void *object)
> > @@ -1325,7 +1329,9 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
> >  
> >  static noinline int alloc_debug_processing(struct kmem_cache *s,
> >  					struct slab *slab,
> > -					void *object, unsigned long addr)
> > +					void *object, unsigned long addr,
> > +					unsigned long waste
> > +					)
> >  {
> >  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> >  		if (!alloc_consistency_checks(s, slab, object))
> > @@ -1334,7 +1340,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
> >  
> >  	/* Success perform special debug activities for allocs */
> >  	if (s->flags & SLAB_STORE_USER)
> > -		set_track(s, object, TRACK_ALLOC, addr);
> > +		set_track(s, object, TRACK_ALLOC, addr, waste);
> >  	trace(s, slab, object, 1);
> >  	init_object(s, object, SLUB_RED_ACTIVE);
> >  	return 1;
> > @@ -1418,7 +1424,7 @@ static noinline int free_debug_processing(
> >  	}
> >  
> >  	if (s->flags & SLAB_STORE_USER)
> > -		set_track_update(s, object, TRACK_FREE, addr, handle);
> > +		set_track_update(s, object, TRACK_FREE, addr, 0, handle);
> >  	trace(s, slab, object, 0);
> >  	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
> >  	init_object(s, object, SLUB_RED_INACTIVE);
> > @@ -1661,7 +1667,8 @@ static inline
> >  void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
> >  
> >  static inline int alloc_debug_processing(struct kmem_cache *s,
> > -	struct slab *slab, void *object, unsigned long addr) { return 0; }
> > +	struct slab *slab, void *object, unsigned long addr,
> > +	unsigned long waste) { return 0; }
> >  
> >  static inline int free_debug_processing(
> >  	struct kmem_cache *s, struct slab *slab,
> > @@ -2905,7 +2912,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> >   * already disabled (which is the case for bulk allocation).
> >   */
> >  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> >  {
> >  	void *freelist;
> >  	struct slab *slab;
> > @@ -3048,7 +3055,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  check_new_slab:
> >  
> >  	if (kmem_cache_debug(s)) {
> > -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
> > +		if (!alloc_debug_processing(s, slab, freelist, addr, s->object_size - orig_size)) {
> >  			/* Slab failed checks. Next slab needed */
> >  			goto new_slab;
> >  		} else {
> > @@ -3102,7 +3109,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >   * pointer.
> >   */
> >  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > -			  unsigned long addr, struct kmem_cache_cpu *c)
> > +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> >  {
> >  	void *p;
> >  
> > @@ -3115,7 +3122,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  	c = slub_get_cpu_ptr(s->cpu_slab);
> >  #endif
> >  
> > -	p = ___slab_alloc(s, gfpflags, node, addr, c);
> > +	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> >  #ifdef CONFIG_PREEMPT_COUNT
> >  	slub_put_cpu_ptr(s->cpu_slab);
> >  #endif
> > @@ -3206,7 +3213,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
> >  	 */
> >  	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
> >  	    unlikely(!object || !slab || !node_match(slab, node))) {
> > -		object = __slab_alloc(s, gfpflags, node, addr, c);
> > +		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
> >  	} else {
> >  		void *next_object = get_freepointer_safe(s, object);
> >  
> > @@ -3731,7 +3738,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >  			 * of re-populating per CPU c->freelist
> >  			 */
> >  			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> > -					    _RET_IP_, c);
> > +					    _RET_IP_, c, s->object_size);
> >  			if (unlikely(!p[i]))
> >  				goto error;
> >  
> > @@ -5092,6 +5099,7 @@ struct location {
> >  	depot_stack_handle_t handle;
> >  	unsigned long count;
> >  	unsigned long addr;
> > +	unsigned long waste;
> >  	long long sum_time;
> >  	long min_time;
> >  	long max_time;
> > @@ -5142,7 +5150,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  {
> >  	long start, end, pos;
> >  	struct location *l;
> > -	unsigned long caddr, chandle;
> > +	unsigned long caddr, chandle, cwaste;
> >  	unsigned long age = jiffies - track->when;
> >  	depot_stack_handle_t handle = 0;
> >  
> > @@ -5162,11 +5170,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  		if (pos == end)
> >  			break;
> >  
> > -		caddr = t->loc[pos].addr;
> > -		chandle = t->loc[pos].handle;
> > -		if ((track->addr == caddr) && (handle == chandle)) {
> > +		l = &t->loc[pos];
> > +		caddr = l->addr;
> > +		chandle = l->handle;
> > +		cwaste = l->waste;
> > +		if ((track->addr == caddr) && (handle == chandle) &&
> > +			(track->waste == cwaste)) {
> >  
> > -			l = &t->loc[pos];
> >  			l->count++;
> >  			if (track->when) {
> >  				l->sum_time += age;
> > @@ -5191,6 +5201,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  			end = pos;
> >  		else if (track->addr == caddr && handle < chandle)
> >  			end = pos;
> > +		else if (track->addr == caddr && handle == chandle &&
> > +				track->waste < cwaste)
> > +			end = pos;
> >  		else
> >  			start = pos;
> >  	}
> > @@ -5214,6 +5227,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> >  	l->min_pid = track->pid;
> >  	l->max_pid = track->pid;
> >  	l->handle = handle;
> > +	l->waste = track->waste;
> >  	cpumask_clear(to_cpumask(l->cpus));
> >  	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
> >  	nodes_clear(l->nodes);
> > @@ -6102,6 +6116,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
> >  		else
> >  			seq_puts(seq, "<not-available>");
> >  
> > +		if (l->waste)
> > +			seq_printf(seq, " waste=%lu/%lu",
> > +				l->count * l->waste, l->waste);
> > +
> >  		if (l->sum_time != l->min_time) {
> >  			seq_printf(seq, " age=%ld/%llu/%ld",
> >  				l->min_time, div_u64(l->sum_time, l->count),
Feng Tang July 13, 2022, 7:36 a.m. UTC | #9
Hi Vlastimil,

On Mon, Jul 11, 2022 at 10:15:21AM +0200, Vlastimil Babka wrote:
> On 7/1/22 15:59, Feng Tang wrote:
> > kmalloc's API family is critical for mm, with one shortcoming that
> > its object size is fixed to be power of 2. When user requests memory
> > for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
> > in worst case, there is around 50% memory space waste.
> > 
> > We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:
> > 
> >     [   26.062145] kmalloc-2k            814056KB     814056KB
> > 
> > From debug we found there are huge number of 'struct iova_magazine',
> > whose size is 1032 bytes (1024 + 8), so each allocation will waste
> > 1016 bytes. Though the issue was solved by giving the right (bigger)
> > size of RAM, it is still nice to optimize the size (either use a
> > kmalloc friendly size or create a dedicated slab for it).
[...]
> 
> Hi and thanks.
> I would suggest some improvements to consider:
> 
> - don't use the struct track to store orig_size, although it's an obvious
> first choice. It's unused waste for the free_track, and also for any
> non-kmalloc caches. I'd carve out an extra int next to the struct tracks.
> Only for kmalloc caches (probably a new kmem cache flag set on creation will
> be needed to easily distinguish them).
> Besides the saved space, you can then set the field from ___slab_alloc()
> directly and not need to pass the orig_size also to alloc_debug_processing()
> etc.
 
Here is a draft patch fowlling your suggestion, please check if I missed
anything? (Quick test showed it achived similar effect as v1 patch). Thanks!

---
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0fefdf528e0d..d3dacb0f013f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -29,6 +29,8 @@
 #define SLAB_RED_ZONE		((slab_flags_t __force)0x00000400U)
 /* DEBUG: Poison objects */
 #define SLAB_POISON		((slab_flags_t __force)0x00000800U)
+/* Indicate a slab of kmalloc */
+#define SLAB_KMALLOC		((slab_flags_t __force)0x00001000U)
 /* Align objs on cache lines */
 #define SLAB_HWCACHE_ALIGN	((slab_flags_t __force)0x00002000U)
 /* Use GFP_DMA memory */
diff --git a/mm/slub.c b/mm/slub.c
index 26b00951aad1..3b0f80927817 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1030,6 +1030,9 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
 
+	if (s->flags & SLAB_KMALLOC)
+		off += sizeof(unsigned int);
+
 	off += kasan_metadata_size(s);
 
 	if (size_from_object(s) == off)
@@ -1323,9 +1326,38 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
 	return 1;
 }
 
+
+static inline void set_orig_size(struct kmem_cache *s,
+					void *object, unsigned int orig_size)
+{
+	void *p = kasan_reset_tag(object);
+
+	p = object + get_info_end(s);
+
+	if (s->flags & SLAB_STORE_USER)
+		p += sizeof(struct track) * 2;
+
+	*(unsigned int *)p = orig_size;
+}
+
+static unsigned int get_orig_size(struct kmem_cache *s, void *object)
+{
+	void *p = kasan_reset_tag(object);
+
+	if (!(s->flags & SLAB_KMALLOC))
+		return s->object_size;
+
+	p = object + get_info_end(s);
+	if (s->flags & SLAB_STORE_USER)
+		p += sizeof(struct track) * 2;
+
+	return *(unsigned int *)p;
+}
+
 static noinline int alloc_debug_processing(struct kmem_cache *s,
 					struct slab *slab,
-					void *object, unsigned long addr)
+					void *object, unsigned long addr,
+					unsigned int orig_size)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!alloc_consistency_checks(s, slab, object))
@@ -1335,6 +1367,10 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
+
+	if (s->flags & SLAB_KMALLOC)
+		set_orig_size(s, object, orig_size);
+
 	trace(s, slab, object, 1);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
@@ -1661,7 +1697,8 @@ static inline
 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct slab *slab, void *object, unsigned long addr) { return 0; }
+	struct slab *slab, void *object, unsigned long addr,
+	unsigned int orig_size) { return 0; }
 
 static inline int free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
@@ -2905,7 +2942,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
  * already disabled (which is the case for bulk allocation).
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *freelist;
 	struct slab *slab;
@@ -3048,7 +3085,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 check_new_slab:
 
 	if (kmem_cache_debug(s)) {
-		if (!alloc_debug_processing(s, slab, freelist, addr)) {
+		if (!alloc_debug_processing(s, slab, freelist, addr, orig_size)) {
 			/* Slab failed checks. Next slab needed */
 			goto new_slab;
 		} else {
@@ -3102,7 +3139,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  * pointer.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *p;
 
@@ -3115,7 +3152,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
 
-	p = ___slab_alloc(s, gfpflags, node, addr, c);
+	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -3206,7 +3243,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 	 */
 	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
 	    unlikely(!object || !slab || !node_match(slab, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
 
@@ -3709,7 +3746,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			 * of re-populating per CPU c->freelist
 			 */
 			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
-					    _RET_IP_, c);
+					    _RET_IP_, c, s->object_size);
 			if (unlikely(!p[i]))
 				goto error;
 
@@ -4118,6 +4155,10 @@ static int calculate_sizes(struct kmem_cache *s)
 		 * the object.
 		 */
 		size += 2 * sizeof(struct track);
+
+	/* Save the original requsted kmalloc size */
+	if (flags & SLAB_KMALLOC)
+		size += sizeof(unsigned int);
 #endif
 
 	kasan_cache_create(s, &size, &s->flags);
@@ -4842,7 +4883,7 @@ void __init kmem_cache_init(void)
 
 	/* Now we can use the kmem_cache to allocate kmalloc slabs */
 	setup_kmalloc_cache_index_table();
-	create_kmalloc_caches(0);
+	create_kmalloc_caches(SLAB_KMALLOC);
 
 	/* Setup random freelists for each cache */
 	init_freelist_randomization();
@@ -5068,6 +5109,7 @@ struct location {
 	depot_stack_handle_t handle;
 	unsigned long count;
 	unsigned long addr;
+	unsigned long waste;
 	long long sum_time;
 	long min_time;
 	long max_time;
@@ -5114,13 +5156,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
 }
 
 static int add_location(struct loc_track *t, struct kmem_cache *s,
-				const struct track *track)
+				const struct track *track,
+				unsigned int orig_size)
 {
 	long start, end, pos;
 	struct location *l;
-	unsigned long caddr, chandle;
+	unsigned long caddr, chandle, cwaste;
 	unsigned long age = jiffies - track->when;
 	depot_stack_handle_t handle = 0;
+	unsigned int waste = s->object_size - orig_size;
 
 #ifdef CONFIG_STACKDEPOT
 	handle = READ_ONCE(track->handle);
@@ -5138,11 +5182,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 		if (pos == end)
 			break;
 
-		caddr = t->loc[pos].addr;
-		chandle = t->loc[pos].handle;
-		if ((track->addr == caddr) && (handle == chandle)) {
+		l = &t->loc[pos];
+		caddr = l->addr;
+		chandle = l->handle;
+		cwaste = l->waste;
+		if ((track->addr == caddr) && (handle == chandle) &&
+			(waste == cwaste)) {
 
-			l = &t->loc[pos];
 			l->count++;
 			if (track->when) {
 				l->sum_time += age;
@@ -5167,6 +5213,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 			end = pos;
 		else if (track->addr == caddr && handle < chandle)
 			end = pos;
+		else if (track->addr == caddr && handle == chandle &&
+				waste < cwaste)
+			end = pos;
 		else
 			start = pos;
 	}
@@ -5190,6 +5239,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 	l->min_pid = track->pid;
 	l->max_pid = track->pid;
 	l->handle = handle;
+	l->waste = waste;
 	cpumask_clear(to_cpumask(l->cpus));
 	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
 	nodes_clear(l->nodes);
@@ -5208,7 +5258,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
 
 	for_each_object(p, s, addr, slab->objects)
 		if (!test_bit(__obj_to_index(s, addr, p), obj_map))
-			add_location(t, s, get_track(s, p, alloc));
+			add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
 }
 #endif  /* CONFIG_DEBUG_FS   */
 #endif	/* CONFIG_SLUB_DEBUG */
@@ -6078,6 +6128,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
 		else
 			seq_puts(seq, "<not-available>");
 
+		if (l->waste)
+			seq_printf(seq, " waste=%lu/%lu",
+				l->count * l->waste, l->waste);
+
 		if (l->sum_time != l->min_time) {
 			seq_printf(seq, " age=%ld/%llu/%ld",
 				l->min_time, div_u64(l->sum_time, l->count),

> - the knowledge of actual size could be used to improve poisoning checks as
> well, detect cases when there's buffer overrun over the orig_size but not
> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
> it now, but with orig_size stored we could?

The above patch doesn't touch this. As I have a question, for the
[orib_size, object_size) area, shall we fill it with POISON_XXX no matter
REDZONE flag is set or not?

Thanks,
Feng

> Thanks!
> Vlastimil
Vlastimil Babka July 14, 2022, 8:11 p.m. UTC | #10
On 7/13/22 09:36, Feng Tang wrote:
> Hi Vlastimil,
> 
> On Mon, Jul 11, 2022 at 10:15:21AM +0200, Vlastimil Babka wrote:
>> On 7/1/22 15:59, Feng Tang wrote:
>> > kmalloc's API family is critical for mm, with one shortcoming that
>> > its object size is fixed to be power of 2. When user requests memory
>> > for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
>> > in worst case, there is around 50% memory space waste.
>> > 
>> > We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:
>> > 
>> >     [   26.062145] kmalloc-2k            814056KB     814056KB
>> > 
>> > From debug we found there are huge number of 'struct iova_magazine',
>> > whose size is 1032 bytes (1024 + 8), so each allocation will waste
>> > 1016 bytes. Though the issue was solved by giving the right (bigger)
>> > size of RAM, it is still nice to optimize the size (either use a
>> > kmalloc friendly size or create a dedicated slab for it).
> [...]
>> 
>> Hi and thanks.
>> I would suggest some improvements to consider:
>> 
>> - don't use the struct track to store orig_size, although it's an obvious
>> first choice. It's unused waste for the free_track, and also for any
>> non-kmalloc caches. I'd carve out an extra int next to the struct tracks.
>> Only for kmalloc caches (probably a new kmem cache flag set on creation will
>> be needed to easily distinguish them).
>> Besides the saved space, you can then set the field from ___slab_alloc()
>> directly and not need to pass the orig_size also to alloc_debug_processing()
>> etc.
>  
> Here is a draft patch fowlling your suggestion, please check if I missed
> anything? (Quick test showed it achived similar effect as v1 patch). Thanks!

Thanks, overal it looks at first glance!

> ---
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 0fefdf528e0d..d3dacb0f013f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -29,6 +29,8 @@
>  #define SLAB_RED_ZONE		((slab_flags_t __force)0x00000400U)
>  /* DEBUG: Poison objects */
>  #define SLAB_POISON		((slab_flags_t __force)0x00000800U)
> +/* Indicate a slab of kmalloc */

"Indicate a kmalloc cache" would be more precise.

> +#define SLAB_KMALLOC		((slab_flags_t __force)0x00001000U)
>  /* Align objs on cache lines */
>  #define SLAB_HWCACHE_ALIGN	((slab_flags_t __force)0x00002000U)
>  /* Use GFP_DMA memory */
> diff --git a/mm/slub.c b/mm/slub.c
> index 26b00951aad1..3b0f80927817 100644

<snip>

> 
>> - the knowledge of actual size could be used to improve poisoning checks as
>> well, detect cases when there's buffer overrun over the orig_size but not
>> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
>> it now, but with orig_size stored we could?
> 
> The above patch doesn't touch this. As I have a question, for the
> [orib_size, object_size) area, shall we fill it with POISON_XXX no matter
> REDZONE flag is set or not?

Ah, looks like we use redzoning, not poisoning, for padding from
s->object_size to word boundary. So it would be more consistent to use the
redzone pattern (RED_ACTIVE) and check with the dynamic orig_size. Probably
no change for RED_INACTIVE handling is needed though.

> Thanks,
> Feng
> 
>> Thanks!
>> Vlastimil
Feng Tang July 15, 2022, 8:29 a.m. UTC | #11
On Thu, Jul 14, 2022 at 10:11:32PM +0200, Vlastimil Babka wrote:
> On 7/13/22 09:36, Feng Tang wrote:
> > Hi Vlastimil,
> > 
> > On Mon, Jul 11, 2022 at 10:15:21AM +0200, Vlastimil Babka wrote:
> >> On 7/1/22 15:59, Feng Tang wrote:
> >> > kmalloc's API family is critical for mm, with one shortcoming that
> >> > its object size is fixed to be power of 2. When user requests memory
> >> > for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
> >> > in worst case, there is around 50% memory space waste.
> >> > 
> >> > We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:
> >> > 
> >> >     [   26.062145] kmalloc-2k            814056KB     814056KB
> >> > 
> >> > From debug we found there are huge number of 'struct iova_magazine',
> >> > whose size is 1032 bytes (1024 + 8), so each allocation will waste
> >> > 1016 bytes. Though the issue was solved by giving the right (bigger)
> >> > size of RAM, it is still nice to optimize the size (either use a
> >> > kmalloc friendly size or create a dedicated slab for it).
> > [...]
> >> 
> >> Hi and thanks.
> >> I would suggest some improvements to consider:
> >> 
> >> - don't use the struct track to store orig_size, although it's an obvious
> >> first choice. It's unused waste for the free_track, and also for any
> >> non-kmalloc caches. I'd carve out an extra int next to the struct tracks.
> >> Only for kmalloc caches (probably a new kmem cache flag set on creation will
> >> be needed to easily distinguish them).
> >> Besides the saved space, you can then set the field from ___slab_alloc()
> >> directly and not need to pass the orig_size also to alloc_debug_processing()
> >> etc.
> >  
> > Here is a draft patch fowlling your suggestion, please check if I missed
> > anything? (Quick test showed it achived similar effect as v1 patch). Thanks!
> 
> Thanks, overal it looks at first glance!

Thanks!

> > ---
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 0fefdf528e0d..d3dacb0f013f 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -29,6 +29,8 @@
> >  #define SLAB_RED_ZONE		((slab_flags_t __force)0x00000400U)
> >  /* DEBUG: Poison objects */
> >  #define SLAB_POISON		((slab_flags_t __force)0x00000800U)
> > +/* Indicate a slab of kmalloc */
> 
> "Indicate a kmalloc cache" would be more precise.
 
Will use this in next version.

> > +#define SLAB_KMALLOC		((slab_flags_t __force)0x00001000U)
> >  /* Align objs on cache lines */
> >  #define SLAB_HWCACHE_ALIGN	((slab_flags_t __force)0x00002000U)
> >  /* Use GFP_DMA memory */
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 26b00951aad1..3b0f80927817 100644
> 
> <snip>
> 
> > 
> >> - the knowledge of actual size could be used to improve poisoning checks as
> >> well, detect cases when there's buffer overrun over the orig_size but not
> >> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
> >> it now, but with orig_size stored we could?
> > 
> > The above patch doesn't touch this. As I have a question, for the
> > [orib_size, object_size) area, shall we fill it with POISON_XXX no matter
> > REDZONE flag is set or not?
> 
> Ah, looks like we use redzoning, not poisoning, for padding from
> s->object_size to word boundary. So it would be more consistent to use the
> redzone pattern (RED_ACTIVE) and check with the dynamic orig_size. Probably
> no change for RED_INACTIVE handling is needed though.

Thanks for clarifying, will go this way and do more test. Also I'd 
make it a separate patch, as it is logically different from the space
wastage.

Thanks,
Feng
Feng Tang July 19, 2022, 1:45 p.m. UTC | #12
Hi Vlastimil,

On Fri, Jul 15, 2022 at 04:29:22PM +0800, Tang, Feng wrote:
[...]
> > >> - the knowledge of actual size could be used to improve poisoning checks as
> > >> well, detect cases when there's buffer overrun over the orig_size but not
> > >> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
> > >> it now, but with orig_size stored we could?
> > > 
> > > The above patch doesn't touch this. As I have a question, for the
> > > [orib_size, object_size) area, shall we fill it with POISON_XXX no matter
> > > REDZONE flag is set or not?
> > 
> > Ah, looks like we use redzoning, not poisoning, for padding from
> > s->object_size to word boundary. So it would be more consistent to use the
> > redzone pattern (RED_ACTIVE) and check with the dynamic orig_size. Probably
> > no change for RED_INACTIVE handling is needed though.
> 
> Thanks for clarifying, will go this way and do more test. Also I'd 
> make it a separate patch, as it is logically different from the space
> wastage.

I made a draft to redzone the wasted space, which basically works (patch
pasted at the end of the mail) as detecting corruption of below test code:
	
	size = 256;
	buf = kmalloc(size + 8, GFP_KERNEL);
	memset(buf + size + size/2, 0xff, size/4);
	print_section(KERN_ERR, "Corruptted-kmalloc-space", buf, size * 2);
	kfree(buf);

However when it is enabled globally, there are many places reporting
corruption. I debugged one case, and found that the network(skb_buff)
code already knows this "wasted" kmalloc space and utilize it which is
detected by my patch.

The allocation stack is:

[    0.933675] BUG kmalloc-2k (Not tainted): kmalloc unused part overwritten
[    0.933675] -----------------------------------------------------------------------------
[    0.933675]
[    0.933675] 0xffff888237d026c0-0xffff888237d026e3 @offset=9920. First byte 0x0 instead of 0xcc
[    0.933675] Allocated in __alloc_skb+0x8e/0x1d0 age=5 cpu=0 pid=1
[    0.933675]  __slab_alloc.constprop.0+0x52/0x90
[    0.933675]  __kmalloc_node_track_caller+0x129/0x380
[    0.933675]  kmalloc_reserve+0x2a/0x70
[    0.933675]  __alloc_skb+0x8e/0x1d0
[    0.933675]  audit_buffer_alloc+0x3a/0xc0
[    0.933675]  audit_log_start.part.0+0xa3/0x300
[    0.933675]  audit_log+0x62/0xc0
[    0.933675]  audit_init+0x15c/0x16f

And the networking code which touches the [orig_size, object_size) area
is in __build_skb_around(), which put a 'struct skb_shared_info' at the
end of this area:

	static void __build_skb_around(struct sk_buff *skb, void *data,
				       unsigned int frag_size)
	{
		struct skb_shared_info *shinfo;
		unsigned int size = frag_size ? : ksize(data);

		size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
		-----> XXX carve the space  <-----

		...
		skb_set_end_offset(skb, size);
		...

		shinfo = skb_shinfo(skb);
		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
		atomic_set(&shinfo->dataref, 1);

		-----> upper 2 lines changes the memory <-----
		...
	}

Then we end up seeing the corruption report: 

[    0.933675] Object   ffff888237d026c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    0.933675] Object   ffff888237d026d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    0.933675] Object   ffff888237d026e0: 01 00 00 00 cc cc cc cc cc cc cc cc cc cc cc cc  ................

I haven't got time to chase other cases, and would update these first.

Following is the draft (not cleaned patch) patch to redzone the
[orig_size, object_size) space.

Thanks,
Feng

---
diff --git a/mm/slab.c b/mm/slab.c
index 6474c515a664..2f1110b16463 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3229,7 +3229,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_
 	init = slab_want_init_on_alloc(flags, cachep);
 
 out_hooks:
-	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr, init);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr, init, 0);
 	return ptr;
 }
 
@@ -3291,7 +3291,7 @@ slab_alloc(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
 	init = slab_want_init_on_alloc(flags, cachep);
 
 out:
-	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
 	return objp;
 }
 
@@ -3536,13 +3536,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	 * Done outside of the IRQ disabled section.
 	 */
 	slab_post_alloc_hook(s, objcg, flags, size, p,
-				slab_want_init_on_alloc(flags, s));
+				slab_want_init_on_alloc(flags, s), 0);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, objcg, flags, i, p, false);
+	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index a8d5eb1c323f..938ec6454dbc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -719,12 +719,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					struct obj_cgroup *objcg, gfp_t flags,
-					size_t size, void **p, bool init)
+					size_t size, void **p, bool init,
+					unsigned int orig_size)
 {
 	size_t i;
 
 	flags &= gfp_allowed_mask;
 
+	/* If original request size(kmalloc) is not set, use object_size */
+	if (!orig_size)
+		orig_size = s->object_size;
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * kasan_slab_alloc and initialization memset must be
@@ -735,7 +740,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 	for (i = 0; i < size; i++) {
 		p[i] = kasan_slab_alloc(s, p[i], flags, init);
 		if (p[i] && init && !kasan_has_integrated_init())
-			memset(p[i], 0, s->object_size);
+			memset(p[i], 0, orig_size);
 		kmemleak_alloc_recursive(p[i], s->object_size, 1,
 					 s->flags, flags);
 	}
diff --git a/mm/slub.c b/mm/slub.c
index 1a806912b1a3..014513e0658f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -45,6 +45,21 @@
 
 #include "internal.h"
 
+static inline void dump_slub(struct kmem_cache *s)
+{
+	printk("Dump slab[%s] info:\n", s->name);
+	printk("flags=0x%lx, size=%d, obj_size=%d, offset=%d\n"
+		"oo=0x%x, inuse=%d, align=%d, red_left_pad=%d\n",
+		s->flags, s->size, s->object_size, s->offset,
+		s->oo.x, s->inuse, s->align, s->red_left_pad
+		);
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	printk("cpu_partial=%d, cpu_partial_slabs=%d\n",
+		s->cpu_partial, s->cpu_partial_slabs);
+#endif
+	printk("\n");
+}
+
 /*
  * Lock order:
  *   1. slab_mutex (Global Mutex)
@@ -191,6 +206,12 @@ static inline bool kmem_cache_debug(struct kmem_cache *s)
 	return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
 }
 
+static inline bool kmem_cache_debug_orig_size(struct kmem_cache *s)
+{
+	return (s->flags & SLAB_KMALLOC &&
+			s->flags & (SLAB_RED_ZONE | SLAB_STORE_USER));
+}
+
 void *fixup_red_left(struct kmem_cache *s, void *p)
 {
 	if (kmem_cache_debug_flags(s, SLAB_RED_ZONE))
@@ -833,7 +854,7 @@ static unsigned int get_orig_size(struct kmem_cache *s, void *object)
 {
 	void *p = kasan_reset_tag(object);
 
-	if (!(s->flags & SLAB_KMALLOC))
+	if (!kmem_cache_debug_orig_size(s))
 		return s->object_size;
 
 	p = object + get_info_end(s);
@@ -902,6 +923,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
 
+	if (kmem_cache_debug_orig_size(s))
+		off += sizeof(unsigned int);
+
 	off += kasan_metadata_size(s);
 
 	if (off != size_from_object(s))
@@ -958,13 +982,21 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
 static void init_object(struct kmem_cache *s, void *object, u8 val)
 {
 	u8 *p = kasan_reset_tag(object);
+	unsigned int orig_size = s->object_size;
 
 	if (s->flags & SLAB_RED_ZONE)
 		memset(p - s->red_left_pad, val, s->red_left_pad);
 
+	if (kmem_cache_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+		/* Redzone the allocated by kmalloc but unused space */
+		orig_size = get_orig_size(s, object);
+		if (orig_size < s->object_size)
+			memset(p + orig_size, val, s->object_size - orig_size);
+	}
+
 	if (s->flags & __OBJECT_POISON) {
-		memset(p, POISON_FREE, s->object_size - 1);
-		p[s->object_size - 1] = POISON_END;
+		memset(p, POISON_FREE, orig_size - 1);
+		p[orig_size - 1] = POISON_END;
 	}
 
 	if (s->flags & SLAB_RED_ZONE)
@@ -1057,7 +1089,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
 
-	if (s->flags & SLAB_KMALLOC)
+	if (kmem_cache_debug_orig_size(s))
 		off += sizeof(unsigned int);
 
 	off += kasan_metadata_size(s);
@@ -1110,6 +1142,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 {
 	u8 *p = object;
 	u8 *endobject = object + s->object_size;
+	unsigned int orig_size;
 
 	if (s->flags & SLAB_RED_ZONE) {
 		if (!check_bytes_and_report(s, slab, object, "Left Redzone",
@@ -1119,6 +1152,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 		if (!check_bytes_and_report(s, slab, object, "Right Redzone",
 			endobject, val, s->inuse - s->object_size))
 			return 0;
+
+
 	} else {
 		if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
 			check_bytes_and_report(s, slab, p, "Alignment padding",
@@ -1127,7 +1162,23 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 		}
 	}
 
+	#if 1
+	if (kmem_cache_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+
+		orig_size = get_orig_size(s, object); 
+
+		if (s->object_size != orig_size  &&  
+			!check_bytes_and_report(s, slab, object, "kmalloc unused part",
+				p + orig_size, val, s->object_size - orig_size)) {
+			dump_slub(s);
+//			while (1);
+			return 0;
+		}
+	}
+	#endif
+
 	if (s->flags & SLAB_POISON) {
+
 		if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) &&
 			(!check_bytes_and_report(s, slab, p, "Poison", p,
 					POISON_FREE, s->object_size - 1) ||
@@ -1367,7 +1418,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
 
-	if (s->flags & SLAB_KMALLOC)
+	if (kmem_cache_debug_orig_size(s))
 		set_orig_size(s, object, orig_size);
 
 	trace(s, slab, object, 1);
@@ -3276,7 +3327,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 	init = slab_want_init_on_alloc(gfpflags, s);
 
 out:
-	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
+	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);
 
 	return object;
 }
@@ -3769,11 +3820,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	 * Done outside of the IRQ disabled fastpath loop.
 	 */
 	slab_post_alloc_hook(s, objcg, flags, size, p,
-				slab_want_init_on_alloc(flags, s));
+				slab_want_init_on_alloc(flags, s), 0);
 	return i;
 error:
 	slub_put_cpu_ptr(s->cpu_slab);
-	slab_post_alloc_hook(s, objcg, flags, i, p, false);
+	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
@@ -4155,12 +4206,12 @@ static int calculate_sizes(struct kmem_cache *s)
 		 */
 		size += 2 * sizeof(struct track);
 
-	/* Save the original requsted kmalloc size */
-	if (flags & SLAB_KMALLOC)
+	/* Save the original kmalloc request size */
+	if (kmem_cache_debug_orig_size(s))
 		size += sizeof(unsigned int);
 #endif
-
 	kasan_cache_create(s, &size, &s->flags);
+
 #ifdef CONFIG_SLUB_DEBUG
 	if (flags & SLAB_RED_ZONE) {
 		/*
Vlastimil Babka July 19, 2022, 2:39 p.m. UTC | #13
On 7/19/22 15:45, Feng Tang wrote:
> Hi Vlastimil,
> 
> On Fri, Jul 15, 2022 at 04:29:22PM +0800, Tang, Feng wrote:
> [...]
>> > >> - the knowledge of actual size could be used to improve poisoning checks as
>> > >> well, detect cases when there's buffer overrun over the orig_size but not
>> > >> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
>> > >> it now, but with orig_size stored we could?
>> > > 
>> > > The above patch doesn't touch this. As I have a question, for the
>> > > [orib_size, object_size) area, shall we fill it with POISON_XXX no matter
>> > > REDZONE flag is set or not?
>> > 
>> > Ah, looks like we use redzoning, not poisoning, for padding from
>> > s->object_size to word boundary. So it would be more consistent to use the
>> > redzone pattern (RED_ACTIVE) and check with the dynamic orig_size. Probably
>> > no change for RED_INACTIVE handling is needed though.
>> 
>> Thanks for clarifying, will go this way and do more test. Also I'd 
>> make it a separate patch, as it is logically different from the space
>> wastage.
> 
> I made a draft to redzone the wasted space, which basically works (patch
> pasted at the end of the mail) as detecting corruption of below test code:
> 	
> 	size = 256;
> 	buf = kmalloc(size + 8, GFP_KERNEL);
> 	memset(buf + size + size/2, 0xff, size/4);
> 	print_section(KERN_ERR, "Corruptted-kmalloc-space", buf, size * 2);
> 	kfree(buf);
> 
> However when it is enabled globally, there are many places reporting
> corruption. I debugged one case, and found that the network(skb_buff)
> code already knows this "wasted" kmalloc space and utilize it which is
> detected by my patch.
> 
> The allocation stack is:
> 
> [    0.933675] BUG kmalloc-2k (Not tainted): kmalloc unused part overwritten
> [    0.933675] -----------------------------------------------------------------------------
> [    0.933675]
> [    0.933675] 0xffff888237d026c0-0xffff888237d026e3 @offset=9920. First byte 0x0 instead of 0xcc
> [    0.933675] Allocated in __alloc_skb+0x8e/0x1d0 age=5 cpu=0 pid=1
> [    0.933675]  __slab_alloc.constprop.0+0x52/0x90
> [    0.933675]  __kmalloc_node_track_caller+0x129/0x380
> [    0.933675]  kmalloc_reserve+0x2a/0x70
> [    0.933675]  __alloc_skb+0x8e/0x1d0
> [    0.933675]  audit_buffer_alloc+0x3a/0xc0
> [    0.933675]  audit_log_start.part.0+0xa3/0x300
> [    0.933675]  audit_log+0x62/0xc0
> [    0.933675]  audit_init+0x15c/0x16f
> 
> And the networking code which touches the [orig_size, object_size) area
> is in __build_skb_around(), which put a 'struct skb_shared_info' at the
> end of this area:
> 
> 	static void __build_skb_around(struct sk_buff *skb, void *data,
> 				       unsigned int frag_size)
> 	{
> 		struct skb_shared_info *shinfo;
> 		unsigned int size = frag_size ? : ksize(data);

Hmm so it's a ksize() user, which should be legitimate way to use the
"waste" data. Hopefully it should be then enough to patch __ksize() to set
the object's tracked waste to 0 (orig_size to size) - assume that if
somebody called ksize() they intend to use the space. That would also make
the debugfs report more truthful.
Feng Tang July 19, 2022, 3:03 p.m. UTC | #14
On Tue, Jul 19, 2022 at 04:39:58PM +0200, Vlastimil Babka wrote:
> On 7/19/22 15:45, Feng Tang wrote:
> > Hi Vlastimil,
> > 
> > On Fri, Jul 15, 2022 at 04:29:22PM +0800, Tang, Feng wrote:
> > [...]
> >> > >> - the knowledge of actual size could be used to improve poisoning checks as
> >> > >> well, detect cases when there's buffer overrun over the orig_size but not
> >> > >> cache's size. e.g. if you kmalloc(48) and overrun up to 64 we won't detect
> >> > >> it now, but with orig_size stored we could?
> >> > > 
> >> > > The above patch doesn't touch this. As I have a question, for the
> >> > > [orib_size, object_size) area, shall we fill it with POISON_XXX no matter
> >> > > REDZONE flag is set or not?
> >> > 
> >> > Ah, looks like we use redzoning, not poisoning, for padding from
> >> > s->object_size to word boundary. So it would be more consistent to use the
> >> > redzone pattern (RED_ACTIVE) and check with the dynamic orig_size. Probably
> >> > no change for RED_INACTIVE handling is needed though.
> >> 
> >> Thanks for clarifying, will go this way and do more test. Also I'd 
> >> make it a separate patch, as it is logically different from the space
> >> wastage.
> > 
> > I made a draft to redzone the wasted space, which basically works (patch
> > pasted at the end of the mail) as detecting corruption of below test code:
> > 	
> > 	size = 256;
> > 	buf = kmalloc(size + 8, GFP_KERNEL);
> > 	memset(buf + size + size/2, 0xff, size/4);
> > 	print_section(KERN_ERR, "Corruptted-kmalloc-space", buf, size * 2);
> > 	kfree(buf);
> > 
> > However when it is enabled globally, there are many places reporting
> > corruption. I debugged one case, and found that the network(skb_buff)
> > code already knows this "wasted" kmalloc space and utilize it which is
> > detected by my patch.
> > 
> > The allocation stack is:
> > 
> > [    0.933675] BUG kmalloc-2k (Not tainted): kmalloc unused part overwritten
> > [    0.933675] -----------------------------------------------------------------------------
> > [    0.933675]
> > [    0.933675] 0xffff888237d026c0-0xffff888237d026e3 @offset=9920. First byte 0x0 instead of 0xcc
> > [    0.933675] Allocated in __alloc_skb+0x8e/0x1d0 age=5 cpu=0 pid=1
> > [    0.933675]  __slab_alloc.constprop.0+0x52/0x90
> > [    0.933675]  __kmalloc_node_track_caller+0x129/0x380
> > [    0.933675]  kmalloc_reserve+0x2a/0x70
> > [    0.933675]  __alloc_skb+0x8e/0x1d0
> > [    0.933675]  audit_buffer_alloc+0x3a/0xc0
> > [    0.933675]  audit_log_start.part.0+0xa3/0x300
> > [    0.933675]  audit_log+0x62/0xc0
> > [    0.933675]  audit_init+0x15c/0x16f
> > 
> > And the networking code which touches the [orig_size, object_size) area
> > is in __build_skb_around(), which put a 'struct skb_shared_info' at the
> > end of this area:
> > 
> > 	static void __build_skb_around(struct sk_buff *skb, void *data,
> > 				       unsigned int frag_size)
> > 	{
> > 		struct skb_shared_info *shinfo;
> > 		unsigned int size = frag_size ? : ksize(data);
> 
> Hmm so it's a ksize() user, which should be legitimate way to use the
> "waste" data. Hopefully it should be then enough to patch __ksize() to set
> the object's tracked waste to 0 (orig_size to size) - assume that if
> somebody called ksize() they intend to use the space. That would also make
> the debugfs report more truthful.

Yep, it sounds good to me. Will chase other corrupted places, hope
they are legitimate users too :)

Thanks,
Feng
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index b1281b8654bd3..97304ea1e6aa5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -271,6 +271,7 @@  struct track {
 #endif
 	int cpu;		/* Was running on cpu */
 	int pid;		/* Pid context */
+	unsigned long waste;	/* memory waste for a kmalloc-ed object */
 	unsigned long when;	/* When did the operation occur */
 };
 
@@ -747,6 +748,7 @@  static inline depot_stack_handle_t set_track_prepare(void)
 
 static void set_track_update(struct kmem_cache *s, void *object,
 			     enum track_item alloc, unsigned long addr,
+			     unsigned long waste,
 			     depot_stack_handle_t handle)
 {
 	struct track *p = get_track(s, object, alloc);
@@ -758,14 +760,16 @@  static void set_track_update(struct kmem_cache *s, void *object,
 	p->cpu = smp_processor_id();
 	p->pid = current->pid;
 	p->when = jiffies;
+	p->waste = waste;
 }
 
 static __always_inline void set_track(struct kmem_cache *s, void *object,
-				      enum track_item alloc, unsigned long addr)
+				      enum track_item alloc, unsigned long addr,
+				      unsigned long waste)
 {
 	depot_stack_handle_t handle = set_track_prepare();
 
-	set_track_update(s, object, alloc, addr, handle);
+	set_track_update(s, object, alloc, addr, waste, handle);
 }
 
 static void init_tracking(struct kmem_cache *s, void *object)
@@ -1325,7 +1329,9 @@  static inline int alloc_consistency_checks(struct kmem_cache *s,
 
 static noinline int alloc_debug_processing(struct kmem_cache *s,
 					struct slab *slab,
-					void *object, unsigned long addr)
+					void *object, unsigned long addr,
+					unsigned long waste
+					)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!alloc_consistency_checks(s, slab, object))
@@ -1334,7 +1340,7 @@  static noinline int alloc_debug_processing(struct kmem_cache *s,
 
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
-		set_track(s, object, TRACK_ALLOC, addr);
+		set_track(s, object, TRACK_ALLOC, addr, waste);
 	trace(s, slab, object, 1);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
@@ -1418,7 +1424,7 @@  static noinline int free_debug_processing(
 	}
 
 	if (s->flags & SLAB_STORE_USER)
-		set_track_update(s, object, TRACK_FREE, addr, handle);
+		set_track_update(s, object, TRACK_FREE, addr, 0, handle);
 	trace(s, slab, object, 0);
 	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
 	init_object(s, object, SLUB_RED_INACTIVE);
@@ -1661,7 +1667,8 @@  static inline
 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct slab *slab, void *object, unsigned long addr) { return 0; }
+	struct slab *slab, void *object, unsigned long addr,
+	unsigned long waste) { return 0; }
 
 static inline int free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
@@ -2905,7 +2912,7 @@  static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
  * already disabled (which is the case for bulk allocation).
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *freelist;
 	struct slab *slab;
@@ -3048,7 +3055,7 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 check_new_slab:
 
 	if (kmem_cache_debug(s)) {
-		if (!alloc_debug_processing(s, slab, freelist, addr)) {
+		if (!alloc_debug_processing(s, slab, freelist, addr, s->object_size - orig_size)) {
 			/* Slab failed checks. Next slab needed */
 			goto new_slab;
 		} else {
@@ -3102,7 +3109,7 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  * pointer.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *p;
 
@@ -3115,7 +3122,7 @@  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
 
-	p = ___slab_alloc(s, gfpflags, node, addr, c);
+	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -3206,7 +3213,7 @@  static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 	 */
 	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
 	    unlikely(!object || !slab || !node_match(slab, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
 
@@ -3731,7 +3738,7 @@  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			 * of re-populating per CPU c->freelist
 			 */
 			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
-					    _RET_IP_, c);
+					    _RET_IP_, c, s->object_size);
 			if (unlikely(!p[i]))
 				goto error;
 
@@ -5092,6 +5099,7 @@  struct location {
 	depot_stack_handle_t handle;
 	unsigned long count;
 	unsigned long addr;
+	unsigned long waste;
 	long long sum_time;
 	long min_time;
 	long max_time;
@@ -5142,7 +5150,7 @@  static int add_location(struct loc_track *t, struct kmem_cache *s,
 {
 	long start, end, pos;
 	struct location *l;
-	unsigned long caddr, chandle;
+	unsigned long caddr, chandle, cwaste;
 	unsigned long age = jiffies - track->when;
 	depot_stack_handle_t handle = 0;
 
@@ -5162,11 +5170,13 @@  static int add_location(struct loc_track *t, struct kmem_cache *s,
 		if (pos == end)
 			break;
 
-		caddr = t->loc[pos].addr;
-		chandle = t->loc[pos].handle;
-		if ((track->addr == caddr) && (handle == chandle)) {
+		l = &t->loc[pos];
+		caddr = l->addr;
+		chandle = l->handle;
+		cwaste = l->waste;
+		if ((track->addr == caddr) && (handle == chandle) &&
+			(track->waste == cwaste)) {
 
-			l = &t->loc[pos];
 			l->count++;
 			if (track->when) {
 				l->sum_time += age;
@@ -5191,6 +5201,9 @@  static int add_location(struct loc_track *t, struct kmem_cache *s,
 			end = pos;
 		else if (track->addr == caddr && handle < chandle)
 			end = pos;
+		else if (track->addr == caddr && handle == chandle &&
+				track->waste < cwaste)
+			end = pos;
 		else
 			start = pos;
 	}
@@ -5214,6 +5227,7 @@  static int add_location(struct loc_track *t, struct kmem_cache *s,
 	l->min_pid = track->pid;
 	l->max_pid = track->pid;
 	l->handle = handle;
+	l->waste = track->waste;
 	cpumask_clear(to_cpumask(l->cpus));
 	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
 	nodes_clear(l->nodes);
@@ -6102,6 +6116,10 @@  static int slab_debugfs_show(struct seq_file *seq, void *v)
 		else
 			seq_puts(seq, "<not-available>");
 
+		if (l->waste)
+			seq_printf(seq, " waste=%lu/%lu",
+				l->count * l->waste, l->waste);
+
 		if (l->sum_time != l->min_time) {
 			seq_printf(seq, " age=%ld/%llu/%ld",
 				l->min_time, div_u64(l->sum_time, l->count),