[v4] kmemleak: survive in a low-memory situation
diff mbox series

Message ID 20190327005948.24263-1-cai@lca.pw
State New
Headers show
Series
  • [v4] kmemleak: survive in a low-memory situation
Related show

Commit Message

Qian Cai March 27, 2019, 12:59 a.m. UTC
Kmemleak could quickly fail to allocate an object structure and then
disable itself below in a low-memory situation. For example, running a
mmap() workload triggering swapping and OOM. This is especially
problematic for running things like LTP testsuite where one OOM test
case would disable the whole kmemleak and render the rest of test cases
without kmemleak watching for leaking.

Kmemleak allocation could fail even though the tracked memory is
succeeded. Hence, it could still try to start a direct reclaim if it is
not executed in an atomic context (spinlock, irq-handler etc), or a
high-priority allocation in an atomic context as a last-ditch effort.
Since kmemleak is a debug feature, it is unlikely to be used in
production that memory resources is scarce where direct reclaim or
high-priority atomic allocations should not be granted lightly.

Unless there is a brave soul to reimplement the kmemleak to embed it's
metadata into the tracked memory itself in a foreseeable future, this
provides a good balance between enabling kmemleak in a low-memory
situation and not introducing too much hackiness into the existing
code for now. Another approach is to fail back the original allocation
once kmemleak_alloc() failed, but there are too many call sites to
deal with which makes it error-prone.

kmemleak: Cannot allocate a kmemleak_object structure
kmemleak: Kernel memory leak detector disabled
kmemleak: Automatic memory scanning thread ended
RIP: 0010:__alloc_pages_nodemask+0x242a/0x2ab0
Call Trace:
 allocate_slab+0x4d9/0x930
 new_slab+0x46/0x70
 ___slab_alloc+0x5d3/0x9c0
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x30a/0x360
 create_object+0x96/0x9a0
 kmemleak_alloc+0x71/0xa0
 kmem_cache_alloc+0x254/0x360
 mempool_alloc_slab+0x3f/0x60
 mempool_alloc+0x120/0x329
 bio_alloc_bioset+0x1a8/0x510
 get_swap_bio+0x107/0x470
 __swap_writepage+0xab4/0x1650
 swap_writepage+0x86/0xe0

Signed-off-by: Qian Cai <cai@lca.pw>
---

v4: Update the commit log.
    Fix a typo in comments per Christ.
    Consolidate the allocation.
v3: Update the commit log.
    Simplify the code inspired by graph_trace_open() from ftrace.
v2: Remove the needless checking for NULL objects in slab_post_alloc_hook()
    per Catalin.

 mm/kmemleak.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Michal Hocko March 27, 2019, 8:44 a.m. UTC | #1
On Tue 26-03-19 20:59:48, Qian Cai wrote:
[...]
> Unless there is a brave soul to reimplement the kmemleak to embed it's
> metadata into the tracked memory itself in a foreseeable future, this
> provides a good balance between enabling kmemleak in a low-memory
> situation and not introducing too much hackiness into the existing
> code for now. Another approach is to fail back the original allocation
> once kmemleak_alloc() failed, but there are too many call sites to
> deal with which makes it error-prone.

As long as there is an implicit __GFP_NOFAIL then kmemleak is simply
broken no matter what other gfp flags you play with. Has anybody looked
at some sort of preallocation where gfpflags_allow_blocking context
allocate objects into a pool that non-sleeping allocations can eat from?

> kmemleak: Cannot allocate a kmemleak_object structure
> kmemleak: Kernel memory leak detector disabled
> kmemleak: Automatic memory scanning thread ended
> RIP: 0010:__alloc_pages_nodemask+0x242a/0x2ab0
> Call Trace:
>  allocate_slab+0x4d9/0x930
>  new_slab+0x46/0x70
>  ___slab_alloc+0x5d3/0x9c0
>  __slab_alloc+0x12/0x20
>  kmem_cache_alloc+0x30a/0x360
>  create_object+0x96/0x9a0
>  kmemleak_alloc+0x71/0xa0
>  kmem_cache_alloc+0x254/0x360
>  mempool_alloc_slab+0x3f/0x60
>  mempool_alloc+0x120/0x329
>  bio_alloc_bioset+0x1a8/0x510
>  get_swap_bio+0x107/0x470
>  __swap_writepage+0xab4/0x1650
>  swap_writepage+0x86/0xe0
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> 
> v4: Update the commit log.
>     Fix a typo in comments per Christ.
>     Consolidate the allocation.
> v3: Update the commit log.
>     Simplify the code inspired by graph_trace_open() from ftrace.
> v2: Remove the needless checking for NULL objects in slab_post_alloc_hook()
>     per Catalin.
> 
>  mm/kmemleak.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index a2d894d3de07..7f4545ab1f84 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -580,7 +580,16 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
>  	struct rb_node **link, *rb_parent;
>  	unsigned long untagged_ptr;
>  
> -	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> +	/*
> +	 * The tracked memory was allocated successful, if the kmemleak object
> +	 * failed to allocate for some reasons, it ends up with the whole
> +	 * kmemleak disabled, so try it harder.
> +	 */
> +	gfp = (in_atomic() || irqs_disabled()) ?
> +	       gfp_kmemleak_mask(gfp) | GFP_ATOMIC :
> +	       gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;


The comment for in_atomic says:
 * Are we running in atomic context?  WARNING: this macro cannot
 * always detect atomic context; in particular, it cannot know about
 * held spinlocks in non-preemptible kernels.  Thus it should not be
 * used in the general case to determine whether sleeping is possible.
 * Do not use in_atomic() in driver code.
Qian Cai March 27, 2019, 11:34 a.m. UTC | #2
On 3/27/19 4:44 AM, Michal Hocko wrote:
>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>> index a2d894d3de07..7f4545ab1f84 100644
>> --- a/mm/kmemleak.c
>> +++ b/mm/kmemleak.c
>> @@ -580,7 +580,16 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
>>  	struct rb_node **link, *rb_parent;
>>  	unsigned long untagged_ptr;
>>  
>> -	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
>> +	/*
>> +	 * The tracked memory was allocated successful, if the kmemleak object
>> +	 * failed to allocate for some reasons, it ends up with the whole
>> +	 * kmemleak disabled, so try it harder.
>> +	 */
>> +	gfp = (in_atomic() || irqs_disabled()) ?
>> +	       gfp_kmemleak_mask(gfp) | GFP_ATOMIC :
>> +	       gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
> 
> 
> The comment for in_atomic says:
>  * Are we running in atomic context?  WARNING: this macro cannot
>  * always detect atomic context; in particular, it cannot know about
>  * held spinlocks in non-preemptible kernels.  Thus it should not be
>  * used in the general case to determine whether sleeping is possible.
>  * Do not use in_atomic() in driver code.

That is why it needs both in_atomic() and irqs_disabled(), so irqs_disabled()
can detect kernel functions held spinlocks even in non-preemptible kernels.

According to [1],

"This [2] is useful if you know that the data in question is only ever
manipulated from a "process context", ie no interrupts involved."

Since kmemleak only deal with kernel context, if a spinlock was held, it always
has local interrupt disabled.

ftrace is in the same boat where this commit was merged a while back that has
the same check.

ef99b88b16be
tracing: Handle ftrace_dump() atomic context in graph_trace_open()

[1] https://www.kernel.org/doc/Documentation/locking/spinlocks.txt
[2]
	spin_lock(&lock);
	...
	spin_unlock(&lock);
Michal Hocko March 27, 2019, 11:44 a.m. UTC | #3
On Wed 27-03-19 07:34:32, Qian Cai wrote:
> On 3/27/19 4:44 AM, Michal Hocko wrote:
> >> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> >> index a2d894d3de07..7f4545ab1f84 100644
> >> --- a/mm/kmemleak.c
> >> +++ b/mm/kmemleak.c
> >> @@ -580,7 +580,16 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
> >>  	struct rb_node **link, *rb_parent;
> >>  	unsigned long untagged_ptr;
> >>  
> >> -	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> >> +	/*
> >> +	 * The tracked memory was allocated successful, if the kmemleak object
> >> +	 * failed to allocate for some reasons, it ends up with the whole
> >> +	 * kmemleak disabled, so try it harder.
> >> +	 */
> >> +	gfp = (in_atomic() || irqs_disabled()) ?
> >> +	       gfp_kmemleak_mask(gfp) | GFP_ATOMIC :
> >> +	       gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
> > 
> > 
> > The comment for in_atomic says:
> >  * Are we running in atomic context?  WARNING: this macro cannot
> >  * always detect atomic context; in particular, it cannot know about
> >  * held spinlocks in non-preemptible kernels.  Thus it should not be
> >  * used in the general case to determine whether sleeping is possible.
> >  * Do not use in_atomic() in driver code.
> 
> That is why it needs both in_atomic() and irqs_disabled(), so irqs_disabled()
> can detect kernel functions held spinlocks even in non-preemptible kernels.
> 
> According to [1],
> 
> "This [2] is useful if you know that the data in question is only ever
> manipulated from a "process context", ie no interrupts involved."
> 
> Since kmemleak only deal with kernel context, if a spinlock was held, it always
> has local interrupt disabled.

What? Normal spin lock implementation doesn't disable interrupts. So
either I misunderstand what you are saying or you seem to be confused.
the thing is that in_atomic relies on preempt_count to work properly and
if you have CONFIG_PREEMPT_COUNT=n then you simply never know whether
preemption is disabled so you do not know that a spin_lock is held.
irqs_disabled on the other hand checks whether arch specific flag for
IRQs handling is set (or cleared). So you would only catch irq safe spin
locks with the above check.

> ftrace is in the same boat where this commit was merged a while back that has
> the same check.
> 
> ef99b88b16be
> tracing: Handle ftrace_dump() atomic context in graph_trace_open()
> 
> [1] https://www.kernel.org/doc/Documentation/locking/spinlocks.txt
> [2]
> 	spin_lock(&lock);
> 	...
> 	spin_unlock(&lock);
Qian Cai March 27, 2019, 1:05 p.m. UTC | #4
On 3/27/19 7:44 AM, Michal Hocko wrote> What? Normal spin lock implementation
doesn't disable interrupts. So
> either I misunderstand what you are saying or you seem to be confused.
> the thing is that in_atomic relies on preempt_count to work properly and
> if you have CONFIG_PREEMPT_COUNT=n then you simply never know whether
> preemption is disabled so you do not know that a spin_lock is held.
> irqs_disabled on the other hand checks whether arch specific flag for
> IRQs handling is set (or cleared). So you would only catch irq safe spin
> locks with the above check.

Exactly, because kmemleak_alloc() is only called in a few call sites, slab
allocation, neigh_hash_alloc(), alloc_page_ext(), sg_kmalloc(),
early_amd_iommu_init() and blk_mq_alloc_rqs(), my review does not yield any of
those holding irq unsafe spinlocks.

Could future code changes suddenly call kmemleak_alloc() with a irq unsafe
spinlock held? Always possible, but it is unlikely to happen. I could put some
comments on kmemleak_alloc() about this though.
Michal Hocko March 27, 2019, 1:17 p.m. UTC | #5
On Wed 27-03-19 09:05:31, Qian Cai wrote:
> On 3/27/19 7:44 AM, Michal Hocko wrote> What? Normal spin lock implementation
> doesn't disable interrupts. So
> > either I misunderstand what you are saying or you seem to be confused.
> > the thing is that in_atomic relies on preempt_count to work properly and
> > if you have CONFIG_PREEMPT_COUNT=n then you simply never know whether
> > preemption is disabled so you do not know that a spin_lock is held.
> > irqs_disabled on the other hand checks whether arch specific flag for
> > IRQs handling is set (or cleared). So you would only catch irq safe spin
> > locks with the above check.
> 
> Exactly, because kmemleak_alloc() is only called in a few call sites, slab
> allocation, neigh_hash_alloc(), alloc_page_ext(), sg_kmalloc(),
> early_amd_iommu_init() and blk_mq_alloc_rqs(), my review does not yield any of
> those holding irq unsafe spinlocks.

I do not understand. What about a regular kmalloc(GFP_NOWAIT) callers with a simple
spinlock held?
Catalin Marinas March 27, 2019, 5:29 p.m. UTC | #6
On Wed, Mar 27, 2019 at 09:44:32AM +0100, Michal Hocko wrote:
> On Tue 26-03-19 20:59:48, Qian Cai wrote:
> [...]
> > Unless there is a brave soul to reimplement the kmemleak to embed it's
> > metadata into the tracked memory itself in a foreseeable future,

Revisiting the kmemleak memory scanning code, that's not actually
possible without some long periods with kmemleak_lock held. The scanner
relies on the kmemleak_object (refcounted) being around even when the
actual memory block has been freed.

> > this
> > provides a good balance between enabling kmemleak in a low-memory
> > situation and not introducing too much hackiness into the existing
> > code for now. Another approach is to fail back the original allocation
> > once kmemleak_alloc() failed, but there are too many call sites to
> > deal with which makes it error-prone.
> 
> As long as there is an implicit __GFP_NOFAIL then kmemleak is simply
> broken no matter what other gfp flags you play with. Has anybody looked
> at some sort of preallocation where gfpflags_allow_blocking context
> allocate objects into a pool that non-sleeping allocations can eat from?

Quick attempt below and it needs some more testing (pretty random pick
of the EMERGENCY_POOL_SIZE value). Also, with __GFP_NOFAIL removed, are
the other flags safe or we should trim them further?

---------------8<-------------------------------
From dc4194539f8191bb754901cea74c86e7960886f8 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Wed, 27 Mar 2019 17:20:57 +0000
Subject: [PATCH] mm: kmemleak: Add an emergency allocation pool for kmemleak
 objects

This patch adds an emergency pool for struct kmemleak_object in case the
normal kmem_cache_alloc() fails under the gfp constraints passed by the
slab allocation caller. The patch also removes __GFP_NOFAIL which does
not play well with other gfp flags (introduced by commit d9570ee3bd1d,
"kmemleak: allow to coexist with fault injection").

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 mm/kmemleak.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 6c318f5ac234..366a680cff7c 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -127,7 +127,7 @@
 /* GFP bitmask for kmemleak internal allocations */
 #define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
 				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
-				 __GFP_NOWARN | __GFP_NOFAIL)
+				 __GFP_NOWARN)
 
 /* scanning area inside a memory block */
 struct kmemleak_scan_area {
@@ -191,11 +191,16 @@ struct kmemleak_object {
 #define HEX_ASCII		1
 /* max number of lines to be printed */
 #define HEX_MAX_LINES		2
+/* minimum emergency pool size */
+#define EMERGENCY_POOL_SIZE	(NR_CPUS * 4)
 
 /* the list of all allocated objects */
 static LIST_HEAD(object_list);
 /* the list of gray-colored objects (see color_gray comment below) */
 static LIST_HEAD(gray_list);
+/* emergency pool allocation */
+static LIST_HEAD(emergency_list);
+static int emergency_pool_size;
 /* search tree for object boundaries */
 static struct rb_root object_tree_root = RB_ROOT;
 /* rw_lock protecting the access to object_list and object_tree_root */
@@ -467,6 +472,43 @@ static int get_object(struct kmemleak_object *object)
 	return atomic_inc_not_zero(&object->use_count);
 }
 
+/*
+ * Emergency pool allocation and freeing. kmemleak_lock must not be held.
+ */
+static struct kmemleak_object *emergency_alloc(void)
+{
+	unsigned long flags;
+	struct kmemleak_object *object;
+
+	write_lock_irqsave(&kmemleak_lock, flags);
+	object = list_first_entry_or_null(&emergency_list, typeof(*object), object_list);
+	if (object) {
+		list_del(&object->object_list);
+		emergency_pool_size--;
+	}
+	write_unlock_irqrestore(&kmemleak_lock, flags);
+
+	return object;
+}
+
+/*
+ * Return true if object added to the emergency pool, false otherwise.
+ */
+static bool emergency_free(struct kmemleak_object *object)
+{
+	unsigned long flags;
+
+	if (emergency_pool_size >= EMERGENCY_POOL_SIZE)
+		return false;
+
+	write_lock_irqsave(&kmemleak_lock, flags);
+	list_add(&object->object_list, &emergency_list);
+	emergency_pool_size++;
+	write_unlock_irqrestore(&kmemleak_lock, flags);
+
+	return true;
+}
+
 /*
  * RCU callback to free a kmemleak_object.
  */
@@ -485,7 +527,8 @@ static void free_object_rcu(struct rcu_head *rcu)
 		hlist_del(&area->node);
 		kmem_cache_free(scan_area_cache, area);
 	}
-	kmem_cache_free(object_cache, object);
+	if (!emergency_free(object))
+		kmem_cache_free(object_cache, object);
 }
 
 /*
@@ -577,6 +620,8 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 	unsigned long untagged_ptr;
 
 	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
+	if (!object)
+		object = emergency_alloc();
 	if (!object) {
 		pr_warn("Cannot allocate a kmemleak_object structure\n");
 		kmemleak_disable();
@@ -2127,6 +2172,16 @@ void __init kmemleak_init(void)
 			kmemleak_warning = 0;
 		}
 	}
+
+	/* populate the emergency allocation pool */
+	while (emergency_pool_size < EMERGENCY_POOL_SIZE) {
+		struct kmemleak_object *object;
+
+		object = kmem_cache_alloc(object_cache, GFP_KERNEL);
+		if (!object)
+			break;
+		emergency_free(object);
+	}
 }
 
 /*
Qian Cai March 27, 2019, 6:02 p.m. UTC | #7
On 3/27/19 1:29 PM, Catalin Marinas wrote:
> From dc4194539f8191bb754901cea74c86e7960886f8 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Wed, 27 Mar 2019 17:20:57 +0000
> Subject: [PATCH] mm: kmemleak: Add an emergency allocation pool for kmemleak
>  objects
> 
> This patch adds an emergency pool for struct kmemleak_object in case the
> normal kmem_cache_alloc() fails under the gfp constraints passed by the
> slab allocation caller. The patch also removes __GFP_NOFAIL which does
> not play well with other gfp flags (introduced by commit d9570ee3bd1d,
> "kmemleak: allow to coexist with fault injection").
> 
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

It takes 2 runs of LTP oom01 tests to disable kmemleak.
Michal Hocko March 27, 2019, 6:17 p.m. UTC | #8
On Wed 27-03-19 17:29:57, Catalin Marinas wrote:
[...]
> Quick attempt below and it needs some more testing (pretty random pick
> of the EMERGENCY_POOL_SIZE value). Also, with __GFP_NOFAIL removed, are
> the other flags safe or we should trim them further?

I would be still careful about __GFP_NORETRY. I pressume that the
primary purpose is to prevent from the OOM killer but this makes the
allocation failure much more likely. So if anything __GFP_RETRY_MAYFAIL
would suite better for that purpose. But I am not really sure that this
is worth bothering.

> ---------------8<-------------------------------
> >From dc4194539f8191bb754901cea74c86e7960886f8 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Wed, 27 Mar 2019 17:20:57 +0000
> Subject: [PATCH] mm: kmemleak: Add an emergency allocation pool for kmemleak
>  objects
> 
> This patch adds an emergency pool for struct kmemleak_object in case the
> normal kmem_cache_alloc() fails under the gfp constraints passed by the
> slab allocation caller. The patch also removes __GFP_NOFAIL which does
> not play well with other gfp flags (introduced by commit d9570ee3bd1d,
> "kmemleak: allow to coexist with fault injection").

Thank you! This is definitely a step into the right direction. Maybe the
pool allocation logic will need some tuning - e.g. does it make sense to
allocate into the pool from sleepable allocations - or is it sufficient
to refill on free. Something for the real workloads to tell, I guess.

> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
>  mm/kmemleak.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 6c318f5ac234..366a680cff7c 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -127,7 +127,7 @@
>  /* GFP bitmask for kmemleak internal allocations */
>  #define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
>  				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
> -				 __GFP_NOWARN | __GFP_NOFAIL)
> +				 __GFP_NOWARN)
>  
>  /* scanning area inside a memory block */
>  struct kmemleak_scan_area {
> @@ -191,11 +191,16 @@ struct kmemleak_object {
>  #define HEX_ASCII		1
>  /* max number of lines to be printed */
>  #define HEX_MAX_LINES		2
> +/* minimum emergency pool size */
> +#define EMERGENCY_POOL_SIZE	(NR_CPUS * 4)
>  
>  /* the list of all allocated objects */
>  static LIST_HEAD(object_list);
>  /* the list of gray-colored objects (see color_gray comment below) */
>  static LIST_HEAD(gray_list);
> +/* emergency pool allocation */
> +static LIST_HEAD(emergency_list);
> +static int emergency_pool_size;
>  /* search tree for object boundaries */
>  static struct rb_root object_tree_root = RB_ROOT;
>  /* rw_lock protecting the access to object_list and object_tree_root */
> @@ -467,6 +472,43 @@ static int get_object(struct kmemleak_object *object)
>  	return atomic_inc_not_zero(&object->use_count);
>  }
>  
> +/*
> + * Emergency pool allocation and freeing. kmemleak_lock must not be held.
> + */
> +static struct kmemleak_object *emergency_alloc(void)
> +{
> +	unsigned long flags;
> +	struct kmemleak_object *object;
> +
> +	write_lock_irqsave(&kmemleak_lock, flags);
> +	object = list_first_entry_or_null(&emergency_list, typeof(*object), object_list);
> +	if (object) {
> +		list_del(&object->object_list);
> +		emergency_pool_size--;
> +	}
> +	write_unlock_irqrestore(&kmemleak_lock, flags);
> +
> +	return object;
> +}
> +
> +/*
> + * Return true if object added to the emergency pool, false otherwise.
> + */
> +static bool emergency_free(struct kmemleak_object *object)
> +{
> +	unsigned long flags;
> +
> +	if (emergency_pool_size >= EMERGENCY_POOL_SIZE)
> +		return false;
> +
> +	write_lock_irqsave(&kmemleak_lock, flags);
> +	list_add(&object->object_list, &emergency_list);
> +	emergency_pool_size++;
> +	write_unlock_irqrestore(&kmemleak_lock, flags);
> +
> +	return true;
> +}
> +
>  /*
>   * RCU callback to free a kmemleak_object.
>   */
> @@ -485,7 +527,8 @@ static void free_object_rcu(struct rcu_head *rcu)
>  		hlist_del(&area->node);
>  		kmem_cache_free(scan_area_cache, area);
>  	}
> -	kmem_cache_free(object_cache, object);
> +	if (!emergency_free(object))
> +		kmem_cache_free(object_cache, object);
>  }
>  
>  /*
> @@ -577,6 +620,8 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
>  	unsigned long untagged_ptr;
>  
>  	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> +	if (!object)
> +		object = emergency_alloc();
>  	if (!object) {
>  		pr_warn("Cannot allocate a kmemleak_object structure\n");
>  		kmemleak_disable();
> @@ -2127,6 +2172,16 @@ void __init kmemleak_init(void)
>  			kmemleak_warning = 0;
>  		}
>  	}
> +
> +	/* populate the emergency allocation pool */
> +	while (emergency_pool_size < EMERGENCY_POOL_SIZE) {
> +		struct kmemleak_object *object;
> +
> +		object = kmem_cache_alloc(object_cache, GFP_KERNEL);
> +		if (!object)
> +			break;
> +		emergency_free(object);
> +	}
>  }
>  
>  /*
Matthew Wilcox March 27, 2019, 6:21 p.m. UTC | #9
On Wed, Mar 27, 2019 at 05:29:57PM +0000, Catalin Marinas wrote:
> On Wed, Mar 27, 2019 at 09:44:32AM +0100, Michal Hocko wrote:
> > As long as there is an implicit __GFP_NOFAIL then kmemleak is simply
> > broken no matter what other gfp flags you play with. Has anybody looked
> > at some sort of preallocation where gfpflags_allow_blocking context
> > allocate objects into a pool that non-sleeping allocations can eat from?
> 
> Quick attempt below and it needs some more testing (pretty random pick
> of the EMERGENCY_POOL_SIZE value). Also, with __GFP_NOFAIL removed, are
> the other flags safe or we should trim them further?

Why not use mempool?

>  #define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
>  				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
> -				 __GFP_NOWARN | __GFP_NOFAIL)
> +				 __GFP_NOWARN)

Why GFP_NORETRY?  And if I have specified one of the other retry policies
in my gfp flags, you should presumably clear that off before setting
GFP_NORETRY.
Pekka Enberg March 28, 2019, 6:05 a.m. UTC | #10
Hi,

On 27/03/2019 2.59, Qian Cai wrote:
> Unless there is a brave soul to reimplement the kmemleak to embed it's
> metadata into the tracked memory itself in a foreseeable future, this
> provides a good balance between enabling kmemleak in a low-memory
> situation and not introducing too much hackiness into the existing
> code for now.

Unfortunately I am not that brave soul, but I'm wondering what the 
complication here is? It shouldn't be too hard to teach 
calculate_sizes() in SLUB about a new SLAB_KMEMLEAK flag that reserves 
spaces for the metadata.

- Pekka
Catalin Marinas March 28, 2019, 10:30 a.m. UTC | #11
On Thu, Mar 28, 2019 at 08:05:31AM +0200, Pekka Enberg wrote:
> On 27/03/2019 2.59, Qian Cai wrote:
> > Unless there is a brave soul to reimplement the kmemleak to embed it's
> > metadata into the tracked memory itself in a foreseeable future, this
> > provides a good balance between enabling kmemleak in a low-memory
> > situation and not introducing too much hackiness into the existing
> > code for now.
> 
> Unfortunately I am not that brave soul, but I'm wondering what the
> complication here is? It shouldn't be too hard to teach calculate_sizes() in
> SLUB about a new SLAB_KMEMLEAK flag that reserves spaces for the metadata.

I don't think it's the calculate_sizes() that's the hard part. The way
kmemleak is designed assumes that the metadata has a longer lifespan
than the slab object it is tracking (and refcounted via
get_object/put_object()). We'd have to replace some of the
rcu_read_(un)lock() regions with a full kmemleak_lock together with a
few more tweaks to allow the release of kmemleak_lock during memory
scanning (which can take minutes; so it needs to be safe w.r.t. metadata
freeing, currently relying on a deferred RCU freeing).

Anyway, I think it is possible, just not straight forward.
Pekka Enberg March 28, 2019, 11:50 a.m. UTC | #12
Hi Catalin,

On 27/03/2019 2.59, Qian Cai wrote:
>>> Unless there is a brave soul to reimplement the kmemleak to embed it's
>>> metadata into the tracked memory itself in a foreseeable future, this
>>> provides a good balance between enabling kmemleak in a low-memory
>>> situation and not introducing too much hackiness into the existing
>>> code for now.

On Thu, Mar 28, 2019 at 08:05:31AM +0200, Pekka Enberg wrote:
>> Unfortunately I am not that brave soul, but I'm wondering what the
>> complication here is? It shouldn't be too hard to teach calculate_sizes() in
>> SLUB about a new SLAB_KMEMLEAK flag that reserves spaces for the metadata.

On 28/03/2019 12.30, Catalin Marinas wrote:> I don't think it's the 
calculate_sizes() that's the hard part. The way
> kmemleak is designed assumes that the metadata has a longer lifespan
> than the slab object it is tracking (and refcounted via
> get_object/put_object()). We'd have to replace some of the
> rcu_read_(un)lock() regions with a full kmemleak_lock together with a
> few more tweaks to allow the release of kmemleak_lock during memory
> scanning (which can take minutes; so it needs to be safe w.r.t. metadata
> freeing, currently relying on a deferred RCU freeing).

Right.

I think SLUB already supports delaying object freeing because of KASAN 
(see the slab_free_freelist_hook() function) so the issue with metadata 
outliving object is solvable (although will consume more memory).

I can't say I remember enough details from kmemleak to comment on the 
locking complications you point out, though.

- Pekka
Catalin Marinas March 28, 2019, 2:59 p.m. UTC | #13
On Wed, Mar 27, 2019 at 11:21:58AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 27, 2019 at 05:29:57PM +0000, Catalin Marinas wrote:
> > On Wed, Mar 27, 2019 at 09:44:32AM +0100, Michal Hocko wrote:
> > > As long as there is an implicit __GFP_NOFAIL then kmemleak is simply
> > > broken no matter what other gfp flags you play with. Has anybody looked
> > > at some sort of preallocation where gfpflags_allow_blocking context
> > > allocate objects into a pool that non-sleeping allocations can eat from?
> > 
> > Quick attempt below and it needs some more testing (pretty random pick
> > of the EMERGENCY_POOL_SIZE value). Also, with __GFP_NOFAIL removed, are
> > the other flags safe or we should trim them further?
> 
> Why not use mempool?

I had the wrong impression that it could sleep but it's only if
__GFP_DIRECT_RECLAIM is passed. See below for an updated patch.

> >  #define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
> >  				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
> > -				 __GFP_NOWARN | __GFP_NOFAIL)
> > +				 __GFP_NOWARN)
> 
> Why GFP_NORETRY?  And if I have specified one of the other retry policies
> in my gfp flags, you should presumably clear that off before setting
> GFP_NORETRY.

It only preserves GFP_KERNEL|GFP_ATOMIC from the original flags
while setting the NOWARN|NORETRY|NOMEMALLOC (the same flags seem to be
set by mempool_alloc()). Anyway, with the changes below, I'll let
mempool add the relevant flags while kmemleak only passes
GFP_KERNEL|GFP_ATOMIC from the original caller.

-----------------------8<------------------------------------------
From 09eba8f0235eb16409931e6aad77a45a12bedc82 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Thu, 28 Mar 2019 13:26:07 +0000
Subject: [PATCH] mm: kmemleak: Use mempool allocations for kmemleak objects

This patch adds mempool allocations for struct kmemleak_object and
kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
under memory pressure. The patch also masks out all the gfp flags passed
to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 mm/kmemleak.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 6c318f5ac234..9755678e83b9 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -82,6 +82,7 @@
 #include <linux/kthread.h>
 #include <linux/rbtree.h>
 #include <linux/fs.h>
+#include <linux/mempool.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/cpumask.h>
@@ -125,9 +126,7 @@
 #define BYTES_PER_POINTER	sizeof(void *)
 
 /* GFP bitmask for kmemleak internal allocations */
-#define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
-				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
-				 __GFP_NOWARN | __GFP_NOFAIL)
+#define gfp_kmemleak_mask(gfp)	((gfp) & (GFP_KERNEL | GFP_ATOMIC))
 
 /* scanning area inside a memory block */
 struct kmemleak_scan_area {
@@ -191,6 +190,9 @@ struct kmemleak_object {
 #define HEX_ASCII		1
 /* max number of lines to be printed */
 #define HEX_MAX_LINES		2
+/* minimum memory pool sizes */
+#define MIN_OBJECT_POOL		(NR_CPUS * 4)
+#define MIN_SCAN_AREA_POOL	(NR_CPUS * 1)
 
 /* the list of all allocated objects */
 static LIST_HEAD(object_list);
@@ -203,7 +205,9 @@ static DEFINE_RWLOCK(kmemleak_lock);
 
 /* allocation caches for kmemleak internal data */
 static struct kmem_cache *object_cache;
+static mempool_t *object_mempool;
 static struct kmem_cache *scan_area_cache;
+static mempool_t *scan_area_mempool;
 
 /* set if tracing memory operations is enabled */
 static int kmemleak_enabled;
@@ -483,9 +487,9 @@ static void free_object_rcu(struct rcu_head *rcu)
 	 */
 	hlist_for_each_entry_safe(area, tmp, &object->area_list, node) {
 		hlist_del(&area->node);
-		kmem_cache_free(scan_area_cache, area);
+		mempool_free(area, scan_area_mempool);
 	}
-	kmem_cache_free(object_cache, object);
+	mempool_free(object, object_mempool);
 }
 
 /*
@@ -576,7 +580,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 	struct rb_node **link, *rb_parent;
 	unsigned long untagged_ptr;
 
-	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
+	object = mempool_alloc(object_mempool, gfp_kmemleak_mask(gfp));
 	if (!object) {
 		pr_warn("Cannot allocate a kmemleak_object structure\n");
 		kmemleak_disable();
@@ -640,7 +644,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 			 * be freed while the kmemleak_lock is held.
 			 */
 			dump_object_info(parent);
-			kmem_cache_free(object_cache, object);
+			mempool_free(object, object_mempool);
 			object = NULL;
 			goto out;
 		}
@@ -798,7 +802,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
 		return;
 	}
 
-	area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp));
+	area = mempool_alloc(scan_area_mempool, gfp_kmemleak_mask(gfp));
 	if (!area) {
 		pr_warn("Cannot allocate a scan area\n");
 		goto out;
@@ -810,7 +814,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
 	} else if (ptr + size > object->pointer + object->size) {
 		kmemleak_warn("Scan area larger than object 0x%08lx\n", ptr);
 		dump_object_info(object);
-		kmem_cache_free(scan_area_cache, area);
+		mempool_free(area, scan_area_mempool);
 		goto out_unlock;
 	}
 
@@ -2049,6 +2053,18 @@ void __init kmemleak_init(void)
 
 	object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE);
 	scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE);
+	if (!object_cache || !scan_area_cache) {
+		kmemleak_disable();
+		return;
+	}
+	object_mempool = mempool_create_slab_pool(MIN_OBJECT_POOL,
+						  object_cache);
+	scan_area_mempool = mempool_create_slab_pool(MIN_SCAN_AREA_POOL,
+						     scan_area_cache);
+	if (!object_mempool || !scan_area_mempool) {
+		kmemleak_disable();
+		return;
+	}
 
 	if (crt_early_log > ARRAY_SIZE(early_log))
 		pr_warn("Early log buffer exceeded (%d), please increase DEBUG_KMEMLEAK_EARLY_LOG_SIZE\n",
Catalin Marinas March 28, 2019, 3:05 p.m. UTC | #14
On Wed, Mar 27, 2019 at 02:02:27PM -0400, Qian Cai wrote:
> On 3/27/19 1:29 PM, Catalin Marinas wrote:
> > From dc4194539f8191bb754901cea74c86e7960886f8 Mon Sep 17 00:00:00 2001
> > From: Catalin Marinas <catalin.marinas@arm.com>
> > Date: Wed, 27 Mar 2019 17:20:57 +0000
> > Subject: [PATCH] mm: kmemleak: Add an emergency allocation pool for kmemleak
> >  objects
> > 
> > This patch adds an emergency pool for struct kmemleak_object in case the
> > normal kmem_cache_alloc() fails under the gfp constraints passed by the
> > slab allocation caller. The patch also removes __GFP_NOFAIL which does
> > not play well with other gfp flags (introduced by commit d9570ee3bd1d,
> > "kmemleak: allow to coexist with fault injection").
> > 
> > Suggested-by: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> It takes 2 runs of LTP oom01 tests to disable kmemleak.

What configuration are you using (number of CPUs, RAM)? I tried this on
an arm64 guest under kvm with 4 CPUs and 512MB of RAM, together with
fault injection on kmemleak_object cache and running oom01 several times
without any failures.
Qian Cai March 28, 2019, 3:24 p.m. UTC | #15
On Thu, 2019-03-28 at 14:59 +0000, Catalin Marinas wrote:		2
> +/* minimum memory pool sizes */
> +#define MIN_OBJECT_POOL		(NR_CPUS * 4)
> +#define MIN_SCAN_AREA_POOL	(NR_CPUS * 1)

I am thinking about making those are tunable, so people could have a big pool
depends on their workloads.

Also, I would like to see a way to refill this emergency pool. For example, once
OOM kiler freed up more memory. Maybe at the time of kmemleak_scan kthread
kicked in and then monitor and refill the pool whenever possible.
Catalin Marinas March 28, 2019, 3:28 p.m. UTC | #16
On Thu, Mar 28, 2019 at 01:50:29PM +0200, Pekka Enberg wrote:
> On 27/03/2019 2.59, Qian Cai wrote:
> > > > Unless there is a brave soul to reimplement the kmemleak to embed it's
> > > > metadata into the tracked memory itself in a foreseeable future, this
> > > > provides a good balance between enabling kmemleak in a low-memory
> > > > situation and not introducing too much hackiness into the existing
> > > > code for now.
> 
> On Thu, Mar 28, 2019 at 08:05:31AM +0200, Pekka Enberg wrote:
> > > Unfortunately I am not that brave soul, but I'm wondering what the
> > > complication here is? It shouldn't be too hard to teach calculate_sizes() in
> > > SLUB about a new SLAB_KMEMLEAK flag that reserves spaces for the metadata.
> 
> On 28/03/2019 12.30, Catalin Marinas wrote:> I don't think it's the
> calculate_sizes() that's the hard part. The way
> > kmemleak is designed assumes that the metadata has a longer lifespan
> > than the slab object it is tracking (and refcounted via
> > get_object/put_object()). We'd have to replace some of the
> > rcu_read_(un)lock() regions with a full kmemleak_lock together with a
> > few more tweaks to allow the release of kmemleak_lock during memory
> > scanning (which can take minutes; so it needs to be safe w.r.t. metadata
> > freeing, currently relying on a deferred RCU freeing).
> 
> Right.
> 
> I think SLUB already supports delaying object freeing because of KASAN (see
> the slab_free_freelist_hook() function) so the issue with metadata outliving
> object is solvable (although will consume more memory).

Thanks. That's definitely an area to investigate.

> I can't say I remember enough details from kmemleak to comment on the
> locking complications you point out, though.

They are not too bad, I'd just need some quiet couple of days to go
through them (which I don't have at the moment).
Qian Cai March 28, 2019, 3:41 p.m. UTC | #17
On Thu, 2019-03-28 at 15:05 +0000, Catalin Marinas wrote:
> > It takes 2 runs of LTP oom01 tests to disable kmemleak.
> 
> What configuration are you using (number of CPUs, RAM)? I tried this on
> an arm64 guest under kvm with 4 CPUs and 512MB of RAM, together with
> fault injection on kmemleak_object cache and running oom01 several times
> without any failures.

Apparently, the CPUs are so fast and the disk is so slow (swapping). It ends up
taking a long time for OOM to kick in.

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6126T CPU @ 2.60GHz
Stepping:            4
CPU MHz:             3300.002
BogoMIPS:            5200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            19712K
NUMA node0 CPU(s):   0-11,24-35
NUMA node1 CPU(s):   12-23,36-47

# free -m
              total        used        free      shared  buff/cache   available
Mem:         166206       31737      134063          33         406      133584
Swap:          4095           0        4095

# lspci | grep -i sata
00:11.5 SATA controller: Intel Corporation C620 Series Chipset Family SSATA
Controller [AHCI mode] (rev 08)
00:17.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA
Controller [AHCI mode] (rev 08)
Michal Hocko March 29, 2019, 12:02 p.m. UTC | #18
On Thu 28-03-19 14:59:17, Catalin Marinas wrote:
[...]
> >From 09eba8f0235eb16409931e6aad77a45a12bedc82 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Thu, 28 Mar 2019 13:26:07 +0000
> Subject: [PATCH] mm: kmemleak: Use mempool allocations for kmemleak objects
> 
> This patch adds mempool allocations for struct kmemleak_object and
> kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
> under memory pressure. The patch also masks out all the gfp flags passed
> to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

Using mempool allocator is better than inventing its own implementation
but there is one thing to be slightly careful/worried about.

This allocator expects that somebody will refill the pool in a finit
time. Most users are OK with that because objects in flight are going
to return in the pool in a relatively short time (think of an IO) but
kmemleak is not guaranteed to comply with that AFAIU. Sure ephemeral
allocations are happening all the time so there should be some churn
in the pool all the time but if we go to an extreme where there is a
serious memory leak then I suspect we might get stuck here without any
way forward. Page/slab allocator would eventually back off even though
small allocations never fail because a user context would get killed
sooner or later but there is no fatal_signal_pending backoff in the
mempool alloc path.

Anyway, I believe this is a step in the right direction and should the
above ever materializes as a relevant problem we can tune the mempool
to backoff for _some_ callers or do something similar.

Btw. there is kmemleak_update_trace call in mempool_alloc, is this ok
for the kmemleak allocation path?

> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
>  mm/kmemleak.c | 34 +++++++++++++++++++++++++---------
>  1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 6c318f5ac234..9755678e83b9 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -82,6 +82,7 @@
>  #include <linux/kthread.h>
>  #include <linux/rbtree.h>
>  #include <linux/fs.h>
> +#include <linux/mempool.h>
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
>  #include <linux/cpumask.h>
> @@ -125,9 +126,7 @@
>  #define BYTES_PER_POINTER	sizeof(void *)
>  
>  /* GFP bitmask for kmemleak internal allocations */
> -#define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
> -				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
> -				 __GFP_NOWARN | __GFP_NOFAIL)
> +#define gfp_kmemleak_mask(gfp)	((gfp) & (GFP_KERNEL | GFP_ATOMIC))
>  
>  /* scanning area inside a memory block */
>  struct kmemleak_scan_area {
> @@ -191,6 +190,9 @@ struct kmemleak_object {
>  #define HEX_ASCII		1
>  /* max number of lines to be printed */
>  #define HEX_MAX_LINES		2
> +/* minimum memory pool sizes */
> +#define MIN_OBJECT_POOL		(NR_CPUS * 4)
> +#define MIN_SCAN_AREA_POOL	(NR_CPUS * 1)
>  
>  /* the list of all allocated objects */
>  static LIST_HEAD(object_list);
> @@ -203,7 +205,9 @@ static DEFINE_RWLOCK(kmemleak_lock);
>  
>  /* allocation caches for kmemleak internal data */
>  static struct kmem_cache *object_cache;
> +static mempool_t *object_mempool;
>  static struct kmem_cache *scan_area_cache;
> +static mempool_t *scan_area_mempool;
>  
>  /* set if tracing memory operations is enabled */
>  static int kmemleak_enabled;
> @@ -483,9 +487,9 @@ static void free_object_rcu(struct rcu_head *rcu)
>  	 */
>  	hlist_for_each_entry_safe(area, tmp, &object->area_list, node) {
>  		hlist_del(&area->node);
> -		kmem_cache_free(scan_area_cache, area);
> +		mempool_free(area, scan_area_mempool);
>  	}
> -	kmem_cache_free(object_cache, object);
> +	mempool_free(object, object_mempool);
>  }
>  
>  /*
> @@ -576,7 +580,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
>  	struct rb_node **link, *rb_parent;
>  	unsigned long untagged_ptr;
>  
> -	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> +	object = mempool_alloc(object_mempool, gfp_kmemleak_mask(gfp));
>  	if (!object) {
>  		pr_warn("Cannot allocate a kmemleak_object structure\n");
>  		kmemleak_disable();
> @@ -640,7 +644,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
>  			 * be freed while the kmemleak_lock is held.
>  			 */
>  			dump_object_info(parent);
> -			kmem_cache_free(object_cache, object);
> +			mempool_free(object, object_mempool);
>  			object = NULL;
>  			goto out;
>  		}
> @@ -798,7 +802,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
>  		return;
>  	}
>  
> -	area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp));
> +	area = mempool_alloc(scan_area_mempool, gfp_kmemleak_mask(gfp));
>  	if (!area) {
>  		pr_warn("Cannot allocate a scan area\n");
>  		goto out;
> @@ -810,7 +814,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
>  	} else if (ptr + size > object->pointer + object->size) {
>  		kmemleak_warn("Scan area larger than object 0x%08lx\n", ptr);
>  		dump_object_info(object);
> -		kmem_cache_free(scan_area_cache, area);
> +		mempool_free(area, scan_area_mempool);
>  		goto out_unlock;
>  	}
>  
> @@ -2049,6 +2053,18 @@ void __init kmemleak_init(void)
>  
>  	object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE);
>  	scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE);
> +	if (!object_cache || !scan_area_cache) {
> +		kmemleak_disable();
> +		return;
> +	}
> +	object_mempool = mempool_create_slab_pool(MIN_OBJECT_POOL,
> +						  object_cache);
> +	scan_area_mempool = mempool_create_slab_pool(MIN_SCAN_AREA_POOL,
> +						     scan_area_cache);
> +	if (!object_mempool || !scan_area_mempool) {
> +		kmemleak_disable();
> +		return;
> +	}
>  
>  	if (crt_early_log > ARRAY_SIZE(early_log))
>  		pr_warn("Early log buffer exceeded (%d), please increase DEBUG_KMEMLEAK_EARLY_LOG_SIZE\n",
Catalin Marinas March 29, 2019, 4:16 p.m. UTC | #19
On Fri, Mar 29, 2019 at 01:02:37PM +0100, Michal Hocko wrote:
> On Thu 28-03-19 14:59:17, Catalin Marinas wrote:
> [...]
> > >From 09eba8f0235eb16409931e6aad77a45a12bedc82 Mon Sep 17 00:00:00 2001
> > From: Catalin Marinas <catalin.marinas@arm.com>
> > Date: Thu, 28 Mar 2019 13:26:07 +0000
> > Subject: [PATCH] mm: kmemleak: Use mempool allocations for kmemleak objects
> > 
> > This patch adds mempool allocations for struct kmemleak_object and
> > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
> > under memory pressure. The patch also masks out all the gfp flags passed
> > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.
> 
> Using mempool allocator is better than inventing its own implementation
> but there is one thing to be slightly careful/worried about.
> 
> This allocator expects that somebody will refill the pool in a finit
> time. Most users are OK with that because objects in flight are going
> to return in the pool in a relatively short time (think of an IO) but
> kmemleak is not guaranteed to comply with that AFAIU. Sure ephemeral
> allocations are happening all the time so there should be some churn
> in the pool all the time but if we go to an extreme where there is a
> serious memory leak then I suspect we might get stuck here without any
> way forward. Page/slab allocator would eventually back off even though
> small allocations never fail because a user context would get killed
> sooner or later but there is no fatal_signal_pending backoff in the
> mempool alloc path.

We could improve the mempool code slightly to refill itself (from some
workqueue or during a mempool_alloc() which allows blocking) but it's
really just a best effort for a debug tool under OOM conditions. It may
be sufficient just to make the mempool size tunable (via
/sys/kernel/debug/kmemleak).

> Anyway, I believe this is a step in the right direction and should the
> above ever materializes as a relevant problem we can tune the mempool
> to backoff for _some_ callers or do something similar.
> 
> Btw. there is kmemleak_update_trace call in mempool_alloc, is this ok
> for the kmemleak allocation path?

It's not a problem, maybe only a small overhead in searching an rbtree
in kmemleak but it cannot find anything since the kmemleak metadata is
not tracked. And this only happens if a normal allocation fails and
takes an existing object from the pool.

I thought about passing the mempool back into kmemleak and checking
whether it's one of the two pools it uses but concluded that it's not
worth it.
Michal Hocko April 1, 2019, 8:12 p.m. UTC | #20
On Fri 29-03-19 16:16:38, Catalin Marinas wrote:
> On Fri, Mar 29, 2019 at 01:02:37PM +0100, Michal Hocko wrote:
> > On Thu 28-03-19 14:59:17, Catalin Marinas wrote:
> > [...]
> > > >From 09eba8f0235eb16409931e6aad77a45a12bedc82 Mon Sep 17 00:00:00 2001
> > > From: Catalin Marinas <catalin.marinas@arm.com>
> > > Date: Thu, 28 Mar 2019 13:26:07 +0000
> > > Subject: [PATCH] mm: kmemleak: Use mempool allocations for kmemleak objects
> > > 
> > > This patch adds mempool allocations for struct kmemleak_object and
> > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
> > > under memory pressure. The patch also masks out all the gfp flags passed
> > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.
> > 
> > Using mempool allocator is better than inventing its own implementation
> > but there is one thing to be slightly careful/worried about.
> > 
> > This allocator expects that somebody will refill the pool in a finit
> > time. Most users are OK with that because objects in flight are going
> > to return in the pool in a relatively short time (think of an IO) but
> > kmemleak is not guaranteed to comply with that AFAIU. Sure ephemeral
> > allocations are happening all the time so there should be some churn
> > in the pool all the time but if we go to an extreme where there is a
> > serious memory leak then I suspect we might get stuck here without any
> > way forward. Page/slab allocator would eventually back off even though
> > small allocations never fail because a user context would get killed
> > sooner or later but there is no fatal_signal_pending backoff in the
> > mempool alloc path.
> 
> We could improve the mempool code slightly to refill itself (from some
> workqueue or during a mempool_alloc() which allows blocking) but it's
> really just a best effort for a debug tool under OOM conditions. It may
> be sufficient just to make the mempool size tunable (via
> /sys/kernel/debug/kmemleak).

The point I've tried to make is that you really have to fail at some
point but mempool is fundamentally about non-failing as long as the
allocation is sleepable. And we cannot really break that assumptions
because existing users really depend on it. But as I've said I would try
it out and see. This is just a debugging feature and I assume that a
really fatal oom caused by a real memory leak would be detected sooner
than the whole thing just blows up.
Catalin Marinas April 5, 2019, 4:43 p.m. UTC | #21
On Mon, Apr 01, 2019 at 10:12:01PM +0200, Michal Hocko wrote:
> On Fri 29-03-19 16:16:38, Catalin Marinas wrote:
> > On Fri, Mar 29, 2019 at 01:02:37PM +0100, Michal Hocko wrote:
> > > On Thu 28-03-19 14:59:17, Catalin Marinas wrote:
> > > [...]
> > > > >From 09eba8f0235eb16409931e6aad77a45a12bedc82 Mon Sep 17 00:00:00 2001
> > > > From: Catalin Marinas <catalin.marinas@arm.com>
> > > > Date: Thu, 28 Mar 2019 13:26:07 +0000
> > > > Subject: [PATCH] mm: kmemleak: Use mempool allocations for kmemleak objects
> > > > 
> > > > This patch adds mempool allocations for struct kmemleak_object and
> > > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
> > > > under memory pressure. The patch also masks out all the gfp flags passed
> > > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.
> > > 
> > > Using mempool allocator is better than inventing its own implementation
> > > but there is one thing to be slightly careful/worried about.
> > > 
> > > This allocator expects that somebody will refill the pool in a finit
> > > time. Most users are OK with that because objects in flight are going
> > > to return in the pool in a relatively short time (think of an IO) but
> > > kmemleak is not guaranteed to comply with that AFAIU. Sure ephemeral
> > > allocations are happening all the time so there should be some churn
> > > in the pool all the time but if we go to an extreme where there is a
> > > serious memory leak then I suspect we might get stuck here without any
> > > way forward. Page/slab allocator would eventually back off even though
> > > small allocations never fail because a user context would get killed
> > > sooner or later but there is no fatal_signal_pending backoff in the
> > > mempool alloc path.
> > 
> > We could improve the mempool code slightly to refill itself (from some
> > workqueue or during a mempool_alloc() which allows blocking) but it's
> > really just a best effort for a debug tool under OOM conditions. It may
> > be sufficient just to make the mempool size tunable (via
> > /sys/kernel/debug/kmemleak).
> 
> The point I've tried to make is that you really have to fail at some
> point but mempool is fundamentally about non-failing as long as the
> allocation is sleepable. And we cannot really break that assumptions
> because existing users really depend on it. But as I've said I would try
> it out and see. This is just a debugging feature and I assume that a
> really fatal oom caused by a real memory leak would be detected sooner
> than the whole thing just blows up.

I'll first push a patch to use mempool as it is, with a tunable size via
/sys/kernel/debug/kmemleak. I think the better solution would be a
rewrite of the metadata handling in kmemleak to embed it into the slab
object (as per Pekka's suggestion). However, I'll be on holiday until
the 15th, so cannot look into this.

Thanks.

Patch
diff mbox series

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index a2d894d3de07..7f4545ab1f84 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -580,7 +580,16 @@  static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 	struct rb_node **link, *rb_parent;
 	unsigned long untagged_ptr;
 
-	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
+	/*
+	 * The tracked memory was allocated successful, if the kmemleak object
+	 * failed to allocate for some reasons, it ends up with the whole
+	 * kmemleak disabled, so try it harder.
+	 */
+	gfp = (in_atomic() || irqs_disabled()) ?
+	       gfp_kmemleak_mask(gfp) | GFP_ATOMIC :
+	       gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
+
+	object = kmem_cache_alloc(object_cache, gfp);
 	if (!object) {
 		pr_warn("Cannot allocate a kmemleak_object structure\n");
 		kmemleak_disable();