diff mbox series

[RFC] mm/kmemleak: Fix sleeping function called from invalid context at print message

Message ID 20241115145410.114376-1-acarmina@redhat.com (mailing list archive)
State New
Headers show
Series [RFC] mm/kmemleak: Fix sleeping function called from invalid context at print message | expand

Commit Message

Alessandro Carminati Nov. 15, 2024, 2:54 p.m. UTC
Address a bug in the kernel that triggers a "sleeping function called from
invalid context" warning when /sys/kernel/debug/kmemleak is printed under
specific conditions:
- CONFIG_PREEMPT_RT=y
- Set SELinux as the LSM for the system
- Set kptr_restrict to 1
- kmemleak buffer contains at least one item
Ensure the kmemleak buffer contains at least one item
Commit 8c96f1bc6fc49c724c4cdd22d3e99260263b7384 ("mm/kmemleak: turn
kmemleak_lock and object->lock to raw_spinlock_t") introduced a change
where kmemleak_seq_show is executed in atomic context within the RT kernel.
However, the SELinux capability check in this function flow still relies on
regular spinlocks, leading to potential race conditions that trigger an
error when printing the kmemleak backtrace.
Move the backtrace printing out of the critical section. Use a stack
variable to store the backtrace pointers, avoiding spinlocks in the atomic
context.
Implement delta encoding to minimize the stack memory footprint,
addressing the potentially large memory demands for storing these pointers
on 64-bit systems.

Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
---
```
[  159.247069] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[  159.247193] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
[  159.247241] preempt_count: 1, expected: 0
[  159.247277] RCU nest depth: 2, expected: 2
[  159.247388] 6 locks held by cat/136:
[  159.247438]  #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
[  159.248835]  #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
[  159.249053]  #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
[  159.249127]  #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
[  159.249205]  #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
[  159.249364] irq event stamp: 136660
[  159.249407] hardirqs last  enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
[  159.249465] hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
[  159.249518] softirqs last  enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
[  159.249571] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  159.249970] Preemption disabled at:
[  159.249988] [<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
[  159.250609] CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G            E      6.11.0-rt7+ #34
[  159.250797] Tainted: [E]=UNSIGNED_MODULE
[  159.250822] Hardware name: linux,dummy-virt (DT)
[  159.251050] Call trace:
[  159.251079]  dump_backtrace+0xa0/0x128
[  159.251132]  show_stack+0x1c/0x30
[  159.251156]  dump_stack_lvl+0xe8/0x198
[  159.251180]  dump_stack+0x18/0x20
[  159.251227]  rt_spin_lock+0x8c/0x1a8
[  159.251273]  avc_perm_nonode+0xa0/0x150
[  159.251316]  cred_has_capability.isra.0+0x118/0x218
[  159.251340]  selinux_capable+0x50/0x80
[  159.251363]  security_capable+0x7c/0xd0
[  159.251388]  has_ns_capability_noaudit+0x94/0x1b0
[  159.251412]  has_capability_noaudit+0x20/0x30
[  159.251437]  restricted_pointer+0x21c/0x4b0
[  159.251461]  pointer+0x298/0x760
[  159.251482]  vsnprintf+0x330/0xf70
[  159.251504]  seq_printf+0x178/0x218
[  159.251526]  print_unreferenced+0x1a4/0x2d0
[  159.251551]  kmemleak_seq_show+0xd0/0x1e0
[  159.251576]  seq_read_iter+0x354/0xe30
[  159.251599]  seq_read+0x250/0x378
[  159.251622]  full_proxy_read+0xd8/0x148
[  159.251649]  vfs_read+0x190/0x918
[  159.251672]  ksys_read+0xf0/0x1e0
[  159.251693]  __arm64_sys_read+0x70/0xa8
[  159.251716]  invoke_syscall.constprop.0+0xd4/0x1d8
[  159.251767]  el0_svc+0x50/0x158
[  159.251813]  el0t_64_sync+0x17c/0x180
```
I have considered three potential approaches to address this matter:

1. Remove Raw Pointer Printing
The simplest solution is to eliminate raw pointer printing from the report.
This approach involves minimal changes to the kernel code and is 
straightforward to implement.

While I am confident that omitting the raw address would result in
negligible information loss in most scenarios, some may perceive it as a
feature regression. Below is an example of the modification:
```
- warn_or_seq_printf(seq, "    [<%pK>] %pS\n", ptr, ptr);
+ warn_or_seq_printf(seq, "    %pS\n", ptr);
```
This change may be acceptable since the %pS format outputs a hex string
if no kallsyms are available. However, it modifies the original behavior,
and in the kallsyms scenario, the raw pointer would no longer be present.

2. Modify SELinux to Avoid Sleeping Spinlocks
Another option is to alter the SELinux capability check to use
non-sleeping spinlocks.
However, this approach is not advisable. The SELinux capability check is
extensively used across the kernel and is far more critical than the
kmemleak reporting feature.
Adapting it to address this rare issue could unnecessarily introduce
latency across the entire kernel, particularly as kmemleak is rarely used
in production environments.

3. Move Stack Trace Printing Outside the Atomic Section
The third and preferred approach is to move the stack trace printing
outside the atomic section. This would preserve the current functionality
without modifying SELinux.

The primary challenge here lies in making the backtrace pointers available
after exiting the critical section, as they are captured within it.
To address this, the backtrace pointers can be copied to a safe location,
enabling access once the raw_spinlock is released.

Options for Creating a Safe Location for Backtrace Pointers
Several strategies have been considered for storing the backtrace pointers
safely:
* Dynamic Allocation
    * Allocating memory with kmalloc cannot be done within a raw_spinlock
      area. Using GFP_ATOMIC is also infeasible.
    * Since the code that prints the message is inside a loop, executed
      potentially multiple times, it is only within the raw_spinlock 
      section that we can determine whether allocation is needed.
    * Allocating and deallocating memory on every loop iteration would be
      prohibitively expensive.
* Global Data Section
    * In this strategy, the message would be printed after exiting the
      raw_spinlock protected section.
    * However, this approach risks data corruption if another occurrence
      of the issue arises before the first operation completes.
* Per-CPU Data
    * The same concerns as with global data apply here. While data 
      corruption is less likely, it is not impossible.
* Stack
    * Using the stack is the best option since each thread has its own 
      stack, ensuring data isolation. However, the size of the data poses
      a challenge.
    * Exporting a full stack trace pointer list requires considerable space.
      A 32-level stack trace in a 64-bit system would require 256 bytes, 
      which is contrary to best practices for stack size management.

To mitigate this, I propose using delta encoding to store the addresses.
This method reduces the size of each address from 8 bytes to 4 bytes on
64-bit systems. While this introduces some complexity, it significantly
reduces memory usage and allows us to preserve the kmemleak reports in their
current form.
 mm/kmemleak.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 7 deletions(-)
diff mbox series

Patch

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 0400f5e8ac60..fc5869e09280 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -274,6 +274,44 @@  static void kmemleak_disable(void);
 		pr_warn(fmt, ##__VA_ARGS__);		\
 } while (0)
 
+#define PTR_STORAGE_OP_OK	-1
+#define PTR_STORAGE_OP_FAIL	0
+#define PTR_STORAGE_CAPACITY	32
+
+struct ptr_storage {
+	unsigned long	base;
+	u32		data[PTR_STORAGE_CAPACITY];
+	int		nr_entries;
+};
+
+static int ptr_storage_insert(unsigned long p, struct ptr_storage *s)
+{
+	unsigned long diff_data;
+
+	if (s->nr_entries != 0) {
+		diff_data = s->base - p;
+		if (s->nr_entries < PTR_STORAGE_CAPACITY) {
+			s->data[((s->nr_entries - 1))] = diff_data & 0xffffffff;
+			s->nr_entries++;
+			return PTR_STORAGE_OP_OK;
+		}
+		return PTR_STORAGE_OP_FAIL;
+	}
+	s->base = p;
+	s->nr_entries++;
+	return PTR_STORAGE_OP_OK;
+}
+
+static void *ptr_storage_get(struct ptr_storage *s, int item_no)
+{
+	if (item_no < s->nr_entries && item_no > 0)
+		return (void *)s->base - (s32)s->data[(item_no - 1)];
+
+	if (item_no == 0)
+		return (void *)s->base;
+	return NULL;
+}
+
 static void warn_or_seq_hex_dump(struct seq_file *seq, int prefix_type,
 				 int rowsize, int groupsize, const void *buf,
 				 size_t len, bool ascii)
@@ -357,11 +395,13 @@  static bool unreferenced_object(struct kmemleak_object *object)
  * print_unreferenced function must be called with the object->lock held.
  */
 static void print_unreferenced(struct seq_file *seq,
-			       struct kmemleak_object *object)
+			       struct kmemleak_object *object,
+			       struct ptr_storage *s)
 {
 	int i;
 	unsigned long *entries;
 	unsigned int nr_entries;
+	unsigned long tmp;
 
 	nr_entries = stack_depot_fetch(object->trace_handle, &entries);
 	warn_or_seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n",
@@ -372,8 +412,8 @@  static void print_unreferenced(struct seq_file *seq,
 	warn_or_seq_printf(seq, "  backtrace (crc %x):\n", object->checksum);
 
 	for (i = 0; i < nr_entries; i++) {
-		void *ptr = (void *)entries[i];
-		warn_or_seq_printf(seq, "    [<%pK>] %pS\n", ptr, ptr);
+		tmp = (unsigned long)entries[i];
+		ptr_storage_insert(tmp, s);
 	}
 }
 
@@ -1625,6 +1665,10 @@  static void kmemleak_scan(void)
 	struct zone *zone;
 	int __maybe_unused i;
 	int new_leaks = 0;
+	struct ptr_storage s = {0};
+	bool do_print = false;
+	void *tmp;
+	int inx;
 
 	jiffies_last_scan = jiffies;
 
@@ -1783,12 +1827,20 @@  static void kmemleak_scan(void)
 		    !(object->flags & OBJECT_REPORTED)) {
 			object->flags |= OBJECT_REPORTED;
 
-			if (kmemleak_verbose)
-				print_unreferenced(NULL, object);
+			if (kmemleak_verbose) {
+				print_unreferenced(NULL, object, &s);
+				do_print = true;
+			}
 
 			new_leaks++;
 		}
 		raw_spin_unlock_irq(&object->lock);
+		if (kmemleak_verbose && do_print) {
+			for (inx = 0; inx < s.nr_entries; inx++) {
+				tmp = ptr_storage_get(&s, i);
+				warn_or_seq_printf(NULL, "    [<%pK>] %pS\n", tmp, tmp);
+			}
+		}
 	}
 	rcu_read_unlock();
 
@@ -1939,11 +1991,23 @@  static int kmemleak_seq_show(struct seq_file *seq, void *v)
 {
 	struct kmemleak_object *object = v;
 	unsigned long flags;
+	struct ptr_storage s = {0};
+	void *tmp;
+	int i;
+	bool do_print = false;
 
 	raw_spin_lock_irqsave(&object->lock, flags);
-	if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object))
-		print_unreferenced(seq, object);
+	if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) {
+		print_unreferenced(seq, object, &s);
+		do_print = true;
+	}
 	raw_spin_unlock_irqrestore(&object->lock, flags);
+	if (do_print) {
+		for (i = 0; i < s.nr_entries; i++) {
+			tmp = ptr_storage_get(&s, i);
+			warn_or_seq_printf(seq, "    [<%pK>] %pS\n", tmp, tmp);
+		}
+	}
 	return 0;
 }