diff mbox series

tracing: Do not allow mmap() of persistent ring buffer

Message ID 20250213180737.061871ae@gandalf.local.home (mailing list archive)
State Superseded
Headers show
Series tracing: Do not allow mmap() of persistent ring buffer | expand

Commit Message

Steven Rostedt Feb. 13, 2025, 11:07 p.m. UTC
From: Steven Rostedt <rostedt@goodmis.org>

When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:

 BUG: unable to handle page fault for address: ffffe97bd00025c8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 2862f3067 P4D 2862f3067 PUD 0
 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
 CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:validate_page_before_insert+0x5/0xb0
 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
 FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x1f
  ? __die+0x2e/0x40
  ? page_fault_oops+0x157/0x2b0
  ? search_module_extables+0x53/0x80
  ? validate_page_before_insert+0x5/0xb0
  ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
  ? __bad_area_nosemaphore+0x16e/0x1b0
  ? bad_area_nosemaphore+0x16/0x20
  ? do_kern_addr_fault+0x77/0x90
  ? exc_page_fault+0x22b/0x230
  ? asm_exc_page_fault+0x2b/0x30
  ? validate_page_before_insert+0x5/0xb0
  ? vm_insert_pages+0x151/0x400
  __rb_map_vma+0x21f/0x3f0
  ring_buffer_map+0x21b/0x2f0
  tracing_buffers_mmap+0x70/0xd0
  __mmap_region+0x6f0/0xbd0
  mmap_region+0x7f/0x130
  do_mmap+0x475/0x610
  vm_mmap_pgoff+0xf2/0x1d0
  ksys_mmap_pgoff+0x166/0x200
  __x64_sys_mmap+0x37/0x50
  x64_sys_call+0x1670/0x1d70
  do_syscall_64+0xbb/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reason was that the code that maps the ring buffer pages to user space
has:

	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);

And uses that in:

	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.

Cc: stable@vger.kernel.org
Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Masami Hiramatsu (Google) Feb. 14, 2025, 2:07 a.m. UTC | #1
On Thu, 13 Feb 2025 18:07:37 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When trying to mmap a trace instance buffer that is attached to
> reserve_mem, it would crash:
> 
>  BUG: unable to handle page fault for address: ffffe97bd00025c8
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 2862f3067 P4D 2862f3067 PUD 0
>  Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
>  CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  RIP: 0010:validate_page_before_insert+0x5/0xb0
>  Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
>  RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
>  RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
>  RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
>  RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
>  R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
>  R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
>  FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
>  Call Trace:
>   <TASK>
>   ? __die_body.cold+0x19/0x1f
>   ? __die+0x2e/0x40
>   ? page_fault_oops+0x157/0x2b0
>   ? search_module_extables+0x53/0x80
>   ? validate_page_before_insert+0x5/0xb0
>   ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
>   ? __bad_area_nosemaphore+0x16e/0x1b0
>   ? bad_area_nosemaphore+0x16/0x20
>   ? do_kern_addr_fault+0x77/0x90
>   ? exc_page_fault+0x22b/0x230
>   ? asm_exc_page_fault+0x2b/0x30
>   ? validate_page_before_insert+0x5/0xb0
>   ? vm_insert_pages+0x151/0x400
>   __rb_map_vma+0x21f/0x3f0
>   ring_buffer_map+0x21b/0x2f0
>   tracing_buffers_mmap+0x70/0xd0
>   __mmap_region+0x6f0/0xbd0
>   mmap_region+0x7f/0x130
>   do_mmap+0x475/0x610
>   vm_mmap_pgoff+0xf2/0x1d0
>   ksys_mmap_pgoff+0x166/0x200
>   __x64_sys_mmap+0x37/0x50
>   x64_sys_call+0x1670/0x1d70
>   do_syscall_64+0xbb/0x1d0
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> The reason was that the code that maps the ring buffer pages to user space
> has:
> 
> 	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
> 
> And uses that in:
> 
> 	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
> 
> But virt_to_page() does not work with vmap()'d memory which is what the
> persistent ring buffer has. It is rather trivial to allow this, but for
> now just disable mmap() of instances that have their ring buffer from the
> reserve_mem option.

This may be good for fixing crash short term but may not allow us to read the
subbufs which is read right before crash. Can we also add an option to reset
the read pointer in the previous boot for the persistent ring buffer?

Thank you,

> 
> Cc: stable@vger.kernel.org
> Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 25ff37aab00f..72895500155d 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -8279,6 +8279,10 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
>  	struct trace_iterator *iter = &info->iter;
>  	int ret = 0;
>  
> +	/* Currently the boot mapped buffer is not supported for mmap */
> +	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
> +		return -EINVAL;
> +
>  	ret = get_snapshot_map(iter->tr);
>  	if (ret)
>  		return ret;
> -- 
> 2.47.2
>
Steven Rostedt Feb. 14, 2025, 2:21 a.m. UTC | #2
On Fri, 14 Feb 2025 11:07:22 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> This may be good for fixing crash short term but may not allow us to read the
> subbufs which is read right before crash. Can we also add an option to reset
> the read pointer in the previous boot for the persistent ring buffer?

I'm not sure what you mean here.

Note, this is just for mmapping the buffers. Currently we never tried it as
if we did, it would have crashed. But this does not affect normal reads.

-- Steve
Masami Hiramatsu (Google) Feb. 14, 2025, 7:13 a.m. UTC | #3
On Thu, 13 Feb 2025 21:21:47 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 14 Feb 2025 11:07:22 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > This may be good for fixing crash short term but may not allow us to read the
> > subbufs which is read right before crash. Can we also add an option to reset
> > the read pointer in the previous boot for the persistent ring buffer?
> 
> I'm not sure what you mean here.
> 
> Note, this is just for mmapping the buffers. Currently we never tried it as
> if we did, it would have crashed. But this does not affect normal reads.

But reading buffer via mmap() is already supported. Can we read all pages,
which had been read by trace_pipe_raw in previous boot, again on this boot?

Anyway, I made another patch to fix it to allow mmap() on persistent ring
buffer here. I thought I can get the phys_addr from struct vm_area, but
it could not for vmap()'d area. So this stores the phys_addr in trace_buffer.

Thanks,

commit d1d703d1db64c13146542612f7fd4dd85904f682
Author: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Date:   Fri Feb 14 15:38:12 2025 +0900

    tracing: Fix to allow mmap() on persistent ring buffer
    
    The kernel crashes when mmap() the persistent ring buffer (per-cpu
     trace_pipe_raw) as below;
    
    [   68.539135] ------------[ cut here ]------------
    [   68.540126] kernel BUG at arch/x86/mm/physaddr.c:28!
    [   68.541178] Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [   68.542259] CPU: 6 UID: 0 PID: 115 Comm: mmap Not tainted 6.14.0-rc2 #14
    [   68.543473] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
    [   68.545649] RIP: 0010:__phys_addr+0x56/0x60
    [   68.546608] Code: 89 c2 48 d3 ea 48 85 d2 75 1f c3 cc cc cc cc cc 48 81 f9 00 00 00 20 73 13 48 03 0d d4 66 1c 01 48 89 c8 c3 cc cc cc cc cc 90 <0f> 0b 90 0f 0b 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
    [   68.550479] RSP: 0018:ffffc900018639f0 EFLAGS: 00010202
    [   68.551364] RAX: 0000408000001000 RBX: ffff888004111000 RCX: 0000000000000028
    [   68.552609] RDX: 0000000000000040 RSI: ffff88800737a000 RDI: ffffc90000001000
    [   68.553768] RBP: 000000000000027c R08: 0000000000000dc0 R09: 00000000ffffffff
    [   68.555177] R10: ffffffff816002ea R11: 0000000000000000 R12: ffff88800737a000
    [   68.556376] R13: 0000000000000000 R14: ffffea0000000000 R15: 0000000000000001
    [   68.557550] FS:  000055558036a400(0000) GS:ffff88807db80000(0000) knlGS:0000000000000000
    [   68.558963] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   68.559970] CR2: 00007f73a9e96b74 CR3: 0000000005e2a000 CR4: 00000000000006b0
    [   68.561354] Call Trace:
    [   68.562077]  <TASK>
    [   68.562657]  ? __die_body+0x6a/0xb0
    [   68.563385]  ? die+0xa4/0xd0
    [   68.563956]  ? do_trap+0xa6/0x170
    [   68.564727]  ? __phys_addr+0x56/0x60
    [   68.565547]  ? do_error_trap+0xc7/0x120
    [   68.566385]  ? __phys_addr+0x56/0x60
    [   68.567170]  ? handle_invalid_op+0x2c/0x40
    [   68.568025]  ? __phys_addr+0x56/0x60
    [   68.568801]  ? exc_invalid_op+0x39/0x50
    [   68.569576]  ? asm_exc_invalid_op+0x1a/0x20
    [   68.570329]  ? __create_object+0x3a/0xf0
    [   68.571120]  ? __phys_addr+0x56/0x60
    [   68.571858]  __rb_map_vma+0x280/0x320
    [   68.572544]  ? _raw_spin_unlock_irqrestore+0x40/0x60
    [   68.573356]  ring_buffer_map+0x25a/0x340
    [   68.573989]  tracing_buffers_mmap+0x8a/0xe0
    [   68.574702]  mmap_region+0x898/0xd30
    [   68.575369]  do_mmap+0x4ac/0x5e0
    [   68.575920]  ? down_write_killable+0xb5/0xf0
    [   68.576610]  vm_mmap_pgoff+0xc4/0x1a0
    [   68.577218]  ksys_mmap_pgoff+0x181/0x200
    [   68.577848]  do_syscall_64+0xec/0x1d0
    [   68.578448]  ? exc_page_fault+0x92/0x110
    [   68.579092]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
    
    Since the persistent ring buffer is mapping reserve memory in vmap area,
    it can not use virt_to_page(). But we can use the offset of each subbuf
    pages from the physical start address of reserve memory for identifying
    PFN because reserve memory is mapped linearly.
    
    Cc: stable@vger.kernel.org
    Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
    Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 17fbb7855295..2d0902e3d194 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -92,6 +92,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
 struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
 					       int order, unsigned long start,
 					       unsigned long range_size,
+					       phys_addr_t phys_addr,
 					       struct lock_class_key *key);
 
 bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
@@ -113,11 +114,11 @@ bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
  * ring buffer's lock class separate.
  */
-#define ring_buffer_alloc_range(size, flags, order, start, range_size)	\
+#define ring_buffer_alloc_range(size, flags, order, start, range_size, phys_addr)	\
 ({									\
 	static struct lock_class_key __key;				\
 	__ring_buffer_alloc_range((size), (flags), (order), (start),	\
-				  (range_size), &__key);		\
+				  (range_size), (phys_addr), &__key);	\
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8e0ae15ca5b..cb2380aadbdf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -549,6 +549,7 @@ struct trace_buffer {
 
 	unsigned long			range_addr_start;
 	unsigned long			range_addr_end;
+	phys_addr_t			range_addr_phys;
 
 	long				last_text_delta;
 	long				last_data_delta;
@@ -2290,6 +2291,7 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 					 int order, unsigned long start,
 					 unsigned long end,
+					 phys_addr_t phys_start,
 					 struct lock_class_key *key)
 {
 	struct trace_buffer *buffer;
@@ -2370,6 +2372,7 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 		nr_pages--;
 		buffer->range_addr_start = start;
 		buffer->range_addr_end = end;
+		buffer->range_addr_phys = phys_start;
 
 		rb_range_meta_init(buffer, nr_pages);
 	} else {
@@ -2424,7 +2427,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 					struct lock_class_key *key)
 {
 	/* Default buffer page size - one system page */
-	return alloc_buffer(size, flags, 0, 0, 0,key);
+	return alloc_buffer(size, flags, 0, 0, 0, 0, key);
 
 }
 EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
@@ -2446,9 +2449,10 @@ EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
 struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
 					       int order, unsigned long start,
 					       unsigned long range_size,
+					       phys_addr_t phys_start,
 					       struct lock_class_key *key)
 {
-	return alloc_buffer(size, flags, order, start, start + range_size, key);
+	return alloc_buffer(size, flags, order, start, start + range_size, phys_start, key);
 }
 
 /**
@@ -6966,10 +6970,11 @@ static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
 static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
 			struct vm_area_struct *vma)
 {
-	unsigned long nr_subbufs, nr_pages, nr_vma_pages, pgoff = vma->vm_pgoff;
+	unsigned long nr_subbufs, nr_pages, nr_vma_pages, base, pgoff = vma->vm_pgoff;
 	unsigned int subbuf_pages, subbuf_order;
 	struct page **pages;
 	int p = 0, s = 0;
+	phys_addr_t pa;
 	int err;
 
 	/* Refuse MP_PRIVATE or writable mappings */
@@ -7032,6 +7037,8 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
 		s += pgoff / subbuf_pages;
 	}
 
+	pa = cpu_buffer->buffer->range_addr_phys;
+	base = cpu_buffer->buffer->range_addr_start;
 	while (p < nr_pages) {
 		struct page *page;
 		int off = 0;
@@ -7041,7 +7048,10 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
 			goto out;
 		}
 
-		page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
+		if (pa)
+			page = pfn_to_page((pa + cpu_buffer->subbuf_ids[s] - base) >> PAGE_SHIFT);
+		else
+			page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
 
 		for (; off < (1 << (subbuf_order)); off++, page++) {
 			if (p >= nr_pages)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1496a5ac33ae..5c87a4fa601a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9210,7 +9210,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
 	if (tr->range_addr_start && tr->range_addr_size) {
 		buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
 						      tr->range_addr_start,
-						      tr->range_addr_size);
+						      tr->range_addr_size,
+						      tr->range_addr_phys);
 
 		ring_buffer_last_boot_delta(buf->buffer,
 					    &tr->text_delta, &tr->data_delta);
@@ -9364,7 +9365,8 @@ static int trace_array_create_dir(struct trace_array *tr)
 static struct trace_array *
 trace_array_create_systems(const char *name, const char *systems,
 			   unsigned long range_addr_start,
-			   unsigned long range_addr_size)
+			   unsigned long range_addr_size,
+			   phys_addr_t range_addr_phys)
 {
 	struct trace_array *tr;
 	int ret;
@@ -9393,6 +9395,7 @@ trace_array_create_systems(const char *name, const char *systems,
 	/* Only for boot up memory mapped ring buffers */
 	tr->range_addr_start = range_addr_start;
 	tr->range_addr_size = range_addr_size;
+	tr->range_addr_phys = range_addr_phys;
 
 	tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
@@ -9455,7 +9458,7 @@ trace_array_create_systems(const char *name, const char *systems,
 
 static struct trace_array *trace_array_create(const char *name)
 {
-	return trace_array_create_systems(name, NULL, 0, 0);
+	return trace_array_create_systems(name, NULL, 0, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9533,7 +9536,7 @@ struct trace_array *trace_array_get_by_name(const char *name, const char *system
 		}
 	}
 
-	tr = trace_array_create_systems(name, systems, 0, 0);
+	tr = trace_array_create_systems(name, systems, 0, 0, 0);
 
 	if (IS_ERR(tr))
 		tr = NULL;
@@ -10443,7 +10446,7 @@ __init static void enable_instances(void)
 				do_allocate_snapshot(name);
 		}
 
-		tr = trace_array_create_systems(name, NULL, addr, size);
+		tr = trace_array_create_systems(name, NULL, addr, size, start);
 		if (IS_ERR(tr)) {
 			pr_warn("Tracing: Failed to create instance buffer %s\n", curr_str);
 			continue;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9c21ba45b7af..7a7db029915f 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -348,6 +348,7 @@ struct trace_array {
 	unsigned int		mapped;
 	unsigned long		range_addr_start;
 	unsigned long		range_addr_size;
+	phys_addr_t		range_addr_phys;
 	long			text_delta;
 	long			data_delta;
Steven Rostedt Feb. 14, 2025, 12:07 p.m. UTC | #4
On Fri, 14 Feb 2025 16:13:32 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Thu, 13 Feb 2025 21:21:47 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 14 Feb 2025 11:07:22 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >   
> > > This may be good for fixing crash short term but may not allow us to read the
> > > subbufs which is read right before crash. Can we also add an option to reset
> > > the read pointer in the previous boot for the persistent ring buffer?  
> > 
> > I'm not sure what you mean here.
> > 
> > Note, this is just for mmapping the buffers. Currently we never tried it as
> > if we did, it would have crashed. But this does not affect normal reads.  
> 
> But reading buffer via mmap() is already supported. Can we read all pages,
> which had been read by trace_pipe_raw in previous boot, again on this boot?

It's not supported. If you try it, it will crash. This prevents reading via
mmap() on a boot buffer. I don't know what you are asking. Once this patch
is applied, mmap() will always fail on the boot buffer before or after you
start it.

> 
> Anyway, I made another patch to fix it to allow mmap() on persistent ring
> buffer here. I thought I can get the phys_addr from struct vm_area, but
> it could not for vmap()'d area. So this stores the phys_addr in trace_buffer.

Which is way too complex and intrusive. If you want to allow mapping, all
it needs is this:

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 07b421115692..9339adc88ad5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5927,12 +5941,18 @@ static void rb_clear_buffer_page(struct buffer_page *page)
 static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
+	struct page *page;
 
 	if (!meta)
 		return;
 
 	meta->reader.read = cpu_buffer->reader_page->read;
-	meta->reader.id = cpu_buffer->reader_page->id;
+	/* For boot buffers, the id is the index */
+	if (cpu_buffer->ring_meta)
+		meta->reader.id = rb_meta_subbuf_idx(cpu_buffer->ring_meta,
+						     cpu_buffer->reader_page->page);
+	else
+		meta->reader.id = cpu_buffer->reader_page->id;
 	meta->reader.lost_events = cpu_buffer->lost_events;
 
 	meta->entries = local_read(&cpu_buffer->entries);
@@ -5940,7 +5960,12 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
 	meta->read = cpu_buffer->read;
 
 	/* Some archs do not have data cache coherency between kernel and user-space */
-	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
+	if (virt_addr_valid(cpu_buffer->meta_page))
+		page = virt_to_page(cpu_buffer->meta_page);
+	else
+		page = vmalloc_to_page(cpu_buffer->meta_page);
+
+	flush_dcache_folio(page_folio(page));
 }
 
 static void
@@ -7041,7 +7066,10 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
 			goto out;
 		}
 
-		page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
+		if (virt_addr_valid(cpu_buffer->subbuf_ids[s]))
+			page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
+		else
+			page = vmalloc_to_page((void *)cpu_buffer->subbuf_ids[s]);
 
 		for (; off < (1 << (subbuf_order)); off++, page++) {
 			if (p >= nr_pages)
@@ -7187,6 +7215,7 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 	unsigned long missed_events;
 	unsigned long reader_size;
 	unsigned long flags;
+	struct page *page;
 
 	cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
 	if (IS_ERR(cpu_buffer))
@@ -7255,7 +7291,12 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 
 out:
 	/* Some archs do not have data cache coherency between kernel and user-space */
-	flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
+	if (virt_addr_valid(cpu_buffer->meta_page))
+		page = virt_to_page(cpu_buffer->meta_page);
+	else
+		page = vmalloc_to_page(cpu_buffer->meta_page);
+
+	flush_dcache_folio(page_folio(page));
 
 	rb_update_meta_page(cpu_buffer);
 
But this still doesn't work, as the index is still off (I'm still fixing it).

The persistent ring buffer uses the page->id for one thing but the user
space mmap() uses it for something else.

-- Steve
Masami Hiramatsu (Google) Feb. 14, 2025, 2:36 p.m. UTC | #5
On Fri, 14 Feb 2025 07:07:12 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 14 Feb 2025 16:13:32 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > On Thu, 13 Feb 2025 21:21:47 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > On Fri, 14 Feb 2025 11:07:22 +0900
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > >   
> > > > This may be good for fixing crash short term but may not allow us to read the
> > > > subbufs which is read right before crash. Can we also add an option to reset
> > > > the read pointer in the previous boot for the persistent ring buffer?  
> > > 
> > > I'm not sure what you mean here.
> > > 
> > > Note, this is just for mmapping the buffers. Currently we never tried it as
> > > if we did, it would have crashed. But this does not affect normal reads.  
> > 
> > But reading buffer via mmap() is already supported. Can we read all pages,
> > which had been read by trace_pipe_raw in previous boot, again on this boot?
> 
> It's not supported. If you try it, it will crash. This prevents reading via
> mmap() on a boot buffer. I don't know what you are asking. Once this patch
> is applied, mmap() will always fail on the boot buffer before or after you
> start it.

Hmm, I meant it is supported for other non-persisten ring buffer, isn't it?

> 
> > 
> > Anyway, I made another patch to fix it to allow mmap() on persistent ring
> > buffer here. I thought I can get the phys_addr from struct vm_area, but
> > it could not for vmap()'d area. So this stores the phys_addr in trace_buffer.
> 
> Which is way too complex and intrusive. If you want to allow mapping, all
> it needs is this:

Ah, vmalloc_to_page() works, I see.

> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 07b421115692..9339adc88ad5 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5927,12 +5941,18 @@ static void rb_clear_buffer_page(struct buffer_page *page)
>  static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>  {
>  	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +	struct page *page;
>  
>  	if (!meta)
>  		return;
>  
>  	meta->reader.read = cpu_buffer->reader_page->read;
> -	meta->reader.id = cpu_buffer->reader_page->id;
> +	/* For boot buffers, the id is the index */
> +	if (cpu_buffer->ring_meta)
> +		meta->reader.id = rb_meta_subbuf_idx(cpu_buffer->ring_meta,
> +						     cpu_buffer->reader_page->page);
> +	else
> +		meta->reader.id = cpu_buffer->reader_page->id;
>  	meta->reader.lost_events = cpu_buffer->lost_events;
>  
>  	meta->entries = local_read(&cpu_buffer->entries);
> @@ -5940,7 +5960,12 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	meta->read = cpu_buffer->read;
>  
>  	/* Some archs do not have data cache coherency between kernel and user-space */
> -	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
> +	if (virt_addr_valid(cpu_buffer->meta_page))
> +		page = virt_to_page(cpu_buffer->meta_page);
> +	else
> +		page = vmalloc_to_page(cpu_buffer->meta_page);
> +
> +	flush_dcache_folio(page_folio(page));
>  }
>  
>  static void
> @@ -7041,7 +7066,10 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
>  			goto out;
>  		}
>  
> -		page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
> +		if (virt_addr_valid(cpu_buffer->subbuf_ids[s]))
> +			page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
> +		else
> +			page = vmalloc_to_page((void *)cpu_buffer->subbuf_ids[s]);
>  
>  		for (; off < (1 << (subbuf_order)); off++, page++) {
>  			if (p >= nr_pages)
> @@ -7187,6 +7215,7 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
>  	unsigned long missed_events;
>  	unsigned long reader_size;
>  	unsigned long flags;
> +	struct page *page;
>  
>  	cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
>  	if (IS_ERR(cpu_buffer))
> @@ -7255,7 +7291,12 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
>  
>  out:
>  	/* Some archs do not have data cache coherency between kernel and user-space */
> -	flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
> +	if (virt_addr_valid(cpu_buffer->meta_page))
> +		page = virt_to_page(cpu_buffer->meta_page);
> +	else
> +		page = vmalloc_to_page(cpu_buffer->meta_page);
> +
> +	flush_dcache_folio(page_folio(page));
>  
>  	rb_update_meta_page(cpu_buffer);
>  
> But this still doesn't work, as the index is still off (I'm still fixing it).

Ah, OK. Thanks for fixing it.

> 
> The persistent ring buffer uses the page->id for one thing but the user
> space mmap() uses it for something else.

Thanks you,

> 
> -- Steve
Steven Rostedt Feb. 14, 2025, 2:59 p.m. UTC | #6
On Fri, 14 Feb 2025 23:36:13 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > 
> > It's not supported. If you try it, it will crash. This prevents reading via
> > mmap() on a boot buffer. I don't know what you are asking. Once this patch
> > is applied, mmap() will always fail on the boot buffer before or after you
> > start it.  
> 
> Hmm, I meant it is supported for other non-persisten ring buffer, isn't it?

Correct. It is supported in other buffers, but it just isn't supported in
the persistent one.

This patch only disables mmap if it's trying to mmap a persistent one.

I guess I don't understand your concern.

-- Steve
Masami Hiramatsu (Google) Feb. 15, 2025, 3:37 p.m. UTC | #7
On Fri, 14 Feb 2025 09:59:43 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 14 Feb 2025 23:36:13 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > 
> > > It's not supported. If you try it, it will crash. This prevents reading via
> > > mmap() on a boot buffer. I don't know what you are asking. Once this patch
> > > is applied, mmap() will always fail on the boot buffer before or after you
> > > start it.  
> > 
> > Hmm, I meant it is supported for other non-persisten ring buffer, isn't it?
> 
> Correct. It is supported in other buffers, but it just isn't supported in
> the persistent one.
> 
> This patch only disables mmap if it's trying to mmap a persistent one.
> 
> I guess I don't understand your concern.

My concern is related to the fixes policy. If this is a "fix", we will
backport the new "disables mmap on persistent ring buffer" limitation
to the stable kernel (that was not documented previously.)

However, from the user point of view, "mmap() ring buffers" is already
supported (although it did not work on stable kernel for now). Thus I think
the "Fix" is expected as "fixing mmap() persistent ring buffer". 

Thank you,
Steven Rostedt Feb. 15, 2025, 4:21 p.m. UTC | #8
On Sun, 16 Feb 2025 00:37:02 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> My concern is related to the fixes policy. If this is a "fix", we will
> backport the new "disables mmap on persistent ring buffer" limitation
> to the stable kernel (that was not documented previously.)
> 
> However, from the user point of view, "mmap() ring buffers" is already
> supported (although it did not work on stable kernel for now). Thus I think
> the "Fix" is expected as "fixing mmap() persistent ring buffer". 

This only disables mmapping of the persistent ring buffer. Other ring
buffers can be mapped. We never supported mmapping the persistent ring
buffer. Even in stable kernels, if you mmap it, it will crash just like
it does now. Thus, this doesn't cause any regressions. It's a fix even
for stable kernels.

Or did the virt_to_page() change recently where that wasn't the case?

-- Steve
Steven Rostedt Feb. 15, 2025, 4:45 p.m. UTC | #9
On Sat, 15 Feb 2025 11:21:53 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 16 Feb 2025 00:37:02 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > My concern is related to the fixes policy. If this is a "fix", we will
> > backport the new "disables mmap on persistent ring buffer" limitation
> > to the stable kernel (that was not documented previously.)
> > 
> > However, from the user point of view, "mmap() ring buffers" is already
> > supported (although it did not work on stable kernel for now). Thus I think
> > the "Fix" is expected as "fixing mmap() persistent ring buffer".   
> 
> This only disables mmapping of the persistent ring buffer. Other ring
> buffers can be mapped. We never supported mmapping the persistent ring
> buffer. Even in stable kernels, if you mmap it, it will crash just like
> it does now. Thus, this doesn't cause any regressions. It's a fix even
> for stable kernels.
> 
> Or did the virt_to_page() change recently where that wasn't the case?
> 

Although the fixes tag is wrong. As the persistent ring buffer didn't
even exist then. It should be:

Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer")

As that's what added the BOOT flag and is in the same kernel version
that added the persistent ring buffer.

-- Steve
Masami Hiramatsu (Google) Feb. 18, 2025, 3:14 p.m. UTC | #10
On Sat, 15 Feb 2025 11:45:40 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 15 Feb 2025 11:21:53 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Sun, 16 Feb 2025 00:37:02 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > 
> > > My concern is related to the fixes policy. If this is a "fix", we will
> > > backport the new "disables mmap on persistent ring buffer" limitation
> > > to the stable kernel (that was not documented previously.)
> > > 
> > > However, from the user point of view, "mmap() ring buffers" is already
> > > supported (although it did not work on stable kernel for now). Thus I think
> > > the "Fix" is expected as "fixing mmap() persistent ring buffer".   
> > 
> > This only disables mmapping of the persistent ring buffer. Other ring
> > buffers can be mapped. We never supported mmapping the persistent ring
> > buffer. Even in stable kernels, if you mmap it, it will crash just like
> > it does now. Thus, this doesn't cause any regressions. It's a fix even
> > for stable kernels.
> > 
> > Or did the virt_to_page() change recently where that wasn't the case?
> > 
> 
> Although the fixes tag is wrong. As the persistent ring buffer didn't
> even exist then. It should be:
> 
> Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer")
> 
> As that's what added the BOOT flag and is in the same kernel version
> that added the persistent ring buffer.

OK, so this fix is for the limitation of the buffer which has
TRACE_ARRAY_FL_BOOT. When this flag is introduced, it should also
prohibit the mmap.

Thank you,

> 
> -- Steve
Steven Rostedt Feb. 18, 2025, 3:21 p.m. UTC | #11
On Wed, 19 Feb 2025 00:14:24 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > As that's what added the BOOT flag and is in the same kernel version
> > that added the persistent ring buffer.  
> 
> OK, so this fix is for the limitation of the buffer which has
> TRACE_ARRAY_FL_BOOT. When this flag is introduced, it should also
> prohibit the mmap.

Correct.

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 25ff37aab00f..72895500155d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8279,6 +8279,10 @@  static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct trace_iterator *iter = &info->iter;
 	int ret = 0;
 
+	/* Currently the boot mapped buffer is not supported for mmap */
+	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
+		return -EINVAL;
+
 	ret = get_snapshot_map(iter->tr);
 	if (ret)
 		return ret;