diff mbox series

[v2] tracing: Do not allow mmap() of persistent ring buffer

Message ID 20250214115547.0d7287d3@gandalf.local.home (mailing list archive)
State Accepted
Commit 129fe718819cc5e24ea2f489db9ccd4371f0c6f6
Headers show
Series [v2] tracing: Do not allow mmap() of persistent ring buffer | expand

Commit Message

Steven Rostedt Feb. 14, 2025, 4:55 p.m. UTC
From: Steven Rostedt <rostedt@goodmis.org>

When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:

 BUG: unable to handle page fault for address: ffffe97bd00025c8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 2862f3067 P4D 2862f3067 PUD 0
 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
 CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:validate_page_before_insert+0x5/0xb0
 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
 FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x1f
  ? __die+0x2e/0x40
  ? page_fault_oops+0x157/0x2b0
  ? search_module_extables+0x53/0x80
  ? validate_page_before_insert+0x5/0xb0
  ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
  ? __bad_area_nosemaphore+0x16e/0x1b0
  ? bad_area_nosemaphore+0x16/0x20
  ? do_kern_addr_fault+0x77/0x90
  ? exc_page_fault+0x22b/0x230
  ? asm_exc_page_fault+0x2b/0x30
  ? validate_page_before_insert+0x5/0xb0
  ? vm_insert_pages+0x151/0x400
  __rb_map_vma+0x21f/0x3f0
  ring_buffer_map+0x21b/0x2f0
  tracing_buffers_mmap+0x70/0xd0
  __mmap_region+0x6f0/0xbd0
  mmap_region+0x7f/0x130
  do_mmap+0x475/0x610
  vm_mmap_pgoff+0xf2/0x1d0
  ksys_mmap_pgoff+0x166/0x200
  __x64_sys_mmap+0x37/0x50
  x64_sys_call+0x1670/0x1d70
  do_syscall_64+0xbb/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reason was that the code that maps the ring buffer pages to user space
has:

	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);

And uses that in:

	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.

If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.

Cc: stable@vger.kernel.org
Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250213180737.061871ae@gandalf.local.home

- Return -ENODEV instead of -EINVAL as that is what is returned when the .mmap
  field is missing from the file_operations structure.

 kernel/trace/trace.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

David Laight Feb. 14, 2025, 5:18 p.m. UTC | #1
On Fri, 14 Feb 2025 11:55:47 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
....
> The reason was that the code that maps the ring buffer pages to user space
> has:
> 
> 	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
...
> But virt_to_page() does not work with vmap()'d memory which is what the
> persistent ring buffer has. It is rather trivial to allow this, but for
> now just disable mmap() of instances that have their ring buffer from the
> reserve_mem option.

I've recently fallen foul of the same issue elsewhere [1].
Perhaps there ought to be a variant of virt_to_page() that returns an
error for addresses outside the kernel map.
Or even a fast version that doesn't check for places where the cost
of the additional conditional would matter.

Even a kernel panic for an unchecked NULL pointer would be easier
to diagnose than the current situation.

[1] In my case it was dma_alloc_coherent() using vmalloc() when an iommu
is enabled and then the wrong things happening when I try to mmap()
the buffer into userspace (offset in both the dma buffer and the user file).
I do need to check that the iommu is honouring the buffer alignment.

	David
Masami Hiramatsu (Google) Feb. 15, 2025, 3:14 p.m. UTC | #2
On Fri, 14 Feb 2025 11:55:47 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When trying to mmap a trace instance buffer that is attached to
> reserve_mem, it would crash:
> 
>  BUG: unable to handle page fault for address: ffffe97bd00025c8
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 2862f3067 P4D 2862f3067 PUD 0
>  Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
>  CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  RIP: 0010:validate_page_before_insert+0x5/0xb0
>  Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
>  RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
>  RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
>  RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
>  RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
>  R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
>  R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
>  FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
>  Call Trace:
>   <TASK>
>   ? __die_body.cold+0x19/0x1f
>   ? __die+0x2e/0x40
>   ? page_fault_oops+0x157/0x2b0
>   ? search_module_extables+0x53/0x80
>   ? validate_page_before_insert+0x5/0xb0
>   ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
>   ? __bad_area_nosemaphore+0x16e/0x1b0
>   ? bad_area_nosemaphore+0x16/0x20
>   ? do_kern_addr_fault+0x77/0x90
>   ? exc_page_fault+0x22b/0x230
>   ? asm_exc_page_fault+0x2b/0x30
>   ? validate_page_before_insert+0x5/0xb0
>   ? vm_insert_pages+0x151/0x400
>   __rb_map_vma+0x21f/0x3f0
>   ring_buffer_map+0x21b/0x2f0
>   tracing_buffers_mmap+0x70/0xd0
>   __mmap_region+0x6f0/0xbd0
>   mmap_region+0x7f/0x130
>   do_mmap+0x475/0x610
>   vm_mmap_pgoff+0xf2/0x1d0
>   ksys_mmap_pgoff+0x166/0x200
>   __x64_sys_mmap+0x37/0x50
>   x64_sys_call+0x1670/0x1d70
>   do_syscall_64+0xbb/0x1d0
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> The reason was that the code that maps the ring buffer pages to user space
> has:
> 
> 	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
> 
> And uses that in:
> 
> 	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
> 
> But virt_to_page() does not work with vmap()'d memory which is what the
> persistent ring buffer has. It is rather trivial to allow this, but for
> now just disable mmap() of instances that have their ring buffer from the
> reserve_mem option.
> 
> If an mmap() is performed on a persistent buffer it will return -ENODEV
> just like it would if the .mmap field wasn't defined in the
> file_operations structure.
> 

Looks good to me.

Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> Cc: stable@vger.kernel.org
> Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v1: https://lore.kernel.org/20250213180737.061871ae@gandalf.local.home
> 
> - Return -ENODEV instead of -EINVAL as that is what is returned when the .mmap
>   field is missing from the file_operations structure.
> 
>  kernel/trace/trace.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 25ff37aab00f..0e6d517e74e0 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -8279,6 +8279,10 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
>  	struct trace_iterator *iter = &info->iter;
>  	int ret = 0;
>  
> +	/* Currently the boot mapped buffer is not supported for mmap */
> +	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
> +		return -ENODEV;
> +
>  	ret = get_snapshot_map(iter->tr);
>  	if (ret)
>  		return ret;
> -- 
> 2.47.2
>
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 25ff37aab00f..0e6d517e74e0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8279,6 +8279,10 @@  static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct trace_iterator *iter = &info->iter;
 	int ret = 0;
 
+	/* Currently the boot mapped buffer is not supported for mmap */
+	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
+		return -ENODEV;
+
 	ret = get_snapshot_map(iter->tr);
 	if (ret)
 		return ret;