diff mbox series

[141/192] fs/proc/kcore.c: add mmap interface

Message ID 20210701015441.snfkDnNcO%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c | expand

Commit Message

Andrew Morton July 1, 2021, 1:54 a.m. UTC
From: ZHOUFENG <zhoufeng.zf@bytedance.com>
Subject: fs/proc/kcore.c: add mmap interface

When we do the kernel monitor, use the DRGN
(https://github.com/osandov/drgn) access to kernel data structures, found
that the system calls a lot.  DRGN is implemented by reading /proc/kcore. 
After looking at the kcore code, it is found that kcore does not implement
mmap, resulting in frequent context switching triggered by read. 
Therefore, we want to add mmap interface to optimize performance.  Since
vmalloc and module areas will change with allocation and release,
consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
and KCORE_RAM.

The test results:
1. the default version of kcore
real 11.00
user 8.53
sys 3.59

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.64  128.578319          12  11168701           pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00  129.042853              11193748       966 total

2. added kcore for the mmap interface
real 6.44
user 7.32
sys 0.24

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
32.94    0.130120          24      5317       315 futex
11.66    0.046077          21      2231         1 lstat
 9.23    0.036449         177       206           mmap
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.395077                 25435       971 total

The test results show that the number of system calls and time consumption
are significantly reduced.

Thanks to Andrew Morton for your advice.

[akpm@linux-foundation.org: KCORE_REMAP is no more]
Link: https://lkml.kernel.org/r/20210601082241.13378-1-zhoufeng.zf@bytedance.com
Co-developed-by: CHENYING <chenying.kernel@bytedance.com>
Signed-off-by: CHENYING <chenying.kernel@bytedance.com>
Signed-off-by: ZHOUFENG <zhoufeng.zf@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

Comments

Linus Torvalds July 1, 2021, 3:32 a.m. UTC | #1
On Wed, Jun 30, 2021 at 6:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> When we do the kernel monitor, use the DRGN
> (https://github.com/osandov/drgn) access to kernel data structures, found
> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.

Ok, this is funky, but I'm going to drop this patch because I think
it's buggy as is.

 Since

> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
> +{
> +       size_t size = vma->vm_end - vma->vm_start;

Ok.

But then:

> +       start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
> +               ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));

Not only is that

        ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT)

a very strange calculation (did you mean "data_offset & PAGE_MASK"?),
but I don't see anything that protects against underflow in that
calculation. pg_off can easily be arbitrarily small (eg zero), so that
subtraction can underflow afaik.

So that needs a test, and return -EINVAL or whatever.

But even if that is fixed, this test is entirely broken:

> +       list_for_each_entry(m, &kclist_head, list) {
> +               if (start >= m->addr && size <= m->size)
> +                       break;
> +       }

No, that's wrong.

You allow 'size' to be as big as 'm->size', but you do that even if
'start' isn't 'm->start'.

The proper check would be something like

       u64 end = start + size;

        if (start >= m->addr && end <= m->addr+m->size) ..

or similar (and that should check that "start+size" hasn't overflowed).

So I see what appears to be multiple problems, and while I hand-waved
some fixes for them, those are very much "maybe something like this",
and I'm going to drop this patch. Not for 5.14.

           Linus
Feng Zhou July 1, 2021, 6:35 a.m. UTC | #2
在 2021/7/1 上午11:32, Linus Torvalds 写道:
> On Wed, Jun 30, 2021 at 6:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> When we do the kernel monitor, use the DRGN
>> (https://github.com/osandov/drgn) access to kernel data structures, found
>> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
>> After looking at the kcore code, it is found that kcore does not implement
>> mmap, resulting in frequent context switching triggered by read.
>> Therefore, we want to add mmap interface to optimize performance.
> 
> Ok, this is funky, but I'm going to drop this patch because I think
> it's buggy as is.
> 
>   Since
> 
>> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
>> +{
>> +       size_t size = vma->vm_end - vma->vm_start;
> 
> Ok.
> 
> But then:
> 
>> +       start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
>> +               ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));
> 
> Not only is that
> 
>          ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT)
> 
> a very strange calculation (did you mean "data_offset & PAGE_MASK"?),
> but I don't see anything that protects against underflow in that
> calculation. pg_off can easily be arbitrarily small (eg zero), so that
> subtraction can underflow afaik.

Sorry, the calculations here are really confusing. The reason is that 
when DRGN read /proc/kcore for ELF file header:
phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
and DRGN call mmap, use phdr->p_offset passed in, I need to subtract 
"data_offset".

> 
> So that needs a test, and return -EINVAL or whatever.
>
There's a problem with not judging "start". I will fix it in a v3.

> But even if that is fixed, this test is entirely broken:
> 
>> +       list_for_each_entry(m, &kclist_head, list) {
>> +               if (start >= m->addr && size <= m->size)
>> +                       break;
>> +       }
> 
> No, that's wrong.
> 

Yes, this is indeed wrong, I will fix it in a v3.


> You allow 'size' to be as big as 'm->size', but you do that even if
> 'start' isn't 'm->start'.
> 
> The proper check would be something like
> 
>         u64 end = start + size;
> 
>          if (start >= m->addr && end <= m->addr+m->size) ..
> 
> or similar (and that should check that "start+size" hasn't overflowed).
> 
> So I see what appears to be multiple problems, and while I hand-waved
> some fixes for them, those are very much "maybe something like this",
> and I'm going to drop this patch. Not for 5.14.
> 
>             Linus
>
diff mbox series

Patch

--- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
+++ a/fs/proc/kcore.c
@@ -614,11 +614,78 @@  static int release_kcore(struct inode *i
 	return 0;
 }
 
+static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct kcore_mmap_ops = {
+	.fault = mmap_kcore_fault,
+};
+
+static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
+{
+	size_t size = vma->vm_end - vma->vm_start;
+	u64 start, pfn;
+	int nphdr;
+	size_t data_offset;
+	size_t phdrs_len, notes_len;
+	struct kcore_list *m = NULL;
+	int ret = 0;
+
+	down_read(&kclist_lock);
+
+	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
+
+	start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
+		((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));
+
+	list_for_each_entry(m, &kclist_head, list) {
+		if (start >= m->addr && size <= m->size)
+			break;
+	}
+
+	if (&m->list == &kclist_head) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;
+
+	if (kern_addr_valid(start)) {
+		if (m->type == KCORE_RAM)
+			pfn = __pa(start) >> PAGE_SHIFT;
+		else if (m->type == KCORE_TEXT)
+			pfn = __pa_symbol(start) >> PAGE_SHIFT;
+		else {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, pfn, size,
+				vma->vm_page_prot);
+	} else {
+		ret = -EFAULT;
+	}
+
+out:
+	up_read(&kclist_lock);
+	return ret;
+}
+
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read	= read_kcore,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
+	.proc_mmap	= mmap_kcore,
 };
 
 /* just remember that we have to update kcore */