diff mbox series

[4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register

Message ID 20191016073731.4076725-5-songliubraving@fb.com (mailing list archive)
State New, archived
Headers show
Series Fixes for THP in page cache | expand

Commit Message

Song Liu Oct. 16, 2019, 7:37 a.m. UTC
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This is
because perf_event based trace_uprobe calls uprobe_unregister twice on
close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER.
The second call will split the PMD mapped page table entry, which is not
the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Also add a WARN() to confirm uprobe unregister never work on huge pages.

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Oleg Nesterov Oct. 16, 2019, 12:10 p.m. UTC | #1
On 10/16, Song Liu wrote:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	int ret, is_register, ref_ctr_updated = 0;
>  	bool orig_page_huge = false;
> +	unsigned int gup_flags = FOLL_FORCE;
>  
>  	is_register = is_swbp_insn(&opcode);
>  	uprobe = container_of(auprobe, struct uprobe, arch);
>  
>  retry:
> +	if (is_register)
> +		gup_flags |= FOLL_SPLIT_PMD;
>  	/* Read the page with vaddr into memory */
> -	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> -			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
> +				    &old_page, &vma, NULL);
>  	if (ret <= 0)
>  		return ret;
>  
> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	if (ret <= 0)
>  		goto put_old;
>  
> +	WARN(!is_register && PageCompound(old_page),
> +	     "uprobe unregister should never work on compound page\n");

But this can happen with the change above. You can't know if *vaddr was
previously changed by install_breakpoint() or not.

If not, verify_opcode() should likely save us, but we can't rely on it.
Say, someone can write "int3" into vm_file at uprobe->offset.

And I am not sure it is safe to continue in this case, I'd suggest to
return -EWHATEVER to avoid the possible crash.

Oleg.
Song Liu Oct. 16, 2019, 4:10 p.m. UTC | #2
> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/16, Song Liu wrote:
>> 
>> --- a/kernel/events/uprobes.c
>> +++ b/kernel/events/uprobes.c
>> @@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>> 	struct vm_area_struct *vma;
>> 	int ret, is_register, ref_ctr_updated = 0;
>> 	bool orig_page_huge = false;
>> +	unsigned int gup_flags = FOLL_FORCE;
>> 
>> 	is_register = is_swbp_insn(&opcode);
>> 	uprobe = container_of(auprobe, struct uprobe, arch);
>> 
>> retry:
>> +	if (is_register)
>> +		gup_flags |= FOLL_SPLIT_PMD;
>> 	/* Read the page with vaddr into memory */
>> -	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
>> -			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
>> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
>> +				    &old_page, &vma, NULL);
>> 	if (ret <= 0)
>> 		return ret;
>> 
>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>> 	if (ret <= 0)
>> 		goto put_old;
>> 
>> +	WARN(!is_register && PageCompound(old_page),
>> +	     "uprobe unregister should never work on compound page\n");
> 
> But this can happen with the change above. You can't know if *vaddr was
> previously changed by install_breakpoint() or not.

> If not, verify_opcode() should likely save us, but we can't rely on it.
> Say, someone can write "int3" into vm_file at uprobe->offset.

I think this won't really happen. With is_register == false, we already 
know opcode is not "int3", so current call must be from set_orig_insn(). 
Therefore, old_page must be installed by uprobe, and cannot be compound.

The other way is not guaranteed. With is_register == true, it is still
possible current call is from set_orig_insn(). However, we do not rely
on this path. 

Does this make sense? Or did I miss anything?

> 
> And I am not sure it is safe to continue in this case, I'd suggest to
> return -EWHATEVER to avoid the possible crash.

I think we can return -ESOMETHING here to be safe. However, if the 
analysis above makes sense, it is not necessary. 

Thanks,
Song
Oleg Nesterov Oct. 17, 2019, 8:47 a.m. UTC | #3
On 10/16, Song Liu wrote:
>
> > On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >> 	if (ret <= 0)
> >> 		goto put_old;
> >>
> >> +	WARN(!is_register && PageCompound(old_page),
> >> +	     "uprobe unregister should never work on compound page\n");
> >
> > But this can happen with the change above. You can't know if *vaddr was
> > previously changed by install_breakpoint() or not.
>
> > If not, verify_opcode() should likely save us, but we can't rely on it.
> > Say, someone can write "int3" into vm_file at uprobe->offset.
>
> I think this won't really happen. With is_register == false, we already
> know opcode is not "int3", so current call must be from set_orig_insn().
> Therefore, old_page must be installed by uprobe, and cannot be compound.
>
> The other way is not guaranteed. With is_register == true, it is still
> possible current call is from set_orig_insn(). However, we do not rely
> on this path.

Quite contrary.

When is_register == true we know that a) the caller is install_breakpoint()
and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
userspace, say, by gdb.

If is_register == false we only know that the caller is remove_breakpoint().
We can't know if this page was COW'ed by uprobes or userspace, we can not
know if the insn we are going to replace is int3 or not, thus we can not
assume that verify_opcode() will fail and save us.

Oleg.
Song Liu Oct. 17, 2019, 2:05 p.m. UTC | #4
> On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/16, Song Liu wrote:
>> 
>>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>>> 	if (ret <= 0)
>>>> 		goto put_old;
>>>> 
>>>> +	WARN(!is_register && PageCompound(old_page),
>>>> +	     "uprobe unregister should never work on compound page\n");
>>> 
>>> But this can happen with the change above. You can't know if *vaddr was
>>> previously changed by install_breakpoint() or not.
>> 
>>> If not, verify_opcode() should likely save us, but we can't rely on it.
>>> Say, someone can write "int3" into vm_file at uprobe->offset.
>> 
>> I think this won't really happen. With is_register == false, we already
>> know opcode is not "int3", so current call must be from set_orig_insn().
>> Therefore, old_page must be installed by uprobe, and cannot be compound.
>> 
>> The other way is not guaranteed. With is_register == true, it is still
>> possible current call is from set_orig_insn(). However, we do not rely
>> on this path.
> 
> Quite contrary.
> 
> When is_register == true we know that a) the caller is install_breakpoint()
> and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
> userspace, say, by gdb.
> 
> If is_register == false we only know that the caller is remove_breakpoint().
> We can't know if this page was COW'ed by uprobes or userspace, we can not
> know if the insn we are going to replace is int3 or not, thus we can not
> assume that verify_opcode() will fail and save us.

So the case we worry about is: 
old_page is COW by user space, target insn is int3, and it is a huge page; 
then uprobe calls remove_breakpoint(); 

Yeah, I guess this could happen. 

For the fix, I guess return -Esomething in such case should be sufficient?

Thanks,
Song
Oleg Nesterov Oct. 17, 2019, 2:28 p.m. UTC | #5
On 10/17, Song Liu wrote:
>
>
> > On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 10/16, Song Liu wrote:
> >>
> >>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>>
> >>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >>>> 	if (ret <= 0)
> >>>> 		goto put_old;
> >>>>
> >>>> +	WARN(!is_register && PageCompound(old_page),
> >>>> +	     "uprobe unregister should never work on compound page\n");
> >>>
> >>> But this can happen with the change above. You can't know if *vaddr was
> >>> previously changed by install_breakpoint() or not.
> >>
> >>> If not, verify_opcode() should likely save us, but we can't rely on it.
> >>> Say, someone can write "int3" into vm_file at uprobe->offset.
> >>
> >> I think this won't really happen. With is_register == false, we already
> >> know opcode is not "int3", so current call must be from set_orig_insn().
> >> Therefore, old_page must be installed by uprobe, and cannot be compound.
> >>
> >> The other way is not guaranteed. With is_register == true, it is still
> >> possible current call is from set_orig_insn(). However, we do not rely
> >> on this path.
> >
> > Quite contrary.
> >
> > When is_register == true we know that a) the caller is install_breakpoint()
> > and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
> > userspace, say, by gdb.
> >
> > If is_register == false we only know that the caller is remove_breakpoint().
> > We can't know if this page was COW'ed by uprobes or userspace, we can not
> > know if the insn we are going to replace is int3 or not, thus we can not
> > assume that verify_opcode() will fail and save us.
>
> So the case we worry about is:
> old_page is COW by user space,

no, in this case the page shouldn't be huge,

> target insn is int3, and it is a huge page;
> then uprobe calls remove_breakpoint();

Yes,

> Yeah, I guess this could happen.

Yes,

> For the fix, I guess return -Esomething in such case should be sufficient?

this is what I tried to suggest from the very beginning.

Oleg.
Song Liu Oct. 17, 2019, 3:34 p.m. UTC | #6
> On Oct 17, 2019, at 7:28 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/17, Song Liu wrote:
>> 
>> 
>>> On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>> On 10/16, Song Liu wrote:
>>>> 
>>>>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>>> 
>>>>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>>>>> 	if (ret <= 0)
>>>>>> 		goto put_old;
>>>>>> 
>>>>>> +	WARN(!is_register && PageCompound(old_page),
>>>>>> +	     "uprobe unregister should never work on compound page\n");
>>>>> 
>>>>> But this can happen with the change above. You can't know if *vaddr was
>>>>> previously changed by install_breakpoint() or not.
>>>> 
>>>>> If not, verify_opcode() should likely save us, but we can't rely on it.
>>>>> Say, someone can write "int3" into vm_file at uprobe->offset.
>>>> 
>>>> I think this won't really happen. With is_register == false, we already
>>>> know opcode is not "int3", so current call must be from set_orig_insn().
>>>> Therefore, old_page must be installed by uprobe, and cannot be compound.
>>>> 
>>>> The other way is not guaranteed. With is_register == true, it is still
>>>> possible current call is from set_orig_insn(). However, we do not rely
>>>> on this path.
>>> 
>>> Quite contrary.
>>> 
>>> When is_register == true we know that a) the caller is install_breakpoint()
>>> and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
>>> userspace, say, by gdb.
>>> 
>>> If is_register == false we only know that the caller is remove_breakpoint().
>>> We can't know if this page was COW'ed by uprobes or userspace, we can not
>>> know if the insn we are going to replace is int3 or not, thus we can not
>>> assume that verify_opcode() will fail and save us.
>> 
>> So the case we worry about is:
>> old_page is COW by user space,
> 
> no, in this case the page shouldn't be huge,
> 
>> target insn is int3, and it is a huge page;
>> then uprobe calls remove_breakpoint();
> 
> Yes,
> 
>> Yeah, I guess this could happen.
> 
> Yes,
> 
>> For the fix, I guess return -Esomething in such case should be sufficient?
> 
> this is what I tried to suggest from the very beginning.
> 

Thanks Oleg!

Attached is v2 of 4/4. 

Song


============================ 8< ===============================

From 0ab451f31b8fa6315d9de86eebe5d6c44dabac7e Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Tue, 15 Oct 2019 22:38:55 -0700
Subject: [PATCH v2 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register

Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This is
because perf_event based trace_uprobe calls uprobe_unregister twice on
close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER.
The second call will split the PMD mapped page table entry, which is not
the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

---
Changes v1 -> v2:
Return -EINVAL if the WARN() triggers. (Oleg Nesterov)
---
 kernel/events/uprobes.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 94d38a39d72e..c74761004ee5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
        struct vm_area_struct *vma;
        int ret, is_register, ref_ctr_updated = 0;
        bool orig_page_huge = false;
+       unsigned int gup_flags = FOLL_FORCE;

        is_register = is_swbp_insn(&opcode);
        uprobe = container_of(auprobe, struct uprobe, arch);

 retry:
+       if (is_register)
+               gup_flags |= FOLL_SPLIT_PMD;
        /* Read the page with vaddr into memory */
-       ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-                       FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
+       ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
+                                   &old_page, &vma, NULL);
        if (ret <= 0)
                return ret;

@@ -489,6 +492,12 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
        if (ret <= 0)
                goto put_old;

+       if (WARN(!is_register && PageCompound(old_page),
+                "uprobe unregister should never work on compound page\n")) {
+               ret = -EINVAL;
+               goto put_old;
+       }
+
        /* We are going to replace instruction, update ref_ctr. */
        if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
                ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
--
2.17.1
diff mbox series

Patch

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 94d38a39d72e..d7a556cc589e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,14 +474,17 @@  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	int ret, is_register, ref_ctr_updated = 0;
 	bool orig_page_huge = false;
+	unsigned int gup_flags = FOLL_FORCE;
 
 	is_register = is_swbp_insn(&opcode);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
+	if (is_register)
+		gup_flags |= FOLL_SPLIT_PMD;
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
+				    &old_page, &vma, NULL);
 	if (ret <= 0)
 		return ret;
 
@@ -489,6 +492,9 @@  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret <= 0)
 		goto put_old;
 
+	WARN(!is_register && PageCompound(old_page),
+	     "uprobe unregister should never work on compound page\n");
+
 	/* We are going to replace instruction, update ref_ctr. */
 	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
 		ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);