[v6,0/6] arm64: Add kernel probes (kprobes) support

Message ID	553EF74D.8020706@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Message-ID: <553EF74D.8020706@redhat.com> Date: Mon, 27 Apr 2015 22:58:21 -0400 From: William Cohen <wcohen@redhat.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>, David Long <dave.long@linaro.org>, Will Deacon <will.deacon@arm.com> Subject: Re: [PATCH v6 0/6] arm64: Add kernel probes (kprobes) support References: <1429561187-3661-1-git-send-email-dave.long@linaro.org> <55363791.4070706@hitachi.com> <553AB222.50609@redhat.com> In-Reply-To: <553AB222.50609@redhat.com> Content-Type: multipart/mixed; boundary="------------040703040201040204090504" Cc: "Jon Medhurst \(Tixy\)" <tixy@linaro.org>, Steve Capper <steve.capper@linaro.org>, Ananth N Mavinakayanahalli <ananth@in.ibm.com>, Catalin Marinas <catalin.marinas@arm.com>, linux-kernel@vger.kernel.org, Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>, sandeepa.s.prabhu@gmail.com, Russell King <linux@arm.linux.org.uk>, davem@davemloft.net, linux-arm-kernel@lists.infradead.org Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

William Cohen April 28, 2015, 2:58 a.m. UTC

Hi All,

I have been experimenting with the patches for arm64 kprobes support.
On occasion the kernel gets stuck in a loop printing output:

 Unexpected kernel single-step exception at EL1

This message by itself is not that enlighten.  I added the attached
patch to get some additional information about register state when the
warning is printed out.  Below is an example output:


[14613.263536] Unexpected kernel single-step exception at EL1
[14613.269001] kcb->ss_ctx.ss_status = 1
[14613.272643] kcb->ss_ctx.match_addr = fffffdfffc001250 0xfffffdfffc001250
[14613.279324] instruction_pointer(regs) = fffffe0000093358 el1_da+0x8/0x70
[14613.286003] 
[14613.287487] CPU: 3 PID: 621 Comm: irqbalance Tainted: G           OE   4.0.0u4+ #6
[14613.295019] Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0-rh-0.15 Mar 13 2015
[14613.302982] task: fffffe01d6806780 ti: fffffe01d68ac000 task.ti: fffffe01d68ac000
[14613.310430] PC is at el1_da+0x8/0x70
[14613.313990] LR is at trampoline_probe_handler+0x188/0x1ec
[14613.319363] pc : [<fffffe0000093358>] lr : [<fffffe0000687590>] pstate: 600001c5
[14613.326724] sp : fffffe01d68af640
[14613.330021] x29: fffffe01d68afbf0 x28: fffffe01d68ac000 
[14613.335328] x27: fffffe00000939cc x26: fffffe0000bb09d0 
[14613.340634] x25: fffffe01d68afdb0 x24: 0000000000000025 
[14613.345939] x23: 00000000800003c5 x22: fffffdfffc001284 
[14613.351245] x21: fffffe01d68af760 x20: fffffe01d7c79a00 
[14613.356552] x19: 0000000000000000 x18: 000003ffa4b8e600 
[14613.361858] x17: 000003ffa5480698 x16: fffffe00001f2afc 
[14613.367164] x15: 0000000000000007 x14: 000003ffeffa8690 
[14613.372471] x13: 0000000000000001 x12: 000003ffa4baf200 
[14613.377778] x11: fffffe00006bb328 x10: fffffe00006bb32c 
[14613.383084] x9 : fffffe01d68afd10 x8 : fffffe01d6806d10 
[14613.388390] x7 : fffffe01ffd01298 x6 : fffffe000009192c 
[14613.393696] x5 : fffffe0000c1b398 x4 : 0000000000000000 
[14613.399001] x3 : 0000000000200200 x2 : 0000000000100100 
[14613.404306] x1 : 0000000096000006 x0 : 0000000000000015 
[14613.409610] 
[14613.411094] BUG: failure at arch/arm64/kernel/debug-monitors.c:276/single_step_handler()!


The really odd thing is the address of the PC it is in el1_da the code
to handle data aborts.  it looks like it is getting the unexpected
single_step exception right after the enable_debug in el1_da.  I think
what might be happening is:

-an instruction is instrumented with kprobe
-the instruction is copied to a buffer
-a breakpoint replaces the instruction
-the kprobe fires when the breakpoint is encountered
-the instruction in the buffer is set to single step
-a single step of the instruction is attempted
-a data abort exception is raised
-el1_da is called
-el1_da does an enable_dbg to unmask the debug exceptions
-single_step_handler is called
-single_step_handler doesn't find anything to handle that pc
-single_step_handler prints the warning about unexpected el1 single step
-single_step_handler re-enable ss step
-the single step of the instruction is attempted endlessly

It looks like commit 1059c6bf8534acda249e7e65c81e7696fb074dc1 from Mon
Sep 22   "arm64: debug: don't re-enable debug exceptions on return from el1_dbg"
was trying to address a similar problem for the el1_dbg
function.  Should el1_da and other el1_* functions have the enable_dbg
removed?

If single_step_handler doesn't find a handler, is re-enabling the
single step with set_regs_spsr_ss in single_step_handler the right thing to do?

-Will

Will Deacon April 29, 2015, 10:23 a.m. UTC | #1

On Tue, Apr 28, 2015 at 03:58:21AM +0100, William Cohen wrote:
> Hi All,

Hi Will,

> I have been experimenting with the patches for arm64 kprobes support.
> On occasion the kernel gets stuck in a loop printing output:
> 
>  Unexpected kernel single-step exception at EL1
> 
> This message by itself is not that enlighten.  I added the attached
> patch to get some additional information about register state when the
> warning is printed out.  Below is an example output:

Given that we've got the pt_regs in our hands at that point, I'm happy to
print something more useful if you like (e.g. the PC?).

> [14613.263536] Unexpected kernel single-step exception at EL1
> [14613.269001] kcb->ss_ctx.ss_status = 1
> [14613.272643] kcb->ss_ctx.match_addr = fffffdfffc001250 0xfffffdfffc001250
> [14613.279324] instruction_pointer(regs) = fffffe0000093358 el1_da+0x8/0x70
> [14613.286003] 
> [14613.287487] CPU: 3 PID: 621 Comm: irqbalance Tainted: G           OE   4.0.0u4+ #6
> [14613.295019] Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0-rh-0.15 Mar 13 2015
> [14613.302982] task: fffffe01d6806780 ti: fffffe01d68ac000 task.ti: fffffe01d68ac000
> [14613.310430] PC is at el1_da+0x8/0x70
> [14613.313990] LR is at trampoline_probe_handler+0x188/0x1ec

> The really odd thing is the address of the PC it is in el1_da the code
> to handle data aborts.  it looks like it is getting the unexpected
> single_step exception right after the enable_debug in el1_da.  I think
> what might be happening is:
> 
> -an instruction is instrumented with kprobe
> -the instruction is copied to a buffer
> -a breakpoint replaces the instruction
> -the kprobe fires when the breakpoint is encountered
> -the instruction in the buffer is set to single step
> -a single step of the instruction is attempted
> -a data abort exception is raised
> -el1_da is called

So that's the bit that I find weird. Can you take a look at what we're doing
in trampoline_probe_handler, please? It could be that we're doing something
like get_user and aborting on a faulting userspace address, but I think
kprobes should handle that rather than us trying to get the generic
single-step code to deal with it.

> It looks like commit 1059c6bf8534acda249e7e65c81e7696fb074dc1 from Mon
> Sep 22   "arm64: debug: don't re-enable debug exceptions on return from el1_dbg"
> was trying to address a similar problem for the el1_dbg
> function.  Should el1_da and other el1_* functions have the enable_dbg
> removed?

I don't think so. The current behaviour of the low-level debug handler is to
step into traps, which is more flexible than trying to step over them (which
could lead to us stepping over interrupts, or preemption points). It should
be up to the higher-level debugger (e.g. kprobes, kgdb) to distinguish
between the traps it does and does not care about.

An equivalent userspace example would be GDB stepping into single handlers,
I suppose.

Will

William Cohen May 2, 2015, 1:44 a.m. UTC | #2

On 04/29/2015 06:23 AM, Will Deacon wrote:
> On Tue, Apr 28, 2015 at 03:58:21AM +0100, William Cohen wrote:
>> Hi All,
> 
> Hi Will,
> 
>> I have been experimenting with the patches for arm64 kprobes support.
>> On occasion the kernel gets stuck in a loop printing output:
>>
>>  Unexpected kernel single-step exception at EL1
>>
>> This message by itself is not that enlighten.  I added the attached
>> patch to get some additional information about register state when the
>> warning is printed out.  Below is an example output:
> 
> Given that we've got the pt_regs in our hands at that point, I'm happy to
> print something more useful if you like (e.g. the PC?).
> 
>> [14613.263536] Unexpected kernel single-step exception at EL1
>> [14613.269001] kcb->ss_ctx.ss_status = 1
>> [14613.272643] kcb->ss_ctx.match_addr = fffffdfffc001250 0xfffffdfffc001250
>> [14613.279324] instruction_pointer(regs) = fffffe0000093358 el1_da+0x8/0x70
>> [14613.286003] 
>> [14613.287487] CPU: 3 PID: 621 Comm: irqbalance Tainted: G           OE   4.0.0u4+ #6
>> [14613.295019] Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0-rh-0.15 Mar 13 2015
>> [14613.302982] task: fffffe01d6806780 ti: fffffe01d68ac000 task.ti: fffffe01d68ac000
>> [14613.310430] PC is at el1_da+0x8/0x70
>> [14613.313990] LR is at trampoline_probe_handler+0x188/0x1ec
> 
>> The really odd thing is the address of the PC it is in el1_da the code
>> to handle data aborts.  it looks like it is getting the unexpected
>> single_step exception right after the enable_debug in el1_da.  I think
>> what might be happening is:
>>
>> -an instruction is instrumented with kprobe
>> -the instruction is copied to a buffer
>> -a breakpoint replaces the instruction
>> -the kprobe fires when the breakpoint is encountered
>> -the instruction in the buffer is set to single step
>> -a single step of the instruction is attempted
>> -a data abort exception is raised
>> -el1_da is called
> 
> So that's the bit that I find weird. Can you take a look at what we're doing
> in trampoline_probe_handler, please? It could be that we're doing something
> like get_user and aborting on a faulting userspace address, but I think
> kprobes should handle that rather than us trying to get the generic
> single-step code to deal with it.
> 
>> It looks like commit 1059c6bf8534acda249e7e65c81e7696fb074dc1 from Mon
>> Sep 22   "arm64: debug: don't re-enable debug exceptions on return from el1_dbg"
>> was trying to address a similar problem for the el1_dbg
>> function.  Should el1_da and other el1_* functions have the enable_dbg
>> removed?
> 
> I don't think so. The current behaviour of the low-level debug handler is to
> step into traps, which is more flexible than trying to step over them (which
> could lead to us stepping over interrupts, or preemption points). It should
> be up to the higher-level debugger (e.g. kprobes, kgdb) to distinguish
> between the traps it does and does not care about.
> 
> An equivalent userspace example would be GDB stepping into single handlers,
> I suppose.
> 
> Will
> 

Dave Long and I did some additional experimentation to better
understand what is condition causes the kernel to sometimes spew:

Unexpected kernel single-step exception at EL1

The functioncallcount.stp test instruments the entry and return of
every function in the mm files, including kfree.  In most cases the
arm64 trampoline_probe_handler just determines which return probe
instance matches the current conditions, runs the associated handler,
and recycles the return probe instance for another use by placing it
on a hlist.  However, it is possible that a return probe instance has
been set up on function entry and the return probe is unregistered
before the return probe instance fires.  In this case kfree is called
by the trampoline handler to remove the return probe instances related
to the unregistered kretprobe.  This case where the the kprobed kfree
is called within the arm64 trampoline_probe_handler function trigger
the problem.

The kprobe breakpoint for the kfree call from within the
trampoline_probe_handler is encountered and started, but things go
wrong when attempting the single step on the instruction.

It took a while to trigger this problem with the sytemtap testsuite.
Dave Long came up with steps that reproduce this more quickly with a
probed function that is always called within the trampoline handler.
Trying the same on x86_64 doesn't trigger the problem.  It appears
that the x86_64 code can handle a single step from within the
trampoline_handler.

-Will Cohen

David Long May 5, 2015, 5:14 a.m. UTC | #3

On 05/01/15 21:44, William Cohen wrote:
> On 04/29/2015 06:23 AM, Will Deacon wrote:
>> On Tue, Apr 28, 2015 at 03:58:21AM +0100, William Cohen wrote:
>>> Hi All,
>>
>> Hi Will,
>>
>>> I have been experimenting with the patches for arm64 kprobes support.
>>> On occasion the kernel gets stuck in a loop printing output:
>>>
>>>   Unexpected kernel single-step exception at EL1
>>>
>>> This message by itself is not that enlighten.  I added the attached
>>> patch to get some additional information about register state when the
>>> warning is printed out.  Below is an example output:
>>
>> Given that we've got the pt_regs in our hands at that point, I'm happy to
>> print something more useful if you like (e.g. the PC?).
>>
>>> [14613.263536] Unexpected kernel single-step exception at EL1
>>> [14613.269001] kcb->ss_ctx.ss_status = 1
>>> [14613.272643] kcb->ss_ctx.match_addr = fffffdfffc001250 0xfffffdfffc001250
>>> [14613.279324] instruction_pointer(regs) = fffffe0000093358 el1_da+0x8/0x70
>>> [14613.286003]
>>> [14613.287487] CPU: 3 PID: 621 Comm: irqbalance Tainted: G           OE   4.0.0u4+ #6
>>> [14613.295019] Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0-rh-0.15 Mar 13 2015
>>> [14613.302982] task: fffffe01d6806780 ti: fffffe01d68ac000 task.ti: fffffe01d68ac000
>>> [14613.310430] PC is at el1_da+0x8/0x70
>>> [14613.313990] LR is at trampoline_probe_handler+0x188/0x1ec
>>
>>> The really odd thing is the address of the PC it is in el1_da the code
>>> to handle data aborts.  it looks like it is getting the unexpected
>>> single_step exception right after the enable_debug in el1_da.  I think
>>> what might be happening is:
>>>
>>> -an instruction is instrumented with kprobe
>>> -the instruction is copied to a buffer
>>> -a breakpoint replaces the instruction
>>> -the kprobe fires when the breakpoint is encountered
>>> -the instruction in the buffer is set to single step
>>> -a single step of the instruction is attempted
>>> -a data abort exception is raised
>>> -el1_da is called
>>
>> So that's the bit that I find weird. Can you take a look at what we're doing
>> in trampoline_probe_handler, please? It could be that we're doing something
>> like get_user and aborting on a faulting userspace address, but I think
>> kprobes should handle that rather than us trying to get the generic
>> single-step code to deal with it.
>>
>>> It looks like commit 1059c6bf8534acda249e7e65c81e7696fb074dc1 from Mon
>>> Sep 22   "arm64: debug: don't re-enable debug exceptions on return from el1_dbg"
>>> was trying to address a similar problem for the el1_dbg
>>> function.  Should el1_da and other el1_* functions have the enable_dbg
>>> removed?
>>
>> I don't think so. The current behaviour of the low-level debug handler is to
>> step into traps, which is more flexible than trying to step over them (which
>> could lead to us stepping over interrupts, or preemption points). It should
>> be up to the higher-level debugger (e.g. kprobes, kgdb) to distinguish
>> between the traps it does and does not care about.
>>
>> An equivalent userspace example would be GDB stepping into single handlers,
>> I suppose.
>>
>> Will
>>
>
> Dave Long and I did some additional experimentation to better
> understand what is condition causes the kernel to sometimes spew:
>
> Unexpected kernel single-step exception at EL1
>
> The functioncallcount.stp test instruments the entry and return of
> every function in the mm files, including kfree.  In most cases the
> arm64 trampoline_probe_handler just determines which return probe
> instance matches the current conditions, runs the associated handler,
> and recycles the return probe instance for another use by placing it
> on a hlist.  However, it is possible that a return probe instance has
> been set up on function entry and the return probe is unregistered
> before the return probe instance fires.  In this case kfree is called
> by the trampoline handler to remove the return probe instances related
> to the unregistered kretprobe.  This case where the the kprobed kfree
> is called within the arm64 trampoline_probe_handler function trigger
> the problem.
>
> The kprobe breakpoint for the kfree call from within the
> trampoline_probe_handler is encountered and started, but things go
> wrong when attempting the single step on the instruction.
>
> It took a while to trigger this problem with the sytemtap testsuite.
> Dave Long came up with steps that reproduce this more quickly with a
> probed function that is always called within the trampoline handler.
> Trying the same on x86_64 doesn't trigger the problem.  It appears
> that the x86_64 code can handle a single step from within the
> trampoline_handler.
>
> -Will Cohen
>
>
>

I'm assuming there are no plans for supporting software breakpoint debug 
exceptions during processing of single-step exceptions, any time soon on 
arm64.  Given that the only solution that I can come with for this is 
instead of making this orphaned kretprobe instance list exist only 
temporarily (in the scope of the kretprobe trampoline handler), make it 
always exist and kfree any items found on it as part of a periodic 
cleanup running outside of the handler context.  I think these changes 
would still all be in archiecture-specific code.  This doesn't feel to 
me like a bad solution.  Does anyone think there is a simpler way out of 
this?

-Dave Long

Will Deacon May 5, 2015, 3:48 p.m. UTC | #4

On Tue, May 05, 2015 at 06:14:51AM +0100, David Long wrote:
> On 05/01/15 21:44, William Cohen wrote:
> > Dave Long and I did some additional experimentation to better
> > understand what is condition causes the kernel to sometimes spew:
> >
> > Unexpected kernel single-step exception at EL1
> >
> > The functioncallcount.stp test instruments the entry and return of
> > every function in the mm files, including kfree.  In most cases the
> > arm64 trampoline_probe_handler just determines which return probe
> > instance matches the current conditions, runs the associated handler,
> > and recycles the return probe instance for another use by placing it
> > on a hlist.  However, it is possible that a return probe instance has
> > been set up on function entry and the return probe is unregistered
> > before the return probe instance fires.  In this case kfree is called
> > by the trampoline handler to remove the return probe instances related
> > to the unregistered kretprobe.  This case where the the kprobed kfree
> > is called within the arm64 trampoline_probe_handler function trigger
> > the problem.
> >
> > The kprobe breakpoint for the kfree call from within the
> > trampoline_probe_handler is encountered and started, but things go
> > wrong when attempting the single step on the instruction.
> >
> > It took a while to trigger this problem with the sytemtap testsuite.
> > Dave Long came up with steps that reproduce this more quickly with a
> > probed function that is always called within the trampoline handler.
> > Trying the same on x86_64 doesn't trigger the problem.  It appears
> > that the x86_64 code can handle a single step from within the
> > trampoline_handler.
> >
> 
> I'm assuming there are no plans for supporting software breakpoint debug 
> exceptions during processing of single-step exceptions, any time soon on 
> arm64.  Given that the only solution that I can come with for this is 
> instead of making this orphaned kretprobe instance list exist only 
> temporarily (in the scope of the kretprobe trampoline handler), make it 
> always exist and kfree any items found on it as part of a periodic 
> cleanup running outside of the handler context.  I think these changes 
> would still all be in archiecture-specific code.  This doesn't feel to 
> me like a bad solution.  Does anyone think there is a simpler way out of 
> this?

Just to clarify, is the problem here the software breakpoint exception,
or trying to step the faulting instruction whilst we were already handling
a step?

I think I'd be inclined to keep the code run in debug context to a minimum.
We already can't block there, and the more code we add the more black spots
we end up with in the kernel itself. The alternative would be to make your
kprobes code re-entrant, but that sounds like a nightmare.

You say this works on x86. How do they handle it? Is the nested probe
on kfree ignored or handled?

Will

William Cohen May 5, 2015, 4:18 p.m. UTC | #5

On 05/05/2015 11:48 AM, Will Deacon wrote:
> On Tue, May 05, 2015 at 06:14:51AM +0100, David Long wrote:
>> On 05/01/15 21:44, William Cohen wrote:
>>> Dave Long and I did some additional experimentation to better
>>> understand what is condition causes the kernel to sometimes spew:
>>>
>>> Unexpected kernel single-step exception at EL1
>>>
>>> The functioncallcount.stp test instruments the entry and return of
>>> every function in the mm files, including kfree.  In most cases the
>>> arm64 trampoline_probe_handler just determines which return probe
>>> instance matches the current conditions, runs the associated handler,
>>> and recycles the return probe instance for another use by placing it
>>> on a hlist.  However, it is possible that a return probe instance has
>>> been set up on function entry and the return probe is unregistered
>>> before the return probe instance fires.  In this case kfree is called
>>> by the trampoline handler to remove the return probe instances related
>>> to the unregistered kretprobe.  This case where the the kprobed kfree
>>> is called within the arm64 trampoline_probe_handler function trigger
>>> the problem.
>>>
>>> The kprobe breakpoint for the kfree call from within the
>>> trampoline_probe_handler is encountered and started, but things go
>>> wrong when attempting the single step on the instruction.
>>>
>>> It took a while to trigger this problem with the sytemtap testsuite.
>>> Dave Long came up with steps that reproduce this more quickly with a
>>> probed function that is always called within the trampoline handler.
>>> Trying the same on x86_64 doesn't trigger the problem.  It appears
>>> that the x86_64 code can handle a single step from within the
>>> trampoline_handler.
>>>
>>
>> I'm assuming there are no plans for supporting software breakpoint debug 
>> exceptions during processing of single-step exceptions, any time soon on 
>> arm64.  Given that the only solution that I can come with for this is 
>> instead of making this orphaned kretprobe instance list exist only 
>> temporarily (in the scope of the kretprobe trampoline handler), make it 
>> always exist and kfree any items found on it as part of a periodic 
>> cleanup running outside of the handler context.  I think these changes 
>> would still all be in archiecture-specific code.  This doesn't feel to 
>> me like a bad solution.  Does anyone think there is a simpler way out of 
>> this?
> 
> Just to clarify, is the problem here the software breakpoint exception,
> or trying to step the faulting instruction whilst we were already handling
> a step?
> 
> I think I'd be inclined to keep the code run in debug context to a minimum.
> We already can't block there, and the more code we add the more black spots
> we end up with in the kernel itself. The alternative would be to make your
> kprobes code re-entrant, but that sounds like a nightmare.
> 
> You say this works on x86. How do they handle it? Is the nested probe
> on kfree ignored or handled?
> 
> Will
> 

Hi Will,

I ran the experiment on the x86_64 machine and the x86_64 was definitely able to do the software breakpoint of the probed function and then do a single step from within the trampoline handler.  It looks like the x86_64 code avoids doing a breakpoint to implement the trampoline:

http://lxr.linux.no/#linux+v3.19.1/arch/x86/kernel/kprobes/core.c#L645

x86_64 saves the registers and then directly calls the trampoline_handler.  Maybe it would be possible implement something similar on aarch64 with some carefully crafted assembly code to save the register state and then directly call the trampoline code.  This would move the trampoline handler out of exception state and make it act more like a normal function.

-Will Cohen

David Long May 12, 2015, 5:54 a.m. UTC | #6

On 05/05/15 11:48, Will Deacon wrote:
> On Tue, May 05, 2015 at 06:14:51AM +0100, David Long wrote:
>> On 05/01/15 21:44, William Cohen wrote:
>>> Dave Long and I did some additional experimentation to better
>>> understand what is condition causes the kernel to sometimes spew:
>>>
>>> Unexpected kernel single-step exception at EL1
>>>
>>> The functioncallcount.stp test instruments the entry and return of
>>> every function in the mm files, including kfree.  In most cases the
>>> arm64 trampoline_probe_handler just determines which return probe
>>> instance matches the current conditions, runs the associated handler,
>>> and recycles the return probe instance for another use by placing it
>>> on a hlist.  However, it is possible that a return probe instance has
>>> been set up on function entry and the return probe is unregistered
>>> before the return probe instance fires.  In this case kfree is called
>>> by the trampoline handler to remove the return probe instances related
>>> to the unregistered kretprobe.  This case where the the kprobed kfree
>>> is called within the arm64 trampoline_probe_handler function trigger
>>> the problem.
>>>
>>> The kprobe breakpoint for the kfree call from within the
>>> trampoline_probe_handler is encountered and started, but things go
>>> wrong when attempting the single step on the instruction.
>>>
>>> It took a while to trigger this problem with the sytemtap testsuite.
>>> Dave Long came up with steps that reproduce this more quickly with a
>>> probed function that is always called within the trampoline handler.
>>> Trying the same on x86_64 doesn't trigger the problem.  It appears
>>> that the x86_64 code can handle a single step from within the
>>> trampoline_handler.
>>>
>>
>> I'm assuming there are no plans for supporting software breakpoint debug
>> exceptions during processing of single-step exceptions, any time soon on
>> arm64.  Given that the only solution that I can come with for this is
>> instead of making this orphaned kretprobe instance list exist only
>> temporarily (in the scope of the kretprobe trampoline handler), make it
>> always exist and kfree any items found on it as part of a periodic
>> cleanup running outside of the handler context.  I think these changes
>> would still all be in archiecture-specific code.  This doesn't feel to
>> me like a bad solution.  Does anyone think there is a simpler way out of
>> this?
>
> Just to clarify, is the problem here the software breakpoint exception,
> or trying to step the faulting instruction whilst we were already handling
> a step?
>

Sorry for the delay, I got tripped up with some global optimizations 
that happened when I made more testing changes.  When the kprobes 
software breakpoint handler for kretprobes is reentered it sets up the 
single-step and that ends up hitting inside entry.S, apparently in 
el1_undef.

> I think I'd be inclined to keep the code run in debug context to a minimum.
> We already can't block there, and the more code we add the more black spots
> we end up with in the kernel itself. The alternative would be to make your
> kprobes code re-entrant, but that sounds like a nightmare.
>
> You say this works on x86. How do they handle it? Is the nested probe
> on kfree ignored or handled?
>

Will Cohen's email pointing out x86 does not use a breakpoint for the 
trampoline handler explains a lot.  I'm experimenting starting with his 
proposed new trampoline code.  I can't see a reason this can't be made 
to work and so given everything it doesn't seem interesting to try and 
understand the failure in reentering the kprobe break handler in any 
more detail.

-dave long

[v6,0/6] arm64: Add kernel probes (kprobes) support

Commit Message

Comments

Patch