diff mbox

Possible race between PTRACE_SETVFPREGS and PTRACE_CONT on ARM?

Message ID 20160530213529.GS19428@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King (Oracle) May 30, 2016, 9:35 p.m. UTC
On Mon, May 30, 2016 at 01:48:11PM -0400, Simon Marchi wrote:
> Hello knowledgeable ARM people!
> 
> (Background: https://sourceware.org/ml/gdb/2016-05/msg00020.html )
> 
> Debugging a flaky GDB test case on ARM lead me to think there might
> be race between PTRACE_SETVFPREGS and PTRACE_CONT on ARM
> (PTRACE_SETVFPREGS is ARM-specific anyway).  The test case (and the
> reproducer below) changes the value of a VFP register (let's say d0)
> using PTRACE_SETVFPREGS and resumes the thread with PTRACE_CONT.  It
> happens intermittently that the thread resumes execution with the
> old value in d0 instead of the new one.

So, I thought I'd look into this, and what I see here on my systems
(whether it be Marvell Dove or iMX6) is that the program always exits
with a return code of 1.

Investigation on the Marvell Dove platform leads me to conclude that
Ubuntu 14.04 gdb (7.7.1-0ubuntu5~14.04.2) is built without support for
VFP - if I add an "info float" into the gdb script, I get:

Breakpoint 1, break_here () at test.S:8
8             vmrs APSR_nzcv, fpscr
No floating-point info available for this processor.

which is incredibly annoying, because it means that your "p $d0 = 4.0"
line has no effect on the VFP state - hence why its always exiting with
1.

However, if we look closer, we see that gdb has decided to put the
breakpoint _after_ the comparison instruction, as confirmed by the
disassembly after the breakpoint is hit:

   0x00008108 <+0>:     vcmp.f64        d0, d1
=> 0x0000810c <+4>:     vmrs    APSR_nzcv, fpscr
   0x00008110 <+8>:     moveq   r0, #1

On iMX6, where I have Ubuntu 12.04 gdb (7.4-2012.04-0ubuntu2.1), "info
float" works as one expects, but we still end up with the program
exiting with a code of 1 - every time - because again, the breakpoint
is misplaced.

So, the gdb verisons I have here seem to be particularly poor - but with
some modifications, I can test out on iMX6 by forcing gdb to do the right
thing - by inserting a couple of "mov r0, r0" instructions after the
"break_here" label.

With that, on a single CPU, it seems to work correctly every time, but
if I bring up a secondary CPU I start seeing the same problems you've
reported - which seems to need the following patch to solve.  Please can
you check whether this resolves your problem?

Thanks.

 arch/arm/kernel/ptrace.c | 2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Comments

Russell King (Oracle) May 30, 2016, 10:40 p.m. UTC | #1
On Mon, May 30, 2016 at 10:35:29PM +0100, Russell King - ARM Linux wrote:
> With that, on a single CPU, it seems to work correctly every time, but
> if I bring up a secondary CPU I start seeing the same problems you've
> reported - which seems to need the following patch to solve.  Please can
> you check whether this resolves your problem?

More background behind this patch... I think that commit 8130b9d7b9d8
("ARM: 7308/1: vfp: flush thread hwstate before copying ptrace registers")
was wrong.

Let's look at what's happening.  The above commit had the following effect:

	vfp_sync_hwstate();
	new_vfp = thread->vfpstate.hard;
	modify new_vfp (but not new_vfp.cpu)
+	vfp_flush_hwstate(thread);
	thread->vfpstate.hard = new_vfp;
-	vfp_flush_hwstate(thread);

Now, the commit above claims that a context switch mid-copy of new_vfp
into thread->vfpstate.hard may result in corrupted state.

I'm not quite sure how we could get to that point: the only thread which
is allowed to touch the traced child's state is the ptracing parent, and
the ptracing parent can't do anything until after vfp_set() has returned.
Even if the traced child gets a fatal signal, the point at which the
signal is delivered to the child is when returning to userspace, which
means the child must be runnable.  That in turn means that we are not
in the middle of changing the register state.

So, I'm not sure where Will got the idea that the partially copied state
would be visible: it shouldn't ever be visible.  If it is visible, then
we've got problems even with Will's patch applied, because a partial
copy whenever it happens would be visible.

In any case, the above change is wrong: vfp_flush_hwstate() sets
thread->vfpstate.hard.cpu to an invalid CPU number.  Switching the order
as Will has done _undoes_ the effect of vfp_flush_hwstate(), which leads
to the bug you are seeing.

So, I think the correct approach is to revert it.

If the partially copied state _is_ visible somehow, that needs to be
better documented - the original commit fails to indicate how the
partially copied state becomes visible as a result of a context switch.

> 
> Thanks.
> 
>  arch/arm/kernel/ptrace.c | 2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
> index ef9119f7462e..4d9375814b53 100644
> --- a/arch/arm/kernel/ptrace.c
> +++ b/arch/arm/kernel/ptrace.c
> @@ -733,8 +733,8 @@ static int vfp_set(struct task_struct *target,
>  	if (ret)
>  		return ret;
>  
> -	vfp_flush_hwstate(thread);
>  	thread->vfpstate.hard = new_vfp;
> +	vfp_flush_hwstate(thread);
>  
>  	return 0;
>  }
> 
> 
> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon May 31, 2016, 1:52 p.m. UTC | #2
On Mon, May 30, 2016 at 11:40:28PM +0100, Russell King - ARM Linux wrote:
> On Mon, May 30, 2016 at 10:35:29PM +0100, Russell King - ARM Linux wrote:
> > With that, on a single CPU, it seems to work correctly every time, but
> > if I bring up a secondary CPU I start seeing the same problems you've
> > reported - which seems to need the following patch to solve.  Please can
> > you check whether this resolves your problem?
> 
> More background behind this patch... I think that commit 8130b9d7b9d8
> ("ARM: 7308/1: vfp: flush thread hwstate before copying ptrace registers")
> was wrong.
> 
> Let's look at what's happening.  The above commit had the following effect:
> 
> 	vfp_sync_hwstate();
> 	new_vfp = thread->vfpstate.hard;
> 	modify new_vfp (but not new_vfp.cpu)
> +	vfp_flush_hwstate(thread);
> 	thread->vfpstate.hard = new_vfp;
> -	vfp_flush_hwstate(thread);
> 
> Now, the commit above claims that a context switch mid-copy of new_vfp
> into thread->vfpstate.hard may result in corrupted state.
> 
> I'm not quite sure how we could get to that point: the only thread which
> is allowed to touch the traced child's state is the ptracing parent, and
> the ptracing parent can't do anything until after vfp_set() has returned.
> Even if the traced child gets a fatal signal, the point at which the
> signal is delivered to the child is when returning to userspace, which
> means the child must be runnable.  That in turn means that we are not
> in the middle of changing the register state.
> 
> So, I'm not sure where Will got the idea that the partially copied state
> would be visible: it shouldn't ever be visible.  If it is visible, then
> we've got problems even with Will's patch applied, because a partial
> copy whenever it happens would be visible.
> 
> In any case, the above change is wrong: vfp_flush_hwstate() sets
> thread->vfpstate.hard.cpu to an invalid CPU number.  Switching the order
> as Will has done _undoes_ the effect of vfp_flush_hwstate(), which leads
> to the bug you are seeing.
> 
> So, I think the correct approach is to revert it.

Yes, it looks like I was over-zealous in fixing all users of
vfp_flush_hwstate after I fixed the signal handling code. Given that the
child isn't runnable until the parent has finished poking at the register
state, the original code looks fine.

The only thing I'm uncertain of is whether or not PTRACE_SEIZE/PTRACE_LISTEN
allow switching to the child (but even then, how is the parent doing
to issue such a request?).

Will
Russell King (Oracle) May 31, 2016, 2:18 p.m. UTC | #3
On Tue, May 31, 2016 at 02:52:52PM +0100, Will Deacon wrote:
> The only thing I'm uncertain of is whether or not PTRACE_SEIZE/PTRACE_LISTEN
> allow switching to the child (but even then, how is the parent doing
> to issue such a request?).

I can't see how that would be possible - the parent needs to finish
the vfp_set() call before it can issue any others - and there can
only be one tracer.  IOW, if we're in vfp_set(), the thread must
have attached to the child, at which point PTRACE_SEIZE will be
rejected.

PTRACE_LISTEN also requires the child to be in STOP state as well, and
again, can only be issued from the current tracer.
Simon Marchi June 1, 2016, 12:54 p.m. UTC | #4
On 16-05-30 05:35 PM, Russell King - ARM Linux wrote:
> So, the gdb verisons I have here seem to be particularly poor - but with
> some modifications, I can test out on iMX6 by forcing gdb to do the right
> thing - by inserting a couple of "mov r0, r0" instructions after the
> "break_here" label.

I see that problem too with older versions, bisecting shows it has been fixed
in commit

  6e22494e5076 Do not skip prologue for asm (.S) files

in gdb, which is included in gdb 7.10 and up.

> With that, on a single CPU, it seems to work correctly every time, but
> if I bring up a secondary CPU I start seeing the same problems you've
> reported - which seems to need the following patch to solve.  Please can
> you check whether this resolves your problem?

Yes that fixes the problem, the test case succeeds every time.  I have stared
at those lines in ptrace.c for some time, but couldn't find the problem.  Thanks
for looking into it!
Russell King (Oracle) June 2, 2016, 1:15 p.m. UTC | #5
On Wed, Jun 01, 2016 at 08:54:05AM -0400, Simon Marchi wrote:
> On 16-05-30 05:35 PM, Russell King - ARM Linux wrote:
> > So, the gdb verisons I have here seem to be particularly poor - but with
> > some modifications, I can test out on iMX6 by forcing gdb to do the right
> > thing - by inserting a couple of "mov r0, r0" instructions after the
> > "break_here" label.
> 
> I see that problem too with older versions, bisecting shows it has been fixed
> in commit
> 
>   6e22494e5076 Do not skip prologue for asm (.S) files
> 
> in gdb, which is included in gdb 7.10 and up.
> 
> > With that, on a single CPU, it seems to work correctly every time, but
> > if I bring up a secondary CPU I start seeing the same problems you've
> > reported - which seems to need the following patch to solve.  Please can
> > you check whether this resolves your problem?
> 
> Yes that fixes the problem, the test case succeeds every time.  I have stared
> at those lines in ptrace.c for some time, but couldn't find the problem.  Thanks
> for looking into it!

Hi, can I add a:

Tested-by: Simon Marchi <simon.marchi@ericsson.com>

tag to the commit please?

Many thanks.
Simon Marchi June 2, 2016, 1:17 p.m. UTC | #6
On 16-06-02 09:15 AM, Russell King - ARM Linux wrote:
> Hi, can I add a:
> 
> Tested-by: Simon Marchi <simon.marchi@ericsson.com>
> 
> tag to the commit please?

Yes, of course.
diff mbox

Patch

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index ef9119f7462e..4d9375814b53 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -733,8 +733,8 @@  static int vfp_set(struct task_struct *target,
 	if (ret)
 		return ret;
 
-	vfp_flush_hwstate(thread);
 	thread->vfpstate.hard = new_vfp;
+	vfp_flush_hwstate(thread);
 
 	return 0;
 }