diff mbox

[Bisected] 3.7-rc1 can't resume (still present in 3.9)

Message ID 1487368.CBCgti1nd5@vostro.rjw.lan (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Rafael Wysocki May 2, 2013, 11:29 p.m. UTC
On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> On 05-02 02:45, Rafael J. Wysocki wrote:
> > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > Well, you could give me instructions on how to debug this (I'll do 
> > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > would worth the effort since this bug is somehow critical.
> > > > 
> > > > Best regards, Jonas
> > > 
> > > I'll put together a debug patch unless I can trick Rafael into doing
> > > it first...
> > 
> > I'm afraid that code has changed quite a bit since I looked at it last time.
> > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > 
> > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > uses a different register, which shouldn't matter)?  Does it still help then?
> 
> Hello Rafel, first of all, thank you for helping me out :)
> You're right, the patch still solves the suspend bug, after removing the first 
> hunk of the patch and applying it (see attachement:
> suspendfix_first_hunk_dropped.patch).
> 
> > 
> > If so, there are still a few things you can do to it, e.g:
> > (1) drop the
> > 
> > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > -       jnc     1f
> > 
> 
> Still works :) (used suspendfix_1.patch)
> 
> > lines,
> > (2) drop the
> > 
> > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > -       jnc     1f
> > 
> > lines,
> 
> Still works :) (used suspendfix_2.patch)
> 
> > (3) drop the
> > 
> > +       jecxz   1f
> > 
> 
> Still works :) (used suspendfix_3.patch)
> 
> > line,
> > (4) drop the
> > 
> > +       movl    %eax, %ecx
> > +       orl     %edx, %ecx
> > +       jz      1f
> > 
> 
> At this point, the bug reoccurs (used suspendfix_4.patch)! 
> But that doesn't mean these lines are the only critical, because the more
> minimal patch
> 
> @@ -119,6 +119,9 @@
>         jnc     1f
>         movl    pmode_efer, %eax
>         movl    pmode_efer + 4, %edx
> +       movl    %eax, %ecx
> +       orl     %edx, %ecx
> +       jz      1f
>         movl    $MSR_EFER, %ecx
>         wrmsr
>  1:
> 
> 
> with removing this part
> 
> -       movl    pmode_cr4, %eax
> -       movl    %eax, %cr4
> +       movl    pmode_cr4, %ecx
> +       movl    %ecx, %cr4
> 
> also doesn't fix the issue (see suspendfix_5.patch).
> 
> > lines and see what the minimal patch needed for things to work again is.
> > 
> 
> So the most minimal working patch is suspendfix_3.patch.

Thanks for doing that detective work!

The only explanation of why this particular patch can help that seems viable to
us at the moment is that we have a memory corruption in the code region modified
by it and the patch simply changes the alignment of the instructions that don't
get corrupted.

It looks like this may be verified by putting a bunch of nops into the region
in question, so can you please check if the attached patch helps too?

Rafael

Comments

Jonas Heinrich May 3, 2013, 11:07 a.m. UTC | #1
On 05-03 01:29, Rafael J. Wysocki wrote:
> On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > Well, you could give me instructions on how to debug this (I'll do 
> > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > would worth the effort since this bug is somehow critical.
> > > > > 
> > > > > Best regards, Jonas
> > > > 
> > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > it first...
> > > 
> > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > > 
> > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > uses a different register, which shouldn't matter)?  Does it still help then?
> > 
> > Hello Rafel, first of all, thank you for helping me out :)
> > You're right, the patch still solves the suspend bug, after removing the first 
> > hunk of the patch and applying it (see attachement:
> > suspendfix_first_hunk_dropped.patch).
> > 
> > > 
> > > If so, there are still a few things you can do to it, e.g:
> > > (1) drop the
> > > 
> > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > -       jnc     1f
> > > 
> > 
> > Still works :) (used suspendfix_1.patch)
> > 
> > > lines,
> > > (2) drop the
> > > 
> > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > -       jnc     1f
> > > 
> > > lines,
> > 
> > Still works :) (used suspendfix_2.patch)
> > 
> > > (3) drop the
> > > 
> > > +       jecxz   1f
> > > 
> > 
> > Still works :) (used suspendfix_3.patch)
> > 
> > > line,
> > > (4) drop the
> > > 
> > > +       movl    %eax, %ecx
> > > +       orl     %edx, %ecx
> > > +       jz      1f
> > > 
> > 
> > At this point, the bug reoccurs (used suspendfix_4.patch)! 
> > But that doesn't mean these lines are the only critical, because the more
> > minimal patch
> > 
> > @@ -119,6 +119,9 @@
> >         jnc     1f
> >         movl    pmode_efer, %eax
> >         movl    pmode_efer + 4, %edx
> > +       movl    %eax, %ecx
> > +       orl     %edx, %ecx
> > +       jz      1f
> >         movl    $MSR_EFER, %ecx
> >         wrmsr
> >  1:
> > 
> > 
> > with removing this part
> > 
> > -       movl    pmode_cr4, %eax
> > -       movl    %eax, %cr4
> > +       movl    pmode_cr4, %ecx
> > +       movl    %ecx, %cr4
> > 
> > also doesn't fix the issue (see suspendfix_5.patch).
> > 
> > > lines and see what the minimal patch needed for things to work again is.
> > > 
> > 
> > So the most minimal working patch is suspendfix_3.patch.
> 
> Thanks for doing that detective work!
> 
> The only explanation of why this particular patch can help that seems viable to
> us at the moment is that we have a memory corruption in the code region modified
> by it and the patch simply changes the alignment of the instructions that don't
> get corrupted.
> 
> It looks like this may be verified by putting a bunch of nops into the region
> in question, so can you please check if the attached patch helps too?

Unfortunately, your attached patch doesn't seem to fix the bug. 
Hope you still have some ideas to address this issue :)

- Jonas
> 
> Rafael
> 
> 
> -- 
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
Rafael Wysocki May 3, 2013, 11:37 a.m. UTC | #2
On Friday, May 03, 2013 11:07:05 AM Jonas Heinrich wrote:
> On 05-03 01:29, Rafael J. Wysocki wrote:
> > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > Well, you could give me instructions on how to debug this (I'll do 
> > > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > > would worth the effort since this bug is somehow critical.
> > > > > > 
> > > > > > Best regards, Jonas
> > > > > 
> > > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > > it first...
> > > > 
> > > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > > > 
> > > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > > uses a different register, which shouldn't matter)?  Does it still help then?
> > > 
> > > Hello Rafel, first of all, thank you for helping me out :)
> > > You're right, the patch still solves the suspend bug, after removing the first 
> > > hunk of the patch and applying it (see attachement:
> > > suspendfix_first_hunk_dropped.patch).
> > > 
> > > > 
> > > > If so, there are still a few things you can do to it, e.g:
> > > > (1) drop the
> > > > 
> > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > -       jnc     1f
> > > > 
> > > 
> > > Still works :) (used suspendfix_1.patch)
> > > 
> > > > lines,
> > > > (2) drop the
> > > > 
> > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > -       jnc     1f
> > > > 
> > > > lines,
> > > 
> > > Still works :) (used suspendfix_2.patch)
> > > 
> > > > (3) drop the
> > > > 
> > > > +       jecxz   1f
> > > > 
> > > 
> > > Still works :) (used suspendfix_3.patch)
> > > 
> > > > line,
> > > > (4) drop the
> > > > 
> > > > +       movl    %eax, %ecx
> > > > +       orl     %edx, %ecx
> > > > +       jz      1f
> > > > 
> > > 
> > > At this point, the bug reoccurs (used suspendfix_4.patch)! 
> > > But that doesn't mean these lines are the only critical, because the more
> > > minimal patch
> > > 
> > > @@ -119,6 +119,9 @@
> > >         jnc     1f
> > >         movl    pmode_efer, %eax
> > >         movl    pmode_efer + 4, %edx
> > > +       movl    %eax, %ecx
> > > +       orl     %edx, %ecx
> > > +       jz      1f
> > >         movl    $MSR_EFER, %ecx
> > >         wrmsr
> > >  1:
> > > 
> > > 
> > > with removing this part
> > > 
> > > -       movl    pmode_cr4, %eax
> > > -       movl    %eax, %cr4
> > > +       movl    pmode_cr4, %ecx
> > > +       movl    %ecx, %cr4
> > > 
> > > also doesn't fix the issue (see suspendfix_5.patch).
> > > 
> > > > lines and see what the minimal patch needed for things to work again is.
> > > > 
> > > 
> > > So the most minimal working patch is suspendfix_3.patch.
> > 
> > Thanks for doing that detective work!
> > 
> > The only explanation of why this particular patch can help that seems viable to
> > us at the moment is that we have a memory corruption in the code region modified
> > by it and the patch simply changes the alignment of the instructions that don't
> > get corrupted.
> > 
> > It looks like this may be verified by putting a bunch of nops into the region
> > in question, so can you please check if the attached patch helps too?
> 
> Unfortunately, your attached patch doesn't seem to fix the bug. 
> Hope you still have some ideas to address this issue :)

Well, not really.

We'll need a dump of the wakeup structure from your system I suppose.
I'll try to write a patch to get that over the weekend.

Thanks,
Rafael
Jarkko Sakkinen May 3, 2013, 12:15 p.m. UTC | #3
Hi

On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> On 05-03 01:29, Rafael J. Wysocki wrote:
> > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > this would worth the effort since this bug is somehow critical.
> > > > > > 
> > > > > > Best regards, Jonas
> > > > > 
> > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > doing it first...
> > > > 
> > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > CCed.]
> > > > 
> > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > (it just uses a different register, which shouldn't matter)?  Does
> > > > it still help then?
> > > 
> > > Hello Rafel, first of all, thank you for helping me out :)
> > > You're right, the patch still solves the suspend bug, after removing
> > > the first hunk of the patch and applying it (see attachement:
> > > suspendfix_first_hunk_dropped.patch).
> > > 
> > > > If so, there are still a few things you can do to it, e.g:
> > > > (1) drop the
> > > > 
> > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > -       jnc     1f
> > > 
> > > Still works :) (used suspendfix_1.patch)
> > > 
> > > > lines,
> > > > (2) drop the
> > > > 
> > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > -       jnc     1f
> > > > 
> > > > lines,
> > > 
> > > Still works :) (used suspendfix_2.patch)
> > > 
> > > > (3) drop the
> > > > 
> > > > +       jecxz   1f
> > > 
> > > Still works :) (used suspendfix_3.patch)
> > > 
> > > > line,
> > > > (4) drop the
> > > > 
> > > > +       movl    %eax, %ecx
> > > > +       orl     %edx, %ecx
> > > > +       jz      1f
> > > 
> > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > But that doesn't mean these lines are the only critical, because the
> > > more minimal patch
> > > 
> > > @@ -119,6 +119,9 @@
> > > 
> > >         jnc     1f
> > >         movl    pmode_efer, %eax
> > >         movl    pmode_efer + 4, %edx
> > > 
> > > +       movl    %eax, %ecx
> > > +       orl     %edx, %ecx
> > > +       jz      1f
> > > 
> > >         movl    $MSR_EFER, %ecx
> > >         wrmsr
> > >  
> > >  1:
> > > with removing this part
> > > 
> > > -       movl    pmode_cr4, %eax
> > > -       movl    %eax, %cr4
> > > +       movl    pmode_cr4, %ecx
> > > +       movl    %ecx, %cr4
> > > 
> > > also doesn't fix the issue (see suspendfix_5.patch).
> > > 
> > > > lines and see what the minimal patch needed for things to work again
> > > > is.
> > > 
> > > So the most minimal working patch is suspendfix_3.patch.
> > 
> > Thanks for doing that detective work!
> > 
> > The only explanation of why this particular patch can help that seems
> > viable to us at the moment is that we have a memory corruption in the
> > code region modified by it and the patch simply changes the alignment of
> > the instructions that don't get corrupted.
> > 
> > It looks like this may be verified by putting a bunch of nops into the
> > region in question, so can you please check if the attached patch helps
> > too?
> 
> Unfortunately, your attached patch doesn't seem to fix the bug.
> Hope you still have some ideas to address this issue :)

Kind of had to experiment with this since I don't have access to
T43. Did you already try:

- EFER handling only is reverted as it was before 73201dbe.
- CR4 handling only is reverted as it was before 73201dbe.

Thanks.

/Jarkko

> 
> - Jonas
> 
> > Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki May 28, 2013, 9:36 p.m. UTC | #4
On Friday, May 03, 2013 01:37:45 PM Rafael J. Wysocki wrote:
> On Friday, May 03, 2013 11:07:05 AM Jonas Heinrich wrote:
> > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > Well, you could give me instructions on how to debug this (I'll do 
> > > > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > > > would worth the effort since this bug is somehow critical.
> > > > > > > 
> > > > > > > Best regards, Jonas
> > > > > > 
> > > > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > > > it first...
> > > > > 
> > > > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > > > > 
> > > > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > > > uses a different register, which shouldn't matter)?  Does it still help then?
> > > > 
> > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > You're right, the patch still solves the suspend bug, after removing the first 
> > > > hunk of the patch and applying it (see attachement:
> > > > suspendfix_first_hunk_dropped.patch).
> > > > 
> > > > > 
> > > > > If so, there are still a few things you can do to it, e.g:
> > > > > (1) drop the
> > > > > 
> > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > -       jnc     1f
> > > > > 
> > > > 
> > > > Still works :) (used suspendfix_1.patch)
> > > > 
> > > > > lines,
> > > > > (2) drop the
> > > > > 
> > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > -       jnc     1f
> > > > > 
> > > > > lines,
> > > > 
> > > > Still works :) (used suspendfix_2.patch)
> > > > 
> > > > > (3) drop the
> > > > > 
> > > > > +       jecxz   1f
> > > > > 
> > > > 
> > > > Still works :) (used suspendfix_3.patch)
> > > > 
> > > > > line,
> > > > > (4) drop the
> > > > > 
> > > > > +       movl    %eax, %ecx
> > > > > +       orl     %edx, %ecx
> > > > > +       jz      1f
> > > > > 
> > > > 
> > > > At this point, the bug reoccurs (used suspendfix_4.patch)! 
> > > > But that doesn't mean these lines are the only critical, because the more
> > > > minimal patch
> > > > 
> > > > @@ -119,6 +119,9 @@
> > > >         jnc     1f
> > > >         movl    pmode_efer, %eax
> > > >         movl    pmode_efer + 4, %edx
> > > > +       movl    %eax, %ecx
> > > > +       orl     %edx, %ecx
> > > > +       jz      1f
> > > >         movl    $MSR_EFER, %ecx
> > > >         wrmsr
> > > >  1:
> > > > 
> > > > 
> > > > with removing this part
> > > > 
> > > > -       movl    pmode_cr4, %eax
> > > > -       movl    %eax, %cr4
> > > > +       movl    pmode_cr4, %ecx
> > > > +       movl    %ecx, %cr4
> > > > 
> > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > > 
> > > > > lines and see what the minimal patch needed for things to work again is.
> > > > > 
> > > > 
> > > > So the most minimal working patch is suspendfix_3.patch.
> > > 
> > > Thanks for doing that detective work!
> > > 
> > > The only explanation of why this particular patch can help that seems viable to
> > > us at the moment is that we have a memory corruption in the code region modified
> > > by it and the patch simply changes the alignment of the instructions that don't
> > > get corrupted.
> > > 
> > > It looks like this may be verified by putting a bunch of nops into the region
> > > in question, so can you please check if the attached patch helps too?
> > 
> > Unfortunately, your attached patch doesn't seem to fix the bug. 
> > Hope you still have some ideas to address this issue :)
> 
> Well, not really.
> 
> We'll need a dump of the wakeup structure from your system I suppose.
> I'll try to write a patch to get that over the weekend.

Can you try 3.10-rc3 by the way?  There is a fix in there that *might* make
a difference (although the odds are quite significant).

Thanks,
Rafael
Jonas Heinrich July 8, 2013, 9:50 a.m. UTC | #5
On 05-03 15:15, Jarkko Sakkinen wrote:
> Hi
> 
> On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > > this would worth the effort since this bug is somehow critical.
> > > > > > > 
> > > > > > > Best regards, Jonas
> > > > > > 
> > > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > > doing it first...
> > > > > 
> > > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > > CCed.]
> > > > > 
> > > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > > (it just uses a different register, which shouldn't matter)?  Does
> > > > > it still help then?
> > > > 
> > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > You're right, the patch still solves the suspend bug, after removing
> > > > the first hunk of the patch and applying it (see attachement:
> > > > suspendfix_first_hunk_dropped.patch).
> > > > 
> > > > > If so, there are still a few things you can do to it, e.g:
> > > > > (1) drop the
> > > > > 
> > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > -       jnc     1f
> > > > 
> > > > Still works :) (used suspendfix_1.patch)
> > > > 
> > > > > lines,
> > > > > (2) drop the
> > > > > 
> > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > -       jnc     1f
> > > > > 
> > > > > lines,
> > > > 
> > > > Still works :) (used suspendfix_2.patch)
> > > > 
> > > > > (3) drop the
> > > > > 
> > > > > +       jecxz   1f
> > > > 
> > > > Still works :) (used suspendfix_3.patch)
> > > > 
> > > > > line,
> > > > > (4) drop the
> > > > > 
> > > > > +       movl    %eax, %ecx
> > > > > +       orl     %edx, %ecx
> > > > > +       jz      1f
> > > > 
> > > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > > But that doesn't mean these lines are the only critical, because the
> > > > more minimal patch
> > > > 
> > > > @@ -119,6 +119,9 @@
> > > > 
> > > >         jnc     1f
> > > >         movl    pmode_efer, %eax
> > > >         movl    pmode_efer + 4, %edx
> > > > 
> > > > +       movl    %eax, %ecx
> > > > +       orl     %edx, %ecx
> > > > +       jz      1f
> > > > 
> > > >         movl    $MSR_EFER, %ecx
> > > >         wrmsr
> > > >  
> > > >  1:
> > > > with removing this part
> > > > 
> > > > -       movl    pmode_cr4, %eax
> > > > -       movl    %eax, %cr4
> > > > +       movl    pmode_cr4, %ecx
> > > > +       movl    %ecx, %cr4
> > > > 
> > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > > 
> > > > > lines and see what the minimal patch needed for things to work again
> > > > > is.
> > > > 
> > > > So the most minimal working patch is suspendfix_3.patch.
> > > 
> > > Thanks for doing that detective work!
> > > 
> > > The only explanation of why this particular patch can help that seems
> > > viable to us at the moment is that we have a memory corruption in the
> > > code region modified by it and the patch simply changes the alignment of
> > > the instructions that don't get corrupted.
> > > 
> > > It looks like this may be verified by putting a bunch of nops into the
> > > region in question, so can you please check if the attached patch helps
> > > too?
> > 
> > Unfortunately, your attached patch doesn't seem to fix the bug.
> > Hope you still have some ideas to address this issue :)
> 
> Kind of had to experiment with this since I don't have access to
> T43. Did you already try:
> 
> - EFER handling only is reverted as it was before 73201dbe.
> - CR4 handling only is reverted as it was before 73201dbe.
Hi Jarkko,
thank you for your response!
Can you please be more specific about that instruction? I don't really
know what to do, sorry :/

@Rafael: Bug still present in kernel 3.10 final :(

- Jonas

> 
> Thanks.
> 
> /Jarkko
> 
> > 
> > - Jonas
> > 
> > > Rafael
Rafael Wysocki July 8, 2013, 1:05 p.m. UTC | #6
On Monday, July 08, 2013 09:50:43 AM Jonas Heinrich wrote:
> On 05-03 15:15, Jarkko Sakkinen wrote:
> > Hi
> > 
> > On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> > > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > > > this would worth the effort since this bug is somehow critical.
> > > > > > > > 
> > > > > > > > Best regards, Jonas
> > > > > > > 
> > > > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > > > doing it first...
> > > > > > 
> > > > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > > > CCed.]
> > > > > > 
> > > > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > > > (it just uses a different register, which shouldn't matter)?  Does
> > > > > > it still help then?
> > > > > 
> > > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > > You're right, the patch still solves the suspend bug, after removing
> > > > > the first hunk of the patch and applying it (see attachement:
> > > > > suspendfix_first_hunk_dropped.patch).
> > > > > 
> > > > > > If so, there are still a few things you can do to it, e.g:
> > > > > > (1) drop the
> > > > > > 
> > > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > > -       jnc     1f
> > > > > 
> > > > > Still works :) (used suspendfix_1.patch)
> > > > > 
> > > > > > lines,
> > > > > > (2) drop the
> > > > > > 
> > > > > > -       btl     $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > > -       jnc     1f
> > > > > > 
> > > > > > lines,
> > > > > 
> > > > > Still works :) (used suspendfix_2.patch)
> > > > > 
> > > > > > (3) drop the
> > > > > > 
> > > > > > +       jecxz   1f
> > > > > 
> > > > > Still works :) (used suspendfix_3.patch)
> > > > > 
> > > > > > line,
> > > > > > (4) drop the
> > > > > > 
> > > > > > +       movl    %eax, %ecx
> > > > > > +       orl     %edx, %ecx
> > > > > > +       jz      1f
> > > > > 
> > > > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > > > But that doesn't mean these lines are the only critical, because the
> > > > > more minimal patch
> > > > > 
> > > > > @@ -119,6 +119,9 @@
> > > > > 
> > > > >         jnc     1f
> > > > >         movl    pmode_efer, %eax
> > > > >         movl    pmode_efer + 4, %edx
> > > > > 
> > > > > +       movl    %eax, %ecx
> > > > > +       orl     %edx, %ecx
> > > > > +       jz      1f
> > > > > 
> > > > >         movl    $MSR_EFER, %ecx
> > > > >         wrmsr
> > > > >  
> > > > >  1:
> > > > > with removing this part
> > > > > 
> > > > > -       movl    pmode_cr4, %eax
> > > > > -       movl    %eax, %cr4
> > > > > +       movl    pmode_cr4, %ecx
> > > > > +       movl    %ecx, %cr4
> > > > > 
> > > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > > > 
> > > > > > lines and see what the minimal patch needed for things to work again
> > > > > > is.
> > > > > 
> > > > > So the most minimal working patch is suspendfix_3.patch.
> > > > 
> > > > Thanks for doing that detective work!
> > > > 
> > > > The only explanation of why this particular patch can help that seems
> > > > viable to us at the moment is that we have a memory corruption in the
> > > > code region modified by it and the patch simply changes the alignment of
> > > > the instructions that don't get corrupted.
> > > > 
> > > > It looks like this may be verified by putting a bunch of nops into the
> > > > region in question, so can you please check if the attached patch helps
> > > > too?
> > > 
> > > Unfortunately, your attached patch doesn't seem to fix the bug.
> > > Hope you still have some ideas to address this issue :)
> > 
> > Kind of had to experiment with this since I don't have access to
> > T43. Did you already try:
> > 
> > - EFER handling only is reverted as it was before 73201dbe.
> > - CR4 handling only is reverted as it was before 73201dbe.
> Hi Jarkko,
> thank you for your response!
> Can you please be more specific about that instruction? I don't really
> know what to do, sorry :/
> 
> @Rafael: Bug still present in kernel 3.10 final :(

That's because no one knows how to fix it.  Actually, we don't even know
what the root cause of it is.  Sorry about that.

Are you still able to make things work again by reverting the commit you
bisected to?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin July 15, 2013, 9:11 p.m. UTC | #7
On 07/08/2013 06:05 AM, Rafael J. Wysocki wrote:
> 
> That's because no one knows how to fix it.  Actually, we don't even know
> what the root cause of it is.  Sorry about that.
> 
> Are you still able to make things work again by reverting the commit you
> bisected to?
> 

Well, we know now.   Hopefully we should have the fix in 3.11-rc2 and
then on its way to stable.

	-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

---
 arch/x86/realmode/rm/wakeup_asm.S |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Index: linux-pm/arch/x86/realmode/rm/wakeup_asm.S
===================================================================
--- linux-pm.orig/arch/x86/realmode/rm/wakeup_asm.S
+++ linux-pm/arch/x86/realmode/rm/wakeup_asm.S
@@ -117,6 +117,38 @@  ENTRY(wakeup_start)
 1:
 	btl	$WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
 	jnc	1f
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
 	movl	pmode_efer, %eax
 	movl	pmode_efer + 4, %edx
 	movl	$MSR_EFER, %ecx