diff mbox

v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

Message ID 20170214175956.GA3587@amd (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Pavel Machek Feb. 14, 2017, 5:59 p.m. UTC
Hi!

> > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > but I'll have to double check.
> > > 
> > > But all the kernel versions worked when the keyboard was plugged into
> > > its original USB port?
> > 
> > Aha. So it looks difference is probably in "where is keyboard plugged
> > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > a while :-(.
> > 
> > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > 
> > It happens with current Linus' tree.
> 
> v4.10-rc6-feb3 : broken
> v4.9 : ok
> (v4.6 : ok)

Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   

With debug patch below, I get

...1d.7: PCI fixup... pass 2
...1d.7: PCI fixup... pass 3
...1d.7: PCI fixup... pass 3 done

...followed by hang. So yes, it looks USB related.

(Sometimes it hangs with some kind backtrace involving secondary CPU
startup, unfortunately useful info is off screen at that point).

Any ideas?
								Pavel

Comments

Pavel Machek Feb. 14, 2017, 7:27 p.m. UTC | #1
On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> Hi!
> 
> > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > but I'll have to double check.
> > > > 
> > > > But all the kernel versions worked when the keyboard was plugged into
> > > > its original USB port?
> > > 
> > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > a while :-(.
> > > 
> > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > 
> > > It happens with current Linus' tree.
> > 
> > v4.10-rc6-feb3 : broken
> > v4.9 : ok
> > (v4.6 : ok)
> 
> Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> 
> With debug patch below, I get
> 
> ...1d.7: PCI fixup... pass 2
> ...1d.7: PCI fixup... pass 3
> ...1d.7: PCI fixup... pass 3 done
> 
> ...followed by hang. So yes, it looks USB related.
> 
> (Sometimes it hangs with some kind backtrace involving secondary CPU
> startup, unfortunately useful info is off screen at that point).

Forgot to say, 1d.7 is EHCI controller.

00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
Controller (rev 01)

									Pavel
Alan Stern Feb. 14, 2017, 7:54 p.m. UTC | #2
On Tue, 14 Feb 2017, Pavel Machek wrote:

> On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > Hi!
> > 
> > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > but I'll have to double check.
> > > > > 
> > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > its original USB port?
> > > > 
> > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > a while :-(.
> > > > 
> > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > 
> > > > It happens with current Linus' tree.
> > > 
> > > v4.10-rc6-feb3 : broken
> > > v4.9 : ok
> > > (v4.6 : ok)
> > 
> > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > 
> > With debug patch below, I get
> > 
> > ...1d.7: PCI fixup... pass 2
> > ...1d.7: PCI fixup... pass 3
> > ...1d.7: PCI fixup... pass 3 done
> > 
> > ...followed by hang. So yes, it looks USB related.
> > 
> > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > startup, unfortunately useful info is off screen at that point).
> 
> Forgot to say, 1d.7 is EHCI controller.
> 
> 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> Controller (rev 01)

So this looks like a problem in the PCI subsystem affecting a USB
controller.

Linus is right; bisection is the best approach now that you know a
reliable trigger.

Alan Stern
Pavel Machek Feb. 15, 2017, 5:23 p.m. UTC | #3
On Tue 2017-02-14 11:12:26, Linus Torvalds wrote:
> On Feb 14, 2017 9:59 AM, "Pavel Machek" <pavel@ucw.cz> wrote:
> 
> Hi!
> 
> > >
> > > Booting to grub, then hitting ctrl-alt-del is enough to make it work.
> Ouch.
> > >
> > > It happens with current Linus' tree.
> >
> > v4.10-rc6-feb3 : broken
> > v4.9 : ok
> 
> I wonder if you could bisect it now that you've figured out the rules for
> when it breaks...

I guess that's what I'll need to do. It is my main machine, so it is a
bit painful.

Anyway, it seems that "nosmp" makes it hang at similar place, but
makes it hang reliably, reboot or cold poweroff. So I guess that's
what I'll use for bisection -- should be possible to do automatically
that way.

> I don't think I've seen any similar reports, so we don't have a lot of
> clues to go by otherwise, I think.

:-(.
								Pavel
Pavel Machek Feb. 15, 2017, 11:20 p.m. UTC | #4
On Wed 2017-02-15 18:23:03, Pavel Machek wrote:
> On Tue 2017-02-14 11:12:26, Linus Torvalds wrote:
> > On Feb 14, 2017 9:59 AM, "Pavel Machek" <pavel@ucw.cz> wrote:
> > 
> > Hi!
> > 
> > > >
> > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work.
> > Ouch.
> > > >
> > > > It happens with current Linus' tree.
> > >
> > > v4.10-rc6-feb3 : broken
> > > v4.9 : ok
> > 
> > I wonder if you could bisect it now that you've figured out the rules for
> > when it breaks...
> 
> I guess that's what I'll need to do. It is my main machine, so it is a
> bit painful.
> 
> Anyway, it seems that "nosmp" makes it hang at similar place, but
> makes it hang reliably, reboot or cold poweroff. So I guess that's
> what I'll use for bisection -- should be possible to do automatically
> that way.

I was mistaken. "nosmp" does not seem to make the hang reliable.

my-4.10-r8+ broken
4.10-rc8 broken
4.10-rc4 broken
4.10-rc3 ok
4.10-rc2 ok?

I started bisect, 168 revisions to go.

								Pavel
Linus Torvalds Feb. 15, 2017, 11:34 p.m. UTC | #5
On Wed, Feb 15, 2017 at 3:20 PM, Pavel Machek <pavel@ucw.cz> wrote:
> 4.10-rc4 broken
> 4.10-rc3 ok

Hmm. If those actually end up being reliable, then there's not a whole
lot in between them wrt PCI or USB.

What looked like the most likely candidate seems to be xhci-specific, though.

But maybe it's something that isn't directly in drivers/{pci,usb}/ and
just interacts badly.

                 Linus
Pavel Machek Feb. 16, 2017, 11:11 a.m. UTC | #6
Hi!

On Wed 2017-02-15 15:34:27, Linus Torvalds wrote:
> On Wed, Feb 15, 2017 at 3:20 PM, Pavel Machek <pavel@ucw.cz> wrote:
> > 4.10-rc4 broken
> > 4.10-rc3 ok
> 
> Hmm. If those actually end up being reliable, then there's not a whole
> lot in between them wrt PCI or USB.
> 
> What looked like the most likely candidate seems to be xhci-specific, though.
> 
> But maybe it's something that isn't directly in drivers/{pci,usb}/ and
> just interacts badly.

Ok. I _hope_ my tests are ok. Bisect log so far is:

pavel@half:/data/l/linux$ git bisect log
# bad: [49def1853334396f948dcb4cedb9347abb318df5] Linux 4.10-rc4
# good: [a121103c922847ba5010819a3f250f1f7fc84ab8] Linux 4.10-rc3
git bisect start 'v4.10-rc4' 'v4.10-rc3'
# good: [557ed56cc75e0a33c15ba438734a280bac23bd32] Merge tag
'sound-4.10-rc4' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 557ed56cc75e0a33c15ba438734a280bac23bd32
# good: [f4d3935e4f4884ba80561db5549394afb8eef8f7] Merge branch
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good f4d3935e4f4884ba80561db5549394afb8eef8f7
# bad: [83346fbc07d267de777e2597552f785174ad0373] Merge branch
'x86-urgent-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 83346fbc07d267de777e2597552f785174ad0373
# good: [18e7a45af91acdde99d3aa1372cc40e1f8142f7b] perf/x86: Reject
non sampling events with precise_ip
git bisect good 18e7a45af91acdde99d3aa1372cc40e1f8142f7b
# good: [84936118bdf37bda513d4a361c38181a216427e0] x86/unwind: Disable
KASAN checks for non-current tasks
git bisect good 84936118bdf37bda513d4a361c38181a216427e0
# good: [79078c53baabee12dfefb0cfe00ca94cb2c35570] Merge branch
'perf-urgent-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 79078c53baabee12dfefb0cfe00ca94cb2c35570
# good: [695085b4bc7603551db0b3da897b8bf9893ca218] x86/tsc: Add the
Intel Denverton Processor to native_calibrate_tsc()
git bisect good 695085b4bc7603551db0b3da897b8bf9893ca218

I should go now, but I should be able to finish it today.

Best regards,
								Pavel
Pavel Machek Feb. 16, 2017, 5:25 p.m. UTC | #7
Hi!

> > > 4.10-rc4 broken
> > > 4.10-rc3 ok
> > 
> > Hmm. If those actually end up being reliable, then there's not a whole
> > lot in between them wrt PCI or USB.
> > 
> > What looked like the most likely candidate seems to be xhci-specific, though.
> > 
> > But maybe it's something that isn't directly in drivers/{pci,usb}/ and
> > just interacts badly.
> 
> Ok. I _hope_ my tests are ok. Bisect log so far is:

And the winner is:

pavel@half:/data/l/linux$ git bisect bad
24b91e360ef521a2808771633d76ebc68bd5604b is the first bad commit
commit 24b91e360ef521a2808771633d76ebc68bd5604b
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Wed Jan 4 15:12:04 2017 +0100

    nohz: Fix collision between tick and other hrtimers
    

									Pavel
Frederic Weisbecker Feb. 16, 2017, 6:13 p.m. UTC | #8
On Thu, Feb 16, 2017 at 06:25:35PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > > 4.10-rc4 broken
> > > > 4.10-rc3 ok
> > > 
> > > Hmm. If those actually end up being reliable, then there's not a whole
> > > lot in between them wrt PCI or USB.
> > > 
> > > What looked like the most likely candidate seems to be xhci-specific, though.
> > > 
> > > But maybe it's something that isn't directly in drivers/{pci,usb}/ and
> > > just interacts badly.
> > 
> > Ok. I _hope_ my tests are ok. Bisect log so far is:
> 
> And the winner is:
> 
> pavel@half:/data/l/linux$ git bisect bad
> 24b91e360ef521a2808771633d76ebc68bd5604b is the first bad commit
> commit 24b91e360ef521a2808771633d76ebc68bd5604b
> Author: Frederic Weisbecker <fweisbec@gmail.com>
> Date:   Wed Jan 4 15:12:04 2017 +0100
> 
>     nohz: Fix collision between tick and other hrtimers

I haven't followed the discussion but this patch has a known issue which is fixed
with:
    7bdb59f1ad474bd7161adc8f923cdef10f2638d1
    "tick/nohz: Fix possible missing clock reprog after tick soft restart"

I hope this fixes your issue.
Linus Torvalds Feb. 16, 2017, 6:20 p.m. UTC | #9
On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
<fweisbec@gmail.com> wrote:
>
> I haven't followed the discussion but this patch has a known issue which is fixed
> with:
>     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
>     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
>
> I hope this fixes your issue.

No, Pavel saw the problem with rc8 too, which already has that fix.

So I think we'll just need to revert that original patch (and that
means that we have to revert the commit you point to as well, since
that ->next_tick field was added by the original commit).

Pavel, can you verify that rc8 with both

  24b91e360ef521a2808771633d76ebc68bd5604b
  7bdb59f1ad474bd7161adc8f923cdef10f2638d1

reverted works reliably for you?

               Linus
Frederic Weisbecker Feb. 16, 2017, 6:34 p.m. UTC | #10
On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> >
> > I haven't followed the discussion but this patch has a known issue which is fixed
> > with:
> >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> >
> > I hope this fixes your issue.
> 
> No, Pavel saw the problem with rc8 too, which already has that fix.
> 
> So I think we'll just need to revert that original patch (and that
> means that we have to revert the commit you point to as well, since
> that ->next_tick field was added by the original commit).

Aw too bad, but indeed that late we don't have the choice.
Pavel Machek Feb. 16, 2017, 7:06 p.m. UTC | #11
On Thu 2017-02-16 18:25:35, Pavel Machek wrote:
> Hi!
> 
> > > > 4.10-rc4 broken
> > > > 4.10-rc3 ok
> > > 
> > > Hmm. If those actually end up being reliable, then there's not a whole
> > > lot in between them wrt PCI or USB.
> > > 
> > > What looked like the most likely candidate seems to be xhci-specific, though.
> > > 
> > > But maybe it's something that isn't directly in drivers/{pci,usb}/ and
> > > just interacts badly.
> > 
> > Ok. I _hope_ my tests are ok. Bisect log so far is:
> 
> And the winner is:
> 
> pavel@half:/data/l/linux$ git bisect bad
> 24b91e360ef521a2808771633d76ebc68bd5604b is the first bad commit
> commit 24b91e360ef521a2808771633d76ebc68bd5604b
> Author: Frederic Weisbecker <fweisbec@gmail.com>
> Date:   Wed Jan 4 15:12:04 2017 +0100
> 
>     nohz: Fix collision between tick and other hrtimers
>     

I had to revert 7bdb59f1ad474bd7161adc8f923cdef10f2638d1, too,
otherwise -rc8 does not compile.

With 24b91e360ef521a28087716 and 7bdb59f1ad474 reverted, it seems to
boot ok. (I did few tries.)

Best regards,
								Pavel
Thomas Gleixner Feb. 16, 2017, 7:34 p.m. UTC | #12
On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > <fweisbec@gmail.com> wrote:
> > >
> > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > with:
> > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > >
> > > I hope this fixes your issue.
> > 
> > No, Pavel saw the problem with rc8 too, which already has that fix.
> > 
> > So I think we'll just need to revert that original patch (and that
> > means that we have to revert the commit you point to as well, since
> > that ->next_tick field was added by the original commit).
> 
> Aw too bad, but indeed that late we don't have the choice.

Hint: Look for CPU hotplug interaction of these patches. I bet something
becomes stale when the CPU goes down and does not get reset when it comes
back online.

Thanks,

	tglx
Pavel Machek Feb. 16, 2017, 8:06 p.m. UTC | #13
On Thu 2017-02-16 20:34:45, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> > On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > > <fweisbec@gmail.com> wrote:
> > > >
> > > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > > with:
> > > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > > >
> > > > I hope this fixes your issue.
> > > 
> > > No, Pavel saw the problem with rc8 too, which already has that fix.
> > > 
> > > So I think we'll just need to revert that original patch (and that
> > > means that we have to revert the commit you point to as well, since
> > > that ->next_tick field was added by the original commit).

(I already said that elsewhere, but yes, revert of 7bdb59f1ad474b and
24b91e360ef5 fixes boot problems for me. Hmm, and 24b9 was marked for
stable... I don't know how to contact all the stable maintainers, but
probably it should not go to stable just yet...) 

> > Aw too bad, but indeed that late we don't have the choice.
> 
> Hint: Look for CPU hotplug interaction of these patches. I bet something
> becomes stale when the CPU goes down and does not get reset when it comes
> back online.

Hmm, that would explain problems at boot _and_ problems during
suspend/resume.

Note that this can be used to test the hotplug...

 cd /sys/devices/system/cpu/cpu1
 while true; do echo 0 > online; echo 1 > online; done

									Pavel
Linus Torvalds Feb. 16, 2017, 8:21 p.m. UTC | #14
On Thu, Feb 16, 2017 at 12:06 PM, Pavel Machek <pavel@ucw.cz> wrote:
>
> Hmm, that would explain problems at boot _and_ problems during
> suspend/resume.

I've committed the revert, and I'm just assuming that the revert also
fixed your suspend/resume issues, but I wanted to just double-check
that since it's only implied, no staed explicitly..

     Linus
Pavel Machek Feb. 16, 2017, 8:48 p.m. UTC | #15
On Thu 2017-02-16 12:21:13, Linus Torvalds wrote:
> On Thu, Feb 16, 2017 at 12:06 PM, Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Hmm, that would explain problems at boot _and_ problems during
> > suspend/resume.
> 
> I've committed the revert, and I'm just assuming that the revert also
> fixed your suspend/resume issues, but I wanted to just double-check
> that since it's only implied, no staed explicitly..

Thanks!

I don't yet know if suspend/resume issues are fixed. Those are somehow
tricky to reproduce -- fun stuff does not happen on every suspend. I
should know within a week or so...

									Pavel
Greg KH Feb. 17, 2017, 1:11 a.m. UTC | #16
On Thu, Feb 16, 2017 at 09:06:24PM +0100, Pavel Machek wrote:
> On Thu 2017-02-16 20:34:45, Thomas Gleixner wrote:
> > On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> > > On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > > > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > > > <fweisbec@gmail.com> wrote:
> > > > >
> > > > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > > > with:
> > > > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > > > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > > > >
> > > > > I hope this fixes your issue.
> > > > 
> > > > No, Pavel saw the problem with rc8 too, which already has that fix.
> > > > 
> > > > So I think we'll just need to revert that original patch (and that
> > > > means that we have to revert the commit you point to as well, since
> > > > that ->next_tick field was added by the original commit).
> 
> (I already said that elsewhere, but yes, revert of 7bdb59f1ad474b and
> 24b91e360ef5 fixes boot problems for me. Hmm, and 24b9 was marked for
> stable... I don't know how to contact all the stable maintainers, but
> probably it should not go to stable just yet...) 

It tried to get into the stable trees, but it broke the build, so it was
dropped.  So the stable trees are safe for now.

thanks,

greg k-h
Frederic Weisbecker Feb. 17, 2017, 2:04 p.m. UTC | #17
On Thu, Feb 16, 2017 at 08:34:45PM +0100, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> > On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > > <fweisbec@gmail.com> wrote:
> > > >
> > > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > > with:
> > > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > > >
> > > > I hope this fixes your issue.
> > > 
> > > No, Pavel saw the problem with rc8 too, which already has that fix.
> > > 
> > > So I think we'll just need to revert that original patch (and that
> > > means that we have to revert the commit you point to as well, since
> > > that ->next_tick field was added by the original commit).
> > 
> > Aw too bad, but indeed that late we don't have the choice.
> 
> Hint: Look for CPU hotplug interaction of these patches. I bet something
> becomes stale when the CPU goes down and does not get reset when it comes
> back online.

Indeed I should check that. But Pavel is seeing this on boot, where the
only hotplug operations that happen are CPU UP without preceding CPU DOWN
that may have retained stale values. I think the value of ts->next_tick should
be initially 0 for all CPUs. So perhaps that 0 value confuses stuff. But
looking at the code I don't see how. It maybe something more subtle.
Frederic Weisbecker Feb. 17, 2017, 2:40 p.m. UTC | #18
On Thu, Feb 16, 2017 at 08:06:04PM +0100, Pavel Machek wrote:
> On Thu 2017-02-16 18:25:35, Pavel Machek wrote:
> > Hi!
> > 
> > > > > 4.10-rc4 broken
> > > > > 4.10-rc3 ok
> > > > 
> > > > Hmm. If those actually end up being reliable, then there's not a whole
> > > > lot in between them wrt PCI or USB.
> > > > 
> > > > What looked like the most likely candidate seems to be xhci-specific, though.
> > > > 
> > > > But maybe it's something that isn't directly in drivers/{pci,usb}/ and
> > > > just interacts badly.
> > > 
> > > Ok. I _hope_ my tests are ok. Bisect log so far is:
> > 
> > And the winner is:
> > 
> > pavel@half:/data/l/linux$ git bisect bad
> > 24b91e360ef521a2808771633d76ebc68bd5604b is the first bad commit
> > commit 24b91e360ef521a2808771633d76ebc68bd5604b
> > Author: Frederic Weisbecker <fweisbec@gmail.com>
> > Date:   Wed Jan 4 15:12:04 2017 +0100
> > 
> >     nohz: Fix collision between tick and other hrtimers
> >     
> 
> I had to revert 7bdb59f1ad474bd7161adc8f923cdef10f2638d1, too,
> otherwise -rc8 does not compile.
> 
> With 24b91e360ef521a28087716 and 7bdb59f1ad474 reverted, it seems to
> boot ok. (I did few tries.)

Do you still have the config that triggered this? I don't have much expectations
about reproducing, this has almost never worked for me, but at least I could narrow
down the context.

Thanks.
Thomas Gleixner Feb. 17, 2017, 4:37 p.m. UTC | #19
On Fri, 17 Feb 2017, Frederic Weisbecker wrote:
> On Thu, Feb 16, 2017 at 08:34:45PM +0100, Thomas Gleixner wrote:
> > On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> > > On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > > > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > > > <fweisbec@gmail.com> wrote:
> > > > >
> > > > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > > > with:
> > > > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > > > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > > > >
> > > > > I hope this fixes your issue.
> > > > 
> > > > No, Pavel saw the problem with rc8 too, which already has that fix.
> > > > 
> > > > So I think we'll just need to revert that original patch (and that
> > > > means that we have to revert the commit you point to as well, since
> > > > that ->next_tick field was added by the original commit).
> > > 
> > > Aw too bad, but indeed that late we don't have the choice.
> > 
> > Hint: Look for CPU hotplug interaction of these patches. I bet something
> > becomes stale when the CPU goes down and does not get reset when it comes
> > back online.
> 
> Indeed I should check that. But Pavel is seeing this on boot, where the

I don't think so. He observed it on suspend resume and by doing hotplug
operations in a loop. But I might be wrong as usual.

> only hotplug operations that happen are CPU UP without preceding CPU DOWN
> that may have retained stale values. I think the value of ts->next_tick should
> be initially 0 for all CPUs. So perhaps that 0 value confuses stuff. But
> looking at the code I don't see how. It maybe something more subtle.
>
Pavel Machek Feb. 17, 2017, 5:05 p.m. UTC | #20
On Fri 2017-02-17 17:37:47, Thomas Gleixner wrote:
> On Fri, 17 Feb 2017, Frederic Weisbecker wrote:
> > On Thu, Feb 16, 2017 at 08:34:45PM +0100, Thomas Gleixner wrote:
> > > On Thu, 16 Feb 2017, Frederic Weisbecker wrote:
> > > > On Thu, Feb 16, 2017 at 10:20:14AM -0800, Linus Torvalds wrote:
> > > > > On Thu, Feb 16, 2017 at 10:13 AM, Frederic Weisbecker
> > > > > <fweisbec@gmail.com> wrote:
> > > > > >
> > > > > > I haven't followed the discussion but this patch has a known issue which is fixed
> > > > > > with:
> > > > > >     7bdb59f1ad474bd7161adc8f923cdef10f2638d1
> > > > > >     "tick/nohz: Fix possible missing clock reprog after tick soft restart"
> > > > > >
> > > > > > I hope this fixes your issue.
> > > > > 
> > > > > No, Pavel saw the problem with rc8 too, which already has that fix.
> > > > > 
> > > > > So I think we'll just need to revert that original patch (and that
> > > > > means that we have to revert the commit you point to as well, since
> > > > > that ->next_tick field was added by the original commit).
> > > > 
> > > > Aw too bad, but indeed that late we don't have the choice.
> > > 
> > > Hint: Look for CPU hotplug interaction of these patches. I bet something
> > > becomes stale when the CPU goes down and does not get reset when it comes
> > > back online.
> > 
> > Indeed I should check that. But Pavel is seeing this on boot, where the
> 
> I don't think so. He observed it on suspend resume and by doing hotplug
> operations in a loop. But I might be wrong as usual.

These are different bugs.

On x60, I see failures doing hotplug/unplug in a loop, or lot of
suspends. Someone seen it in v4.8-stable etc. Old bug. Rare to hit.

Desktop machine was failing to boot, and had some fun with
suspend/resume too. Boot hang was reproducible with right
procedure. (Hard poweroff, cold boot.). That one was introduced in
4.10-rc cycle.


									Pavel
Pavel Machek Feb. 18, 2017, 8:55 a.m. UTC | #21
On Thu 2017-02-16 12:21:13, Linus Torvalds wrote:
> On Thu, Feb 16, 2017 at 12:06 PM, Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Hmm, that would explain problems at boot _and_ problems during
> > suspend/resume.
> 
> I've committed the revert, and I'm just assuming that the revert also
> fixed your suspend/resume issues, but I wanted to just double-check
> that since it's only implied, no staed explicitly..

So boot issue is fixed, but it hung on resume, again. v4.9 worked
ok. Display is restored when it hangs on resume, but mouse is dead; I
guess that means there should be some chance to get debugging messages
during the resume.

									Pavel
Frederic Weisbecker Feb. 23, 2017, 4:28 p.m. UTC | #22
On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > Hi!
> > 
> > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > but I'll have to double check.
> > > > > 
> > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > its original USB port?
> > > > 
> > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > a while :-(.
> > > > 
> > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > 
> > > > It happens with current Linus' tree.
> > > 
> > > v4.10-rc6-feb3 : broken
> > > v4.9 : ok
> > > (v4.6 : ok)
> > 
> > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > 
> > With debug patch below, I get
> > 
> > ...1d.7: PCI fixup... pass 2
> > ...1d.7: PCI fixup... pass 3
> > ...1d.7: PCI fixup... pass 3 done
> > 
> > ...followed by hang. So yes, it looks USB related.
> > 
> > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > startup, unfortunately useful info is off screen at that point).
> 
> Forgot to say, 1d.7 is EHCI controller.
> 
> 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> Controller (rev 01)

Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
burden you again :-)
Pavel Machek Feb. 23, 2017, 6:40 p.m. UTC | #23
On Thu 2017-02-23 17:28:26, Frederic Weisbecker wrote:
> On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> > On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > > Hi!
> > > 
> > > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > > but I'll have to double check.
> > > > > > 
> > > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > > its original USB port?
> > > > > 
> > > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > > a while :-(.
> > > > > 
> > > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > > 
> > > > > It happens with current Linus' tree.
> > > > 
> > > > v4.10-rc6-feb3 : broken
> > > > v4.9 : ok
> > > > (v4.6 : ok)
> > > 
> > > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > > 
> > > With debug patch below, I get
> > > 
> > > ...1d.7: PCI fixup... pass 2
> > > ...1d.7: PCI fixup... pass 3
> > > ...1d.7: PCI fixup... pass 3 done
> > > 
> > > ...followed by hang. So yes, it looks USB related.
> > > 
> > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > startup, unfortunately useful info is off screen at that point).
> > 
> > Forgot to say, 1d.7 is EHCI controller.
> > 
> > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > Controller (rev 01)
> 
> Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> burden you again :-)

Go through more mails. It is only reproducible after cold boot. .. so
I doubt it will be easy to reproduce on another machine.

Now... I do have serial port, and I even might have serial cable
somewhere, but.... Giving how sensitive it is, it is probably going to
go away with console on ttyS...

									Pavel
Frederic Weisbecker Feb. 25, 2017, 3:28 a.m. UTC | #24
On Thu, Feb 23, 2017 at 07:40:13PM +0100, Pavel Machek wrote:
> On Thu 2017-02-23 17:28:26, Frederic Weisbecker wrote:
> > On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> > > On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > > > Hi!
> > > > 
> > > > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > > > but I'll have to double check.
> > > > > > > 
> > > > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > > > its original USB port?
> > > > > > 
> > > > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > > > a while :-(.
> > > > > > 
> > > > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > > > 
> > > > > > It happens with current Linus' tree.
> > > > > 
> > > > > v4.10-rc6-feb3 : broken
> > > > > v4.9 : ok
> > > > > (v4.6 : ok)
> > > > 
> > > > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > > > 
> > > > With debug patch below, I get
> > > > 
> > > > ...1d.7: PCI fixup... pass 2
> > > > ...1d.7: PCI fixup... pass 3
> > > > ...1d.7: PCI fixup... pass 3 done
> > > > 
> > > > ...followed by hang. So yes, it looks USB related.
> > > > 
> > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > startup, unfortunately useful info is off screen at that point).
> > > 
> > > Forgot to say, 1d.7 is EHCI controller.
> > > 
> > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > Controller (rev 01)
> > 
> > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > burden you again :-)
> 
> Go through more mails.

I've read the whole thread several times, I couldn't get much more clues.

> It is only reproducible after cold boot. .. so I doubt it will be easy to reproduce on another machine.

I have no idea. That's just my only hope for now.

> 
> Now... I do have serial port, and I even might have serial cable
> somewhere, but.... Giving how sensitive it is, it is probably going to
> go away with console on ttyS...

We'll see how it goes. I'll be off next week and then I should get the eeepc.
I'll get back to it there.

What gets me surprised is that the tick doesn't even fire yet on pci quirks time,
at least not on my machine where the clocksource is setup afterward. That said if
some of the pci quirks are async works, it might explain some later relation with the tick.

Thanks.
Frederic Weisbecker March 18, 2017, 2:42 p.m. UTC | #25
On Thu, Feb 23, 2017 at 07:40:13PM +0100, Pavel Machek wrote:
> On Thu 2017-02-23 17:28:26, Frederic Weisbecker wrote:
> > On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> > > On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > > > Hi!
> > > > 
> > > > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > > > but I'll have to double check.
> > > > > > > 
> > > > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > > > its original USB port?
> > > > > > 
> > > > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > > > a while :-(.
> > > > > > 
> > > > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > > > 
> > > > > > It happens with current Linus' tree.
> > > > > 
> > > > > v4.10-rc6-feb3 : broken
> > > > > v4.9 : ok
> > > > > (v4.6 : ok)
> > > > 
> > > > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > > > 
> > > > With debug patch below, I get
> > > > 
> > > > ...1d.7: PCI fixup... pass 2
> > > > ...1d.7: PCI fixup... pass 3
> > > > ...1d.7: PCI fixup... pass 3 done
> > > > 
> > > > ...followed by hang. So yes, it looks USB related.
> > > > 
> > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > startup, unfortunately useful info is off screen at that point).
> > > 
> > > Forgot to say, 1d.7 is EHCI controller.
> > > 
> > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > Controller (rev 01)
> > 
> > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > burden you again :-)
> 
> Go through more mails. It is only reproducible after cold boot. .. so
> I doubt it will be easy to reproduce on another machine.
> 
> Now... I do have serial port, and I even might have serial cable
> somewhere, but.... Giving how sensitive it is, it is probably going to
> go away with console on ttyS...

So I had access to a machine with NM10/ICH7 chipset and I failed to reproduce.
What machine is it you're using?

I fear you're my last resort. I suspect something is programming the clockevent
behind the tick. I thought it could be the clockevents switch code but I can't find
any issue there.

I see you have CONFIG_HIGH_RES_TIMERS=n. Could you try with it enabled?

For a quick rewind:

    git reset --hard v4.10
    git revert 558e8e27e73f53f8a512485be538b07115fe5f3c

Thanks!
Frederic Weisbecker April 3, 2017, 3:38 p.m. UTC | #26
On Thu, Feb 23, 2017 at 07:40:13PM +0100, Pavel Machek wrote:
> On Thu 2017-02-23 17:28:26, Frederic Weisbecker wrote:
> > On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> > > On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > > > Hi!
> > > > 
> > > > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > > > but I'll have to double check.
> > > > > > > 
> > > > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > > > its original USB port?
> > > > > > 
> > > > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > > > a while :-(.
> > > > > > 
> > > > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > > > 
> > > > > > It happens with current Linus' tree.
> > > > > 
> > > > > v4.10-rc6-feb3 : broken
> > > > > v4.9 : ok
> > > > > (v4.6 : ok)
> > > > 
> > > > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.   
> > > > 
> > > > With debug patch below, I get
> > > > 
> > > > ...1d.7: PCI fixup... pass 2
> > > > ...1d.7: PCI fixup... pass 3
> > > > ...1d.7: PCI fixup... pass 3 done
> > > > 
> > > > ...followed by hang. So yes, it looks USB related.
> > > > 
> > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > startup, unfortunately useful info is off screen at that point).
> > > 
> > > Forgot to say, 1d.7 is EHCI controller.
> > > 
> > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > Controller (rev 01)
> > 
> > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > burden you again :-)
> 
> Go through more mails. It is only reproducible after cold boot. .. so
> I doubt it will be easy to reproduce on another machine.
> 
> Now... I do have serial port, and I even might have serial cable
> somewhere, but.... Giving how sensitive it is, it is probably going to
> go away with console on ttyS...

I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
I even plugged a usb keyboard but even then I have been unable to
reproduce either :-(
Pavel Machek April 3, 2017, 6:20 p.m. UTC | #27
> > > > > ...1d.7: PCI fixup... pass 2
> > > > > ...1d.7: PCI fixup... pass 3
> > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > 
> > > > > ...followed by hang. So yes, it looks USB related.
> > > > > 
> > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > startup, unfortunately useful info is off screen at that point).
> > > > 
> > > > Forgot to say, 1d.7 is EHCI controller.
> > > > 
> > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > Controller (rev 01)
> > > 
> > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > burden you again :-)
> > 
> > Go through more mails. It is only reproducible after cold boot. .. so
> > I doubt it will be easy to reproduce on another machine.
> > 
> > Now... I do have serial port, and I even might have serial cable
> > somewhere, but.... Giving how sensitive it is, it is probably going to
> > go away with console on ttyS...
> 
> I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> I even plugged a usb keyboard but even then I have been unable to
> reproduce either :-(

Ok, give me some time. I'm no longer using the affected machine, so no
promises.

									Pavel
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 1800bef..060ad79 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3510,6 +3510,8 @@  void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev)
 {
 	struct pci_fixup *start, *end;
 
+	dev_info(&dev->dev, "PCI fixup device %p, pass %d\n", dev, pass);
+
 	switch (pass) {
 	case pci_fixup_early:
 		start = __start_pci_fixups_early;
@@ -3558,6 +3560,7 @@  void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev)
 		return;
 	}
 	pci_do_fixups(dev, start, end);
+	dev_info(&dev->dev, "PCI fixup device %p, pass %d, done\n", dev, pass);
 }
 EXPORT_SYMBOL(pci_fixup_device);