diff mbox

v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

Message ID 20170412150832.GE21309@lerouge (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Frederic Weisbecker April 12, 2017, 3:08 p.m. UTC
On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > > 
> > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > > 
> > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > startup, unfortunately useful info is off screen at that point).
> > > > > 
> > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > > 
> > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > Controller (rev 01)
> > > > 
> > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > burden you again :-)
> > > 
> > > Go through more mails. It is only reproducible after cold boot. .. so
> > > I doubt it will be easy to reproduce on another machine.
> > > 
> > > Now... I do have serial port, and I even might have serial cable
> > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > go away with console on ttyS...
> > 
> > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > I even plugged a usb keyboard but even then I have been unable to
> > reproduce either :-(
> 
> Ok, give me some time. I'm no longer using the affected machine, so no
> promises.

Actually someone reported me a very similar issue than yours lately. It's probably
the same. And I have a potential fix.

The scenario is a bit tricky again, and still theoretical. If you're interested in gory details:
a tick which is scheduled at jiffies = N + 1, in order to expire a timer_list timer, fires a
tiny bit too early (ie: very few microseconds in advance). So it doesn't update the jiffies on irq entry
and still sees jiffies = N. The timer_list timer doesnt expire yet and on IRQ exit we reschedule
the tick at the same time. But we see that ts->next_tick already has that value, therefore
we don't reprogram it again, leaving the clockevent unprogrammed.

So in case you have the time and opportunity to test the fix, you'll need to:

1) Revert back to the offending change:
   git revert 558e8e27e73f53f8a512485be538b07115fe5f3c

2) Apply a delta fix:



Thanks!

Comments

Pavel Machek April 15, 2017, 9:34 p.m. UTC | #1
On Wed 2017-04-12 17:08:35, Frederic Weisbecker wrote:
> On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > > > 
> > > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > > > 
> > > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > > startup, unfortunately useful info is off screen at that point).
> > > > > > 
> > > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > > > 
> > > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > > Controller (rev 01)
> > > > > 
> > > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > > burden you again :-)
> > > > 
> > > > Go through more mails. It is only reproducible after cold boot. .. so
> > > > I doubt it will be easy to reproduce on another machine.
> > > > 
> > > > Now... I do have serial port, and I even might have serial cable
> > > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > > go away with console on ttyS...
> > > 
> > > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > > I even plugged a usb keyboard but even then I have been unable to
> > > reproduce either :-(
> > 
> > Ok, give me some time. I'm no longer using the affected machine, so no
> > promises.
> 
> Actually someone reported me a very similar issue than yours lately. It's probably
> the same. And I have a potential fix.

Got the machine back to work -- I guess it will be useful for distcc.

And yes, you seem to have right fix :-). 

Tested-by: Pavel Machek <pavel@ucw.cz>

									Pavel
Frederic Weisbecker April 20, 2017, 2:52 p.m. UTC | #2
On Sat, Apr 15, 2017 at 11:34:47PM +0200, Pavel Machek wrote:
> On Wed 2017-04-12 17:08:35, Frederic Weisbecker wrote:
> > On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > > > > 
> > > > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > > > > 
> > > > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > > > startup, unfortunately useful info is off screen at that point).
> > > > > > > 
> > > > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > > > > 
> > > > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > > > Controller (rev 01)
> > > > > > 
> > > > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > > > burden you again :-)
> > > > > 
> > > > > Go through more mails. It is only reproducible after cold boot. .. so
> > > > > I doubt it will be easy to reproduce on another machine.
> > > > > 
> > > > > Now... I do have serial port, and I even might have serial cable
> > > > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > > > go away with console on ttyS...
> > > > 
> > > > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > > > I even plugged a usb keyboard but even then I have been unable to
> > > > reproduce either :-(
> > > 
> > > Ok, give me some time. I'm no longer using the affected machine, so no
> > > promises.
> > 
> > Actually someone reported me a very similar issue than yours lately. It's probably
> > the same. And I have a potential fix.
> 
> Got the machine back to work -- I guess it will be useful for distcc.
> 
> And yes, you seem to have right fix :-). 
> 
> Tested-by: Pavel Machek <pavel@ucw.cz>

Thanks a lot! I'm posting the fix.
diff mbox

Patch

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a3b8154..ae66515 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1071,8 +1071,10 @@  static void tick_nohz_handler(struct clock_event_device *dev)
 	tick_sched_handle(ts, regs);
 
 	/* No need to reprogram if we are running tickless  */
-	if (unlikely(ts->tick_stopped))
+	if (unlikely(ts->tick_stopped)) {
+		ts->next_tick = 0;
 		return;
+	}
 
 	hrtimer_forward(&ts->sched_timer, now, tick_period);
 	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
@@ -1172,8 +1174,10 @@  static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 		tick_sched_handle(ts, regs);
 
 	/* No need to reprogram if we are in idle or full dynticks mode */
-	if (unlikely(ts->tick_stopped))
+	if (unlikely(ts->tick_stopped)) {
+		ts->next_tick = 0;
 		return HRTIMER_NORESTART;
+	}
 
 	hrtimer_forward(timer, now, tick_period);