Message ID | 20140925091318.GA4269@suse.de (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Joerg Roedel <jroedel@suse.de> writes: > On Thu, Sep 25, 2014 at 09:20:58AM +0200, Bjørn Mork wrote: >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: >> >> > I've decided to go with a revert for 3.17, as we don't seem to have an immediate >> > fix and the final 3.17 may be as close as this Sunday. So I'm going to send my >> > final pull request for 3.17 to Linus tomorrow or early on Friday. >> >> Sounds safest to me, FWIW. > > Yes, sorry for the delay, I am still fighting with my cold and couldn't > get around to send a fix sooner :/ If it's any comfort, a cold was the reason I found this :-) I didn't plan on testing 3.17 at all. Instead I wanted to beat on Debian's 3.16 packages for a while to make sure it is in perfect condition for the jessie release. But then I catched the cold and felt like just relaxing with a movie. And instantly hit the i915 ring initialization bug. Which has a workaround in the queue for 3.16, so there is nothing to worry about: https://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git/tree/queue-3.16/drm-i915-read-head-register-back-in-init_ring_common-to-enforce-ordering.patch Except that this bug hit me there and then and I wasn't exactly in the mood for building a new kernel. So I just installed Debian's experimental 3.17-rc5 build as a quickfix instead and watched the movie. Then, much later, I noticed the oops. Which I wouldn't have done without the cold. >> For the next round of this, I think the only missing part was some test >> like >> >> if (!forbidden_pages_map || !free_pages_map) >> goto return_without_freeing_anything; > > Right, this is pretty much the fix. Can you please test the attached > patch? > >> And BTW, I believe it would be useful if at least one more person in the >> world tested hibernation between each release ;-) > > Well, I tested these patches on at least 4 or 5 different hardware > configurations. I also know of other people testing hibernation with -rc > kernels, but this is the first report of this issue I have seen. I > wonder what it different in your setup so that you trigger this bug. I do wonder about that as well. AFAIK I don't do anything extraordinary. My hardware is somewhat dated but should still be pretty common. I use "s2disk" from the Debian wheezy "uswsusp" package (version 1.0+20110509-3) to hibernate the system. Don't think I do anything fancy. I just run s2disk and that's that. The fact that the test was there before indicates that the NULL maps are to be expected, but maybe that is really just painting over some hardware related problem? My laptop is a Lenovo Thinkpad X301 from ~2008, having much of the hardware in common with most laptops from that time: bjorn@nemi:~$ lspci -nn 00:00.0 Host bridge [0600]: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub [8086:2a40] (rev 07) 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a42] (rev 07) 00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a43] (rev 07) 00:03.0 Communication controller [0780]: Intel Corporation Mobile 4 Series Chipset MEI Controller [8086:2a44] (rev 07) 00:19.0 Ethernet controller [0200]: Intel Corporation 82567LM Gigabit Network Connection [8086:10f5] (rev 03) 00:1a.0 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 [8086:2937] (rev 03) 00:1a.1 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 [8086:2938] (rev 03) 00:1a.2 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 [8086:2939] (rev 03) 00:1a.7 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 [8086:293c] (rev 03) 00:1b.0 Audio device [0403]: Intel Corporation 82801I (ICH9 Family) HD Audio Controller [8086:293e] (rev 03) 00:1c.0 PCI bridge [0604]: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 [8086:2940] (rev 03) 00:1c.1 PCI bridge [0604]: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 [8086:2942] (rev 03) 00:1d.0 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 [8086:2934] (rev 03) 00:1d.1 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 [8086:2935] (rev 03) 00:1d.2 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 [8086:2936] (rev 03) 00:1d.7 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 [8086:293a] (rev 03) 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge [8086:2448] (rev 93) 00:1f.0 ISA bridge [0601]: Intel Corporation ICH9M-E LPC Interface Controller [8086:2917] (rev 03) 00:1f.2 SATA controller [0106]: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] [8086:2929] (rev 03) 00:1f.3 SMBus [0c05]: Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930] (rev 03) 03:00.0 Network controller [0280]: Intel Corporation Wireless 7260 [8086:08b1] (rev 63) bjorn@nemi:~$ lsusb Bus 005 Device 003: ID 1199:a001 Sierra Wireless, Inc. Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 007 Device 002: ID 8087:07dc Intel Corp. Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 004 Device 003: ID 17ef:4807 Lenovo UVC Camera Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 002 Device 002: ID 0a5c:2145 Broadcom Corp. BCM2045B (BDC-2.1) [Bluetooth Controller] Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub The only odd parts are the newer 7260 wifi card (with the 8087:07dc additional bluetooth device), and the even newer EM7345 LTE modem (1199:a001). There is of course a slight chance that one of these devices cause something odd to happen during hibernate/resume. The newer bluetooth device does trigger a firmware related warning during resume, which I haven't bothered investigating yet. I guess it could theoretically be related to other resume issues: Sep 25 12:19:52 nemi kernel: [ 51.128904] ------------[ cut here ]------------ Sep 25 12:19:52 nemi kernel: [ 51.128915] WARNING: CPU: 0 PID: 5119 at drivers/base/firmware_class.c:1124 _request_firmware+0x277/0x68d() Sep 25 12:19:52 nemi kernel: [ 51.128920] Modules linked in: xt_multiport iptable_filter ip_tables bnep dm_mod xt_hl nf_log_ipv6 nf_log_common xt_LOG binfmt_misc ip6table_filter ip6_tables x_tables nfsd nfs_acl nfs lockd fscache sunrpc 8021q garp stp mrp llc tun loop fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_conexant snd_hda_codec_generic cdc_mbim cdc_wdm cdc_ncm usbnet coretemp kvm_intel mii arc4 cdc_acm kvm evdev psmouse serio_raw uvcvideo ecb videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common btusb videodev bluetooth snd_hda_intel iwlmvm snd_hda_controller mac80211 snd_hda_codec snd_hwdep snd_pcm_oss lpc_ich mfd_core i2c_i801 snd_mixer_oss iwlwifi snd_pcm cfg80211 snd_timer wmi thinkpad_acpi nvram snd soundcore rfkill ac battery i915 i2c_algo_bit drm_kms_helper video drm acpi_cpufreq button processor ext4 crc16 jbd2 mbcache nbd sg sd_mod crc_t10dif crct10dif_generic sr_mod crct10dif_common cdrom microcode ahci libahci libata scsi_mod ehci_pci uhci_hcd eh ci_hcd usbcore usb_common e1000e ptp pps_core thermal thermal_sys Sep 25 12:19:52 nemi kernel: [ 51.129255] CPU: 0 PID: 5119 Comm: kworker/u9:0 Tainted: G W 3.17.0-rc6+ #261 Sep 25 12:19:52 nemi kernel: [ 51.129261] Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011 Sep 25 12:19:52 nemi kernel: [ 51.129305] Workqueue: hci0 hci_power_on [bluetooth] Sep 25 12:19:52 nemi kernel: [ 51.129312] 0000000000000009 ffff88023184fae8 ffffffff813be9c7 0000000000003dea Sep 25 12:19:52 nemi kernel: [ 51.129326] 0000000000000000 ffff88023184fb28 ffffffff8103eb60 ffff88023184faf8 Sep 25 12:19:52 nemi kernel: [ 51.129340] ffffffff812a61a5 00000000fffffff5 ffff880223e6b180 ffff8800b17dd4c0 Sep 25 12:19:52 nemi kernel: [ 51.129355] Call Trace: Sep 25 12:19:52 nemi kernel: [ 51.129367] [<ffffffff813be9c7>] dump_stack+0x4e/0x68 Sep 25 12:19:52 nemi kernel: [ 51.129377] [<ffffffff8103eb60>] warn_slowpath_common+0x7c/0x96 Sep 25 12:19:52 nemi kernel: [ 51.129389] [<ffffffff812a61a5>] ? _request_firmware+0x277/0x68d Sep 25 12:19:52 nemi kernel: [ 51.129400] [<ffffffff8103eb8f>] warn_slowpath_null+0x15/0x17 Sep 25 12:19:52 nemi kernel: [ 51.129411] [<ffffffff812a61a5>] _request_firmware+0x277/0x68d Sep 25 12:19:52 nemi kernel: [ 51.129422] [<ffffffff812a667b>] request_firmware+0x30/0x44 Sep 25 12:19:52 nemi kernel: [ 51.129449] [<ffffffffa058bcd4>] btusb_setup_intel+0x297/0x675 [btusb] Sep 25 12:19:52 nemi kernel: [ 51.129513] [<ffffffffa0071366>] ? usb_autopm_put_interface+0x35/0x39 [usbcore] Sep 25 12:19:52 nemi kernel: [ 51.129584] [<ffffffffa05ddccf>] hci_dev_do_open+0x13b/0x986 [bluetooth] Sep 25 12:19:52 nemi kernel: [ 51.129596] [<ffffffff810533c2>] ? process_one_work+0x136/0x3bf Sep 25 12:19:52 nemi kernel: [ 51.129653] [<ffffffffa05de561>] hci_power_on+0x47/0x16a [bluetooth] Sep 25 12:19:52 nemi kernel: [ 51.129664] [<ffffffff81053487>] process_one_work+0x1fb/0x3bf Sep 25 12:19:52 nemi kernel: [ 51.129675] [<ffffffff810533c2>] ? process_one_work+0x136/0x3bf Sep 25 12:19:52 nemi kernel: [ 51.129710] [<ffffffff81053841>] worker_thread+0x1cc/0x2a3 Sep 25 12:19:52 nemi kernel: [ 51.129721] [<ffffffff81053675>] ? process_scheduled_works+0x2a/0x2a Sep 25 12:19:52 nemi kernel: [ 51.129731] [<ffffffff81057dc0>] kthread+0xb5/0xbd Sep 25 12:19:52 nemi kernel: [ 51.129743] [<ffffffff810759e3>] ? trace_hardirqs_on+0xd/0xf Sep 25 12:19:52 nemi kernel: [ 51.129754] [<ffffffff81057d0b>] ? __kthread_parkme+0x5c/0x5c Sep 25 12:19:52 nemi kernel: [ 51.129765] [<ffffffff813c472c>] ret_from_fork+0x7c/0xb0 Sep 25 12:19:52 nemi kernel: [ 51.129775] [<ffffffff81057d0b>] ? __kthread_parkme+0x5c/0x5c Sep 25 12:19:52 nemi kernel: [ 51.129782] ---[ end trace 059b5e27114304b7 ]--- Sep 25 12:19:52 nemi kernel: [ 51.129791] bluetooth hci0: firmware: intel/ibt-hw-37.7.bseq will not be loaded Sep 25 12:19:52 nemi kernel: [ 51.129802] Bluetooth: hci0 failed to open default Intel fw file: intel/ibt-hw-37.7.bseq And I have always(?) had the strange NMI warning in resume logs: Sep 25 12:19:55 nemi kernel: [ 54.623454] Uhhuh. NMI received for unknown reason 31 on CPU 0. but Google has taught me to simply ignore that. I see it as just an example of why it's pointless to log firmware errors without a clear indication what the user is supposed to do about it. I might be wrong :-) That's all I got. I was hoping someone could explain under what circumstances those maps will be NULL. > Anyway, it would be great if you could test the patch below :) The patch works fine for me. Thanks. I look forward to the next revision of this set. It looks like a really nice cleanup. And I am sorry that I didn't catch the bug earlier in the 3.17 cycle. Bjørn -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, September 25, 2014 12:54:56 PM Bjørn Mork wrote: > Joerg Roedel <jroedel@suse.de> writes: > > On Thu, Sep 25, 2014 at 09:20:58AM +0200, Bjørn Mork wrote: > >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: [cut] > > > Anyway, it would be great if you could test the patch below :) > > The patch works fine for me. Thanks. I look forward to the next > revision of this set. It looks like a really nice cleanup. And I am > sorry that I didn't catch the bug earlier in the 3.17 cycle. I'm only going to revert the commit in question as the rest should just work without this one, only less efficiently. So I'll be waiting for a resend of it with the patch that you've just tested folded in after 3.18-rc is out.
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index c4b8093..791a618 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1343,6 +1343,9 @@ void swsusp_free(void) { unsigned long fb_pfn, fr_pfn; + if (!forbidden_pages_map || !free_pages_map) + goto out; + memory_bm_position_reset(forbidden_pages_map); memory_bm_position_reset(free_pages_map); @@ -1370,6 +1373,7 @@ loop: goto loop; } +out: nr_copy_pages = 0; nr_meta_pages = 0; restore_pblist = NULL;