diff mbox

NULL pointer dereference in swsusp_free with 3.17-rc5

Message ID 20140925091318.GA4269@suse.de (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Joerg Roedel Sept. 25, 2014, 9:13 a.m. UTC
On Thu, Sep 25, 2014 at 09:20:58AM +0200, Bjørn Mork wrote:
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> 
> > I've decided to go with a revert for 3.17, as we don't seem to have an immediate
> > fix and the final 3.17 may be as close as this Sunday.  So I'm going to send my
> > final pull request for 3.17 to Linus tomorrow or early on Friday.
> 
> Sounds safest to me, FWIW.

Yes, sorry for the delay, I am still fighting with my cold and couldn't
get around to send a fix sooner :/

> For the next round of this, I think the only missing part was some test
> like
> 
>         if (!forbidden_pages_map || !free_pages_map)
>            goto return_without_freeing_anything;

Right, this is pretty much the fix. Can you please test the attached
patch?

> And BTW, I believe it would be useful if at least one more person in the
> world tested hibernation between each release ;-)

Well, I tested these patches on at least 4 or 5 different hardware
configurations. I also know of other people testing hibernation with -rc
kernels, but this is the first report of this issue I have seen. I
wonder what it different in your setup so that you trigger this bug.

Anyway, it would be great if you could test the patch below :)

Thanks,

	Joerg

From fe599eff60cfbfbb1f894dc476ee28f38aef954b Mon Sep 17 00:00:00 2001
From: Joerg Roedel <jroedel@suse.de>
Date: Thu, 25 Sep 2014 11:04:40 +0200
Subject: [PATCH] PM: Hibernate: Fix NULL pointer access in swsusp_free

The optimized version of swsusp_free does not check the
bitmap pointers anymore, which may cause a NULL pointer
dereference and a kernel crash. Fix it by adding the checks
and bail out if one of them is NULL.

Reported-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Bjørn Mork Sept. 25, 2014, 10:54 a.m. UTC | #1
Joerg Roedel <jroedel@suse.de> writes:
> On Thu, Sep 25, 2014 at 09:20:58AM +0200, Bjørn Mork wrote:
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> 
>> > I've decided to go with a revert for 3.17, as we don't seem to have an immediate
>> > fix and the final 3.17 may be as close as this Sunday.  So I'm going to send my
>> > final pull request for 3.17 to Linus tomorrow or early on Friday.
>> 
>> Sounds safest to me, FWIW.
>
> Yes, sorry for the delay, I am still fighting with my cold and couldn't
> get around to send a fix sooner :/

If it's any comfort, a cold was the reason I found this :-)

I didn't plan on testing 3.17 at all. Instead I wanted to beat on
Debian's 3.16 packages for a while to make sure it is in perfect
condition for the jessie release.  But then I catched the cold and felt
like just relaxing with a movie.  And instantly hit the i915 ring
initialization bug. Which has a workaround in the queue for 3.16, so
there is nothing to worry about:
https://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git/tree/queue-3.16/drm-i915-read-head-register-back-in-init_ring_common-to-enforce-ordering.patch

Except that this bug hit me there and then and I wasn't exactly in the
mood for building a new kernel.  So I just installed Debian's
experimental 3.17-rc5 build as a quickfix instead and watched the movie.

Then, much later, I noticed the oops.  Which I wouldn't have done
without the cold.

>> For the next round of this, I think the only missing part was some test
>> like
>> 
>>         if (!forbidden_pages_map || !free_pages_map)
>>            goto return_without_freeing_anything;
>
> Right, this is pretty much the fix. Can you please test the attached
> patch?
>
>> And BTW, I believe it would be useful if at least one more person in the
>> world tested hibernation between each release ;-)
>
> Well, I tested these patches on at least 4 or 5 different hardware
> configurations. I also know of other people testing hibernation with -rc
> kernels, but this is the first report of this issue I have seen. I
> wonder what it different in your setup so that you trigger this bug.

I do wonder about that as well. AFAIK I don't do anything extraordinary.
My hardware is somewhat dated but should still be pretty common.

I use "s2disk" from the Debian wheezy "uswsusp" package (version
1.0+20110509-3) to hibernate the system.  Don't think I do anything
fancy.  I just run s2disk and that's that.

The fact that the test was there before indicates that the NULL maps are
to be expected, but maybe that is really just painting over some
hardware related problem? My laptop is a Lenovo Thinkpad X301 from
~2008, having much of the hardware in common with most laptops from that
time:

bjorn@nemi:~$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub [8086:2a40] (rev 07)
00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a42] (rev 07)
00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a43] (rev 07)
00:03.0 Communication controller [0780]: Intel Corporation Mobile 4 Series Chipset MEI Controller [8086:2a44] (rev 07)
00:19.0 Ethernet controller [0200]: Intel Corporation 82567LM Gigabit Network Connection [8086:10f5] (rev 03)
00:1a.0 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 [8086:2937] (rev 03)
00:1a.1 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 [8086:2938] (rev 03)
00:1a.2 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 [8086:2939] (rev 03)
00:1a.7 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 [8086:293c] (rev 03)
00:1b.0 Audio device [0403]: Intel Corporation 82801I (ICH9 Family) HD Audio Controller [8086:293e] (rev 03)
00:1c.0 PCI bridge [0604]: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 [8086:2940] (rev 03)
00:1c.1 PCI bridge [0604]: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 [8086:2942] (rev 03)
00:1d.0 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 [8086:2934] (rev 03)
00:1d.1 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 [8086:2935] (rev 03)
00:1d.2 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 [8086:2936] (rev 03)
00:1d.7 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 [8086:293a] (rev 03)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge [8086:2448] (rev 93)
00:1f.0 ISA bridge [0601]: Intel Corporation ICH9M-E LPC Interface Controller [8086:2917] (rev 03)
00:1f.2 SATA controller [0106]: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] [8086:2929] (rev 03)
00:1f.3 SMBus [0c05]: Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930] (rev 03)
03:00.0 Network controller [0280]: Intel Corporation Wireless 7260 [8086:08b1] (rev 63)

bjorn@nemi:~$ lsusb 
Bus 005 Device 003: ID 1199:a001 Sierra Wireless, Inc. 
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 007 Device 002: ID 8087:07dc Intel Corp. 
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 003: ID 17ef:4807 Lenovo UVC Camera
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 002: ID 0a5c:2145 Broadcom Corp. BCM2045B (BDC-2.1) [Bluetooth Controller]
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub


The only odd parts are the newer 7260 wifi card (with the 8087:07dc
additional bluetooth device), and the even newer EM7345 LTE modem
(1199:a001).  There is of course a slight chance that one of these
devices cause something odd to happen during hibernate/resume.

The newer bluetooth device does trigger a firmware related warning
during resume, which I haven't bothered investigating yet.  I guess it
could theoretically be related to other resume issues:

Sep 25 12:19:52 nemi kernel: [   51.128904] ------------[ cut here ]------------
Sep 25 12:19:52 nemi kernel: [   51.128915] WARNING: CPU: 0 PID: 5119 at drivers/base/firmware_class.c:1124 _request_firmware+0x277/0x68d()
Sep 25 12:19:52 nemi kernel: [   51.128920] Modules linked in: xt_multiport iptable_filter ip_tables bnep dm_mod xt_hl nf_log_ipv6 nf_log_common xt_LOG binfmt_misc ip6table_filter ip6_tables x_tables nfsd nfs_acl nfs lockd fscache sunrpc 8021q garp stp mrp llc tun loop fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_conexant snd_hda_codec_generic cdc_mbim cdc_wdm cdc_ncm usbnet coretemp kvm_intel mii arc4 cdc_acm kvm evdev psmouse serio_raw uvcvideo ecb videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common btusb videodev bluetooth snd_hda_intel iwlmvm snd_hda_controller mac80211 snd_hda_codec snd_hwdep snd_pcm_oss lpc_ich mfd_core i2c_i801 snd_mixer_oss iwlwifi snd_pcm cfg80211 snd_timer wmi thinkpad_acpi nvram snd soundcore rfkill ac battery i915 i2c_algo_bit drm_kms_helper video drm acpi_cpufreq button processor ext4 crc16 jbd2 mbcache nbd sg sd_mod crc_t10dif crct10dif_generic sr_mod crct10dif_common cdrom microcode ahci libahci libata scsi_mod ehci_pci uhci_hcd eh
 ci_hcd usbcore usb_common e1000e ptp pps_core thermal thermal_sys
Sep 25 12:19:52 nemi kernel: [   51.129255] CPU: 0 PID: 5119 Comm: kworker/u9:0 Tainted: G        W      3.17.0-rc6+ #261
Sep 25 12:19:52 nemi kernel: [   51.129261] Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
Sep 25 12:19:52 nemi kernel: [   51.129305] Workqueue: hci0 hci_power_on [bluetooth]
Sep 25 12:19:52 nemi kernel: [   51.129312]  0000000000000009 ffff88023184fae8 ffffffff813be9c7 0000000000003dea
Sep 25 12:19:52 nemi kernel: [   51.129326]  0000000000000000 ffff88023184fb28 ffffffff8103eb60 ffff88023184faf8
Sep 25 12:19:52 nemi kernel: [   51.129340]  ffffffff812a61a5 00000000fffffff5 ffff880223e6b180 ffff8800b17dd4c0
Sep 25 12:19:52 nemi kernel: [   51.129355] Call Trace:
Sep 25 12:19:52 nemi kernel: [   51.129367]  [<ffffffff813be9c7>] dump_stack+0x4e/0x68
Sep 25 12:19:52 nemi kernel: [   51.129377]  [<ffffffff8103eb60>] warn_slowpath_common+0x7c/0x96
Sep 25 12:19:52 nemi kernel: [   51.129389]  [<ffffffff812a61a5>] ? _request_firmware+0x277/0x68d
Sep 25 12:19:52 nemi kernel: [   51.129400]  [<ffffffff8103eb8f>] warn_slowpath_null+0x15/0x17
Sep 25 12:19:52 nemi kernel: [   51.129411]  [<ffffffff812a61a5>] _request_firmware+0x277/0x68d
Sep 25 12:19:52 nemi kernel: [   51.129422]  [<ffffffff812a667b>] request_firmware+0x30/0x44
Sep 25 12:19:52 nemi kernel: [   51.129449]  [<ffffffffa058bcd4>] btusb_setup_intel+0x297/0x675 [btusb]
Sep 25 12:19:52 nemi kernel: [   51.129513]  [<ffffffffa0071366>] ? usb_autopm_put_interface+0x35/0x39 [usbcore]
Sep 25 12:19:52 nemi kernel: [   51.129584]  [<ffffffffa05ddccf>] hci_dev_do_open+0x13b/0x986 [bluetooth]
Sep 25 12:19:52 nemi kernel: [   51.129596]  [<ffffffff810533c2>] ? process_one_work+0x136/0x3bf
Sep 25 12:19:52 nemi kernel: [   51.129653]  [<ffffffffa05de561>] hci_power_on+0x47/0x16a [bluetooth]
Sep 25 12:19:52 nemi kernel: [   51.129664]  [<ffffffff81053487>] process_one_work+0x1fb/0x3bf
Sep 25 12:19:52 nemi kernel: [   51.129675]  [<ffffffff810533c2>] ? process_one_work+0x136/0x3bf
Sep 25 12:19:52 nemi kernel: [   51.129710]  [<ffffffff81053841>] worker_thread+0x1cc/0x2a3
Sep 25 12:19:52 nemi kernel: [   51.129721]  [<ffffffff81053675>] ? process_scheduled_works+0x2a/0x2a
Sep 25 12:19:52 nemi kernel: [   51.129731]  [<ffffffff81057dc0>] kthread+0xb5/0xbd
Sep 25 12:19:52 nemi kernel: [   51.129743]  [<ffffffff810759e3>] ? trace_hardirqs_on+0xd/0xf
Sep 25 12:19:52 nemi kernel: [   51.129754]  [<ffffffff81057d0b>] ? __kthread_parkme+0x5c/0x5c
Sep 25 12:19:52 nemi kernel: [   51.129765]  [<ffffffff813c472c>] ret_from_fork+0x7c/0xb0
Sep 25 12:19:52 nemi kernel: [   51.129775]  [<ffffffff81057d0b>] ? __kthread_parkme+0x5c/0x5c
Sep 25 12:19:52 nemi kernel: [   51.129782] ---[ end trace 059b5e27114304b7 ]---
Sep 25 12:19:52 nemi kernel: [   51.129791] bluetooth hci0: firmware: intel/ibt-hw-37.7.bseq will not be loaded
Sep 25 12:19:52 nemi kernel: [   51.129802] Bluetooth: hci0 failed to open default Intel fw file: intel/ibt-hw-37.7.bseq


And I have always(?) had the strange NMI warning in resume logs: 

 Sep 25 12:19:55 nemi kernel: [   54.623454] Uhhuh. NMI received for unknown reason 31 on CPU 0.

but Google has taught me to simply ignore that.  I see it as just an
example of why it's pointless to log firmware errors without a clear
indication what the user is supposed to do about it.  I might be wrong
:-)

That's all I got.  I was hoping someone could explain under what
circumstances those maps will be NULL.

> Anyway, it would be great if you could test the patch below :)

The patch works fine for me.  Thanks.  I look forward to the next
revision of this set.  It looks like a really nice cleanup. And I am
sorry that I didn't catch the bug earlier in the 3.17 cycle.


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 25, 2014, 8:26 p.m. UTC | #2
On Thursday, September 25, 2014 12:54:56 PM Bjørn Mork wrote:
> Joerg Roedel <jroedel@suse.de> writes:
> > On Thu, Sep 25, 2014 at 09:20:58AM +0200, Bjørn Mork wrote:
> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

[cut]

> 
> > Anyway, it would be great if you could test the patch below :)
> 
> The patch works fine for me.  Thanks.  I look forward to the next
> revision of this set.  It looks like a really nice cleanup. And I am
> sorry that I didn't catch the bug earlier in the 3.17 cycle.

I'm only going to revert the commit in question as the rest should just
work without this one, only less efficiently.

So I'll be waiting for a resend of it with the patch that you've just tested
folded in after 3.18-rc is out.
diff mbox

Patch

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index c4b8093..791a618 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1343,6 +1343,9 @@  void swsusp_free(void)
 {
 	unsigned long fb_pfn, fr_pfn;
 
+	if (!forbidden_pages_map || !free_pages_map)
+		goto out;
+
 	memory_bm_position_reset(forbidden_pages_map);
 	memory_bm_position_reset(free_pages_map);
 
@@ -1370,6 +1373,7 @@  loop:
 		goto loop;
 	}
 
+out:
 	nr_copy_pages = 0;
 	nr_meta_pages = 0;
 	restore_pblist = NULL;