diff mbox

[v2] Hibernate: check unsafe page should not in e820 reserved region

Message ID 1407165801-24648-1-git-send-email-jlee@suse.com (mailing list archive)
State Accepted, archived
Headers show

Commit Message

Chun-Yi Lee Aug. 4, 2014, 3:23 p.m. UTC
When the machine doesn't well handle the e820 persistent when hibernate
resuming, then it may causes page fault when writing image to snapshot
buffer:

[   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
[   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
[   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
[   17.933469] Oops: 0002 [#1] SMP
...

The ffff880069d4f000 page is in e820 reserved region of resume boot
kernel:

[    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
...
[    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]

So snapshot.c mark the pfn to forbidden pages map. But, this
page is also in the memory bitmap in snapshot image because it's an
original page used by image kernel, so it will also mark as an
unsafe(free) page in prepare_image().

That means the page in e820 when resuming mark as "forbidden" and
"free", it causes get_buffer() treat it as an allocated unsafe page.
Then snapshot_write_next() return this page to load_image, load_image
writing content to this address, but this page didn't really allocated
. So, we got page fault.

Although the root cause is from BIOS, I think aggressive check and
significant message in kernel will better then a page fault for
issue tracking, especially when serial console unavailable.

This patch adds code in mark_unsafe_pages() for check does free pages in
nosave region. If so, then it print message and return fault to stop whole
S4 resume process:

[    8.166004] PM: Image loading progress:   0%
[    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
[    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
[    8.926633] PM: Error -14 resuming
[    8.933534] PM: Failed to load hibernation image, recovering.

v2:
 + removed empty check of nosave_regions list.
 + fixed the typo of "region" in code for error message and patch comment.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
---
 kernel/power/snapshot.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

Comments

Takashi Iwai Aug. 4, 2014, 4:36 p.m. UTC | #1
At Mon,  4 Aug 2014 23:23:21 +0800,
Lee, Chun-Yi wrote:
> 
> When the machine doesn't well handle the e820 persistent when hibernate
> resuming, then it may causes page fault when writing image to snapshot
> buffer:
> 
> [   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
> [   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
> [   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
> [   17.933469] Oops: 0002 [#1] SMP
> ...
> 
> The ffff880069d4f000 page is in e820 reserved region of resume boot
> kernel:
> 
> [    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
> ...
> [    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
> 
> So snapshot.c mark the pfn to forbidden pages map. But, this
> page is also in the memory bitmap in snapshot image because it's an
> original page used by image kernel, so it will also mark as an
> unsafe(free) page in prepare_image().
> 
> That means the page in e820 when resuming mark as "forbidden" and
> "free", it causes get_buffer() treat it as an allocated unsafe page.
> Then snapshot_write_next() return this page to load_image, load_image
> writing content to this address, but this page didn't really allocated
> . So, we got page fault.
> 
> Although the root cause is from BIOS, I think aggressive check and
> significant message in kernel will better then a page fault for
> issue tracking, especially when serial console unavailable.
> 
> This patch adds code in mark_unsafe_pages() for check does free pages in
> nosave region. If so, then it print message and return fault to stop whole
> S4 resume process:
> 
> [    8.166004] PM: Image loading progress:   0%
> [    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
> [    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
> [    8.926633] PM: Error -14 resuming
> [    8.933534] PM: Failed to load hibernation image, recovering.
> 
> v2:
>  + removed empty check of nosave_regions list.
>  + fixed the typo of "region" in code for error message and patch comment.
> 
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Takashi Iwai <tiwai@suse.de>

Looks good to me, feel free to take my tag,
  Reviewed-by: Takashi Iwai <tiwai@suse.de>


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
joeyli Aug. 7, 2014, 4:05 a.m. UTC | #2
Hi experts, 

On Mon, Aug 04, 2014 at 11:23:21PM +0800, Lee, Chun-Yi wrote:
> When the machine doesn't well handle the e820 persistent when hibernate
> resuming, then it may causes page fault when writing image to snapshot
> buffer:
> 
> [   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
> [   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
> [   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
> [   17.933469] Oops: 0002 [#1] SMP
> ...
> 
> The ffff880069d4f000 page is in e820 reserved region of resume boot
> kernel:
> 
> [    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
> ...
> [    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
> 
> So snapshot.c mark the pfn to forbidden pages map. But, this
> page is also in the memory bitmap in snapshot image because it's an
> original page used by image kernel, so it will also mark as an
> unsafe(free) page in prepare_image().
> 
> That means the page in e820 when resuming mark as "forbidden" and
> "free", it causes get_buffer() treat it as an allocated unsafe page.
> Then snapshot_write_next() return this page to load_image, load_image
> writing content to this address, but this page didn't really allocated
> . So, we got page fault.
> 
> Although the root cause is from BIOS, I think aggressive check and
> significant message in kernel will better then a page fault for
> issue tracking, especially when serial console unavailable.
> 
> This patch adds code in mark_unsafe_pages() for check does free pages in
> nosave region. If so, then it print message and return fault to stop whole
> S4 resume process:
> 
> [    8.166004] PM: Image loading progress:   0%
> [    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
> [    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
> [    8.926633] PM: Error -14 resuming
> [    8.933534] PM: Failed to load hibernation image, recovering.
> 
> v2:
>  + removed empty check of nosave_regions list.
>  + fixed the typo of "region" in code for error message and patch comment.
> 
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Acked-by: Pavel Machek <pavel@ucw.cz>
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>

I discussed with Vojtech Pavlik for this patch, he raised a situation is:

Maybe e820 changed but image kernel original pages do not fall into new e820 region.
Then the hibernate will recovery success, but later kernel drivers may got problem
when accessing memory. 

My idea is hashing the start/end pfn of each nosave region sequentially, put this
nosave region digest to hibernate header then compare e820 digest in check_header()
when hibernate resuming.

I am developing patch, then we don't need check unsafe page should not in unsave(e820)
regions.


Thanks a lot!
Joey Lee

> ---
>  kernel/power/snapshot.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 4fc5c32..c4b8093 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -954,6 +954,25 @@ static void mark_nosave_pages(struct memory_bitmap *bm)
>  	}
>  }
>  
> +static bool is_nosave_page(unsigned long pfn)
> +{
> +	struct nosave_region *region;
> +
> +	list_for_each_entry(region, &nosave_regions, list) {
> +		if (pfn >= region->start_pfn && pfn < region->end_pfn) {
> +			pr_err("PM: %#010llx in e820 nosave region: "
> +			       "[mem %#010llx-%#010llx]\n",
> +			       (unsigned long long) pfn << PAGE_SHIFT,
> +			       (unsigned long long) region->start_pfn << PAGE_SHIFT,
> +			       ((unsigned long long) region->end_pfn << PAGE_SHIFT)
> +					- 1);
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
>  /**
>   *	create_basic_memory_bitmaps - create bitmaps needed for marking page
>   *	frames that should not be saved and free page frames.  The pointers
> @@ -2015,7 +2034,7 @@ static int mark_unsafe_pages(struct memory_bitmap *bm)
>  	do {
>  		pfn = memory_bm_next_pfn(bm);
>  		if (likely(pfn != BM_END_OF_MAP)) {
> -			if (likely(pfn_valid(pfn)))
> +			if (likely(pfn_valid(pfn)) && !is_nosave_page(pfn))
>  				swsusp_set_page_free(pfn_to_page(pfn));
>  			else
>  				return -EFAULT;
> -- 
> 1.8.4.5
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Aug. 7, 2014, 8 a.m. UTC | #3
Hi!

> > When the machine doesn't well handle the e820 persistent when hibernate
> > resuming, then it may causes page fault when writing image to snapshot
> > buffer:
> > 
> > [   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
> > [   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
> > [   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
> > [   17.933469] Oops: 0002 [#1] SMP
> > ...
> > 
> > The ffff880069d4f000 page is in e820 reserved region of resume boot
> > kernel:
> > 
> > [    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
> > ...
> > [    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
> > 
> > So snapshot.c mark the pfn to forbidden pages map. But, this
> > page is also in the memory bitmap in snapshot image because it's an
> > original page used by image kernel, so it will also mark as an
> > unsafe(free) page in prepare_image().
> > 
> > That means the page in e820 when resuming mark as "forbidden" and
> > "free", it causes get_buffer() treat it as an allocated unsafe page.
> > Then snapshot_write_next() return this page to load_image, load_image
> > writing content to this address, but this page didn't really allocated
> > . So, we got page fault.
> > 
> > Although the root cause is from BIOS, I think aggressive check and
> > significant message in kernel will better then a page fault for
> > issue tracking, especially when serial console unavailable.
> > 
> > This patch adds code in mark_unsafe_pages() for check does free pages in
> > nosave region. If so, then it print message and return fault to stop whole
> > S4 resume process:
> > 
> > [    8.166004] PM: Image loading progress:   0%
> > [    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
> > [    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
> > [    8.926633] PM: Error -14 resuming
> > [    8.933534] PM: Failed to load hibernation image, recovering.
> > 
> > v2:
> >  + removed empty check of nosave_regions list.
> >  + fixed the typo of "region" in code for error message and patch comment.
> > 
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Takashi Iwai <tiwai@suse.de>
> > Acked-by: Pavel Machek <pavel@ucw.cz>
> > Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> 
> I discussed with Vojtech Pavlik for this patch, he raised a situation is:
> 
> Maybe e820 changed but image kernel original pages do not fall into new e820 region.
> Then the hibernate will recovery success, but later kernel drivers may got problem
> when accessing memory. 
> 
> My idea is hashing the start/end pfn of each nosave region sequentially, put this
> nosave region digest to hibernate header then compare e820 digest in check_header()
> when hibernate resuming.
> 
> I am developing patch, then we don't need check unsafe page should not in unsave(e820)
> regions.

Actually, if you are doing such a check... it makes sense to check for
_all_ the regions, nosave or not. If e820 map changed at all, it is
not safe to resume.
									Pavel
joeyli Aug. 7, 2014, 8:52 a.m. UTC | #4
On Thu, Aug 07, 2014 at 10:00:25AM +0200, Pavel Machek wrote:
> Hi!
> 
> > > When the machine doesn't well handle the e820 persistent when hibernate
> > > resuming, then it may causes page fault when writing image to snapshot
> > > buffer:
> > > 
> > > [   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
> > > [   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
> > > [   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
> > > [   17.933469] Oops: 0002 [#1] SMP
[...]
> > > [    8.933534] PM: Failed to load hibernation image, recovering.
> > > 
> > > v2:
> > >  + removed empty check of nosave_regions list.
> > >  + fixed the typo of "region" in code for error message and patch comment.
> > > 
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Cc: Len Brown <len.brown@intel.com>
> > > Cc: Takashi Iwai <tiwai@suse.de>
> > > Acked-by: Pavel Machek <pavel@ucw.cz>
> > > Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> > 
> > I discussed with Vojtech Pavlik for this patch, he raised a situation is:
> > 
> > Maybe e820 changed but image kernel original pages do not fall into new e820 region.
> > Then the hibernate will recovery success, but later kernel drivers may got problem
> > when accessing memory. 
> > 
> > My idea is hashing the start/end pfn of each nosave region sequentially, put this
> > nosave region digest to hibernate header then compare e820 digest in check_header()
> > when hibernate resuming.
> > 
> > I am developing patch, then we don't need check unsafe page should not in unsave(e820)
> > regions.
> 
> Actually, if you are doing such a check... it makes sense to check for
> _all_ the regions, nosave or not. If e820 map changed at all, it is
> not safe to resume.
> 									Pavel

Currently nosave region only called register by e820 code, so hibernate's nosave region included e820
reserved, ACPI data and ACPI NVS region.

I thought hashing the start/end pfn of above regions is enough.


Regards
Joey Lee
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Aug. 7, 2014, 9:39 a.m. UTC | #5
> > Actually, if you are doing such a check... it makes sense to check for
> > _all_ the regions, nosave or not. If e820 map changed at all, it is
> > not safe to resume.
> > 									Pavel
> 
> Currently nosave region only called register by e820 code, so hibernate's nosave region included e820
> reserved, ACPI data and ACPI NVS region.
> 
> I thought hashing the start/end pfn of above regions is enough.

If ammount of memory changed, for example, it is unsafe to
resume. So if you are doing the check, anyway, please hash
whole e820 table.
						Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
joeyli Aug. 7, 2014, 10:17 a.m. UTC | #6
On Thu, Aug 07, 2014 at 11:39:57AM +0200, Pavel Machek wrote:
> > > Actually, if you are doing such a check... it makes sense to check for
> > > _all_ the regions, nosave or not. If e820 map changed at all, it is
> > > not safe to resume.
> > > 									Pavel
> > 
> > Currently nosave region only called register by e820 code, so hibernate's nosave region included e820
> > reserved, ACPI data and ACPI NVS region.
> > 
> > I thought hashing the start/end pfn of above regions is enough.
> 
> If ammount of memory changed, for example, it is unsafe to
> resume. So if you are doing the check, anyway, please hash
> whole e820 table.
> 						Pavel

There already have num_physpages in header for check the total physical page number.


Regards
Joey Lee
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Aug. 7, 2014, 9:05 p.m. UTC | #7
On Thu 2014-08-07 18:17:34, joeyli wrote:
> On Thu, Aug 07, 2014 at 11:39:57AM +0200, Pavel Machek wrote:
> > > > Actually, if you are doing such a check... it makes sense to check for
> > > > _all_ the regions, nosave or not. If e820 map changed at all, it is
> > > > not safe to resume.
> > > > 									Pavel
> > > 
> > > Currently nosave region only called register by e820 code, so hibernate's nosave region included e820
> > > reserved, ACPI data and ACPI NVS region.
> > > 
> > > I thought hashing the start/end pfn of above regions is enough.
> > 
> > If ammount of memory changed, for example, it is unsafe to
> > resume. So if you are doing the check, anyway, please hash
> > whole e820 table.
> 
> There already have num_physpages in header for check the total physical page number.

Good, but if ammount of memory stayed the same, but offsets
changed (for example), resume is unsafe, too.

When I wrote that num_physpages check, I should have checked
whole e820  table, instead. (If anything at all changed there,
"new" kernel is running with wrong e820 info).

You seem to be in great position to fix that mistake now...

						Pavel
joeyli Aug. 8, 2014, 4:24 a.m. UTC | #8
On Thu, Aug 07, 2014 at 11:05:33PM +0200, Pavel Machek wrote:
> On Thu 2014-08-07 18:17:34, joeyli wrote:
> > On Thu, Aug 07, 2014 at 11:39:57AM +0200, Pavel Machek wrote:
> > > > > Actually, if you are doing such a check... it makes sense to check for
> > > > > _all_ the regions, nosave or not. If e820 map changed at all, it is
> > > > > not safe to resume.
> > > > > 									Pavel
> > > > 
> > > > Currently nosave region only called register by e820 code, so hibernate's nosave region included e820
> > > > reserved, ACPI data and ACPI NVS region.
> > > > 
> > > > I thought hashing the start/end pfn of above regions is enough.
> > > 
> > > If ammount of memory changed, for example, it is unsafe to
> > > resume. So if you are doing the check, anyway, please hash
> > > whole e820 table.
> > 
> > There already have num_physpages in header for check the total physical page number.
> 
> Good, but if ammount of memory stayed the same, but offsets
> changed (for example), resume is unsafe, too.

I agreed

> 
> When I wrote that num_physpages check, I should have checked
> whole e820  table, instead. (If anything at all changed there,
> "new" kernel is running with wrong e820 info).
> 
> You seem to be in great position to fix that mistake now...
> 
> 						Pavel

Hashing e820 is fine, but it can not provide detail information to user/developer when issue
happened. We only know the e820 table changed but not more information for bug tracking, not
too many shipping machine have serial console.

After checked the space of swsusp_info, I hope can store whole e820 table to snapshot header
for compare the range. The maximum space consumed 20 * E820MAX = 20 * 128 = 2560 bytes.
Currrent swsusp_info is used 430 bytes in one page, it's enough to us for keep whole e820
table.

And, I thought don't need compare the range of E820_RAM and E820_RESERVED_KERN type
because they are using by OS and stored in snapshot image.


Thanks a lot!
Joey Lee
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Aug. 8, 2014, 7:05 a.m. UTC | #9
Hi!

> > When I wrote that num_physpages check, I should have checked
> > whole e820  table, instead. (If anything at all changed there,
> > "new" kernel is running with wrong e820 info).
> > 
> > You seem to be in great position to fix that mistake now...
> 
> Hashing e820 is fine, but it can not provide detail information to user/developer when issue
> happened. We only know the e820 table changed but not more information for bug tracking, not
> too many shipping machine have serial console.

I don't see what it has to do with serial console, during relevant
stages of hibernation, normal console should be available.

Anyway, either hash or full comparison is file, whatever is simpler.

Thanks,

									Pavel
diff mbox

Patch

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 4fc5c32..c4b8093 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -954,6 +954,25 @@  static void mark_nosave_pages(struct memory_bitmap *bm)
 	}
 }
 
+static bool is_nosave_page(unsigned long pfn)
+{
+	struct nosave_region *region;
+
+	list_for_each_entry(region, &nosave_regions, list) {
+		if (pfn >= region->start_pfn && pfn < region->end_pfn) {
+			pr_err("PM: %#010llx in e820 nosave region: "
+			       "[mem %#010llx-%#010llx]\n",
+			       (unsigned long long) pfn << PAGE_SHIFT,
+			       (unsigned long long) region->start_pfn << PAGE_SHIFT,
+			       ((unsigned long long) region->end_pfn << PAGE_SHIFT)
+					- 1);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 /**
  *	create_basic_memory_bitmaps - create bitmaps needed for marking page
  *	frames that should not be saved and free page frames.  The pointers
@@ -2015,7 +2034,7 @@  static int mark_unsafe_pages(struct memory_bitmap *bm)
 	do {
 		pfn = memory_bm_next_pfn(bm);
 		if (likely(pfn != BM_END_OF_MAP)) {
-			if (likely(pfn_valid(pfn)))
+			if (likely(pfn_valid(pfn)) && !is_nosave_page(pfn))
 				swsusp_set_page_free(pfn_to_page(pfn));
 			else
 				return -EFAULT;