Message ID | 1480530091-1092-1-git-send-email-rrichter@cavium.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Nov 30, 2016 at 07:21:31PM +0100, Robert Richter wrote: > On ThunderX systems with certain memory configurations we see the > following BUG_ON(): > > kernel BUG at mm/page_alloc.c:1848! > > This happens for some configs with 64k page size enabled. The BUG_ON() > checks if start and end page of a memmap range belongs to the same > zone. > > The BUG_ON() check fails if a memory zone contains NOMAP regions. In > this case the node information of those pages is not initialized. This > causes an inconsistency of the page links with wrong zone and node > information for that pages. NOMAP pages from node 1 still point to the > mem zone from node 0 and have the wrong nid assigned. > > The reason for the mis-configuration is a change in pfn_valid() which > reports pages marked NOMAP as invalid: > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > This causes pages marked as nomap being no long reassigned to the new > zone in memmap_init_zone() by calling __init_single_pfn(). > > Fixing this by restoring the old behavior of pfn_valid() to use > memblock_is_memory(). Also changing users of pfn_valid() in arm64 code > to use memblock_is_map_memory() where necessary. This only affects > code in ioremap.c. The code in mmu.c still can use the new version of > pfn_valid(). > > As a consequence, pfn_valid() can not be used to check if a physical > page is RAM. It just checks if there is an underlying memmap with a > valid struct page. Moreover, for performance reasons the whole memmap > (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM > config). The memory range is extended to fit the alignment of the > memmap. Thus, pfn_valid() may return true for pfns that do not map to > physical memory. Those pages are simply not reported to the mm, they > are not marked reserved nor added to the list of free pages. Other > functions such a page_is_ram() or memblock_is_map_ memory() must be > used to check for memory and if the page can be mapped with the linear > mapping. > > Since NOMAP mem ranges may need to be mapped with different mem > attributes (e.g. read-only or non-caching) we can not use linear > mapping here. The use of memblock_is_memory() in pfn_valid() may not > break this behaviour. Since commit: > > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem > > NOMAP mem resources are no longer marked as system RAM (IORESOURCE_ > SYSTEM_RAM). Now page_is_ram() and region_intersects() (see > memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not > added to the linear mapping as system RAM is. > > v2: > > * Added Ack > * updated description to reflect the discussion > > Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Signed-off-by: Robert Richter <rrichter@cavium.com> > --- > arch/arm64/mm/init.c | 2 +- > arch/arm64/mm/ioremap.c | 5 +++-- > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 212c4d1e2f26..166911f4a2e6 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > int pfn_valid(unsigned long pfn) > { > - return memblock_is_map_memory(pfn << PAGE_SHIFT); > + return memblock_is_memory(pfn << PAGE_SHIFT); > } > EXPORT_SYMBOL(pfn_valid); > #endif > diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c > index 01e88c8bcab0..c17c220b0c48 100644 > --- a/arch/arm64/mm/ioremap.c > +++ b/arch/arm64/mm/ioremap.c > @@ -21,6 +21,7 @@ > */ > > #include <linux/export.h> > +#include <linux/memblock.h> > #include <linux/mm.h> > #include <linux/vmalloc.h> > #include <linux/io.h> > @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, > /* > * Don't allow RAM to be mapped. > */ > - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) > + if (WARN_ON(memblock_is_map_memory(phys_addr))) > return NULL; > > area = get_vm_area_caller(size, VM_IOREMAP, caller); > @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); > void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) > { > /* For normal memory we already have a cacheable mapping. */ > - if (pfn_valid(__phys_to_pfn(phys_addr))) > + if (memblock_is_map_memory(phys_addr)) > return (void __iomem *)__phys_to_virt(phys_addr); Thanks for sending out the new patch. Whilst I'm still a bit worried about changing pfn_valid like this, I guess we'll just have to fix up any callers which suffer from this change. Acked-by: Will Deacon <will.deacon@arm.com> I'd like to see this sit in -next for a bit before we send it further. Will
Hi Robert, Will, On 01/12/16 16:45, Will Deacon wrote: > On Wed, Nov 30, 2016 at 07:21:31PM +0100, Robert Richter wrote: >> On ThunderX systems with certain memory configurations we see the >> following BUG_ON(): >> >> kernel BUG at mm/page_alloc.c:1848! >> >> This happens for some configs with 64k page size enabled. The BUG_ON() >> checks if start and end page of a memmap range belongs to the same >> zone. >> >> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >> this case the node information of those pages is not initialized. This >> causes an inconsistency of the page links with wrong zone and node >> information for that pages. NOMAP pages from node 1 still point to the >> mem zone from node 0 and have the wrong nid assigned. >> >> The reason for the mis-configuration is a change in pfn_valid() which >> reports pages marked NOMAP as invalid: >> >> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping >> >> This causes pages marked as nomap being no long reassigned to the new >> zone in memmap_init_zone() by calling __init_single_pfn(). >> >> Fixing this by restoring the old behavior of pfn_valid() to use >> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code >> to use memblock_is_map_memory() where necessary. This only affects >> code in ioremap.c. The code in mmu.c still can use the new version of >> pfn_valid(). >> >> As a consequence, pfn_valid() can not be used to check if a physical >> page is RAM. It just checks if there is an underlying memmap with a >> valid struct page. Moreover, for performance reasons the whole memmap >> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM >> config). The memory range is extended to fit the alignment of the >> memmap. Thus, pfn_valid() may return true for pfns that do not map to >> physical memory. Those pages are simply not reported to the mm, they >> are not marked reserved nor added to the list of free pages. Other >> functions such a page_is_ram() or memblock_is_map_ memory() must be >> used to check for memory and if the page can be mapped with the linear >> mapping. [...] > Thanks for sending out the new patch. Whilst I'm still a bit worried about > changing pfn_valid like this, I guess we'll just have to fix up any callers > which suffer from this change. Hibernate's core code falls foul of this. This patch causes a panic when copying memory to build the 'image'[0]. saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid() pages can be accessed. Fortunately the core code exposes pfn_is_nosave() which we can extend to catch 'nomap' pages, but only if they are also marked as PageReserved(). Are there any side-effects of marking all the nomap regions with mark_page_reserved()? (it doesn't appear to be the case today). Patches incoming... Thanks, James [0] panic trace root@juno-r1:~# echo disk > /sys/power/state [ 56.914184] PM: Syncing filesystems ... [ 56.918853] done. [ 56.920826] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 56.930383] PM: Preallocating image memory... done (allocated 97481 pages) [ 60.566084] PM: Allocated 389924 kbytes in 3.62 seconds (107.71 MB/s) [ 60.572576] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 60.604877] PM: freeze of devices complete after 23.146 msecs [ 60.611230] PM: late freeze of devices complete after 0.578 msecs [ 60.618609] PM: noirq freeze of devices complete after 1.247 msecs [ 60.624833] Disabling non-boot CPUs ... [ 60.649112] CPU1: shutdown [ 60.651823] psci: CPU1 killed. [ 60.701055] CPU2: shutdown [ 60.703766] psci: CPU2 killed. [ 60.745002] IRQ11 no longer affine to CPU3 [ 60.745043] CPU3: shutdown [ 60.751890] psci: CPU3 killed. [ 60.784966] CPU4: shutdown [ 60.787676] psci: CPU4 killed. [ 60.824916] IRQ8 no longer affine to CPU5 [ 60.824920] IRQ9 no longer affine to CPU5 [ 60.824927] IRQ18 no longer affine to CPU5 [ 60.824931] IRQ20 no longer affine to CPU5 [ 60.824951] CPU5: shutdown [ 60.843975] psci: CPU5 killed. [ 60.857989] PM: Creating hibernation image: [ 60.857989] PM: Need to copy 96285 pages [ 60.857989] Unable to handle kernel paging request at virtual address ffff8000794a0000 [ 60.857989] pgd = ffff800975190000 [ 60.857989] [ffff8000794a0000] *pgd=0000000000000000[ 60.857989] [ 60.857989] Internal error: Oops: 96000007 [#1] PREEMPT SMP [ 60.857989] Modules linked in: [ 60.857989] CPU: 0 PID: 2366 Comm: bash Not tainted 4.9.0-rc7-00001-gecf7c47af54d #6346 [ 60.857989] Hardware name: ARM Juno development board (r1) (DT) [ 60.857989] task: ffff8009766d3200 task.stack: ffff800975fec000 [ 60.857989] PC is at swsusp_save+0x250/0x2c8 [ 60.857989] LR is at swsusp_save+0x214/0x2c8 [ 60.857989] pc : [<ffff000008100bd0>] lr : [<ffff000008100b94>] pstate: 200003c5 [ 60.857989] sp : ffff800975fefb50 [ 60.857989] x29: ffff800975fefb50 x28: ffff800975fec000 [ 60.857989] x27: ffff0000088c2000 x26: 0000000000000040 [ 60.857989] x25: 00000000000f94a0 x24: ffff000008e437e8 [ 60.857989] x23: 000000000001781d x22: ffff000008bee000 [ 60.857989] x21: ffff000008e437f8 x20: ffff000008e43878 [ 60.857989] x19: ffff7e0000000000 x18: 0000000000000006 [ 60.857989] x17: 0000000000000000 x16: 00000000000005d0 [ 60.857989] x15: ffff000008e43e95 x14: 00000000000001d9 [ 60.857989] x13: 0000000000000001 x12: ffff7e0000000000 [ 60.857989] x11: ffff7e0025ffffc0 x10: 0000000025ffffc0 [ 60.857989] x9 : 000000000000012f x8 : ffff80096cd04ce0 [ 60.857989] x7 : 0000000000978000 x6 : 0000000000000076 [ 60.857989] x5 : fffffffffffffff8 x4 : 0000000000080000 [ 60.857989] x3 : ffff8000794a0000 x2 : ffff800959d83000 [ 60.857989] x1 : 0000000000000000 x0 : ffff7e0001e52800 [ 60.857989] Process bash (pid: 2366, stack limit = 0xffff800975fec020) [ 60.857989] Stack: (0xffff800975fefb50 to 0xffff800975ff0000) [ 60.857989] Call trace: [ 60.857989] [<ffff000008100bd0>] swsusp_save+0x250/0x2c8 [ 60.857989] [<ffff0000080936ec>] swsusp_arch_suspend+0xb4/0x100 [ 60.857989] [<ffff0000080fe670>] hibernation_snapshot+0x278/0x318 [ 60.857989] [<ffff0000080fef10>] hibernate+0x1d0/0x268 [ 60.857989] [<ffff0000080fc954>] state_store+0xdc/0x100 [ 60.857989] [<ffff00000838419c>] kobj_attr_store+0x14/0x28 [ 60.857989] [<ffff00000825be68>] sysfs_kf_write+0x48/0x58 [ 60.857989] [<ffff00000825b1f8>] kernfs_fop_write+0xb0/0x1d8 [ 60.857989] [<ffff0000081e2ddc>] __vfs_write+0x1c/0x110 [ 60.857989] [<ffff0000081e3bd8>] vfs_write+0xa0/0x1b8 [ 60.857989] [<ffff0000081e4f44>] SyS_write+0x44/0xa0 [ 60.857989] [<ffff000008082ef0>] el0_svc_naked+0x24/0x28 [ 60.857989] Code: d37ae442 d37ae463 b2514042 b2514063 (f8636820) [ 60.857989] ---[ end trace d0265b757c9dd571 ]--- [ 60.857989] ------------[ cut here ]------------
James, On 01.12.16 17:26:55, James Morse wrote: > On 01/12/16 16:45, Will Deacon wrote: > > Thanks for sending out the new patch. Whilst I'm still a bit worried about > > changing pfn_valid like this, I guess we'll just have to fix up any callers > > which suffer from this change. > > Hibernate's core code falls foul of this. This patch causes a panic when copying > memory to build the 'image'[0]. > saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid() > pages can be accessed. > > Fortunately the core code exposes pfn_is_nosave() which we can extend to catch > 'nomap' pages, but only if they are also marked as PageReserved(). > > Are there any side-effects of marking all the nomap regions with > mark_page_reserved()? (it doesn't appear to be the case today). Reserving the page adds it to the memory management which is what we would like to avoid for NOMAP pages. I don't believe we should do this. Since NOMAP is to some degree now core functionality I would rather implement pfn_is_nomap() that defaults to pfn_is_valid() but calls memblock_is_nomap() for arm64 or does something equivalent. The question arises what to do with that mem at all. There could be mappings by the kernel, e.g. of acpi tables. We can't assume the mem regions still come out the same from the BIOS during resume. Do we need to save the mem? I can't answer that as I don't know much about hibernation yet. Thanks, -Robert
Hi Robert, On 02/12/16 07:11, Robert Richter wrote: > On 01.12.16 17:26:55, James Morse wrote: >> On 01/12/16 16:45, Will Deacon wrote: >>> Thanks for sending out the new patch. Whilst I'm still a bit worried about >>> changing pfn_valid like this, I guess we'll just have to fix up any callers >>> which suffer from this change. >> >> Hibernate's core code falls foul of this. This patch causes a panic when copying >> memory to build the 'image'[0]. >> saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid() >> pages can be accessed. >> >> Fortunately the core code exposes pfn_is_nosave() which we can extend to catch >> 'nomap' pages, but only if they are also marked as PageReserved(). >> >> Are there any side-effects of marking all the nomap regions with >> mark_page_reserved()? (it doesn't appear to be the case today). > > Reserving the page adds it to the memory management which is what we > would like to avoid for NOMAP pages. I don't believe we should do > this. Since NOMAP is to some degree now core functionality I would > rather implement pfn_is_nomap() that defaults to pfn_is_valid() but > calls memblock_is_nomap() for arm64 or does something equivalent. I thought the adjust_managed_page_count() code was just fiddling with some counters. I will change it to call SetPageReserved() directly which will just set the bit in struct page's flags. I will post these shortly as the 'fixes' way of solving the hibernate fallout. I guess any arch that uses memblock nomap needs core code to take account of it, but at the moment that is just arm/arm64. If we are adding new pfn_is_ calls, we could try and clean up pfn_valid() users to use pfn_is_memory(), pfn_is_mapped() or pfn_has_memmap(). Part of the problem is 'valid' means different things to different people. > The question arises what to do with that mem at all. There could be > mappings by the kernel, e.g. of acpi tables. We can't assume the mem > regions still come out the same from the BIOS during resume. Unfortunately we have to assume this. If the firmware reserved regions move around in memory we can't resume from hibernate. Other OS also require this not to happen. ([0] 'firmware memory requirements') Hibernate core code checks the number of pages of kernel memory is the same before trying to resume. If you just move the allocations around this will panic during resume as the resume kernel will have surprising holes in its linear map. x86 recently grew an MD5sum check of the e820 memory map, I intend to do the same for memblock. The theory is this would only happen if you change the hardware in some way, and that otherwise the firmware is entirely deterministic... > Do we > need to save the mem? I can't answer that as I don't know much about > hibernation yet. Hibernate only save/restores the linear map and CPU state. We expect firmware to put equivalent data in the same places for its nomap regions. If the region belongs to a device, its up to the device driver to tidy up. (It has freeze/thaw/resume callbacks to do this). Thanks, James [0] https://msdn.microsoft.com/en-gb/windows/hardware/commercialize/manufacture/desktop/uefi-requirements-boot-time-runtime-hibernation-state--s4
Patch "arm64: mm: Fix memmap to be initialized for the entire section" changes pfn_valid() in a way that breaks hibernate. These patches fix hibernate, and provided struct page's are allocated for nomap pages, can be applied before [0]. Hibernate core code belives 'valid' to mean "I can access this". It uses pfn_valid() to test the page if the page is 'valid'. pfn_valid() needs to be changed so that all struct pages in a numa node have the same node-id. Currently 'nomap' pages are skipped, and retain their pre-numa node-ids, which leads to a later BUG_ON(). These patches make hibernate's savable_page() take its escape route via 'if (PageReserved(page) && pfn_is_nosave(pfn))'. [0] https://lkml.org/lkml/2016/11/30/566 James Morse (2): arm64: mm: Mark nomap regions with the PG_reserved flag arm64: hibernate: report nomap regions as being pfn_nosave arch/arm64/kernel/hibernate.c | 6 +++++- arch/arm64/mm/init.c | 14 ++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-)
On 2 December 2016 at 14:49, James Morse <james.morse@arm.com> wrote: > Patch "arm64: mm: Fix memmap to be initialized for the entire section" > changes pfn_valid() in a way that breaks hibernate. These patches fix > hibernate, and provided struct page's are allocated for nomap pages, > can be applied before [0]. > > Hibernate core code belives 'valid' to mean "I can access this". It > uses pfn_valid() to test the page if the page is 'valid'. > > pfn_valid() needs to be changed so that all struct pages in a numa > node have the same node-id. Currently 'nomap' pages are skipped, and > retain their pre-numa node-ids, which leads to a later BUG_ON(). > > These patches make hibernate's savable_page() take its escape route > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'. > This makes me feel slightly uneasy. Robert makes a convincing point, but I wonder if we can expect more fallout from the ambiguity of pfn_valid(). Now we are not only forced to assign non-existing (as far as the OS is concerned) pages to the correct NUMA node, we also need to set certain page flags.
On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote: > On 2 December 2016 at 14:49, James Morse <james.morse@arm.com> wrote: > > Patch "arm64: mm: Fix memmap to be initialized for the entire section" > > changes pfn_valid() in a way that breaks hibernate. These patches fix > > hibernate, and provided struct page's are allocated for nomap pages, > > can be applied before [0]. > > > > Hibernate core code belives 'valid' to mean "I can access this". It > > uses pfn_valid() to test the page if the page is 'valid'. > > > > pfn_valid() needs to be changed so that all struct pages in a numa > > node have the same node-id. Currently 'nomap' pages are skipped, and > > retain their pre-numa node-ids, which leads to a later BUG_ON(). > > > > These patches make hibernate's savable_page() take its escape route > > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'. > > > > This makes me feel slightly uneasy. Robert makes a convincing point, > but I wonder if we can expect more fallout from the ambiguity of > pfn_valid(). Now we are not only forced to assign non-existing (as far > as the OS is concerned) pages to the correct NUMA node, we also need > to set certain page flags. Yes, I really don't know how to proceed here. Playing whack-a-mole with pfn_valid() users doesn't sounds like an improvement on the current situation to me. Robert -- if we leave pfn_valid() as it is, would a point-hack to memmap_init_zone help, or do you anticipate other problems? Will
On 06.12.16 17:38:11, Will Deacon wrote: > On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote: > > On 2 December 2016 at 14:49, James Morse <james.morse@arm.com> wrote: > > > Patch "arm64: mm: Fix memmap to be initialized for the entire section" > > > changes pfn_valid() in a way that breaks hibernate. These patches fix > > > hibernate, and provided struct page's are allocated for nomap pages, > > > can be applied before [0]. > > > > > > Hibernate core code belives 'valid' to mean "I can access this". It > > > uses pfn_valid() to test the page if the page is 'valid'. > > > > > > pfn_valid() needs to be changed so that all struct pages in a numa > > > node have the same node-id. Currently 'nomap' pages are skipped, and > > > retain their pre-numa node-ids, which leads to a later BUG_ON(). > > > > > > These patches make hibernate's savable_page() take its escape route > > > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'. > > > > > > > This makes me feel slightly uneasy. Robert makes a convincing point, > > but I wonder if we can expect more fallout from the ambiguity of > > pfn_valid(). Now we are not only forced to assign non-existing (as far > > as the OS is concerned) pages to the correct NUMA node, we also need > > to set certain page flags. > > Yes, I really don't know how to proceed here. Playing whack-a-mole with > pfn_valid() users doesn't sounds like an improvement on the current > situation to me. > > Robert -- if we leave pfn_valid() as it is, would a point-hack to > memmap_init_zone help, or do you anticipate other problems? I would suggest to fix the hibernation code as I commented on before to use pfn_is_nosave() that defaults to pfn_valid() but uses memblock_ is_nomap() for arm64. Let's just fix it and see if no other issues arise. I am trying to send a patch for this until tomorrow. I am also going to see how early_pfn_valid() could be redirected to use memblock_is_nomap() on arm64. That would "quick fix" the problem, though I rather prefer to go further with the current solution. -Robert
On Wed, Dec 07, 2016 at 10:06:38AM +0100, Robert Richter wrote: > On 06.12.16 17:38:11, Will Deacon wrote: > > On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote: > > > On 2 December 2016 at 14:49, James Morse <james.morse@arm.com> wrote: > > > > Patch "arm64: mm: Fix memmap to be initialized for the entire section" > > > > changes pfn_valid() in a way that breaks hibernate. These patches fix > > > > hibernate, and provided struct page's are allocated for nomap pages, > > > > can be applied before [0]. > > > > > > > > Hibernate core code belives 'valid' to mean "I can access this". It > > > > uses pfn_valid() to test the page if the page is 'valid'. > > > > > > > > pfn_valid() needs to be changed so that all struct pages in a numa > > > > node have the same node-id. Currently 'nomap' pages are skipped, and > > > > retain their pre-numa node-ids, which leads to a later BUG_ON(). > > > > > > > > These patches make hibernate's savable_page() take its escape route > > > > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'. > > > > > > > > > > This makes me feel slightly uneasy. Robert makes a convincing point, > > > but I wonder if we can expect more fallout from the ambiguity of > > > pfn_valid(). Now we are not only forced to assign non-existing (as far > > > as the OS is concerned) pages to the correct NUMA node, we also need > > > to set certain page flags. > > > > Yes, I really don't know how to proceed here. Playing whack-a-mole with > > pfn_valid() users doesn't sounds like an improvement on the current > > situation to me. > > > > Robert -- if we leave pfn_valid() as it is, would a point-hack to > > memmap_init_zone help, or do you anticipate other problems? > > I would suggest to fix the hibernation code as I commented on before > to use pfn_is_nosave() that defaults to pfn_valid() but uses memblock_ > is_nomap() for arm64. Let's just fix it and see if no other issues > arise. I am trying to send a patch for this until tomorrow. I'd rather not use mainline as a guinea pig like this, since I'd be very surprised if other places don't break given the scope for different interpretations of pfn_valid. > I am also going to see how early_pfn_valid() could be redirected to > use memblock_is_nomap() on arm64. That would "quick fix" the problem, > though I rather prefer to go further with the current solution. I don't like either of them, but early_pfn_valid is easier to revert so let's go with that. Will
Hi Robert, We have merged your patch to 4.9.0-rc8, however we still meet the similar problem on our D05 board: Thanks, Yisheng Xie ----------------- [ 5.081971] ------------[ cut here ]------------ [ 5.086668] kernel BUG at mm/page_alloc.c:1863! [ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP [ 5.096159] Modules linked in: [ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58 [ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016 [ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000 [ 5.120903] PC is at move_freepages+0x170/0x180 [ 5.125514] LR is at move_freepages_block+0xa8/0xb8 [ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5 [ 5.138008] sp : fffffe13f66fb910 [ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80 [ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90 [ 5.152193] x25: 000000000000000a x24: fffffdff800b0020 [ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000 [ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680 [ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26 [ 5.173820] x17: 0000000000000000 x16: 0000000000000000 [ 5.179226] x15: 000000005c589429 x14: 0000000000000000 [ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0 [ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b [ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000 [ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001 [ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000 [ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000 [ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680 [...] [ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180 [ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8 [ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0 [ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58 [ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08 [ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0 [ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168 [ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc [ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44 [ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138 [ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c [ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110 [ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20 [ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000) [ 5.851024] ---[ end trace 3382df1ae82057db ]--- On 2016/12/1 2:21, Robert Richter wrote: > On ThunderX systems with certain memory configurations we see the > following BUG_ON(): > > kernel BUG at mm/page_alloc.c:1848! > > This happens for some configs with 64k page size enabled. The BUG_ON() > checks if start and end page of a memmap range belongs to the same > zone. > > The BUG_ON() check fails if a memory zone contains NOMAP regions. In > this case the node information of those pages is not initialized. This > causes an inconsistency of the page links with wrong zone and node > information for that pages. NOMAP pages from node 1 still point to the > mem zone from node 0 and have the wrong nid assigned. > > The reason for the mis-configuration is a change in pfn_valid() which > reports pages marked NOMAP as invalid: > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > This causes pages marked as nomap being no long reassigned to the new > zone in memmap_init_zone() by calling __init_single_pfn(). > > Fixing this by restoring the old behavior of pfn_valid() to use > memblock_is_memory(). Also changing users of pfn_valid() in arm64 code > to use memblock_is_map_memory() where necessary. This only affects > code in ioremap.c. The code in mmu.c still can use the new version of > pfn_valid(). > > As a consequence, pfn_valid() can not be used to check if a physical > page is RAM. It just checks if there is an underlying memmap with a > valid struct page. Moreover, for performance reasons the whole memmap > (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM > config). The memory range is extended to fit the alignment of the > memmap. Thus, pfn_valid() may return true for pfns that do not map to > physical memory. Those pages are simply not reported to the mm, they > are not marked reserved nor added to the list of free pages. Other > functions such a page_is_ram() or memblock_is_map_ memory() must be > used to check for memory and if the page can be mapped with the linear > mapping. > > Since NOMAP mem ranges may need to be mapped with different mem > attributes (e.g. read-only or non-caching) we can not use linear > mapping here. The use of memblock_is_memory() in pfn_valid() may not > break this behaviour. Since commit: > > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem > > NOMAP mem resources are no longer marked as system RAM (IORESOURCE_ > SYSTEM_RAM). Now page_is_ram() and region_intersects() (see > memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not > added to the linear mapping as system RAM is. > > v2: > > * Added Ack > * updated description to reflect the discussion > > Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Signed-off-by: Robert Richter <rrichter@cavium.com> > --- > arch/arm64/mm/init.c | 2 +- > arch/arm64/mm/ioremap.c | 5 +++-- > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 212c4d1e2f26..166911f4a2e6 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > int pfn_valid(unsigned long pfn) > { > - return memblock_is_map_memory(pfn << PAGE_SHIFT); > + return memblock_is_memory(pfn << PAGE_SHIFT); > } > EXPORT_SYMBOL(pfn_valid); > #endif > diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c > index 01e88c8bcab0..c17c220b0c48 100644 > --- a/arch/arm64/mm/ioremap.c > +++ b/arch/arm64/mm/ioremap.c > @@ -21,6 +21,7 @@ > */ > > #include <linux/export.h> > +#include <linux/memblock.h> > #include <linux/mm.h> > #include <linux/vmalloc.h> > #include <linux/io.h> > @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, > /* > * Don't allow RAM to be mapped. > */ > - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) > + if (WARN_ON(memblock_is_map_memory(phys_addr))) > return NULL; > > area = get_vm_area_caller(size, VM_IOREMAP, caller); > @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); > void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) > { > /* For normal memory we already have a cacheable mapping. */ > - if (pfn_valid(__phys_to_pfn(phys_addr))) > + if (memblock_is_map_memory(phys_addr)) > return (void __iomem *)__phys_to_virt(phys_addr); > > return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL), >
On 9 December 2016 at 12:14, Yisheng Xie <xieyisheng1@huawei.com> wrote: > Hi Robert, > We have merged your patch to 4.9.0-rc8, however we still meet the similar problem > on our D05 board: > To be clear: does this issue also occur on D05 *without* the patch? > ----------------- > [ 5.081971] ------------[ cut here ]------------ > [ 5.086668] kernel BUG at mm/page_alloc.c:1863! > [ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP > [ 5.096159] Modules linked in: > [ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58 > [ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016 > [ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000 > [ 5.120903] PC is at move_freepages+0x170/0x180 > [ 5.125514] LR is at move_freepages_block+0xa8/0xb8 > [ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5 > [ 5.138008] sp : fffffe13f66fb910 > [ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80 > [ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90 > [ 5.152193] x25: 000000000000000a x24: fffffdff800b0020 > [ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000 > [ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680 > [ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26 > [ 5.173820] x17: 0000000000000000 x16: 0000000000000000 > [ 5.179226] x15: 000000005c589429 x14: 0000000000000000 > [ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0 > [ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b > [ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000 > [ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001 > [ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000 > [ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000 > [ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680 > [...] > [ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180 > [ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8 > [ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0 > [ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58 > [ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08 > [ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0 > [ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168 > [ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc > [ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44 > [ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138 > [ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c > [ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110 > [ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20 > [ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000) > [ 5.851024] ---[ end trace 3382df1ae82057db ]--- > > > On 2016/12/1 2:21, Robert Richter wrote: >> On ThunderX systems with certain memory configurations we see the >> following BUG_ON(): >> >> kernel BUG at mm/page_alloc.c:1848! >> >> This happens for some configs with 64k page size enabled. The BUG_ON() >> checks if start and end page of a memmap range belongs to the same >> zone. >> >> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >> this case the node information of those pages is not initialized. This >> causes an inconsistency of the page links with wrong zone and node >> information for that pages. NOMAP pages from node 1 still point to the >> mem zone from node 0 and have the wrong nid assigned. >> >> The reason for the mis-configuration is a change in pfn_valid() which >> reports pages marked NOMAP as invalid: >> >> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping >> >> This causes pages marked as nomap being no long reassigned to the new >> zone in memmap_init_zone() by calling __init_single_pfn(). >> >> Fixing this by restoring the old behavior of pfn_valid() to use >> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code >> to use memblock_is_map_memory() where necessary. This only affects >> code in ioremap.c. The code in mmu.c still can use the new version of >> pfn_valid(). >> >> As a consequence, pfn_valid() can not be used to check if a physical >> page is RAM. It just checks if there is an underlying memmap with a >> valid struct page. Moreover, for performance reasons the whole memmap >> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM >> config). The memory range is extended to fit the alignment of the >> memmap. Thus, pfn_valid() may return true for pfns that do not map to >> physical memory. Those pages are simply not reported to the mm, they >> are not marked reserved nor added to the list of free pages. Other >> functions such a page_is_ram() or memblock_is_map_ memory() must be >> used to check for memory and if the page can be mapped with the linear >> mapping. >> >> Since NOMAP mem ranges may need to be mapped with different mem >> attributes (e.g. read-only or non-caching) we can not use linear >> mapping here. The use of memblock_is_memory() in pfn_valid() may not >> break this behaviour. Since commit: >> >> e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem >> >> NOMAP mem resources are no longer marked as system RAM (IORESOURCE_ >> SYSTEM_RAM). Now page_is_ram() and region_intersects() (see >> memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not >> added to the linear mapping as system RAM is. >> >> v2: >> >> * Added Ack >> * updated description to reflect the discussion >> >> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> >> Signed-off-by: Robert Richter <rrichter@cavium.com> >> --- >> arch/arm64/mm/init.c | 2 +- >> arch/arm64/mm/ioremap.c | 5 +++-- >> 2 files changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c >> index 212c4d1e2f26..166911f4a2e6 100644 >> --- a/arch/arm64/mm/init.c >> +++ b/arch/arm64/mm/init.c >> @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) >> #ifdef CONFIG_HAVE_ARCH_PFN_VALID >> int pfn_valid(unsigned long pfn) >> { >> - return memblock_is_map_memory(pfn << PAGE_SHIFT); >> + return memblock_is_memory(pfn << PAGE_SHIFT); >> } >> EXPORT_SYMBOL(pfn_valid); >> #endif >> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c >> index 01e88c8bcab0..c17c220b0c48 100644 >> --- a/arch/arm64/mm/ioremap.c >> +++ b/arch/arm64/mm/ioremap.c >> @@ -21,6 +21,7 @@ >> */ >> >> #include <linux/export.h> >> +#include <linux/memblock.h> >> #include <linux/mm.h> >> #include <linux/vmalloc.h> >> #include <linux/io.h> >> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, >> /* >> * Don't allow RAM to be mapped. >> */ >> - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) >> + if (WARN_ON(memblock_is_map_memory(phys_addr))) >> return NULL; >> >> area = get_vm_area_caller(size, VM_IOREMAP, caller); >> @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); >> void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) >> { >> /* For normal memory we already have a cacheable mapping. */ >> - if (pfn_valid(__phys_to_pfn(phys_addr))) >> + if (memblock_is_map_memory(phys_addr)) >> return (void __iomem *)__phys_to_virt(phys_addr); >> >> return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL), >> >
On 12/09/2016 08:19 PM, Ard Biesheuvel wrote: > On 9 December 2016 at 12:14, Yisheng Xie <xieyisheng1@huawei.com> wrote: >> Hi Robert, >> We have merged your patch to 4.9.0-rc8, however we still meet the similar problem >> on our D05 board: >> > > To be clear: does this issue also occur on D05 *without* the patch? It boots ok on D05 without this patch. But I think the problem Robert described in the commit message is still there, just not triggered in the boot. we met this problem when having LTP stress memory test and hit the same BUG_ON. Thanks Hanjun
On 09.12.16 20:14:24, Yisheng Xie wrote: > We have merged your patch to 4.9.0-rc8, however we still meet the similar problem > on our D05 board: I assume you can reliable trigger the bug. Can you add some debug messages that show the pfn number, node id, memory range. Also, add to kernel parameters: memblock=debug efi=debug loglevel=8 mminit_loglevel=4 We need the mem layout detected by efi and the memblock regions. This should be in the full boot log then. Thanks, -Robert > ----------------- > [ 5.081971] ------------[ cut here ]------------ > [ 5.086668] kernel BUG at mm/page_alloc.c:1863! > [ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP > [ 5.096159] Modules linked in: > [ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58 > [ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016 > [ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000 > [ 5.120903] PC is at move_freepages+0x170/0x180 > [ 5.125514] LR is at move_freepages_block+0xa8/0xb8 > [ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5 > [ 5.138008] sp : fffffe13f66fb910 > [ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80 > [ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90 > [ 5.152193] x25: 000000000000000a x24: fffffdff800b0020 > [ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000 > [ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680 > [ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26 > [ 5.173820] x17: 0000000000000000 x16: 0000000000000000 > [ 5.179226] x15: 000000005c589429 x14: 0000000000000000 > [ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0 > [ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b > [ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000 > [ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001 > [ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000 > [ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000 > [ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680 > [...] > [ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180 > [ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8 > [ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0 > [ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58 > [ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08 > [ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0 > [ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168 > [ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc > [ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44 > [ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138 > [ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c > [ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110 > [ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20 > [ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000) > [ 5.851024] ---[ end trace 3382df1ae82057db ]---
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 212c4d1e2f26..166911f4a2e6 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) #ifdef CONFIG_HAVE_ARCH_PFN_VALID int pfn_valid(unsigned long pfn) { - return memblock_is_map_memory(pfn << PAGE_SHIFT); + return memblock_is_memory(pfn << PAGE_SHIFT); } EXPORT_SYMBOL(pfn_valid); #endif diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c index 01e88c8bcab0..c17c220b0c48 100644 --- a/arch/arm64/mm/ioremap.c +++ b/arch/arm64/mm/ioremap.c @@ -21,6 +21,7 @@ */ #include <linux/export.h> +#include <linux/memblock.h> #include <linux/mm.h> #include <linux/vmalloc.h> #include <linux/io.h> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, /* * Don't allow RAM to be mapped. */ - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) + if (WARN_ON(memblock_is_map_memory(phys_addr))) return NULL; area = get_vm_area_caller(size, VM_IOREMAP, caller); @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) { /* For normal memory we already have a cacheable mapping. */ - if (pfn_valid(__phys_to_pfn(phys_addr))) + if (memblock_is_map_memory(phys_addr)) return (void __iomem *)__phys_to_virt(phys_addr); return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),