Message ID | 1475747527-32387-1-git-send-email-rrichter@cavium.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Robert, Apologies for only responding now. I did not quite manage to get my head around your original email yet, but I don't think this patch is the correct solution. On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote: > There is a memory setup problem on ThunderX systems with certain > memory configurations. The symptom is > > kernel BUG at mm/page_alloc.c:1848! > > This happens for some configs with 64k page size enabled. The bug > triggers for page zones with some pages in the zone not assigned to > this particular zone. In my case some pages that are marked as nomap > were not reassigned to the new zone of node 1, so those are still > assigned to node 0. > > The reason for the mis-configuration is a change in pfn_valid() which > reports pages marked nomap as invalid: > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > These pages are owned by the firmware, which may map it with attributes that conflict with the attributes we use for the linear mapping. This means they should not be covered by the linear mapping. > This causes pages marked as nomap being no long reassigned to the new > zone in memmap_init_zone() by calling __init_single_pfn(). > This sounds like the root cause of your issue. Could we not fix that instead? > Fixing this by restoring the old behavior of pfn_valid() to use > memblock_is_memory(). This is incorrect imo. In general, pfn_valid() means ordinary memory covered by the linear mapping and the struct page array. Returning reserved ranges that the kernel should not even touch only to please the NUMA code seems like an inappropriate way to deal with this issue. > Also changing users of pfn_valid() in arm64 code > to use memblock_is_map_memory() where necessary. This only affects > code in ioremap.c. The code in mmu.c still can use the new version of > pfn_valid(). > > Should be marked stable v4.5.. > > Signed-off-by: Robert Richter <rrichter@cavium.com> > --- > arch/arm64/mm/init.c | 2 +- > arch/arm64/mm/ioremap.c | 5 +++-- > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index bbb7ee76e319..25b8659c2a9f 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > int pfn_valid(unsigned long pfn) > { > - return memblock_is_map_memory(pfn << PAGE_SHIFT); > + return memblock_is_memory(pfn << PAGE_SHIFT); > } > EXPORT_SYMBOL(pfn_valid); > #endif > diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c > index 01e88c8bcab0..c17c220b0c48 100644 > --- a/arch/arm64/mm/ioremap.c > +++ b/arch/arm64/mm/ioremap.c > @@ -21,6 +21,7 @@ > */ > > #include <linux/export.h> > +#include <linux/memblock.h> > #include <linux/mm.h> > #include <linux/vmalloc.h> > #include <linux/io.h> > @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, > /* > * Don't allow RAM to be mapped. > */ > - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) > + if (WARN_ON(memblock_is_map_memory(phys_addr))) > return NULL; > > area = get_vm_area_caller(size, VM_IOREMAP, caller); > @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); > void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) > { > /* For normal memory we already have a cacheable mapping. */ > - if (pfn_valid(__phys_to_pfn(phys_addr))) > + if (memblock_is_map_memory(phys_addr)) > return (void __iomem *)__phys_to_virt(phys_addr); > > return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL), > -- > 2.7.0.rc3 >
Ard, thank you for your answer and you explanation. On 06.10.16 11:00:33, Ard Biesheuvel wrote: > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote: > > There is a memory setup problem on ThunderX systems with certain > > memory configurations. The symptom is > > > > kernel BUG at mm/page_alloc.c:1848! > > > > This happens for some configs with 64k page size enabled. The bug > > triggers for page zones with some pages in the zone not assigned to > > this particular zone. In my case some pages that are marked as nomap > > were not reassigned to the new zone of node 1, so those are still > > assigned to node 0. > > > > The reason for the mis-configuration is a change in pfn_valid() which > > reports pages marked nomap as invalid: > > > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > > > These pages are owned by the firmware, which may map it with > attributes that conflict with the attributes we use for the linear > mapping. This means they should not be covered by the linear mapping. > > > This causes pages marked as nomap being no long reassigned to the new > > zone in memmap_init_zone() by calling __init_single_pfn(). > > > > This sounds like the root cause of your issue. Could we not fix that instead? Yes, this is proposal b) from my last mail that would work too: I implemented an arm64 private early_pfn_valid() function that uses memblock_is_memory() to setup all pages of a zone. Though, I think this is the wrong way and thus I prefer this patch instead. I see serveral reasons for this: Inconsistent use of struct *page, it is initialized but never used again. Other archs only do a basic range check in pfn_valid(), the default implementation just returns if the whole section is valid. As I understand the code, if the mem range is not aligned to the section, then there will be pfn's in the section that don't have physical mem attached. The page is then just initialized, it's not marked reserved nor the refcount is non-zero. It is then simply not used. This is how no-map pages should be handled too. I think pfn_valid() is just a quick check if the pfn's struct *page can be used. There is a good description for this in include/linux/ mmzone.h. So there can be memory holes that have a valid pfn. If the no-map memory needs special handling, then additional checks need to be added to the particular code (as in ioremap.c). It's imo wrong to (mis-)use pfn_valid for that. Variant b) involves generic mm code to fix it for arm64, this patch is an arm64 change only. This makes it harder to get a fix for it. (Though maybe only a problem of patch logistics.) > > > Fixing this by restoring the old behavior of pfn_valid() to use > > memblock_is_memory(). > > This is incorrect imo. In general, pfn_valid() means ordinary memory > covered by the linear mapping and the struct page array. Returning > reserved ranges that the kernel should not even touch only to please > the NUMA code seems like an inappropriate way to deal with this issue. As said above, it is not marked as reserved, it is treated like non-existing memory. This has been observed for non-numa kernels too and can happen for each zone that is only partly initialized. I think the patch addresses your concerns. I can't see there the kernel uses memory marked as nomap in a wrong way. Thanks, -Robert > > > Also changing users of pfn_valid() in arm64 code > > to use memblock_is_map_memory() where necessary. This only affects > > code in ioremap.c. The code in mmu.c still can use the new version of > > pfn_valid(). > > > > Should be marked stable v4.5.. > > > > Signed-off-by: Robert Richter <rrichter@cavium.com> > > --- > > arch/arm64/mm/init.c | 2 +- > > arch/arm64/mm/ioremap.c | 5 +++-- > > 2 files changed, 4 insertions(+), 3 deletions(-)
On 10/06/2016 02:52 AM, Robert Richter wrote: > There is a memory setup problem on ThunderX systems with certain > memory configurations. The symptom is > > kernel BUG at mm/page_alloc.c:1848! > > This happens for some configs with 64k page size enabled. The bug > triggers for page zones with some pages in the zone not assigned to > this particular zone. In my case some pages that are marked as nomap > were not reassigned to the new zone of node 1, so those are still > assigned to node 0. > > The reason for the mis-configuration is a change in pfn_valid() which > reports pages marked nomap as invalid: > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > This causes pages marked as nomap being no long reassigned to the new > zone in memmap_init_zone() by calling __init_single_pfn(). > > Fixing this by restoring the old behavior of pfn_valid() to use > memblock_is_memory(). Also changing users of pfn_valid() in arm64 code > to use memblock_is_map_memory() where necessary. This only affects > code in ioremap.c. The code in mmu.c still can use the new version of > pfn_valid(). > > Should be marked stable v4.5.. In that case you should add: Cc: <stable@vger.kernel.org> # 4.5.x- here. > > Signed-off-by: Robert Richter <rrichter@cavium.com> [...]
Mark, Will, any opinion here? Thanks, -Robert On 06.10.16 18:11:14, Robert Richter wrote: > Ard, > > thank you for your answer and you explanation. > > On 06.10.16 11:00:33, Ard Biesheuvel wrote: > > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote: > > > There is a memory setup problem on ThunderX systems with certain > > > memory configurations. The symptom is > > > > > > kernel BUG at mm/page_alloc.c:1848! > > > > > > This happens for some configs with 64k page size enabled. The bug > > > triggers for page zones with some pages in the zone not assigned to > > > this particular zone. In my case some pages that are marked as nomap > > > were not reassigned to the new zone of node 1, so those are still > > > assigned to node 0. > > > > > > The reason for the mis-configuration is a change in pfn_valid() which > > > reports pages marked nomap as invalid: > > > > > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > > > > > > These pages are owned by the firmware, which may map it with > > attributes that conflict with the attributes we use for the linear > > mapping. This means they should not be covered by the linear mapping. > > > > > This causes pages marked as nomap being no long reassigned to the new > > > zone in memmap_init_zone() by calling __init_single_pfn(). > > > > > > > This sounds like the root cause of your issue. Could we not fix that instead? > > Yes, this is proposal b) from my last mail that would work too: I > implemented an arm64 private early_pfn_valid() function that uses > memblock_is_memory() to setup all pages of a zone. Though, I think > this is the wrong way and thus I prefer this patch instead. I see > serveral reasons for this: > > Inconsistent use of struct *page, it is initialized but never used > again. > > Other archs only do a basic range check in pfn_valid(), the default > implementation just returns if the whole section is valid. As I > understand the code, if the mem range is not aligned to the section, > then there will be pfn's in the section that don't have physical mem > attached. The page is then just initialized, it's not marked reserved > nor the refcount is non-zero. It is then simply not used. This is how > no-map pages should be handled too. > > I think pfn_valid() is just a quick check if the pfn's struct *page > can be used. There is a good description for this in include/linux/ > mmzone.h. So there can be memory holes that have a valid pfn. > > If the no-map memory needs special handling, then additional checks > need to be added to the particular code (as in ioremap.c). It's imo > wrong to (mis-)use pfn_valid for that. > > Variant b) involves generic mm code to fix it for arm64, this patch is > an arm64 change only. This makes it harder to get a fix for it. > (Though maybe only a problem of patch logistics.) > > > > > > Fixing this by restoring the old behavior of pfn_valid() to use > > > memblock_is_memory(). > > > > This is incorrect imo. In general, pfn_valid() means ordinary memory > > covered by the linear mapping and the struct page array. Returning > > reserved ranges that the kernel should not even touch only to please > > the NUMA code seems like an inappropriate way to deal with this issue. > > As said above, it is not marked as reserved, it is treated like > non-existing memory. > > This has been observed for non-numa kernels too and can happen for > each zone that is only partly initialized. > > I think the patch addresses your concerns. I can't see there the > kernel uses memory marked as nomap in a wrong way. > > Thanks, > > -Robert > > > > > > Also changing users of pfn_valid() in arm64 code > > > to use memblock_is_map_memory() where necessary. This only affects > > > code in ioremap.c. The code in mmu.c still can use the new version of > > > pfn_valid(). > > > > > > Should be marked stable v4.5.. > > > > > > Signed-off-by: Robert Richter <rrichter@cavium.com> > > > --- > > > arch/arm64/mm/init.c | 2 +- > > > arch/arm64/mm/ioremap.c | 5 +++-- > > > 2 files changed, 4 insertions(+), 3 deletions(-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-efi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Robert, Ard, Sorry for the delay in getting to this; I've been travelling a lot lately and in the meantime this managed to get buried in my inbox. On Thu, Oct 06, 2016 at 06:11:14PM +0200, Robert Richter wrote: > On 06.10.16 11:00:33, Ard Biesheuvel wrote: > > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote: > > > There is a memory setup problem on ThunderX systems with certain > > > memory configurations. The symptom is > > > > > > kernel BUG at mm/page_alloc.c:1848! > > > > > > This happens for some configs with 64k page size enabled. The bug > > > triggers for page zones with some pages in the zone not assigned to > > > this particular zone. In my case some pages that are marked as nomap > > > were not reassigned to the new zone of node 1, so those are still > > > assigned to node 0. > > > > > > The reason for the mis-configuration is a change in pfn_valid() which > > > reports pages marked nomap as invalid: > > > > > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > > > These pages are owned by the firmware, which may map it with > > attributes that conflict with the attributes we use for the linear > > mapping. This means they should not be covered by the linear mapping. > > > > > This causes pages marked as nomap being no long reassigned to the new > > > zone in memmap_init_zone() by calling __init_single_pfn(). Why do we have pages for a nomap region? Given the region shouldn't be in the linear mapping, and isn't suitable for general allocation, I don't believe it makes sense to have a struct page for any part of it. Am I missing some reason that we require a struct page? e.g. is it just easier to allocate an unused struct page than to carve it out? > > This sounds like the root cause of your issue. Could we not fix that instead? > > Yes, this is proposal b) from my last mail that would work too: I > implemented an arm64 private early_pfn_valid() function that uses > memblock_is_memory() to setup all pages of a zone. Though, I think > this is the wrong way and thus I prefer this patch instead. I see > serveral reasons for this: > > Inconsistent use of struct *page, it is initialized but never used > again. As above, I don't believe we should have a struct page to initialise in the first place. > Other archs only do a basic range check in pfn_valid(), the default > implementation just returns if the whole section is valid. As I > understand the code, if the mem range is not aligned to the section, > then there will be pfn's in the section that don't have physical mem > attached. The page is then just initialized, it's not marked reserved > nor the refcount is non-zero. It is then simply not used. This is how > no-map pages should be handled too. > > I think pfn_valid() is just a quick check if the pfn's struct *page > can be used. There is a good description for this in include/linux/ > mmzone.h. So there can be memory holes that have a valid pfn. I take it you mean the comment in the CONFIG_ARCH_HAS_HOLES_MEMORYMODEL ifdef (line 1266 in v4.9-rc1)? I'm not sufficiently acquainted with the memmap code to follow; I'll need to dig into that a bit further. > If the no-map memory needs special handling, then additional checks > need to be added to the particular code (as in ioremap.c). It's imo > wrong to (mis-)use pfn_valid for that. > > Variant b) involves generic mm code to fix it for arm64, this patch is > an arm64 change only. This makes it harder to get a fix for it. > (Though maybe only a problem of patch logistics.) > > > > Fixing this by restoring the old behavior of pfn_valid() to use > > > memblock_is_memory(). > > > > This is incorrect imo. In general, pfn_valid() means ordinary memory > > covered by the linear mapping and the struct page array. Returning > > reserved ranges that the kernel should not even touch only to please > > the NUMA code seems like an inappropriate way to deal with this issue. > > As said above, it is not marked as reserved, it is treated like > non-existing memory. I think Ard was using "reserved" in the more general sense than the Linux-specific meaning. NOMAP is distinct from the Linux concept of "reserved" memory, but is "reserved" in some sense. Memory with NOMAP is meant to be treated as non-existent for the purpose of the linear mapping (and thus for the purpose of struct page). > This has been observed for non-numa kernels too and can happen for > each zone that is only partly initialized. > > I think the patch addresses your concerns. I can't see there the > kernel uses memory marked as nomap in a wrong way. I'll have to dig into this locally; I'm still not familiar enough with this code to know what the right thing to do is. Thanks, Mark.
Mark, thanks for your answer. See below. I also attached the full crash dump. On 18.10.16 11:18:36, Mark Rutland wrote: > Hi Robert, Ard, > > Sorry for the delay in getting to this; I've been travelling a lot > lately and in the meantime this managed to get buried in my inbox. > > On Thu, Oct 06, 2016 at 06:11:14PM +0200, Robert Richter wrote: > > On 06.10.16 11:00:33, Ard Biesheuvel wrote: > > > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote: > > > > There is a memory setup problem on ThunderX systems with certain > > > > memory configurations. The symptom is > > > > > > > > kernel BUG at mm/page_alloc.c:1848! > > > > > > > > This happens for some configs with 64k page size enabled. The bug > > > > triggers for page zones with some pages in the zone not assigned to > > > > this particular zone. In my case some pages that are marked as nomap > > > > were not reassigned to the new zone of node 1, so those are still > > > > assigned to node 0. > > > > > > > > The reason for the mis-configuration is a change in pfn_valid() which > > > > reports pages marked nomap as invalid: > > > > > > > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > > > > > These pages are owned by the firmware, which may map it with > > > attributes that conflict with the attributes we use for the linear > > > mapping. This means they should not be covered by the linear mapping. > > > > > > > This causes pages marked as nomap being no long reassigned to the new > > > > zone in memmap_init_zone() by calling __init_single_pfn(). > > Why do we have pages for a nomap region? Given the region shouldn't be > in the linear mapping, and isn't suitable for general allocation, I > don't believe it makes sense to have a struct page for any part of it. > > Am I missing some reason that we require a struct page? > > e.g. is it just easier to allocate an unused struct page than to carve > it out? Pages are handled in blocks with size MAX_ORDER_NR_PAGES. The start and end pfn of a memory region is aligned then to fit the mem block size (see memmap_init_zone()). Therefore a memblock may contain pages without underlying physical memory. mm code requires the whole memmap to be initialized, this means that for each page in the whole mem block there is a valid struct page. See e.g. move_freepages_block() and move_freepages(), stuct page is accessed even before pfn_valid() is used. I assume there are other occurrences of that too. My interpretation is that pfn_valid() checks for the existence of a valid struct page, there must not necessarily phys memory mapped to it. This is the reason why I changed pfn_valid() to use memblock_is_memory() which is sufficient for generic mm code. Only in arm64 mm code I additinally added the memblock_is_map_memory() check where pfn_valid() was used. > > > > This sounds like the root cause of your issue. Could we not fix that instead? > > > > Yes, this is proposal b) from my last mail that would work too: I > > implemented an arm64 private early_pfn_valid() function that uses > > memblock_is_memory() to setup all pages of a zone. Though, I think > > this is the wrong way and thus I prefer this patch instead. I see > > serveral reasons for this: > > > > Inconsistent use of struct *page, it is initialized but never used > > again. > > As above, I don't believe we should have a struct page to initialise in > the first place. > > > Other archs only do a basic range check in pfn_valid(), the default > > implementation just returns if the whole section is valid. As I > > understand the code, if the mem range is not aligned to the section, > > then there will be pfn's in the section that don't have physical mem > > attached. The page is then just initialized, it's not marked reserved > > nor the refcount is non-zero. It is then simply not used. This is how > > no-map pages should be handled too. > > > > I think pfn_valid() is just a quick check if the pfn's struct *page > > can be used. There is a good description for this in include/linux/ > > mmzone.h. So there can be memory holes that have a valid pfn. > > I take it you mean the comment in the CONFIG_ARCH_HAS_HOLES_MEMORYMODEL > ifdef (line 1266 in v4.9-rc1)? Yes. > > I'm not sufficiently acquainted with the memmap code to follow; I'll > need to dig into that a bit further. > > > If the no-map memory needs special handling, then additional checks > > need to be added to the particular code (as in ioremap.c). It's imo > > wrong to (mis-)use pfn_valid for that. > > > > Variant b) involves generic mm code to fix it for arm64, this patch is > > an arm64 change only. This makes it harder to get a fix for it. > > (Though maybe only a problem of patch logistics.) > > > > > > Fixing this by restoring the old behavior of pfn_valid() to use > > > > memblock_is_memory(). > > > > > > This is incorrect imo. In general, pfn_valid() means ordinary memory > > > covered by the linear mapping and the struct page array. Returning > > > reserved ranges that the kernel should not even touch only to please > > > the NUMA code seems like an inappropriate way to deal with this issue. > > > > As said above, it is not marked as reserved, it is treated like > > non-existing memory. > > I think Ard was using "reserved" in the more general sense than the > Linux-specific meaning. NOMAP is distinct from the Linux concept of > "reserved" memory, but is "reserved" in some sense. > > Memory with NOMAP is meant to be treated as non-existent for the purpose > of the linear mapping (and thus for the purpose of struct page). Yes, it's not marked reserved and can never be freed. Thanks, -Robert > > > This has been observed for non-numa kernels too and can happen for > > each zone that is only partly initialized. > > > > I think the patch addresses your concerns. I can't see there the > > kernel uses memory marked as nomap in a wrong way. > > I'll have to dig into this locally; I'm still not familiar enough with > this code to know what the right thing to do is. > > Thanks, > Mark. EFI stub: Booting Linux Kernel... EFI stub: Using DTB from configuration table EFI stub: Exiting boot services and installing virtual address map... Booting Linux on physical CPU 0x0 Linux version 4.8.0-rc4-00267-g664e88c (root@crb2spass2rric.localdomain) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.1) ) #3 SMP Wed Sep 7 06:38:13 UTC 2016 Boot CPU: AArch64 Processor [431f0a10] efi: Getting EFI parameters from FDT: efi: EFI v2.40 by Cavium Thunder cn88xx EFI ThunderX-Firmware-Release-1.22.10-0-g4e85766 Aug 24 2016 15:59:03 efi: ACPI=0xfffff000 ACPI 2.0=0xfffff014 SMBIOS 3.0=0x10ffafcf000 cma: Reserved 512 MiB at 0x00000000c0000000 NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0 NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1 NUMA: parsing numa-distance-map-v1 NUMA: Initmem setup node 0 [mem 0x01400000-0xfffffffff] NUMA: NODE_DATA [mem 0xfffff2580-0xfffffffff] NUMA: Initmem setup node 1 [mem 0x10000400000-0x10fffffffff] NUMA: NODE_DATA [mem 0x10ffffa2500-0x10ffffaff7f] kmemleak: Kernel memory leak detector disabled Zone ranges: DMA [mem 0x0000000001400000-0x00000000ffffffff] Normal [mem 0x0000000100000000-0x0000010fffffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000001400000-0x00000000fffdffff] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff] node 0: [mem 0x0000000100000000-0x0000000fffffffff] node 1: [mem 0x0000010000400000-0x0000010ff9e8ffff] node 1: [mem 0x0000010ff9e90000-0x0000010ff9f2ffff] node 1: [mem 0x0000010ff9f30000-0x0000010ffaeaffff] node 1: [mem 0x0000010ffaeb0000-0x0000010ffaffffff] node 1: [mem 0x0000010ffb000000-0x0000010ffffaffff] node 1: [mem 0x0000010ffffb0000-0x0000010fffffffff] Initmem setup node 0 [mem 0x0000000001400000-0x0000000fffffffff] Initmem setup node 1 [mem 0x0000010000400000-0x0000010fffffffff] psci: probing for conduit method from DT. psci: PSCIv0.2 detected in firmware. psci: Using standard PSCI v0.2 function IDs psci: Trusted OS resident on physical CPU 0x0 Number of cores (96) exceeds configured maximum of 8 - clipping percpu: Embedded 3 pages/cpu @ffffff0fffce0000 s117704 r8192 d70712 u196608 Detected VIPT I-cache on CPU0 CPU features: enabling workaround for Cavium erratum 27456 Built 2 zonelists in Node order, mobility grouping on. Total pages: 2094720 Policy zone: Normal Kernel command line: BOOT_IMAGE=/vmlinuz root=UUID=9a418ae5-231a-40c7-b7ba-ca70cfadfc5e ro LANG=en_US.UTF-8 PID hash table entries: 4096 (order: -1, 32768 bytes) software IO TLB [mem 0xfbfd0000-0xfffd0000] (64MB) mapped at [fffffe00fbfd0000-fffffe00fffcffff] Memory: 133196160K/134193152K available (9084K kernel code, 1545K rwdata, 3712K rodata, 1536K init, 15826K bss, 472704K reserved, 524288K cma-reserved) Virtual kernel memory layout: modules : 0xfffffc0000000000 - 0xfffffc0008000000 ( 128 MB) vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000 ( 2045 GB) .text : 0xfffffc0008080000 - 0xfffffc0008960000 ( 9088 KB) .rodata : 0xfffffc0008960000 - 0xfffffc0008d10000 ( 3776 KB) .init : 0xfffffc0008d10000 - 0xfffffc0008e90000 ( 1536 KB) .data : 0xfffffc0008e90000 - 0xfffffc0009012600 ( 1546 KB) .bss : 0xfffffc0009012600 - 0xfffffc0009f86fc8 ( 15827 KB) fixed : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000 ( 4288 KB) PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000 ( 16 MB) vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000 ( 2 GB maximum) 0xfffffdff80005000 - 0xfffffdffc4000000 ( 1087 MB actual) memory : 0xfffffe0001400000 - 0xffffff1000000000 (1114092 MB) SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=8, Nodes=2 Running RCU self tests Hierarchical RCU implementation. RCU lockdep checking is enabled. Build-time adjustment of leaf fanout to 64. NR_IRQS:64 nr_irqs:64 0 GICv3: GIC: Using split EOI/Deactivate mode ITS: /interrupt-controller@801000000000/gic-its@801000020000 ITS@0x0000801000020000: allocated 2097152 Devices @ff5000000 (flat, esz 8, psz 64K, shr 1) ITS: /interrupt-controller@801000000000/gic-its@901000020000 ITS@0x0000901000020000: allocated 2097152 Devices @ffa000000 (flat, esz 8, psz 64K, shr 1) GIC: using LPI property table @0x0000000ff4140000 ITS: Allocated 32512 chunks for LPIs GICv3: CPU0: found redistributor 0 region 0:0x0000801080000000 CPU0: using LPI pending table @0x0000000ff4150000 arm_arch_timer: Architected cp15 timer(s) running at 100.00MHz (phys). clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x171024e7e0, max_idle_ns: 440795205315 ns sched_clock: 56 bits at 100MHz, resolution 10ns, wraps every 4398046511100ns Console: colour dummy device 80x25 console [tty0] enabled Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES: 8 ... MAX_LOCK_DEPTH: 48 ... MAX_LOCKDEP_KEYS: 8191 ... CLASSHASH_SIZE: 4096 ... MAX_LOCKDEP_ENTRIES: 32768 ... MAX_LOCKDEP_CHAINS: 65536 ... CHAINHASH_SIZE: 32768 memory used by lock dependency info: 8159 kB per task-struct memory footprint: 1920 bytes mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl Calibrating delay loop (skipped), value calculated using timer frequency.. 200.00 BogoMIPS (lpj=100000) pid_max: default: 32768 minimum: 301 Security Framework initialized Yama: becoming mindful. SELinux: Initializing. Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes) Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes) Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes) Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes) ftrace: allocating 29244 entries in 8 pages Unable to find CPU node for /cpus/cpu@8 /cpus/cpu-map/cluster0/core8: Can't get CPU for leaf core ASID allocator initialised with 65536 entries PCI/MSI: /interrupt-controller@801000000000/gic-its@801000020000 domain created PCI/MSI: /interrupt-controller@801000000000/gic-its@901000020000 domain created Platform MSI: /interrupt-controller@801000000000/gic-its@801000020000 domain created Platform MSI: /interrupt-controller@801000000000/gic-its@901000020000 domain created Remapping and enabling EFI services. UEFI Runtime regions are not aligned to 64 KB -- buggy firmware? ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at arch/arm64/kernel/efi.c:34 efi_create_mapping+0x5c/0x11c Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-rc4-00267-g664e88c #3 Hardware name: Cavium ThunderX CN88XX board (DT) task: fffffe0fee55a200 task.stack: ffffff00004a4000 PC is at efi_create_mapping+0x5c/0x11c LR is at efi_create_mapping+0x5c/0x11c pc : [<fffffc0008d14f68>] lr : [<fffffc0008d14f68>] pstate: 60000045 sp : ffffff00004a7d00 x29: ffffff00004a7d00 x28: 0000000000000000 x27: 0000000000000000 x26: 0000000000000000 x25: 0000000000000000 x24: fffffc0008f9c730 x23: fffffc0008f9c730 x22: fffffc0008c60ac0 x21: fffffc0008f9c730 x20: 00c8000000000713 x19: ffffff0ff76b0c08 x18: 0000000000000010 x17: 00000000c9cbbfc7 x16: 0000000000000000 x15: fffffc0089c956f7 x14: 3f657261776d7269 x13: 6620796767756220 x12: 2d2d20424b203436 x11: 206f742064656e67 x10: 0000000000000077 x9 : 65726120736e6f69 x8 : 0000000000000001 x7 : ffffff00004a4000 x6 : fffffc000814a288 x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000001 x2 : ffffff00004a4000 x1 : fffffe0fee55a200 x0 : 0000000000000040 ---[ end trace fcdd7f8cb96cdb3b ]--- Call trace: Exception stack(0xffffff00004a7b20 to 0xffffff00004a7c50) 7b20: ffffff0ff76b0c08 0000040000000000 ffffff00004a7d00 fffffc0008d14f68 7b40: 0000000060000045 000000000000003d 0000000000000001 fffffc000814a9fc 7b60: ffffff00004a7c00 fffffc000814adcc ffffff00004a7c60 fffffc0008baf420 7b80: fffffc0008f9c730 fffffc0008c60ac0 fffffc0008f9c730 fffffc0008f9c730 7ba0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7bc0: fffffc00080d6d70 0000000000000000 0000000000000040 fffffe0fee55a200 7be0: ffffff00004a4000 0000000000000001 0000000000000001 0000000000000000 7c00: fffffc000814a288 ffffff00004a4000 0000000000000001 65726120736e6f69 7c20: 0000000000000077 206f742064656e67 2d2d20424b203436 6620796767756220 7c40: 3f657261776d7269 fffffc0089c956f7 [<fffffc0008d14f68>] efi_create_mapping+0x5c/0x11c [<fffffc0008d5e1ec>] arm_enable_runtime_services+0x12c/0x210 [<fffffc0008082ff4>] do_one_initcall+0x44/0x138 [<fffffc0008d10d30>] kernel_init_freeable+0x17c/0x2e0 [<fffffc000893c888>] kernel_init+0x20/0xf8 [<fffffc0008082b80>] ret_from_fork+0x10/0x50 EFI remap 0x0000010ff9e98000 => 0000000020008000 EFI remap 0x0000010ffaeb6000 => 00000000200a6000 EFI remap 0x0000010ffafc9000 => 00000000201b9000 EFI remap 0x0000010ffafcd000 => 00000000201bd000 EFI remap 0x0000010ffffb9000 => 00000000201f9000 EFI remap 0x0000010ffffcd000 => 000000002020d000 EFI remap 0x0000804000001000 => 0000000020241000 EFI remap 0x000087e0d0001000 => 0000000020251000 Detected VIPT I-cache on CPU1 GICv3: CPU1: found redistributor 1 region 0:0x0000801080020000 CPU1: using LPI pending table @0x0000010014060000 CPU1: Booted secondary processor [431f0a10] ------------[ cut here ]------------ WARNING: CPU: 1 PID: 0 at ./include/linux/cpumask.h:121 gic_raise_softirq+0x14c/0x1e0 Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.8.0-rc4-00267-g664e88c #3 Hardware name: Cavium ThunderX CN88XX board (DT) task: fffffe0fee57b600 task.stack: fffffe0fe4038000 PC is at gic_raise_softirq+0x14c/0x1e0 LR is at gic_raise_softirq+0xc0/0x1e0 pc : [<fffffc00084c07a4>] lr : [<fffffc00084c0718>] pstate: 600001c5 sp : fffffe0fe403bcd0 x29: fffffe0fe403bcd0 x28: 0000000000000001 x27: fffffc0008fd7b1d x26: 0000000000000000 x25: fffffc0008ec2000 x24: fffffc0008ecebc8 x23: fffffc0008998688 x22: 0000000000000001 x21: fffffc0008ec2e34 x20: 0000000000000000 x19: 0000000000000008 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: fffffc0009c97f08 x14: 0000000000000002 x13: 0000000000000332 x12: 0000000000000335 x11: fffffc0009bf3de0 x10: 0000000000000332 x9 : fffffe0fe4038000 x8 : 0000000000000000 x7 : 00000000ffffffff x6 : 0000000000000000 x5 : fffffffffffffffe x4 : 0000000000000040 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000008 x0 : 0000000000000001 ---[ end trace fcdd7f8cb96cdb3c ]--- Call trace: Exception stack(0xfffffe0fe403baf0 to 0xfffffe0fe403bc20) bae0: 0000000000000008 0000040000000000 bb00: fffffe0fe403bcd0 fffffc00084c07a4 00000000600001c5 000000000000003d bb20: fffffe0fee57bec0 ffffffffffffffd8 0000000000000028 fffffe0fee57be70 bb40: fffffc000925fe88 0000000000000002 fffffc00080886c0 fffffe0fee57b600 bb60: fffffe0fe403bb90 fffffc0008088974 fffffc00099b4d60 fffffe0fee57b600 bb80: fffffc000925fe88 fffffe0fe403bd4c fffffe0fe403bbe0 fffffc00080889e8 bba0: 0000000000000001 0000000000000008 0000000000000000 0000000000000000 bbc0: 0000000000000040 fffffffffffffffe 0000000000000000 00000000ffffffff bbe0: 0000000000000000 fffffe0fe4038000 0000000000000332 fffffc0009bf3de0 bc00: 0000000000000335 0000000000000332 0000000000000002 fffffc0009c97f08 [<fffffc00084c07a4>] gic_raise_softirq+0x14c/0x1e0 [<fffffc000808e028>] smp_cross_call+0x80/0x238 [<fffffc000808efa4>] smp_send_reschedule+0x3c/0x48 [<fffffc00081021a8>] resched_curr+0x58/0x90 [<fffffc0008103fd8>] check_preempt_curr+0x70/0xe8 [<fffffc0008104090>] ttwu_do_wakeup+0x40/0x2e8 [<fffffc00081043cc>] ttwu_do_activate+0x94/0xc0 [<fffffc00081058a8>] try_to_wake_up+0x200/0x580 [<fffffc0008105d3c>] default_wake_function+0x34/0x48 [<fffffc0008126afc>] __wake_up_common+0x64/0xa8 [<fffffc0008126bec>] __wake_up_locked+0x3c/0x50 [<fffffc0008127b58>] complete+0x48/0x68 [<fffffc000808e630>] secondary_start_kernel+0x180/0x208 [<0000000001482e08>] 0x1482e08 Detected VIPT I-cache on CPU2 GICv3: CPU2: found redistributor 2 region 0:0x0000801080040000 CPU2: using LPI pending table @0x0000010014070000 CPU2: Booted secondary processor [431f0a10] Detected VIPT I-cache on CPU3 GICv3: CPU3: found redistributor 3 region 0:0x0000801080060000 CPU3: using LPI pending table @0x0000010014080000 CPU3: Booted secondary processor [431f0a10] Detected VIPT I-cache on CPU4 GICv3: CPU4: found redistributor 4 region 0:0x0000801080080000 CPU4: using LPI pending table @0x0000010014090000 CPU4: Booted secondary processor [431f0a10] Detected VIPT I-cache on CPU5 GICv3: CPU5: found redistributor 5 region 0:0x00008010800a0000 CPU5: using LPI pending table @0x00000100140a0000 CPU5: Booted secondary processor [431f0a10] Detected VIPT I-cache on CPU6 GICv3: CPU6: found redistributor 6 region 0:0x00008010800c0000 CPU6: using LPI pending table @0x00000100140b0000 CPU6: Booted secondary processor [431f0a10] Detected VIPT I-cache on CPU7 GICv3: CPU7: found redistributor 7 region 0:0x00008010800e0000 CPU7: using LPI pending table @0x00000100140c0000 CPU7: Booted secondary processor [431f0a10] Brought up 8 CPUs SMP: Total of 8 processors activated. CPU features: detected feature: GIC system register CPU interface CPU features: detected feature: LSE atomic instructions CPU features: detected feature: Software prefetching using PRFM CPU: All CPU(s) started at EL2 alternatives: patching kernel code devtmpfs: initialized SMBIOS 3.0.0 present. clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns atomic64_test: passed pinctrl core: initialized pinctrl subsystem NET: Registered protocol family 16 cpuidle: using governor menu vdso: 2 pages (1 code @ fffffc0008980000, 1 data @ fffffc0008eb0000) hw-breakpoint: found 6 breakpoint and 4 watchpoint registers. DMA: preallocated 256 KiB pool for atomic allocations Serial: AMBA PL011 UART driver 87e024000000.serial: ttyAMA0 at MMIO 0x87e024000000 (irq = 7, base_baud = 0) is a PL011 rev3 console [ttyAMA0] enabled 87e025000000.serial: ttyAMA1 at MMIO 0x87e025000000 (irq = 8, base_baud = 0) is a PL011 rev3 OF: /soc@0/pci@848000000000: arguments longer than property OF: /soc@0/pci@849000000000: arguments longer than property OF: /soc@0/pci@84a000000000: arguments longer than property OF: /soc@0/pci@84b000000000: arguments longer than property OF: /soc@0/pci@87e0c0000000: arguments longer than property OF: /soc@100000000000/pci@848000000000: arguments longer than property OF: /soc@100000000000/pci@849000000000: arguments longer than property OF: /soc@100000000000/pci@84a000000000: arguments longer than property OF: /soc@100000000000/pci@84b000000000: arguments longer than property HugeTLB registered 2 MB page size, pre-allocated 0 pages HugeTLB registered 512 MB page size, pre-allocated 0 pages ACPI: Interpreter disabled. arm-smmu 830000000000.smmu0: probing hardware configuration... arm-smmu 830000000000.smmu0: SMMUv2 with: arm-smmu 830000000000.smmu0: stage 1 translation arm-smmu 830000000000.smmu0: stage 2 translation arm-smmu 830000000000.smmu0: nested translation arm-smmu 830000000000.smmu0: non-coherent table walk arm-smmu 830000000000.smmu0: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 830000000000.smmu0: stream matching with 128 register groups, mask 0x7fff arm-smmu 830000000000.smmu0: 128 context banks (0 stage-2 only) arm-smmu 830000000000.smmu0: Supported page sizes: 0x62215000 arm-smmu 830000000000.smmu0: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 830000000000.smmu0: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 830000000000.smmu0: registered 1 master devices arm-smmu 831000000000.smmu1: probing hardware configuration... arm-smmu 831000000000.smmu1: SMMUv2 with: arm-smmu 831000000000.smmu1: stage 1 translation arm-smmu 831000000000.smmu1: stage 2 translation arm-smmu 831000000000.smmu1: nested translation arm-smmu 831000000000.smmu1: non-coherent table walk arm-smmu 831000000000.smmu1: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 831000000000.smmu1: stream matching with 128 register groups, mask 0x7fff arm-smmu 831000000000.smmu1: 128 context banks (0 stage-2 only) arm-smmu 831000000000.smmu1: Supported page sizes: 0x62215000 arm-smmu 831000000000.smmu1: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 831000000000.smmu1: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 831000000000.smmu1: registered 2 master devices arm-smmu 832000000000.smmu2: probing hardware configuration... arm-smmu 832000000000.smmu2: SMMUv2 with: arm-smmu 832000000000.smmu2: stage 1 translation arm-smmu 832000000000.smmu2: stage 2 translation arm-smmu 832000000000.smmu2: nested translation arm-smmu 832000000000.smmu2: non-coherent table walk arm-smmu 832000000000.smmu2: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 832000000000.smmu2: stream matching with 128 register groups, mask 0x7fff arm-smmu 832000000000.smmu2: 128 context banks (0 stage-2 only) arm-smmu 832000000000.smmu2: Supported page sizes: 0x62215000 arm-smmu 832000000000.smmu2: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 832000000000.smmu2: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 832000000000.smmu2: registered 1 master devices arm-smmu 833000000000.smmu3: probing hardware configuration... arm-smmu 833000000000.smmu3: SMMUv2 with: arm-smmu 833000000000.smmu3: stage 1 translation arm-smmu 833000000000.smmu3: stage 2 translation arm-smmu 833000000000.smmu3: nested translation arm-smmu 833000000000.smmu3: non-coherent table walk arm-smmu 833000000000.smmu3: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 833000000000.smmu3: stream matching with 128 register groups, mask 0x7fff arm-smmu 833000000000.smmu3: 128 context banks (0 stage-2 only) arm-smmu 833000000000.smmu3: Supported page sizes: 0x62215000 arm-smmu 833000000000.smmu3: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 833000000000.smmu3: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 833000000000.smmu3: registered 1 master devices arm-smmu 930000000000.smmu4: probing hardware configuration... arm-smmu 930000000000.smmu4: SMMUv2 with: arm-smmu 930000000000.smmu4: stage 1 translation arm-smmu 930000000000.smmu4: stage 2 translation arm-smmu 930000000000.smmu4: nested translation arm-smmu 930000000000.smmu4: non-coherent table walk arm-smmu 930000000000.smmu4: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 930000000000.smmu4: stream matching with 128 register groups, mask 0x7fff arm-smmu 930000000000.smmu4: 128 context banks (0 stage-2 only) arm-smmu 930000000000.smmu4: Supported page sizes: 0x62215000 arm-smmu 930000000000.smmu4: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 930000000000.smmu4: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 930000000000.smmu4: registered 1 master devices arm-smmu 931000000000.smmu5: probing hardware configuration... arm-smmu 931000000000.smmu5: SMMUv2 with: arm-smmu 931000000000.smmu5: stage 1 translation arm-smmu 931000000000.smmu5: stage 2 translation arm-smmu 931000000000.smmu5: nested translation arm-smmu 931000000000.smmu5: non-coherent table walk arm-smmu 931000000000.smmu5: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 931000000000.smmu5: stream matching with 128 register groups, mask 0x7fff arm-smmu 931000000000.smmu5: 128 context banks (0 stage-2 only) arm-smmu 931000000000.smmu5: Supported page sizes: 0x62215000 arm-smmu 931000000000.smmu5: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 931000000000.smmu5: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 931000000000.smmu5: registered 1 master devices arm-smmu 932000000000.smmu6: probing hardware configuration... arm-smmu 932000000000.smmu6: SMMUv2 with: arm-smmu 932000000000.smmu6: stage 1 translation arm-smmu 932000000000.smmu6: stage 2 translation arm-smmu 932000000000.smmu6: nested translation arm-smmu 932000000000.smmu6: non-coherent table walk arm-smmu 932000000000.smmu6: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 932000000000.smmu6: stream matching with 128 register groups, mask 0x7fff arm-smmu 932000000000.smmu6: 128 context banks (0 stage-2 only) arm-smmu 932000000000.smmu6: Supported page sizes: 0x62215000 arm-smmu 932000000000.smmu6: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 932000000000.smmu6: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 932000000000.smmu6: registered 1 master devices arm-smmu 933000000000.smmu7: probing hardware configuration... arm-smmu 933000000000.smmu7: SMMUv2 with: arm-smmu 933000000000.smmu7: stage 1 translation arm-smmu 933000000000.smmu7: stage 2 translation arm-smmu 933000000000.smmu7: nested translation arm-smmu 933000000000.smmu7: non-coherent table walk arm-smmu 933000000000.smmu7: (IDR0.CTTW overridden by dma-coherent property) arm-smmu 933000000000.smmu7: stream matching with 128 register groups, mask 0x7fff arm-smmu 933000000000.smmu7: 128 context banks (0 stage-2 only) arm-smmu 933000000000.smmu7: Supported page sizes: 0x62215000 arm-smmu 933000000000.smmu7: Stage-1: 48-bit VA -> 48-bit IPA arm-smmu 933000000000.smmu7: Stage-2: 48-bit IPA -> 48-bit PA arm-smmu 933000000000.smmu7: registered 1 master devices iommu: Adding device 848000000000.pci to group 0 iommu: Adding device 849000000000.pci to group 1 iommu: Adding device 84a000000000.pci to group 2 iommu: Adding device 84b000000000.pci to group 3 iommu: Adding device 88001f000000.pci to group 4 iommu: Adding device 948000000000.pci to group 5 iommu: Adding device 949000000000.pci to group 6 iommu: Adding device 94a000000000.pci to group 7 iommu: Adding device 94b000000000.pci to group 8 vgaarb: loaded SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default DMA-API: preallocated 4096 debug entries DMA-API: debugging enabled by kernel config clocksource: Switched to clocksource arch_sys_counter VFS: Disk quotas dquot_6.6.0 VFS: Dquot-cache hash table entries: 8192 (order 0, 65536 bytes) pnp: PnP ACPI: disabled NET: Registered protocol family 2 TCP established hash table entries: 524288 (order: 6, 4194304 bytes) TCP bind hash table entries: 65536 (order: 6, 4194304 bytes) TCP: Hash tables configured (established 524288 bind 65536) UDP hash table entries: 65536 (order: 7, 10485760 bytes) UDP-Lite hash table entries: 65536 (order: 7, 10485760 bytes) NET: Registered protocol family 1 Unpacking initramfs... ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:1844! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 4 PID: 1 Comm: swapper/0 Tainted: G W 4.8.0-rc4-00267-g664e88c #3 Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Aug 24 2016 task: fffffe0fee55a200 task.stack: ffffff00004a4000 PC is at move_freepages+0x158/0x168 LR is at move_freepages_block+0xac/0xc0 pc : [<fffffc0008227618>] lr : [<fffffc00082276d4>] pstate: 800000c5 sp : ffffff00004a7480 x29: ffffff00004a7480 x28: fffffdffc3fc0020 x27: fffffdffc3fc0000 x26: 000000000000000a x25: 000000000000000a x24: ffffff0ffffa3190 x23: ffffff0fffdbf530 x22: 0000000000000000 x21: fffffdffc3ffffc0 x20: ffffff0ffffa2c80 x19: fffffdffc3f80000 x18: fffffe0fdc0c2548 x17: 0000000000000000 x16: 0000000100000000 x15: 00000000000000c1 x14: 04000000001b2000 x13: 00aedf01c7040000 x12: 00001b00000033f5 x11: 01c604000000001b x10: 0000ae7401c50430 x9 : 0000000000000000 x8 : fffffc0008f53000 x7 : fffffc00082295f0 x6 : 0000000000000001 x5 : 0000000000000000 x4 : fffffe0fffff2580 x3 : ffffff0ffffa2500 x2 : 0000000000000001 x1 : fffffe0fffff2580 x0 : ffffff0ffffa2c80 Process swapper/0 (pid: 1, stack limit = 0xffffff00004a4020) Stack: (0xffffff00004a7480 to 0xffffff00004a8000) 7480: ffffff00004a74e0 fffffc00082276d4 0000000000000000 ffffff0ffffa2c80 74a0: fffffdffc3f80000 0000000000000000 ffffff0fffdbf530 ffffff0ffffa3190 74c0: 000000000000000a 000000000000000a fffffdffc3fc0000 fffffdffc3fc0020 74e0: ffffff00004a7510 fffffc0008227f00 ffffff0ffffa2c80 0000000000000000 7500: fffffdffc3fc0000 0000000000000028 ffffff00004a75b0 fffffc0008229654 7520: ffffff0fffdbf530 ffffff0ffffa2c80 fffffc0008e5f520 0000000000000000 7540: ffffff0fffdbf530 ffffff0fffdbf520 0000030ff6f60000 0000000000000001 7560: 0000000000000000 0000000000000000 ffffff0ffffa3200 fffffc00082295f0 7580: fffffc0008e5f520 fffffc0008ec1000 0000000000000000 0000000000000001 75a0: 0000000000000010 01ffff0ffffa2c80 ffffff00004a76a0 fffffc000822a5ec 75c0: 00000000024200c2 0000000000000000 0000000000000000 00000000024200c2 75e0: fffffc0008ec4000 0000000000000001 0000000000000000 fffffc0008ec2000 7600: 00000000024200c2 fffffc00082886c8 ffffffffffffffff 0000000000000040 7620: 00000012004a7640 fffffc0008e587e8 0000000100000000 0000000000000000 7640: ffffff00024200c2 fffffc00081007bc ffffff0ffffa2c80 0000000000000010 7660: fffffc0008e5f520 0000000000000040 0000000000000000 ffffff0ffffa3200 7680: 0000000000000001 ffffff0ffffa3b80 ffffff0ffffa3b80 ffffff00004a77b8 76a0: ffffff00004a77e0 fffffc00082886c8 0000000000000000 fffffc0008ec0c48 76c0: ffffff0ffffa3b80 0000000000000000 0000000000000000 0000000000000001 76e0: ffffff00004a4000 0000000000000000 fffffe0fee55a200 00000000000053d4 7700: fffffe0fee55a200 0000000000000000 0000000000000000 fffffc0008eec160 7720: 0000000000000000 0000000000000002 fffffe0fee55aa98 fffffc0008ec2000 7740: fffffe0fee55a200 fffffc0009c95000 0000000000000000 fffffc0008eec160 7760: 0000000000000000 fffffc0008e587e8 0000000000000000 0000000100400000 7780: 0000000000000000 0000000000000000 0000000000000000 0000000000000002 77a0: 0000000000000000 0000000000000000 fffffc000821d070 ffffff0ffffa3b80 77c0: 0000000000000000 ffffff0ffffa3b80 0000000100000000 0000000000000000 77e0: ffffff00004a7810 fffffc0008288d20 fffffc0008e587e8 fffffe0fec01a8f0 7800: 00000000024200c2 fffffc000821e62c ffffff00004a7870 fffffc000821e62c 7820: 0000000000000000 ffffff00004a4000 00000000024200c2 fffffe0fe0249410 7840: fffffc000821e788 00000000024200c2 fffffc0008f01d28 0000000002493ee0 7860: 0000000000002c2c 0000000000000040 ffffff00004a78d0 fffffc000821e788 7880: 0000000000000000 000000000000000e 00000000024200c2 fffffe0fe0249410 78a0: 0000000000000001 fffffc0008fd76ac fffffc0008f01d28 0000000002493ee0 78c0: 0000000000002c2c fffffc000821e850 ffffff00004a7930 fffffc000821e9b0 78e0: fffffe0fe0249410 0000000000000001 0000000000000001 0000000000002c2c 7900: ffffff00004a7a20 fffffe0fcc140400 00000000000053d4 fffffc00089b8900 7920: 0000000000002c2c fffffc000821e440 ffffff00004a7960 fffffc00082eb758 7940: fffffe0fe0249410 0000000000000001 0000000000010000 ffffff00004a7b18 7960: ffffff00004a79a0 fffffc000821e3bc 0000000000010000 ffffff00004a7b18 7980: fffffe0fe0249410 0000000000010000 0000000000000000 fffffc000821e434 79a0: ffffff00004a7a30 fffffc00082200e0 0000000000000000 fffffe0fcc140400 79c0: fffffe0fe0249210 ffffff00004a7af0 fffffe0fe0249410 ffffff00004a7b18 79e0: fffffc0008d40e70 0000000000008000 fffffc0009040078 ffffff0f83f26000 7a00: 0000000000000000 fffffc0008bd0300 00002c2c00000001 ffffff00004a4000 7a20: fffffdff83f20840 0000000027e02140 ffffff00004a7a80 fffffc00082201fc 7a40: 0000000000008000 ffffff00004a7af0 fffffe0fe02492e8 ffffff00004a7b18 7a60: ffffff00004a7bb8 0000000000000007 fffffc0008d40e70 0000000000008000 7a80: ffffff00004a7ab0 fffffc00082ba89c fffffe0fcc140400 0000000000008000 7aa0: ffffff00004a7bb8 fffffe0fdd350000 ffffff00004a7b40 fffffc00082bb588 7ac0: 0000000000000000 0000000000008000 fffffe0fcc140400 fffffe0fdd350000 7ae0: fffffe0fdd350000 0000000000008000 fffffe0fcc140400 000000000000ac2c 7b00: 0000000000000000 0000000000000000 0000000000000000 fffffc0000000003 7b20: 00000000000053d4 0000000000002c2c ffffff00004a7ae0 0000000000000001 7b40: ffffff00004a7b80 fffffc00082bc5b4 fffffe0fcc140400 fffffe0fcc140400 7b60: fffffe0fdd350000 0000000000008000 fffffc0008babe88 ffffff0f8ea0e439 7b80: ffffff00004a7bc0 fffffc0008d120f0 0000000000000000 0000000000008000 7ba0: fffffe0fdd350000 0000000000000000 fffffe0fdc3fa280 000000000000ac2c 7bc0: ffffff00004a7bf0 fffffc0008d121d4 fffffc0008d75000 fffffc0008d75b60 7be0: fffffc0008d75bb8 fffffe0fdd350000 ffffff00004a7c10 fffffc0008d11eac 7c00: fffffc0008d75b60 0000000000008000 ffffff00004a7c40 fffffc0008d11f10 7c20: 0000000000008000 0000000000008000 fffffc0008d75b60 fffffc00089e42c0 7c40: ffffff00004a7c80 fffffc0008d410f4 fffffe0fdc3fa280 ffffff0f83f26000 7c60: 0000000000000000 fffffe0fdd350000 fffffc0008d11d74 fffffc0008d11ec8 7c80: ffffff00004a7cf0 fffffc0008d411b4 fffffc0008d75000 ffffff0f83f26000 7ca0: 000000000e7f20bc 0000000000000000 fffffc0008d4119c fffffc0008ff1f38 7cc0: fffffc0008babf90 fffffc0009040000 fffffc0008e3d9a0 0000000000000000 7ce0: 0000000008d75000 0000000000008000 ffffff00004a7d00 fffffc0008d127c4 7d00: ffffff00004a7d60 fffffc0008d1294c fffffc0009040000 fffffc0008d128dc 7d20: fffffc0009040000 fffffc0009040000 fffffc0009040000 fffffc0008d1045c 7d40: fffffc0008cfc468 fffffc0008d74290 0000000000001b57 fffffc0008c04958 7d60: ffffff00004a7d90 fffffc0008082ff4 ffffff00004a4000 fffffc0008d128dc 7d80: 0000000000000000 0000000000000005 ffffff00004a7e00 fffffc0008d10df4 7da0: 0000000000000101 fffffc0009040000 fffffc0008d742a8 0000000000000005 7dc0: fffffc0008e3d700 0000000000000000 fffffc0008ee2370 fffffc0008bb79f8 7de0: 0000000000000000 0000000500000005 0000000000000000 fffffc0008d1045c 7e00: ffffff00004a7ea0 fffffc000893c888 fffffc0009040000 fffffc0009040000 7e20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7e40: 0000000000000000 0000000000000000 0000000000000000 ffffff00004a4000 7e60: 0000000000000003 0000000000000000 0000000000000000 0000000000000000 7e80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7ea0: 0000000000000000 fffffc0008082b80 fffffc000893c868 0000000000000000 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call trace: Exception stack(0xffffff00004a72a0 to 0xffffff00004a73d0) 72a0: fffffdffc3f80000 0000040000000000 ffffff00004a7480 fffffc0008227618 72c0: 00000000800000c5 000000000000003d fffffc0008ec4000 0000000000000030 72e0: ffffff00004a7300 fffffc00081593b0 fffffc0008f53000 0000000000000000 7300: ffffff00004a7320 fffffc000822abac fffffc0008fd76be ffffff00004a7438 7320: ffffff00004a7460 fffffc00082886c8 ffffff00004a7380 fffffc0008135d08 7340: 0000000000000001 0000000000000007 ffffff0ffffa2c80 fffffe0fffff2580 7360: 0000000000000001 ffffff0ffffa2500 fffffe0fffff2580 0000000000000000 7380: 0000000000000001 fffffc00082295f0 fffffc0008f53000 0000000000000000 73a0: 0000ae7401c50430 01c604000000001b 00001b00000033f5 00aedf01c7040000 73c0: 04000000001b2000 00000000000000c1 [<fffffc0008227618>] move_freepages+0x158/0x168 [<fffffc00082276d4>] move_freepages_block+0xac/0xc0 [<fffffc0008227f00>] __rmqueue+0x740/0x898 [<fffffc0008229654>] get_page_from_freelist+0x3e4/0xca0 [<fffffc000822a5ec>] __alloc_pages_nodemask+0x1ac/0xfd8 [<fffffc00082886c8>] alloc_page_interleave+0x60/0xb8 [<fffffc0008288d20>] alloc_pages_current+0x168/0x1c8 [<fffffc000821e62c>] __page_cache_alloc+0x17c/0x1c0 [<fffffc000821e788>] pagecache_get_page+0x118/0x2f8 [<fffffc000821e9b0>] grab_cache_page_write_begin+0x48/0x68 [<fffffc00082eb758>] simple_write_begin+0x40/0x168 [<fffffc000821e3bc>] generic_perform_write+0xc4/0x1b8 [<fffffc00082200e0>] __generic_file_write_iter+0x178/0x1c8 [<fffffc00082201fc>] generic_file_write_iter+0xcc/0x1c8 [<fffffc00082ba89c>] __vfs_write+0xcc/0x140 [<fffffc00082bb588>] vfs_write+0xa8/0x1c0 [<fffffc00082bc5b4>] SyS_write+0x54/0xb0 [<fffffc0008d120f0>] xwrite+0x34/0x7c [<fffffc0008d121d4>] do_copy+0x9c/0xf4 [<fffffc0008d11eac>] write_buffer+0x34/0x50 [<fffffc0008d11f10>] flush_buffer+0x48/0xb8 [<fffffc0008d410f4>] __gunzip+0x27c/0x324 [<fffffc0008d411b4>] gunzip+0x18/0x20 [<fffffc0008d127c4>] unpack_to_rootfs+0x168/0x280 [<fffffc0008d1294c>] populate_rootfs+0x70/0x138 [<fffffc0008082ff4>] do_one_initcall+0x44/0x138 [<fffffc0008d10df4>] kernel_init_freeable+0x240/0x2e0 [<fffffc000893c888>] kernel_init+0x20/0xf8 [<fffffc0008082b80>] ret_from_fork+0x10/0x50 Code: 913dc021 9400d2db d4210000 d503201f (d4210000) ---[ end trace fcdd7f8cb96cdb3d ]--- note: swapper/0[1] exited with preempt_count 1 Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b SMP: stopping secondary CPUs ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Hi Robert,
On Mon, Oct 17, 2016 at 08:58:01PM +0200, Robert Richter wrote:
> Mark, Will, any opinion here?
Having looking at this, I'm inclined to agree with you; pfn_valid() is
all about whether the underlying mem_map (struct page *) entry exists,
not about whether the page is mappable or not.
That said, setting the zone for pages representing NOMAP memory feels
like a slippery slope to losing information about them being NOMAP in
the first place and the whole problem getting out-of-hand. Whilst I'm
happy for pfn_valid() to return true (in the sense that we're within
bounds of mem_map etc), I'm less happy that we're also saying that the
struct page contains useful information, such as the zone and the node
information, which is then subsequently used by the NUMA code.
On top of that, pfn_valid is used in other places as a coarse "is this
memory?" check, and will cause things like ioremap to fail whereas it
wouldn't at the moment. It feels to me like NOMAP memory is a new type
of memory where there *is* a struct page, but it shouldn't be used for
anything. I don't think pfn_valid can describe that, given the way it's
currently used, and flipping the logic is just likely to move the problem
elsewhere.
What options do we have for fixing this in the NUMA code?
Will
On 27.10.16 17:01:36, Will Deacon wrote: > Hi Robert, > > On Mon, Oct 17, 2016 at 08:58:01PM +0200, Robert Richter wrote: > > Mark, Will, any opinion here? > > Having looking at this, I'm inclined to agree with you; pfn_valid() is > all about whether the underlying mem_map (struct page *) entry exists, > not about whether the page is mappable or not. > > That said, setting the zone for pages representing NOMAP memory feels > like a slippery slope to losing information about them being NOMAP in > the first place and the whole problem getting out-of-hand. Whilst I'm > happy for pfn_valid() to return true (in the sense that we're within > bounds of mem_map etc), I'm less happy that we're also saying that the > struct page contains useful information, such as the zone and the node > information, which is then subsequently used by the NUMA code. Let's see it in a different way, pfns and the struct page assigned to each of it is about *physical* memory. The system knows all the memory, some is free, some reserved and some marked NOMAP. Regardless of the mapping of the page the mm code maintains and uses that information. There are assumptions on validity and checks in the code that now cause problems due to partly or non-existing data about nomap pages. This inconsistency is dangerous since a problem may occur any time then the page area is accessed first, thus a system may crash randomly depending on the memory access. Luckily, in my case it triggered reproducible while loading initrd during boot. I also think that this is not only NUMA related. E.g. the following bug report is probably also related: https://bugzilla.redhat.com/show_bug.cgi?id=1387793 > On top of that, pfn_valid is used in other places as a coarse "is this > memory?" check, and will cause things like ioremap to fail whereas it > wouldn't at the moment. IMO this is a misuse of pfn_valid() that needs to be fixed with additional checks, e.g. traversing memblocks. > It feels to me like NOMAP memory is a new type > of memory where there *is* a struct page, but it shouldn't be used for > anything. IMO, a NOMAP page should just be handled like a reserved page except that the page is marked reserved. See free_low_memory_core_early(). Thus, NOMAP pages are not in the free pages list or set to reserved. It is simply not available for mapping at all. Isn't that exactly what it should be? I also did not yet understand the benefit of the differentiation between NOMAP and reserved and the original motivation for its implementation. I looked through the mail threads but could not find any hint. The only difference I see now is that it is not listed as a reserved page, but as long as it is not freed it should behave the same. I remember the case to handle memory different (coherency, etc.), but are not sure here. Ard, could you explain this? > I don't think pfn_valid can describe that, given the way it's > currently used, and flipping the logic is just likely to move the problem > elsewhere. > > What options do we have for fixing this in the NUMA code? Out of my mind: 1) Treat NOMAP pages same as reserved pages (my patch). 2) Change mm code to allow arch specific early_pfn_valid(). 3) Fix mm code to only access stuct page (of a zone) if pfn_valid() is true. There can be more alternatives. IMO: * We shouldn't touch generic mm code. * We should maintain a valid struct page for all pages in a sections. * We should only traverse memblock where really necessary (arm64 only). * I don't think this problem is numa specific. -Robert
On Fri, Oct 28, 2016 at 11:19:05AM +0200, Robert Richter wrote: > On 27.10.16 17:01:36, Will Deacon wrote: > > It feels to me like NOMAP memory is a new type > > of memory where there *is* a struct page, but it shouldn't be used for > > anything. > > IMO, a NOMAP page should just be handled like a reserved page except > that the page is marked reserved. See free_low_memory_core_early(). > Thus, NOMAP pages are not in the free pages list or set to reserved. > It is simply not available for mapping at all. Isn't that exactly what > it should be? > > I also did not yet understand the benefit of the differentiation > between NOMAP and reserved and the original motivation for its > implementation. I looked through the mail threads but could not find > any hint. The only difference I see now is that it is not listed as a > reserved page, but as long as it is not freed it should behave the > same. I remember the case to handle memory different (coherency, > etc.), but are not sure here. Ard, could you explain this? > > > I don't think pfn_valid can describe that, given the way it's > > currently used, and flipping the logic is just likely to move the problem > > elsewhere. > > > > What options do we have for fixing this in the NUMA code? > > Out of my mind: > > 1) Treat NOMAP pages same as reserved pages (my patch). Just to reiterate here, but your patch as it stands will break other parts of the kernel. For example, acpi_os_ioremap relies on being able to ioremap these regions afaict. I think any solution involving pfn_valid is just going to move the crash around. Will
Will, On 07.11.16 21:05:14, Will Deacon wrote: > Just to reiterate here, but your patch as it stands will break other parts > of the kernel. For example, acpi_os_ioremap relies on being able to ioremap > these regions afaict. > > I think any solution involving pfn_valid is just going to move the crash > around. Let me describe the crash more detailed. Following range is marked nomap (full efi map below): [ 0.000000] efi: 0x010ffffb9000-0x010ffffccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] efi: 0x010ffffcd000-0x010fffffefff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC]* The mem belongs to this nodes: [ 0.000000] NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0 [ 0.000000] NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1 The following mem ranges are created: [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000001400000-0x00000000ffffffff] [ 0.000000] Normal [mem 0x0000000100000000-0x0000010fffffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000001400000-0x00000000fffdffff] [ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff] [ 0.000000] node 0: [mem 0x0000000100000000-0x0000000fffffffff] [ 0.000000] node 1: [mem 0x0000010000400000-0x0000010ff9e7ffff] [ 0.000000] node 1: [mem 0x0000010ff9e80000-0x0000010ff9f1ffff] [ 0.000000] node 1: [mem 0x0000010ff9f20000-0x0000010ffaeaffff] [ 0.000000] node 1: [mem 0x0000010ffaeb0000-0x0000010ffaffffff] [ 0.000000] node 1: [mem 0x0000010ffb000000-0x0000010ffffaffff] [ 0.000000] node 1: [mem 0x0000010ffffb0000-0x0000010fffffffff] The last range 0x0000010ffffb0000-0x0000010fffffffff is correctly marked as a nomap area. Paging is then initalized in free_area_init_core() (mm/page_alloc.c) which calls memmap_init_zone(). This initializes all pages (struct page) of the zones for each node except for pfns from nomap memory. It uses early_pfn_valid() which is bound to pfn_valid(). In 4.4 *all* pages of a zone were initialized. Now page from nomap ranges are skipped, and e.g. pfn 0x10ffffb has an uninitalized struct page. Note that nomap mem ranges are still part of the memblock list and also part of the zone ranges within start_pfn and end_pfn, but those don't have a valid struct page now. IMO this is a bug. Later on all mappable memory is reserved and all pages of it are freed by adding them to the free-pages list except for the reserved memblocks. Now the initrd is loaded. This reserves free memory by calling move_freepages_block(). A block has the size of pageblock_nr_pages which is in my configuration 0x2000. The kernel choses the block 0x0000010fe0000000-0x0000010fffffffff that also contains the nomap region around 10ffb000000. In move_freepages() the pages of the zone are freed. The code accesses pfn #10fffff but the struct page is uninitialized and thus node is 0. This is different to #10fe000 and the zone which is node 1. This causes the BUG_ON() to trigger: [ 4.173521] Unpacking initramfs... [ 8.420710] ------------[ cut here ]------------ [ 8.425344] kernel BUG at mm/page_alloc.c:1844! [ 8.429870] Internal error: Oops - BUG: 0 [#1] SMP I believe this is not the only case where the mm code relies on a valid struct page for all pfns of a zone, thus modifying mm code to make this case work is not an option. (E.g. this could be done by moving the early_pfn_valid() check at the beginning of the loop in memmap_init_zone().) So for a fix we need to change early_pfn_valid() which is mapped to pfn_valid() for SPARSEMEM. Using a different function than pfn_valid() for this does not look reasonable. The only option is to revert pfn_valid() to it's original bahaviour. My fix now uses again memblock_is_memory() for pfn_valid() instead of memblock_is_map_memory(). So I needed to change some users of pfn_valid() to use memblock_is_map_memory() there necessary. This is only required for arch specific code, all other uses of pfn_valid() are save to use memblock_is_memory(). Thus, I don't see where my patch breaks code. Even acpi_os_ioremap() keeps the same behaviour as before since it still uses memblock_is_ memory(). Could you more describe your concerns why do you think this patch breaks the kernel and moves the problem somewhere else? I believe it fixes the problem at all. Thanks, -Robert [ 0.000000] efi: Processing EFI memory map: [ 0.000000] MEMBLOCK configuration: [ 0.000000] memory size = 0x1ffe800000 reserved size = 0x10537 [ 0.000000] memory.cnt = 0x2 [ 0.000000] memory[0x0] [0x00000001400000-0x00000fffffffff], 0xffec00000 bytes flags: 0x0 [ 0.000000] memory[0x1] [0x00010000400000-0x00010fffffffff], 0xfffc00000 bytes flags: 0x0 [ 0.000000] reserved.cnt = 0x1 [ 0.000000] reserved[0x0] [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0 [ 0.000000] efi: 0x000001400000-0x00000147ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000001400000-0x0000000147ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000001480000-0x0000024affff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000001480000-0x000000024affff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x0000024b0000-0x0000211fffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x000000024b0000-0x000000211fffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000021200000-0x00002121ffff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000021200000-0x0000002121ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000021220000-0x0000fffebfff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000021220000-0x000000fffeffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x000000fffe0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS | | | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x000000ffff0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x000000ffff0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000100000000-0x000ff7ffffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000100000000-0x00000ff7ffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000ff8000000-0x000ff801ffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000ff8000000-0x00000ff801ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000ff8020000-0x000fffa9efff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000ff8020000-0x00000fffa9ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x000fffa9f000-0x000fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00000fffa90000-0x00000fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010000400000-0x010f812b3fff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010000400000-0x00010f812bffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f812b4000-0x010f812b6fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f812b0000-0x00010f812bffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f812b7000-0x010f822e6fff [Loader Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f812b0000-0x00010f822effff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f822e7000-0x010f822f6fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f822e0000-0x00010f822fffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f822f7000-0x010f82342fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f822f0000-0x00010f8234ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f82343000-0x010f91e73fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f91e74000-0x010f91e74fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f91e70000-0x00010f91e7ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f91e75000-0x010f92c98fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f91e70000-0x00010f92c9ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f92c99000-0x010f93880fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f92c90000-0x00010f9388ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010f93881000-0x010ff7880fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010f93880000-0x00010ff788ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ff7881000-0x010ff7886fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ff7880000-0x00010ff788ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ff7887000-0x010ff78a3fff [Loader Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ff7880000-0x00010ff78affff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ff78a4000-0x010ff9e8dfff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ff78a0000-0x00010ff9e8ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ff9e8e000-0x010ff9f16fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ff9e80000-0x00010ff9f1ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ff9f17000-0x010ffaeb5fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ff9f10000-0x00010ffaebffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffaeb6000-0x010ffafc8fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ffaeb0000-0x00010ffafcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffafc9000-0x010ffafccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ffafc0000-0x00010ffafcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffafcd000-0x010ffaff4fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ffafc0000-0x00010ffaffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffaff5000-0x010ffb008fff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ffaff0000-0x00010ffb00ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffb009000-0x010fffe28fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010ffb000000-0x00010fffe2ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010fffe29000-0x010fffe3ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010fffe20000-0x00010fffe3ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010fffe40000-0x010fffe53fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010fffe40000-0x00010fffe5ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010fffe54000-0x010ffffb8fff [Boot Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010fffe50000-0x00010ffffbffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffffb9000-0x010ffffccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ffffb0000-0x00010ffffcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffffcd000-0x010fffffefff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC]* [ 0.000000] memblock_add: [0x00010ffffc0000-0x00010fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x010ffffff000-0x010fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] memblock_add: [0x00010fffff0000-0x00010fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c [ 0.000000] efi: 0x804000001000-0x804000001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] [ 0.000000] efi: 0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] [ 0.000000] memblock_reserve: [0x00010f812b0000-0x00010f812bffff] flags 0x0 efi_init+0x208/0x2c8 [ 0.000000] memblock_add: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 arm64_memblock_init+0x16c/0x248 [ 0.000000] memblock_reserve: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 arm64_memblock_init+0x178/0x248 [ 0.000000] memblock_reserve: [0x00000001480000-0x000000024affff] flags 0x0 arm64_memblock_init+0x1b0/0x248 [ 0.000000] memblock_reserve: [0x00010f82343000-0x00010f91e73613] flags 0x0 arm64_memblock_init+0x1cc/0x248 [ 0.000000] memblock_reserve: [0x000000c0000000-0x000000dfffffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] cma: Reserved 512 MiB at 0x00000000c0000000 [ 0.000000] memblock_reserve: [0x00010ffffa0000-0x00010ffffaffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff90000-0x00010ffff9ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff80000-0x00010ffff8ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff70000-0x00010ffff7ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff60000-0x00010ffff6ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff50000-0x00010ffff5ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff40000-0x00010ffff4ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffff30000-0x00010ffff3ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_free: [0x00010ffffa0000-0x00010ffffaffff] paging_init+0x5c0/0x610 [ 0.000000] memblock_free: [0x00000002490000-0x000000024affff] paging_init+0x5f4/0x610 [ 0.000000] memblock_reserve: [0x00010fffefed80-0x00010ffff2fffb] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffffaffd0-0x00010ffffafffe] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffffaffa0-0x00010ffffaffce] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] memblock_reserve: [0x00010ffffa0000-0x00010ffffa000f] flags 0x0 numa_init+0x88/0x3f0 [ 0.000000] NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0 [ 0.000000] NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1 [ 0.000000] NUMA: parsing numa-distance-map-v1 [ 0.000000] NUMA: Initmem setup node 0 [mem 0x01400000-0xfffffffff] [ 0.000000] memblock_reserve: [0x00000fffffe500-0x00000fffffffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] NUMA: NODE_DATA [mem 0xfffffe500-0xfffffffff] [ 0.000000] NUMA: Initmem setup node 1 [mem 0x10000400000-0x10fffffffff] [ 0.000000] memblock_reserve: [0x00010ffffae480-0x00010ffffaff7f] flags 0x0 memblock_alloc_range_nid+0x30/0x58 [ 0.000000] NUMA: NODE_DATA [mem 0x10ffffae480-0x10ffffaff7f] ... [ 8.420710] ------------[ cut here ]------------ [ 8.425344] kernel BUG at mm/page_alloc.c:1844! [ 8.429870] Internal error: Oops - BUG: 0 [#1] SMP [ 8.434654] Modules linked in: [ 8.437712] CPU: 72 PID: 1 Comm: swapper/0 Tainted: G W 4.8.1.4.vanilla10-00007-g9eb3a76b8b88 #111 [ 8.447788] Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Sep 13 2016 [ 8.456474] task: ffff800fee626800 task.stack: ffff800fec02c000 [ 8.462404] PC is at move_freepages+0x198/0x1b0 [ 8.466924] LR is at move_freepages_block+0xa8/0xb8 [ 8.471791] pc : [<ffff0000081e5dd0>] lr : [<ffff0000081e5e90>] pstate: 000000c5 [ 8.479174] sp : ffff800fec02f4d0 [ 8.482478] x29: ffff800fec02f4d0 x28: 0000000000000001 [ 8.487786] x27: 0000000000000000 x26: 0000000000000000 [ 8.493093] x25: 000000000000000a x24: ffff7fe043fd0020 [ 8.498401] x23: ffff810ffffaec00 x22: 0000000000000000 [ 8.503708] x21: ffff7fe043ffffc0 x20: 0000000000000000 [ 8.509015] x19: ffff7fe043f80000 x18: ffff8100102f2548 [ 8.514322] x17: 0000000000000000 x16: 0000000100000000 [ 8.519630] x15: 0000000000000007 x14: 0000010200000000 [ 8.524938] x13: 0004a41400000000 x12: 00032c920000001a [ 8.530245] x11: 0000010200000000 x10: 0004a40800000000 [ 8.535553] x9 : 0000000000000040 x8 : 0000000000000000 [ 8.540860] x7 : 0000000000000000 x6 : ffff810ffffaf120 [ 8.546168] x5 : 0000000000000001 x4 : ffff000008d70c48 [ 8.551475] x3 : ffff810ffffae480 x2 : 0000000000000000 [ 8.556782] x1 : 0000000000000001 x0 : ffff810ffffaec00 [ 8.562089] [ 8.563571] Process swapper/0 (pid: 1, stack limit = 0xffff800fec02c020) [ 8.570261] Stack: (0xffff800fec02f4d0 to 0xffff800fec030000) ... [ 9.279652] Call trace: [ 9.282091] Exception stack(0xffff800fec02f300 to 0xffff800fec02f430) [ 9.288520] f300: ffff7fe043f80000 0001000000000000 ffff800fec02f4d0 ffff0000081e5dd0 [ 9.296339] f320: 0000000000000001 ffff800fee626800 0000000000000000 00000000026012d0 [ 9.304157] f340: 0000000100000000 ffffffffffffffff 0000000000000130 0000000000000040 [ 9.311976] f360: 0000001200000000 ffff000008cf8198 0000000100000000 0000000000000001 [ 9.319794] f380: 00000000026012d0 ffff810ff9a1e8e8 0000000000000000 0000000000000040 [ 9.327613] f3a0: ffff810ffffaec00 0000000000000001 0000000000000000 ffff810ffffae480 [ 9.335431] f3c0: ffff000008d70c48 0000000000000001 ffff810ffffaf120 0000000000000000 [ 9.343250] f3e0: 0000000000000000 0000000000000040 0004a40800000000 0000010200000000 [ 9.351068] f400: 00032c920000001a 0004a41400000000 0000010200000000 0000000000000007 [ 9.358885] f420: 0000000100000000 0000000000000000 [ 9.363756] [<ffff0000081e5dd0>] move_freepages+0x198/0x1b0 [ 9.369318] [<ffff0000081e5e90>] move_freepages_block+0xa8/0xb8 [ 9.375227] [<ffff0000081e657c>] __rmqueue+0x604/0x650 [ 9.380355] [<ffff0000081e7898>] get_page_from_freelist+0x3f0/0xb88 [ 9.386612] [<ffff0000081e860c>] __alloc_pages_nodemask+0x12c/0xce8 [ 9.392876] [<ffff00000823dc1c>] alloc_page_interleave+0x64/0xc0 [ 9.398873] [<ffff00000823e2f8>] alloc_pages_current+0x108/0x168 [ 9.404870] [<ffff0000081de1cc>] __page_cache_alloc+0x104/0x140 [ 9.410779] [<ffff0000081de31c>] pagecache_get_page+0x114/0x300 [ 9.416688] [<ffff0000081de550>] grab_cache_page_write_begin+0x48/0x68 [ 9.423207] [<ffff00000828e5f8>] simple_write_begin+0x40/0x150 [ 9.429029] [<ffff0000081ddfe0>] generic_perform_write+0xb8/0x1a0 [ 9.435112] [<ffff0000081df8c8>] __generic_file_write_iter+0x170/0x1b0 [ 9.441628] [<ffff0000081df9d4>] generic_file_write_iter+0xcc/0x1c8 [ 9.447893] [<ffff000008262eb4>] __vfs_write+0xcc/0x140 [ 9.453108] [<ffff000008263b84>] vfs_write+0xa4/0x1c0 [ 9.458150] [<ffff000008264bc4>] SyS_write+0x54/0xb0 [ 9.463108] [<ffff000008be1fe4>] xwrite+0x34/0x7c [ 9.467801] [<ffff000008be20c8>] do_copy+0x9c/0xf4 [ 9.472583] [<ffff000008be1da4>] write_buffer+0x34/0x50 [ 9.477796] [<ffff000008be1e08>] flush_buffer+0x48/0xb4 [ 9.483016] [<ffff000008c0fe08>] __gunzip+0x288/0x334 [ 9.488057] [<ffff000008c0fecc>] gunzip+0x18/0x20 [ 9.492750] [<ffff000008be26bc>] unpack_to_rootfs+0x168/0x284 [ 9.498486] [<ffff000008be2848>] populate_rootfs+0x70/0x138 [ 9.504051] [<ffff000008082ff4>] do_one_initcall+0x44/0x138 [ 9.509613] [<ffff000008be0d04>] kernel_init_freeable+0x1ac/0x24c [ 9.515699] [<ffff000008847f20>] kernel_init+0x20/0xf8 [ 9.520826] [<ffff000008082b80>] ret_from_fork+0x10/0x50 [ 9.526129] Code: 910a0021 9400b47d d4210000 d503201f (d4210000) [ 9.532233] ---[ end trace c3040dccdcf12d3a ]--- [ 9.536889] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote: > Thus, I don't see where my patch breaks code. Even acpi_os_ioremap() > keeps the same behaviour as before since it still uses memblock_is_ > memory(). Could you more describe your concerns why do you think this > patch breaks the kernel and moves the problem somewhere else? I > believe it fixes the problem at all. acpi_os_ioremap always ends up in __ioremap_caller, regardless of memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true. Will
Thanks for your answer. On 17.11.16 14:25:29, Will Deacon wrote: > On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote: > > Thus, I don't see where my patch breaks code. Even acpi_os_ioremap() > > keeps the same behaviour as before since it still uses memblock_is_ > > memory(). Could you more describe your concerns why do you think this > > patch breaks the kernel and moves the problem somewhere else? I > > believe it fixes the problem at all. > > acpi_os_ioremap always ends up in __ioremap_caller, regardless of > memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true. But that's the reason my patch changed the code to use memblock_is_ map_memory() instead. I was looking into the users of pfn_valid() esp. in arm64 code and changed it where required. This week I looked into the kernel again for code that might break by a pfn_valid() change. I found try_ram_remap() in memremap.c that has changed behaviour now, but this is explicit for MEMREMAP_WB, so it should be fine. Maybe it might be better to use page_is_ram() in addition to pfn_valid() where necessary. This should work now after commit: e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem I still think pfn_valid() is not the correct use to determine the mem attributes for mappings, there are further checks required. The risk of breaking something with my patch is small and limited only to the mapping of efi reserved regions (which is the state of 4.4). If something breaks anyway it can easily be fixed by adding more checks to pfn_valid() as suggested above. -Robert
On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote: > Thanks for your answer. > > On 17.11.16 14:25:29, Will Deacon wrote: >> On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote: >> > Thus, I don't see where my patch breaks code. Even acpi_os_ioremap() >> > keeps the same behaviour as before since it still uses memblock_is_ >> > memory(). Could you more describe your concerns why do you think this >> > patch breaks the kernel and moves the problem somewhere else? I >> > believe it fixes the problem at all. >> >> acpi_os_ioremap always ends up in __ioremap_caller, regardless of >> memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true. > > But that's the reason my patch changed the code to use memblock_is_ > map_memory() instead. I was looking into the users of pfn_valid() esp. > in arm64 code and changed it where required. > > This week I looked into the kernel again for code that might break by > a pfn_valid() change. I found try_ram_remap() in memremap.c that has > changed behaviour now, but this is explicit for MEMREMAP_WB, so it > should be fine. > > Maybe it might be better to use page_is_ram() in addition to > pfn_valid() where necessary. This should work now after commit: > > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem > > I still think pfn_valid() is not the correct use to determine the mem > attributes for mappings, there are further checks required. > > The risk of breaking something with my patch is small and limited only > to the mapping of efi reserved regions (which is the state of 4.4). If > something breaks anyway it can easily be fixed by adding more checks > to pfn_valid() as suggested above. > As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is the correct way to address this. However, doing that does uncover a bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct page fields before the pfn_valid_within() check, so it seems those need to be switched around. Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate for sparsemem. Care to elaborate why?
On 20.11.16 17:07:44, Ard Biesheuvel wrote: > On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote: > > The risk of breaking something with my patch is small and limited only > > to the mapping of efi reserved regions (which is the state of 4.4). If > > something breaks anyway it can easily be fixed by adding more checks > > to pfn_valid() as suggested above. > > > > As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is > the correct way to address this. However, doing that does uncover a > bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct > page fields before the pfn_valid_within() check, so it seems those > need to be switched around. > > Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate > for sparsemem. Care to elaborate why? HOLES_IN_ZONE is of rare use in the kernel. I think it was introduced to save memory for the memmap and only some single systems enable it. There is no architecture that enables it entirely. For good reasons... It introduces additional checks. pfn_valid() is usually checked only once for the whole memmap. There are a number of checks enabled, just grep for pfn_valid_within. This will increase the number of pfn_valid() calls by a factor of MAX_ORDER_NR_PAGES, in my config this is 8k. So, this is not the direction to go. My patch fixes a regression in the kernel that was introduced by the nomap implementation. Some systems can not boot anymore, beside of that the BUG_ON() may occur any time depending only on physical page access, we need to fix 4.9. Here is a reproducer: https://patchwork.kernel.org/patch/9407677/ My patch also does not break memremap(). With my patch applied try_ram_remap() would return a linear addr for nomap regions. But this is only called if WB is explicitly requested, so it should not happen. If you think pfn_valid() is wrong here, I am happy to send a patch that fixes this by using page_is_ram(). In any case, the worst case that may happen is to behave the same as v4.4, we might fix then the wrong use of pfn_valid() where it is not correctly used to check for ram. -Robert
On 23 November 2016 at 21:15, Robert Richter <robert.richter@cavium.com> wrote: > On 20.11.16 17:07:44, Ard Biesheuvel wrote: >> On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote: > >> > The risk of breaking something with my patch is small and limited only >> > to the mapping of efi reserved regions (which is the state of 4.4). If >> > something breaks anyway it can easily be fixed by adding more checks >> > to pfn_valid() as suggested above. >> > >> >> As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is >> the correct way to address this. However, doing that does uncover a >> bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct >> page fields before the pfn_valid_within() check, so it seems those >> need to be switched around. >> >> Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate >> for sparsemem. Care to elaborate why? > > HOLES_IN_ZONE is of rare use in the kernel. I think it was introduced > to save memory for the memmap and only some single systems enable it. > There is no architecture that enables it entirely. For good reasons... > > It introduces additional checks. pfn_valid() is usually checked only > once for the whole memmap. There are a number of checks enabled, just > grep for pfn_valid_within. This will increase the number of > pfn_valid() calls by a factor of MAX_ORDER_NR_PAGES, in my config this > is 8k. So, this is not the direction to go. > That does sound like a potential issue. But does it cause any slowdown in practice? The reality is that CONFIG_HOLES_IN_ZONE perfectly describes the situation, and so it is still my preferred option if the performance hit is tolerable. > My patch fixes a regression in the kernel that was introduced by the > nomap implementation. Some systems can not boot anymore, beside of > that the BUG_ON() may occur any time depending only on physical page > access, we need to fix 4.9. Here is a reproducer: > > https://patchwork.kernel.org/patch/9407677/ > > My patch also does not break memremap(). With my patch applied > try_ram_remap() would return a linear addr for nomap regions. But this > is only called if WB is explicitly requested, so it should not happen. Why? MEMREMAP_WB is used often, among other things for mapping firmware tables, which are marked as NOMAP, so in these cases, the linear address is not mapped. > If you think pfn_valid() is wrong here, I am happy to send a patch > that fixes this by using page_is_ram(). In any case, the worst case > that may happen is to behave the same as v4.4, we might fix then the > wrong use of pfn_valid() where it is not correctly used to check for > ram. > page_is_ram() uses string comparisons to look for regions called 'System RAM'. Is that something we can tolerate for each pfn_valid() calll? Perhaps the solution is to reimplement page_is_ram() for arm64 using memblock_is_memory() instead, But that still means we need to modify the generic memremap() code first to switch to it before changing the arm64 implementation of pfn_valid
On 23.11.16 21:25:06, Ard Biesheuvel wrote: > Why? MEMREMAP_WB is used often, among other things for mapping > firmware tables, which are marked as NOMAP, so in these cases, the > linear address is not mapped. If fw tables are mapped wb, that is wrong and needs a separate fix. > > If you think pfn_valid() is wrong here, I am happy to send a patch > > that fixes this by using page_is_ram(). In any case, the worst case > > that may happen is to behave the same as v4.4, we might fix then the > > wrong use of pfn_valid() where it is not correctly used to check for > > ram. > > > > page_is_ram() uses string comparisons to look for regions called > 'System RAM'. Is that something we can tolerate for each pfn_valid() > calll? > > Perhaps the solution is to reimplement page_is_ram() for arm64 using > memblock_is_memory() instead, But that still means we need to modify > the generic memremap() code first to switch to it before changing the > arm64 implementation of pfn_valid No, that's not the solution. pfn_valid() should just check if there is a valid struct page, as other archs do. And if there is a mis-use of pfn_valid() to check for ram, only that calls should be fixed to use page_is_ram(), however this is implemented, or something appropriate. But I don't see any problematic code, and if so, I will fix that. -Robert
On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: > On 23.11.16 21:25:06, Ard Biesheuvel wrote: >> Why? MEMREMAP_WB is used often, among other things for mapping >> firmware tables, which are marked as NOMAP, so in these cases, the >> linear address is not mapped. > > If fw tables are mapped wb, that is wrong and needs a separate fix. > Why is that wrong? >> > If you think pfn_valid() is wrong here, I am happy to send a patch >> > that fixes this by using page_is_ram(). In any case, the worst case >> > that may happen is to behave the same as v4.4, we might fix then the >> > wrong use of pfn_valid() where it is not correctly used to check for >> > ram. >> > >> >> page_is_ram() uses string comparisons to look for regions called >> 'System RAM'. Is that something we can tolerate for each pfn_valid() >> calll? >> >> Perhaps the solution is to reimplement page_is_ram() for arm64 using >> memblock_is_memory() instead, But that still means we need to modify >> the generic memremap() code first to switch to it before changing the >> arm64 implementation of pfn_valid > > No, that's not the solution. pfn_valid() should just check if there is > a valid struct page, as other archs do. And if there is a mis-use of > pfn_valid() to check for ram, only that calls should be fixed to use > page_is_ram(), however this is implemented, or something appropriate. > But I don't see any problematic code, and if so, I will fix that. > memremap() uses pfn_valid() to decide whether some address is covered by the linear mapping. If we correct pfn_valid() to adhere to your definition, we will need to fix memremap() first in any case.
On 24.11.16 13:44:31, Ard Biesheuvel wrote: > On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: > > On 23.11.16 21:25:06, Ard Biesheuvel wrote: > >> Why? MEMREMAP_WB is used often, among other things for mapping > >> firmware tables, which are marked as NOMAP, so in these cases, the > >> linear address is not mapped. > > > > If fw tables are mapped wb, that is wrong and needs a separate fix. > > > > Why is that wrong? The whole issue with mapping acpi tables is not marking them cachable, what wb does. Otherwise we could just use linear mapping for those mem ranges. > >> > If you think pfn_valid() is wrong here, I am happy to send a patch > >> > that fixes this by using page_is_ram(). In any case, the worst case > >> > that may happen is to behave the same as v4.4, we might fix then the > >> > wrong use of pfn_valid() where it is not correctly used to check for > >> > ram. > >> > > >> > >> page_is_ram() uses string comparisons to look for regions called > >> 'System RAM'. Is that something we can tolerate for each pfn_valid() > >> calll? > >> > >> Perhaps the solution is to reimplement page_is_ram() for arm64 using > >> memblock_is_memory() instead, But that still means we need to modify > >> the generic memremap() code first to switch to it before changing the > >> arm64 implementation of pfn_valid > > > > No, that's not the solution. pfn_valid() should just check if there is > > a valid struct page, as other archs do. And if there is a mis-use of > > pfn_valid() to check for ram, only that calls should be fixed to use > > page_is_ram(), however this is implemented, or something appropriate. > > But I don't see any problematic code, and if so, I will fix that. > > > > memremap() uses pfn_valid() to decide whether some address is covered > by the linear mapping. If we correct pfn_valid() to adhere to your > definition, we will need to fix memremap() first in any case. As said, will fix that if needed. But I think the caller is wrong then. -Robert
On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: > On 24.11.16 13:44:31, Ard Biesheuvel wrote: >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote: >> >> Why? MEMREMAP_WB is used often, among other things for mapping >> >> firmware tables, which are marked as NOMAP, so in these cases, the >> >> linear address is not mapped. >> > >> > If fw tables are mapped wb, that is wrong and needs a separate fix. >> > >> >> Why is that wrong? > > The whole issue with mapping acpi tables is not marking them cachable, > what wb does. What 'issue'? > Otherwise we could just use linear mapping for those mem > ranges. > Regions containing firmware tables are owned by the firmware, and it is the firmware that tells us which memory attributes we are allowed to use. If those attributes include WB, it is perfectly legal to use a cacheable mapping. That does *not* mean they should be covered by the linear mapping. The linear mapping is read-write-non-exec, for instance, and we may prefer to use a read-only mapping and/or executable mapping. >> >> > If you think pfn_valid() is wrong here, I am happy to send a patch >> >> > that fixes this by using page_is_ram(). In any case, the worst case >> >> > that may happen is to behave the same as v4.4, we might fix then the >> >> > wrong use of pfn_valid() where it is not correctly used to check for >> >> > ram. >> >> > >> >> >> >> page_is_ram() uses string comparisons to look for regions called >> >> 'System RAM'. Is that something we can tolerate for each pfn_valid() >> >> calll? >> >> >> >> Perhaps the solution is to reimplement page_is_ram() for arm64 using >> >> memblock_is_memory() instead, But that still means we need to modify >> >> the generic memremap() code first to switch to it before changing the >> >> arm64 implementation of pfn_valid >> > >> > No, that's not the solution. pfn_valid() should just check if there is >> > a valid struct page, as other archs do. And if there is a mis-use of >> > pfn_valid() to check for ram, only that calls should be fixed to use >> > page_is_ram(), however this is implemented, or something appropriate. >> > But I don't see any problematic code, and if so, I will fix that. >> > >> >> memremap() uses pfn_valid() to decide whether some address is covered >> by the linear mapping. If we correct pfn_valid() to adhere to your >> definition, we will need to fix memremap() first in any case. > > As said, will fix that if needed. But I think the caller is wrong > then. >
On 24.11.16 13:58:30, Ard Biesheuvel wrote: > On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: > > On 24.11.16 13:44:31, Ard Biesheuvel wrote: > >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: > >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote: > >> >> Why? MEMREMAP_WB is used often, among other things for mapping > >> >> firmware tables, which are marked as NOMAP, so in these cases, the > >> >> linear address is not mapped. > >> > > >> > If fw tables are mapped wb, that is wrong and needs a separate fix. > >> > > >> > >> Why is that wrong? > > > > The whole issue with mapping acpi tables is not marking them cachable, > > what wb does. > > What 'issue'? > > > Otherwise we could just use linear mapping for those mem > > ranges. > > > > Regions containing firmware tables are owned by the firmware, and it > is the firmware that tells us which memory attributes we are allowed > to use. If those attributes include WB, it is perfectly legal to use a > cacheable mapping. That does *not* mean they should be covered by the > linear mapping. The linear mapping is read-write-non-exec, for > instance, and we may prefer to use a read-only mapping and/or > executable mapping. Ok, I am going to fix try_ram_remap(). Are there other concerns with this patch? Thanks, -Robert
On 24 November 2016 at 14:11, Robert Richter <robert.richter@cavium.com> wrote: > On 24.11.16 13:58:30, Ard Biesheuvel wrote: >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote: >> >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: >> >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote: >> >> >> Why? MEMREMAP_WB is used often, among other things for mapping >> >> >> firmware tables, which are marked as NOMAP, so in these cases, the >> >> >> linear address is not mapped. >> >> > >> >> > If fw tables are mapped wb, that is wrong and needs a separate fix. >> >> > >> >> >> >> Why is that wrong? >> > >> > The whole issue with mapping acpi tables is not marking them cachable, >> > what wb does. >> >> What 'issue'? >> >> > Otherwise we could just use linear mapping for those mem >> > ranges. >> > >> >> Regions containing firmware tables are owned by the firmware, and it >> is the firmware that tells us which memory attributes we are allowed >> to use. If those attributes include WB, it is perfectly legal to use a >> cacheable mapping. That does *not* mean they should be covered by the >> linear mapping. The linear mapping is read-write-non-exec, for >> instance, and we may prefer to use a read-only mapping and/or >> executable mapping. > > Ok, I am going to fix try_ram_remap(). > Thanks. Could you also add an arm64 version of page_is_ram() that uses memblock_is_memory() while you're at it? I think using memblock directly in try_ram_remap() may not be the best approach > Are there other concerns with this patch? > I think we all agree that pfn_valid() should return whether a pfn has a struct page associated with it, the debate is about whether it makes sense to allocate struct pages for memory that the kernel does not own. But given that it does not really hurt to do so for small holes, I think your suggestion makes sense. Should we be doing anything more to ensure that those pages are not dereferenced inadvertently? Is there a page flag we should be setting?
On 24.11.16 14:23:16, Ard Biesheuvel wrote: > On 24 November 2016 at 14:11, Robert Richter <robert.richter@cavium.com> wrote: > > On 24.11.16 13:58:30, Ard Biesheuvel wrote: > >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: > >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote: > >> >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote: > >> >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote: > >> >> >> Why? MEMREMAP_WB is used often, among other things for mapping > >> >> >> firmware tables, which are marked as NOMAP, so in these cases, the > >> >> >> linear address is not mapped. > >> >> > > >> >> > If fw tables are mapped wb, that is wrong and needs a separate fix. > >> >> > > >> >> > >> >> Why is that wrong? > >> > > >> > The whole issue with mapping acpi tables is not marking them cachable, > >> > what wb does. > >> > >> What 'issue'? > >> > >> > Otherwise we could just use linear mapping for those mem > >> > ranges. > >> > > >> > >> Regions containing firmware tables are owned by the firmware, and it > >> is the firmware that tells us which memory attributes we are allowed > >> to use. If those attributes include WB, it is perfectly legal to use a > >> cacheable mapping. That does *not* mean they should be covered by the > >> linear mapping. The linear mapping is read-write-non-exec, for > >> instance, and we may prefer to use a read-only mapping and/or > >> executable mapping. > > > > Ok, I am going to fix try_ram_remap(). > > > > Thanks. Could you also add an arm64 version of page_is_ram() that uses > memblock_is_memory() while you're at it? I think using memblock > directly in try_ram_remap() may not be the best approach Sure. I also want to mark the patches as stable. > > Are there other concerns with this patch? > > > > I think we all agree that pfn_valid() should return whether a pfn has > a struct page associated with it, the debate is about whether it makes > sense to allocate struct pages for memory that the kernel does not > own. But given that it does not really hurt to do so for small holes, > I think your suggestion makes sense. Thanks for your comments and the review. > Should we be doing anything more to ensure that those pages are not > dereferenced inadvertently? Is there a page flag we should be setting? I don't think so. Boot mem is initialized in free_low_memory_core_ early(). The PageReserved flag is set for pages from reserved memory ranges, and memory ranges not marked NOMAP is just freed. Since pages are either reservered or in the free_list of pages, pages from other memory ranges (NOMAP) is not visible to mm. -Robert
Ard, > > >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: > > >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote: > > >> Regions containing firmware tables are owned by the firmware, and it > > >> is the firmware that tells us which memory attributes we are allowed > > >> to use. If those attributes include WB, it is perfectly legal to use a > > >> cacheable mapping. That does *not* mean they should be covered by the > > >> linear mapping. The linear mapping is read-write-non-exec, for > > >> instance, and we may prefer to use a read-only mapping and/or > > >> executable mapping. > > > > > > Ok, I am going to fix try_ram_remap(). I revisited the code and it is working well already since: e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem Now, try_ram_remap() is only called if the region to be mapped is entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem ranges and not NOMAP mem. region_intersects() then returns REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case REGION_DISJOINT would be returned and thus arch_memremap_wb() being called directly. Before the e7cd190385d1 change try_ram_remap() was called also for nomap regions. So we can leave memremap() as it is and just apply this patch unmodified. What do you think? Please ack. I am going to prepare the pfn_is_ram() change in addition to this patch, but that should not be required for this fix to work correcly. Thanks, -Robert
On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote: > Ard, > >> > >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote: >> > >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote: > >> > >> Regions containing firmware tables are owned by the firmware, and it >> > >> is the firmware that tells us which memory attributes we are allowed >> > >> to use. If those attributes include WB, it is perfectly legal to use a >> > >> cacheable mapping. That does *not* mean they should be covered by the >> > >> linear mapping. The linear mapping is read-write-non-exec, for >> > >> instance, and we may prefer to use a read-only mapping and/or >> > >> executable mapping. >> > > >> > > Ok, I am going to fix try_ram_remap(). > > I revisited the code and it is working well already since: > > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem > > Now, try_ram_remap() is only called if the region to be mapped is > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem > ranges and not NOMAP mem. region_intersects() then returns > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case > REGION_DISJOINT would be returned and thus arch_memremap_wb() being > called directly. Before the e7cd190385d1 change try_ram_remap() was > called also for nomap regions. > > So we can leave memremap() as it is and just apply this patch > unmodified. What do you think? I agree. The pfn_valid() check in try_ram_remap() is still appropriate simply because the PageHighmem check requires a valid struct page. But if we don't enter that code path anymore for NOMAP regions, I think we're ok. > Please ack. > I still don't fully understand how it is guaranteed that *all* memory (i.e., all regions for which memblock_is_memory() returns true) is covered by a struct page, but marked as reserved. Are we relying on the fact that NOMAP memory is also memblock_reserve()'d? > I am going to prepare the pfn_is_ram() change in addition to this > patch, but that should not be required for this fix to work correcly. > I don't think you need to bother with page_is_ram() then. The only place we use it is in devmem_is_allowed(), and there it makes sense to allow NOMAP regions to be accessed (provided that you think /dev/mem is a good idea in the first place).
On 24.11.16 19:42:47, Ard Biesheuvel wrote: > On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote: > > I revisited the code and it is working well already since: > > > > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem > > > > Now, try_ram_remap() is only called if the region to be mapped is > > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem > > ranges and not NOMAP mem. region_intersects() then returns > > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case > > REGION_DISJOINT would be returned and thus arch_memremap_wb() being > > called directly. Before the e7cd190385d1 change try_ram_remap() was > > called also for nomap regions. > > > > So we can leave memremap() as it is and just apply this patch > > unmodified. What do you think? > > I agree. The pfn_valid() check in try_ram_remap() is still appropriate > simply because the PageHighmem check requires a valid struct page. But > if we don't enter that code path anymore for NOMAP regions, I think > we're ok. > > > Please ack. > > > > I still don't fully understand how it is guaranteed that *all* memory > (i.e., all regions for which memblock_is_memory() returns true) is > covered by a struct page, but marked as reserved. Are we relying on > the fact that NOMAP memory is also memblock_reserve()'d? See free_low_memory_core_early(): ---- for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) count += __free_memory_core(start, end); ---- Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are also *not* marked reserved. So nothing at all from NOMAP mem is reported to mm, it is not present (see below for a mem config, note flags: 0x4 mem regions). -Robert [ 0.000000] efi: Processing EFI memory map: [ 0.000000] efi: 0x000001400000-0x00000147ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000001480000-0x0000024bffff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x0000024c0000-0x0000211fffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000021200000-0x00002121ffff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000021220000-0x0000fffebfff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000100000000-0x000ff7ffffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000ff8000000-0x000ff801ffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000ff8020000-0x000fffa9efff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x000fffa9f000-0x000fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010000400000-0x010f816aefff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f816af000-0x010f816b1fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f816b2000-0x010f826f1fff [Loader Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f826f2000-0x010f82701fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f82702000-0x010f82787fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f82788000-0x010f9276bfff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f9276c000-0x010f9276cfff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f9276d000-0x010f935a8fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f935a9000-0x010f93880fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010f93881000-0x010ff7880fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ff7881000-0x010ff7886fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ff7887000-0x010ff78a3fff [Loader Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ff78a4000-0x010ff9e8dfff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ff9e8e000-0x010ff9f16fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ff9f17000-0x010ffaeb5fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffaeb6000-0x010ffafc8fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffafc9000-0x010ffafccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffafcd000-0x010ffaff4fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffaff5000-0x010ffb008fff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffb009000-0x010fffe28fff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010fffe29000-0x010fffe3ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010fffe40000-0x010fffe53fff [Loader Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010fffe54000-0x010ffffb8fff [Boot Code | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffffb9000-0x010ffffccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffffcd000-0x010fffffefff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x010ffffff000-0x010fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] [ 0.000000] efi: 0x804000001000-0x804000001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] [ 0.000000] efi: 0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] [ 0.000000] MEMBLOCK configuration: [ 0.000000] memory size = 0x1ffe800000 reserved size = 0x39146a21 [ 0.000000] memory.cnt = 0x9 [ 0.000000] memory[0x0] [0x00000001400000-0x000000fffdffff], 0xfebe0000 bytes on node 0 flags: 0x0 [ 0.000000] memory[0x1] [0x000000fffe0000-0x000000ffffffff], 0x20000 bytes on node 0 flags: 0x4 [ 0.000000] memory[0x2] [0x00000100000000-0x00000fffffffff], 0xf00000000 bytes on node 0 flags: 0x0 [ 0.000000] memory[0x3] [0x00010000400000-0x00010ff9e7ffff], 0xff9a80000 bytes on node 1 flags: 0x0 [ 0.000000] memory[0x4] [0x00010ff9e80000-0x00010ff9f1ffff], 0xa0000 bytes on node 1 flags: 0x4 [ 0.000000] memory[0x5] [0x00010ff9f20000-0x00010ffaeaffff], 0xf90000 bytes on node 1 flags: 0x0 [ 0.000000] memory[0x6] [0x00010ffaeb0000-0x00010ffaffffff], 0x150000 bytes on node 1 flags: 0x4 [ 0.000000] memory[0x7] [0x00010ffb000000-0x00010ffffaffff], 0x4fb0000 bytes on node 1 flags: 0x0 [ 0.000000] memory[0x8] [0x00010ffffb0000-0x00010fffffffff], 0x50000 bytes on node 1 flags: 0x4 [ 0.000000] reserved.cnt = 0xd [ 0.000000] reserved[0x0] [0x00000001480000-0x0000000249ffff], 0x1020000 bytes flags: 0x0 [ 0.000000] reserved[0x1] [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0 [ 0.000000] reserved[0x2] [0x000000c0000000-0x000000dfffffff], 0x20000000 bytes flags: 0x0 [ 0.000000] reserved[0x3] [0x00000ffbfb8000-0x00000ffffdffff], 0x4028000 bytes flags: 0x0 [ 0.000000] reserved[0x4] [0x00000ffffecb00-0x00000fffffffff], 0x13500 bytes flags: 0x0 [ 0.000000] reserved[0x5] [0x00010f81780000-0x00010f8178ffff], 0x10000 bytes flags: 0x0 [ 0.000000] reserved[0x6] [0x00010f82870000-0x00010f9286ffff], 0x10000000 bytes flags: 0x0 [ 0.000000] reserved[0x7] [0x00010ffbce0000-0x00010fffceffff], 0x4010000 bytes flags: 0x0 [ 0.000000] reserved[0x8] [0x00010fffee6d80-0x00010ffff2fffb], 0x4927c bytes flags: 0x0 [ 0.000000] reserved[0x9] [0x00010ffff30000-0x00010ffffa000f], 0x70010 bytes flags: 0x0 [ 0.000000] reserved[0xa] [0x00010ffffae280-0x00010ffffaff7f], 0x1d00 bytes flags: 0x0 [ 0.000000] reserved[0xb] [0x00010ffffaffa0-0x00010ffffaffce], 0x2f bytes flags: 0x0 [ 0.000000] reserved[0xc] [0x00010ffffaffd0-0x00010ffffafffe], 0x2f bytes flags: 0x0
On 25 November 2016 at 11:29, Robert Richter <robert.richter@cavium.com> wrote: > On 24.11.16 19:42:47, Ard Biesheuvel wrote: >> On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote: > >> > I revisited the code and it is working well already since: >> > >> > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem >> > >> > Now, try_ram_remap() is only called if the region to be mapped is >> > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem >> > ranges and not NOMAP mem. region_intersects() then returns >> > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case >> > REGION_DISJOINT would be returned and thus arch_memremap_wb() being >> > called directly. Before the e7cd190385d1 change try_ram_remap() was >> > called also for nomap regions. >> > >> > So we can leave memremap() as it is and just apply this patch >> > unmodified. What do you think? >> >> I agree. The pfn_valid() check in try_ram_remap() is still appropriate >> simply because the PageHighmem check requires a valid struct page. But >> if we don't enter that code path anymore for NOMAP regions, I think >> we're ok. >> >> > Please ack. >> > >> >> I still don't fully understand how it is guaranteed that *all* memory >> (i.e., all regions for which memblock_is_memory() returns true) is >> covered by a struct page, but marked as reserved. Are we relying on >> the fact that NOMAP memory is also memblock_reserve()'d? > > See free_low_memory_core_early(): > > ---- > for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, > NULL) > count += __free_memory_core(start, end); > ---- > > Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are > also *not* marked reserved. So nothing at all from NOMAP mem is > reported to mm, it is not present (see below for a mem config, note > flags: 0x4 mem regions). > OK, thanks for clearing that up. But that still does not explain how we can be certain that NOMAP regions are guaranteed to be covered by a struct page, does it? Because that is ultimately what pfn_valid() means, that it is safe to, e.g., look at the page flags. > [ 0.000000] efi: Processing EFI memory map: > [ 0.000000] efi: 0x000001400000-0x00000147ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000001480000-0x0000024bffff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x0000024c0000-0x0000211fffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000021200000-0x00002121ffff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000021220000-0x0000fffebfff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000100000000-0x000ff7ffffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000ff8000000-0x000ff801ffff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000ff8020000-0x000fffa9efff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x000fffa9f000-0x000fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010000400000-0x010f816aefff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f816af000-0x010f816b1fff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f816b2000-0x010f826f1fff [Loader Code | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f826f2000-0x010f82701fff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f82702000-0x010f82787fff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f82788000-0x010f9276bfff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f9276c000-0x010f9276cfff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f9276d000-0x010f935a8fff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f935a9000-0x010f93880fff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010f93881000-0x010ff7880fff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ff7881000-0x010ff7886fff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ff7887000-0x010ff78a3fff [Loader Code | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ff78a4000-0x010ff9e8dfff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ff9e8e000-0x010ff9f16fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ff9f17000-0x010ffaeb5fff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffaeb6000-0x010ffafc8fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffafc9000-0x010ffafccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffafcd000-0x010ffaff4fff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffaff5000-0x010ffb008fff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffb009000-0x010fffe28fff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010fffe29000-0x010fffe3ffff [Conventional Memory| | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010fffe40000-0x010fffe53fff [Loader Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010fffe54000-0x010ffffb8fff [Boot Code | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffffb9000-0x010ffffccfff [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffffcd000-0x010fffffefff [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x010ffffff000-0x010fffffffff [Boot Data | | | | | | | | |WB|WT|WC|UC] > [ 0.000000] efi: 0x804000001000-0x804000001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] > [ 0.000000] efi: 0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O |RUN| | | | | | | | | | |UC] > > [ 0.000000] MEMBLOCK configuration: > [ 0.000000] memory size = 0x1ffe800000 reserved size = 0x39146a21 > [ 0.000000] memory.cnt = 0x9 > [ 0.000000] memory[0x0] [0x00000001400000-0x000000fffdffff], 0xfebe0000 bytes on node 0 flags: 0x0 > [ 0.000000] memory[0x1] [0x000000fffe0000-0x000000ffffffff], 0x20000 bytes on node 0 flags: 0x4 > [ 0.000000] memory[0x2] [0x00000100000000-0x00000fffffffff], 0xf00000000 bytes on node 0 flags: 0x0 > [ 0.000000] memory[0x3] [0x00010000400000-0x00010ff9e7ffff], 0xff9a80000 bytes on node 1 flags: 0x0 > [ 0.000000] memory[0x4] [0x00010ff9e80000-0x00010ff9f1ffff], 0xa0000 bytes on node 1 flags: 0x4 > [ 0.000000] memory[0x5] [0x00010ff9f20000-0x00010ffaeaffff], 0xf90000 bytes on node 1 flags: 0x0 > [ 0.000000] memory[0x6] [0x00010ffaeb0000-0x00010ffaffffff], 0x150000 bytes on node 1 flags: 0x4 > [ 0.000000] memory[0x7] [0x00010ffb000000-0x00010ffffaffff], 0x4fb0000 bytes on node 1 flags: 0x0 > [ 0.000000] memory[0x8] [0x00010ffffb0000-0x00010fffffffff], 0x50000 bytes on node 1 flags: 0x4 > [ 0.000000] reserved.cnt = 0xd > [ 0.000000] reserved[0x0] [0x00000001480000-0x0000000249ffff], 0x1020000 bytes flags: 0x0 > [ 0.000000] reserved[0x1] [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0 > [ 0.000000] reserved[0x2] [0x000000c0000000-0x000000dfffffff], 0x20000000 bytes flags: 0x0 > [ 0.000000] reserved[0x3] [0x00000ffbfb8000-0x00000ffffdffff], 0x4028000 bytes flags: 0x0 > [ 0.000000] reserved[0x4] [0x00000ffffecb00-0x00000fffffffff], 0x13500 bytes flags: 0x0 > [ 0.000000] reserved[0x5] [0x00010f81780000-0x00010f8178ffff], 0x10000 bytes flags: 0x0 > [ 0.000000] reserved[0x6] [0x00010f82870000-0x00010f9286ffff], 0x10000000 bytes flags: 0x0 > [ 0.000000] reserved[0x7] [0x00010ffbce0000-0x00010fffceffff], 0x4010000 bytes flags: 0x0 > [ 0.000000] reserved[0x8] [0x00010fffee6d80-0x00010ffff2fffb], 0x4927c bytes flags: 0x0 > [ 0.000000] reserved[0x9] [0x00010ffff30000-0x00010ffffa000f], 0x70010 bytes flags: 0x0 > [ 0.000000] reserved[0xa] [0x00010ffffae280-0x00010ffffaff7f], 0x1d00 bytes flags: 0x0 > [ 0.000000] reserved[0xb] [0x00010ffffaffa0-0x00010ffffaffce], 0x2f bytes flags: 0x0 > [ 0.000000] reserved[0xc] [0x00010ffffaffd0-0x00010ffffafffe], 0x2f bytes flags: 0x0
On 25 November 2016 at 12:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 25 November 2016 at 11:29, Robert Richter <robert.richter@cavium.com> wrote: >> On 24.11.16 19:42:47, Ard Biesheuvel wrote: >>> On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote: >> >>> > I revisited the code and it is working well already since: >>> > >>> > e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem >>> > >>> > Now, try_ram_remap() is only called if the region to be mapped is >>> > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem >>> > ranges and not NOMAP mem. region_intersects() then returns >>> > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case >>> > REGION_DISJOINT would be returned and thus arch_memremap_wb() being >>> > called directly. Before the e7cd190385d1 change try_ram_remap() was >>> > called also for nomap regions. >>> > >>> > So we can leave memremap() as it is and just apply this patch >>> > unmodified. What do you think? >>> >>> I agree. The pfn_valid() check in try_ram_remap() is still appropriate >>> simply because the PageHighmem check requires a valid struct page. But >>> if we don't enter that code path anymore for NOMAP regions, I think >>> we're ok. >>> >>> > Please ack. >>> > >>> >>> I still don't fully understand how it is guaranteed that *all* memory >>> (i.e., all regions for which memblock_is_memory() returns true) is >>> covered by a struct page, but marked as reserved. Are we relying on >>> the fact that NOMAP memory is also memblock_reserve()'d? >> >> See free_low_memory_core_early(): >> >> ---- >> for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, >> NULL) >> count += __free_memory_core(start, end); >> ---- >> >> Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are >> also *not* marked reserved. So nothing at all from NOMAP mem is >> reported to mm, it is not present (see below for a mem config, note >> flags: 0x4 mem regions). >> > > OK, thanks for clearing that up. But that still does not explain how > we can be certain that NOMAP regions are guaranteed to be covered by a > struct page, does it? Because that is ultimately what pfn_valid() > means, that it is safe to, e.g., look at the page flags. > Answering to self: arm64_memory_present() registers all memory as present, which means that sparse_init() will allocate struct page backing for all of memory, including NOMAP regions. We could ask ourselves whether it makes sense to disregard NOMAP memory here, but that probably buys us very little (but I do wonder how it affects the occurrence of the bug). In any case, it looks to me like your patch is safe, i.e., calling pfn_valid() on NOMAP pages is safe, although I still find it debatable whether the kernel should be tracking memory it does not own. However, for performance reasons, it probably makes sense to apply your patch, and if anyone ever comes up with a use case where this is problematic (e.g., gigabytes of NOMAP memory), we can look into it then. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index bbb7ee76e319..25b8659c2a9f 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) #ifdef CONFIG_HAVE_ARCH_PFN_VALID int pfn_valid(unsigned long pfn) { - return memblock_is_map_memory(pfn << PAGE_SHIFT); + return memblock_is_memory(pfn << PAGE_SHIFT); } EXPORT_SYMBOL(pfn_valid); #endif diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c index 01e88c8bcab0..c17c220b0c48 100644 --- a/arch/arm64/mm/ioremap.c +++ b/arch/arm64/mm/ioremap.c @@ -21,6 +21,7 @@ */ #include <linux/export.h> +#include <linux/memblock.h> #include <linux/mm.h> #include <linux/vmalloc.h> #include <linux/io.h> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, /* * Don't allow RAM to be mapped. */ - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) + if (WARN_ON(memblock_is_map_memory(phys_addr))) return NULL; area = get_vm_area_caller(size, VM_IOREMAP, caller); @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap); void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) { /* For normal memory we already have a cacheable mapping. */ - if (pfn_valid(__phys_to_pfn(phys_addr))) + if (memblock_is_map_memory(phys_addr)) return (void __iomem *)__phys_to_virt(phys_addr); return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
There is a memory setup problem on ThunderX systems with certain memory configurations. The symptom is kernel BUG at mm/page_alloc.c:1848! This happens for some configs with 64k page size enabled. The bug triggers for page zones with some pages in the zone not assigned to this particular zone. In my case some pages that are marked as nomap were not reassigned to the new zone of node 1, so those are still assigned to node 0. The reason for the mis-configuration is a change in pfn_valid() which reports pages marked nomap as invalid: 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping This causes pages marked as nomap being no long reassigned to the new zone in memmap_init_zone() by calling __init_single_pfn(). Fixing this by restoring the old behavior of pfn_valid() to use memblock_is_memory(). Also changing users of pfn_valid() in arm64 code to use memblock_is_map_memory() where necessary. This only affects code in ioremap.c. The code in mmu.c still can use the new version of pfn_valid(). Should be marked stable v4.5.. Signed-off-by: Robert Richter <rrichter@cavium.com> --- arch/arm64/mm/init.c | 2 +- arch/arm64/mm/ioremap.c | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-)