diff mbox

arm64: mm: Fix memmap to be initialized for the entire section

Message ID 1475747527-32387-1-git-send-email-rrichter@cavium.com (mailing list archive)
State New, archived
Headers show

Commit Message

Robert Richter Oct. 6, 2016, 9:52 a.m. UTC
There is a memory setup problem on ThunderX systems with certain
memory configurations. The symptom is

 kernel BUG at mm/page_alloc.c:1848!

This happens for some configs with 64k page size enabled. The bug
triggers for page zones with some pages in the zone not assigned to
this particular zone. In my case some pages that are marked as nomap
were not reassigned to the new zone of node 1, so those are still
assigned to node 0.

The reason for the mis-configuration is a change in pfn_valid() which
reports pages marked nomap as invalid:

 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping

This causes pages marked as nomap being no long reassigned to the new
zone in memmap_init_zone() by calling __init_single_pfn().

Fixing this by restoring the old behavior of pfn_valid() to use
memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
to use memblock_is_map_memory() where necessary. This only affects
code in ioremap.c. The code in mmu.c still can use the new version of
pfn_valid().

Should be marked stable v4.5..

Signed-off-by: Robert Richter <rrichter@cavium.com>
---
 arch/arm64/mm/init.c    | 2 +-
 arch/arm64/mm/ioremap.c | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

Comments

Ard Biesheuvel Oct. 6, 2016, 10 a.m. UTC | #1
Hi Robert,

Apologies for only responding now. I did not quite manage to get my
head around your original email yet, but I don't think this patch is
the correct solution.

On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote:
> There is a memory setup problem on ThunderX systems with certain
> memory configurations. The symptom is
>
>  kernel BUG at mm/page_alloc.c:1848!
>
> This happens for some configs with 64k page size enabled. The bug
> triggers for page zones with some pages in the zone not assigned to
> this particular zone. In my case some pages that are marked as nomap
> were not reassigned to the new zone of node 1, so those are still
> assigned to node 0.
>
> The reason for the mis-configuration is a change in pfn_valid() which
> reports pages marked nomap as invalid:
>
>  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>

These pages are owned by the firmware, which may map it with
attributes that conflict with the attributes we use for the linear
mapping. This means they should not be covered by the linear mapping.

> This causes pages marked as nomap being no long reassigned to the new
> zone in memmap_init_zone() by calling __init_single_pfn().
>

This sounds like the root cause of your issue. Could we not fix that instead?

> Fixing this by restoring the old behavior of pfn_valid() to use
> memblock_is_memory().

This is incorrect imo. In general, pfn_valid() means ordinary memory
covered by the linear mapping and the struct page array. Returning
reserved ranges that the kernel should not even touch only to please
the NUMA code seems like an inappropriate way to deal with this issue.

> Also changing users of pfn_valid() in arm64 code
> to use memblock_is_map_memory() where necessary. This only affects
> code in ioremap.c. The code in mmu.c still can use the new version of
> pfn_valid().
>
> Should be marked stable v4.5..
>
> Signed-off-by: Robert Richter <rrichter@cavium.com>
> ---
>  arch/arm64/mm/init.c    | 2 +-
>  arch/arm64/mm/ioremap.c | 5 +++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index bbb7ee76e319..25b8659c2a9f 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>  #ifdef CONFIG_HAVE_ARCH_PFN_VALID
>  int pfn_valid(unsigned long pfn)
>  {
> -       return memblock_is_map_memory(pfn << PAGE_SHIFT);
> +       return memblock_is_memory(pfn << PAGE_SHIFT);
>  }
>  EXPORT_SYMBOL(pfn_valid);
>  #endif
> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> index 01e88c8bcab0..c17c220b0c48 100644
> --- a/arch/arm64/mm/ioremap.c
> +++ b/arch/arm64/mm/ioremap.c
> @@ -21,6 +21,7 @@
>   */
>
>  #include <linux/export.h>
> +#include <linux/memblock.h>
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  #include <linux/io.h>
> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
>         /*
>          * Don't allow RAM to be mapped.
>          */
> -       if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> +       if (WARN_ON(memblock_is_map_memory(phys_addr)))
>                 return NULL;
>
>         area = get_vm_area_caller(size, VM_IOREMAP, caller);
> @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap);
>  void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
>  {
>         /* For normal memory we already have a cacheable mapping. */
> -       if (pfn_valid(__phys_to_pfn(phys_addr)))
> +       if (memblock_is_map_memory(phys_addr))
>                 return (void __iomem *)__phys_to_virt(phys_addr);
>
>         return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
> --
> 2.7.0.rc3
>
Richter, Robert Oct. 6, 2016, 4:11 p.m. UTC | #2
Ard,

thank you for your answer and you explanation.

On 06.10.16 11:00:33, Ard Biesheuvel wrote:
> On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote:
> > There is a memory setup problem on ThunderX systems with certain
> > memory configurations. The symptom is
> >
> >  kernel BUG at mm/page_alloc.c:1848!
> >
> > This happens for some configs with 64k page size enabled. The bug
> > triggers for page zones with some pages in the zone not assigned to
> > this particular zone. In my case some pages that are marked as nomap
> > were not reassigned to the new zone of node 1, so those are still
> > assigned to node 0.
> >
> > The reason for the mis-configuration is a change in pfn_valid() which
> > reports pages marked nomap as invalid:
> >
> >  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
> >
> 
> These pages are owned by the firmware, which may map it with
> attributes that conflict with the attributes we use for the linear
> mapping. This means they should not be covered by the linear mapping.
> 
> > This causes pages marked as nomap being no long reassigned to the new
> > zone in memmap_init_zone() by calling __init_single_pfn().
> >
> 
> This sounds like the root cause of your issue. Could we not fix that instead?

Yes, this is proposal b) from my last mail that would work too: I
implemented an arm64 private early_pfn_valid() function that uses
memblock_is_memory() to setup all pages of a zone. Though, I think
this is the wrong way and thus I prefer this patch instead. I see
serveral reasons for this:

Inconsistent use of struct *page, it is initialized but never used
again.

Other archs only do a basic range check in pfn_valid(), the default
implementation just returns if the whole section is valid. As I
understand the code, if the mem range is not aligned to the section,
then there will be pfn's in the section that don't have physical mem
attached. The page is then just initialized, it's not marked reserved
nor the refcount is non-zero. It is then simply not used. This is how
no-map pages should be handled too.

I think pfn_valid() is just a quick check if the pfn's struct *page
can be used. There is a good description for this in include/linux/
mmzone.h. So there can be memory holes that have a valid pfn.

If the no-map memory needs special handling, then additional checks
need to be added to the particular code (as in ioremap.c). It's imo
wrong to (mis-)use pfn_valid for that.

Variant b) involves generic mm code to fix it for arm64, this patch is
an arm64 change only. This makes it harder to get a fix for it.
(Though maybe only a problem of patch logistics.)

> 
> > Fixing this by restoring the old behavior of pfn_valid() to use
> > memblock_is_memory().
> 
> This is incorrect imo. In general, pfn_valid() means ordinary memory
> covered by the linear mapping and the struct page array. Returning
> reserved ranges that the kernel should not even touch only to please
> the NUMA code seems like an inappropriate way to deal with this issue.

As said above, it is not marked as reserved, it is treated like
non-existing memory.

This has been observed for non-numa kernels too and can happen for
each zone that is only partly initialized.

I think the patch addresses your concerns. I can't see there the
kernel uses memory marked as nomap in a wrong way.

Thanks,

-Robert

> 
> > Also changing users of pfn_valid() in arm64 code
> > to use memblock_is_map_memory() where necessary. This only affects
> > code in ioremap.c. The code in mmu.c still can use the new version of
> > pfn_valid().
> >
> > Should be marked stable v4.5..
> >
> > Signed-off-by: Robert Richter <rrichter@cavium.com>
> > ---
> >  arch/arm64/mm/init.c    | 2 +-
> >  arch/arm64/mm/ioremap.c | 5 +++--
> >  2 files changed, 4 insertions(+), 3 deletions(-)
David Daney Oct. 10, 2016, 3:33 p.m. UTC | #3
On 10/06/2016 02:52 AM, Robert Richter wrote:
> There is a memory setup problem on ThunderX systems with certain
> memory configurations. The symptom is
>
>   kernel BUG at mm/page_alloc.c:1848!
>
> This happens for some configs with 64k page size enabled. The bug
> triggers for page zones with some pages in the zone not assigned to
> this particular zone. In my case some pages that are marked as nomap
> were not reassigned to the new zone of node 1, so those are still
> assigned to node 0.
>
> The reason for the mis-configuration is a change in pfn_valid() which
> reports pages marked nomap as invalid:
>
>   68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>
> This causes pages marked as nomap being no long reassigned to the new
> zone in memmap_init_zone() by calling __init_single_pfn().
>
> Fixing this by restoring the old behavior of pfn_valid() to use
> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
> to use memblock_is_map_memory() where necessary. This only affects
> code in ioremap.c. The code in mmu.c still can use the new version of
> pfn_valid().
>
> Should be marked stable v4.5..

In that case you should add:

Cc: <stable@vger.kernel.org> # 4.5.x-

here.


>
> Signed-off-by: Robert Richter <rrichter@cavium.com>
[...]
Robert Richter Oct. 17, 2016, 6:58 p.m. UTC | #4
Mark, Will, any opinion here?

Thanks,

-Robert

On 06.10.16 18:11:14, Robert Richter wrote:
> Ard,
> 
> thank you for your answer and you explanation.
> 
> On 06.10.16 11:00:33, Ard Biesheuvel wrote:
> > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote:
> > > There is a memory setup problem on ThunderX systems with certain
> > > memory configurations. The symptom is
> > >
> > >  kernel BUG at mm/page_alloc.c:1848!
> > >
> > > This happens for some configs with 64k page size enabled. The bug
> > > triggers for page zones with some pages in the zone not assigned to
> > > this particular zone. In my case some pages that are marked as nomap
> > > were not reassigned to the new zone of node 1, so those are still
> > > assigned to node 0.
> > >
> > > The reason for the mis-configuration is a change in pfn_valid() which
> > > reports pages marked nomap as invalid:
> > >
> > >  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
> > >
> > 
> > These pages are owned by the firmware, which may map it with
> > attributes that conflict with the attributes we use for the linear
> > mapping. This means they should not be covered by the linear mapping.
> > 
> > > This causes pages marked as nomap being no long reassigned to the new
> > > zone in memmap_init_zone() by calling __init_single_pfn().
> > >
> > 
> > This sounds like the root cause of your issue. Could we not fix that instead?
> 
> Yes, this is proposal b) from my last mail that would work too: I
> implemented an arm64 private early_pfn_valid() function that uses
> memblock_is_memory() to setup all pages of a zone. Though, I think
> this is the wrong way and thus I prefer this patch instead. I see
> serveral reasons for this:
> 
> Inconsistent use of struct *page, it is initialized but never used
> again.
> 
> Other archs only do a basic range check in pfn_valid(), the default
> implementation just returns if the whole section is valid. As I
> understand the code, if the mem range is not aligned to the section,
> then there will be pfn's in the section that don't have physical mem
> attached. The page is then just initialized, it's not marked reserved
> nor the refcount is non-zero. It is then simply not used. This is how
> no-map pages should be handled too.
> 
> I think pfn_valid() is just a quick check if the pfn's struct *page
> can be used. There is a good description for this in include/linux/
> mmzone.h. So there can be memory holes that have a valid pfn.
> 
> If the no-map memory needs special handling, then additional checks
> need to be added to the particular code (as in ioremap.c). It's imo
> wrong to (mis-)use pfn_valid for that.
> 
> Variant b) involves generic mm code to fix it for arm64, this patch is
> an arm64 change only. This makes it harder to get a fix for it.
> (Though maybe only a problem of patch logistics.)
> 
> > 
> > > Fixing this by restoring the old behavior of pfn_valid() to use
> > > memblock_is_memory().
> > 
> > This is incorrect imo. In general, pfn_valid() means ordinary memory
> > covered by the linear mapping and the struct page array. Returning
> > reserved ranges that the kernel should not even touch only to please
> > the NUMA code seems like an inappropriate way to deal with this issue.
> 
> As said above, it is not marked as reserved, it is treated like
> non-existing memory.
> 
> This has been observed for non-numa kernels too and can happen for
> each zone that is only partly initialized.
> 
> I think the patch addresses your concerns. I can't see there the
> kernel uses memory marked as nomap in a wrong way.
> 
> Thanks,
> 
> -Robert
> 
> > 
> > > Also changing users of pfn_valid() in arm64 code
> > > to use memblock_is_map_memory() where necessary. This only affects
> > > code in ioremap.c. The code in mmu.c still can use the new version of
> > > pfn_valid().
> > >
> > > Should be marked stable v4.5..
> > >
> > > Signed-off-by: Robert Richter <rrichter@cavium.com>
> > > ---
> > >  arch/arm64/mm/init.c    | 2 +-
> > >  arch/arm64/mm/ioremap.c | 5 +++--
> > >  2 files changed, 4 insertions(+), 3 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Oct. 18, 2016, 10:18 a.m. UTC | #5
Hi Robert, Ard,

Sorry for the delay in getting to this; I've been travelling a lot
lately and in the meantime this managed to get buried in my inbox.

On Thu, Oct 06, 2016 at 06:11:14PM +0200, Robert Richter wrote:
> On 06.10.16 11:00:33, Ard Biesheuvel wrote:
> > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote:
> > > There is a memory setup problem on ThunderX systems with certain
> > > memory configurations. The symptom is
> > >
> > >  kernel BUG at mm/page_alloc.c:1848!
> > >
> > > This happens for some configs with 64k page size enabled. The bug
> > > triggers for page zones with some pages in the zone not assigned to
> > > this particular zone. In my case some pages that are marked as nomap
> > > were not reassigned to the new zone of node 1, so those are still
> > > assigned to node 0.
> > >
> > > The reason for the mis-configuration is a change in pfn_valid() which
> > > reports pages marked nomap as invalid:
> > >
> > >  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
> > 
> > These pages are owned by the firmware, which may map it with
> > attributes that conflict with the attributes we use for the linear
> > mapping. This means they should not be covered by the linear mapping.
> > 
> > > This causes pages marked as nomap being no long reassigned to the new
> > > zone in memmap_init_zone() by calling __init_single_pfn().

Why do we have pages for a nomap region? Given the region shouldn't be
in the linear mapping, and isn't suitable for general allocation, I
don't believe it makes sense to have a struct page for any part of it.

Am I missing some reason that we require a struct page?

e.g. is it just easier to allocate an unused struct page than to carve
it out?

> > This sounds like the root cause of your issue. Could we not fix that instead?
> 
> Yes, this is proposal b) from my last mail that would work too: I
> implemented an arm64 private early_pfn_valid() function that uses
> memblock_is_memory() to setup all pages of a zone. Though, I think
> this is the wrong way and thus I prefer this patch instead. I see
> serveral reasons for this:
> 
> Inconsistent use of struct *page, it is initialized but never used
> again.

As above, I don't believe we should have a struct page to initialise in
the first place.

> Other archs only do a basic range check in pfn_valid(), the default
> implementation just returns if the whole section is valid. As I
> understand the code, if the mem range is not aligned to the section,
> then there will be pfn's in the section that don't have physical mem
> attached. The page is then just initialized, it's not marked reserved
> nor the refcount is non-zero. It is then simply not used. This is how
> no-map pages should be handled too.
> 
> I think pfn_valid() is just a quick check if the pfn's struct *page
> can be used. There is a good description for this in include/linux/
> mmzone.h. So there can be memory holes that have a valid pfn.

I take it you mean the comment in the CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
ifdef (line 1266 in v4.9-rc1)?

I'm not sufficiently acquainted with the memmap code to follow; I'll
need to dig into that a bit further.

> If the no-map memory needs special handling, then additional checks
> need to be added to the particular code (as in ioremap.c). It's imo
> wrong to (mis-)use pfn_valid for that.
> 
> Variant b) involves generic mm code to fix it for arm64, this patch is
> an arm64 change only. This makes it harder to get a fix for it.
> (Though maybe only a problem of patch logistics.)
> 
> > > Fixing this by restoring the old behavior of pfn_valid() to use
> > > memblock_is_memory().
> > 
> > This is incorrect imo. In general, pfn_valid() means ordinary memory
> > covered by the linear mapping and the struct page array. Returning
> > reserved ranges that the kernel should not even touch only to please
> > the NUMA code seems like an inappropriate way to deal with this issue.
> 
> As said above, it is not marked as reserved, it is treated like
> non-existing memory.

I think Ard was using "reserved" in the more general sense than the
Linux-specific meaning. NOMAP is distinct from the Linux concept of
"reserved" memory, but is "reserved" in some sense.

Memory with NOMAP is meant to be treated as non-existent for the purpose
of the linear mapping (and thus for the purpose of struct page).

> This has been observed for non-numa kernels too and can happen for
> each zone that is only partly initialized.
> 
> I think the patch addresses your concerns. I can't see there the
> kernel uses memory marked as nomap in a wrong way.

I'll have to dig into this locally; I'm still not familiar enough with
this code to know what the right thing to do is.

Thanks,
Mark.
Richter, Robert Oct. 18, 2016, 3:02 p.m. UTC | #6
Mark,

thanks for your answer. See below. I also attached the full crash
dump.

On 18.10.16 11:18:36, Mark Rutland wrote:
> Hi Robert, Ard,
> 
> Sorry for the delay in getting to this; I've been travelling a lot
> lately and in the meantime this managed to get buried in my inbox.
> 
> On Thu, Oct 06, 2016 at 06:11:14PM +0200, Robert Richter wrote:
> > On 06.10.16 11:00:33, Ard Biesheuvel wrote:
> > > On 6 October 2016 at 10:52, Robert Richter <rrichter@cavium.com> wrote:
> > > > There is a memory setup problem on ThunderX systems with certain
> > > > memory configurations. The symptom is
> > > >
> > > >  kernel BUG at mm/page_alloc.c:1848!
> > > >
> > > > This happens for some configs with 64k page size enabled. The bug
> > > > triggers for page zones with some pages in the zone not assigned to
> > > > this particular zone. In my case some pages that are marked as nomap
> > > > were not reassigned to the new zone of node 1, so those are still
> > > > assigned to node 0.
> > > >
> > > > The reason for the mis-configuration is a change in pfn_valid() which
> > > > reports pages marked nomap as invalid:
> > > >
> > > >  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
> > > 
> > > These pages are owned by the firmware, which may map it with
> > > attributes that conflict with the attributes we use for the linear
> > > mapping. This means they should not be covered by the linear mapping.
> > > 
> > > > This causes pages marked as nomap being no long reassigned to the new
> > > > zone in memmap_init_zone() by calling __init_single_pfn().
> 
> Why do we have pages for a nomap region? Given the region shouldn't be
> in the linear mapping, and isn't suitable for general allocation, I
> don't believe it makes sense to have a struct page for any part of it.
> 
> Am I missing some reason that we require a struct page?
> 
> e.g. is it just easier to allocate an unused struct page than to carve
> it out?

Pages are handled in blocks with size MAX_ORDER_NR_PAGES. The start
and end pfn of a memory region is aligned then to fit the mem block
size (see memmap_init_zone()). Therefore a memblock may contain pages
without underlying physical memory.

mm code requires the whole memmap to be initialized, this means that
for each page in the whole mem block there is a valid struct page. See
e.g. move_freepages_block() and move_freepages(), stuct page is
accessed even before pfn_valid() is used. I assume there are other
occurrences of that too.

My interpretation is that pfn_valid() checks for the existence of a
valid struct page, there must not necessarily phys memory mapped to
it. This is the reason why I changed pfn_valid() to use
memblock_is_memory() which is sufficient for generic mm code. Only in
arm64 mm code I additinally added the memblock_is_map_memory() check
where pfn_valid() was used.

> 
> > > This sounds like the root cause of your issue. Could we not fix that instead?
> > 
> > Yes, this is proposal b) from my last mail that would work too: I
> > implemented an arm64 private early_pfn_valid() function that uses
> > memblock_is_memory() to setup all pages of a zone. Though, I think
> > this is the wrong way and thus I prefer this patch instead. I see
> > serveral reasons for this:
> > 
> > Inconsistent use of struct *page, it is initialized but never used
> > again.
> 
> As above, I don't believe we should have a struct page to initialise in
> the first place.
> 
> > Other archs only do a basic range check in pfn_valid(), the default
> > implementation just returns if the whole section is valid. As I
> > understand the code, if the mem range is not aligned to the section,
> > then there will be pfn's in the section that don't have physical mem
> > attached. The page is then just initialized, it's not marked reserved
> > nor the refcount is non-zero. It is then simply not used. This is how
> > no-map pages should be handled too.
> > 
> > I think pfn_valid() is just a quick check if the pfn's struct *page
> > can be used. There is a good description for this in include/linux/
> > mmzone.h. So there can be memory holes that have a valid pfn.
> 
> I take it you mean the comment in the CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
> ifdef (line 1266 in v4.9-rc1)?

Yes.

> 
> I'm not sufficiently acquainted with the memmap code to follow; I'll
> need to dig into that a bit further.
> 
> > If the no-map memory needs special handling, then additional checks
> > need to be added to the particular code (as in ioremap.c). It's imo
> > wrong to (mis-)use pfn_valid for that.
> > 
> > Variant b) involves generic mm code to fix it for arm64, this patch is
> > an arm64 change only. This makes it harder to get a fix for it.
> > (Though maybe only a problem of patch logistics.)
> > 
> > > > Fixing this by restoring the old behavior of pfn_valid() to use
> > > > memblock_is_memory().
> > > 
> > > This is incorrect imo. In general, pfn_valid() means ordinary memory
> > > covered by the linear mapping and the struct page array. Returning
> > > reserved ranges that the kernel should not even touch only to please
> > > the NUMA code seems like an inappropriate way to deal with this issue.
> > 
> > As said above, it is not marked as reserved, it is treated like
> > non-existing memory.
> 
> I think Ard was using "reserved" in the more general sense than the
> Linux-specific meaning. NOMAP is distinct from the Linux concept of
> "reserved" memory, but is "reserved" in some sense.
> 
> Memory with NOMAP is meant to be treated as non-existent for the purpose
> of the linear mapping (and thus for the purpose of struct page).

Yes, it's not marked reserved and can never be freed.

Thanks,

-Robert

> 
> > This has been observed for non-numa kernels too and can happen for
> > each zone that is only partly initialized.
> > 
> > I think the patch addresses your concerns. I can't see there the
> > kernel uses memory marked as nomap in a wrong way.
> 
> I'll have to dig into this locally; I'm still not familiar enough with
> this code to know what the right thing to do is.
> 
> Thanks,
> Mark.

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
Booting Linux on physical CPU 0x0
Linux version 4.8.0-rc4-00267-g664e88c (root@crb2spass2rric.localdomain) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.1) ) #3 SMP Wed Sep 7 06:38:13 UTC 2016
Boot CPU: AArch64 Processor [431f0a10]
efi: Getting EFI parameters from FDT:
efi: EFI v2.40 by Cavium Thunder cn88xx EFI ThunderX-Firmware-Release-1.22.10-0-g4e85766 Aug 24 2016 15:59:03
efi:  ACPI=0xfffff000  ACPI 2.0=0xfffff014  SMBIOS 3.0=0x10ffafcf000 
cma: Reserved 512 MiB at 0x00000000c0000000
NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0
NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1
NUMA: parsing numa-distance-map-v1
NUMA: Initmem setup node 0 [mem 0x01400000-0xfffffffff]
NUMA: NODE_DATA [mem 0xfffff2580-0xfffffffff]
NUMA: Initmem setup node 1 [mem 0x10000400000-0x10fffffffff]
NUMA: NODE_DATA [mem 0x10ffffa2500-0x10ffffaff7f]
kmemleak: Kernel memory leak detector disabled
Zone ranges:
  DMA      [mem 0x0000000001400000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x0000010fffffffff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000001400000-0x00000000fffdffff]
  node   0: [mem 0x00000000fffe0000-0x00000000ffffffff]
  node   0: [mem 0x0000000100000000-0x0000000fffffffff]
  node   1: [mem 0x0000010000400000-0x0000010ff9e8ffff]
  node   1: [mem 0x0000010ff9e90000-0x0000010ff9f2ffff]
  node   1: [mem 0x0000010ff9f30000-0x0000010ffaeaffff]
  node   1: [mem 0x0000010ffaeb0000-0x0000010ffaffffff]
  node   1: [mem 0x0000010ffb000000-0x0000010ffffaffff]
  node   1: [mem 0x0000010ffffb0000-0x0000010fffffffff]
Initmem setup node 0 [mem 0x0000000001400000-0x0000000fffffffff]
Initmem setup node 1 [mem 0x0000010000400000-0x0000010fffffffff]
psci: probing for conduit method from DT.
psci: PSCIv0.2 detected in firmware.
psci: Using standard PSCI v0.2 function IDs
psci: Trusted OS resident on physical CPU 0x0
Number of cores (96) exceeds configured maximum of 8 - clipping
percpu: Embedded 3 pages/cpu @ffffff0fffce0000 s117704 r8192 d70712 u196608
Detected VIPT I-cache on CPU0
CPU features: enabling workaround for Cavium erratum 27456
Built 2 zonelists in Node order, mobility grouping on.  Total pages: 2094720
Policy zone: Normal
Kernel command line: BOOT_IMAGE=/vmlinuz root=UUID=9a418ae5-231a-40c7-b7ba-ca70cfadfc5e ro LANG=en_US.UTF-8
PID hash table entries: 4096 (order: -1, 32768 bytes)
software IO TLB [mem 0xfbfd0000-0xfffd0000] (64MB) mapped at [fffffe00fbfd0000-fffffe00fffcffff]
Memory: 133196160K/134193152K available (9084K kernel code, 1545K rwdata, 3712K rodata, 1536K init, 15826K bss, 472704K reserved, 524288K cma-reserved)
Virtual kernel memory layout:
    modules : 0xfffffc0000000000 - 0xfffffc0008000000   (   128 MB)
    vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000   (  2045 GB)
      .text : 0xfffffc0008080000 - 0xfffffc0008960000   (  9088 KB)
    .rodata : 0xfffffc0008960000 - 0xfffffc0008d10000   (  3776 KB)
      .init : 0xfffffc0008d10000 - 0xfffffc0008e90000   (  1536 KB)
      .data : 0xfffffc0008e90000 - 0xfffffc0009012600   (  1546 KB)
       .bss : 0xfffffc0009012600 - 0xfffffc0009f86fc8   ( 15827 KB)
    fixed   : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000   (  4288 KB)
    PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000   (    16 MB)
    vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000   (     2 GB maximum)
              0xfffffdff80005000 - 0xfffffdffc4000000   (  1087 MB actual)
    memory  : 0xfffffe0001400000 - 0xffffff1000000000   (1114092 MB)
SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=8, Nodes=2
Running RCU self tests
Hierarchical RCU implementation.
	RCU lockdep checking is enabled.
	Build-time adjustment of leaf fanout to 64.
NR_IRQS:64 nr_irqs:64 0
GICv3: GIC: Using split EOI/Deactivate mode
ITS: /interrupt-controller@801000000000/gic-its@801000020000
ITS@0x0000801000020000: allocated 2097152 Devices @ff5000000 (flat, esz 8, psz 64K, shr 1)
ITS: /interrupt-controller@801000000000/gic-its@901000020000
ITS@0x0000901000020000: allocated 2097152 Devices @ffa000000 (flat, esz 8, psz 64K, shr 1)
GIC: using LPI property table @0x0000000ff4140000
ITS: Allocated 32512 chunks for LPIs
GICv3: CPU0: found redistributor 0 region 0:0x0000801080000000
CPU0: using LPI pending table @0x0000000ff4150000
arm_arch_timer: Architected cp15 timer(s) running at 100.00MHz (phys).
clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x171024e7e0, max_idle_ns: 440795205315 ns
sched_clock: 56 bits at 100MHz, resolution 10ns, wraps every 4398046511100ns
Console: colour dummy device 80x25
console [tty0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:          48
... MAX_LOCKDEP_KEYS:        8191
... CLASSHASH_SIZE:          4096
... MAX_LOCKDEP_ENTRIES:     32768
... MAX_LOCKDEP_CHAINS:      65536
... CHAINHASH_SIZE:          32768
 memory used by lock dependency info: 8159 kB
 per task-struct memory footprint: 1920 bytes
mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
Calibrating delay loop (skipped), value calculated using timer frequency.. 200.00 BogoMIPS (lpj=100000)
pid_max: default: 32768 minimum: 301
Security Framework initialized
Yama: becoming mindful.
SELinux:  Initializing.
Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes)
Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes)
Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes)
Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes)
ftrace: allocating 29244 entries in 8 pages
Unable to find CPU node for /cpus/cpu@8
/cpus/cpu-map/cluster0/core8: Can't get CPU for leaf core
ASID allocator initialised with 65536 entries
PCI/MSI: /interrupt-controller@801000000000/gic-its@801000020000 domain created
PCI/MSI: /interrupt-controller@801000000000/gic-its@901000020000 domain created
Platform MSI: /interrupt-controller@801000000000/gic-its@801000020000 domain created
Platform MSI: /interrupt-controller@801000000000/gic-its@901000020000 domain created
Remapping and enabling EFI services.
UEFI Runtime regions are not aligned to 64 KB -- buggy firmware?
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at arch/arm64/kernel/efi.c:34 efi_create_mapping+0x5c/0x11c
Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-rc4-00267-g664e88c #3
Hardware name: Cavium ThunderX CN88XX board (DT)
task: fffffe0fee55a200 task.stack: ffffff00004a4000
PC is at efi_create_mapping+0x5c/0x11c
LR is at efi_create_mapping+0x5c/0x11c
pc : [<fffffc0008d14f68>] lr : [<fffffc0008d14f68>] pstate: 60000045
sp : ffffff00004a7d00
x29: ffffff00004a7d00 x28: 0000000000000000 
x27: 0000000000000000 x26: 0000000000000000 
x25: 0000000000000000 x24: fffffc0008f9c730 
x23: fffffc0008f9c730 x22: fffffc0008c60ac0 
x21: fffffc0008f9c730 x20: 00c8000000000713 
x19: ffffff0ff76b0c08 x18: 0000000000000010 
x17: 00000000c9cbbfc7 x16: 0000000000000000 
x15: fffffc0089c956f7 x14: 3f657261776d7269 
x13: 6620796767756220 x12: 2d2d20424b203436 
x11: 206f742064656e67 x10: 0000000000000077 
x9 : 65726120736e6f69 x8 : 0000000000000001 
x7 : ffffff00004a4000 x6 : fffffc000814a288 
x5 : 0000000000000000 x4 : 0000000000000001 
x3 : 0000000000000001 x2 : ffffff00004a4000 
x1 : fffffe0fee55a200 x0 : 0000000000000040 

---[ end trace fcdd7f8cb96cdb3b ]---
Call trace:
Exception stack(0xffffff00004a7b20 to 0xffffff00004a7c50)
7b20: ffffff0ff76b0c08 0000040000000000 ffffff00004a7d00 fffffc0008d14f68
7b40: 0000000060000045 000000000000003d 0000000000000001 fffffc000814a9fc
7b60: ffffff00004a7c00 fffffc000814adcc ffffff00004a7c60 fffffc0008baf420
7b80: fffffc0008f9c730 fffffc0008c60ac0 fffffc0008f9c730 fffffc0008f9c730
7ba0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7bc0: fffffc00080d6d70 0000000000000000 0000000000000040 fffffe0fee55a200
7be0: ffffff00004a4000 0000000000000001 0000000000000001 0000000000000000
7c00: fffffc000814a288 ffffff00004a4000 0000000000000001 65726120736e6f69
7c20: 0000000000000077 206f742064656e67 2d2d20424b203436 6620796767756220
7c40: 3f657261776d7269 fffffc0089c956f7
[<fffffc0008d14f68>] efi_create_mapping+0x5c/0x11c
[<fffffc0008d5e1ec>] arm_enable_runtime_services+0x12c/0x210
[<fffffc0008082ff4>] do_one_initcall+0x44/0x138
[<fffffc0008d10d30>] kernel_init_freeable+0x17c/0x2e0
[<fffffc000893c888>] kernel_init+0x20/0xf8
[<fffffc0008082b80>] ret_from_fork+0x10/0x50
  EFI remap 0x0000010ff9e98000 => 0000000020008000
  EFI remap 0x0000010ffaeb6000 => 00000000200a6000
  EFI remap 0x0000010ffafc9000 => 00000000201b9000
  EFI remap 0x0000010ffafcd000 => 00000000201bd000
  EFI remap 0x0000010ffffb9000 => 00000000201f9000
  EFI remap 0x0000010ffffcd000 => 000000002020d000
  EFI remap 0x0000804000001000 => 0000000020241000
  EFI remap 0x000087e0d0001000 => 0000000020251000
Detected VIPT I-cache on CPU1
GICv3: CPU1: found redistributor 1 region 0:0x0000801080020000
CPU1: using LPI pending table @0x0000010014060000
CPU1: Booted secondary processor [431f0a10]
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at ./include/linux/cpumask.h:121 gic_raise_softirq+0x14c/0x1e0
Modules linked in:

CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.8.0-rc4-00267-g664e88c #3
Hardware name: Cavium ThunderX CN88XX board (DT)
task: fffffe0fee57b600 task.stack: fffffe0fe4038000
PC is at gic_raise_softirq+0x14c/0x1e0
LR is at gic_raise_softirq+0xc0/0x1e0
pc : [<fffffc00084c07a4>] lr : [<fffffc00084c0718>] pstate: 600001c5
sp : fffffe0fe403bcd0
x29: fffffe0fe403bcd0 x28: 0000000000000001 
x27: fffffc0008fd7b1d x26: 0000000000000000 
x25: fffffc0008ec2000 x24: fffffc0008ecebc8 
x23: fffffc0008998688 x22: 0000000000000001 
x21: fffffc0008ec2e34 x20: 0000000000000000 
x19: 0000000000000008 x18: 0000000000000030 
x17: 0000000000000000 x16: 0000000000000000 
x15: fffffc0009c97f08 x14: 0000000000000002 
x13: 0000000000000332 x12: 0000000000000335 
x11: fffffc0009bf3de0 x10: 0000000000000332 
x9 : fffffe0fe4038000 x8 : 0000000000000000 
x7 : 00000000ffffffff x6 : 0000000000000000 
x5 : fffffffffffffffe x4 : 0000000000000040 
x3 : 0000000000000000 x2 : 0000000000000000 
x1 : 0000000000000008 x0 : 0000000000000001 

---[ end trace fcdd7f8cb96cdb3c ]---
Call trace:
Exception stack(0xfffffe0fe403baf0 to 0xfffffe0fe403bc20)
bae0:                                   0000000000000008 0000040000000000
bb00: fffffe0fe403bcd0 fffffc00084c07a4 00000000600001c5 000000000000003d
bb20: fffffe0fee57bec0 ffffffffffffffd8 0000000000000028 fffffe0fee57be70
bb40: fffffc000925fe88 0000000000000002 fffffc00080886c0 fffffe0fee57b600
bb60: fffffe0fe403bb90 fffffc0008088974 fffffc00099b4d60 fffffe0fee57b600
bb80: fffffc000925fe88 fffffe0fe403bd4c fffffe0fe403bbe0 fffffc00080889e8
bba0: 0000000000000001 0000000000000008 0000000000000000 0000000000000000
bbc0: 0000000000000040 fffffffffffffffe 0000000000000000 00000000ffffffff
bbe0: 0000000000000000 fffffe0fe4038000 0000000000000332 fffffc0009bf3de0
bc00: 0000000000000335 0000000000000332 0000000000000002 fffffc0009c97f08
[<fffffc00084c07a4>] gic_raise_softirq+0x14c/0x1e0
[<fffffc000808e028>] smp_cross_call+0x80/0x238
[<fffffc000808efa4>] smp_send_reschedule+0x3c/0x48
[<fffffc00081021a8>] resched_curr+0x58/0x90
[<fffffc0008103fd8>] check_preempt_curr+0x70/0xe8
[<fffffc0008104090>] ttwu_do_wakeup+0x40/0x2e8
[<fffffc00081043cc>] ttwu_do_activate+0x94/0xc0
[<fffffc00081058a8>] try_to_wake_up+0x200/0x580
[<fffffc0008105d3c>] default_wake_function+0x34/0x48
[<fffffc0008126afc>] __wake_up_common+0x64/0xa8
[<fffffc0008126bec>] __wake_up_locked+0x3c/0x50
[<fffffc0008127b58>] complete+0x48/0x68
[<fffffc000808e630>] secondary_start_kernel+0x180/0x208
[<0000000001482e08>] 0x1482e08
Detected VIPT I-cache on CPU2
GICv3: CPU2: found redistributor 2 region 0:0x0000801080040000
CPU2: using LPI pending table @0x0000010014070000
CPU2: Booted secondary processor [431f0a10]
Detected VIPT I-cache on CPU3
GICv3: CPU3: found redistributor 3 region 0:0x0000801080060000
CPU3: using LPI pending table @0x0000010014080000
CPU3: Booted secondary processor [431f0a10]
Detected VIPT I-cache on CPU4
GICv3: CPU4: found redistributor 4 region 0:0x0000801080080000
CPU4: using LPI pending table @0x0000010014090000
CPU4: Booted secondary processor [431f0a10]
Detected VIPT I-cache on CPU5
GICv3: CPU5: found redistributor 5 region 0:0x00008010800a0000
CPU5: using LPI pending table @0x00000100140a0000
CPU5: Booted secondary processor [431f0a10]
Detected VIPT I-cache on CPU6
GICv3: CPU6: found redistributor 6 region 0:0x00008010800c0000
CPU6: using LPI pending table @0x00000100140b0000
CPU6: Booted secondary processor [431f0a10]
Detected VIPT I-cache on CPU7
GICv3: CPU7: found redistributor 7 region 0:0x00008010800e0000
CPU7: using LPI pending table @0x00000100140c0000
CPU7: Booted secondary processor [431f0a10]
Brought up 8 CPUs
SMP: Total of 8 processors activated.
CPU features: detected feature: GIC system register CPU interface
CPU features: detected feature: LSE atomic instructions
CPU features: detected feature: Software prefetching using PRFM
CPU: All CPU(s) started at EL2
alternatives: patching kernel code
devtmpfs: initialized
SMBIOS 3.0.0 present.
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
atomic64_test: passed
pinctrl core: initialized pinctrl subsystem
NET: Registered protocol family 16
cpuidle: using governor menu
vdso: 2 pages (1 code @ fffffc0008980000, 1 data @ fffffc0008eb0000)
hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
DMA: preallocated 256 KiB pool for atomic allocations
Serial: AMBA PL011 UART driver
87e024000000.serial: ttyAMA0 at MMIO 0x87e024000000 (irq = 7, base_baud = 0) is a PL011 rev3
console [ttyAMA0] enabled
87e025000000.serial: ttyAMA1 at MMIO 0x87e025000000 (irq = 8, base_baud = 0) is a PL011 rev3
OF: /soc@0/pci@848000000000: arguments longer than property
OF: /soc@0/pci@849000000000: arguments longer than property
OF: /soc@0/pci@84a000000000: arguments longer than property
OF: /soc@0/pci@84b000000000: arguments longer than property
OF: /soc@0/pci@87e0c0000000: arguments longer than property
OF: /soc@100000000000/pci@848000000000: arguments longer than property
OF: /soc@100000000000/pci@849000000000: arguments longer than property
OF: /soc@100000000000/pci@84a000000000: arguments longer than property
OF: /soc@100000000000/pci@84b000000000: arguments longer than property
HugeTLB registered 2 MB page size, pre-allocated 0 pages
HugeTLB registered 512 MB page size, pre-allocated 0 pages
ACPI: Interpreter disabled.
arm-smmu 830000000000.smmu0: probing hardware configuration...
arm-smmu 830000000000.smmu0: SMMUv2 with:
arm-smmu 830000000000.smmu0: 	stage 1 translation
arm-smmu 830000000000.smmu0: 	stage 2 translation
arm-smmu 830000000000.smmu0: 	nested translation
arm-smmu 830000000000.smmu0: 	non-coherent table walk
arm-smmu 830000000000.smmu0: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 830000000000.smmu0: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 830000000000.smmu0: 	128 context banks (0 stage-2 only)
arm-smmu 830000000000.smmu0: 	Supported page sizes: 0x62215000
arm-smmu 830000000000.smmu0: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 830000000000.smmu0: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 830000000000.smmu0: registered 1 master devices
arm-smmu 831000000000.smmu1: probing hardware configuration...
arm-smmu 831000000000.smmu1: SMMUv2 with:
arm-smmu 831000000000.smmu1: 	stage 1 translation
arm-smmu 831000000000.smmu1: 	stage 2 translation
arm-smmu 831000000000.smmu1: 	nested translation
arm-smmu 831000000000.smmu1: 	non-coherent table walk
arm-smmu 831000000000.smmu1: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 831000000000.smmu1: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 831000000000.smmu1: 	128 context banks (0 stage-2 only)
arm-smmu 831000000000.smmu1: 	Supported page sizes: 0x62215000
arm-smmu 831000000000.smmu1: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 831000000000.smmu1: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 831000000000.smmu1: registered 2 master devices
arm-smmu 832000000000.smmu2: probing hardware configuration...
arm-smmu 832000000000.smmu2: SMMUv2 with:
arm-smmu 832000000000.smmu2: 	stage 1 translation
arm-smmu 832000000000.smmu2: 	stage 2 translation
arm-smmu 832000000000.smmu2: 	nested translation
arm-smmu 832000000000.smmu2: 	non-coherent table walk
arm-smmu 832000000000.smmu2: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 832000000000.smmu2: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 832000000000.smmu2: 	128 context banks (0 stage-2 only)
arm-smmu 832000000000.smmu2: 	Supported page sizes: 0x62215000
arm-smmu 832000000000.smmu2: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 832000000000.smmu2: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 832000000000.smmu2: registered 1 master devices
arm-smmu 833000000000.smmu3: probing hardware configuration...
arm-smmu 833000000000.smmu3: SMMUv2 with:
arm-smmu 833000000000.smmu3: 	stage 1 translation
arm-smmu 833000000000.smmu3: 	stage 2 translation
arm-smmu 833000000000.smmu3: 	nested translation
arm-smmu 833000000000.smmu3: 	non-coherent table walk
arm-smmu 833000000000.smmu3: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 833000000000.smmu3: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 833000000000.smmu3: 	128 context banks (0 stage-2 only)
arm-smmu 833000000000.smmu3: 	Supported page sizes: 0x62215000
arm-smmu 833000000000.smmu3: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 833000000000.smmu3: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 833000000000.smmu3: registered 1 master devices
arm-smmu 930000000000.smmu4: probing hardware configuration...
arm-smmu 930000000000.smmu4: SMMUv2 with:
arm-smmu 930000000000.smmu4: 	stage 1 translation
arm-smmu 930000000000.smmu4: 	stage 2 translation
arm-smmu 930000000000.smmu4: 	nested translation
arm-smmu 930000000000.smmu4: 	non-coherent table walk
arm-smmu 930000000000.smmu4: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 930000000000.smmu4: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 930000000000.smmu4: 	128 context banks (0 stage-2 only)
arm-smmu 930000000000.smmu4: 	Supported page sizes: 0x62215000
arm-smmu 930000000000.smmu4: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 930000000000.smmu4: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 930000000000.smmu4: registered 1 master devices
arm-smmu 931000000000.smmu5: probing hardware configuration...
arm-smmu 931000000000.smmu5: SMMUv2 with:
arm-smmu 931000000000.smmu5: 	stage 1 translation
arm-smmu 931000000000.smmu5: 	stage 2 translation
arm-smmu 931000000000.smmu5: 	nested translation
arm-smmu 931000000000.smmu5: 	non-coherent table walk
arm-smmu 931000000000.smmu5: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 931000000000.smmu5: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 931000000000.smmu5: 	128 context banks (0 stage-2 only)
arm-smmu 931000000000.smmu5: 	Supported page sizes: 0x62215000
arm-smmu 931000000000.smmu5: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 931000000000.smmu5: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 931000000000.smmu5: registered 1 master devices
arm-smmu 932000000000.smmu6: probing hardware configuration...
arm-smmu 932000000000.smmu6: SMMUv2 with:
arm-smmu 932000000000.smmu6: 	stage 1 translation
arm-smmu 932000000000.smmu6: 	stage 2 translation
arm-smmu 932000000000.smmu6: 	nested translation
arm-smmu 932000000000.smmu6: 	non-coherent table walk
arm-smmu 932000000000.smmu6: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 932000000000.smmu6: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 932000000000.smmu6: 	128 context banks (0 stage-2 only)
arm-smmu 932000000000.smmu6: 	Supported page sizes: 0x62215000
arm-smmu 932000000000.smmu6: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 932000000000.smmu6: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 932000000000.smmu6: registered 1 master devices
arm-smmu 933000000000.smmu7: probing hardware configuration...
arm-smmu 933000000000.smmu7: SMMUv2 with:
arm-smmu 933000000000.smmu7: 	stage 1 translation
arm-smmu 933000000000.smmu7: 	stage 2 translation
arm-smmu 933000000000.smmu7: 	nested translation
arm-smmu 933000000000.smmu7: 	non-coherent table walk
arm-smmu 933000000000.smmu7: 	(IDR0.CTTW overridden by dma-coherent property)
arm-smmu 933000000000.smmu7: 	stream matching with 128 register groups, mask 0x7fff
arm-smmu 933000000000.smmu7: 	128 context banks (0 stage-2 only)
arm-smmu 933000000000.smmu7: 	Supported page sizes: 0x62215000
arm-smmu 933000000000.smmu7: 	Stage-1: 48-bit VA -> 48-bit IPA
arm-smmu 933000000000.smmu7: 	Stage-2: 48-bit IPA -> 48-bit PA
arm-smmu 933000000000.smmu7: registered 1 master devices
iommu: Adding device 848000000000.pci to group 0
iommu: Adding device 849000000000.pci to group 1
iommu: Adding device 84a000000000.pci to group 2
iommu: Adding device 84b000000000.pci to group 3
iommu: Adding device 88001f000000.pci to group 4
iommu: Adding device 948000000000.pci to group 5
iommu: Adding device 949000000000.pci to group 6
iommu: Adding device 94a000000000.pci to group 7
iommu: Adding device 94b000000000.pci to group 8
vgaarb: loaded
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
DMA-API: preallocated 4096 debug entries
DMA-API: debugging enabled by kernel config
clocksource: Switched to clocksource arch_sys_counter
VFS: Disk quotas dquot_6.6.0
VFS: Dquot-cache hash table entries: 8192 (order 0, 65536 bytes)
pnp: PnP ACPI: disabled
NET: Registered protocol family 2
TCP established hash table entries: 524288 (order: 6, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 6, 4194304 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
UDP hash table entries: 65536 (order: 7, 10485760 bytes)
UDP-Lite hash table entries: 65536 (order: 7, 10485760 bytes)
NET: Registered protocol family 1
Unpacking initramfs...
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1844!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 4 PID: 1 Comm: swapper/0 Tainted: G        W       4.8.0-rc4-00267-g664e88c #3
Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Aug 24 2016
task: fffffe0fee55a200 task.stack: ffffff00004a4000
PC is at move_freepages+0x158/0x168
LR is at move_freepages_block+0xac/0xc0
pc : [<fffffc0008227618>] lr : [<fffffc00082276d4>] pstate: 800000c5
sp : ffffff00004a7480
x29: ffffff00004a7480 x28: fffffdffc3fc0020 
x27: fffffdffc3fc0000 x26: 000000000000000a 
x25: 000000000000000a x24: ffffff0ffffa3190 
x23: ffffff0fffdbf530 x22: 0000000000000000 
x21: fffffdffc3ffffc0 x20: ffffff0ffffa2c80 
x19: fffffdffc3f80000 x18: fffffe0fdc0c2548 
x17: 0000000000000000 x16: 0000000100000000 
x15: 00000000000000c1 x14: 04000000001b2000 
x13: 00aedf01c7040000 x12: 00001b00000033f5 
x11: 01c604000000001b x10: 0000ae7401c50430 
x9 : 0000000000000000 x8 : fffffc0008f53000 
x7 : fffffc00082295f0 x6 : 0000000000000001 
x5 : 0000000000000000 x4 : fffffe0fffff2580 
x3 : ffffff0ffffa2500 x2 : 0000000000000001 
x1 : fffffe0fffff2580 x0 : ffffff0ffffa2c80 

Process swapper/0 (pid: 1, stack limit = 0xffffff00004a4020)
Stack: (0xffffff00004a7480 to 0xffffff00004a8000)
7480: ffffff00004a74e0 fffffc00082276d4 0000000000000000 ffffff0ffffa2c80
74a0: fffffdffc3f80000 0000000000000000 ffffff0fffdbf530 ffffff0ffffa3190
74c0: 000000000000000a 000000000000000a fffffdffc3fc0000 fffffdffc3fc0020
74e0: ffffff00004a7510 fffffc0008227f00 ffffff0ffffa2c80 0000000000000000
7500: fffffdffc3fc0000 0000000000000028 ffffff00004a75b0 fffffc0008229654
7520: ffffff0fffdbf530 ffffff0ffffa2c80 fffffc0008e5f520 0000000000000000
7540: ffffff0fffdbf530 ffffff0fffdbf520 0000030ff6f60000 0000000000000001
7560: 0000000000000000 0000000000000000 ffffff0ffffa3200 fffffc00082295f0
7580: fffffc0008e5f520 fffffc0008ec1000 0000000000000000 0000000000000001
75a0: 0000000000000010 01ffff0ffffa2c80 ffffff00004a76a0 fffffc000822a5ec
75c0: 00000000024200c2 0000000000000000 0000000000000000 00000000024200c2
75e0: fffffc0008ec4000 0000000000000001 0000000000000000 fffffc0008ec2000
7600: 00000000024200c2 fffffc00082886c8 ffffffffffffffff 0000000000000040
7620: 00000012004a7640 fffffc0008e587e8 0000000100000000 0000000000000000
7640: ffffff00024200c2 fffffc00081007bc ffffff0ffffa2c80 0000000000000010
7660: fffffc0008e5f520 0000000000000040 0000000000000000 ffffff0ffffa3200
7680: 0000000000000001 ffffff0ffffa3b80 ffffff0ffffa3b80 ffffff00004a77b8
76a0: ffffff00004a77e0 fffffc00082886c8 0000000000000000 fffffc0008ec0c48
76c0: ffffff0ffffa3b80 0000000000000000 0000000000000000 0000000000000001
76e0: ffffff00004a4000 0000000000000000 fffffe0fee55a200 00000000000053d4
7700: fffffe0fee55a200 0000000000000000 0000000000000000 fffffc0008eec160
7720: 0000000000000000 0000000000000002 fffffe0fee55aa98 fffffc0008ec2000
7740: fffffe0fee55a200 fffffc0009c95000 0000000000000000 fffffc0008eec160
7760: 0000000000000000 fffffc0008e587e8 0000000000000000 0000000100400000
7780: 0000000000000000 0000000000000000 0000000000000000 0000000000000002
77a0: 0000000000000000 0000000000000000 fffffc000821d070 ffffff0ffffa3b80
77c0: 0000000000000000 ffffff0ffffa3b80 0000000100000000 0000000000000000
77e0: ffffff00004a7810 fffffc0008288d20 fffffc0008e587e8 fffffe0fec01a8f0
7800: 00000000024200c2 fffffc000821e62c ffffff00004a7870 fffffc000821e62c
7820: 0000000000000000 ffffff00004a4000 00000000024200c2 fffffe0fe0249410
7840: fffffc000821e788 00000000024200c2 fffffc0008f01d28 0000000002493ee0
7860: 0000000000002c2c 0000000000000040 ffffff00004a78d0 fffffc000821e788
7880: 0000000000000000 000000000000000e 00000000024200c2 fffffe0fe0249410
78a0: 0000000000000001 fffffc0008fd76ac fffffc0008f01d28 0000000002493ee0
78c0: 0000000000002c2c fffffc000821e850 ffffff00004a7930 fffffc000821e9b0
78e0: fffffe0fe0249410 0000000000000001 0000000000000001 0000000000002c2c
7900: ffffff00004a7a20 fffffe0fcc140400 00000000000053d4 fffffc00089b8900
7920: 0000000000002c2c fffffc000821e440 ffffff00004a7960 fffffc00082eb758
7940: fffffe0fe0249410 0000000000000001 0000000000010000 ffffff00004a7b18
7960: ffffff00004a79a0 fffffc000821e3bc 0000000000010000 ffffff00004a7b18
7980: fffffe0fe0249410 0000000000010000 0000000000000000 fffffc000821e434
79a0: ffffff00004a7a30 fffffc00082200e0 0000000000000000 fffffe0fcc140400
79c0: fffffe0fe0249210 ffffff00004a7af0 fffffe0fe0249410 ffffff00004a7b18
79e0: fffffc0008d40e70 0000000000008000 fffffc0009040078 ffffff0f83f26000
7a00: 0000000000000000 fffffc0008bd0300 00002c2c00000001 ffffff00004a4000
7a20: fffffdff83f20840 0000000027e02140 ffffff00004a7a80 fffffc00082201fc
7a40: 0000000000008000 ffffff00004a7af0 fffffe0fe02492e8 ffffff00004a7b18
7a60: ffffff00004a7bb8 0000000000000007 fffffc0008d40e70 0000000000008000
7a80: ffffff00004a7ab0 fffffc00082ba89c fffffe0fcc140400 0000000000008000
7aa0: ffffff00004a7bb8 fffffe0fdd350000 ffffff00004a7b40 fffffc00082bb588
7ac0: 0000000000000000 0000000000008000 fffffe0fcc140400 fffffe0fdd350000
7ae0: fffffe0fdd350000 0000000000008000 fffffe0fcc140400 000000000000ac2c
7b00: 0000000000000000 0000000000000000 0000000000000000 fffffc0000000003
7b20: 00000000000053d4 0000000000002c2c ffffff00004a7ae0 0000000000000001
7b40: ffffff00004a7b80 fffffc00082bc5b4 fffffe0fcc140400 fffffe0fcc140400
7b60: fffffe0fdd350000 0000000000008000 fffffc0008babe88 ffffff0f8ea0e439
7b80: ffffff00004a7bc0 fffffc0008d120f0 0000000000000000 0000000000008000
7ba0: fffffe0fdd350000 0000000000000000 fffffe0fdc3fa280 000000000000ac2c
7bc0: ffffff00004a7bf0 fffffc0008d121d4 fffffc0008d75000 fffffc0008d75b60
7be0: fffffc0008d75bb8 fffffe0fdd350000 ffffff00004a7c10 fffffc0008d11eac
7c00: fffffc0008d75b60 0000000000008000 ffffff00004a7c40 fffffc0008d11f10
7c20: 0000000000008000 0000000000008000 fffffc0008d75b60 fffffc00089e42c0
7c40: ffffff00004a7c80 fffffc0008d410f4 fffffe0fdc3fa280 ffffff0f83f26000
7c60: 0000000000000000 fffffe0fdd350000 fffffc0008d11d74 fffffc0008d11ec8
7c80: ffffff00004a7cf0 fffffc0008d411b4 fffffc0008d75000 ffffff0f83f26000
7ca0: 000000000e7f20bc 0000000000000000 fffffc0008d4119c fffffc0008ff1f38
7cc0: fffffc0008babf90 fffffc0009040000 fffffc0008e3d9a0 0000000000000000
7ce0: 0000000008d75000 0000000000008000 ffffff00004a7d00 fffffc0008d127c4
7d00: ffffff00004a7d60 fffffc0008d1294c fffffc0009040000 fffffc0008d128dc
7d20: fffffc0009040000 fffffc0009040000 fffffc0009040000 fffffc0008d1045c
7d40: fffffc0008cfc468 fffffc0008d74290 0000000000001b57 fffffc0008c04958
7d60: ffffff00004a7d90 fffffc0008082ff4 ffffff00004a4000 fffffc0008d128dc
7d80: 0000000000000000 0000000000000005 ffffff00004a7e00 fffffc0008d10df4
7da0: 0000000000000101 fffffc0009040000 fffffc0008d742a8 0000000000000005
7dc0: fffffc0008e3d700 0000000000000000 fffffc0008ee2370 fffffc0008bb79f8
7de0: 0000000000000000 0000000500000005 0000000000000000 fffffc0008d1045c
7e00: ffffff00004a7ea0 fffffc000893c888 fffffc0009040000 fffffc0009040000
7e20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7e40: 0000000000000000 0000000000000000 0000000000000000 ffffff00004a4000
7e60: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
7e80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7ea0: 0000000000000000 fffffc0008082b80 fffffc000893c868 0000000000000000
7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call trace:
Exception stack(0xffffff00004a72a0 to 0xffffff00004a73d0)
72a0: fffffdffc3f80000 0000040000000000 ffffff00004a7480 fffffc0008227618
72c0: 00000000800000c5 000000000000003d fffffc0008ec4000 0000000000000030
72e0: ffffff00004a7300 fffffc00081593b0 fffffc0008f53000 0000000000000000
7300: ffffff00004a7320 fffffc000822abac fffffc0008fd76be ffffff00004a7438
7320: ffffff00004a7460 fffffc00082886c8 ffffff00004a7380 fffffc0008135d08
7340: 0000000000000001 0000000000000007 ffffff0ffffa2c80 fffffe0fffff2580
7360: 0000000000000001 ffffff0ffffa2500 fffffe0fffff2580 0000000000000000
7380: 0000000000000001 fffffc00082295f0 fffffc0008f53000 0000000000000000
73a0: 0000ae7401c50430 01c604000000001b 00001b00000033f5 00aedf01c7040000
73c0: 04000000001b2000 00000000000000c1
[<fffffc0008227618>] move_freepages+0x158/0x168
[<fffffc00082276d4>] move_freepages_block+0xac/0xc0
[<fffffc0008227f00>] __rmqueue+0x740/0x898
[<fffffc0008229654>] get_page_from_freelist+0x3e4/0xca0
[<fffffc000822a5ec>] __alloc_pages_nodemask+0x1ac/0xfd8
[<fffffc00082886c8>] alloc_page_interleave+0x60/0xb8
[<fffffc0008288d20>] alloc_pages_current+0x168/0x1c8
[<fffffc000821e62c>] __page_cache_alloc+0x17c/0x1c0
[<fffffc000821e788>] pagecache_get_page+0x118/0x2f8
[<fffffc000821e9b0>] grab_cache_page_write_begin+0x48/0x68
[<fffffc00082eb758>] simple_write_begin+0x40/0x168
[<fffffc000821e3bc>] generic_perform_write+0xc4/0x1b8
[<fffffc00082200e0>] __generic_file_write_iter+0x178/0x1c8
[<fffffc00082201fc>] generic_file_write_iter+0xcc/0x1c8
[<fffffc00082ba89c>] __vfs_write+0xcc/0x140
[<fffffc00082bb588>] vfs_write+0xa8/0x1c0
[<fffffc00082bc5b4>] SyS_write+0x54/0xb0
[<fffffc0008d120f0>] xwrite+0x34/0x7c
[<fffffc0008d121d4>] do_copy+0x9c/0xf4
[<fffffc0008d11eac>] write_buffer+0x34/0x50
[<fffffc0008d11f10>] flush_buffer+0x48/0xb8
[<fffffc0008d410f4>] __gunzip+0x27c/0x324
[<fffffc0008d411b4>] gunzip+0x18/0x20
[<fffffc0008d127c4>] unpack_to_rootfs+0x168/0x280
[<fffffc0008d1294c>] populate_rootfs+0x70/0x138
[<fffffc0008082ff4>] do_one_initcall+0x44/0x138
[<fffffc0008d10df4>] kernel_init_freeable+0x240/0x2e0
[<fffffc000893c888>] kernel_init+0x20/0xf8
[<fffffc0008082b80>] ret_from_fork+0x10/0x50
Code: 913dc021 9400d2db d4210000 d503201f (d4210000) 
---[ end trace fcdd7f8cb96cdb3d ]---
note: swapper/0[1] exited with preempt_count 1
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

SMP: stopping secondary CPUs
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Will Deacon Oct. 27, 2016, 4:01 p.m. UTC | #7
Hi Robert,

On Mon, Oct 17, 2016 at 08:58:01PM +0200, Robert Richter wrote:
> Mark, Will, any opinion here?

Having looking at this, I'm inclined to agree with you; pfn_valid() is
all about whether the underlying mem_map (struct page *) entry exists,
not about whether the page is mappable or not.

That said, setting the zone for pages representing NOMAP memory feels
like a slippery slope to losing information about them being NOMAP in
the first place and the whole problem getting out-of-hand. Whilst I'm
happy for pfn_valid() to return true (in the sense that we're within
bounds of mem_map etc), I'm less happy that we're also saying that the
struct page contains useful information, such as the zone and the node
information, which is then subsequently used by the NUMA code.

On top of that, pfn_valid is used in other places as a coarse "is this
memory?" check, and will cause things like ioremap to fail whereas it
wouldn't at the moment. It feels to me like NOMAP memory is a new type
of memory where there *is* a struct page, but it shouldn't be used for
anything. I don't think pfn_valid can describe that, given the way it's
currently used, and flipping the logic is just likely to move the problem
elsewhere.

What options do we have for fixing this in the NUMA code?

Will
Richter, Robert Oct. 28, 2016, 9:19 a.m. UTC | #8
On 27.10.16 17:01:36, Will Deacon wrote:
> Hi Robert,
> 
> On Mon, Oct 17, 2016 at 08:58:01PM +0200, Robert Richter wrote:
> > Mark, Will, any opinion here?
> 
> Having looking at this, I'm inclined to agree with you; pfn_valid() is
> all about whether the underlying mem_map (struct page *) entry exists,
> not about whether the page is mappable or not.
> 
> That said, setting the zone for pages representing NOMAP memory feels
> like a slippery slope to losing information about them being NOMAP in
> the first place and the whole problem getting out-of-hand. Whilst I'm
> happy for pfn_valid() to return true (in the sense that we're within
> bounds of mem_map etc), I'm less happy that we're also saying that the
> struct page contains useful information, such as the zone and the node
> information, which is then subsequently used by the NUMA code.

Let's see it in a different way, pfns and the struct page assigned to
each of it is about *physical* memory. The system knows all the
memory, some is free, some reserved and some marked NOMAP. Regardless
of the mapping of the page the mm code maintains and uses that
information.

There are assumptions on validity and checks in the code that now
cause problems due to partly or non-existing data about nomap pages.
This inconsistency is dangerous since a problem may occur any time
then the page area is accessed first, thus a system may crash randomly
depending on the memory access. Luckily, in my case it triggered
reproducible while loading initrd during boot.

I also think that this is not only NUMA related. E.g. the following
bug report is probably also related:

 https://bugzilla.redhat.com/show_bug.cgi?id=1387793

> On top of that, pfn_valid is used in other places as a coarse "is this
> memory?" check, and will cause things like ioremap to fail whereas it
> wouldn't at the moment.

IMO this is a misuse of pfn_valid() that needs to be fixed with
additional checks, e.g. traversing memblocks.

> It feels to me like NOMAP memory is a new type
> of memory where there *is* a struct page, but it shouldn't be used for
> anything.

IMO, a NOMAP page should just be handled like a reserved page except
that the page is marked reserved. See free_low_memory_core_early().
Thus, NOMAP pages are not in the free pages list or set to reserved.
It is simply not available for mapping at all. Isn't that exactly what
it should be?

I also did not yet understand the benefit of the differentiation
between NOMAP and reserved and the original motivation for its
implementation. I looked through the mail threads but could not find
any hint. The only difference I see now is that it is not listed as a
reserved page, but as long as it is not freed it should behave the
same. I remember the case to handle memory different (coherency,
etc.), but are not sure here. Ard, could you explain this?

> I don't think pfn_valid can describe that, given the way it's
> currently used, and flipping the logic is just likely to move the problem
> elsewhere.
> 
> What options do we have for fixing this in the NUMA code?

Out of my mind:

1) Treat NOMAP pages same as reserved pages (my patch).

2) Change mm code to allow arch specific early_pfn_valid().

3) Fix mm code to only access stuct page (of a zone) if pfn_valid() is
   true.

There can be more alternatives. IMO:

 * We shouldn't touch generic mm code.

 * We should maintain a valid struct page for all pages in a sections.

 * We should only traverse memblock where really necessary (arm64
   only).

 * I don't think this problem is numa specific.

-Robert
Will Deacon Nov. 7, 2016, 9:05 p.m. UTC | #9
On Fri, Oct 28, 2016 at 11:19:05AM +0200, Robert Richter wrote:
> On 27.10.16 17:01:36, Will Deacon wrote:
> > It feels to me like NOMAP memory is a new type
> > of memory where there *is* a struct page, but it shouldn't be used for
> > anything.
> 
> IMO, a NOMAP page should just be handled like a reserved page except
> that the page is marked reserved. See free_low_memory_core_early().
> Thus, NOMAP pages are not in the free pages list or set to reserved.
> It is simply not available for mapping at all. Isn't that exactly what
> it should be?
> 
> I also did not yet understand the benefit of the differentiation
> between NOMAP and reserved and the original motivation for its
> implementation. I looked through the mail threads but could not find
> any hint. The only difference I see now is that it is not listed as a
> reserved page, but as long as it is not freed it should behave the
> same. I remember the case to handle memory different (coherency,
> etc.), but are not sure here. Ard, could you explain this?
> 
> > I don't think pfn_valid can describe that, given the way it's
> > currently used, and flipping the logic is just likely to move the problem
> > elsewhere.
> > 
> > What options do we have for fixing this in the NUMA code?
> 
> Out of my mind:
> 
> 1) Treat NOMAP pages same as reserved pages (my patch).

Just to reiterate here, but your patch as it stands will break other parts
of the kernel. For example, acpi_os_ioremap relies on being able to ioremap
these regions afaict.

I think any solution involving pfn_valid is just going to move the crash
around.

Will
Richter, Robert Nov. 9, 2016, 7:51 p.m. UTC | #10
Will,

On 07.11.16 21:05:14, Will Deacon wrote:
> Just to reiterate here, but your patch as it stands will break other parts
> of the kernel. For example, acpi_os_ioremap relies on being able to ioremap
> these regions afaict.
> 
> I think any solution involving pfn_valid is just going to move the crash
> around.

Let me describe the crash more detailed.

Following range is marked nomap (full efi map below):

[    0.000000] efi:   0x010ffffb9000-0x010ffffccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] efi:   0x010ffffcd000-0x010fffffefff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*

The mem belongs to this nodes:

[    0.000000] NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0
[    0.000000] NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1

The following mem ranges are created:

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000001400000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000010fffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000001400000-0x00000000fffdffff]
[    0.000000]   node   0: [mem 0x00000000fffe0000-0x00000000ffffffff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x0000000fffffffff]
[    0.000000]   node   1: [mem 0x0000010000400000-0x0000010ff9e7ffff]
[    0.000000]   node   1: [mem 0x0000010ff9e80000-0x0000010ff9f1ffff]
[    0.000000]   node   1: [mem 0x0000010ff9f20000-0x0000010ffaeaffff]
[    0.000000]   node   1: [mem 0x0000010ffaeb0000-0x0000010ffaffffff]
[    0.000000]   node   1: [mem 0x0000010ffb000000-0x0000010ffffaffff]
[    0.000000]   node   1: [mem 0x0000010ffffb0000-0x0000010fffffffff]

The last range 0x0000010ffffb0000-0x0000010fffffffff is correctly
marked as a nomap area.

Paging is then initalized in free_area_init_core() (mm/page_alloc.c)
which calls memmap_init_zone(). This initializes all pages (struct
page) of the zones for each node except for pfns from nomap memory. It
uses early_pfn_valid() which is bound to pfn_valid().

In 4.4 *all* pages of a zone were initialized. Now page from nomap
ranges are skipped, and e.g. pfn 0x10ffffb has an uninitalized struct
page.  Note that nomap mem ranges are still part of the memblock list
and also part of the zone ranges within start_pfn and end_pfn, but
those don't have a valid struct page now. IMO this is a bug.

Later on all mappable memory is reserved and all pages of it are freed
by adding them to the free-pages list except for the reserved
memblocks.

Now the initrd is loaded. This reserves free memory by calling
move_freepages_block(). A block has the size of pageblock_nr_pages
which is in my configuration 0x2000. The kernel choses the block

 0x0000010fe0000000-0x0000010fffffffff

that also contains the nomap region around 10ffb000000.

In move_freepages() the pages of the zone are freed. The code accesses
pfn #10fffff but the struct page is uninitialized and thus node is 0.
This is different to #10fe000 and the zone which is node 1. This
causes the BUG_ON() to trigger:

[    4.173521] Unpacking initramfs...
[    8.420710] ------------[ cut here ]------------
[    8.425344] kernel BUG at mm/page_alloc.c:1844!
[    8.429870] Internal error: Oops - BUG: 0 [#1] SMP

I believe this is not the only case where the mm code relies on a
valid struct page for all pfns of a zone, thus modifying mm code to
make this case work is not an option. (E.g. this could be done by
moving the early_pfn_valid() check at the beginning of the loop in
memmap_init_zone().)

So for a fix we need to change early_pfn_valid() which is mapped to
pfn_valid() for SPARSEMEM. Using a different function than pfn_valid()
for this does not look reasonable. The only option is to revert
pfn_valid() to it's original bahaviour.

My fix now uses again memblock_is_memory() for pfn_valid() instead of
memblock_is_map_memory(). So I needed to change some users of
pfn_valid() to use memblock_is_map_memory() there necessary. This is
only required for arch specific code, all other uses of pfn_valid()
are save to use memblock_is_memory().

Thus, I don't see where my patch breaks code. Even acpi_os_ioremap()
keeps the same behaviour as before since it still uses memblock_is_
memory(). Could you more describe your concerns why do you think this
patch breaks the kernel and moves the problem somewhere else? I
believe it fixes the problem at all.

Thanks,

-Robert



[    0.000000] efi: Processing EFI memory map:
[    0.000000] MEMBLOCK configuration:
[    0.000000]  memory size = 0x1ffe800000 reserved size = 0x10537
[    0.000000]  memory.cnt  = 0x2
[    0.000000]  memory[0x0]     [0x00000001400000-0x00000fffffffff], 0xffec00000 bytes flags: 0x0
[    0.000000]  memory[0x1]     [0x00010000400000-0x00010fffffffff], 0xfffc00000 bytes flags: 0x0
[    0.000000]  reserved.cnt  = 0x1
[    0.000000]  reserved[0x0]   [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0
[    0.000000] efi:   0x000001400000-0x00000147ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000001400000-0x0000000147ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000001480000-0x0000024affff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000001480000-0x000000024affff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x0000024b0000-0x0000211fffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x000000024b0000-0x000000211fffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000021200000-0x00002121ffff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000021200000-0x0000002121ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000021220000-0x0000fffebfff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000021220000-0x000000fffeffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x000000fffe0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS    |   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x000000ffff0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x000000ffff0000-0x000000ffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000100000000-0x000ff7ffffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000100000000-0x00000ff7ffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000ff8000000-0x000ff801ffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000ff8000000-0x00000ff801ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000ff8020000-0x000fffa9efff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000ff8020000-0x00000fffa9ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x000fffa9f000-0x000fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00000fffa90000-0x00000fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010000400000-0x010f812b3fff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010000400000-0x00010f812bffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f812b4000-0x010f812b6fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f812b0000-0x00010f812bffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f812b7000-0x010f822e6fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f812b0000-0x00010f822effff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f822e7000-0x010f822f6fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f822e0000-0x00010f822fffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f822f7000-0x010f82342fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f822f0000-0x00010f8234ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f82343000-0x010f91e73fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f91e74000-0x010f91e74fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f91e70000-0x00010f91e7ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f91e75000-0x010f92c98fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f91e70000-0x00010f92c9ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f92c99000-0x010f93880fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f92c90000-0x00010f9388ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010f93881000-0x010ff7880fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010f93880000-0x00010ff788ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ff7881000-0x010ff7886fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ff7880000-0x00010ff788ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ff7887000-0x010ff78a3fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ff7880000-0x00010ff78affff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ff78a4000-0x010ff9e8dfff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ff78a0000-0x00010ff9e8ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ff9e8e000-0x010ff9f16fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ff9e80000-0x00010ff9f1ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ff9f17000-0x010ffaeb5fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ff9f10000-0x00010ffaebffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffaeb6000-0x010ffafc8fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ffaeb0000-0x00010ffafcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffafc9000-0x010ffafccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ffafc0000-0x00010ffafcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffafcd000-0x010ffaff4fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ffafc0000-0x00010ffaffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffaff5000-0x010ffb008fff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ffaff0000-0x00010ffb00ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffb009000-0x010fffe28fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010ffb000000-0x00010fffe2ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010fffe29000-0x010fffe3ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010fffe20000-0x00010fffe3ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010fffe40000-0x010fffe53fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010fffe40000-0x00010fffe5ffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010fffe54000-0x010ffffb8fff [Boot Code          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010fffe50000-0x00010ffffbffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffffb9000-0x010ffffccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ffffb0000-0x00010ffffcffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffffcd000-0x010fffffefff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]*
[    0.000000] memblock_add: [0x00010ffffc0000-0x00010fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x010ffffff000-0x010fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] memblock_add: [0x00010fffff0000-0x00010fffffffff] flags 0x0 early_init_dt_add_memory_arch+0x54/0x5c
[    0.000000] efi:   0x804000001000-0x804000001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]
[    0.000000] efi:   0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]
[    0.000000] memblock_reserve: [0x00010f812b0000-0x00010f812bffff] flags 0x0 efi_init+0x208/0x2c8
[    0.000000] memblock_add: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 arm64_memblock_init+0x16c/0x248
[    0.000000] memblock_reserve: [0x00010f82340000-0x00010f91e7ffff] flags 0x0 arm64_memblock_init+0x178/0x248
[    0.000000] memblock_reserve: [0x00000001480000-0x000000024affff] flags 0x0 arm64_memblock_init+0x1b0/0x248
[    0.000000] memblock_reserve: [0x00010f82343000-0x00010f91e73613] flags 0x0 arm64_memblock_init+0x1cc/0x248
[    0.000000] memblock_reserve: [0x000000c0000000-0x000000dfffffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] cma: Reserved 512 MiB at 0x00000000c0000000
[    0.000000] memblock_reserve: [0x00010ffffa0000-0x00010ffffaffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff90000-0x00010ffff9ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff80000-0x00010ffff8ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff70000-0x00010ffff7ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff60000-0x00010ffff6ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff50000-0x00010ffff5ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff40000-0x00010ffff4ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffff30000-0x00010ffff3ffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000]    memblock_free: [0x00010ffffa0000-0x00010ffffaffff] paging_init+0x5c0/0x610
[    0.000000]    memblock_free: [0x00000002490000-0x000000024affff] paging_init+0x5f4/0x610
[    0.000000] memblock_reserve: [0x00010fffefed80-0x00010ffff2fffb] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffffaffd0-0x00010ffffafffe] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffffaffa0-0x00010ffffaffce] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] memblock_reserve: [0x00010ffffa0000-0x00010ffffa000f] flags 0x0 numa_init+0x88/0x3f0
[    0.000000] NUMA: Adding memblock [0x1400000 - 0xfffffffff] on node 0
[    0.000000] NUMA: Adding memblock [0x10000400000 - 0x10fffffffff] on node 1
[    0.000000] NUMA: parsing numa-distance-map-v1
[    0.000000] NUMA: Initmem setup node 0 [mem 0x01400000-0xfffffffff]
[    0.000000] memblock_reserve: [0x00000fffffe500-0x00000fffffffff] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] NUMA: NODE_DATA [mem 0xfffffe500-0xfffffffff]
[    0.000000] NUMA: Initmem setup node 1 [mem 0x10000400000-0x10fffffffff]
[    0.000000] memblock_reserve: [0x00010ffffae480-0x00010ffffaff7f] flags 0x0 memblock_alloc_range_nid+0x30/0x58
[    0.000000] NUMA: NODE_DATA [mem 0x10ffffae480-0x10ffffaff7f]
...
[    8.420710] ------------[ cut here ]------------
[    8.425344] kernel BUG at mm/page_alloc.c:1844!
[    8.429870] Internal error: Oops - BUG: 0 [#1] SMP
[    8.434654] Modules linked in:
[    8.437712] CPU: 72 PID: 1 Comm: swapper/0 Tainted: G        W       4.8.1.4.vanilla10-00007-g9eb3a76b8b88 #111
[    8.447788] Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Sep 13 2016
[    8.456474] task: ffff800fee626800 task.stack: ffff800fec02c000
[    8.462404] PC is at move_freepages+0x198/0x1b0
[    8.466924] LR is at move_freepages_block+0xa8/0xb8
[    8.471791] pc : [<ffff0000081e5dd0>] lr : [<ffff0000081e5e90>] pstate: 000000c5
[    8.479174] sp : ffff800fec02f4d0
[    8.482478] x29: ffff800fec02f4d0 x28: 0000000000000001 
[    8.487786] x27: 0000000000000000 x26: 0000000000000000 
[    8.493093] x25: 000000000000000a x24: ffff7fe043fd0020 
[    8.498401] x23: ffff810ffffaec00 x22: 0000000000000000 
[    8.503708] x21: ffff7fe043ffffc0 x20: 0000000000000000 
[    8.509015] x19: ffff7fe043f80000 x18: ffff8100102f2548 
[    8.514322] x17: 0000000000000000 x16: 0000000100000000 
[    8.519630] x15: 0000000000000007 x14: 0000010200000000 
[    8.524938] x13: 0004a41400000000 x12: 00032c920000001a 
[    8.530245] x11: 0000010200000000 x10: 0004a40800000000 
[    8.535553] x9 : 0000000000000040 x8 : 0000000000000000 
[    8.540860] x7 : 0000000000000000 x6 : ffff810ffffaf120 
[    8.546168] x5 : 0000000000000001 x4 : ffff000008d70c48 
[    8.551475] x3 : ffff810ffffae480 x2 : 0000000000000000 
[    8.556782] x1 : 0000000000000001 x0 : ffff810ffffaec00 
[    8.562089] 
[    8.563571] Process swapper/0 (pid: 1, stack limit = 0xffff800fec02c020)
[    8.570261] Stack: (0xffff800fec02f4d0 to 0xffff800fec030000)
...
[    9.279652] Call trace:
[    9.282091] Exception stack(0xffff800fec02f300 to 0xffff800fec02f430)
[    9.288520] f300: ffff7fe043f80000 0001000000000000 ffff800fec02f4d0 ffff0000081e5dd0
[    9.296339] f320: 0000000000000001 ffff800fee626800 0000000000000000 00000000026012d0
[    9.304157] f340: 0000000100000000 ffffffffffffffff 0000000000000130 0000000000000040
[    9.311976] f360: 0000001200000000 ffff000008cf8198 0000000100000000 0000000000000001
[    9.319794] f380: 00000000026012d0 ffff810ff9a1e8e8 0000000000000000 0000000000000040
[    9.327613] f3a0: ffff810ffffaec00 0000000000000001 0000000000000000 ffff810ffffae480
[    9.335431] f3c0: ffff000008d70c48 0000000000000001 ffff810ffffaf120 0000000000000000
[    9.343250] f3e0: 0000000000000000 0000000000000040 0004a40800000000 0000010200000000
[    9.351068] f400: 00032c920000001a 0004a41400000000 0000010200000000 0000000000000007
[    9.358885] f420: 0000000100000000 0000000000000000
[    9.363756] [<ffff0000081e5dd0>] move_freepages+0x198/0x1b0
[    9.369318] [<ffff0000081e5e90>] move_freepages_block+0xa8/0xb8
[    9.375227] [<ffff0000081e657c>] __rmqueue+0x604/0x650
[    9.380355] [<ffff0000081e7898>] get_page_from_freelist+0x3f0/0xb88
[    9.386612] [<ffff0000081e860c>] __alloc_pages_nodemask+0x12c/0xce8
[    9.392876] [<ffff00000823dc1c>] alloc_page_interleave+0x64/0xc0
[    9.398873] [<ffff00000823e2f8>] alloc_pages_current+0x108/0x168
[    9.404870] [<ffff0000081de1cc>] __page_cache_alloc+0x104/0x140
[    9.410779] [<ffff0000081de31c>] pagecache_get_page+0x114/0x300
[    9.416688] [<ffff0000081de550>] grab_cache_page_write_begin+0x48/0x68
[    9.423207] [<ffff00000828e5f8>] simple_write_begin+0x40/0x150
[    9.429029] [<ffff0000081ddfe0>] generic_perform_write+0xb8/0x1a0
[    9.435112] [<ffff0000081df8c8>] __generic_file_write_iter+0x170/0x1b0
[    9.441628] [<ffff0000081df9d4>] generic_file_write_iter+0xcc/0x1c8
[    9.447893] [<ffff000008262eb4>] __vfs_write+0xcc/0x140
[    9.453108] [<ffff000008263b84>] vfs_write+0xa4/0x1c0
[    9.458150] [<ffff000008264bc4>] SyS_write+0x54/0xb0
[    9.463108] [<ffff000008be1fe4>] xwrite+0x34/0x7c
[    9.467801] [<ffff000008be20c8>] do_copy+0x9c/0xf4
[    9.472583] [<ffff000008be1da4>] write_buffer+0x34/0x50
[    9.477796] [<ffff000008be1e08>] flush_buffer+0x48/0xb4
[    9.483016] [<ffff000008c0fe08>] __gunzip+0x288/0x334
[    9.488057] [<ffff000008c0fecc>] gunzip+0x18/0x20
[    9.492750] [<ffff000008be26bc>] unpack_to_rootfs+0x168/0x284
[    9.498486] [<ffff000008be2848>] populate_rootfs+0x70/0x138
[    9.504051] [<ffff000008082ff4>] do_one_initcall+0x44/0x138
[    9.509613] [<ffff000008be0d04>] kernel_init_freeable+0x1ac/0x24c
[    9.515699] [<ffff000008847f20>] kernel_init+0x20/0xf8
[    9.520826] [<ffff000008082b80>] ret_from_fork+0x10/0x50
[    9.526129] Code: 910a0021 9400b47d d4210000 d503201f (d4210000) 
[    9.532233] ---[ end trace c3040dccdcf12d3a ]---
[    9.536889] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Will Deacon Nov. 17, 2016, 2:25 p.m. UTC | #11
On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote:
> Thus, I don't see where my patch breaks code. Even acpi_os_ioremap()
> keeps the same behaviour as before since it still uses memblock_is_
> memory(). Could you more describe your concerns why do you think this
> patch breaks the kernel and moves the problem somewhere else? I
> believe it fixes the problem at all.

acpi_os_ioremap always ends up in __ioremap_caller, regardless of
memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true.

Will
Richter, Robert Nov. 17, 2016, 3:18 p.m. UTC | #12
Thanks for your answer.

On 17.11.16 14:25:29, Will Deacon wrote:
> On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote:
> > Thus, I don't see where my patch breaks code. Even acpi_os_ioremap()
> > keeps the same behaviour as before since it still uses memblock_is_
> > memory(). Could you more describe your concerns why do you think this
> > patch breaks the kernel and moves the problem somewhere else? I
> > believe it fixes the problem at all.
> 
> acpi_os_ioremap always ends up in __ioremap_caller, regardless of
> memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true.

But that's the reason my patch changed the code to use memblock_is_
map_memory() instead. I was looking into the users of pfn_valid() esp.
in arm64 code and changed it where required.

This week I looked into the kernel again for code that might break by
a pfn_valid() change. I found try_ram_remap() in memremap.c that has
changed behaviour now, but this is explicit for MEMREMAP_WB, so it
should be fine.

Maybe it might be better to use page_is_ram() in addition to
pfn_valid() where necessary. This should work now after commit:

 e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem

I still think pfn_valid() is not the correct use to determine the mem
attributes for mappings, there are further checks required.

The risk of breaking something with my patch is small and limited only
to the mapping of efi reserved regions (which is the state of 4.4). If
something breaks anyway it can easily be fixed by adding more checks
to pfn_valid() as suggested above.

-Robert
Ard Biesheuvel Nov. 20, 2016, 5:07 p.m. UTC | #13
On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote:
> Thanks for your answer.
>
> On 17.11.16 14:25:29, Will Deacon wrote:
>> On Wed, Nov 09, 2016 at 08:51:32PM +0100, Robert Richter wrote:
>> > Thus, I don't see where my patch breaks code. Even acpi_os_ioremap()
>> > keeps the same behaviour as before since it still uses memblock_is_
>> > memory(). Could you more describe your concerns why do you think this
>> > patch breaks the kernel and moves the problem somewhere else? I
>> > believe it fixes the problem at all.
>>
>> acpi_os_ioremap always ends up in __ioremap_caller, regardless of
>> memblock_is_memory(). __ioremap_caller then fails if pfn_valid is true.
>
> But that's the reason my patch changed the code to use memblock_is_
> map_memory() instead. I was looking into the users of pfn_valid() esp.
> in arm64 code and changed it where required.
>
> This week I looked into the kernel again for code that might break by
> a pfn_valid() change. I found try_ram_remap() in memremap.c that has
> changed behaviour now, but this is explicit for MEMREMAP_WB, so it
> should be fine.
>
> Maybe it might be better to use page_is_ram() in addition to
> pfn_valid() where necessary. This should work now after commit:
>
>  e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>
> I still think pfn_valid() is not the correct use to determine the mem
> attributes for mappings, there are further checks required.
>
> The risk of breaking something with my patch is small and limited only
> to the mapping of efi reserved regions (which is the state of 4.4). If
> something breaks anyway it can easily be fixed by adding more checks
> to pfn_valid() as suggested above.
>

As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is
the correct way to address this. However, doing that does uncover a
bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct
page fields before the pfn_valid_within() check, so it seems those
need to be switched around.

Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate
for sparsemem. Care to elaborate why?
Richter, Robert Nov. 23, 2016, 9:15 p.m. UTC | #14
On 20.11.16 17:07:44, Ard Biesheuvel wrote:
> On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote:

> > The risk of breaking something with my patch is small and limited only
> > to the mapping of efi reserved regions (which is the state of 4.4). If
> > something breaks anyway it can easily be fixed by adding more checks
> > to pfn_valid() as suggested above.
> >
> 
> As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is
> the correct way to address this. However, doing that does uncover a
> bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct
> page fields before the pfn_valid_within() check, so it seems those
> need to be switched around.
> 
> Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate
> for sparsemem. Care to elaborate why?

HOLES_IN_ZONE is of rare use in the kernel. I think it was introduced
to save memory for the memmap and only some single systems enable it.
There is no architecture that enables it entirely. For good reasons...

It introduces additional checks. pfn_valid() is usually checked only
once for the whole memmap. There are a number of checks enabled, just
grep for pfn_valid_within. This will increase the number of
pfn_valid() calls by a factor of MAX_ORDER_NR_PAGES, in my config this
is 8k. So, this is not the direction to go.

My patch fixes a regression in the kernel that was introduced by the
nomap implementation. Some systems can not boot anymore, beside of
that the BUG_ON() may occur any time depending only on physical page
access, we need to fix 4.9. Here is a reproducer:

 https://patchwork.kernel.org/patch/9407677/

My patch also does not break memremap(). With my patch applied
try_ram_remap() would return a linear addr for nomap regions. But this
is only called if WB is explicitly requested, so it should not happen.
If you think pfn_valid() is wrong here, I am happy to send a patch
that fixes this by using page_is_ram(). In any case, the worst case
that may happen is to behave the same as v4.4, we might fix then the
wrong use of pfn_valid() where it is not correctly used to check for
ram.

-Robert
Ard Biesheuvel Nov. 23, 2016, 9:25 p.m. UTC | #15
On 23 November 2016 at 21:15, Robert Richter <robert.richter@cavium.com> wrote:
> On 20.11.16 17:07:44, Ard Biesheuvel wrote:
>> On 17 November 2016 at 15:18, Robert Richter <robert.richter@cavium.com> wrote:
>
>> > The risk of breaking something with my patch is small and limited only
>> > to the mapping of efi reserved regions (which is the state of 4.4). If
>> > something breaks anyway it can easily be fixed by adding more checks
>> > to pfn_valid() as suggested above.
>> >
>>
>> As I noted before, it looks to me like setting CONFIG_HOLES_IN_ZONE is
>> the correct way to address this. However, doing that does uncover a
>> bug in move_freepages() where the VM_BUG_ON_PAGE() dereferences struct
>> page fields before the pfn_valid_within() check, so it seems those
>> need to be switched around.
>>
>> Robert, you mentioned that CONFIG_HOLES_IN_ZONE seems inappropriate
>> for sparsemem. Care to elaborate why?
>
> HOLES_IN_ZONE is of rare use in the kernel. I think it was introduced
> to save memory for the memmap and only some single systems enable it.
> There is no architecture that enables it entirely. For good reasons...
>
> It introduces additional checks. pfn_valid() is usually checked only
> once for the whole memmap. There are a number of checks enabled, just
> grep for pfn_valid_within. This will increase the number of
> pfn_valid() calls by a factor of MAX_ORDER_NR_PAGES, in my config this
> is 8k. So, this is not the direction to go.
>

That does sound like a potential issue. But does it cause any slowdown
in practice?

The reality is that CONFIG_HOLES_IN_ZONE perfectly describes the
situation, and so it is still my preferred option if the performance
hit is tolerable.

> My patch fixes a regression in the kernel that was introduced by the
> nomap implementation. Some systems can not boot anymore, beside of
> that the BUG_ON() may occur any time depending only on physical page
> access, we need to fix 4.9. Here is a reproducer:
>
>  https://patchwork.kernel.org/patch/9407677/
>
> My patch also does not break memremap(). With my patch applied
> try_ram_remap() would return a linear addr for nomap regions. But this
> is only called if WB is explicitly requested, so it should not happen.

Why? MEMREMAP_WB is used often, among other things for mapping
firmware tables, which are marked as NOMAP, so in these cases, the
linear address is not mapped.

> If you think pfn_valid() is wrong here, I am happy to send a patch
> that fixes this by using page_is_ram(). In any case, the worst case
> that may happen is to behave the same as v4.4, we might fix then the
> wrong use of pfn_valid() where it is not correctly used to check for
> ram.
>

page_is_ram() uses string comparisons to look for regions called
'System RAM'. Is that something we can tolerate for each pfn_valid()
calll?

Perhaps the solution is to reimplement page_is_ram() for arm64 using
memblock_is_memory() instead, But that still means we need to modify
the generic memremap() code first to switch to it before changing the
arm64 implementation of pfn_valid
Richter, Robert Nov. 24, 2016, 1:42 p.m. UTC | #16
On 23.11.16 21:25:06, Ard Biesheuvel wrote:
> Why? MEMREMAP_WB is used often, among other things for mapping
> firmware tables, which are marked as NOMAP, so in these cases, the
> linear address is not mapped.

If fw tables are mapped wb, that is wrong and needs a separate fix.

> > If you think pfn_valid() is wrong here, I am happy to send a patch
> > that fixes this by using page_is_ram(). In any case, the worst case
> > that may happen is to behave the same as v4.4, we might fix then the
> > wrong use of pfn_valid() where it is not correctly used to check for
> > ram.
> >
> 
> page_is_ram() uses string comparisons to look for regions called
> 'System RAM'. Is that something we can tolerate for each pfn_valid()
> calll?
> 
> Perhaps the solution is to reimplement page_is_ram() for arm64 using
> memblock_is_memory() instead, But that still means we need to modify
> the generic memremap() code first to switch to it before changing the
> arm64 implementation of pfn_valid

No, that's not the solution. pfn_valid() should just check if there is
a valid struct page, as other archs do. And if there is a mis-use of
pfn_valid() to check for ram, only that calls should be fixed to use
page_is_ram(), however this is implemented, or something appropriate.
But I don't see any problematic code, and if so, I will fix that.

-Robert
Ard Biesheuvel Nov. 24, 2016, 1:44 p.m. UTC | #17
On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
> On 23.11.16 21:25:06, Ard Biesheuvel wrote:
>> Why? MEMREMAP_WB is used often, among other things for mapping
>> firmware tables, which are marked as NOMAP, so in these cases, the
>> linear address is not mapped.
>
> If fw tables are mapped wb, that is wrong and needs a separate fix.
>

Why is that wrong?

>> > If you think pfn_valid() is wrong here, I am happy to send a patch
>> > that fixes this by using page_is_ram(). In any case, the worst case
>> > that may happen is to behave the same as v4.4, we might fix then the
>> > wrong use of pfn_valid() where it is not correctly used to check for
>> > ram.
>> >
>>
>> page_is_ram() uses string comparisons to look for regions called
>> 'System RAM'. Is that something we can tolerate for each pfn_valid()
>> calll?
>>
>> Perhaps the solution is to reimplement page_is_ram() for arm64 using
>> memblock_is_memory() instead, But that still means we need to modify
>> the generic memremap() code first to switch to it before changing the
>> arm64 implementation of pfn_valid
>
> No, that's not the solution. pfn_valid() should just check if there is
> a valid struct page, as other archs do. And if there is a mis-use of
> pfn_valid() to check for ram, only that calls should be fixed to use
> page_is_ram(), however this is implemented, or something appropriate.
> But I don't see any problematic code, and if so, I will fix that.
>

memremap() uses pfn_valid() to decide whether some address is covered
by the linear mapping. If we correct pfn_valid() to adhere to your
definition, we will need to fix memremap() first in any case.
Richter, Robert Nov. 24, 2016, 1:51 p.m. UTC | #18
On 24.11.16 13:44:31, Ard Biesheuvel wrote:
> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
> > On 23.11.16 21:25:06, Ard Biesheuvel wrote:
> >> Why? MEMREMAP_WB is used often, among other things for mapping
> >> firmware tables, which are marked as NOMAP, so in these cases, the
> >> linear address is not mapped.
> >
> > If fw tables are mapped wb, that is wrong and needs a separate fix.
> >
> 
> Why is that wrong?

The whole issue with mapping acpi tables is not marking them cachable,
what wb does. Otherwise we could just use linear mapping for those mem
ranges.

> >> > If you think pfn_valid() is wrong here, I am happy to send a patch
> >> > that fixes this by using page_is_ram(). In any case, the worst case
> >> > that may happen is to behave the same as v4.4, we might fix then the
> >> > wrong use of pfn_valid() where it is not correctly used to check for
> >> > ram.
> >> >
> >>
> >> page_is_ram() uses string comparisons to look for regions called
> >> 'System RAM'. Is that something we can tolerate for each pfn_valid()
> >> calll?
> >>
> >> Perhaps the solution is to reimplement page_is_ram() for arm64 using
> >> memblock_is_memory() instead, But that still means we need to modify
> >> the generic memremap() code first to switch to it before changing the
> >> arm64 implementation of pfn_valid
> >
> > No, that's not the solution. pfn_valid() should just check if there is
> > a valid struct page, as other archs do. And if there is a mis-use of
> > pfn_valid() to check for ram, only that calls should be fixed to use
> > page_is_ram(), however this is implemented, or something appropriate.
> > But I don't see any problematic code, and if so, I will fix that.
> >
> 
> memremap() uses pfn_valid() to decide whether some address is covered
> by the linear mapping. If we correct pfn_valid() to adhere to your
> definition, we will need to fix memremap() first in any case.

As said, will fix that if needed. But I think the caller is wrong
then.

-Robert
Ard Biesheuvel Nov. 24, 2016, 1:58 p.m. UTC | #19
On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
> On 24.11.16 13:44:31, Ard Biesheuvel wrote:
>> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
>> > On 23.11.16 21:25:06, Ard Biesheuvel wrote:
>> >> Why? MEMREMAP_WB is used often, among other things for mapping
>> >> firmware tables, which are marked as NOMAP, so in these cases, the
>> >> linear address is not mapped.
>> >
>> > If fw tables are mapped wb, that is wrong and needs a separate fix.
>> >
>>
>> Why is that wrong?
>
> The whole issue with mapping acpi tables is not marking them cachable,
> what wb does.

What 'issue'?

> Otherwise we could just use linear mapping for those mem
> ranges.
>

Regions containing firmware tables are owned by the firmware, and it
is the firmware that tells us which memory attributes we are allowed
to use. If those attributes include WB, it is perfectly legal to use a
cacheable mapping. That does *not* mean they should be covered by the
linear mapping. The linear mapping is read-write-non-exec, for
instance, and we may prefer to use a read-only mapping and/or
executable mapping.

>> >> > If you think pfn_valid() is wrong here, I am happy to send a patch
>> >> > that fixes this by using page_is_ram(). In any case, the worst case
>> >> > that may happen is to behave the same as v4.4, we might fix then the
>> >> > wrong use of pfn_valid() where it is not correctly used to check for
>> >> > ram.
>> >> >
>> >>
>> >> page_is_ram() uses string comparisons to look for regions called
>> >> 'System RAM'. Is that something we can tolerate for each pfn_valid()
>> >> calll?
>> >>
>> >> Perhaps the solution is to reimplement page_is_ram() for arm64 using
>> >> memblock_is_memory() instead, But that still means we need to modify
>> >> the generic memremap() code first to switch to it before changing the
>> >> arm64 implementation of pfn_valid
>> >
>> > No, that's not the solution. pfn_valid() should just check if there is
>> > a valid struct page, as other archs do. And if there is a mis-use of
>> > pfn_valid() to check for ram, only that calls should be fixed to use
>> > page_is_ram(), however this is implemented, or something appropriate.
>> > But I don't see any problematic code, and if so, I will fix that.
>> >
>>
>> memremap() uses pfn_valid() to decide whether some address is covered
>> by the linear mapping. If we correct pfn_valid() to adhere to your
>> definition, we will need to fix memremap() first in any case.
>
> As said, will fix that if needed. But I think the caller is wrong
> then.
>
Richter, Robert Nov. 24, 2016, 2:11 p.m. UTC | #20
On 24.11.16 13:58:30, Ard Biesheuvel wrote:
> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
> > On 24.11.16 13:44:31, Ard Biesheuvel wrote:
> >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
> >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote:
> >> >> Why? MEMREMAP_WB is used often, among other things for mapping
> >> >> firmware tables, which are marked as NOMAP, so in these cases, the
> >> >> linear address is not mapped.
> >> >
> >> > If fw tables are mapped wb, that is wrong and needs a separate fix.
> >> >
> >>
> >> Why is that wrong?
> >
> > The whole issue with mapping acpi tables is not marking them cachable,
> > what wb does.
> 
> What 'issue'?
> 
> > Otherwise we could just use linear mapping for those mem
> > ranges.
> >
> 
> Regions containing firmware tables are owned by the firmware, and it
> is the firmware that tells us which memory attributes we are allowed
> to use. If those attributes include WB, it is perfectly legal to use a
> cacheable mapping. That does *not* mean they should be covered by the
> linear mapping. The linear mapping is read-write-non-exec, for
> instance, and we may prefer to use a read-only mapping and/or
> executable mapping.

Ok, I am going to fix try_ram_remap().

Are there other concerns with this patch?

Thanks,

-Robert
Ard Biesheuvel Nov. 24, 2016, 2:23 p.m. UTC | #21
On 24 November 2016 at 14:11, Robert Richter <robert.richter@cavium.com> wrote:
> On 24.11.16 13:58:30, Ard Biesheuvel wrote:
>> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
>> > On 24.11.16 13:44:31, Ard Biesheuvel wrote:
>> >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
>> >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote:
>> >> >> Why? MEMREMAP_WB is used often, among other things for mapping
>> >> >> firmware tables, which are marked as NOMAP, so in these cases, the
>> >> >> linear address is not mapped.
>> >> >
>> >> > If fw tables are mapped wb, that is wrong and needs a separate fix.
>> >> >
>> >>
>> >> Why is that wrong?
>> >
>> > The whole issue with mapping acpi tables is not marking them cachable,
>> > what wb does.
>>
>> What 'issue'?
>>
>> > Otherwise we could just use linear mapping for those mem
>> > ranges.
>> >
>>
>> Regions containing firmware tables are owned by the firmware, and it
>> is the firmware that tells us which memory attributes we are allowed
>> to use. If those attributes include WB, it is perfectly legal to use a
>> cacheable mapping. That does *not* mean they should be covered by the
>> linear mapping. The linear mapping is read-write-non-exec, for
>> instance, and we may prefer to use a read-only mapping and/or
>> executable mapping.
>
> Ok, I am going to fix try_ram_remap().
>

Thanks. Could you also add an arm64 version of page_is_ram() that uses
memblock_is_memory() while you're at it? I think using memblock
directly in try_ram_remap() may not be the best approach

> Are there other concerns with this patch?
>

I think we all agree that pfn_valid() should return whether a pfn has
a struct page associated with it, the debate is about whether it makes
sense to allocate struct pages for memory that the kernel does not
own. But given that it does not really hurt to do so for small holes,
I think your suggestion makes sense.

Should we be doing anything more to ensure that those pages are not
dereferenced inadvertently? Is there a page flag we should be setting?
Richter, Robert Nov. 24, 2016, 3:09 p.m. UTC | #22
On 24.11.16 14:23:16, Ard Biesheuvel wrote:
> On 24 November 2016 at 14:11, Robert Richter <robert.richter@cavium.com> wrote:
> > On 24.11.16 13:58:30, Ard Biesheuvel wrote:
> >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
> >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote:
> >> >> On 24 November 2016 at 13:42, Robert Richter <robert.richter@cavium.com> wrote:
> >> >> > On 23.11.16 21:25:06, Ard Biesheuvel wrote:
> >> >> >> Why? MEMREMAP_WB is used often, among other things for mapping
> >> >> >> firmware tables, which are marked as NOMAP, so in these cases, the
> >> >> >> linear address is not mapped.
> >> >> >
> >> >> > If fw tables are mapped wb, that is wrong and needs a separate fix.
> >> >> >
> >> >>
> >> >> Why is that wrong?
> >> >
> >> > The whole issue with mapping acpi tables is not marking them cachable,
> >> > what wb does.
> >>
> >> What 'issue'?
> >>
> >> > Otherwise we could just use linear mapping for those mem
> >> > ranges.
> >> >
> >>
> >> Regions containing firmware tables are owned by the firmware, and it
> >> is the firmware that tells us which memory attributes we are allowed
> >> to use. If those attributes include WB, it is perfectly legal to use a
> >> cacheable mapping. That does *not* mean they should be covered by the
> >> linear mapping. The linear mapping is read-write-non-exec, for
> >> instance, and we may prefer to use a read-only mapping and/or
> >> executable mapping.
> >
> > Ok, I am going to fix try_ram_remap().
> >
> 
> Thanks. Could you also add an arm64 version of page_is_ram() that uses
> memblock_is_memory() while you're at it? I think using memblock
> directly in try_ram_remap() may not be the best approach

Sure. I also want to mark the patches as stable.

> > Are there other concerns with this patch?
> >
> 
> I think we all agree that pfn_valid() should return whether a pfn has
> a struct page associated with it, the debate is about whether it makes
> sense to allocate struct pages for memory that the kernel does not
> own. But given that it does not really hurt to do so for small holes,
> I think your suggestion makes sense.

Thanks for your comments and the review.

> Should we be doing anything more to ensure that those pages are not
> dereferenced inadvertently? Is there a page flag we should be setting?

I don't think so. Boot mem is initialized in free_low_memory_core_
early(). The PageReserved flag is set for pages from reserved memory
ranges, and memory ranges not marked NOMAP is just freed. Since pages
are either reservered or in the free_list of pages, pages from other
memory ranges (NOMAP) is not visible to mm.

-Robert
Richter, Robert Nov. 24, 2016, 7:26 p.m. UTC | #23
Ard,

> > >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
> > >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote:

> > >> Regions containing firmware tables are owned by the firmware, and it
> > >> is the firmware that tells us which memory attributes we are allowed
> > >> to use. If those attributes include WB, it is perfectly legal to use a
> > >> cacheable mapping. That does *not* mean they should be covered by the
> > >> linear mapping. The linear mapping is read-write-non-exec, for
> > >> instance, and we may prefer to use a read-only mapping and/or
> > >> executable mapping.
> > >
> > > Ok, I am going to fix try_ram_remap().

I revisited the code and it is working well already since:

 e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem

Now, try_ram_remap() is only called if the region to be mapped is
entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem
ranges and not NOMAP mem. region_intersects() then returns
REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case
REGION_DISJOINT would be returned and thus arch_memremap_wb() being
called directly. Before the e7cd190385d1 change try_ram_remap() was
called also for nomap regions.

So we can leave memremap() as it is and just apply this patch
unmodified. What do you think? Please ack.

I am going to prepare the pfn_is_ram() change in addition to this
patch, but that should not be required for this fix to work correcly.

Thanks,

-Robert
Ard Biesheuvel Nov. 24, 2016, 7:42 p.m. UTC | #24
On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote:
> Ard,
>
>> > >> On 24 November 2016 at 13:51, Robert Richter <robert.richter@cavium.com> wrote:
>> > >> > On 24.11.16 13:44:31, Ard Biesheuvel wrote:
>
>> > >> Regions containing firmware tables are owned by the firmware, and it
>> > >> is the firmware that tells us which memory attributes we are allowed
>> > >> to use. If those attributes include WB, it is perfectly legal to use a
>> > >> cacheable mapping. That does *not* mean they should be covered by the
>> > >> linear mapping. The linear mapping is read-write-non-exec, for
>> > >> instance, and we may prefer to use a read-only mapping and/or
>> > >> executable mapping.
>> > >
>> > > Ok, I am going to fix try_ram_remap().
>
> I revisited the code and it is working well already since:
>
>  e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>
> Now, try_ram_remap() is only called if the region to be mapped is
> entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem
> ranges and not NOMAP mem. region_intersects() then returns
> REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case
> REGION_DISJOINT would be returned and thus arch_memremap_wb() being
> called directly. Before the e7cd190385d1 change try_ram_remap() was
> called also for nomap regions.
>
> So we can leave memremap() as it is and just apply this patch
> unmodified. What do you think?

I agree. The pfn_valid() check in try_ram_remap() is still appropriate
simply because the PageHighmem check requires a valid struct page. But
if we don't enter that code path anymore for NOMAP regions, I think
we're ok.

> Please ack.
>

I still don't fully understand how it is guaranteed that *all* memory
(i.e., all regions for which memblock_is_memory() returns true) is
covered by a struct page, but marked as reserved. Are we relying on
the fact that NOMAP memory is also memblock_reserve()'d?

> I am going to prepare the pfn_is_ram() change in addition to this
> patch, but that should not be required for this fix to work correcly.
>

I don't think you need to bother with page_is_ram() then. The only
place we use it is in devmem_is_allowed(), and there it makes sense to
allow NOMAP regions to be accessed (provided that you think /dev/mem
is a good idea in the first place).
Richter, Robert Nov. 25, 2016, 11:29 a.m. UTC | #25
On 24.11.16 19:42:47, Ard Biesheuvel wrote:
> On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote:

> > I revisited the code and it is working well already since:
> >
> >  e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
> >
> > Now, try_ram_remap() is only called if the region to be mapped is
> > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem
> > ranges and not NOMAP mem. region_intersects() then returns
> > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case
> > REGION_DISJOINT would be returned and thus arch_memremap_wb() being
> > called directly. Before the e7cd190385d1 change try_ram_remap() was
> > called also for nomap regions.
> >
> > So we can leave memremap() as it is and just apply this patch
> > unmodified. What do you think?
> 
> I agree. The pfn_valid() check in try_ram_remap() is still appropriate
> simply because the PageHighmem check requires a valid struct page. But
> if we don't enter that code path anymore for NOMAP regions, I think
> we're ok.
> 
> > Please ack.
> >
> 
> I still don't fully understand how it is guaranteed that *all* memory
> (i.e., all regions for which memblock_is_memory() returns true) is
> covered by a struct page, but marked as reserved. Are we relying on
> the fact that NOMAP memory is also memblock_reserve()'d?

See free_low_memory_core_early():

----
	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
        			NULL)
		count += __free_memory_core(start, end);
----

Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are
also *not* marked reserved. So nothing at all from NOMAP mem is
reported to mm, it is not present (see below for a mem config, note
flags: 0x4 mem regions).

-Robert


[    0.000000] efi: Processing EFI memory map:
[    0.000000] efi:   0x000001400000-0x00000147ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000001480000-0x0000024bffff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x0000024c0000-0x0000211fffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000021200000-0x00002121ffff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000021220000-0x0000fffebfff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS    |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000100000000-0x000ff7ffffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000ff8000000-0x000ff801ffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000ff8020000-0x000fffa9efff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x000fffa9f000-0x000fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010000400000-0x010f816aefff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f816af000-0x010f816b1fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f816b2000-0x010f826f1fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f826f2000-0x010f82701fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f82702000-0x010f82787fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f82788000-0x010f9276bfff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f9276c000-0x010f9276cfff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f9276d000-0x010f935a8fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f935a9000-0x010f93880fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010f93881000-0x010ff7880fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ff7881000-0x010ff7886fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ff7887000-0x010ff78a3fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ff78a4000-0x010ff9e8dfff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ff9e8e000-0x010ff9f16fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ff9f17000-0x010ffaeb5fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffaeb6000-0x010ffafc8fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffafc9000-0x010ffafccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffafcd000-0x010ffaff4fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffaff5000-0x010ffb008fff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffb009000-0x010fffe28fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010fffe29000-0x010fffe3ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010fffe40000-0x010fffe53fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010fffe54000-0x010ffffb8fff [Boot Code          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffffb9000-0x010ffffccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffffcd000-0x010fffffefff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x010ffffff000-0x010fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
[    0.000000] efi:   0x804000001000-0x804000001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]
[    0.000000] efi:   0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]

[    0.000000] MEMBLOCK configuration:
[    0.000000]  memory size = 0x1ffe800000 reserved size = 0x39146a21
[    0.000000]  memory.cnt  = 0x9
[    0.000000]  memory[0x0]     [0x00000001400000-0x000000fffdffff], 0xfebe0000 bytes on node 0 flags: 0x0
[    0.000000]  memory[0x1]     [0x000000fffe0000-0x000000ffffffff], 0x20000 bytes on node 0 flags: 0x4
[    0.000000]  memory[0x2]     [0x00000100000000-0x00000fffffffff], 0xf00000000 bytes on node 0 flags: 0x0
[    0.000000]  memory[0x3]     [0x00010000400000-0x00010ff9e7ffff], 0xff9a80000 bytes on node 1 flags: 0x0
[    0.000000]  memory[0x4]     [0x00010ff9e80000-0x00010ff9f1ffff], 0xa0000 bytes on node 1 flags: 0x4
[    0.000000]  memory[0x5]     [0x00010ff9f20000-0x00010ffaeaffff], 0xf90000 bytes on node 1 flags: 0x0
[    0.000000]  memory[0x6]     [0x00010ffaeb0000-0x00010ffaffffff], 0x150000 bytes on node 1 flags: 0x4
[    0.000000]  memory[0x7]     [0x00010ffb000000-0x00010ffffaffff], 0x4fb0000 bytes on node 1 flags: 0x0
[    0.000000]  memory[0x8]     [0x00010ffffb0000-0x00010fffffffff], 0x50000 bytes on node 1 flags: 0x4
[    0.000000]  reserved.cnt  = 0xd
[    0.000000]  reserved[0x0]   [0x00000001480000-0x0000000249ffff], 0x1020000 bytes flags: 0x0
[    0.000000]  reserved[0x1]   [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0
[    0.000000]  reserved[0x2]   [0x000000c0000000-0x000000dfffffff], 0x20000000 bytes flags: 0x0
[    0.000000]  reserved[0x3]   [0x00000ffbfb8000-0x00000ffffdffff], 0x4028000 bytes flags: 0x0
[    0.000000]  reserved[0x4]   [0x00000ffffecb00-0x00000fffffffff], 0x13500 bytes flags: 0x0
[    0.000000]  reserved[0x5]   [0x00010f81780000-0x00010f8178ffff], 0x10000 bytes flags: 0x0
[    0.000000]  reserved[0x6]   [0x00010f82870000-0x00010f9286ffff], 0x10000000 bytes flags: 0x0
[    0.000000]  reserved[0x7]   [0x00010ffbce0000-0x00010fffceffff], 0x4010000 bytes flags: 0x0
[    0.000000]  reserved[0x8]   [0x00010fffee6d80-0x00010ffff2fffb], 0x4927c bytes flags: 0x0
[    0.000000]  reserved[0x9]   [0x00010ffff30000-0x00010ffffa000f], 0x70010 bytes flags: 0x0
[    0.000000]  reserved[0xa]   [0x00010ffffae280-0x00010ffffaff7f], 0x1d00 bytes flags: 0x0
[    0.000000]  reserved[0xb]   [0x00010ffffaffa0-0x00010ffffaffce], 0x2f bytes flags: 0x0
[    0.000000]  reserved[0xc]   [0x00010ffffaffd0-0x00010ffffafffe], 0x2f bytes flags: 0x0
Ard Biesheuvel Nov. 25, 2016, 12:28 p.m. UTC | #26
On 25 November 2016 at 11:29, Robert Richter <robert.richter@cavium.com> wrote:
> On 24.11.16 19:42:47, Ard Biesheuvel wrote:
>> On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote:
>
>> > I revisited the code and it is working well already since:
>> >
>> >  e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>> >
>> > Now, try_ram_remap() is only called if the region to be mapped is
>> > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem
>> > ranges and not NOMAP mem. region_intersects() then returns
>> > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case
>> > REGION_DISJOINT would be returned and thus arch_memremap_wb() being
>> > called directly. Before the e7cd190385d1 change try_ram_remap() was
>> > called also for nomap regions.
>> >
>> > So we can leave memremap() as it is and just apply this patch
>> > unmodified. What do you think?
>>
>> I agree. The pfn_valid() check in try_ram_remap() is still appropriate
>> simply because the PageHighmem check requires a valid struct page. But
>> if we don't enter that code path anymore for NOMAP regions, I think
>> we're ok.
>>
>> > Please ack.
>> >
>>
>> I still don't fully understand how it is guaranteed that *all* memory
>> (i.e., all regions for which memblock_is_memory() returns true) is
>> covered by a struct page, but marked as reserved. Are we relying on
>> the fact that NOMAP memory is also memblock_reserve()'d?
>
> See free_low_memory_core_early():
>
> ----
>         for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
>                                 NULL)
>                 count += __free_memory_core(start, end);
> ----
>
> Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are
> also *not* marked reserved. So nothing at all from NOMAP mem is
> reported to mm, it is not present (see below for a mem config, note
> flags: 0x4 mem regions).
>

OK, thanks for clearing that up. But that still does not explain how
we can be certain that NOMAP regions are guaranteed to be covered by a
struct page, does it? Because that is ultimately what pfn_valid()
means, that it is safe to, e.g., look at the page flags.


> [    0.000000] efi: Processing EFI memory map:
> [    0.000000] efi:   0x000001400000-0x00000147ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000001480000-0x0000024bffff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x0000024c0000-0x0000211fffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000021200000-0x00002121ffff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000021220000-0x0000fffebfff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x0000fffec000-0x0000ffff5fff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x0000ffff6000-0x0000ffff6fff [ACPI Memory NVS    |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x0000ffff7000-0x0000ffffffff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000100000000-0x000ff7ffffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000ff8000000-0x000ff801ffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000ff8020000-0x000fffa9efff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x000fffa9f000-0x000fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010000400000-0x010f816aefff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f816af000-0x010f816b1fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f816b2000-0x010f826f1fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f826f2000-0x010f82701fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f82702000-0x010f82787fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f82788000-0x010f9276bfff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f9276c000-0x010f9276cfff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f9276d000-0x010f935a8fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f935a9000-0x010f93880fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010f93881000-0x010ff7880fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ff7881000-0x010ff7886fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ff7887000-0x010ff78a3fff [Loader Code        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ff78a4000-0x010ff9e8dfff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ff9e8e000-0x010ff9f16fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ff9f17000-0x010ffaeb5fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffaeb6000-0x010ffafc8fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffafc9000-0x010ffafccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffafcd000-0x010ffaff4fff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffaff5000-0x010ffb008fff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffb009000-0x010fffe28fff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010fffe29000-0x010fffe3ffff [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010fffe40000-0x010fffe53fff [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010fffe54000-0x010ffffb8fff [Boot Code          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffffb9000-0x010ffffccfff [Runtime Code       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffffcd000-0x010fffffefff [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x010ffffff000-0x010fffffffff [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC]
> [    0.000000] efi:   0x804000001000-0x804000001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]
> [    0.000000] efi:   0x87e0d0001000-0x87e0d0001fff [Memory Mapped I/O  |RUN|  |  |  |  |  |  |   |  |  |  |UC]
>
> [    0.000000] MEMBLOCK configuration:
> [    0.000000]  memory size = 0x1ffe800000 reserved size = 0x39146a21
> [    0.000000]  memory.cnt  = 0x9
> [    0.000000]  memory[0x0]     [0x00000001400000-0x000000fffdffff], 0xfebe0000 bytes on node 0 flags: 0x0
> [    0.000000]  memory[0x1]     [0x000000fffe0000-0x000000ffffffff], 0x20000 bytes on node 0 flags: 0x4
> [    0.000000]  memory[0x2]     [0x00000100000000-0x00000fffffffff], 0xf00000000 bytes on node 0 flags: 0x0
> [    0.000000]  memory[0x3]     [0x00010000400000-0x00010ff9e7ffff], 0xff9a80000 bytes on node 1 flags: 0x0
> [    0.000000]  memory[0x4]     [0x00010ff9e80000-0x00010ff9f1ffff], 0xa0000 bytes on node 1 flags: 0x4
> [    0.000000]  memory[0x5]     [0x00010ff9f20000-0x00010ffaeaffff], 0xf90000 bytes on node 1 flags: 0x0
> [    0.000000]  memory[0x6]     [0x00010ffaeb0000-0x00010ffaffffff], 0x150000 bytes on node 1 flags: 0x4
> [    0.000000]  memory[0x7]     [0x00010ffb000000-0x00010ffffaffff], 0x4fb0000 bytes on node 1 flags: 0x0
> [    0.000000]  memory[0x8]     [0x00010ffffb0000-0x00010fffffffff], 0x50000 bytes on node 1 flags: 0x4
> [    0.000000]  reserved.cnt  = 0xd
> [    0.000000]  reserved[0x0]   [0x00000001480000-0x0000000249ffff], 0x1020000 bytes flags: 0x0
> [    0.000000]  reserved[0x1]   [0x00000021200000-0x00000021210536], 0x10537 bytes flags: 0x0
> [    0.000000]  reserved[0x2]   [0x000000c0000000-0x000000dfffffff], 0x20000000 bytes flags: 0x0
> [    0.000000]  reserved[0x3]   [0x00000ffbfb8000-0x00000ffffdffff], 0x4028000 bytes flags: 0x0
> [    0.000000]  reserved[0x4]   [0x00000ffffecb00-0x00000fffffffff], 0x13500 bytes flags: 0x0
> [    0.000000]  reserved[0x5]   [0x00010f81780000-0x00010f8178ffff], 0x10000 bytes flags: 0x0
> [    0.000000]  reserved[0x6]   [0x00010f82870000-0x00010f9286ffff], 0x10000000 bytes flags: 0x0
> [    0.000000]  reserved[0x7]   [0x00010ffbce0000-0x00010fffceffff], 0x4010000 bytes flags: 0x0
> [    0.000000]  reserved[0x8]   [0x00010fffee6d80-0x00010ffff2fffb], 0x4927c bytes flags: 0x0
> [    0.000000]  reserved[0x9]   [0x00010ffff30000-0x00010ffffa000f], 0x70010 bytes flags: 0x0
> [    0.000000]  reserved[0xa]   [0x00010ffffae280-0x00010ffffaff7f], 0x1d00 bytes flags: 0x0
> [    0.000000]  reserved[0xb]   [0x00010ffffaffa0-0x00010ffffaffce], 0x2f bytes flags: 0x0
> [    0.000000]  reserved[0xc]   [0x00010ffffaffd0-0x00010ffffafffe], 0x2f bytes flags: 0x0
Ard Biesheuvel Nov. 25, 2016, 5:01 p.m. UTC | #27
On 25 November 2016 at 12:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 25 November 2016 at 11:29, Robert Richter <robert.richter@cavium.com> wrote:
>> On 24.11.16 19:42:47, Ard Biesheuvel wrote:
>>> On 24 November 2016 at 19:26, Robert Richter <robert.richter@cavium.com> wrote:
>>
>>> > I revisited the code and it is working well already since:
>>> >
>>> >  e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>>> >
>>> > Now, try_ram_remap() is only called if the region to be mapped is
>>> > entirely in IORESOURCE_SYSTEM_RAM. This is only true for normal mem
>>> > ranges and not NOMAP mem. region_intersects() then returns
>>> > REGION_INTERSECTS and calls try_ram_remap(). For the NOMAP memory case
>>> > REGION_DISJOINT would be returned and thus arch_memremap_wb() being
>>> > called directly. Before the e7cd190385d1 change try_ram_remap() was
>>> > called also for nomap regions.
>>> >
>>> > So we can leave memremap() as it is and just apply this patch
>>> > unmodified. What do you think?
>>>
>>> I agree. The pfn_valid() check in try_ram_remap() is still appropriate
>>> simply because the PageHighmem check requires a valid struct page. But
>>> if we don't enter that code path anymore for NOMAP regions, I think
>>> we're ok.
>>>
>>> > Please ack.
>>> >
>>>
>>> I still don't fully understand how it is guaranteed that *all* memory
>>> (i.e., all regions for which memblock_is_memory() returns true) is
>>> covered by a struct page, but marked as reserved. Are we relying on
>>> the fact that NOMAP memory is also memblock_reserve()'d?
>>
>> See free_low_memory_core_early():
>>
>> ----
>>         for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
>>                                 NULL)
>>                 count += __free_memory_core(start, end);
>> ----
>>
>> Only mem with the MEMBLOCK_NONE flag is added. And NOMAP pages are
>> also *not* marked reserved. So nothing at all from NOMAP mem is
>> reported to mm, it is not present (see below for a mem config, note
>> flags: 0x4 mem regions).
>>
>
> OK, thanks for clearing that up. But that still does not explain how
> we can be certain that NOMAP regions are guaranteed to be covered by a
> struct page, does it? Because that is ultimately what pfn_valid()
> means, that it is safe to, e.g., look at the page flags.
>

Answering to self: arm64_memory_present() registers all memory as
present, which means that sparse_init() will allocate struct page
backing for all of memory, including NOMAP regions.

We could ask ourselves whether it makes sense to disregard NOMAP
memory here, but that probably buys us very little (but I do wonder
how it affects the occurrence of the bug).

In any case, it looks to me like your patch is safe, i.e., calling
pfn_valid() on NOMAP pages is safe, although I still find it debatable
whether the kernel should be tracking memory it does not own. However,
for performance reasons, it probably makes sense to apply your patch,
and if anyone ever comes up with a use case where this is problematic
(e.g., gigabytes of NOMAP memory), we can look into it then.

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
diff mbox

Patch

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index bbb7ee76e319..25b8659c2a9f 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -147,7 +147,7 @@  static void __init zone_sizes_init(unsigned long min, unsigned long max)
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 int pfn_valid(unsigned long pfn)
 {
-	return memblock_is_map_memory(pfn << PAGE_SHIFT);
+	return memblock_is_memory(pfn << PAGE_SHIFT);
 }
 EXPORT_SYMBOL(pfn_valid);
 #endif
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index 01e88c8bcab0..c17c220b0c48 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -21,6 +21,7 @@ 
  */
 
 #include <linux/export.h>
+#include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/io.h>
@@ -55,7 +56,7 @@  static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
 	/*
 	 * Don't allow RAM to be mapped.
 	 */
-	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
+	if (WARN_ON(memblock_is_map_memory(phys_addr)))
 		return NULL;
 
 	area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -96,7 +97,7 @@  EXPORT_SYMBOL(__iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
 	/* For normal memory we already have a cacheable mapping. */
-	if (pfn_valid(__phys_to_pfn(phys_addr)))
+	if (memblock_is_map_memory(phys_addr))
 		return (void __iomem *)__phys_to_virt(phys_addr);
 
 	return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),