diff mbox

[v3] arm64: mm: Fix NOMAP page initialization

Message ID 20161216165437.21612-1-rrichter@cavium.com (mailing list archive)
State New, archived
Headers show

Commit Message

Robert Richter Dec. 16, 2016, 4:54 p.m. UTC
On ThunderX systems with certain memory configurations we see the
following BUG_ON():

 kernel BUG at mm/page_alloc.c:1848!

This happens for some configs with 64k page size enabled. The BUG_ON()
checks if start and end page of a memmap range belongs to the same
zone.

The BUG_ON() check fails if a memory zone contains NOMAP regions. In
this case the node information of those pages is not initialized. This
causes an inconsistency of the page links with wrong zone and node
information for that pages. NOMAP pages from node 1 still point to the
mem zone from node 0 and have the wrong nid assigned.

The reason for the mis-configuration is a change in pfn_valid() which
reports pages marked NOMAP as invalid:

 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping

This causes pages marked as nomap being no longer reassigned to the
new zone in memmap_init_zone() by calling __init_single_pfn().

Fixing this by implementing an arm64 specific early_pfn_valid(). This
causes all pages of sections with memory including NOMAP ranges to be
initialized by __init_single_page() and ensures consistency of page
links to zone, node and section.

The HAVE_ARCH_PFN_VALID config option now requires an explicit
definiton of early_pfn_valid() in the same way as pfn_valid(). This
allows a customized implementation of early_pfn_valid() which
redirects to valid_section() for arm64. This is the same as for the
generic pfn_valid() implementation.

v3:

 * Use valid_section() which is the same as the default pfn_valid()
   implementation to initialize
 * Added Ack for arm/ changes.

v2:

 * Use pfn_present() instead of memblock_is_memory() to support also
   non-memory NOMAP holes

Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Robert Richter <rrichter@cavium.com>
---
 arch/arm/include/asm/page.h   |  1 +
 arch/arm64/include/asm/page.h |  2 ++
 arch/arm64/mm/init.c          | 15 +++++++++++++++
 include/linux/mmzone.h        |  5 ++++-
 4 files changed, 22 insertions(+), 1 deletion(-)

Comments

Ard Biesheuvel Jan. 4, 2017, 1:56 p.m. UTC | #1
On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote:
> On ThunderX systems with certain memory configurations we see the
> following BUG_ON():
>
>  kernel BUG at mm/page_alloc.c:1848!
>
> This happens for some configs with 64k page size enabled. The BUG_ON()
> checks if start and end page of a memmap range belongs to the same
> zone.
>
> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
> this case the node information of those pages is not initialized. This
> causes an inconsistency of the page links with wrong zone and node
> information for that pages. NOMAP pages from node 1 still point to the
> mem zone from node 0 and have the wrong nid assigned.
>
> The reason for the mis-configuration is a change in pfn_valid() which
> reports pages marked NOMAP as invalid:
>
>  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>
> This causes pages marked as nomap being no longer reassigned to the
> new zone in memmap_init_zone() by calling __init_single_pfn().
>
> Fixing this by implementing an arm64 specific early_pfn_valid(). This
> causes all pages of sections with memory including NOMAP ranges to be
> initialized by __init_single_page() and ensures consistency of page
> links to zone, node and section.
>

I like this solution a lot better than the first one, but I am still
somewhat uneasy about having the kernel reason about attributes of
pages it should not touch in the first place. But the fact that
early_pfn_valid() is only used a single time in the whole kernel does
give some confidence that we are not simply moving the problem
elsewhere.

Given that you are touching arch/arm/ as well as arch/arm64, could you
explain why only arm64 needs this treatment? Is it simply because we
don't have NUMA support there?

Considering that Hisilicon D05 suffered from the same issue, I would
like to get some coverage there as well. Hanjun, is this something you
can arrange? Thanks


> The HAVE_ARCH_PFN_VALID config option now requires an explicit
> definiton of early_pfn_valid() in the same way as pfn_valid(). This
> allows a customized implementation of early_pfn_valid() which
> redirects to valid_section() for arm64. This is the same as for the
> generic pfn_valid() implementation.
>
> v3:
>
>  * Use valid_section() which is the same as the default pfn_valid()
>    implementation to initialize
>  * Added Ack for arm/ changes.
>
> v2:
>
>  * Use pfn_present() instead of memblock_is_memory() to support also
>    non-memory NOMAP holes
>
> Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
> Signed-off-by: Robert Richter <rrichter@cavium.com>
> ---
>  arch/arm/include/asm/page.h   |  1 +
>  arch/arm64/include/asm/page.h |  2 ++
>  arch/arm64/mm/init.c          | 15 +++++++++++++++
>  include/linux/mmzone.h        |  5 ++++-
>  4 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
> index 4355f0ec44d6..79761bd55f94 100644
> --- a/arch/arm/include/asm/page.h
> +++ b/arch/arm/include/asm/page.h
> @@ -158,6 +158,7 @@ typedef struct page *pgtable_t;
>
>  #ifdef CONFIG_HAVE_ARCH_PFN_VALID
>  extern int pfn_valid(unsigned long);
> +#define early_pfn_valid(pfn)   pfn_valid(pfn)
>  #endif
>
>  #include <asm/memory.h>
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 8472c6def5ef..17ceb7435ded 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -49,6 +49,8 @@ typedef struct page *pgtable_t;
>
>  #ifdef CONFIG_HAVE_ARCH_PFN_VALID
>  extern int pfn_valid(unsigned long);
> +extern int early_pfn_valid(unsigned long);
> +#define early_pfn_valid early_pfn_valid
>  #endif
>
>  #include <asm/memory.h>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 212c4d1e2f26..8ff62a7ff634 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -145,11 +145,26 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>  #endif /* CONFIG_NUMA */
>
>  #ifdef CONFIG_HAVE_ARCH_PFN_VALID
> +
>  int pfn_valid(unsigned long pfn)
>  {
>         return memblock_is_map_memory(pfn << PAGE_SHIFT);
>  }
>  EXPORT_SYMBOL(pfn_valid);
> +
> +/*
> + * This is the same as the generic pfn_valid() implementation. We use
> + * valid_section() here to make sure all pages of a section including
> + * NOMAP pages are initialized with __init_single_page().
> + */
> +int early_pfn_valid(unsigned long pfn)
> +{
> +       if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> +               return 0;
> +       return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
> +}
> +EXPORT_SYMBOL(early_pfn_valid);
> +
>  #endif
>
>  #ifndef CONFIG_SPARSEMEM
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0f088f3a2fed..bedcf8a95881 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1170,12 +1170,16 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn)
>  }
>
>  #ifndef CONFIG_HAVE_ARCH_PFN_VALID
> +
>  static inline int pfn_valid(unsigned long pfn)
>  {
>         if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>                 return 0;
>         return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
>  }
> +
> +#define early_pfn_valid(pfn)   pfn_valid(pfn)
> +
>  #endif
>
>  static inline int pfn_present(unsigned long pfn)
> @@ -1200,7 +1204,6 @@ static inline int pfn_present(unsigned long pfn)
>  #define pfn_to_nid(pfn)                (0)
>  #endif
>
> -#define early_pfn_valid(pfn)   pfn_valid(pfn)
>  void sparse_init(void);
>  #else
>  #define sparse_init()  do {} while (0)
> --
> 2.11.0
>
Hanjun Guo Jan. 5, 2017, 2:03 a.m. UTC | #2
On 2017/1/4 21:56, Ard Biesheuvel wrote:
> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote:
>> On ThunderX systems with certain memory configurations we see the
>> following BUG_ON():
>>
>>  kernel BUG at mm/page_alloc.c:1848!
>>
>> This happens for some configs with 64k page size enabled. The BUG_ON()
>> checks if start and end page of a memmap range belongs to the same
>> zone.
>>
>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>> this case the node information of those pages is not initialized. This
>> causes an inconsistency of the page links with wrong zone and node
>> information for that pages. NOMAP pages from node 1 still point to the
>> mem zone from node 0 and have the wrong nid assigned.
>>
>> The reason for the mis-configuration is a change in pfn_valid() which
>> reports pages marked NOMAP as invalid:
>>
>>  68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>>
>> This causes pages marked as nomap being no longer reassigned to the
>> new zone in memmap_init_zone() by calling __init_single_pfn().
>>
>> Fixing this by implementing an arm64 specific early_pfn_valid(). This
>> causes all pages of sections with memory including NOMAP ranges to be
>> initialized by __init_single_page() and ensures consistency of page
>> links to zone, node and section.
>>
>
> I like this solution a lot better than the first one, but I am still
> somewhat uneasy about having the kernel reason about attributes of
> pages it should not touch in the first place. But the fact that
> early_pfn_valid() is only used a single time in the whole kernel does
> give some confidence that we are not simply moving the problem
> elsewhere.
>
> Given that you are touching arch/arm/ as well as arch/arm64, could you
> explain why only arm64 needs this treatment? Is it simply because we
> don't have NUMA support there?
>
> Considering that Hisilicon D05 suffered from the same issue, I would
> like to get some coverage there as well. Hanjun, is this something you
> can arrange? Thanks

Sure, we will test this patch with LTP MM stress test (which triggers
the bug on D05), and give the feedback.

Thanks
Hanjun
Richter, Robert Jan. 5, 2017, 11:03 a.m. UTC | #3
On 04.01.17 13:56:39, Ard Biesheuvel wrote:
> Given that you are touching arch/arm/ as well as arch/arm64, could you
> explain why only arm64 needs this treatment? Is it simply because we
> don't have NUMA support there?

I haven't considered a solution for arch/arm yet. The fixes are
independent. But if that fix would be an excepted solution, it could
be implemented for arm then too. But as you said, since probably only
NUMA is affected, we might not need it there.

-Robert
Hanjun Guo Jan. 6, 2017, 1:07 a.m. UTC | #4
On 2017/1/5 10:03, Hanjun Guo wrote:
> On 2017/1/4 21:56, Ard Biesheuvel wrote:
>> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote:
>>> On ThunderX systems with certain memory configurations we see the
>>> following BUG_ON():
>>>
>>>  kernel BUG at mm/page_alloc.c:1848!
>>>
>>> This happens for some configs with 64k page size enabled. The BUG_ON()
>>> checks if start and end page of a memmap range belongs to the same
>>> zone.
>>>
>>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>>> this case the node information of those pages is not initialized. This
>>> causes an inconsistency of the page links with wrong zone and node
>>> information for that pages. NOMAP pages from node 1 still point to the
>>> mem zone from node 0 and have the wrong nid assigned.
>>>
>>> The reason for the mis-configuration is a change in pfn_valid() which
>>> reports pages marked NOMAP as invalid:
>>>
>>>  68709f45385a arm64: only consider memblocks with NOMAP cleared for
>>> linear mapping
>>>
>>> This causes pages marked as nomap being no longer reassigned to the
>>> new zone in memmap_init_zone() by calling __init_single_pfn().
>>>
>>> Fixing this by implementing an arm64 specific early_pfn_valid(). This
>>> causes all pages of sections with memory including NOMAP ranges to be
>>> initialized by __init_single_page() and ensures consistency of page
>>> links to zone, node and section.
>>>
>>
>> I like this solution a lot better than the first one, but I am still
>> somewhat uneasy about having the kernel reason about attributes of
>> pages it should not touch in the first place. But the fact that
>> early_pfn_valid() is only used a single time in the whole kernel does
>> give some confidence that we are not simply moving the problem
>> elsewhere.
>>
>> Given that you are touching arch/arm/ as well as arch/arm64, could you
>> explain why only arm64 needs this treatment? Is it simply because we
>> don't have NUMA support there?
>>
>> Considering that Hisilicon D05 suffered from the same issue, I would
>> like to get some coverage there as well. Hanjun, is this something you
>> can arrange? Thanks
>
> Sure, we will test this patch with LTP MM stress test (which triggers
> the bug on D05), and give the feedback.

a update here, tested on 4.9,

  - Applied Ard's two patches only
  - Applied Robert's patch only

Both of them can work fine on D05 with NUMA enabled, which means
boot ok and LTP MM stress test is passed.

I'm not familiar with memory management, it's up to you guys to make
a decision :)

Thanks
Hanjun
Prakash B Jan. 6, 2017, 5:22 a.m. UTC | #5
Hi Hanjun,


> a update here, tested on 4.9,
>
>  - Applied Ard's two patches only
>  - Applied Robert's patch only
>
> Both of them can work fine on D05 with NUMA enabled, which means
> boot ok and LTP MM stress test is passed.

It is  not related to this patch set.
LTP "cpuset01" test  crashes with latest 4.9,  4.10-rc1 and 4.10-rc2 kernels on
Thunderx 2S .  Do you see any such behaviour on D05.

Any idea what might be causing this issue.


  227.627546] cpuset01: page allocation stalls for 10096ms, order:0,
mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[  227.627586] CPU: 53 PID: 11017 Comm: cpuset01 Not tainted 4.9.04kNUMA+ #2
[  227.627591] Hardware name: www.cavium.com ThunderX Unknown/ThunderX
Unknown, BIOS 0.3 Aug 24 2016
[  227.627599] Call trace:
[  227.627623] [<ffff000008089f10>] dump_backtrace+0x0/0x238
[  227.627640] [<ffff00000808a16c>] show_stack+0x24/0x30
[  227.627656] [<ffff00000846fb50>] dump_stack+0x94/0xb4
[  227.627679] [<ffff0000081eb4f8>] warn_alloc+0x138/0x150
[  227.627686] [<ffff0000081ec0a4>] __alloc_pages_nodemask+0xb04/0xcf0
[  227.627697] [<ffff000008245988>] alloc_pages_vma+0xc8/0x270
[  227.627715] [<ffff00000821f604>] handle_mm_fault+0xc8c/0xfd8
[  227.627732] [<ffff00000809a488>] do_page_fault+0x2c0/0x368
[  227.627744] [<ffff0000080812ec>] do_mem_abort+0x6c/0xe0
[  227.627752] Exception stack(0xffff801f55823e00 to 0xffff801f55823f30)
[  227.627763] 3e00: 0000000000000000 0000ffff92682000
ffffffffffffffff 0000ffff9252b3e8
[  227.627774] 3e20: 0000000020000000 0000000000000000
000000000000a000 0000000000000003
[  227.627785] 3e40: 0000000000000022 ffffffffffffffff
0000000000000123 00000000000000de
[  227.627793] 3e60: ffff000008972000 0000000000000015
ffff801f55823e90 0000000000040900
[  227.627800] 3e80: 0000000000000000 ffff0000080836f0
0000000000000000 0000ffff92682000
[  227.627809] 3ea0: ffffffffffffffff 0000ffff92575d8c
0000000000000000 0000000000040900
[  227.627819] 3ec0: 0000ffff92682000 00000000000000f7
0000000000004fc0 0000000000000022
[  227.627828] 3ee0: 0000000000000000 0000000000000000
0000ffff925f5508 f7f7f7f7f7f7f7f7
[  227.627838] 3f00: 0000ffff92686ff0 0000000000002ab8
0101010101010101 0000000000000020
[  227.627847] 3f20: 0000000000000000 0000000000000000
[  227.627858] [<ffff000008083324>] el0_da+0x18/0x1c
[  227.627865] Mem-Info:
[  227.627899] active_anon:38613 inactive_anon:8174 isolated_anon:0
                active_file:25148 inactive_file:64173 isolated_file:0
                unevictable:742 dirty:0 writeback:0 unstable:0
                slab_reclaimable:29066 slab_unreclaimable:67304
                mapped:22876 shmem:2597 pagetables:1240 bounce:0
                free:65582521 free_pcp:1834 free_cma:0
Ard Biesheuvel Jan. 6, 2017, 8:37 a.m. UTC | #6
On 6 January 2017 at 01:07, Hanjun Guo <hanjun.guo@linaro.org> wrote:
> On 2017/1/5 10:03, Hanjun Guo wrote:
>>
>> On 2017/1/4 21:56, Ard Biesheuvel wrote:
>>>
>>> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote:
>>>>
>>>> On ThunderX systems with certain memory configurations we see the
>>>> following BUG_ON():
>>>>
>>>>  kernel BUG at mm/page_alloc.c:1848!
>>>>
>>>> This happens for some configs with 64k page size enabled. The BUG_ON()
>>>> checks if start and end page of a memmap range belongs to the same
>>>> zone.
>>>>
>>>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>>>> this case the node information of those pages is not initialized. This
>>>> causes an inconsistency of the page links with wrong zone and node
>>>> information for that pages. NOMAP pages from node 1 still point to the
>>>> mem zone from node 0 and have the wrong nid assigned.
>>>>
>>>> The reason for the mis-configuration is a change in pfn_valid() which
>>>> reports pages marked NOMAP as invalid:
>>>>
>>>>  68709f45385a arm64: only consider memblocks with NOMAP cleared for
>>>> linear mapping
>>>>
>>>> This causes pages marked as nomap being no longer reassigned to the
>>>> new zone in memmap_init_zone() by calling __init_single_pfn().
>>>>
>>>> Fixing this by implementing an arm64 specific early_pfn_valid(). This
>>>> causes all pages of sections with memory including NOMAP ranges to be
>>>> initialized by __init_single_page() and ensures consistency of page
>>>> links to zone, node and section.
>>>>
>>>
>>> I like this solution a lot better than the first one, but I am still
>>> somewhat uneasy about having the kernel reason about attributes of
>>> pages it should not touch in the first place. But the fact that
>>> early_pfn_valid() is only used a single time in the whole kernel does
>>> give some confidence that we are not simply moving the problem
>>> elsewhere.
>>>
>>> Given that you are touching arch/arm/ as well as arch/arm64, could you
>>> explain why only arm64 needs this treatment? Is it simply because we
>>> don't have NUMA support there?
>>>
>>> Considering that Hisilicon D05 suffered from the same issue, I would
>>> like to get some coverage there as well. Hanjun, is this something you
>>> can arrange? Thanks
>>
>>
>> Sure, we will test this patch with LTP MM stress test (which triggers
>> the bug on D05), and give the feedback.
>
>
> a update here, tested on 4.9,
>
>  - Applied Ard's two patches only
>  - Applied Robert's patch only
>
> Both of them can work fine on D05 with NUMA enabled, which means
> boot ok and LTP MM stress test is passed.
>

Thanks a lot Hanjun.

Any comments on the performance impact (including boot time) ?
Hanjun Guo Jan. 9, 2017, 5:09 a.m. UTC | #7
Hi Prakash,

On 2017/1/6 13:22, Prakash B wrote:
> Hi Hanjun,
>
>
>> a update here, tested on 4.9,
>>
>>  - Applied Ard's two patches only
>>  - Applied Robert's patch only
>>
>> Both of them can work fine on D05 with NUMA enabled, which means
>> boot ok and LTP MM stress test is passed.
>
> It is  not related to this patch set.
> LTP "cpuset01" test  crashes with latest 4.9,  4.10-rc1 and 4.10-rc2 kernels on
> Thunderx 2S .  Do you see any such behaviour on D05.

I didn't test "cpuset01" on D05 but according to the test in
Linaro, LTP full test is passed on D05 with Ard's 2 patches.

>
> Any idea what might be causing this issue.

Since it's not happening on D05, maybe it's related to
the firmware? (just a wild guess...)

Thanks
Hanjun

>
>
>   227.627546] cpuset01: page allocation stalls for 10096ms, order:0,
> mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
> [  227.627586] CPU: 53 PID: 11017 Comm: cpuset01 Not tainted 4.9.04kNUMA+ #2
> [  227.627591] Hardware name: www.cavium.com ThunderX Unknown/ThunderX
> Unknown, BIOS 0.3 Aug 24 2016
> [  227.627599] Call trace:
> [  227.627623] [<ffff000008089f10>] dump_backtrace+0x0/0x238
> [  227.627640] [<ffff00000808a16c>] show_stack+0x24/0x30
> [  227.627656] [<ffff00000846fb50>] dump_stack+0x94/0xb4
> [  227.627679] [<ffff0000081eb4f8>] warn_alloc+0x138/0x150
> [  227.627686] [<ffff0000081ec0a4>] __alloc_pages_nodemask+0xb04/0xcf0
> [  227.627697] [<ffff000008245988>] alloc_pages_vma+0xc8/0x270
> [  227.627715] [<ffff00000821f604>] handle_mm_fault+0xc8c/0xfd8
> [  227.627732] [<ffff00000809a488>] do_page_fault+0x2c0/0x368
> [  227.627744] [<ffff0000080812ec>] do_mem_abort+0x6c/0xe0
> [  227.627752] Exception stack(0xffff801f55823e00 to 0xffff801f55823f30)
> [  227.627763] 3e00: 0000000000000000 0000ffff92682000
> ffffffffffffffff 0000ffff9252b3e8
> [  227.627774] 3e20: 0000000020000000 0000000000000000
> 000000000000a000 0000000000000003
> [  227.627785] 3e40: 0000000000000022 ffffffffffffffff
> 0000000000000123 00000000000000de
> [  227.627793] 3e60: ffff000008972000 0000000000000015
> ffff801f55823e90 0000000000040900
> [  227.627800] 3e80: 0000000000000000 ffff0000080836f0
> 0000000000000000 0000ffff92682000
> [  227.627809] 3ea0: ffffffffffffffff 0000ffff92575d8c
> 0000000000000000 0000000000040900
> [  227.627819] 3ec0: 0000ffff92682000 00000000000000f7
> 0000000000004fc0 0000000000000022
> [  227.627828] 3ee0: 0000000000000000 0000000000000000
> 0000ffff925f5508 f7f7f7f7f7f7f7f7
> [  227.627838] 3f00: 0000ffff92686ff0 0000000000002ab8
> 0101010101010101 0000000000000020
> [  227.627847] 3f20: 0000000000000000 0000000000000000
> [  227.627858] [<ffff000008083324>] el0_da+0x18/0x1c
> [  227.627865] Mem-Info:
> [  227.627899] active_anon:38613 inactive_anon:8174 isolated_anon:0
>                 active_file:25148 inactive_file:64173 isolated_file:0
>                 unevictable:742 dirty:0 writeback:0 unstable:0
>                 slab_reclaimable:29066 slab_unreclaimable:67304
>                 mapped:22876 shmem:2597 pagetables:1240 bounce:0
>                 free:65582521 free_pcp:1834 free_cma:0
>
Hanjun Guo Jan. 9, 2017, 5:14 a.m. UTC | #8
On 2017/1/6 16:37, Ard Biesheuvel wrote:
> On 6 January 2017 at 01:07, Hanjun Guo <hanjun.guo@linaro.org> wrote:
>> On 2017/1/5 10:03, Hanjun Guo wrote:
>>>
>>> On 2017/1/4 21:56, Ard Biesheuvel wrote:
>>>>
>>>> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote:
>>>>>
>>>>> On ThunderX systems with certain memory configurations we see the
>>>>> following BUG_ON():
>>>>>
>>>>>  kernel BUG at mm/page_alloc.c:1848!
>>>>>
>>>>> This happens for some configs with 64k page size enabled. The BUG_ON()
>>>>> checks if start and end page of a memmap range belongs to the same
>>>>> zone.
>>>>>
>>>>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>>>>> this case the node information of those pages is not initialized. This
>>>>> causes an inconsistency of the page links with wrong zone and node
>>>>> information for that pages. NOMAP pages from node 1 still point to the
>>>>> mem zone from node 0 and have the wrong nid assigned.
>>>>>
>>>>> The reason for the mis-configuration is a change in pfn_valid() which
>>>>> reports pages marked NOMAP as invalid:
>>>>>
>>>>>  68709f45385a arm64: only consider memblocks with NOMAP cleared for
>>>>> linear mapping
>>>>>
>>>>> This causes pages marked as nomap being no longer reassigned to the
>>>>> new zone in memmap_init_zone() by calling __init_single_pfn().
>>>>>
>>>>> Fixing this by implementing an arm64 specific early_pfn_valid(). This
>>>>> causes all pages of sections with memory including NOMAP ranges to be
>>>>> initialized by __init_single_page() and ensures consistency of page
>>>>> links to zone, node and section.
>>>>>
>>>>
>>>> I like this solution a lot better than the first one, but I am still
>>>> somewhat uneasy about having the kernel reason about attributes of
>>>> pages it should not touch in the first place. But the fact that
>>>> early_pfn_valid() is only used a single time in the whole kernel does
>>>> give some confidence that we are not simply moving the problem
>>>> elsewhere.
>>>>
>>>> Given that you are touching arch/arm/ as well as arch/arm64, could you
>>>> explain why only arm64 needs this treatment? Is it simply because we
>>>> don't have NUMA support there?
>>>>
>>>> Considering that Hisilicon D05 suffered from the same issue, I would
>>>> like to get some coverage there as well. Hanjun, is this something you
>>>> can arrange? Thanks
>>>
>>>
>>> Sure, we will test this patch with LTP MM stress test (which triggers
>>> the bug on D05), and give the feedback.
>>
>>
>> a update here, tested on 4.9,
>>
>>  - Applied Ard's two patches only
>>  - Applied Robert's patch only
>>
>> Both of them can work fine on D05 with NUMA enabled, which means
>> boot ok and LTP MM stress test is passed.
>>
>
> Thanks a lot Hanjun.
>
> Any comments on the performance impact (including boot time) ?

Didn't collect the performance data yet, any recommended test
suite?  Is it sysbench ok? we can test it and collect the data.

Thanks
Hanjun
Prakash B Jan. 9, 2017, 6:15 a.m. UTC | #9
Thanks Hanjun ,


On Mon, Jan 9, 2017 at 10:39 AM, Hanjun Guo <hanjun.guo@linaro.org> wrote:
> Hi Prakash,
> I didn't test "cpuset01" on D05 but according to the test in
> Linaro, LTP full test is passed on D05 with Ard's 2 patches.
>
>>
>> Any idea what might be causing this issue.
>
>
> Since it's not happening on D05, maybe it's related to
> the firmware? (just a wild guess...)
>

 Used same firmware b/w 4.4 kernel and 4.9 (and   above kernels) .
Test passed wtih 4.4 kernel and didn't generated  any crashes or
dumps.

If there is more observation I will  send a  mail  or I will start a
separate mail thread.

Thanks,
Prakash B
Richter, Robert Jan. 9, 2017, 11:53 a.m. UTC | #10
On 06.01.17 08:37:25, Ard Biesheuvel wrote:
> Any comments on the performance impact (including boot time) ?

I did a kernel compile test and kernel mode time increases by about
2.2%. Though this is already significant, we need a more suitable mem
benchmark here for further testing.

For boot time I dont see significant changes.

-Robert


Boot times:

pfn_valid_within():
[   25.929134]
[   25.548830]
[   25.503225]

early_pfn_valid() v3:
[   25.773814]
[   25.548428]
[   25.765290]


Kernel compile times (3 runs each):

pfn_valid_within():

real    6m4.088s
user    372m57.607s
sys     16m55.158s

real    6m1.532s
user    372m48.453s
sys     16m50.370s

real    6m4.061s
user    373m18.753s
sys     16m57.027s


early_pfn_valid() v3:

real    6m3.261s
user    373m15.816s
sys     16m30.019s

real    6m2.980s
user    373m9.019s
sys     16m32.992s

real    6m2.574s
user    372m45.146s
sys     16m33.218s
Will Deacon Jan. 12, 2017, 4:05 p.m. UTC | #11
Hi Robert,

On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:
> On 06.01.17 08:37:25, Ard Biesheuvel wrote:
> > Any comments on the performance impact (including boot time) ?
> 
> I did a kernel compile test and kernel mode time increases by about
> 2.2%. Though this is already significant, we need a more suitable mem
> benchmark here for further testing.

Thanks for doing this.

> For boot time I dont see significant changes.
> 
> -Robert
> 
> 
> Boot times:
> 
> pfn_valid_within():
> [   25.929134]
> [   25.548830]
> [   25.503225]
> 
> early_pfn_valid() v3:
> [   25.773814]
> [   25.548428]
> [   25.765290]
> 
> 
> Kernel compile times (3 runs each):
> 
> pfn_valid_within():
> 
> real    6m4.088s
> user    372m57.607s
> sys     16m55.158s
> 
> real    6m1.532s
> user    372m48.453s
> sys     16m50.370s
> 
> real    6m4.061s
> user    373m18.753s
> sys     16m57.027s

Did you reboot the machine between each build here, or only when changing
kernel? If the latter, do you see variations in kernel build time by simply
rebooting the same Image?

Will
Richter, Robert Jan. 12, 2017, 6:58 p.m. UTC | #12
On 12.01.17 16:05:36, Will Deacon wrote:
> On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:

> > Kernel compile times (3 runs each):
> > 
> > pfn_valid_within():
> > 
> > real    6m4.088s
> > user    372m57.607s
> > sys     16m55.158s
> > 
> > real    6m1.532s
> > user    372m48.453s
> > sys     16m50.370s
> > 
> > real    6m4.061s
> > user    373m18.753s
> > sys     16m57.027s
> 
> Did you reboot the machine between each build here, or only when changing
> kernel? If the latter, do you see variations in kernel build time by simply
> rebooting the same Image?

I built it in a loop on the shell, so no reboots between builds. Note
that I was building the kernel in /dev/shm to not access harddisks. I
think build times should be comparable then since there is no fs
caching.

-Robert
Will Deacon Jan. 13, 2017, 9:19 a.m. UTC | #13
On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote:
> On 12.01.17 16:05:36, Will Deacon wrote:
> > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:
> 
> > > Kernel compile times (3 runs each):
> > > 
> > > pfn_valid_within():
> > > 
> > > real    6m4.088s
> > > user    372m57.607s
> > > sys     16m55.158s
> > > 
> > > real    6m1.532s
> > > user    372m48.453s
> > > sys     16m50.370s
> > > 
> > > real    6m4.061s
> > > user    373m18.753s
> > > sys     16m57.027s
> > 
> > Did you reboot the machine between each build here, or only when changing
> > kernel? If the latter, do you see variations in kernel build time by simply
> > rebooting the same Image?
> 
> I built it in a loop on the shell, so no reboots between builds. Note
> that I was building the kernel in /dev/shm to not access harddisks. I
> think build times should be comparable then since there is no fs
> caching.

I guess I'm really asking what the standard deviation is if you *do* reboot
between builds, using the same kernel. It's hard to tell whether the numbers
are due to the patches, or just because of noise incurred by the way things
happen to initialise.

Will
Richter, Robert Jan. 13, 2017, 1:15 p.m. UTC | #14
On 13.01.17 09:19:04, Will Deacon wrote:
> On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote:
> > On 12.01.17 16:05:36, Will Deacon wrote:
> > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:
> > 
> > > > Kernel compile times (3 runs each):
> > > > 
> > > > pfn_valid_within():
> > > > 
> > > > real    6m4.088s
> > > > user    372m57.607s
> > > > sys     16m55.158s
> > > > 
> > > > real    6m1.532s
> > > > user    372m48.453s
> > > > sys     16m50.370s
> > > > 
> > > > real    6m4.061s
> > > > user    373m18.753s
> > > > sys     16m57.027s
> > > 
> > > Did you reboot the machine between each build here, or only when changing
> > > kernel? If the latter, do you see variations in kernel build time by simply
> > > rebooting the same Image?
> > 
> > I built it in a loop on the shell, so no reboots between builds. Note
> > that I was building the kernel in /dev/shm to not access harddisks. I
> > think build times should be comparable then since there is no fs
> > caching.
> 
> I guess I'm really asking what the standard deviation is if you *do* reboot
> between builds, using the same kernel. It's hard to tell whether the numbers
> are due to the patches, or just because of noise incurred by the way things
> happen to initialise.

Ok, I am going to test this.

-Robert
Richter, Robert Jan. 17, 2017, 10 a.m. UTC | #15
On 13.01.17 14:15:00, Robert Richter wrote:
> On 13.01.17 09:19:04, Will Deacon wrote:
> > On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote:
> > > On 12.01.17 16:05:36, Will Deacon wrote:
> > > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:
> > > 
> > > > > Kernel compile times (3 runs each):
> > > > > 
> > > > > pfn_valid_within():
> > > > > 
> > > > > real    6m4.088s
> > > > > user    372m57.607s
> > > > > sys     16m55.158s
> > > > > 
> > > > > real    6m1.532s
> > > > > user    372m48.453s
> > > > > sys     16m50.370s
> > > > > 
> > > > > real    6m4.061s
> > > > > user    373m18.753s
> > > > > sys     16m57.027s
> > > > 
> > > > Did you reboot the machine between each build here, or only when changing
> > > > kernel? If the latter, do you see variations in kernel build time by simply
> > > > rebooting the same Image?
> > > 
> > > I built it in a loop on the shell, so no reboots between builds. Note
> > > that I was building the kernel in /dev/shm to not access harddisks. I
> > > think build times should be comparable then since there is no fs
> > > caching.
> > 
> > I guess I'm really asking what the standard deviation is if you *do* reboot
> > between builds, using the same kernel. It's hard to tell whether the numbers
> > are due to the patches, or just because of noise incurred by the way things
> > happen to initialise.
> 
> Ok, I am going to test this.

See below the data for a test with reboots between every 3 builds (9
builds per kernel). Though some deviation can be seen between reboots
there is a trend.

-Robert



pfn_valid_within(), boot #1:

real	6m0.007s
user	372m55.709s
sys	16m45.962s

real	5m58.718s
user	372m58.852s
sys	16m47.675s

real	5m58.481s
user	372m56.172s
sys	16m46.953s

pfn_valid_within(), Boot #2:

real	6m1.163s
user	372m57.282s
sys	16m52.025s

real	6m0.562s
user	373m4.957s
sys	16m52.847s

real	6m0.030s
user	372m54.710s
sys	16m54.516s

pfn_valid_within(), Boot #3:

real	6m1.784s
user	373m13.379s
sys	16m48.388s

real	5m58.579s
user	373m10.403s
sys	16m47.628s

real	5m59.151s
user	373m0.084s
sys	16m50.634s

early_pfn_valid(), Boot #1:

real	5m59.902s
user	372m57.201s
sys	16m42.157s

real	5m59.510s
user	372m59.762s
sys	16m47.331s

real	5m58.559s
user	372m46.530s
sys	16m49.010s

early_pfn_valid(), Boot #2:

real	6m0.652s
user	373m10.785s
sys	16m25.138s

real	5m58.663s
user	373m4.498s
sys	16m28.262s

real	5m57.675s
user	373m6.174s
sys	16m28.653s

early_pfn_valid(), Boot #3:

real	5m59.680s
user	373m4.007s
sys	16m26.781s

real	5m58.234s
user	372m58.895s
sys	16m26.957s

real	5m58.707s
user	372m40.546s
sys	16m29.345s
Will Deacon Jan. 17, 2017, 7:16 p.m. UTC | #16
On Tue, Jan 17, 2017 at 11:00:15AM +0100, Robert Richter wrote:
> On 13.01.17 14:15:00, Robert Richter wrote:
> > On 13.01.17 09:19:04, Will Deacon wrote:
> > > On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote:
> > > > On 12.01.17 16:05:36, Will Deacon wrote:
> > > > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote:
> > > > 
> > > > > > Kernel compile times (3 runs each):
> > > > > > 
> > > > > > pfn_valid_within():
> > > > > > 
> > > > > > real    6m4.088s
> > > > > > user    372m57.607s
> > > > > > sys     16m55.158s
> > > > > > 
> > > > > > real    6m1.532s
> > > > > > user    372m48.453s
> > > > > > sys     16m50.370s
> > > > > > 
> > > > > > real    6m4.061s
> > > > > > user    373m18.753s
> > > > > > sys     16m57.027s
> > > > > 
> > > > > Did you reboot the machine between each build here, or only when changing
> > > > > kernel? If the latter, do you see variations in kernel build time by simply
> > > > > rebooting the same Image?
> > > > 
> > > > I built it in a loop on the shell, so no reboots between builds. Note
> > > > that I was building the kernel in /dev/shm to not access harddisks. I
> > > > think build times should be comparable then since there is no fs
> > > > caching.
> > > 
> > > I guess I'm really asking what the standard deviation is if you *do* reboot
> > > between builds, using the same kernel. It's hard to tell whether the numbers
> > > are due to the patches, or just because of noise incurred by the way things
> > > happen to initialise.
> > 
> > Ok, I am going to test this.
> 
> See below the data for a test with reboots between every 3 builds (9
> builds per kernel). Though some deviation can be seen between reboots
> there is a trend.

I can't really see the trend given that, for system time, your
pfn_valid_within results have a variance of ~9 and the early_pfn_valid
results have a variance of ~92. Given that the variance seems to come
about due to the reboots, I think we need more numbers to establish whether
the data sets end up largely overlapping or if they really are disjoint.

Will
Richter, Robert Feb. 3, 2017, 3:14 p.m. UTC | #17
On 17.01.17 19:16:56, Will Deacon wrote:
> I can't really see the trend given that, for system time, your
> pfn_valid_within results have a variance of ~9 and the early_pfn_valid
> results have a variance of ~92. Given that the variance seems to come
> about due to the reboots, I think we need more numbers to establish whether
> the data sets end up largely overlapping or if they really are disjoint.

Assuming the numbers of both versions are not significant, please
apply one or the other.

Thanks,

-Robert
Will Deacon Feb. 3, 2017, 6:16 p.m. UTC | #18
On Fri, Feb 03, 2017 at 04:14:14PM +0100, Robert Richter wrote:
> On 17.01.17 19:16:56, Will Deacon wrote:
> > I can't really see the trend given that, for system time, your
> > pfn_valid_within results have a variance of ~9 and the early_pfn_valid
> > results have a variance of ~92. Given that the variance seems to come
> > about due to the reboots, I think we need more numbers to establish whether
> > the data sets end up largely overlapping or if they really are disjoint.
> 
> Assuming the numbers of both versions are not significant, please
> apply one or the other.

I'd rather apply the pfn_valid_within version, but please can you Ack the
patch first, since there was some confusion when it was posted about a
translation fault that was never reproduced.

Thanks,

Will
diff mbox

Patch

diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 4355f0ec44d6..79761bd55f94 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -158,6 +158,7 @@  typedef struct page *pgtable_t;
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 extern int pfn_valid(unsigned long);
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
 #endif
 
 #include <asm/memory.h>
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 8472c6def5ef..17ceb7435ded 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -49,6 +49,8 @@  typedef struct page *pgtable_t;
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 extern int pfn_valid(unsigned long);
+extern int early_pfn_valid(unsigned long);
+#define early_pfn_valid early_pfn_valid
 #endif
 
 #include <asm/memory.h>
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 212c4d1e2f26..8ff62a7ff634 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -145,11 +145,26 @@  static void __init zone_sizes_init(unsigned long min, unsigned long max)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
+
 int pfn_valid(unsigned long pfn)
 {
 	return memblock_is_map_memory(pfn << PAGE_SHIFT);
 }
 EXPORT_SYMBOL(pfn_valid);
+
+/*
+ * This is the same as the generic pfn_valid() implementation. We use
+ * valid_section() here to make sure all pages of a section including
+ * NOMAP pages are initialized with __init_single_page().
+ */
+int early_pfn_valid(unsigned long pfn)
+{
+	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+		return 0;
+	return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+}
+EXPORT_SYMBOL(early_pfn_valid);
+
 #endif
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..bedcf8a95881 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1170,12 +1170,16 @@  static inline struct mem_section *__pfn_to_section(unsigned long pfn)
 }
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+
 static inline int pfn_valid(unsigned long pfn)
 {
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
 }
+
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
+
 #endif
 
 static inline int pfn_present(unsigned long pfn)
@@ -1200,7 +1204,6 @@  static inline int pfn_present(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
-#define early_pfn_valid(pfn)	pfn_valid(pfn)
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)