Message ID | 20161216165437.21612-1-rrichter@cavium.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote: > On ThunderX systems with certain memory configurations we see the > following BUG_ON(): > > kernel BUG at mm/page_alloc.c:1848! > > This happens for some configs with 64k page size enabled. The BUG_ON() > checks if start and end page of a memmap range belongs to the same > zone. > > The BUG_ON() check fails if a memory zone contains NOMAP regions. In > this case the node information of those pages is not initialized. This > causes an inconsistency of the page links with wrong zone and node > information for that pages. NOMAP pages from node 1 still point to the > mem zone from node 0 and have the wrong nid assigned. > > The reason for the mis-configuration is a change in pfn_valid() which > reports pages marked NOMAP as invalid: > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > This causes pages marked as nomap being no longer reassigned to the > new zone in memmap_init_zone() by calling __init_single_pfn(). > > Fixing this by implementing an arm64 specific early_pfn_valid(). This > causes all pages of sections with memory including NOMAP ranges to be > initialized by __init_single_page() and ensures consistency of page > links to zone, node and section. > I like this solution a lot better than the first one, but I am still somewhat uneasy about having the kernel reason about attributes of pages it should not touch in the first place. But the fact that early_pfn_valid() is only used a single time in the whole kernel does give some confidence that we are not simply moving the problem elsewhere. Given that you are touching arch/arm/ as well as arch/arm64, could you explain why only arm64 needs this treatment? Is it simply because we don't have NUMA support there? Considering that Hisilicon D05 suffered from the same issue, I would like to get some coverage there as well. Hanjun, is this something you can arrange? Thanks > The HAVE_ARCH_PFN_VALID config option now requires an explicit > definiton of early_pfn_valid() in the same way as pfn_valid(). This > allows a customized implementation of early_pfn_valid() which > redirects to valid_section() for arm64. This is the same as for the > generic pfn_valid() implementation. > > v3: > > * Use valid_section() which is the same as the default pfn_valid() > implementation to initialize > * Added Ack for arm/ changes. > > v2: > > * Use pfn_present() instead of memblock_is_memory() to support also > non-memory NOMAP holes > > Acked-by: Russell King <rmk+kernel@armlinux.org.uk> > Signed-off-by: Robert Richter <rrichter@cavium.com> > --- > arch/arm/include/asm/page.h | 1 + > arch/arm64/include/asm/page.h | 2 ++ > arch/arm64/mm/init.c | 15 +++++++++++++++ > include/linux/mmzone.h | 5 ++++- > 4 files changed, 22 insertions(+), 1 deletion(-) > > diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h > index 4355f0ec44d6..79761bd55f94 100644 > --- a/arch/arm/include/asm/page.h > +++ b/arch/arm/include/asm/page.h > @@ -158,6 +158,7 @@ typedef struct page *pgtable_t; > > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > extern int pfn_valid(unsigned long); > +#define early_pfn_valid(pfn) pfn_valid(pfn) > #endif > > #include <asm/memory.h> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h > index 8472c6def5ef..17ceb7435ded 100644 > --- a/arch/arm64/include/asm/page.h > +++ b/arch/arm64/include/asm/page.h > @@ -49,6 +49,8 @@ typedef struct page *pgtable_t; > > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > extern int pfn_valid(unsigned long); > +extern int early_pfn_valid(unsigned long); > +#define early_pfn_valid early_pfn_valid > #endif > > #include <asm/memory.h> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 212c4d1e2f26..8ff62a7ff634 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -145,11 +145,26 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) > #endif /* CONFIG_NUMA */ > > #ifdef CONFIG_HAVE_ARCH_PFN_VALID > + > int pfn_valid(unsigned long pfn) > { > return memblock_is_map_memory(pfn << PAGE_SHIFT); > } > EXPORT_SYMBOL(pfn_valid); > + > +/* > + * This is the same as the generic pfn_valid() implementation. We use > + * valid_section() here to make sure all pages of a section including > + * NOMAP pages are initialized with __init_single_page(). > + */ > +int early_pfn_valid(unsigned long pfn) > +{ > + if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > + return 0; > + return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); > +} > +EXPORT_SYMBOL(early_pfn_valid); > + > #endif > > #ifndef CONFIG_SPARSEMEM > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 0f088f3a2fed..bedcf8a95881 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1170,12 +1170,16 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) > } > > #ifndef CONFIG_HAVE_ARCH_PFN_VALID > + > static inline int pfn_valid(unsigned long pfn) > { > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); > } > + > +#define early_pfn_valid(pfn) pfn_valid(pfn) > + > #endif > > static inline int pfn_present(unsigned long pfn) > @@ -1200,7 +1204,6 @@ static inline int pfn_present(unsigned long pfn) > #define pfn_to_nid(pfn) (0) > #endif > > -#define early_pfn_valid(pfn) pfn_valid(pfn) > void sparse_init(void); > #else > #define sparse_init() do {} while (0) > -- > 2.11.0 >
On 2017/1/4 21:56, Ard Biesheuvel wrote: > On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote: >> On ThunderX systems with certain memory configurations we see the >> following BUG_ON(): >> >> kernel BUG at mm/page_alloc.c:1848! >> >> This happens for some configs with 64k page size enabled. The BUG_ON() >> checks if start and end page of a memmap range belongs to the same >> zone. >> >> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >> this case the node information of those pages is not initialized. This >> causes an inconsistency of the page links with wrong zone and node >> information for that pages. NOMAP pages from node 1 still point to the >> mem zone from node 0 and have the wrong nid assigned. >> >> The reason for the mis-configuration is a change in pfn_valid() which >> reports pages marked NOMAP as invalid: >> >> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping >> >> This causes pages marked as nomap being no longer reassigned to the >> new zone in memmap_init_zone() by calling __init_single_pfn(). >> >> Fixing this by implementing an arm64 specific early_pfn_valid(). This >> causes all pages of sections with memory including NOMAP ranges to be >> initialized by __init_single_page() and ensures consistency of page >> links to zone, node and section. >> > > I like this solution a lot better than the first one, but I am still > somewhat uneasy about having the kernel reason about attributes of > pages it should not touch in the first place. But the fact that > early_pfn_valid() is only used a single time in the whole kernel does > give some confidence that we are not simply moving the problem > elsewhere. > > Given that you are touching arch/arm/ as well as arch/arm64, could you > explain why only arm64 needs this treatment? Is it simply because we > don't have NUMA support there? > > Considering that Hisilicon D05 suffered from the same issue, I would > like to get some coverage there as well. Hanjun, is this something you > can arrange? Thanks Sure, we will test this patch with LTP MM stress test (which triggers the bug on D05), and give the feedback. Thanks Hanjun
On 04.01.17 13:56:39, Ard Biesheuvel wrote: > Given that you are touching arch/arm/ as well as arch/arm64, could you > explain why only arm64 needs this treatment? Is it simply because we > don't have NUMA support there? I haven't considered a solution for arch/arm yet. The fixes are independent. But if that fix would be an excepted solution, it could be implemented for arm then too. But as you said, since probably only NUMA is affected, we might not need it there. -Robert
On 2017/1/5 10:03, Hanjun Guo wrote: > On 2017/1/4 21:56, Ard Biesheuvel wrote: >> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote: >>> On ThunderX systems with certain memory configurations we see the >>> following BUG_ON(): >>> >>> kernel BUG at mm/page_alloc.c:1848! >>> >>> This happens for some configs with 64k page size enabled. The BUG_ON() >>> checks if start and end page of a memmap range belongs to the same >>> zone. >>> >>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >>> this case the node information of those pages is not initialized. This >>> causes an inconsistency of the page links with wrong zone and node >>> information for that pages. NOMAP pages from node 1 still point to the >>> mem zone from node 0 and have the wrong nid assigned. >>> >>> The reason for the mis-configuration is a change in pfn_valid() which >>> reports pages marked NOMAP as invalid: >>> >>> 68709f45385a arm64: only consider memblocks with NOMAP cleared for >>> linear mapping >>> >>> This causes pages marked as nomap being no longer reassigned to the >>> new zone in memmap_init_zone() by calling __init_single_pfn(). >>> >>> Fixing this by implementing an arm64 specific early_pfn_valid(). This >>> causes all pages of sections with memory including NOMAP ranges to be >>> initialized by __init_single_page() and ensures consistency of page >>> links to zone, node and section. >>> >> >> I like this solution a lot better than the first one, but I am still >> somewhat uneasy about having the kernel reason about attributes of >> pages it should not touch in the first place. But the fact that >> early_pfn_valid() is only used a single time in the whole kernel does >> give some confidence that we are not simply moving the problem >> elsewhere. >> >> Given that you are touching arch/arm/ as well as arch/arm64, could you >> explain why only arm64 needs this treatment? Is it simply because we >> don't have NUMA support there? >> >> Considering that Hisilicon D05 suffered from the same issue, I would >> like to get some coverage there as well. Hanjun, is this something you >> can arrange? Thanks > > Sure, we will test this patch with LTP MM stress test (which triggers > the bug on D05), and give the feedback. a update here, tested on 4.9, - Applied Ard's two patches only - Applied Robert's patch only Both of them can work fine on D05 with NUMA enabled, which means boot ok and LTP MM stress test is passed. I'm not familiar with memory management, it's up to you guys to make a decision :) Thanks Hanjun
Hi Hanjun, > a update here, tested on 4.9, > > - Applied Ard's two patches only > - Applied Robert's patch only > > Both of them can work fine on D05 with NUMA enabled, which means > boot ok and LTP MM stress test is passed. It is not related to this patch set. LTP "cpuset01" test crashes with latest 4.9, 4.10-rc1 and 4.10-rc2 kernels on Thunderx 2S . Do you see any such behaviour on D05. Any idea what might be causing this issue. 227.627546] cpuset01: page allocation stalls for 10096ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE) [ 227.627586] CPU: 53 PID: 11017 Comm: cpuset01 Not tainted 4.9.04kNUMA+ #2 [ 227.627591] Hardware name: www.cavium.com ThunderX Unknown/ThunderX Unknown, BIOS 0.3 Aug 24 2016 [ 227.627599] Call trace: [ 227.627623] [<ffff000008089f10>] dump_backtrace+0x0/0x238 [ 227.627640] [<ffff00000808a16c>] show_stack+0x24/0x30 [ 227.627656] [<ffff00000846fb50>] dump_stack+0x94/0xb4 [ 227.627679] [<ffff0000081eb4f8>] warn_alloc+0x138/0x150 [ 227.627686] [<ffff0000081ec0a4>] __alloc_pages_nodemask+0xb04/0xcf0 [ 227.627697] [<ffff000008245988>] alloc_pages_vma+0xc8/0x270 [ 227.627715] [<ffff00000821f604>] handle_mm_fault+0xc8c/0xfd8 [ 227.627732] [<ffff00000809a488>] do_page_fault+0x2c0/0x368 [ 227.627744] [<ffff0000080812ec>] do_mem_abort+0x6c/0xe0 [ 227.627752] Exception stack(0xffff801f55823e00 to 0xffff801f55823f30) [ 227.627763] 3e00: 0000000000000000 0000ffff92682000 ffffffffffffffff 0000ffff9252b3e8 [ 227.627774] 3e20: 0000000020000000 0000000000000000 000000000000a000 0000000000000003 [ 227.627785] 3e40: 0000000000000022 ffffffffffffffff 0000000000000123 00000000000000de [ 227.627793] 3e60: ffff000008972000 0000000000000015 ffff801f55823e90 0000000000040900 [ 227.627800] 3e80: 0000000000000000 ffff0000080836f0 0000000000000000 0000ffff92682000 [ 227.627809] 3ea0: ffffffffffffffff 0000ffff92575d8c 0000000000000000 0000000000040900 [ 227.627819] 3ec0: 0000ffff92682000 00000000000000f7 0000000000004fc0 0000000000000022 [ 227.627828] 3ee0: 0000000000000000 0000000000000000 0000ffff925f5508 f7f7f7f7f7f7f7f7 [ 227.627838] 3f00: 0000ffff92686ff0 0000000000002ab8 0101010101010101 0000000000000020 [ 227.627847] 3f20: 0000000000000000 0000000000000000 [ 227.627858] [<ffff000008083324>] el0_da+0x18/0x1c [ 227.627865] Mem-Info: [ 227.627899] active_anon:38613 inactive_anon:8174 isolated_anon:0 active_file:25148 inactive_file:64173 isolated_file:0 unevictable:742 dirty:0 writeback:0 unstable:0 slab_reclaimable:29066 slab_unreclaimable:67304 mapped:22876 shmem:2597 pagetables:1240 bounce:0 free:65582521 free_pcp:1834 free_cma:0
On 6 January 2017 at 01:07, Hanjun Guo <hanjun.guo@linaro.org> wrote: > On 2017/1/5 10:03, Hanjun Guo wrote: >> >> On 2017/1/4 21:56, Ard Biesheuvel wrote: >>> >>> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote: >>>> >>>> On ThunderX systems with certain memory configurations we see the >>>> following BUG_ON(): >>>> >>>> kernel BUG at mm/page_alloc.c:1848! >>>> >>>> This happens for some configs with 64k page size enabled. The BUG_ON() >>>> checks if start and end page of a memmap range belongs to the same >>>> zone. >>>> >>>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >>>> this case the node information of those pages is not initialized. This >>>> causes an inconsistency of the page links with wrong zone and node >>>> information for that pages. NOMAP pages from node 1 still point to the >>>> mem zone from node 0 and have the wrong nid assigned. >>>> >>>> The reason for the mis-configuration is a change in pfn_valid() which >>>> reports pages marked NOMAP as invalid: >>>> >>>> 68709f45385a arm64: only consider memblocks with NOMAP cleared for >>>> linear mapping >>>> >>>> This causes pages marked as nomap being no longer reassigned to the >>>> new zone in memmap_init_zone() by calling __init_single_pfn(). >>>> >>>> Fixing this by implementing an arm64 specific early_pfn_valid(). This >>>> causes all pages of sections with memory including NOMAP ranges to be >>>> initialized by __init_single_page() and ensures consistency of page >>>> links to zone, node and section. >>>> >>> >>> I like this solution a lot better than the first one, but I am still >>> somewhat uneasy about having the kernel reason about attributes of >>> pages it should not touch in the first place. But the fact that >>> early_pfn_valid() is only used a single time in the whole kernel does >>> give some confidence that we are not simply moving the problem >>> elsewhere. >>> >>> Given that you are touching arch/arm/ as well as arch/arm64, could you >>> explain why only arm64 needs this treatment? Is it simply because we >>> don't have NUMA support there? >>> >>> Considering that Hisilicon D05 suffered from the same issue, I would >>> like to get some coverage there as well. Hanjun, is this something you >>> can arrange? Thanks >> >> >> Sure, we will test this patch with LTP MM stress test (which triggers >> the bug on D05), and give the feedback. > > > a update here, tested on 4.9, > > - Applied Ard's two patches only > - Applied Robert's patch only > > Both of them can work fine on D05 with NUMA enabled, which means > boot ok and LTP MM stress test is passed. > Thanks a lot Hanjun. Any comments on the performance impact (including boot time) ?
Hi Prakash, On 2017/1/6 13:22, Prakash B wrote: > Hi Hanjun, > > >> a update here, tested on 4.9, >> >> - Applied Ard's two patches only >> - Applied Robert's patch only >> >> Both of them can work fine on D05 with NUMA enabled, which means >> boot ok and LTP MM stress test is passed. > > It is not related to this patch set. > LTP "cpuset01" test crashes with latest 4.9, 4.10-rc1 and 4.10-rc2 kernels on > Thunderx 2S . Do you see any such behaviour on D05. I didn't test "cpuset01" on D05 but according to the test in Linaro, LTP full test is passed on D05 with Ard's 2 patches. > > Any idea what might be causing this issue. Since it's not happening on D05, maybe it's related to the firmware? (just a wild guess...) Thanks Hanjun > > > 227.627546] cpuset01: page allocation stalls for 10096ms, order:0, > mode:0x24200ca(GFP_HIGHUSER_MOVABLE) > [ 227.627586] CPU: 53 PID: 11017 Comm: cpuset01 Not tainted 4.9.04kNUMA+ #2 > [ 227.627591] Hardware name: www.cavium.com ThunderX Unknown/ThunderX > Unknown, BIOS 0.3 Aug 24 2016 > [ 227.627599] Call trace: > [ 227.627623] [<ffff000008089f10>] dump_backtrace+0x0/0x238 > [ 227.627640] [<ffff00000808a16c>] show_stack+0x24/0x30 > [ 227.627656] [<ffff00000846fb50>] dump_stack+0x94/0xb4 > [ 227.627679] [<ffff0000081eb4f8>] warn_alloc+0x138/0x150 > [ 227.627686] [<ffff0000081ec0a4>] __alloc_pages_nodemask+0xb04/0xcf0 > [ 227.627697] [<ffff000008245988>] alloc_pages_vma+0xc8/0x270 > [ 227.627715] [<ffff00000821f604>] handle_mm_fault+0xc8c/0xfd8 > [ 227.627732] [<ffff00000809a488>] do_page_fault+0x2c0/0x368 > [ 227.627744] [<ffff0000080812ec>] do_mem_abort+0x6c/0xe0 > [ 227.627752] Exception stack(0xffff801f55823e00 to 0xffff801f55823f30) > [ 227.627763] 3e00: 0000000000000000 0000ffff92682000 > ffffffffffffffff 0000ffff9252b3e8 > [ 227.627774] 3e20: 0000000020000000 0000000000000000 > 000000000000a000 0000000000000003 > [ 227.627785] 3e40: 0000000000000022 ffffffffffffffff > 0000000000000123 00000000000000de > [ 227.627793] 3e60: ffff000008972000 0000000000000015 > ffff801f55823e90 0000000000040900 > [ 227.627800] 3e80: 0000000000000000 ffff0000080836f0 > 0000000000000000 0000ffff92682000 > [ 227.627809] 3ea0: ffffffffffffffff 0000ffff92575d8c > 0000000000000000 0000000000040900 > [ 227.627819] 3ec0: 0000ffff92682000 00000000000000f7 > 0000000000004fc0 0000000000000022 > [ 227.627828] 3ee0: 0000000000000000 0000000000000000 > 0000ffff925f5508 f7f7f7f7f7f7f7f7 > [ 227.627838] 3f00: 0000ffff92686ff0 0000000000002ab8 > 0101010101010101 0000000000000020 > [ 227.627847] 3f20: 0000000000000000 0000000000000000 > [ 227.627858] [<ffff000008083324>] el0_da+0x18/0x1c > [ 227.627865] Mem-Info: > [ 227.627899] active_anon:38613 inactive_anon:8174 isolated_anon:0 > active_file:25148 inactive_file:64173 isolated_file:0 > unevictable:742 dirty:0 writeback:0 unstable:0 > slab_reclaimable:29066 slab_unreclaimable:67304 > mapped:22876 shmem:2597 pagetables:1240 bounce:0 > free:65582521 free_pcp:1834 free_cma:0 >
On 2017/1/6 16:37, Ard Biesheuvel wrote: > On 6 January 2017 at 01:07, Hanjun Guo <hanjun.guo@linaro.org> wrote: >> On 2017/1/5 10:03, Hanjun Guo wrote: >>> >>> On 2017/1/4 21:56, Ard Biesheuvel wrote: >>>> >>>> On 16 December 2016 at 16:54, Robert Richter <rrichter@cavium.com> wrote: >>>>> >>>>> On ThunderX systems with certain memory configurations we see the >>>>> following BUG_ON(): >>>>> >>>>> kernel BUG at mm/page_alloc.c:1848! >>>>> >>>>> This happens for some configs with 64k page size enabled. The BUG_ON() >>>>> checks if start and end page of a memmap range belongs to the same >>>>> zone. >>>>> >>>>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In >>>>> this case the node information of those pages is not initialized. This >>>>> causes an inconsistency of the page links with wrong zone and node >>>>> information for that pages. NOMAP pages from node 1 still point to the >>>>> mem zone from node 0 and have the wrong nid assigned. >>>>> >>>>> The reason for the mis-configuration is a change in pfn_valid() which >>>>> reports pages marked NOMAP as invalid: >>>>> >>>>> 68709f45385a arm64: only consider memblocks with NOMAP cleared for >>>>> linear mapping >>>>> >>>>> This causes pages marked as nomap being no longer reassigned to the >>>>> new zone in memmap_init_zone() by calling __init_single_pfn(). >>>>> >>>>> Fixing this by implementing an arm64 specific early_pfn_valid(). This >>>>> causes all pages of sections with memory including NOMAP ranges to be >>>>> initialized by __init_single_page() and ensures consistency of page >>>>> links to zone, node and section. >>>>> >>>> >>>> I like this solution a lot better than the first one, but I am still >>>> somewhat uneasy about having the kernel reason about attributes of >>>> pages it should not touch in the first place. But the fact that >>>> early_pfn_valid() is only used a single time in the whole kernel does >>>> give some confidence that we are not simply moving the problem >>>> elsewhere. >>>> >>>> Given that you are touching arch/arm/ as well as arch/arm64, could you >>>> explain why only arm64 needs this treatment? Is it simply because we >>>> don't have NUMA support there? >>>> >>>> Considering that Hisilicon D05 suffered from the same issue, I would >>>> like to get some coverage there as well. Hanjun, is this something you >>>> can arrange? Thanks >>> >>> >>> Sure, we will test this patch with LTP MM stress test (which triggers >>> the bug on D05), and give the feedback. >> >> >> a update here, tested on 4.9, >> >> - Applied Ard's two patches only >> - Applied Robert's patch only >> >> Both of them can work fine on D05 with NUMA enabled, which means >> boot ok and LTP MM stress test is passed. >> > > Thanks a lot Hanjun. > > Any comments on the performance impact (including boot time) ? Didn't collect the performance data yet, any recommended test suite? Is it sysbench ok? we can test it and collect the data. Thanks Hanjun
Thanks Hanjun , On Mon, Jan 9, 2017 at 10:39 AM, Hanjun Guo <hanjun.guo@linaro.org> wrote: > Hi Prakash, > I didn't test "cpuset01" on D05 but according to the test in > Linaro, LTP full test is passed on D05 with Ard's 2 patches. > >> >> Any idea what might be causing this issue. > > > Since it's not happening on D05, maybe it's related to > the firmware? (just a wild guess...) > Used same firmware b/w 4.4 kernel and 4.9 (and above kernels) . Test passed wtih 4.4 kernel and didn't generated any crashes or dumps. If there is more observation I will send a mail or I will start a separate mail thread. Thanks, Prakash B
On 06.01.17 08:37:25, Ard Biesheuvel wrote:
> Any comments on the performance impact (including boot time) ?
I did a kernel compile test and kernel mode time increases by about
2.2%. Though this is already significant, we need a more suitable mem
benchmark here for further testing.
For boot time I dont see significant changes.
-Robert
Boot times:
pfn_valid_within():
[ 25.929134]
[ 25.548830]
[ 25.503225]
early_pfn_valid() v3:
[ 25.773814]
[ 25.548428]
[ 25.765290]
Kernel compile times (3 runs each):
pfn_valid_within():
real 6m4.088s
user 372m57.607s
sys 16m55.158s
real 6m1.532s
user 372m48.453s
sys 16m50.370s
real 6m4.061s
user 373m18.753s
sys 16m57.027s
early_pfn_valid() v3:
real 6m3.261s
user 373m15.816s
sys 16m30.019s
real 6m2.980s
user 373m9.019s
sys 16m32.992s
real 6m2.574s
user 372m45.146s
sys 16m33.218s
Hi Robert, On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > On 06.01.17 08:37:25, Ard Biesheuvel wrote: > > Any comments on the performance impact (including boot time) ? > > I did a kernel compile test and kernel mode time increases by about > 2.2%. Though this is already significant, we need a more suitable mem > benchmark here for further testing. Thanks for doing this. > For boot time I dont see significant changes. > > -Robert > > > Boot times: > > pfn_valid_within(): > [ 25.929134] > [ 25.548830] > [ 25.503225] > > early_pfn_valid() v3: > [ 25.773814] > [ 25.548428] > [ 25.765290] > > > Kernel compile times (3 runs each): > > pfn_valid_within(): > > real 6m4.088s > user 372m57.607s > sys 16m55.158s > > real 6m1.532s > user 372m48.453s > sys 16m50.370s > > real 6m4.061s > user 373m18.753s > sys 16m57.027s Did you reboot the machine between each build here, or only when changing kernel? If the latter, do you see variations in kernel build time by simply rebooting the same Image? Will
On 12.01.17 16:05:36, Will Deacon wrote: > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > > Kernel compile times (3 runs each): > > > > pfn_valid_within(): > > > > real 6m4.088s > > user 372m57.607s > > sys 16m55.158s > > > > real 6m1.532s > > user 372m48.453s > > sys 16m50.370s > > > > real 6m4.061s > > user 373m18.753s > > sys 16m57.027s > > Did you reboot the machine between each build here, or only when changing > kernel? If the latter, do you see variations in kernel build time by simply > rebooting the same Image? I built it in a loop on the shell, so no reboots between builds. Note that I was building the kernel in /dev/shm to not access harddisks. I think build times should be comparable then since there is no fs caching. -Robert
On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote: > On 12.01.17 16:05:36, Will Deacon wrote: > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > > > > Kernel compile times (3 runs each): > > > > > > pfn_valid_within(): > > > > > > real 6m4.088s > > > user 372m57.607s > > > sys 16m55.158s > > > > > > real 6m1.532s > > > user 372m48.453s > > > sys 16m50.370s > > > > > > real 6m4.061s > > > user 373m18.753s > > > sys 16m57.027s > > > > Did you reboot the machine between each build here, or only when changing > > kernel? If the latter, do you see variations in kernel build time by simply > > rebooting the same Image? > > I built it in a loop on the shell, so no reboots between builds. Note > that I was building the kernel in /dev/shm to not access harddisks. I > think build times should be comparable then since there is no fs > caching. I guess I'm really asking what the standard deviation is if you *do* reboot between builds, using the same kernel. It's hard to tell whether the numbers are due to the patches, or just because of noise incurred by the way things happen to initialise. Will
On 13.01.17 09:19:04, Will Deacon wrote: > On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote: > > On 12.01.17 16:05:36, Will Deacon wrote: > > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > > > > > > Kernel compile times (3 runs each): > > > > > > > > pfn_valid_within(): > > > > > > > > real 6m4.088s > > > > user 372m57.607s > > > > sys 16m55.158s > > > > > > > > real 6m1.532s > > > > user 372m48.453s > > > > sys 16m50.370s > > > > > > > > real 6m4.061s > > > > user 373m18.753s > > > > sys 16m57.027s > > > > > > Did you reboot the machine between each build here, or only when changing > > > kernel? If the latter, do you see variations in kernel build time by simply > > > rebooting the same Image? > > > > I built it in a loop on the shell, so no reboots between builds. Note > > that I was building the kernel in /dev/shm to not access harddisks. I > > think build times should be comparable then since there is no fs > > caching. > > I guess I'm really asking what the standard deviation is if you *do* reboot > between builds, using the same kernel. It's hard to tell whether the numbers > are due to the patches, or just because of noise incurred by the way things > happen to initialise. Ok, I am going to test this. -Robert
On 13.01.17 14:15:00, Robert Richter wrote: > On 13.01.17 09:19:04, Will Deacon wrote: > > On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote: > > > On 12.01.17 16:05:36, Will Deacon wrote: > > > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > > > > > > > > Kernel compile times (3 runs each): > > > > > > > > > > pfn_valid_within(): > > > > > > > > > > real 6m4.088s > > > > > user 372m57.607s > > > > > sys 16m55.158s > > > > > > > > > > real 6m1.532s > > > > > user 372m48.453s > > > > > sys 16m50.370s > > > > > > > > > > real 6m4.061s > > > > > user 373m18.753s > > > > > sys 16m57.027s > > > > > > > > Did you reboot the machine between each build here, or only when changing > > > > kernel? If the latter, do you see variations in kernel build time by simply > > > > rebooting the same Image? > > > > > > I built it in a loop on the shell, so no reboots between builds. Note > > > that I was building the kernel in /dev/shm to not access harddisks. I > > > think build times should be comparable then since there is no fs > > > caching. > > > > I guess I'm really asking what the standard deviation is if you *do* reboot > > between builds, using the same kernel. It's hard to tell whether the numbers > > are due to the patches, or just because of noise incurred by the way things > > happen to initialise. > > Ok, I am going to test this. See below the data for a test with reboots between every 3 builds (9 builds per kernel). Though some deviation can be seen between reboots there is a trend. -Robert pfn_valid_within(), boot #1: real 6m0.007s user 372m55.709s sys 16m45.962s real 5m58.718s user 372m58.852s sys 16m47.675s real 5m58.481s user 372m56.172s sys 16m46.953s pfn_valid_within(), Boot #2: real 6m1.163s user 372m57.282s sys 16m52.025s real 6m0.562s user 373m4.957s sys 16m52.847s real 6m0.030s user 372m54.710s sys 16m54.516s pfn_valid_within(), Boot #3: real 6m1.784s user 373m13.379s sys 16m48.388s real 5m58.579s user 373m10.403s sys 16m47.628s real 5m59.151s user 373m0.084s sys 16m50.634s early_pfn_valid(), Boot #1: real 5m59.902s user 372m57.201s sys 16m42.157s real 5m59.510s user 372m59.762s sys 16m47.331s real 5m58.559s user 372m46.530s sys 16m49.010s early_pfn_valid(), Boot #2: real 6m0.652s user 373m10.785s sys 16m25.138s real 5m58.663s user 373m4.498s sys 16m28.262s real 5m57.675s user 373m6.174s sys 16m28.653s early_pfn_valid(), Boot #3: real 5m59.680s user 373m4.007s sys 16m26.781s real 5m58.234s user 372m58.895s sys 16m26.957s real 5m58.707s user 372m40.546s sys 16m29.345s
On Tue, Jan 17, 2017 at 11:00:15AM +0100, Robert Richter wrote: > On 13.01.17 14:15:00, Robert Richter wrote: > > On 13.01.17 09:19:04, Will Deacon wrote: > > > On Thu, Jan 12, 2017 at 07:58:25PM +0100, Robert Richter wrote: > > > > On 12.01.17 16:05:36, Will Deacon wrote: > > > > > On Mon, Jan 09, 2017 at 12:53:20PM +0100, Robert Richter wrote: > > > > > > > > > > Kernel compile times (3 runs each): > > > > > > > > > > > > pfn_valid_within(): > > > > > > > > > > > > real 6m4.088s > > > > > > user 372m57.607s > > > > > > sys 16m55.158s > > > > > > > > > > > > real 6m1.532s > > > > > > user 372m48.453s > > > > > > sys 16m50.370s > > > > > > > > > > > > real 6m4.061s > > > > > > user 373m18.753s > > > > > > sys 16m57.027s > > > > > > > > > > Did you reboot the machine between each build here, or only when changing > > > > > kernel? If the latter, do you see variations in kernel build time by simply > > > > > rebooting the same Image? > > > > > > > > I built it in a loop on the shell, so no reboots between builds. Note > > > > that I was building the kernel in /dev/shm to not access harddisks. I > > > > think build times should be comparable then since there is no fs > > > > caching. > > > > > > I guess I'm really asking what the standard deviation is if you *do* reboot > > > between builds, using the same kernel. It's hard to tell whether the numbers > > > are due to the patches, or just because of noise incurred by the way things > > > happen to initialise. > > > > Ok, I am going to test this. > > See below the data for a test with reboots between every 3 builds (9 > builds per kernel). Though some deviation can be seen between reboots > there is a trend. I can't really see the trend given that, for system time, your pfn_valid_within results have a variance of ~9 and the early_pfn_valid results have a variance of ~92. Given that the variance seems to come about due to the reboots, I think we need more numbers to establish whether the data sets end up largely overlapping or if they really are disjoint. Will
On 17.01.17 19:16:56, Will Deacon wrote: > I can't really see the trend given that, for system time, your > pfn_valid_within results have a variance of ~9 and the early_pfn_valid > results have a variance of ~92. Given that the variance seems to come > about due to the reboots, I think we need more numbers to establish whether > the data sets end up largely overlapping or if they really are disjoint. Assuming the numbers of both versions are not significant, please apply one or the other. Thanks, -Robert
On Fri, Feb 03, 2017 at 04:14:14PM +0100, Robert Richter wrote: > On 17.01.17 19:16:56, Will Deacon wrote: > > I can't really see the trend given that, for system time, your > > pfn_valid_within results have a variance of ~9 and the early_pfn_valid > > results have a variance of ~92. Given that the variance seems to come > > about due to the reboots, I think we need more numbers to establish whether > > the data sets end up largely overlapping or if they really are disjoint. > > Assuming the numbers of both versions are not significant, please > apply one or the other. I'd rather apply the pfn_valid_within version, but please can you Ack the patch first, since there was some confusion when it was posted about a translation fault that was never reproduced. Thanks, Will
diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h index 4355f0ec44d6..79761bd55f94 100644 --- a/arch/arm/include/asm/page.h +++ b/arch/arm/include/asm/page.h @@ -158,6 +158,7 @@ typedef struct page *pgtable_t; #ifdef CONFIG_HAVE_ARCH_PFN_VALID extern int pfn_valid(unsigned long); +#define early_pfn_valid(pfn) pfn_valid(pfn) #endif #include <asm/memory.h> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 8472c6def5ef..17ceb7435ded 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -49,6 +49,8 @@ typedef struct page *pgtable_t; #ifdef CONFIG_HAVE_ARCH_PFN_VALID extern int pfn_valid(unsigned long); +extern int early_pfn_valid(unsigned long); +#define early_pfn_valid early_pfn_valid #endif #include <asm/memory.h> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 212c4d1e2f26..8ff62a7ff634 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -145,11 +145,26 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) #endif /* CONFIG_NUMA */ #ifdef CONFIG_HAVE_ARCH_PFN_VALID + int pfn_valid(unsigned long pfn) { return memblock_is_map_memory(pfn << PAGE_SHIFT); } EXPORT_SYMBOL(pfn_valid); + +/* + * This is the same as the generic pfn_valid() implementation. We use + * valid_section() here to make sure all pages of a section including + * NOMAP pages are initialized with __init_single_page(). + */ +int early_pfn_valid(unsigned long pfn) +{ + if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) + return 0; + return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); +} +EXPORT_SYMBOL(early_pfn_valid); + #endif #ifndef CONFIG_SPARSEMEM diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0f088f3a2fed..bedcf8a95881 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1170,12 +1170,16 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) } #ifndef CONFIG_HAVE_ARCH_PFN_VALID + static inline int pfn_valid(unsigned long pfn) { if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); } + +#define early_pfn_valid(pfn) pfn_valid(pfn) + #endif static inline int pfn_present(unsigned long pfn) @@ -1200,7 +1204,6 @@ static inline int pfn_present(unsigned long pfn) #define pfn_to_nid(pfn) (0) #endif -#define early_pfn_valid(pfn) pfn_valid(pfn) void sparse_init(void); #else #define sparse_init() do {} while (0)