[PATCHv2,0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

Message ID	1547183577-20309-1-git-send-email-kernelfans@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of kernelfans@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Pingfan Liu <kernelfans@gmail.com> To: linux-kernel@vger.kernel.org Cc: Pingfan Liu <kernelfans@gmail.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>, Yinghai Lu <yinghai@kernel.org>, Tejun Heo <tj@kernel.org>, Chao Fan <fanc.fnst@cn.fujitsu.com>, Baoquan He <bhe@redhat.com>, Juergen Gross <jgross@suse.com>, Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>, x86@kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Subject: [PATCHv2 0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Date: Fri, 11 Jan 2019 13:12:50 +0800 Message-Id: <1547183577-20309-1-git-send-email-kernelfans@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info \| expand [PATCHv2,0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem ho… [PATCHv2,1/7] x86/mm: concentrate the code to memblock allocator enabled [PATCHv2,2/7] acpi: change the topo of acpi_table_upgrade() [PATCHv2,3/7] mm/memblock: introduce allocation boundary for tracing purpose [PATCHv2,4/7] x86/setup: parse acpi to get hotplug info before init_mem_mapping() [PATCHv2,5/7] x86/mm: set allowed range for memblock allocator [PATCHv2,6/7] x86/mm: remove bottom-up allocation style for x86_64 [PATCHv2,7/7] x86/mm: isolate the bottom-up style to init_32.c

Message ID

1547183577-20309-1-git-send-email-kernelfans@gmail.com (mailing list archive)

Headers

Received-SPF: pass (google.com: domain of kernelfans@gmail.com designates
 209.85.220.65 as permitted sender) client-ip=209.85.220.65;
From: Pingfan Liu <kernelfans@gmail.com>
To: linux-kernel@vger.kernel.org
Cc: Pingfan Liu <kernelfans@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>,
	Yinghai Lu <yinghai@kernel.org>,
	Tejun Heo <tj@kernel.org>,
	Chao Fan <fanc.fnst@cn.fujitsu.com>,
	Baoquan He <bhe@redhat.com>,
	Juergen Gross <jgross@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michal Hocko <mhocko@suse.com>,
	x86@kernel.org,
	linux-acpi@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCHv2 0/7] x86_64/mm: remove bottom-up allocation style by pushing
 forward the parsing of mem hotplug info 
Date: Fri, 11 Jan 2019 13:12:50 +0800
Message-Id: <1547183577-20309-1-git-send-email-kernelfans@gmail.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info | expand

Message

Pingfan Liu Jan. 11, 2019, 5:12 a.m. UTC

Background
When kaslr kernel can be guaranteed to sit inside unmovable node
after [1]. But if kaslr kernel is located near the end of the movable node,
then bottom-up allocator may create pagetable which crosses the boundary
between unmovable node and movable node.  It is a probability issue,
two factors include -1. how big the gap between kernel end and
unmovable node's end.  -2. how many memory does the system own.
Alternative way to fix this issue is by increasing the gap by
boot/compressed/kaslr*. But taking the scenario of PB level memory,
the pagetable will take server MB even if using 1GB page, different page
attr and fragment will make things worse. So it is hard to decide how much
should the gap increase.
The following figure show the defection of current bottom-up style:
  [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]

If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
can generate pgtable on nodeC, which stain movable node.

This patch makes it certainty instead of a probablity problem. It achieves
this by pushing forward the parsing of mem hotplug info ahead of init_mem_mapping().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: x86@kernel.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-mm@kvack.org
Pingfan Liu (7):
  x86/mm: concentrate the code to memblock allocator enabled
  acpi: change the topo of acpi_table_upgrade()
  mm/memblock: introduce allocation boundary for tracing purpose
  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  x86/mm: set allowed range for memblock allocator
  x86/mm: remove bottom-up allocation style for x86_64
  x86/mm: isolate the bottom-up style to init_32.c

 arch/arm/mm/init.c              |   3 +-
 arch/arm/mm/mmu.c               |   4 +-
 arch/arm/mm/nommu.c             |   2 +-
 arch/arm64/kernel/setup.c       |   2 +-
 arch/csky/kernel/setup.c        |   2 +-
 arch/microblaze/mm/init.c       |   2 +-
 arch/mips/kernel/setup.c        |   2 +-
 arch/powerpc/mm/40x_mmu.c       |   6 +-
 arch/powerpc/mm/44x_mmu.c       |   2 +-
 arch/powerpc/mm/8xx_mmu.c       |   2 +-
 arch/powerpc/mm/fsl_booke_mmu.c |   5 +-
 arch/powerpc/mm/hash_utils_64.c |   4 +-
 arch/powerpc/mm/init_32.c       |   2 +-
 arch/powerpc/mm/pgtable-radix.c |   2 +-
 arch/powerpc/mm/ppc_mmu_32.c    |   8 +-
 arch/powerpc/mm/tlb_nohash.c    |   6 +-
 arch/unicore32/mm/mmu.c         |   2 +-
 arch/x86/kernel/setup.c         |  93 ++++++++++++++---------
 arch/x86/mm/init.c              | 163 +++++-----------------------------------
 arch/x86/mm/init_32.c           | 147 ++++++++++++++++++++++++++++++++++++
 arch/x86/mm/mm_internal.h       |   8 +-
 arch/xtensa/mm/init.c           |   2 +-
 drivers/acpi/tables.c           |   4 +-
 include/linux/acpi.h            |   5 +-
 include/linux/memblock.h        |  10 ++-
 mm/memblock.c                   |  23 ++++--
 26 files changed, 290 insertions(+), 221 deletions(-)

Comments

Dave Hansen Jan. 14, 2019, 11:02 p.m. UTC | #1

On 1/10/19 9:12 PM, Pingfan Liu wrote:
> Background
> When kaslr kernel can be guaranteed to sit inside unmovable node
> after [1].

What does this "[1]" refer to?

Also, can you clarify your terminology here a bit.  By "kaslr kernel",
do you mean the base address?

> But if kaslr kernel is located near the end of the movable node,
> then bottom-up allocator may create pagetable which crosses the boundary
> between unmovable node and movable node.

Again, I'm confused.  Do you literally mean a single page table page?  I
think you mean the page tables, but it would be nice to clarify this,
and also explicitly state which page tables these are.

>  It is a probability issue,
> two factors include -1. how big the gap between kernel end and
> unmovable node's end.  -2. how many memory does the system own.
> Alternative way to fix this issue is by increasing the gap by
> boot/compressed/kaslr*.

Oh, you mean the KASLR code in arch/x86/boot/compressed/kaslr*.[ch]?

It took me a minute to figure out you were talking about filenames.

> But taking the scenario of PB level memory, the pagetable will take
> server MB even if using 1GB page, different page attr and fragment
> will make things worse. So it is hard to decide how much should the
> gap increase.
I'm not following this.  If we move the image around, we leave holes.
Why do we need page table pages allocated to cover these holes?

> The following figure show the defection of current bottom-up style:
>   [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]

"defection"?

> If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
> can generate pgtable on nodeC, which stain movable node.

Let me see if I can summarize this:
1. The kernel ASLR decompression code picks a spot to place the kernel
   image in physical memory.
2. Some page tables are dynamically allocated near (after) this spot.
3. Sometimes, based on the random ASLR location, these page tables fall
   over into the "movable node" area.  Being unmovable allocations, this
   is not cool.
4. To fix this (on 64-bit at least), we stop allocating page tables
   based on the location of the kernel image.  Instead, we allocate
   using the memblock allocator itself, which knows how to avoid the
   movable node.

> This patch makes it certainty instead of a probablity problem. It achieves
> this by pushing forward the parsing of mem hotplug info ahead of init_mem_mapping().

What does memory hotplug have to do with this?  I thought this was all
about early boot.

Pingfan Liu Jan. 15, 2019, 6:06 a.m. UTC | #2

On Tue, Jan 15, 2019 at 7:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/10/19 9:12 PM, Pingfan Liu wrote:
> > Background
> > When kaslr kernel can be guaranteed to sit inside unmovable node
> > after [1].
>
> What does this "[1]" refer to?
>
https://lore.kernel.org/patchwork/patch/1029376/

> Also, can you clarify your terminology here a bit.  By "kaslr kernel",
> do you mean the base address?
>
It should be the randomization of load address. Googled, and found out
that it is "base address".

> > But if kaslr kernel is located near the end of the movable node,
> > then bottom-up allocator may create pagetable which crosses the boundary
> > between unmovable node and movable node.
>
> Again, I'm confused.  Do you literally mean a single page table page?  I
> think you mean the page tables, but it would be nice to clarify this,
> and also explicitly state which page tables these are.
>
It should be page table pages. The page table is built by init_mem_mapping().

> >  It is a probability issue,
> > two factors include -1. how big the gap between kernel end and
> > unmovable node's end.  -2. how many memory does the system own.
> > Alternative way to fix this issue is by increasing the gap by
> > boot/compressed/kaslr*.
>
> Oh, you mean the KASLR code in arch/x86/boot/compressed/kaslr*.[ch]?
>
Sorry, and yes, code in arch/x86/boot/compressed/kaslr_64.c and kaslr.c

> It took me a minute to figure out you were talking about filenames.
>
> > But taking the scenario of PB level memory, the pagetable will take
> > server MB even if using 1GB page, different page attr and fragment
> > will make things worse. So it is hard to decide how much should the
> > gap increase.
> I'm not following this.  If we move the image around, we leave holes.
> Why do we need page table pages allocated to cover these holes?
>
I means in arch/x86/boot/compressed/kaslr.c, store_slot_info() {
slot_area.num = (region->size - image_size) /CONFIG_PHYSICAL_ALIGN + 1
}.  Let us denote the size of page table as "X", then the formula is
changed to slot_area.num = (region->size - image_size -X)
/CONFIG_PHYSICAL_ALIGN + 1. And it is hard to decide X due to the
above factors.

> > The following figure show the defection of current bottom-up style:
> >   [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]
>
> "defection"?
>
Oh, defect.

> > If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
> > can generate pgtable on nodeC, which stain movable node.
>
> Let me see if I can summarize this:
> 1. The kernel ASLR decompression code picks a spot to place the kernel
>    image in physical memory.
> 2. Some page tables are dynamically allocated near (after) this spot.
> 3. Sometimes, based on the random ASLR location, these page tables fall
>    over into the "movable node" area.  Being unmovable allocations, this
>    is not cool.
> 4. To fix this (on 64-bit at least), we stop allocating page tables
>    based on the location of the kernel image.  Instead, we allocate
>    using the memblock allocator itself, which knows how to avoid the
>    movable node.
>
Yes, you get my idea exactly. Thanks for your help to summary it. Hard
for me to express it clearly in English.

> > This patch makes it certainty instead of a probablity problem. It achieves
> > this by pushing forward the parsing of mem hotplug info ahead of init_mem_mapping().
>
> What does memory hotplug have to do with this?  I thought this was all
> about early boot.

Put the info about memory hot plugable to memblock allocator,
initmem_init()->...->acpi_numa_memory_affinity_init(), where
memblock_mark_hotplug() does it. Later when memory allocator works, in
__next_mem_range(), it will check this info by
memblock_is_hotpluggable().

Thanks and regards,
Pingfan