diff mbox

[4/6] x86/mem-hotplug: Support initialize page tables bottom up

Message ID 5241649B.3090302@cn.fujitsu.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Yanfei Zhang Sept. 24, 2013, 10:08 a.m. UTC
From: Tang Chen <tangchen@cn.fujitsu.com>

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 60 insertions(+), 5 deletions(-)

Comments

Tejun Heo Sept. 24, 2013, 12:33 p.m. UTC | #1
Hello,

On Tue, Sep 24, 2013 at 06:08:27PM +0800, Zhang Yanfei wrote:
> +#ifdef CONFIG_MOVABLE_NODE

You don't need the above ifdef.  The compiler will be able to cull the
code as long as memblock_bottom_up() is defined as constant
expression when !MOVABLE_NODE.

> +/**
> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range [map_start, map_end)
> + * in a heuristic way. In the beginning, step_size is small. The more memory we
> + * map memory in the next loop.

The same comment as before.  Now we have two function with the
identical comment but behaving differently, which isn't nice.

...
> +	 * If the allocation is in bottom-up direction, we start from the
> +	 * bottom and go to the top: first [kernel_end, end) and then
> +	 * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
> +	 * (end of memory) and go to the bottom.
> +	 *
> +	 * The memblock_find_in_range() gets us a block of RAM in
> +	 * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
>  	 */
> -	memory_map_top_down(ISA_END_ADDRESS, end);
> +	if (memblock_bottom_up()) {
> +		unsigned long kernel_end;
> +
> +		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
> +		memory_map_bottom_up(kernel_end, end);
> +		memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);

Hmm... so, this is kinda weird.  We're doing it in two chunks and
mapping memory between ISA_END_ADDRESS and kernel_end right on top of
ISA_END_ADDRESS?  Can't you give enough information to the mapping
function so that it can map everything on top of kernel_end in single
go?

Thanks.
Zhang Yanfei Sept. 24, 2013, 1:23 p.m. UTC | #2
On 09/24/2013 08:33 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 24, 2013 at 06:08:27PM +0800, Zhang Yanfei wrote:
>> +#ifdef CONFIG_MOVABLE_NODE
> 
> You don't need the above ifdef.  The compiler will be able to cull the
> code as long as memblock_bottom_up() is defined as constant
> expression when !MOVABLE_NODE.

yeah, will remove the #if. thanks.

> 
>> +/**
>> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range [map_start, map_end)
>> + * in a heuristic way. In the beginning, step_size is small. The more memory we
>> + * map memory in the next loop.
> 
> The same comment as before.  Now we have two function with the
> identical comment but behaving differently, which isn't nice.

OK, will change them.

> 
> ...
>> +	 * If the allocation is in bottom-up direction, we start from the
>> +	 * bottom and go to the top: first [kernel_end, end) and then
>> +	 * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
>> +	 * (end of memory) and go to the bottom.
>> +	 *
>> +	 * The memblock_find_in_range() gets us a block of RAM in
>> +	 * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
>>  	 */
>> -	memory_map_top_down(ISA_END_ADDRESS, end);
>> +	if (memblock_bottom_up()) {
>> +		unsigned long kernel_end;
>> +
>> +		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
>> +		memory_map_bottom_up(kernel_end, end);
>> +		memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
> 
> Hmm... so, this is kinda weird.  We're doing it in two chunks and
> mapping memory between ISA_END_ADDRESS and kernel_end right on top of
> ISA_END_ADDRESS?  Can't you give enough information to the mapping
> function so that it can map everything on top of kernel_end in single
> go?

You mean we should call memory_map_bottom_up like this:

memory_map_bottom_up(ISA_END_ADDRESS, end)

right?

> 
> Thanks.
>
Tejun Heo Sept. 24, 2013, 1:27 p.m. UTC | #3
On Tue, Sep 24, 2013 at 09:23:48PM +0800, Zhang Yanfei wrote:
> > Hmm... so, this is kinda weird.  We're doing it in two chunks and
> > mapping memory between ISA_END_ADDRESS and kernel_end right on top of
> > ISA_END_ADDRESS?  Can't you give enough information to the mapping
> > function so that it can map everything on top of kernel_end in single
> > go?
> 
> You mean we should call memory_map_bottom_up like this:
> 
> memory_map_bottom_up(ISA_END_ADDRESS, end)
> 
> right?

But that wouldn't be ideal as we want the page tables above kernel
image and the above would allocate it above ISA_END_ADDRESS, right?
Maybe memory_map_bottom_up() should take extra parameters for where to
allocate page tables at separately from the mapping range and treat it
specially?  Would that make the function a lot more complex?

Thanks.
Zhang Yanfei Sept. 24, 2013, 1:34 p.m. UTC | #4
On 09/24/2013 09:27 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 09:23:48PM +0800, Zhang Yanfei wrote:
>>> Hmm... so, this is kinda weird.  We're doing it in two chunks and
>>> mapping memory between ISA_END_ADDRESS and kernel_end right on top of
>>> ISA_END_ADDRESS?  Can't you give enough information to the mapping
>>> function so that it can map everything on top of kernel_end in single
>>> go?
>>
>> You mean we should call memory_map_bottom_up like this:
>>
>> memory_map_bottom_up(ISA_END_ADDRESS, end)
>>
>> right?
> 
> But that wouldn't be ideal as we want the page tables above kernel
> image and the above would allocate it above ISA_END_ADDRESS, right?

The original idea is we will allocate everything above the kernel. So
the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
above the kernel.

> Maybe memory_map_bottom_up() should take extra parameters for where to
> allocate page tables at separately from the mapping range and treat it
> specially?  Would that make the function a lot more complex?

Hmmmm...I will try to see if it is complex.

Thanks.

> 
> Thanks.
>
Tejun Heo Sept. 24, 2013, 1:39 p.m. UTC | #5
Hello,

On Tue, Sep 24, 2013 at 09:34:46PM +0800, Zhang Yanfei wrote:
> > But that wouldn't be ideal as we want the page tables above kernel
> > image and the above would allocate it above ISA_END_ADDRESS, right?
> 
> The original idea is we will allocate everything above the kernel. So
> the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
> above the kernel.

I'm a bit confused why we need two separate calls then.  What's the
difference from calling with the whole range?

Thanks.
Zhang Yanfei Sept. 24, 2013, 1:53 p.m. UTC | #6
Hi tejun,

On 09/24/2013 09:39 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 24, 2013 at 09:34:46PM +0800, Zhang Yanfei wrote:
>>> But that wouldn't be ideal as we want the page tables above kernel
>>> image and the above would allocate it above ISA_END_ADDRESS, right?
>>
>> The original idea is we will allocate everything above the kernel. So
>> the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
>> above the kernel.
> 
> I'm a bit confused why we need two separate calls then.  What's the
> difference from calling with the whole range?

OK, this is just because we allocate pagtables just above the kernel.
And if we use up the BRK space that is reserved for inital pagetables,
we will use memblock to allocate memory for pagetables, and the memory
allocated here should be mapped already. So we first map [kernel_end, end)
to make memory above the kernel be mapped as soon as possible. And then
use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).

Thanks.

> 
> Thanks.
>
Tejun Heo Sept. 24, 2013, 2:05 p.m. UTC | #7
On Tue, Sep 24, 2013 at 09:53:47PM +0800, Zhang Yanfei wrote:
> OK, this is just because we allocate pagtables just above the kernel.
> And if we use up the BRK space that is reserved for inital pagetables,
> we will use memblock to allocate memory for pagetables, and the memory
> allocated here should be mapped already. So we first map [kernel_end, end)
> to make memory above the kernel be mapped as soon as possible. And then
> use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).

I see.  The code seems fine to me then.  Can you please add comment
explaining why the split calls are necessary?

Thanks.
Zhang Yanfei Sept. 24, 2013, 2:07 p.m. UTC | #8
On 09/24/2013 10:05 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 09:53:47PM +0800, Zhang Yanfei wrote:
>> OK, this is just because we allocate pagtables just above the kernel.
>> And if we use up the BRK space that is reserved for inital pagetables,
>> we will use memblock to allocate memory for pagetables, and the memory
>> allocated here should be mapped already. So we first map [kernel_end, end)
>> to make memory above the kernel be mapped as soon as possible. And then
>> use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).
> 
> I see.  The code seems fine to me then.  Can you please add comment
> explaining why the split calls are necessary?

Okay, will add the comment.

Thanks.
diff mbox

Patch

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 73e79e6..7441865 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -451,6 +451,50 @@  static void __init memory_map_top_down(unsigned long map_start,
 		init_range_memory_mapping(real_end, map_end);
 }
 
+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [map_start, map_end)
+ * in a heuristic way. In the beginning, step_size is small. The more memory we
+ * map memory in the next loop.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end)
+{
+	unsigned long next, new_mapped_ram_size, start;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+	start = map_start;
+
+	while (start < map_end) {
+		if (map_end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > map_end)
+				next = map_end;
+		} else
+			next = map_end;
+
+		new_mapped_ram_size = init_range_memory_mapping(start, next);
+		min_pfn_mapped = start >> PAGE_SHIFT;
+		start = next;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+#else
+static void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end)
+{
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
 void __init init_mem_mapping(void)
 {
 	unsigned long end;
@@ -467,12 +511,23 @@  void __init init_mem_mapping(void)
 	init_memory_mapping(0, ISA_END_ADDRESS);
 
 	/*
-	 * We start from the top (end of memory) and go to the bottom.
-	 * The memblock_find_in_range() gets us a block of RAM from the
-	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-	 * for page table.
+	 * If the allocation is in bottom-up direction, we start from the
+	 * bottom and go to the top: first [kernel_end, end) and then
+	 * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
+	 * (end of memory) and go to the bottom.
+	 *
+	 * The memblock_find_in_range() gets us a block of RAM in
+	 * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
 	 */
-	memory_map_top_down(ISA_END_ADDRESS, end);
+	if (memblock_bottom_up()) {
+		unsigned long kernel_end;
+
+		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+		memory_map_bottom_up(kernel_end, end);
+		memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+	} else {
+		memory_map_top_down(ISA_END_ADDRESS, end);
+	}
 
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {