diff mbox

[10/11] x86, mem-hotplug: Support initialize page tables from low to high.

Message ID 1377596268-31552-11-git-send-email-tangchen@cn.fujitsu.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

tangchen Aug. 27, 2013, 9:37 a.m. UTC
init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
memory for page tables. To prevent page tables being allocated within hotpluggable
memory, we will allocate page tables from the end of kernel image to the higher
memory.

The order of page tables allocation is controled by movablenode boot option.
Since the default behavior of page tables initialization procedure is allocate
page tables from top of the memory downwards, if users don't specify movablenode
boot option, the kernel will behave as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |  119 +++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 91 insertions(+), 28 deletions(-)

Comments

tangchen Sept. 6, 2013, 1:34 a.m. UTC | #1
Hi Wanpeng,

Thank you for reviewing. See below, please.

On 09/05/2013 09:30 PM, Wanpeng Li wrote:
......
>> +#ifdef CONFIG_MOVABLE_NODE
>> +	unsigned long kernel_end;
>> +
>> +	if (movablenode_enable_srat&&
>> +	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
>
> I think memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is always
> true if config MOVABLE_NODE and movablenode_enable_srat == true if PATCH
> 11/11 is applied.

memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is true here if 
MOVABLE_NODE
is configured, and it will be reset after SRAT is parsed. But 
movablenode_enable_srat
could only be true when users specify movablenode boot option in the 
kernel commandline.

Please refer to patch 9/11.

>
>> +		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
>> +
>> +		memory_map_from_low(kernel_end, end);
>> +		memory_map_from_low(ISA_END_ADDRESS, kernel_end);
>
> Why split ISA_END_ADDRESS ~ end?

The first 5 pages for the page tables are from brk, please refer to 
alloc_low_pages().
They are able to map about 2MB memory. And this 2MB memory will be used 
to store
page tables for the next mapped pages.

Here, we split [ISA_END_ADDRESS, end) into [ISA_END_ADDRESS, _end) and 
[_end, end),
and map [_end, end) first. This is because memory in [ISA_END_ADDRESS, 
_end) may be
used, then we have not enough memory for the next coming page tables. We 
should map
[_end, end) first because this memory is highly likely unused.

>
......
>
> I think the variables sorted by address is:
> ISA_END_ADDRESS ->  _end ->  real_end ->  end

Yes.

>
>> +	memory_map_from_high(ISA_END_ADDRESS, real_end);
>
> If this is overlap with work done between #ifdef CONFIG_MOVABLE_NODE and
> #endif?
>

I don't think so. Seeing from my code, if work between #ifdef 
CONFIG_MOVABLE_NODE and
#endif is done, it will goto out, right ?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen Sept. 6, 2013, 3:09 a.m. UTC | #2
Hi Wanpeng,

On 09/06/2013 10:16 AM, Wanpeng Li wrote:
......
>>>> +#ifdef CONFIG_MOVABLE_NODE
>>>> +	unsigned long kernel_end;
>>>> +
>>>> +	if (movablenode_enable_srat&&
>>>> +	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
>>>
>>> I think memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is always
>>> true if config MOVABLE_NODE and movablenode_enable_srat == true if PATCH
>>> 11/11 is applied.
>>
>> memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is true here if
>> MOVABLE_NODE
>> is configured, and it will be reset after SRAT is parsed. But
>> movablenode_enable_srat
>> could only be true when users specify movablenode boot option in the
>> kernel commandline.
>
> You are right.
>
> I mean the change should be:
>
> +#ifdef CONFIG_MOVABLE_NODE
> +       unsigned long kernel_end;
> +
> +       if (movablenode_enable_srat) {
>
> The is unnecessary to check memblock.current_order since it is always true
> if movable_node is configured and movablenode_enable_srat is true.
>

But I think, memblock.current_order is set outside init_mem_mapping(). And
the path in the if statement could only be run when current order is from
low to high. So I think it is safe to check it here.

I prefer to keep it at least in the next version patch-set. If others also
think it is unnecessary, I'm OK with removing the checking. :)

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 793204b..f004d8e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -407,13 +407,77 @@  static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_from_low - Map [start, end) from low to high
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [start, end) in a
+ * heuristic way. In the beginning, step_size is small. The more memory we map
+ * memory in the next loop.
+ */
+static void __init memory_map_from_low(unsigned long start, unsigned long end)
+{
+	unsigned long next, new_mapped_ram_size;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	while (start < end) {
+		if (end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > end)
+				next = end;
+		} else
+			next = end;
+
+		new_mapped_ram_size = init_range_memory_mapping(start, next);
+		start = next;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
+/**
+ * memory_map_from_high - Map [start, end) from high to low
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function is similar to memory_map_from_low() except it maps memory
+ * from high to low.
+ */
+static void __init memory_map_from_high(unsigned long start, unsigned long end)
 {
-	unsigned long end, real_end, start, last_start;
-	unsigned long step_size;
-	unsigned long addr;
+	unsigned long prev, new_mapped_ram_size;
 	unsigned long mapped_ram_size = 0;
-	unsigned long new_mapped_ram_size;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	while (start < end) {
+		if (end > step_size) {
+			prev = round_down(end - 1, step_size);
+			if (prev < start)
+				prev = start;
+		} else
+			prev = start;
+
+		new_mapped_ram_size = init_range_memory_mapping(prev, end);
+		end = prev;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+
+void __init init_mem_mapping(void)
+{
+	unsigned long end;
 
 	probe_page_size_mask();
 
@@ -423,44 +487,43 @@  void __init init_mem_mapping(void)
 	end = max_low_pfn << PAGE_SHIFT;
 #endif
 
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
+	max_pfn_mapped = 0; /* will get exact value next */
+	min_pfn_mapped = end >> PAGE_SHIFT;
+
+#ifdef CONFIG_MOVABLE_NODE
+	unsigned long kernel_end;
+
+	if (movablenode_enable_srat &&
+	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
+		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+
+		memory_map_from_low(kernel_end, end);
+		memory_map_from_low(ISA_END_ADDRESS, kernel_end);
+		goto out;
+	}
+#endif /* CONFIG_MOVABLE_NODE */
+
+	unsigned long addr, real_end;
 
 	/* xen has big range in reserved near end of ram, skip it at first.*/
 	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
-	/* step_size need to be small so pgt_buf from BRK could cover it */
-	step_size = PMD_SIZE;
-	max_pfn_mapped = 0; /* will get exact value next */
-	min_pfn_mapped = real_end >> PAGE_SHIFT;
-	last_start = start = real_end;
-
 	/*
 	 * We start from the top (end of memory) and go to the bottom.
 	 * The memblock_find_in_range() gets us a block of RAM from the
 	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 	 * for page table.
 	 */
-	while (last_start > ISA_END_ADDRESS) {
-		if (last_start > step_size) {
-			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
-		} else
-			start = ISA_END_ADDRESS;
-		new_mapped_ram_size = init_range_memory_mapping(start,
-							last_start);
-		last_start = start;
-		/* only increase step_size after big range get mapped */
-		if (new_mapped_ram_size > mapped_ram_size)
-			step_size <<= STEP_SIZE_SHIFT;
-		mapped_ram_size += new_mapped_ram_size;
-	}
+	memory_map_from_high(ISA_END_ADDRESS, real_end);
 
 	if (real_end < end)
 		init_range_memory_mapping(real_end, end);
 
+out:
+	/* the ISA range is always mapped regardless of memory holes */
+	init_memory_mapping(0, ISA_END_ADDRESS);
+
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
 		/* can we preseve max_low_pfn ?*/