diff mbox series

mm, page_alloc: clear zone_movable_pfn if the node doesn't have ZONE_MOVABLE

Message ID 20181216125624.3416-1-richard.weiyang@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, page_alloc: clear zone_movable_pfn if the node doesn't have ZONE_MOVABLE | expand

Commit Message

Wei Yang Dec. 16, 2018, 12:56 p.m. UTC
A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
current implementation doesn't comply with this rule when kernel
parameter "kernelcore=" is used.

Current implementation doesn't harm the system, since the value in
zone_movable_pfn is out of the range of current zone. While user would
see this message during bootup, even that node doesn't has ZONE_MOVABLE.

    Movable zone start for each node
      Node 0: 0x0000000080000000

This fix takes advantage of the highest bit of a pfn to indicate it is
used for the calculation instead of the final result. And clear those
pfn whose highest bit is set after entire calculation.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c    | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

Comments

Michal Hocko Dec. 17, 2018, 10:25 a.m. UTC | #1
On Sun 16-12-18 20:56:24, Wei Yang wrote:
> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
> current implementation doesn't comply with this rule when kernel
> parameter "kernelcore=" is used.
> 
> Current implementation doesn't harm the system, since the value in
> zone_movable_pfn is out of the range of current zone. While user would
> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
> 
>     Movable zone start for each node
>       Node 0: 0x0000000080000000

I am sorry but the above description confuses me more than it helps.
Could you start over again and describe the user visible problem, then
follow up with the udnerlying bug and finally continue with a proposed
fix?
Wei Yang Dec. 17, 2018, 2:18 p.m. UTC | #2
On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
>On Sun 16-12-18 20:56:24, Wei Yang wrote:
>> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
>> current implementation doesn't comply with this rule when kernel
>> parameter "kernelcore=" is used.
>> 
>> Current implementation doesn't harm the system, since the value in
>> zone_movable_pfn is out of the range of current zone. While user would
>> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
>> 
>>     Movable zone start for each node
>>       Node 0: 0x0000000080000000
>
>I am sorry but the above description confuses me more than it helps.
>Could you start over again and describe the user visible problem, then
>follow up with the udnerlying bug and finally continue with a proposed
>fix?

Yep, how about this one:

For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass
"kernelcore=2G" as kernel parameter, the dmesg looks like:

     Movable zone start for each node
       Node 0: 0x0000000080000000
       Node 1: 0x0000000100000000

This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
dmesg shows only Node 1 has ZONE_MOVABLE.

     On node 0 totalpages: 524190
       DMA zone: 64 pages used for memmap
       DMA zone: 21 pages reserved
       DMA zone: 3998 pages, LIFO batch:0
       DMA32 zone: 8128 pages used for memmap
       DMA32 zone: 520192 pages, LIFO batch:63
     
     On node 1 totalpages: 524255
       DMA32 zone: 4096 pages used for memmap
       DMA32 zone: 262111 pages, LIFO batch:63
       Movable zone: 4096 pages used for memmap
       Movable zone: 262144 pages, LIFO batch:63

The good news is current result doesn't harm the ZONE_MOVABLE
calculation, while it confuse user and may lead to code inconsistency.
For example, in adjust_zone_range_for_zone_movable(), the comment says
"Only adjust if ZONE_MOVABLE is on this node" by check zone_movable_pfn.
But we can see this doesn't hold for all cases.

The cause of this problem is we leverage zone_movable_pfn during the
iteration to record where we have touched and reduce double account.
But after using this, those temporary data is not cleared. 

To fix this issue, we may have several ways. In this patch I propose the
one with minimal change of current code by taking advantage of the
highest bit of zone_movable_pfn. When the zone_movable_pfn is a
temporary calculation data, the highest bit is set. After the entire
calculation is complete, zone_movable_pfn with highest bit set will be
cleared.

>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Dec. 18, 2018, 12:14 p.m. UTC | #3
On Mon 17-12-18 14:18:02, Wei Yang wrote:
> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
> >On Sun 16-12-18 20:56:24, Wei Yang wrote:
> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
> >> current implementation doesn't comply with this rule when kernel
> >> parameter "kernelcore=" is used.
> >> 
> >> Current implementation doesn't harm the system, since the value in
> >> zone_movable_pfn is out of the range of current zone. While user would
> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
> >> 
> >>     Movable zone start for each node
> >>       Node 0: 0x0000000080000000
> >
> >I am sorry but the above description confuses me more than it helps.
> >Could you start over again and describe the user visible problem, then
> >follow up with the udnerlying bug and finally continue with a proposed
> >fix?
> 
> Yep, how about this one:
> 
> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass

Did you mean 2G on each? Because your nodes do have 2GB each.

> "kernelcore=2G" as kernel parameter, the dmesg looks like:
> 
>      Movable zone start for each node
>        Node 0: 0x0000000080000000
>        Node 1: 0x0000000100000000
> 
> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
> dmesg shows only Node 1 has ZONE_MOVABLE.

Well, the documentation says
	kernelcore=	[KNL,X86,IA-64,PPC]
			Format: nn[KMGTPE] | nn% | "mirror"
			This parameter specifies the amount of memory usable by
			the kernel for non-movable allocations.  The requested
			amount is spread evenly throughout all nodes in the
			system as ZONE_NORMAL.  The remaining memory is used for
			movable memory in its own zone, ZONE_MOVABLE.  In the
			event, a node is too small to have both ZONE_NORMAL and
			ZONE_MOVABLE, kernelcore memory will take priority and
			other nodes will have a larger ZONE_MOVABLE.

>      On node 0 totalpages: 524190
>        DMA zone: 64 pages used for memmap
>        DMA zone: 21 pages reserved
>        DMA zone: 3998 pages, LIFO batch:0
>        DMA32 zone: 8128 pages used for memmap
>        DMA32 zone: 520192 pages, LIFO batch:63
>      
>      On node 1 totalpages: 524255
>        DMA32 zone: 4096 pages used for memmap
>        DMA32 zone: 262111 pages, LIFO batch:63
>        Movable zone: 4096 pages used for memmap
>        Movable zone: 262144 pages, LIFO batch:63

so assuming your really have 4GB in total and 2GB should be in kernel
zones then each node should get half of it to kernel zones and the
remaining 2G evenly distributed to movable zones. So something seems
broken here.
Wei Yang Dec. 18, 2018, 2:39 p.m. UTC | #4
On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote:
>On Mon 17-12-18 14:18:02, Wei Yang wrote:
>> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
>> >On Sun 16-12-18 20:56:24, Wei Yang wrote:
>> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
>> >> current implementation doesn't comply with this rule when kernel
>> >> parameter "kernelcore=" is used.
>> >> 
>> >> Current implementation doesn't harm the system, since the value in
>> >> zone_movable_pfn is out of the range of current zone. While user would
>> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
>> >> 
>> >>     Movable zone start for each node
>> >>       Node 0: 0x0000000080000000
>> >
>> >I am sorry but the above description confuses me more than it helps.
>> >Could you start over again and describe the user visible problem, then
>> >follow up with the udnerlying bug and finally continue with a proposed
>> >fix?
>> 
>> Yep, how about this one:
>> 
>> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass
>
>Did you mean 2G on each? Because your nodes do have 2GB each.
>
>> "kernelcore=2G" as kernel parameter, the dmesg looks like:
>> 
>>      Movable zone start for each node
>>        Node 0: 0x0000000080000000
>>        Node 1: 0x0000000100000000
>> 
>> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
>> dmesg shows only Node 1 has ZONE_MOVABLE.
>
>Well, the documentation says
>	kernelcore=	[KNL,X86,IA-64,PPC]
>			Format: nn[KMGTPE] | nn% | "mirror"
>			This parameter specifies the amount of memory usable by
>			the kernel for non-movable allocations.  The requested
>			amount is spread evenly throughout all nodes in the
>			system as ZONE_NORMAL.  The remaining memory is used for
>			movable memory in its own zone, ZONE_MOVABLE.  In the
>			event, a node is too small to have both ZONE_NORMAL and
>			ZONE_MOVABLE, kernelcore memory will take priority and
>			other nodes will have a larger ZONE_MOVABLE.

Yes, current behavior is a little bit different.

When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the
highest ZONE. Which means if a node doesn't has the highest zone, all
its memory belongs to kernelcore.

Looks like a design decision?

>
>>      On node 0 totalpages: 524190
>>        DMA zone: 64 pages used for memmap
>>        DMA zone: 21 pages reserved
>>        DMA zone: 3998 pages, LIFO batch:0
>>        DMA32 zone: 8128 pages used for memmap
>>        DMA32 zone: 520192 pages, LIFO batch:63
>>      
>>      On node 1 totalpages: 524255
>>        DMA32 zone: 4096 pages used for memmap
>>        DMA32 zone: 262111 pages, LIFO batch:63
>>        Movable zone: 4096 pages used for memmap
>>        Movable zone: 262144 pages, LIFO batch:63
>
>so assuming your really have 4GB in total and 2GB should be in kernel
>zones then each node should get half of it to kernel zones and the
>remaining 2G evenly distributed to movable zones. So something seems
>broken here.

In case we really have this implemented. We will have following memory
layout.


    +---------+------+---------+--------+------------+
    |DMA      |DMA32 |Movable  |DMA32   |Movable     |
    +---------+------+---------+--------+------------+
    |<        Node 0          >|<      Node 1       >|

This means we have none-monotonic increasing zone.

Is this what we expect now? If this is, we really have someting broken.

>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Dec. 18, 2018, 2:47 p.m. UTC | #5
On Tue 18-12-18 14:39:43, Wei Yang wrote:
> On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote:
> >On Mon 17-12-18 14:18:02, Wei Yang wrote:
> >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
> >> >On Sun 16-12-18 20:56:24, Wei Yang wrote:
> >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
> >> >> current implementation doesn't comply with this rule when kernel
> >> >> parameter "kernelcore=" is used.
> >> >> 
> >> >> Current implementation doesn't harm the system, since the value in
> >> >> zone_movable_pfn is out of the range of current zone. While user would
> >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
> >> >> 
> >> >>     Movable zone start for each node
> >> >>       Node 0: 0x0000000080000000
> >> >
> >> >I am sorry but the above description confuses me more than it helps.
> >> >Could you start over again and describe the user visible problem, then
> >> >follow up with the udnerlying bug and finally continue with a proposed
> >> >fix?
> >> 
> >> Yep, how about this one:
> >> 
> >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass
> >
> >Did you mean 2G on each? Because your nodes do have 2GB each.
> >
> >> "kernelcore=2G" as kernel parameter, the dmesg looks like:
> >> 
> >>      Movable zone start for each node
> >>        Node 0: 0x0000000080000000
> >>        Node 1: 0x0000000100000000
> >> 
> >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
> >> dmesg shows only Node 1 has ZONE_MOVABLE.
> >
> >Well, the documentation says
> >	kernelcore=	[KNL,X86,IA-64,PPC]
> >			Format: nn[KMGTPE] | nn% | "mirror"
> >			This parameter specifies the amount of memory usable by
> >			the kernel for non-movable allocations.  The requested
> >			amount is spread evenly throughout all nodes in the
> >			system as ZONE_NORMAL.  The remaining memory is used for
> >			movable memory in its own zone, ZONE_MOVABLE.  In the
> >			event, a node is too small to have both ZONE_NORMAL and
> >			ZONE_MOVABLE, kernelcore memory will take priority and
> >			other nodes will have a larger ZONE_MOVABLE.
> 
> Yes, current behavior is a little bit different.

Then it is either a bug in implementation or documentation.

> 
> When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the
> highest ZONE. Which means if a node doesn't has the highest zone, all
> its memory belongs to kernelcore.

Each node can have all zones. DMA and DMA32 have address range specific
but there is always NORMAL zone to hold kernel memory irrespective of
the pfn range.

> 
> Looks like a design decision?
> 
> >
> >>      On node 0 totalpages: 524190
> >>        DMA zone: 64 pages used for memmap
> >>        DMA zone: 21 pages reserved
> >>        DMA zone: 3998 pages, LIFO batch:0
> >>        DMA32 zone: 8128 pages used for memmap
> >>        DMA32 zone: 520192 pages, LIFO batch:63
> >>      
> >>      On node 1 totalpages: 524255
> >>        DMA32 zone: 4096 pages used for memmap
> >>        DMA32 zone: 262111 pages, LIFO batch:63
> >>        Movable zone: 4096 pages used for memmap
> >>        Movable zone: 262144 pages, LIFO batch:63
> >
> >so assuming your really have 4GB in total and 2GB should be in kernel
> >zones then each node should get half of it to kernel zones and the
> >remaining 2G evenly distributed to movable zones. So something seems
> >broken here.
> 
> In case we really have this implemented. We will have following memory
> layout.
> 
> 
>     +---------+------+---------+--------+------------+
>     |DMA      |DMA32 |Movable  |DMA32   |Movable     |
>     +---------+------+---------+--------+------------+
>     |<        Node 0          >|<      Node 1       >|
> 
> This means we have none-monotonic increasing zone.
> 
> Is this what we expect now? If this is, we really have someting broken.

Absolutely. Each node can have all zones as mentioned above.
Wei Yang Dec. 18, 2018, 8:27 p.m. UTC | #6
On Tue, Dec 18, 2018 at 03:47:24PM +0100, Michal Hocko wrote:
>On Tue 18-12-18 14:39:43, Wei Yang wrote:
>> On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote:
>> >On Mon 17-12-18 14:18:02, Wei Yang wrote:
>> >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
>> >> >On Sun 16-12-18 20:56:24, Wei Yang wrote:
>> >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
>> >> >> current implementation doesn't comply with this rule when kernel
>> >> >> parameter "kernelcore=" is used.
>> >> >> 
>> >> >> Current implementation doesn't harm the system, since the value in
>> >> >> zone_movable_pfn is out of the range of current zone. While user would
>> >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
>> >> >> 
>> >> >>     Movable zone start for each node
>> >> >>       Node 0: 0x0000000080000000
>> >> >
>> >> >I am sorry but the above description confuses me more than it helps.
>> >> >Could you start over again and describe the user visible problem, then
>> >> >follow up with the udnerlying bug and finally continue with a proposed
>> >> >fix?
>> >> 
>> >> Yep, how about this one:
>> >> 
>> >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass
>> >
>> >Did you mean 2G on each? Because your nodes do have 2GB each.
>> >
>> >> "kernelcore=2G" as kernel parameter, the dmesg looks like:
>> >> 
>> >>      Movable zone start for each node
>> >>        Node 0: 0x0000000080000000
>> >>        Node 1: 0x0000000100000000
>> >> 
>> >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
>> >> dmesg shows only Node 1 has ZONE_MOVABLE.
>> >
>> >Well, the documentation says
>> >	kernelcore=	[KNL,X86,IA-64,PPC]
>> >			Format: nn[KMGTPE] | nn% | "mirror"
>> >			This parameter specifies the amount of memory usable by
>> >			the kernel for non-movable allocations.  The requested
>> >			amount is spread evenly throughout all nodes in the
>> >			system as ZONE_NORMAL.  The remaining memory is used for
>> >			movable memory in its own zone, ZONE_MOVABLE.  In the
>> >			event, a node is too small to have both ZONE_NORMAL and
>> >			ZONE_MOVABLE, kernelcore memory will take priority and
>> >			other nodes will have a larger ZONE_MOVABLE.
>> 
>> Yes, current behavior is a little bit different.
>
>Then it is either a bug in implementation or documentation.
>
>> 
>> When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the
>> highest ZONE. Which means if a node doesn't has the highest zone, all
>> its memory belongs to kernelcore.
>
>Each node can have all zones. DMA and DMA32 have address range specific
>but there is always NORMAL zone to hold kernel memory irrespective of
>the pfn range.
>
>> 
>> Looks like a design decision?
>> 
>> >
>> >>      On node 0 totalpages: 524190
>> >>        DMA zone: 64 pages used for memmap
>> >>        DMA zone: 21 pages reserved
>> >>        DMA zone: 3998 pages, LIFO batch:0
>> >>        DMA32 zone: 8128 pages used for memmap
>> >>        DMA32 zone: 520192 pages, LIFO batch:63
>> >>      
>> >>      On node 1 totalpages: 524255
>> >>        DMA32 zone: 4096 pages used for memmap
>> >>        DMA32 zone: 262111 pages, LIFO batch:63
>> >>        Movable zone: 4096 pages used for memmap
>> >>        Movable zone: 262144 pages, LIFO batch:63
>> >
>> >so assuming your really have 4GB in total and 2GB should be in kernel
>> >zones then each node should get half of it to kernel zones and the
>> >remaining 2G evenly distributed to movable zones. So something seems
>> >broken here.
>> 
>> In case we really have this implemented. We will have following memory
>> layout.
>> 
>> 
>>     +---------+------+---------+--------+------------+
>>     |DMA      |DMA32 |Movable  |DMA32   |Movable     |
>>     +---------+------+---------+--------+------------+
>>     |<        Node 0          >|<      Node 1       >|
>> 
>> This means we have none-monotonic increasing zone.
>> 
>> Is this what we expect now? If this is, we really have someting broken.
>
>Absolutely. Each node can have all zones as mentioned above.
>

Ok, this seems the implementation is not correct now.

BTW, would this eat lower zone's memory? For example, has less DMA32?

>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Dec. 19, 2018, 6:56 a.m. UTC | #7
On Tue 18-12-18 20:27:43, Wei Yang wrote:
[...]
> BTW, would this eat lower zone's memory? For example, has less DMA32?

Yes I think so. If the distribution should be even and some node(s) span
only lower 32b address range then there is no other option than shrink
the DMA32 zone. There is a note
			In the
			event, a node is too small to have both ZONE_NORMAL and
			ZONE_MOVABLE, kernelcore memory will take priority and
			other nodes will have a larger ZONE_MOVABLE.
which explains that this might not be the case though.

Btw. I have to say I quite do not like this interface not to mention the
implementation. THere are users to rely on it though so we cannot remove
it. There is a lot of room for cleanups there.
Wei Yang Dec. 19, 2018, 12:56 p.m. UTC | #8
On Wed, Dec 19, 2018 at 07:56:25AM +0100, Michal Hocko wrote:
>On Tue 18-12-18 20:27:43, Wei Yang wrote:
>[...]
>> BTW, would this eat lower zone's memory? For example, has less DMA32?
>
>Yes I think so. If the distribution should be even and some node(s) span
>only lower 32b address range then there is no other option than shrink
>the DMA32 zone. There is a note
>			In the
>			event, a node is too small to have both ZONE_NORMAL and
>			ZONE_MOVABLE, kernelcore memory will take priority and
>			other nodes will have a larger ZONE_MOVABLE.
>which explains that this might not be the case though.
>
>Btw. I have to say I quite do not like this interface not to mention the
>implementation. THere are users to rely on it though so we cannot remove
>it. There is a lot of room for cleanups there.

Hmm... ok, thanks for your explanation.

>-- 
>Michal Hocko
>SUSE Labs
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..c3d8a3346dd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2141,6 +2141,7 @@  static inline unsigned long get_num_physpages(void)
  * See mm/page_alloc.c for more information on each function exposed by
  * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
  */
+#define zone_movable_pfn_highestbit (1UL << (BITS_PER_LONG - 1))
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
 unsigned long node_map_pfn_alignment(void);
 unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb4df3f63f5e..cd3a77b9cb95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6841,6 +6841,7 @@  static void __init find_zone_movable_pfns_for_nodes(void)
 		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 			unsigned long size_pages;
 
+			zone_movable_pfn[nid] &= ~zone_movable_pfn_highestbit;
 			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
 			if (start_pfn >= end_pfn)
 				continue;
@@ -6866,6 +6867,13 @@  static void __init find_zone_movable_pfns_for_nodes(void)
 					 * not double account here
 					 */
 					zone_movable_pfn[nid] = end_pfn;
+
+					/*
+					 * Set highest bit to indicate it is
+					 * used for calculation.
+					 */
+					zone_movable_pfn[nid] |=
+						zone_movable_pfn_highestbit;
 					continue;
 				}
 				start_pfn = usable_startpfn;
@@ -6904,6 +6912,13 @@  static void __init find_zone_movable_pfns_for_nodes(void)
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+	/*
+	 * clear zone_movable_pfn if its highest bit is set
+	 */
+	for_each_node_state(nid, N_MEMORY)
+		if (zone_movable_pfn[nid] & zone_movable_pfn_highestbit)
+			zone_movable_pfn[nid] = 0;
+
 out2:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)