Message ID | 20181216125624.3416-1-richard.weiyang@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, page_alloc: clear zone_movable_pfn if the node doesn't have ZONE_MOVABLE | expand |
On Sun 16-12-18 20:56:24, Wei Yang wrote: > A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while > current implementation doesn't comply with this rule when kernel > parameter "kernelcore=" is used. > > Current implementation doesn't harm the system, since the value in > zone_movable_pfn is out of the range of current zone. While user would > see this message during bootup, even that node doesn't has ZONE_MOVABLE. > > Movable zone start for each node > Node 0: 0x0000000080000000 I am sorry but the above description confuses me more than it helps. Could you start over again and describe the user visible problem, then follow up with the udnerlying bug and finally continue with a proposed fix?
On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: >On Sun 16-12-18 20:56:24, Wei Yang wrote: >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while >> current implementation doesn't comply with this rule when kernel >> parameter "kernelcore=" is used. >> >> Current implementation doesn't harm the system, since the value in >> zone_movable_pfn is out of the range of current zone. While user would >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. >> >> Movable zone start for each node >> Node 0: 0x0000000080000000 > >I am sorry but the above description confuses me more than it helps. >Could you start over again and describe the user visible problem, then >follow up with the udnerlying bug and finally continue with a proposed >fix? Yep, how about this one: For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass "kernelcore=2G" as kernel parameter, the dmesg looks like: Movable zone start for each node Node 0: 0x0000000080000000 Node 1: 0x0000000100000000 This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following dmesg shows only Node 1 has ZONE_MOVABLE. On node 0 totalpages: 524190 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520192 pages, LIFO batch:63 On node 1 totalpages: 524255 DMA32 zone: 4096 pages used for memmap DMA32 zone: 262111 pages, LIFO batch:63 Movable zone: 4096 pages used for memmap Movable zone: 262144 pages, LIFO batch:63 The good news is current result doesn't harm the ZONE_MOVABLE calculation, while it confuse user and may lead to code inconsistency. For example, in adjust_zone_range_for_zone_movable(), the comment says "Only adjust if ZONE_MOVABLE is on this node" by check zone_movable_pfn. But we can see this doesn't hold for all cases. The cause of this problem is we leverage zone_movable_pfn during the iteration to record where we have touched and reduce double account. But after using this, those temporary data is not cleared. To fix this issue, we may have several ways. In this patch I propose the one with minimal change of current code by taking advantage of the highest bit of zone_movable_pfn. When the zone_movable_pfn is a temporary calculation data, the highest bit is set. After the entire calculation is complete, zone_movable_pfn with highest bit set will be cleared. >-- >Michal Hocko >SUSE Labs
On Mon 17-12-18 14:18:02, Wei Yang wrote: > On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: > >On Sun 16-12-18 20:56:24, Wei Yang wrote: > >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while > >> current implementation doesn't comply with this rule when kernel > >> parameter "kernelcore=" is used. > >> > >> Current implementation doesn't harm the system, since the value in > >> zone_movable_pfn is out of the range of current zone. While user would > >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. > >> > >> Movable zone start for each node > >> Node 0: 0x0000000080000000 > > > >I am sorry but the above description confuses me more than it helps. > >Could you start over again and describe the user visible problem, then > >follow up with the udnerlying bug and finally continue with a proposed > >fix? > > Yep, how about this one: > > For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass Did you mean 2G on each? Because your nodes do have 2GB each. > "kernelcore=2G" as kernel parameter, the dmesg looks like: > > Movable zone start for each node > Node 0: 0x0000000080000000 > Node 1: 0x0000000100000000 > > This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following > dmesg shows only Node 1 has ZONE_MOVABLE. Well, the documentation says kernelcore= [KNL,X86,IA-64,PPC] Format: nn[KMGTPE] | nn% | "mirror" This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system as ZONE_NORMAL. The remaining memory is used for movable memory in its own zone, ZONE_MOVABLE. In the event, a node is too small to have both ZONE_NORMAL and ZONE_MOVABLE, kernelcore memory will take priority and other nodes will have a larger ZONE_MOVABLE. > On node 0 totalpages: 524190 > DMA zone: 64 pages used for memmap > DMA zone: 21 pages reserved > DMA zone: 3998 pages, LIFO batch:0 > DMA32 zone: 8128 pages used for memmap > DMA32 zone: 520192 pages, LIFO batch:63 > > On node 1 totalpages: 524255 > DMA32 zone: 4096 pages used for memmap > DMA32 zone: 262111 pages, LIFO batch:63 > Movable zone: 4096 pages used for memmap > Movable zone: 262144 pages, LIFO batch:63 so assuming your really have 4GB in total and 2GB should be in kernel zones then each node should get half of it to kernel zones and the remaining 2G evenly distributed to movable zones. So something seems broken here.
On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote: >On Mon 17-12-18 14:18:02, Wei Yang wrote: >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: >> >On Sun 16-12-18 20:56:24, Wei Yang wrote: >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while >> >> current implementation doesn't comply with this rule when kernel >> >> parameter "kernelcore=" is used. >> >> >> >> Current implementation doesn't harm the system, since the value in >> >> zone_movable_pfn is out of the range of current zone. While user would >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. >> >> >> >> Movable zone start for each node >> >> Node 0: 0x0000000080000000 >> > >> >I am sorry but the above description confuses me more than it helps. >> >Could you start over again and describe the user visible problem, then >> >follow up with the udnerlying bug and finally continue with a proposed >> >fix? >> >> Yep, how about this one: >> >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass > >Did you mean 2G on each? Because your nodes do have 2GB each. > >> "kernelcore=2G" as kernel parameter, the dmesg looks like: >> >> Movable zone start for each node >> Node 0: 0x0000000080000000 >> Node 1: 0x0000000100000000 >> >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following >> dmesg shows only Node 1 has ZONE_MOVABLE. > >Well, the documentation says > kernelcore= [KNL,X86,IA-64,PPC] > Format: nn[KMGTPE] | nn% | "mirror" > This parameter specifies the amount of memory usable by > the kernel for non-movable allocations. The requested > amount is spread evenly throughout all nodes in the > system as ZONE_NORMAL. The remaining memory is used for > movable memory in its own zone, ZONE_MOVABLE. In the > event, a node is too small to have both ZONE_NORMAL and > ZONE_MOVABLE, kernelcore memory will take priority and > other nodes will have a larger ZONE_MOVABLE. Yes, current behavior is a little bit different. When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the highest ZONE. Which means if a node doesn't has the highest zone, all its memory belongs to kernelcore. Looks like a design decision? > >> On node 0 totalpages: 524190 >> DMA zone: 64 pages used for memmap >> DMA zone: 21 pages reserved >> DMA zone: 3998 pages, LIFO batch:0 >> DMA32 zone: 8128 pages used for memmap >> DMA32 zone: 520192 pages, LIFO batch:63 >> >> On node 1 totalpages: 524255 >> DMA32 zone: 4096 pages used for memmap >> DMA32 zone: 262111 pages, LIFO batch:63 >> Movable zone: 4096 pages used for memmap >> Movable zone: 262144 pages, LIFO batch:63 > >so assuming your really have 4GB in total and 2GB should be in kernel >zones then each node should get half of it to kernel zones and the >remaining 2G evenly distributed to movable zones. So something seems >broken here. In case we really have this implemented. We will have following memory layout. +---------+------+---------+--------+------------+ |DMA |DMA32 |Movable |DMA32 |Movable | +---------+------+---------+--------+------------+ |< Node 0 >|< Node 1 >| This means we have none-monotonic increasing zone. Is this what we expect now? If this is, we really have someting broken. >-- >Michal Hocko >SUSE Labs
On Tue 18-12-18 14:39:43, Wei Yang wrote: > On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote: > >On Mon 17-12-18 14:18:02, Wei Yang wrote: > >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: > >> >On Sun 16-12-18 20:56:24, Wei Yang wrote: > >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while > >> >> current implementation doesn't comply with this rule when kernel > >> >> parameter "kernelcore=" is used. > >> >> > >> >> Current implementation doesn't harm the system, since the value in > >> >> zone_movable_pfn is out of the range of current zone. While user would > >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. > >> >> > >> >> Movable zone start for each node > >> >> Node 0: 0x0000000080000000 > >> > > >> >I am sorry but the above description confuses me more than it helps. > >> >Could you start over again and describe the user visible problem, then > >> >follow up with the udnerlying bug and finally continue with a proposed > >> >fix? > >> > >> Yep, how about this one: > >> > >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass > > > >Did you mean 2G on each? Because your nodes do have 2GB each. > > > >> "kernelcore=2G" as kernel parameter, the dmesg looks like: > >> > >> Movable zone start for each node > >> Node 0: 0x0000000080000000 > >> Node 1: 0x0000000100000000 > >> > >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following > >> dmesg shows only Node 1 has ZONE_MOVABLE. > > > >Well, the documentation says > > kernelcore= [KNL,X86,IA-64,PPC] > > Format: nn[KMGTPE] | nn% | "mirror" > > This parameter specifies the amount of memory usable by > > the kernel for non-movable allocations. The requested > > amount is spread evenly throughout all nodes in the > > system as ZONE_NORMAL. The remaining memory is used for > > movable memory in its own zone, ZONE_MOVABLE. In the > > event, a node is too small to have both ZONE_NORMAL and > > ZONE_MOVABLE, kernelcore memory will take priority and > > other nodes will have a larger ZONE_MOVABLE. > > Yes, current behavior is a little bit different. Then it is either a bug in implementation or documentation. > > When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the > highest ZONE. Which means if a node doesn't has the highest zone, all > its memory belongs to kernelcore. Each node can have all zones. DMA and DMA32 have address range specific but there is always NORMAL zone to hold kernel memory irrespective of the pfn range. > > Looks like a design decision? > > > > >> On node 0 totalpages: 524190 > >> DMA zone: 64 pages used for memmap > >> DMA zone: 21 pages reserved > >> DMA zone: 3998 pages, LIFO batch:0 > >> DMA32 zone: 8128 pages used for memmap > >> DMA32 zone: 520192 pages, LIFO batch:63 > >> > >> On node 1 totalpages: 524255 > >> DMA32 zone: 4096 pages used for memmap > >> DMA32 zone: 262111 pages, LIFO batch:63 > >> Movable zone: 4096 pages used for memmap > >> Movable zone: 262144 pages, LIFO batch:63 > > > >so assuming your really have 4GB in total and 2GB should be in kernel > >zones then each node should get half of it to kernel zones and the > >remaining 2G evenly distributed to movable zones. So something seems > >broken here. > > In case we really have this implemented. We will have following memory > layout. > > > +---------+------+---------+--------+------------+ > |DMA |DMA32 |Movable |DMA32 |Movable | > +---------+------+---------+--------+------------+ > |< Node 0 >|< Node 1 >| > > This means we have none-monotonic increasing zone. > > Is this what we expect now? If this is, we really have someting broken. Absolutely. Each node can have all zones as mentioned above.
On Tue, Dec 18, 2018 at 03:47:24PM +0100, Michal Hocko wrote: >On Tue 18-12-18 14:39:43, Wei Yang wrote: >> On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote: >> >On Mon 17-12-18 14:18:02, Wei Yang wrote: >> >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: >> >> >On Sun 16-12-18 20:56:24, Wei Yang wrote: >> >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while >> >> >> current implementation doesn't comply with this rule when kernel >> >> >> parameter "kernelcore=" is used. >> >> >> >> >> >> Current implementation doesn't harm the system, since the value in >> >> >> zone_movable_pfn is out of the range of current zone. While user would >> >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. >> >> >> >> >> >> Movable zone start for each node >> >> >> Node 0: 0x0000000080000000 >> >> > >> >> >I am sorry but the above description confuses me more than it helps. >> >> >Could you start over again and describe the user visible problem, then >> >> >follow up with the udnerlying bug and finally continue with a proposed >> >> >fix? >> >> >> >> Yep, how about this one: >> >> >> >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass >> > >> >Did you mean 2G on each? Because your nodes do have 2GB each. >> > >> >> "kernelcore=2G" as kernel parameter, the dmesg looks like: >> >> >> >> Movable zone start for each node >> >> Node 0: 0x0000000080000000 >> >> Node 1: 0x0000000100000000 >> >> >> >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following >> >> dmesg shows only Node 1 has ZONE_MOVABLE. >> > >> >Well, the documentation says >> > kernelcore= [KNL,X86,IA-64,PPC] >> > Format: nn[KMGTPE] | nn% | "mirror" >> > This parameter specifies the amount of memory usable by >> > the kernel for non-movable allocations. The requested >> > amount is spread evenly throughout all nodes in the >> > system as ZONE_NORMAL. The remaining memory is used for >> > movable memory in its own zone, ZONE_MOVABLE. In the >> > event, a node is too small to have both ZONE_NORMAL and >> > ZONE_MOVABLE, kernelcore memory will take priority and >> > other nodes will have a larger ZONE_MOVABLE. >> >> Yes, current behavior is a little bit different. > >Then it is either a bug in implementation or documentation. > >> >> When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the >> highest ZONE. Which means if a node doesn't has the highest zone, all >> its memory belongs to kernelcore. > >Each node can have all zones. DMA and DMA32 have address range specific >but there is always NORMAL zone to hold kernel memory irrespective of >the pfn range. > >> >> Looks like a design decision? >> >> > >> >> On node 0 totalpages: 524190 >> >> DMA zone: 64 pages used for memmap >> >> DMA zone: 21 pages reserved >> >> DMA zone: 3998 pages, LIFO batch:0 >> >> DMA32 zone: 8128 pages used for memmap >> >> DMA32 zone: 520192 pages, LIFO batch:63 >> >> >> >> On node 1 totalpages: 524255 >> >> DMA32 zone: 4096 pages used for memmap >> >> DMA32 zone: 262111 pages, LIFO batch:63 >> >> Movable zone: 4096 pages used for memmap >> >> Movable zone: 262144 pages, LIFO batch:63 >> > >> >so assuming your really have 4GB in total and 2GB should be in kernel >> >zones then each node should get half of it to kernel zones and the >> >remaining 2G evenly distributed to movable zones. So something seems >> >broken here. >> >> In case we really have this implemented. We will have following memory >> layout. >> >> >> +---------+------+---------+--------+------------+ >> |DMA |DMA32 |Movable |DMA32 |Movable | >> +---------+------+---------+--------+------------+ >> |< Node 0 >|< Node 1 >| >> >> This means we have none-monotonic increasing zone. >> >> Is this what we expect now? If this is, we really have someting broken. > >Absolutely. Each node can have all zones as mentioned above. > Ok, this seems the implementation is not correct now. BTW, would this eat lower zone's memory? For example, has less DMA32? >-- >Michal Hocko >SUSE Labs
On Tue 18-12-18 20:27:43, Wei Yang wrote:
[...]
> BTW, would this eat lower zone's memory? For example, has less DMA32?
Yes I think so. If the distribution should be even and some node(s) span
only lower 32b address range then there is no other option than shrink
the DMA32 zone. There is a note
In the
event, a node is too small to have both ZONE_NORMAL and
ZONE_MOVABLE, kernelcore memory will take priority and
other nodes will have a larger ZONE_MOVABLE.
which explains that this might not be the case though.
Btw. I have to say I quite do not like this interface not to mention the
implementation. THere are users to rely on it though so we cannot remove
it. There is a lot of room for cleanups there.
On Wed, Dec 19, 2018 at 07:56:25AM +0100, Michal Hocko wrote: >On Tue 18-12-18 20:27:43, Wei Yang wrote: >[...] >> BTW, would this eat lower zone's memory? For example, has less DMA32? > >Yes I think so. If the distribution should be even and some node(s) span >only lower 32b address range then there is no other option than shrink >the DMA32 zone. There is a note > In the > event, a node is too small to have both ZONE_NORMAL and > ZONE_MOVABLE, kernelcore memory will take priority and > other nodes will have a larger ZONE_MOVABLE. >which explains that this might not be the case though. > >Btw. I have to say I quite do not like this interface not to mention the >implementation. THere are users to rely on it though so we cannot remove >it. There is a lot of room for cleanups there. Hmm... ok, thanks for your explanation. >-- >Michal Hocko >SUSE Labs
diff --git a/include/linux/mm.h b/include/linux/mm.h index 5411de93a363..c3d8a3346dd1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2141,6 +2141,7 @@ static inline unsigned long get_num_physpages(void) * See mm/page_alloc.c for more information on each function exposed by * CONFIG_HAVE_MEMBLOCK_NODE_MAP. */ +#define zone_movable_pfn_highestbit (1UL << (BITS_PER_LONG - 1)) extern void free_area_init_nodes(unsigned long *max_zone_pfn); unsigned long node_map_pfn_alignment(void); unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eb4df3f63f5e..cd3a77b9cb95 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6841,6 +6841,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { unsigned long size_pages; + zone_movable_pfn[nid] &= ~zone_movable_pfn_highestbit; start_pfn = max(start_pfn, zone_movable_pfn[nid]); if (start_pfn >= end_pfn) continue; @@ -6866,6 +6867,13 @@ static void __init find_zone_movable_pfns_for_nodes(void) * not double account here */ zone_movable_pfn[nid] = end_pfn; + + /* + * Set highest bit to indicate it is + * used for calculation. + */ + zone_movable_pfn[nid] |= + zone_movable_pfn_highestbit; continue; } start_pfn = usable_startpfn; @@ -6904,6 +6912,13 @@ static void __init find_zone_movable_pfns_for_nodes(void) if (usable_nodes && required_kernelcore > usable_nodes) goto restart; + /* + * clear zone_movable_pfn if its highest bit is set + */ + for_each_node_state(nid, N_MEMORY) + if (zone_movable_pfn[nid] & zone_movable_pfn_highestbit) + zone_movable_pfn[nid] = 0; + out2: /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ for (nid = 0; nid < MAX_NUMNODES; nid++)
A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while current implementation doesn't comply with this rule when kernel parameter "kernelcore=" is used. Current implementation doesn't harm the system, since the value in zone_movable_pfn is out of the range of current zone. While user would see this message during bootup, even that node doesn't has ZONE_MOVABLE. Movable zone start for each node Node 0: 0x0000000080000000 This fix takes advantage of the highest bit of a pfn to indicate it is used for the calculation instead of the final result. And clear those pfn whose highest bit is set after entire calculation. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> --- include/linux/mm.h | 1 + mm/page_alloc.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+)