Message ID | 20181102155528.20358-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memory_hotplug: teach has_unmovable_pages about of LRU migrateable pages | expand |
Hi Michal, On 11/02/18 at 04:55pm, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Baoquan He has noticed that 15c30bc09085 ("mm, memory_hotplug: make > has_unmovable_pages more robust") is causing memory offlining failures > on a movable node. After a further debugging it turned out that > has_unmovable_pages fails prematurely because it stumbles over off-LRU > pages. Nevertheless those pages are not on LRU because they are waiting > on the pcp LRU caches (an example of __dump_page added by a debugging > patch) > [ 560.923297] page:ffffea043f39fa80 count:1 mapcount:0 mapping:ffff880e5dce1b59 index:0x7f6eec459 > [ 560.931967] flags: 0x5fffffc0080024(uptodate|active|swapbacked) > [ 560.937867] raw: 005fffffc0080024 dead000000000100 dead000000000200 ffff880e5dce1b59 > [ 560.945606] raw: 00000007f6eec459 0000000000000000 00000001ffffffff ffff880e43ae8000 > [ 560.953323] page dumped because: hotplug > [ 560.957238] page->mem_cgroup:ffff880e43ae8000 > [ 560.961620] has_unmovable_pages: pfn:0x10fd030d, found:0x1, count:0x0 > [ 560.968127] page:ffffea043f40c340 count:2 mapcount:0 mapping:ffff880e2f2d8628 index:0x0 > [ 560.976104] flags: 0x5fffffc0000006(referenced|uptodate) > [ 560.981401] raw: 005fffffc0000006 dead000000000100 dead000000000200 ffff880e2f2d8628 > [ 560.989119] raw: 0000000000000000 0000000000000000 00000002ffffffff ffff88010a8f5000 > [ 560.996833] page dumped because: hotplug Sorry, last week I didn't test this patch with memory pressure adding. Today use "stress -m 200 -t 2h" to add pressure, hot removing failed. Will send you output log. W/o memory pressure, it sometimes succeed. I saw one failure last night, it still show un-removable as 0 in hotpluggable node one time, I worried it might be caused by my compiling mistake, so compile and try again this morning. Thanks Baoquan
On Mon 05-11-18 08:20:09, Baoquan He wrote: > Hi Michal, > > On 11/02/18 at 04:55pm, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Baoquan He has noticed that 15c30bc09085 ("mm, memory_hotplug: make > > has_unmovable_pages more robust") is causing memory offlining failures > > on a movable node. After a further debugging it turned out that > > has_unmovable_pages fails prematurely because it stumbles over off-LRU > > pages. Nevertheless those pages are not on LRU because they are waiting > > on the pcp LRU caches (an example of __dump_page added by a debugging > > patch) > > [ 560.923297] page:ffffea043f39fa80 count:1 mapcount:0 mapping:ffff880e5dce1b59 index:0x7f6eec459 > > [ 560.931967] flags: 0x5fffffc0080024(uptodate|active|swapbacked) > > [ 560.937867] raw: 005fffffc0080024 dead000000000100 dead000000000200 ffff880e5dce1b59 > > [ 560.945606] raw: 00000007f6eec459 0000000000000000 00000001ffffffff ffff880e43ae8000 > > [ 560.953323] page dumped because: hotplug > > [ 560.957238] page->mem_cgroup:ffff880e43ae8000 > > [ 560.961620] has_unmovable_pages: pfn:0x10fd030d, found:0x1, count:0x0 > > [ 560.968127] page:ffffea043f40c340 count:2 mapcount:0 mapping:ffff880e2f2d8628 index:0x0 > > [ 560.976104] flags: 0x5fffffc0000006(referenced|uptodate) > > [ 560.981401] raw: 005fffffc0000006 dead000000000100 dead000000000200 ffff880e2f2d8628 > > [ 560.989119] raw: 0000000000000000 0000000000000000 00000002ffffffff ffff88010a8f5000 > > [ 560.996833] page dumped because: hotplug > > Sorry, last week I didn't test this patch with memory pressure adding. > Today use "stress -m 200 -t 2h" to add pressure, hot removing failed. > Will send you output log. W/o memory pressure, it sometimes succeed. I > saw one failure last night, it still show un-removable as 0 in > hotpluggable node one time, I worried it might be caused by my compiling > mistake, so compile and try again this morning. In a private email you have sent this (let's assume this is correctly testing the patch I have posted): : [43283.914082] has_unmovable_pages: pfn:0x10e62600, found:0x1, count:0x0 : [43283.920669] page:ffffea0439898000 count:1 mapcount:1 mapping:ffff880e5639d3c9 index:0x7f2430400 compound_mapcount: 1 : [43283.931219] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) : [43283.937954] raw: 005fffffc0090034 ffffea043ffcb888 ffffea043f728008 ffff880e5639d3c9 : [43283.945722] raw: 00000007f2430400 0000000000000000 00000001ffffffff ffff880e2baad000 : [43283.955381] page dumped because: hotplug The page is both LRU and SwapBacked which should hit the some of the checks in has_unmovable_pages. The fact it hasn't means we have clearly raced with the page being allocated and marked SwapBacked/LRU. This is surely possible and there is no universal way to prevent from that for all types of potentially migratedable pages. The race window should be relatively small. Maybe we can add a retry for movable zone pages. How reproducible this is? But, as I've said memory isolation resp. has_unmovable_pages begs for a complete redesign.
On 11/05/18 at 10:14am, Michal Hocko wrote: > On Mon 05-11-18 08:20:09, Baoquan He wrote: > > Hi Michal, > > > > On 11/02/18 at 04:55pm, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > Baoquan He has noticed that 15c30bc09085 ("mm, memory_hotplug: make > > > has_unmovable_pages more robust") is causing memory offlining failures > > > on a movable node. After a further debugging it turned out that > > > has_unmovable_pages fails prematurely because it stumbles over off-LRU > > > pages. Nevertheless those pages are not on LRU because they are waiting > > > on the pcp LRU caches (an example of __dump_page added by a debugging > > > patch) > > > [ 560.923297] page:ffffea043f39fa80 count:1 mapcount:0 mapping:ffff880e5dce1b59 index:0x7f6eec459 > > > [ 560.931967] flags: 0x5fffffc0080024(uptodate|active|swapbacked) > > > [ 560.937867] raw: 005fffffc0080024 dead000000000100 dead000000000200 ffff880e5dce1b59 > > > [ 560.945606] raw: 00000007f6eec459 0000000000000000 00000001ffffffff ffff880e43ae8000 > > > [ 560.953323] page dumped because: hotplug > > > [ 560.957238] page->mem_cgroup:ffff880e43ae8000 > > > [ 560.961620] has_unmovable_pages: pfn:0x10fd030d, found:0x1, count:0x0 > > > [ 560.968127] page:ffffea043f40c340 count:2 mapcount:0 mapping:ffff880e2f2d8628 index:0x0 > > > [ 560.976104] flags: 0x5fffffc0000006(referenced|uptodate) > > > [ 560.981401] raw: 005fffffc0000006 dead000000000100 dead000000000200 ffff880e2f2d8628 > > > [ 560.989119] raw: 0000000000000000 0000000000000000 00000002ffffffff ffff88010a8f5000 > > > [ 560.996833] page dumped because: hotplug > > > > Sorry, last week I didn't test this patch with memory pressure adding. > > Today use "stress -m 200 -t 2h" to add pressure, hot removing failed. > > Will send you output log. W/o memory pressure, it sometimes succeed. I > > saw one failure last night, it still show un-removable as 0 in > > hotpluggable node one time, I worried it might be caused by my compiling > > mistake, so compile and try again this morning. > > In a private email you have sent this (let's assume this is correctly > testing the patch I have posted): Yeah, I recompiled and copy bzImage to /boot to ensure the patch is compiled in. > > : [43283.914082] has_unmovable_pages: pfn:0x10e62600, found:0x1, count:0x0 > : [43283.920669] page:ffffea0439898000 count:1 mapcount:1 mapping:ffff880e5639d3c9 index:0x7f2430400 compound_mapcount: 1 > : [43283.931219] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) > : [43283.937954] raw: 005fffffc0090034 ffffea043ffcb888 ffffea043f728008 ffff880e5639d3c9 > : [43283.945722] raw: 00000007f2430400 0000000000000000 00000001ffffffff ffff880e2baad000 > : [43283.955381] page dumped because: hotplug > > The page is both LRU and SwapBacked which should hit the some of the > checks in has_unmovable_pages. The fact it hasn't means we have clearly > raced with the page being allocated and marked SwapBacked/LRU. This is > surely possible and there is no universal way to prevent from that for > all types of potentially migratedable pages. The race window should be > relatively small. Maybe we can add a retry for movable zone pages. > > How reproducible this is? On the bare metal with 8 nodes and each node has 64 GB memory, node1~7 are hotpluggable and movable_node is added. After reboot, execute "stress -m 200 -t 2h", by default the 200 processes will malloc/free 256MB continuously. Then hot remove one memory board on node1~7, it always happened. The progress is that all memory blocks on node1~7 are removable now, accessing and reading them won't trigger the old trace now. > > But, as I've said memory isolation resp. has_unmovable_pages begs for a > complete redesign. So how about using the patch I pasted before? It has an explanation and test result is good. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a919ba5..021e39d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7824,7 +7824,8 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if (__PageMovable(page)) continue; - if (!PageLRU(page)) + if (!PageLRU(page) && + (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)) found++; /* * If there are RECLAIMABLE pages, we need to check Thanks Baoquan
On Mon 05-11-18 10:14:07, Michal Hocko wrote:
> Maybe we can add a retry for movable zone pages.
Or something like this. Ugly as hell, no question about that. I also
have to think about this some more to convince myself this will not
result in an endless loop under some situations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48ceda313332..342d66eca0f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7779,12 +7779,16 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
pfn = page_to_pfn(page);
for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
unsigned long check = pfn + iter;
+ unsigned long saved_flags;
if (!pfn_valid_within(check))
continue;
page = pfn_to_page(check);
+retry:
+ saved_flags = READ_ONCE(page->flags);
+
if (PageReserved(page))
goto unmovable;
@@ -7840,6 +7844,13 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
page->mapping->a_ops->migratepage)
continue;
+ /*
+ * We might race with the allocation of the page so retry
+ * if flags have changed.
+ */
+ if (saved_flags != READ_ONCE(page->flags))
+ goto retry;
+
/*
* If there are RECLAIMABLE pages, we need to check
* it. But now, memory offline itself doesn't call
On Mon 05-11-18 17:26:18, Baoquan He wrote: [...] > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a919ba5..021e39d 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7824,7 +7824,8 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, > if (__PageMovable(page)) > continue; > > - if (!PageLRU(page)) > + if (!PageLRU(page) && > + (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)) > found++; > /* > * If there are RECLAIMABLE pages, we need to check As explained during the private conversion I am not really thrilled by this check. AFAIU this will be the case for basically all pages in the zone_movable. As we have seen already some unexpected ones can lurk in easily.
On 11/05/18 at 10:28am, Michal Hocko wrote: > On Mon 05-11-18 10:14:07, Michal Hocko wrote: > > Maybe we can add a retry for movable zone pages. > > Or something like this. Ugly as hell, no question about that. I also > have to think about this some more to convince myself this will not > result in an endless loop under some situations. Testing, will tell the result. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48ceda313332..342d66eca0f3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7779,12 +7779,16 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, > pfn = page_to_pfn(page); > for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) { > unsigned long check = pfn + iter; > + unsigned long saved_flags; > > if (!pfn_valid_within(check)) > continue; > > page = pfn_to_page(check); > > +retry: > + saved_flags = READ_ONCE(page->flags); > + > if (PageReserved(page)) > goto unmovable; > > @@ -7840,6 +7844,13 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, > page->mapping->a_ops->migratepage) > continue; > > + /* > + * We might race with the allocation of the page so retry > + * if flags have changed. > + */ > + if (saved_flags != READ_ONCE(page->flags)) > + goto retry; > + > /* > * If there are RECLAIMABLE pages, we need to check > * it. But now, memory offline itself doesn't call > -- > Michal Hocko > SUSE Labs
Hi Michal, On 11/05/18 at 10:28am, Michal Hocko wrote: > > Or something like this. Ugly as hell, no question about that. I also > have to think about this some more to convince myself this will not > result in an endless loop under some situations. It failed. Paste the log and patch diff here, please help check if I made any mistake on manual code change. The log is at bottom. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a919ba5cb3c8..cdcd923ec337 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7779,14 +7779,22 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, pfn = page_to_pfn(page); for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) { unsigned long check = pfn + iter; + unsigned long saved_flags; if (!pfn_valid_within(check)) continue; page = pfn_to_page(check); - if (PageReserved(page)) +retry: + saved_flags = READ_ONCE(page->flags); + + + if (PageReserved(page)) { + pr_info("has_unmovable_pages 000: pfn:0x%x\n", pfn+iter); + __dump_page(page, "hotplug"); goto unmovable; + } /* * Hugepages are not in LRU lists, but they're movable. @@ -7795,8 +7803,11 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, */ if (PageHuge(page)) { - if (!hugepage_migration_supported(page_hstate(page))) + if (!hugepage_migration_supported(page_hstate(page))) { + pr_info("has_unmovable_pages 111: pfn:0x%x\n", pfn+iter); + __dump_page(page, "hotplug"); goto unmovable; + } iter = round_up(iter + 1, 1<<compound_order(page)) - 1; continue; @@ -7824,8 +7835,29 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if (__PageMovable(page)) continue; - if (!PageLRU(page)) +#if 0 + if (!PageLRU(page) && (get_pageblock_migratetype(page)!=MIGRATE_MOVABLE) ) found++; +#endif + if (PageLRU(page)) + continue; + + if (PageSwapBacked(page)) + continue; + + + if (page->mapping && !page->mapping->a_ops) + pr_info("page->mapping:%ps \n", page->mapping); + + if (page->mapping && page->mapping->a_ops && page->mapping->a_ops->migratepage) + continue; + + /* + * We might race with the allocation of the page so retry + * if flags have changed. + */ + if (saved_flags != READ_ONCE(page->flags)) + goto retry; /* * If there are RECLAIMABLE pages, we need to check * it. But now, memory offline itself doesn't call @@ -7839,8 +7871,11 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * is set to both of a memory hole page and a _used_ kernel * page at boot. */ - if (found > count) + if (++found > count) { + pr_info("has_unmovable_pages: pfn:0x%x, found:0x%x, count:0x%x \n", pfn+iter, found, count); + __dump_page(page, "hotplug"); goto unmovable; + } } return false; unmovable: ***********console log******************* [ 458.584711] Offlined Pages 524288 [ 458.943655] Offlined Pages 524288 [ 459.390757] Offlined Pages 524288 [ 460.086409] Offlined Pages 524288 [ 460.931868] Offlined Pages 524288 [ 461.741327] Offlined Pages 524288 [ 462.576653] Offlined Pages 524288 [ 463.291947] Offlined Pages 524288 [ 464.121980] Offlined Pages 524288 [ 464.869983] Offlined Pages 524288 [ 465.550254] Offlined Pages 524288 [ 466.337934] Offlined Pages 524288 [ 467.143416] Offlined Pages 524288 [ 467.925108] Offlined Pages 524288 [ 468.665318] Offlined Pages 524288 [ 469.473999] Offlined Pages 524288 [ 470.390116] Offlined Pages 524288 [ 471.069104] Offlined Pages 524288 [ 471.704154] Offlined Pages 524288 [ 472.322466] Offlined Pages 524288 [ 472.964513] Offlined Pages 524288 [ 473.629328] Offlined Pages 524288 [ 474.265908] Offlined Pages 524288 [ 474.883829] Offlined Pages 524288 [ 475.538700] Offlined Pages 524288 [ 476.247451] Offlined Pages 524288 [ 476.575516] has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0 [ 476.582103] page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1 [ 476.592645] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 476.599386] raw: 005fffffc0090034 ffffea043bd58008 ffffea0437fb8008 ffff880e05239841 [ 476.607154] raw: 00000007f26e5000 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 476.616725] page dumped because: hotplug [ 476.620682] page->mem_cgroup:ffff880e74f5c000 [ 476.625190] WARNING: CPU: 245 PID: 8 at mm/page_alloc.c:7882 has_unmovable_pages.cold.123+0x44/0xb6 [ 476.634230] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul iTCO_vendor_support ghash_clmulni_intel intel_cstate joydev ses ipmi_si enclosure ipmi_devintf scsi_transport_sas intel_uncore ipmi_msghandler pcspkr intel_rapl_perf sg mei_me i2c_i801 mei lpc_ich wmi xfs libcrc32c sd_mod ahci igb crc32c_intel libahci i2c_algo_bit dca libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [ 476.678239] CPU: 245 PID: 8 Comm: kworker/u576:0 Not tainted 4.19.0+ #9 [ 476.684871] Hardware name: 9008/IT91SMUB, BIOS BLXSV512 03/22/2018 [ 476.691199] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 476.696678] RIP: 0010:has_unmovable_pages.cold.123+0x44/0xb6 [ 476.702369] Code: fe 0f 0e 82 4c 89 ff e8 0f a0 02 00 48 8b 44 24 10 48 2b 40 50 48 89 c2 b8 01 00 00 00 48 81 fa 40 11 00 00 0f 85 ec eb ff ff <0f> 0b e9 e5 eb ff ff 48 89 de 48 c7 c7 08 4d 0a 82 e8 79 1e f0 ff [ 476.721100] RSP: 0018:ffffc900000e3c70 EFLAGS: 00010046 [ 476.726361] RAX: 0000000000000001 RBX: 0000000010dfec00 RCX: 0000000000000006 [ 476.733543] RDX: 0000000000001140 RSI: 0000000000000096 RDI: ffff880e7cb55ad0 [ 476.742768] RBP: 005fffffc0010000 R08: 0000000000000bbf R09: 0000000000000007 [ 476.749926] R10: 0000000000000000 R11: ffffffff829f162d R12: 0000000010dfec00 [ 476.757082] R13: 0000000000000001 R14: 0000000000000000 R15: ffffea0437fb0000 [ 476.764241] FS: 0000000000000000(0000) GS:ffff880e7cb40000(0000) knlGS:0000000000000000 [ 476.772338] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 476.778102] CR2: 00007fc3670f3000 CR3: 0000000e716c8003 CR4: 00000000003606e0 [ 476.785249] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 476.792405] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 476.799562] Call Trace: [ 476.804039] start_isolate_page_range+0x258/0x2f0 [ 476.808823] __offline_pages+0xcc/0x8e0 [ 476.812753] ? klist_next+0xf2/0x100 [ 476.816402] ? device_is_dependent+0x90/0x90 [ 476.820759] memory_subsys_offline+0x40/0x60 [ 476.825127] device_offline+0x81/0xb0 [ 476.828920] acpi_bus_offline+0xdb/0x140 [ 476.832937] acpi_device_hotplug+0x21c/0x460 [ 476.837281] acpi_hotplug_work_fn+0x1a/0x30 [ 476.841562] process_one_work+0x1a1/0x3a0 [ 476.845647] worker_thread+0x30/0x380 [ 476.849381] ? drain_workqueue+0x120/0x120 [ 476.853549] kthread+0x112/0x130 [ 476.856866] ? kthread_park+0x80/0x80 [ 476.860588] ret_from_fork+0x35/0x40 [ 476.864204] ---[ end trace 08fb4fe25cf760b3 ]--- [ 476.955547] has_unmovable_pages: pfn:0x10e07a00, found:0x1, count:0x0 [ 476.962126] page:ffffea04381e8000 count:1 mapcount:1 mapping:ffff880e0913d899 index:0x7f26ec600 compound_mapcount: 1 [ 476.972673] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 476.979413] raw: 005fffffc0090034 ffffea043c338008 ffffea043f5b0008 ffff880e0913d899 [ 476.987192] raw: 00000007f26ec600 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 476.996921] page dumped because: hotplug [ 477.000880] page->mem_cgroup:ffff880e74f5c000 [ 477.110154] has_unmovable_pages: pfn:0x10e9ee00, found:0x1, count:0x0 [ 477.118626] page:ffffea043a7b8000 count:1 mapcount:1 mapping:ffff880e0c89c2c1 index:0x7f26e5000 compound_mapcount: 1 [ 477.129176] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 477.135911] raw: 005fffffc0090034 ffffea043b0e0008 ffffea04383e8008 ffff880e0c89c2c1 [ 477.143690] raw: 00000007f26e5000 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 477.151448] page dumped because: hotplug [ 477.155404] page->mem_cgroup:ffff880e74f5c000 [ 477.224784] has_unmovable_pages: pfn:0x10f13600, found:0x1, count:0x0 [ 477.231368] page:ffffea043c4d8000 count:1 mapcount:1 mapping:ffff880e57b7adc1 index:0x7f26e8600 compound_mapcount: 1 [ 477.241922] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 477.250324] raw: 005fffffc0090034 ffffea043af88008 ffffea043cf20508 ffff880e57b7adc1 [ 477.258089] raw: 00000007f26e8600 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 477.265857] page dumped because: hotplug [ 477.269811] page->mem_cgroup:ffff880e74f5c000 [ 477.307236] has_unmovable_pages: pfn:0x10f8da00, found:0x1, count:0x0 [ 477.313807] page:ffffea043e368000 count:1 mapcount:1 mapping:ffff880e75132529 index:0x7f26e1600 compound_mapcount: 1 [ 477.324361] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 477.331096] raw: 005fffffc0090034 ffffea043d2c0008 ffffea043ba40008 ffff880e75132529 [ 477.338875] raw: 00000007f26e1600 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 477.346635] page dumped because: hotplug [ 477.350590] page->mem_cgroup:ffff880e74f5c000 [ 477.380478] has_unmovable_pages: pfn:0x10d87200, found:0x1, count:0x0 [ 477.387060] page:ffffea04361c8000 count:1 mapcount:1 mapping:ffff880e0913d899 index:0x7f26e2400 compound_mapcount: 1 [ 477.397610] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) [ 477.404340] raw: 005fffffc0090034 ffffea0437fb8008 ffffea043cd20008 ffff880e0913d899 [ 477.412113] raw: 00000007f26e2400 0000000000000000 00000001ffffffff ffff880e74f5c000 [ 477.419870] page dumped because: hotplug [ 477.423842] page->mem_cgroup:ffff880e74f5c000 [ 477.435557] memory memory539: Offline failed. [ 489.171077] perf: interrupt took too long (2745 > 2500), lowering kernel.perf_event_max_sample_rate to 72000 [ 501.332276] INFO: NMI handler (ghes_notify_nmi) took too long to run: 2.179 msecs [ 511.073564] perf: interrupt took too long (3593 > 3431), lowering kernel.perf_event_max_sample_rate to 55000 [ 521.050208] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.836 msecs [ 521.058339] perf: interrupt took too long (16324 > 4491), lowering kernel.perf_event_max_sample_rate to 12000
On Mon 05-11-18 18:25:20, Baoquan He wrote: > Hi Michal, > > On 11/05/18 at 10:28am, Michal Hocko wrote: > > > > Or something like this. Ugly as hell, no question about that. I also > > have to think about this some more to convince myself this will not > > result in an endless loop under some situations. > > It failed. Paste the log and patch diff here, please help check if I made > any mistake on manual code change. The log is at bottom. The retry patch is obviously still racy, it just makes the race window slightly smaller and I hoped it would catch most of those races but this is obviously not the case. I was thinking about your MIGRATE_MOVABLE check some more and I still do not like it much, we just change migrate type at many places and I have hard time to actually see this is always safe wrt. to what we need here. We should be able to restore the zone type check though. The primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust") was that early allocations made it to the zone_movable range. If we add the check _after_ the PageReserved() check then we should be able to rule all bootmem allocation out. So what about the following (on top of the previous patch which makes sense on its own I believe). From d7ffd1342529c892f1de8999c3a5609211599c9d Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Mon, 5 Nov 2018 13:28:51 +0100 Subject: [PATCH] mm, memory_hotplug: check zone_movable in has_unmovable_pages Page state checks are racy. Under a heavy memory workload (e.g. stress -m 200 -t 2h) it is quite easy to hit a race window when the page is allocated but its state is not fully populated yet. A debugging patch to dump the struct page state shows : [ 476.575516] has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0 : [ 476.582103] page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1 : [ 476.592645] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) Note that the state has been checked for both PageLRU and PageSwapBacked already. Closing this race completely would require some sort of retry logic. This can be tricky and error prone (think of potential endless or long taking loops). Workaround this problem for movable zones at least. Such a zone should only contain movable pages. 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust") has told us that this is not strictly true though. Bootmem pages should be marked reserved though so we can move the original check after the PageReserved check. Pages from other zones are still prone to races but we even do not pretend that memory hotremove works for those so pre-mature failure doesn't hurt that much. Reported-by: Baoquan He <bhe@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48ceda313332..5b64c5bc6ea0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7788,6 +7788,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if (PageReserved(page)) goto unmovable; + /* + * If the zone is movable and we have ruled out all reserved + * pages then it should be reasonably safe to assume the rest + * is movable. + */ + if (zone_idx(zone) == ZONE_MOVABLE) + continue; + /* * Hugepages are not in LRU lists, but they're movable. * We need not scan over tail pages bacause we don't
On 11/05/18 at 01:38pm, Michal Hocko wrote: > On Mon 05-11-18 18:25:20, Baoquan He wrote: > > Hi Michal, > > > > On 11/05/18 at 10:28am, Michal Hocko wrote: > > > > > > Or something like this. Ugly as hell, no question about that. I also > > > have to think about this some more to convince myself this will not > > > result in an endless loop under some situations. > > > > It failed. Paste the log and patch diff here, please help check if I made > > any mistake on manual code change. The log is at bottom. > > The retry patch is obviously still racy, it just makes the race window > slightly smaller and I hoped it would catch most of those races but this > is obviously not the case. > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > not like it much, we just change migrate type at many places and I have > hard time to actually see this is always safe wrt. to what we need here. > > We should be able to restore the zone type check though. The > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > has_unmovable_pages more robust") was that early allocations made it to > the zone_movable range. If we add the check _after_ the PageReserved() > check then we should be able to rule all bootmem allocation out. > > So what about the following (on top of the previous patch which makes > sense on its own I believe). Yes, I think this looks very reasonable and should be robust. Have tested it, hot removing 4 hotpluggable nodes continusously succeeds, and then hot adding them back, still works well. So please feel free to add my Tested-by or Acked-by. Tested-by: Baoquan He <bhe@redhat.com> or Acked-by: Baoquan He <bhe@redhat.com> Thanks, Michal. > > > From d7ffd1342529c892f1de8999c3a5609211599c9d Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Mon, 5 Nov 2018 13:28:51 +0100 > Subject: [PATCH] mm, memory_hotplug: check zone_movable in has_unmovable_pages > > Page state checks are racy. Under a heavy memory workload (e.g. stress > -m 200 -t 2h) it is quite easy to hit a race window when the page is > allocated but its state is not fully populated yet. A debugging patch to > dump the struct page state shows > : [ 476.575516] has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0 > : [ 476.582103] page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1 > : [ 476.592645] flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) > > Note that the state has been checked for both PageLRU and PageSwapBacked > already. Closing this race completely would require some sort of retry > logic. This can be tricky and error prone (think of potential endless > or long taking loops). > > Workaround this problem for movable zones at least. Such a zone should > only contain movable pages. 15c30bc09085 ("mm, memory_hotplug: make > has_unmovable_pages more robust") has told us that this is not strictly > true though. Bootmem pages should be marked reserved though so we can > move the original check after the PageReserved check. Pages from other > zones are still prone to races but we even do not pretend that memory > hotremove works for those so pre-mature failure doesn't hurt that much. > > Reported-by: Baoquan He <bhe@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/page_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48ceda313332..5b64c5bc6ea0 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7788,6 +7788,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, > if (PageReserved(page)) > goto unmovable; > > + /* > + * If the zone is movable and we have ruled out all reserved > + * pages then it should be reasonably safe to assume the rest > + * is movable. > + */ > + if (zone_idx(zone) == ZONE_MOVABLE) > + continue; > + > /* > * Hugepages are not in LRU lists, but they're movable. > * We need not scan over tail pages bacause we don't > -- > 2.19.1 > > -- > Michal Hocko > SUSE Labs
On Mon 05-11-18 22:23:08, Baoquan He wrote: > On 11/05/18 at 01:38pm, Michal Hocko wrote: > > On Mon 05-11-18 18:25:20, Baoquan He wrote: > > > Hi Michal, > > > > > > On 11/05/18 at 10:28am, Michal Hocko wrote: > > > > > > > > Or something like this. Ugly as hell, no question about that. I also > > > > have to think about this some more to convince myself this will not > > > > result in an endless loop under some situations. > > > > > > It failed. Paste the log and patch diff here, please help check if I made > > > any mistake on manual code change. The log is at bottom. > > > > The retry patch is obviously still racy, it just makes the race window > > slightly smaller and I hoped it would catch most of those races but this > > is obviously not the case. > > > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > > not like it much, we just change migrate type at many places and I have > > hard time to actually see this is always safe wrt. to what we need here. > > > > We should be able to restore the zone type check though. The > > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > > has_unmovable_pages more robust") was that early allocations made it to > > the zone_movable range. If we add the check _after_ the PageReserved() > > check then we should be able to rule all bootmem allocation out. > > > > So what about the following (on top of the previous patch which makes > > sense on its own I believe). > > Yes, I think this looks very reasonable and should be robust. > > Have tested it, hot removing 4 hotpluggable nodes continusously > succeeds, and then hot adding them back, still works well. > > So please feel free to add my Tested-by or Acked-by. > > Tested-by: Baoquan He <bhe@redhat.com> > or > Acked-by: Baoquan He <bhe@redhat.com> Thanks for retesting! Does this apply to both patches?
On 11/05/18 at 06:10pm, Michal Hocko wrote: > On Mon 05-11-18 22:23:08, Baoquan He wrote: > > On 11/05/18 at 01:38pm, Michal Hocko wrote: > > > On Mon 05-11-18 18:25:20, Baoquan He wrote: > > > > Hi Michal, > > > > > > > > On 11/05/18 at 10:28am, Michal Hocko wrote: > > > > > > > > > > Or something like this. Ugly as hell, no question about that. I also > > > > > have to think about this some more to convince myself this will not > > > > > result in an endless loop under some situations. > > > > > > > > It failed. Paste the log and patch diff here, please help check if I made > > > > any mistake on manual code change. The log is at bottom. > > > > > > The retry patch is obviously still racy, it just makes the race window > > > slightly smaller and I hoped it would catch most of those races but this > > > is obviously not the case. > > > > > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > > > not like it much, we just change migrate type at many places and I have > > > hard time to actually see this is always safe wrt. to what we need here. > > > > > > We should be able to restore the zone type check though. The > > > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > > > has_unmovable_pages more robust") was that early allocations made it to > > > the zone_movable range. If we add the check _after_ the PageReserved() > > > check then we should be able to rule all bootmem allocation out. > > > > > > So what about the following (on top of the previous patch which makes > > > sense on its own I believe). > > > > Yes, I think this looks very reasonable and should be robust. > > > > Have tested it, hot removing 4 hotpluggable nodes continusously > > succeeds, and then hot adding them back, still works well. > > > > So please feel free to add my Tested-by or Acked-by. > > > > Tested-by: Baoquan He <bhe@redhat.com> > > or > > Acked-by: Baoquan He <bhe@redhat.com> > > Thanks for retesting! Does this apply to both patches? Sorry, don't get it. I just applied this on top of linus's tree and tested. Do you mean applying it on top of previous code change?
On Tue 06-11-18 08:22:16, Baoquan He wrote: > On 11/05/18 at 06:10pm, Michal Hocko wrote: > > On Mon 05-11-18 22:23:08, Baoquan He wrote: > > > On 11/05/18 at 01:38pm, Michal Hocko wrote: > > > > On Mon 05-11-18 18:25:20, Baoquan He wrote: > > > > > Hi Michal, > > > > > > > > > > On 11/05/18 at 10:28am, Michal Hocko wrote: > > > > > > > > > > > > Or something like this. Ugly as hell, no question about that. I also > > > > > > have to think about this some more to convince myself this will not > > > > > > result in an endless loop under some situations. > > > > > > > > > > It failed. Paste the log and patch diff here, please help check if I made > > > > > any mistake on manual code change. The log is at bottom. > > > > > > > > The retry patch is obviously still racy, it just makes the race window > > > > slightly smaller and I hoped it would catch most of those races but this > > > > is obviously not the case. > > > > > > > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > > > > not like it much, we just change migrate type at many places and I have > > > > hard time to actually see this is always safe wrt. to what we need here. > > > > > > > > We should be able to restore the zone type check though. The > > > > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > > > > has_unmovable_pages more robust") was that early allocations made it to > > > > the zone_movable range. If we add the check _after_ the PageReserved() > > > > check then we should be able to rule all bootmem allocation out. > > > > > > > > So what about the following (on top of the previous patch which makes > > > > sense on its own I believe). > > > > > > Yes, I think this looks very reasonable and should be robust. > > > > > > Have tested it, hot removing 4 hotpluggable nodes continusously > > > succeeds, and then hot adding them back, still works well. > > > > > > So please feel free to add my Tested-by or Acked-by. > > > > > > Tested-by: Baoquan He <bhe@redhat.com> > > > or > > > Acked-by: Baoquan He <bhe@redhat.com> > > > > Thanks for retesting! Does this apply to both patches? > > Sorry, don't get it. I just applied this on top of linus's tree and > tested. Do you mean applying it on top of previous code change? Yes. While the first patch will obviously not help for movable zone because the movable check will override any later check it seems still useful to reduce false positives on normal zones. Or do you think this is not worth it?
On 11/06/18 at 09:28am, Michal Hocko wrote: > > > > > > It failed. Paste the log and patch diff here, please help check if I made > > > > > > any mistake on manual code change. The log is at bottom. > > > > > > > > > > The retry patch is obviously still racy, it just makes the race window > > > > > slightly smaller and I hoped it would catch most of those races but this > > > > > is obviously not the case. > > > > > > > > > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > > > > > not like it much, we just change migrate type at many places and I have > > > > > hard time to actually see this is always safe wrt. to what we need here. > > > > > > > > > > We should be able to restore the zone type check though. The > > > > > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > > > > > has_unmovable_pages more robust") was that early allocations made it to > > > > > the zone_movable range. If we add the check _after_ the PageReserved() > > > > > check then we should be able to rule all bootmem allocation out. > > > > > > > > > > So what about the following (on top of the previous patch which makes > > > > > sense on its own I believe). > > > > > > > > Yes, I think this looks very reasonable and should be robust. > > > > > > > > Have tested it, hot removing 4 hotpluggable nodes continusously > > > > succeeds, and then hot adding them back, still works well. > > > > > > > > So please feel free to add my Tested-by or Acked-by. > > > > > > > > Tested-by: Baoquan He <bhe@redhat.com> > > > > or > > > > Acked-by: Baoquan He <bhe@redhat.com> > > > > > > Thanks for retesting! Does this apply to both patches? > > > > Sorry, don't get it. I just applied this on top of linus's tree and > > tested. Do you mean applying it on top of previous code change? > > Yes. While the first patch will obviously not help for movable zone > because the movable check will override any later check it > seems still useful to reduce false positives on normal zones. Hmm, I don't know if it will bring a little bit confusion on code understanding. Since we only recognize the movable zone issue, and I can only reproduce and verify it on the movable zone issue with the movable zone check adding. Not sure if there are any scenario or use cases to cover those newly added checking other movable zone checking. Surely, I have no objection to adding them. But the two patches are separate issues, they have no dependency on each other. I just tested the movable zone checking yesterday, will add your previous check back, then test again. I believe the result will be positive. Will udpate once done. Thanks Baoquan
On 11/06/18 at 05:16pm, Baoquan He wrote: > On 11/06/18 at 09:28am, Michal Hocko wrote: > > > > > > > It failed. Paste the log and patch diff here, please help check if I made > > > > > > > any mistake on manual code change. The log is at bottom. > > > > > > > > > > > > The retry patch is obviously still racy, it just makes the race window > > > > > > slightly smaller and I hoped it would catch most of those races but this > > > > > > is obviously not the case. > > > > > > > > > > > > I was thinking about your MIGRATE_MOVABLE check some more and I still do > > > > > > not like it much, we just change migrate type at many places and I have > > > > > > hard time to actually see this is always safe wrt. to what we need here. > > > > > > > > > > > > We should be able to restore the zone type check though. The > > > > > > primary problem fixed by 15c30bc09085 ("mm, memory_hotplug: make > > > > > > has_unmovable_pages more robust") was that early allocations made it to > > > > > > the zone_movable range. If we add the check _after_ the PageReserved() > > > > > > check then we should be able to rule all bootmem allocation out. > > > > > > > > > > > > So what about the following (on top of the previous patch which makes > > > > > > sense on its own I believe). > > > > > > > > > > Yes, I think this looks very reasonable and should be robust. > > > > > > > > > > Have tested it, hot removing 4 hotpluggable nodes continusously > > > > > succeeds, and then hot adding them back, still works well. > > > > > > > > > > So please feel free to add my Tested-by or Acked-by. > > > > > > > > > > Tested-by: Baoquan He <bhe@redhat.com> > > > > > or > > > > > Acked-by: Baoquan He <bhe@redhat.com> > > > > > > > > Thanks for retesting! Does this apply to both patches? > > > > > > Sorry, don't get it. I just applied this on top of linus's tree and > > > tested. Do you mean applying it on top of previous code change? > > > > Yes. While the first patch will obviously not help for movable zone > > because the movable check will override any later check it > > seems still useful to reduce false positives on normal zones. > > Hmm, I don't know if it will bring a little bit confusion on code > understanding. Since we only recognize the movable zone issue, and I can > only reproduce and verify it on the movable zone issue with the movable > zone check adding. > > Not sure if there are any scenario or use cases to cover those newly added > checking other movable zone checking. Surely, I have no objection to ^ than > adding them. But the two patches are separate issues, they have no > dependency on each other. > > I just tested the movable zone checking yesterday, will add your > previous check back, then test again. I believe the result will be > positive. Will udpate once done. > > Thanks > Baoquan
On Tue 06-11-18 17:16:24, Baoquan He wrote: [...] > Not sure if there are any scenario or use cases to cover those newly added > checking other movable zone checking. Surely, I have no objection to > adding them. But the two patches are separate issues, they have no > dependency on each other. Yes that is correct. I will drop those additional checks for now. Let's see if we need them later. > I just tested the movable zone checking yesterday, will add your > previous check back, then test again. I believe the result will be > positive. Will udpate once done. THere is no need to retest with that patch for your movable node setup.
On 11/06/18 at 10:51am, Michal Hocko wrote: > > I just tested the movable zone checking yesterday, will add your > > previous check back, then test again. I believe the result will be > > positive. Will udpate once done. > > THere is no need to retest with that patch for your movable node setup. OK, thanks.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 863d46da6586..48ceda313332 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7824,8 +7824,22 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if (__PageMovable(page)) continue; - if (!PageLRU(page)) - found++; + if (PageLRU(page)) + continue; + + /* + * Some LRU pages might be temporarily off-LRU for all + * sort of different reasons - reclaim, migration, + * per-cpu LRU caches etc. + * Make sure we do not consider those pages to be unmovable. + */ + if (PageSwapBacked(page)) + continue; + + if (page->mapping && page->mapping->a_ops && + page->mapping->a_ops->migratepage) + continue; + /* * If there are RECLAIMABLE pages, we need to check * it. But now, memory offline itself doesn't call @@ -7839,7 +7853,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * is set to both of a memory hole page and a _used_ kernel * page at boot. */ - if (found > count) + if (++found > count) goto unmovable; } return false;