[v3,5/9] mm: vmscan: demote anon DRAM pages to migration target node

Message ID	1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.56 as permitted sender) client-ip=115.124.30.56; From: Yang Shi <yang.shi@linux.alibaba.com> To: mhocko@suse.com, mgorman@techsingularity.net, riel@surriel.com, hannes@cmpxchg.org, akpm@linux-foundation.org, dave.hansen@intel.com, keith.busch@intel.com, dan.j.williams@intel.com, fengguang.wu@intel.com, fan.du@intel.com, ying.huang@intel.com, ziy@nvidia.com Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [v3 PATCH 5/9] mm: vmscan: demote anon DRAM pages to migration target node Date: Fri, 14 Jun 2019 07:29:33 +0800 Message-Id: <1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com> In-Reply-To: <1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com> References: <1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Migrate mode for node reclaim with heterogeneous memory hierarchy \| expand [v3,RFC,0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy [v3,1/9] mm: define N_CPU_MEM node states [v3,2/9] mm: Introduce migrate target nodemask [v3,3/9] mm: page_alloc: make find_next_best_node find return migration target node [v3,4/9] mm: migrate: make migrate_pages() return nr_succeeded [v3,5/9] mm: vmscan: demote anon DRAM pages to migration target node [v3,6/9] mm: vmscan: don't demote for memcg reclaim [v3,7/9] mm: vmscan: check if the demote target node is contended or not [v3,8/9] mm: vmscan: add page demotion counter [v3,9/9] mm: numa: add page promotion counter

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7493220..4b76a55 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -919,6 +919,7 @@ This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages +8 = Zone reclaim migrate pages zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be @@ -943,4 +944,9 @@ Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. +Allowing zone reclaim to migrate pages to the migration target nodes, which +are typically cheaper and slower than DRAM, but have larger capacity, i.e. +NVDIMM nodes, if such nodes are present in the system. The migrate mode +is not compatible with cpuset and mempolicy settings. + ============ End of Document ================================= diff --git a/include/linux/gfp.h b/include/linux/gfp.h index fb07b50..b294455 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -285,6 +285,14 @@ * available and will not wake kswapd/kcompactd on failure. The _LIGHT * version does not attempt reclaim/compaction at all and is by default used * in page fault path, while the non-light is used by khugepaged. + * + * %GFP_DEMOTE is for migration on memory reclaim (a.k.a demotion) allocations. + * The allocation might happen in kswapd or direct reclaim, so assuming + * __GFP_IO and __GFP_FS are not allowed looks safer. Demotion happens for + * user pages (on LRU) only and on specific node. Generally it will fail + * quickly if memory is not available, but may wake up kswapd on failure. + * + * %GFP_TRANSHUGE_DEMOTE is used for THP demotion allocation. */ #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) @@ -300,6 +308,10 @@ #define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM) #define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) +#define GFP_DEMOTE (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | \ + __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | \ + GFP_NOWAIT) +#define GFP_TRANSHUGE_DEMOTE (GFP_DEMOTE | __GFP_COMP) /* Convert GFP flags to their corresponding migrate type */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 837fdd1..cfb1f57 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -25,6 +25,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_DEMOTE, MR_TYPES }; diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 705b33d..c1d5b36 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -20,7 +20,8 @@ EM( MR_SYSCALL, "syscall_or_cpuset") \ EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ - EMe(MR_CONTIG_RANGE, "contig_range") + EM( MR_CONTIG_RANGE, "contig_range") \ + EMe(MR_DEMOTE, "demote") /* * First define the enums in the above macros to be exported to userspace diff --git a/mm/debug.c b/mm/debug.c index 8345bb6..0bcced8 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -25,6 +25,7 @@ "mempolicy_mbind", "numa_misplaced", "cma", + "demote", }; const struct trace_print_flags pageflag_names[] = { diff --git a/mm/internal.h b/mm/internal.h index a3181e2..3d756f2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -303,6 +303,18 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask, } #endif +static inline bool has_migration_target_node_online(void) +{ + int nid; + + for_each_online_node(nid) { + if (node_state(nid, N_MIGRATE_TARGET)) + return true; + } + + return false; +} + /* mm/util.c */ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node *rb_parent); diff --git a/mm/migrate.c b/mm/migrate.c index bc4242a..9fb76a6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1006,7 +1006,8 @@ static int move_to_new_page(struct page *newpage, struct page *page, } static int __unmap_and_move(struct page *page, struct page *newpage, - int force, enum migrate_mode mode) + int force, enum migrate_mode mode, + enum migrate_reason reason) { int rc = -EAGAIN; int page_was_mapped = 0; @@ -1143,8 +1144,16 @@ static int __unmap_and_move(struct page *page, struct page *newpage, if (rc == MIGRATEPAGE_SUCCESS) { if (unlikely(!is_lru)) put_page(newpage); - else + else { + /* + * Put demoted pages on the target node's + * active LRU. + */ + if (!PageUnevictable(newpage) && + reason == MR_DEMOTE) + SetPageActive(newpage); putback_lru_page(newpage); + } } return rc; @@ -1198,7 +1207,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, goto out; } - rc = __unmap_and_move(page, newpage, force, mode); + rc = __unmap_and_move(page, newpage, force, mode, reason); if (rc == MIGRATEPAGE_SUCCESS) set_page_owner_migrate_reason(newpage, reason); diff --git a/mm/vmscan.c b/mm/vmscan.c index 7acd0af..428a83b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1094,6 +1094,55 @@ static void page_check_dirty_writeback(struct page *page, mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } +#ifdef CONFIG_NUMA +#define RECLAIM_OFF 0 +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages to migration target + * node during reclaim */ +static struct page *alloc_demote_page(struct page *page, unsigned long node) +{ + if (unlikely(PageHuge(page))) + /* HugeTLB demotion is not supported for now */ + BUG(); + else if (PageTransHuge(page)) { + struct page *thp; + + thp = alloc_pages_node(node, GFP_TRANSHUGE_DEMOTE, + HPAGE_PMD_ORDER); + if (!thp) + return NULL; + prep_transhuge_page(thp); + return thp; + } else + return __alloc_pages_node(node, GFP_DEMOTE, 0); +} +#else +static inline struct page *alloc_demote_page(struct page *page, + unsigned long node) +{ + return NULL; +} +#endif + +static inline bool is_demote_ok(int nid) +{ + /* Just do demotion with migrate mode of node reclaim */ + if (!(node_reclaim_mode & RECLAIM_MIGRATE)) + return false; + + /* Current node is cpuless node */ + if (!node_state(nid, N_CPU_MEM)) + return false; + + /* No online migration target node */ + if (!has_migration_target_node_online()) + return false; + + return true; +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -1106,6 +1155,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(demote_pages); unsigned nr_reclaimed = 0; unsigned pgactivate = 0; @@ -1269,6 +1319,18 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ if (PageAnon(page) && PageSwapBacked(page)) { if (!PageSwapCache(page)) { + /* + * Demote anonymous pages only for now and + * skip MADV_FREE pages. + * + * Demotion only happen from primary nodes + * to cpuless nodes. + */ + if (is_demote_ok(page_to_nid(page))) { + list_add(&page->lru, &demote_pages); + unlock_page(page); + continue; + } if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; if (PageTransHuge(page)) { @@ -1480,6 +1542,30 @@ static unsigned long shrink_page_list(struct list_head *page_list, VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + /* Demote pages to migration target */ + if (!list_empty(&demote_pages)) { + int err, target_nid; + unsigned int nr_succeeded = 0; + nodemask_t used_mask; + + nodes_clear(used_mask); + target_nid = find_next_best_node(pgdat->node_id, &used_mask, + true); + + /* Demotion would ignore all cpuset and mempolicy settings */ + err = migrate_pages(&demote_pages, alloc_demote_page, NULL, + target_nid, MIGRATE_ASYNC, MR_DEMOTE, + &nr_succeeded); + + nr_reclaimed += nr_succeeded; + + if (err) { + putback_movable_pages(&demote_pages); + + list_splice(&ret_pages, &demote_pages); + } + } + mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_unref_page_list(&free_pages); @@ -2136,10 +2222,11 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, unsigned long gb; /* - * If we don't have swap space, anonymous page deactivation - * is pointless. + * If we don't have swap space or migtation target node online, + * anonymous page deactivation is pointless. */ - if (!file && !total_swap_pages) + if (!file && !total_swap_pages && + !is_demote_ok(pgdat->node_id)) return false; inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx); @@ -2213,22 +2300,34 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, unsigned long ap, fp; enum lru_list lru; - /* If we have no swap space, do not bother scanning anon pages. */ - if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { - scan_balance = SCAN_FILE; - goto out; - } - /* - * Global reclaim will swap to prevent OOM even with no - * swappiness, but memcg users want to use this knob to - * disable swapping for individual groups completely when - * using the memory controller's swap limit feature would be - * too expensive. + * Anon pages can be demoted to PMEM. If there is PMEM node online, + * still scan anonymous LRU even though the systme is swapless or + * swapping is disabled by memcg. + * + * If current node is already PMEM node, demotion is not applicable. */ - if (!global_reclaim(sc) && !swappiness) { - scan_balance = SCAN_FILE; - goto out; + if (!is_demote_ok(pgdat->node_id)) { + /* + * If we have no swap space, do not bother scanning + * anon pages. + */ + if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { + scan_balance = SCAN_FILE; + goto out; + } + + /* + * Global reclaim will swap to prevent OOM even with no + * swappiness, but memcg users want to use this knob to + * disable swapping for individual groups completely when + * using the memory controller's swap limit feature would be + * too expensive. + */ + if (!global_reclaim(sc) && !swappiness) { + scan_balance = SCAN_FILE; + goto out; + } } /* @@ -2577,7 +2676,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, */ pages_for_compaction = compact_gap(sc->order); inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); - if (get_nr_swap_pages() > 0) + if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id)) inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); if (sc->nr_reclaimed < pages_for_compaction && inactive_lru_pages > pages_for_compaction) @@ -3262,7 +3361,8 @@ static void age_active_anon(struct pglist_data *pgdat, { struct mem_cgroup *memcg; - if (!total_swap_pages) + /* Aging anon page as long as demotion is fine */ + if (!total_swap_pages && !is_demote_ok(pgdat->node_id)) return; memcg = mem_cgroup_iter(NULL, NULL, NULL); @@ -4003,11 +4103,6 @@ static int __init kswapd_init(void) */ int node_reclaim_mode __read_mostly; -#define RECLAIM_OFF 0 -#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ - /* * Priority for NODE_RECLAIM. This determines the fraction of pages * of a node considered for each zone_reclaim. 4 scans 1/16th of @@ -4084,8 +4179,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .gfp_mask = current_gfp_context(gfp_mask), .order = order, .priority = NODE_RECLAIM_PRIORITY, - .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = !!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) || + (node_reclaim_mode & RECLAIM_MIGRATE)), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), }; @@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + (node_reclaim_mode & RECLAIM_MIGRATE)) { /* * Free memory by calling shrink node with increasing * priorities until we have enough memory freed. @@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) * thrown out if the node is overallocated. So we do not reclaim * if less than a specified percentage of the node is used by * unmapped file backed pages. + * + * Migrate mode doesn't care the above restrictions. */ if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages && - node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) + node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages && + !(node_reclaim_mode & RECLAIM_MIGRATE)) return NODE_RECLAIM_FULL; /*

[v3,5/9] mm: vmscan: demote anon DRAM pages to migration target node

Commit Message

Patch