Message ID | 20210401183235.BCC49E8B@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Migrate Pages in lieu of discard | expand |
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > Some method is obviously needed to enable reclaim-based migration. > > Just like traditional autonuma, there will be some workloads that > will benefit like workloads with more "static" configurations where > hot pages stay hot and cold pages stay cold. If pages come and go > from the hot and cold sets, the benefits of this approach will be > more limited. > > The benefits are truly workload-based and *not* hardware-based. > We do not believe that there is a viable threshold where certain > hardware configurations should have this mechanism enabled while > others do not. > > To be conservative, earlier work defaulted to disable reclaim- > based migration and did not include a mechanism to enable it. > This proposes extending the existing "zone_reclaim_mode" (now > now really node_reclaim_mode) as a method to enable it. > > We are open to any alternative that allows end users to enable > this mechanism or disable it it workload harm is detected (just > like traditional autonuma). > > Once this is enabled page demotion may move data to a NUMA node > that does not fall into the cpuset of the allocating process. > This could be construed to violate the guarantees of cpusets. > However, since this is an opt-in mechanism, the assumption is > that anyone enabling it is content to relax the guarantees. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Wei Xu <weixugc@google.com> > Cc: Yang Shi <yang.shi@linux.alibaba.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: osalvador <osalvador@suse.de> > > Changes since 20200122: > * Changelog material about relaxing cpuset constraints > > Changes since 20210304: > * Add Documentation/ material about relaxing cpuset constraints Reviewed-by: Yang Shi <shy828301@gmail.com> > --- > > b/Documentation/admin-guide/sysctl/vm.rst | 12 ++++++++++++ > b/include/linux/swap.h | 3 ++- > b/include/uapi/linux/mempolicy.h | 1 + > b/mm/vmscan.c | 6 ++++-- > 4 files changed, 19 insertions(+), 3 deletions(-) > > diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst > --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 2021-03-31 15:17:40.324000190 -0700 > +++ b/Documentation/admin-guide/sysctl/vm.rst 2021-03-31 15:17:40.349000190 -0700 > @@ -976,6 +976,7 @@ This is value OR'ed together of > 1 Zone reclaim on > 2 Zone reclaim writes dirty pages out > 4 Zone reclaim swaps pages > +8 Zone reclaim migrates pages > = =================================== > > zone_reclaim_mode is disabled by default. For file servers or workloads > @@ -1000,3 +1001,14 @@ of other processes running on other node > Allowing regular swap effectively restricts allocations to the local > node unless explicitly overridden by memory policies or cpuset > configurations. > + > +Page migration during reclaim is intended for systems with tiered memory > +configurations. These systems have multiple types of memory with varied > +performance characteristics instead of plain NUMA systems where the same > +kind of memory is found at varied distances. Allowing page migration > +during reclaim enables these systems to migrate pages from fast tiers to > +slow tiers when the fast tier is under pressure. This migration is > +performed before swap. It may move data to a NUMA node that does not > +fall into the cpuset of the allocating process which might be construed > +to violate the guarantees of cpusets. This should not be enabled on > +systems which need strict cpuset location guarantees. > diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h > --- a/include/linux/swap.h~RECLAIM_MIGRATE 2021-03-31 15:17:40.331000190 -0700 > +++ b/include/linux/swap.h 2021-03-31 15:17:40.351000190 -0700 > @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio; > static inline bool node_reclaim_enabled(void) > { > /* Is any node_reclaim_mode bit set? */ > - return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP); > + return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE| > + RECLAIM_UNMAP|RECLAIM_MIGRATE); > } > > extern void check_move_unevictable_pages(struct pagevec *pvec); > diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h > --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 2021-03-31 15:17:40.337000190 -0700 > +++ b/include/uapi/linux/mempolicy.h 2021-03-31 15:17:40.352000190 -0700 > @@ -71,5 +71,6 @@ enum { > #define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ > #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > #define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > +#define RECLAIM_MIGRATE (1<<3) /* Migrate to other nodes during reclaim */ > > #endif /* _UAPI_LINUX_MEMPOLICY_H */ > diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c > --- a/mm/vmscan.c~RECLAIM_MIGRATE 2021-03-31 15:17:40.339000190 -0700 > +++ b/mm/vmscan.c 2021-03-31 15:17:40.357000190 -0700 > @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc > VM_BUG_ON_PAGE(PageHuge(page), page); > VM_BUG_ON_PAGE(PageLRU(page), page); > > + if (!(node_reclaim_mode & RECLAIM_MIGRATE)) > + return false; > + > /* It is pointless to do demotion in memcg reclaim */ > if (cgroup_reclaim(sc)) > return false; > @@ -1083,8 +1086,7 @@ static bool migrate_demote_page_ok(struc > if (PageTransHuge(page) && !thp_migration_supported()) > return false; > > - // FIXME: actually enable this later in the series > - return false; > + return true; > } > > /* Check if a page is dirty or under writeback */ > _ >
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > This proposes extending the existing "zone_reclaim_mode" (now > now really node_reclaim_mode) as a method to enable it. Nit: now now -> now > We are open to any alternative that allows end users to enable > this mechanism or disable it it workload harm is detected (just > like traditional autonuma). Nit: it it -> it if > diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c > --- a/mm/vmscan.c~RECLAIM_MIGRATE 2021-03-31 15:17:40.339000190 -0700 > +++ b/mm/vmscan.c 2021-03-31 15:17:40.357000190 -0700 > @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc > VM_BUG_ON_PAGE(PageHuge(page), page); > VM_BUG_ON_PAGE(PageLRU(page), page); > > + if (!(node_reclaim_mode & RECLAIM_MIGRATE)) > + return false; > + As I commented on an earlier patch in this series, the RECLAIM_MIGRATE check needs to be added to other new callers of next_demotion_node() as well to avoid unnecessarily splitting THP pages when neither swap nor RECLAIM_MIGRATE is enabled. It can be too late to check RECLAIM_MIGRATE only in migrate_demote_page_ok().
diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 2021-03-31 15:17:40.324000190 -0700 +++ b/Documentation/admin-guide/sysctl/vm.rst 2021-03-31 15:17:40.349000190 -0700 @@ -976,6 +976,7 @@ This is value OR'ed together of 1 Zone reclaim on 2 Zone reclaim writes dirty pages out 4 Zone reclaim swaps pages +8 Zone reclaim migrates pages = =================================== zone_reclaim_mode is disabled by default. For file servers or workloads @@ -1000,3 +1001,14 @@ of other processes running on other node Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. + +Page migration during reclaim is intended for systems with tiered memory +configurations. These systems have multiple types of memory with varied +performance characteristics instead of plain NUMA systems where the same +kind of memory is found at varied distances. Allowing page migration +during reclaim enables these systems to migrate pages from fast tiers to +slow tiers when the fast tier is under pressure. This migration is +performed before swap. It may move data to a NUMA node that does not +fall into the cpuset of the allocating process which might be construed +to violate the guarantees of cpusets. This should not be enabled on +systems which need strict cpuset location guarantees. diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h --- a/include/linux/swap.h~RECLAIM_MIGRATE 2021-03-31 15:17:40.331000190 -0700 +++ b/include/linux/swap.h 2021-03-31 15:17:40.351000190 -0700 @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio; static inline bool node_reclaim_enabled(void) { /* Is any node_reclaim_mode bit set? */ - return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP); + return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE| + RECLAIM_UNMAP|RECLAIM_MIGRATE); } extern void check_move_unevictable_pages(struct pagevec *pvec); diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 2021-03-31 15:17:40.337000190 -0700 +++ b/include/uapi/linux/mempolicy.h 2021-03-31 15:17:40.352000190 -0700 @@ -71,5 +71,6 @@ enum { #define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ #define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_MIGRATE (1<<3) /* Migrate to other nodes during reclaim */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c --- a/mm/vmscan.c~RECLAIM_MIGRATE 2021-03-31 15:17:40.339000190 -0700 +++ b/mm/vmscan.c 2021-03-31 15:17:40.357000190 -0700 @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc VM_BUG_ON_PAGE(PageHuge(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + if (!(node_reclaim_mode & RECLAIM_MIGRATE)) + return false; + /* It is pointless to do demotion in memcg reclaim */ if (cgroup_reclaim(sc)) return false; @@ -1083,8 +1086,7 @@ static bool migrate_demote_page_ok(struc if (PageTransHuge(page) && !thp_migration_supported()) return false; - // FIXME: actually enable this later in the series - return false; + return true; } /* Check if a page is dirty or under writeback */