[RFC,0/8] Migrate Pages in lieu of discard

Message ID	20200629234503.749E5340@viggo.jf.intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F075920781 IronPort-SDR: PoxEYgchYi62b4rTAPAPpbbmrNFu0YafE+Gv3sn/0zTR65BHBVS1B+uQotf4adqVNhrG261lfP y+J+2wZQnQnA== IronPort-SDR: oVTVuftkIOwxt3/rKONa/WAgktlBs5S7lFl0aeQLbZHwD22qNFzck0p3Mk/bvCrABzf7t/BLOh Nnko91nBbidg== Subject: [RFC][PATCH 0/8] Migrate Pages in lieu of discard To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com From: Dave Hansen <dave.hansen@linux.intel.com> Date: Mon, 29 Jun 2020 16:45:03 -0700 Message-Id: <20200629234503.749E5340@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Migrate Pages in lieu of discard \| expand [RFC,0/8] Migrate Pages in lieu of discard [RFC,1/8] mm/numa: node demotion data structure and lookup [RFC,2/8] mm/migrate: Defer allocating new page until needed [RFC,3/8] mm/vmscan: Attempt to migrate page in lieu of discard [RFC,4/8] mm/vmscan: add page demotion counter [RFC,5/8] mm/numa: automatically generate node migration order [RFC,6/8] mm/vmscan: Consider anonymous pages without swap [RFC,7/8] mm/vmscan: never demote for memcg reclaim [RFC,8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Dave Hansen June 29, 2020, 11:45 p.m. UTC

I've been sitting on these for too long.  Tha main purpose of this
post is to have a public discussion with the other folks who are
interested in this functionalty and converge on a single
implementation.

This set directly incorporates a statictics patch from Yang Shi and
also includes one to ensure good behavior with cgroup reclaim which
was very closely derived from this series:

	https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com/

Since the last post, the major changes are:
 - Added patch to skip migration when doing cgroup reclaim
 - Added stats patch from Yang Shi

The full series is also available here:

	https://github.com/hansendc/linux/tree/automigrate-20200629

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Other things need to be built to
promote hot pages back to DRAM.

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from here.  It includes
autonuma-based hot page promotion back to DRAM:

	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

--

Dave Hansen (5):
      mm/numa: node demotion data structure and lookup
      mm/vmscan: Attempt to migrate page in lieu of discard
      mm/numa: automatically generate node migration order
      mm/vmscan: never demote for memcg reclaim
      mm/numa: new reclaim mode to enable reclaim-based migration

Keith Busch (2):
      mm/migrate: Defer allocating new page until needed
      mm/vmscan: Consider anonymous pages without swap

Yang Shi (1):
      mm/vmscan: add page demotion counter

 Documentation/admin-guide/sysctl/vm.rst |    9
 include/linux/migrate.h                 |    6
 include/linux/node.h                    |    9
 include/linux/vm_event_item.h           |    2
 include/trace/events/migrate.h          |    3
 mm/debug.c                              |    1
 mm/internal.h                           |    1
 mm/migrate.c                            |  400 ++++++++++++++++++++++++++------
 mm/page_alloc.c                         |    2
 mm/vmscan.c                             |   88 ++++++-
 mm/vmstat.c                             |    2
 11 files changed, 439 insertions(+), 84 deletions(-)

Shakeel Butt June 30, 2020, 6:36 p.m. UTC | #1

On Mon, Jun 29, 2020 at 4:48 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> I've been sitting on these for too long.  Tha main purpose of this
> post is to have a public discussion with the other folks who are
> interested in this functionalty and converge on a single
> implementation.
>
> This set directly incorporates a statictics patch from Yang Shi and
> also includes one to ensure good behavior with cgroup reclaim which
> was very closely derived from this series:
>
>         https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com/
>
> Since the last post, the major changes are:
>  - Added patch to skip migration when doing cgroup reclaim
>  - Added stats patch from Yang Shi
>
> The full series is also available here:
>
>         https://github.com/hansendc/linux/tree/automigrate-20200629
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>

I have a high level question. Given a reclaim request for a set of
nodes, if there is no demotion path out of that set, should the kernel
still consider the migrations within the set of nodes? Basically
should the decision to allow migrations within a reclaim request be
taken at the node level or the reclaim request (or allocation level)?

Dave Hansen June 30, 2020, 6:51 p.m. UTC | #2

On 6/30/20 11:36 AM, Shakeel Butt wrote:
>> This is part of a larger patch set.  If you want to apply these or
>> play with them, I'd suggest using the tree from here.  It includes
>> autonuma-based hot page promotion back to DRAM:
>>
>>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>>
>> This is also all based on an upstream mechanism that allows
>> persistent memory to be onlined and used as if it were volatile:
>>
>>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>>
> I have a high level question. Given a reclaim request for a set of
> nodes, if there is no demotion path out of that set, should the kernel
> still consider the migrations within the set of nodes? 

OK, to be specific, we're talking about a case where we've arrived at
try_to_free_pages() and, say, all of the nodes on the system are set in
sc->nodemask?  Isn't the common case that all nodes are set in
sc->nodemask?  Since there is never a demotion path out of the set of
all nodes, the common case would be that there is no demotion path out
of a reclaim node set.

If that's true, I'd say that the kernel still needs to consider
migrations even within the set.

Shakeel Butt June 30, 2020, 7:25 p.m. UTC | #3

On Tue, Jun 30, 2020 at 11:51 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/30/20 11:36 AM, Shakeel Butt wrote:
> >> This is part of a larger patch set.  If you want to apply these or
> >> play with them, I'd suggest using the tree from here.  It includes
> >> autonuma-based hot page promotion back to DRAM:
> >>
> >>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
> >>
> >> This is also all based on an upstream mechanism that allows
> >> persistent memory to be onlined and used as if it were volatile:
> >>
> >>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
> >>
> > I have a high level question. Given a reclaim request for a set of
> > nodes, if there is no demotion path out of that set, should the kernel
> > still consider the migrations within the set of nodes?
>
> OK, to be specific, we're talking about a case where we've arrived at
> try_to_free_pages()

Yes.

> and, say, all of the nodes on the system are set in
> sc->nodemask?  Isn't the common case that all nodes are set in
> sc->nodemask?

Depends on the workload but for normal users, yes.

> Since there is never a demotion path out of the set of
> all nodes, the common case would be that there is no demotion path out
> of a reclaim node set.
>
> If that's true, I'd say that the kernel still needs to consider
> migrations even within the set.

In my opinion it should be a user defined policy but I think that
discussion is orthogonal to this patch series. As I understand, this
patch series aims to add the migration-within-reclaim infrastructure,
IMO the policies, optimizations, heuristics can come later.

BTW is this proposal only for systems having multi-tiers of memory?
Can a multi-node DRAM-only system take advantage of this proposal? For
example I have a system with two DRAM nodes running two jobs
hardwalled to each node. For each job the other node is kind of
low-tier memory. If I can describe the per-job demotion paths then
these jobs can take advantage of this proposal during occasional
peaks.

Dave Hansen June 30, 2020, 7:31 p.m. UTC | #4

On 6/30/20 12:25 PM, Shakeel Butt wrote:
>> Since there is never a demotion path out of the set of
>> all nodes, the common case would be that there is no demotion path out
>> of a reclaim node set.
>>
>> If that's true, I'd say that the kernel still needs to consider
>> migrations even within the set.
> In my opinion it should be a user defined policy but I think that
> discussion is orthogonal to this patch series. As I understand, this
> patch series aims to add the migration-within-reclaim infrastructure,
> IMO the policies, optimizations, heuristics can come later.

Yes, this should be considered to add the infrastructure and one
_simple_ policy implementation which sets up migration away from nodes
with CPUs to more distant nodes without CPUs.

This simple policy will be useful for (but not limited to) volatile-use
persistent memory like Intel's Optane DIMMS.

> BTW is this proposal only for systems having multi-tiers of memory?
> Can a multi-node DRAM-only system take advantage of this proposal? For
> example I have a system with two DRAM nodes running two jobs
> hardwalled to each node. For each job the other node is kind of
> low-tier memory. If I can describe the per-job demotion paths then
> these jobs can take advantage of this proposal during occasional
> peaks.

I don't see any reason it could not work there.  There would just need
to be a way to set up a different demotion path policy that what was
done here.

Zi Yan July 1, 2020, 2:24 p.m. UTC | #5

On 30 Jun 2020, at 15:31, Dave Hansen wrote:

>
>
>> BTW is this proposal only for systems having multi-tiers of memory?
>> Can a multi-node DRAM-only system take advantage of this proposal? For
>> example I have a system with two DRAM nodes running two jobs
>> hardwalled to each node. For each job the other node is kind of
>> low-tier memory. If I can describe the per-job demotion paths then
>> these jobs can take advantage of this proposal during occasional
>> peaks.
>
> I don't see any reason it could not work there.  There would just need
> to be a way to set up a different demotion path policy that what was
> done here.

We might need a different threshold (or GFP flag) for allocating new pages
in remote node for demotion. Otherwise, we could
see scenarios like: two nodes in a system are almost full and Node A is
trying to demote some pages to Node B, which triggers page demotion from
Node B to Node A. Then, we might be able to avoid a demotion cycle by not
allowing Node A to demote pages again but swapping pages to disk when Node B
is demoting its pages to Node A, but this still leads to a long reclaim path
compared to making Node A swapping to disk directly. In such cases, Node A
should just swap pages to disk without bothering Node B at all.

Maybe something like GFP_DEMOTION flag for allocating pages for demotion and
the flag requires more free pages available in the destination node to
avoid the situation above?

—
Best Regards,
Yan Zi

Dave Hansen July 1, 2020, 2:32 p.m. UTC | #6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 7/1/20 7:24 AM, Zi Yan wrote:
> On 30 Jun 2020, at 15:31, Dave Hansen wrote:
>>> BTW is this proposal only for systems having multi-tiers of
>>> memory? Can a multi-node DRAM-only system take advantage of
>>> this proposal? For example I have a system with two DRAM nodes
>>> running two jobs hardwalled to each node. For each job the
>>> other node is kind of low-tier memory. If I can describe the
>>> per-job demotion paths then these jobs can take advantage of
>>> this proposal during occasional peaks.
>> I don't see any reason it could not work there.  There would just
>> need to be a way to set up a different demotion path policy that
>> what was done here.
> We might need a different threshold (or GFP flag) for allocating
> new pages in remote node for demotion. Otherwise, we could see
> scenarios like: two nodes in a system are almost full and Node A
> is trying to demote some pages to Node B, which triggers page
> demotion from Node B to Node A.

I've always assumed that migration cycles would be illegal since it's
so hard to guarantee forward reclaim progress with them in place.
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCAAGBQJe/J5kAAoJEGg1lTBwyZKw0TMP/1kufbxVGSY331xhOL/QHEoE
Tsuo62l2CJ/CbhIBKzac24k1Rf9AiyxUukkVZfa32c2Kf03XWjUNiWVuRPSTMlfT
E0h2llYYbUBs+eVeT4Ksz4xkThKHlXPNuS1OMhuSVbjhieiPqp3J0blohXaWdkSa
DBgpiqNlVPD7V0NIA5qfsumZRrOJDdJNdLKbjI7GBVprEHu5N/X0NQpakPErtcka
kSz7Hjv5x+fbd3rxc2QhrnegBE1oMQGUl14nf/kIKnKuZV2WIdabaxrYWrQBvALa
Z2sfcBRU41/SKvz/syCwJpSr1XkfsjNKvDMlkflXndMTzzP4/rhAyDX5Wzw99Aws
zz6UmRhZrFOudq4R5jpOqJiDfn1RGYA8mH04bEOPjEgGRiXaxi5Sp6fh/BQG5p7n
QESH0LVHEhg8h+10FWZ5VYU1UwMIdzolBI8Y8VlJDjeSpzSFyyDFP7Re3OyQRfmb
ij5ThSozo35t+zEYS4yofgPMZKJ/aZ+EySEF5LZsipKC2RsRuFFpaDSOOGXZKLXq
G/R9g2LeLZK6iNNlCrIGjeAAKN8UZzOMJwapYV8czt0HTQ2vRjuDE1Y2TRD6gjXI
x6vUCfFyOEJw4l3mca+Sb1pmFcaiXBRxBrat6q23Ls+eCDMIaTgx5wA7NEeq0Td7
yShQbtIvJKRubiscJlZ/
=MjgB
-----END PGP SIGNATURE-----

[RFC,0/8] Migrate Pages in lieu of discard

Message

Comments