Message ID | 20200629234503.749E5340@viggo.jf.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Migrate Pages in lieu of discard | expand |
On Mon, Jun 29, 2020 at 4:48 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > I've been sitting on these for too long. Tha main purpose of this > post is to have a public discussion with the other folks who are > interested in this functionalty and converge on a single > implementation. > > This set directly incorporates a statictics patch from Yang Shi and > also includes one to ensure good behavior with cgroup reclaim which > was very closely derived from this series: > > https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com/ > > Since the last post, the major changes are: > - Added patch to skip migration when doing cgroup reclaim > - Added stats patch from Yang Shi > > The full series is also available here: > > https://github.com/hansendc/linux/tree/automigrate-20200629 > > -- > > We're starting to see systems with more and more kinds of memory such > as Intel's implementation of persistent memory. > > Let's say you have a system with some DRAM and some persistent memory. > Today, once DRAM fills up, reclaim will start and some of the DRAM > contents will be thrown out. Allocations will, at some point, start > falling over to the slower persistent memory. > > That has two nasty properties. First, the newer allocations can end > up in the slower persistent memory. Second, reclaimed data in DRAM > are just discarded even if there are gobs of space in persistent > memory that could be used. > > This set implements a solution to these problems. At the end of the > reclaim process in shrink_page_list() just before the last page > refcount is dropped, the page is migrated to persistent memory instead > of being dropped. > > While I've talked about a DRAM/PMEM pairing, this approach would > function in any environment where memory tiers exist. > > This is not perfect. It "strands" pages in slower memory and never > brings them back to fast DRAM. Other things need to be built to > promote hot pages back to DRAM. > > This is part of a larger patch set. If you want to apply these or > play with them, I'd suggest using the tree from here. It includes > autonuma-based hot page promotion back to DRAM: > > http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com > > This is also all based on an upstream mechanism that allows > persistent memory to be onlined and used as if it were volatile: > > http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com > I have a high level question. Given a reclaim request for a set of nodes, if there is no demotion path out of that set, should the kernel still consider the migrations within the set of nodes? Basically should the decision to allow migrations within a reclaim request be taken at the node level or the reclaim request (or allocation level)?
On 6/30/20 11:36 AM, Shakeel Butt wrote: >> This is part of a larger patch set. If you want to apply these or >> play with them, I'd suggest using the tree from here. It includes >> autonuma-based hot page promotion back to DRAM: >> >> http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com >> >> This is also all based on an upstream mechanism that allows >> persistent memory to be onlined and used as if it were volatile: >> >> http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com >> > I have a high level question. Given a reclaim request for a set of > nodes, if there is no demotion path out of that set, should the kernel > still consider the migrations within the set of nodes? OK, to be specific, we're talking about a case where we've arrived at try_to_free_pages() and, say, all of the nodes on the system are set in sc->nodemask? Isn't the common case that all nodes are set in sc->nodemask? Since there is never a demotion path out of the set of all nodes, the common case would be that there is no demotion path out of a reclaim node set. If that's true, I'd say that the kernel still needs to consider migrations even within the set.
On Tue, Jun 30, 2020 at 11:51 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/30/20 11:36 AM, Shakeel Butt wrote: > >> This is part of a larger patch set. If you want to apply these or > >> play with them, I'd suggest using the tree from here. It includes > >> autonuma-based hot page promotion back to DRAM: > >> > >> http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com > >> > >> This is also all based on an upstream mechanism that allows > >> persistent memory to be onlined and used as if it were volatile: > >> > >> http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com > >> > > I have a high level question. Given a reclaim request for a set of > > nodes, if there is no demotion path out of that set, should the kernel > > still consider the migrations within the set of nodes? > > OK, to be specific, we're talking about a case where we've arrived at > try_to_free_pages() Yes. > and, say, all of the nodes on the system are set in > sc->nodemask? Isn't the common case that all nodes are set in > sc->nodemask? Depends on the workload but for normal users, yes. > Since there is never a demotion path out of the set of > all nodes, the common case would be that there is no demotion path out > of a reclaim node set. > > If that's true, I'd say that the kernel still needs to consider > migrations even within the set. In my opinion it should be a user defined policy but I think that discussion is orthogonal to this patch series. As I understand, this patch series aims to add the migration-within-reclaim infrastructure, IMO the policies, optimizations, heuristics can come later. BTW is this proposal only for systems having multi-tiers of memory? Can a multi-node DRAM-only system take advantage of this proposal? For example I have a system with two DRAM nodes running two jobs hardwalled to each node. For each job the other node is kind of low-tier memory. If I can describe the per-job demotion paths then these jobs can take advantage of this proposal during occasional peaks.
On 6/30/20 12:25 PM, Shakeel Butt wrote: >> Since there is never a demotion path out of the set of >> all nodes, the common case would be that there is no demotion path out >> of a reclaim node set. >> >> If that's true, I'd say that the kernel still needs to consider >> migrations even within the set. > In my opinion it should be a user defined policy but I think that > discussion is orthogonal to this patch series. As I understand, this > patch series aims to add the migration-within-reclaim infrastructure, > IMO the policies, optimizations, heuristics can come later. Yes, this should be considered to add the infrastructure and one _simple_ policy implementation which sets up migration away from nodes with CPUs to more distant nodes without CPUs. This simple policy will be useful for (but not limited to) volatile-use persistent memory like Intel's Optane DIMMS. > BTW is this proposal only for systems having multi-tiers of memory? > Can a multi-node DRAM-only system take advantage of this proposal? For > example I have a system with two DRAM nodes running two jobs > hardwalled to each node. For each job the other node is kind of > low-tier memory. If I can describe the per-job demotion paths then > these jobs can take advantage of this proposal during occasional > peaks. I don't see any reason it could not work there. There would just need to be a way to set up a different demotion path policy that what was done here.
On 30 Jun 2020, at 15:31, Dave Hansen wrote: > > >> BTW is this proposal only for systems having multi-tiers of memory? >> Can a multi-node DRAM-only system take advantage of this proposal? For >> example I have a system with two DRAM nodes running two jobs >> hardwalled to each node. For each job the other node is kind of >> low-tier memory. If I can describe the per-job demotion paths then >> these jobs can take advantage of this proposal during occasional >> peaks. > > I don't see any reason it could not work there. There would just need > to be a way to set up a different demotion path policy that what was > done here. We might need a different threshold (or GFP flag) for allocating new pages in remote node for demotion. Otherwise, we could see scenarios like: two nodes in a system are almost full and Node A is trying to demote some pages to Node B, which triggers page demotion from Node B to Node A. Then, we might be able to avoid a demotion cycle by not allowing Node A to demote pages again but swapping pages to disk when Node B is demoting its pages to Node A, but this still leads to a long reclaim path compared to making Node A swapping to disk directly. In such cases, Node A should just swap pages to disk without bothering Node B at all. Maybe something like GFP_DEMOTION flag for allocating pages for demotion and the flag requires more free pages available in the destination node to avoid the situation above? — Best Regards, Yan Zi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 7/1/20 7:24 AM, Zi Yan wrote: > On 30 Jun 2020, at 15:31, Dave Hansen wrote: >>> BTW is this proposal only for systems having multi-tiers of >>> memory? Can a multi-node DRAM-only system take advantage of >>> this proposal? For example I have a system with two DRAM nodes >>> running two jobs hardwalled to each node. For each job the >>> other node is kind of low-tier memory. If I can describe the >>> per-job demotion paths then these jobs can take advantage of >>> this proposal during occasional peaks. >> I don't see any reason it could not work there. There would just >> need to be a way to set up a different demotion path policy that >> what was done here. > We might need a different threshold (or GFP flag) for allocating > new pages in remote node for demotion. Otherwise, we could see > scenarios like: two nodes in a system are almost full and Node A > is trying to demote some pages to Node B, which triggers page > demotion from Node B to Node A. I've always assumed that migration cycles would be illegal since it's so hard to guarantee forward reclaim progress with them in place. -----BEGIN PGP SIGNATURE----- iQIcBAEBCAAGBQJe/J5kAAoJEGg1lTBwyZKw0TMP/1kufbxVGSY331xhOL/QHEoE Tsuo62l2CJ/CbhIBKzac24k1Rf9AiyxUukkVZfa32c2Kf03XWjUNiWVuRPSTMlfT E0h2llYYbUBs+eVeT4Ksz4xkThKHlXPNuS1OMhuSVbjhieiPqp3J0blohXaWdkSa DBgpiqNlVPD7V0NIA5qfsumZRrOJDdJNdLKbjI7GBVprEHu5N/X0NQpakPErtcka kSz7Hjv5x+fbd3rxc2QhrnegBE1oMQGUl14nf/kIKnKuZV2WIdabaxrYWrQBvALa Z2sfcBRU41/SKvz/syCwJpSr1XkfsjNKvDMlkflXndMTzzP4/rhAyDX5Wzw99Aws zz6UmRhZrFOudq4R5jpOqJiDfn1RGYA8mH04bEOPjEgGRiXaxi5Sp6fh/BQG5p7n QESH0LVHEhg8h+10FWZ5VYU1UwMIdzolBI8Y8VlJDjeSpzSFyyDFP7Re3OyQRfmb ij5ThSozo35t+zEYS4yofgPMZKJ/aZ+EySEF5LZsipKC2RsRuFFpaDSOOGXZKLXq G/R9g2LeLZK6iNNlCrIGjeAAKN8UZzOMJwapYV8czt0HTQ2vRjuDE1Y2TRD6gjXI x6vUCfFyOEJw4l3mca+Sb1pmFcaiXBRxBrat6q23Ls+eCDMIaTgx5wA7NEeq0Td7 yShQbtIvJKRubiscJlZ/ =MjgB -----END PGP SIGNATURE-----