mbox series

[0/4] Allow persistent memory to be used like normal RAM

Message ID 20190116181859.D1504459@viggo.jf.intel.com (mailing list archive)
Headers show
Series Allow persistent memory to be used like normal RAM | expand

Message

Dave Hansen Jan. 16, 2019, 6:18 p.m. UTC
I would like to get this queued up to get merged.  Since most of the
churn is in the nvdimm code, and it also depends on some refactoring
that only exists in the nvdimm tree, it seems like putting it in *via*
the nvdimm tree is the best path.

But, this series makes non-trivial changes to the "resource" code and
memory hotplug.  I'd really like to get some acks from folks on the
first three patches which affect those areas.

Borislav and Bjorn, you seem to be the most active in the resource code.

Michal, I'd really appreciate at look at all of this from a mem hotplug
perspective.

Note: these are based on commit d2f33c19644 in:

	git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x0000000080000000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
	ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 4 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
	echo $f: `cat $f`
	echo online_movable > $f
	grep ^MemTotal /proc/meminfo
done

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>

Comments

Jeff Moyer Jan. 17, 2019, 4:29 p.m. UTC | #1
Dave Hansen <dave.hansen@linux.intel.com> writes:

> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!

So, isn't that what memory mode is for?
  https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/

Why do we need this code in the kernel?

-Jeff
Keith Busch Jan. 17, 2019, 4:47 p.m. UTC | #2
On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
> Dave Hansen <dave.hansen@linux.intel.com> writes:
> > Persistent memory is cool.  But, currently, you have to rewrite
> > your applications to use it.  Wouldn't it be cool if you could
> > just have it show up in your system like normal RAM and get to
> > it like a slow blob of memory?  Well... have I got the patch
> > series for you!
> 
> So, isn't that what memory mode is for?
>   https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> 
> Why do we need this code in the kernel?

I don't think those are the same thing. The "memory mode" in the link
refers to platforms that sequester DRAM to side cache memory access, where
this series doesn't have that platform dependency nor hides faster DRAM.
Dan Williams Jan. 17, 2019, 4:50 p.m. UTC | #3
On Thu, Jan 17, 2019 at 8:29 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dave Hansen <dave.hansen@linux.intel.com> writes:
>
> > Persistent memory is cool.  But, currently, you have to rewrite
> > your applications to use it.  Wouldn't it be cool if you could
> > just have it show up in your system like normal RAM and get to
> > it like a slow blob of memory?  Well... have I got the patch
> > series for you!
>
> So, isn't that what memory mode is for?
>   https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/

That's a hardware cache that privately manages DRAM in front of PMEM.
It benefits from some help from software [1].

> Why do we need this code in the kernel?

This goes further and enables software managed allocation decisions
with the full DRAM + PMEM address space.

[1]: https://lore.kernel.org/lkml/154767945660.1983228.12167020940431682725.stgit@dwillia2-desk3.amr.corp.intel.com/
Jeff Moyer Jan. 17, 2019, 5:20 p.m. UTC | #4
Keith Busch <keith.busch@intel.com> writes:

> On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
>> Dave Hansen <dave.hansen@linux.intel.com> writes:
>> > Persistent memory is cool.  But, currently, you have to rewrite
>> > your applications to use it.  Wouldn't it be cool if you could
>> > just have it show up in your system like normal RAM and get to
>> > it like a slow blob of memory?  Well... have I got the patch
>> > series for you!
>> 
>> So, isn't that what memory mode is for?
>>   https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
>> 
>> Why do we need this code in the kernel?
>
> I don't think those are the same thing. The "memory mode" in the link
> refers to platforms that sequester DRAM to side cache memory access, where
> this series doesn't have that platform dependency nor hides faster DRAM.

OK, so you are making two arguments, here.  1) platforms may not support
memory mode, and 2) this series allows for performance differentiated
memory (even though applications may not modified to make use of
that...).

With this patch set, an unmodified application would either use:

1) whatever memory it happened to get
2) only the faster dram (via numactl --membind=)
3) only the slower pmem (again, via numactl --membind1)
4) preferentially one or the other (numactl --preferred=)

The other options are:
- as mentioned above, memory mode, which uses DRAM as a cache for the
  slower persistent memory.  Note that it isn't all or nothing--you can
  configure your system with both memory mode and appdirect.  The
  limitation, of course, is that your platform has to support this.

  This seems like the obvious solution if you want to make use of the
  larger pmem capacity as regular volatile memory (and your platform
  supports it).  But maybe there is some other limitation that motivated
  this work?

- libmemkind or pmdk.  These options typically* require application
  modifications, but allow those applications to actively decide which
  data lives in fast versus slow media.

  This seems like the obvious answer for applications that care about
  access latency.

* you could override the system malloc, but some libraries/application
  stacks already do that, so it isn't a universal solution.

Listing something like this in the headers of these patch series would
considerably reduce the head-scratching for reviewers.

Keith, you seem to be implying that there are platforms that won't
support memory mode.  Do you also have some insight into how customers
want to use this, beyond my speculation?  It's really frustrating to see
patch sets like this go by without any real use cases provided.

Cheers,
Jeff
Keith Busch Jan. 17, 2019, 7:34 p.m. UTC | #5
On Thu, Jan 17, 2019 at 12:20:06PM -0500, Jeff Moyer wrote:
> Keith Busch <keith.busch@intel.com> writes:
> > On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
> >> Dave Hansen <dave.hansen@linux.intel.com> writes:
> >> > Persistent memory is cool.  But, currently, you have to rewrite
> >> > your applications to use it.  Wouldn't it be cool if you could
> >> > just have it show up in your system like normal RAM and get to
> >> > it like a slow blob of memory?  Well... have I got the patch
> >> > series for you!
> >> 
> >> So, isn't that what memory mode is for?
> >>   https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> >> 
> >> Why do we need this code in the kernel?
> >
> > I don't think those are the same thing. The "memory mode" in the link
> > refers to platforms that sequester DRAM to side cache memory access, where
> > this series doesn't have that platform dependency nor hides faster DRAM.
> 
> OK, so you are making two arguments, here.  1) platforms may not support
> memory mode, and 2) this series allows for performance differentiated
> memory (even though applications may not modified to make use of
> that...).
> 
> With this patch set, an unmodified application would either use:
> 
> 1) whatever memory it happened to get
> 2) only the faster dram (via numactl --membind=)
> 3) only the slower pmem (again, via numactl --membind1)
> 4) preferentially one or the other (numactl --preferred=)

Yes, numactl and mbind are good ways for unmodified applications
to use these different memory types when they're available.

Tangentially related, I have another series[1] that provides supplementary
information that can be used to help make these decisions for platforms
that provide HMAT (heterogeneous memory attribute tables).

> The other options are:
> - as mentioned above, memory mode, which uses DRAM as a cache for the
>   slower persistent memory.  Note that it isn't all or nothing--you can
>   configure your system with both memory mode and appdirect.  The
>   limitation, of course, is that your platform has to support this.
>
>   This seems like the obvious solution if you want to make use of the
>   larger pmem capacity as regular volatile memory (and your platform
>   supports it).  But maybe there is some other limitation that motivated
>   this work?

The hardware supported implementation is one way it may be used, and it's
up side is that accessing the cached memory is transparent to the OS and
applications. They can use memory unaware that this is happening, so it
has a low barrier for applications to make use of the large available
address space.

There are some minimal things software may do that improve this mode,
as Dan mentioned in his reply [2], but it is still usable even without
such optimizations.

On the downside, a reboot would be required if you want to change the
memory configuration at a later time, like you decide more or less DRAM
as cache is needed. This series has runtime hot pluggable capabilities.

It's also possible the customer may know better which applications require
more hot vs cold data, but the memory mode caching doesn't give them as
much control since the faster memory is hidden.

> - libmemkind or pmdk.  These options typically* require application
>   modifications, but allow those applications to actively decide which
>   data lives in fast versus slow media.
> 
>   This seems like the obvious answer for applications that care about
>   access latency.
> 
> * you could override the system malloc, but some libraries/application
>   stacks already do that, so it isn't a universal solution.
> 
> Listing something like this in the headers of these patch series would
> considerably reduce the head-scratching for reviewers.
> 
> Keith, you seem to be implying that there are platforms that won't
> support memory mode.  Do you also have some insight into how customers
> want to use this, beyond my speculation?  It's really frustrating to see
> patch sets like this go by without any real use cases provided.

Right, most NFIT reporting platforms today don't have memory mode, and
the kernel currently only supports the persistent DAX mode with these.
This series adds another option for those platforms.

I think numactl as you mentioned is the first consideration for how
customers may make use. Dave or Dan might have other use cases in mind.
Just thinking out loud, if we wanted an in-kernel use case, it may be
interesting to make slower memory a swap tier so the host can manage
the cache rather than the hardware.

[1]
https://lore.kernel.org/patchwork/cover/1032688/

[2]
https://lore.kernel.org/lkml/154767945660.1983228.12167020940431682725.stgit@dwillia2-desk3.amr.corp.intel.com/
Jeff Moyer Jan. 17, 2019, 9:57 p.m. UTC | #6
Keith Busch <keith.busch@intel.com> writes:

>> Keith, you seem to be implying that there are platforms that won't
>> support memory mode.  Do you also have some insight into how customers
>> want to use this, beyond my speculation?  It's really frustrating to see
>> patch sets like this go by without any real use cases provided.
>
> Right, most NFIT reporting platforms today don't have memory mode, and
> the kernel currently only supports the persistent DAX mode with these.
> This series adds another option for those platforms.

All NFIT reporting platforms today are shipping NVDIMM-Ns, where it
makes absolutely no sense to use them as regular DRAM.  I don't think
that's a good argument to make.

> I think numactl as you mentioned is the first consideration for how
> customers may make use. Dave or Dan might have other use cases in mind.

Well, it sure looks like this took a lot of work, so I thought there
were known use cases or users asking for this functionality.

Cheers,
Jeff
Dave Hansen Jan. 17, 2019, 10:43 p.m. UTC | #7
On 1/17/19 8:29 AM, Jeff Moyer wrote:
>> Persistent memory is cool.  But, currently, you have to rewrite
>> your applications to use it.  Wouldn't it be cool if you could
>> just have it show up in your system like normal RAM and get to
>> it like a slow blob of memory?  Well... have I got the patch
>> series for you!
> So, isn't that what memory mode is for?
>   https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> 
> Why do we need this code in the kernel?

So, my bad for not mentioning memory mode.  This patch set existed
before we could talk about it publicly, so it simply ignores its
existence.  It's a pretty glaring omissions at this point, sorry.

I'll add this to the patches, but here are a few reasons you might want
this instead of memory mode:
1. Memory mode is all-or-nothing.  Either 100% of your persistent memory
   is used for memory mode, or nothing is.  With this set, you can
   (theoretically) have very granular (128MB) assignment of PMEM to
   either volatile or persistent uses.  We have a few practical matters
   to fix to get us down to that 128MB value, but we can get there.
2. The capacity of memory mode is the size of your persistent memory.
   DRAM capacity is "lost" because it is used for cache.  With this,
   you get PMEM+DRAM capacity for memory.
3. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing, your
   entire memory space is exposed to these unpredictable latencies.
   This solution lets you guarantee DRAM latencies if you need them.
4. The new "tier" of memory is exposed to software.  That means that you
   can build tiered applications or infrastructure.  A cloud provider
   could sell cheaper VMs that use more PMEM and more expensive ones
   that use DRAM.  That's impossible with memory mode.

Don't take this as criticism of memory mode.  Memory mode is awesome,
and doesn't strictly require *any* software changes (we have software
changes proposed for optimizing it though).  It has tons of other
advantages over *this* approach.  Basically, they are complementary
enough that we think both can live side-by-side.
Fengguang Wu Jan. 18, 2019, 11:48 a.m. UTC | #8
>With this patch set, an unmodified application would either use:
>
>1) whatever memory it happened to get
>2) only the faster dram (via numactl --membind=)
>3) only the slower pmem (again, via numactl --membind1)
>4) preferentially one or the other (numactl --preferred=)

Yet another option:

MemoryOptimizer -- hot page accounting and migration daemon
https://github.com/intel/memory-optimizer

Once PMEM NUMA nodes are available, we may run a user space daemon to
walk page tables of virtual machines (EPT) or processes, collect the
"accessed" bits to find out hot pages, and finally migrate hot pages
to DRAM and cold pages to PMEM.

In that scenario, only kernel and the migrate daemon need to be aware
of the PMEM nodes. Unmodified virtual machines and processes can enjoy
the added memory space w/o knowing whether it's using DRAM or PMEM.

Thanks,
Fengguang