mbox series

[0/9] Allow persistent memory to be used like normal RAM

Message ID 20181022201317.8558C1D8@viggo.jf.intel.com (mailing list archive)
Headers show
Series Allow persistent memory to be used like normal RAM | expand

Message

Dave Hansen Oct. 22, 2018, 8:13 p.m. UTC
Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

This is highly RFC, and I really want the feedback from the
nvdimm/pmem folks about whether this is a viable long-term
perversion of their code and device mode.  It's insufficiently
documented and probably not bisectable either.

Todo:
1. The device re-binding hacks are ham-fisted at best.  We
   need a better way of doing this, especially so the kmem
   driver does not get in the way of normal pmem devices.
2. When the device has no proper node, we default it to
   NUMA node 0.  Is that OK?
3. We muck with the 'struct resource' code quite a bit. It
   definitely needs a once-over from folks more familiar
   with it than I.
4. Is there a better way to do this than starting with a
   copy of pmem.c?

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x0000000080000000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
	ndctl create-namespace -fe namespace0.0 -m dax
4. In the background, the kmem driver will probably bind to the
   new device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
	echo $f: `cat $f`
	echo online > $f
	grep ^MemTotal /proc/meminfo
done

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>

Comments

Dan Williams Oct. 23, 2018, 1:05 a.m. UTC | #1
On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
>
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.
>
> This is highly RFC, and I really want the feedback from the
> nvdimm/pmem folks about whether this is a viable long-term
> perversion of their code and device mode.  It's insufficiently
> documented and probably not bisectable either.
>
> Todo:
> 1. The device re-binding hacks are ham-fisted at best.  We
>    need a better way of doing this, especially so the kmem
>    driver does not get in the way of normal pmem devices.
> 2. When the device has no proper node, we default it to
>    NUMA node 0.  Is that OK?
> 3. We muck with the 'struct resource' code quite a bit. It
>    definitely needs a once-over from folks more familiar
>    with it than I.
> 4. Is there a better way to do this than starting with a
>    copy of pmem.c?

So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7
and remove all the devm_memremap_pages() infrastructure and dax_region
infrastructure.

The driver should be a dead simple turn around to call add_memory()
for the passed in range. The hard part is, as you say, arranging for
the kmem driver to not stand in the way of typical range / device
claims by the dax_pmem device.

To me this looks like teaching the nvdimm-bus and this dax_kmem driver
to require explicit matching based on 'id'. The attachment scheme
would look like this:

modprobe dax_kmem
echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind

At step1 the dax_kmem drivers will match no devices and stays out of
the way of dax_pmem. It learns about devices it cares about by being
explicitly told about them. Then unbind from the typical dax_pmem
driver and attach to dax_kmem to perform the one way hotplug.

I expect udev can automate this by setting up a rule to watch for
device-dax instances by UUID and call a script to do the detach /
reattach dance.
Dan Williams Oct. 23, 2018, 1:11 a.m. UTC | #2
On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >
> > Persistent memory is cool.  But, currently, you have to rewrite
> > your applications to use it.  Wouldn't it be cool if you could
> > just have it show up in your system like normal RAM and get to
> > it like a slow blob of memory?  Well... have I got the patch
> > series for you!
> >
> > This series adds a new "driver" to which pmem devices can be
> > attached.  Once attached, the memory "owned" by the device is
> > hot-added to the kernel and managed like any other memory.  On
> > systems with an HMAT (a new ACPI table), each socket (roughly)
> > will have a separate NUMA node for its persistent memory so
> > this newly-added memory can be selected by its unique NUMA
> > node.
> >
> > This is highly RFC, and I really want the feedback from the
> > nvdimm/pmem folks about whether this is a viable long-term
> > perversion of their code and device mode.  It's insufficiently
> > documented and probably not bisectable either.
> >
> > Todo:
> > 1. The device re-binding hacks are ham-fisted at best.  We
> >    need a better way of doing this, especially so the kmem
> >    driver does not get in the way of normal pmem devices.
> > 2. When the device has no proper node, we default it to
> >    NUMA node 0.  Is that OK?
> > 3. We muck with the 'struct resource' code quite a bit. It
> >    definitely needs a once-over from folks more familiar
> >    with it than I.
> > 4. Is there a better way to do this than starting with a
> >    copy of pmem.c?
>
> So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7
> and remove all the devm_memremap_pages() infrastructure and dax_region
> infrastructure.
>
> The driver should be a dead simple turn around to call add_memory()
> for the passed in range. The hard part is, as you say, arranging for
> the kmem driver to not stand in the way of typical range / device
> claims by the dax_pmem device.
>
> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> to require explicit matching based on 'id'. The attachment scheme
> would look like this:
>
> modprobe dax_kmem
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
>
> At step1 the dax_kmem drivers will match no devices and stays out of
> the way of dax_pmem. It learns about devices it cares about by being
> explicitly told about them. Then unbind from the typical dax_pmem
> driver and attach to dax_kmem to perform the one way hotplug.
>
> I expect udev can automate this by setting up a rule to watch for
> device-dax instances by UUID and call a script to do the detach /
> reattach dance.

The next question is how to support this for ranges that don't
originate from the pmem sub-system. I expect we want dax_kmem to
register a generic platform device representing the range and have a
generic platofrm driver that turns around and does the add_memory().
Elliott, Robert (Servers) Oct. 23, 2018, 6:12 p.m. UTC | #3
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of Dan Williams
> Sent: Monday, October 22, 2018 8:05 PM
> Subject: Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM
> 
> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
...
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On

Would this memory be considered volatile (with the driver initializing
it to zeros), or persistent (contents are presented unchanged,
applications may guarantee persistence by using cache flush
instructions, fence instructions, and writing to flush hint addresses
per the persistent memory programming model)?

> > 1. The device re-binding hacks are ham-fisted at best.  We
> >    need a better way of doing this, especially so the kmem
> >    driver does not get in the way of normal pmem devices.
...
> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> to require explicit matching based on 'id'. The attachment scheme
> would look like this:
> 
> modprobe dax_kmem
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
> 
> At step1 the dax_kmem drivers will match no devices and stays out of
> the way of dax_pmem. It learns about devices it cares about by being
> explicitly told about them. Then unbind from the typical dax_pmem
> driver and attach to dax_kmem to perform the one way hotplug.
> 
> I expect udev can automate this by setting up a rule to watch for
> device-dax instances by UUID and call a script to do the detach /
> reattach dance.

Where would that rule be stored? Storing it on another device
is problematic. If that rule is lost, it could confuse other
drivers trying to grab device DAX devices for use as persistent
memory.

A new namespace mode would record the intended usage in the
device itself, eliminating dependencies. It could join the
other modes like:

	ndctl create-namespace -m raw
		create /dev/pmem4 block device
	ndctl create-namespace -m sector
		create /dev/pmem4s block device
	ndctl create-namespace -m fsdax
		create /dev/pmem4 block device
	ndctl create-namespace -m devdax
		create /dev/dax4.3 character device
		for use as persistent memory
	ndctl create-namespace -m mem
		create /dev/mem4.3 character device
		for use as volatile memory

---
Robert Elliott, HPE Persistent Memory
Dave Hansen Oct. 23, 2018, 6:16 p.m. UTC | #4
>> This series adds a new "driver" to which pmem devices can be
>> attached.  Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory.  On
> 
> Would this memory be considered volatile (with the driver initializing
> it to zeros), or persistent (contents are presented unchanged,
> applications may guarantee persistence by using cache flush
> instructions, fence instructions, and writing to flush hint addresses
> per the persistent memory programming model)?

Volatile.

>> I expect udev can automate this by setting up a rule to watch for
>> device-dax instances by UUID and call a script to do the detach /
>> reattach dance.
> 
> Where would that rule be stored? Storing it on another device
> is problematic. If that rule is lost, it could confuse other
> drivers trying to grab device DAX devices for use as persistent
> memory.

Well, we do lots of things like stable device naming from udev scripts.
 We depend on them not being lost.  At least this "fails safe" so we'll
default to persistence instead of defaulting to "eat your data".
Dan Williams Oct. 23, 2018, 6:58 p.m. UTC | #5
On Tue, Oct 23, 2018 at 11:17 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> >> This series adds a new "driver" to which pmem devices can be
> >> attached.  Once attached, the memory "owned" by the device is
> >> hot-added to the kernel and managed like any other memory.  On
> >
> > Would this memory be considered volatile (with the driver initializing
> > it to zeros), or persistent (contents are presented unchanged,
> > applications may guarantee persistence by using cache flush
> > instructions, fence instructions, and writing to flush hint addresses
> > per the persistent memory programming model)?
>
> Volatile.
>
> >> I expect udev can automate this by setting up a rule to watch for
> >> device-dax instances by UUID and call a script to do the detach /
> >> reattach dance.
> >
> > Where would that rule be stored? Storing it on another device
> > is problematic. If that rule is lost, it could confuse other
> > drivers trying to grab device DAX devices for use as persistent
> > memory.
>
> Well, we do lots of things like stable device naming from udev scripts.
>  We depend on them not being lost.  At least this "fails safe" so we'll
> default to persistence instead of defaulting to "eat your data".
>

Right, and at least for the persistent memory to volatile conversion
case we will have the UUID to positively identify the DAX device. So
it will indeed "fail safe" and just become a dax_pmem device again if
the configuration is lost. We'll likely need to create/use a "by-path"
scheme for non-pmem use cases.
Xishi Qiu Oct. 26, 2018, 5:42 a.m. UTC | #6
Hi Dave,

This patchset hotadd a pmem and use it like a normal DRAM, I
have some questions here, and I think my production line may
also concerned.

1) How to set the AEP (Apache Pass) usage percentage for one
process (or a vma)?
e.g. there are two vms from two customers, they pay different
money for the vm. So if we alloc and convert AEP/DRAM by global,
the high load vm may get 100% DRAM, and the low load vm may get
100% AEP, this is unfair. The low load is compared to another
one, for himself, the actual low load maybe is high load.

2) I find page idle only check the access bit, _PAGE_BIT_ACCESSED,
as we know AEP read performance is much higher than write, so I
think we should also check the dirty bit, _PAGE_BIT_DIRTY. Test
and clear dirty bit is safe for anon page, but unsafe for file
page, e.g. should call clear_page_dirty_for_io first, right?

3) I think we should manage the AEP memory separately instead
of together with the DRAM. Manage them together maybe change less
code, but it will cause some problems at high priority DRAM
allocation if there is no DRAM, then should convert (steal DRAM)
from another one, it takes much time.
How about create a new zone, e.g. ZONE_AEP, and use madvise
to set a new flag VM_AEP, which will enable the vma to alloc AEP
memory in page fault later, then use vma_rss_stat(like mm_rss_stat)
to control the AEP usage percentage for a vma.

4) I am interesting about the conversion mechanism betweent AEP
and DRAM. I think numa balancing will cause page fault, this is
unacceptable for some apps, it cause performance jitter. And the
kswapd is not precise enough. So use a daemon kernel thread
(like khugepaged) maybe a good solution, add the AEP used processes
to a list, then scan the VM_AEP marked vmas, get the access state,
and do the conversion.

Thanks,
Xishi Qiu
On 2018/10/23 04:13, Dave Hansen wrote:
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
> 
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.
> 
> This is highly RFC, and I really want the feedback from the
> nvdimm/pmem folks about whether this is a viable long-term
> perversion of their code and device mode.  It's insufficiently
> documented and probably not bisectable either.
> 
> Todo:
> 1. The device re-binding hacks are ham-fisted at best.  We
>    need a better way of doing this, especially so the kmem
>    driver does not get in the way of normal pmem devices.
> 2. When the device has no proper node, we default it to
>    NUMA node 0.  Is that OK?
> 3. We muck with the 'struct resource' code quite a bit. It
>    definitely needs a once-over from folks more familiar
>    with it than I.
> 4. Is there a better way to do this than starting with a
>    copy of pmem.c?
> 
> Here's how I set up a system to test this thing:
> 
> 1. Boot qemu with lots of memory: "-m 4096", for instance
> 2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
>    physical seems to work: memmap=512M!0x0000000080000000
>    This will end up looking like a pmem device at boot.
> 3. When booted, convert fsdax device to "device dax":
> 	ndctl create-namespace -fe namespace0.0 -m dax
> 4. In the background, the kmem driver will probably bind to the
>    new device.
> 5. Now, online the new memory sections.  Perhaps:
> 
> grep ^MemTotal /proc/meminfo
> for f in `grep -vl online /sys/devices/system/memory/*/state`; do
> 	echo $f: `cat $f`
> 	echo online > $f
> 	grep ^MemTotal /proc/meminfo
> done
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
>
Xishi Qiu Oct. 26, 2018, 8:03 a.m. UTC | #7
Hi Dan,

How about let the BIOS report a new type for kmem in e820 table?
e.g.
#define E820_PMEM	7
#define E820_KMEM	8

Then pmem and kmem are separately, and we can easily hotadd kmem
to the memory subsystem, no disturb the existing code (e.g. pmem,
nvdimm, dax...).

I don't know whether Intel will change some hardware features for
pmem which used like a volatility memory in the future. Perhaps
faster than pmem, cheaper, but volatility, and no need to care
about atomicity, consistency, L2/L3 cache...

Another question, why call it kmem? what does the "k" mean?

Thanks,
Xishi Qiu
On 2018/10/23 09:11, Dan Williams wrote:
> On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>>
>>> Persistent memory is cool.  But, currently, you have to rewrite
>>> your applications to use it.  Wouldn't it be cool if you could
>>> just have it show up in your system like normal RAM and get to
>>> it like a slow blob of memory?  Well... have I got the patch
>>> series for you!
>>>
>>> This series adds a new "driver" to which pmem devices can be
>>> attached.  Once attached, the memory "owned" by the device is
>>> hot-added to the kernel and managed like any other memory.  On
>>> systems with an HMAT (a new ACPI table), each socket (roughly)
>>> will have a separate NUMA node for its persistent memory so
>>> this newly-added memory can be selected by its unique NUMA
>>> node.
>>>
>>> This is highly RFC, and I really want the feedback from the
>>> nvdimm/pmem folks about whether this is a viable long-term
>>> perversion of their code and device mode.  It's insufficiently
>>> documented and probably not bisectable either.
>>>
>>> Todo:
>>> 1. The device re-binding hacks are ham-fisted at best.  We
>>>    need a better way of doing this, especially so the kmem
>>>    driver does not get in the way of normal pmem devices.
>>> 2. When the device has no proper node, we default it to
>>>    NUMA node 0.  Is that OK?
>>> 3. We muck with the 'struct resource' code quite a bit. It
>>>    definitely needs a once-over from folks more familiar
>>>    with it than I.
>>> 4. Is there a better way to do this than starting with a
>>>    copy of pmem.c?
>>
>> So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7
>> and remove all the devm_memremap_pages() infrastructure and dax_region
>> infrastructure.
>>
>> The driver should be a dead simple turn around to call add_memory()
>> for the passed in range. The hard part is, as you say, arranging for
>> the kmem driver to not stand in the way of typical range / device
>> claims by the dax_pmem device.
>>
>> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
>> to require explicit matching based on 'id'. The attachment scheme
>> would look like this:
>>
>> modprobe dax_kmem
>> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
>> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
>> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
>>
>> At step1 the dax_kmem drivers will match no devices and stays out of
>> the way of dax_pmem. It learns about devices it cares about by being
>> explicitly told about them. Then unbind from the typical dax_pmem
>> driver and attach to dax_kmem to perform the one way hotplug.
>>
>> I expect udev can automate this by setting up a rule to watch for
>> device-dax instances by UUID and call a script to do the detach /
>> reattach dance.
> 
> The next question is how to support this for ranges that don't
> originate from the pmem sub-system. I expect we want dax_kmem to
> register a generic platform device representing the range and have a
> generic platofrm driver that turns around and does the add_memory().
>
Fengguang Wu Oct. 26, 2018, 9:03 a.m. UTC | #8
Hi Xishi,

I can help answer the migration and policy related questions.

On Fri, Oct 26, 2018 at 01:42:43PM +0800, Xishi Qiu wrote:
>Hi Dave,
>
>This patchset hotadd a pmem and use it like a normal DRAM, I
>have some questions here, and I think my production line may
>also concerned.
>
>1) How to set the AEP (Apache Pass) usage percentage for one
>process (or a vma)?
>e.g. there are two vms from two customers, they pay different
>money for the vm. So if we alloc and convert AEP/DRAM by global,
>the high load vm may get 100% DRAM, and the low load vm may get
>100% AEP, this is unfair. The low load is compared to another
>one, for himself, the actual low load maybe is high load.

Per VM, process, VMA policies are possible. They can be implemented
in user space migration daemon. We can dip into details when user
space code is released.

>2) I find page idle only check the access bit, _PAGE_BIT_ACCESSED,
>as we know AEP read performance is much higher than write, so I
>think we should also check the dirty bit, _PAGE_BIT_DIRTY. Test

Yeah dirty bit could be considered later. The initial version will
only check accessed bit.

>and clear dirty bit is safe for anon page, but unsafe for file
>page, e.g. should call clear_page_dirty_for_io first, right?

We'll only migrate anonymous pages in the initial version.

>3) I think we should manage the AEP memory separately instead
>of together with the DRAM.

I guess the intention of this patchset is to use different
NUMA nodes for AEP and DRAM.

>Manage them together maybe change less
>code, but it will cause some problems at high priority DRAM
>allocation if there is no DRAM, then should convert (steal DRAM)
>from another one, it takes much time.
>How about create a new zone, e.g. ZONE_AEP, and use madvise
>to set a new flag VM_AEP, which will enable the vma to alloc AEP
>memory in page fault later, then use vma_rss_stat(like mm_rss_stat)
>to control the AEP usage percentage for a vma.
>
>4) I am interesting about the conversion mechanism betweent AEP
>and DRAM. I think numa balancing will cause page fault, this is
>unacceptable for some apps, it cause performance jitter. And the

NUMA balancing can be taught to be enabled per task. I'm not sure
there is already such knob, but it looks easy to implement such a
policy.

>kswapd is not precise enough. So use a daemon kernel thread
>(like khugepaged) maybe a good solution, add the AEP used processes
>to a list, then scan the VM_AEP marked vmas, get the access state,
>and do the conversion.

If that's a desirable policy, our user space migration daemon could
possibly do that, too.

Thanks,
Fengguang

>On 2018/10/23 04:13, Dave Hansen wrote:
>> Persistent memory is cool.  But, currently, you have to rewrite
>> your applications to use it.  Wouldn't it be cool if you could
>> just have it show up in your system like normal RAM and get to
>> it like a slow blob of memory?  Well... have I got the patch
>> series for you!
>>
>> This series adds a new "driver" to which pmem devices can be
>> attached.  Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory.  On
>> systems with an HMAT (a new ACPI table), each socket (roughly)
>> will have a separate NUMA node for its persistent memory so
>> this newly-added memory can be selected by its unique NUMA
>> node.
>>
>> This is highly RFC, and I really want the feedback from the
>> nvdimm/pmem folks about whether this is a viable long-term
>> perversion of their code and device mode.  It's insufficiently
>> documented and probably not bisectable either.
>>
>> Todo:
>> 1. The device re-binding hacks are ham-fisted at best.  We
>>    need a better way of doing this, especially so the kmem
>>    driver does not get in the way of normal pmem devices.
>> 2. When the device has no proper node, we default it to
>>    NUMA node 0.  Is that OK?
>> 3. We muck with the 'struct resource' code quite a bit. It
>>    definitely needs a once-over from folks more familiar
>>    with it than I.
>> 4. Is there a better way to do this than starting with a
>>    copy of pmem.c?
>>
>> Here's how I set up a system to test this thing:
>>
>> 1. Boot qemu with lots of memory: "-m 4096", for instance
>> 2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
>>    physical seems to work: memmap=512M!0x0000000080000000
>>    This will end up looking like a pmem device at boot.
>> 3. When booted, convert fsdax device to "device dax":
>> 	ndctl create-namespace -fe namespace0.0 -m dax
>> 4. In the background, the kmem driver will probably bind to the
>>    new device.
>> 5. Now, online the new memory sections.  Perhaps:
>>
>> grep ^MemTotal /proc/meminfo
>> for f in `grep -vl online /sys/devices/system/memory/*/state`; do
>> 	echo $f: `cat $f`
>> 	echo online > $f
>> 	grep ^MemTotal /proc/meminfo
>> done
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Ross Zwisler <zwisler@kernel.org>
>> Cc: Vishal Verma <vishal.l.verma@intel.com>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: linux-nvdimm@lists.01.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>>
>
Dave Hansen Oct. 26, 2018, 1:58 p.m. UTC | #9
On 10/26/18 1:03 AM, Xishi Qiu wrote:
> How about let the BIOS report a new type for kmem in e820 table?
> e.g.
> #define E820_PMEM	7
> #define E820_KMEM	8

It would be best if the BIOS just did this all for us.  But, what you're
describing would take years to get from concept to showing up in
someone's hands.  I'd rather not wait.

Plus, doing it the way I suggested gives the OS the most control.  The
BIOS isn't in the critical path to do the right thing.
Dan Williams Oct. 27, 2018, 4:45 a.m. UTC | #10
On Mon, Oct 22, 2018 at 6:11 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > >
> > > Persistent memory is cool.  But, currently, you have to rewrite
> > > your applications to use it.  Wouldn't it be cool if you could
> > > just have it show up in your system like normal RAM and get to
> > > it like a slow blob of memory?  Well... have I got the patch
> > > series for you!
> > >
> > > This series adds a new "driver" to which pmem devices can be
> > > attached.  Once attached, the memory "owned" by the device is
> > > hot-added to the kernel and managed like any other memory.  On
> > > systems with an HMAT (a new ACPI table), each socket (roughly)
> > > will have a separate NUMA node for its persistent memory so
> > > this newly-added memory can be selected by its unique NUMA
> > > node.
> > >
> > > This is highly RFC, and I really want the feedback from the
> > > nvdimm/pmem folks about whether this is a viable long-term
> > > perversion of their code and device mode.  It's insufficiently
> > > documented and probably not bisectable either.
> > >
> > > Todo:
> > > 1. The device re-binding hacks are ham-fisted at best.  We
> > >    need a better way of doing this, especially so the kmem
> > >    driver does not get in the way of normal pmem devices.
> > > 2. When the device has no proper node, we default it to
> > >    NUMA node 0.  Is that OK?
> > > 3. We muck with the 'struct resource' code quite a bit. It
> > >    definitely needs a once-over from folks more familiar
> > >    with it than I.
> > > 4. Is there a better way to do this than starting with a
> > >    copy of pmem.c?
> >
> > So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7
> > and remove all the devm_memremap_pages() infrastructure and dax_region
> > infrastructure.
> >
> > The driver should be a dead simple turn around to call add_memory()
> > for the passed in range. The hard part is, as you say, arranging for
> > the kmem driver to not stand in the way of typical range / device
> > claims by the dax_pmem device.
> >
> > To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> > to require explicit matching based on 'id'. The attachment scheme
> > would look like this:
> >
> > modprobe dax_kmem
> > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> > echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
> >
> > At step1 the dax_kmem drivers will match no devices and stays out of
> > the way of dax_pmem. It learns about devices it cares about by being
> > explicitly told about them. Then unbind from the typical dax_pmem
> > driver and attach to dax_kmem to perform the one way hotplug.
> >
> > I expect udev can automate this by setting up a rule to watch for
> > device-dax instances by UUID and call a script to do the detach /
> > reattach dance.
>
> The next question is how to support this for ranges that don't
> originate from the pmem sub-system. I expect we want dax_kmem to
> register a generic platform device representing the range and have a
> generic platofrm driver that turns around and does the add_memory().

I forgot I have some old patches that do something along these lines
and make device-dax it's own bus. I'll dust those off so we can
discern what's left.
Fengguang Wu Oct. 27, 2018, 11 a.m. UTC | #11
Hi Dave,

What's the base tree for this patchset? I tried 4.19, linux-next and
Dan's libnvdimm-for-next branch, but none applies cleanly.

Thanks,
Fengguang
Yang Shi Oct. 31, 2018, 5:11 a.m. UTC | #12
On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
>
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.

Could you please elaborate this? I'm supposed you mean the pmem will
be a separate NUMA node, right?

I would like to try the patches on real hardware, any prerequisite is needed?

Thanks,
Yang

>
> This is highly RFC, and I really want the feedback from the
> nvdimm/pmem folks about whether this is a viable long-term
> perversion of their code and device mode.  It's insufficiently
> documented and probably not bisectable either.
>
> Todo:
> 1. The device re-binding hacks are ham-fisted at best.  We
>    need a better way of doing this, especially so the kmem
>    driver does not get in the way of normal pmem devices.
> 2. When the device has no proper node, we default it to
>    NUMA node 0.  Is that OK?
> 3. We muck with the 'struct resource' code quite a bit. It
>    definitely needs a once-over from folks more familiar
>    with it than I.
> 4. Is there a better way to do this than starting with a
>    copy of pmem.c?
>
> Here's how I set up a system to test this thing:
>
> 1. Boot qemu with lots of memory: "-m 4096", for instance
> 2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
>    physical seems to work: memmap=512M!0x0000000080000000
>    This will end up looking like a pmem device at boot.
> 3. When booted, convert fsdax device to "device dax":
>         ndctl create-namespace -fe namespace0.0 -m dax
> 4. In the background, the kmem driver will probably bind to the
>    new device.
> 5. Now, online the new memory sections.  Perhaps:
>
> grep ^MemTotal /proc/meminfo
> for f in `grep -vl online /sys/devices/system/memory/*/state`; do
>         echo $f: `cat $f`
>         echo online > $f
>         grep ^MemTotal /proc/meminfo
> done
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
>
Brice Goglin Dec. 3, 2018, 9:22 a.m. UTC | #13
Le 22/10/2018 à 22:13, Dave Hansen a écrit :
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
>
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.


Hello Dave

What happens on systems without an HMAT? Does this new memory get merged
into existing NUMA nodes?

Also, do you plan to have a way for applications to find out which NUMA
nodes are "real DRAM" while others are "pmem-backed"? (something like a
new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT
performance attributes for this?

Brice
Dave Hansen Dec. 3, 2018, 4:56 p.m. UTC | #14
On 12/3/18 1:22 AM, Brice Goglin wrote:
> Le 22/10/2018 à 22:13, Dave Hansen a écrit :
> What happens on systems without an HMAT? Does this new memory get merged
> into existing NUMA nodes?

It gets merged into the persistent memory device's node, as told by the
firmware.  Intel's persistent memory should always be in its own node,
separate from DRAM.

> Also, do you plan to have a way for applications to find out which NUMA
> nodes are "real DRAM" while others are "pmem-backed"? (something like a
> new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT
> performance attributes for this?

The best way is to use the sysfs-generic interfaces to the HMAT that
Keith Busch is pushing.  In the end, we really think folks will only
care about the memory's performance properties rather than whether it's
*actually* persistent memory or not.
Dan Williams Dec. 3, 2018, 5:16 p.m. UTC | #15
On Mon, Dec 3, 2018 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 12/3/18 1:22 AM, Brice Goglin wrote:
> > Le 22/10/2018 à 22:13, Dave Hansen a écrit :
> > What happens on systems without an HMAT? Does this new memory get merged
> > into existing NUMA nodes?
>
> It gets merged into the persistent memory device's node, as told by the
> firmware.  Intel's persistent memory should always be in its own node,
> separate from DRAM.
>
> > Also, do you plan to have a way for applications to find out which NUMA
> > nodes are "real DRAM" while others are "pmem-backed"? (something like a
> > new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT
> > performance attributes for this?
>
> The best way is to use the sysfs-generic interfaces to the HMAT that
> Keith Busch is pushing.  In the end, we really think folks will only
> care about the memory's performance properties rather than whether it's
> *actually* persistent memory or not.

It's also important to point out that "persistent memory" by itself is
an ambiguous memory type. It's anything from new media with distinct
performance characteristics to battery backed DRAM. I.e. the
performance of "persistent memory" may be indistinguishable from "real
DRAM".