Message ID | 20181022201317.8558C1D8@viggo.jf.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow persistent memory to be used like normal RAM | expand |
On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > Persistent memory is cool. But, currently, you have to rewrite > your applications to use it. Wouldn't it be cool if you could > just have it show up in your system like normal RAM and get to > it like a slow blob of memory? Well... have I got the patch > series for you! > > This series adds a new "driver" to which pmem devices can be > attached. Once attached, the memory "owned" by the device is > hot-added to the kernel and managed like any other memory. On > systems with an HMAT (a new ACPI table), each socket (roughly) > will have a separate NUMA node for its persistent memory so > this newly-added memory can be selected by its unique NUMA > node. > > This is highly RFC, and I really want the feedback from the > nvdimm/pmem folks about whether this is a viable long-term > perversion of their code and device mode. It's insufficiently > documented and probably not bisectable either. > > Todo: > 1. The device re-binding hacks are ham-fisted at best. We > need a better way of doing this, especially so the kmem > driver does not get in the way of normal pmem devices. > 2. When the device has no proper node, we default it to > NUMA node 0. Is that OK? > 3. We muck with the 'struct resource' code quite a bit. It > definitely needs a once-over from folks more familiar > with it than I. > 4. Is there a better way to do this than starting with a > copy of pmem.c? So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7 and remove all the devm_memremap_pages() infrastructure and dax_region infrastructure. The driver should be a dead simple turn around to call add_memory() for the passed in range. The hard part is, as you say, arranging for the kmem driver to not stand in the way of typical range / device claims by the dax_pmem device. To me this looks like teaching the nvdimm-bus and this dax_kmem driver to require explicit matching based on 'id'. The attachment scheme would look like this: modprobe dax_kmem echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind At step1 the dax_kmem drivers will match no devices and stays out of the way of dax_pmem. It learns about devices it cares about by being explicitly told about them. Then unbind from the typical dax_pmem driver and attach to dax_kmem to perform the one way hotplug. I expect udev can automate this by setting up a rule to watch for device-dax instances by UUID and call a script to do the detach / reattach dance.
On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > > Persistent memory is cool. But, currently, you have to rewrite > > your applications to use it. Wouldn't it be cool if you could > > just have it show up in your system like normal RAM and get to > > it like a slow blob of memory? Well... have I got the patch > > series for you! > > > > This series adds a new "driver" to which pmem devices can be > > attached. Once attached, the memory "owned" by the device is > > hot-added to the kernel and managed like any other memory. On > > systems with an HMAT (a new ACPI table), each socket (roughly) > > will have a separate NUMA node for its persistent memory so > > this newly-added memory can be selected by its unique NUMA > > node. > > > > This is highly RFC, and I really want the feedback from the > > nvdimm/pmem folks about whether this is a viable long-term > > perversion of their code and device mode. It's insufficiently > > documented and probably not bisectable either. > > > > Todo: > > 1. The device re-binding hacks are ham-fisted at best. We > > need a better way of doing this, especially so the kmem > > driver does not get in the way of normal pmem devices. > > 2. When the device has no proper node, we default it to > > NUMA node 0. Is that OK? > > 3. We muck with the 'struct resource' code quite a bit. It > > definitely needs a once-over from folks more familiar > > with it than I. > > 4. Is there a better way to do this than starting with a > > copy of pmem.c? > > So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7 > and remove all the devm_memremap_pages() infrastructure and dax_region > infrastructure. > > The driver should be a dead simple turn around to call add_memory() > for the passed in range. The hard part is, as you say, arranging for > the kmem driver to not stand in the way of typical range / device > claims by the dax_pmem device. > > To me this looks like teaching the nvdimm-bus and this dax_kmem driver > to require explicit matching based on 'id'. The attachment scheme > would look like this: > > modprobe dax_kmem > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id > echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind > > At step1 the dax_kmem drivers will match no devices and stays out of > the way of dax_pmem. It learns about devices it cares about by being > explicitly told about them. Then unbind from the typical dax_pmem > driver and attach to dax_kmem to perform the one way hotplug. > > I expect udev can automate this by setting up a rule to watch for > device-dax instances by UUID and call a script to do the detach / > reattach dance. The next question is how to support this for ranges that don't originate from the pmem sub-system. I expect we want dax_kmem to register a generic platform device representing the range and have a generic platofrm driver that turns around and does the add_memory().
> -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of Dan Williams > Sent: Monday, October 22, 2018 8:05 PM > Subject: Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM > > On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: ... > This series adds a new "driver" to which pmem devices can be > attached. Once attached, the memory "owned" by the device is > hot-added to the kernel and managed like any other memory. On Would this memory be considered volatile (with the driver initializing it to zeros), or persistent (contents are presented unchanged, applications may guarantee persistence by using cache flush instructions, fence instructions, and writing to flush hint addresses per the persistent memory programming model)? > > 1. The device re-binding hacks are ham-fisted at best. We > > need a better way of doing this, especially so the kmem > > driver does not get in the way of normal pmem devices. ... > To me this looks like teaching the nvdimm-bus and this dax_kmem driver > to require explicit matching based on 'id'. The attachment scheme > would look like this: > > modprobe dax_kmem > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id > echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind > > At step1 the dax_kmem drivers will match no devices and stays out of > the way of dax_pmem. It learns about devices it cares about by being > explicitly told about them. Then unbind from the typical dax_pmem > driver and attach to dax_kmem to perform the one way hotplug. > > I expect udev can automate this by setting up a rule to watch for > device-dax instances by UUID and call a script to do the detach / > reattach dance. Where would that rule be stored? Storing it on another device is problematic. If that rule is lost, it could confuse other drivers trying to grab device DAX devices for use as persistent memory. A new namespace mode would record the intended usage in the device itself, eliminating dependencies. It could join the other modes like: ndctl create-namespace -m raw create /dev/pmem4 block device ndctl create-namespace -m sector create /dev/pmem4s block device ndctl create-namespace -m fsdax create /dev/pmem4 block device ndctl create-namespace -m devdax create /dev/dax4.3 character device for use as persistent memory ndctl create-namespace -m mem create /dev/mem4.3 character device for use as volatile memory --- Robert Elliott, HPE Persistent Memory
>> This series adds a new "driver" to which pmem devices can be >> attached. Once attached, the memory "owned" by the device is >> hot-added to the kernel and managed like any other memory. On > > Would this memory be considered volatile (with the driver initializing > it to zeros), or persistent (contents are presented unchanged, > applications may guarantee persistence by using cache flush > instructions, fence instructions, and writing to flush hint addresses > per the persistent memory programming model)? Volatile. >> I expect udev can automate this by setting up a rule to watch for >> device-dax instances by UUID and call a script to do the detach / >> reattach dance. > > Where would that rule be stored? Storing it on another device > is problematic. If that rule is lost, it could confuse other > drivers trying to grab device DAX devices for use as persistent > memory. Well, we do lots of things like stable device naming from udev scripts. We depend on them not being lost. At least this "fails safe" so we'll default to persistence instead of defaulting to "eat your data".
On Tue, Oct 23, 2018 at 11:17 AM Dave Hansen <dave.hansen@intel.com> wrote: > > >> This series adds a new "driver" to which pmem devices can be > >> attached. Once attached, the memory "owned" by the device is > >> hot-added to the kernel and managed like any other memory. On > > > > Would this memory be considered volatile (with the driver initializing > > it to zeros), or persistent (contents are presented unchanged, > > applications may guarantee persistence by using cache flush > > instructions, fence instructions, and writing to flush hint addresses > > per the persistent memory programming model)? > > Volatile. > > >> I expect udev can automate this by setting up a rule to watch for > >> device-dax instances by UUID and call a script to do the detach / > >> reattach dance. > > > > Where would that rule be stored? Storing it on another device > > is problematic. If that rule is lost, it could confuse other > > drivers trying to grab device DAX devices for use as persistent > > memory. > > Well, we do lots of things like stable device naming from udev scripts. > We depend on them not being lost. At least this "fails safe" so we'll > default to persistence instead of defaulting to "eat your data". > Right, and at least for the persistent memory to volatile conversion case we will have the UUID to positively identify the DAX device. So it will indeed "fail safe" and just become a dax_pmem device again if the configuration is lost. We'll likely need to create/use a "by-path" scheme for non-pmem use cases.
Hi Dave, This patchset hotadd a pmem and use it like a normal DRAM, I have some questions here, and I think my production line may also concerned. 1) How to set the AEP (Apache Pass) usage percentage for one process (or a vma)? e.g. there are two vms from two customers, they pay different money for the vm. So if we alloc and convert AEP/DRAM by global, the high load vm may get 100% DRAM, and the low load vm may get 100% AEP, this is unfair. The low load is compared to another one, for himself, the actual low load maybe is high load. 2) I find page idle only check the access bit, _PAGE_BIT_ACCESSED, as we know AEP read performance is much higher than write, so I think we should also check the dirty bit, _PAGE_BIT_DIRTY. Test and clear dirty bit is safe for anon page, but unsafe for file page, e.g. should call clear_page_dirty_for_io first, right? 3) I think we should manage the AEP memory separately instead of together with the DRAM. Manage them together maybe change less code, but it will cause some problems at high priority DRAM allocation if there is no DRAM, then should convert (steal DRAM) from another one, it takes much time. How about create a new zone, e.g. ZONE_AEP, and use madvise to set a new flag VM_AEP, which will enable the vma to alloc AEP memory in page fault later, then use vma_rss_stat(like mm_rss_stat) to control the AEP usage percentage for a vma. 4) I am interesting about the conversion mechanism betweent AEP and DRAM. I think numa balancing will cause page fault, this is unacceptable for some apps, it cause performance jitter. And the kswapd is not precise enough. So use a daemon kernel thread (like khugepaged) maybe a good solution, add the AEP used processes to a list, then scan the VM_AEP marked vmas, get the access state, and do the conversion. Thanks, Xishi Qiu On 2018/10/23 04:13, Dave Hansen wrote: > Persistent memory is cool. But, currently, you have to rewrite > your applications to use it. Wouldn't it be cool if you could > just have it show up in your system like normal RAM and get to > it like a slow blob of memory? Well... have I got the patch > series for you! > > This series adds a new "driver" to which pmem devices can be > attached. Once attached, the memory "owned" by the device is > hot-added to the kernel and managed like any other memory. On > systems with an HMAT (a new ACPI table), each socket (roughly) > will have a separate NUMA node for its persistent memory so > this newly-added memory can be selected by its unique NUMA > node. > > This is highly RFC, and I really want the feedback from the > nvdimm/pmem folks about whether this is a viable long-term > perversion of their code and device mode. It's insufficiently > documented and probably not bisectable either. > > Todo: > 1. The device re-binding hacks are ham-fisted at best. We > need a better way of doing this, especially so the kmem > driver does not get in the way of normal pmem devices. > 2. When the device has no proper node, we default it to > NUMA node 0. Is that OK? > 3. We muck with the 'struct resource' code quite a bit. It > definitely needs a once-over from folks more familiar > with it than I. > 4. Is there a better way to do this than starting with a > copy of pmem.c? > > Here's how I set up a system to test this thing: > > 1. Boot qemu with lots of memory: "-m 4096", for instance > 2. Reserve 512MB of physical memory. Reserving a spot a 2GB > physical seems to work: memmap=512M!0x0000000080000000 > This will end up looking like a pmem device at boot. > 3. When booted, convert fsdax device to "device dax": > ndctl create-namespace -fe namespace0.0 -m dax > 4. In the background, the kmem driver will probably bind to the > new device. > 5. Now, online the new memory sections. Perhaps: > > grep ^MemTotal /proc/meminfo > for f in `grep -vl online /sys/devices/system/memory/*/state`; do > echo $f: `cat $f` > echo online > $f > grep ^MemTotal /proc/meminfo > done > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Ross Zwisler <zwisler@kernel.org> > Cc: Vishal Verma <vishal.l.verma@intel.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-nvdimm@lists.01.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Huang Ying <ying.huang@intel.com> > Cc: Fengguang Wu <fengguang.wu@intel.com> >
Hi Dan, How about let the BIOS report a new type for kmem in e820 table? e.g. #define E820_PMEM 7 #define E820_KMEM 8 Then pmem and kmem are separately, and we can easily hotadd kmem to the memory subsystem, no disturb the existing code (e.g. pmem, nvdimm, dax...). I don't know whether Intel will change some hardware features for pmem which used like a volatility memory in the future. Perhaps faster than pmem, cheaper, but volatility, and no need to care about atomicity, consistency, L2/L3 cache... Another question, why call it kmem? what does the "k" mean? Thanks, Xishi Qiu On 2018/10/23 09:11, Dan Williams wrote: > On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote: >> >> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: >>> >>> Persistent memory is cool. But, currently, you have to rewrite >>> your applications to use it. Wouldn't it be cool if you could >>> just have it show up in your system like normal RAM and get to >>> it like a slow blob of memory? Well... have I got the patch >>> series for you! >>> >>> This series adds a new "driver" to which pmem devices can be >>> attached. Once attached, the memory "owned" by the device is >>> hot-added to the kernel and managed like any other memory. On >>> systems with an HMAT (a new ACPI table), each socket (roughly) >>> will have a separate NUMA node for its persistent memory so >>> this newly-added memory can be selected by its unique NUMA >>> node. >>> >>> This is highly RFC, and I really want the feedback from the >>> nvdimm/pmem folks about whether this is a viable long-term >>> perversion of their code and device mode. It's insufficiently >>> documented and probably not bisectable either. >>> >>> Todo: >>> 1. The device re-binding hacks are ham-fisted at best. We >>> need a better way of doing this, especially so the kmem >>> driver does not get in the way of normal pmem devices. >>> 2. When the device has no proper node, we default it to >>> NUMA node 0. Is that OK? >>> 3. We muck with the 'struct resource' code quite a bit. It >>> definitely needs a once-over from folks more familiar >>> with it than I. >>> 4. Is there a better way to do this than starting with a >>> copy of pmem.c? >> >> So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7 >> and remove all the devm_memremap_pages() infrastructure and dax_region >> infrastructure. >> >> The driver should be a dead simple turn around to call add_memory() >> for the passed in range. The hard part is, as you say, arranging for >> the kmem driver to not stand in the way of typical range / device >> claims by the dax_pmem device. >> >> To me this looks like teaching the nvdimm-bus and this dax_kmem driver >> to require explicit matching based on 'id'. The attachment scheme >> would look like this: >> >> modprobe dax_kmem >> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id >> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind >> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind >> >> At step1 the dax_kmem drivers will match no devices and stays out of >> the way of dax_pmem. It learns about devices it cares about by being >> explicitly told about them. Then unbind from the typical dax_pmem >> driver and attach to dax_kmem to perform the one way hotplug. >> >> I expect udev can automate this by setting up a rule to watch for >> device-dax instances by UUID and call a script to do the detach / >> reattach dance. > > The next question is how to support this for ranges that don't > originate from the pmem sub-system. I expect we want dax_kmem to > register a generic platform device representing the range and have a > generic platofrm driver that turns around and does the add_memory(). >
Hi Xishi, I can help answer the migration and policy related questions. On Fri, Oct 26, 2018 at 01:42:43PM +0800, Xishi Qiu wrote: >Hi Dave, > >This patchset hotadd a pmem and use it like a normal DRAM, I >have some questions here, and I think my production line may >also concerned. > >1) How to set the AEP (Apache Pass) usage percentage for one >process (or a vma)? >e.g. there are two vms from two customers, they pay different >money for the vm. So if we alloc and convert AEP/DRAM by global, >the high load vm may get 100% DRAM, and the low load vm may get >100% AEP, this is unfair. The low load is compared to another >one, for himself, the actual low load maybe is high load. Per VM, process, VMA policies are possible. They can be implemented in user space migration daemon. We can dip into details when user space code is released. >2) I find page idle only check the access bit, _PAGE_BIT_ACCESSED, >as we know AEP read performance is much higher than write, so I >think we should also check the dirty bit, _PAGE_BIT_DIRTY. Test Yeah dirty bit could be considered later. The initial version will only check accessed bit. >and clear dirty bit is safe for anon page, but unsafe for file >page, e.g. should call clear_page_dirty_for_io first, right? We'll only migrate anonymous pages in the initial version. >3) I think we should manage the AEP memory separately instead >of together with the DRAM. I guess the intention of this patchset is to use different NUMA nodes for AEP and DRAM. >Manage them together maybe change less >code, but it will cause some problems at high priority DRAM >allocation if there is no DRAM, then should convert (steal DRAM) >from another one, it takes much time. >How about create a new zone, e.g. ZONE_AEP, and use madvise >to set a new flag VM_AEP, which will enable the vma to alloc AEP >memory in page fault later, then use vma_rss_stat(like mm_rss_stat) >to control the AEP usage percentage for a vma. > >4) I am interesting about the conversion mechanism betweent AEP >and DRAM. I think numa balancing will cause page fault, this is >unacceptable for some apps, it cause performance jitter. And the NUMA balancing can be taught to be enabled per task. I'm not sure there is already such knob, but it looks easy to implement such a policy. >kswapd is not precise enough. So use a daemon kernel thread >(like khugepaged) maybe a good solution, add the AEP used processes >to a list, then scan the VM_AEP marked vmas, get the access state, >and do the conversion. If that's a desirable policy, our user space migration daemon could possibly do that, too. Thanks, Fengguang >On 2018/10/23 04:13, Dave Hansen wrote: >> Persistent memory is cool. But, currently, you have to rewrite >> your applications to use it. Wouldn't it be cool if you could >> just have it show up in your system like normal RAM and get to >> it like a slow blob of memory? Well... have I got the patch >> series for you! >> >> This series adds a new "driver" to which pmem devices can be >> attached. Once attached, the memory "owned" by the device is >> hot-added to the kernel and managed like any other memory. On >> systems with an HMAT (a new ACPI table), each socket (roughly) >> will have a separate NUMA node for its persistent memory so >> this newly-added memory can be selected by its unique NUMA >> node. >> >> This is highly RFC, and I really want the feedback from the >> nvdimm/pmem folks about whether this is a viable long-term >> perversion of their code and device mode. It's insufficiently >> documented and probably not bisectable either. >> >> Todo: >> 1. The device re-binding hacks are ham-fisted at best. We >> need a better way of doing this, especially so the kmem >> driver does not get in the way of normal pmem devices. >> 2. When the device has no proper node, we default it to >> NUMA node 0. Is that OK? >> 3. We muck with the 'struct resource' code quite a bit. It >> definitely needs a once-over from folks more familiar >> with it than I. >> 4. Is there a better way to do this than starting with a >> copy of pmem.c? >> >> Here's how I set up a system to test this thing: >> >> 1. Boot qemu with lots of memory: "-m 4096", for instance >> 2. Reserve 512MB of physical memory. Reserving a spot a 2GB >> physical seems to work: memmap=512M!0x0000000080000000 >> This will end up looking like a pmem device at boot. >> 3. When booted, convert fsdax device to "device dax": >> ndctl create-namespace -fe namespace0.0 -m dax >> 4. In the background, the kmem driver will probably bind to the >> new device. >> 5. Now, online the new memory sections. Perhaps: >> >> grep ^MemTotal /proc/meminfo >> for f in `grep -vl online /sys/devices/system/memory/*/state`; do >> echo $f: `cat $f` >> echo online > $f >> grep ^MemTotal /proc/meminfo >> done >> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: Ross Zwisler <zwisler@kernel.org> >> Cc: Vishal Verma <vishal.l.verma@intel.com> >> Cc: Tom Lendacky <thomas.lendacky@amd.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: linux-nvdimm@lists.01.org >> Cc: linux-kernel@vger.kernel.org >> Cc: linux-mm@kvack.org >> Cc: Huang Ying <ying.huang@intel.com> >> Cc: Fengguang Wu <fengguang.wu@intel.com> >> >
On 10/26/18 1:03 AM, Xishi Qiu wrote: > How about let the BIOS report a new type for kmem in e820 table? > e.g. > #define E820_PMEM 7 > #define E820_KMEM 8 It would be best if the BIOS just did this all for us. But, what you're describing would take years to get from concept to showing up in someone's hands. I'd rather not wait. Plus, doing it the way I suggested gives the OS the most control. The BIOS isn't in the critical path to do the right thing.
On Mon, Oct 22, 2018 at 6:11 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > > > > Persistent memory is cool. But, currently, you have to rewrite > > > your applications to use it. Wouldn't it be cool if you could > > > just have it show up in your system like normal RAM and get to > > > it like a slow blob of memory? Well... have I got the patch > > > series for you! > > > > > > This series adds a new "driver" to which pmem devices can be > > > attached. Once attached, the memory "owned" by the device is > > > hot-added to the kernel and managed like any other memory. On > > > systems with an HMAT (a new ACPI table), each socket (roughly) > > > will have a separate NUMA node for its persistent memory so > > > this newly-added memory can be selected by its unique NUMA > > > node. > > > > > > This is highly RFC, and I really want the feedback from the > > > nvdimm/pmem folks about whether this is a viable long-term > > > perversion of their code and device mode. It's insufficiently > > > documented and probably not bisectable either. > > > > > > Todo: > > > 1. The device re-binding hacks are ham-fisted at best. We > > > need a better way of doing this, especially so the kmem > > > driver does not get in the way of normal pmem devices. > > > 2. When the device has no proper node, we default it to > > > NUMA node 0. Is that OK? > > > 3. We muck with the 'struct resource' code quite a bit. It > > > definitely needs a once-over from folks more familiar > > > with it than I. > > > 4. Is there a better way to do this than starting with a > > > copy of pmem.c? > > > > So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7 > > and remove all the devm_memremap_pages() infrastructure and dax_region > > infrastructure. > > > > The driver should be a dead simple turn around to call add_memory() > > for the passed in range. The hard part is, as you say, arranging for > > the kmem driver to not stand in the way of typical range / device > > claims by the dax_pmem device. > > > > To me this looks like teaching the nvdimm-bus and this dax_kmem driver > > to require explicit matching based on 'id'. The attachment scheme > > would look like this: > > > > modprobe dax_kmem > > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id > > echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind > > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind > > > > At step1 the dax_kmem drivers will match no devices and stays out of > > the way of dax_pmem. It learns about devices it cares about by being > > explicitly told about them. Then unbind from the typical dax_pmem > > driver and attach to dax_kmem to perform the one way hotplug. > > > > I expect udev can automate this by setting up a rule to watch for > > device-dax instances by UUID and call a script to do the detach / > > reattach dance. > > The next question is how to support this for ranges that don't > originate from the pmem sub-system. I expect we want dax_kmem to > register a generic platform device representing the range and have a > generic platofrm driver that turns around and does the add_memory(). I forgot I have some old patches that do something along these lines and make device-dax it's own bus. I'll dust those off so we can discern what's left.
Hi Dave, What's the base tree for this patchset? I tried 4.19, linux-next and Dan's libnvdimm-for-next branch, but none applies cleanly. Thanks, Fengguang
On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > Persistent memory is cool. But, currently, you have to rewrite > your applications to use it. Wouldn't it be cool if you could > just have it show up in your system like normal RAM and get to > it like a slow blob of memory? Well... have I got the patch > series for you! > > This series adds a new "driver" to which pmem devices can be > attached. Once attached, the memory "owned" by the device is > hot-added to the kernel and managed like any other memory. On > systems with an HMAT (a new ACPI table), each socket (roughly) > will have a separate NUMA node for its persistent memory so > this newly-added memory can be selected by its unique NUMA > node. Could you please elaborate this? I'm supposed you mean the pmem will be a separate NUMA node, right? I would like to try the patches on real hardware, any prerequisite is needed? Thanks, Yang > > This is highly RFC, and I really want the feedback from the > nvdimm/pmem folks about whether this is a viable long-term > perversion of their code and device mode. It's insufficiently > documented and probably not bisectable either. > > Todo: > 1. The device re-binding hacks are ham-fisted at best. We > need a better way of doing this, especially so the kmem > driver does not get in the way of normal pmem devices. > 2. When the device has no proper node, we default it to > NUMA node 0. Is that OK? > 3. We muck with the 'struct resource' code quite a bit. It > definitely needs a once-over from folks more familiar > with it than I. > 4. Is there a better way to do this than starting with a > copy of pmem.c? > > Here's how I set up a system to test this thing: > > 1. Boot qemu with lots of memory: "-m 4096", for instance > 2. Reserve 512MB of physical memory. Reserving a spot a 2GB > physical seems to work: memmap=512M!0x0000000080000000 > This will end up looking like a pmem device at boot. > 3. When booted, convert fsdax device to "device dax": > ndctl create-namespace -fe namespace0.0 -m dax > 4. In the background, the kmem driver will probably bind to the > new device. > 5. Now, online the new memory sections. Perhaps: > > grep ^MemTotal /proc/meminfo > for f in `grep -vl online /sys/devices/system/memory/*/state`; do > echo $f: `cat $f` > echo online > $f > grep ^MemTotal /proc/meminfo > done > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Ross Zwisler <zwisler@kernel.org> > Cc: Vishal Verma <vishal.l.verma@intel.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-nvdimm@lists.01.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Huang Ying <ying.huang@intel.com> > Cc: Fengguang Wu <fengguang.wu@intel.com> >
Le 22/10/2018 à 22:13, Dave Hansen a écrit : > Persistent memory is cool. But, currently, you have to rewrite > your applications to use it. Wouldn't it be cool if you could > just have it show up in your system like normal RAM and get to > it like a slow blob of memory? Well... have I got the patch > series for you! > > This series adds a new "driver" to which pmem devices can be > attached. Once attached, the memory "owned" by the device is > hot-added to the kernel and managed like any other memory. On > systems with an HMAT (a new ACPI table), each socket (roughly) > will have a separate NUMA node for its persistent memory so > this newly-added memory can be selected by its unique NUMA > node. Hello Dave What happens on systems without an HMAT? Does this new memory get merged into existing NUMA nodes? Also, do you plan to have a way for applications to find out which NUMA nodes are "real DRAM" while others are "pmem-backed"? (something like a new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT performance attributes for this? Brice
On 12/3/18 1:22 AM, Brice Goglin wrote: > Le 22/10/2018 à 22:13, Dave Hansen a écrit : > What happens on systems without an HMAT? Does this new memory get merged > into existing NUMA nodes? It gets merged into the persistent memory device's node, as told by the firmware. Intel's persistent memory should always be in its own node, separate from DRAM. > Also, do you plan to have a way for applications to find out which NUMA > nodes are "real DRAM" while others are "pmem-backed"? (something like a > new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT > performance attributes for this? The best way is to use the sysfs-generic interfaces to the HMAT that Keith Busch is pushing. In the end, we really think folks will only care about the memory's performance properties rather than whether it's *actually* persistent memory or not.
On Mon, Dec 3, 2018 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 12/3/18 1:22 AM, Brice Goglin wrote: > > Le 22/10/2018 à 22:13, Dave Hansen a écrit : > > What happens on systems without an HMAT? Does this new memory get merged > > into existing NUMA nodes? > > It gets merged into the persistent memory device's node, as told by the > firmware. Intel's persistent memory should always be in its own node, > separate from DRAM. > > > Also, do you plan to have a way for applications to find out which NUMA > > nodes are "real DRAM" while others are "pmem-backed"? (something like a > > new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT > > performance attributes for this? > > The best way is to use the sysfs-generic interfaces to the HMAT that > Keith Busch is pushing. In the end, we really think folks will only > care about the memory's performance properties rather than whether it's > *actually* persistent memory or not. It's also important to point out that "persistent memory" by itself is an ambiguous memory type. It's anything from new media with distinct performance characteristics to battery backed DRAM. I.e. the performance of "persistent memory" may be indistinguishable from "real DRAM".