diff mbox series

[5/5] dax: "Hotplug" persistent memory for use like normal RAM

Message ID 20190124231448.E102D18E@viggo.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Allow persistent memory to be used like normal RAM | expand

Commit Message

Dave Hansen Jan. 24, 2019, 11:14 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/drivers/base/memory.c |    1 
 b/drivers/dax/Kconfig   |   16 +++++++
 b/drivers/dax/Makefile  |    1 
 b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)

Comments

Jane Chu Jan. 25, 2019, 6:13 a.m. UTC | #1
Hi, Dave,

While chatting with my colleague Erwin about the patchset, it occurred
that we're not clear about the error handling part. Specifically,

1. If an uncorrectable error is detected during a 'load' in the hot 
plugged pmem region, how will the error be handled?  will it be
handled like PMEM or DRAM?

2. If a poison is set, and is persistent, which entity should clear
the poison, and badblock(if applicable)? If it's user's responsibility,
does ndctl support the clearing in this mode?

thanks!
-jane


On 1/24/2019 3:14 PM, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is intended for use with NVDIMMs that are physically persistent
> (physically like flash) so that they can be used as a cost-effective
> RAM replacement.  Intel Optane DC persistent memory is one
> implementation of this kind of NVDIMM.
> 
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries. (DIMM-based
> persistent memory hardware/software is described in great detail
> here: Documentation/nvdimm/nvdimm.txt).
> 
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed and used just like
> normal RAM would be.
> 
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
> 
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> 
> and then bind it to this new driver:
> 
> 	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> 	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
> 
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
> 
> This rebinding procedure is currently a one-way trip.  Once memory
> is bound to "kmem", it's there permanently and can not be
> unbound and assigned back to device_dax.
> 
> The kmem driver will never bind to a dax device unless the device
> is *explicitly* bound to the driver.  There are two reasons for
> this: One, since it is a one-way trip, it can not be undone if
> bound incorrectly.  Two, the kmem driver destroys data on the
> device.  Think of if you had good data on a pmem device.  It
> would be catastrophic if you compile-in "kmem", but leave out
> the "device_dax" driver.  kmem would take over the device and
> write volatile data all over your good data.
> 
> This inherits any existing NUMA information for the newly-added
> memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
> 
> Because NUMA nodes are created, the existing NUMA APIs and tools
> are sufficient to create policies for applications or memory areas
> to have affinity for or an aversion to using this memory.
> 
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
> This calculation is also the reason we export
> memory_block_size_bytes().
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Jerome Glisse <jglisse@redhat.com>
> ---
> 
>   b/drivers/base/memory.c |    1
>   b/drivers/dax/Kconfig   |   16 +++++++
>   b/drivers/dax/Makefile  |    1
>   b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
>   4 files changed, 126 insertions(+)
> 
> diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
> --- a/drivers/base/memory.c~dax-kmem-try-4	2019-01-24 15:13:15.987199535 -0800
> +++ b/drivers/base/memory.c	2019-01-24 15:13:15.994199535 -0800
> @@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
>   {
>   	return MIN_MEMORY_BLOCK_SIZE;
>   }
> +EXPORT_SYMBOL_GPL(memory_block_size_bytes);
>   
>   static unsigned long get_memory_block_size(void)
>   {
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-01-24 15:13:15.988199535 -0800
> +++ b/drivers/dax/Kconfig	2019-01-24 15:13:15.994199535 -0800
> @@ -32,6 +32,22 @@ config DEV_DAX_PMEM
>   
>   	  Say M if unsure
>   
> +config DEV_DAX_KMEM
> +	tristate "KMEM DAX: volatile-use of persistent memory"
> +	default DEV_DAX
> +	depends on DEV_DAX
> +	depends on MEMORY_HOTPLUG # for add_memory() and friends
> +	help
> +	  Support access to persistent memory as if it were RAM.  This
> +	  allows easier use of persistent memory by unmodified
> +	  applications.
> +
> +	  To use this feature, a DAX device must be unbound from the
> +	  device_dax driver (PMEM DAX) and bound to this kmem driver
> +	  on each boot.
> +
> +	  Say N if unsure.
> +
>   config DEV_DAX_PMEM_COMPAT
>   	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
>   	depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null	2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c	2019-01-24 15:13:15.994199535 -0800
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> +	struct dev_dax *dev_dax = to_dev_dax(dev);
> +	struct resource *res = &dev_dax->region->res;
> +	resource_size_t kmem_start;
> +	resource_size_t kmem_size;
> +	resource_size_t kmem_end;
> +	struct resource *new_res;
> +	int numa_node;
> +	int rc;
> +
> +	/*
> +	 * Ensure good NUMA information for the persistent memory.
> +	 * Without this check, there is a risk that slow memory
> +	 * could be mixed in a node with faster memory, causing
> +	 * unavoidable performance issues.
> +	 */
> +	numa_node = dev_dax->target_node;
> +	if (numa_node < 0) {
> +		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
> +			 res, numa_node);
> +		return -EINVAL;
> +	}
> +
> +	/* Hotplug starting at the beginning of the next block: */
> +	kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> +	kmem_size = resource_size(res);
> +	/* Adjust the size down to compensate for moving up kmem_start: */
> +        kmem_size -= kmem_start - res->start;
> +	/* Align the size down to cover only complete blocks: */
> +	kmem_size &= ~(memory_block_size_bytes() - 1);
> +	kmem_end = kmem_start+kmem_size;
> +
> +	/* Region is permanently reserved.  Hot-remove not yet implemented. */
> +	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
> +	if (!new_res) {
> +		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> +			 &kmem_start, &kmem_end);
> +		return -EBUSY;
> +	}
> +
> +	/*
> +	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> +	 * so that add_memory() can add a child resource.  Do not
> +	 * inherit flags from the parent since it may set new flags
> +	 * unknown to us that will break add_memory() below.
> +	 */
> +	new_res->flags = IORESOURCE_SYSTEM_RAM;
> +	new_res->name = dev_name(dev);
> +
> +	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> +	if (rc)
> +		return rc;
> +
> +	return 0;
> +}
> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> +	/*
> +	 * Purposely leak the request_mem_region() for the device-dax
> +	 * range and return '0' to ->remove() attempts. The removal of
> +	 * the device from the driver always succeeds, but the region
> +	 * is permanently pinned as reserved by the unreleased
> +	 * request_mem_region().
> +	 */
> +	return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> +	.drv = {
> +		.probe = dev_dax_kmem_probe,
> +		.remove = dev_dax_kmem_remove,
> +	},
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> +	return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> +	dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4	2019-01-24 15:13:15.990199535 -0800
> +++ b/drivers/dax/Makefile	2019-01-24 15:13:15.994199535 -0800
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_DAX) += dax.o
>   obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>   
>   dax-y := super.o
>   dax-y += bus.o
> _
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
Dan Williams Jan. 25, 2019, 6:27 a.m. UTC | #2
On Thu, Jan 24, 2019 at 10:13 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> Hi, Dave,
>
> While chatting with my colleague Erwin about the patchset, it occurred
> that we're not clear about the error handling part. Specifically,
>
> 1. If an uncorrectable error is detected during a 'load' in the hot
> plugged pmem region, how will the error be handled?  will it be
> handled like PMEM or DRAM?

DRAM.

> 2. If a poison is set, and is persistent, which entity should clear
> the poison, and badblock(if applicable)? If it's user's responsibility,
> does ndctl support the clearing in this mode?

With persistent memory advertised via a static logical-to-physical
storage/dax device mapping, once an error develops it destroys a
physical *and* logical part of a device address space. That loss of
logical address space makes error clearing a necessity. However, with
the DRAM / "System RAM" error handling model, the OS can just offline
the page and map a different one to repair the logical address space.
So, no, ndctl will not have explicit enabling to clear volatile
errors, the OS will just dynamically offline problematic pages.
Du, Fan Jan. 25, 2019, 8:20 a.m. UTC | #3
Dan

Thanks for the insights!

Can I say, the UCE is delivered from h/w to OS in a single way in case of machine
check, only PMEM/DAX stuff filter out UC address and managed in its own way by
badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will kick in,
right?

And how about when ARS is involved but no machine check fired for the function
of this patchset?

>-----Original Message-----
>From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
>Of Dan Williams
>Sent: Friday, January 25, 2019 2:28 PM
>To: Jane Chu <jane.chu@oracle.com>
>Cc: Tom Lendacky <thomas.lendacky@amd.com>; Michal Hocko
><mhocko@suse.com>; linux-nvdimm <linux-nvdimm@lists.01.org>; Takashi
>Iwai <tiwai@suse.de>; Dave Hansen <dave.hansen@linux.intel.com>; Huang,
>Ying <ying.huang@intel.com>; Linux Kernel Mailing List
><linux-kernel@vger.kernel.org>; Linux MM <linux-mm@kvack.org>; Jérôme
>Glisse <jglisse@redhat.com>; Borislav Petkov <bp@suse.de>; Yaowei Bai
><baiyaowei@cmss.chinamobile.com>; Ross Zwisler <zwisler@kernel.org>;
>Bjorn Helgaas <bhelgaas@google.com>; Andrew Morton
><akpm@linux-foundation.org>; Wu, Fengguang <fengguang.wu@intel.com>
>Subject: Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like
>normal RAM
>
>On Thu, Jan 24, 2019 at 10:13 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Hi, Dave,
>>
>> While chatting with my colleague Erwin about the patchset, it occurred
>> that we're not clear about the error handling part. Specifically,
>>
>> 1. If an uncorrectable error is detected during a 'load' in the hot
>> plugged pmem region, how will the error be handled?  will it be
>> handled like PMEM or DRAM?
>
>DRAM.
>
>> 2. If a poison is set, and is persistent, which entity should clear
>> the poison, and badblock(if applicable)? If it's user's responsibility,
>> does ndctl support the clearing in this mode?
>
>With persistent memory advertised via a static logical-to-physical
>storage/dax device mapping, once an error develops it destroys a
>physical *and* logical part of a device address space. That loss of
>logical address space makes error clearing a necessity. However, with
>the DRAM / "System RAM" error handling model, the OS can just offline
>the page and map a different one to repair the logical address space.
>So, no, ndctl will not have explicit enabling to clear volatile
>errors, the OS will just dynamically offline problematic pages.
>_______________________________________________
>Linux-nvdimm mailing list
>Linux-nvdimm@lists.01.org
>https://lists.01.org/mailman/listinfo/linux-nvdimm
Dan Williams Jan. 25, 2019, 5:18 p.m. UTC | #4
On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <fan.du@intel.com> wrote:
>
> Dan
>
> Thanks for the insights!
>
> Can I say, the UCE is delivered from h/w to OS in a single way in case of machine
> check, only PMEM/DAX stuff filter out UC address and managed in its own way by
> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will kick in,
> right?

The common RAS workflow always kicks in, it's just the page state
presented by a DAX mapping needs distinct handling. Once it is
hot-plugged it no longer needs to be treated differently than "System
RAM".

> And how about when ARS is involved but no machine check fired for the function
> of this patchset?

The hotplug effectively disconnects this address range from the ARS
results. They will still be reported in the libnvdimm "region" level
badblocks instance, but there's no safe / coordinated way to go clear
those errors without additional kernel enabling. There is no "clear
error" semantic for "System RAM".
Verma, Vishal L Jan. 25, 2019, 6:20 p.m. UTC | #5
On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <fan.du@intel.com> wrote:
> > Dan
> > 
> > Thanks for the insights!
> > 
> > Can I say, the UCE is delivered from h/w to OS in a single way in
> > case of machine
> > check, only PMEM/DAX stuff filter out UC address and managed in its
> > own way by
> > badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
> > kick in,
> > right?
> 
> The common RAS workflow always kicks in, it's just the page state
> presented by a DAX mapping needs distinct handling. Once it is
> hot-plugged it no longer needs to be treated differently than "System
> RAM".
> 
> > And how about when ARS is involved but no machine check fired for
> > the function
> > of this patchset?
> 
> The hotplug effectively disconnects this address range from the ARS
> results. They will still be reported in the libnvdimm "region" level
> badblocks instance, but there's no safe / coordinated way to go clear
> those errors without additional kernel enabling. There is no "clear
> error" semantic for "System RAM".
> 
Perhaps as future enabling, the kernel can go perform "clear error" for
offlined pages, and make them usable again. But I'm not sure how
prepared mm is to re-accept pages previously offlined.
Jane Chu Jan. 25, 2019, 7:10 p.m. UTC | #6
On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
> 
> On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
>> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <fan.du@intel.com> wrote:
>>> Dan
>>>
>>> Thanks for the insights!
>>>
>>> Can I say, the UCE is delivered from h/w to OS in a single way in
>>> case of machine
>>> check, only PMEM/DAX stuff filter out UC address and managed in its
>>> own way by
>>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
>>> kick in,
>>> right?
>>
>> The common RAS workflow always kicks in, it's just the page state
>> presented by a DAX mapping needs distinct handling. Once it is
>> hot-plugged it no longer needs to be treated differently than "System
>> RAM".
>>
>>> And how about when ARS is involved but no machine check fired for
>>> the function
>>> of this patchset?
>>
>> The hotplug effectively disconnects this address range from the ARS
>> results. They will still be reported in the libnvdimm "region" level
>> badblocks instance, but there's no safe / coordinated way to go clear
>> those errors without additional kernel enabling. There is no "clear
>> error" semantic for "System RAM".
>>
> Perhaps as future enabling, the kernel can go perform "clear error" for
> offlined pages, and make them usable again. But I'm not sure how
> prepared mm is to re-accept pages previously offlined.
> 

Offlining a DRAM backed page due to an UC makes sense because
  a. the physical DRAM cell might still have an error
  b. power cycle, scrubing could potentially 'repair' the DRAM cell,
making the page usable again.

But for a PMEM backed page, neither is true. If a poison bit is set in
a page, that indicates the underlying hardware has completed the repair 
work, all that's left is for software to recover.  Secondly, because 
poison is persistent, unless software explicitly clear the bit,
the page is permanently unusable.

thanks,
-jane
Dan Williams Jan. 25, 2019, 7:15 p.m. UTC | #7
On Fri, Jan 25, 2019 at 11:10 AM Jane Chu <jane.chu@oracle.com> wrote:
>
>
> On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
> >
> > On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
> >> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <fan.du@intel.com> wrote:
> >>> Dan
> >>>
> >>> Thanks for the insights!
> >>>
> >>> Can I say, the UCE is delivered from h/w to OS in a single way in
> >>> case of machine
> >>> check, only PMEM/DAX stuff filter out UC address and managed in its
> >>> own way by
> >>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
> >>> kick in,
> >>> right?
> >>
> >> The common RAS workflow always kicks in, it's just the page state
> >> presented by a DAX mapping needs distinct handling. Once it is
> >> hot-plugged it no longer needs to be treated differently than "System
> >> RAM".
> >>
> >>> And how about when ARS is involved but no machine check fired for
> >>> the function
> >>> of this patchset?
> >>
> >> The hotplug effectively disconnects this address range from the ARS
> >> results. They will still be reported in the libnvdimm "region" level
> >> badblocks instance, but there's no safe / coordinated way to go clear
> >> those errors without additional kernel enabling. There is no "clear
> >> error" semantic for "System RAM".
> >>
> > Perhaps as future enabling, the kernel can go perform "clear error" for
> > offlined pages, and make them usable again. But I'm not sure how
> > prepared mm is to re-accept pages previously offlined.
> >
>
> Offlining a DRAM backed page due to an UC makes sense because
>   a. the physical DRAM cell might still have an error
>   b. power cycle, scrubing could potentially 'repair' the DRAM cell,
> making the page usable again.
>
> But for a PMEM backed page, neither is true. If a poison bit is set in
> a page, that indicates the underlying hardware has completed the repair
> work, all that's left is for software to recover.  Secondly, because
> poison is persistent, unless software explicitly clear the bit,
> the page is permanently unusable.

Not permanently... system-owner always has the option to use the
device-DAX and ARS mechanisms to clear errors at the next boot.
There's just no kernel enabling to do that automatically as a part of
this patch set.

However, we should consider this along with the userspace enabling to
control which device-dax instances are set aside for hotplug. It would
make sense to have a "clear errors before hotplug" configuration
option.
Jane Chu Jan. 25, 2019, 11:30 p.m. UTC | #8
On 1/25/2019 11:15 AM, Dan Williams wrote:
> On Fri, Jan 25, 2019 at 11:10 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>>
>> On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
>>>
>>> On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
>>>> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <fan.du@intel.com> wrote:
>>>>> Dan
>>>>>
>>>>> Thanks for the insights!
>>>>>
>>>>> Can I say, the UCE is delivered from h/w to OS in a single way in
>>>>> case of machine
>>>>> check, only PMEM/DAX stuff filter out UC address and managed in its
>>>>> own way by
>>>>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
>>>>> kick in,
>>>>> right?
>>>>
>>>> The common RAS workflow always kicks in, it's just the page state
>>>> presented by a DAX mapping needs distinct handling. Once it is
>>>> hot-plugged it no longer needs to be treated differently than "System
>>>> RAM".
>>>>
>>>>> And how about when ARS is involved but no machine check fired for
>>>>> the function
>>>>> of this patchset?
>>>>
>>>> The hotplug effectively disconnects this address range from the ARS
>>>> results. They will still be reported in the libnvdimm "region" level
>>>> badblocks instance, but there's no safe / coordinated way to go clear
>>>> those errors without additional kernel enabling. There is no "clear
>>>> error" semantic for "System RAM".
>>>>
>>> Perhaps as future enabling, the kernel can go perform "clear error" for
>>> offlined pages, and make them usable again. But I'm not sure how
>>> prepared mm is to re-accept pages previously offlined.
>>>
>>
>> Offlining a DRAM backed page due to an UC makes sense because
>>    a. the physical DRAM cell might still have an error
>>    b. power cycle, scrubing could potentially 'repair' the DRAM cell,
>> making the page usable again.
>>
>> But for a PMEM backed page, neither is true. If a poison bit is set in
>> a page, that indicates the underlying hardware has completed the repair
>> work, all that's left is for software to recover.  Secondly, because
>> poison is persistent, unless software explicitly clear the bit,
>> the page is permanently unusable.
> 
> Not permanently... system-owner always has the option to use the
> device-DAX and ARS mechanisms to clear errors at the next boot.
> There's just no kernel enabling to do that automatically as a part of
> this patch set.
> 
> However, we should consider this along with the userspace enabling to
> control which device-dax instances are set aside for hotplug. It would
> make sense to have a "clear errors before hotplug" configuration
> option.
> 

Agreed, it would be nice to clear error prior to the hotplug operation,
better if that can be handled by the kernel.

thanks,
-jane
Michal Hocko Jan. 28, 2019, 9:25 a.m. UTC | #9
On Fri 25-01-19 11:15:08, Dan Williams wrote:
[...]
> However, we should consider this along with the userspace enabling to
> control which device-dax instances are set aside for hotplug. It would
> make sense to have a "clear errors before hotplug" configuration
> option.

I am not sure I understand. Do you mean to clear HWPoison when the
memory is hotadded (add_pages) or onlined (resp. move_pfn_range_to_zone)?
Dan Williams Jan. 28, 2019, 4:34 p.m. UTC | #10
On Mon, Jan 28, 2019 at 1:26 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 25-01-19 11:15:08, Dan Williams wrote:
> [...]
> > However, we should consider this along with the userspace enabling to
> > control which device-dax instances are set aside for hotplug. It would
> > make sense to have a "clear errors before hotplug" configuration
> > option.
>
> I am not sure I understand. Do you mean to clear HWPoison when the
> memory is hotadded (add_pages) or onlined (resp. move_pfn_range_to_zone)?

Before the memory is hot-added via the kmem driver it shows up as an
independent persistent memory namespace. A namespace can be configured
as a block device and errors cleared by writing to the given "bad
block". Once all media errors are cleared the namespace can be
assigned as volatile memory to the core-kernel mm. The memory range
starts as a namespace each boot and must be hotplugged via startup
scripts, those scripts can be made to handle the bad block pruning.
Brice Goglin Feb. 9, 2019, 11 a.m. UTC | #11
Le 25/01/2019 à 00:14, Dave Hansen a écrit :
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This is intended for use with NVDIMMs that are physically persistent
> (physically like flash) so that they can be used as a cost-effective
> RAM replacement.  Intel Optane DC persistent memory is one
> implementation of this kind of NVDIMM.
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries. (DIMM-based
> persistent memory hardware/software is described in great detail
> here: Documentation/nvdimm/nvdimm.txt).
>
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed and used just like
> normal RAM would be.
>
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind


Hello Dave

I am trying to use these patches (on top on Dan's nvdimm-pending branch
with Keith's HMAT patches). Writing to remove_id just hangs. echo never
returns, it uses 100% CPU and I can't kill it.

[ 5468.744898] bash            R  running task        0 21419  21416 0x00000080
[ 5468.744899] Call Trace:
[ 5468.744902]  ? vsnprintf+0x372/0x4e0
[ 5468.744904]  ? klist_next+0x79/0xe0
[ 5468.744905]  ? sprintf+0x56/0x80
[ 5468.744907]  ? bus_for_each_dev+0x8a/0xc0
[ 5468.744911]  ? do_id_store+0xe8/0x1e0
[ 5468.744914]  ? _cond_resched+0x15/0x30
[ 5468.744915]  ? __kmalloc+0x17f/0x200
[ 5468.744918]  ? kernfs_fop_write+0x83/0x190
[ 5468.744918]  ? __vfs_write+0x36/0x1b0
[ 5468.744919]  ? selinux_file_permission+0xe1/0x130
[ 5468.744921]  ? security_file_permission+0x36/0x100
[ 5468.744922]  ? vfs_write+0xad/0x1b0
[ 5468.744922]  ? ksys_write+0x52/0xc0
[ 5468.744924]  ? do_syscall_64+0x5b/0x180
[ 5468.744927]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

CONFIG_NVDIMM_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=m
CONFIG_DEV_DAX_PMEM_COMPAT=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y

  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":1598128390144,
    "uuid":"7046a749-477f-4690-9b3c-a640a1aa44f1",
    "chardev":"dax0.0"
  }


I've used your patches on fake hardware (memmap=xx!yy) with an older
nvdimm-pending branch (without Keith's patches). It worked fine. This
time I am running on real Intel hardware. Any idea where to look ?

Thanks

Brice
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Le 25/01/2019 à 00:14, Dave Hansen a
      écrit :<br>
    </div>
    <blockquote type="cite"
      cite="mid:20190124231448.E102D18E@viggo.jf.intel.com">
      <pre class="moz-quote-pre" wrap="">
From: Dave Hansen <a class="moz-txt-link-rfc2396E" href="mailto:dave.hansen@linux.intel.com">&lt;dave.hansen@linux.intel.com&gt;</a>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo -n dax0.0 &gt; /sys/bus/dax/drivers/device_dax/remove_id
	echo -n dax0.0 &gt; /sys/bus/dax/drivers/device_dax/unbind
</pre>
    </blockquote>
    <p><br>
    </p>
    <p>Hello Dave</p>
    <p>I am trying to use these patches (on top on Dan's nvdimm-pending
      branch with Keith's HMAT patches). Writing to remove_id just
      hangs. echo never returns, it uses 100% CPU and I can't kill it.</p>
    <pre>[ 5468.744898] bash            R  running task        0 21419  21416 0x00000080
[ 5468.744899] Call Trace:
[ 5468.744902]  ? vsnprintf+0x372/0x4e0
[ 5468.744904]  ? klist_next+0x79/0xe0
[ 5468.744905]  ? sprintf+0x56/0x80
[ 5468.744907]  ? bus_for_each_dev+0x8a/0xc0
[ 5468.744911]  ? do_id_store+0xe8/0x1e0
[ 5468.744914]  ? _cond_resched+0x15/0x30
[ 5468.744915]  ? __kmalloc+0x17f/0x200
[ 5468.744918]  ? kernfs_fop_write+0x83/0x190
[ 5468.744918]  ? __vfs_write+0x36/0x1b0
[ 5468.744919]  ? selinux_file_permission+0xe1/0x130
[ 5468.744921]  ? security_file_permission+0x36/0x100
[ 5468.744922]  ? vfs_write+0xad/0x1b0
[ 5468.744922]  ? ksys_write+0x52/0xc0
[ 5468.744924]  ? do_syscall_64+0x5b/0x180
[ 5468.744927]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

CONFIG_NVDIMM_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=m
CONFIG_DEV_DAX_PMEM_COMPAT=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y

  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":1598128390144,
    "uuid":"7046a749-477f-4690-9b3c-a640a1aa44f1",
    "chardev":"dax0.0"
  }


</pre>
    <p>I've used your patches on fake hardware (memmap=xx!yy) with an
      older nvdimm-pending branch (without Keith's patches). It worked
      fine. This time I am running on real Intel hardware. Any idea
      where to look ?</p>
    <p>Thanks</p>
    <p>Brice</p>
    <p><br>
    </p>
  </body>
</html>
Dave Hansen Feb. 11, 2019, 4:22 p.m. UTC | #12
On 2/9/19 3:00 AM, Brice Goglin wrote:
> I've used your patches on fake hardware (memmap=xx!yy) with an older
> nvdimm-pending branch (without Keith's patches). It worked fine. This
> time I am running on real Intel hardware. Any idea where to look ?

I've run them on real Intel hardware too.

Could you share the exact sequence of commands you're issuing to
reproduce the hang?  My guess would be that there's some odd interaction
between Dan's latest branch and my now (slightly) stale patches.

I'll refresh them this week and see if I can reproduce what you're seeing.
Brice Goglin Feb. 12, 2019, 7:59 p.m. UTC | #13
Le 11/02/2019 à 17:22, Dave Hansen a écrit :

> On 2/9/19 3:00 AM, Brice Goglin wrote:
>> I've used your patches on fake hardware (memmap=xx!yy) with an older
>> nvdimm-pending branch (without Keith's patches). It worked fine. This
>> time I am running on real Intel hardware. Any idea where to look ?
> I've run them on real Intel hardware too.
>
> Could you share the exact sequence of commands you're issuing to
> reproduce the hang?  My guess would be that there's some odd interaction
> between Dan's latest branch and my now (slightly) stale patches.
>
> I'll refresh them this week and see if I can reproduce what you're seeing.

# ndctl disable-region all
# ndctl zero-labels all
# ndctl enable-region region0
# ndctl create-namespace -r region0 -t pmem -m devdax
{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"dev",
  "size":"1488.37 GiB (1598.13 GB)",
  "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
  "daxregion":{
    "id":0,
    "size":"1488.37 GiB (1598.13 GB)",
    "align":2097152,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"1488.37 GiB (1598.13 GB)"
      }
    ]
  },
  "align":2097152
}
# ndctl enable-namespace namespace0.0
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
<hang>

I tried with and without dax_pmem_compat loaded, but it doesn't help.

Brice
Dan Williams Feb. 13, 2019, 12:30 a.m. UTC | #14
On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 11/02/2019 à 17:22, Dave Hansen a écrit :
>
> > On 2/9/19 3:00 AM, Brice Goglin wrote:
> >> I've used your patches on fake hardware (memmap=xx!yy) with an older
> >> nvdimm-pending branch (without Keith's patches). It worked fine. This
> >> time I am running on real Intel hardware. Any idea where to look ?
> > I've run them on real Intel hardware too.
> >
> > Could you share the exact sequence of commands you're issuing to
> > reproduce the hang?  My guess would be that there's some odd interaction
> > between Dan's latest branch and my now (slightly) stale patches.
> >
> > I'll refresh them this week and see if I can reproduce what you're seeing.
>
> # ndctl disable-region all
> # ndctl zero-labels all
> # ndctl enable-region region0
> # ndctl create-namespace -r region0 -t pmem -m devdax
> {
>   "dev":"namespace0.0",
>   "mode":"devdax",
>   "map":"dev",
>   "size":"1488.37 GiB (1598.13 GB)",
>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>   "daxregion":{
>     "id":0,
>     "size":"1488.37 GiB (1598.13 GB)",
>     "align":2097152,
>     "devices":[
>       {
>         "chardev":"dax0.0",
>         "size":"1488.37 GiB (1598.13 GB)"
>       }
>     ]
>   },
>   "align":2097152
> }
> # ndctl enable-namespace namespace0.0
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> <hang>
>
> I tried with and without dax_pmem_compat loaded, but it doesn't help.

I think this is due to:

  a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id

I missed that this path is also called in the remove_id path. Thanks
for the bug report! I'll get this fixed up.
Brice Goglin Feb. 13, 2019, 8:12 a.m. UTC | #15
Le 13/02/2019 à 01:30, Dan Williams a écrit :
> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>> # ndctl disable-region all
>> # ndctl zero-labels all
>> # ndctl enable-region region0
>> # ndctl create-namespace -r region0 -t pmem -m devdax
>> {
>>   "dev":"namespace0.0",
>>   "mode":"devdax",
>>   "map":"dev",
>>   "size":"1488.37 GiB (1598.13 GB)",
>>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>>   "daxregion":{
>>     "id":0,
>>     "size":"1488.37 GiB (1598.13 GB)",
>>     "align":2097152,
>>     "devices":[
>>       {
>>         "chardev":"dax0.0",
>>         "size":"1488.37 GiB (1598.13 GB)"
>>       }
>>     ]
>>   },
>>   "align":2097152
>> }
>> # ndctl enable-namespace namespace0.0
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>> <hang>
>>
>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> I think this is due to:
>
>   a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>
> I missed that this path is also called in the remove_id path. Thanks
> for the bug report! I'll get this fixed up.


Now that remove_id is fixed, things fails later in Dave's procedure:

# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
-bash: echo: write error: No such device

(And nothing seems to have changed in /sys/devices/system/memory/*/state)

Brice
Dan Williams Feb. 13, 2019, 8:24 a.m. UTC | #16
On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 13/02/2019 à 01:30, Dan Williams a écrit :
> > On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >> # ndctl disable-region all
> >> # ndctl zero-labels all
> >> # ndctl enable-region region0
> >> # ndctl create-namespace -r region0 -t pmem -m devdax
> >> {
> >>   "dev":"namespace0.0",
> >>   "mode":"devdax",
> >>   "map":"dev",
> >>   "size":"1488.37 GiB (1598.13 GB)",
> >>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
> >>   "daxregion":{
> >>     "id":0,
> >>     "size":"1488.37 GiB (1598.13 GB)",
> >>     "align":2097152,
> >>     "devices":[
> >>       {
> >>         "chardev":"dax0.0",
> >>         "size":"1488.37 GiB (1598.13 GB)"
> >>       }
> >>     ]
> >>   },
> >>   "align":2097152
> >> }
> >> # ndctl enable-namespace namespace0.0
> >> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >> <hang>
> >>
> >> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> > I think this is due to:
> >
> >   a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
> >
> > I missed that this path is also called in the remove_id path. Thanks
> > for the bug report! I'll get this fixed up.
>
>
> Now that remove_id is fixed, things fails later in Dave's procedure:
>
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id

In the current version of the code the bind is not necessary, so the
lack of error messages here means the bind succeeded.

> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
> -bash: echo: write error: No such device

This also happens when the device is already bound.

>
> (And nothing seems to have changed in /sys/devices/system/memory/*/state)

What does "cat /proc/iomem" say?
Brice Goglin Feb. 13, 2019, 8:43 a.m. UTC | #17
Le 13/02/2019 à 09:24, Dan Williams a écrit :
> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>>> # ndctl disable-region all
>>>> # ndctl zero-labels all
>>>> # ndctl enable-region region0
>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
>>>> {
>>>>   "dev":"namespace0.0",
>>>>   "mode":"devdax",
>>>>   "map":"dev",
>>>>   "size":"1488.37 GiB (1598.13 GB)",
>>>>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>>>>   "daxregion":{
>>>>     "id":0,
>>>>     "size":"1488.37 GiB (1598.13 GB)",
>>>>     "align":2097152,
>>>>     "devices":[
>>>>       {
>>>>         "chardev":"dax0.0",
>>>>         "size":"1488.37 GiB (1598.13 GB)"
>>>>       }
>>>>     ]
>>>>   },
>>>>   "align":2097152
>>>> }
>>>> # ndctl enable-namespace namespace0.0
>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>>> <hang>
>>>>
>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
>>> I think this is due to:
>>>
>>>   a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>>>
>>> I missed that this path is also called in the remove_id path. Thanks
>>> for the bug report! I'll get this fixed up.
>>
>> Now that remove_id is fixed, things fails later in Dave's procedure:
>>
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> In the current version of the code the bind is not necessary, so the
> lack of error messages here means the bind succeeded.
>
>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>> -bash: echo: write error: No such device
> This also happens when the device is already bound.
>
>> (And nothing seems to have changed in /sys/devices/system/memory/*/state)
> What does "cat /proc/iomem" say?


3060000000-1aa5fffffff : Persistent Memory
  3060000000-36481fffff : namespace0.0
  3680000000-1a9ffffffff : dax0.0
    3680000000-1a9ffffffff : System RAM
(the last line wasn't here before attaching to kmem)

I said nothing changed in memory/*/state, I actually meant that nothing
was offline. But things are actually working!

First, node4 appeared, all memory is already attached to it without
having to write to memory/*/state

Node 4 MemTotal:       1558183936 kB
Node 4 MemFree:        1558068564 kB
Node 4 MemUsed:          115372 kB

I wasn't expecting node4 to appear because the machine has no
/sys/firmware/acpi/tables/HMAT when running in 1LM (there's one in 2LM).
I thought you said in the past that no HMAT would mean memory would be
added to the existing DDR node?

Thanks!

Brice
Brice Goglin Feb. 13, 2019, 1:06 p.m. UTC | #18
Le 13/02/2019 à 09:43, Brice Goglin a écrit :
> Le 13/02/2019 à 09:24, Dan Williams a écrit :
>> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
>>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>>>> # ndctl disable-region all
>>>>> # ndctl zero-labels all
>>>>> # ndctl enable-region region0
>>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
>>>>> {
>>>>>   "dev":"namespace0.0",
>>>>>   "mode":"devdax",
>>>>>   "map":"dev",
>>>>>   "size":"1488.37 GiB (1598.13 GB)",
>>>>>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>>>>>   "daxregion":{
>>>>>     "id":0,
>>>>>     "size":"1488.37 GiB (1598.13 GB)",
>>>>>     "align":2097152,
>>>>>     "devices":[
>>>>>       {
>>>>>         "chardev":"dax0.0",
>>>>>         "size":"1488.37 GiB (1598.13 GB)"
>>>>>       }
>>>>>     ]
>>>>>   },
>>>>>   "align":2097152
>>>>> }
>>>>> # ndctl enable-namespace namespace0.0
>>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>>>> <hang>
>>>>>
>>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
>>>> I think this is due to:
>>>>
>>>>   a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>>>>
>>>> I missed that this path is also called in the remove_id path. Thanks
>>>> for the bug report! I'll get this fixed up.
>>> Now that remove_id is fixed, things fails later in Dave's procedure:
>>>
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>> In the current version of the code the bind is not necessary, so the
>> lack of error messages here means the bind succeeded.


It looks like "unbind" is required to make the PMEM appear as a new
node. If I remove_id from devdax and new_id to kmem without "unbind" in
the middle, nothing appears.

Writing to "kmem/bind" didn't seem necessary.

Brice



>>
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>>> -bash: echo: write error: No such device
>> This also happens when the device is already bound.
>>
>>> (And nothing seems to have changed in /sys/devices/system/memory/*/state)
>> What does "cat /proc/iomem" say?
>
> 3060000000-1aa5fffffff : Persistent Memory
>   3060000000-36481fffff : namespace0.0
>   3680000000-1a9ffffffff : dax0.0
>     3680000000-1a9ffffffff : System RAM
> (the last line wasn't here before attaching to kmem)
>
> I said nothing changed in memory/*/state, I actually meant that nothing
> was offline. But things are actually working!
>
> First, node4 appeared, all memory is already attached to it without
> having to write to memory/*/state
>
> Node 4 MemTotal:       1558183936 kB
> Node 4 MemFree:        1558068564 kB
> Node 4 MemUsed:          115372 kB
>
> I wasn't expecting node4 to appear because the machine has no
> /sys/firmware/acpi/tables/HMAT when running in 1LM (there's one in 2LM).
> I thought you said in the past that no HMAT would mean memory would be
> added to the existing DDR node?
>
> Thanks!
>
> Brice
>
>
Dan Williams Feb. 13, 2019, 4:19 p.m. UTC | #19
On Wed, Feb 13, 2019 at 5:07 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 13/02/2019 à 09:43, Brice Goglin a écrit :
> > Le 13/02/2019 à 09:24, Dan Williams a écrit :
> >> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
> >>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >>>>> # ndctl disable-region all
> >>>>> # ndctl zero-labels all
> >>>>> # ndctl enable-region region0
> >>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
> >>>>> {
> >>>>>   "dev":"namespace0.0",
> >>>>>   "mode":"devdax",
> >>>>>   "map":"dev",
> >>>>>   "size":"1488.37 GiB (1598.13 GB)",
> >>>>>   "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
> >>>>>   "daxregion":{
> >>>>>     "id":0,
> >>>>>     "size":"1488.37 GiB (1598.13 GB)",
> >>>>>     "align":2097152,
> >>>>>     "devices":[
> >>>>>       {
> >>>>>         "chardev":"dax0.0",
> >>>>>         "size":"1488.37 GiB (1598.13 GB)"
> >>>>>       }
> >>>>>     ]
> >>>>>   },
> >>>>>   "align":2097152
> >>>>> }
> >>>>> # ndctl enable-namespace namespace0.0
> >>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >>>>> <hang>
> >>>>>
> >>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> >>>> I think this is due to:
> >>>>
> >>>>   a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
> >>>>
> >>>> I missed that this path is also called in the remove_id path. Thanks
> >>>> for the bug report! I'll get this fixed up.
> >>> Now that remove_id is fixed, things fails later in Dave's procedure:
> >>>
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >> In the current version of the code the bind is not necessary, so the
> >> lack of error messages here means the bind succeeded.
>
>
> It looks like "unbind" is required to make the PMEM appear as a new
> node. If I remove_id from devdax and new_id to kmem without "unbind" in
> the middle, nothing appears.
>
> Writing to "kmem/bind" didn't seem necessary.

Yes, in short:

device_dax/remove_id: not required, this driver attaches to any and
all device-dax devices by default
device_dax/unbind: required, nothing else will free the device for
kmem to attach
kmem/new_id: required, it will attach if the device is currently
unbound otherwise the device must be unbound before proceeding
kmem/bind: only required if the device was busy / attached to
device_dax when new_id was written.
diff mbox series

Patch

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4	2019-01-24 15:13:15.987199535 -0800
+++ b/drivers/base/memory.c	2019-01-24 15:13:15.994199535 -0800
@@ -88,6 +88,7 @@  unsigned long __weak memory_block_size_b
 {
 	return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-01-24 15:13:15.988199535 -0800
+++ b/drivers/dax/Kconfig	2019-01-24 15:13:15.994199535 -0800
@@ -32,6 +32,22 @@  config DEV_DAX_PMEM
 
 	  Say M if unsure
 
+config DEV_DAX_KMEM
+	tristate "KMEM DAX: volatile-use of persistent memory"
+	default DEV_DAX
+	depends on DEV_DAX
+	depends on MEMORY_HOTPLUG # for add_memory() and friends
+	help
+	  Support access to persistent memory as if it were RAM.  This
+	  allows easier use of persistent memory by unmodified
+	  applications.
+
+	  To use this feature, a DAX device must be unbound from the
+	  device_dax driver (PMEM DAX) and bound to this kmem driver
+	  on each boot.
+
+	  Say N if unsure.
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null	2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c	2019-01-24 15:13:15.994199535 -0800
@@ -0,0 +1,108 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = &dev_dax->region->res;
+	resource_size_t kmem_start;
+	resource_size_t kmem_size;
+	resource_size_t kmem_end;
+	struct resource *new_res;
+	int numa_node;
+	int rc;
+
+	/*
+	 * Ensure good NUMA information for the persistent memory.
+	 * Without this check, there is a risk that slow memory
+	 * could be mixed in a node with faster memory, causing
+	 * unavoidable performance issues.
+	 */
+	numa_node = dev_dax->target_node;
+	if (numa_node < 0) {
+		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
+			 res, numa_node);
+		return -EINVAL;
+	}
+
+	/* Hotplug starting at the beginning of the next block: */
+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+	kmem_size = resource_size(res);
+	/* Adjust the size down to compensate for moving up kmem_start: */
+        kmem_size -= kmem_start - res->start;
+	/* Align the size down to cover only complete blocks: */
+	kmem_size &= ~(memory_block_size_bytes() - 1);
+	kmem_end = kmem_start+kmem_size;
+
+	/* Region is permanently reserved.  Hot-remove not yet implemented. */
+	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+	if (!new_res) {
+		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
+			 &kmem_start, &kmem_end);
+		return -EBUSY;
+	}
+
+	/*
+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.  Do not
+	 * inherit flags from the parent since it may set new flags
+	 * unknown to us that will break add_memory() below.
+	 */
+	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	new_res->name = dev_name(dev);
+
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	/*
+	 * Purposely leak the request_mem_region() for the device-dax
+	 * range and return '0' to ->remove() attempts. The removal of
+	 * the device from the driver always succeeds, but the region
+	 * is permanently pinned as reserved by the unreleased
+	 * request_mem_region().
+	 */
+	return -EBUSY;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+	.drv = {
+		.probe = dev_dax_kmem_probe,
+		.remove = dev_dax_kmem_remove,
+	},
+};
+
+static int __init dax_kmem_init(void)
+{
+	return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+	dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-01-24 15:13:15.990199535 -0800
+++ b/drivers/dax/Makefile	2019-01-24 15:13:15.994199535 -0800
@@ -1,6 +1,7 @@ 
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 
 dax-y := super.o
 dax-y += bus.o