diff mbox

[RFC,2/4] libnvdimm: Add a device-tree interface

Message ID 20170627102851.15484-2-oohall@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Oliver O'Halloran June 27, 2017, 10:28 a.m. UTC
A fairly bare-bones set of device-tree bindings so libnvdimm can be used
on powerpc and other, less cool, device-tree based platforms.

Cc: devicetree@vger.kernel.org
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
The current bindings are essentially this:

nonvolatile-memory {
	compatible = "nonvolatile-memory", "special-memory";
	ranges;

	region@0 {
		compatible = "nvdimm,byte-addressable";
		reg = <0x0 0x1000>;
	};

	region@1000 {
		compatible = "nvdimm,byte-addressable";
		reg = <0x1000 0x1000>;
	};
};

To handle interleave sets, etc the plan was the add an extra property with the
interleave stride and a "mapping" property with <&DIMM, dimm-start-offset>
tuples for each dimm in the interleave set. Block MMIO regions can be added
with a different compatible type, but I'm not too concerned with them for
now.

Does this sound reasonable? Is there anything this scheme would make difficult?
---
 drivers/nvdimm/Kconfig     |  10 +++
 drivers/nvdimm/Makefile    |   1 +
 drivers/nvdimm/of_nvdimm.c | 209 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 drivers/nvdimm/of_nvdimm.c

Comments

Mark Rutland June 27, 2017, 10:43 a.m. UTC | #1
Hi,

On Tue, Jun 27, 2017 at 08:28:49PM +1000, Oliver O'Halloran wrote:
> A fairly bare-bones set of device-tree bindings so libnvdimm can be used
> on powerpc and other, less cool, device-tree based platforms.

;)

> Cc: devicetree@vger.kernel.org
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> The current bindings are essentially this:
> 
> nonvolatile-memory {
> 	compatible = "nonvolatile-memory", "special-memory";
> 	ranges;
> 
> 	region@0 {
> 		compatible = "nvdimm,byte-addressable";
> 		reg = <0x0 0x1000>;
> 	};
> 
> 	region@1000 {
> 		compatible = "nvdimm,byte-addressable";
> 		reg = <0x1000 0x1000>;
> 	};
> };

This needs to have a proper binding document under
Documentation/devicetree/bindings/. Something like the reserved-memory
bdings would be a good template.

If we want thet "nvdimm" vendor-prefix, that'll have to be reserved,
too (see Documentation/devicetree/bindings/vendor-prefixes.txt).

What is "special-memory"? What other memory types would be described
here?

What exacctly does "nvdimm,byte-addressable" imply? I suspect that you
also expect such memory to be compatible with mappings using (some)
cacheable attributes?

Perhaps the byte-addressable property should be a boolean property on
the region, rather than part of the compatible string.

> To handle interleave sets, etc the plan was the add an extra property with the
> interleave stride and a "mapping" property with <&DIMM, dimm-start-offset>
> tuples for each dimm in the interleave set. Block MMIO regions can be added
> with a different compatible type, but I'm not too concerned with them for
> now.

Sorry, I'm not too familiar with nonvolatile memory. What are interleave
sets?

What are block MMIO regions?

Is there any documentation one can refer to for any of this?

[...]

> +static const struct of_device_id of_nvdimm_bus_match[] = {
> +	{ .compatible = "nonvolatile-memory" },
> +	{ .compatible = "special-memory" },
> +	{ },
> +};

Why both? Is the driver handling other "special-memory"?

Thanks,
Mark.
Oliver O'Halloran June 27, 2017, 2:05 p.m. UTC | #2
Hi Mark,

Thanks for the review and sorry, I really should have added more
context. I was originally just going to send this to the linux-nvdimm
list, but I figured the wider device-tree community might be
interested too.

Preamble:

Non-volatile DIMMs (nvdimms) are otherwise normal DDR DIMMs that are
based on some kind of non-volatile memory with DRAM-like performance
(i.e. not flash). The best known example would probably be Intel's 3D
XPoint technology, but there are a few others around. The non-volatile
aspect makes them useful as storage devices and being part of the
memory space allows the backing storage to be exposed to userspace via
mmap() provided the kernel supports it. The mmap() trick is enabled by
the kernel supporting "direct access" aka DAX.

With that out of the way...

On Tue, Jun 27, 2017 at 8:43 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,
>
> On Tue, Jun 27, 2017 at 08:28:49PM +1000, Oliver O'Halloran wrote:
>> A fairly bare-bones set of device-tree bindings so libnvdimm can be used
>> on powerpc and other, less cool, device-tree based platforms.
>
> ;)
>
>> Cc: devicetree@vger.kernel.org
>> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>> ---
>> The current bindings are essentially this:
>>
>> nonvolatile-memory {
>>       compatible = "nonvolatile-memory", "special-memory";
>>       ranges;
>>
>>       region@0 {
>>               compatible = "nvdimm,byte-addressable";
>>               reg = <0x0 0x1000>;
>>       };
>>
>>       region@1000 {
>>               compatible = "nvdimm,byte-addressable";
>>               reg = <0x1000 0x1000>;
>>       };
>> };
>
> This needs to have a proper binding document under
> Documentation/devicetree/bindings/. Something like the reserved-memory
> bdings would be a good template.
>
> If we want thet "nvdimm" vendor-prefix, that'll have to be reserved,
> too (see Documentation/devicetree/bindings/vendor-prefixes.txt).

It's on my TODO list, I just wanted to get some comments on the
overall approach before doing the rest of the grunt work.

>
> What is "special-memory"? What other memory types would be described
> here?
>
> What exacctly does "nvdimm,byte-addressable" imply? I suspect that you
> also expect such memory to be compatible with mappings using (some)
> cacheable attributes?

I think it's always been assumed that nvdimm memory can be treated as
cacheable system memory for all intents and purposes. It might be
useful to be able to override it on a per-bus or per-region basis
though.

>
> Perhaps the byte-addressable property should be a boolean property on
> the region, rather than part of the compatible string.
See below.

>> To handle interleave sets, etc the plan was the add an extra property with the
>> interleave stride and a "mapping" property with <&DIMM, dimm-start-offset>
>> tuples for each dimm in the interleave set. Block MMIO regions can be added
>> with a different compatible type, but I'm not too concerned with them for
>> now.
>
> Sorry, I'm not too familiar with nonvolatile memory. What are interleave
> sets?

An interleave set refers to a group of DIMMs which share a physical
address range. The addresses in the range are assigned to different
backing DIMMs to improve performance. E.g

Addr 0 to Addr 127 are on DIMM0, Addr 127 to 255 are on DIMM1, Addr
256 to 384 are on DIMM0, etc, etc

software needs to be aware of the interleave pattern so it can
localise memory errors to a specific DIMM.

>
> What are block MMIO regions?

NVDIMMs come in two flavours: byte addressable and block aperture. The
byte addressable type can be treated as conventional memory while the
block aperture type are essentially an MMIO block device. Their
contents are accessed via the MMIO window rather than being presented
to the system as RAM so they don't have any of the features that make
NVDIMMs interesting. It would be nice if we could punt them into a
different driver, unfortunately ACPI allows storage on one DIMM to be
partitioned into byte addressable and block regions and libnvdimm
provides the management interface for both. Dan Williams, who
maintains libnvdimm and the ACPI interface to it, would be a better
person to ask about the finer details.

>
> Is there any documentation one can refer to for any of this?

Documentation/nvdimm/nvdimm.txt has a fairly detailed overview of how
libnvdimm operates. The short version is that libnvdimm provides a
"nvdimm_bus" container for "regions" and "dimms." Regions are chunks
of memory and come in the block or byte types mentioned above, while
DIMMs refer to the physical devices. A firmware specific driver
converts the firmware's hardware description into a set of DIMMs, a
set of regions, and a set of relationships between the two.

On top of that, regions are partitioned into "namespaces" which are
then exported to userspace as either a block device (with PAGE_SIZE
blocks) or as a "DAX device." In the block device case a filesystem is
used to manage the storage and provided the filesystem supports FS_DAX
and is mounted with -o dax, mmap() calls will map the backing memory
directly rather than buffering IO in the page cache. DAX devices can
be mmap()ed to access the backing storage directly so all the
management issues can be punted to userspace.

>
> [...]
>
>> +static const struct of_device_id of_nvdimm_bus_match[] = {
>> +     { .compatible = "nonvolatile-memory" },
>> +     { .compatible = "special-memory" },
>> +     { },
>> +};
>
> Why both? Is the driver handling other "special-memory"?

This is one of the things I was hoping the community could help
decide. "nonvolatile-memory" is probably a more accurate description
of the for the current usage, but the functionality does have other
uses. The interface might be useful for exposing any kind memory with
special characteristics, like high-bandwidth memory or memory on a
coherent accelerator.

Thanks,
Oliver
diff mbox

Patch

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5bdd499b5f4f..72d147b55596 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,4 +102,14 @@  config NVDIMM_DAX
 
 	  Select Y if unsure
 
+config OF_NVDIMM
+	tristate "Device-tree support for NVDIMMs"
+	depends on OF
+	default LIBNVDIMM
+	help
+	  Allows byte addressable persistent memory regions to be described in the
+	  device-tree.
+
+	  Select Y if unsure.
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..622961f4849d 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -3,6 +3,7 @@  obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
+obj-$(CONFIG_OF_NVDIMM) += of_nvdimm.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/of_nvdimm.c b/drivers/nvdimm/of_nvdimm.c
new file mode 100644
index 000000000000..359808200feb
--- /dev/null
+++ b/drivers/nvdimm/of_nvdimm.c
@@ -0,0 +1,209 @@ 
+/*
+ * Copyright 2017, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ */
+
+#define pr_fmt(fmt) "of_nvdimm: " fmt
+
+#include <linux/of_platform.h>
+#include <linux/of_address.h>
+#include <linux/libnvdimm.h>
+#include <linux/module.h>
+#include <linux/ioport.h>
+#include <linux/slab.h>
+
+static const struct attribute_group *region_attr_groups[] = {
+	&nd_region_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static int of_nvdimm_add_byte(struct nvdimm_bus *bus, struct device_node *np)
+{
+	struct nd_region_desc ndr_desc;
+	struct resource temp_res;
+	struct nd_region *region;
+
+	/*
+	 * byte regions should only have one address range
+	 */
+	if (of_address_to_resource(np, 0, &temp_res)) {
+		pr_warn("Unable to parse reg[0] for %s\n", np->full_name);
+		return -ENXIO;
+	}
+
+	pr_debug("Found %pR for %s\n", &temp_res, np->full_name);
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	ndr_desc.res = &temp_res;
+	ndr_desc.attr_groups = region_attr_groups;
+#ifdef CONFIG_NUMA
+	ndr_desc.numa_node = of_node_to_nid(np);
+#endif
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+
+	region = nvdimm_pmem_region_create(bus, &ndr_desc);
+	if (!region)
+		return -ENXIO;
+
+	/*
+	 * Bind the region to the OF node we spawned it from. We
+	 * already bumped the node's refcount while walking the
+	 * bus.
+	 */
+	to_nd_region_dev(region)->of_node = np;
+
+	return 0;
+}
+
+/*
+ * 'data' is a pointer to the function that handles registering the device
+ * on the nvdimm bus.
+ */
+static struct of_device_id of_nvdimm_dev_types[] = {
+	{ .compatible = "nvdimm,byte-addressable", .data = of_nvdimm_add_byte },
+	{ },
+};
+
+static void of_nvdimm_parse_one(struct nvdimm_bus *bus,
+		struct device_node *node)
+{
+	int (*parse_node)(struct nvdimm_bus *, struct device_node *);
+	const struct of_device_id *match;
+	int rc;
+
+	if (of_node_test_and_set_flag(node, OF_POPULATED)) {
+		pr_debug("%s already parsed, skipping\n",
+			node->full_name);
+		return;
+	}
+
+	match = of_match_node(of_nvdimm_dev_types, node);
+	if (!match) {
+		pr_info("No compatible match for '%s'\n",
+			node->full_name);
+		of_node_clear_flag(node, OF_POPULATED);
+		return;
+	}
+
+	of_node_get(node);
+	parse_node = match->data;
+	rc = parse_node(bus, node);
+
+	if (rc) {
+		of_node_clear_flag(node, OF_POPULATED);
+		of_node_put(node);
+	}
+
+	pr_debug("Parsed %s, rc = %d\n", node->full_name, rc);
+
+	return;
+}
+
+/*
+ * The nvdimm core refers to the bus descriptor structure at runtime
+ * so we need to keep it around. Note that this is different to region
+ * descriptors which can be stack allocated.
+ */
+struct of_nd_bus {
+	struct nvdimm_bus_descriptor desc;
+	struct nvdimm_bus *bus;
+};
+
+static const struct attribute_group *bus_attr_groups[] = {
+	&nvdimm_bus_attribute_group,
+	NULL,
+};
+
+static int of_nvdimm_probe(struct platform_device *pdev)
+{
+	struct device_node *node, *child;
+	struct of_nd_bus *of_nd_bus;
+
+	node = dev_of_node(&pdev->dev);
+	if (!node)
+		return -ENXIO;
+
+	of_nd_bus = kzalloc(sizeof(*of_nd_bus), GFP_KERNEL);
+	if (!of_nd_bus)
+		return -ENOMEM;
+
+	of_nd_bus->desc.attr_groups = bus_attr_groups;
+	of_nd_bus->desc.provider_name = "of_nvdimm";
+	of_nd_bus->desc.module = THIS_MODULE;
+	of_nd_bus->bus = nvdimm_bus_register(&pdev->dev, &of_nd_bus->desc);
+	if (!of_nd_bus->bus)
+		goto err;
+
+	to_nvdimm_bus_dev(of_nd_bus->bus)->of_node = node;
+
+	/* now walk the node bus and setup regions, etc */
+        for_each_available_child_of_node(node, child)
+		of_nvdimm_parse_one(of_nd_bus->bus, child);
+
+	platform_set_drvdata(pdev, of_nd_bus);
+
+	return 0;
+
+err:
+	nvdimm_bus_unregister(of_nd_bus->bus);
+	kfree(of_nd_bus);
+	return -ENXIO;
+}
+
+static int of_nvdimm_remove(struct platform_device *pdev)
+{
+	struct of_nd_bus *bus = platform_get_drvdata(pdev);
+	struct device_node *node;
+
+	if (!bus)
+		return 0; /* possible? */
+
+	for_each_available_child_of_node(pdev->dev.of_node, node) {
+		if (!of_node_check_flag(node, OF_POPULATED))
+			continue;
+
+		of_node_clear_flag(node, OF_POPULATED);
+		of_node_put(node);
+		pr_debug("de-populating %s\n", node->full_name);
+	}
+
+	nvdimm_bus_unregister(bus->bus);
+	kfree(bus);
+
+	return 0;
+}
+
+static const struct of_device_id of_nvdimm_bus_match[] = {
+	{ .compatible = "nonvolatile-memory" },
+	{ .compatible = "special-memory" },
+	{ },
+};
+
+static struct platform_driver of_nvdimm_driver = {
+	.probe = of_nvdimm_probe,
+	.remove = of_nvdimm_remove,
+	.driver = {
+		.name = "of_nvdimm",
+		.owner = THIS_MODULE,
+		.of_match_table = of_nvdimm_bus_match,
+	},
+};
+
+module_platform_driver(of_nvdimm_driver);
+MODULE_DEVICE_TABLE(of, of_nvdimm_bus_match);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("IBM Corporation");