diff mbox

[1/8] pmem: Initial version of persistent memory driver

Message ID 54F84420.40209@plexistor.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boaz Harrosh March 5, 2015, 11:55 a.m. UTC
From: Ross Zwisler <ross.zwisler@linux.intel.com>

PMEM is a new driver That supports any physical contiguous iomem range
as a single block device. The driver has support for as many as needed
iomem ranges each as its own device.

The driver is not only good for NvDIMMs, It is good for any flat memory
mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
(memmap= on command line), PCIE Battery backed memory cards, VM shared
memory, and so on.

The API to pmem module a single string parameter named "map"
of the form:
		 map=mapS[,mapS...]

		 where mapS=nn[KMG]$ss[KMG],
		 or    mapS=nn[KMG]@ss[KMG],

		 nn=size, ss=offset

Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.

The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.

For each specified mapS there will be a device created.

[This is the accumulated version of the driver developed by
 multiple programmers. To see the real history of these
 patches see:
	git://git.open-osd.org/pmem.git
	https://github.com/01org/prd
 This patch is based on (git://git.open-osd.org/pmem.git):
	[5ccf703] SQUASHME: Don't clobber the map module param

<list-of-changes>
[boaz]
SQUASHME: pmem: Remove unused #include headers
SQUASHME: pmem: Request from fdisk 4k alignment
SQUASHME: pmem: Let each device manage private memory region
SQUASHME: pmem: Support of multiple memory regions
SQUASHME: pmem: Micro optimization the hotpath 001
SQUASHME: pmem: no need to copy a page at a time
SQUASHME: pmem that 4k sector thing
SQUASHME: pmem: Cleanliness is neat
SQUASHME: Don't clobber the map module param
SQUASHME: pmem: Few changes to Initial version of pmem
SQUASHME: Changes to copyright text (trivial)
</list-of-changes>

TODO: Add Documentation/blockdev/pmem.txt

Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 MAINTAINERS            |   7 ++
 drivers/block/Kconfig  |  18 +++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 334 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 360 insertions(+)
 create mode 100644 drivers/block/pmem.c

Comments

Paul Bolle March 5, 2015, 8:35 p.m. UTC | #1
For what it's worth, at this moment, I found a license nit.

On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
> --- /dev/null
> +++ b/drivers/block/pmem.c
> @@ -0,0 +1,334 @@
> +/*
> + * Persistent Memory Driver
> + * Copyright (c) 2014, Intel Corporation.
> + * Copyright (c) 2014, Boaz Harrosh <boaz@plexistor.com>.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

So you probably want
    MODULE_LICENSE("GPL v2");

here.


Paul Bolle
Andy Lutomirski March 5, 2015, 11:03 p.m. UTC | #2
On Mar 5, 2015 3:55 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> PMEM is a new driver That supports any physical contiguous iomem range
> as a single block device. The driver has support for as many as needed
> iomem ranges each as its own device.
>
> The driver is not only good for NvDIMMs, It is good for any flat memory
> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
> (memmap= on command line), PCIE Battery backed memory cards, VM shared
> memory, and so on.
>
> The API to pmem module a single string parameter named "map"
> of the form:
>                  map=mapS[,mapS...]
>
>                  where mapS=nn[KMG]$ss[KMG],
>                  or    mapS=nn[KMG]@ss[KMG],
>
>                  nn=size, ss=offset
>
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
>
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
>
> For each specified mapS there will be a device created.

[...]

> +       pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);

I think it would be nice to have control over the caching mode.
Depending on the application, WT or UC could make more sense.

--Andy
Boaz Harrosh March 9, 2015, 12:20 p.m. UTC | #3
On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
<>
> 
> I think it would be nice to have control over the caching mode.
> Depending on the application, WT or UC could make more sense.
> 

Patches are welcome. say
	map=sss@aaa:WT,sss@aaa:CA, ...

But for us, with direct_access(), all benchmarks show a slight advantage
for the cached mode.

Thanks
Boaz
Ross Zwisler March 18, 2015, 5:43 p.m. UTC | #4
On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> PMEM is a new driver That supports any physical contiguous iomem range
> as a single block device. The driver has support for as many as needed
> iomem ranges each as its own device.
> 
> The driver is not only good for NvDIMMs, It is good for any flat memory
> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
> (memmap= on command line), PCIE Battery backed memory cards, VM shared
> memory, and so on.
> 
> The API to pmem module a single string parameter named "map"
> of the form:
> 		 map=mapS[,mapS...]
> 
> 		 where mapS=nn[KMG]$ss[KMG],
> 		 or    mapS=nn[KMG]@ss[KMG],
> 
> 		 nn=size, ss=offset
> 
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
> 
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
> 
> For each specified mapS there will be a device created.
> 
> [This is the accumulated version of the driver developed by
>  multiple programmers. To see the real history of these
>  patches see:
> 	git://git.open-osd.org/pmem.git
> 	https://github.com/01org/prd
>  This patch is based on (git://git.open-osd.org/pmem.git):
> 	[5ccf703] SQUASHME: Don't clobber the map module param
> 
> <list-of-changes>
> [boaz]
> SQUASHME: pmem: Remove unused #include headers
> SQUASHME: pmem: Request from fdisk 4k alignment
> SQUASHME: pmem: Let each device manage private memory region
> SQUASHME: pmem: Support of multiple memory regions
> SQUASHME: pmem: Micro optimization the hotpath 001
> SQUASHME: pmem: no need to copy a page at a time
> SQUASHME: pmem that 4k sector thing
> SQUASHME: pmem: Cleanliness is neat
> SQUASHME: Don't clobber the map module param
> SQUASHME: pmem: Few changes to Initial version of pmem
> SQUASHME: Changes to copyright text (trivial)
> </list-of-changes>
> 
> TODO: Add Documentation/blockdev/pmem.txt
> 
> Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

I wrote the initial version of the PMEM driver (then called PRD for Persistent
RAM Driver) in late 2013/early 2014, and posted it on GitHub.  Here's a link
to my first version:

https://github.com/01org/prd/tree/prd_3.13

Matthew Wilcox pointed Boaz to it in June of 2014, and he cloned my tree and
went off and made a bunch of changes.  A few of those changes he sent back to
me, like the one I included in the patch series I recently sent for upstream
inclusion:

https://lkml.org/lkml/2015/3/16/1102

Many of the changes he did not submit back to me for review or inclusion in my
tree.

With the first patch in this series Boaz is squashing all of our changes
together, adding his copyright and trying to install himself as maintainer.  I
believe this to be unacceptable.  

Boaz, if you have contributions that you would like to make to PMEM, please
submit them to our mailing list (linux-nvdimm@lists.01.org) and we will be
happy to review them.  But please don't try and steal control of my driver.

- Ross
Andy Lutomirski March 18, 2015, 6:06 p.m. UTC | #5
On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>
> On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> <>
> >
> > I think it would be nice to have control over the caching mode.
> > Depending on the application, WT or UC could make more sense.
> >
>
> Patches are welcome. say
>         map=sss@aaa:WT,sss@aaa:CA, ...
>
> But for us, with direct_access(), all benchmarks show a slight advantage
> for the cached mode.

I'm sure cached is faster.  The question is: who flushes the cache?

--Andy
Boaz Harrosh March 19, 2015, 9:24 a.m. UTC | #6
On 03/18/2015 07:43 PM, Ross Zwisler wrote:
> On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> PMEM is a new driver That supports any physical contiguous iomem range
>> as a single block device. The driver has support for as many as needed
>> iomem ranges each as its own device.
>>
>> The driver is not only good for NvDIMMs, It is good for any flat memory
>> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
>> (memmap= on command line), PCIE Battery backed memory cards, VM shared
>> memory, and so on.
>>
>> The API to pmem module a single string parameter named "map"
>> of the form:
>> 		 map=mapS[,mapS...]
>>
>> 		 where mapS=nn[KMG]$ss[KMG],
>> 		 or    mapS=nn[KMG]@ss[KMG],
>>
>> 		 nn=size, ss=offset
>>
>> Just like the Kernel command line map && memmap parameters,
>> so anything you did at grub just copy/paste to here.
>>
>> The "@" form is exactly the same as the "$" form only that
>> at bash prompt we need to escape the "$" with \$ so also
>> support the '@' char for convenience.
>>
>> For each specified mapS there will be a device created.
>>
>> [This is the accumulated version of the driver developed by
>>  multiple programmers. To see the real history of these
>>  patches see:
>> 	git://git.open-osd.org/pmem.git
>> 	https://github.com/01org/prd
>>  This patch is based on (git://git.open-osd.org/pmem.git):
>> 	[5ccf703] SQUASHME: Don't clobber the map module param
>>
>> <list-of-changes>
>> [boaz]
>> SQUASHME: pmem: Remove unused #include headers
>> SQUASHME: pmem: Request from fdisk 4k alignment
>> SQUASHME: pmem: Let each device manage private memory region
>> SQUASHME: pmem: Support of multiple memory regions
>> SQUASHME: pmem: Micro optimization the hotpath 001
>> SQUASHME: pmem: no need to copy a page at a time
>> SQUASHME: pmem that 4k sector thing
>> SQUASHME: pmem: Cleanliness is neat
>> SQUASHME: Don't clobber the map module param
>> SQUASHME: pmem: Few changes to Initial version of pmem
>> SQUASHME: Changes to copyright text (trivial)
>> </list-of-changes>
>>
>> TODO: Add Documentation/blockdev/pmem.txt
>>
>> Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> 
> I wrote the initial version of the PMEM driver (then called PRD for Persistent
> RAM Driver) in late 2013/early 2014, and posted it on GitHub.  Here's a link
> to my first version:
> 
> https://github.com/01org/prd/tree/prd_3.13
> 
> Matthew Wilcox pointed Boaz to it in June of 2014, and he cloned my tree and
> went off and made a bunch of changes.  A few of those changes he sent back to
> me, like the one I included in the patch series I recently sent for upstream
> inclusion:
> 
> https://lkml.org/lkml/2015/3/16/1102
> 
> Many of the changes he did not submit back to me for review or inclusion in my
> tree.
> 
> With the first patch in this series Boaz is squashing all of our changes
> together, adding his copyright and trying to install himself as maintainer.  I
> believe this to be unacceptable.  
> 
> Boaz, if you have contributions that you would like to make to PMEM, please
> submit them to our mailing list (linux-nvdimm@lists.01.org) and we will be
> happy to review them.  But please don't try and steal control of my driver.
> 

I apologize. It is not my intention to hijack your project. All but the last
2 changes I have posted again and again, even those changes I have said that
I maintain them in a public tree, and made them available publicly ASAP. I
stopped sending the (last 2) patches because it felt like I was spamming the
list, since none of my patches got any comments or have been accepted to your
tree.

It was my notion that you do not want to bother with farther development, your
tree was stuck on 3.17, while I was rebasing on every major Linux release, adding
my changes as they accumulated over time.

For example That patch that you mentioned that you accepted in the tree, that
same patch was just a staging patch, to the more important change that
throws away the toy Kconfig and 3 module params, and puts in a real world
actually usable API, for long term. You did not take that patch, why?

So I was in the notion that you don't want to maintain this driver, and I was
forced to fork the project and move on. What other choice did I have?

About the added copyright, diffing your original driver without any of my
changes including all the partition bugs, the changed API the IO path cleanup,
it comes out less then 30% similarity, as a courtesy to my employer I think he
is entitled to an added copyright.

But let us not fight. You want to maintain this thing, start by squashing all
my changes + all the other added patches and publish them on an your tree. I
need this driver usable.

> - Ross
> 
> 

Thanks
Boaz
Dan Williams March 20, 2015, 12:11 a.m. UTC | #7
On Thu, Mar 19, 2015 at 2:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> I apologize. It is not my intention to hijack your project. All but the last
> 2 changes I have posted again and again, even those changes I have said that
> I maintain them in a public tree, and made them available publicly ASAP. I
> stopped sending the (last 2) patches because it felt like I was spamming the
> list, since none of my patches got any comments or have been accepted to your
> tree.

That's not true.  We talked about your "map=" proposal at length back
in September.  You concluded "That the discovery should be elsewhere
in an ARCH/driver LLD and pmem stays generic." [1].  A generic
approach is being specified by the ACPI Working Group and will be
released "Real Soon Now (TM)" (on the order of weeks not months).  My
first choice would be to finish waiting for that specification before
we upstream a pmem driver.  Outside of that, if we need a pmem driver
"now", Ross's version has the nice property of having an easier to
revert resource discovery mechanism.  The kernel command line is
arguably an ABI and the need for "map=" is obviated by a generic
resource discovery mechanism.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2014-September/000043.html
Elliott, Robert (Server Storage) March 26, 2015, 4 a.m. UTC | #8
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Andy Lutomirski
> Sent: Wednesday, March 18, 2015 1:07 PM
> To: Boaz Harrosh
> Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
> Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
> Christoph Hellwig
> Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
> 
> On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
> >
> > On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> > <>
> > >
> > > I think it would be nice to have control over the caching mode.
> > > Depending on the application, WT or UC could make more sense.
> > >
> >
> > Patches are welcome. say
> >         map=sss@aaa:WT,sss@aaa:CA, ...
> >
> > But for us, with direct_access(), all benchmarks show a slight advantage
> > for the cached mode.
> 
> I'm sure cached is faster.  The question is: who flushes the cache?
> 
> --Andy

Nobody.

Therefore, pmem as currently proposed (mapping the memory with
ioremap_cache, which uses _PAGE_CACHE_MODE_WB) is unsafe unless the
system is doing something special to ensure L1, L2, and L3 caches are
flushed on power loss.

I think pmem needs to map the memory as UC or WT by default, providing
WB and WC only as an option for users confident that those attributes
are safe to use in their system.

Even using UC or WT presumes that ADR is in place.
Boaz Harrosh March 26, 2015, 7:51 a.m. UTC | #9
On 03/26/2015 06:00 AM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>> owner@vger.kernel.org] On Behalf Of Andy Lutomirski
>> Sent: Wednesday, March 18, 2015 1:07 PM
>> To: Boaz Harrosh
>> Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
>> Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
>> Christoph Hellwig
>> Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
>>
>> On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>>>
>>> On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
>>> <>
>>>>
>>>> I think it would be nice to have control over the caching mode.
>>>> Depending on the application, WT or UC could make more sense.
>>>>
>>>
>>> Patches are welcome. say
>>>         map=sss@aaa:WT,sss@aaa:CA, ...
>>>
>>> But for us, with direct_access(), all benchmarks show a slight advantage
>>> for the cached mode.
>>
>> I'm sure cached is faster.  The question is: who flushes the cache?
>>
>> --Andy
> 
> Nobody.
> 
> Therefore, pmem as currently proposed (mapping the memory with
> ioremap_cache, which uses _PAGE_CACHE_MODE_WB) is unsafe unless the
> system is doing something special to ensure L1, L2, and L3 caches are
> flushed on power loss.
> 
> I think pmem needs to map the memory as UC or WT by default, providing
> WB and WC only as an option for users confident that those attributes
> are safe to use in their system.
> 
> Even using UC or WT presumes that ADR is in place.
> 

I will add command line options for these modes per range. (Unless you
care to send a patch before me)

Thanks this is a good idea
Boaz
Dave Chinner March 26, 2015, 9:31 p.m. UTC | #10
On Thu, Mar 26, 2015 at 04:00:57AM +0000, Elliott, Robert (Server Storage) wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Andy Lutomirski
> > Sent: Wednesday, March 18, 2015 1:07 PM
> > To: Boaz Harrosh
> > Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
> > Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
> > Christoph Hellwig
> > Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
> > 
> > On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
> > >
> > > On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> > > <>
> > > >
> > > > I think it would be nice to have control over the caching mode.
> > > > Depending on the application, WT or UC could make more sense.
> > > >
> > >
> > > Patches are welcome. say
> > >         map=sss@aaa:WT,sss@aaa:CA, ...
> > >
> > > But for us, with direct_access(), all benchmarks show a slight advantage
> > > for the cached mode.
> > 
> > I'm sure cached is faster.  The question is: who flushes the cache?
> > 
> > --Andy
> 
> Nobody.

There is another discussion going on about ensuring we have
mechanisms to flush the cpu caches correctly when DAX is enabled and
data integrity operations are run.  i.e. fsync and sync will provide
cache flush triggers for DAX enabled devices once we get everything
in place.

Cheers,

Dave.
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index ddc5a8c..21c5384 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8053,6 +8053,13 @@  S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+M:	Boaz Harrosh <boaz@plexistor.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d..1530c2a 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,24 @@  config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "pmem: Persistent memory block device support"
+	help
+	  If you have Persistent memory in your system say Y/m
+	  here. The driver can support real Persistent memory chips
+	  such as NVDIMMs , as well as volatile memory that was set
+	  aside from Kernel use by the "memmap" kernel parameter.
+	  And/or any contiguous physical memory ranges that you want
+	  to represent as a block device. (Even PCIE flat memory mapped
+	  devices)
+	  See Documentation/block/pmem.txt for how to use
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem. Created Devices will be named: /dev/pmemX
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@  obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..02cd118
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,334 @@ 
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2014, Boaz Harrosh <boaz@plexistor.com>.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver's skeleton is based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+	struct list_head	pmem_list;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page, uint len,
+			 uint off, int rw, sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+	size_t pmem_off = sector << 9;
+
+	BUG_ON(pmem_off >= pmem->size);
+
+	if (rw == READ) {
+		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
+		err = -EIO;
+		goto out;
+	}
+
+	if (WARN_ON(bio->bi_rw & REQ_DISCARD)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	sector = bio->bi_iter.bi_sector;
+	bio_for_each_segment(bvec, bio, iter) {
+		/* NOTE: There is a legend saying that bv_len might be
+		 * bigger than PAGE_SIZE in the case that bv_page points to
+		 * a physical contiguous PFN set. But for us it is fine because
+		 * it means the Kernel virtual mapping is also contiguous. And
+		 * on the pmem side we are always contiguous both virtual and
+		 * physical
+		 */
+		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+			     rw, sector);
+		sector += bvec.bv_len >> 9;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+};
+
+/* Kernel module stuff */
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+	"pmem device mapping: map=mapS[,mapS...] where:\n"
+	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
+
+static LIST_HEAD(pmem_devices);
+static int pmem_major;
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
+					       "pmem");
+	if (unlikely(!res_mem)) {
+		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   pmem->phys_addr, pmem->size);
+		return -EINVAL;
+	}
+
+	pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+	if (unlikely(!pmem->virt_addr)) {
+		err = -ENXIO;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(pmem->phys_addr, pmem->size);
+	return err;
+}
+
+void pmem_unmapmem(struct pmem_device *pmem)
+{
+	if (unlikely(!pmem->virt_addr))
+		return;
+
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = NULL;
+}
+
+#define PMEM_ALIGNMEM PAGE_SIZE
+
+static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
+				      int i)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	int err;
+
+	if (unlikely((phys_addr & (PMEM_ALIGNMEM - 1)) ||
+		     (disk_size & (PMEM_ALIGNMEM - 1)))) {
+		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 0x%lx aligned\n",
+		       phys_addr, disk_size, PMEM_ALIGNMEM);
+		err = -EINVAL;
+		goto out;
+	}
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (unlikely(!pmem)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	pmem->phys_addr = phys_addr;
+	pmem->size = disk_size;
+
+	err = pmem_mapmem(pmem);
+	if (unlikely(err))
+		goto out_free_dev;
+
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!pmem->pmem_queue)) {
+		err = -ENOMEM;
+		goto out_unmap;
+	}
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	/* This is so fdisk will align partitions on 4k, because of
+	 * direct_access API needing 4k alignment, returning a PFN
+	 */
+	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
+
+	disk = alloc_disk(0);
+	if (unlikely(!disk)) {
+		err = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	disk->major		= pmem_major;
+	disk->first_minor	= 0;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", i);
+	set_capacity(disk, disk_size >> 9);
+	pmem->pmem_disk = disk;
+
+	return pmem;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	pmem_unmapmem(pmem);
+out_free_dev:
+	kfree(pmem);
+out:
+	return ERR_PTR(err);
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	pmem_unmapmem(pmem);
+	kfree(pmem);
+}
+
+static void pmem_del_one(struct pmem_device *pmem)
+{
+	list_del(&pmem->pmem_list);
+	del_gendisk(pmem->pmem_disk);
+	pmem_free(pmem);
+}
+
+static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+	char *p = map;
+
+	*size = (size_t)memparse(p, &p);
+	if ((p == map) || ((*p != '$') && (*p != '@')))
+		return -EINVAL;
+
+	if (!*(++p))
+		return -EINVAL;
+
+	*start = (phys_addr_t)memparse(p, &p);
+
+	return *p == '\0' ? 0 : -EINVAL;
+}
+
+static int __init pmem_init(void)
+{
+	int result, i;
+	struct pmem_device *pmem, *next;
+	char *p, *pmem_map, *map_dup;
+
+	if (unlikely(!map || !*map)) {
+		pr_err("pmem: must specify map=nn@ss parameter.\n");
+		return -EINVAL;
+	}
+
+	result = register_blkdev(0, "pmem");
+	if (unlikely(result < 0))
+		return -EIO;
+
+	pmem_major = result;
+
+	map_dup = pmem_map = kstrdup(map, GFP_KERNEL);
+	if (unlikely(!pmem_map)) {
+		pr_debug("pmem_init strdup(%s) failed\n", map);
+		return -ENOMEM;
+	}
+
+	i = 0;
+	while ((p = strsep(&pmem_map, ",")) != NULL) {
+		phys_addr_t phys_addr;
+		size_t disk_size;
+
+		if (!*p)
+			continue;
+		result = pmem_parse_map_one(p, &phys_addr, &disk_size);
+		if (result)
+			goto out_free;
+		pmem = pmem_alloc(phys_addr, disk_size, i);
+		if (IS_ERR(pmem)) {
+			result = PTR_ERR(pmem);
+			goto out_free;
+		}
+		list_add_tail(&pmem->pmem_list, &pmem_devices);
+		++i;
+	}
+
+	list_for_each_entry(pmem, &pmem_devices, pmem_list)
+		add_disk(pmem->pmem_disk);
+
+	pr_info("pmem: module loaded map=%s\n", map);
+	kfree(map_dup);
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) {
+		list_del(&pmem->pmem_list);
+		pmem_free(pmem);
+	}
+	kfree(map_dup);
+	unregister_blkdev(pmem_major, "pmem");
+
+	return result;
+}
+
+static void __exit pmem_exit(void)
+{
+	struct pmem_device *pmem, *next;
+
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
+		pmem_del_one(pmem);
+
+	unregister_blkdev(pmem_major, "pmem");
+	pr_info("pmem: module unloaded\n");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+module_init(pmem_init);
+module_exit(pmem_exit);