diff mbox

[GIT,PULL] PMEM driver for v4.1

Message ID 20150413093309.GA30219@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ingo Molnar April 13, 2015, 9:33 a.m. UTC
Linus,

Please pull the latest x86-pmem-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pmem-for-linus

   # HEAD: 4c1eaa2344fb26bb5e936fb4d8ee307343ea0089 drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()

This is the initial support for the pmem block device driver: 
persistent non-volatile memory space mapped into the system's physical 
memory space as large physical memory regions.

The driver is based on Intel code, written by Ross Zwisler, with fixes 
by Boaz Harrosh, integrated with x86 e820 memory resource management 
and tidied up by Christoph Hellwig.

Note that there were two other separate pmem driver submissions to 
lkml: but apparently all parties (Ross Zwisler, Boaz Harrosh) are 
reasonably happy with this initial version.

This version enables minimal support that enables persistent memory 
devices out in the wild to work as block devices, identified through a 
magic (non-standard) e820 flag and auto-discovered if 
CONFIG_X86_PMEM_LEGACY=y, or added explicitly through manipulating the 
memory maps via the "memmap=..." boot option with the new, special '!' 
modifier character.

Limitations: this is a regular block device, and since the pmem areas 
are not struct page backed, they are invisible to the rest of the 
system (other than the block IO device), so direct IO to/from pmem 
areas, direct mmap() or XIP is not possible yet. The page cache will 
also shadow and double buffer pmem contents, etc.

Initial support is for x86.

 Thanks,

	Ingo

------------------>
Christoph Hellwig (1):
      x86/mm: Add support for the non-standard protected e820 type

Ingo Molnar (1):
      drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()

Ross Zwisler (1):
      drivers/block/pmem: Add a driver for persistent memory


 Documentation/kernel-parameters.txt |   6 +
 MAINTAINERS                         |   6 +
 arch/x86/Kconfig                    |  10 ++
 arch/x86/include/uapi/asm/e820.h    |  10 ++
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/e820.c              |  26 +++-
 arch/x86/kernel/pmem.c              |  53 ++++++++
 drivers/block/Kconfig               |  11 ++
 drivers/block/Makefile              |   1 +
 drivers/block/pmem.c                | 262 ++++++++++++++++++++++++++++++++++++
 10 files changed, 380 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/kernel/pmem.c
 create mode 100644 drivers/block/pmem.c

Comments

Christoph Hellwig April 13, 2015, 9:35 a.m. UTC | #1
On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> Limitations: this is a regular block device, and since the pmem areas 
> are not struct page backed, they are invisible to the rest of the 
> system (other than the block IO device), so direct IO to/from pmem 
> areas, direct mmap() or XIP is not possible yet. The page cache will 
> also shadow and double buffer pmem contents, etc.

Unless you use the DAX support in ext2/4 and soon XFS, in which case
we avoid that double buffering when doing read/write and mmap
Ingo Molnar April 13, 2015, 10:45 a.m. UTC | #2
* Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> > Limitations: this is a regular block device, and since the pmem areas 
> > are not struct page backed, they are invisible to the rest of the 
> > system (other than the block IO device), so direct IO to/from pmem 
> > areas, direct mmap() or XIP is not possible yet. The page cache will 
> > also shadow and double buffer pmem contents, etc.
> 
> Unless you use the DAX support in ext2/4 and soon XFS, in which case
> we avoid that double buffering when doing read/write and mmap

Indeed, I missed that DAX support just went upstream in v4.0 - nice!

DAX may have some other limitations though that comes from not having 
struct page * backing and using VM_MIXEDMAP, the following APIs might 
not work on DAX files:

   - splice
   - zero copy O_DIRECT into DAX areas.
   - futexes

   - ( AFAICS hugetlbs won't work on DAX mmap()s yet - although with
       the current nocache mapping that's probable the least of the 
       performance issues for now. )

Btw., what's the future design plan here? Enable struct page backing, 
or provide special codepaths for all DAX uses like the special pte 
based approach for mmap()s?

Thanks,

	Ingo
Yigal Korman April 13, 2015, 11:11 a.m. UTC | #3
On Mon, Apr 13, 2015 at 1:45 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Christoph Hellwig <hch@lst.de> wrote:
>
>> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
>> > Limitations: this is a regular block device, and since the pmem areas
>> > are not struct page backed, they are invisible to the rest of the
>> > system (other than the block IO device), so direct IO to/from pmem
>> > areas, direct mmap() or XIP is not possible yet. The page cache will
>> > also shadow and double buffer pmem contents, etc.
>>
>> Unless you use the DAX support in ext2/4 and soon XFS, in which case
>> we avoid that double buffering when doing read/write and mmap
>
> Indeed, I missed that DAX support just went upstream in v4.0 - nice!
>
> DAX may have some other limitations though that comes from not having
> struct page * backing and using VM_MIXEDMAP, the following APIs might
> not work on DAX files:
>
>    - splice
>    - zero copy O_DIRECT into DAX areas.
>    - futexes
>
>    - ( AFAICS hugetlbs won't work on DAX mmap()s yet - although with
>        the current nocache mapping that's probable the least of the
>        performance issues for now. )

mlock() and MAP_POPULATE don't work with DAX files as well.

>
> Btw., what's the future design plan here? Enable struct page backing,
> or provide special codepaths for all DAX uses like the special pte
> based approach for mmap()s?
>
> Thanks,
>
>         Ingo
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Boaz Harrosh April 13, 2015, 12:21 p.m. UTC | #4
On 04/13/2015 01:45 PM, Ingo Molnar wrote:
> 
> * Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
>>> Limitations: this is a regular block device, and since the pmem areas 
>>> are not struct page backed, they are invisible to the rest of the 
>>> system (other than the block IO device), so direct IO to/from pmem 
>>> areas, direct mmap() or XIP is not possible yet. The page cache will 
>>> also shadow and double buffer pmem contents, etc.
>>
>> Unless you use the DAX support in ext2/4 and soon XFS, in which case
>> we avoid that double buffering when doing read/write and mmap
> 
> Indeed, I missed that DAX support just went upstream in v4.0 - nice!
> 
> DAX may have some other limitations though that comes from not having 
> struct page * backing and using VM_MIXEDMAP, the following APIs might 
> not work on DAX files:
> 
>    - splice

splice works fine. Also I sent a cleanup in this area to Andrew it will
be in for 4.1

>    - zero copy O_DIRECT into DAX areas.

DAX is always O_DIRECT.

What does not work is mmap of DAX file and use that pointer in an
O_DIRECT operation of another device. (unless it is a DAX device)

Also mmap of DAX file and RDMA or direct-networking. Will need
a copy.

All this is fixable by applying my page-struct patch for pmem

>    - futexes
> 
>    - ( AFAICS hugetlbs won't work on DAX mmap()s yet - although with
>        the current nocache mapping that's probable the least of the 
>        performance issues for now. )
> 
> Btw., what's the future design plan here? Enable struct page backing, 
> or provide special codepaths for all DAX uses like the special pte 
> based approach for mmap()s?
> 

I'm hopping for struct page, 4k pages at first and 2M pages later on,
which needs more work in IO stacks, where I need this most.

> Thanks,
> 	Ingo
> 

Thanks
Boaz
Ingo Molnar April 13, 2015, 12:35 p.m. UTC | #5
* Boaz Harrosh <boaz@plexistor.com> wrote:

> On 04/13/2015 01:45 PM, Ingo Molnar wrote:
> > 
> > * Christoph Hellwig <hch@lst.de> wrote:
> > 
> >> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> >>> Limitations: this is a regular block device, and since the pmem areas 
> >>> are not struct page backed, they are invisible to the rest of the 
> >>> system (other than the block IO device), so direct IO to/from pmem 
> >>> areas, direct mmap() or XIP is not possible yet. The page cache will 
> >>> also shadow and double buffer pmem contents, etc.
> >>
> >> Unless you use the DAX support in ext2/4 and soon XFS, in which case
> >> we avoid that double buffering when doing read/write and mmap
> > 
> > Indeed, I missed that DAX support just went upstream in v4.0 - nice!
> > 
> > DAX may have some other limitations though that comes from not having 
> > struct page * backing and using VM_MIXEDMAP, the following APIs might 
> > not work on DAX files:
> > 
> >    - splice
> 
> splice works fine. Also I sent a cleanup in this area to Andrew it will
> be in for 4.1

How does splice work with DAX files? AFAICS vmsplice() won't work, as 
it uses get_user_pages(), which needs struct page backing. Also, how 
will f_op->sendpage() work? That too needs page backing.

> >    - zero copy O_DIRECT into DAX areas.
> 
> DAX is always O_DIRECT.
> 
> What does not work is mmap of DAX file and use that pointer in an
> O_DIRECT operation of another device. (unless it is a DAX device)

That is what I meant: O_DIRECT into (or from) DAX mmap()-ed areas, 
from a different device.

I'm not promoting the use of these APIs, some of them are quirky, just 
wanted to list the known limitations. The pmem driver is already 
useful as-is.

Thanks,

	Ingo
Boaz Harrosh April 13, 2015, 1:36 p.m. UTC | #6
On 04/13/2015 03:35 PM, Ingo Molnar wrote:
> 
<>
> 
> How does splice work with DAX files? AFAICS vmsplice() won't work, as 
> it uses get_user_pages(), which needs struct page backing. Also, how 
> will f_op->sendpage() work? That too needs page backing.
> 

I'm not sure about f_op->sendpage(). I do not see it in ext4 so I assumed
it works through f_op->splice_read/write.

The way f_op->splice_XX works is through the read/write_iter system
and this one is fully supported with DAX. It will do an extra copy.

[You can see this patch in linux-next:
	8b22559 dax: unify ext2/4_{dax,}_file_operations
 that explains a bit about this issue
]

>>>    - zero copy O_DIRECT into DAX areas.
>>
>> DAX is always O_DIRECT.
>>
>> What does not work is mmap of DAX file and use that pointer in an
>> O_DIRECT operation of another device. (unless it is a DAX device)
> 
> That is what I meant: O_DIRECT into (or from) DAX mmap()-ed areas, 
> from a different device.
> 

Yes. For me the one missing the most is RDMA to/from pmem copy-less.

Is why I'm pushing for my page-struct patches. But so far people
rejected it. I hope to resend it soon, once the dust settles and see
if people might change their mind.

> I'm not promoting the use of these APIs, some of them are quirky, just 
> wanted to list the known limitations. The pmem driver is already 
> useful as-is.
> 
> Thanks,
> 	Ingo
> 

Thanks
Boaz
Christoph Hellwig April 13, 2015, 5:18 p.m. UTC | #7
On Mon, Apr 13, 2015 at 12:45:32PM +0200, Ingo Molnar wrote:
> Btw., what's the future design plan here? Enable struct page backing, 
> or provide special codepaths for all DAX uses like the special pte 
> based approach for mmap()s?

There are a couple approaches proposed, but we don't have consensus which
way to go yet (to put it mildly).

 - the old Intel patches just allocate pages for E820_PMEM regions.
   I think this is a good way forward for the "old-school" small
   pmem regions which usually are battery/capacitor + flash backed
   DRAM anyway.  This could easily be resurrected for the current code,
   but it couldn't be used for PCI backed pmem regions, and would work
   although waste a lot of resources for the gigantic pmem devices some
   Intel people talk about (400TB+ IIRC).

 - Intel has proposed changes that allow block I/O on regions that aren't
   page backed, by supporting PFN-based scatterlists which would have to be
   supported all over the I/O path. Reception of that code has been rather
   mediocre in general, although I wouldn't rule it out.

 - Boaz has shown code that creates pages dynamically for pmem regions.
   Unlike the old Intel e820 code that would also work for PCI backed
   pmem regions.  Boaz says he has such a card, but until someone actually
   publishes specs and/or the trivial pci_driver for them I'm inclined to
   just ignore that option.

 - There have been proposals for temporary struct page mappings, or
   variable sized pages, but as far as I can tell no code to actually
   implement these schemes.
Christoph Hellwig April 13, 2015, 5:19 p.m. UTC | #8
On Mon, Apr 13, 2015 at 02:11:56PM +0300, Yigal Korman wrote:
> mlock()

DAX files always are in-memory so this just sounds like an oversight.
method.
Christoph Hellwig April 13, 2015, 5:22 p.m. UTC | #9
On Mon, Apr 13, 2015 at 02:35:35PM +0200, Ingo Molnar wrote:
> How does splice work with DAX files?

By falling back to default_file_splice_read/default_file_splice_write
which doesn't use the iter ops, but instead requires a copy in the
splice code.  But given that the actual underlying reads and writes
bypass the pagecache it's not any less effiecient than the normal
pagecache based splice.

> AFAICS vmsplice() won't work, as 
> it uses get_user_pages(), which needs struct page backing.

Exactly.

> Also, how 
> will f_op->sendpage() work? That too needs page backing.

default_file_splice_read allocates it's own kernel pages, which are
then passed to ->sendpage.
Boaz Harrosh April 14, 2015, 6:41 a.m. UTC | #10
On 04/13/2015 08:19 PM, Christoph Hellwig wrote:
> On Mon, Apr 13, 2015 at 02:11:56PM +0300, Yigal Korman wrote:
>> mlock()
> 
> DAX files always are in-memory so this just sounds like an oversight.
> method.

Yes mlock on DAX can just return true, but mlock implies MAP_POPULATE.

Which means "I would like to page-fault the all mmap range at mmap time
so at access time I'm guarantied not to sleep". This is usually done
for latency sensitive applications. 

But current code fails on MAP_POPULATE for DAX because it is only
implemented for pages, and therefor mlock fails as well.

One thing I do not understand. does mlock also protects against
truncate?

Thanks
Boaz
Ingo Molnar April 14, 2015, 12:41 p.m. UTC | #11
* Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Apr 13, 2015 at 12:45:32PM +0200, Ingo Molnar wrote:
> > Btw., what's the future design plan here? Enable struct page backing, 
> > or provide special codepaths for all DAX uses like the special pte 
> > based approach for mmap()s?
> 
> There are a couple approaches proposed, but we don't have consensus which
> way to go yet (to put it mildly).
> 
>  - the old Intel patches just allocate pages for E820_PMEM regions.
>    I think this is a good way forward for the "old-school" small
>    pmem regions which usually are battery/capacitor + flash backed
>    DRAM anyway.  This could easily be resurrected for the current code,
>    but it couldn't be used for PCI backed pmem regions, and would work
>    although waste a lot of resources for the gigantic pmem devices some
>    Intel people talk about (400TB+ IIRC).

So here's how I calculate the various scenarios:

There are two main usecases visible currently for pmem devices: 'pmem 
as storage' and 'pmem as memory', and they have rather distinct 
characteristics.

1) pmem devices as 'storage':

So the size of 'struct page' is 64 bytes currently.

So even if a pmem device is just 1 TB (small storage), for example to 
replace storage on a system, we'd have to allocate 64 bytes per 4096 
bytes of storage, which would use up 16 GB of main memory just for the 
page struct arrays...

So in this case 'struct page' allocation is not acceptable even for 
relatively small pmem device sizes.

For this usecase I think the current pmem driver is perfectly 
acceptable and in fact ideal:

  - it double buffers, giving higher performance and also protecting
    storage from wear (that is likely flash based)

  - the double buffering makes most struct page based APIs work just 
    fine.

  - it offers the DAX APIs for those weird special cases that really 
    want to do their own cache and memory management (databases and 
    other crazy usecases)

2) pmem devices as 'memory':

Battery backed and similar solutions of nv-dram, these are probably a 
lot smaller (for cost reasons) and are also a lot more RAM-alike, so 
the 'struct page' allocation in main RAM makes sense and possibly 
people would want to avoid the double buffering as well.

Furthermore, in this case we could also do another trick:

>  - Intel has proposed changes that allow block I/O on regions that aren't
>    page backed, by supporting PFN-based scatterlists which would have to be
>    supported all over the I/O path. Reception of that code has been rather
>    mediocre in general, although I wouldn't rule it out.
>
>  - Boaz has shown code that creates pages dynamically for pmem regions.
>    Unlike the old Intel e820 code that would also work for PCI backed
>    pmem regions.  Boaz says he has such a card, but until someone actually
>    publishes specs and/or the trivial pci_driver for them I'm inclined to
>    just ignore that option.
> 
>  - There have been proposals for temporary struct page mappings, or
>    variable sized pages, but as far as I can tell no code to actually
>    implement these schemes.

None of this gives me warm fuzzy feelings...

... has anyone explored the possibility of putting 'struct page' into 
the pmem device itself, essentially using it as metadata?

Since it's directly mapped it should just work for most things if it's 
at least write-through cached (UC would be a horror), and it would 
also solve all the size problems. With write-through caching it should 
also be pretty OK performance-wise. The 64 bytes size is ideal as 
well.

( This would create a bit of a dependency with the kernel version, and 
  would complicate things of how to acquire a fresh page array after 
  bootup, but that could be solved relatively easily IMHO. )

This would eliminate all the negative effects dynamic allocation of 
page structs or sg-lists brings.

Anyway:

Since both the 'pmem as storage' and 'pmem as memory' usecases will 
likely be utilized in practice, IMHO we should hedge our bets by 
supporting both equally well: we should start with the simpler one 
(the current driver) and then support both in the end, with as much 
end user flexibility as possible.

Thanks,

	Ingo
Boaz Harrosh April 14, 2015, 1:45 p.m. UTC | #12
On 04/14/2015 03:41 PM, Ingo Molnar wrote:
> 
> * Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Mon, Apr 13, 2015 at 12:45:32PM +0200, Ingo Molnar wrote:
>>> Btw., what's the future design plan here? Enable struct page backing, 
>>> or provide special codepaths for all DAX uses like the special pte 
>>> based approach for mmap()s?
>>
>> There are a couple approaches proposed, but we don't have consensus which
>> way to go yet (to put it mildly).
>>
>>  - the old Intel patches just allocate pages for E820_PMEM regions.
>>    I think this is a good way forward for the "old-school" small
>>    pmem regions which usually are battery/capacitor + flash backed
>>    DRAM anyway.  This could easily be resurrected for the current code,
>>    but it couldn't be used for PCI backed pmem regions, and would work
>>    although waste a lot of resources for the gigantic pmem devices some
>>    Intel people talk about (400TB+ IIRC).
> 
> So here's how I calculate the various scenarios:
> 
> There are two main usecases visible currently for pmem devices: 'pmem 
> as storage' and 'pmem as memory', and they have rather distinct 
> characteristics.
> 
> 1) pmem devices as 'storage':
> 
> So the size of 'struct page' is 64 bytes currently.
> 
> So even if a pmem device is just 1 TB (small storage), for example to 
> replace storage on a system, we'd have to allocate 64 bytes per 4096 
> bytes of storage, which would use up 16 GB of main memory just for the 
> page struct arrays...
> 
> So in this case 'struct page' allocation is not acceptable even for 
> relatively small pmem device sizes.
> 
> For this usecase I think the current pmem driver is perfectly 
> acceptable and in fact ideal:
> 
>   - it double buffers, giving higher performance and also protecting
>     storage from wear (that is likely flash based)
> 
>   - the double buffering makes most struct page based APIs work just 
>     fine.
> 
>   - it offers the DAX APIs for those weird special cases that really 
>     want to do their own cache and memory management (databases and 
>     other crazy usecases)
> 
> 2) pmem devices as 'memory':
> 
> Battery backed and similar solutions of nv-dram, these are probably a 
> lot smaller (for cost reasons) and are also a lot more RAM-alike, so 
> the 'struct page' allocation in main RAM makes sense and possibly 
> people would want to avoid the double buffering as well.
> 
> Furthermore, in this case we could also do another trick:
> 
>>  - Intel has proposed changes that allow block I/O on regions that aren't
>>    page backed, by supporting PFN-based scatterlists which would have to be
>>    supported all over the I/O path. Reception of that code has been rather
>>    mediocre in general, although I wouldn't rule it out.
>>
>>  - Boaz has shown code that creates pages dynamically for pmem regions.
>>    Unlike the old Intel e820 code that would also work for PCI backed
>>    pmem regions.  Boaz says he has such a card, but until someone actually
>>    publishes specs and/or the trivial pci_driver for them I'm inclined to
>>    just ignore that option.
>>
>>  - There have been proposals for temporary struct page mappings, or
>>    variable sized pages, but as far as I can tell no code to actually
>>    implement these schemes.
> 
> None of this gives me warm fuzzy feelings...
> 
> ... has anyone explored the possibility of putting 'struct page' into 
> the pmem device itself, essentially using it as metadata?
> 

Is what I've been saying from the first time it was asked

Yes this works today.
With my patchset it can be done.

The way it works is through memory_hotplug. You can hotplug in a pmem range
as additional memory. This adds to the total amount of memory in the system.

you than load the pmem device with page-struct support that will allocate
from the regular memory pools, but which are now bigger.

What I thought is that one "partition" on the device can be used
as RAM, and its contents is re-initialized on boot.
(I'm saying "partition" because it is more a pre-defined range not
 a real block-layer partition)

That said, it should be very easy to build on my patchset and instruct
add_section to use a pre-allocated area for the page-structs.
(I will add an RFC patch that will do this)

> Since it's directly mapped it should just work for most things if it's 
> at least write-through cached (UC would be a horror), and it would 
> also solve all the size problems. With write-through caching it should 
> also be pretty OK performance-wise. The 64 bytes size is ideal as 
> well.
> 
> ( This would create a bit of a dependency with the kernel version, and 
>   would complicate things of how to acquire a fresh page array after 
>   bootup, but that could be solved relatively easily IMHO. )
> 
> This would eliminate all the negative effects dynamic allocation of 
> page structs or sg-lists brings.
> 
> Anyway:
> 
> Since both the 'pmem as storage' and 'pmem as memory' usecases will 
> likely be utilized in practice, IMHO we should hedge our bets by 
> supporting both equally well: we should start with the simpler one 
> (the current driver) and then support both in the end, with as much 
> end user flexibility as possible.
> 

I have a pending patch to let user specify the mapping used per pmem
device, (And default set at Kconfig). Including a per device use
of pages. pages just being one of the mapping modes.

So all this can just be under the same driver and each admin can
configure for his own needs.

I will send a new patchset once the merge window is over. And we
can then see how it all looks.

> Thanks,
> 	Ingo
> 

Thanks Ingo very much
Boaz
Elliott, Robert (Server Storage) April 14, 2015, 2:08 p.m. UTC | #13
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> Of Ingo Molnar
> Sent: Tuesday, April 14, 2015 7:42 AM
> To: Christoph Hellwig
> Subject: Re: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1
> 
> 
...
> Since it's directly mapped it should just work for most things if it's
> at least write-through cached (UC would be a horror), and it would
> also solve all the size problems. With write-through caching it should
> also be pretty OK performance-wise. The 64 bytes size is ideal as

Are the WT support patches going to make it into 4.1?
Dan Williams April 14, 2015, 4:04 p.m. UTC | #14
On Tue, Apr 14, 2015 at 5:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> 2) pmem devices as 'memory':
>
> Battery backed and similar solutions of nv-dram, these are probably a
> lot smaller (for cost reasons) and are also a lot more RAM-alike, so
> the 'struct page' allocation in main RAM makes sense and possibly
> people would want to avoid the double buffering as well.
>
> Furthermore, in this case we could also do another trick:
>
>>  - Intel has proposed changes that allow block I/O on regions that aren't
>>    page backed, by supporting PFN-based scatterlists which would have to be
>>    supported all over the I/O path. Reception of that code has been rather
>>    mediocre in general, although I wouldn't rule it out.
>>
>>  - Boaz has shown code that creates pages dynamically for pmem regions.
>>    Unlike the old Intel e820 code that would also work for PCI backed
>>    pmem regions.  Boaz says he has such a card, but until someone actually
>>    publishes specs and/or the trivial pci_driver for them I'm inclined to
>>    just ignore that option.
>>
>>  - There have been proposals for temporary struct page mappings, or
>>    variable sized pages, but as far as I can tell no code to actually
>>    implement these schemes.
>
> None of this gives me warm fuzzy feelings...
>
> ... has anyone explored the possibility of putting 'struct page' into
> the pmem device itself, essentially using it as metadata?

Yes, the impetus for proposing the pfn conversion of the block layer
was the consideration that persistent memory may have less write
endurance than DRAM.  The kernel preserving write endurance
exclusively for user data and the elimination of struct page overhead
motivated the patchset [1].

[1]: https://lwn.net/Articles/636968/
Dan Williams April 14, 2015, 4:34 p.m. UTC | #15
On Tue, Apr 14, 2015 at 7:08 AM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
>> Of Ingo Molnar
>> Sent: Tuesday, April 14, 2015 7:42 AM
>> To: Christoph Hellwig
>> Subject: Re: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1
>>
>>
> ...
>> Since it's directly mapped it should just work for most things if it's
>> at least write-through cached (UC would be a horror), and it would
>> also solve all the size problems. With write-through caching it should
>> also be pretty OK performance-wise. The 64 bytes size is ideal as
>
> Are the WT support patches going to make it into 4.1?

Which patches are these?  Maybe I missed them, but I don't see
anything in the archives.
Elliott, Robert (Server Storage) April 14, 2015, 9:46 p.m. UTC | #16
> -----Original Message-----
> From: Dan Williams [mailto:dan.j.williams@intel.com]
> Sent: Tuesday, April 14, 2015 11:34 AM
> To: Elliott, Robert (Server Storage)
> > ...
> >> Since it's directly mapped it should just work for most things if it's
> >> at least write-through cached (UC would be a horror), and it would
> >> also solve all the size problems. With write-through caching it should
> >> also be pretty OK performance-wise. The 64 bytes size is ideal as
> >
> > Are the WT support patches going to make it into 4.1?
> 
> Which patches are these?  Maybe I missed them, but I don't see
> anything in the archives.

These have been baking in linux-mm and linux-next:
* [PATCH v3 0/6] Kernel huge I/O mapping support
  https://lkml.org/lkml/2015/3/3/589
* [PATCH v4 0/7] mtrr, mm, x86: Enhance MTRR checks for huge I/O mapping
  https://lkml.org/lkml/2015/3/24/1056

I don't think this made it into a subsystem tree yet:
* [PATCH v8 0/7] Support Write-Through mapping on x86 
  https://lkml.org/lkml/2015/2/24/773

I guess we could target 4.2 for both the WT series and
pmem patches that support the new ioremap_wt() function.

---
Robert Elliott, HP Server Storage
Ingo Molnar April 15, 2015, 8:03 a.m. UTC | #17
* Elliott, Robert (Server Storage) <Elliott@hp.com> wrote:

> > -----Original Message-----
> > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> > Of Ingo Molnar
> > Sent: Tuesday, April 14, 2015 7:42 AM
> > To: Christoph Hellwig
> > Subject: Re: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1
> > 
> > 
> ...
> > Since it's directly mapped it should just work for most things if it's
> > at least write-through cached (UC would be a horror), and it would
> > also solve all the size problems. With write-through caching it should
> > also be pretty OK performance-wise. The 64 bytes size is ideal as
> 
> Are the WT support patches going to make it into 4.1?

Not that I know of, they are still under review.

Thanks,

	Ingo
Ingo Molnar April 15, 2015, 8:45 a.m. UTC | #18
* Dan Williams <dan.j.williams@intel.com> wrote:

> > None of this gives me warm fuzzy feelings...
> >
> > ... has anyone explored the possibility of putting 'struct page' 
> > into the pmem device itself, essentially using it as metadata?
> 
> Yes, the impetus for proposing the pfn conversion of the block layer 
> was the consideration that persistent memory may have less write 
> endurance than DRAM.  The kernel preserving write endurance 
> exclusively for user data and the elimination of struct page 
> overhead motivated the patchset [1].
> 
> [1]: https://lwn.net/Articles/636968/

(Is there a Git URL where I could take a look at these patches?)

But, I think the usage of pfn's in the block layer is relatively 
independent of the question whether a pmem region should be 
permanently struct page backed or not.

I think the main confusion comes from the fact that 'pfn' can have two 
roles with sufficiently advanced MMIO interfaces: describing main RAM 
page (struct page), but also describing essentially sectors on a 
large, MMIO-accessible storage device, directly visible to the CPU but 
otherwise not RAM.

So for that reason I think pmem devices should be both struct page 
backed and not struct page backed, depending on their physical 
characteristics:

------------

1)

If a pmem device is in any way expected to be write-unreliable (i.e. 
it's not DRAM but flash) then it's going to be potentially large and 
we simply cannot use struct page backing for it, full stop.

Users very likely want a filesystem on it, with double buffering that 
both reduces wear and makes better use of main RAM and CPU caches.

In this case the pmem device is a simple storage device that has a 
refreshlingly clean hardware ABI that exposes all of its contents in a 
large, directly mapped MMIO region in essence.

We don't back mass storage with struct page, we never did with any of 
the other storage devices either.

I'd expect this to be the 90% dominant 'pmem usecase' in the future.

In this case any 'direct mapping' system calls, DIO or 
non-double-buffering mmaps() and DAX on the other hand will stay a 
'weird' secondary usecases for user-space operating systems like 
databases that want to take caching out of the hands of the kernel.
 
The majority of users will use it as storage, with a filesystem on it 
and regular RAM caching it for everyone's gain. All the struct page 
based APIs and system calls will work just fine, and the rare usecases 
will be served by DAX.

------------

2)

But if a pmem device is RAM, with no write unreliability, then we 
obviously want it to have struct page backing, and we probably want to 
think about it more in terms of hot-pluggable memory, than a storage 
device.

This scenario will be less common than the mass-storage scenario.

Note that this is similar to how GPU memory is categorized: it's 
essentially RAM-alike, which naturally results in struct page backing.

------------

Note that scenarios 1) and 2) are not under our control, they are 
essentially a physical property, with some user policy influencing it 
as well. So we have to support both and we have no 'opinion' about 
which one is right, as it's simply physical reality as-is.

In that sense I think this driver does the right thing as a first 
step: it exposes pmem regions in the more conservative fashion, as a 
block storage device, assuming write unreliability.

Patches that would turn the pmem driver into unconditionally struct 
page backed would be misguided for this usecase. Allocating and 
freeing struct page arrays on the fly would be similarly misguided.

But patches that allow pmem regions that declare themselves true RAM 
to be inserted as hotplug memory would be the right approach IMHO - 
while still preserving the pmem block device and the non-struct-page 
backed approach for other pmem devices.

Note how in this picture the question of how IO scatter-gather lists 
are constructed is an implementational detail that does not impact the 
main design: they are essentially DMA abstractions for storage 
devices, implemented efficiently via memcpy() in the pmem case, and 
both pfn lists and struct page lists are pretty equivalent approaches 
for most usages.

The only exception are the 'weird' usecases like DAX, DIO and RDMA: 
these have to be pfn driven, due to the lack of struct page 
descriptors for storage devices in general. In that case the 'pfn' 
isn't really memory, but a sector_t equivalent, for this new type of 
storage DMA that is implemented via a memcpy().

In that sense the special DAX page fault handler looks like a natural 
approach as well: the pfn's in the page table aren't really describing 
memory pages, but 'sectors' on an IO device - with special rules, 
limited APIs and ongoing complications to be expected.

At least that's how I see it.

Thanks,

	Ingo
Dan Williams April 16, 2015, 4:31 a.m. UTC | #19
On Wed, Apr 15, 2015 at 1:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> > None of this gives me warm fuzzy feelings...
>> >
>> > ... has anyone explored the possibility of putting 'struct page'
>> > into the pmem device itself, essentially using it as metadata?
>>
>> Yes, the impetus for proposing the pfn conversion of the block layer
>> was the consideration that persistent memory may have less write
>> endurance than DRAM.  The kernel preserving write endurance
>> exclusively for user data and the elimination of struct page
>> overhead motivated the patchset [1].
>>
>> [1]: https://lwn.net/Articles/636968/
>
> (Is there a Git URL where I could take a look at these patches?)

  git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
evacuate-struct-page-v1

Note that the bulk of the change is automated via Coccinelle.

For v2, I'm looking at enabling a kmap primitive to operate on a
pfn_t, kmap_pfn().
Christoph Hellwig April 17, 2015, 6:38 a.m. UTC | #20
So we'be been busy discussing future steps in this thread for a few days,
but it seems like Linus doesn't like the pull request.  Linus, can you
say what you don't like here?
Linus Torvalds April 18, 2015, 3:42 p.m. UTC | #21
On Fri, Apr 17, 2015 at 2:38 AM, Christoph Hellwig <hch@lst.de> wrote:
> So we'be been busy discussing future steps in this thread for a few days,
> but it seems like Linus doesn't like the pull request.  Linus, can you
> say what you don't like here?

I basically don't do pulls that generate a lot of discussion before
the discussion is over.

Of course, most of the time that discussion is because the pull
request itself is contentious. This time it seems to be more about
secondary issues related to the area rather than the pull request
itself being contentious. I assume nobody is actually objecting to the
pull itself, and will be going ahead with it.

         Linus
Matthew Wilcox May 25, 2015, 6:16 p.m. UTC | #22
On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> Please pull the latest x86-pmem-for-linus git tree from:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pmem-for-linus
> 
>    # HEAD: 4c1eaa2344fb26bb5e936fb4d8ee307343ea0089 drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
> 
> This is the initial support for the pmem block device driver: 
> persistent non-volatile memory space mapped into the system's physical 
> memory space as large physical memory regions.

Ingo, this sucks.  You collapsed all of the separate patches into a
single "add new driver" patch, which makes it impossible to bisect which
of the recent changes broke xfstests.  Please don't do this again.

> ------------------>
> Christoph Hellwig (1):
>       x86/mm: Add support for the non-standard protected e820 type
> 
> Ingo Molnar (1):
>       drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
> 
> Ross Zwisler (1):
>       drivers/block/pmem: Add a driver for persistent memory
Ingo Molnar May 25, 2015, 6:30 p.m. UTC | #23
* Matthew Wilcox <willy@linux.intel.com> wrote:

> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> > Please pull the latest x86-pmem-for-linus git tree from:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pmem-for-linus
> > 
> >    # HEAD: 4c1eaa2344fb26bb5e936fb4d8ee307343ea0089 drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
> > 
> > This is the initial support for the pmem block device driver: 
> > persistent non-volatile memory space mapped into the system's physical 
> > memory space as large physical memory regions.
> 
> Ingo, this sucks.  You collapsed all of the separate patches into a 
> single "add new driver" patch, which makes it impossible to bisect 
> which of the recent changes broke xfstests.  Please don't do this 
> again.

I didn't do any collapsing.

Thanks,

	Ingo
Boaz Harrosh May 26, 2015, 8:41 a.m. UTC | #24
On 05/25/2015 09:16 PM, Matthew Wilcox wrote:
<>
> 
> Ingo, this sucks.  You collapsed all of the separate patches into a
> single "add new driver" patch, which makes it impossible to bisect which
> of the recent changes broke xfstests.  Please don't do this again.
> 

Matthew hi

Below is a splitout of the patches I sent to Christoph.
Christoph in his turn has added more changes.

I would please like to help. What is the breakage you
see with DAX.

I'm routinely testing with DAX so it is a surprise,
Though I'm testing with my version with pages and
__copy_from_user_nocache, and so on.
Or I might have missed it. What test are you failing?


Find patches at:
[web: http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/tags/PMEM-SPLITOUT-4.0-rc5]
[git tag PMEM-SPLITOUT-4.0-rc5 on git://git.open-osd.org / pmem.git]

The interesting list for you:
bc465aa Linux 4.0-rc5
e489c02 x86: add support for the non-standard protected e820 type
8b06a64 SQUASHME: Don't let e820_PMEM section merge emulated segments.

    There were a few changes to what got upstream on top of these 2

a069890 pmem: Initial version of persistent memory driver

    This one already contains changes made by Christoph

d4f6df5 SQUASHME: pmem: Remove getgeo
2f3f1dd SQUASHME: pmem: Streamline pmem driver
c921d23 SQUSHME: pmem: Micro cleaning
bb373a7 SQUASHME: pmem: Remove SECTOR_SHIFT
ad0070e SQUASHME: pmem: Remove "heavily based on brd.c" + Copyright

   This is not the final pmem version that got merged, but these are the changes
   I made which got incorporated.

Thanks
Boaz
Matthew Wilcox May 26, 2015, 7:31 p.m. UTC | #25
On Tue, May 26, 2015 at 11:41:41AM +0300, Boaz Harrosh wrote:
> I would please like to help. What is the breakage you
> see with DAX.
> 
> I'm routinely testing with DAX so it is a surprise,
> Though I'm testing with my version with pages and
> __copy_from_user_nocache, and so on.
> Or I might have missed it. What test are you failing?

generic/019 fails in several fun ways.

The first way, which I fixed yesterday, is that the test was using
the wrong way to find the 'make-it-fail' switch for the block device.
That's now in xfstests.  The messages from xfstests were unnecessarily
worrying; they were complaining about an inconsistent filesystem, which
might be expected as the test had failed to abort cleanly and left a
couple of tasks actively writing to the filesystem.

(I hadn't seen the problem before because I was using two devices pmem0
and pmem1; with the new pmem driver, I got one device and partitioned
it instead.  The problem only occurs when using partitions, not when
using entire devices).

The second way is that we hit two BUG/WARN messages.  The first (which
we hit simultaneously on three CPUs in this run!) is:
WARNING: CPU: 7 PID: 2922 at fs/buffer.c:1143 mark_buffer_dirty+0x19e/0x270()

The stack trace probably isn't useful, and anyway it's horribly corrupted
due to triggering the stack trace simultaneously on three CPUs.

The second one we hit was this one:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 2930 at fs/block_dev.c:56 __blkdev_put+0xc5/0x210()
 Modules linked in: ext4 crc16 jbd2 pmem binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support evdev x86_pkg_temp_thermal coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse serio_raw pcspkr i2c_i801 snd_hda_codec_realtek snd_hda_codec_generic lpc_ich mfd_core mei_me mei i915 snd_hda_intel i2c_algo_bit snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_hda_core loop video drm_kms_helper fuse snd_timer snd drm soundcore button processor parport_pc ppdev lp parport sg sd_mod ehci_pci ehci_hcd ahci libahci crc32c_intel libata fan scsi_mod xhci_pci nvme xhci_hcd e1000e ptp pps_core usbcore usb_common thermal thermal_sys
 CPU: 0 PID: 2930 Comm: umount Tainted: G        W       4.1.0-rc4+ #10
 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013
  ffffffff81a04063 ffff8800a58e3d98 ffffffff81653644 0000000000000000
  0000000000000000 ffff8800a58e3dd8 ffffffff81081fea 0000000000000000
  ffff880236580880 ffff880236580ae8 ffff880236580a60 ffff880236580898
 Call Trace:
  [<ffffffff81653644>] dump_stack+0x4c/0x65
  [<ffffffff81081fea>] warn_slowpath_common+0x8a/0xc0
  [<ffffffff810820da>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81260475>] __blkdev_put+0xc5/0x210
  [<ffffffff81260f72>] blkdev_put+0x52/0x180
  [<ffffffff8121e631>] kill_block_super+0x41/0x80
  [<ffffffff8121ea94>] deactivate_locked_super+0x44/0x80
  [<ffffffff8121ef0c>] deactivate_super+0x6c/0x80
  [<ffffffff81242133>] cleanup_mnt+0x43/0xa0
  [<ffffffff812421e2>] __cleanup_mnt+0x12/0x20
  [<ffffffff810a7104>] task_work_run+0xc4/0xf0
  [<ffffffff8101bdd9>] do_notify_resume+0x59/0x80
  [<ffffffff8165cd66>] int_signal+0x12/0x17
 ---[ end trace 73da47765ccceacf ]---

I suspect these are generic ext4 problems that will occur without DAX.
DAX just makes them more likely to occur since only metadata I/O now
goes through the 'likely to fail' path.

Are you skipping generic/019 or just not seeing these failures?
Ingo Molnar May 27, 2015, 7:50 a.m. UTC | #26
* Matthew Wilcox <willy@linux.intel.com> wrote:

> On Mon, Apr 13, 2015 at 11:33:09AM +0200, Ingo Molnar wrote:
> > Please pull the latest x86-pmem-for-linus git tree from:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pmem-for-linus
> > 
> >    # HEAD: 4c1eaa2344fb26bb5e936fb4d8ee307343ea0089 drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
> > 
> > This is the initial support for the pmem block device driver: 
> > persistent non-volatile memory space mapped into the system's physical 
> > memory space as large physical memory regions.
> 
> Ingo, this sucks.  You collapsed all of the separate patches into a single "add 
> new driver" patch, which makes it impossible to bisect which of the recent 
> changes broke xfstests.  Please don't do this again.

You ignored my previous reply, so let me ask this again more forcefully: what the 
hell are you talking about??

As you can see it from the fine commit 9e853f2313e5:

  Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
  Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
  Signed-off-by: Christoph Hellwig <hch@lst.de>
  Acked-by: Dan Williams <dan.j.williams@intel.com>
  ...
  Link: http://lkml.kernel.org/r/1427872339-6688-3-git-send-email-hch@lst.de
  [ Minor cleanups. ]
  Signed-off-by: Ingo Molnar <mingo@kernel.org>

I didn't do any 'collapsing' of patches, I applied the patches as sent by hch:

  9e853f2313e5 ("drivers/block/pmem: Add a driver for persistent memory")
  ec776ef6bbe1 ("x86/mm: Add support for the non-standard protected e820 type")

Thanks,

	Ingo
Boaz Harrosh May 27, 2015, 8:10 a.m. UTC | #27
On 05/26/2015 10:31 PM, Matthew Wilcox wrote:
> On Tue, May 26, 2015 at 11:41:41AM +0300, Boaz Harrosh wrote:
>> I would please like to help. What is the breakage you
>> see with DAX.
>>
>> I'm routinely testing with DAX so it is a surprise,
>> Though I'm testing with my version with pages and
>> __copy_from_user_nocache, and so on.
>> Or I might have missed it. What test are you failing?
> 
> generic/019 fails in several fun ways.
> 
> The first way, which I fixed yesterday, is that the test was using
> the wrong way to find the 'make-it-fail' switch for the block device.
> That's now in xfstests.  The messages from xfstests were unnecessarily
> worrying; they were complaining about an inconsistent filesystem, which
> might be expected as the test had failed to abort cleanly and left a
> couple of tasks actively writing to the filesystem.
> 

OK Apparently I never ran generic/019 see below

> (I hadn't seen the problem before because I was using two devices pmem0
> and pmem1; with the new pmem driver, I got one device and partitioned
> it instead.  The problem only occurs when using partitions, not when
> using entire devices).
> 

My version of pmem+pages has back this option of slicing up pmem to
arbitrary sized pieces, because each piece can load with one of
EPMEM_UNCACHED, EPMEM_WRITE_TROUGH, EPMEM_WRITE_COMBINED, EPMEM_CACHED,
EPMEM_PAGES memory mapping mode.
(I have almost all the proper code for EPMEM_CACHED / EPMEM_PAGES
 support in surrounding code including m/fsync and mmap, memcpy_nt
 and so on, only left out is sync/freeze with regard to mmap)

But Yes I will tell the guys to add a partitions testing as well it
is important that it will work.

So would you suspect that we should have the same problem with
the original driver?

> The second way is that we hit two BUG/WARN messages.  The first (which
> we hit simultaneously on three CPUs in this run!) is:
> WARNING: CPU: 7 PID: 2922 at fs/buffer.c:1143 mark_buffer_dirty+0x19e/0x270()
> 

Hum, so that must be from some directory handling code, no? Regular DAX IO
does not do write_begin. But why would it make any difference if its a
partition or not, that's weird. Since this is a real page not a pmem buffer
at all.

> The stack trace probably isn't useful, and anyway it's horribly corrupted
> due to triggering the stack trace simultaneously on three CPUs.
> 
> The second one we hit was this one:
> 
>  ------------[ cut here ]------------
>  WARNING: CPU: 0 PID: 2930 at fs/block_dev.c:56 __blkdev_put+0xc5/0x210()
>  Modules linked in: ext4 crc16 jbd2 pmem binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support evdev x86_pkg_temp_thermal coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse serio_raw pcspkr i2c_i801 snd_hda_codec_realtek snd_hda_codec_generic lpc_ich mfd_core mei_me mei i915 snd_hda_intel i2c_algo_bit snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_hda_core loop video drm_kms_helper fuse snd_timer snd drm soundcore button processor parport_pc ppdev lp parport sg sd_mod ehci_pci ehci_hcd ahci libahci crc32c_intel libata fan scsi_mod xhci_pci nvme xhci_hcd e1000e ptp pps_core usbcore usb_common thermal thermal_sys
>  CPU: 0 PID: 2930 Comm: umount Tainted: G        W       4.1.0-rc4+ #10
>  Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013
>   ffffffff81a04063 ffff8800a58e3d98 ffffffff81653644 0000000000000000
>   0000000000000000 ffff8800a58e3dd8 ffffffff81081fea 0000000000000000
>   ffff880236580880 ffff880236580ae8 ffff880236580a60 ffff880236580898
>  Call Trace:
>   [<ffffffff81653644>] dump_stack+0x4c/0x65
>   [<ffffffff81081fea>] warn_slowpath_common+0x8a/0xc0
>   [<ffffffff810820da>] warn_slowpath_null+0x1a/0x20
>   [<ffffffff81260475>] __blkdev_put+0xc5/0x210
>   [<ffffffff81260f72>] blkdev_put+0x52/0x180
>   [<ffffffff8121e631>] kill_block_super+0x41/0x80
>   [<ffffffff8121ea94>] deactivate_locked_super+0x44/0x80
>   [<ffffffff8121ef0c>] deactivate_super+0x6c/0x80
>   [<ffffffff81242133>] cleanup_mnt+0x43/0xa0
>   [<ffffffff812421e2>] __cleanup_mnt+0x12/0x20
>   [<ffffffff810a7104>] task_work_run+0xc4/0xf0
>   [<ffffffff8101bdd9>] do_notify_resume+0x59/0x80
>   [<ffffffff8165cd66>] int_signal+0x12/0x17
>  ---[ end trace 73da47765ccceacf ]---
> 
> I suspect these are generic ext4 problems that will occur without DAX.
> DAX just makes them more likely to occur since only metadata I/O now
> goes through the 'likely to fail' path.
> 

But why would WARN_ON_ONCE(write_inode_now(inode, true)) fail?
None of the pmem.c IO paths ever return any error. It must be failing in
block core even before reaching pmem. It looks like we are stuck with
dirty mappings after the devices are already being torn down, or something ...

> Are you skipping generic/019 or just not seeing these failures?
> 

Hu funny I just looked and I see with ./check auto I get
generic/018 1s ... [not run] defragmentation not supported for fstype "m1fs"
generic/020 0s ... 0s

019 is not even printing a skip. But if I run it directly I get:
generic/019      [not run] /sys/kernel/debug/fail_make_request  not found. \
	Seems that CONFIG_FAIL_MAKE_REQUEST kernel config option not enabled

So my bad, I will try to properly configure and recreate this failure here
as well.

Thanks
Boaz
Christoph Hellwig May 27, 2015, 8:11 a.m. UTC | #28
On Wed, May 27, 2015 at 11:10:21AM +0300, Boaz Harrosh wrote:
> Hu funny I just looked and I see with ./check auto I get
> generic/018 1s ... [not run] defragmentation not supported for fstype "m1fs"
> generic/020 0s ... 0s
> 
> 019 is not even printing a skip. But if I run it directly I get:
> generic/019      [not run] /sys/kernel/debug/fail_make_request  not found. \
> 	Seems that CONFIG_FAIL_MAKE_REQUEST kernel config option not enabled
> 
> So my bad, I will try to properly configure and recreate this failure here
> as well.

It fails I/O above the driver.  Any failure in generic/019 is very unl?kely
to be driver related.
Boaz Harrosh May 27, 2015, 8:26 a.m. UTC | #29
On 05/27/2015 11:11 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2015 at 11:10:21AM +0300, Boaz Harrosh wrote:
>> Hu funny I just looked and I see with ./check auto I get
>> generic/018 1s ... [not run] defragmentation not supported for fstype "m1fs"
>> generic/020 0s ... 0s
>>
>> 019 is not even printing a skip. But if I run it directly I get:
>> generic/019      [not run] /sys/kernel/debug/fail_make_request  not found. \
>> 	Seems that CONFIG_FAIL_MAKE_REQUEST kernel config option not enabled
>>
>> So my bad, I will try to properly configure and recreate this failure here
>> as well.
> 
> It fails I/O above the driver.  Any failure in generic/019 is very unl?kely
> to be driver related.
> 

Hm, so then that would be expected right?
__blkdev_put fails to WARN_ON_ONCE(write_inode_now(inode, true))
because the test tells the Kernel to fail IO requests.

__blkdev_put then complains because it has no way to report
the error back to caller.

I would then say that 019 test should expect that this might
happen. Or it should somehow make sure to remove the requests
block before the umount.

Thanks
Boaz
diff mbox

Patch

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a62a7b4..c87122dd790f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/MAINTAINERS b/MAINTAINERS
index 1de6afa8ee51..4517613dc638 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8071,6 +8071,12 @@  S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..9e3bcd6f4a48 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1430,6 +1430,16 @@  config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY
+	bool "Support non-standard NVDIMMs and ADR protected memory"
+	help
+	  Treat memory marked using the non-standard e820 type of 12 as used
+	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
+	  The kernel will offer these regions to the 'pmem' driver so
+	  they can be used for persistent storage.
+
+	  Say Y if unsure.
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index d993e33f5236..960a8a9dc4ab 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -33,6 +33,16 @@ 
 #define E820_NVS	4
 #define E820_UNUSABLE	5
 
+/*
+ * This is a non-standardized way to represent ADR or NVDIMM regions that
+ * persist over a reboot.  The kernel will ignore their special capabilities
+ * unless the CONFIG_X86_PMEM_LEGACY=y option is set.
+ *
+ * ( Note that older platforms also used 6 for the same type of memory,
+ *   but newer versions switched to 12 as 6 was assigned differently.  Some
+ *   time they will learn... )
+ */
+#define E820_PRAM	12
 
 /*
  * reserved RAM used by kernel itself
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cdb1b70ddad0..971f18cd9ca0 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -94,6 +94,7 @@  obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
+obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201deee923..11cc7d54ec3f 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,9 @@  static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PRAM:
+		printk(KERN_CONT "persistent (type %u)", type);
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -343,7 +346,7 @@  int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 		 * continue building up new bios map based on this
 		 * information
 		 */
-		if (current_type != last_type)	{
+		if (current_type != last_type || current_type == E820_PRAM) {
 			if (last_type != 0)	 {
 				new_bios[new_bios_entry].size =
 					change_point[chgidx]->addr - last_addr;
@@ -688,6 +691,7 @@  void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 			register_nosave_region(pfn, PFN_UP(ei->addr));
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
+
 		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
 			register_nosave_region(PFN_UP(ei->addr), pfn);
 
@@ -748,7 +752,7 @@  u64 __init early_reserve_e820(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
+static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -759,7 +763,11 @@  static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (ei->type != type)
+		/*
+		 * Persistent memory is accounted as ram for purposes of
+		 * establishing max_pfn and mem_map.
+		 */
+		if (ei->type != E820_RAM && ei->type != E820_PRAM)
 			continue;
 
 		start_pfn = ei->addr >> PAGE_SHIFT;
@@ -784,12 +792,12 @@  static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 }
 unsigned long __init e820_end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
+	return e820_end_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820_end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
+	return e820_end_pfn(1UL << (32-PAGE_SHIFT));
 }
 
 static void early_panic(char *msg)
@@ -866,6 +874,9 @@  static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PRAM);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -907,6 +918,7 @@  static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
+	case E820_PRAM: return "Persistent RAM";
 	default:	return "reserved";
 	}
 }
@@ -940,7 +952,9 @@  void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
+		if (((e820.map[i].type != E820_RESERVED) &&
+		     (e820.map[i].type != E820_PRAM)) ||
+		     res->start < (1ULL<<20)) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
new file mode 100644
index 000000000000..3420c874ddc5
--- /dev/null
+++ b/arch/x86/kernel/pmem.c
@@ -0,0 +1,53 @@ 
+/*
+ * Copyright (c) 2015, Christoph Hellwig.
+ */
+#include <linux/memblock.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <asm/e820.h>
+#include <asm/page_types.h>
+#include <asm/setup.h>
+
+static __init void register_pmem_device(struct resource *res)
+{
+	struct platform_device *pdev;
+	int error;
+
+	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	if (!pdev)
+		return;
+
+	error = platform_device_add_resources(pdev, res, 1);
+	if (error)
+		goto out_put_pdev;
+
+	error = platform_device_add(pdev);
+	if (error)
+		goto out_put_pdev;
+	return;
+
+out_put_pdev:
+	dev_warn(&pdev->dev, "failed to add 'pmem' (persistent memory) device!\n");
+	platform_device_put(pdev);
+}
+
+static __init int register_pmem_devices(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type == E820_PRAM) {
+			struct resource res = {
+				.flags	= IORESOURCE_MEM,
+				.start	= ei->addr,
+				.end	= ei->addr + ei->size - 1,
+			};
+			register_pmem_device(&res);
+		}
+	}
+
+	return 0;
+}
+device_initcall(register_pmem_devices);
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d4d7af..eb1fed5bd516 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,17 @@  config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more persistent block devices.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called 'pmem'.
+
+	  If unsure, say N.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d1438d..9cc6c18a1c7e 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@  obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 000000000000..eabf4a8d0085
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,262 @@ 
+/*
+ * Persistent Memory Driver
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig <hch@lst.de>.
+ * Copyright (c) 2015, Boaz Harrosh <boaz@plexistor.com>.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <asm/cacheflush.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+
+#define PMEM_MINORS		16
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+static int pmem_major;
+static atomic_t pmem_index;
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+	size_t pmem_off = sector << 9;
+
+	if (rw == READ) {
+		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_data_dir(bio);
+	sector = bio->bi_iter.bi_sector;
+	bio_for_each_segment(bvec, bio, iter) {
+		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+			     rw, sector);
+		sector += bvec.bv_len >> 9;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+
+	return 0;
+}
+
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	size_t offset = sector << 9;
+
+	if (!pmem)
+		return -ENODEV;
+
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+
+	return pmem->size - offset;
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		pmem_rw_page,
+	.direct_access =	pmem_direct_access,
+};
+
+static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	int idx, err;
+
+	err = -ENOMEM;
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (!pmem)
+		goto out;
+
+	pmem->phys_addr = res->start;
+	pmem->size = resource_size(res);
+
+	err = -EINVAL;
+	if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
+		dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n", &pmem->phys_addr, pmem->size);
+		goto out_free_dev;
+	}
+
+	/*
+	 * Map the memory as non-cachable, as we can't write back the contents
+	 * of the CPU caches in case of a crash.
+	 */
+	err = -ENOMEM;
+	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	if (!pmem->virt_addr)
+		goto out_release_region;
+
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!pmem->pmem_queue)
+		goto out_unmap;
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	disk = alloc_disk(PMEM_MINORS);
+	if (!disk)
+		goto out_free_queue;
+
+	idx = atomic_inc_return(&pmem_index) - 1;
+
+	disk->major		= pmem_major;
+	disk->first_minor	= PMEM_MINORS * idx;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", idx);
+	disk->driverfs_dev = dev;
+	set_capacity(disk, pmem->size >> 9);
+	pmem->pmem_disk = disk;
+
+	add_disk(disk);
+
+	return pmem;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	iounmap(pmem->virt_addr);
+out_release_region:
+	release_mem_region(pmem->phys_addr, pmem->size);
+out_free_dev:
+	kfree(pmem);
+out:
+	return ERR_PTR(err);
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+	del_gendisk(pmem->pmem_disk);
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	kfree(pmem);
+}
+
+static int pmem_probe(struct platform_device *pdev)
+{
+	struct pmem_device *pmem;
+	struct resource *res;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENXIO;
+
+	pmem = pmem_alloc(&pdev->dev, res);
+	if (IS_ERR(pmem))
+		return PTR_ERR(pmem);
+
+	platform_set_drvdata(pdev, pmem);
+
+	return 0;
+}
+
+static int pmem_remove(struct platform_device *pdev)
+{
+	struct pmem_device *pmem = platform_get_drvdata(pdev);
+
+	pmem_free(pmem);
+	return 0;
+}
+
+static struct platform_driver pmem_driver = {
+	.probe		= pmem_probe,
+	.remove		= pmem_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "pmem",
+	},
+};
+
+static int __init pmem_init(void)
+{
+	int error;
+
+	pmem_major = register_blkdev(0, "pmem");
+	if (pmem_major < 0)
+		return pmem_major;
+
+	error = platform_driver_register(&pmem_driver);
+	if (error)
+		unregister_blkdev(pmem_major, "pmem");
+	return error;
+}
+module_init(pmem_init);
+
+static void pmem_exit(void)
+{
+	platform_driver_unregister(&pmem_driver);
+	unregister_blkdev(pmem_major, "pmem");
+}
+module_exit(pmem_exit);
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");