diff mbox

vmalloc: introduce vmap_pfn for persistent memory

Message ID alpine.LRH.2.02.1711071645240.1339@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State Rejected, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mikulas Patocka Nov. 7, 2017, 10:03 p.m. UTC
Hi

I am developing a driver that uses persistent memory for caching. A 
persistent memory device can be mapped in several discontiguous ranges.

The kernel has a function vmap that takes an array of pointers to pages 
and maps these pages to contiguous linear address space. However, it can't 
be used on persistent memory because persistent memory may not be backed 
by page structures.

This patch introduces a new function vmap_pfn, it works like vmap, but 
takes an array of pfn_t - so it can be used on persistent memory.

This is an example how vmap_pfn is used: 
https://www.redhat.com/archives/dm-devel/2017-November/msg00026.html (see 
the function persistent_memory_claim)

Mikulas



From: Mikulas Patocka <mpatocka@redhat.com>

There's a function vmap that can take discontiguous pages and map them 
linearly to the vmalloc space. However, persistent memory may not be 
backed by pages, so we can't use vmap on it.

This patch introduces a function vmap_pfn that works like vmap, but it 
takes an array of page frame numbers (pfn_t). It can be used to remap 
discontiguous chunks of persistent memory into a linear range.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/vmalloc.h |    2 +
 mm/vmalloc.c            |   88 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 71 insertions(+), 19 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig Nov. 8, 2017, 9:59 a.m. UTC | #1
On Tue, Nov 07, 2017 at 05:03:11PM -0500, Mikulas Patocka wrote:
> Hi
> 
> I am developing a driver that uses persistent memory for caching. A 
> persistent memory device can be mapped in several discontiguous ranges.
> 
> The kernel has a function vmap that takes an array of pointers to pages 
> and maps these pages to contiguous linear address space. However, it can't 
> be used on persistent memory because persistent memory may not be backed 
> by page structures.
> 
> This patch introduces a new function vmap_pfn, it works like vmap, but 
> takes an array of pfn_t - so it can be used on persistent memory.

How is cache flushing going to work for this interface assuming
that your write to/from the virtual address and expect it to be
persisted on pmem?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 8, 2017, 12:33 p.m. UTC | #2
On Wed, 8 Nov 2017, Christoph Hellwig wrote:

> On Tue, Nov 07, 2017 at 05:03:11PM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I am developing a driver that uses persistent memory for caching. A 
> > persistent memory device can be mapped in several discontiguous ranges.
> > 
> > The kernel has a function vmap that takes an array of pointers to pages 
> > and maps these pages to contiguous linear address space. However, it can't 
> > be used on persistent memory because persistent memory may not be backed 
> > by page structures.
> > 
> > This patch introduces a new function vmap_pfn, it works like vmap, but 
> > takes an array of pfn_t - so it can be used on persistent memory.
> 
> How is cache flushing going to work for this interface assuming
> that your write to/from the virtual address and expect it to be
> persisted on pmem?

We could use the function clwb() (or arch-independent wrapper dax_flush()) 
- that uses the clflushopt instruction on Broadwell or clwb on Skylake - 
but it is very slow, write performance on Broadwell is only 350MB/s.

So in practice I use the movnti instruction that bypasses cache. The 
write-combining buffer is flushed with sfence.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Nov. 8, 2017, 3:04 p.m. UTC | #3
On Wed, Nov 08, 2017 at 07:33:09AM -0500, Mikulas Patocka wrote:
> We could use the function clwb() (or arch-independent wrapper dax_flush()) 
> - that uses the clflushopt instruction on Broadwell or clwb on Skylake - 
> but it is very slow, write performance on Broadwell is only 350MB/s.
> 
> So in practice I use the movnti instruction that bypasses cache. The 
> write-combining buffer is flushed with sfence.

And what do you do for an architecture with virtuall indexed caches?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 8, 2017, 3:21 p.m. UTC | #4
On Wed, 8 Nov 2017, Christoph Hellwig wrote:

> On Wed, Nov 08, 2017 at 07:33:09AM -0500, Mikulas Patocka wrote:
> > We could use the function clwb() (or arch-independent wrapper dax_flush()) 
> > - that uses the clflushopt instruction on Broadwell or clwb on Skylake - 
> > but it is very slow, write performance on Broadwell is only 350MB/s.
> > 
> > So in practice I use the movnti instruction that bypasses cache. The 
> > write-combining buffer is flushed with sfence.
> 
> And what do you do for an architecture with virtuall indexed caches?

Persistent memory is not supported on such architectures - it is only 
supported on x86-64 and arm64.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Nov. 8, 2017, 3:35 p.m. UTC | #5
On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
> > And what do you do for an architecture with virtuall indexed caches?
> 
> Persistent memory is not supported on such architectures - it is only 
> supported on x86-64 and arm64.

For now.  But once support is added your driver will just corrupt data
unless you have the right API in place.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 8, 2017, 3:41 p.m. UTC | #6
On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
>> > And what do you do for an architecture with virtuall indexed caches?
>>
>> Persistent memory is not supported on such architectures - it is only
>> supported on x86-64 and arm64.
>
> For now.  But once support is added your driver will just corrupt data
> unless you have the right API in place.

I'm also in the process of ripping out page-less dax support. With
pages we can potentially leverage the VIVT-cache support in some
architectures, likely with more supporting infrastructure for
dax_flush().

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 8, 2017, 5:42 p.m. UTC | #7
On Wed, 8 Nov 2017, Christoph Hellwig wrote:

> On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
> > > And what do you do for an architecture with virtuall indexed caches?
> > 
> > Persistent memory is not supported on such architectures - it is only 
> > supported on x86-64 and arm64.
> 
> For now.  But once support is added your driver will just corrupt data
> unless you have the right API in place.

If dax_flush were able to flush vmapped area, I don't see a problem with 
it.

You obviously can't access the same device simultaneously through vmapped 
area and direct mapping. But when the persistent memory driver is using 
the device, no one is expected to touch it anyway.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Nov. 8, 2017, 5:47 p.m. UTC | #8
Can you start by explaining what you actually need the vmap for?

Going through a vmap for every I/O is certainly not going to be nice
on NVDIMM-N or similar modules :)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 8, 2017, 8:15 p.m. UTC | #9
On Wed, 8 Nov 2017, Dan Williams wrote:

> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
> >> > And what do you do for an architecture with virtuall indexed caches?
> >>
> >> Persistent memory is not supported on such architectures - it is only
> >> supported on x86-64 and arm64.
> >
> > For now.  But once support is added your driver will just corrupt data
> > unless you have the right API in place.
> 
> I'm also in the process of ripping out page-less dax support. With
> pages we can potentially leverage the VIVT-cache support in some
> architectures, likely with more supporting infrastructure for
> dax_flush().

Should I remove all the code for page-less persistent memory from my 
driver?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 8, 2017, 8:25 p.m. UTC | #10
On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 8 Nov 2017, Dan Williams wrote:
>
>> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
>> >> > And what do you do for an architecture with virtuall indexed caches?
>> >>
>> >> Persistent memory is not supported on such architectures - it is only
>> >> supported on x86-64 and arm64.
>> >
>> > For now.  But once support is added your driver will just corrupt data
>> > unless you have the right API in place.
>>
>> I'm also in the process of ripping out page-less dax support. With
>> pages we can potentially leverage the VIVT-cache support in some
>> architectures, likely with more supporting infrastructure for
>> dax_flush().
>
> Should I remove all the code for page-less persistent memory from my
> driver?
>

Yes, that would be my recommendation. You can see that filesystem-dax
is on its way to dropping page-less support in this series:

   https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 8, 2017, 8:26 p.m. UTC | #11
On Wed, 8 Nov 2017, Christoph Hellwig wrote:

> Can you start by explaining what you actually need the vmap for?

It is possible to use lvm on persistent memory. You can create linear or 
striped logical volumes on persistent memory and these volumes still have 
the direct_access method, so they can be mapped with the function 
dax_direct_access().

If we create logical volumes on persistent memory, the method 
dax_direct_access() won't return the whole device, it will return only a 
part. When dax_direct_access() returns the whole device, my driver just 
uses it without vmap. When dax_direct_access() return only a part of the 
device, my driver calls it repeatedly to get all the parts and then 
assembles the parts into a linear address space with vmap.

See the function persistent_memory_claim() here: 
https://www.redhat.com/archives/dm-devel/2017-November/msg00026.html

> Going through a vmap for every I/O is certainly not going to be nice
> on NVDIMM-N or similar modules :)

It's just a call to vmalloc_to_page.

Though, if persistent memory is not page-backed, I have to copy the data 
before writing them.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 8, 2017, 9:26 p.m. UTC | #12
On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> On Wed, 8 Nov 2017, Christoph Hellwig wrote:
>
>> Can you start by explaining what you actually need the vmap for?
>
> It is possible to use lvm on persistent memory. You can create linear or
> striped logical volumes on persistent memory and these volumes still have
> the direct_access method, so they can be mapped with the function
> dax_direct_access().
>
> If we create logical volumes on persistent memory, the method
> dax_direct_access() won't return the whole device, it will return only a
> part. When dax_direct_access() returns the whole device, my driver just
> uses it without vmap. When dax_direct_access() return only a part of the
> device, my driver calls it repeatedly to get all the parts and then
> assembles the parts into a linear address space with vmap.

I know I proposed "call dax_direct_access() once" as a strawman for an
in-kernel driver user, but it's better to call it per access so you
can better stay in sync with base driver events like new media errors
and unplug / driver-unload. Either that, or at least have a plan how
to handle those events.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 9, 2017, 4:37 p.m. UTC | #13
On Wed, 8 Nov 2017, Dan Williams wrote:

> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > On Wed, 8 Nov 2017, Christoph Hellwig wrote:
> >
> >> Can you start by explaining what you actually need the vmap for?
> >
> > It is possible to use lvm on persistent memory. You can create linear or
> > striped logical volumes on persistent memory and these volumes still have
> > the direct_access method, so they can be mapped with the function
> > dax_direct_access().
> >
> > If we create logical volumes on persistent memory, the method
> > dax_direct_access() won't return the whole device, it will return only a
> > part. When dax_direct_access() returns the whole device, my driver just
> > uses it without vmap. When dax_direct_access() return only a part of the
> > device, my driver calls it repeatedly to get all the parts and then
> > assembles the parts into a linear address space with vmap.
> 
> I know I proposed "call dax_direct_access() once" as a strawman for an
> in-kernel driver user, but it's better to call it per access so you
> can better stay in sync with base driver events like new media errors
> and unplug / driver-unload. Either that, or at least have a plan how
> to handle those events.

Calling it on every access would be inacceptable performance overkill. How 
is it supposed to work anyway? - if something intends to move data on 
persistent memory while some driver accesse it, then we need two functions 
- dax_direct_access() and dax_relinquish_direct_access(). The current 
kernel lacks a function dax_relinquish_direct_access() that would mark a 
region of data as moveable, so we can't move the data anyway.


BTW. what happens if we create a write bio that has its pages pointing to 
persistent memory and there is error when the storage controller attempts 
to do DMA from persistent memory? Will the storage controller react to the 
error in a sensible way and will the block layer report the error?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 9, 2017, 4:40 p.m. UTC | #14
On Wed, 8 Nov 2017, Dan Williams wrote:

> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Wed, 8 Nov 2017, Dan Williams wrote:
> >
> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
> >> >> > And what do you do for an architecture with virtuall indexed caches?
> >> >>
> >> >> Persistent memory is not supported on such architectures - it is only
> >> >> supported on x86-64 and arm64.
> >> >
> >> > For now.  But once support is added your driver will just corrupt data
> >> > unless you have the right API in place.
> >>
> >> I'm also in the process of ripping out page-less dax support. With
> >> pages we can potentially leverage the VIVT-cache support in some
> >> architectures, likely with more supporting infrastructure for
> >> dax_flush().
> >
> > Should I remove all the code for page-less persistent memory from my
> > driver?
> >
> 
> Yes, that would be my recommendation. You can see that filesystem-dax
> is on its way to dropping page-less support in this series:
> 
>    https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html

Why do you indend to drop dax for ramdisk? It's perfect for testing.

On x86, persistent memory can be tested with the memmap kernel parameters, 
but on other architectures, ramdisk is the only option for tests.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 9, 2017, 4:45 p.m. UTC | #15
On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 8 Nov 2017, Dan Williams wrote:
>
>> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Wed, 8 Nov 2017, Dan Williams wrote:
>> >
>> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
>> >> >> > And what do you do for an architecture with virtuall indexed caches?
>> >> >>
>> >> >> Persistent memory is not supported on such architectures - it is only
>> >> >> supported on x86-64 and arm64.
>> >> >
>> >> > For now.  But once support is added your driver will just corrupt data
>> >> > unless you have the right API in place.
>> >>
>> >> I'm also in the process of ripping out page-less dax support. With
>> >> pages we can potentially leverage the VIVT-cache support in some
>> >> architectures, likely with more supporting infrastructure for
>> >> dax_flush().
>> >
>> > Should I remove all the code for page-less persistent memory from my
>> > driver?
>> >
>>
>> Yes, that would be my recommendation. You can see that filesystem-dax
>> is on its way to dropping page-less support in this series:
>>
>>    https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html
>
> Why do you indend to drop dax for ramdisk? It's perfect for testing.
>
> On x86, persistent memory can be tested with the memmap kernel parameters,
> but on other architectures, ramdisk is the only option for tests.
>

Because it's not "perfect for testing", it does not support the
get_user_pages() model that we need to safely handle DAX dma. ARM64
and PowerPC PMEM support is in the works, so I expect the architecture
support landscape for major architectures to improve such that the
pmem driver can always be used for DAX testing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 9, 2017, 4:49 p.m. UTC | #16
On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 8 Nov 2017, Dan Williams wrote:
>
>> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > On Wed, 8 Nov 2017, Christoph Hellwig wrote:
>> >
>> >> Can you start by explaining what you actually need the vmap for?
>> >
>> > It is possible to use lvm on persistent memory. You can create linear or
>> > striped logical volumes on persistent memory and these volumes still have
>> > the direct_access method, so they can be mapped with the function
>> > dax_direct_access().
>> >
>> > If we create logical volumes on persistent memory, the method
>> > dax_direct_access() won't return the whole device, it will return only a
>> > part. When dax_direct_access() returns the whole device, my driver just
>> > uses it without vmap. When dax_direct_access() return only a part of the
>> > device, my driver calls it repeatedly to get all the parts and then
>> > assembles the parts into a linear address space with vmap.
>>
>> I know I proposed "call dax_direct_access() once" as a strawman for an
>> in-kernel driver user, but it's better to call it per access so you
>> can better stay in sync with base driver events like new media errors
>> and unplug / driver-unload. Either that, or at least have a plan how
>> to handle those events.
>
> Calling it on every access would be inacceptable performance overkill. How
> is it supposed to work anyway? - if something intends to move data on
> persistent memory while some driver accesse it, then we need two functions
> - dax_direct_access() and dax_relinquish_direct_access(). The current
> kernel lacks a function dax_relinquish_direct_access() that would mark a
> region of data as moveable, so we can't move the data anyway.

We take a global reference on the hosting device while pages are
registered, see the percpu_ref usage in kernel/memremap.c, and we hold
the dax_read_lock() over calls to dax_direct_access() to temporarily
hold the device alive for the duration of the call.

> BTW. what happens if we create a write bio that has its pages pointing to
> persistent memory and there is error when the storage controller attempts
> to do DMA from persistent memory? Will the storage controller react to the
> error in a sensible way and will the block layer report the error?

While pages are pinned for DMA the devm_memremap_pages() mapping is
pinned. Otherwise, an error reading persistent memory is identical to
an error reading DRAM.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 9, 2017, 5:30 p.m. UTC | #17
On Thu, 9 Nov 2017, Dan Williams wrote:

> On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Wed, 8 Nov 2017, Dan Williams wrote:
> >
> >> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >
> >> > On Wed, 8 Nov 2017, Dan Williams wrote:
> >> >
> >> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
> >> >> >> > And what do you do for an architecture with virtuall indexed caches?
> >> >> >>
> >> >> >> Persistent memory is not supported on such architectures - it is only
> >> >> >> supported on x86-64 and arm64.
> >> >> >
> >> >> > For now.  But once support is added your driver will just corrupt data
> >> >> > unless you have the right API in place.
> >> >>
> >> >> I'm also in the process of ripping out page-less dax support. With
> >> >> pages we can potentially leverage the VIVT-cache support in some
> >> >> architectures, likely with more supporting infrastructure for
> >> >> dax_flush().
> >> >
> >> > Should I remove all the code for page-less persistent memory from my
> >> > driver?
> >> >
> >>
> >> Yes, that would be my recommendation. You can see that filesystem-dax
> >> is on its way to dropping page-less support in this series:
> >>
> >>    https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html
> >
> > Why do you indend to drop dax for ramdisk? It's perfect for testing.
> >
> > On x86, persistent memory can be tested with the memmap kernel parameters,
> > but on other architectures, ramdisk is the only option for tests.
> >
> 
> Because it's not "perfect for testing", it does not support the
> get_user_pages() model that we need to safely handle DAX dma. ARM64
> and PowerPC PMEM support is in the works, so I expect the architecture
> support landscape for major architectures to improve such that the
> pmem driver can always be used for DAX testing.

Yes - but if I want to test the persistent memory driver on big-endian 
machine, I could use an old Sparc or PA-RISC machine with ramdisk.

New PowerPC machines may support native persistent memory, but they are 
extremely expensive.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 9, 2017, 5:35 p.m. UTC | #18
On Thu, Nov 9, 2017 at 9:30 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 9 Nov 2017, Dan Williams wrote:
>
>> On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Wed, 8 Nov 2017, Dan Williams wrote:
>> >
>> >> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >
>> >> >
>> >> > On Wed, 8 Nov 2017, Dan Williams wrote:
>> >> >
>> >> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> >> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote:
>> >> >> >> > And what do you do for an architecture with virtuall indexed caches?
>> >> >> >>
>> >> >> >> Persistent memory is not supported on such architectures - it is only
>> >> >> >> supported on x86-64 and arm64.
>> >> >> >
>> >> >> > For now.  But once support is added your driver will just corrupt data
>> >> >> > unless you have the right API in place.
>> >> >>
>> >> >> I'm also in the process of ripping out page-less dax support. With
>> >> >> pages we can potentially leverage the VIVT-cache support in some
>> >> >> architectures, likely with more supporting infrastructure for
>> >> >> dax_flush().
>> >> >
>> >> > Should I remove all the code for page-less persistent memory from my
>> >> > driver?
>> >> >
>> >>
>> >> Yes, that would be my recommendation. You can see that filesystem-dax
>> >> is on its way to dropping page-less support in this series:
>> >>
>> >>    https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html
>> >
>> > Why do you indend to drop dax for ramdisk? It's perfect for testing.
>> >
>> > On x86, persistent memory can be tested with the memmap kernel parameters,
>> > but on other architectures, ramdisk is the only option for tests.
>> >
>>
>> Because it's not "perfect for testing", it does not support the
>> get_user_pages() model that we need to safely handle DAX dma. ARM64
>> and PowerPC PMEM support is in the works, so I expect the architecture
>> support landscape for major architectures to improve such that the
>> pmem driver can always be used for DAX testing.
>
> Yes - but if I want to test the persistent memory driver on big-endian
> machine, I could use an old Sparc or PA-RISC machine with ramdisk.
>
> New PowerPC machines may support native persistent memory, but they are
> extremely expensive.
>

Ok, but for testing purposes we can just enable a "compile-test" /
fake "ARCH_HAS_PMEM_API" configuration to enable pmem usage on other
archs which would enable a better test case than DAX with brd.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 9, 2017, 6:13 p.m. UTC | #19
On Thu, 9 Nov 2017, Dan Williams wrote:

> On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Wed, 8 Nov 2017, Dan Williams wrote:
> >
> >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote:
> >> >
> >> >> Can you start by explaining what you actually need the vmap for?
> >> >
> >> > It is possible to use lvm on persistent memory. You can create linear or
> >> > striped logical volumes on persistent memory and these volumes still have
> >> > the direct_access method, so they can be mapped with the function
> >> > dax_direct_access().
> >> >
> >> > If we create logical volumes on persistent memory, the method
> >> > dax_direct_access() won't return the whole device, it will return only a
> >> > part. When dax_direct_access() returns the whole device, my driver just
> >> > uses it without vmap. When dax_direct_access() return only a part of the
> >> > device, my driver calls it repeatedly to get all the parts and then
> >> > assembles the parts into a linear address space with vmap.
> >>
> >> I know I proposed "call dax_direct_access() once" as a strawman for an
> >> in-kernel driver user, but it's better to call it per access so you
> >> can better stay in sync with base driver events like new media errors
> >> and unplug / driver-unload. Either that, or at least have a plan how
> >> to handle those events.
> >
> > Calling it on every access would be inacceptable performance overkill. How
> > is it supposed to work anyway? - if something intends to move data on
> > persistent memory while some driver accesse it, then we need two functions
> > - dax_direct_access() and dax_relinquish_direct_access(). The current
> > kernel lacks a function dax_relinquish_direct_access() that would mark a
> > region of data as moveable, so we can't move the data anyway.
> 
> We take a global reference on the hosting device while pages are
> registered, see the percpu_ref usage in kernel/memremap.c, and we hold
> the dax_read_lock() over calls to dax_direct_access() to temporarily
> hold the device alive for the duration of the call.

If would be good if you provided some function that locks down persistent 
memory in the long-term. Locking it on every access just kills performance 
unacceptably.

For changing mapping, you could provide a callback. When the callback is 
called, the driver that uses persistent memory could quiesce itself, 
release the long-term lock and let the system change the mapping.

> While pages are pinned for DMA the devm_memremap_pages() mapping is
> pinned. Otherwise, an error reading persistent memory is identical to
> an error reading DRAM.

The question is if storage controllers and their drivers can react to this 
in a sensible way. Did someone test it?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 9, 2017, 6:38 p.m. UTC | #20
On Thu, Nov 9, 2017 at 10:13 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 9 Nov 2017, Dan Williams wrote:
>
>> On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Wed, 8 Nov 2017, Dan Williams wrote:
>> >
>> >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote:
>> >> >
>> >> >> Can you start by explaining what you actually need the vmap for?
>> >> >
>> >> > It is possible to use lvm on persistent memory. You can create linear or
>> >> > striped logical volumes on persistent memory and these volumes still have
>> >> > the direct_access method, so they can be mapped with the function
>> >> > dax_direct_access().
>> >> >
>> >> > If we create logical volumes on persistent memory, the method
>> >> > dax_direct_access() won't return the whole device, it will return only a
>> >> > part. When dax_direct_access() returns the whole device, my driver just
>> >> > uses it without vmap. When dax_direct_access() return only a part of the
>> >> > device, my driver calls it repeatedly to get all the parts and then
>> >> > assembles the parts into a linear address space with vmap.
>> >>
>> >> I know I proposed "call dax_direct_access() once" as a strawman for an
>> >> in-kernel driver user, but it's better to call it per access so you
>> >> can better stay in sync with base driver events like new media errors
>> >> and unplug / driver-unload. Either that, or at least have a plan how
>> >> to handle those events.
>> >
>> > Calling it on every access would be inacceptable performance overkill. How
>> > is it supposed to work anyway? - if something intends to move data on
>> > persistent memory while some driver accesse it, then we need two functions
>> > - dax_direct_access() and dax_relinquish_direct_access(). The current
>> > kernel lacks a function dax_relinquish_direct_access() that would mark a
>> > region of data as moveable, so we can't move the data anyway.
>>
>> We take a global reference on the hosting device while pages are
>> registered, see the percpu_ref usage in kernel/memremap.c, and we hold
>> the dax_read_lock() over calls to dax_direct_access() to temporarily
>> hold the device alive for the duration of the call.
>
> If would be good if you provided some function that locks down persistent
> memory in the long-term. Locking it on every access just kills performance
> unacceptably.
>
> For changing mapping, you could provide a callback. When the callback is
> called, the driver that uses persistent memory could quiesce itself,
> release the long-term lock and let the system change the mapping.

I'll take a look at this. It dovetails with some of the discussions we
are having about how to support RDMA to persistent memory and
notification/callback to tear down memory registrations.

>> While pages are pinned for DMA the devm_memremap_pages() mapping is
>> pinned. Otherwise, an error reading persistent memory is identical to
>> an error reading DRAM.
>
> The question is if storage controllers and their drivers can react to this
> in a sensible way. Did someone test it?

The drivers don't need to react, once the pages are pinned for dma the
hot-unplug will not progress until all those page references are
dropped.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 9, 2017, 6:51 p.m. UTC | #21
On Thu, 9 Nov 2017, Dan Williams wrote:

> On Thu, Nov 9, 2017 at 10:13 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Thu, 9 Nov 2017, Dan Williams wrote:
> >
> >> On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >
> >> > On Wed, 8 Nov 2017, Dan Williams wrote:
> >> >
> >> >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote:
> >> >> >
> >> >> >> Can you start by explaining what you actually need the vmap for?
> >> >> >
> >> >> > It is possible to use lvm on persistent memory. You can create linear or
> >> >> > striped logical volumes on persistent memory and these volumes still have
> >> >> > the direct_access method, so they can be mapped with the function
> >> >> > dax_direct_access().
> >> >> >
> >> >> > If we create logical volumes on persistent memory, the method
> >> >> > dax_direct_access() won't return the whole device, it will return only a
> >> >> > part. When dax_direct_access() returns the whole device, my driver just
> >> >> > uses it without vmap. When dax_direct_access() return only a part of the
> >> >> > device, my driver calls it repeatedly to get all the parts and then
> >> >> > assembles the parts into a linear address space with vmap.
> >> >>
> >> >> I know I proposed "call dax_direct_access() once" as a strawman for an
> >> >> in-kernel driver user, but it's better to call it per access so you
> >> >> can better stay in sync with base driver events like new media errors
> >> >> and unplug / driver-unload. Either that, or at least have a plan how
> >> >> to handle those events.
> >> >
> >> > Calling it on every access would be inacceptable performance overkill. How
> >> > is it supposed to work anyway? - if something intends to move data on
> >> > persistent memory while some driver accesse it, then we need two functions
> >> > - dax_direct_access() and dax_relinquish_direct_access(). The current
> >> > kernel lacks a function dax_relinquish_direct_access() that would mark a
> >> > region of data as moveable, so we can't move the data anyway.
> >>
> >> We take a global reference on the hosting device while pages are
> >> registered, see the percpu_ref usage in kernel/memremap.c, and we hold
> >> the dax_read_lock() over calls to dax_direct_access() to temporarily
> >> hold the device alive for the duration of the call.
> >
> > If would be good if you provided some function that locks down persistent
> > memory in the long-term. Locking it on every access just kills performance
> > unacceptably.
> >
> > For changing mapping, you could provide a callback. When the callback is
> > called, the driver that uses persistent memory could quiesce itself,
> > release the long-term lock and let the system change the mapping.
> 
> I'll take a look at this. It dovetails with some of the discussions we
> are having about how to support RDMA to persistent memory and
> notification/callback to tear down memory registrations.

If you come up with some design, post it also to dm-devel@redhat.com and 
to me. Device mapper can also change mapping of devices, so it will need 
to participate in long-term locking too.

> >> While pages are pinned for DMA the devm_memremap_pages() mapping is
> >> pinned. Otherwise, an error reading persistent memory is identical to
> >> an error reading DRAM.
> >
> > The question is if storage controllers and their drivers can react to this
> > in a sensible way. Did someone test it?
> 
> The drivers don't need to react, once the pages are pinned for dma the
> hot-unplug will not progress until all those page references are
> dropped.

I am not talking about moving pages here, I'm talking about possible 
hardware errors in persistent memory. In this situation, the storage 
controller receives an error on the bus - and the question is, how will it 
react. Ideally, it should abort just this transfer and return an error 
that the driver will propagate up. But I'm skeptical that someone is 
really testing the controllers and drivers for this possiblity.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Nov. 9, 2017, 6:58 p.m. UTC | #22
On Thu, Nov 9, 2017 at 10:51 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
[..]
>> The drivers don't need to react, once the pages are pinned for dma the
>> hot-unplug will not progress until all those page references are
>> dropped.
>
> I am not talking about moving pages here, I'm talking about possible
> hardware errors in persistent memory. In this situation, the storage
> controller receives an error on the bus - and the question is, how will it
> react. Ideally, it should abort just this transfer and return an error
> that the driver will propagate up. But I'm skeptical that someone is
> really testing the controllers and drivers for this possiblity.

This is something that drive controllers already need to deal with
today on DRAM, but I suspect you are right because in general
error-path testing in drivers is rare to non-existent in Linux. We can
endeavor to do better with persistent memory where we have some
explicit error injection facilities defined in ACPI that might enjoy
wider support than the existing EINJ facility.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h
+++ linux-2.6/include/linux/vmalloc.h
@@ -98,6 +98,8 @@  extern void vfree_atomic(const void *add
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
+extern void *vmap_pfn(pfn_t *pfns, unsigned int count,
+			unsigned long flags, pgprot_t prot);
 extern void vunmap(const void *addr);
 
 extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -31,6 +31,7 @@ 
 #include <linux/compiler.h>
 #include <linux/llist.h>
 #include <linux/bitops.h>
+#include <linux/pfn_t.h>
 
 #include <linux/uaccess.h>
 #include <asm/tlbflush.h>
@@ -132,7 +133,7 @@  static void vunmap_page_range(unsigned l
 }
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr)
 {
 	pte_t *pte;
 
@@ -145,20 +146,25 @@  static int vmap_pte_range(pmd_t *pmd, un
 	if (!pte)
 		return -ENOMEM;
 	do {
-		struct page *page = pages[*nr];
-
+		unsigned long pf;
+		if (pages) {
+			struct page *page = pages[*nr];
+			if (WARN_ON(!page))
+				return -ENOMEM;
+			pf = page_to_pfn(page);
+		} else {
+			pf = pfn_t_to_pfn(pfns[*nr]);
+		}
 		if (WARN_ON(!pte_none(*pte)))
 			return -EBUSY;
-		if (WARN_ON(!page))
-			return -ENOMEM;
-		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
+		set_pte_at(&init_mm, addr, pte, pfn_pte(pf, prot));
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	return 0;
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -168,14 +174,14 @@  static int vmap_pmd_range(pud_t *pud, un
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+		if (vmap_pte_range(pmd, addr, next, prot, pages, pfns, nr))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -185,14 +191,14 @@  static int vmap_pud_range(p4d_t *p4d, un
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+		if (vmap_pmd_range(pud, addr, next, prot, pages, pfns, nr))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -202,7 +208,7 @@  static int vmap_p4d_range(pgd_t *pgd, un
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+		if (vmap_pud_range(p4d, addr, next, prot, pages, pfns, nr))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -215,7 +221,7 @@  static int vmap_p4d_range(pgd_t *pgd, un
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
 static int vmap_page_range_noflush(unsigned long start, unsigned long end,
-				   pgprot_t prot, struct page **pages)
+				   pgprot_t prot, struct page **pages, pfn_t *pfns)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -227,7 +233,7 @@  static int vmap_page_range_noflush(unsig
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+		err = vmap_p4d_range(pgd, addr, next, prot, pages, pfns, &nr);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
@@ -236,11 +242,11 @@  static int vmap_page_range_noflush(unsig
 }
 
 static int vmap_page_range(unsigned long start, unsigned long end,
-			   pgprot_t prot, struct page **pages)
+			   pgprot_t prot, struct page **pages, pfn_t *pfns)
 {
 	int ret;
 
-	ret = vmap_page_range_noflush(start, end, prot, pages);
+	ret = vmap_page_range_noflush(start, end, prot, pages, pfns);
 	flush_cache_vmap(start, end);
 	return ret;
 }
@@ -1191,7 +1197,7 @@  void *vm_map_ram(struct page **pages, un
 		addr = va->va_start;
 		mem = (void *)addr;
 	}
-	if (vmap_page_range(addr, addr + size, prot, pages) < 0) {
+	if (vmap_page_range(addr, addr + size, prot, pages, NULL) < 0) {
 		vm_unmap_ram(mem, count);
 		return NULL;
 	}
@@ -1306,7 +1312,7 @@  void __init vmalloc_init(void)
 int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 			     pgprot_t prot, struct page **pages)
 {
-	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+	return vmap_page_range_noflush(addr, addr + size, prot, pages, NULL);
 }
 
 /**
@@ -1347,13 +1353,24 @@  void unmap_kernel_range(unsigned long ad
 }
 EXPORT_SYMBOL_GPL(unmap_kernel_range);
 
+static int map_vm_area_pfn(struct vm_struct *area, pgprot_t prot, pfn_t *pfns)
+{
+	unsigned long addr = (unsigned long)area->addr;
+	unsigned long end = addr + get_vm_area_size(area);
+	int err;
+
+	err = vmap_page_range(addr, end, prot, NULL, pfns);
+
+	return err > 0 ? 0 : err;
+}
+
 int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 {
 	unsigned long addr = (unsigned long)area->addr;
 	unsigned long end = addr + get_vm_area_size(area);
 	int err;
 
-	err = vmap_page_range(addr, end, prot, pages);
+	err = vmap_page_range(addr, end, prot, pages, NULL);
 
 	return err > 0 ? 0 : err;
 }
@@ -1660,6 +1677,39 @@  void *vmap(struct page **pages, unsigned
 }
 EXPORT_SYMBOL(vmap);
 
+/**
+ *	vmap_pfn  -  map an array of pages into virtually contiguous space
+ *	@pfns:		array of page frame numbers
+ *	@count:		number of pages to map
+ *	@flags:		vm_area->flags
+ *	@prot:		page protection for the mapping
+ *
+ *	Maps @count pages from @pages into contiguous kernel virtual
+ *	space.
+ */
+void *vmap_pfn(pfn_t *pfns, unsigned int count, unsigned long flags, pgprot_t prot)
+{
+	struct vm_struct *area;
+	unsigned long size;		/* In bytes */
+
+	might_sleep();
+
+	size = (unsigned long)count << PAGE_SHIFT;
+	if (unlikely((size >> PAGE_SHIFT) != count))
+		return NULL;
+	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	if (!area)
+		return NULL;
+
+	if (map_vm_area_pfn(area, prot, pfns)) {
+		vunmap(area->addr);
+		return NULL;
+	}
+
+	return area->addr;
+}
+EXPORT_SYMBOL(vmap_pfn);
+
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
 			    int node, const void *caller);