Message ID | alpine.LRH.2.02.1711071645240.1339@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Nov 07, 2017 at 05:03:11PM -0500, Mikulas Patocka wrote: > Hi > > I am developing a driver that uses persistent memory for caching. A > persistent memory device can be mapped in several discontiguous ranges. > > The kernel has a function vmap that takes an array of pointers to pages > and maps these pages to contiguous linear address space. However, it can't > be used on persistent memory because persistent memory may not be backed > by page structures. > > This patch introduces a new function vmap_pfn, it works like vmap, but > takes an array of pfn_t - so it can be used on persistent memory. How is cache flushing going to work for this interface assuming that your write to/from the virtual address and expect it to be persisted on pmem?
On Wed, 8 Nov 2017, Christoph Hellwig wrote: > On Tue, Nov 07, 2017 at 05:03:11PM -0500, Mikulas Patocka wrote: > > Hi > > > > I am developing a driver that uses persistent memory for caching. A > > persistent memory device can be mapped in several discontiguous ranges. > > > > The kernel has a function vmap that takes an array of pointers to pages > > and maps these pages to contiguous linear address space. However, it can't > > be used on persistent memory because persistent memory may not be backed > > by page structures. > > > > This patch introduces a new function vmap_pfn, it works like vmap, but > > takes an array of pfn_t - so it can be used on persistent memory. > > How is cache flushing going to work for this interface assuming > that your write to/from the virtual address and expect it to be > persisted on pmem? We could use the function clwb() (or arch-independent wrapper dax_flush()) - that uses the clflushopt instruction on Broadwell or clwb on Skylake - but it is very slow, write performance on Broadwell is only 350MB/s. So in practice I use the movnti instruction that bypasses cache. The write-combining buffer is flushed with sfence. Mikulas
On Wed, Nov 08, 2017 at 07:33:09AM -0500, Mikulas Patocka wrote: > We could use the function clwb() (or arch-independent wrapper dax_flush()) > - that uses the clflushopt instruction on Broadwell or clwb on Skylake - > but it is very slow, write performance on Broadwell is only 350MB/s. > > So in practice I use the movnti instruction that bypasses cache. The > write-combining buffer is flushed with sfence. And what do you do for an architecture with virtuall indexed caches?
On Wed, 8 Nov 2017, Christoph Hellwig wrote: > On Wed, Nov 08, 2017 at 07:33:09AM -0500, Mikulas Patocka wrote: > > We could use the function clwb() (or arch-independent wrapper dax_flush()) > > - that uses the clflushopt instruction on Broadwell or clwb on Skylake - > > but it is very slow, write performance on Broadwell is only 350MB/s. > > > > So in practice I use the movnti instruction that bypasses cache. The > > write-combining buffer is flushed with sfence. > > And what do you do for an architecture with virtuall indexed caches? Persistent memory is not supported on such architectures - it is only supported on x86-64 and arm64. Mikulas
On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: > > And what do you do for an architecture with virtuall indexed caches? > > Persistent memory is not supported on such architectures - it is only > supported on x86-64 and arm64. For now. But once support is added your driver will just corrupt data unless you have the right API in place.
On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: >> > And what do you do for an architecture with virtuall indexed caches? >> >> Persistent memory is not supported on such architectures - it is only >> supported on x86-64 and arm64. > > For now. But once support is added your driver will just corrupt data > unless you have the right API in place. I'm also in the process of ripping out page-less dax support. With pages we can potentially leverage the VIVT-cache support in some architectures, likely with more supporting infrastructure for dax_flush().
On Wed, 8 Nov 2017, Christoph Hellwig wrote: > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: > > > And what do you do for an architecture with virtuall indexed caches? > > > > Persistent memory is not supported on such architectures - it is only > > supported on x86-64 and arm64. > > For now. But once support is added your driver will just corrupt data > unless you have the right API in place. If dax_flush were able to flush vmapped area, I don't see a problem with it. You obviously can't access the same device simultaneously through vmapped area and direct mapping. But when the persistent memory driver is using the device, no one is expected to touch it anyway. Mikulas
Can you start by explaining what you actually need the vmap for? Going through a vmap for every I/O is certainly not going to be nice on NVDIMM-N or similar modules :)
On Wed, 8 Nov 2017, Dan Williams wrote: > On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: > >> > And what do you do for an architecture with virtuall indexed caches? > >> > >> Persistent memory is not supported on such architectures - it is only > >> supported on x86-64 and arm64. > > > > For now. But once support is added your driver will just corrupt data > > unless you have the right API in place. > > I'm also in the process of ripping out page-less dax support. With > pages we can potentially leverage the VIVT-cache support in some > architectures, likely with more supporting infrastructure for > dax_flush(). Should I remove all the code for page-less persistent memory from my driver? Mikulas
On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Wed, 8 Nov 2017, Dan Williams wrote: > >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: >> >> > And what do you do for an architecture with virtuall indexed caches? >> >> >> >> Persistent memory is not supported on such architectures - it is only >> >> supported on x86-64 and arm64. >> > >> > For now. But once support is added your driver will just corrupt data >> > unless you have the right API in place. >> >> I'm also in the process of ripping out page-less dax support. With >> pages we can potentially leverage the VIVT-cache support in some >> architectures, likely with more supporting infrastructure for >> dax_flush(). > > Should I remove all the code for page-less persistent memory from my > driver? > Yes, that would be my recommendation. You can see that filesystem-dax is on its way to dropping page-less support in this series: https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html
On Wed, 8 Nov 2017, Christoph Hellwig wrote: > Can you start by explaining what you actually need the vmap for? It is possible to use lvm on persistent memory. You can create linear or striped logical volumes on persistent memory and these volumes still have the direct_access method, so they can be mapped with the function dax_direct_access(). If we create logical volumes on persistent memory, the method dax_direct_access() won't return the whole device, it will return only a part. When dax_direct_access() returns the whole device, my driver just uses it without vmap. When dax_direct_access() return only a part of the device, my driver calls it repeatedly to get all the parts and then assembles the parts into a linear address space with vmap. See the function persistent_memory_claim() here: https://www.redhat.com/archives/dm-devel/2017-November/msg00026.html > Going through a vmap for every I/O is certainly not going to be nice > on NVDIMM-N or similar modules :) It's just a call to vmalloc_to_page. Though, if persistent memory is not page-backed, I have to copy the data before writing them. Mikulas
On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > On Wed, 8 Nov 2017, Christoph Hellwig wrote: > >> Can you start by explaining what you actually need the vmap for? > > It is possible to use lvm on persistent memory. You can create linear or > striped logical volumes on persistent memory and these volumes still have > the direct_access method, so they can be mapped with the function > dax_direct_access(). > > If we create logical volumes on persistent memory, the method > dax_direct_access() won't return the whole device, it will return only a > part. When dax_direct_access() returns the whole device, my driver just > uses it without vmap. When dax_direct_access() return only a part of the > device, my driver calls it repeatedly to get all the parts and then > assembles the parts into a linear address space with vmap. I know I proposed "call dax_direct_access() once" as a strawman for an in-kernel driver user, but it's better to call it per access so you can better stay in sync with base driver events like new media errors and unplug / driver-unload. Either that, or at least have a plan how to handle those events.
On Wed, 8 Nov 2017, Dan Williams wrote: > On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > On Wed, 8 Nov 2017, Christoph Hellwig wrote: > > > >> Can you start by explaining what you actually need the vmap for? > > > > It is possible to use lvm on persistent memory. You can create linear or > > striped logical volumes on persistent memory and these volumes still have > > the direct_access method, so they can be mapped with the function > > dax_direct_access(). > > > > If we create logical volumes on persistent memory, the method > > dax_direct_access() won't return the whole device, it will return only a > > part. When dax_direct_access() returns the whole device, my driver just > > uses it without vmap. When dax_direct_access() return only a part of the > > device, my driver calls it repeatedly to get all the parts and then > > assembles the parts into a linear address space with vmap. > > I know I proposed "call dax_direct_access() once" as a strawman for an > in-kernel driver user, but it's better to call it per access so you > can better stay in sync with base driver events like new media errors > and unplug / driver-unload. Either that, or at least have a plan how > to handle those events. Calling it on every access would be inacceptable performance overkill. How is it supposed to work anyway? - if something intends to move data on persistent memory while some driver accesse it, then we need two functions - dax_direct_access() and dax_relinquish_direct_access(). The current kernel lacks a function dax_relinquish_direct_access() that would mark a region of data as moveable, so we can't move the data anyway. BTW. what happens if we create a write bio that has its pages pointing to persistent memory and there is error when the storage controller attempts to do DMA from persistent memory? Will the storage controller react to the error in a sensible way and will the block layer report the error? Mikulas
On Wed, 8 Nov 2017, Dan Williams wrote: > On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Wed, 8 Nov 2017, Dan Williams wrote: > > > >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: > >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: > >> >> > And what do you do for an architecture with virtuall indexed caches? > >> >> > >> >> Persistent memory is not supported on such architectures - it is only > >> >> supported on x86-64 and arm64. > >> > > >> > For now. But once support is added your driver will just corrupt data > >> > unless you have the right API in place. > >> > >> I'm also in the process of ripping out page-less dax support. With > >> pages we can potentially leverage the VIVT-cache support in some > >> architectures, likely with more supporting infrastructure for > >> dax_flush(). > > > > Should I remove all the code for page-less persistent memory from my > > driver? > > > > Yes, that would be my recommendation. You can see that filesystem-dax > is on its way to dropping page-less support in this series: > > https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html Why do you indend to drop dax for ramdisk? It's perfect for testing. On x86, persistent memory can be tested with the memmap kernel parameters, but on other architectures, ramdisk is the only option for tests. Mikulas
On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Wed, 8 Nov 2017, Dan Williams wrote: > >> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Wed, 8 Nov 2017, Dan Williams wrote: >> > >> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: >> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: >> >> >> > And what do you do for an architecture with virtuall indexed caches? >> >> >> >> >> >> Persistent memory is not supported on such architectures - it is only >> >> >> supported on x86-64 and arm64. >> >> > >> >> > For now. But once support is added your driver will just corrupt data >> >> > unless you have the right API in place. >> >> >> >> I'm also in the process of ripping out page-less dax support. With >> >> pages we can potentially leverage the VIVT-cache support in some >> >> architectures, likely with more supporting infrastructure for >> >> dax_flush(). >> > >> > Should I remove all the code for page-less persistent memory from my >> > driver? >> > >> >> Yes, that would be my recommendation. You can see that filesystem-dax >> is on its way to dropping page-less support in this series: >> >> https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html > > Why do you indend to drop dax for ramdisk? It's perfect for testing. > > On x86, persistent memory can be tested with the memmap kernel parameters, > but on other architectures, ramdisk is the only option for tests. > Because it's not "perfect for testing", it does not support the get_user_pages() model that we need to safely handle DAX dma. ARM64 and PowerPC PMEM support is in the works, so I expect the architecture support landscape for major architectures to improve such that the pmem driver can always be used for DAX testing.
On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Wed, 8 Nov 2017, Dan Williams wrote: > >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote: >> > >> >> Can you start by explaining what you actually need the vmap for? >> > >> > It is possible to use lvm on persistent memory. You can create linear or >> > striped logical volumes on persistent memory and these volumes still have >> > the direct_access method, so they can be mapped with the function >> > dax_direct_access(). >> > >> > If we create logical volumes on persistent memory, the method >> > dax_direct_access() won't return the whole device, it will return only a >> > part. When dax_direct_access() returns the whole device, my driver just >> > uses it without vmap. When dax_direct_access() return only a part of the >> > device, my driver calls it repeatedly to get all the parts and then >> > assembles the parts into a linear address space with vmap. >> >> I know I proposed "call dax_direct_access() once" as a strawman for an >> in-kernel driver user, but it's better to call it per access so you >> can better stay in sync with base driver events like new media errors >> and unplug / driver-unload. Either that, or at least have a plan how >> to handle those events. > > Calling it on every access would be inacceptable performance overkill. How > is it supposed to work anyway? - if something intends to move data on > persistent memory while some driver accesse it, then we need two functions > - dax_direct_access() and dax_relinquish_direct_access(). The current > kernel lacks a function dax_relinquish_direct_access() that would mark a > region of data as moveable, so we can't move the data anyway. We take a global reference on the hosting device while pages are registered, see the percpu_ref usage in kernel/memremap.c, and we hold the dax_read_lock() over calls to dax_direct_access() to temporarily hold the device alive for the duration of the call. > BTW. what happens if we create a write bio that has its pages pointing to > persistent memory and there is error when the storage controller attempts > to do DMA from persistent memory? Will the storage controller react to the > error in a sensible way and will the block layer report the error? While pages are pinned for DMA the devm_memremap_pages() mapping is pinned. Otherwise, an error reading persistent memory is identical to an error reading DRAM.
On Thu, 9 Nov 2017, Dan Williams wrote: > On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Wed, 8 Nov 2017, Dan Williams wrote: > > > >> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> > > >> > > >> > On Wed, 8 Nov 2017, Dan Williams wrote: > >> > > >> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: > >> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: > >> >> >> > And what do you do for an architecture with virtuall indexed caches? > >> >> >> > >> >> >> Persistent memory is not supported on such architectures - it is only > >> >> >> supported on x86-64 and arm64. > >> >> > > >> >> > For now. But once support is added your driver will just corrupt data > >> >> > unless you have the right API in place. > >> >> > >> >> I'm also in the process of ripping out page-less dax support. With > >> >> pages we can potentially leverage the VIVT-cache support in some > >> >> architectures, likely with more supporting infrastructure for > >> >> dax_flush(). > >> > > >> > Should I remove all the code for page-less persistent memory from my > >> > driver? > >> > > >> > >> Yes, that would be my recommendation. You can see that filesystem-dax > >> is on its way to dropping page-less support in this series: > >> > >> https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html > > > > Why do you indend to drop dax for ramdisk? It's perfect for testing. > > > > On x86, persistent memory can be tested with the memmap kernel parameters, > > but on other architectures, ramdisk is the only option for tests. > > > > Because it's not "perfect for testing", it does not support the > get_user_pages() model that we need to safely handle DAX dma. ARM64 > and PowerPC PMEM support is in the works, so I expect the architecture > support landscape for major architectures to improve such that the > pmem driver can always be used for DAX testing. Yes - but if I want to test the persistent memory driver on big-endian machine, I could use an old Sparc or PA-RISC machine with ramdisk. New PowerPC machines may support native persistent memory, but they are extremely expensive. Mikulas
On Thu, Nov 9, 2017 at 9:30 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Thu, 9 Nov 2017, Dan Williams wrote: > >> On Thu, Nov 9, 2017 at 8:40 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Wed, 8 Nov 2017, Dan Williams wrote: >> > >> >> On Wed, Nov 8, 2017 at 12:15 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> > >> >> > >> >> > On Wed, 8 Nov 2017, Dan Williams wrote: >> >> > >> >> >> On Wed, Nov 8, 2017 at 7:35 AM, Christoph Hellwig <hch@infradead.org> wrote: >> >> >> > On Wed, Nov 08, 2017 at 10:21:38AM -0500, Mikulas Patocka wrote: >> >> >> >> > And what do you do for an architecture with virtuall indexed caches? >> >> >> >> >> >> >> >> Persistent memory is not supported on such architectures - it is only >> >> >> >> supported on x86-64 and arm64. >> >> >> > >> >> >> > For now. But once support is added your driver will just corrupt data >> >> >> > unless you have the right API in place. >> >> >> >> >> >> I'm also in the process of ripping out page-less dax support. With >> >> >> pages we can potentially leverage the VIVT-cache support in some >> >> >> architectures, likely with more supporting infrastructure for >> >> >> dax_flush(). >> >> > >> >> > Should I remove all the code for page-less persistent memory from my >> >> > driver? >> >> > >> >> >> >> Yes, that would be my recommendation. You can see that filesystem-dax >> >> is on its way to dropping page-less support in this series: >> >> >> >> https://lists.01.org/pipermail/linux-nvdimm/2017-October/013125.html >> > >> > Why do you indend to drop dax for ramdisk? It's perfect for testing. >> > >> > On x86, persistent memory can be tested with the memmap kernel parameters, >> > but on other architectures, ramdisk is the only option for tests. >> > >> >> Because it's not "perfect for testing", it does not support the >> get_user_pages() model that we need to safely handle DAX dma. ARM64 >> and PowerPC PMEM support is in the works, so I expect the architecture >> support landscape for major architectures to improve such that the >> pmem driver can always be used for DAX testing. > > Yes - but if I want to test the persistent memory driver on big-endian > machine, I could use an old Sparc or PA-RISC machine with ramdisk. > > New PowerPC machines may support native persistent memory, but they are > extremely expensive. > Ok, but for testing purposes we can just enable a "compile-test" / fake "ARCH_HAS_PMEM_API" configuration to enable pmem usage on other archs which would enable a better test case than DAX with brd.
On Thu, 9 Nov 2017, Dan Williams wrote: > On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Wed, 8 Nov 2017, Dan Williams wrote: > > > >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote: > >> > > >> >> Can you start by explaining what you actually need the vmap for? > >> > > >> > It is possible to use lvm on persistent memory. You can create linear or > >> > striped logical volumes on persistent memory and these volumes still have > >> > the direct_access method, so they can be mapped with the function > >> > dax_direct_access(). > >> > > >> > If we create logical volumes on persistent memory, the method > >> > dax_direct_access() won't return the whole device, it will return only a > >> > part. When dax_direct_access() returns the whole device, my driver just > >> > uses it without vmap. When dax_direct_access() return only a part of the > >> > device, my driver calls it repeatedly to get all the parts and then > >> > assembles the parts into a linear address space with vmap. > >> > >> I know I proposed "call dax_direct_access() once" as a strawman for an > >> in-kernel driver user, but it's better to call it per access so you > >> can better stay in sync with base driver events like new media errors > >> and unplug / driver-unload. Either that, or at least have a plan how > >> to handle those events. > > > > Calling it on every access would be inacceptable performance overkill. How > > is it supposed to work anyway? - if something intends to move data on > > persistent memory while some driver accesse it, then we need two functions > > - dax_direct_access() and dax_relinquish_direct_access(). The current > > kernel lacks a function dax_relinquish_direct_access() that would mark a > > region of data as moveable, so we can't move the data anyway. > > We take a global reference on the hosting device while pages are > registered, see the percpu_ref usage in kernel/memremap.c, and we hold > the dax_read_lock() over calls to dax_direct_access() to temporarily > hold the device alive for the duration of the call. If would be good if you provided some function that locks down persistent memory in the long-term. Locking it on every access just kills performance unacceptably. For changing mapping, you could provide a callback. When the callback is called, the driver that uses persistent memory could quiesce itself, release the long-term lock and let the system change the mapping. > While pages are pinned for DMA the devm_memremap_pages() mapping is > pinned. Otherwise, an error reading persistent memory is identical to > an error reading DRAM. The question is if storage controllers and their drivers can react to this in a sensible way. Did someone test it? Mikulas
On Thu, Nov 9, 2017 at 10:13 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Thu, 9 Nov 2017, Dan Williams wrote: > >> On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Wed, 8 Nov 2017, Dan Williams wrote: >> > >> >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote: >> >> > >> >> >> Can you start by explaining what you actually need the vmap for? >> >> > >> >> > It is possible to use lvm on persistent memory. You can create linear or >> >> > striped logical volumes on persistent memory and these volumes still have >> >> > the direct_access method, so they can be mapped with the function >> >> > dax_direct_access(). >> >> > >> >> > If we create logical volumes on persistent memory, the method >> >> > dax_direct_access() won't return the whole device, it will return only a >> >> > part. When dax_direct_access() returns the whole device, my driver just >> >> > uses it without vmap. When dax_direct_access() return only a part of the >> >> > device, my driver calls it repeatedly to get all the parts and then >> >> > assembles the parts into a linear address space with vmap. >> >> >> >> I know I proposed "call dax_direct_access() once" as a strawman for an >> >> in-kernel driver user, but it's better to call it per access so you >> >> can better stay in sync with base driver events like new media errors >> >> and unplug / driver-unload. Either that, or at least have a plan how >> >> to handle those events. >> > >> > Calling it on every access would be inacceptable performance overkill. How >> > is it supposed to work anyway? - if something intends to move data on >> > persistent memory while some driver accesse it, then we need two functions >> > - dax_direct_access() and dax_relinquish_direct_access(). The current >> > kernel lacks a function dax_relinquish_direct_access() that would mark a >> > region of data as moveable, so we can't move the data anyway. >> >> We take a global reference on the hosting device while pages are >> registered, see the percpu_ref usage in kernel/memremap.c, and we hold >> the dax_read_lock() over calls to dax_direct_access() to temporarily >> hold the device alive for the duration of the call. > > If would be good if you provided some function that locks down persistent > memory in the long-term. Locking it on every access just kills performance > unacceptably. > > For changing mapping, you could provide a callback. When the callback is > called, the driver that uses persistent memory could quiesce itself, > release the long-term lock and let the system change the mapping. I'll take a look at this. It dovetails with some of the discussions we are having about how to support RDMA to persistent memory and notification/callback to tear down memory registrations. >> While pages are pinned for DMA the devm_memremap_pages() mapping is >> pinned. Otherwise, an error reading persistent memory is identical to >> an error reading DRAM. > > The question is if storage controllers and their drivers can react to this > in a sensible way. Did someone test it? The drivers don't need to react, once the pages are pinned for dma the hot-unplug will not progress until all those page references are dropped.
On Thu, 9 Nov 2017, Dan Williams wrote: > On Thu, Nov 9, 2017 at 10:13 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Thu, 9 Nov 2017, Dan Williams wrote: > > > >> On Thu, Nov 9, 2017 at 8:37 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> > > >> > > >> > On Wed, 8 Nov 2017, Dan Williams wrote: > >> > > >> >> On Wed, Nov 8, 2017 at 12:26 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> >> > On Wed, 8 Nov 2017, Christoph Hellwig wrote: > >> >> > > >> >> >> Can you start by explaining what you actually need the vmap for? > >> >> > > >> >> > It is possible to use lvm on persistent memory. You can create linear or > >> >> > striped logical volumes on persistent memory and these volumes still have > >> >> > the direct_access method, so they can be mapped with the function > >> >> > dax_direct_access(). > >> >> > > >> >> > If we create logical volumes on persistent memory, the method > >> >> > dax_direct_access() won't return the whole device, it will return only a > >> >> > part. When dax_direct_access() returns the whole device, my driver just > >> >> > uses it without vmap. When dax_direct_access() return only a part of the > >> >> > device, my driver calls it repeatedly to get all the parts and then > >> >> > assembles the parts into a linear address space with vmap. > >> >> > >> >> I know I proposed "call dax_direct_access() once" as a strawman for an > >> >> in-kernel driver user, but it's better to call it per access so you > >> >> can better stay in sync with base driver events like new media errors > >> >> and unplug / driver-unload. Either that, or at least have a plan how > >> >> to handle those events. > >> > > >> > Calling it on every access would be inacceptable performance overkill. How > >> > is it supposed to work anyway? - if something intends to move data on > >> > persistent memory while some driver accesse it, then we need two functions > >> > - dax_direct_access() and dax_relinquish_direct_access(). The current > >> > kernel lacks a function dax_relinquish_direct_access() that would mark a > >> > region of data as moveable, so we can't move the data anyway. > >> > >> We take a global reference on the hosting device while pages are > >> registered, see the percpu_ref usage in kernel/memremap.c, and we hold > >> the dax_read_lock() over calls to dax_direct_access() to temporarily > >> hold the device alive for the duration of the call. > > > > If would be good if you provided some function that locks down persistent > > memory in the long-term. Locking it on every access just kills performance > > unacceptably. > > > > For changing mapping, you could provide a callback. When the callback is > > called, the driver that uses persistent memory could quiesce itself, > > release the long-term lock and let the system change the mapping. > > I'll take a look at this. It dovetails with some of the discussions we > are having about how to support RDMA to persistent memory and > notification/callback to tear down memory registrations. If you come up with some design, post it also to dm-devel@redhat.com and to me. Device mapper can also change mapping of devices, so it will need to participate in long-term locking too. > >> While pages are pinned for DMA the devm_memremap_pages() mapping is > >> pinned. Otherwise, an error reading persistent memory is identical to > >> an error reading DRAM. > > > > The question is if storage controllers and their drivers can react to this > > in a sensible way. Did someone test it? > > The drivers don't need to react, once the pages are pinned for dma the > hot-unplug will not progress until all those page references are > dropped. I am not talking about moving pages here, I'm talking about possible hardware errors in persistent memory. In this situation, the storage controller receives an error on the bus - and the question is, how will it react. Ideally, it should abort just this transfer and return an error that the driver will propagate up. But I'm skeptical that someone is really testing the controllers and drivers for this possiblity. Mikulas
On Thu, Nov 9, 2017 at 10:51 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: [..] >> The drivers don't need to react, once the pages are pinned for dma the >> hot-unplug will not progress until all those page references are >> dropped. > > I am not talking about moving pages here, I'm talking about possible > hardware errors in persistent memory. In this situation, the storage > controller receives an error on the bus - and the question is, how will it > react. Ideally, it should abort just this transfer and return an error > that the driver will propagate up. But I'm skeptical that someone is > really testing the controllers and drivers for this possiblity. This is something that drive controllers already need to deal with today on DRAM, but I suspect you are right because in general error-path testing in drivers is rare to non-existent in Linux. We can endeavor to do better with persistent memory where we have some explicit error injection facilities defined in ACPI that might enjoy wider support than the existing EINJ facility.
Index: linux-2.6/include/linux/vmalloc.h =================================================================== --- linux-2.6.orig/include/linux/vmalloc.h +++ linux-2.6/include/linux/vmalloc.h @@ -98,6 +98,8 @@ extern void vfree_atomic(const void *add extern void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot); +extern void *vmap_pfn(pfn_t *pfns, unsigned int count, + unsigned long flags, pgprot_t prot); extern void vunmap(const void *addr); extern int remap_vmalloc_range_partial(struct vm_area_struct *vma, Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c +++ linux-2.6/mm/vmalloc.c @@ -31,6 +31,7 @@ #include <linux/compiler.h> #include <linux/llist.h> #include <linux/bitops.h> +#include <linux/pfn_t.h> #include <linux/uaccess.h> #include <asm/tlbflush.h> @@ -132,7 +133,7 @@ static void vunmap_page_range(unsigned l } static int vmap_pte_range(pmd_t *pmd, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr) + unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr) { pte_t *pte; @@ -145,20 +146,25 @@ static int vmap_pte_range(pmd_t *pmd, un if (!pte) return -ENOMEM; do { - struct page *page = pages[*nr]; - + unsigned long pf; + if (pages) { + struct page *page = pages[*nr]; + if (WARN_ON(!page)) + return -ENOMEM; + pf = page_to_pfn(page); + } else { + pf = pfn_t_to_pfn(pfns[*nr]); + } if (WARN_ON(!pte_none(*pte))) return -EBUSY; - if (WARN_ON(!page)) - return -ENOMEM; - set_pte_at(&init_mm, addr, pte, mk_pte(page, prot)); + set_pte_at(&init_mm, addr, pte, pfn_pte(pf, prot)); (*nr)++; } while (pte++, addr += PAGE_SIZE, addr != end); return 0; } static int vmap_pmd_range(pud_t *pud, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr) + unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr) { pmd_t *pmd; unsigned long next; @@ -168,14 +174,14 @@ static int vmap_pmd_range(pud_t *pud, un return -ENOMEM; do { next = pmd_addr_end(addr, end); - if (vmap_pte_range(pmd, addr, next, prot, pages, nr)) + if (vmap_pte_range(pmd, addr, next, prot, pages, pfns, nr)) return -ENOMEM; } while (pmd++, addr = next, addr != end); return 0; } static int vmap_pud_range(p4d_t *p4d, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr) + unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr) { pud_t *pud; unsigned long next; @@ -185,14 +191,14 @@ static int vmap_pud_range(p4d_t *p4d, un return -ENOMEM; do { next = pud_addr_end(addr, end); - if (vmap_pmd_range(pud, addr, next, prot, pages, nr)) + if (vmap_pmd_range(pud, addr, next, prot, pages, pfns, nr)) return -ENOMEM; } while (pud++, addr = next, addr != end); return 0; } static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr) + unsigned long end, pgprot_t prot, struct page **pages, pfn_t *pfns, int *nr) { p4d_t *p4d; unsigned long next; @@ -202,7 +208,7 @@ static int vmap_p4d_range(pgd_t *pgd, un return -ENOMEM; do { next = p4d_addr_end(addr, end); - if (vmap_pud_range(p4d, addr, next, prot, pages, nr)) + if (vmap_pud_range(p4d, addr, next, prot, pages, pfns, nr)) return -ENOMEM; } while (p4d++, addr = next, addr != end); return 0; @@ -215,7 +221,7 @@ static int vmap_p4d_range(pgd_t *pgd, un * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N] */ static int vmap_page_range_noflush(unsigned long start, unsigned long end, - pgprot_t prot, struct page **pages) + pgprot_t prot, struct page **pages, pfn_t *pfns) { pgd_t *pgd; unsigned long next; @@ -227,7 +233,7 @@ static int vmap_page_range_noflush(unsig pgd = pgd_offset_k(addr); do { next = pgd_addr_end(addr, end); - err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr); + err = vmap_p4d_range(pgd, addr, next, prot, pages, pfns, &nr); if (err) return err; } while (pgd++, addr = next, addr != end); @@ -236,11 +242,11 @@ static int vmap_page_range_noflush(unsig } static int vmap_page_range(unsigned long start, unsigned long end, - pgprot_t prot, struct page **pages) + pgprot_t prot, struct page **pages, pfn_t *pfns) { int ret; - ret = vmap_page_range_noflush(start, end, prot, pages); + ret = vmap_page_range_noflush(start, end, prot, pages, pfns); flush_cache_vmap(start, end); return ret; } @@ -1191,7 +1197,7 @@ void *vm_map_ram(struct page **pages, un addr = va->va_start; mem = (void *)addr; } - if (vmap_page_range(addr, addr + size, prot, pages) < 0) { + if (vmap_page_range(addr, addr + size, prot, pages, NULL) < 0) { vm_unmap_ram(mem, count); return NULL; } @@ -1306,7 +1312,7 @@ void __init vmalloc_init(void) int map_kernel_range_noflush(unsigned long addr, unsigned long size, pgprot_t prot, struct page **pages) { - return vmap_page_range_noflush(addr, addr + size, prot, pages); + return vmap_page_range_noflush(addr, addr + size, prot, pages, NULL); } /** @@ -1347,13 +1353,24 @@ void unmap_kernel_range(unsigned long ad } EXPORT_SYMBOL_GPL(unmap_kernel_range); +static int map_vm_area_pfn(struct vm_struct *area, pgprot_t prot, pfn_t *pfns) +{ + unsigned long addr = (unsigned long)area->addr; + unsigned long end = addr + get_vm_area_size(area); + int err; + + err = vmap_page_range(addr, end, prot, NULL, pfns); + + return err > 0 ? 0 : err; +} + int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages) { unsigned long addr = (unsigned long)area->addr; unsigned long end = addr + get_vm_area_size(area); int err; - err = vmap_page_range(addr, end, prot, pages); + err = vmap_page_range(addr, end, prot, pages, NULL); return err > 0 ? 0 : err; } @@ -1660,6 +1677,39 @@ void *vmap(struct page **pages, unsigned } EXPORT_SYMBOL(vmap); +/** + * vmap_pfn - map an array of pages into virtually contiguous space + * @pfns: array of page frame numbers + * @count: number of pages to map + * @flags: vm_area->flags + * @prot: page protection for the mapping + * + * Maps @count pages from @pages into contiguous kernel virtual + * space. + */ +void *vmap_pfn(pfn_t *pfns, unsigned int count, unsigned long flags, pgprot_t prot) +{ + struct vm_struct *area; + unsigned long size; /* In bytes */ + + might_sleep(); + + size = (unsigned long)count << PAGE_SHIFT; + if (unlikely((size >> PAGE_SHIFT) != count)) + return NULL; + area = get_vm_area_caller(size, flags, __builtin_return_address(0)); + if (!area) + return NULL; + + if (map_vm_area_pfn(area, prot, pfns)) { + vunmap(area->addr); + return NULL; + } + + return area->addr; +} +EXPORT_SYMBOL(vmap_pfn); + static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot, int node, const void *caller);