Message ID | 20200513034705.172983-3-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/5] powerpc/pmem: Add new instructions for persistent storage and sync | expand |
On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > Architectures like ppc64 provide persistent memory specific barriers > that will ensure that all stores for which the modifications are > written to persistent storage by preceding dcbfps and dcbstps > instructions have updated persistent storage before any data > access or data transfer caused by subsequent instructions is initiated. > This is in addition to the ordering done by wmb() > > Update nvdimm core such that architecture can use barriers other than > wmb to ensure all previous writes are architecturally visible for > the platform buffer flush. This seems like an exceedingly bad idea, maybe I'm missing something. This implies that the deployed base of DAX applications using the old instruction sequence are going to regress on new hardware that requires the new instructions to be deployed. I'm thinking the kernel should go as far as to disable DAX operation by default on new hardware until userspace asserts that it is prepared to switch to the new implementation. Is there any other way to ensure the forward compatibility of deployed ppc64 DAX applications? > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > drivers/nvdimm/region_devs.c | 8 ++++---- > include/linux/libnvdimm.h | 4 ++++ > 2 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index ccbb5b43b8b2..88ea34a9c7fd 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region) > idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8)); > > /* > - * The first wmb() is needed to 'sfence' all previous writes > - * such that they are architecturally visible for the platform > - * buffer flush. Note that we've already arranged for pmem > + * The first arch_pmem_flush_barrier() is needed to 'sfence' all > + * previous writes such that they are architecturally visible for > + * the platform buffer flush. Note that we've already arranged for pmem > * writes to avoid the cache via memcpy_flushcache(). The final > * wmb() ensures ordering for the NVDIMM flush write. > */ > - wmb(); > + arch_pmem_flush_barrier(); > for (i = 0; i < nd_region->ndr_mappings; i++) > if (ndrd_get_flush_wpq(ndrd, i, 0)) > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h > index 18da4059be09..66f6c65bd789 100644 > --- a/include/linux/libnvdimm.h > +++ b/include/linux/libnvdimm.h > @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, size_t size) > } > #endif > > +#ifndef arch_pmem_flush_barrier > +#define arch_pmem_flush_barrier() wmb() > +#endif > + > #endif /* __LIBNVDIMM_H__ */ > -- > 2.26.2 >
Hi Dan, Apologies for the delay in response. I was waiting for feedback from hardware team before responding to this email. Dan Williams <dan.j.williams@intel.com> writes: > On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> Architectures like ppc64 provide persistent memory specific barriers >> that will ensure that all stores for which the modifications are >> written to persistent storage by preceding dcbfps and dcbstps >> instructions have updated persistent storage before any data >> access or data transfer caused by subsequent instructions is initiated. >> This is in addition to the ordering done by wmb() >> >> Update nvdimm core such that architecture can use barriers other than >> wmb to ensure all previous writes are architecturally visible for >> the platform buffer flush. > > This seems like an exceedingly bad idea, maybe I'm missing something. > This implies that the deployed base of DAX applications using the old > instruction sequence are going to regress on new hardware that > requires the new instructions to be deployed. pmdk support for ppc64 is still work in progress and there is pull request to switch pmdk to use new instruction. https://github.com/tuliom/pmdk/commit/fix-flush All userspace applications will be switched to use the new instructions. The new instructions are designed such that when running on P8 and P9 they behave as 'dcbf' and 'hwsync'. Applications using new instructions will behave as expected when running on P8 and P9. Only future hardware will differentiate between 'dcbf' and 'dcbfps' > I'm thinking the kernel > should go as far as to disable DAX operation by default on new > hardware until userspace asserts that it is prepared to switch to the > new implementation. Is there any other way to ensure the forward > compatibility of deployed ppc64 DAX applications? AFAIU there is no released persistent memory hardware on ppc64 platform and we need to make sure before applications get enabled to use these persistent memory devices, they should switch to use the new instruction? > >> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> drivers/nvdimm/region_devs.c | 8 ++++---- >> include/linux/libnvdimm.h | 4 ++++ >> 2 files changed, 8 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c >> index ccbb5b43b8b2..88ea34a9c7fd 100644 >> --- a/drivers/nvdimm/region_devs.c >> +++ b/drivers/nvdimm/region_devs.c >> @@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region) >> idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8)); >> >> /* >> - * The first wmb() is needed to 'sfence' all previous writes >> - * such that they are architecturally visible for the platform >> - * buffer flush. Note that we've already arranged for pmem >> + * The first arch_pmem_flush_barrier() is needed to 'sfence' all >> + * previous writes such that they are architecturally visible for >> + * the platform buffer flush. Note that we've already arranged for pmem >> * writes to avoid the cache via memcpy_flushcache(). The final >> * wmb() ensures ordering for the NVDIMM flush write. >> */ >> - wmb(); >> + arch_pmem_flush_barrier(); >> for (i = 0; i < nd_region->ndr_mappings; i++) >> if (ndrd_get_flush_wpq(ndrd, i, 0)) >> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h >> index 18da4059be09..66f6c65bd789 100644 >> --- a/include/linux/libnvdimm.h >> +++ b/include/linux/libnvdimm.h >> @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, size_t size) >> } >> #endif >> >> +#ifndef arch_pmem_flush_barrier >> +#define arch_pmem_flush_barrier() wmb() >> +#endif >> + >> #endif /* __LIBNVDIMM_H__ */ >> -- >> 2.26.2 >> > _______________________________________________ > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > > Hi Dan, > > Apologies for the delay in response. I was waiting for feedback from > hardware team before responding to this email. > > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Tue, May 12, 2020 at 8:47 PM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> Architectures like ppc64 provide persistent memory specific barriers > >> that will ensure that all stores for which the modifications are > >> written to persistent storage by preceding dcbfps and dcbstps > >> instructions have updated persistent storage before any data > >> access or data transfer caused by subsequent instructions is initiated. > >> This is in addition to the ordering done by wmb() > >> > >> Update nvdimm core such that architecture can use barriers other than > >> wmb to ensure all previous writes are architecturally visible for > >> the platform buffer flush. > > > > This seems like an exceedingly bad idea, maybe I'm missing something. > > This implies that the deployed base of DAX applications using the old > > instruction sequence are going to regress on new hardware that > > requires the new instructions to be deployed. > > > pmdk support for ppc64 is still work in progress and there is pull > request to switch pmdk to use new instruction. Ok. > > https://github.com/tuliom/pmdk/commit/fix-flush > > All userspace applications will be switched to use the new > instructions. The new instructions are designed such that when running on P8 > and P9 they behave as 'dcbf' and 'hwsync'. Sure, makes sense. > Applications using new instructions will behave as expected when running > on P8 and P9. Only future hardware will differentiate between 'dcbf' and > 'dcbfps' Right, this is the problem. Applications using new instructions behave as expected, the kernel has been shipping of_pmem and papr_scm for several cycles now, you're saying that the DAX applications written against those platforms are going to be broken on P8 and P9? > > I'm thinking the kernel > > should go as far as to disable DAX operation by default on new > > hardware until userspace asserts that it is prepared to switch to the > > new implementation. Is there any other way to ensure the forward > > compatibility of deployed ppc64 DAX applications? > > AFAIU there is no released persistent memory hardware on ppc64 platform > and we need to make sure before applications get enabled to use these > persistent memory devices, they should switch to use the new > instruction? Right, I want the kernel to offer some level of safety here because everything you are describing sounds like a flag day conversion. Am I misreading? Is there some other gate that prevents existing users of of_pmem and papr_scm from having their expectations violated when running on P8 / P9 hardware? Maybe there's tighter ecosystem control that I'm just not familiar with, I'm only going off the fact that the kernel has shipped a non-zero number of NVDIMM drivers that build with ARCH=ppc64 for several cycles.
Dan Williams <dan.j.williams@intel.com> writes: > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: ... >> Applications using new instructions will behave as expected when running >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and >> 'dcbfps' > > Right, this is the problem. Applications using new instructions behave > as expected, the kernel has been shipping of_pmem and papr_scm for > several cycles now, you're saying that the DAX applications written > against those platforms are going to be broken on P8 and P9? The expecation is that both kernel and userspace would get upgraded to use the new instruction before actual persistent memory devices are made available. > >> > I'm thinking the kernel >> > should go as far as to disable DAX operation by default on new >> > hardware until userspace asserts that it is prepared to switch to the >> > new implementation. Is there any other way to ensure the forward >> > compatibility of deployed ppc64 DAX applications? >> >> AFAIU there is no released persistent memory hardware on ppc64 platform >> and we need to make sure before applications get enabled to use these >> persistent memory devices, they should switch to use the new >> instruction? > > Right, I want the kernel to offer some level of safety here because > everything you are describing sounds like a flag day conversion. Am I > misreading? Is there some other gate that prevents existing users of > of_pmem and papr_scm from having their expectations violated when > running on P8 / P9 hardware? Maybe there's tighter ecosystem control > that I'm just not familiar with, I'm only going off the fact that the > kernel has shipped a non-zero number of NVDIMM drivers that build with > ARCH=ppc64 for several cycles. If we are looking at adding changes to kernel that will prevent a kernel from running on newer hardware in a specific case, we could as well take the changes to get the kernel use the newer instructions right? But I agree with your concern that if we have older kernel/applications that continue to use `dcbf` on future hardware we will end up having issues w.r.t powerfail consistency. The plan is what you outlined above as tighter ecosystem control. Considering we don't have a pmem device generally available, we get both kernel and userspace upgraded to use these new instructions before such a device is made available. -aneesh
On Tue, May 19, 2020 at 6:53 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > > ... > > >> Applications using new instructions will behave as expected when running > >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and > >> 'dcbfps' > > > > Right, this is the problem. Applications using new instructions behave > > as expected, the kernel has been shipping of_pmem and papr_scm for > > several cycles now, you're saying that the DAX applications written > > against those platforms are going to be broken on P8 and P9? > > The expecation is that both kernel and userspace would get upgraded to > use the new instruction before actual persistent memory devices are > made available. > > > > >> > I'm thinking the kernel > >> > should go as far as to disable DAX operation by default on new > >> > hardware until userspace asserts that it is prepared to switch to the > >> > new implementation. Is there any other way to ensure the forward > >> > compatibility of deployed ppc64 DAX applications? > >> > >> AFAIU there is no released persistent memory hardware on ppc64 platform > >> and we need to make sure before applications get enabled to use these > >> persistent memory devices, they should switch to use the new > >> instruction? > > > > Right, I want the kernel to offer some level of safety here because > > everything you are describing sounds like a flag day conversion. Am I > > misreading? Is there some other gate that prevents existing users of > > of_pmem and papr_scm from having their expectations violated when > > running on P8 / P9 hardware? Maybe there's tighter ecosystem control > > that I'm just not familiar with, I'm only going off the fact that the > > kernel has shipped a non-zero number of NVDIMM drivers that build with > > ARCH=ppc64 for several cycles. > > If we are looking at adding changes to kernel that will prevent a kernel > from running on newer hardware in a specific case, we could as well take > the changes to get the kernel use the newer instructions right? Oh, no, I'm not talking about stopping the kernel from running. I'm simply recommending that support for MAP_SYNC mappings (userspace managed flushing) be disabled by default on PPC with either a compile-time or run-time default to assert that userspace has been audited for legacy applications or that the platform owner is otherwise willing to take the risk. > But I agree with your concern that if we have older kernel/applications > that continue to use `dcbf` on future hardware we will end up > having issues w.r.t powerfail consistency. The plan is what you outlined > above as tighter ecosystem control. Considering we don't have a pmem > device generally available, we get both kernel and userspace upgraded > to use these new instructions before such a device is made available. Ok, I think a compile time kernel option with a runtime override satisfies my concern. Does that work for you?
Dan Williams <dan.j.williams@intel.com> writes: > On Tue, May 19, 2020 at 6:53 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> Dan Williams <dan.j.williams@intel.com> writes: >> >> > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V >> > <aneesh.kumar@linux.ibm.com> wrote: >> >> ... >> >> >> Applications using new instructions will behave as expected when running >> >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and >> >> 'dcbfps' >> > >> > Right, this is the problem. Applications using new instructions behave >> > as expected, the kernel has been shipping of_pmem and papr_scm for >> > several cycles now, you're saying that the DAX applications written >> > against those platforms are going to be broken on P8 and P9? >> >> The expecation is that both kernel and userspace would get upgraded to >> use the new instruction before actual persistent memory devices are >> made available. >> >> > >> >> > I'm thinking the kernel >> >> > should go as far as to disable DAX operation by default on new >> >> > hardware until userspace asserts that it is prepared to switch to the >> >> > new implementation. Is there any other way to ensure the forward >> >> > compatibility of deployed ppc64 DAX applications? >> >> >> >> AFAIU there is no released persistent memory hardware on ppc64 platform >> >> and we need to make sure before applications get enabled to use these >> >> persistent memory devices, they should switch to use the new >> >> instruction? >> > >> > Right, I want the kernel to offer some level of safety here because >> > everything you are describing sounds like a flag day conversion. Am I >> > misreading? Is there some other gate that prevents existing users of >> > of_pmem and papr_scm from having their expectations violated when >> > running on P8 / P9 hardware? Maybe there's tighter ecosystem control >> > that I'm just not familiar with, I'm only going off the fact that the >> > kernel has shipped a non-zero number of NVDIMM drivers that build with >> > ARCH=ppc64 for several cycles. >> >> If we are looking at adding changes to kernel that will prevent a kernel >> from running on newer hardware in a specific case, we could as well take >> the changes to get the kernel use the newer instructions right? > > Oh, no, I'm not talking about stopping the kernel from running. I'm > simply recommending that support for MAP_SYNC mappings (userspace > managed flushing) be disabled by default on PPC with either a > compile-time or run-time default to assert that userspace has been > audited for legacy applications or that the platform owner is > otherwise willing to take the risk. > >> But I agree with your concern that if we have older kernel/applications >> that continue to use `dcbf` on future hardware we will end up >> having issues w.r.t powerfail consistency. The plan is what you outlined >> above as tighter ecosystem control. Considering we don't have a pmem >> device generally available, we get both kernel and userspace upgraded >> to use these new instructions before such a device is made available. > > Ok, I think a compile time kernel option with a runtime override > satisfies my concern. Does that work for you? something like below? But this still won't handle devdax mmap right? diff --git a/arch/arm64/include/asm/libnvdimm.h b/arch/arm64/include/asm/libnvdimm.h new file mode 100644 index 000000000000..aee697a72537 --- /dev/null +++ b/arch/arm64/include/asm/libnvdimm.h @@ -0,0 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ARCH_LIBNVDIMM_H__ +#define __ARCH_LIBNVDIMM_H__ + +#endif /* __ARCH_LIBNVDIMM_H__ */ diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h new file mode 100644 index 000000000000..da479200bfb8 --- /dev/null +++ b/arch/powerpc/include/asm/libnvdimm.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ARCH_LIBNVDIMM_H__ +#define __ARCH_LIBNVDIMM_H__ + +#define arch_disable_sync_nvdimm arch_disable_sync_nvdimm +extern bool arch_disable_sync_nvdimm(void); + +#endif /* __ARCH_LIBNVDIMM_H__ */ diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c index 0666a8d29596..3ce4fb4f167b 100644 --- a/arch/powerpc/lib/pmem.c +++ b/arch/powerpc/lib/pmem.c @@ -9,6 +9,8 @@ #include <asm/cacheflush.h> + +static bool sync_fault = IS_ENABLED(CONFIG_PPC_NVDIMM_SYNC_FAULT); /* * CONFIG_ARCH_HAS_PMEM_API symbols */ @@ -57,3 +59,16 @@ void memcpy_page_flushcache(char *to, struct page *page, size_t offset, memcpy_flushcache(to, page_to_virt(page) + offset, len); } EXPORT_SYMBOL(memcpy_page_flushcache); + +bool arch_disable_sync_nvdimm(void) +{ + return !sync_fault; +} + +static int __init parse_sync_fault(char *p) +{ + sync_fault = true; + return 0; +} +early_param("enable_sync_fault", parse_sync_fault); + diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 27a81c291be8..dde11d75a746 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -383,6 +383,15 @@ config PPC_KUEP If you're unsure, say Y. +config PPC_NVDIMM_SYNC_FAULT + bool "Synchronous fault support (MAP_SYNC)" + default n + help + Enable support for synchronous fault with nvdimm namespaces. + + If you're unsure, say N. + + config PPC_HAVE_KUAP bool diff --git a/arch/x86/include/asm/libnvdimm.h b/arch/x86/include/asm/libnvdimm.h new file mode 100644 index 000000000000..aee697a72537 --- /dev/null +++ b/arch/x86/include/asm/libnvdimm.h @@ -0,0 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ARCH_LIBNVDIMM_H__ +#define __ARCH_LIBNVDIMM_H__ + +#endif /* __ARCH_LIBNVDIMM_H__ */ diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index ccbb5b43b8b2..74a0809491af 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -1278,6 +1278,13 @@ bool is_nvdimm_sync(struct nd_region *nd_region) if (is_nd_volatile(&nd_region->dev)) return true; + /* + * If arch is forcing a synchronous fault + * disable. + */ + if (arch_disable_sync_nvdimm()) + return false; + return is_nd_pmem(&nd_region->dev) && !test_bit(ND_REGION_ASYNC, &nd_region->flags); } diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 18da4059be09..891449aebe91 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -13,6 +13,8 @@ #include <linux/spinlock.h> #include <linux/bio.h> +#include <asm/libnvdimm.h> + struct badrange_entry { u64 start; u64 length; @@ -286,4 +288,12 @@ static inline void arch_invalidate_pmem(void *addr, size_t size) } #endif +#ifndef arch_disable_sync_nvdimm +#define arch_disable_sync_nvdimm arch_disable_sync_nvdimm +static inline bool arch_disable_sync_nvdimm() +{ + return false; +} +#endif + #endif /* __LIBNVDIMM_H__ */
Dan Williams <dan.j.williams@intel.com> writes: >> But I agree with your concern that if we have older kernel/applications >> that continue to use `dcbf` on future hardware we will end up >> having issues w.r.t powerfail consistency. The plan is what you outlined >> above as tighter ecosystem control. Considering we don't have a pmem >> device generally available, we get both kernel and userspace upgraded >> to use these new instructions before such a device is made available. I thought power already supported NVDIMM-N, no? So are you saying that those devices will continue to work with the existing flushing and fencing mechanisms? > Ok, I think a compile time kernel option with a runtime override > satisfies my concern. Does that work for you? The compile time option only helps when running newer kernels. I'm not sure how you would even begin to audit userspace applications (keep in mind, not every application is open source, and not every application uses pmdk). I also question the merits of forcing the administrator to make the determination of whether all applications on the system will work properly. Really, you have to rely on the vendor to tell you the platform is supported, and at that point, why put further hurdles in the way? The decision to require different instructions on ppc is unfortunate, but one I'm sure we have no control over. I don't see any merit in the kernel disallowing MAP_SYNC access on these platforms. Ideally, we'd have some way of ensuring older kernels don't work with these new platforms, but I don't think that's possible. Moving on to the patch itself--Aneesh, have you audited other persistent memory users in the kernel? For example, drivers/md/dm-writecache.c does this: static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) { if (WC_MODE_PMEM(wc)) wmb(); <========== else ssd_commit_flushed(wc, wait_for_ios); } I believe you'll need to make modifications there. Cheers, Jeff
On 5/21/20 8:08 PM, Jeff Moyer wrote: > Dan Williams <dan.j.williams@intel.com> writes: > >>> But I agree with your concern that if we have older kernel/applications >>> that continue to use `dcbf` on future hardware we will end up >>> having issues w.r.t powerfail consistency. The plan is what you outlined >>> above as tighter ecosystem control. Considering we don't have a pmem >>> device generally available, we get both kernel and userspace upgraded >>> to use these new instructions before such a device is made available. > > I thought power already supported NVDIMM-N, no? So are you saying that > those devices will continue to work with the existing flushing and > fencing mechanisms? > yes. these devices can continue to use 'dcbf + hwsync' as long as we are running them on P9. >> Ok, I think a compile time kernel option with a runtime override >> satisfies my concern. Does that work for you? > > The compile time option only helps when running newer kernels. I'm not > sure how you would even begin to audit userspace applications (keep in > mind, not every application is open source, and not every application > uses pmdk). I also question the merits of forcing the administrator to > make the determination of whether all applications on the system will > work properly. Really, you have to rely on the vendor to tell you the > platform is supported, and at that point, why put further hurdles in the > way? > > The decision to require different instructions on ppc is unfortunate, > but one I'm sure we have no control over. I don't see any merit in the > kernel disallowing MAP_SYNC access on these platforms. Ideally, we'd > have some way of ensuring older kernels don't work with these new > platforms, but I don't think that's possible. > I am currently looking at the possibility of firmware present these devices with different device-tree compat values. So that older /existing kernel won't initialize the device on newer systems. Is that a good compromise? We still can end up with older userspace and newer kernel. One of the option suggested by Jan Kara is to use a prctl flag to control that? (intead of kernel parameter option I posted before) > Moving on to the patch itself--Aneesh, have you audited other persistent > memory users in the kernel? For example, drivers/md/dm-writecache.c does > this: > > static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) > { > if (WC_MODE_PMEM(wc)) > wmb(); <========== > else > ssd_commit_flushed(wc, wait_for_ios); > } > > I believe you'll need to make modifications there. > Correct. Thanks for catching that. I don't understand dm much, wondering how this will work with non-synchronous DAX device? -aneesh
On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > On 5/21/20 8:08 PM, Jeff Moyer wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > >>> But I agree with your concern that if we have older kernel/applications > >>> that continue to use `dcbf` on future hardware we will end up > >>> having issues w.r.t powerfail consistency. The plan is what you outlined > >>> above as tighter ecosystem control. Considering we don't have a pmem > >>> device generally available, we get both kernel and userspace upgraded > >>> to use these new instructions before such a device is made available. > > > > I thought power already supported NVDIMM-N, no? So are you saying that > > those devices will continue to work with the existing flushing and > > fencing mechanisms? > > > > yes. these devices can continue to use 'dcbf + hwsync' as long as we are > running them on P9. > > > >> Ok, I think a compile time kernel option with a runtime override > >> satisfies my concern. Does that work for you? > > > > The compile time option only helps when running newer kernels. I'm not > > sure how you would even begin to audit userspace applications (keep in > > mind, not every application is open source, and not every application > > uses pmdk). I also question the merits of forcing the administrator to > > make the determination of whether all applications on the system will > > work properly. Really, you have to rely on the vendor to tell you the > > platform is supported, and at that point, why put further hurdles in the > > way? > > > > The decision to require different instructions on ppc is unfortunate, > > but one I'm sure we have no control over. I don't see any merit in the > > kernel disallowing MAP_SYNC access on these platforms. Ideally, we'd > > have some way of ensuring older kernels don't work with these new > > platforms, but I don't think that's possible. > > > > > I am currently looking at the possibility of firmware present these > devices with different device-tree compat values. So that older > /existing kernel won't initialize the device on newer systems. Is that a > good compromise? We still can end up with older userspace and newer > kernel. One of the option suggested by Jan Kara is to use a prctl flag > to control that? (intead of kernel parameter option I posted before) > > > > Moving on to the patch itself--Aneesh, have you audited other persistent > > memory users in the kernel? For example, drivers/md/dm-writecache.c does > > this: > > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) > > { > > if (WC_MODE_PMEM(wc)) > > wmb(); <========== > > else > > ssd_commit_flushed(wc, wait_for_ios); > > } > > > > I believe you'll need to make modifications there. > > > > Correct. Thanks for catching that. > > > I don't understand dm much, wondering how this will work with > non-synchronous DAX device? That's a good point. DM-writecache needs to be cognizant of things like virtio-pmem that violate the rule that persisent memory writes can be flushed by CPU functions rather than calling back into the driver. It seems we need to always make the flush case a dax_operation callback to account for this.
On Thu, May 21, 2020 at 7:39 AM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > >> But I agree with your concern that if we have older kernel/applications > >> that continue to use `dcbf` on future hardware we will end up > >> having issues w.r.t powerfail consistency. The plan is what you outlined > >> above as tighter ecosystem control. Considering we don't have a pmem > >> device generally available, we get both kernel and userspace upgraded > >> to use these new instructions before such a device is made available. > > I thought power already supported NVDIMM-N, no? So are you saying that > those devices will continue to work with the existing flushing and > fencing mechanisms? > > > Ok, I think a compile time kernel option with a runtime override > > satisfies my concern. Does that work for you? > > The compile time option only helps when running newer kernels. I'm not > sure how you would even begin to audit userspace applications (keep in > mind, not every application is open source, and not every application > uses pmdk). I also question the merits of forcing the administrator to > make the determination of whether all applications on the system will > work properly. Really, you have to rely on the vendor to tell you the > platform is supported, and at that point, why put further hurdles in the > way? I'm thoroughly confused by this. I thought this was exactly the role of a Linux distribution vendor. ISVs qualify their application on a hardware-platform + distribution combination and the distribution owns picking ABI defaults like CONFIG_SYSFS_DEPRECATED regardless of whether they can guarantee that all apps are updated to the new semantics. The administrator is not forced, the administrator if afforded an override in the extreme case that they find an exception to what was qualified and need to override the distribution's compile-time choice. > > The decision to require different instructions on ppc is unfortunate, > but one I'm sure we have no control over. I don't see any merit in the > kernel disallowing MAP_SYNC access on these platforms. Ideally, we'd > have some way of ensuring older kernels don't work with these new > platforms, but I don't think that's possible. I see disabling MAP_SYNC as the more targeted form of "ensursing older kernels don't work. So I guess we agree that something should break when baseline assumptions change, we just don't yet agree on where that break should happen?
On Thu, 21 May 2020, Dan Williams wrote: > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: > > > > > Moving on to the patch itself--Aneesh, have you audited other persistent > > > memory users in the kernel? For example, drivers/md/dm-writecache.c does > > > this: > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) > > > { > > > if (WC_MODE_PMEM(wc)) > > > wmb(); <========== > > > else > > > ssd_commit_flushed(wc, wait_for_ios); > > > } > > > > > > I believe you'll need to make modifications there. > > > > > > > Correct. Thanks for catching that. > > > > > > I don't understand dm much, wondering how this will work with > > non-synchronous DAX device? > > That's a good point. DM-writecache needs to be cognizant of things > like virtio-pmem that violate the rule that persisent memory writes > can be flushed by CPU functions rather than calling back into the > driver. It seems we need to always make the flush case a dax_operation > callback to account for this. dm-writecache is normally sitting on the top of dm-linear, so it would need to pass the wmb() call through the dm core and dm-linear target ... that would slow it down ... I remember that you already did it this way some times ago and then removed it. What's the exact problem with POWER? Could the POWER system have two types of persistent memory that need two different ways of flushing? Mikulas
On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote: > > > On Thu, 21 May 2020, Dan Williams wrote: > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > Moving on to the patch itself--Aneesh, have you audited other persistent > > > > memory users in the kernel? For example, drivers/md/dm-writecache.c does > > > > this: > > > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) > > > > { > > > > if (WC_MODE_PMEM(wc)) > > > > wmb(); <========== > > > > else > > > > ssd_commit_flushed(wc, wait_for_ios); > > > > } > > > > > > > > I believe you'll need to make modifications there. > > > > > > > > > > Correct. Thanks for catching that. > > > > > > > > > I don't understand dm much, wondering how this will work with > > > non-synchronous DAX device? > > > > That's a good point. DM-writecache needs to be cognizant of things > > like virtio-pmem that violate the rule that persisent memory writes > > can be flushed by CPU functions rather than calling back into the > > driver. It seems we need to always make the flush case a dax_operation > > callback to account for this. > > dm-writecache is normally sitting on the top of dm-linear, so it would > need to pass the wmb() call through the dm core and dm-linear target ... > that would slow it down ... I remember that you already did it this way > some times ago and then removed it. > > What's the exact problem with POWER? Could the POWER system have two types > of persistent memory that need two different ways of flushing? As far as I understand the discussion so far - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency - on POWER $newhardware uses $newinstruction to ensure pmem consistency (compatible with $oldinstruction on $oldhardware) - on some platforms instead of barrier instruction a callback into the driver is issued to ensure consistency None of this is reflected by the dm driver. Thanks Michal
On 5/22/20 3:01 PM, Michal Suchánek wrote: > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote: >> >> >> On Thu, 21 May 2020, Dan Williams wrote: >> >>> On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V >>> <aneesh.kumar@linux.ibm.com> wrote: >>>> >>>>> Moving on to the patch itself--Aneesh, have you audited other persistent >>>>> memory users in the kernel? For example, drivers/md/dm-writecache.c does >>>>> this: >>>>> >>>>> static void writecache_commit_flushed(struct dm_writecache *wc, bool wait_for_ios) >>>>> { >>>>> if (WC_MODE_PMEM(wc)) >>>>> wmb(); <========== >>>>> else >>>>> ssd_commit_flushed(wc, wait_for_ios); >>>>> } >>>>> >>>>> I believe you'll need to make modifications there. >>>>> >>>> >>>> Correct. Thanks for catching that. >>>> >>>> >>>> I don't understand dm much, wondering how this will work with >>>> non-synchronous DAX device? >>> >>> That's a good point. DM-writecache needs to be cognizant of things >>> like virtio-pmem that violate the rule that persisent memory writes >>> can be flushed by CPU functions rather than calling back into the >>> driver. It seems we need to always make the flush case a dax_operation >>> callback to account for this. >> >> dm-writecache is normally sitting on the top of dm-linear, so it would >> need to pass the wmb() call through the dm core and dm-linear target ... >> that would slow it down ... I remember that you already did it this way >> some times ago and then removed it. >> >> What's the exact problem with POWER? Could the POWER system have two types >> of persistent memory that need two different ways of flushing? > > As far as I understand the discussion so far > > - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency > - on POWER $newhardware uses $newinstruction to ensure pmem consistency > (compatible with $oldinstruction on $oldhardware) Correct. > - on some platforms instead of barrier instruction a callback into the > driver is issued to ensure consistency This is virtio-pmem only at this point IIUC. > > None of this is reflected by the dm driver. > -aneesh
On Fri, 22 May 2020, Aneesh Kumar K.V wrote: > On 5/22/20 3:01 PM, Michal Suchánek wrote: > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote: > > > > > > > > > On Thu, 21 May 2020, Dan Williams wrote: > > > > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > Moving on to the patch itself--Aneesh, have you audited other > > > > > > persistent > > > > > > memory users in the kernel? For example, drivers/md/dm-writecache.c > > > > > > does > > > > > > this: > > > > > > > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool > > > > > > wait_for_ios) > > > > > > { > > > > > > if (WC_MODE_PMEM(wc)) > > > > > > wmb(); <========== > > > > > > else > > > > > > ssd_commit_flushed(wc, wait_for_ios); > > > > > > } > > > > > > > > > > > > I believe you'll need to make modifications there. > > > > > > > > > > > > > > > > Correct. Thanks for catching that. > > > > > > > > > > > > > > > I don't understand dm much, wondering how this will work with > > > > > non-synchronous DAX device? > > > > > > > > That's a good point. DM-writecache needs to be cognizant of things > > > > like virtio-pmem that violate the rule that persisent memory writes > > > > can be flushed by CPU functions rather than calling back into the > > > > driver. It seems we need to always make the flush case a dax_operation > > > > callback to account for this. > > > > > > dm-writecache is normally sitting on the top of dm-linear, so it would > > > need to pass the wmb() call through the dm core and dm-linear target ... > > > that would slow it down ... I remember that you already did it this way > > > some times ago and then removed it. > > > > > > What's the exact problem with POWER? Could the POWER system have two types > > > of persistent memory that need two different ways of flushing? > > > > As far as I understand the discussion so far > > > > - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency > > - on POWER $newhardware uses $newinstruction to ensure pmem consistency > > (compatible with $oldinstruction on $oldhardware) > > Correct. > > > - on some platforms instead of barrier instruction a callback into the > > driver is issued to ensure consistency > > This is virtio-pmem only at this point IIUC. > > -aneesh And does the virtio-pmem driver track which pages are dirty? Or does it need to specify the range of pages to flush in the flush function? > > None of this is reflected by the dm driver. We could make a new dax method: void *(dax_get_flush_function)(void); This would return a pointer to "wmb()" on x86 and something else on Power. The method "dax_get_flush_function" would be called only once when initializing the writecache driver (because the call would be slow because it would have to go through the DM stack) and then, the returned function would be called each time we need write ordering. The returned function would do just "sfence; ret". Mikulas
On Fri, May 22, 2020 at 09:01:17AM -0400, Mikulas Patocka wrote: > > > On Fri, 22 May 2020, Aneesh Kumar K.V wrote: > > > On 5/22/20 3:01 PM, Michal Suchánek wrote: > > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Thu, 21 May 2020, Dan Williams wrote: > > > > > > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > Moving on to the patch itself--Aneesh, have you audited other > > > > > > > persistent > > > > > > > memory users in the kernel? For example, drivers/md/dm-writecache.c > > > > > > > does > > > > > > > this: > > > > > > > > > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool > > > > > > > wait_for_ios) > > > > > > > { > > > > > > > if (WC_MODE_PMEM(wc)) > > > > > > > wmb(); <========== > > > > > > > else > > > > > > > ssd_commit_flushed(wc, wait_for_ios); > > > > > > > } > > > > > > > > > > > > > > I believe you'll need to make modifications there. > > > > > > > > > > > > > > > > > > > Correct. Thanks for catching that. > > > > > > > > > > > > > > > > > > I don't understand dm much, wondering how this will work with > > > > > > non-synchronous DAX device? > > > > > > > > > > That's a good point. DM-writecache needs to be cognizant of things > > > > > like virtio-pmem that violate the rule that persisent memory writes > > > > > can be flushed by CPU functions rather than calling back into the > > > > > driver. It seems we need to always make the flush case a dax_operation > > > > > callback to account for this. > > > > > > > > dm-writecache is normally sitting on the top of dm-linear, so it would > > > > need to pass the wmb() call through the dm core and dm-linear target ... > > > > that would slow it down ... I remember that you already did it this way > > > > some times ago and then removed it. > > > > > > > > What's the exact problem with POWER? Could the POWER system have two types > > > > of persistent memory that need two different ways of flushing? > > > > > > As far as I understand the discussion so far > > > > > > - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency > > > - on POWER $newhardware uses $newinstruction to ensure pmem consistency > > > (compatible with $oldinstruction on $oldhardware) > > > > Correct. > > > > > - on some platforms instead of barrier instruction a callback into the > > > driver is issued to ensure consistency > > > > This is virtio-pmem only at this point IIUC. > > > > -aneesh > > And does the virtio-pmem driver track which pages are dirty? Or does it > need to specify the range of pages to flush in the flush function? > > > > None of this is reflected by the dm driver. > > We could make a new dax method: > void *(dax_get_flush_function)(void); > > This would return a pointer to "wmb()" on x86 and something else on Power. > > The method "dax_get_flush_function" would be called only once when > initializing the writecache driver (because the call would be slow because > it would have to go through the DM stack) and then, the returned function > would be called each time we need write ordering. The returned function > would do just "sfence; ret". Hello, as far as I understand the code virtio_pmem has a fush function defined which indeed can make use of the region properties, such as memory range. If such function exists you need quivalent of sync() - call into the device in question. If it does not calling arch_pmem_flush_barrier() instead of wmb() should suffice. I am not aware of an interface to determine if the flush function exists for a particular region. Thanks Michal
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index ccbb5b43b8b2..88ea34a9c7fd 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region) idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8)); /* - * The first wmb() is needed to 'sfence' all previous writes - * such that they are architecturally visible for the platform - * buffer flush. Note that we've already arranged for pmem + * The first arch_pmem_flush_barrier() is needed to 'sfence' all + * previous writes such that they are architecturally visible for + * the platform buffer flush. Note that we've already arranged for pmem * writes to avoid the cache via memcpy_flushcache(). The final * wmb() ensures ordering for the NVDIMM flush write. */ - wmb(); + arch_pmem_flush_barrier(); for (i = 0; i < nd_region->ndr_mappings; i++) if (ndrd_get_flush_wpq(ndrd, i, 0)) writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 18da4059be09..66f6c65bd789 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, size_t size) } #endif +#ifndef arch_pmem_flush_barrier +#define arch_pmem_flush_barrier() wmb() +#endif + #endif /* __LIBNVDIMM_H__ */
Architectures like ppc64 provide persistent memory specific barriers that will ensure that all stores for which the modifications are written to persistent storage by preceding dcbfps and dcbstps instructions have updated persistent storage before any data access or data transfer caused by subsequent instructions is initiated. This is in addition to the ordering done by wmb() Update nvdimm core such that architecture can use barriers other than wmb to ensure all previous writes are architecturally visible for the platform buffer flush. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- drivers/nvdimm/region_devs.c | 8 ++++---- include/linux/libnvdimm.h | 4 ++++ 2 files changed, 8 insertions(+), 4 deletions(-)