Message ID | 20180519052635.567438191@debian.vm (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: > Use new API for flushing persistent memory. The sentence doesnt make much sense. 'A new API', 'A better abstraction' maybe? > > The problem is this: > * on X86-64, non-temporal stores have the best performance > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > should flush cache as late as possible, because it performs better this > way. > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > data persistently, all three functions must be called. > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > pmem_commit is wmb. > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > pmem_commit is empty. All these should be provided by the pmem layer, and be properly documented. And be sorted before adding your new target that uses them. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 22 2018 at 2:39am -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: > > Use new API for flushing persistent memory. > > The sentence doesnt make much sense. 'A new API', 'A better > abstraction' maybe? > > > > > The problem is this: > > * on X86-64, non-temporal stores have the best performance > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > > should flush cache as late as possible, because it performs better this > > way. > > > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > > data persistently, all three functions must be called. > > > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > > > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > > pmem_commit is wmb. > > > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > > pmem_commit is empty. > > All these should be provided by the pmem layer, and be properly > documented. And be sorted before adding your new target that uses > them. I don't see that as a hard requirement. Mikulas did the work to figure out what is more optimal on x86_64 vs amd64. It makes a difference for his target and that is sufficient to carry it locally until/when it is either elevated to pmem. We cannot even get x86 and swait maintainers to reply to repeat requests for review. Stacking up further deps on pmem isn't high on my list. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, May 22 2018 at 2:39am -0400, > Christoph Hellwig <hch@infradead.org> wrote: > >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: >> > Use new API for flushing persistent memory. >> >> The sentence doesnt make much sense. 'A new API', 'A better >> abstraction' maybe? >> >> > >> > The problem is this: >> > * on X86-64, non-temporal stores have the best performance >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We >> > should flush cache as late as possible, because it performs better this >> > way. >> > >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit >> > data persistently, all three functions must be called. >> > >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. >> > >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and >> > pmem_commit is wmb. >> > >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and >> > pmem_commit is empty. >> >> All these should be provided by the pmem layer, and be properly >> documented. And be sorted before adding your new target that uses >> them. > > I don't see that as a hard requirement. Mikulas did the work to figure > out what is more optimal on x86_64 vs amd64. It makes a difference for > his target and that is sufficient to carry it locally until/when it is > either elevated to pmem. > > We cannot even get x86 and swait maintainers to reply to repeat requests > for review. Stacking up further deps on pmem isn't high on my list. > Except I'm being responsive. I agree with Christoph that we should build pmem helpers at an architecture level and not per-driver. Let's make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch up to x86 in this space. We already have PowerPC enabling PMEM API, so I don't see an unreasonable barrier to ask the same of ARM. This patch is not even cc'd to linux-arm-kernel. Has the subject been broached with them? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 22 2018 at 3:00pm -0400, Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Tue, May 22 2018 at 2:39am -0400, > > Christoph Hellwig <hch@infradead.org> wrote: > > > >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: > >> > Use new API for flushing persistent memory. > >> > >> The sentence doesnt make much sense. 'A new API', 'A better > >> abstraction' maybe? > >> > >> > > >> > The problem is this: > >> > * on X86-64, non-temporal stores have the best performance > >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > >> > should flush cache as late as possible, because it performs better this > >> > way. > >> > > >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > >> > data persistently, all three functions must be called. > >> > > >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > >> > > >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > >> > pmem_commit is wmb. > >> > > >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > >> > pmem_commit is empty. > >> > >> All these should be provided by the pmem layer, and be properly > >> documented. And be sorted before adding your new target that uses > >> them. > > > > I don't see that as a hard requirement. Mikulas did the work to figure > > out what is more optimal on x86_64 vs amd64. It makes a difference for > > his target and that is sufficient to carry it locally until/when it is > > either elevated to pmem. > > > > We cannot even get x86 and swait maintainers to reply to repeat requests > > for review. Stacking up further deps on pmem isn't high on my list. > > > > Except I'm being responsive. Except you're looking to immediately punt to linux-arm-kernel ;) > I agree with Christoph that we should > build pmem helpers at an architecture level and not per-driver. Let's > make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch > up to x86 in this space. We already have PowerPC enabling PMEM API, so > I don't see an unreasonable barrier to ask the same of ARM. This patch > is not even cc'd to linux-arm-kernel. Has the subject been broached > with them? No idea. Not by me. The thing is, I'm no expert in pmem. You are. Coordinating the change with ARM et al feels unnecessarily limiting and quicky moves outside my control. Serious question: Why can't this code land in this dm-writecache target and then be lifted (or obsoleted)? But if you think it worthwhile to force ARM to step up then fine. That does limit the availability of using writecache on ARM while they get the PMEM API together. I'll do whatever you want.. just put the smack down and tell me how it is ;) Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, May 22 2018 at 3:00pm -0400, > Dan Williams <dan.j.williams@intel.com> wrote: > >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> > On Tue, May 22 2018 at 2:39am -0400, >> > Christoph Hellwig <hch@infradead.org> wrote: >> > >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: >> >> > Use new API for flushing persistent memory. >> >> >> >> The sentence doesnt make much sense. 'A new API', 'A better >> >> abstraction' maybe? >> >> >> >> > >> >> > The problem is this: >> >> > * on X86-64, non-temporal stores have the best performance >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We >> >> > should flush cache as late as possible, because it performs better this >> >> > way. >> >> > >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit >> >> > data persistently, all three functions must be called. >> >> > >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. >> >> > >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and >> >> > pmem_commit is wmb. >> >> > >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and >> >> > pmem_commit is empty. >> >> >> >> All these should be provided by the pmem layer, and be properly >> >> documented. And be sorted before adding your new target that uses >> >> them. >> > >> > I don't see that as a hard requirement. Mikulas did the work to figure >> > out what is more optimal on x86_64 vs amd64. It makes a difference for >> > his target and that is sufficient to carry it locally until/when it is >> > either elevated to pmem. >> > >> > We cannot even get x86 and swait maintainers to reply to repeat requests >> > for review. Stacking up further deps on pmem isn't high on my list. >> > >> >> Except I'm being responsive. > > Except you're looking to immediately punt to linux-arm-kernel ;) Well, I'm not, not really. I'm saying drop ARM support, it's not ready. > >> I agree with Christoph that we should >> build pmem helpers at an architecture level and not per-driver. Let's >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch >> up to x86 in this space. We already have PowerPC enabling PMEM API, so >> I don't see an unreasonable barrier to ask the same of ARM. This patch >> is not even cc'd to linux-arm-kernel. Has the subject been broached >> with them? > > No idea. Not by me. > > The thing is, I'm no expert in pmem. You are. Coordinating the change > with ARM et al feels unnecessarily limiting and quicky moves outside my > control. > > Serious question: Why can't this code land in this dm-writecache target > and then be lifted (or obsoleted)? Because we already have an API, and we don't want to promote local solutions to global problems, or carry unnecessary technical debt. > > But if you think it worthwhile to force ARM to step up then fine. That > does limit the availability of using writecache on ARM while they get > the PMEM API together. > > I'll do whatever you want.. just put the smack down and tell me how it > is ;) I'd say just control the variables you can control. Drop the ARM support if you want to move forward and propose extensions / updates to the pmem api for x86 and I'll help push those since I was involved in pushing the x86 pmem api in the first instance. That way you don't need to touch this driver as new archs add their pmem api enabling. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 22 2018 at 3:27pm -0400, Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Tue, May 22 2018 at 3:00pm -0400, > > Dan Williams <dan.j.williams@intel.com> wrote: > > > >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote: > >> > On Tue, May 22 2018 at 2:39am -0400, > >> > Christoph Hellwig <hch@infradead.org> wrote: > >> > > >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: > >> >> > Use new API for flushing persistent memory. > >> >> > >> >> The sentence doesnt make much sense. 'A new API', 'A better > >> >> abstraction' maybe? > >> >> > >> >> > > >> >> > The problem is this: > >> >> > * on X86-64, non-temporal stores have the best performance > >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > >> >> > should flush cache as late as possible, because it performs better this > >> >> > way. > >> >> > > >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > >> >> > data persistently, all three functions must be called. > >> >> > > >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > >> >> > > >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > >> >> > pmem_commit is wmb. > >> >> > > >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > >> >> > pmem_commit is empty. > >> >> > >> >> All these should be provided by the pmem layer, and be properly > >> >> documented. And be sorted before adding your new target that uses > >> >> them. > >> > > >> > I don't see that as a hard requirement. Mikulas did the work to figure > >> > out what is more optimal on x86_64 vs amd64. It makes a difference for > >> > his target and that is sufficient to carry it locally until/when it is > >> > either elevated to pmem. > >> > > >> > We cannot even get x86 and swait maintainers to reply to repeat requests > >> > for review. Stacking up further deps on pmem isn't high on my list. > >> > > >> > >> Except I'm being responsive. > > > > Except you're looking to immediately punt to linux-arm-kernel ;) > > Well, I'm not, not really. I'm saying drop ARM support, it's not ready. > > > > >> I agree with Christoph that we should > >> build pmem helpers at an architecture level and not per-driver. Let's > >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch > >> up to x86 in this space. We already have PowerPC enabling PMEM API, so > >> I don't see an unreasonable barrier to ask the same of ARM. This patch > >> is not even cc'd to linux-arm-kernel. Has the subject been broached > >> with them? > > > > No idea. Not by me. > > > > The thing is, I'm no expert in pmem. You are. Coordinating the change > > with ARM et al feels unnecessarily limiting and quicky moves outside my > > control. > > > > Serious question: Why can't this code land in this dm-writecache target > > and then be lifted (or obsoleted)? > > Because we already have an API, and we don't want to promote local > solutions to global problems, or carry unnecessary technical debt. > > > > > But if you think it worthwhile to force ARM to step up then fine. That > > does limit the availability of using writecache on ARM while they get > > the PMEM API together. > > > > I'll do whatever you want.. just put the smack down and tell me how it > > is ;) > > I'd say just control the variables you can control. Drop the ARM > support if you want to move forward and propose extensions / updates > to the pmem api for x86 and I'll help push those since I was involved > in pushing the x86 pmem api in the first instance. That way you don't > need to touch this driver as new archs add their pmem api enabling. Looking at Mikulas' wrapper API that you and hch are calling into question: For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem(). (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.) Whereas x86_64 is using memcpy_flushcache() as provided by CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE. (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache) Just seems this isn't purely about ARM lacking on an API level (given on x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API). Seems this is more to do with x86_64 having efficient Non-temporal stores? Anyway, I'm still trying to appreciate the details here before I can make any forward progress. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi, Mike, Mike Snitzer <snitzer@redhat.com> writes: > Looking at Mikulas' wrapper API that you and hch are calling into > question: > > For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem(). > (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.) > > Whereas x86_64 is using memcpy_flushcache() as provided by > CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE. > (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache) > > Just seems this isn't purely about ARM lacking on an API level (given on > x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API). > > Seems this is more to do with x86_64 having efficient Non-temporal > stores? Yeah, I think you've got that all right. > Anyway, I'm still trying to appreciate the details here before I can > make any forward progress. Making data persistent on x64 requires 3 steps: 1) copy the data into pmem (store instructions) 2) flush the cache lines associated with the data (clflush, clflush_opt, clwb) 3) wait on the flush to complete (sfence) I'm not sure if other architectures require step 3. Mikulas' implementation seems to imply that arm64 doesn't require the fence. The current pmem api provides: memcpy* -- step 1 memcpy_flushcache -- this combines steps 1 and 2 dax_flush -- step 2 wmb* -- step 3 * not strictly part of the pmem api So, if you didn't care about performance, you could write generic code that only used memcpy, dax_flush, and wmb (assuming other arches actually need the wmb). What Mikulas did was to abstract out an API that could be called by generic code that would work optimally on all architectures. This looks like a worth-while addition to the PMEM API, to me. Mikulas, what do you think about refactoring the code as Christoph suggested? Cheers, Jeff -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 22 May 2018, Jeff Moyer wrote: > Hi, Mike, > > Mike Snitzer <snitzer@redhat.com> writes: > > > Looking at Mikulas' wrapper API that you and hch are calling into > > question: > > > > For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem(). > > (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.) > > > > Whereas x86_64 is using memcpy_flushcache() as provided by > > CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE. > > (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache) > > > > Just seems this isn't purely about ARM lacking on an API level (given on > > x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API). > > > > Seems this is more to do with x86_64 having efficient Non-temporal > > stores? > > Yeah, I think you've got that all right. > > > Anyway, I'm still trying to appreciate the details here before I can > > make any forward progress. > > Making data persistent on x64 requires 3 steps: > 1) copy the data into pmem (store instructions) > 2) flush the cache lines associated with the data (clflush, clflush_opt, clwb) > 3) wait on the flush to complete (sfence) In theory it works this way. In practice, this sequence is useless because the cache flusing instructions are horribly slow. So, the dm-writecache driver uses non-temporal stores instead of cache flushing. Now, the problem with arm64 is that it doesn't have non-temporal stores. So, memcpy_flushcache on arm64 does cached stores and flushes the cache afterwards. And this eager flushing is slower than late flushing. On arm4, you want to do cached stores, then do something else, and flush the cache as late as possible. > I'm not sure if other architectures require step 3. Mikulas' > implementation seems to imply that arm64 doesn't require the fence. I suppose that arch_wb_cache_pmem() does whatever it needs to do to flush the cache. If not, add something like arch_wb_cache_pmem_commit(). > The current pmem api provides: > > memcpy* -- step 1 > memcpy_flushcache -- this combines steps 1 and 2 > dax_flush -- step 2 > wmb* -- step 3 > > * not strictly part of the pmem api > > So, if you didn't care about performance, you could write generic code > that only used memcpy, dax_flush, and wmb (assuming other arches > actually need the wmb). What Mikulas did was to abstract out an API > that could be called by generic code that would work optimally on all > architectures. > > This looks like a worth-while addition to the PMEM API, to me. Mikulas, > what do you think about refactoring the code as Christoph suggested? I sent this patch https://www.redhat.com/archives/dm-devel/2018-May/msg00054.html so that you can take the functions pmem_memcpy, pmem_assign, pmem_flush and pmem_commit and move them to the generic linux headers. If you want to do it, do it. > Cheers, > Jeff Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 22 May 2018, Dan Williams wrote: > On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Tue, May 22 2018 at 2:39am -0400, > > Christoph Hellwig <hch@infradead.org> wrote: > > > >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote: > >> > Use new API for flushing persistent memory. > >> > >> The sentence doesnt make much sense. 'A new API', 'A better > >> abstraction' maybe? > >> > >> > > >> > The problem is this: > >> > * on X86-64, non-temporal stores have the best performance > >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > >> > should flush cache as late as possible, because it performs better this > >> > way. > >> > > >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > >> > data persistently, all three functions must be called. > >> > > >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > >> > > >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > >> > pmem_commit is wmb. > >> > > >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > >> > pmem_commit is empty. > >> > >> All these should be provided by the pmem layer, and be properly > >> documented. And be sorted before adding your new target that uses > >> them. > > > > I don't see that as a hard requirement. Mikulas did the work to figure > > out what is more optimal on x86_64 vs amd64. It makes a difference for > > his target and that is sufficient to carry it locally until/when it is > > either elevated to pmem. > > > > We cannot even get x86 and swait maintainers to reply to repeat requests > > for review. Stacking up further deps on pmem isn't high on my list. > > > > Except I'm being responsive. I agree with Christoph that we should > build pmem helpers at an architecture level and not per-driver. Let's > make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch > up to x86 in this space. We already have PowerPC enabling PMEM API, so > I don't see an unreasonable barrier to ask the same of ARM. This patch > is not even cc'd to linux-arm-kernel. Has the subject been broached > with them? The ARM code can't "catch-up" with X86. On X86 - non-temporal stores (i.e. memcpy_flushcache) are faster than cached write and cache flushing. The ARM architecture doesn't have non-temporal stores. So, memcpy_flushcache on ARM does memcpy (that writes data to the cache) and then flushes the cache. But this eager cache flushig is slower than late cache flushing. The optimal code sequence on ARM to write to persistent memory is to call memcpy, then do something else, and then call arch_wb_cache_pmem as late as possible. And this ARM-optimized code sequence is just horribly slow on X86. This issue can't be "fixed" in ARM-specific source code. The ARM processor have such a characteristics that eager cache flushing is slower than late cache flushing - and that's it - you can't change processor behavior. If you don't want '#if defined(CONFIG_X86_64)' in the dm-writecache driver, then just take the functions that are in this conditional block and move them to some generic linux header. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > Use new API for flushing persistent memory. > > The problem is this: > * on X86-64, non-temporal stores have the best performance > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > should flush cache as late as possible, because it performs better this > way. > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > data persistently, all three functions must be called. > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > pmem_commit is wmb. > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > pmem_commit is empty. I don't want to grow driver-local wrappers for pmem. You should use memcpy_flushcache directly() and if an architecture does not define memcpy_flushcache() then don't allow building dm-writecache, i.e. this driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't see a need to add a standalone flush operation if all relevant archs provide memcpy_flushcache(). As for commit, I'd say just use wmb() directly since all archs define it. Alternatively we could introduce memcpy_flushcache_relaxed() to be the un-ordered version of the copy routine and memcpy_flushcache() would imply a wmb(). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 24 May 2018, Dan Williams wrote: > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > Use new API for flushing persistent memory. > > > > The problem is this: > > * on X86-64, non-temporal stores have the best performance > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > > should flush cache as late as possible, because it performs better this > > way. > > > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > > data persistently, all three functions must be called. > > > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > > > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > > pmem_commit is wmb. > > > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > > pmem_commit is empty. > > I don't want to grow driver-local wrappers for pmem. You should use > memcpy_flushcache directly() and if an architecture does not define > memcpy_flushcache() then don't allow building dm-writecache, i.e. this > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't > see a need to add a standalone flush operation if all relevant archs > provide memcpy_flushcache(). As for commit, I'd say just use wmb() > directly since all archs define it. Alternatively we could introduce > memcpy_flushcache_relaxed() to be the un-ordered version of the copy > routine and memcpy_flushcache() would imply a wmb(). But memcpy_flushcache() on ARM64 is slow. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, May 25 2018 at 2:17am -0400, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Thu, 24 May 2018, Dan Williams wrote: > > > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > Use new API for flushing persistent memory. > > > > > > The problem is this: > > > * on X86-64, non-temporal stores have the best performance > > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We > > > should flush cache as late as possible, because it performs better this > > > way. > > > > > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit > > > data persistently, all three functions must be called. > > > > > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. > > > > > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > > > pmem_commit is wmb. > > > > > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > > > pmem_commit is empty. > > > > I don't want to grow driver-local wrappers for pmem. You should use > > memcpy_flushcache directly() and if an architecture does not define > > memcpy_flushcache() then don't allow building dm-writecache, i.e. this > > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't > > see a need to add a standalone flush operation if all relevant archs > > provide memcpy_flushcache(). As for commit, I'd say just use wmb() > > directly since all archs define it. Alternatively we could introduce > > memcpy_flushcache_relaxed() to be the un-ordered version of the copy > > routine and memcpy_flushcache() would imply a wmb(). > > But memcpy_flushcache() on ARM64 is slow. Yes, Dan can you please take some time to fully appreciate what this small wrapper API is providing (maybe you've done that, but your recent reply is mixed-message). Seems you're keeping the innovation it provides at arms-length. Basically the PMEM APIs you've helped construct are lacking, forcing DM developers to own fixing them is an inversion that only serves to stonewall. Please, less time on stonewalling and more time lifting this wrapper API; otherwise the dm-writecache local wrapper API is a near-term means to an end. I revised the dm-writecache patch header yesterday, please review: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d (the last paragraph in particular, I intend to move forward with this wrapper unless someone on the PMEM side of the house steps up and lifts it up between now and when the 4.18 merge window opens) dm: add writecache target The writecache target caches writes on persistent memory or SSD. It is intended for databases or other programs that need extremely low commit latency. The writecache target doesn't cache reads because reads are supposed to be cached in page cache in normal RAM. The following describes the approach used to provide the most efficient flushing of persistent memory on X86_64 vs ARM64: * On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster than cached writes and cache flushing. * The ARM64 architecture doesn't have non-temporal stores. So, memcpy_flushcache on ARM does memcpy (that writes data to the cache) and then flushes the cache. But this eager cache flushig is slower than late cache flushing. The optimal code sequence on ARM to write to persistent memory is to call memcpy, then do something else, and then call arch_wb_cache_pmem as late as possible. And this ARM-optimized code sequence is just horribly slow on X86. This issue can't be "fixed" in ARM-specific source code. The ARM processor have characteristics such that eager cache flushing is slower than late cache flushing - and that's it - you can't change processor behavior. We introduce a wrapper API for flushing persistent memory with functions pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently, all three functions must be called. The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and pmem_commit is wmb. On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and pmem_commit is empty. It is clear that this wrapper API for flushing persistent memory needs to be elevated out of this dm-writecache driver. But that can happen later without requiring DM developers to blaze new trails on pmem specific implementation details/quirks (pmem developers need to clean up their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take into account the duality of the different programming models needed to achieve optimal cross-architecture use of persistent memory). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <msnitzer@redhat.com> -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: > On Fri, May 25 2018 at 2:17am -0400, > Mikulas Patocka <mpatocka@redhat.com> wrote: > >> >> >> On Thu, 24 May 2018, Dan Williams wrote: >> >> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > > Use new API for flushing persistent memory. >> > > >> > > The problem is this: >> > > * on X86-64, non-temporal stores have the best performance >> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We >> > > should flush cache as late as possible, because it performs better this >> > > way. >> > > >> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit >> > > data persistently, all three functions must be called. >> > > >> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign >> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically. >> > > >> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and >> > > pmem_commit is wmb. >> > > >> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and >> > > pmem_commit is empty. >> > >> > I don't want to grow driver-local wrappers for pmem. You should use >> > memcpy_flushcache directly() and if an architecture does not define >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >> > see a need to add a standalone flush operation if all relevant archs >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >> > directly since all archs define it. Alternatively we could introduce >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >> > routine and memcpy_flushcache() would imply a wmb(). >> >> But memcpy_flushcache() on ARM64 is slow. > > Yes, Dan can you please take some time to fully appreciate what this > small wrapper API is providing (maybe you've done that, but your recent > reply is mixed-message). Seems you're keeping the innovation it > provides at arms-length. Basically the PMEM APIs you've helped > construct are lacking, forcing DM developers to own fixing them is an > inversion that only serves to stonewall. > > Please, less time on stonewalling and more time lifting this wrapper > API; otherwise the dm-writecache local wrapper API is a near-term means > to an end. > > I revised the dm-writecache patch header yesterday, please review: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d > (the last paragraph in particular, I intend to move forward with this > wrapper unless someone on the PMEM side of the house steps up and lifts > it up between now and when the 4.18 merge window opens) > > dm: add writecache target > The writecache target caches writes on persistent memory or SSD. > It is intended for databases or other programs that need extremely low > commit latency. > > The writecache target doesn't cache reads because reads are supposed to > be cached in page cache in normal RAM. > > The following describes the approach used to provide the most efficient > flushing of persistent memory on X86_64 vs ARM64: > > * On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster > than cached writes and cache flushing. > > * The ARM64 architecture doesn't have non-temporal stores. So, > memcpy_flushcache on ARM does memcpy (that writes data to the cache) > and then flushes the cache. But this eager cache flushig is slower > than late cache flushing. > > The optimal code sequence on ARM to write to persistent memory is to > call memcpy, then do something else, and then call arch_wb_cache_pmem as > late as possible. And this ARM-optimized code sequence is just horribly > slow on X86. > > This issue can't be "fixed" in ARM-specific source code. The ARM > processor have characteristics such that eager cache flushing is slower > than late cache flushing - and that's it - you can't change processor > behavior. > > We introduce a wrapper API for flushing persistent memory with functions > pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently, > all three functions must be called. > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign > (unlike pmem_memcpy) guarantees that 8-byte values are written > atomically. > > On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and > pmem_commit is wmb. > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and > pmem_commit is empty. > > It is clear that this wrapper API for flushing persistent memory needs > to be elevated out of this dm-writecache driver. But that can happen > later without requiring DM developers to blaze new trails on pmem > specific implementation details/quirks (pmem developers need to clean up > their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API > and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take > into account the duality of the different programming models needed to > achieve optimal cross-architecture use of persistent memory). Right, so again, what is wrong with memcpy_flushcache_relaxed() + wmb() or otherwise making memcpy_flushcache() ordered. I do not see that as a trailblazing requirement, I see that as typical review and a reduction of the operation space that you are proposing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 25 May 2018, Dan Williams wrote: > On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Fri, May 25 2018 at 2:17am -0400, > > Mikulas Patocka <mpatocka@redhat.com> wrote: > > > >> On Thu, 24 May 2018, Dan Williams wrote: > >> > >> > I don't want to grow driver-local wrappers for pmem. You should use > >> > memcpy_flushcache directly() and if an architecture does not define > >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this > >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't > >> > see a need to add a standalone flush operation if all relevant archs > >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() > >> > directly since all archs define it. Alternatively we could introduce > >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy > >> > routine and memcpy_flushcache() would imply a wmb(). > >> > >> But memcpy_flushcache() on ARM64 is slow. > > Right, so again, what is wrong with memcpy_flushcache_relaxed() + > wmb() or otherwise making memcpy_flushcache() ordered. I do not see > that as a trailblazing requirement, I see that as typical review and a > reduction of the operation space that you are proposing. memcpy_flushcache on ARM64 is generally wrong thing to do, because it is slower than memcpy and explicit cache flush. Suppose that you want to write data to a block device and make it persistent. So you send a WRITE bio and then a FLUSH bio. Now - how to implement these two bios on persistent memory: On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does wmb() - this is the optimal implementation. But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal implementation is that the WRITE bio does just memcpy() and the FLUSH bio does arch_wb_cache_pmem() on the affected range. Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture doesn't have non-temporal stores. So, memcpy_flushcache() is implemented as memcpy() followed by a cache flush. Now - if you flush the cache immediatelly after memcpy, the cache is full of dirty lines and the cache-flushing code has to write these lines back and that is slow. If you flush the cache some time after memcpy (i.e. when the FLUSH bio is received), the processor already flushed some part of the cache on its own, so the cache-flushing function has less work to do and it is faster. So the conclusion is - don't use memcpy_flushcache on ARM. This problem cannot be fixed by a better implementation of memcpy_flushcache. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Fri, 25 May 2018, Dan Williams wrote: > >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> > On Fri, May 25 2018 at 2:17am -0400, >> > Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> >> On Thu, 24 May 2018, Dan Williams wrote: >> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use >> >> > memcpy_flushcache directly() and if an architecture does not define >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >> >> > see a need to add a standalone flush operation if all relevant archs >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >> >> > directly since all archs define it. Alternatively we could introduce >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >> >> > routine and memcpy_flushcache() would imply a wmb(). >> >> >> >> But memcpy_flushcache() on ARM64 is slow. >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see >> that as a trailblazing requirement, I see that as typical review and a >> reduction of the operation space that you are proposing. > > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is > slower than memcpy and explicit cache flush. > > Suppose that you want to write data to a block device and make it > persistent. So you send a WRITE bio and then a FLUSH bio. > > Now - how to implement these two bios on persistent memory: > > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does > wmb() - this is the optimal implementation. > > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal > implementation is that the WRITE bio does just memcpy() and the FLUSH bio > does arch_wb_cache_pmem() on the affected range. > > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented > as memcpy() followed by a cache flush. > > Now - if you flush the cache immediatelly after memcpy, the cache is full > of dirty lines and the cache-flushing code has to write these lines back > and that is slow. > > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is > received), the processor already flushed some part of the cache on its > own, so the cache-flushing function has less work to do and it is faster. > > So the conclusion is - don't use memcpy_flushcache on ARM. This problem > cannot be fixed by a better implementation of memcpy_flushcache. It sounds like ARM might be better off with mapping its pmem as write-through rather than write-back, and skip the explicit cache management altogether. You speak of "optimal" and "sub-optimal", but what would be more clear is fio measurements of the relative IOPs and latency profiles of the different approaches. The reason I am continuing to push here is that reducing the operation space from 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the maintenance long term. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, 26 May 2018, Dan Williams wrote: > On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Fri, 25 May 2018, Dan Williams wrote: > > > >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: > >> > On Fri, May 25 2018 at 2:17am -0400, > >> > Mikulas Patocka <mpatocka@redhat.com> wrote: > >> > > >> >> On Thu, 24 May 2018, Dan Williams wrote: > >> >> > >> >> > I don't want to grow driver-local wrappers for pmem. You should use > >> >> > memcpy_flushcache directly() and if an architecture does not define > >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this > >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't > >> >> > see a need to add a standalone flush operation if all relevant archs > >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() > >> >> > directly since all archs define it. Alternatively we could introduce > >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy > >> >> > routine and memcpy_flushcache() would imply a wmb(). > >> >> > >> >> But memcpy_flushcache() on ARM64 is slow. > >> > >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + > >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see > >> that as a trailblazing requirement, I see that as typical review and a > >> reduction of the operation space that you are proposing. > > > > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is > > slower than memcpy and explicit cache flush. > > > > Suppose that you want to write data to a block device and make it > > persistent. So you send a WRITE bio and then a FLUSH bio. > > > > Now - how to implement these two bios on persistent memory: > > > > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does > > wmb() - this is the optimal implementation. > > > > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal > > implementation is that the WRITE bio does just memcpy() and the FLUSH bio > > does arch_wb_cache_pmem() on the affected range. > > > > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture > > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented > > as memcpy() followed by a cache flush. > > > > Now - if you flush the cache immediatelly after memcpy, the cache is full > > of dirty lines and the cache-flushing code has to write these lines back > > and that is slow. > > > > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is > > received), the processor already flushed some part of the cache on its > > own, so the cache-flushing function has less work to do and it is faster. > > > > So the conclusion is - don't use memcpy_flushcache on ARM. This problem > > cannot be fixed by a better implementation of memcpy_flushcache. > > It sounds like ARM might be better off with mapping its pmem as > write-through rather than write-back, and skip the explicit cache I doubt it would perform well - write combining combines the writes into a larger segments - and write-through doesn't. > management altogether. You speak of "optimal" and "sub-optimal", but > what would be more clear is fio measurements of the relative IOPs and > latency profiles of the different approaches. The reason I am > continuing to push here is that reducing the operation space from > 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the > maintenance long term. I measured it (with nvme backing store) and late cache flushing has 12% better performance than eager flushing with memcpy_flushcache(). 131836 4k iops - vs - 117016. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 22 May 2018, Dan Williams wrote: > >> Except I'm being responsive. > > > > Except you're looking to immediately punt to linux-arm-kernel ;) > > Well, I'm not, not really. I'm saying drop ARM support, it's not ready. This is the worst thing to do - because once late cache flushing is dropped from the dm-writecache target, it could hardly be reintroduced again. > >> I agree with Christoph that we should > >> build pmem helpers at an architecture level and not per-driver. Let's > >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch > >> up to x86 in this space. We already have PowerPC enabling PMEM API, so > >> I don't see an unreasonable barrier to ask the same of ARM. This patch > >> is not even cc'd to linux-arm-kernel. Has the subject been broached > >> with them? > > > > No idea. Not by me. > > > > The thing is, I'm no expert in pmem. You are. Coordinating the change > > with ARM et al feels unnecessarily limiting and quicky moves outside my > > control. > > > > Serious question: Why can't this code land in this dm-writecache target > > and then be lifted (or obsoleted)? > > Because we already have an API, and we don't want to promote local > solutions to global problems, or carry unnecessary technical debt. > > > > > But if you think it worthwhile to force ARM to step up then fine. That > > does limit the availability of using writecache on ARM while they get > > the PMEM API together. > > > > I'll do whatever you want.. just put the smack down and tell me how it > > is ;) > > I'd say just control the variables you can control. Drop the ARM > support if you want to move forward and propose extensions / updates What do we gain by dropping it? > to the pmem api for x86 and I'll help push those since I was involved > in pushing the x86 pmem api in the first instance. That way you don't > need to touch this driver as new archs add their pmem api enabling. The pmem API is x86-centric - that the problem. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, May 28, 2018 at 6:52 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Tue, 22 May 2018, Dan Williams wrote: > >> >> Except I'm being responsive. >> > >> > Except you're looking to immediately punt to linux-arm-kernel ;) >> >> Well, I'm not, not really. I'm saying drop ARM support, it's not ready. > > This is the worst thing to do - because once late cache flushing is > dropped from the dm-writecache target, it could hardly be reintroduced > again. > >> >> I agree with Christoph that we should >> >> build pmem helpers at an architecture level and not per-driver. Let's >> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch >> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so >> >> I don't see an unreasonable barrier to ask the same of ARM. This patch >> >> is not even cc'd to linux-arm-kernel. Has the subject been broached >> >> with them? >> > >> > No idea. Not by me. >> > >> > The thing is, I'm no expert in pmem. You are. Coordinating the change >> > with ARM et al feels unnecessarily limiting and quicky moves outside my >> > control. >> > >> > Serious question: Why can't this code land in this dm-writecache target >> > and then be lifted (or obsoleted)? >> >> Because we already have an API, and we don't want to promote local >> solutions to global problems, or carry unnecessary technical debt. >> >> > >> > But if you think it worthwhile to force ARM to step up then fine. That >> > does limit the availability of using writecache on ARM while they get >> > the PMEM API together. >> > >> > I'll do whatever you want.. just put the smack down and tell me how it >> > is ;) >> >> I'd say just control the variables you can control. Drop the ARM >> support if you want to move forward and propose extensions / updates > > What do we gain by dropping it? > >> to the pmem api for x86 and I'll help push those since I was involved >> in pushing the x86 pmem api in the first instance. That way you don't >> need to touch this driver as new archs add their pmem api enabling. > > The pmem API is x86-centric - that the problem. When I read your patch I came away with the impression that ARM had not added memcpy_flushcache() yet and you were working around that fact. Now that I look, ARM *does* define memcpy_flushcache() and you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM memcpy_flushcache() implementation is: memcpy(dst, src, cnt); __clean_dcache_area_pop(dst, cnt); So, I do not see how what you're doing is any less work unless you are flushing less than you copy? If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the ARM implementation is broken and that needs to be addressed not worked around in a driver. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Sat, 26 May 2018, Dan Williams wrote: > >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Fri, 25 May 2018, Dan Williams wrote: >> > >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> >> > On Fri, May 25 2018 at 2:17am -0400, >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> > >> >> >> On Thu, 24 May 2018, Dan Williams wrote: >> >> >> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use >> >> >> > memcpy_flushcache directly() and if an architecture does not define >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >> >> >> > see a need to add a standalone flush operation if all relevant archs >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >> >> >> > directly since all archs define it. Alternatively we could introduce >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >> >> >> > routine and memcpy_flushcache() would imply a wmb(). >> >> >> >> >> >> But memcpy_flushcache() on ARM64 is slow. >> >> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see >> >> that as a trailblazing requirement, I see that as typical review and a >> >> reduction of the operation space that you are proposing. >> > >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is >> > slower than memcpy and explicit cache flush. >> > >> > Suppose that you want to write data to a block device and make it >> > persistent. So you send a WRITE bio and then a FLUSH bio. >> > >> > Now - how to implement these two bios on persistent memory: >> > >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does >> > wmb() - this is the optimal implementation. >> > >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio >> > does arch_wb_cache_pmem() on the affected range. >> > >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented >> > as memcpy() followed by a cache flush. >> > >> > Now - if you flush the cache immediatelly after memcpy, the cache is full >> > of dirty lines and the cache-flushing code has to write these lines back >> > and that is slow. >> > >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is >> > received), the processor already flushed some part of the cache on its >> > own, so the cache-flushing function has less work to do and it is faster. >> > >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem >> > cannot be fixed by a better implementation of memcpy_flushcache. >> >> It sounds like ARM might be better off with mapping its pmem as >> write-through rather than write-back, and skip the explicit cache > > I doubt it would perform well - write combining combines the writes into a > larger segments - and write-through doesn't. > Last I checked write-through caching does not disable write combining >> management altogether. You speak of "optimal" and "sub-optimal", but >> what would be more clear is fio measurements of the relative IOPs and >> latency profiles of the different approaches. The reason I am >> continuing to push here is that reducing the operation space from >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the >> maintenance long term. > > I measured it (with nvme backing store) and late cache flushing has 12% > better performance than eager flushing with memcpy_flushcache(). I assume what you're seeing is ARM64 over-flushing the amount of dirty data so it becomes more efficient to do an amortized flush at the end? However, that effectively makes memcpy_flushcache() unusable in the way it can be used on x86. You claimed that ARM does not support non-temporal stores, but it does, see the STNP instruction. I do not want to see arch specific optimizations in drivers, so either write-through mappings is a potential answer to remove the need to explicitly manage flushing, or just implement STNP hacks in memcpy_flushcache() like you did with MOVNT on x86. > 131836 4k iops - vs - 117016. To be clear this is memcpy_flushcache() vs memcpy + flush? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 28 May 2018, Dan Williams wrote: > On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Sat, 26 May 2018, Dan Williams wrote: > > > >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> > > >> > > >> > On Fri, 25 May 2018, Dan Williams wrote: > >> > > >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: > >> >> > On Fri, May 25 2018 at 2:17am -0400, > >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote: > >> >> > > >> >> >> On Thu, 24 May 2018, Dan Williams wrote: > >> >> >> > >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use > >> >> >> > memcpy_flushcache directly() and if an architecture does not define > >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this > >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't > >> >> >> > see a need to add a standalone flush operation if all relevant archs > >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() > >> >> >> > directly since all archs define it. Alternatively we could introduce > >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy > >> >> >> > routine and memcpy_flushcache() would imply a wmb(). > >> >> >> > >> >> >> But memcpy_flushcache() on ARM64 is slow. > >> >> > >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + > >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see > >> >> that as a trailblazing requirement, I see that as typical review and a > >> >> reduction of the operation space that you are proposing. > >> > > >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is > >> > slower than memcpy and explicit cache flush. > >> > > >> > Suppose that you want to write data to a block device and make it > >> > persistent. So you send a WRITE bio and then a FLUSH bio. > >> > > >> > Now - how to implement these two bios on persistent memory: > >> > > >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does > >> > wmb() - this is the optimal implementation. > >> > > >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal > >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio > >> > does arch_wb_cache_pmem() on the affected range. > >> > > >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture > >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented > >> > as memcpy() followed by a cache flush. > >> > > >> > Now - if you flush the cache immediatelly after memcpy, the cache is full > >> > of dirty lines and the cache-flushing code has to write these lines back > >> > and that is slow. > >> > > >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is > >> > received), the processor already flushed some part of the cache on its > >> > own, so the cache-flushing function has less work to do and it is faster. > >> > > >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem > >> > cannot be fixed by a better implementation of memcpy_flushcache. > >> > >> It sounds like ARM might be better off with mapping its pmem as > >> write-through rather than write-back, and skip the explicit cache > > > > I doubt it would perform well - write combining combines the writes into a > > larger segments - and write-through doesn't. > > > > Last I checked write-through caching does not disable write combining > > >> management altogether. You speak of "optimal" and "sub-optimal", but > >> what would be more clear is fio measurements of the relative IOPs and > >> latency profiles of the different approaches. The reason I am > >> continuing to push here is that reducing the operation space from > >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the > >> maintenance long term. > > > > I measured it (with nvme backing store) and late cache flushing has 12% > > better performance than eager flushing with memcpy_flushcache(). > > I assume what you're seeing is ARM64 over-flushing the amount of dirty > data so it becomes more efficient to do an amortized flush at the end? > However, that effectively makes memcpy_flushcache() unusable in the > way it can be used on x86. You claimed that ARM does not support > non-temporal stores, but it does, see the STNP instruction. I do not > want to see arch specific optimizations in drivers, so either > write-through mappings is a potential answer to remove the need to > explicitly manage flushing, or just implement STNP hacks in > memcpy_flushcache() like you did with MOVNT on x86. > > > 131836 4k iops - vs - 117016. > > To be clear this is memcpy_flushcache() vs memcpy + flush? I found out what caused the difference. I used dax_flush on the version of dm-writecache that I had on the ARM machine (with the kernel 4.14, because it is the last version where dax on ramdisk works) - and I thought that dax_flush flushes the cache, but it doesn't. When I replaced dax_flush with arch_wb_cache_pmem, the performance difference between early flushing and late flushing disappeared. So I think we can remove this per-architecture switch from dm-writecache. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 30 2018 at 9:07am -0400, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Mon, 28 May 2018, Dan Williams wrote: > > > On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > I measured it (with nvme backing store) and late cache flushing has 12% > > > better performance than eager flushing with memcpy_flushcache(). > > > > I assume what you're seeing is ARM64 over-flushing the amount of dirty > > data so it becomes more efficient to do an amortized flush at the end? > > However, that effectively makes memcpy_flushcache() unusable in the > > way it can be used on x86. You claimed that ARM does not support > > non-temporal stores, but it does, see the STNP instruction. I do not > > want to see arch specific optimizations in drivers, so either > > write-through mappings is a potential answer to remove the need to > > explicitly manage flushing, or just implement STNP hacks in > > memcpy_flushcache() like you did with MOVNT on x86. > > > > > 131836 4k iops - vs - 117016. > > > > To be clear this is memcpy_flushcache() vs memcpy + flush? > > I found out what caused the difference. I used dax_flush on the version of > dm-writecache that I had on the ARM machine (with the kernel 4.14, because > it is the last version where dax on ramdisk works) - and I thought that > dax_flush flushes the cache, but it doesn't. > > When I replaced dax_flush with arch_wb_cache_pmem, the performance > difference between early flushing and late flushing disappeared. > > So I think we can remove this per-architecture switch from dm-writecache. That is really great news, can you submit an incremental patch that layers ontop of the linux-dm.git 'dm-4.18' branch? Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 30 May 2018, Mike Snitzer wrote: > That is really great news, can you submit an incremental patch that > layers ontop of the linux-dm.git 'dm-4.18' branch? > > Thanks, > Mike I've sent the current version that I have. I fixed the bugs that were reported here (missing DAX, dm_bufio_client_create, __branch_check__ long->int truncation). Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 30 2018 at 9:21am -0400, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Wed, 30 May 2018, Mike Snitzer wrote: > > > That is really great news, can you submit an incremental patch that > > layers ontop of the linux-dm.git 'dm-4.18' branch? > > > > Thanks, > > Mike > > I've sent the current version that I have. I fixed the bugs that were > reported here (missing DAX, dm_bufio_client_create, __branch_check__ > long->int truncation). OK, but a monolithic dm-writecache.c is no longer useful to me. I can drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take your __branch_check__ patch). Not sure what the dm_bufio_client_create fix is... must've missed a report about that. ANyway, point is we're on too a different phase of dm-writecache.c's development. I've picked it up and am trying to get it ready for the 4.18 merge window (likely opening Sunday). Therefore it needs to be in a git tree, and incremental changes overlayed. I cannot be rebasing at this late stage in the 4.18 development window. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 30 May 2018, Mike Snitzer wrote: > On Wed, May 30 2018 at 9:21am -0400, > Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > On Wed, 30 May 2018, Mike Snitzer wrote: > > > > > That is really great news, can you submit an incremental patch that > > > layers ontop of the linux-dm.git 'dm-4.18' branch? > > > > > > Thanks, > > > Mike > > > > I've sent the current version that I have. I fixed the bugs that were > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ > > long->int truncation). > > OK, but a monolithic dm-writecache.c is no longer useful to me. I can > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take > your __branch_check__ patch). Not sure what the dm_bufio_client_create > fix is... must've missed a report about that. > > ANyway, point is we're on too a different phase of dm-writecache.c's > development. I've picked it up and am trying to get it ready for the > 4.18 merge window (likely opening Sunday). Therefore it needs to be in > a git tree, and incremental changes overlayed. I cannot be rebasing at > this late stage in the 4.18 development window. > > Thanks, > Mike I downloaded dm-writecache from your git repository some times ago - but you changed a lot of useless things (i.e. reordering the fields in the structure) since that time - so, you'll have to merge the changes. You dropped the latency measuring code - why? We still would like to benchmark the driver. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams <dan.j.williams@intel.com> writes: > When I read your patch I came away with the impression that ARM had > not added memcpy_flushcache() yet and you were working around that > fact. Now that I look, ARM *does* define memcpy_flushcache() and > you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on > ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM > memcpy_flushcache() implementation is: > > memcpy(dst, src, cnt); > __clean_dcache_area_pop(dst, cnt); > > So, I do not see how what you're doing is any less work unless you are > flushing less than you copy? > > If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the > ARM implementation is broken and that needs to be addressed not worked > around in a driver. I think Mikulas wanted to batch up multiple copies and flush at the end. According to his commit message, that batching gained him 2% performance. -Jeff -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 30 May 2018, Jeff Moyer wrote: > Dan Williams <dan.j.williams@intel.com> writes: > > > When I read your patch I came away with the impression that ARM had > > not added memcpy_flushcache() yet and you were working around that > > fact. Now that I look, ARM *does* define memcpy_flushcache() and > > you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on > > ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM > > memcpy_flushcache() implementation is: > > > > memcpy(dst, src, cnt); > > __clean_dcache_area_pop(dst, cnt); > > > > So, I do not see how what you're doing is any less work unless you are > > flushing less than you copy? > > > > If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the > > ARM implementation is broken and that needs to be addressed not worked > > around in a driver. > > I think Mikulas wanted to batch up multiple copies and flush at the > end. According to his commit message, that batching gained him 2% > performance. > > -Jeff No - this 2% difference is inlined memcpy_flushcache() vs out-of-line memcpy_flushcache(). I thought that dax_flush() performed 12% better memcpy_flushcache() - but the reason why it performed better was - that it was not flushing the cache at all. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Jeff Moyer <jmoyer@redhat.com> writes: > Dan Williams <dan.j.williams@intel.com> writes: > >> When I read your patch I came away with the impression that ARM had >> not added memcpy_flushcache() yet and you were working around that >> fact. Now that I look, ARM *does* define memcpy_flushcache() and >> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on >> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM >> memcpy_flushcache() implementation is: >> >> memcpy(dst, src, cnt); >> __clean_dcache_area_pop(dst, cnt); >> >> So, I do not see how what you're doing is any less work unless you are >> flushing less than you copy? >> >> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the >> ARM implementation is broken and that needs to be addressed not worked >> around in a driver. > > I think Mikulas wanted to batch up multiple copies and flush at the > end. According to his commit message, that batching gained him 2% > performance. Nevermind me, I just caught up with the rest of the thread. :) -Jeff -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Mon, 28 May 2018, Dan Williams wrote: > >> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Sat, 26 May 2018, Dan Williams wrote: >> > >> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> > >> >> > >> >> > On Fri, 25 May 2018, Dan Williams wrote: >> >> > >> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> >> >> > On Fri, May 25 2018 at 2:17am -0400, >> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> >> > >> >> >> >> On Thu, 24 May 2018, Dan Williams wrote: >> >> >> >> >> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use >> >> >> >> > memcpy_flushcache directly() and if an architecture does not define >> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >> >> >> >> > see a need to add a standalone flush operation if all relevant archs >> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >> >> >> >> > directly since all archs define it. Alternatively we could introduce >> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >> >> >> >> > routine and memcpy_flushcache() would imply a wmb(). >> >> >> >> >> >> >> >> But memcpy_flushcache() on ARM64 is slow. >> >> >> >> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + >> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see >> >> >> that as a trailblazing requirement, I see that as typical review and a >> >> >> reduction of the operation space that you are proposing. >> >> > >> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is >> >> > slower than memcpy and explicit cache flush. >> >> > >> >> > Suppose that you want to write data to a block device and make it >> >> > persistent. So you send a WRITE bio and then a FLUSH bio. >> >> > >> >> > Now - how to implement these two bios on persistent memory: >> >> > >> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does >> >> > wmb() - this is the optimal implementation. >> >> > >> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal >> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio >> >> > does arch_wb_cache_pmem() on the affected range. >> >> > >> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture >> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented >> >> > as memcpy() followed by a cache flush. >> >> > >> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full >> >> > of dirty lines and the cache-flushing code has to write these lines back >> >> > and that is slow. >> >> > >> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is >> >> > received), the processor already flushed some part of the cache on its >> >> > own, so the cache-flushing function has less work to do and it is faster. >> >> > >> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem >> >> > cannot be fixed by a better implementation of memcpy_flushcache. >> >> >> >> It sounds like ARM might be better off with mapping its pmem as >> >> write-through rather than write-back, and skip the explicit cache >> > >> > I doubt it would perform well - write combining combines the writes into a >> > larger segments - and write-through doesn't. >> > >> >> Last I checked write-through caching does not disable write combining >> >> >> management altogether. You speak of "optimal" and "sub-optimal", but >> >> what would be more clear is fio measurements of the relative IOPs and >> >> latency profiles of the different approaches. The reason I am >> >> continuing to push here is that reducing the operation space from >> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the >> >> maintenance long term. >> > >> > I measured it (with nvme backing store) and late cache flushing has 12% >> > better performance than eager flushing with memcpy_flushcache(). >> >> I assume what you're seeing is ARM64 over-flushing the amount of dirty >> data so it becomes more efficient to do an amortized flush at the end? >> However, that effectively makes memcpy_flushcache() unusable in the >> way it can be used on x86. You claimed that ARM does not support >> non-temporal stores, but it does, see the STNP instruction. I do not >> want to see arch specific optimizations in drivers, so either >> write-through mappings is a potential answer to remove the need to >> explicitly manage flushing, or just implement STNP hacks in >> memcpy_flushcache() like you did with MOVNT on x86. >> >> > 131836 4k iops - vs - 117016. >> >> To be clear this is memcpy_flushcache() vs memcpy + flush? > > I found out what caused the difference. I used dax_flush on the version of > dm-writecache that I had on the ARM machine (with the kernel 4.14, because > it is the last version where dax on ramdisk works) - and I thought that > dax_flush flushes the cache, but it doesn't. > > When I replaced dax_flush with arch_wb_cache_pmem, the performance > difference between early flushing and late flushing disappeared. > > So I think we can remove this per-architecture switch from dm-writecache. Great find! Thanks for the due diligence. Feel free to add: Acked-by: Dan Williams <dan.j.williams@intel.com> ...on the reworks to unify ARM and x86. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> >> On Mon, 28 May 2018, Dan Williams wrote: >> >>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >>> > >>> > >>> > On Sat, 26 May 2018, Dan Williams wrote: >>> > >>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >>> >> > >>> >> > >>> >> > On Fri, 25 May 2018, Dan Williams wrote: >>> >> > >>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote: >>> >> >> > On Fri, May 25 2018 at 2:17am -0400, >>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote: >>> >> >> > >>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote: >>> >> >> >> >>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use >>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define >>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >>> >> >> >> > see a need to add a standalone flush operation if all relevant archs >>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >>> >> >> >> > directly since all archs define it. Alternatively we could introduce >>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >>> >> >> >> > routine and memcpy_flushcache() would imply a wmb(). >>> >> >> >> >>> >> >> >> But memcpy_flushcache() on ARM64 is slow. >>> >> >> >>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + >>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see >>> >> >> that as a trailblazing requirement, I see that as typical review and a >>> >> >> reduction of the operation space that you are proposing. >>> >> > >>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is >>> >> > slower than memcpy and explicit cache flush. >>> >> > >>> >> > Suppose that you want to write data to a block device and make it >>> >> > persistent. So you send a WRITE bio and then a FLUSH bio. >>> >> > >>> >> > Now - how to implement these two bios on persistent memory: >>> >> > >>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does >>> >> > wmb() - this is the optimal implementation. >>> >> > >>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal >>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio >>> >> > does arch_wb_cache_pmem() on the affected range. >>> >> > >>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture >>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented >>> >> > as memcpy() followed by a cache flush. >>> >> > >>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full >>> >> > of dirty lines and the cache-flushing code has to write these lines back >>> >> > and that is slow. >>> >> > >>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is >>> >> > received), the processor already flushed some part of the cache on its >>> >> > own, so the cache-flushing function has less work to do and it is faster. >>> >> > >>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem >>> >> > cannot be fixed by a better implementation of memcpy_flushcache. >>> >> >>> >> It sounds like ARM might be better off with mapping its pmem as >>> >> write-through rather than write-back, and skip the explicit cache >>> > >>> > I doubt it would perform well - write combining combines the writes into a >>> > larger segments - and write-through doesn't. >>> > >>> >>> Last I checked write-through caching does not disable write combining >>> >>> >> management altogether. You speak of "optimal" and "sub-optimal", but >>> >> what would be more clear is fio measurements of the relative IOPs and >>> >> latency profiles of the different approaches. The reason I am >>> >> continuing to push here is that reducing the operation space from >>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the >>> >> maintenance long term. >>> > >>> > I measured it (with nvme backing store) and late cache flushing has 12% >>> > better performance than eager flushing with memcpy_flushcache(). >>> >>> I assume what you're seeing is ARM64 over-flushing the amount of dirty >>> data so it becomes more efficient to do an amortized flush at the end? >>> However, that effectively makes memcpy_flushcache() unusable in the >>> way it can be used on x86. You claimed that ARM does not support >>> non-temporal stores, but it does, see the STNP instruction. I do not >>> want to see arch specific optimizations in drivers, so either >>> write-through mappings is a potential answer to remove the need to >>> explicitly manage flushing, or just implement STNP hacks in >>> memcpy_flushcache() like you did with MOVNT on x86. >>> >>> > 131836 4k iops - vs - 117016. >>> >>> To be clear this is memcpy_flushcache() vs memcpy + flush? >> >> I found out what caused the difference. I used dax_flush on the version of >> dm-writecache that I had on the ARM machine (with the kernel 4.14, because >> it is the last version where dax on ramdisk works) - and I thought that >> dax_flush flushes the cache, but it doesn't. >> >> When I replaced dax_flush with arch_wb_cache_pmem, the performance >> difference between early flushing and late flushing disappeared. >> >> So I think we can remove this per-architecture switch from dm-writecache. > > Great find! Thanks for the due diligence. Feel free to add: > > Acked-by: Dan Williams <dan.j.williams@intel.com> > > ...on the reworks to unify ARM and x86. One more note. The side effect of not using dax_flush() is that you may end up flushing caches on systems where the platform has asserted it will take responsibility for flushing caches at power loss. If / when those systems become more prevalent we may want to think of a way to combine the non-temporal optimization and the cache-flush-bypass optimizations. However that is something that can wait for a later change beyond 4.18. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 30 May 2018, Dan Williams wrote: > > Great find! Thanks for the due diligence. Feel free to add: > > > > Acked-by: Dan Williams <dan.j.williams@intel.com> > > > > ...on the reworks to unify ARM and x86. > > One more note. The side effect of not using dax_flush() is that you > may end up flushing caches on systems where the platform has asserted > it will take responsibility for flushing caches at power loss. If / > when those systems become more prevalent we may want to think of a way > to combine the non-temporal optimization and the cache-flush-bypass > optimizations. However that is something that can wait for a later > change beyond 4.18. We could define memcpy_flushpmem, that falls back to memcpy or memcpy_flushcache, depending on whether the platform flushes the caches at power loss or not. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Wed, 30 May 2018, Dan Williams wrote: > >> > Great find! Thanks for the due diligence. Feel free to add: >> > >> > Acked-by: Dan Williams <dan.j.williams@intel.com> >> > >> > ...on the reworks to unify ARM and x86. >> >> One more note. The side effect of not using dax_flush() is that you >> may end up flushing caches on systems where the platform has asserted >> it will take responsibility for flushing caches at power loss. If / >> when those systems become more prevalent we may want to think of a way >> to combine the non-temporal optimization and the cache-flush-bypass >> optimizations. However that is something that can wait for a later >> change beyond 4.18. > > We could define memcpy_flushpmem, that falls back to memcpy or > memcpy_flushcache, depending on whether the platform flushes the caches at > power loss or not. The problem is that some platforms only power fail protect a subset of the physical address range, but yes, if the platform makes a global assertion we can globally replace memcpy_flushpmem() with plain memcpy. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 31 May 2018, Dan Williams wrote: > On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Wed, 30 May 2018, Dan Williams wrote: > > > >> > Great find! Thanks for the due diligence. Feel free to add: > >> > > >> > Acked-by: Dan Williams <dan.j.williams@intel.com> > >> > > >> > ...on the reworks to unify ARM and x86. > >> > >> One more note. The side effect of not using dax_flush() is that you > >> may end up flushing caches on systems where the platform has asserted > >> it will take responsibility for flushing caches at power loss. If / > >> when those systems become more prevalent we may want to think of a way > >> to combine the non-temporal optimization and the cache-flush-bypass > >> optimizations. However that is something that can wait for a later > >> change beyond 4.18. > > > > We could define memcpy_flushpmem, that falls back to memcpy or > > memcpy_flushcache, depending on whether the platform flushes the caches at > > power loss or not. > > The problem is that some platforms only power fail protect a subset of > the physical address range, How can this be? A psysical address may be cached on any CPU, so either there is enough power to flush all the CPUs' caches or there isn't. How does the CPU design that protects only a part of physical addresses look like? > but yes, if the platform makes a global > assertion we can globally replace memcpy_flushpmem() with plain > memcpy. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, May 31, 2018 at 8:31 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Thu, 31 May 2018, Dan Williams wrote: > >> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Wed, 30 May 2018, Dan Williams wrote: >> > >> >> > Great find! Thanks for the due diligence. Feel free to add: >> >> > >> >> > Acked-by: Dan Williams <dan.j.williams@intel.com> >> >> > >> >> > ...on the reworks to unify ARM and x86. >> >> >> >> One more note. The side effect of not using dax_flush() is that you >> >> may end up flushing caches on systems where the platform has asserted >> >> it will take responsibility for flushing caches at power loss. If / >> >> when those systems become more prevalent we may want to think of a way >> >> to combine the non-temporal optimization and the cache-flush-bypass >> >> optimizations. However that is something that can wait for a later >> >> change beyond 4.18. >> > >> > We could define memcpy_flushpmem, that falls back to memcpy or >> > memcpy_flushcache, depending on whether the platform flushes the caches at >> > power loss or not. >> >> The problem is that some platforms only power fail protect a subset of >> the physical address range, > > How can this be? A psysical address may be cached on any CPU, so either > there is enough power to flush all the CPUs' caches or there isn't. > > How does the CPU design that protects only a part of physical addresses > look like? It's not necessarily a CPU problem, it may be a problem of having enough stored energy to potentially software loops to flush caches. There's also the consideration that a general purpose platform may mix persistent memory technologies from different vendors where some might be flash-backed DRAM and some might be persistent media directly. For now I don't think we need to worry about it, but I don't want to make the assumption that this property is platform global given the history of how persistent memory has been deployed to date. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6/drivers/md/dm-writecache.c =================================================================== --- linux-2.6.orig/drivers/md/dm-writecache.c 2018-05-19 06:20:28.000000000 +0200 +++ linux-2.6/drivers/md/dm-writecache.c 2018-05-19 07:10:26.000000000 +0200 @@ -14,6 +14,7 @@ #include <linux/dm-kcopyd.h> #include <linux/dax.h> #include <linux/pfn_t.h> +#include <linux/libnvdimm.h> #define DM_MSG_PREFIX "writecache" @@ -47,14 +48,48 @@ * On ARM64, cache flushing is more efficient. */ #if defined(CONFIG_X86_64) -#define EAGER_DATA_FLUSH -#define NT_STORE(dest, src) \ -do { \ - typeof(src) val = (src); \ - memcpy_flushcache(&(dest), &val, sizeof(src)); \ + +static void pmem_memcpy(void *dest, void *src, size_t len) +{ + memcpy_flushcache(dest, src, len); +} + +#define __pmem_assign(dest, src, uniq) \ +do { \ + typeof(dest) uniq = (src); \ + memcpy_flushcache(&(dest), &uniq, sizeof(dest)); \ } while (0) + +#define pmem_assign(dest, src) \ + __pmem_assign(dest, src, __UNIQUE_ID(pmem_assign)) + +static void pmem_flush(void *dest, size_t len) +{ +} + +static void pmem_commit(void) +{ + wmb(); +} + #else -#define NT_STORE(dest, src) WRITE_ONCE(dest, src) + +static void pmem_memcpy(void *dest, void *src, size_t len) +{ + memcpy(dest, src, len); +} + +#define pmem_assign(dest, src) WRITE_ONCE(dest, src) + +static void pmem_flush(void *dest, size_t len) +{ + arch_wb_cache_pmem(dest, len); +} + +static void pmem_commit(void) +{ +} + #endif #if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && !defined(DM_WRITECACHE_ONLY_SSD) @@ -105,7 +140,7 @@ struct wc_entry { }; #ifndef DM_WRITECACHE_ONLY_SSD -#define WC_MODE_PMEM(wc) ((wc)->pmem_mode) +#define WC_MODE_PMEM(wc) (likely((wc)->pmem_mode)) #define WC_MODE_FUA(wc) ((wc)->writeback_fua) #else #define WC_MODE_PMEM(wc) false @@ -400,21 +435,6 @@ static void persistent_memory_invalidate invalidate_kernel_vmap_range(ptr, size); } -static void persistent_memory_flush(struct dm_writecache *wc, void *ptr, size_t size) -{ -#ifndef EAGER_DATA_FLUSH - dax_flush(wc->ssd_dev->dax_dev, ptr, size); -#endif -} - -static void persistent_memory_commit_flushed(void) -{ -#ifdef EAGER_DATA_FLUSH - /* needed since memcpy_flushcache is used instead of dax_flush */ - wmb(); -#endif -} - static struct wc_memory_superblock *sb(struct dm_writecache *wc) { return wc->memory_map; @@ -462,21 +482,20 @@ static void clear_seq_count(struct dm_wr #ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS e->seq_count = -1; #endif - NT_STORE(memory_entry(wc, e)->seq_count, cpu_to_le64(-1)); + pmem_assign(memory_entry(wc, e)->seq_count, cpu_to_le64(-1)); } static void write_original_sector_seq_count(struct dm_writecache *wc, struct wc_entry *e, uint64_t original_sector, uint64_t seq_count) { - struct wc_memory_entry *me_p, me; + struct wc_memory_entry me; #ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS e->original_sector = original_sector; e->seq_count = seq_count; #endif - me_p = memory_entry(wc, e); me.original_sector = cpu_to_le64(original_sector); me.seq_count = cpu_to_le64(seq_count); - NT_STORE(*me_p, me); + pmem_assign(*memory_entry(wc, e), me); } #define writecache_error(wc, err, msg, arg...) \ @@ -491,8 +510,7 @@ do { \ static void writecache_flush_all_metadata(struct dm_writecache *wc) { if (WC_MODE_PMEM(wc)) { - persistent_memory_flush(wc, - sb(wc), offsetof(struct wc_memory_superblock, entries[wc->n_blocks])); + pmem_flush(sb(wc), offsetof(struct wc_memory_superblock, entries[wc->n_blocks])); } else { memset(wc->dirty_bitmap, -1, wc->dirty_bitmap_size); } @@ -501,7 +519,7 @@ static void writecache_flush_all_metadat static void writecache_flush_region(struct dm_writecache *wc, void *ptr, size_t size) { if (WC_MODE_PMEM(wc)) - persistent_memory_flush(wc, ptr, size); + pmem_flush(ptr, size); else __set_bit(((char *)ptr - (char *)wc->memory_map) / BITMAP_GRANULARITY, wc->dirty_bitmap); @@ -579,7 +597,7 @@ static void ssd_commit_flushed(struct dm static void writecache_commit_flushed(struct dm_writecache *wc) { if (WC_MODE_PMEM(wc)) - persistent_memory_commit_flushed(); + pmem_commit(); else ssd_commit_flushed(wc); } @@ -788,10 +806,8 @@ static void writecache_poison_lists(stru static void writecache_flush_entry(struct dm_writecache *wc, struct wc_entry *e) { writecache_flush_region(wc, memory_entry(wc, e), sizeof(struct wc_memory_entry)); -#ifndef EAGER_DATA_FLUSH if (WC_MODE_PMEM(wc)) writecache_flush_region(wc, memory_data(wc, e), wc->block_size); -#endif } static bool writecache_entry_is_committed(struct dm_writecache *wc, struct wc_entry *e) @@ -834,7 +850,7 @@ static void writecache_flush(struct dm_w writecache_wait_for_ios(wc, WRITE); wc->seq_count++; - NT_STORE(sb(wc)->seq_count, cpu_to_le64(wc->seq_count)); + pmem_assign(sb(wc)->seq_count, cpu_to_le64(wc->seq_count)); writecache_flush_region(wc, &sb(wc)->seq_count, sizeof sb(wc)->seq_count); writecache_commit_flushed(wc); @@ -1152,11 +1168,7 @@ static void bio_copy_block(struct dm_wri } } else { flush_dcache_page(bio_page(bio)); -#ifdef EAGER_DATA_FLUSH - memcpy_flushcache(data, buf, size); -#else - memcpy(data, buf, size); -#endif + pmem_memcpy(data, buf, size); } bvec_kunmap_irq(buf, &flags); @@ -1850,18 +1862,18 @@ static int init_memory(struct dm_writeca return r; for (b = 0; b < ARRAY_SIZE(sb(wc)->padding); b++) - NT_STORE(sb(wc)->padding[b], cpu_to_le64(0)); - NT_STORE(sb(wc)->version, cpu_to_le32(MEMORY_SUPERBLOCK_VERSION)); - NT_STORE(sb(wc)->block_size, cpu_to_le32(wc->block_size)); - NT_STORE(sb(wc)->n_blocks, cpu_to_le64(wc->n_blocks)); - NT_STORE(sb(wc)->seq_count, cpu_to_le64(0)); + pmem_assign(sb(wc)->padding[b], cpu_to_le64(0)); + pmem_assign(sb(wc)->version, cpu_to_le32(MEMORY_SUPERBLOCK_VERSION)); + pmem_assign(sb(wc)->block_size, cpu_to_le32(wc->block_size)); + pmem_assign(sb(wc)->n_blocks, cpu_to_le64(wc->n_blocks)); + pmem_assign(sb(wc)->seq_count, cpu_to_le64(0)); for (b = 0; b < wc->n_blocks; b++) write_original_sector_seq_count(wc, &wc->entries[b], -1, -1); writecache_flush_all_metadata(wc); writecache_commit_flushed(wc); - NT_STORE(sb(wc)->magic, cpu_to_le32(MEMORY_SUPERBLOCK_MAGIC)); + pmem_assign(sb(wc)->magic, cpu_to_le32(MEMORY_SUPERBLOCK_MAGIC)); writecache_flush_region(wc, &sb(wc)->magic, sizeof sb(wc)->magic); writecache_commit_flushed(wc);