[4/4] dm-writecache: use new API for flushing

Message ID	20180519052635.567438191@debian.vm (mailing list archive)
State	Superseded, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> Message-Id: <20180519052635.567438191@debian.vm> User-Agent: quilt/0.63-1 Date: Sat, 19 May 2018 07:25:07 +0200 From: Mikulas Patocka <mpatocka@redhat.com> To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <msnitzer@redhat.com>, Dan Williams <dan.j.williams@intel.com> References: <20180519052503.325953342@debian.vm> MIME-Version: 1.0 Content-Disposition: inline; filename=dm-writecache-interface.patch Cc: dm-devel@redhat.com Subject: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing Precedence: junk Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Mikulas Patocka May 19, 2018, 5:25 a.m. UTC

Use new API for flushing persistent memory.

The problem is this:
* on X86-64, non-temporal stores have the best performance
* ARM64 doesn't have non-temporal stores, so we must flush cache. We
  should flush cache as late as possible, because it performs better this
  way.

We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
data persistently, all three functions must be called.

The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written atomically.

On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.

On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.

Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-writecache.c |  100 +++++++++++++++++++++++++--------------------
 1 file changed, 56 insertions(+), 44 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig May 22, 2018, 6:39 a.m. UTC | #1

On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> Use new API for flushing persistent memory.

The sentence doesnt make much sense.  'A new API', 'A better
abstraction' maybe?

> 
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
> 
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
> 
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> 
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
> 
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

All these should be provided by the pmem layer, and be properly
documented.  And be sorted before adding your new target that uses
them.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 22, 2018, 6:41 p.m. UTC | #2

On Tue, May 22 2018 at  2:39am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> > Use new API for flushing persistent memory.
> 
> The sentence doesnt make much sense.  'A new API', 'A better
> abstraction' maybe?
> 
> > 
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> > 
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> > 
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > 
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> > 
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> All these should be provided by the pmem layer, and be properly
> documented.  And be sorted before adding your new target that uses
> them.

I don't see that as a hard requirement.  Mikulas did the work to figure
out what is more optimal on x86_64 vs amd64.  It makes a difference for
his target and that is sufficient to carry it locally until/when it is
either elevated to pmem.

We cannot even get x86 and swait maintainers to reply to repeat requests
for review.  Stacking up further deps on pmem isn't high on my list.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 22, 2018, 7 p.m. UTC | #3

On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 22 2018 at  2:39am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> > Use new API for flushing persistent memory.
>>
>> The sentence doesnt make much sense.  'A new API', 'A better
>> abstraction' maybe?
>>
>> >
>> > The problem is this:
>> > * on X86-64, non-temporal stores have the best performance
>> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >   should flush cache as late as possible, because it performs better this
>> >   way.
>> >
>> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > data persistently, all three functions must be called.
>> >
>> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >
>> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > pmem_commit is wmb.
>> >
>> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > pmem_commit is empty.
>>
>> All these should be provided by the pmem layer, and be properly
>> documented.  And be sorted before adding your new target that uses
>> them.
>
> I don't see that as a hard requirement.  Mikulas did the work to figure
> out what is more optimal on x86_64 vs amd64.  It makes a difference for
> his target and that is sufficient to carry it locally until/when it is
> either elevated to pmem.
>
> We cannot even get x86 and swait maintainers to reply to repeat requests
> for review.  Stacking up further deps on pmem isn't high on my list.
>

Except I'm being responsive. I agree with Christoph that we should
build pmem helpers at an architecture level and not per-driver. Let's
make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
up to x86 in this space. We already have PowerPC enabling PMEM API, so
I don't see an unreasonable barrier to ask the same of ARM. This patch
is not even cc'd to linux-arm-kernel. Has the subject been broached
with them?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 22, 2018, 7:19 p.m. UTC | #4

On Tue, May 22 2018 at  3:00pm -0400,
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive.

Except you're looking to immediately punt to linux-arm-kernel ;)

> I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

No idea.  Not by me.

The thing is, I'm no expert in pmem.  You are.  Coordinating the change
with ARM et al feels unnecessarily limiting and quicky moves outside my
control.

Serious question: Why can't this code land in this dm-writecache target
and then be lifted (or obsoleted)?

But if you think it worthwhile to force ARM to step up then fine.  That
does limit the availability of using writecache on ARM while they get
the PMEM API together.

I'll do whatever you want.. just put the smack down and tell me how it
is ;)

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 22, 2018, 7:27 p.m. UTC | #5

On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 22 2018 at  3:00pm -0400,
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Tue, May 22 2018 at  2:39am -0400,
>> > Christoph Hellwig <hch@infradead.org> wrote:
>> >
>> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> >> > Use new API for flushing persistent memory.
>> >>
>> >> The sentence doesnt make much sense.  'A new API', 'A better
>> >> abstraction' maybe?
>> >>
>> >> >
>> >> > The problem is this:
>> >> > * on X86-64, non-temporal stores have the best performance
>> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >> >   should flush cache as late as possible, because it performs better this
>> >> >   way.
>> >> >
>> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> >> > data persistently, all three functions must be called.
>> >> >
>> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >> >
>> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> >> > pmem_commit is wmb.
>> >> >
>> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> >> > pmem_commit is empty.
>> >>
>> >> All these should be provided by the pmem layer, and be properly
>> >> documented.  And be sorted before adding your new target that uses
>> >> them.
>> >
>> > I don't see that as a hard requirement.  Mikulas did the work to figure
>> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
>> > his target and that is sufficient to carry it locally until/when it is
>> > either elevated to pmem.
>> >
>> > We cannot even get x86 and swait maintainers to reply to repeat requests
>> > for review.  Stacking up further deps on pmem isn't high on my list.
>> >
>>
>> Except I'm being responsive.
>
> Except you're looking to immediately punt to linux-arm-kernel ;)

Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

>
>> I agree with Christoph that we should
>> build pmem helpers at an architecture level and not per-driver. Let's
>> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> with them?
>
> No idea.  Not by me.
>
> The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> with ARM et al feels unnecessarily limiting and quicky moves outside my
> control.
>
> Serious question: Why can't this code land in this dm-writecache target
> and then be lifted (or obsoleted)?

Because we already have an API, and we don't want to promote local
solutions to global problems, or carry  unnecessary technical debt.

>
> But if you think it worthwhile to force ARM to step up then fine.  That
> does limit the availability of using writecache on ARM while they get
> the PMEM API together.
>
> I'll do whatever you want.. just put the smack down and tell me how it
> is ;)

I'd say just control the variables you can control. Drop the ARM
support if you want to move forward and propose extensions / updates
to the pmem api for x86 and I'll help push those since I was involved
in pushing the x86 pmem api in the first instance. That way you don't
need to touch this driver as new archs add their pmem api enabling.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 22, 2018, 8:52 p.m. UTC | #6

On Tue, May 22 2018 at  3:27pm -0400,
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  3:00pm -0400,
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> > On Tue, May 22 2018 at  2:39am -0400,
> >> > Christoph Hellwig <hch@infradead.org> wrote:
> >> >
> >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> >> > Use new API for flushing persistent memory.
> >> >>
> >> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> >> abstraction' maybe?
> >> >>
> >> >> >
> >> >> > The problem is this:
> >> >> > * on X86-64, non-temporal stores have the best performance
> >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >> >   should flush cache as late as possible, because it performs better this
> >> >> >   way.
> >> >> >
> >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> >> > data persistently, all three functions must be called.
> >> >> >
> >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >> >
> >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> >> > pmem_commit is wmb.
> >> >> >
> >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> >> > pmem_commit is empty.
> >> >>
> >> >> All these should be provided by the pmem layer, and be properly
> >> >> documented.  And be sorted before adding your new target that uses
> >> >> them.
> >> >
> >> > I don't see that as a hard requirement.  Mikulas did the work to figure
> >> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> >> > his target and that is sufficient to carry it locally until/when it is
> >> > either elevated to pmem.
> >> >
> >> > We cannot even get x86 and swait maintainers to reply to repeat requests
> >> > for review.  Stacking up further deps on pmem isn't high on my list.
> >> >
> >>
> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
> 
> >
> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates
> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

Looking at Mikulas' wrapper API that you and hch are calling into
question:

For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
(And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)

Whereas x86_64 is using memcpy_flushcache() as provided by
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
(Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)

Just seems this isn't purely about ARM lacking on an API level (given on
x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).

Seems this is more to do with x86_64 having efficient Non-temporal
stores?

Anyway, I'm still trying to appreciate the details here before I can
make any forward progress.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer May 22, 2018, 10:53 p.m. UTC | #7

Hi, Mike,

Mike Snitzer <snitzer@redhat.com> writes:

> Looking at Mikulas' wrapper API that you and hch are calling into
> question:
>
> For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
>
> Whereas x86_64 is using memcpy_flushcache() as provided by
> CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
>
> Just seems this isn't purely about ARM lacking on an API level (given on
> x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
>
> Seems this is more to do with x86_64 having efficient Non-temporal
> stores?

Yeah, I think you've got that all right.

> Anyway, I'm still trying to appreciate the details here before I can
> make any forward progress.

Making data persistent on x64 requires 3 steps:
1) copy the data into pmem   (store instructions)
2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
3) wait on the flush to complete (sfence)

I'm not sure if other architectures require step 3.  Mikulas'
implementation seems to imply that arm64 doesn't require the fence.

The current pmem api provides:

memcpy*           -- step 1
memcpy_flushcache -- this combines steps 1 and 2
dax_flush         -- step 2
wmb*              -- step 3

* not strictly part of the pmem api

So, if you didn't care about performance, you could write generic code
that only used memcpy, dax_flush, and wmb (assuming other arches
actually need the wmb).  What Mikulas did was to abstract out an API
that could be called by generic code that would work optimally on all
architectures.

This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
what do you think about refactoring the code as Christoph suggested?

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 23, 2018, 8:57 p.m. UTC | #8

On Tue, 22 May 2018, Jeff Moyer wrote:

> Hi, Mike,
> 
> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > Looking at Mikulas' wrapper API that you and hch are calling into
> > question:
> >
> > For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> > (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
> >
> > Whereas x86_64 is using memcpy_flushcache() as provided by
> > CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> > (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
> >
> > Just seems this isn't purely about ARM lacking on an API level (given on
> > x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
> >
> > Seems this is more to do with x86_64 having efficient Non-temporal
> > stores?
> 
> Yeah, I think you've got that all right.
> 
> > Anyway, I'm still trying to appreciate the details here before I can
> > make any forward progress.
> 
> Making data persistent on x64 requires 3 steps:
> 1) copy the data into pmem   (store instructions)
> 2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
> 3) wait on the flush to complete (sfence)

In theory it works this way. In practice, this sequence is useless because 
the cache flusing instructions are horribly slow.

So, the dm-writecache driver uses non-temporal stores instead of cache 
flushing.

Now, the problem with arm64 is that it doesn't have non-temporal stores. 
So, memcpy_flushcache on arm64 does cached stores and flushes the cache 
afterwards. And this eager flushing is slower than late flushing. On arm4, 
you want to do cached stores, then do something else, and flush the cache 
as late as possible.

> I'm not sure if other architectures require step 3.  Mikulas'
> implementation seems to imply that arm64 doesn't require the fence.

I suppose that arch_wb_cache_pmem() does whatever it needs to do to flush 
the cache. If not, add something like arch_wb_cache_pmem_commit().

> The current pmem api provides:
> 
> memcpy*           -- step 1
> memcpy_flushcache -- this combines steps 1 and 2
> dax_flush         -- step 2
> wmb*              -- step 3
> 
> * not strictly part of the pmem api
> 
> So, if you didn't care about performance, you could write generic code
> that only used memcpy, dax_flush, and wmb (assuming other arches
> actually need the wmb).  What Mikulas did was to abstract out an API
> that could be called by generic code that would work optimally on all
> architectures.
> 
> This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
> what do you think about refactoring the code as Christoph suggested?

I sent this patch 
https://www.redhat.com/archives/dm-devel/2018-May/msg00054.html so that 
you can take the functions pmem_memcpy, pmem_assign, pmem_flush and 
pmem_commit and move them to the generic linux headers. If you want to do 
it, do it.

> Cheers,
> Jeff

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 24, 2018, 8:15 a.m. UTC | #9

On Tue, 22 May 2018, Dan Williams wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive. I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

The ARM code can't "catch-up" with X86.

On X86 - non-temporal stores (i.e. memcpy_flushcache) are faster than 
cached write and cache flushing.

The ARM architecture doesn't have non-temporal stores. So, 
memcpy_flushcache on ARM does memcpy (that writes data to the cache) and 
then flushes the cache. But this eager cache flushig is slower than late 
cache flushing.

The optimal code sequence on ARM to write to persistent memory is to call 
memcpy, then do something else, and then call arch_wb_cache_pmem as late 
as possible. And this ARM-optimized code sequence is just horribly slow on 
X86.

This issue can't be "fixed" in ARM-specific source code. The ARM processor 
have such a characteristics that eager cache flushing is slower than late 
cache flushing - and that's it - you can't change processor behavior.

If you don't want '#if defined(CONFIG_X86_64)' in the dm-writecache 
driver, then just take the functions that are in this conditional block 
and move them to some generic linux header.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 25, 2018, 3:12 a.m. UTC | #10

On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Use new API for flushing persistent memory.
>
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
>
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

I don't want to grow driver-local wrappers for pmem. You should use
memcpy_flushcache directly() and if an architecture does not define
memcpy_flushcache() then don't allow building dm-writecache, i.e. this
driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
see a need to add a standalone flush operation if all relevant archs
provide memcpy_flushcache(). As for commit, I'd say just use wmb()
directly since all archs define it. Alternatively we could introduce
memcpy_flushcache_relaxed() to be the un-ordered version of the copy
routine and memcpy_flushcache() would imply a wmb().

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 25, 2018, 6:17 a.m. UTC | #11

On Thu, 24 May 2018, Dan Williams wrote:

> On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > Use new API for flushing persistent memory.
> >
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> >
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> >
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> >
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> I don't want to grow driver-local wrappers for pmem. You should use
> memcpy_flushcache directly() and if an architecture does not define
> memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> see a need to add a standalone flush operation if all relevant archs
> provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> directly since all archs define it. Alternatively we could introduce
> memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> routine and memcpy_flushcache() would imply a wmb().

But memcpy_flushcache() on ARM64 is slow.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 25, 2018, 12:51 p.m. UTC | #12

On Fri, May 25 2018 at  2:17am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Thu, 24 May 2018, Dan Williams wrote:
> 
> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > Use new API for flushing persistent memory.
> > >
> > > The problem is this:
> > > * on X86-64, non-temporal stores have the best performance
> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> > >   should flush cache as late as possible, because it performs better this
> > >   way.
> > >
> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > > data persistently, all three functions must be called.
> > >
> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > >
> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > > pmem_commit is wmb.
> > >
> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > > pmem_commit is empty.
> > 
> > I don't want to grow driver-local wrappers for pmem. You should use
> > memcpy_flushcache directly() and if an architecture does not define
> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> > see a need to add a standalone flush operation if all relevant archs
> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> > directly since all archs define it. Alternatively we could introduce
> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> > routine and memcpy_flushcache() would imply a wmb().
> 
> But memcpy_flushcache() on ARM64 is slow.

Yes, Dan can you please take some time to fully appreciate what this
small wrapper API is providing (maybe you've done that, but your recent
reply is mixed-message).  Seems you're keeping the innovation it
provides at arms-length.  Basically the PMEM APIs you've helped
construct are lacking, forcing DM developers to own fixing them is an
inversion that only serves to stonewall.

Please, less time on stonewalling and more time lifting this wrapper
API; otherwise the dm-writecache local wrapper API is a near-term means
to an end.

I revised the dm-writecache patch header yesterday, please review:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
(the last paragraph in particular, I intend to move forward with this
wrapper unless someone on the PMEM side of the house steps up and lifts
it up between now and when the 4.18 merge window opens)

dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

The following describes the approach used to provide the most efficient
flushing of persistent memory on X86_64 vs ARM64:

* On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
  than cached writes and cache flushing.

* The ARM64 architecture doesn't have non-temporal stores. So,
  memcpy_flushcache on ARM does memcpy (that writes data to the cache)
  and then flushes the cache.  But this eager cache flushig is slower
  than late cache flushing.

The optimal code sequence on ARM to write to persistent memory is to
call memcpy, then do something else, and then call arch_wb_cache_pmem as
late as possible. And this ARM-optimized code sequence is just horribly
slow on X86.

This issue can't be "fixed" in ARM-specific source code. The ARM
processor have characteristics such that eager cache flushing is slower
than late cache flushing - and that's it - you can't change processor
behavior.

We introduce a wrapper API for flushing persistent memory with functions
pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
all three functions must be called.

The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written
atomically.

On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.

On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.

It is clear that this wrapper API for flushing persistent memory needs
to be elevated out of this dm-writecache driver.  But that can happen
later without requiring DM developers to blaze new trails on pmem
specific implementation details/quirks (pmem developers need to clean up
their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
into account the duality of the different programming models needed to
achieve optimal cross-architecture use of persistent memory).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 25, 2018, 3:57 p.m. UTC | #13

On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, May 25 2018 at  2:17am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>>
>>
>> On Thu, 24 May 2018, Dan Williams wrote:
>>
>> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > > Use new API for flushing persistent memory.
>> > >
>> > > The problem is this:
>> > > * on X86-64, non-temporal stores have the best performance
>> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> > >   should flush cache as late as possible, because it performs better this
>> > >   way.
>> > >
>> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > > data persistently, all three functions must be called.
>> > >
>> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> > >
>> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > > pmem_commit is wmb.
>> > >
>> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > > pmem_commit is empty.
>> >
>> > I don't want to grow driver-local wrappers for pmem. You should use
>> > memcpy_flushcache directly() and if an architecture does not define
>> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> > see a need to add a standalone flush operation if all relevant archs
>> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> > directly since all archs define it. Alternatively we could introduce
>> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> > routine and memcpy_flushcache() would imply a wmb().
>>
>> But memcpy_flushcache() on ARM64 is slow.
>
> Yes, Dan can you please take some time to fully appreciate what this
> small wrapper API is providing (maybe you've done that, but your recent
> reply is mixed-message).  Seems you're keeping the innovation it
> provides at arms-length.  Basically the PMEM APIs you've helped
> construct are lacking, forcing DM developers to own fixing them is an
> inversion that only serves to stonewall.
>
> Please, less time on stonewalling and more time lifting this wrapper
> API; otherwise the dm-writecache local wrapper API is a near-term means
> to an end.
>
> I revised the dm-writecache patch header yesterday, please review:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
> (the last paragraph in particular, I intend to move forward with this
> wrapper unless someone on the PMEM side of the house steps up and lifts
> it up between now and when the 4.18 merge window opens)
>
> dm: add writecache target
> The writecache target caches writes on persistent memory or SSD.
> It is intended for databases or other programs that need extremely low
> commit latency.
>
> The writecache target doesn't cache reads because reads are supposed to
> be cached in page cache in normal RAM.
>
> The following describes the approach used to provide the most efficient
> flushing of persistent memory on X86_64 vs ARM64:
>
> * On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
>   than cached writes and cache flushing.
>
> * The ARM64 architecture doesn't have non-temporal stores. So,
>   memcpy_flushcache on ARM does memcpy (that writes data to the cache)
>   and then flushes the cache.  But this eager cache flushig is slower
>   than late cache flushing.
>
> The optimal code sequence on ARM to write to persistent memory is to
> call memcpy, then do something else, and then call arch_wb_cache_pmem as
> late as possible. And this ARM-optimized code sequence is just horribly
> slow on X86.
>
> This issue can't be "fixed" in ARM-specific source code. The ARM
> processor have characteristics such that eager cache flushing is slower
> than late cache flushing - and that's it - you can't change processor
> behavior.
>
> We introduce a wrapper API for flushing persistent memory with functions
> pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
> all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written
> atomically.
>
> On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.
>
> It is clear that this wrapper API for flushing persistent memory needs
> to be elevated out of this dm-writecache driver.  But that can happen
> later without requiring DM developers to blaze new trails on pmem
> specific implementation details/quirks (pmem developers need to clean up
> their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
> and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
> into account the duality of the different programming models needed to
> achieve optimal cross-architecture use of persistent memory).

Right, so again, what is wrong with memcpy_flushcache_relaxed() +
wmb() or otherwise making memcpy_flushcache() ordered. I do not see
that as a trailblazing requirement, I see that as typical review and a
reduction of the operation space that you are proposing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 26, 2018, 7:02 a.m. UTC | #14

On Fri, 25 May 2018, Dan Williams wrote:

> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Fri, May 25 2018 at  2:17am -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >> On Thu, 24 May 2018, Dan Williams wrote:
> >>
> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> > memcpy_flushcache directly() and if an architecture does not define
> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> > see a need to add a standalone flush operation if all relevant archs
> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> > directly since all archs define it. Alternatively we could introduce
> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> > routine and memcpy_flushcache() would imply a wmb().
> >>
> >> But memcpy_flushcache() on ARM64 is slow.
> 
> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> that as a trailblazing requirement, I see that as typical review and a
> reduction of the operation space that you are proposing.

memcpy_flushcache on ARM64 is generally wrong thing to do, because it is 
slower than memcpy and explicit cache flush.

Suppose that you want to write data to a block device and make it 
persistent. So you send a WRITE bio and then a FLUSH bio.

Now - how to implement these two bios on persistent memory:

On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does 
wmb() - this is the optimal implementation.

But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal 
implementation is that the WRITE bio does just memcpy() and the FLUSH bio 
does arch_wb_cache_pmem() on the affected range.

Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture 
doesn't have non-temporal stores. So, memcpy_flushcache() is implemented 
as memcpy() followed by a cache flush.

Now - if you flush the cache immediatelly after memcpy, the cache is full 
of dirty lines and the cache-flushing code has to write these lines back 
and that is slow.

If you flush the cache some time after memcpy (i.e. when the FLUSH bio is 
received), the processor already flushed some part of the cache on its 
own, so the cache-flushing function has less work to do and it is faster.

So the conclusion is - don't use memcpy_flushcache on ARM. This problem 
cannot be fixed by a better implementation of memcpy_flushcache.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 26, 2018, 3:26 p.m. UTC | #15

On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Fri, 25 May 2018, Dan Williams wrote:
>
>> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Fri, May 25 2018 at  2:17am -0400,
>> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >>
>> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> > see a need to add a standalone flush operation if all relevant archs
>> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> > directly since all archs define it. Alternatively we could introduce
>> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> > routine and memcpy_flushcache() would imply a wmb().
>> >>
>> >> But memcpy_flushcache() on ARM64 is slow.
>>
>> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> that as a trailblazing requirement, I see that as typical review and a
>> reduction of the operation space that you are proposing.
>
> memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> slower than memcpy and explicit cache flush.
>
> Suppose that you want to write data to a block device and make it
> persistent. So you send a WRITE bio and then a FLUSH bio.
>
> Now - how to implement these two bios on persistent memory:
>
> On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> wmb() - this is the optimal implementation.
>
> But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> does arch_wb_cache_pmem() on the affected range.
>
> Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> as memcpy() followed by a cache flush.
>
> Now - if you flush the cache immediatelly after memcpy, the cache is full
> of dirty lines and the cache-flushing code has to write these lines back
> and that is slow.
>
> If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> received), the processor already flushed some part of the cache on its
> own, so the cache-flushing function has less work to do and it is faster.
>
> So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> cannot be fixed by a better implementation of memcpy_flushcache.

It sounds like ARM might be better off with mapping its pmem as
write-through rather than write-back, and skip the explicit cache
management altogether. You speak of "optimal" and "sub-optimal", but
what would be more clear is fio measurements of the relative IOPs and
latency profiles of the different approaches. The reason I am
continuing to push here is that reducing the operation space from
'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
maintenance long term.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 28, 2018, 1:32 p.m. UTC | #16

On Sat, 26 May 2018, Dan Williams wrote:

> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Fri, 25 May 2018, Dan Williams wrote:
> >
> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >>
> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >>
> >> >> But memcpy_flushcache() on ARM64 is slow.
> >>
> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> that as a trailblazing requirement, I see that as typical review and a
> >> reduction of the operation space that you are proposing.
> >
> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> > slower than memcpy and explicit cache flush.
> >
> > Suppose that you want to write data to a block device and make it
> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >
> > Now - how to implement these two bios on persistent memory:
> >
> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> > wmb() - this is the optimal implementation.
> >
> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> > does arch_wb_cache_pmem() on the affected range.
> >
> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> > as memcpy() followed by a cache flush.
> >
> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> > of dirty lines and the cache-flushing code has to write these lines back
> > and that is slow.
> >
> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> > received), the processor already flushed some part of the cache on its
> > own, so the cache-flushing function has less work to do and it is faster.
> >
> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> > cannot be fixed by a better implementation of memcpy_flushcache.
> 
> It sounds like ARM might be better off with mapping its pmem as
> write-through rather than write-back, and skip the explicit cache

I doubt it would perform well - write combining combines the writes into a 
larger segments - and write-through doesn't.

> management altogether. You speak of "optimal" and "sub-optimal", but
> what would be more clear is fio measurements of the relative IOPs and
> latency profiles of the different approaches. The reason I am
> continuing to push here is that reducing the operation space from
> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> maintenance long term.

I measured it (with nvme backing store) and late cache flushing has 12% 
better performance than eager flushing with memcpy_flushcache().

131836 4k iops - vs - 117016.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 28, 2018, 1:52 p.m. UTC | #17

On Tue, 22 May 2018, Dan Williams wrote:

> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

This is the worst thing to do - because once late cache flushing is 
dropped from the dm-writecache target, it could hardly be reintroduced 
again.

> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates

What do we gain by dropping it?

> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

The pmem API is x86-centric - that the problem.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 28, 2018, 5:41 p.m. UTC | #18

On Mon, May 28, 2018 at 6:52 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Tue, 22 May 2018, Dan Williams wrote:
>
>> >> Except I'm being responsive.
>> >
>> > Except you're looking to immediately punt to linux-arm-kernel ;)
>>
>> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
>
> This is the worst thing to do - because once late cache flushing is
> dropped from the dm-writecache target, it could hardly be reintroduced
> again.
>
>> >> I agree with Christoph that we should
>> >> build pmem helpers at an architecture level and not per-driver. Let's
>> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> >> with them?
>> >
>> > No idea.  Not by me.
>> >
>> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
>> > with ARM et al feels unnecessarily limiting and quicky moves outside my
>> > control.
>> >
>> > Serious question: Why can't this code land in this dm-writecache target
>> > and then be lifted (or obsoleted)?
>>
>> Because we already have an API, and we don't want to promote local
>> solutions to global problems, or carry  unnecessary technical debt.
>>
>> >
>> > But if you think it worthwhile to force ARM to step up then fine.  That
>> > does limit the availability of using writecache on ARM while they get
>> > the PMEM API together.
>> >
>> > I'll do whatever you want.. just put the smack down and tell me how it
>> > is ;)
>>
>> I'd say just control the variables you can control. Drop the ARM
>> support if you want to move forward and propose extensions / updates
>
> What do we gain by dropping it?
>
>> to the pmem api for x86 and I'll help push those since I was involved
>> in pushing the x86 pmem api in the first instance. That way you don't
>> need to touch this driver as new archs add their pmem api enabling.
>
> The pmem API is x86-centric - that the problem.

When I read your patch I came away with the impression that ARM had
not added memcpy_flushcache() yet and you were working around that
fact. Now that I look, ARM *does* define memcpy_flushcache() and
you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
memcpy_flushcache() implementation is:

    memcpy(dst, src, cnt);
    __clean_dcache_area_pop(dst, cnt);

So, I do not see how what you're doing is any less work unless you are
flushing less than you copy?

If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
ARM implementation is broken and that needs to be addressed not worked
around in a driver.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 28, 2018, 6:14 p.m. UTC | #19

On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Sat, 26 May 2018, Dan Williams wrote:
>
>> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Fri, 25 May 2018, Dan Williams wrote:
>> >
>> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >
>> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >>
>> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >>
>> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >>
>> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> that as a trailblazing requirement, I see that as typical review and a
>> >> reduction of the operation space that you are proposing.
>> >
>> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> > slower than memcpy and explicit cache flush.
>> >
>> > Suppose that you want to write data to a block device and make it
>> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >
>> > Now - how to implement these two bios on persistent memory:
>> >
>> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> > wmb() - this is the optimal implementation.
>> >
>> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> > does arch_wb_cache_pmem() on the affected range.
>> >
>> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> > as memcpy() followed by a cache flush.
>> >
>> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> > of dirty lines and the cache-flushing code has to write these lines back
>> > and that is slow.
>> >
>> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> > received), the processor already flushed some part of the cache on its
>> > own, so the cache-flushing function has less work to do and it is faster.
>> >
>> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> > cannot be fixed by a better implementation of memcpy_flushcache.
>>
>> It sounds like ARM might be better off with mapping its pmem as
>> write-through rather than write-back, and skip the explicit cache
>
> I doubt it would perform well - write combining combines the writes into a
> larger segments - and write-through doesn't.
>

Last I checked write-through caching does not disable write combining

>> management altogether. You speak of "optimal" and "sub-optimal", but
>> what would be more clear is fio measurements of the relative IOPs and
>> latency profiles of the different approaches. The reason I am
>> continuing to push here is that reducing the operation space from
>> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> maintenance long term.
>
> I measured it (with nvme backing store) and late cache flushing has 12%
> better performance than eager flushing with memcpy_flushcache().

I assume what you're seeing is ARM64 over-flushing the amount of dirty
data so it becomes more efficient to do an amortized flush at the end?
However, that effectively makes memcpy_flushcache() unusable in the
way it can be used on x86. You claimed that ARM does not support
non-temporal stores, but it does, see the STNP instruction. I do not
want to see arch specific optimizations in drivers, so either
write-through mappings is a potential answer to remove the need to
explicitly manage flushing, or just implement STNP hacks in
memcpy_flushcache() like you did with MOVNT on x86.

> 131836 4k iops - vs - 117016.

To be clear this is memcpy_flushcache() vs memcpy + flush?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 30, 2018, 1:07 p.m. UTC | #20

On Mon, 28 May 2018, Dan Williams wrote:

> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Sat, 26 May 2018, Dan Williams wrote:
> >
> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >
> >> > On Fri, 25 May 2018, Dan Williams wrote:
> >> >
> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >> >
> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >> >>
> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >> >>
> >> >> >> But memcpy_flushcache() on ARM64 is slow.
> >> >>
> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> >> that as a trailblazing requirement, I see that as typical review and a
> >> >> reduction of the operation space that you are proposing.
> >> >
> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> >> > slower than memcpy and explicit cache flush.
> >> >
> >> > Suppose that you want to write data to a block device and make it
> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >> >
> >> > Now - how to implement these two bios on persistent memory:
> >> >
> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> >> > wmb() - this is the optimal implementation.
> >> >
> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> >> > does arch_wb_cache_pmem() on the affected range.
> >> >
> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> >> > as memcpy() followed by a cache flush.
> >> >
> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> >> > of dirty lines and the cache-flushing code has to write these lines back
> >> > and that is slow.
> >> >
> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> >> > received), the processor already flushed some part of the cache on its
> >> > own, so the cache-flushing function has less work to do and it is faster.
> >> >
> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> >> > cannot be fixed by a better implementation of memcpy_flushcache.
> >>
> >> It sounds like ARM might be better off with mapping its pmem as
> >> write-through rather than write-back, and skip the explicit cache
> >
> > I doubt it would perform well - write combining combines the writes into a
> > larger segments - and write-through doesn't.
> >
> 
> Last I checked write-through caching does not disable write combining
> 
> >> management altogether. You speak of "optimal" and "sub-optimal", but
> >> what would be more clear is fio measurements of the relative IOPs and
> >> latency profiles of the different approaches. The reason I am
> >> continuing to push here is that reducing the operation space from
> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> >> maintenance long term.
> >
> > I measured it (with nvme backing store) and late cache flushing has 12%
> > better performance than eager flushing with memcpy_flushcache().
> 
> I assume what you're seeing is ARM64 over-flushing the amount of dirty
> data so it becomes more efficient to do an amortized flush at the end?
> However, that effectively makes memcpy_flushcache() unusable in the
> way it can be used on x86. You claimed that ARM does not support
> non-temporal stores, but it does, see the STNP instruction. I do not
> want to see arch specific optimizations in drivers, so either
> write-through mappings is a potential answer to remove the need to
> explicitly manage flushing, or just implement STNP hacks in
> memcpy_flushcache() like you did with MOVNT on x86.
> 
> > 131836 4k iops - vs - 117016.
> 
> To be clear this is memcpy_flushcache() vs memcpy + flush?

I found out what caused the difference. I used dax_flush on the version of 
dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
it is the last version where dax on ramdisk works) - and I thought that 
dax_flush flushes the cache, but it doesn't.

When I replaced dax_flush with arch_wb_cache_pmem, the performance 
difference between early flushing and late flushing disappeared.

So I think we can remove this per-architecture switch from dm-writecache.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 30, 2018, 1:16 p.m. UTC | #21

On Wed, May 30 2018 at  9:07am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 28 May 2018, Dan Williams wrote:
> 
> > On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > I measured it (with nvme backing store) and late cache flushing has 12%
> > > better performance than eager flushing with memcpy_flushcache().
> > 
> > I assume what you're seeing is ARM64 over-flushing the amount of dirty
> > data so it becomes more efficient to do an amortized flush at the end?
> > However, that effectively makes memcpy_flushcache() unusable in the
> > way it can be used on x86. You claimed that ARM does not support
> > non-temporal stores, but it does, see the STNP instruction. I do not
> > want to see arch specific optimizations in drivers, so either
> > write-through mappings is a potential answer to remove the need to
> > explicitly manage flushing, or just implement STNP hacks in
> > memcpy_flushcache() like you did with MOVNT on x86.
> > 
> > > 131836 4k iops - vs - 117016.
> > 
> > To be clear this is memcpy_flushcache() vs memcpy + flush?
> 
> I found out what caused the difference. I used dax_flush on the version of 
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
> it is the last version where dax on ramdisk works) - and I thought that 
> dax_flush flushes the cache, but it doesn't.
> 
> When I replaced dax_flush with arch_wb_cache_pmem, the performance 
> difference between early flushing and late flushing disappeared.
> 
> So I think we can remove this per-architecture switch from dm-writecache.

That is really great news, can you submit an incremental patch that
layers ontop of the linux-dm.git 'dm-4.18' branch?

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 30, 2018, 1:21 p.m. UTC | #22

On Wed, 30 May 2018, Mike Snitzer wrote:

> That is really great news, can you submit an incremental patch that
> layers ontop of the linux-dm.git 'dm-4.18' branch?
> 
> Thanks,
> Mike

I've sent the current version that I have. I fixed the bugs that were 
reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
long->int truncation).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer May 30, 2018, 1:26 p.m. UTC | #23

On Wed, May 30 2018 at  9:21am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > That is really great news, can you submit an incremental patch that
> > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > 
> > Thanks,
> > Mike
> 
> I've sent the current version that I have. I fixed the bugs that were 
> reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> long->int truncation).

OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
your __branch_check__ patch).  Not sure what the dm_bufio_client_create
fix is... must've missed a report about that.

ANyway, point is we're on too a different phase of dm-writecache.c's
development.  I've picked it up and am trying to get it ready for the
4.18 merge window (likely opening Sunday).  Therefore it needs to be in
a git tree, and incremental changes overlayed.  I cannot be rebasing at
this late stage in the 4.18 development window.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 30, 2018, 1:33 p.m. UTC | #24

On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at  9:21am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > That is really great news, can you submit an incremental patch that
> > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > 
> > > Thanks,
> > > Mike
> > 
> > I've sent the current version that I have. I fixed the bugs that were 
> > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > long->int truncation).
> 
> OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> fix is... must've missed a report about that.
> 
> ANyway, point is we're on too a different phase of dm-writecache.c's
> development.  I've picked it up and am trying to get it ready for the
> 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> a git tree, and incremental changes overlayed.  I cannot be rebasing at
> this late stage in the 4.18 development window.
> 
> Thanks,
> Mike

I downloaded dm-writecache from your git repository some times ago - but 
you changed a lot of useless things (i.e. reordering the fields in the 
structure) since that time - so, you'll have to merge the changes.

You dropped the latency measuring code - why? We still would like to 
benchmark the driver.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer May 30, 2018, 1:42 p.m. UTC | #25

Dan Williams <dan.j.williams@intel.com> writes:

> When I read your patch I came away with the impression that ARM had
> not added memcpy_flushcache() yet and you were working around that
> fact. Now that I look, ARM *does* define memcpy_flushcache() and
> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> memcpy_flushcache() implementation is:
>
>     memcpy(dst, src, cnt);
>     __clean_dcache_area_pop(dst, cnt);
>
> So, I do not see how what you're doing is any less work unless you are
> flushing less than you copy?
>
> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> ARM implementation is broken and that needs to be addressed not worked
> around in a driver.

I think Mikulas wanted to batch up multiple copies and flush at the
end.  According to his commit message, that batching gained him 2%
performance.

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 30, 2018, 1:51 p.m. UTC | #26

On Wed, 30 May 2018, Jeff Moyer wrote:

> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > When I read your patch I came away with the impression that ARM had
> > not added memcpy_flushcache() yet and you were working around that
> > fact. Now that I look, ARM *does* define memcpy_flushcache() and
> > you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> > ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> > memcpy_flushcache() implementation is:
> >
> >     memcpy(dst, src, cnt);
> >     __clean_dcache_area_pop(dst, cnt);
> >
> > So, I do not see how what you're doing is any less work unless you are
> > flushing less than you copy?
> >
> > If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> > ARM implementation is broken and that needs to be addressed not worked
> > around in a driver.
> 
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.
> 
> -Jeff

No - this 2% difference is inlined memcpy_flushcache() vs out-of-line 
memcpy_flushcache().

I thought that dax_flush() performed 12% better memcpy_flushcache() - but 
the reason why it performed better was - that it was not flushing the 
cache at all.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer May 30, 2018, 1:52 p.m. UTC | #27

Jeff Moyer <jmoyer@redhat.com> writes:

> Dan Williams <dan.j.williams@intel.com> writes:
>
>> When I read your patch I came away with the impression that ARM had
>> not added memcpy_flushcache() yet and you were working around that
>> fact. Now that I look, ARM *does* define memcpy_flushcache() and
>> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
>> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
>> memcpy_flushcache() implementation is:
>>
>>     memcpy(dst, src, cnt);
>>     __clean_dcache_area_pop(dst, cnt);
>>
>> So, I do not see how what you're doing is any less work unless you are
>> flushing less than you copy?
>>
>> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
>> ARM implementation is broken and that needs to be addressed not worked
>> around in a driver.
>
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.

Nevermind me, I just caught up with the rest of the thread.  :)

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 30, 2018, 3:58 p.m. UTC | #28

On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Mon, 28 May 2018, Dan Williams wrote:
>
>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Sat, 26 May 2018, Dan Williams wrote:
>> >
>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >
>> >> >
>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>> >> >
>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >> >
>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >> >>
>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >> >>
>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >> >>
>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> >> that as a trailblazing requirement, I see that as typical review and a
>> >> >> reduction of the operation space that you are proposing.
>> >> >
>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> >> > slower than memcpy and explicit cache flush.
>> >> >
>> >> > Suppose that you want to write data to a block device and make it
>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >> >
>> >> > Now - how to implement these two bios on persistent memory:
>> >> >
>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> >> > wmb() - this is the optimal implementation.
>> >> >
>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> >> > does arch_wb_cache_pmem() on the affected range.
>> >> >
>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> >> > as memcpy() followed by a cache flush.
>> >> >
>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> >> > of dirty lines and the cache-flushing code has to write these lines back
>> >> > and that is slow.
>> >> >
>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> >> > received), the processor already flushed some part of the cache on its
>> >> > own, so the cache-flushing function has less work to do and it is faster.
>> >> >
>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>> >>
>> >> It sounds like ARM might be better off with mapping its pmem as
>> >> write-through rather than write-back, and skip the explicit cache
>> >
>> > I doubt it would perform well - write combining combines the writes into a
>> > larger segments - and write-through doesn't.
>> >
>>
>> Last I checked write-through caching does not disable write combining
>>
>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>> >> what would be more clear is fio measurements of the relative IOPs and
>> >> latency profiles of the different approaches. The reason I am
>> >> continuing to push here is that reducing the operation space from
>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> >> maintenance long term.
>> >
>> > I measured it (with nvme backing store) and late cache flushing has 12%
>> > better performance than eager flushing with memcpy_flushcache().
>>
>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>> data so it becomes more efficient to do an amortized flush at the end?
>> However, that effectively makes memcpy_flushcache() unusable in the
>> way it can be used on x86. You claimed that ARM does not support
>> non-temporal stores, but it does, see the STNP instruction. I do not
>> want to see arch specific optimizations in drivers, so either
>> write-through mappings is a potential answer to remove the need to
>> explicitly manage flushing, or just implement STNP hacks in
>> memcpy_flushcache() like you did with MOVNT on x86.
>>
>> > 131836 4k iops - vs - 117016.
>>
>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>
> I found out what caused the difference. I used dax_flush on the version of
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
> it is the last version where dax on ramdisk works) - and I thought that
> dax_flush flushes the cache, but it doesn't.
>
> When I replaced dax_flush with arch_wb_cache_pmem, the performance
> difference between early flushing and late flushing disappeared.
>
> So I think we can remove this per-architecture switch from dm-writecache.

Great find! Thanks for the due diligence. Feel free to add:

    Acked-by: Dan Williams <dan.j.williams@intel.com>

...on the reworks to unify ARM and x86.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 30, 2018, 10:39 p.m. UTC | #29

On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>
>> On Mon, 28 May 2018, Dan Williams wrote:
>>
>>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >
>>> >
>>> > On Sat, 26 May 2018, Dan Williams wrote:
>>> >
>>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>>> >> >
>>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >> >
>>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>>> >> >> >>
>>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>>> >> >> >>
>>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>>> >> >>
>>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>>> >> >> that as a trailblazing requirement, I see that as typical review and a
>>> >> >> reduction of the operation space that you are proposing.
>>> >> >
>>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>>> >> > slower than memcpy and explicit cache flush.
>>> >> >
>>> >> > Suppose that you want to write data to a block device and make it
>>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>>> >> >
>>> >> > Now - how to implement these two bios on persistent memory:
>>> >> >
>>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>>> >> > wmb() - this is the optimal implementation.
>>> >> >
>>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>>> >> > does arch_wb_cache_pmem() on the affected range.
>>> >> >
>>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>>> >> > as memcpy() followed by a cache flush.
>>> >> >
>>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>>> >> > of dirty lines and the cache-flushing code has to write these lines back
>>> >> > and that is slow.
>>> >> >
>>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>>> >> > received), the processor already flushed some part of the cache on its
>>> >> > own, so the cache-flushing function has less work to do and it is faster.
>>> >> >
>>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>>> >>
>>> >> It sounds like ARM might be better off with mapping its pmem as
>>> >> write-through rather than write-back, and skip the explicit cache
>>> >
>>> > I doubt it would perform well - write combining combines the writes into a
>>> > larger segments - and write-through doesn't.
>>> >
>>>
>>> Last I checked write-through caching does not disable write combining
>>>
>>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>>> >> what would be more clear is fio measurements of the relative IOPs and
>>> >> latency profiles of the different approaches. The reason I am
>>> >> continuing to push here is that reducing the operation space from
>>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>>> >> maintenance long term.
>>> >
>>> > I measured it (with nvme backing store) and late cache flushing has 12%
>>> > better performance than eager flushing with memcpy_flushcache().
>>>
>>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>>> data so it becomes more efficient to do an amortized flush at the end?
>>> However, that effectively makes memcpy_flushcache() unusable in the
>>> way it can be used on x86. You claimed that ARM does not support
>>> non-temporal stores, but it does, see the STNP instruction. I do not
>>> want to see arch specific optimizations in drivers, so either
>>> write-through mappings is a potential answer to remove the need to
>>> explicitly manage flushing, or just implement STNP hacks in
>>> memcpy_flushcache() like you did with MOVNT on x86.
>>>
>>> > 131836 4k iops - vs - 117016.
>>>
>>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>>
>> I found out what caused the difference. I used dax_flush on the version of
>> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
>> it is the last version where dax on ramdisk works) - and I thought that
>> dax_flush flushes the cache, but it doesn't.
>>
>> When I replaced dax_flush with arch_wb_cache_pmem, the performance
>> difference between early flushing and late flushing disappeared.
>>
>> So I think we can remove this per-architecture switch from dm-writecache.
>
> Great find! Thanks for the due diligence. Feel free to add:
>
>     Acked-by: Dan Williams <dan.j.williams@intel.com>
>
> ...on the reworks to unify ARM and x86.

One more note. The side effect of not using dax_flush() is that you
may end up flushing caches on systems where the platform has asserted
it will take responsibility for flushing caches at power loss. If /
when those systems become more prevalent we may want to think of a way
to combine the non-temporal optimization and the cache-flush-bypass
optimizations. However that is something that can wait for a later
change beyond 4.18.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 31, 2018, 8:19 a.m. UTC | #30

On Wed, 30 May 2018, Dan Williams wrote:

> > Great find! Thanks for the due diligence. Feel free to add:
> >
> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ...on the reworks to unify ARM and x86.
> 
> One more note. The side effect of not using dax_flush() is that you
> may end up flushing caches on systems where the platform has asserted
> it will take responsibility for flushing caches at power loss. If /
> when those systems become more prevalent we may want to think of a way
> to combine the non-temporal optimization and the cache-flush-bypass
> optimizations. However that is something that can wait for a later
> change beyond 4.18.

We could define memcpy_flushpmem, that falls back to memcpy or 
memcpy_flushcache, depending on whether the platform flushes the caches at 
power loss or not.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 31, 2018, 2:51 p.m. UTC | #31

On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 30 May 2018, Dan Williams wrote:
>
>> > Great find! Thanks for the due diligence. Feel free to add:
>> >
>> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
>> >
>> > ...on the reworks to unify ARM and x86.
>>
>> One more note. The side effect of not using dax_flush() is that you
>> may end up flushing caches on systems where the platform has asserted
>> it will take responsibility for flushing caches at power loss. If /
>> when those systems become more prevalent we may want to think of a way
>> to combine the non-temporal optimization and the cache-flush-bypass
>> optimizations. However that is something that can wait for a later
>> change beyond 4.18.
>
> We could define memcpy_flushpmem, that falls back to memcpy or
> memcpy_flushcache, depending on whether the platform flushes the caches at
> power loss or not.

The problem is that some platforms only power fail protect a subset of
the physical address range, but yes, if the platform makes a global
assertion we can globally replace memcpy_flushpmem() with plain
memcpy.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka May 31, 2018, 3:31 p.m. UTC | #32

On Thu, 31 May 2018, Dan Williams wrote:

> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Wed, 30 May 2018, Dan Williams wrote:
> >
> >> > Great find! Thanks for the due diligence. Feel free to add:
> >> >
> >> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
> >> >
> >> > ...on the reworks to unify ARM and x86.
> >>
> >> One more note. The side effect of not using dax_flush() is that you
> >> may end up flushing caches on systems where the platform has asserted
> >> it will take responsibility for flushing caches at power loss. If /
> >> when those systems become more prevalent we may want to think of a way
> >> to combine the non-temporal optimization and the cache-flush-bypass
> >> optimizations. However that is something that can wait for a later
> >> change beyond 4.18.
> >
> > We could define memcpy_flushpmem, that falls back to memcpy or
> > memcpy_flushcache, depending on whether the platform flushes the caches at
> > power loss or not.
> 
> The problem is that some platforms only power fail protect a subset of
> the physical address range,

How can this be? A psysical address may be cached on any CPU, so either 
there is enough power to flush all the CPUs' caches or there isn't.

How does the CPU design that protects only a part of physical addresses 
look like?

> but yes, if the platform makes a global
> assertion we can globally replace memcpy_flushpmem() with plain
> memcpy.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dan Williams May 31, 2018, 4:39 p.m. UTC | #33

On Thu, May 31, 2018 at 8:31 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 31 May 2018, Dan Williams wrote:
>
>> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Wed, 30 May 2018, Dan Williams wrote:
>> >
>> >> > Great find! Thanks for the due diligence. Feel free to add:
>> >> >
>> >> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
>> >> >
>> >> > ...on the reworks to unify ARM and x86.
>> >>
>> >> One more note. The side effect of not using dax_flush() is that you
>> >> may end up flushing caches on systems where the platform has asserted
>> >> it will take responsibility for flushing caches at power loss. If /
>> >> when those systems become more prevalent we may want to think of a way
>> >> to combine the non-temporal optimization and the cache-flush-bypass
>> >> optimizations. However that is something that can wait for a later
>> >> change beyond 4.18.
>> >
>> > We could define memcpy_flushpmem, that falls back to memcpy or
>> > memcpy_flushcache, depending on whether the platform flushes the caches at
>> > power loss or not.
>>
>> The problem is that some platforms only power fail protect a subset of
>> the physical address range,
>
> How can this be? A psysical address may be cached on any CPU, so either
> there is enough power to flush all the CPUs' caches or there isn't.
>
> How does the CPU design that protects only a part of physical addresses
> look like?

It's not necessarily a CPU problem, it may be a problem of having
enough stored energy to potentially software loops to flush caches.
There's also the consideration that a general purpose platform may mix
persistent memory technologies from different vendors where some might
be flash-backed DRAM and some might be persistent media directly.

For now I don't think we need to worry about it, but I don't want to
make the assumption that this property is platform global given the
history of how persistent memory has been deployed to date.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[4/4] dm-writecache: use new API for flushing

Commit Message

Comments

Patch