diff mbox

[v10,12/12] drivers/block/pmem: Map NVDIMM with ioremap_wt()

Message ID 1432739944-22633-13-git-send-email-toshi.kani@hp.com (mailing list archive)
State Superseded
Headers show

Commit Message

Toshi Kani May 27, 2015, 3:19 p.m. UTC
The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
write back the contents of the CPU caches in case of a crash.

This patch changes to use ioremap_wt(), which provides uncached
writes but cached reads, for improving read performance.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 drivers/block/pmem.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Borislav Petkov May 29, 2015, 9:11 a.m. UTC | #1
On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
> write back the contents of the CPU caches in case of a crash.
> 
> This patch changes to use ioremap_wt(), which provides uncached
> writes but cached reads, for improving read performance.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> ---
>  drivers/block/pmem.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> index eabf4a8..095dfaa 100644
> --- a/drivers/block/pmem.c
> +++ b/drivers/block/pmem.c
> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
>  	}
>  
>  	/*
> -	 * Map the memory as non-cachable, as we can't write back the contents
> +	 * Map the memory as write-through, as we can't write back the contents
>  	 * of the CPU caches in case of a crash.
>  	 */
>  	err = -ENOMEM;
> -	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
> +	pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>  	if (!pmem->virt_addr)
>  		goto out_release_region;

Dan, Ross, what about this one?

ACK to pick it up as a temporary solution?
Dan Williams May 29, 2015, 2:43 p.m. UTC | #2
On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
>> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
>> write back the contents of the CPU caches in case of a crash.
>>
>> This patch changes to use ioremap_wt(), which provides uncached
>> writes but cached reads, for improving read performance.
>>
>> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
>> ---
>>  drivers/block/pmem.c |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
>> index eabf4a8..095dfaa 100644
>> --- a/drivers/block/pmem.c
>> +++ b/drivers/block/pmem.c
>> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
>>       }
>>
>>       /*
>> -      * Map the memory as non-cachable, as we can't write back the contents
>> +      * Map the memory as write-through, as we can't write back the contents
>>        * of the CPU caches in case of a crash.
>>        */
>>       err = -ENOMEM;
>> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
>> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>>       if (!pmem->virt_addr)
>>               goto out_release_region;
>
> Dan, Ross, what about this one?
>
> ACK to pick it up as a temporary solution?

I see that is_new_memtype_allowed() is updated to disallow some
combinations, but the manual seems to imply any mixing of memory types
is unsupported.  Which worries me even in the current code where we
have uncached mappings in the driver, and potentially cached DAX
mappings handed out to userspace.

A general quibble separate from this patch is that we don't have a way
of knowing if ioremap() will reject or change our requested memory
type.  Shouldn't the driver be explicitly requesting a known valid
type in advance?

Lastly we now have the PMEM API patches from Ross out for review where
he is assuming cached mappings with non-temporal writes:
https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
This gives us WC semantics on writes which I believe has the nice
property of reducing the number of write transactions to memory.
Also, the numbers in the paper seem to be assuming DAX operation, but
this ioremap_wt() is in the driver and typically behind a file system.
Are the numbers relevant to that usage mode?
Toshi Kani May 29, 2015, 3:03 p.m. UTC | #3
On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
> >> write back the contents of the CPU caches in case of a crash.
> >>
> >> This patch changes to use ioremap_wt(), which provides uncached
> >> writes but cached reads, for improving read performance.
> >>
> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> >> ---
> >>  drivers/block/pmem.c |    4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> >> index eabf4a8..095dfaa 100644
> >> --- a/drivers/block/pmem.c
> >> +++ b/drivers/block/pmem.c
> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
> >>       }
> >>
> >>       /*
> >> -      * Map the memory as non-cachable, as we can't write back the contents
> >> +      * Map the memory as write-through, as we can't write back the contents
> >>        * of the CPU caches in case of a crash.
> >>        */
> >>       err = -ENOMEM;
> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
> >>       if (!pmem->virt_addr)
> >>               goto out_release_region;
> >
> > Dan, Ross, what about this one?
> >
> > ACK to pick it up as a temporary solution?
> 
> I see that is_new_memtype_allowed() is updated to disallow some
> combinations, but the manual seems to imply any mixing of memory types
> is unsupported.  Which worries me even in the current code where we
> have uncached mappings in the driver, and potentially cached DAX
> mappings handed out to userspace.

is_new_memtype_allowed() is not to allow some combinations of mixing of
memory types.  When it is allowed, the requested type of ioremap_xxx()
is changed to match with the existing map type, so that mixing of memory
types does not happen.

DAX uses vm_insert_mixed(), which does not even check the existing map
type to the physical address.

> A general quibble separate from this patch is that we don't have a way
> of knowing if ioremap() will reject or change our requested memory
> type.  Shouldn't the driver be explicitly requesting a known valid
> type in advance?

I agree we need a solution here.

> Lastly we now have the PMEM API patches from Ross out for review where
> he is assuming cached mappings with non-temporal writes:
> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
> This gives us WC semantics on writes which I believe has the nice
> property of reducing the number of write transactions to memory.
> Also, the numbers in the paper seem to be assuming DAX operation, but
> this ioremap_wt() is in the driver and typically behind a file system.
> Are the numbers relevant to that usage mode?

I have not looked into the Ross's changes yet, but they do not seem to
replace the use of ioremap_nocache().  If his changes can use WB type
reliably, yes, we do not need a temporary solution of using ioremap_wt()
in this driver.

Thanks,
-Toshi
Dan Williams May 29, 2015, 6:19 p.m. UTC | #4
On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
>> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
>> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
>> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
>> >> write back the contents of the CPU caches in case of a crash.
>> >>
>> >> This patch changes to use ioremap_wt(), which provides uncached
>> >> writes but cached reads, for improving read performance.
>> >>
>> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
>> >> ---
>> >>  drivers/block/pmem.c |    4 ++--
>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
>> >> index eabf4a8..095dfaa 100644
>> >> --- a/drivers/block/pmem.c
>> >> +++ b/drivers/block/pmem.c
>> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
>> >>       }
>> >>
>> >>       /*
>> >> -      * Map the memory as non-cachable, as we can't write back the contents
>> >> +      * Map the memory as write-through, as we can't write back the contents
>> >>        * of the CPU caches in case of a crash.
>> >>        */
>> >>       err = -ENOMEM;
>> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
>> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>> >>       if (!pmem->virt_addr)
>> >>               goto out_release_region;
>> >
>> > Dan, Ross, what about this one?
>> >
>> > ACK to pick it up as a temporary solution?
>>
>> I see that is_new_memtype_allowed() is updated to disallow some
>> combinations, but the manual seems to imply any mixing of memory types
>> is unsupported.  Which worries me even in the current code where we
>> have uncached mappings in the driver, and potentially cached DAX
>> mappings handed out to userspace.
>
> is_new_memtype_allowed() is not to allow some combinations of mixing of
> memory types.  When it is allowed, the requested type of ioremap_xxx()
> is changed to match with the existing map type, so that mixing of memory
> types does not happen.

Yes, but now if the caller was expecting one memory type and gets
another one that is something I think the driver would want to know.
At a minimum I don't think we want to get emails about pmem driver
performance problems when someone's platform is silently degrading WB
to UC for example.

> DAX uses vm_insert_mixed(), which does not even check the existing map
> type to the physical address.

Right, I think that's a problem...

>> A general quibble separate from this patch is that we don't have a way
>> of knowing if ioremap() will reject or change our requested memory
>> type.  Shouldn't the driver be explicitly requesting a known valid
>> type in advance?
>
> I agree we need a solution here.
>
>> Lastly we now have the PMEM API patches from Ross out for review where
>> he is assuming cached mappings with non-temporal writes:
>> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
>> This gives us WC semantics on writes which I believe has the nice
>> property of reducing the number of write transactions to memory.
>> Also, the numbers in the paper seem to be assuming DAX operation, but
>> this ioremap_wt() is in the driver and typically behind a file system.
>> Are the numbers relevant to that usage mode?
>
> I have not looked into the Ross's changes yet, but they do not seem to
> replace the use of ioremap_nocache().  If his changes can use WB type
> reliably, yes, we do not need a temporary solution of using ioremap_wt()
> in this driver.

Hmm, yes you're right, it seems those patches did not change the
implementation to use ioremap_cache()... which happens to not be
implemented on all architectures.  I'll take a look.
Toshi Kani May 29, 2015, 6:32 p.m. UTC | #5
On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote:
> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
> >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
> >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
> >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
 :
> >> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
> >> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
> >> >>       if (!pmem->virt_addr)
> >> >>               goto out_release_region;
> >> >
> >> > Dan, Ross, what about this one?
> >> >
> >> > ACK to pick it up as a temporary solution?
> >>
> >> I see that is_new_memtype_allowed() is updated to disallow some
> >> combinations, but the manual seems to imply any mixing of memory types
> >> is unsupported.  Which worries me even in the current code where we
> >> have uncached mappings in the driver, and potentially cached DAX
> >> mappings handed out to userspace.
> >
> > is_new_memtype_allowed() is not to allow some combinations of mixing of
> > memory types.  When it is allowed, the requested type of ioremap_xxx()
> > is changed to match with the existing map type, so that mixing of memory
> > types does not happen.
> 
> Yes, but now if the caller was expecting one memory type and gets
> another one that is something I think the driver would want to know.
> At a minimum I don't think we want to get emails about pmem driver
> performance problems when someone's platform is silently degrading WB
> to UC for example.

The pmem driver creates an ioremap map to an NVDIMM range first.  So,
there will be no conflict at this point, unless there is a conflicting
driver claiming the same NVDIMM range.

DAX then uses the pmem driver (or other byte-addressable driver) to
mount a file system and creates a separate user-space mapping for
mmap().  So, a (silent) map-type conflict will happen at this point,
which may not be protected by the ioremap itself.

> > DAX uses vm_insert_mixed(), which does not even check the existing map
> > type to the physical address.
> 
> Right, I think that's a problem...
> 
> >> A general quibble separate from this patch is that we don't have a way
> >> of knowing if ioremap() will reject or change our requested memory
> >> type.  Shouldn't the driver be explicitly requesting a known valid
> >> type in advance?
> >
> > I agree we need a solution here.
> >
> >> Lastly we now have the PMEM API patches from Ross out for review where
> >> he is assuming cached mappings with non-temporal writes:
> >> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
> >> This gives us WC semantics on writes which I believe has the nice
> >> property of reducing the number of write transactions to memory.
> >> Also, the numbers in the paper seem to be assuming DAX operation, but
> >> this ioremap_wt() is in the driver and typically behind a file system.
> >> Are the numbers relevant to that usage mode?
> >
> > I have not looked into the Ross's changes yet, but they do not seem to
> > replace the use of ioremap_nocache().  If his changes can use WB type
> > reliably, yes, we do not need a temporary solution of using ioremap_wt()
> > in this driver.
> 
> Hmm, yes you're right, it seems those patches did not change the
> implementation to use ioremap_cache()... which happens to not be
> implemented on all architectures.  I'll take a look.

Thanks,
-Toshi
Andy Lutomirski May 29, 2015, 6:34 p.m. UTC | #6
On Fri, May 29, 2015 at 11:19 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>> On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
>>> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
>>> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
>>> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
>>> >> write back the contents of the CPU caches in case of a crash.
>>> >>
>>> >> This patch changes to use ioremap_wt(), which provides uncached
>>> >> writes but cached reads, for improving read performance.
>>> >>
>>> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
>>> >> ---
>>> >>  drivers/block/pmem.c |    4 ++--
>>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>>> >>
>>> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
>>> >> index eabf4a8..095dfaa 100644
>>> >> --- a/drivers/block/pmem.c
>>> >> +++ b/drivers/block/pmem.c
>>> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
>>> >>       }
>>> >>
>>> >>       /*
>>> >> -      * Map the memory as non-cachable, as we can't write back the contents
>>> >> +      * Map the memory as write-through, as we can't write back the contents
>>> >>        * of the CPU caches in case of a crash.
>>> >>        */
>>> >>       err = -ENOMEM;
>>> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
>>> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>>> >>       if (!pmem->virt_addr)
>>> >>               goto out_release_region;
>>> >
>>> > Dan, Ross, what about this one?
>>> >
>>> > ACK to pick it up as a temporary solution?
>>>
>>> I see that is_new_memtype_allowed() is updated to disallow some
>>> combinations, but the manual seems to imply any mixing of memory types
>>> is unsupported.  Which worries me even in the current code where we
>>> have uncached mappings in the driver, and potentially cached DAX
>>> mappings handed out to userspace.
>>
>> is_new_memtype_allowed() is not to allow some combinations of mixing of
>> memory types.  When it is allowed, the requested type of ioremap_xxx()
>> is changed to match with the existing map type, so that mixing of memory
>> types does not happen.
>
> Yes, but now if the caller was expecting one memory type and gets
> another one that is something I think the driver would want to know.
> At a minimum I don't think we want to get emails about pmem driver
> performance problems when someone's platform is silently degrading WB
> to UC for example.
>
>> DAX uses vm_insert_mixed(), which does not even check the existing map
>> type to the physical address.
>
> Right, I think that's a problem...
>
>>> A general quibble separate from this patch is that we don't have a way
>>> of knowing if ioremap() will reject or change our requested memory
>>> type.  Shouldn't the driver be explicitly requesting a known valid
>>> type in advance?
>>
>> I agree we need a solution here.
>>
>>> Lastly we now have the PMEM API patches from Ross out for review where
>>> he is assuming cached mappings with non-temporal writes:
>>> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
>>> This gives us WC semantics on writes which I believe has the nice
>>> property of reducing the number of write transactions to memory.
>>> Also, the numbers in the paper seem to be assuming DAX operation, but
>>> this ioremap_wt() is in the driver and typically behind a file system.
>>> Are the numbers relevant to that usage mode?
>>
>> I have not looked into the Ross's changes yet, but they do not seem to
>> replace the use of ioremap_nocache().  If his changes can use WB type
>> reliably, yes, we do not need a temporary solution of using ioremap_wt()
>> in this driver.
>
> Hmm, yes you're right, it seems those patches did not change the
> implementation to use ioremap_cache()... which happens to not be
> implemented on all architectures.  I'll take a look.

Whoa, there!  Why would we use non-temporal stores to WB memory to
access persistent memory?  I can see two reasons not to:

1. As far as I understand it, non-temporal stores to WT should have
almost identical performance.

2. Is there any actual architectural guarantee that it's safe to have
a WB mapping that we use like that?  By my reading of the manual,
MOVNTDQA (as a write to pmem); SFENCE; PCOMMIT; SFENCE on uncached
memory should be guaranteed to do a durable write.  On the other hand,
it's considerably less clear to me that the same sequence to WB memory
is safe -- aren't we supposed to stick a CLWB or CLFLUSHOPT in there,
too, on WB memory?  In other words, is there any case in which
MOVNTDQA or similar acting on a WB mapping could result in a dirty
cache line?

--Andy
Dan Williams May 29, 2015, 7:32 p.m. UTC | #7
On Fri, May 29, 2015 at 11:34 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, May 29, 2015 at 11:19 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>> On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
>>>> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
>>>> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
>>>> >> write back the contents of the CPU caches in case of a crash.
>>>> >>
>>>> >> This patch changes to use ioremap_wt(), which provides uncached
>>>> >> writes but cached reads, for improving read performance.
>>>> >>
>>>> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
>>>> >> ---
>>>> >>  drivers/block/pmem.c |    4 ++--
>>>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>> >>
>>>> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
>>>> >> index eabf4a8..095dfaa 100644
>>>> >> --- a/drivers/block/pmem.c
>>>> >> +++ b/drivers/block/pmem.c
>>>> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
>>>> >>       }
>>>> >>
>>>> >>       /*
>>>> >> -      * Map the memory as non-cachable, as we can't write back the contents
>>>> >> +      * Map the memory as write-through, as we can't write back the contents
>>>> >>        * of the CPU caches in case of a crash.
>>>> >>        */
>>>> >>       err = -ENOMEM;
>>>> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
>>>> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>>>> >>       if (!pmem->virt_addr)
>>>> >>               goto out_release_region;
>>>> >
>>>> > Dan, Ross, what about this one?
>>>> >
>>>> > ACK to pick it up as a temporary solution?
>>>>
>>>> I see that is_new_memtype_allowed() is updated to disallow some
>>>> combinations, but the manual seems to imply any mixing of memory types
>>>> is unsupported.  Which worries me even in the current code where we
>>>> have uncached mappings in the driver, and potentially cached DAX
>>>> mappings handed out to userspace.
>>>
>>> is_new_memtype_allowed() is not to allow some combinations of mixing of
>>> memory types.  When it is allowed, the requested type of ioremap_xxx()
>>> is changed to match with the existing map type, so that mixing of memory
>>> types does not happen.
>>
>> Yes, but now if the caller was expecting one memory type and gets
>> another one that is something I think the driver would want to know.
>> At a minimum I don't think we want to get emails about pmem driver
>> performance problems when someone's platform is silently degrading WB
>> to UC for example.
>>
>>> DAX uses vm_insert_mixed(), which does not even check the existing map
>>> type to the physical address.
>>
>> Right, I think that's a problem...
>>
>>>> A general quibble separate from this patch is that we don't have a way
>>>> of knowing if ioremap() will reject or change our requested memory
>>>> type.  Shouldn't the driver be explicitly requesting a known valid
>>>> type in advance?
>>>
>>> I agree we need a solution here.
>>>
>>>> Lastly we now have the PMEM API patches from Ross out for review where
>>>> he is assuming cached mappings with non-temporal writes:
>>>> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html.
>>>> This gives us WC semantics on writes which I believe has the nice
>>>> property of reducing the number of write transactions to memory.
>>>> Also, the numbers in the paper seem to be assuming DAX operation, but
>>>> this ioremap_wt() is in the driver and typically behind a file system.
>>>> Are the numbers relevant to that usage mode?
>>>
>>> I have not looked into the Ross's changes yet, but they do not seem to
>>> replace the use of ioremap_nocache().  If his changes can use WB type
>>> reliably, yes, we do not need a temporary solution of using ioremap_wt()
>>> in this driver.
>>
>> Hmm, yes you're right, it seems those patches did not change the
>> implementation to use ioremap_cache()... which happens to not be
>> implemented on all architectures.  I'll take a look.
>
> Whoa, there!  Why would we use non-temporal stores to WB memory to
> access persistent memory?  I can see two reasons not to:
>
> 1. As far as I understand it, non-temporal stores to WT should have
> almost identical performance.
>
> 2. Is there any actual architectural guarantee that it's safe to have
> a WB mapping that we use like that?  By my reading of the manual,
> MOVNTDQA (as a write to pmem); SFENCE; PCOMMIT; SFENCE on uncached
> memory should be guaranteed to do a durable write.  On the other hand,
> it's considerably less clear to me that the same sequence to WB memory
> is safe -- aren't we supposed to stick a CLWB or CLFLUSHOPT in there,
> too, on WB memory?  In other words, is there any case in which
> MOVNTDQA or similar acting on a WB mapping could result in a dirty
> cache line?

Depends, see the note in 10.4.6.2, "Some older CPU implementations
(e.g., Pentium M) allowed addresses being written with a non-temporal
store instruction to be updated in-place if the memory type was not WC
and line was already in the cache."  The expectation is that
boot_cpu_has(X86_FEATURE_PCOMMIT) is false on such a cpu so we'll
fallback to not using non-temporal stores.
Dan Williams May 29, 2015, 7:34 p.m. UTC | #8
On Fri, May 29, 2015 at 11:32 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote:
>> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>> > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
>> >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
>> >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
>> >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
>  :
>> >> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
>> >> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
>> >> >>       if (!pmem->virt_addr)
>> >> >>               goto out_release_region;
>> >> >
>> >> > Dan, Ross, what about this one?
>> >> >
>> >> > ACK to pick it up as a temporary solution?
>> >>
>> >> I see that is_new_memtype_allowed() is updated to disallow some
>> >> combinations, but the manual seems to imply any mixing of memory types
>> >> is unsupported.  Which worries me even in the current code where we
>> >> have uncached mappings in the driver, and potentially cached DAX
>> >> mappings handed out to userspace.
>> >
>> > is_new_memtype_allowed() is not to allow some combinations of mixing of
>> > memory types.  When it is allowed, the requested type of ioremap_xxx()
>> > is changed to match with the existing map type, so that mixing of memory
>> > types does not happen.
>>
>> Yes, but now if the caller was expecting one memory type and gets
>> another one that is something I think the driver would want to know.
>> At a minimum I don't think we want to get emails about pmem driver
>> performance problems when someone's platform is silently degrading WB
>> to UC for example.
>
> The pmem driver creates an ioremap map to an NVDIMM range first.  So,
> there will be no conflict at this point, unless there is a conflicting
> driver claiming the same NVDIMM range.

Hmm, I thought it would be WB due to this comment in is_new_memtype_allowed()

        /*
         * PAT type is always WB for untracked ranges, so no need to check.
         */
Toshi Kani May 29, 2015, 8:10 p.m. UTC | #9
On Fri, 2015-05-29 at 12:34 -0700, Dan Williams wrote:
> On Fri, May 29, 2015 at 11:32 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote:
> >> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> >> > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote:
> >> >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote:
> >> >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote:
> >> >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot
> >  :
> >> >> >> -     pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
> >> >> >> +     pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
> >> >> >>       if (!pmem->virt_addr)
> >> >> >>               goto out_release_region;
> >> >> >
> >> >> > Dan, Ross, what about this one?
> >> >> >
> >> >> > ACK to pick it up as a temporary solution?
> >> >>
> >> >> I see that is_new_memtype_allowed() is updated to disallow some
> >> >> combinations, but the manual seems to imply any mixing of memory types
> >> >> is unsupported.  Which worries me even in the current code where we
> >> >> have uncached mappings in the driver, and potentially cached DAX
> >> >> mappings handed out to userspace.
> >> >
> >> > is_new_memtype_allowed() is not to allow some combinations of mixing of
> >> > memory types.  When it is allowed, the requested type of ioremap_xxx()
> >> > is changed to match with the existing map type, so that mixing of memory
> >> > types does not happen.
> >>
> >> Yes, but now if the caller was expecting one memory type and gets
> >> another one that is something I think the driver would want to know.
> >> At a minimum I don't think we want to get emails about pmem driver
> >> performance problems when someone's platform is silently degrading WB
> >> to UC for example.
> >
> > The pmem driver creates an ioremap map to an NVDIMM range first.  So,
> > there will be no conflict at this point, unless there is a conflicting
> > driver claiming the same NVDIMM range.
> 
> Hmm, I thought it would be WB due to this comment in is_new_memtype_allowed()
> 
>         /*
>          * PAT type is always WB for untracked ranges, so no need to check.
>          */

This comment applies to the ISA range, where ioremap() does not create
any mapping to, i.e. untracked.  You can ignore this comment for NVDIMM.

Thanks,
-Toshi
Elliott, Robert (Server Storage) May 29, 2015, 9:29 p.m. UTC | #10
> -----Original Message-----
> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Friday, May 29, 2015 1:35 PM
...
> Whoa, there!  Why would we use non-temporal stores to WB memory to
> access persistent memory?  I can see two reasons not to:

Data written to a block storage device (here, the NVDIMM) is unlikely
to be read or written again any time soon.  It's not like the code
and data that a program has in memory, where there might be a loop
accessing the location every CPU clock; it's storage I/O to
historically very slow (relative to the CPU clock speed) devices.  
The source buffer for that data might be frequently accessed, 
but not the NVDIMM storage itself.  

Non-temporal stores avoid wasting cache space on these "one-time" 
accesses.  The same applies for reads and non-temporal loads.
Keep the CPU data cache lines free for the application.

DAX and mmap() do change that; the application is now free to
store frequently accessed data structures directly in persistent 
memory.  But, that's not available if btt is used, and 
application loads and stores won't go through the memcpy()
calls inside pmem anyway.  The non-temporal instructions are
cache coherent, so data integrity won't get confused by them
if I/O going through pmem's block storage APIs happens
to overlap with the application's mmap() regions.

---
Robert Elliott, HP Server Storage
Andy Lutomirski May 29, 2015, 9:46 p.m. UTC | #11
On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>> Sent: Friday, May 29, 2015 1:35 PM
> ...
>> Whoa, there!  Why would we use non-temporal stores to WB memory to
>> access persistent memory?  I can see two reasons not to:
>
> Data written to a block storage device (here, the NVDIMM) is unlikely
> to be read or written again any time soon.  It's not like the code
> and data that a program has in memory, where there might be a loop
> accessing the location every CPU clock; it's storage I/O to
> historically very slow (relative to the CPU clock speed) devices.
> The source buffer for that data might be frequently accessed,
> but not the NVDIMM storage itself.
>
> Non-temporal stores avoid wasting cache space on these "one-time"
> accesses.  The same applies for reads and non-temporal loads.
> Keep the CPU data cache lines free for the application.
>
> DAX and mmap() do change that; the application is now free to
> store frequently accessed data structures directly in persistent
> memory.  But, that's not available if btt is used, and
> application loads and stores won't go through the memcpy()
> calls inside pmem anyway.  The non-temporal instructions are
> cache coherent, so data integrity won't get confused by them
> if I/O going through pmem's block storage APIs happens
> to overlap with the application's mmap() regions.
>

You answered the wrong question. :)  I understand the point of the
non-temporal stores -- I don't understand the point of using
non-temporal stores to *WB memory*.  I think we should be okay with
having the kernel mapping use WT instead.

--Andy
Elliott, Robert (Server Storage) May 29, 2015, 10:24 p.m. UTC | #12
---
Robert Elliott, HP Server Storage

> -----Original Message-----
> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Friday, May 29, 2015 4:46 PM
> To: Elliott, Robert (Server Storage)
> Cc: Dan Williams; Kani, Toshimitsu; Borislav Petkov; Ross Zwisler;
> H. Peter Anvin; Thomas Gleixner; Ingo Molnar; Andrew Morton; Arnd
> Bergmann; linux-mm@kvack.org; linux-kernel@vger.kernel.org; X86 ML;
> linux-nvdimm@lists.01.org; Juergen Gross; Stefan Bader; Henrique de
> Moraes Holschuh; Yigal Korman; Konrad Rzeszutek Wilk; Luis
> Rodriguez; Christoph Hellwig; Matthew Wilcox
> Subject: Re: [PATCH v10 12/12] drivers/block/pmem: Map NVDIMM with
> ioremap_wt()
> 
> On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage)
> <Elliott@hp.com> wrote:
> >> -----Original Message-----
> >> From: Andy Lutomirski [mailto:luto@amacapital.net]
> >> Sent: Friday, May 29, 2015 1:35 PM
> > ...
> >> Whoa, there!  Why would we use non-temporal stores to WB memory
> to
> >> access persistent memory?  I can see two reasons not to:
> >
> > Data written to a block storage device (here, the NVDIMM) is
> unlikely
> > to be read or written again any time soon.  It's not like the code
> > and data that a program has in memory, where there might be a loop
> > accessing the location every CPU clock; it's storage I/O to
> > historically very slow (relative to the CPU clock speed) devices.
> > The source buffer for that data might be frequently accessed,
> > but not the NVDIMM storage itself.
> >
> > Non-temporal stores avoid wasting cache space on these "one-time"
> > accesses.  The same applies for reads and non-temporal loads.
> > Keep the CPU data cache lines free for the application.
> >
> > DAX and mmap() do change that; the application is now free to
> > store frequently accessed data structures directly in persistent
> > memory.  But, that's not available if btt is used, and
> > application loads and stores won't go through the memcpy()
> > calls inside pmem anyway.  The non-temporal instructions are
> > cache coherent, so data integrity won't get confused by them
> > if I/O going through pmem's block storage APIs happens
> > to overlap with the application's mmap() regions.
> >
> 
> You answered the wrong question. :)  I understand the point of the
> non-temporal stores -- I don't understand the point of using
> non-temporal stores to *WB memory*.  I think we should be okay with
> having the kernel mapping use WT instead.

The cache type that the application chooses for its mmap()
view has to be compatible with that already selected by the 
kernel, or we run into:

Intel SDM 11.12.4 Programming the PAT
...
"The PAT allows any memory type to be specified in the page tables,
and therefore it is possible to have a single physical page mapped
to two or more different linear addresses, each with different
memory types. Intel does not support this practice because it may
lead to undefined operations that can result in a system failure. 
In particular, a WC page must never be aliased to a cacheable page
because WC writes may not check the processor caches."

Right now, application memory is always WB, so WB is the
only safe choice from this perspective (the system must have
ADR for safety from other perspectives). That might not be 
the best choice for all applications, though; some applications
might not want CPU caching all the data they run through here 
and prefer WC.  On a non-ADR system, WT might be the only 
safe choice.

Should there be a way for the application to specify a cache
type in its mmap() call? The type already selected by the
kernel driver could (carefully) be changed on the fly if 
it's different.

Non-temporal store performance is excellent under WB, WC, and WT;
if anything, I think WC edges ahead because it need not snoop
the cache. It's still poor under UC.
H. Peter Anvin May 29, 2015, 10:32 p.m. UTC | #13
Nontemporal stores to WB memory is fine in such a way that it doesn't pollute the cache.  This can be done by denoting to WC or by forcing cache allocation out of only a subset of the cache.

On May 29, 2015 2:46:19 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage)
><Elliott@hp.com> wrote:
>>> -----Original Message-----
>>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>>> Sent: Friday, May 29, 2015 1:35 PM
>> ...
>>> Whoa, there!  Why would we use non-temporal stores to WB memory to
>>> access persistent memory?  I can see two reasons not to:
>>
>> Data written to a block storage device (here, the NVDIMM) is unlikely
>> to be read or written again any time soon.  It's not like the code
>> and data that a program has in memory, where there might be a loop
>> accessing the location every CPU clock; it's storage I/O to
>> historically very slow (relative to the CPU clock speed) devices.
>> The source buffer for that data might be frequently accessed,
>> but not the NVDIMM storage itself.
>>
>> Non-temporal stores avoid wasting cache space on these "one-time"
>> accesses.  The same applies for reads and non-temporal loads.
>> Keep the CPU data cache lines free for the application.
>>
>> DAX and mmap() do change that; the application is now free to
>> store frequently accessed data structures directly in persistent
>> memory.  But, that's not available if btt is used, and
>> application loads and stores won't go through the memcpy()
>> calls inside pmem anyway.  The non-temporal instructions are
>> cache coherent, so data integrity won't get confused by them
>> if I/O going through pmem's block storage APIs happens
>> to overlap with the application's mmap() regions.
>>
>
>You answered the wrong question. :)  I understand the point of the
>non-temporal stores -- I don't understand the point of using
>non-temporal stores to *WB memory*.  I think we should be okay with
>having the kernel mapping use WT instead.
>
>--Andy
Ingo Molnar June 1, 2015, 8:58 a.m. UTC | #14
* Andy Lutomirski <luto@amacapital.net> wrote:

> You answered the wrong question. :) I understand the point of the non-temporal 
> stores -- I don't understand the point of using non-temporal stores to *WB 
> memory*.  I think we should be okay with having the kernel mapping use WT 
> instead.

WB memory is write-through, but they are still fully cached for reads.

So non-temporal instructions influence how the CPU will allocate (or not allocate) 
WT cache lines.

Thanks,

	Ingo
Andy Lutomirski June 1, 2015, 5:10 p.m. UTC | #15
On Mon, Jun 1, 2015 at 1:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> You answered the wrong question. :) I understand the point of the non-temporal
>> stores -- I don't understand the point of using non-temporal stores to *WB
>> memory*.  I think we should be okay with having the kernel mapping use WT
>> instead.
>
> WB memory is write-through, but they are still fully cached for reads.
>
> So non-temporal instructions influence how the CPU will allocate (or not allocate)
> WT cache lines.
>

I'm doing a terrible job of saying what I mean.

Given that we're using non-temporal writes, the kernel code should
work correctly and with similar performance regardless of whether the
mapping is WB or WT.  It would still be correct, if slower, with WC or
UC, and, if we used explicit streaming reads, even that would matter
less.

I think this means that we are free to switch the kernel mapping
between WB and WT as needed to improve DAX behavior.  We could even
plausibly do it at runtime.

--Andy

> Thanks,
>
>         Ingo
diff mbox

Patch

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index eabf4a8..095dfaa 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -139,11 +139,11 @@  static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
 	}
 
 	/*
-	 * Map the memory as non-cachable, as we can't write back the contents
+	 * Map the memory as write-through, as we can't write back the contents
 	 * of the CPU caches in case of a crash.
 	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
 	if (!pmem->virt_addr)
 		goto out_release_region;