libnvdimm, region: sysfs trigger for nvdimm_flush()

Message ID	149281853758.22910.2919981036906495309.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Subject: [PATCH] libnvdimm, region: sysfs trigger for nvdimm_flush() From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Date: Fri, 21 Apr 2017 16:48:57 -0700 Message-ID: <149281853758.22910.2919981036906495309.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-9-g687f MIME-Version: 1.0 Precedence: list Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams April 21, 2017, 11:48 p.m. UTC

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. However, when a namespace is in device-dax mode, or namespaces
are disabled, userspace needs another path.

The new 'flush' attribute is visible when it can be determined that the
interleave-set either does, or does not have DIMMs that expose WPQ-flush
addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
flushes DIMMs, or returns "0" the flush operation is a platform nop.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/region_devs.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Dan Williams April 24, 2017, 7:04 a.m. UTC | #1

On Sun, Apr 23, 2017 at 10:31 PM, Masayoshi Mizuma
<m.mizuma@jp.fujitsu.com> wrote:
> On Fri, 21 Apr 2017 16:48:57 -0700 Dan Williams wrote:
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. However, when a namespace is in device-dax mode, or namespaces
>> are disabled, userspace needs another path.
>>
>> The new 'flush' attribute is visible when it can be determined that the
>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/nvdimm/region_devs.c |   17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
>> index 8de5a04644a1..3495b4c23941 100644
>> --- a/drivers/nvdimm/region_devs.c
>> +++ b/drivers/nvdimm/region_devs.c
>> @@ -255,6 +255,19 @@ static ssize_t size_show(struct device *dev,
>>  }
>>  static DEVICE_ATTR_RO(size);
>>
>> +static ssize_t flush_show(struct device *dev,
>> +             struct device_attribute *attr, char *buf)
>> +{
>> +     struct nd_region *nd_region = to_nd_region(dev);
>> +
>> +     if (nvdimm_has_flush(nd_region)) {
>
> nvdimm_has_flush() also returns as -ENXIO, so
>
> if (nvdimm_has_flush(nd_region) == 1)

If it returns -ENXIO then region_visible() will hide the attribute.

>
>> +             nvdimm_flush(nd_region);
>> +             return sprintf(buf, "1\n");
>> +     }
>> +     return sprintf(buf, "0\n");
>> +}
>> +static DEVICE_ATTR_RO(flush);
>> +
>
> I think separating show and store is better because
> users may only check wheter the device has the flush capability or not.

Makes sense, I'll separate. Thanks for the review.

Jeff Moyer April 24, 2017, 4:26 p.m. UTC | #2

Dan Williams <dan.j.williams@intel.com> writes:

> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
>
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. However, when a namespace is in device-dax mode, or namespaces
> are disabled, userspace needs another path.
>
> The new 'flush' attribute is visible when it can be determined that the
> interleave-set either does, or does not have DIMMs that expose WPQ-flush
> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

NACK.  This should function the same way it does for a pmem device.
Wire up sync.

-Jeff

> ---
>  drivers/nvdimm/region_devs.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 8de5a04644a1..3495b4c23941 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -255,6 +255,19 @@ static ssize_t size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(size);
>  
> +static ssize_t flush_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct nd_region *nd_region = to_nd_region(dev);
> +
> +	if (nvdimm_has_flush(nd_region)) {
> +		nvdimm_flush(nd_region);
> +		return sprintf(buf, "1\n");
> +	}
> +	return sprintf(buf, "0\n");
> +}
> +static DEVICE_ATTR_RO(flush);
> +
>  static ssize_t mappings_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
> @@ -474,6 +487,7 @@ static DEVICE_ATTR_RO(resource);
>  
>  static struct attribute *nd_region_attributes[] = {
>  	&dev_attr_size.attr,
> +	&dev_attr_flush.attr,
>  	&dev_attr_nstype.attr,
>  	&dev_attr_mappings.attr,
>  	&dev_attr_btt_seed.attr,
> @@ -508,6 +522,9 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
>  	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
>  		return 0;
>  
> +	if (a == &dev_attr_flush.attr && nvdimm_has_flush(nd_region) < 0)
> +		return 0;
> +
>  	if (a != &dev_attr_set_cookie.attr
>  			&& a != &dev_attr_available_size.attr)
>  		return a->mode;
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

Dan Williams April 24, 2017, 4:36 p.m. UTC | #3

On Mon, Apr 24, 2017 at 9:26 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. However, when a namespace is in device-dax mode, or namespaces
>> are disabled, userspace needs another path.
>>
>> The new 'flush' attribute is visible when it can be determined that the
>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> NACK.  This should function the same way it does for a pmem device.
> Wire up sync.

We don't have dirty page tracking for device-dax, without that I don't
think we should wire up the current sync calls. I do think we need a
more sophisticated sync syscall interface eventually that can select
which level of flushing is being performed (page cache vs cpu cache vs
platform-write-buffers). Until then I think this sideband interface
makes sense and sysfs is more usable than an ioctl.

Jeff Moyer April 24, 2017, 4:43 p.m. UTC | #4

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Apr 24, 2017 at 9:26 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>
>>> When a pmem namespace is associated with a block device, an
>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>> request. However, when a namespace is in device-dax mode, or namespaces
>>> are disabled, userspace needs another path.
>>>
>>> The new 'flush' attribute is visible when it can be determined that the
>>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>>>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>
>> NACK.  This should function the same way it does for a pmem device.
>> Wire up sync.
>
> We don't have dirty page tracking for device-dax, without that I don't
> think we should wire up the current sync calls.

Why not?  Device dax is meant for the "flush from userspace" paradigm.
There's enough special casing around device dax that I think you can get
away with implementing *sync as call to nvdimm_flush.

> I do think we need a more sophisticated sync syscall interface
> eventually that can select which level of flushing is being performed
> (page cache vs cpu cache vs platform-write-buffers).

I don't.  I think this whole notion of flush, and flush harder is
brain-dead.  How do you explain to applications when they should use
each one?

> Until then I think this sideband interface makes sense and sysfs is
> more usable than an ioctl.

Well, if you're totally against wiring up sync, then I say we forget
about the deep flush completely.  What's your use case?

Cheers,
Jeff

Linda Knippers April 24, 2017, 5:03 p.m. UTC | #5

On 04/21/2017 07:48 PM, Dan Williams wrote:
> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
> 
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. However, when a namespace is in device-dax mode, or namespaces
> are disabled, userspace needs another path.

Why would a user need to flush a disabled namespace?

> The new 'flush' attribute is visible when it can be determined that the
> interleave-set either does, or does not have DIMMs that expose WPQ-flush
> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
> flushes DIMMs, or returns "0" the flush operation is a platform nop.

It seems a little odd to me that reading a read-only attribute both
tells you that the device has flush hints and also triggers a flush.
This means that anyone at any time can cause a flush.  Do we want that?

-- ljk

> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/region_devs.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 8de5a04644a1..3495b4c23941 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -255,6 +255,19 @@ static ssize_t size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(size);
>  
> +static ssize_t flush_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct nd_region *nd_region = to_nd_region(dev);
> +
> +	if (nvdimm_has_flush(nd_region)) {
> +		nvdimm_flush(nd_region);
> +		return sprintf(buf, "1\n");
> +	}
> +	return sprintf(buf, "0\n");
> +}
> +static DEVICE_ATTR_RO(flush);
> +
>  static ssize_t mappings_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
> @@ -474,6 +487,7 @@ static DEVICE_ATTR_RO(resource);
>  
>  static struct attribute *nd_region_attributes[] = {
>  	&dev_attr_size.attr,
> +	&dev_attr_flush.attr,
>  	&dev_attr_nstype.attr,
>  	&dev_attr_mappings.attr,
>  	&dev_attr_btt_seed.attr,
> @@ -508,6 +522,9 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
>  	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
>  		return 0;
>  
> +	if (a == &dev_attr_flush.attr && nvdimm_has_flush(nd_region) < 0)
> +		return 0;
> +
>  	if (a != &dev_attr_set_cookie.attr
>  			&& a != &dev_attr_available_size.attr)
>  		return a->mode;
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>

Dan Williams April 24, 2017, 5:07 p.m. UTC | #6

On Mon, Apr 24, 2017 at 10:03 AM, Linda Knippers <linda.knippers@hpe.com> wrote:
> On 04/21/2017 07:48 PM, Dan Williams wrote:
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. However, when a namespace is in device-dax mode, or namespaces
>> are disabled, userspace needs another path.
>
> Why would a user need to flush a disabled namespace?

For an application that wants to shutdown and sync. Basically I wanted
to make it clear that with this interface the buffers can be synced
regardless of any downstream namespace configuration or state.

>
>> The new 'flush' attribute is visible when it can be determined that the
>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>
> It seems a little odd to me that reading a read-only attribute both
> tells you that the device has flush hints and also triggers a flush.
> This means that anyone at any time can cause a flush.  Do we want that?

No, I'm making the change that Masayoshi-san suggested to move the
flush to a write operation... assuming we move forward given Jeff's
concern.

Dan Williams April 24, 2017, 5:43 p.m. UTC | #7

[ adding Christoph ]

On Mon, Apr 24, 2017 at 9:43 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Mon, Apr 24, 2017 at 9:26 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>>
>>>> When a pmem namespace is associated with a block device, an
>>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>>> request. However, when a namespace is in device-dax mode, or namespaces
>>>> are disabled, userspace needs another path.
>>>>
>>>> The new 'flush' attribute is visible when it can be determined that the
>>>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>>>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>>>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>>>>
>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>>
>>> NACK.  This should function the same way it does for a pmem device.
>>> Wire up sync.
>>
>> We don't have dirty page tracking for device-dax, without that I don't
>> think we should wire up the current sync calls.
>
> Why not?  Device dax is meant for the "flush from userspace" paradigm.
> There's enough special casing around device dax that I think you can get
> away with implementing *sync as call to nvdimm_flush.

I think its an abuse of fsync() and gets in the way of where we might
take userspace-pmem-flushing with new sync primitives as proposed here
[1].

I'm also conscious of the shade that hch threw the last time I tried
to abuse an existing syscall for device-dax [2].

>> I do think we need a more sophisticated sync syscall interface
>> eventually that can select which level of flushing is being performed
>> (page cache vs cpu cache vs platform-write-buffers).
>
> I don't.  I think this whole notion of flush, and flush harder is
> brain-dead.  How do you explain to applications when they should use
> each one?

You never need to use this mechanism to guarantee persistence, which
is counter to what fsync() is defined to provide. This mechanism is
only there to backstop against potential ADR failures.

>> Until then I think this sideband interface makes sense and sysfs is
>> more usable than an ioctl.
>
> Well, if you're totally against wiring up sync, then I say we forget
> about the deep flush completely.  What's your use case?

The use case is device-dax users that want to reduce the impact of an
ADR failure. Which also assumes that the platform has mechanisms to
communicate ADR failure. This is not an interface I expect to be used
for general purpose applications. All of those should be depending
solely on ADR semantics.

[1]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg444842.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-December/008299.html

Jeff Moyer April 24, 2017, 5:58 p.m. UTC | #8

Dan Williams <dan.j.williams@intel.com> writes:

> [ adding Christoph ]
>
> On Mon, Apr 24, 2017 at 9:43 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Mon, Apr 24, 2017 at 9:26 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>
>>>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>>>
>>>>> When a pmem namespace is associated with a block device, an
>>>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>>>> request. However, when a namespace is in device-dax mode, or namespaces
>>>>> are disabled, userspace needs another path.
>>>>>
>>>>> The new 'flush' attribute is visible when it can be determined that the
>>>>> interleave-set either does, or does not have DIMMs that expose WPQ-flush
>>>>> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
>>>>> flushes DIMMs, or returns "0" the flush operation is a platform nop.
>>>>>
>>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>>>
>>>> NACK.  This should function the same way it does for a pmem device.
>>>> Wire up sync.
>>>
>>> We don't have dirty page tracking for device-dax, without that I don't
>>> think we should wire up the current sync calls.
>>
>> Why not?  Device dax is meant for the "flush from userspace" paradigm.
>> There's enough special casing around device dax that I think you can get
>> away with implementing *sync as call to nvdimm_flush.
>
> I think its an abuse of fsync() and gets in the way of where we might
> take userspace-pmem-flushing with new sync primitives as proposed here
> [1].

I agree that it's an abuse, and I'm happy to not go that route.  I am
still against using a sysfs file to do this WPQ flush, however.

> I'm also conscious of the shade that hch threw the last time I tried
> to abuse an existing syscall for device-dax [2].
>
>>> I do think we need a more sophisticated sync syscall interface
>>> eventually that can select which level of flushing is being performed
>>> (page cache vs cpu cache vs platform-write-buffers).
>>
>> I don't.  I think this whole notion of flush, and flush harder is
>> brain-dead.  How do you explain to applications when they should use
>> each one?
>
> You never need to use this mechanism to guarantee persistence, which
> is counter to what fsync() is defined to provide. This mechanism is
> only there to backstop against potential ADR failures.

You haven't answered my question.  Why should applications even need to
consider this?  Do you expect ADR to have a high failure rate?  If so,
shouldn't an application call this deep flush any time they would want
to make their state persistent?

>>> Until then I think this sideband interface makes sense and sysfs is
>>> more usable than an ioctl.
>>
>> Well, if you're totally against wiring up sync, then I say we forget
>> about the deep flush completely.  What's your use case?
>
> The use case is device-dax users that want to reduce the impact of an
> ADR failure. Which also assumes that the platform has mechanisms to
> communicate ADR failure. This is not an interface I expect to be used
> for general purpose applications. All of those should be depending
> solely on ADR semantics.

What applications?

I remain unconvinced of the utility of the WPQ flush separate from
msync/fsync.  Either you always do the WPQ flush, or you never do it.  I
don't see the use case for doing it sometimes, and no one I've asked has
managed to come up with a concrete use case.

Cheers,
Jeff

libnvdimm, region: sysfs trigger for nvdimm_flush()

Commit Message

Comments

Patch