[v3] libnvdimm, region: sysfs trigger for nvdimm_flush()

Message ID	149315140303.23340.14688142799059150805.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Subject: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush() From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Date: Tue, 25 Apr 2017 13:17:37 -0700 Message-ID: <149315140303.23340.14688142799059150805.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <149307780135.7155.11108531648914675756.stgit@dwillia2-desk3.amr.corp.intel.com> References: <149307780135.7155.11108531648914675756.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-9-g687f MIME-Version: 1.0 Precedence: list Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams April 25, 2017, 8:17 p.m. UTC

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'flush' attribute is visible when the individual DIMMs backing a
given interleave-set are described by platform firmware. In ACPI terms
this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
Address Structures". Reads return "1" if the region supports triggering
WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
platform nop, and in that case the attribute is read-only.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v3:
* Fixed up the permissions in the read-only case to 0444 instead of
  0400 as is typical for read-only sysfs attributes.

 drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Jeff Moyer April 26, 2017, 8:38 p.m. UTC | #1

Dan Williams <dan.j.williams@intel.com> writes:

> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
>
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. These requests are typically associated with filesystem
> metadata updates. However, when a namespace is in device-dax mode,
> userspace (think database metadata) needs another path to perform the
> same flushing. In other words this is not required to make data
> persistent, but in the case of metadata it allows for a smaller failure
> domain in the unlikely event of an ADR failure.
>
> The new 'flush' attribute is visible when the individual DIMMs backing a
> given interleave-set are described by platform firmware. In ACPI terms
> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
> Address Structures". Reads return "1" if the region supports triggering
> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
> platform nop, and in that case the attribute is read-only.

I can make peace with exposing this to userspace, though I am mostly
against its use.  However, sysfs feels like the wrong interface.
Believe it or not, I'd rather see this implemented as an ioctl.

This isn't a NACK, it's me giving my opinion.  Do with it what you will.

Cheers,
Jeff

Dan Williams April 26, 2017, 11 p.m. UTC | #2

On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. These requests are typically associated with filesystem
>> metadata updates. However, when a namespace is in device-dax mode,
>> userspace (think database metadata) needs another path to perform the
>> same flushing. In other words this is not required to make data
>> persistent, but in the case of metadata it allows for a smaller failure
>> domain in the unlikely event of an ADR failure.
>>
>> The new 'flush' attribute is visible when the individual DIMMs backing a
>> given interleave-set are described by platform firmware. In ACPI terms
>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>> Address Structures". Reads return "1" if the region supports triggering
>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>> platform nop, and in that case the attribute is read-only.
>
> I can make peace with exposing this to userspace, though I am mostly
> against its use.  However, sysfs feels like the wrong interface.
> Believe it or not, I'd rather see this implemented as an ioctl.
>
> This isn't a NACK, it's me giving my opinion.  Do with it what you will.

I hate ioctls with a burning passion so I can't get on board with that
change, but perhaps the sentiment behind it is that this is too
visible and too attractive being called "flush" in sysfs? Would a name
more specific to the mechanism make it more palatable? Like
"flush_hint_trigger" or "wpq_drain"?

Jeff Moyer April 27, 2017, 1:45 p.m. UTC | #3

Dan Williams <dan.j.williams@intel.com> writes:

> On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>
>>> When a pmem namespace is associated with a block device, an
>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>> request. These requests are typically associated with filesystem
>>> metadata updates. However, when a namespace is in device-dax mode,
>>> userspace (think database metadata) needs another path to perform the
>>> same flushing. In other words this is not required to make data
>>> persistent, but in the case of metadata it allows for a smaller failure
>>> domain in the unlikely event of an ADR failure.
>>>
>>> The new 'flush' attribute is visible when the individual DIMMs backing a
>>> given interleave-set are described by platform firmware. In ACPI terms
>>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>>> Address Structures". Reads return "1" if the region supports triggering
>>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>>> platform nop, and in that case the attribute is read-only.
>>
>> I can make peace with exposing this to userspace, though I am mostly
>> against its use.  However, sysfs feels like the wrong interface.
>> Believe it or not, I'd rather see this implemented as an ioctl.
>>
>> This isn't a NACK, it's me giving my opinion.  Do with it what you will.
>
> I hate ioctls with a burning passion so I can't get on board with that
> change, but perhaps the sentiment behind it is that this is too
> visible and too attractive being called "flush" in sysfs? Would a name
> more specific to the mechanism make it more palatable? Like
> "flush_hint_trigger" or "wpq_drain"?

The sentiment is that programs shouldn't have to grovel around in sysfs
to do stuff related to an open file descriptor or mapping.  I don't take
issue with the name.  I do worry that something like 'wpq_drain' may be
too platform specific, though.  The NVM Programming Model specification
is going to call this "deep flush", so maybe that will give you
some inspiration if you do want to change the name.

Cheers,
Jeff

Dan Williams April 27, 2017, 4:56 p.m. UTC | #4

On Thu, Apr 27, 2017 at 6:45 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>>
>>>> When a pmem namespace is associated with a block device, an
>>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>>> request. These requests are typically associated with filesystem
>>>> metadata updates. However, when a namespace is in device-dax mode,
>>>> userspace (think database metadata) needs another path to perform the
>>>> same flushing. In other words this is not required to make data
>>>> persistent, but in the case of metadata it allows for a smaller failure
>>>> domain in the unlikely event of an ADR failure.
>>>>
>>>> The new 'flush' attribute is visible when the individual DIMMs backing a
>>>> given interleave-set are described by platform firmware. In ACPI terms
>>>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>>>> Address Structures". Reads return "1" if the region supports triggering
>>>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>>>> platform nop, and in that case the attribute is read-only.
>>>
>>> I can make peace with exposing this to userspace, though I am mostly
>>> against its use.  However, sysfs feels like the wrong interface.
>>> Believe it or not, I'd rather see this implemented as an ioctl.
>>>
>>> This isn't a NACK, it's me giving my opinion.  Do with it what you will.
>>
>> I hate ioctls with a burning passion so I can't get on board with that
>> change, but perhaps the sentiment behind it is that this is too
>> visible and too attractive being called "flush" in sysfs? Would a name
>> more specific to the mechanism make it more palatable? Like
>> "flush_hint_trigger" or "wpq_drain"?
>
> The sentiment is that programs shouldn't have to grovel around in sysfs
> to do stuff related to an open file descriptor or mapping.  I don't take
> issue with the name.  I do worry that something like 'wpq_drain' may be
> too platform specific, though.  The NVM Programming Model specification
> is going to call this "deep flush", so maybe that will give you
> some inspiration if you do want to change the name.

I'll change to "deep_flush", and I quibble that this is related to a
single open file descriptor or mapping. It really is a "region flush"
for giving extra protection for global metadata, but the persistence
of individual fds or mappings is handled by ADR. I think an ioctl
might give the false impression that every time you flush a cacheline
to persistence you need to call the ioctl.

Jeff Moyer April 27, 2017, 6:41 p.m. UTC | #5

Dan Williams <dan.j.williams@intel.com> writes:

>> The sentiment is that programs shouldn't have to grovel around in sysfs
>> to do stuff related to an open file descriptor or mapping.  I don't take
>> issue with the name.  I do worry that something like 'wpq_drain' may be
>> too platform specific, though.  The NVM Programming Model specification
>> is going to call this "deep flush", so maybe that will give you
>> some inspiration if you do want to change the name.
>
> I'll change to "deep_flush", and I quibble that this is related to a
> single open file descriptor or mapping. It really is a "region flush"
> for giving extra protection for global metadata, but the persistence
> of individual fds or mappings is handled by ADR. I think an ioctl
> might give the false impression that every time you flush a cacheline
> to persistence you need to call the ioctl.

fsync, for example, may affect more than one fd--all data in the drive
write cache will be flushed.  I don't see how this is so different.  I
think a sysfs file is awkward because it requires an application to
chase down the correct file in the sysfs hierarchy.  If the application
already has an open fd or a mapping, it should be able to operate on
that.

As for confusion on when to use the interface, I think that's inevitable
no matter how it's implemented.  We're introducing a flush type that has
never been exposed before, and we're not giving any information on how
likely an ADR failure is, or how expensive this flush will be.

Cheers,
Jeff

Dan Williams April 27, 2017, 7:17 p.m. UTC | #6

On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>> too platform specific, though.  The NVM Programming Model specification
>>> is going to call this "deep flush", so maybe that will give you
>>> some inspiration if you do want to change the name.
>>
>> I'll change to "deep_flush", and I quibble that this is related to a
>> single open file descriptor or mapping. It really is a "region flush"
>> for giving extra protection for global metadata, but the persistence
>> of individual fds or mappings is handled by ADR. I think an ioctl
>> might give the false impression that every time you flush a cacheline
>> to persistence you need to call the ioctl.
>
> fsync, for example, may affect more than one fd--all data in the drive
> write cache will be flushed.  I don't see how this is so different.  I
> think a sysfs file is awkward because it requires an application to
> chase down the correct file in the sysfs hierarchy.  If the application
> already has an open fd or a mapping, it should be able to operate on
> that.

I'm teetering, but still leaning towards sysfs. The use case that
needs this is device-dax because we otherwise silently do this behind
the application's back on filesystem-dax for fsync / msync. A
device-dax ioctl would be straightforward, but 'deep flush' assumes
that the device-dax instance is fronting persistent memory. There's
nothing persistent memory specific about device-dax except that today
only the nvdimm sub-system knows how to create them, but there's
nothing that prevents other memory regions from being mapped this way.
So I'd rather this persistent memory specific mechanism stay with the
persistent memory specific portion of the interface rather than plumb
persistent memory details out through the generic device-dax interface
since we have no other intercept point like we do in the
filesystem-dax case to hide this flush.

Dan Williams April 27, 2017, 7:21 p.m. UTC | #7

On Thu, Apr 27, 2017 at 12:17 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>> too platform specific, though.  The NVM Programming Model specification
>>>> is going to call this "deep flush", so maybe that will give you
>>>> some inspiration if you do want to change the name.
>>>
>>> I'll change to "deep_flush", and I quibble that this is related to a
>>> single open file descriptor or mapping. It really is a "region flush"
>>> for giving extra protection for global metadata, but the persistence
>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>> might give the false impression that every time you flush a cacheline
>>> to persistence you need to call the ioctl.
>>
>> fsync, for example, may affect more than one fd--all data in the drive
>> write cache will be flushed.  I don't see how this is so different.  I
>> think a sysfs file is awkward because it requires an application to
>> chase down the correct file in the sysfs hierarchy.  If the application
>> already has an open fd or a mapping, it should be able to operate on
>> that.
>
> I'm teetering, but still leaning towards sysfs. The use case that
> needs this is device-dax because we otherwise silently do this behind
> the application's back on filesystem-dax for fsync / msync. A
> device-dax ioctl would be straightforward, but 'deep flush' assumes
> that the device-dax instance is fronting persistent memory. There's
> nothing persistent memory specific about device-dax except that today
> only the nvdimm sub-system knows how to create them, but there's
> nothing that prevents other memory regions from being mapped this way.
> So I'd rather this persistent memory specific mechanism stay with the
> persistent memory specific portion of the interface rather than plumb
> persistent memory details out through the generic device-dax interface
> since we have no other intercept point like we do in the
> filesystem-dax case to hide this flush.

We also still seem to need a discovery mechanism as I've had questions
about "how do I tell if my system supports deep flush?". That's where
sysfs is much better than an ioctl. The need for deep flush discovery
tips the scales, at least for me, to also do deep flush triggering
through the same interface.

Jeff Moyer April 27, 2017, 7:40 p.m. UTC | #8

Dan Williams <dan.j.williams@intel.com> writes:

> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>> too platform specific, though.  The NVM Programming Model specification
>>>> is going to call this "deep flush", so maybe that will give you
>>>> some inspiration if you do want to change the name.
>>>
>>> I'll change to "deep_flush", and I quibble that this is related to a
>>> single open file descriptor or mapping. It really is a "region flush"
>>> for giving extra protection for global metadata, but the persistence
>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>> might give the false impression that every time you flush a cacheline
>>> to persistence you need to call the ioctl.
>>
>> fsync, for example, may affect more than one fd--all data in the drive
>> write cache will be flushed.  I don't see how this is so different.  I
>> think a sysfs file is awkward because it requires an application to
>> chase down the correct file in the sysfs hierarchy.  If the application
>> already has an open fd or a mapping, it should be able to operate on
>> that.
>
> I'm teetering, but still leaning towards sysfs. The use case that
> needs this is device-dax because we otherwise silently do this behind
> the application's back on filesystem-dax for fsync / msync.

We may yet get file system support for flush from userspace (NOVA, for
example).  So I don't think we should restrict ourselves to only
thinking about the device dax use case.

> A device-dax ioctl would be straightforward, but 'deep flush' assumes
> that the device-dax instance is fronting persistent memory.  There's
> nothing persistent memory specific about device-dax except that today
> only the nvdimm sub-system knows how to create them, but there's
> nothing that prevents other memory regions from being mapped this way.

You're concerned that applications operating on device dax instances
that are not backed by pmem will try to issue a deep flush?  Why would
they do that, and why can't you just return failure from the ioctl?

> So I'd rather this persistent memory specific mechanism stay with the
> persistent memory specific portion of the interface rather than plumb
> persistent memory details out through the generic device-dax interface
> since we have no other intercept point like we do in the
> filesystem-dax case to hide this flush.

Look at the block layer.  You can issue an ioctl on a block device, and
if the generic block layer can handle it, it does.  If not, it gets
passed down to lower layers until either it gets handled, or it bubbles
back up because nobody knew what to do with it.  I think you can do the
same thing here, and that solves your layering violation.

Cheers,
Jeff

Jeff Moyer April 27, 2017, 7:43 p.m. UTC | #9

Dan Williams <dan.j.williams@intel.com> writes:

> We also still seem to need a discovery mechanism as I've had questions
> about "how do I tell if my system supports deep flush?". That's where
> sysfs is much better than an ioctl. The need for deep flush discovery
> tips the scales, at least for me, to also do deep flush triggering
> through the same interface.

Return ENXIO or EOPNOTSUPP from the ioctl when it isn't supported?

Cheers,
Jeff

Dan Williams April 27, 2017, 8:02 p.m. UTC | #10

On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>>> too platform specific, though.  The NVM Programming Model specification
>>>>> is going to call this "deep flush", so maybe that will give you
>>>>> some inspiration if you do want to change the name.
>>>>
>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>> single open file descriptor or mapping. It really is a "region flush"
>>>> for giving extra protection for global metadata, but the persistence
>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>> might give the false impression that every time you flush a cacheline
>>>> to persistence you need to call the ioctl.
>>>
>>> fsync, for example, may affect more than one fd--all data in the drive
>>> write cache will be flushed.  I don't see how this is so different.  I
>>> think a sysfs file is awkward because it requires an application to
>>> chase down the correct file in the sysfs hierarchy.  If the application
>>> already has an open fd or a mapping, it should be able to operate on
>>> that.
>>
>> I'm teetering, but still leaning towards sysfs. The use case that
>> needs this is device-dax because we otherwise silently do this behind
>> the application's back on filesystem-dax for fsync / msync.
>
> We may yet get file system support for flush from userspace (NOVA, for
> example).  So I don't think we should restrict ourselves to only
> thinking about the device dax use case.
>
>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>> that the device-dax instance is fronting persistent memory.  There's
>> nothing persistent memory specific about device-dax except that today
>> only the nvdimm sub-system knows how to create them, but there's
>> nothing that prevents other memory regions from being mapped this way.
>
> You're concerned that applications operating on device dax instances
> that are not backed by pmem will try to issue a deep flush?  Why would
> they do that, and why can't you just return failure from the ioctl?
>
>> So I'd rather this persistent memory specific mechanism stay with the
>> persistent memory specific portion of the interface rather than plumb
>> persistent memory details out through the generic device-dax interface
>> since we have no other intercept point like we do in the
>> filesystem-dax case to hide this flush.
>
> Look at the block layer.  You can issue an ioctl on a block device, and
> if the generic block layer can handle it, it does.  If not, it gets
> passed down to lower layers until either it gets handled, or it bubbles
> back up because nobody knew what to do with it.  I think you can do the
> same thing here, and that solves your layering violation.

So this is where I started. I was going to follow the block layer.
Except recently the block layer has been leaning away from ioctls and
implementing support for syscalls directly. The same approach for
device-dax fallocate() support got NAKd, so I opted for sysfs out of
the gate.

However, since there really is no analog for "deep flush" in the
syscall namespace lets (*gag*) implement an ioctl for this.

Dan Williams April 27, 2017, 9:36 p.m. UTC | #11

On Thu, Apr 27, 2017 at 1:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>
>>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>>>> too platform specific, though.  The NVM Programming Model specification
>>>>>> is going to call this "deep flush", so maybe that will give you
>>>>>> some inspiration if you do want to change the name.
>>>>>
>>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>>> single open file descriptor or mapping. It really is a "region flush"
>>>>> for giving extra protection for global metadata, but the persistence
>>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>>> might give the false impression that every time you flush a cacheline
>>>>> to persistence you need to call the ioctl.
>>>>
>>>> fsync, for example, may affect more than one fd--all data in the drive
>>>> write cache will be flushed.  I don't see how this is so different.  I
>>>> think a sysfs file is awkward because it requires an application to
>>>> chase down the correct file in the sysfs hierarchy.  If the application
>>>> already has an open fd or a mapping, it should be able to operate on
>>>> that.
>>>
>>> I'm teetering, but still leaning towards sysfs. The use case that
>>> needs this is device-dax because we otherwise silently do this behind
>>> the application's back on filesystem-dax for fsync / msync.
>>
>> We may yet get file system support for flush from userspace (NOVA, for
>> example).  So I don't think we should restrict ourselves to only
>> thinking about the device dax use case.
>>
>>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>>> that the device-dax instance is fronting persistent memory.  There's
>>> nothing persistent memory specific about device-dax except that today
>>> only the nvdimm sub-system knows how to create them, but there's
>>> nothing that prevents other memory regions from being mapped this way.
>>
>> You're concerned that applications operating on device dax instances
>> that are not backed by pmem will try to issue a deep flush?  Why would
>> they do that, and why can't you just return failure from the ioctl?
>>
>>> So I'd rather this persistent memory specific mechanism stay with the
>>> persistent memory specific portion of the interface rather than plumb
>>> persistent memory details out through the generic device-dax interface
>>> since we have no other intercept point like we do in the
>>> filesystem-dax case to hide this flush.
>>
>> Look at the block layer.  You can issue an ioctl on a block device, and
>> if the generic block layer can handle it, it does.  If not, it gets
>> passed down to lower layers until either it gets handled, or it bubbles
>> back up because nobody knew what to do with it.  I think you can do the
>> same thing here, and that solves your layering violation.
>
> So this is where I started. I was going to follow the block layer.
> Except recently the block layer has been leaning away from ioctls and
> implementing support for syscalls directly. The same approach for
> device-dax fallocate() support got NAKd, so I opted for sysfs out of
> the gate.
>
> However, since there really is no analog for "deep flush" in the
> syscall namespace lets (*gag*) implement an ioctl for this.

So I think about it for 2 seconds and now I'm veering back to sysfs.
We don't want applications calling this ioctl on filesystem-dax fds. I
simply can't bring myself to do the work to pick a unique ioctl number
that is known to be unique for the full filsystem and block-layer
paths when we can put this interface right where it belongs in nvdimm
specific sysfs.

[v3] libnvdimm, region: sysfs trigger for nvdimm_flush()

Commit Message

Comments

Patch