[14/15] libnvdimm: support read-only btt backing devices
diff mbox

Message ID 20150617235602.12943.24958.stgit@dwillia2-desk3.amr.corp.intel.com
State New
Headers show

Commit Message

Dan Williams June 17, 2015, 11:56 p.m. UTC
Upon detection of a read-only backing device arrange for the btt to
device to be read only.  Implement a catch for the BLKROSET ioctl and
only allow a btt-instance to become read-write when the backing-device
becomes read-write.  Conversely, if a backing-device becomes read-only
arrange for its parent btt to be marked read-only.  Synchronize these
changes under the bus lock.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c      |    4 +++
 drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
 drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
 drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/nd-core.h  |   14 +++++++++++
 drivers/nvdimm/nd.h       |    4 +++
 drivers/nvdimm/pmem.c     |    4 +++
 7 files changed, 154 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vishal Verma June 18, 2015, 10:55 p.m. UTC | #1
On Wed, 2015-06-17 at 19:56 -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/blk.c      |    4 +++
>  drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
>  drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/nd-core.h  |   14 +++++++++++
>  drivers/nvdimm/nd.h       |    4 +++
>  drivers/nvdimm/pmem.c     |    4 +++
>  7 files changed, 154 insertions(+), 3 deletions(-)

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 21, 2015, 10:13 a.m. UTC | #2
On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.

Eww.  I have to say the deeper I look into this code the more I hate
the stacking nature of btt.  It seems more and more we should never
attach pmem if we want to use a device with btt. 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 21, 2015, 1:21 p.m. UTC | #3
On Sun, Jun 21, 2015 at 3:13 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
>> Upon detection of a read-only backing device arrange for the btt to
>> device to be read only.  Implement a catch for the BLKROSET ioctl and
>> only allow a btt-instance to become read-write when the backing-device
>> becomes read-write.  Conversely, if a backing-device becomes read-only
>> arrange for its parent btt to be marked read-only.  Synchronize these
>> changes under the bus lock.
>
> Eww.  I have to say the deeper I look into this code the more I hate
> the stacking nature of btt.  It seems more and more we should never
> attach pmem if we want to use a device with btt.

This question has come up before.  Making btt an internal property of
a device makes some things cleaner and others more messy.  We lose the
ability to place a btt instance on top of a partition, rather than a
whole disk.  If we ever need to access the raw device we no longer
have a direct block device to reference.  Linux has been doing stacked
configurations to change the personality of block devices since
forever (md, dm, bcache...), why invent something new to handle the
btt-personality of ->rw_bytes() devices?

BTT precludes DAX, if you want both modes on one pmem disk placing BTT
on a partition of the disk for fs metadata and DAX-capable data on the
rest is our proposed solution.  We chose this architecture after a
conversation with Dave Chinner about XFS's need to have atomic sector
guarantees for its metadata and wanting to simultaneously enable
XFS-DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 21, 2015, 1:54 p.m. UTC | #4
On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
> This question has come up before.  Making btt an internal property of
> a device makes some things cleaner and others more messy.  We lose the
> ability to place a btt instance on top of a partition, rather than a
> whole disk.

I thought the addition of nfit labels avoids the need for a partition
table now?

> If we ever need to access the raw device we no longer
> have a direct block device to reference.  Linux has been doing stacked
> configurations to change the personality of block devices since
> forever (md, dm, bcache...), why invent something new to handle the
> btt-personality of ->rw_bytes() devices?

Because the underlying abstraction really isn't a block device
anymore, it's a byte addressable device.  This is more similar to
for example how the mtd subsystem is structured.

> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
> on a partition of the disk for fs metadata and DAX-capable data on the
> rest is our proposed solution.  We chose this architecture after a
> conversation with Dave Chinner about XFS's need to have atomic sector
> guarantees for its metadata and wanting to simultaneously enable
> XFS-DAX.

I can't see why a v5 XFS filesystem with CRCs on all metadata would need
sector atomic updates any more.  But even in a case where it would it
seem like whatever label you use for partioning should sit above the
block layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 21, 2015, 3:11 p.m. UTC | #5
On Sun, Jun 21, 2015 at 6:54 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
>> This question has come up before.  Making btt an internal property of
>> a device makes some things cleaner and others more messy.  We lose the
>> ability to place a btt instance on top of a partition, rather than a
>> whole disk.
>
> I thought the addition of nfit labels avoids the need for a partition
> table now?
>

The labels only allow allocation of persistent media between pmem and
blk.  For a given dimm you may access in either mode and the label
records the decision.  We can have a btt on either the pmem or
blk-mode disk type, or partition thereof.

>> If we ever need to access the raw device we no longer
>> have a direct block device to reference.  Linux has been doing stacked
>> configurations to change the personality of block devices since
>> forever (md, dm, bcache...), why invent something new to handle the
>> btt-personality of ->rw_bytes() devices?
>
> Because the underlying abstraction really isn't a block device
> anymore, it's a byte addressable device.  This is more similar to
> for example how the mtd subsystem is structured.

Yes, it's this hybrid thing that mostly fits into the existing block
device model save for two new block_device_operations
->direct_access() and ->rw_bytes().  We then use property of a
block_device that allows it to be claimed for exclusive ownership by a
filesystem or another block_device to layer storage semantics on top
be it files+directories, raid, caching, or atomic sectors.  NVDIMM
devices don't present the same complexity as MTD devices.  The only
complexity they present is byte-address-ability, not erase-block-size,
wear-leveling, etc...

>> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
>> on a partition of the disk for fs metadata and DAX-capable data on the
>> rest is our proposed solution.  We chose this architecture after a
>> conversation with Dave Chinner about XFS's need to have atomic sector
>> guarantees for its metadata and wanting to simultaneously enable
>> XFS-DAX.
>
> I can't see why a v5 XFS filesystem with CRCs on all metadata would need
> sector atomic updates any more.  But even in a case where it would it
> seem like whatever label you use for partioning should sit above the
> block layer.

Yes, we use standard partition labels to sub-divide a namespace.  A
namespace boundary is either set by a label internal to the dimm or
the NFIT directly (for dimms that do not support internal labeling).
Good to hear that we don't need BTT for XFS v5, can we make the
guarantee for all filesystems that may want to support DAX?  I still
think stacking is a natural fit for this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 6:30 a.m. UTC | #6
On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
> The labels only allow allocation of persistent media between pmem and
> blk.  For a given dimm you may access in either mode and the label
> records the decision.  We can have a btt on either the pmem or
> blk-mode disk type, or partition thereof.

Sounds like the spec should allow a btt type as well insteaad of
requiring the OS to work around it, as that seems to be one of the few
useful things to do with a run-time label.

Either way, partitions are trivial things and we could add them to the
nvdimm layer.

> Yes, it's this hybrid thing that mostly fits into the existing block
> device model save for two new block_device_operations
> ->direct_access() and ->rw_bytes().  We then use property of a
> block_device that allows it to be claimed for exclusive ownership by a
> filesystem or another block_device to layer storage semantics on top
> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
> devices don't present the same complexity as MTD devices.  The only
> complexity they present is byte-address-ability, not erase-block-size,
> wear-leveling, etc...

I didn't say they show the same complexities, but the same layering.

> Good to hear that we don't need BTT for XFS v5, can we make the
> guarantee for all filesystems that may want to support DAX?  I still
> think stacking is a natural fit for this problem.

I can't make any guarantees, especially not without verification.  But
if correctly implemented any filesystems that does out of place metadata
writes (and that includes a traditional log) and uses checksum to ensure
the integrity of these updates it should be fine.  You'd still have
the issue of sector atomicy of file I/O though.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 7:17 a.m. UTC | #7
On Sun, Jun 21, 2015 at 11:30 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
>> The labels only allow allocation of persistent media between pmem and
>> blk.  For a given dimm you may access in either mode and the label
>> records the decision.  We can have a btt on either the pmem or
>> blk-mode disk type, or partition thereof.
>
> Sounds like the spec should allow a btt type as well insteaad of
> requiring the OS to work around it, as that seems to be one of the few
> useful things to do with a run-time label.

To be fair the namespace was initially envisioned to be btt enabled or
not, and hide the raw media device.  It was only when we added the
"XFS needs BTT so we need BTT support on partitions" constraint did I
push stacked BTT as the most flexible way to handle all these
configurations.  It also simplified the namespace to only be a
partition of access modes and leave sub-dividing pmem to standard
partitions.

> Either way, partitions are trivial things and we could add them to the
> nvdimm layer.
>
>> Yes, it's this hybrid thing that mostly fits into the existing block
>> device model save for two new block_device_operations
>> ->direct_access() and ->rw_bytes().  We then use property of a
>> block_device that allows it to be claimed for exclusive ownership by a
>> filesystem or another block_device to layer storage semantics on top
>> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
>> devices don't present the same complexity as MTD devices.  The only
>> complexity they present is byte-address-ability, not erase-block-size,
>> wear-leveling, etc...
>
> I didn't say they show the same complexities, but the same layering.
>
>> Good to hear that we don't need BTT for XFS v5, can we make the
>> guarantee for all filesystems that may want to support DAX?  I still
>> think stacking is a natural fit for this problem.
>
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

If someone needs sector atomicity of file I/O then by definition they
can't have DAX enabled.

There's no guarantee that these drivers are only ever paired with
XFSv5.  Drivers tend to be backported more freely than filesystems.  I
don't think the need for BTT on partitions will go away, but if you're
not convinced we could try the wait and see approach and move BTT to
only be enabled at namespace boundaries.  That's a fairly invasive
change to the configuration model, I'd hate to come back in a few
months to re-add BTT on partition support alongside the namespace only
mode.  Not trying to throw FUD, I'm willing to admit there are
downsides to the stacking model, they're just not clear to me
presently.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 7:28 a.m. UTC | #8
On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
> To be fair the namespace was initially envisioned to be btt enabled or
> not, and hide the raw media device.

What's the fascination with hiding one access mode just because
another one is available?

> There's no guarantee that these drivers are only ever paired with
> XFSv5.

There's not guarantee for anything.   Note that anything not following
my criteria earlier would need some form of atomic sector updates,
which is a lot more.  But then again for most of those setups you
wouldn't take advantage of pmem anyway.

Sounds like we simply shouldn't merge btt at all for now and wait for
a real use case, which would simplify the whole issue a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 7:39 a.m. UTC | #9
On Mon, Jun 22, 2015 at 12:28 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
>> To be fair the namespace was initially envisioned to be btt enabled or
>> not, and hide the raw media device.
>
> What's the fascination with hiding one access mode just because
> another one is available?
>

Now I'm confused, you *don't* want the raw device to be hidden *and*
you want to kill the stacking?  Something got crossed.  The current
implementation hides nothing, you get to see the entire stacked
composition.  I'd much prefer to avoid hiding anything.

>> There's no guarantee that these drivers are only ever paired with
>> XFSv5.
>
> There's not guarantee for anything.   Note that anything not following
> my criteria earlier would need some form of atomic sector updates,
> which is a lot more.  But then again for most of those setups you
> wouldn't take advantage of pmem anyway.
>
> Sounds like we simply shouldn't merge btt at all for now and wait for
> a real use case, which would simplify the whole issue a lot.

The sinister aspect of sector tearing is that most applications don't
know they have this dependency.  At least today's disk's rarely ever
tear sectors and if they do you almost certainly get a CRC error on
access.  NVDIMMs will always tear and always silently.  I think not
merging BTT at all to see what happens is simply wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jeff Moyer June 22, 2015, 3:02 p.m. UTC | #10
Dan Williams <dan.j.williams@intel.com> writes:

>> Sounds like we simply shouldn't merge btt at all for now and wait for
>> a real use case, which would simplify the whole issue a lot.
>
> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

Agreed, we can't audit all code, and springing this potential data
corruptor on people seems irresponsible.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 3:40 p.m. UTC | #11
On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
> Now I'm confused, you *don't* want the raw device to be hidden *and*
> you want to kill the stacking?  Something got crossed.  The current
> implementation hides nothing, you get to see the entire stacked
> composition.  I'd much prefer to avoid hiding anything.

You see it, but you can't actually use it.  The proper way to expose it
would be to the devices visible in the low-level bus sysfs enumeration
logic but only one ULD attach to it.

> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

So now you leave your users with a choice between a rock and a hard
place, that is using BTT to introduce non-significant overhead and not
supporting DAX or just use it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 3:41 p.m. UTC | #12
On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
> Agreed, we can't audit all code, and springing this potential data
> corruptor on people seems irresponsible.

How do "the people" know they'd have to use btt in the current setup
without auditing their stack first?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jeff Moyer June 22, 2015, 4 p.m. UTC | #13
Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
>> Agreed, we can't audit all code, and springing this potential data
>> corruptor on people seems irresponsible.
>
> How do "the people" know they'd have to use btt in the current setup
> without auditing their stack first?

Right now, the guidance should be to always use btt since there are no
applications that are directly taking advantage of persistent memory
(that I know).  I expect documentation would take care of that.  I also
expect that, as applications add support, they would note the
requirement for dax mountpoints in their documentation.

So, "the people" find out the same way they always have.  :)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 4:32 p.m. UTC | #14
On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
> Right now, the guidance should be to always use btt since there are no
> applications that are directly taking advantage of persistent memory
> (that I know).  I expect documentation would take care of that.  I also
> expect that, as applications add support, they would note the
> requirement for dax mountpoints in their documentation.

It's not just DAX.  Avoiding the overhead for anything else is
another good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 4:36 p.m. UTC | #15
On Mon, Jun 22, 2015 at 8:40 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
>> Now I'm confused, you *don't* want the raw device to be hidden *and*
>> you want to kill the stacking?  Something got crossed.  The current
>> implementation hides nothing, you get to see the entire stacked
>> composition.  I'd much prefer to avoid hiding anything.
>
> You see it, but you can't actually use it.  The proper way to expose it
> would be to the devices visible in the low-level bus sysfs enumeration
> logic but only one ULD attach to it.
>

In that case "don't stack" is too coarse of a hammer.  I see this as a
request to hide the subordinate ULD which is a new capability that DM
and MD might benefit from as well.  We already have the case in MD
where it internally holds a reference to bdev that has been hot
removed, it seems not much of a stretch to have stacking drivers be
able to hide device nodes for bdevs that they are holding.

>> The sinister aspect of sector tearing is that most applications don't
>> know they have this dependency.  At least today's disk's rarely ever
>> tear sectors and if they do you almost certainly get a CRC error on
>> access.  NVDIMMs will always tear and always silently.  I think not
>> merging BTT at all to see what happens is simply wrong.
>
> So now you leave your users with a choice between a rock and a hard
> place, that is using BTT to introduce non-significant overhead and not
> supporting DAX or just use it as-is.

Yes, if they want to use DAX they should do it consciously and audit
their application to be sure it is safe to abandon atomic sector
guarantees.  With the current flexibility to do BTT on a partition
they can do this conversion piecemeal and, for example, keep metadata
on BTT and data on DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jeff Moyer June 22, 2015, 4:42 p.m. UTC | #16
Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
>> Right now, the guidance should be to always use btt since there are no
>> applications that are directly taking advantage of persistent memory
>> (that I know).  I expect documentation would take care of that.  I also
>> expect that, as applications add support, they would note the
>> requirement for dax mountpoints in their documentation.
>
> It's not just DAX.  Avoiding the overhead for anything else is
> another good reason.

OK, add torn sector detection/recovery to that statement, then.  More
importantly, do you agree with the sentiment or not?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 4:45 p.m. UTC | #17
On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
> In that case "don't stack" is too coarse of a hammer.  I see this as a
> request to hide the subordinate ULD which is a new capability that DM
> and MD might benefit from as well.  We already have the case in MD
> where it internally holds a reference to bdev that has been hot
> removed, it seems not much of a stretch to have stacking drivers be
> able to hide device nodes for bdevs that they are holding.

I don't see why you're comparing with MD and DM here.  MD and DM
sit cleanly ontop of any block device.  If btt was independent of
libnvdimm and just used ->rw_bytes we could see it as this.

But it's all a giant entangled mess, where btt for example is probed
by libnvdimm.  At the same time pmem.c isn't really a true block
driver, it's really just a trivial shim between the block API
and pmem-style memcpy.  Especially with the proper pmem API btt
would become cleaner just calling that directly.  

> Yes, if they want to use DAX they should do it consciously and audit
> their application to be sure it is safe to abandon atomic sector
> guarantees.  With the current flexibility to do BTT on a partition
> they can do this conversion piecemeal and, for example, keep metadata
> on BTT and data on DAX.

By that logic you'd want to attach BTT by default and allow opt-out
at some level.  This could be a libnvmdimm-level partitioning scheme,
which would also allow storing the bit if BTT is used or not persistently.
Or it could be on fine grained boundaries which might be more useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 4:48 p.m. UTC | #18
On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
> OK, add torn sector detection/recovery to that statement, then.  More
> importantly, do you agree with the sentiment or not?

I think we're getting on a very slipper slope if we think about
application here.  Buffered I/O application must deal with torn
writes at any granulairty anyway, e.g. fsync + rename is the
only thing they can rely on right now (I actually have software O_ATOMIC
code to avoid this, but that's another story).

Direct I/O using application can make assumption if they know the sector
size, and we must have a way for them to be able to see our new
"subsector sector size".  And thos application are few inbetween but
also important so needing special cases for them is fine.  Although those
are the most likely ones to take advantage of byte addressing anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 4:54 p.m. UTC | #19
On Mon, Jun 22, 2015 at 9:45 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
>> In that case "don't stack" is too coarse of a hammer.  I see this as a
>> request to hide the subordinate ULD which is a new capability that DM
>> and MD might benefit from as well.  We already have the case in MD
>> where it internally holds a reference to bdev that has been hot
>> removed, it seems not much of a stretch to have stacking drivers be
>> able to hide device nodes for bdevs that they are holding.
>
> I don't see why you're comparing with MD and DM here.  MD and DM
> sit cleanly ontop of any block device.  If btt was independent of
> libnvdimm and just used ->rw_bytes we could see it as this.
>
> But it's all a giant entangled mess, where btt for example is probed
> by libnvdimm.  At the same time pmem.c isn't really a true block
> driver, it's really just a trivial shim between the block API
> and pmem-style memcpy.  Especially with the proper pmem API btt
> would become cleaner just calling that directly.

The pmem api does nothing to fix torn sectors, there's no extra
atomicity guarantees that come from those instructions.

>> Yes, if they want to use DAX they should do it consciously and audit
>> their application to be sure it is safe to abandon atomic sector
>> guarantees.  With the current flexibility to do BTT on a partition
>> they can do this conversion piecemeal and, for example, keep metadata
>> on BTT and data on DAX.
>
> By that logic you'd want to attach BTT by default and allow opt-out
> at some level.  This could be a libnvmdimm-level partitioning scheme,
> which would also allow storing the bit if BTT is used or not persistently.
> Or it could be on fine grained boundaries which might be more useful.

Well, let's start with per-disk btt and see where that gets us, we can
always ramp up complexity later.  I'd just as soon make the default
opt-in/out a Kconfig toggle with a sysfs override.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 22, 2015, 4:57 p.m. UTC | #20
On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
> > I don't see why you're comparing with MD and DM here.  MD and DM
> > sit cleanly ontop of any block device.  If btt was independent of
> > libnvdimm and just used ->rw_bytes we could see it as this.
> >
> > But it's all a giant entangled mess, where btt for example is probed
> > by libnvdimm.  At the same time pmem.c isn't really a true block
> > driver, it's really just a trivial shim between the block API
> > and pmem-style memcpy.  Especially with the proper pmem API btt
> > would become cleaner just calling that directly.
> 
> The pmem api does nothing to fix torn sectors, there's no extra
> atomicity guarantees that come from those instructions.

Of course not.  And neither does pmem.c help with you in any way.

That's the point:  btt should be a peer to pmem.c, not on top of it
as there's no value add in pmem.c for it, and they are logically peers.

> Well, let's start with per-disk btt and see where that gets us, we can
> always ramp up complexity later.  I'd just as soon make the default
> opt-in/out a Kconfig toggle with a sysfs override.

Kconfig or sysfs are both utterly horrible choices.  It's a disk format
choice so it needs to be persisted.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 4:59 p.m. UTC | #21
On Mon, Jun 22, 2015 at 9:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
>> > I don't see why you're comparing with MD and DM here.  MD and DM
>> > sit cleanly ontop of any block device.  If btt was independent of
>> > libnvdimm and just used ->rw_bytes we could see it as this.
>> >
>> > But it's all a giant entangled mess, where btt for example is probed
>> > by libnvdimm.  At the same time pmem.c isn't really a true block
>> > driver, it's really just a trivial shim between the block API
>> > and pmem-style memcpy.  Especially with the proper pmem API btt
>> > would become cleaner just calling that directly.
>>
>> The pmem api does nothing to fix torn sectors, there's no extra
>> atomicity guarantees that come from those instructions.
>
> Of course not.  And neither does pmem.c help with you in any way.
>
> That's the point:  btt should be a peer to pmem.c, not on top of it
> as there's no value add in pmem.c for it, and they are logically peers.
>
>> Well, let's start with per-disk btt and see where that gets us, we can
>> always ramp up complexity later.  I'd just as soon make the default
>> opt-in/out a Kconfig toggle with a sysfs override.
>
> Kconfig or sysfs are both utterly horrible choices.  It's a disk format
> choice so it needs to be persisted.

Of course it will be persisted with an on disk BTT superblock.
Establishing that superblock by default and deleting at on-demand are
via Kconfig and sysfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jeff Moyer June 22, 2015, 6:48 p.m. UTC | #22
Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>> OK, add torn sector detection/recovery to that statement, then.  More
>> importantly, do you agree with the sentiment or not?
>
> I think we're getting on a very slipper slope if we think about
> application here.  Buffered I/O application must deal with torn
> writes at any granulairty anyway, e.g. fsync + rename is the
> only thing they can rely on right now (I actually have software O_ATOMIC
> code to avoid this, but that's another story).

OK, so you think applications using buffered I/O will Just Work(TM)?  My
guess is that things will start to break that hadn't broken in the
past.  Sure, the application isn't designed properly, and that should be
fixed, but we shouldn't foist this on users as the default.

> Direct I/O using application can make assumption if they know the sector
> size, and we must have a way for them to be able to see our new
> "subsector sector size".

You need to let them determine that when NOT using the btt, yes.  Right
now, I don't think there's a way to determine what the underlying atomic
write unit is.  That's something the NFIT spec probably should have
defined.

> And thos application are few inbetween but also important so needing
> special cases for them is fine.  Although those are the most likely
> ones to take advantage of byte addressing anyway.

Agreed.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Dan Williams June 22, 2015, 7:04 p.m. UTC | #23
On Mon, Jun 22, 2015 at 11:48 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>>> OK, add torn sector detection/recovery to that statement, then.  More
>>> importantly, do you agree with the sentiment or not?
>>
>> I think we're getting on a very slipper slope if we think about
>> application here.  Buffered I/O application must deal with torn
>> writes at any granulairty anyway, e.g. fsync + rename is the
>> only thing they can rely on right now (I actually have software O_ATOMIC
>> code to avoid this, but that's another story).
>
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.
>
>> Direct I/O using application can make assumption if they know the sector
>> size, and we must have a way for them to be able to see our new
>> "subsector sector size".
>
> You need to let them determine that when NOT using the btt, yes.  Right
> now, I don't think there's a way to determine what the underlying atomic
> write unit is.  That's something the NFIT spec probably should have
> defined.

There are no atomic write units for NFIT to advertise beyond cpu register width.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jeff Moyer June 22, 2015, 7:11 p.m. UTC | #24
Dan Williams <dan.j.williams@intel.com> writes:

>>> Direct I/O using application can make assumption if they know the sector
>>> size, and we must have a way for them to be able to see our new
>>> "subsector sector size".
>>
>> You need to let them determine that when NOT using the btt, yes.  Right
>> now, I don't think there's a way to determine what the underlying atomic
>> write unit is.  That's something the NFIT spec probably should have
>> defined.
>
> There are no atomic write units for NFIT to advertise beyond cpu register width.

That would be useful information for the platform to provide, instead of
requiring the o/s or applications to infer it.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Christoph Hellwig June 23, 2015, 10:10 a.m. UTC | #25
On Mon, Jun 22, 2015 at 02:48:48PM -0400, Jeff Moyer wrote:
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.

They will work or break the same way as before.  We have all kinds of
cases of non-sector updates for buffered I/O anyway: inl?ne data, tail
merging, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox June 23, 2015, 7:30 p.m. UTC | #26
On Mon, Jun 22, 2015 at 08:30:28AM +0200, Christoph Hellwig wrote:
> > Good to hear that we don't need BTT for XFS v5, can we make the
> > guarantee for all filesystems that may want to support DAX?  I still
> > think stacking is a natural fit for this problem.
> 
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

Is ext4 one of the filesystems that copes with torn updates to the log?
I see there's a checksum in the tail of at least some blocks, but I'd
like someone who understands ext4 to reassure me that it also doesn't
need the ability to put its log on a BTT.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 24, 2015, 12:11 p.m. UTC | #27
On Tue, Jun 23, 2015 at 03:30:43PM -0400, Matthew Wilcox wrote:
> > I can't make any guarantees, especially not without verification.  But
> > if correctly implemented any filesystems that does out of place metadata
> > writes (and that includes a traditional log) and uses checksum to ensure
> > the integrity of these updates it should be fine.  You'd still have
> > the issue of sector atomicy of file I/O though.
> 
> Is ext4 one of the filesystems that copes with torn updates to the log?
> I see there's a checksum in the tail of at least some blocks, but I'd
> like someone who understands ext4 to reassure me that it also doesn't
> need the ability to put its log on a BTT.

In theory it should if the log checksums are enabled, but I wouldn't rely
on it without without confirmation and validation from the ext4 folks.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 8a65e5a500d8..adacc27f04f1 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -239,6 +239,10 @@  static int nd_blk_rw_bytes(struct gendisk *disk, resource_size_t offset,
 static const struct block_device_operations nd_blk_fops = {
 	.owner = THIS_MODULE,
 	.rw_bytes = nd_blk_rw_bytes,
+	.ioctl = nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = nvdimm_bdev_compat_ioctl,
+#endif
 };
 
 static int nd_blk_probe(struct device *dev)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 67484633c322..57d3b271e451 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1248,10 +1248,29 @@  static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 	return 0;
 }
 
+static int btt_revalidate_disk(struct gendisk *disk)
+{
+	struct btt *btt = disk->private_data;
+	struct nd_btt *nd_btt = btt->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+	char name[BDEVNAME_SIZE];
+
+	dev_dbg(&nd_btt->dev, "backing dev: %s read-%s", bdevname(bdev, name),
+			bdev_read_only(bdev) ? "only" : "write");
+	if (bdev_read_only(bdev))
+		set_disk_ro(disk, 1);
+	return 0;
+}
+
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_compat_ioctl,
+#endif
+	.revalidate_disk =	btt_revalidate_disk,
 };
 
 static int btt_blk_init(struct btt *btt)
@@ -1296,6 +1315,7 @@  static int btt_blk_init(struct btt *btt)
 	}
 
 	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+	revalidate_disk(btt->btt_disk);
 
 	return 0;
 
@@ -1335,6 +1355,7 @@  static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 	int ret;
 	struct btt *btt;
 	struct device *dev = &nd_btt->dev;
+	struct block_device *bdev = nd_btt->backing_dev;
 
 	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
 	if (!btt)
@@ -1354,7 +1375,13 @@  static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		goto out_free;
 	}
 
-	if (btt->init_state != INIT_READY) {
+	if (btt->init_state != INIT_READY && bdev_read_only(bdev)) {
+		char name[BDEVNAME_SIZE];
+
+		dev_info(dev, "%s is read-only, unable to init btt metadata\n",
+				bdevname(bdev, name));
+		goto out_free;
+	} else if (btt->init_state != INIT_READY) {
 		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
 			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
 		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
@@ -1369,7 +1396,7 @@  static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		ret = btt_meta_init(btt);
 		if (ret) {
 			dev_err(dev, "init: error in meta_init: %d\n", ret);
-			return NULL;
+			goto out_free;
 		}
 	}
 
@@ -1481,7 +1508,10 @@  static int nd_btt_remove(struct device *dev)
 	struct nd_btt *nd_btt = to_nd_btt(dev);
 	struct btt *btt = dev_get_drvdata(dev);
 
+	nvdimm_bus_lock(dev);
 	btt_fini(btt);
+	nvdimm_bus_unlock(dev);
+
 	unlink_btt(nd_btt);
 
 	return 0;
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 02e125b91e77..bcf77dca1532 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -122,7 +122,7 @@  static ssize_t backing_dev_show(struct device *dev,
 		return sprintf(buf, "\n");
 }
 
-static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_EXCL;
 
 static void nd_btt_remove_bdev(struct nd_btt *nd_btt, const char *caller)
 {
@@ -363,6 +363,46 @@  u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 }
 EXPORT_SYMBOL(nd_btt_sb_checksum);
 
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (!dev->driver)
+		return 0;
+
+	/* we can only mark a btt device rw if its backing device is rw */
+	if (bdev_read_only(nd_btt->backing_dev) && !ro)
+		return -EBUSY;
+
+	set_device_ro(bdev, ro);
+	return 0;
+}
+
+int set_btt_disk_ro(struct device *dev, void *data)
+{
+	struct block_device *bdev = data;
+	struct nd_btt *nd_btt;
+	struct btt *btt;
+
+	if (!is_nd_btt(dev))
+		return 0;
+
+	nd_btt = to_nd_btt(dev);
+	if (nd_btt->backing_dev != bdev)
+		return 0;
+
+	/*
+	 * We have the lock at this point and have flushed probing.  We
+	 * are guaranteed that the btt driver is unbound, or has
+	 * completed setup operations and is blocked from initiating
+	 * disk teardown until we are done walking these pointers.
+	 */
+	btt = dev_get_drvdata(dev);
+	if (btt && btt->btt_disk)
+		set_disk_ro(btt->btt_disk, 1);
+	return 0;
+}
+
 static struct nd_btt *__nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
 		struct block_device *bdev, struct btt_sb *btt_sb)
 {
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index d4fbc48f5643..47260ca573e0 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -309,6 +309,61 @@  void nvdimm_bus_remove_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(nvdimm_bus_remove_disk);
 
+static int set_namespace_ro(struct block_device *bdev,
+		struct nvdimm_bus *nvdimm_bus, int ro)
+{
+	set_device_ro(bdev, ro);
+
+	/*
+	 * It's possible to mark the backing device rw while leaving the
+	 * btt device read-only.  However, marking a backing device
+	 * read-only always marks the parent btt read-only.
+	 */
+	if (!ro)
+		return 0;
+	return device_for_each_child(&nvdimm_bus->dev, bdev, set_btt_disk_ro);
+}
+
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	int rc, ro;
+	struct gendisk *disk = bdev->bd_disk;
+	struct device *dev = disk->driverfs_dev;
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+	if (cmd != BLKROSET)
+		return -ENOTTY;
+
+	if (get_user(ro, (int __user *)(arg)))
+		return -EFAULT;
+
+	if (ro == 0 || ro == 1)
+		/* pass */;
+	else
+		return -EINVAL;
+
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	wait_nvdimm_bus_probe_idle(&nvdimm_bus->dev);
+	if (bdev_read_only(bdev) == ro)
+		rc = 0;
+	else if (is_nd_btt(dev))
+		rc = set_btt_ro(bdev, dev, ro);
+	else
+		rc = set_namespace_ro(bdev, nvdimm_bus, ro);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	return rc;
+}
+EXPORT_SYMBOL(nvdimm_bdev_ioctl);
+
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	return nvdimm_bdev_ioctl(bdev, mode, cmd, arg);
+}
+EXPORT_SYMBOL(nvdimm_bdev_compat_ioctl);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 9a90915e6fd2..ba548d248b4e 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -49,11 +49,14 @@  bool is_nvdimm(struct device *dev);
 bool is_nd_pmem(struct device *dev);
 bool is_nd_blk(struct device *dev);
 struct gendisk;
+struct block_device;
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
 struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus);
 void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
 void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro);
+int set_btt_disk_ro(struct device *dev, void *data);
 #else
 static inline bool is_nd_btt(struct device *dev)
 {
@@ -74,6 +77,17 @@  static inline void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus,
 		struct gendisk *disk)
 {
 }
+
+static inline int set_btt_ro(struct block_device *bdev, struct device *dev,
+		int ro)
+{
+	return 0;
+}
+
+static inline int set_btt_disk_ro(struct device *dev, void *data)
+{
+	return 0;
+}
 #endif
 struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 3c4c8b6c64ec..2786eb8456ec 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -164,6 +164,10 @@  int nvdimm_bus_add_disk(struct gendisk *disk);
 int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 		sector_t size);
 void nvdimm_bus_remove_disk(struct gendisk *disk);
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3fd854a78f09..96964419b72d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -131,6 +131,10 @@  static const struct block_device_operations pmem_fops = {
 	.rw_page =		pmem_rw_page,
 	.rw_bytes =		pmem_rw_bytes,
 	.direct_access =	pmem_direct_access,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_ioctl,
+#endif
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev,