diff mbox

[5/5] dax: handle media errors in dax_do_io

Message ID 1458861450-17705-6-git-send-email-vishal.l.verma@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Verma, Vishal L March 24, 2016, 11:17 p.m. UTC
dax_do_io (called for read() or write() for a dax file system) may fail
in the presence of bad blocks or media errors. Since we expect that a
write should clear media errors on nvdimms, make dax_do_io fall back to
the direct_IO path, which will send down a bio to the driver, which can
then attempt to clear the error.

Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 fs/block_dev.c      |  5 +++--
 fs/dax.c            | 34 ++++++++++++++++++++++++++++++++--
 fs/ext2/inode.c     |  5 +++--
 fs/ext4/indirect.c  | 11 +++++++----
 fs/ext4/inode.c     |  5 +++--
 fs/xfs/xfs_aops.c   |  7 ++++---
 include/linux/dax.h |  6 +++++-
 7 files changed, 57 insertions(+), 16 deletions(-)

Comments

Christoph Hellwig March 25, 2016, 10:45 a.m. UTC | #1
On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
> dax_do_io (called for read() or write() for a dax file system) may fail
> in the presence of bad blocks or media errors. Since we expect that a
> write should clear media errors on nvdimms, make dax_do_io fall back to
> the direct_IO path, which will send down a bio to the driver, which can
> then attempt to clear the error.

Leave the fallback on -EIO to the callers please.  They generally call
__blockdev_direct_IO anyway, so it should actually become simpler that
way.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Verma, Vishal L March 25, 2016, 8:59 p.m. UTC | #2
On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote:
> On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:

> > 

> > dax_do_io (called for read() or write() for a dax file system) may

> > fail

> > in the presence of bad blocks or media errors. Since we expect that

> > a

> > write should clear media errors on nvdimms, make dax_do_io fall

> > back to

> > the direct_IO path, which will send down a bio to the driver, which

> > can

> > then attempt to clear the error.

> Leave the fallback on -EIO to the callers please.  They generally

> call

> __blockdev_direct_IO anyway, so it should actually become simpler

> that

> way.


I thought of this, but made the retrying happen in the wrapper so that
it can be centralized. If the callers were to become responsible for
the retry, then any new callers of dax_do_io might not realize they are
responsible for retrying, and hit problems. Another tricky point might
be: in the wrapper, if __dax_do_io failed with -EIO, and subsequently
__blockdev_direct_IO also fails with a *different* error, I chose to
return -EIO because that was the 'first' error we hit and caused us to
fallback.. (Does this even seem reasonable?) And if so, do we want to
push back this decision too, to the callers?
Dan Williams March 25, 2016, 9:42 p.m. UTC | #3
On Fri, Mar 25, 2016 at 1:59 PM, Verma, Vishal L
<vishal.l.verma@intel.com> wrote:
> On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote:
>> On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
>> >
>> > dax_do_io (called for read() or write() for a dax file system) may
>> > fail
>> > in the presence of bad blocks or media errors. Since we expect that
>> > a
>> > write should clear media errors on nvdimms, make dax_do_io fall
>> > back to
>> > the direct_IO path, which will send down a bio to the driver, which
>> > can
>> > then attempt to clear the error.
>> Leave the fallback on -EIO to the callers please.  They generally
>> call
>> __blockdev_direct_IO anyway, so it should actually become simpler
>> that
>> way.
>
> I thought of this, but made the retrying happen in the wrapper so that
> it can be centralized. If the callers were to become responsible for
> the retry, then any new callers of dax_do_io might not realize they are
> responsible for retrying, and hit problems.

That's their prerogative otherwise you are precluding an alternate
handling of a dax_do_io() failure.  Maybe a fs or upper layer can
recover in a different manner than re-submit the I/O to the
__blockdev_direct_IO path.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Verma, Vishal L March 25, 2016, 10:36 p.m. UTC | #4
T24gRnJpLCAyMDE2LTAzLTI1IGF0IDE0OjQyIC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IE9uIEZyaSwgTWFyIDI1LCAyMDE2IGF0IDE6NTkgUE0sIFZlcm1hLCBWaXNoYWwgTA0KPiA8dmlz
aGFsLmwudmVybWFAaW50ZWwuY29tPiB3cm90ZToNCj4gPiANCj4gPiBPbiBGcmksIDIwMTYtMDMt
MjUgYXQgMDM6NDUgLTA3MDAsIENocmlzdG9waCBIZWxsd2lnIHdyb3RlOg0KPiA+ID4gDQo+ID4g
PiBPbiBUaHUsIE1hciAyNCwgMjAxNiBhdCAwNToxNzozMFBNIC0wNjAwLCBWaXNoYWwgVmVybWEg
d3JvdGU6DQo+ID4gPiA+IA0KPiA+ID4gPiANCj4gPiA+ID4gZGF4X2RvX2lvIChjYWxsZWQgZm9y
IHJlYWQoKSBvciB3cml0ZSgpIGZvciBhIGRheCBmaWxlIHN5c3RlbSkNCj4gPiA+ID4gbWF5DQo+
ID4gPiA+IGZhaWwNCj4gPiA+ID4gaW4gdGhlIHByZXNlbmNlIG9mIGJhZCBibG9ja3Mgb3IgbWVk
aWEgZXJyb3JzLiBTaW5jZSB3ZSBleHBlY3QNCj4gPiA+ID4gdGhhdA0KPiA+ID4gPiBhDQo+ID4g
PiA+IHdyaXRlIHNob3VsZCBjbGVhciBtZWRpYSBlcnJvcnMgb24gbnZkaW1tcywgbWFrZSBkYXhf
ZG9faW8gZmFsbA0KPiA+ID4gPiBiYWNrIHRvDQo+ID4gPiA+IHRoZSBkaXJlY3RfSU8gcGF0aCwg
d2hpY2ggd2lsbCBzZW5kIGRvd24gYSBiaW8gdG8gdGhlIGRyaXZlciwNCj4gPiA+ID4gd2hpY2gN
Cj4gPiA+ID4gY2FuDQo+ID4gPiA+IHRoZW4gYXR0ZW1wdCB0byBjbGVhciB0aGUgZXJyb3IuDQo+
ID4gPiBMZWF2ZSB0aGUgZmFsbGJhY2sgb24gLUVJTyB0byB0aGUgY2FsbGVycyBwbGVhc2UuwqDC
oFRoZXkgZ2VuZXJhbGx5DQo+ID4gPiBjYWxsDQo+ID4gPiBfX2Jsb2NrZGV2X2RpcmVjdF9JTyBh
bnl3YXksIHNvIGl0IHNob3VsZCBhY3R1YWxseSBiZWNvbWUgc2ltcGxlcg0KPiA+ID4gdGhhdA0K
PiA+ID4gd2F5Lg0KPiA+IEkgdGhvdWdodCBvZiB0aGlzLCBidXQgbWFkZSB0aGUgcmV0cnlpbmcg
aGFwcGVuIGluIHRoZSB3cmFwcGVyIHNvDQo+ID4gdGhhdA0KPiA+IGl0IGNhbiBiZSBjZW50cmFs
aXplZC4gSWYgdGhlIGNhbGxlcnMgd2VyZSB0byBiZWNvbWUgcmVzcG9uc2libGUNCj4gPiBmb3IN
Cj4gPiB0aGUgcmV0cnksIHRoZW4gYW55IG5ldyBjYWxsZXJzIG9mIGRheF9kb19pbyBtaWdodCBu
b3QgcmVhbGl6ZSB0aGV5DQo+ID4gYXJlDQo+ID4gcmVzcG9uc2libGUgZm9yIHJldHJ5aW5nLCBh
bmQgaGl0IHByb2JsZW1zLg0KPiBUaGF0J3MgdGhlaXIgcHJlcm9nYXRpdmUgb3RoZXJ3aXNlIHlv
dSBhcmUgcHJlY2x1ZGluZyBhbiBhbHRlcm5hdGUNCj4gaGFuZGxpbmcgb2YgYSBkYXhfZG9faW8o
KSBmYWlsdXJlLsKgwqBNYXliZSBhIGZzIG9yIHVwcGVyIGxheWVyIGNhbg0KPiByZWNvdmVyIGlu
IGEgZGlmZmVyZW50IG1hbm5lciB0aGFuIHJlLXN1Ym1pdCB0aGUgSS9PIHRvIHRoZQ0KPiBfX2Js
b2NrZGV2X2RpcmVjdF9JTyBwYXRoLg0KDQpJJ20gaGFwcHkgdG8gbWFrZSB0aGUgY2hhbmdlLCBi
dXQgd2UgZG9uJ3QgcHJlY2x1ZGUgdGhhdCAtLSBfX2RheF9kb19pbw0KaXMgc3RpbGwgZXhwb3J0
ZWQgYW5kIGF2YWlsYWJsZS4u
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 26, 2016, 4:53 p.m. UTC | #5
On Fri, Mar 25, 2016 at 02:42:37PM -0700, Dan Williams wrote:
> That's their prerogative otherwise you are precluding an alternate
> handling of a dax_do_io() failure.  Maybe a fs or upper layer can
> recover in a different manner than re-submit the I/O to the
> __blockdev_direct_IO path.

Let's keep the interface separate because they are, well separate.
There is a reason direct I/O falls back to buffered I/O by returning
and error if it can't handle it instead of handling all the magic.

I also really want to get rid of get_block as soon as possible for
DAX and direct I/O.  For DAX that should actually be possible
really quickly, while direct I/O might take some time and will
be have to be gradual.  So tighter integration of the two interface is
not just bad design, but actively harmful at this point in time.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9c0765b..f3873ab 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -168,8 +168,9 @@  blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 	struct inode *inode = bdev_file_inode(file);
 
 	if (IS_DAX(inode))
-		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
-				NULL, DIO_SKIP_DIO_COUNT);
+		return dax_do_io(iocb, inode, I_BDEV(inode), iter, offset,
+				blkdev_get_block, blkdev_get_block,
+				NULL, NULL, DIO_SKIP_DIO_COUNT);
 	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
 				    blkdev_get_block, NULL, NULL,
 				    DIO_SKIP_DIO_COUNT);
diff --git a/fs/dax.c b/fs/dax.c
index a30481e..b90c8e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -208,7 +208,7 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 }
 
 /**
- * dax_do_io - Perform I/O to a DAX file
+ * __dax_do_io - Perform I/O to a DAX file
  * @iocb: The control block for this I/O
  * @inode: The file which the I/O is directed at
  * @iter: The addresses to do I/O from or to
@@ -224,7 +224,7 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
  * is in progress.
  */
-ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+ssize_t __dax_do_io(struct kiocb *iocb, struct inode *inode,
 		  struct iov_iter *iter, loff_t pos, get_block_t get_block,
 		  dio_iodone_t end_io, int flags)
 {
@@ -262,8 +262,38 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
  out:
 	return retval;
 }
+EXPORT_SYMBOL_GPL(__dax_do_io);
+
+/*
+ * This is a library function for use by file systems. It will perform a
+ * fallback to direct_io semantics if the dax_io fails due to a media error.
+ */
+ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+		  struct block_device *bdev, struct iov_iter *iter, loff_t pos,
+		  get_block_t dax_get_block, get_block_t dio_get_block,
+		  dio_iodone_t end_io, dio_submit_t submit_io, int flags)
+{
+	ssize_t retval;
+
+	retval = __dax_do_io(iocb, inode, iter, pos, dax_get_block, end_io,
+				flags);
+	if (iov_iter_rw(iter) == WRITE && retval == -EIO) {
+		/*
+		 * __dax_do_io may have failed a write due to a bad block.
+		 * Retry with direct_io, and if the direct_IO also fails,
+		 * return -EIO as that was the original error that led us
+		 * down the direct_IO path.
+		 */
+		retval = __blockdev_direct_IO(iocb, inode, bdev, iter, pos,
+				dio_get_block, end_io, submit_io, flags);
+		if (retval < 0)
+			return -EIO;
+	}
+	return retval;
+}
 EXPORT_SYMBOL_GPL(dax_do_io);
 
+
 /*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 824f249..8a307cf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -862,8 +862,9 @@  ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 	ssize_t ret;
 
 	if (IS_DAX(inode))
-		ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
-				DIO_LOCKING);
+		ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+				offset, ext2_get_block, ext2_get_block,
+				NULL, NULL, DIO_LOCKING | DIO_SKIP_HOLES);
 	else
 		ret = blockdev_direct_IO(iocb, inode, iter, offset,
 					 ext2_get_block);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 355ef9c..4b087b7 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -692,8 +692,9 @@  retry:
 			goto locked;
 		}
 		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_get_block, NULL, 0);
+			ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+					offset, ext4_get_block, ext4_get_block,
+					NULL, NULL, 0);
 		else
 			ret = __blockdev_direct_IO(iocb, inode,
 						   inode->i_sb->s_bdev, iter,
@@ -703,8 +704,10 @@  retry:
 	} else {
 locked:
 		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_get_block, NULL, DIO_LOCKING);
+			ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+					offset, ext4_get_block, ext4_get_block,
+					NULL, NULL, DIO_LOCKING |
+					DIO_SKIP_HOLES);
 		else
 			ret = blockdev_direct_IO(iocb, inode, iter, offset,
 						 ext4_get_block);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aee960b..4220dac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3315,8 +3315,9 @@  static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
 	if (IS_DAX(inode))
-		ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
-				ext4_end_io_dio, dio_flags);
+		ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, offset,
+				get_block_func, get_block_func,
+				ext4_end_io_dio, NULL, dio_flags);
 	else
 		ret = __blockdev_direct_IO(iocb, inode,
 					   inode->i_sb->s_bdev, iter, offset,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a9ebabfe..dc4e088 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1682,11 +1682,12 @@  xfs_vm_do_dio(
 					 void		*private),
 	int			flags)
 {
-	struct block_device	*bdev;
+	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 
 	if (IS_DAX(inode))
-		return dax_do_io(iocb, inode, iter, offset,
-				 xfs_get_blocks_direct, endio, 0);
+		return dax_do_io(iocb, inode, bdev, iter, offset,
+				 xfs_get_blocks_direct, xfs_get_blocks_direct,
+				 endio, NULL, flags);
 
 	bdev = xfs_find_bdev_for_inode(inode);
 	return  __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 933198a..6981076 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,8 +5,12 @@ 
 #include <linux/mm.h>
 #include <asm/pgtable.h>
 
-ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
+ssize_t __dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
+ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+		  struct block_device *bdev, struct iov_iter *iter, loff_t pos,
+		  get_block_t dax_get_block, get_block_t dio_get_block,
+		  dio_iodone_t end_io, dio_submit_t submit_io, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,