Message ID | 1458861450-17705-6-git-send-email-vishal.l.verma@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote: > dax_do_io (called for read() or write() for a dax file system) may fail > in the presence of bad blocks or media errors. Since we expect that a > write should clear media errors on nvdimms, make dax_do_io fall back to > the direct_IO path, which will send down a bio to the driver, which can > then attempt to clear the error. Leave the fallback on -EIO to the callers please. They generally call __blockdev_direct_IO anyway, so it should actually become simpler that way. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote: > On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote: > > > > dax_do_io (called for read() or write() for a dax file system) may > > fail > > in the presence of bad blocks or media errors. Since we expect that > > a > > write should clear media errors on nvdimms, make dax_do_io fall > > back to > > the direct_IO path, which will send down a bio to the driver, which > > can > > then attempt to clear the error. > Leave the fallback on -EIO to the callers please. They generally > call > __blockdev_direct_IO anyway, so it should actually become simpler > that > way. I thought of this, but made the retrying happen in the wrapper so that it can be centralized. If the callers were to become responsible for the retry, then any new callers of dax_do_io might not realize they are responsible for retrying, and hit problems. Another tricky point might be: in the wrapper, if __dax_do_io failed with -EIO, and subsequently __blockdev_direct_IO also fails with a *different* error, I chose to return -EIO because that was the 'first' error we hit and caused us to fallback.. (Does this even seem reasonable?) And if so, do we want to push back this decision too, to the callers?
On Fri, Mar 25, 2016 at 1:59 PM, Verma, Vishal L <vishal.l.verma@intel.com> wrote: > On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote: >> On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote: >> > >> > dax_do_io (called for read() or write() for a dax file system) may >> > fail >> > in the presence of bad blocks or media errors. Since we expect that >> > a >> > write should clear media errors on nvdimms, make dax_do_io fall >> > back to >> > the direct_IO path, which will send down a bio to the driver, which >> > can >> > then attempt to clear the error. >> Leave the fallback on -EIO to the callers please. They generally >> call >> __blockdev_direct_IO anyway, so it should actually become simpler >> that >> way. > > I thought of this, but made the retrying happen in the wrapper so that > it can be centralized. If the callers were to become responsible for > the retry, then any new callers of dax_do_io might not realize they are > responsible for retrying, and hit problems. That's their prerogative otherwise you are precluding an alternate handling of a dax_do_io() failure. Maybe a fs or upper layer can recover in a different manner than re-submit the I/O to the __blockdev_direct_IO path. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
T24gRnJpLCAyMDE2LTAzLTI1IGF0IDE0OjQyIC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+ IE9uIEZyaSwgTWFyIDI1LCAyMDE2IGF0IDE6NTkgUE0sIFZlcm1hLCBWaXNoYWwgTA0KPiA8dmlz aGFsLmwudmVybWFAaW50ZWwuY29tPiB3cm90ZToNCj4gPiANCj4gPiBPbiBGcmksIDIwMTYtMDMt MjUgYXQgMDM6NDUgLTA3MDAsIENocmlzdG9waCBIZWxsd2lnIHdyb3RlOg0KPiA+ID4gDQo+ID4g PiBPbiBUaHUsIE1hciAyNCwgMjAxNiBhdCAwNToxNzozMFBNIC0wNjAwLCBWaXNoYWwgVmVybWEg d3JvdGU6DQo+ID4gPiA+IA0KPiA+ID4gPiANCj4gPiA+ID4gZGF4X2RvX2lvIChjYWxsZWQgZm9y IHJlYWQoKSBvciB3cml0ZSgpIGZvciBhIGRheCBmaWxlIHN5c3RlbSkNCj4gPiA+ID4gbWF5DQo+ ID4gPiA+IGZhaWwNCj4gPiA+ID4gaW4gdGhlIHByZXNlbmNlIG9mIGJhZCBibG9ja3Mgb3IgbWVk aWEgZXJyb3JzLiBTaW5jZSB3ZSBleHBlY3QNCj4gPiA+ID4gdGhhdA0KPiA+ID4gPiBhDQo+ID4g PiA+IHdyaXRlIHNob3VsZCBjbGVhciBtZWRpYSBlcnJvcnMgb24gbnZkaW1tcywgbWFrZSBkYXhf ZG9faW8gZmFsbA0KPiA+ID4gPiBiYWNrIHRvDQo+ID4gPiA+IHRoZSBkaXJlY3RfSU8gcGF0aCwg d2hpY2ggd2lsbCBzZW5kIGRvd24gYSBiaW8gdG8gdGhlIGRyaXZlciwNCj4gPiA+ID4gd2hpY2gN Cj4gPiA+ID4gY2FuDQo+ID4gPiA+IHRoZW4gYXR0ZW1wdCB0byBjbGVhciB0aGUgZXJyb3IuDQo+ ID4gPiBMZWF2ZSB0aGUgZmFsbGJhY2sgb24gLUVJTyB0byB0aGUgY2FsbGVycyBwbGVhc2UuwqDC oFRoZXkgZ2VuZXJhbGx5DQo+ID4gPiBjYWxsDQo+ID4gPiBfX2Jsb2NrZGV2X2RpcmVjdF9JTyBh bnl3YXksIHNvIGl0IHNob3VsZCBhY3R1YWxseSBiZWNvbWUgc2ltcGxlcg0KPiA+ID4gdGhhdA0K PiA+ID4gd2F5Lg0KPiA+IEkgdGhvdWdodCBvZiB0aGlzLCBidXQgbWFkZSB0aGUgcmV0cnlpbmcg aGFwcGVuIGluIHRoZSB3cmFwcGVyIHNvDQo+ID4gdGhhdA0KPiA+IGl0IGNhbiBiZSBjZW50cmFs aXplZC4gSWYgdGhlIGNhbGxlcnMgd2VyZSB0byBiZWNvbWUgcmVzcG9uc2libGUNCj4gPiBmb3IN Cj4gPiB0aGUgcmV0cnksIHRoZW4gYW55IG5ldyBjYWxsZXJzIG9mIGRheF9kb19pbyBtaWdodCBu b3QgcmVhbGl6ZSB0aGV5DQo+ID4gYXJlDQo+ID4gcmVzcG9uc2libGUgZm9yIHJldHJ5aW5nLCBh bmQgaGl0IHByb2JsZW1zLg0KPiBUaGF0J3MgdGhlaXIgcHJlcm9nYXRpdmUgb3RoZXJ3aXNlIHlv dSBhcmUgcHJlY2x1ZGluZyBhbiBhbHRlcm5hdGUNCj4gaGFuZGxpbmcgb2YgYSBkYXhfZG9faW8o KSBmYWlsdXJlLsKgwqBNYXliZSBhIGZzIG9yIHVwcGVyIGxheWVyIGNhbg0KPiByZWNvdmVyIGlu IGEgZGlmZmVyZW50IG1hbm5lciB0aGFuIHJlLXN1Ym1pdCB0aGUgSS9PIHRvIHRoZQ0KPiBfX2Js b2NrZGV2X2RpcmVjdF9JTyBwYXRoLg0KDQpJJ20gaGFwcHkgdG8gbWFrZSB0aGUgY2hhbmdlLCBi dXQgd2UgZG9uJ3QgcHJlY2x1ZGUgdGhhdCAtLSBfX2RheF9kb19pbw0KaXMgc3RpbGwgZXhwb3J0 ZWQgYW5kIGF2YWlsYWJsZS4u -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 25, 2016 at 02:42:37PM -0700, Dan Williams wrote: > That's their prerogative otherwise you are precluding an alternate > handling of a dax_do_io() failure. Maybe a fs or upper layer can > recover in a different manner than re-submit the I/O to the > __blockdev_direct_IO path. Let's keep the interface separate because they are, well separate. There is a reason direct I/O falls back to buffered I/O by returning and error if it can't handle it instead of handling all the magic. I also really want to get rid of get_block as soon as possible for DAX and direct I/O. For DAX that should actually be possible really quickly, while direct I/O might take some time and will be have to be gradual. So tighter integration of the two interface is not just bad design, but actively harmful at this point in time. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/block_dev.c b/fs/block_dev.c index 9c0765b..f3873ab 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -168,8 +168,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) struct inode *inode = bdev_file_inode(file); if (IS_DAX(inode)) - return dax_do_io(iocb, inode, iter, offset, blkdev_get_block, - NULL, DIO_SKIP_DIO_COUNT); + return dax_do_io(iocb, inode, I_BDEV(inode), iter, offset, + blkdev_get_block, blkdev_get_block, + NULL, NULL, DIO_SKIP_DIO_COUNT); return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset, blkdev_get_block, NULL, NULL, DIO_SKIP_DIO_COUNT); diff --git a/fs/dax.c b/fs/dax.c index a30481e..b90c8e9 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -208,7 +208,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, } /** - * dax_do_io - Perform I/O to a DAX file + * __dax_do_io - Perform I/O to a DAX file * @iocb: The control block for this I/O * @inode: The file which the I/O is directed at * @iter: The addresses to do I/O from or to @@ -224,7 +224,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O * is in progress. */ -ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, +ssize_t __dax_do_io(struct kiocb *iocb, struct inode *inode, struct iov_iter *iter, loff_t pos, get_block_t get_block, dio_iodone_t end_io, int flags) { @@ -262,8 +262,38 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, out: return retval; } +EXPORT_SYMBOL_GPL(__dax_do_io); + +/* + * This is a library function for use by file systems. It will perform a + * fallback to direct_io semantics if the dax_io fails due to a media error. + */ +ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, struct iov_iter *iter, loff_t pos, + get_block_t dax_get_block, get_block_t dio_get_block, + dio_iodone_t end_io, dio_submit_t submit_io, int flags) +{ + ssize_t retval; + + retval = __dax_do_io(iocb, inode, iter, pos, dax_get_block, end_io, + flags); + if (iov_iter_rw(iter) == WRITE && retval == -EIO) { + /* + * __dax_do_io may have failed a write due to a bad block. + * Retry with direct_io, and if the direct_IO also fails, + * return -EIO as that was the original error that led us + * down the direct_IO path. + */ + retval = __blockdev_direct_IO(iocb, inode, bdev, iter, pos, + dio_get_block, end_io, submit_io, flags); + if (retval < 0) + return -EIO; + } + return retval; +} EXPORT_SYMBOL_GPL(dax_do_io); + /* * The user has performed a load from a hole in the file. Allocating * a new page in the file would cause excessive storage usage for diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 824f249..8a307cf 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -862,8 +862,9 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) ssize_t ret; if (IS_DAX(inode)) - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL, - DIO_LOCKING); + ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, + offset, ext2_get_block, ext2_get_block, + NULL, NULL, DIO_LOCKING | DIO_SKIP_HOLES); else ret = blockdev_direct_IO(iocb, inode, iter, offset, ext2_get_block); diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 355ef9c..4b087b7 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -692,8 +692,9 @@ retry: goto locked; } if (IS_DAX(inode)) - ret = dax_do_io(iocb, inode, iter, offset, - ext4_get_block, NULL, 0); + ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, + offset, ext4_get_block, ext4_get_block, + NULL, NULL, 0); else ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter, @@ -703,8 +704,10 @@ retry: } else { locked: if (IS_DAX(inode)) - ret = dax_do_io(iocb, inode, iter, offset, - ext4_get_block, NULL, DIO_LOCKING); + ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, + offset, ext4_get_block, ext4_get_block, + NULL, NULL, DIO_LOCKING | + DIO_SKIP_HOLES); else ret = blockdev_direct_IO(iocb, inode, iter, offset, ext4_get_block); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index aee960b..4220dac 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3315,8 +3315,9 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter, BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode)); #endif if (IS_DAX(inode)) - ret = dax_do_io(iocb, inode, iter, offset, get_block_func, - ext4_end_io_dio, dio_flags); + ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, offset, + get_block_func, get_block_func, + ext4_end_io_dio, NULL, dio_flags); else ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter, offset, diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index a9ebabfe..dc4e088 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1682,11 +1682,12 @@ xfs_vm_do_dio( void *private), int flags) { - struct block_device *bdev; + struct block_device *bdev = xfs_find_bdev_for_inode(inode); if (IS_DAX(inode)) - return dax_do_io(iocb, inode, iter, offset, - xfs_get_blocks_direct, endio, 0); + return dax_do_io(iocb, inode, bdev, iter, offset, + xfs_get_blocks_direct, xfs_get_blocks_direct, + endio, NULL, flags); bdev = xfs_find_bdev_for_inode(inode); return __blockdev_direct_IO(iocb, inode, bdev, iter, offset, diff --git a/include/linux/dax.h b/include/linux/dax.h index 933198a..6981076 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,8 +5,12 @@ #include <linux/mm.h> #include <asm/pgtable.h> -ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, +ssize_t __dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); +ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, struct iov_iter *iter, loff_t pos, + get_block_t dax_get_block, get_block_t dio_get_block, + dio_iodone_t end_io, dio_submit_t submit_io, int flags); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
dax_do_io (called for read() or write() for a dax file system) may fail in the presence of bad blocks or media errors. Since we expect that a write should clear media errors on nvdimms, make dax_do_io fall back to the direct_IO path, which will send down a bio to the driver, which can then attempt to clear the error. Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- fs/block_dev.c | 5 +++-- fs/dax.c | 34 ++++++++++++++++++++++++++++++++-- fs/ext2/inode.c | 5 +++-- fs/ext4/indirect.c | 11 +++++++---- fs/ext4/inode.c | 5 +++-- fs/xfs/xfs_aops.c | 7 ++++--- include/linux/dax.h | 6 +++++- 7 files changed, 57 insertions(+), 16 deletions(-)