[3/5] block: Add support for DAX on block devices
diff mbox

Message ID 1435608152-6982-4-git-send-email-matthew.r.wilcox@intel.com
State New
Headers show

Commit Message

Wilcox, Matthew R June 29, 2015, 8:02 p.m. UTC
From: Matthew Wilcox <willy@linux.intel.com>

Without this patch, accesses to a file on a filesystem on a block device
could be done without the page cache, but accessing the block device
itself would always go through the page cache.

Now reads and writes to a block device that is capable of DAX will always
bypass the page cache.  Loads and stores to an mmapped block device will
bypass the page cache if the user specified O_DIRECT.  This opt-in from
the user is necessary because DAX mappings are currently incompatible
with RDMA and O_DIRECT I/Os with non-DAX files.

Include support for the DIO_SKIP_DIO_COUNT flag in DAX, which is only
used by the block device driver.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/block_dev.c | 38 ++++++++++++++++++++++++++++++++++++--
 fs/dax.c       |  6 ++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig June 30, 2015, 11:19 a.m. UTC | #1
On Mon, Jun 29, 2015 at 04:02:30PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> Without this patch, accesses to a file on a filesystem on a block device
> could be done without the page cache, but accessing the block device
> itself would always go through the page cache.
> 
> Now reads and writes to a block device that is capable of DAX will always
> bypass the page cache.  Loads and stores to an mmapped block device will
> bypass the page cache if the user specified O_DIRECT.  This opt-in from
> the user is necessary because DAX mappings are currently incompatible
> with RDMA and O_DIRECT I/Os with non-DAX files.

Using O_DIRECT for this seems like a pretty horrible hack, so I'd like
to see a really good justification of using this over other interfaces.

Also it needs a Cc to linux-api and an entry in the open man page, and
and even better explanation of why we only support this interface on
block devices but not file systems.

Last but least I supect we'll need a runtime option for direct_access
support in the brd devices, as we're now going to use the regular
block device path less and less.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox June 30, 2015, 7:56 p.m. UTC | #2
On Tue, Jun 30, 2015 at 04:19:49AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 29, 2015 at 04:02:30PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> > 
> > Without this patch, accesses to a file on a filesystem on a block device
> > could be done without the page cache, but accessing the block device
> > itself would always go through the page cache.
> > 
> > Now reads and writes to a block device that is capable of DAX will always
> > bypass the page cache.  Loads and stores to an mmapped block device will
> > bypass the page cache if the user specified O_DIRECT.  This opt-in from
> > the user is necessary because DAX mappings are currently incompatible
> > with RDMA and O_DIRECT I/Os with non-DAX files.
> 
> Using O_DIRECT for this seems like a pretty horrible hack, so I'd like
> to see a really good justification of using this over other interfaces.

O_DIRECT means "bypass the page cache", which is what this does (now it's
able to apply to mmap too).

> Also it needs a Cc to linux-api and an entry in the open man page, and
> and even better explanation of why we only support this interface on
> block devices but not file systems.

Um, we do support this for filesystems with DAX.  The inconsistency we
have is that if you have a direct-access-capable block device, currently
files in a filesystem on it get the bypass-page-cache treatment, but if
you use the raw block device directly, that mapping doesn't.

> Last but least I supect we'll need a runtime option for direct_access
> support in the brd devices, as we're now going to use the regular
> block device path less and less.

I'm getting there; I was working on getting DAX to dynamically map the
pages that it used (rather than relying on them being permanently part
of the direct mapping), but I had to set that work aside temporarily.
That lets us just delete the compile option, and have direct_access
always work on brd.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 1, 2015, 7:19 a.m. UTC | #3
On Tue, Jun 30, 2015 at 03:56:15PM -0400, Matthew Wilcox wrote:
> > Using O_DIRECT for this seems like a pretty horrible hack, so I'd like
> > to see a really good justification of using this over other interfaces.
> 
> O_DIRECT means "bypass the page cache", which is what this does (now it's
> able to apply to mmap too).

It never had a meaning for mmap.   

> > Also it needs a Cc to linux-api and an entry in the open man page, and
> > and even better explanation of why we only support this interface on
> > block devices but not file systems.
> 
> Um, we do support this for filesystems with DAX.  The inconsistency we
> have is that if you have a direct-access-capable block device, currently
> files in a filesystem on it get the bypass-page-cache treatment, but if
> you use the raw block device directly, that mapping doesn't.

I don't see this O_DIRECT check done anywhere in filesystems.
Filesystems seems to get your O_DIRECT treatment when mounted with the
dax option as far as I can tell without the need for additional options.

The block device equivalent would be a sysfs flag, which seems like the
better implementation choice here.  
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f04c873..e3fab8c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -152,6 +152,9 @@  blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
+	if (IS_DAX(inode))
+		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
+				 NULL, DIO_SKIP_DIO_COUNT);
 	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
 				    blkdev_get_block, NULL, NULL,
 				    DIO_SKIP_DIO_COUNT);
@@ -333,7 +336,37 @@  static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 	mutex_unlock(&bd_inode->i_mutex);
 	return retval;
 }
-	
+
+#ifdef CONFIG_FS_DAX
+static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, blkdev_get_block);
+}
+
+static int blkdev_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, blkdev_get_block);
+}
+
+static const struct vm_operations_struct blkdev_dax_vm_ops = {
+	.fault		= blkdev_dax_fault,
+	.page_mkwrite	= blkdev_dax_mkwrite,
+};
+
+static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((IS_DAX(file->f_mapping->host)) && (file->f_flags & O_DIRECT)) {
+		file_accessed(file);
+		vma->vm_ops = &blkdev_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+		return 0;
+	}
+	return generic_file_mmap(file, vma);
+}
+#else
+#define blkdev_mmap	generic_file_mmap
+#endif
+
 int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
 	struct inode *bd_inode = filp->f_mapping->host;
@@ -1170,6 +1203,7 @@  static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
+		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
@@ -1670,7 +1704,7 @@  const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/dax.c b/fs/dax.c
index 159f796..37a0c48 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -209,7 +209,8 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	}
 
 	/* Protects against truncate */
-	inode_dio_begin(inode);
+	if (!(flags & DIO_SKIP_DIO_COUNT))
+		inode_dio_begin(inode);
 
 	retval = dax_io(inode, iter, pos, end, get_block, &bh);
 
@@ -219,7 +220,8 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	if ((retval > 0) && end_io)
 		end_io(iocb, pos, retval, bh.b_private);
 
-	inode_dio_end(inode);
+	if (!(flags & DIO_SKIP_DIO_COUNT))
+		inode_dio_end(inode);
  out:
 	return retval;
 }