Message ID | 1435608152-6982-4-git-send-email-matthew.r.wilcox@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 29, 2015 at 04:02:30PM -0400, Matthew Wilcox wrote: > From: Matthew Wilcox <willy@linux.intel.com> > > Without this patch, accesses to a file on a filesystem on a block device > could be done without the page cache, but accessing the block device > itself would always go through the page cache. > > Now reads and writes to a block device that is capable of DAX will always > bypass the page cache. Loads and stores to an mmapped block device will > bypass the page cache if the user specified O_DIRECT. This opt-in from > the user is necessary because DAX mappings are currently incompatible > with RDMA and O_DIRECT I/Os with non-DAX files. Using O_DIRECT for this seems like a pretty horrible hack, so I'd like to see a really good justification of using this over other interfaces. Also it needs a Cc to linux-api and an entry in the open man page, and and even better explanation of why we only support this interface on block devices but not file systems. Last but least I supect we'll need a runtime option for direct_access support in the brd devices, as we're now going to use the regular block device path less and less. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 30, 2015 at 04:19:49AM -0700, Christoph Hellwig wrote: > On Mon, Jun 29, 2015 at 04:02:30PM -0400, Matthew Wilcox wrote: > > From: Matthew Wilcox <willy@linux.intel.com> > > > > Without this patch, accesses to a file on a filesystem on a block device > > could be done without the page cache, but accessing the block device > > itself would always go through the page cache. > > > > Now reads and writes to a block device that is capable of DAX will always > > bypass the page cache. Loads and stores to an mmapped block device will > > bypass the page cache if the user specified O_DIRECT. This opt-in from > > the user is necessary because DAX mappings are currently incompatible > > with RDMA and O_DIRECT I/Os with non-DAX files. > > Using O_DIRECT for this seems like a pretty horrible hack, so I'd like > to see a really good justification of using this over other interfaces. O_DIRECT means "bypass the page cache", which is what this does (now it's able to apply to mmap too). > Also it needs a Cc to linux-api and an entry in the open man page, and > and even better explanation of why we only support this interface on > block devices but not file systems. Um, we do support this for filesystems with DAX. The inconsistency we have is that if you have a direct-access-capable block device, currently files in a filesystem on it get the bypass-page-cache treatment, but if you use the raw block device directly, that mapping doesn't. > Last but least I supect we'll need a runtime option for direct_access > support in the brd devices, as we're now going to use the regular > block device path less and less. I'm getting there; I was working on getting DAX to dynamically map the pages that it used (rather than relying on them being permanently part of the direct mapping), but I had to set that work aside temporarily. That lets us just delete the compile option, and have direct_access always work on brd. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 30, 2015 at 03:56:15PM -0400, Matthew Wilcox wrote: > > Using O_DIRECT for this seems like a pretty horrible hack, so I'd like > > to see a really good justification of using this over other interfaces. > > O_DIRECT means "bypass the page cache", which is what this does (now it's > able to apply to mmap too). It never had a meaning for mmap. > > Also it needs a Cc to linux-api and an entry in the open man page, and > > and even better explanation of why we only support this interface on > > block devices but not file systems. > > Um, we do support this for filesystems with DAX. The inconsistency we > have is that if you have a direct-access-capable block device, currently > files in a filesystem on it get the bypass-page-cache treatment, but if > you use the raw block device directly, that mapping doesn't. I don't see this O_DIRECT check done anywhere in filesystems. Filesystems seems to get your O_DIRECT treatment when mounted with the dax option as far as I can tell without the need for additional options. The block device equivalent would be a sysfs flag, which seems like the better implementation choice here. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/block_dev.c b/fs/block_dev.c index f04c873..e3fab8c 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -152,6 +152,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; + if (IS_DAX(inode)) + return dax_do_io(iocb, inode, iter, offset, blkdev_get_block, + NULL, DIO_SKIP_DIO_COUNT); return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset, blkdev_get_block, NULL, NULL, DIO_SKIP_DIO_COUNT); @@ -333,7 +336,37 @@ static loff_t block_llseek(struct file *file, loff_t offset, int whence) mutex_unlock(&bd_inode->i_mutex); return retval; } - + +#ifdef CONFIG_FS_DAX +static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, blkdev_get_block); +} + +static int blkdev_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, blkdev_get_block); +} + +static const struct vm_operations_struct blkdev_dax_vm_ops = { + .fault = blkdev_dax_fault, + .page_mkwrite = blkdev_dax_mkwrite, +}; + +static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) +{ + if ((IS_DAX(file->f_mapping->host)) && (file->f_flags & O_DIRECT)) { + file_accessed(file); + vma->vm_ops = &blkdev_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + return 0; + } + return generic_file_mmap(file, vma); +} +#else +#define blkdev_mmap generic_file_mmap +#endif + int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *bd_inode = filp->f_mapping->host; @@ -1170,6 +1203,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) bdev->bd_disk = disk; bdev->bd_queue = disk->queue; bdev->bd_contains = bdev; + bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0; if (!partno) { ret = -ENXIO; bdev->bd_part = disk_get_part(disk, partno); @@ -1670,7 +1704,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, - .mmap = generic_file_mmap, + .mmap = blkdev_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT diff --git a/fs/dax.c b/fs/dax.c index 159f796..37a0c48 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -209,7 +209,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, } /* Protects against truncate */ - inode_dio_begin(inode); + if (!(flags & DIO_SKIP_DIO_COUNT)) + inode_dio_begin(inode); retval = dax_io(inode, iter, pos, end, get_block, &bh); @@ -219,7 +220,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, if ((retval > 0) && end_io) end_io(iocb, pos, retval, bh.b_private); - inode_dio_end(inode); + if (!(flags & DIO_SKIP_DIO_COUNT)) + inode_dio_end(inode); out: return retval; }