diff mbox

[3/3] block: implement (some of) fallocate for block devices

Message ID 20160305005617.29738.85316.stgit@birch.djwong.org (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong March 5, 2016, 12:56 a.m. UTC
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Both flags require that
FALLOC_FL_KEEP_SIZE are set, both return EINVAL if one tries
to write past the end of the device, and both require that the
offset and length be aligned at least to 512-byte offsets.q

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/block_dev.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/open.c      |    3 ++-
 2 files changed, 69 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Linus Torvalds March 5, 2016, 3:06 a.m. UTC | #1
On Fri, Mar 4, 2016 at 4:56 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> +       /* Only punch if the device can do zeroing discard. */
> +       if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> +           (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> +               return -EOPNOTSUPP;

I'm ok with this, but suspect that some users would prefer to just
turn this into ZERO_RANGE silently.

Comments from people who would be expected to use this?

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds March 5, 2016, 3:13 a.m. UTC | #2
On Fri, Mar 4, 2016 at 4:56 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> +
> +       /* We can't change the bdev size from here */
> +       if (!(mode & FALLOC_FL_KEEP_SIZE))
> +               return -EOPNOTSUPP;

Oh, and this I think is wrong.

The thing is, FALLOC_FL_KEEP_SIZE is only supposed to matter if the
region is outside the existing length.

So if y ou punch a hole in the middle of a file, you don't need
FALLOC_FL_KEEP_SIZE.

I would suggest removing this check entirely, since you already check
that people don't try to punch holes past the end of the device. So
FALLOC_FL_KEEP_SIZE is simply a non-issue, and shouldn't even be
checked.

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 5, 2016, 8:57 p.m. UTC | #3
On Fri, Mar 04, 2016 at 07:06:38PM -0800, Linus Torvalds wrote:
> > +       if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > +           (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > +               return -EOPNOTSUPP;
> 
> I'm ok with this, but suspect that some users would prefer to just
> turn this into ZERO_RANGE silently.
> 
> Comments from people who would be expected to use this?

A hole punch should be a hole punch, and not silently allocate blocks
isntead of deallocating them.  It's not even a fallback, it's pretty
much the opposite for some workloads.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 5, 2016, 8:58 p.m. UTC | #4
On Fri, Mar 04, 2016 at 07:13:25PM -0800, Linus Torvalds wrote:
> > +       /* We can't change the bdev size from here */
> > +       if (!(mode & FALLOC_FL_KEEP_SIZE))
> > +               return -EOPNOTSUPP;
> 
> Oh, and this I think is wrong.
> 
> The thing is, FALLOC_FL_KEEP_SIZE is only supposed to matter if the
> region is outside the existing length.

For allocations...

> So if y ou punch a hole in the middle of a file, you don't need
> FALLOC_FL_KEEP_SIZE.

For FALLOC_FL_PUNCH_HOLE we always require FALLOC_FL_KEEP_SIZE so far,
and I'd rather not change things for block devices just because we can.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164..c9c9421 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@ 
 #include <linux/cleancache.h>
 #include <linux/dax.h>
 #include <asm/uaccess.h>
+#include <linux/falloc.h>
 #include "internal.h"
 
 struct bdev_inode {
@@ -1786,6 +1787,71 @@  static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
 #define blkdev_mmap generic_file_mmap
 #endif
 
+#define	BLKDEV_FALLOC_FL_SUPPORTED					\
+		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
+		 FALLOC_FL_ZERO_RANGE)
+
+long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
+{
+	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+	struct request_queue *q = bdev_get_queue(bdev);
+	struct address_space *mapping;
+	loff_t end = start + len - 1;
+	loff_t bs_mask;
+	int error;
+
+	/* We only support zero range and punch hole. */
+	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+		return -EOPNOTSUPP;
+
+	/* We can't change the bdev size from here */
+	if (!(mode & FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
+	/* We haven't a primitive for "ensure space exists" right now. */
+	if (mode == FALLOC_FL_KEEP_SIZE)
+		return -EOPNOTSUPP;
+
+	/* Only punch if the device can do zeroing discard. */
+	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
+	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
+		return -EOPNOTSUPP;
+
+	/* Don't allow IO that isn't aligned to logical block size */
+	bs_mask = bdev_logical_block_size(bdev) - 1;
+	if ((start & bs_mask) || ((start + len) & bs_mask))
+		return -EINVAL;
+
+	/* Don't go off the end of the device */
+	if (end > i_size_read(bdev->bd_inode))
+		return -EINVAL;
+	if (end < start)
+		return -EINVAL;
+
+	/* Invalidate the page cache, including dirty pages. */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	error = -EINVAL;
+	if (mode & FALLOC_FL_ZERO_RANGE)
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+					    GFP_KERNEL, false);
+	else if (mode & FALLOC_FL_PUNCH_HOLE)
+		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+					     GFP_KERNEL, 0);
+	if (error)
+		return error;
+
+	/*
+	 * Invalidate again; if someone wandered in and dirtied a page,
+	 * the caller will be given -EBUSY;
+	 */
+	return invalidate_inode_pages2_range(mapping,
+					     start >> PAGE_CACHE_SHIFT,
+					     end >> PAGE_CACHE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(blkdev_fallocate);
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
@@ -1800,6 +1866,7 @@  const struct file_operations def_blk_fops = {
 #endif
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.fallocate	= blkdev_fallocate,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/fs/open.c b/fs/open.c
index 55bdc75..4f99adc 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -289,7 +289,8 @@  int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	 * Let individual file system decide if it supports preallocation
 	 * for directories or not.
 	 */
-	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode) &&
+	    !S_ISBLK(inode->i_mode))
 		return -ENODEV;
 
 	/* Check for wrap through zero too */