diff mbox

[3/3] block: implement (some of) fallocate for block devices

Message ID 147510959149.8940.2897845352082568677.stgit@birch.djwong.org (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong Sept. 29, 2016, 12:39 a.m. UTC
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Punch still requires that
FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
device will be clamped to the device size if KEEP_SIZE is set; or will
return -EINVAL if not.  Both start and length must be aligned to the
device's logical block size.

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Incorporate feedback from Christoph & Linus.  Tentatively add
a requirement that the fallocate arguments be aligned to logical block
size, and put in a few XXX comments ahead of LSF discussion.
v3: Forward port to 4.7.
v4: Forward port to 4.8.
---
 fs/block_dev.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/open.c      |    3 +-
 2 files changed, 80 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Sept. 29, 2016, 1:42 a.m. UTC | #1
On 09/28/16 17:39, Darrick J. Wong wrote:
> +	if (end > isize) {
> +		if (mode & FALLOC_FL_KEEP_SIZE) {
> +			len = isize - start;
> +			end = start + len - 1;
> +		} else
> +			return -EINVAL;
> +	}

If FALLOC_FL_KEEP_SIZE has been set and end == isize the above code 
won't reduce end to isize - 1. Shouldn't "end > isize" be changed into 
"end >= isize" ?

> +	switch (mode) {
> +	case FALLOC_FL_ZERO_RANGE:
> +	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> +		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
> +					    GFP_KERNEL, false);
> +		if (error)
> +			return error;
> +		break;
> +	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
> +		/* Only punch if the device can do zeroing discard. */
> +		if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
> +			return -EOPNOTSUPP;
> +		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +					     GFP_KERNEL, 0);
> +		if (error)
> +			return error;
> +		break;
> +	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
> +		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +					     GFP_KERNEL, 0);
> +		if (error)
> +			return error;
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}

Have you considered to move "if (error) return error" out of the switch 
statement?

> +	/*
> +	 * Invalidate again; if someone wandered in and dirtied a page,
> +	 * the caller will be given -EBUSY;
> +	 */
> +	return invalidate_inode_pages2_range(mapping,
> +					     start >> PAGE_SHIFT,
> +					     end >> PAGE_SHIFT);

A comment might be appropriate here that since end is inclusive and 
since the third argument of invalidate_inode_pages2_range() is inclusive 
that rounding down will yield the correct result.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Sept. 29, 2016, 2:09 a.m. UTC | #2
On Wed, Sep 28, 2016 at 06:42:14PM -0700, Bart Van Assche wrote:
> On 09/28/16 17:39, Darrick J. Wong wrote:
> >+	if (end > isize) {
> >+		if (mode & FALLOC_FL_KEEP_SIZE) {
> >+			len = isize - start;
> >+			end = start + len - 1;
> >+		} else
> >+			return -EINVAL;
> >+	}
> 
> If FALLOC_FL_KEEP_SIZE has been set and end == isize the above code won't
> reduce end to isize - 1. Shouldn't "end > isize" be changed into "end >=
> isize" ?

Oops.  Will fix and send out a v2.

> >+	switch (mode) {
> >+	case FALLOC_FL_ZERO_RANGE:
> >+	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> >+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
> >+					    GFP_KERNEL, false);
> >+		if (error)
> >+			return error;
> >+		break;
> >+	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
> >+		/* Only punch if the device can do zeroing discard. */
> >+		if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
> >+			return -EOPNOTSUPP;
> >+		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> >+					     GFP_KERNEL, 0);
> >+		if (error)
> >+			return error;
> >+		break;
> >+	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
> >+		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> >+					     GFP_KERNEL, 0);
> >+		if (error)
> >+			return error;
> >+		break;
> >+	default:
> >+		return -EOPNOTSUPP;
> >+	}
> 
> Have you considered to move "if (error) return error" out of the switch
> statement?

Sure, I could do that.

> >+	/*
> >+	 * Invalidate again; if someone wandered in and dirtied a page,
> >+	 * the caller will be given -EBUSY;
> >+	 */
> >+	return invalidate_inode_pages2_range(mapping,
> >+					     start >> PAGE_SHIFT,
> >+					     end >> PAGE_SHIFT);
> 
> A comment might be appropriate here that since end is inclusive and since
> the third argument of invalidate_inode_pages2_range() is inclusive that
> rounding down will yield the correct result.

/methot the documentation of invalidate_inode_pages2_range was clear
enough on that point, but I could throw that into the comment too.

--D
> 
> Bart.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Sept. 29, 2016, 5:57 a.m. UTC | #3
On 09/29/2016 02:39 AM, Darrick J. Wong wrote:
> After much discussion, it seems that the fallocate feature flag
> FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
> FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
> whitelisted for zeroing SCSI UNMAP.  Punch still requires that
> FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
> device will be clamped to the device size if KEEP_SIZE is set; or will
> return -EINVAL if not.  Both start and length must be aligned to the
> device's logical block size.
> 
> Since the semantics of fallocate are fairly well established already,
> wire up the two pieces.  The other fallocate variants (collapse range,
> insert range, and allocate blocks) are not supported.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Incorporate feedback from Christoph & Linus.  Tentatively add
> a requirement that the fallocate arguments be aligned to logical block
> size, and put in a few XXX comments ahead of LSF discussion.
> v3: Forward port to 4.7.
> v4: Forward port to 4.8.
> ---
>  fs/block_dev.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/open.c      |    3 +-
>  2 files changed, 80 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08ae993..0c808fc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@ 
 #include <linux/cleancache.h>
 #include <linux/dax.h>
 #include <linux/badblocks.h>
+#include <linux/falloc.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -1787,6 +1788,82 @@  static const struct address_space_operations def_blk_aops = {
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#define	BLKDEV_FALLOC_FL_SUPPORTED					\
+		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+
+static long blkdev_fallocate(struct file *file, int mode, loff_t start,
+			     loff_t len)
+{
+	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+	struct request_queue *q = bdev_get_queue(bdev);
+	struct address_space *mapping;
+	loff_t end = start + len - 1;
+	loff_t isize;
+	int error;
+
+	/* Fail if we don't recognize the flags. */
+	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+		return -EOPNOTSUPP;
+
+	/* Don't go off the end of the device. */
+	isize = i_size_read(bdev->bd_inode);
+	if (start >= isize)
+		return -EINVAL;
+	if (end > isize) {
+		if (mode & FALLOC_FL_KEEP_SIZE) {
+			len = isize - start;
+			end = start + len - 1;
+		} else
+			return -EINVAL;
+	}
+
+	/*
+	 * Don't allow IO that isn't aligned to logical block size.
+	 */
+	if ((start | len) & (bdev_logical_block_size(bdev) - 1))
+		return -EINVAL;
+
+	/* Invalidate the page cache, including dirty pages. */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	switch (mode) {
+	case FALLOC_FL_ZERO_RANGE:
+	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+					    GFP_KERNEL, false);
+		if (error)
+			return error;
+		break;
+	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+		/* Only punch if the device can do zeroing discard. */
+		if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
+			return -EOPNOTSUPP;
+		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+					     GFP_KERNEL, 0);
+		if (error)
+			return error;
+		break;
+	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
+		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+					     GFP_KERNEL, 0);
+		if (error)
+			return error;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * Invalidate again; if someone wandered in and dirtied a page,
+	 * the caller will be given -EBUSY;
+	 */
+	return invalidate_inode_pages2_range(mapping,
+					     start >> PAGE_SHIFT,
+					     end >> PAGE_SHIFT);
+}
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
@@ -1801,6 +1878,7 @@  const struct file_operations def_blk_fops = {
 #endif
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.fallocate	= blkdev_fallocate,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/fs/open.c b/fs/open.c
index 4fd6e25..01b6092 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -289,7 +289,8 @@  int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	 * Let individual file system decide if it supports preallocation
 	 * for directories or not.
 	 */
-	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode) &&
+	    !S_ISBLK(inode->i_mode))
 		return -ENODEV;
 
 	/* Check for wrap through zero too */