[v20,03/12] block: add copy offload support

Message ID	20240520102033.9361-4-nj.shetty@samsung.com (mailing list archive)
State	New
Headers	show Received: from mailout4.samsung.com (mailout4.samsung.com [203.254.224.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E196D6A00C for <linux-fsdevel@vger.kernel.org>; Mon, 20 May 2024 11:44:50 +0000 (UTC) From: Nitesh Shetty <nj.shetty@samsung.com> To: Jens Axboe <axboe@kernel.dk>, Jonathan Corbet <corbet@lwn.net>, Alasdair Kergon <agk@redhat.com>, Mike Snitzer <snitzer@kernel.org>, Mikulas Patocka <mpatocka@redhat.com>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Chaitanya Kulkarni <kch@nvidia.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz> Cc: martin.petersen@oracle.com, bvanassche@acm.org, david@fromorbit.com, hare@suse.de, damien.lemoal@opensource.wdc.com, anuj20.g@samsung.com, joshi.k@samsung.com, nitheshshetty@gmail.com, gost.dev@samsung.com, Nitesh Shetty <nj.shetty@samsung.com>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, dm-devel@lists.linux.dev, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org Subject: [PATCH v20 03/12] block: add copy offload support Date: Mon, 20 May 2024 15:50:16 +0530 Message-Id: <20240520102033.9361-4-nj.shetty@samsung.com> In-Reply-To: <20240520102033.9361-1-nj.shetty@samsung.com> Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <20240520102033.9361-1-nj.shetty@samsung.com> <CGME20240520102853epcas5p42d635d6712b8876ea22a45d730cb1378@epcas5p4.samsung.com> Precedence: bulk
Series	[v20,01/12] block: Introduce queue limits and sysfs for copy-offload support \| expand [v20,01/12] block: Introduce queue limits and sysfs for copy-offload support [v20,02/12] Add infrastructure for copy offload in block and request layer. [v20,03/12] block: add copy offload support [v20,04/12] block: add emulation for copy [v20,05/12] fs/read_write: Enable copy_file_range for block device. [v20,06/12] fs, block: copy_file_range for def_blk_ops for direct block device [v20,07/12] nvme: add copy offload support [v20,08/12] nvmet: add copy command support for bdev and file ns [v20,09/12] dm: Add support for copy offload [v20,10/12] dm: Enable copy offload for dm-linear target [v20,11/12] null: Enable trace capability for null block [v20,12/12] null_blk: add support for copy offload

Message ID

20240520102033.9361-4-nj.shetty@samsung.com (mailing list archive)

State

New

Headers

From: Nitesh Shetty <nj.shetty@samsung.com>
To: Jens Axboe <axboe@kernel.dk>, Jonathan Corbet <corbet@lwn.net>, Alasdair
	Kergon <agk@redhat.com>, Mike Snitzer <snitzer@kernel.org>, Mikulas Patocka
	<mpatocka@redhat.com>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig
	<hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Chaitanya Kulkarni
	<kch@nvidia.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian
	Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>
Cc: martin.petersen@oracle.com, bvanassche@acm.org, david@fromorbit.com,
	hare@suse.de, damien.lemoal@opensource.wdc.com, anuj20.g@samsung.com,
	joshi.k@samsung.com, nitheshshetty@gmail.com, gost.dev@samsung.com, Nitesh
	Shetty <nj.shetty@samsung.com>, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	dm-devel@lists.linux.dev, linux-nvme@lists.infradead.org,
	linux-fsdevel@vger.kernel.org
Subject: [PATCH v20 03/12] block: add copy offload support
Date: Mon, 20 May 2024 15:50:16 +0530
Message-Id: <20240520102033.9361-4-nj.shetty@samsung.com>
In-Reply-To: <20240520102033.9361-1-nj.shetty@samsung.com>
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 105P
DLP-Filter: Pass
References: <20240520102033.9361-1-nj.shetty@samsung.com>
	<CGME20240520102853epcas5p42d635d6712b8876ea22a45d730cb1378@epcas5p4.samsung.com>
Precedence: bulk

Series

[v20,01/12] block: Introduce queue limits and sysfs for copy-offload support | expand

Commit Message

Nitesh Shetty May 20, 2024, 10:20 a.m. UTC

Introduce blkdev_copy_offload to perform copy offload.
Issue REQ_OP_COPY_DST with destination info along with taking a plug.
This flows till request layer and waits for src bio to arrive.
Issue REQ_OP_COPY_SRC with source info and this bio reaches request
layer and merges with dst request.
For any reason, if a request comes to the driver with only one of src/dst
bio, we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-lib.c        | 204 +++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |   4 +
 2 files changed, 208 insertions(+)

Comments

Christoph Hellwig June 1, 2024, 6:16 a.m. UTC | #1

> +/* Keeps track of all outstanding copy IO */
> +struct blkdev_copy_io {
> +	atomic_t refcount;
> +	ssize_t copied;
> +	int status;
> +	struct task_struct *waiter;
> +	void (*endio)(void *private, int status, ssize_t copied);
> +	void *private;
> +};
> +
> +/* Keeps track of single outstanding copy offload IO */
> +struct blkdev_copy_offload_io {
> +	struct blkdev_copy_io *cio;
> +	loff_t offset;
> +};

The structure names confuse me, and the comments make things even worse.

AFAICT:

blkdev_copy_io is a per-call structure, I'd name it blkdev_copy_ctx.
blkdev_copy_offload_io is per-bio pair, and something like blkdev_copy_chunk
might be a better idea.  Or we could just try to kill it entirely and add
a field to struct bio in the union currently holding the integrity
information.

I'm also quite confused what kind of offset this offset field is.  The
type and name suggest it is an offset in a file, which for a block device
based helper is pretty odd to start with.  blkdev_copy_offload
initializes it to len - rem, so it kind is an offset, but relative
to the operation and not to a file. blkdev_copy_offload_src_endio then
uses to set the ->copied field, but based on a min which means
->copied can only be decreased.  Something is really off there.

Taking about types and units: blkdev_copy_offload obviously can only
work in terms of LBAs.  Any reason to not make it work in terms of
512-byte block layer sectors instead of in bytes?

> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
> +	    len >= BLK_COPY_MAX_BYTES)
> +		return -EINVAL;

This can be cleaned up an optimized a bit:

	if (!len || len >= BLK_COPY_MAX_BYTES)
		return -EINVAL;
	if ((pos_in | pos_out | len) & align)
		return -EINVAL;
	
> + *
> + * For synchronous operation returns the length of bytes copied or error
> + * For asynchronous operation returns -EIOCBQUEUED or error
> + *
> + * Description:
> + *	Copy source offset to destination offset within block device, using
> + *	device's native copy offload feature.
> + *	We perform copy operation using 2 bio's.
> + *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
> + *	sector and length. Once this bio reaches request layer, we form a
> + *	request and wait for dst bio to arrive.
> + *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
> + *	Once this bio reaches request layer and find a request with previously
> + *	sent destination info we merge the source bio and return.
> + *	3. Release the plug and request is sent to driver
> + *	This design works only for drivers with request queue.

The wording with all the We here is a bit odd.  Much of this also seem
superfluous or at least misplaced in the kernel doc comment as it doesn't
document the API, but just what is done in the code below.

> +	cio = kzalloc(sizeof(*cio), gfp);
> +	if (!cio)
> +		return -ENOMEM;
> +	atomic_set(&cio->refcount, 1);
> +	cio->waiter = current;
> +	cio->endio = endio;
> +	cio->private = private;

For the main use this could be allocated on-stack.  Is there any good
reason to not let callers that really want an async version to implement
the async behavior themselves using suitable helpers?

> +		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);

Please switch to use bio_chain_and_submit, which is a easier to
understand API.  I'm trying to phase out blk_next_bio in favour of
bio_chain_and_submit over the next few merge windows.

> +		if (!src_bio)
> +			goto err_free_dst_bio;
> +		src_bio->bi_iter.bi_size = chunk;
> +		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> +		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
> +		src_bio->bi_private = offload_io;
> +
> +		atomic_inc(&cio->refcount);
> +		submit_bio(src_bio);
> +		blk_finish_plug(&plug);

plugs should be hold over all  I/Os, submitted from the same caller,
which is the point of them.

Nitesh Shetty June 4, 2024, 10:50 a.m. UTC | #2

On 01/06/24 08:16AM, Christoph Hellwig wrote:
>> +/* Keeps track of all outstanding copy IO */
>> +struct blkdev_copy_io {
>> +	atomic_t refcount;
>> +	ssize_t copied;
>> +	int status;
>> +	struct task_struct *waiter;
>> +	void (*endio)(void *private, int status, ssize_t copied);
>> +	void *private;
>> +};
>> +
>> +/* Keeps track of single outstanding copy offload IO */
>> +struct blkdev_copy_offload_io {
>> +	struct blkdev_copy_io *cio;
>> +	loff_t offset;
>> +};
>
>The structure names confuse me, and the comments make things even worse.
>
>AFAICT:
>
>blkdev_copy_io is a per-call structure, I'd name it blkdev_copy_ctx.
>blkdev_copy_offload_io is per-bio pair, and something like blkdev_copy_chunk
Acked, your suggestion for structure name looks better.

>might be a better idea.  Or we could just try to kill it entirely and add
>a field to struct bio in the union currently holding the integrity
>information.
We will explore this.

>I'm also quite confused what kind of offset this offset field is.  The
>type and name suggest it is an offset in a file, which for a block device
>based helper is pretty odd to start with.  blkdev_copy_offload
>initializes it to len - rem, so it kind is an offset, but relative
>to the operation and not to a file. blkdev_copy_offload_src_endio then
>uses to set the ->copied field, but based on a min which means
>->copied can only be decreased.  Something is really off there.
>
Offset in this context, is with respect to the operation.
Overall idea was to handle partial copy, where in some of the split copy IO fails.
In this case we want to return minimum bytes copied.
We can try to store the offset in a temporary variable similar to
pos_out, pos_in instead of current (len - rem), to avoid the confusion.
We will update this in next version.

>Taking about types and units: blkdev_copy_offload obviously can only
>work in terms of LBAs.  Any reason to not make it work in terms of
>512-byte block layer sectors instead of in bytes?
>
Just that number of places where we need to sector shift were
comparatively more. We will update this to 512-byte sectors in next
version.

>> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
>> +	    len >= BLK_COPY_MAX_BYTES)
>> +		return -EINVAL;
>
>This can be cleaned up an optimized a bit:
>
>	if (!len || len >= BLK_COPY_MAX_BYTES)
>		return -EINVAL;
>	if ((pos_in | pos_out | len) & align)
>		return -EINVAL;
>	
Acked.

>> + *
>> + * For synchronous operation returns the length of bytes copied or error
>> + * For asynchronous operation returns -EIOCBQUEUED or error
>> + *
>> + * Description:
>> + *	Copy source offset to destination offset within block device, using
>> + *	device's native copy offload feature.
>> + *	We perform copy operation using 2 bio's.
>> + *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
>> + *	sector and length. Once this bio reaches request layer, we form a
>> + *	request and wait for dst bio to arrive.
>> + *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
>> + *	Once this bio reaches request layer and find a request with previously
>> + *	sent destination info we merge the source bio and return.
>> + *	3. Release the plug and request is sent to driver
>> + *	This design works only for drivers with request queue.
>
>The wording with all the We here is a bit odd.  Much of this also seem
>superfluous or at least misplaced in the kernel doc comment as it doesn't
>document the API, but just what is done in the code below.
>
Since we were doing IO in unconventional way, we felt would be better to
document this, for easy followup.
We will remove this in next version and document just API.

>> +	cio = kzalloc(sizeof(*cio), gfp);
>> +	if (!cio)
>> +		return -ENOMEM;
>> +	atomic_set(&cio->refcount, 1);
>> +	cio->waiter = current;
>> +	cio->endio = endio;
>> +	cio->private = private;
>
>For the main use this could be allocated on-stack.  Is there any good
>reason to not let callers that really want an async version to implement
>the async behavior themselves using suitable helpers?
>
We cannot do on-stack allocation of cio as we use it in endio handler.
cio will be used to track partial IO completion as well.
Callers requiring async implementation would need to manage all this
bookkeeping themselves, leading to duplication of code. We felt it is
better to do it here onetime.
Do you see it any differently ?

>> +		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);
>
>Please switch to use bio_chain_and_submit, which is a easier to
>understand API.  I'm trying to phase out blk_next_bio in favour of
>bio_chain_and_submit over the next few merge windows.
>
Acked

>> +		if (!src_bio)
>> +			goto err_free_dst_bio;
>> +		src_bio->bi_iter.bi_size = chunk;
>> +		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
>> +		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
>> +		src_bio->bi_private = offload_io;
>> +
>> +		atomic_inc(&cio->refcount);
>> +		submit_bio(src_bio);
>> +		blk_finish_plug(&plug);
>
>plugs should be hold over all  I/Os, submitted from the same caller,
>which is the point of them.
>
Acked

Thank You,
Nitesh Shetty

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 442da9dad042..e83461abb581 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,6 +10,22 @@ 
 
 #include "blk.h"
 
+/* Keeps track of all outstanding copy IO */
+struct blkdev_copy_io {
+	atomic_t refcount;
+	ssize_t copied;
+	int status;
+	struct task_struct *waiter;
+	void (*endio)(void *private, int status, ssize_t copied);
+	void *private;
+};
+
+/* Keeps track of single outstanding copy offload IO */
+struct blkdev_copy_offload_io {
+	struct blkdev_copy_io *cio;
+	loff_t offset;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
 	unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -103,6 +119,194 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+static inline ssize_t blkdev_copy_sanity_check(struct block_device *bdev_in,
+					       loff_t pos_in,
+					       struct block_device *bdev_out,
+					       loff_t pos_out, size_t len)
+{
+	unsigned int align = max(bdev_logical_block_size(bdev_out),
+				 bdev_logical_block_size(bdev_in)) - 1;
+
+	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+	    len >= BLK_COPY_MAX_BYTES)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline void blkdev_copy_endio(struct blkdev_copy_io *cio)
+{
+	if (cio->endio) {
+		cio->endio(cio->private, cio->status, cio->copied);
+		kfree(cio);
+	} else {
+		struct task_struct *waiter = cio->waiter;
+
+		WRITE_ONCE(cio->waiter, NULL);
+		blk_wake_io_task(waiter);
+	}
+}
+
+/*
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to complete.
+ * Returns the length of bytes copied or error
+ */
+static ssize_t blkdev_copy_wait_for_completion_io(struct blkdev_copy_io *cio)
+{
+	ssize_t ret;
+
+	for (;;) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(cio->waiter))
+			break;
+		blk_io_schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	ret = cio->copied;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blkdev_copy_offload_src_endio(struct bio *bio)
+{
+	struct blkdev_copy_offload_io *offload_io = bio->bi_private;
+	struct blkdev_copy_io *cio = offload_io->cio;
+
+	if (bio->bi_status) {
+		cio->copied = min_t(ssize_t, offload_io->offset, cio->copied);
+		if (!cio->status)
+			cio->status = blk_status_to_errno(bio->bi_status);
+	}
+	bio_put(bio);
+	kfree(offload_io);
+
+	if (atomic_dec_and_test(&cio->refcount))
+		blkdev_copy_endio(cio);
+}
+
+/*
+ * @bdev:	block device
+ * @pos_in:	source offset
+ * @pos_out:	destination offset
+ * @len:	length in bytes to be copied
+ * @endio:	endio function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	endio function will be called with this private data,
+ *		for synchronous operation this should be NULL
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ *	Copy source offset to destination offset within block device, using
+ *	device's native copy offload feature.
+ *	We perform copy operation using 2 bio's.
+ *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
+ *	sector and length. Once this bio reaches request layer, we form a
+ *	request and wait for dst bio to arrive.
+ *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
+ *	Once this bio reaches request layer and find a request with previously
+ *	sent destination info we merge the source bio and return.
+ *	3. Release the plug and request is sent to driver
+ *	This design works only for drivers with request queue.
+ */
+ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
+			    loff_t pos_out, size_t len,
+			    void (*endio)(void *, int, ssize_t),
+			    void *private, gfp_t gfp)
+{
+	struct blkdev_copy_io *cio;
+	struct blkdev_copy_offload_io *offload_io;
+	struct bio *src_bio, *dst_bio;
+	size_t rem, chunk;
+	size_t max_copy_bytes = bdev_max_copy_sectors(bdev) << SECTOR_SHIFT;
+	ssize_t ret;
+	struct blk_plug plug;
+
+	if (!max_copy_bytes)
+		return -EOPNOTSUPP;
+
+	ret = blkdev_copy_sanity_check(bdev, pos_in, bdev, pos_out, len);
+	if (ret)
+		return ret;
+
+	cio = kzalloc(sizeof(*cio), gfp);
+	if (!cio)
+		return -ENOMEM;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = endio;
+	cio->private = private;
+
+	/*
+	 * If there is a error, copied will be set to least successfully
+	 * completed copied length
+	 */
+	cio->copied = len;
+	for (rem = len; rem > 0; rem -= chunk) {
+		chunk = min(rem, max_copy_bytes);
+
+		offload_io = kzalloc(sizeof(*offload_io), gfp);
+		if (!offload_io)
+			goto err_free_cio;
+		offload_io->cio = cio;
+		/*
+		 * For partial completion, we use offload_io->offset to truncate
+		 * successful copy length
+		 */
+		offload_io->offset = len - rem;
+
+		dst_bio = bio_alloc(bdev, 0, REQ_OP_COPY_DST, gfp);
+		if (!dst_bio)
+			goto err_free_offload_io;
+		dst_bio->bi_iter.bi_size = chunk;
+		dst_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+
+		blk_start_plug(&plug);
+		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);
+		if (!src_bio)
+			goto err_free_dst_bio;
+		src_bio->bi_iter.bi_size = chunk;
+		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
+		src_bio->bi_private = offload_io;
+
+		atomic_inc(&cio->refcount);
+		submit_bio(src_bio);
+		blk_finish_plug(&plug);
+		pos_in += chunk;
+		pos_out += chunk;
+	}
+
+	if (atomic_dec_and_test(&cio->refcount))
+		blkdev_copy_endio(cio);
+	if (endio)
+		return -EIOCBQUEUED;
+
+	return blkdev_copy_wait_for_completion_io(cio);
+
+err_free_dst_bio:
+	bio_put(dst_bio);
+err_free_offload_io:
+	kfree(offload_io);
+err_free_cio:
+	cio->copied = min_t(ssize_t, cio->copied, (len - rem));
+	cio->status = -ENOMEM;
+	if (rem == len) {
+		ret = cio->status;
+		kfree(cio);
+		return ret;
+	}
+	if (cio->endio)
+		return cio->status;
+
+	return blkdev_copy_wait_for_completion_io(cio);
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 109d9f905c3c..3b88dcd5c433 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1079,6 +1079,10 @@  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
+			    loff_t pos_out, size_t len,
+			    void (*endio)(void *, int, ssize_t),
+			    void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */

[v20,03/12] block: add copy offload support

Commit Message

Comments

Patch