diff mbox series

[v15,04/12] block: add emulation for copy

Message ID 20230906163844.18754-5-nj.shetty@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v15,01/12] block: Introduce queue limits and sysfs for copy-offload support | expand

Commit Message

Nitesh Shetty Sept. 6, 2023, 4:38 p.m. UTC
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
Also emulation can be used, if copy offload fails or partially completes.
At present in kernel user of emulation is NVMe fabrics.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c        | 223 +++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |   4 +
 2 files changed, 227 insertions(+)

Comments

Hannes Reinecke Sept. 8, 2023, 6:06 a.m. UTC | #1
On 9/6/23 18:38, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is
> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination.
> Also emulation can be used, if copy offload fails or partially completes.
> At present in kernel user of emulation is NVMe fabrics.
> 
Leave out the last sentence; I really would like to see it enabled for 
SCSI, too (we do have copy offload commands for SCSI ...).

And it raises all the questions which have bogged us down right from the 
start: where is the point in calling copy offload if copy offload is not 
implemented or slower than copying it by hand?
And how can the caller differentiate whether copy offload bring a 
benefit to him?

IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not 
available?

Cheers,

Hannes
Nitesh Shetty Sept. 11, 2023, 7:09 a.m. UTC | #2
On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote:
> On 9/6/23 18:38, Nitesh Shetty wrote:
> > For the devices which does not support copy, copy emulation is added.
> > It is required for in-kernel users like fabrics, where file descriptor is
> > not available and hence they can't use copy_file_range.
> > Copy-emulation is implemented by reading from source into memory and
> > writing to the corresponding destination.
> > Also emulation can be used, if copy offload fails or partially completes.
> > At present in kernel user of emulation is NVMe fabrics.
> > 
> Leave out the last sentence; I really would like to see it enabled for SCSI,
> too (we do have copy offload commands for SCSI ...).
> 
Sure, will do that

> And it raises all the questions which have bogged us down right from the
> start: where is the point in calling copy offload if copy offload is not
> implemented or slower than copying it by hand?
> And how can the caller differentiate whether copy offload bring a benefit to
> him?
> 
> IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not
> available?

Present approach treats copy as a background operation and the idea is to
maximize the chances of achieving copy by falling back to emulation.
Having said that, it should be possible to return -EOPNOTSUPP,
in case of offload IO failure or device not supporting offload.
We will update this in next version.

Thank you,
Nitesh Shetty
Hannes Reinecke Sept. 11, 2023, 7:39 a.m. UTC | #3
On 9/11/23 09:09, Nitesh Shetty wrote:
> On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote:
>> On 9/6/23 18:38, Nitesh Shetty wrote:
>>> For the devices which does not support copy, copy emulation is added.
>>> It is required for in-kernel users like fabrics, where file descriptor is
>>> not available and hence they can't use copy_file_range.
>>> Copy-emulation is implemented by reading from source into memory and
>>> writing to the corresponding destination.
>>> Also emulation can be used, if copy offload fails or partially completes.
>>> At present in kernel user of emulation is NVMe fabrics.
>>>
>> Leave out the last sentence; I really would like to see it enabled for SCSI,
>> too (we do have copy offload commands for SCSI ...).
>>
> Sure, will do that
> 
>> And it raises all the questions which have bogged us down right from the
>> start: where is the point in calling copy offload if copy offload is not
>> implemented or slower than copying it by hand?
>> And how can the caller differentiate whether copy offload bring a benefit to
>> him?
>>
>> IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not
>> available?
> 
> Present approach treats copy as a background operation and the idea is to
> maximize the chances of achieving copy by falling back to emulation.
> Having said that, it should be possible to return -EOPNOTSUPP,
> in case of offload IO failure or device not supporting offload.
> We will update this in next version.
> 
That is also what I meant with my comments to patch 09/12: I don't see 
it as a benefit to _always_ fall back to a generic copy-offload 
emulation. After all, that hardly brings any benefit.
Where I do see a benefit is to tie in the generic copy-offload 
_infrastructure_ to existing mechanisms (like dm-kcopyd).
But if there is no copy-offload infrastructure available then we really 
should return -EOPNOTSUPP as it really is not supported.

In the end, copy offload is not a command which 'always works'.
It's a command which _might_ deliver benefits (ie better performance) if 
dedicated implementations are available and certain parameters are met. 
If not then copy offload is not the best choice, and applications will 
need to be made aware of that.

Cheers,

Hannes
Nitesh Shetty Sept. 11, 2023, 10:20 a.m. UTC | #4
On 11/09/23 09:39AM, Hannes Reinecke wrote:
>On 9/11/23 09:09, Nitesh Shetty wrote:
>>On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote:
>>>On 9/6/23 18:38, Nitesh Shetty wrote:
>>>>For the devices which does not support copy, copy emulation is added.
>>>>It is required for in-kernel users like fabrics, where file descriptor is
>>>>not available and hence they can't use copy_file_range.
>>>>Copy-emulation is implemented by reading from source into memory and
>>>>writing to the corresponding destination.
>>>>Also emulation can be used, if copy offload fails or partially completes.
>>>>At present in kernel user of emulation is NVMe fabrics.
>>>>
>>>Leave out the last sentence; I really would like to see it enabled for SCSI,
>>>too (we do have copy offload commands for SCSI ...).
>>>
>>Sure, will do that
>>
>>>And it raises all the questions which have bogged us down right from the
>>>start: where is the point in calling copy offload if copy offload is not
>>>implemented or slower than copying it by hand?
>>>And how can the caller differentiate whether copy offload bring a benefit to
>>>him?
>>>
>>>IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not
>>>available?
>>
>>Present approach treats copy as a background operation and the idea is to
>>maximize the chances of achieving copy by falling back to emulation.
>>Having said that, it should be possible to return -EOPNOTSUPP,
>>in case of offload IO failure or device not supporting offload.
>>We will update this in next version.
>>
>That is also what I meant with my comments to patch 09/12: I don't see 
>it as a benefit to _always_ fall back to a generic copy-offload 
>emulation. After all, that hardly brings any benefit.

Agreed, we will correct this by returning error to user in case copy offload
fails, instead of falling back to block layer emulation.

We do need block layer emulation for fabrics, where we call emulation
if target doesn't support offload. In fabrics scenarios sending
offload command from host and achieve copy using block layer
emulation on target is better than sending read+write from host.

>Where I do see a benefit is to tie in the generic copy-offload 
>_infrastructure_ to existing mechanisms (like dm-kcopyd).
>But if there is no copy-offload infrastructure available then we 
>really should return -EOPNOTSUPP as it really is not supported.
>
Agreed, we will add this in next phase, once present series gets merged.

>In the end, copy offload is not a command which 'always works'.
>It's a command which _might_ deliver benefits (ie better performance) 
>if dedicated implementations are available and certain parameters are 
>met. If not then copy offload is not the best choice, and applications 
>will need to be made aware of that.

Agreed. We will leave the choice to user, to use either block layer offload
or emulation.


Thank you,
Nitesh Shetty
diff mbox series

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index d22e1e7417ca..b18871ea7281 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,20 @@  struct blkdev_copy_offload_io {
 	loff_t offset;
 };
 
+/* Keeps track of single outstanding copy emulation IO */
+struct blkdev_copy_emulation_io {
+	struct blkdev_copy_io *cio;
+	struct work_struct emulation_work;
+	void *buf;
+	ssize_t buf_len;
+	loff_t pos_in;
+	loff_t pos_out;
+	ssize_t len;
+	struct block_device *bdev_in;
+	struct block_device *bdev_out;
+	gfp_t gfp;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
 	unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -317,6 +331,215 @@  ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
+				   gfp_t gfp)
+{
+	int min_size = PAGE_SIZE;
+	char *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp)
+{
+	unsigned long kaddr = (unsigned long)data;
+	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	unsigned long start = kaddr >> PAGE_SHIFT;
+	const int nr_pages = end - start;
+	bool is_vmalloc = is_vmalloc_addr(data);
+	struct page *page;
+	int offset, i;
+	struct bio *bio;
+
+	bio = bio_kmalloc(nr_pages, gfp);
+	if (!bio)
+		return ERR_PTR(-ENOMEM);
+	bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0);
+
+	if (is_vmalloc) {
+		flush_kernel_vmap_range(data, len);
+		bio->bi_private = data;
+	}
+
+	offset = offset_in_page(kaddr);
+	for (i = 0; i < nr_pages; i++) {
+		unsigned int bytes = PAGE_SIZE - offset;
+
+		if (len <= 0)
+			break;
+
+		if (bytes > len)
+			bytes = len;
+
+		if (!is_vmalloc)
+			page = virt_to_page(data);
+		else
+			page = vmalloc_to_page(data);
+		if (bio_add_page(bio, page, bytes, offset) < bytes) {
+			/* we don't support partial mappings */
+			bio_uninit(bio);
+			kfree(bio);
+			return ERR_PTR(-EINVAL);
+		}
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	return bio;
+}
+
+static void blkdev_copy_emulation_work(struct work_struct *work)
+{
+	struct blkdev_copy_emulation_io *emulation_io = container_of(work,
+			struct blkdev_copy_emulation_io, emulation_work);
+	struct blkdev_copy_io *cio = emulation_io->cio;
+	struct bio *read_bio, *write_bio;
+	loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out;
+	ssize_t rem, chunk;
+	int ret = 0;
+
+	for (rem = emulation_io->len; rem > 0; rem -= chunk) {
+		chunk = min_t(int, emulation_io->buf_len, rem);
+
+		read_bio = bio_map_buf(emulation_io->buf,
+				       emulation_io->buf_len,
+				       emulation_io->gfp);
+		if (IS_ERR(read_bio)) {
+			ret = PTR_ERR(read_bio);
+			break;
+		}
+		read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+		bio_set_dev(read_bio, emulation_io->bdev_in);
+		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		read_bio->bi_iter.bi_size = chunk;
+		ret = submit_bio_wait(read_bio);
+		kfree(read_bio);
+		if (ret)
+			break;
+
+		write_bio = bio_map_buf(emulation_io->buf,
+					emulation_io->buf_len,
+					emulation_io->gfp);
+		if (IS_ERR(write_bio)) {
+			ret = PTR_ERR(write_bio);
+			break;
+		}
+		write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+		bio_set_dev(write_bio, emulation_io->bdev_out);
+		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+		write_bio->bi_iter.bi_size = chunk;
+		ret = submit_bio_wait(write_bio);
+		kfree(write_bio);
+		if (ret)
+			break;
+
+		pos_in += chunk;
+		pos_out += chunk;
+	}
+	cio->status = ret;
+	kvfree(emulation_io->buf);
+	kfree(emulation_io);
+	blkdev_copy_endio(cio);
+}
+
+static inline ssize_t queue_max_hw_bytes(struct request_queue *q)
+{
+	return min_t(ssize_t, queue_max_hw_sectors(q) << SECTOR_SHIFT,
+		     queue_max_segments(q) << PAGE_SHIFT);
+}
+/*
+ * @bdev_in:	source block device
+ * @pos_in:	source offset
+ * @bdev_out:	destination block device
+ * @pos_out:	destination offset
+ * @len:	length in bytes to be copied
+ * @endio:	endio function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	endio function will be called with this private data,
+ *		for synchronous operation this should be NULL
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ *	If native copy offload feature is absent, caller can use this function
+ *	as fallback to perform copy.
+ *	We store information required to perform the copy along with temporary
+ *	buffer allocation. We async punt copy emulation to a worker. And worker
+ *	performs copy in 2 steps.
+ *	1. Read data from source to temporary buffer
+ *	2. Write data to destination from temporary buffer
+ */
+ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in,
+			      struct block_device *bdev_out, loff_t pos_out,
+			      size_t len, void (*endio)(void *, int, ssize_t),
+			      void *private, gfp_t gfp)
+{
+	struct request_queue *in = bdev_get_queue(bdev_in);
+	struct request_queue *out = bdev_get_queue(bdev_out);
+	struct blkdev_copy_emulation_io *emulation_io;
+	struct blkdev_copy_io *cio;
+	ssize_t ret;
+	size_t max_hw_bytes = min(queue_max_hw_bytes(in),
+				  queue_max_hw_bytes(out));
+
+	ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
+	if (ret)
+		return ret;
+
+	cio = kzalloc(sizeof(*cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+
+	cio->waiter = current;
+	cio->copied = len;
+	cio->endio = endio;
+	cio->private = private;
+
+	emulation_io = kzalloc(sizeof(*emulation_io), gfp);
+	if (!emulation_io)
+		goto err_free_cio;
+	emulation_io->cio = cio;
+	INIT_WORK(&emulation_io->emulation_work, blkdev_copy_emulation_work);
+	emulation_io->pos_in = pos_in;
+	emulation_io->pos_out = pos_out;
+	emulation_io->len = len;
+	emulation_io->bdev_in = bdev_in;
+	emulation_io->bdev_out = bdev_out;
+	emulation_io->gfp = gfp;
+
+	emulation_io->buf = blkdev_copy_alloc_buf(min(max_hw_bytes, len),
+						  &emulation_io->buf_len, gfp);
+	if (!emulation_io->buf)
+		goto err_free_emulation_io;
+
+	schedule_work(&emulation_io->emulation_work);
+
+	if (cio->endio)
+		return -EIOCBQUEUED;
+
+	return blkdev_copy_wait_io_completion(cio);
+
+err_free_emulation_io:
+	kfree(emulation_io);
+err_free_cio:
+	kfree(cio);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_emulation);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5405499bcf22..e0a832a1c3a7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1046,6 +1046,10 @@  ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
 			    loff_t pos_out, size_t len,
 			    void (*endio)(void *, int, ssize_t),
 			    void *private, gfp_t gfp_mask);
+ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in,
+			      struct block_device *bdev_out, loff_t pos_out,
+			      size_t len, void (*endio)(void *, int, ssize_t),
+			      void *private, gfp_t gfp);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */