Message ID | 20230906163844.18754-5-nj.shetty@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v15,01/12] block: Introduce queue limits and sysfs for copy-offload support | expand |
On 9/6/23 18:38, Nitesh Shetty wrote: > For the devices which does not support copy, copy emulation is added. > It is required for in-kernel users like fabrics, where file descriptor is > not available and hence they can't use copy_file_range. > Copy-emulation is implemented by reading from source into memory and > writing to the corresponding destination. > Also emulation can be used, if copy offload fails or partially completes. > At present in kernel user of emulation is NVMe fabrics. > Leave out the last sentence; I really would like to see it enabled for SCSI, too (we do have copy offload commands for SCSI ...). And it raises all the questions which have bogged us down right from the start: where is the point in calling copy offload if copy offload is not implemented or slower than copying it by hand? And how can the caller differentiate whether copy offload bring a benefit to him? IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not available? Cheers, Hannes
On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote: > On 9/6/23 18:38, Nitesh Shetty wrote: > > For the devices which does not support copy, copy emulation is added. > > It is required for in-kernel users like fabrics, where file descriptor is > > not available and hence they can't use copy_file_range. > > Copy-emulation is implemented by reading from source into memory and > > writing to the corresponding destination. > > Also emulation can be used, if copy offload fails or partially completes. > > At present in kernel user of emulation is NVMe fabrics. > > > Leave out the last sentence; I really would like to see it enabled for SCSI, > too (we do have copy offload commands for SCSI ...). > Sure, will do that > And it raises all the questions which have bogged us down right from the > start: where is the point in calling copy offload if copy offload is not > implemented or slower than copying it by hand? > And how can the caller differentiate whether copy offload bring a benefit to > him? > > IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not > available? Present approach treats copy as a background operation and the idea is to maximize the chances of achieving copy by falling back to emulation. Having said that, it should be possible to return -EOPNOTSUPP, in case of offload IO failure or device not supporting offload. We will update this in next version. Thank you, Nitesh Shetty
On 9/11/23 09:09, Nitesh Shetty wrote: > On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote: >> On 9/6/23 18:38, Nitesh Shetty wrote: >>> For the devices which does not support copy, copy emulation is added. >>> It is required for in-kernel users like fabrics, where file descriptor is >>> not available and hence they can't use copy_file_range. >>> Copy-emulation is implemented by reading from source into memory and >>> writing to the corresponding destination. >>> Also emulation can be used, if copy offload fails or partially completes. >>> At present in kernel user of emulation is NVMe fabrics. >>> >> Leave out the last sentence; I really would like to see it enabled for SCSI, >> too (we do have copy offload commands for SCSI ...). >> > Sure, will do that > >> And it raises all the questions which have bogged us down right from the >> start: where is the point in calling copy offload if copy offload is not >> implemented or slower than copying it by hand? >> And how can the caller differentiate whether copy offload bring a benefit to >> him? >> >> IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not >> available? > > Present approach treats copy as a background operation and the idea is to > maximize the chances of achieving copy by falling back to emulation. > Having said that, it should be possible to return -EOPNOTSUPP, > in case of offload IO failure or device not supporting offload. > We will update this in next version. > That is also what I meant with my comments to patch 09/12: I don't see it as a benefit to _always_ fall back to a generic copy-offload emulation. After all, that hardly brings any benefit. Where I do see a benefit is to tie in the generic copy-offload _infrastructure_ to existing mechanisms (like dm-kcopyd). But if there is no copy-offload infrastructure available then we really should return -EOPNOTSUPP as it really is not supported. In the end, copy offload is not a command which 'always works'. It's a command which _might_ deliver benefits (ie better performance) if dedicated implementations are available and certain parameters are met. If not then copy offload is not the best choice, and applications will need to be made aware of that. Cheers, Hannes
On 11/09/23 09:39AM, Hannes Reinecke wrote: >On 9/11/23 09:09, Nitesh Shetty wrote: >>On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote: >>>On 9/6/23 18:38, Nitesh Shetty wrote: >>>>For the devices which does not support copy, copy emulation is added. >>>>It is required for in-kernel users like fabrics, where file descriptor is >>>>not available and hence they can't use copy_file_range. >>>>Copy-emulation is implemented by reading from source into memory and >>>>writing to the corresponding destination. >>>>Also emulation can be used, if copy offload fails or partially completes. >>>>At present in kernel user of emulation is NVMe fabrics. >>>> >>>Leave out the last sentence; I really would like to see it enabled for SCSI, >>>too (we do have copy offload commands for SCSI ...). >>> >>Sure, will do that >> >>>And it raises all the questions which have bogged us down right from the >>>start: where is the point in calling copy offload if copy offload is not >>>implemented or slower than copying it by hand? >>>And how can the caller differentiate whether copy offload bring a benefit to >>>him? >>> >>>IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not >>>available? >> >>Present approach treats copy as a background operation and the idea is to >>maximize the chances of achieving copy by falling back to emulation. >>Having said that, it should be possible to return -EOPNOTSUPP, >>in case of offload IO failure or device not supporting offload. >>We will update this in next version. >> >That is also what I meant with my comments to patch 09/12: I don't see >it as a benefit to _always_ fall back to a generic copy-offload >emulation. After all, that hardly brings any benefit. Agreed, we will correct this by returning error to user in case copy offload fails, instead of falling back to block layer emulation. We do need block layer emulation for fabrics, where we call emulation if target doesn't support offload. In fabrics scenarios sending offload command from host and achieve copy using block layer emulation on target is better than sending read+write from host. >Where I do see a benefit is to tie in the generic copy-offload >_infrastructure_ to existing mechanisms (like dm-kcopyd). >But if there is no copy-offload infrastructure available then we >really should return -EOPNOTSUPP as it really is not supported. > Agreed, we will add this in next phase, once present series gets merged. >In the end, copy offload is not a command which 'always works'. >It's a command which _might_ deliver benefits (ie better performance) >if dedicated implementations are available and certain parameters are >met. If not then copy offload is not the best choice, and applications >will need to be made aware of that. Agreed. We will leave the choice to user, to use either block layer offload or emulation. Thank you, Nitesh Shetty
diff --git a/block/blk-lib.c b/block/blk-lib.c index d22e1e7417ca..b18871ea7281 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -26,6 +26,20 @@ struct blkdev_copy_offload_io { loff_t offset; }; +/* Keeps track of single outstanding copy emulation IO */ +struct blkdev_copy_emulation_io { + struct blkdev_copy_io *cio; + struct work_struct emulation_work; + void *buf; + ssize_t buf_len; + loff_t pos_in; + loff_t pos_out; + ssize_t len; + struct block_device *bdev_in; + struct block_device *bdev_out; + gfp_t gfp; +}; + static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector) { unsigned int discard_granularity = bdev_discard_granularity(bdev); @@ -317,6 +331,215 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in, } EXPORT_SYMBOL_GPL(blkdev_copy_offload); +static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size, + gfp_t gfp) +{ + int min_size = PAGE_SIZE; + char *buf; + + while (req_size >= min_size) { + buf = kvmalloc(req_size, gfp); + if (buf) { + *alloc_size = req_size; + return buf; + } + req_size >>= 1; + } + + return NULL; +} + +static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp) +{ + unsigned long kaddr = (unsigned long)data; + unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT; + unsigned long start = kaddr >> PAGE_SHIFT; + const int nr_pages = end - start; + bool is_vmalloc = is_vmalloc_addr(data); + struct page *page; + int offset, i; + struct bio *bio; + + bio = bio_kmalloc(nr_pages, gfp); + if (!bio) + return ERR_PTR(-ENOMEM); + bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0); + + if (is_vmalloc) { + flush_kernel_vmap_range(data, len); + bio->bi_private = data; + } + + offset = offset_in_page(kaddr); + for (i = 0; i < nr_pages; i++) { + unsigned int bytes = PAGE_SIZE - offset; + + if (len <= 0) + break; + + if (bytes > len) + bytes = len; + + if (!is_vmalloc) + page = virt_to_page(data); + else + page = vmalloc_to_page(data); + if (bio_add_page(bio, page, bytes, offset) < bytes) { + /* we don't support partial mappings */ + bio_uninit(bio); + kfree(bio); + return ERR_PTR(-EINVAL); + } + + data += bytes; + len -= bytes; + offset = 0; + } + + return bio; +} + +static void blkdev_copy_emulation_work(struct work_struct *work) +{ + struct blkdev_copy_emulation_io *emulation_io = container_of(work, + struct blkdev_copy_emulation_io, emulation_work); + struct blkdev_copy_io *cio = emulation_io->cio; + struct bio *read_bio, *write_bio; + loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out; + ssize_t rem, chunk; + int ret = 0; + + for (rem = emulation_io->len; rem > 0; rem -= chunk) { + chunk = min_t(int, emulation_io->buf_len, rem); + + read_bio = bio_map_buf(emulation_io->buf, + emulation_io->buf_len, + emulation_io->gfp); + if (IS_ERR(read_bio)) { + ret = PTR_ERR(read_bio); + break; + } + read_bio->bi_opf = REQ_OP_READ | REQ_SYNC; + bio_set_dev(read_bio, emulation_io->bdev_in); + read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT; + read_bio->bi_iter.bi_size = chunk; + ret = submit_bio_wait(read_bio); + kfree(read_bio); + if (ret) + break; + + write_bio = bio_map_buf(emulation_io->buf, + emulation_io->buf_len, + emulation_io->gfp); + if (IS_ERR(write_bio)) { + ret = PTR_ERR(write_bio); + break; + } + write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC; + bio_set_dev(write_bio, emulation_io->bdev_out); + write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT; + write_bio->bi_iter.bi_size = chunk; + ret = submit_bio_wait(write_bio); + kfree(write_bio); + if (ret) + break; + + pos_in += chunk; + pos_out += chunk; + } + cio->status = ret; + kvfree(emulation_io->buf); + kfree(emulation_io); + blkdev_copy_endio(cio); +} + +static inline ssize_t queue_max_hw_bytes(struct request_queue *q) +{ + return min_t(ssize_t, queue_max_hw_sectors(q) << SECTOR_SHIFT, + queue_max_segments(q) << PAGE_SHIFT); +} +/* + * @bdev_in: source block device + * @pos_in: source offset + * @bdev_out: destination block device + * @pos_out: destination offset + * @len: length in bytes to be copied + * @endio: endio function to be called on completion of copy operation, + * for synchronous operation this should be NULL + * @private: endio function will be called with this private data, + * for synchronous operation this should be NULL + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * For synchronous operation returns the length of bytes copied or error + * For asynchronous operation returns -EIOCBQUEUED or error + * + * Description: + * If native copy offload feature is absent, caller can use this function + * as fallback to perform copy. + * We store information required to perform the copy along with temporary + * buffer allocation. We async punt copy emulation to a worker. And worker + * performs copy in 2 steps. + * 1. Read data from source to temporary buffer + * 2. Write data to destination from temporary buffer + */ +ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in, + struct block_device *bdev_out, loff_t pos_out, + size_t len, void (*endio)(void *, int, ssize_t), + void *private, gfp_t gfp) +{ + struct request_queue *in = bdev_get_queue(bdev_in); + struct request_queue *out = bdev_get_queue(bdev_out); + struct blkdev_copy_emulation_io *emulation_io; + struct blkdev_copy_io *cio; + ssize_t ret; + size_t max_hw_bytes = min(queue_max_hw_bytes(in), + queue_max_hw_bytes(out)); + + ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len); + if (ret) + return ret; + + cio = kzalloc(sizeof(*cio), GFP_KERNEL); + if (!cio) + return -ENOMEM; + + cio->waiter = current; + cio->copied = len; + cio->endio = endio; + cio->private = private; + + emulation_io = kzalloc(sizeof(*emulation_io), gfp); + if (!emulation_io) + goto err_free_cio; + emulation_io->cio = cio; + INIT_WORK(&emulation_io->emulation_work, blkdev_copy_emulation_work); + emulation_io->pos_in = pos_in; + emulation_io->pos_out = pos_out; + emulation_io->len = len; + emulation_io->bdev_in = bdev_in; + emulation_io->bdev_out = bdev_out; + emulation_io->gfp = gfp; + + emulation_io->buf = blkdev_copy_alloc_buf(min(max_hw_bytes, len), + &emulation_io->buf_len, gfp); + if (!emulation_io->buf) + goto err_free_emulation_io; + + schedule_work(&emulation_io->emulation_work); + + if (cio->endio) + return -EIOCBQUEUED; + + return blkdev_copy_wait_io_completion(cio); + +err_free_emulation_io: + kfree(emulation_io); +err_free_cio: + kfree(cio); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(blkdev_copy_emulation); + static int __blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, unsigned flags) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5405499bcf22..e0a832a1c3a7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1046,6 +1046,10 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in, loff_t pos_out, size_t len, void (*endio)(void *, int, ssize_t), void *private, gfp_t gfp_mask); +ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in, + struct block_device *bdev_out, loff_t pos_out, + size_t len, void (*endio)(void *, int, ssize_t), + void *private, gfp_t gfp); #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */ #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */