diff mbox series

[3/7] block: copy offload support infrastructure

Message ID 20210817101423.12367-4-selvakuma.s1@samsung.com (mailing list archive)
State New, archived
Headers show
Series [1/7] block: make bio_map_kern() non static | expand

Commit Message

SelvaKumar S Aug. 17, 2021, 10:14 a.m. UTC
From: Nitesh Shetty <nj.shetty@samsung.com>

Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
bio with control information as payload and submit to the device.
Larger copy operation may be divided if necessary by looking at device
limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
submitted to zoned device.
Native copy offload is not supported for stacked devices.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
---
 block/blk-core.c          |  84 ++++++++++++-
 block/blk-lib.c           | 252 ++++++++++++++++++++++++++++++++++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 include/linux/bio.h       |   1 +
 include/linux/blk_types.h |  20 +++
 include/linux/blkdev.h    |  13 ++
 include/uapi/linux/fs.h   |  12 ++
 8 files changed, 378 insertions(+), 6 deletions(-)

Comments

Bart Van Assche Aug. 17, 2021, 5:14 p.m. UTC | #1
On 8/17/21 3:14 AM, SelvaKumar S wrote:
> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> bio with control information as payload and submit to the device.
> Larger copy operation may be divided if necessary by looking at device
> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
> submitted to zoned device.
> Native copy offload is not supported for stacked devices.

Using a single operation for copy-offloading instead of separate 
operations for reading and writing is fundamentally incompatible with 
the device mapper. I think we need a copy-offloading implementation that 
is compatible with the device mapper.

Storing the parameters of the copy operation in the bio payload is 
incompatible with the current implementation of bio_split().

In other words, I think there are fundamental problems with this patch 
series.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
kernel test robot Aug. 17, 2021, 8:35 p.m. UTC | #2
Hi SelvaKumar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on block/for-next]
[also build test WARNING on dm/for-next linus/master v5.14-rc6 next-20210817]
[cannot apply to linux-nvme/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/SelvaKumar-S/block-make-bio_map_kern-non-static/20210817-193111
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: hexagon-randconfig-r013-20210816 (attached as .config)
compiler: clang version 12.0.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/35fc502a7f20a7cd42432cee2777a621c40a3bd3
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review SelvaKumar-S/block-make-bio_map_kern-non-static/20210817-193111
        git checkout 35fc502a7f20a7cd42432cee2777a621c40a3bd3
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=hexagon 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> block/blk-lib.c:197:5: warning: no previous prototype for function 'blk_copy_offload_submit_bio' [-Wmissing-prototypes]
   int blk_copy_offload_submit_bio(struct block_device *bdev,
       ^
   block/blk-lib.c:197:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int blk_copy_offload_submit_bio(struct block_device *bdev,
   ^
   static 
>> block/blk-lib.c:250:5: warning: no previous prototype for function 'blk_copy_offload_scc' [-Wmissing-prototypes]
   int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs,
       ^
   block/blk-lib.c:250:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs,
   ^
   static 
   2 warnings generated.


vim +/blk_copy_offload_submit_bio +197 block/blk-lib.c

   196	
 > 197	int blk_copy_offload_submit_bio(struct block_device *bdev,
   198			struct blk_copy_payload *payload, int payload_size,
   199			struct cio *cio, gfp_t gfp_mask)
   200	{
   201		struct request_queue *q = bdev_get_queue(bdev);
   202		struct bio *bio;
   203	
   204		bio = bio_map_kern(q, payload, payload_size, gfp_mask);
   205		if (IS_ERR(bio))
   206			return PTR_ERR(bio);
   207	
   208		bio_set_dev(bio, bdev);
   209		bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
   210		bio->bi_iter.bi_sector = payload->dest;
   211		bio->bi_end_io = cio_bio_end_io;
   212		bio->bi_private = cio;
   213		atomic_inc(&cio->refcount);
   214		submit_bio(bio);
   215	
   216		return 0;
   217	}
   218	
   219	/* Go through all the enrties inside user provided payload, and determine the
   220	 * maximum number of entries in a payload, based on device's scc-limits.
   221	 */
   222	static inline int blk_max_payload_entries(int nr_srcs, struct range_entry *rlist,
   223			int max_nr_srcs, sector_t max_copy_range_sectors, sector_t max_copy_len)
   224	{
   225		sector_t range_len, copy_len = 0, remaining = 0;
   226		int ri = 0, pi = 1, max_pi = 0;
   227	
   228		for (ri = 0; ri < nr_srcs; ri++) {
   229			for (remaining = rlist[ri].len; remaining > 0; remaining -= range_len) {
   230				range_len = min3(remaining, max_copy_range_sectors,
   231									max_copy_len - copy_len);
   232				pi++;
   233				copy_len += range_len;
   234	
   235				if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) {
   236					max_pi = max(max_pi, pi);
   237					pi = 1;
   238					copy_len = 0;
   239				}
   240			}
   241		}
   242	
   243		return max(max_pi, pi);
   244	}
   245	
   246	/*
   247	 * blk_copy_offload_scc	- Use device's native copy offload feature
   248	 * Go through user provide payload, prepare new payload based on device's copy offload limits.
   249	 */
 > 250	int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs,
   251			struct range_entry *rlist, struct block_device *dest_bdev,
   252			sector_t dest, gfp_t gfp_mask)
   253	{
   254		struct request_queue *q = bdev_get_queue(dest_bdev);
   255		struct cio *cio = NULL;
   256		struct blk_copy_payload *payload;
   257		sector_t range_len, copy_len = 0, remaining = 0;
   258		sector_t src_blk, cdest = dest;
   259		sector_t max_copy_range_sectors, max_copy_len;
   260		int ri = 0, pi = 0, ret = 0, payload_size, max_pi, max_nr_srcs;
   261	
   262		cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
   263		if (!cio)
   264			return -ENOMEM;
   265		atomic_set(&cio->refcount, 0);
   266	
   267		max_nr_srcs = q->limits.max_copy_nr_ranges;
   268		max_copy_range_sectors = q->limits.max_copy_range_sectors;
   269		max_copy_len = q->limits.max_copy_sectors;
   270	
   271		max_pi = blk_max_payload_entries(nr_srcs, rlist, max_nr_srcs,
   272						max_copy_range_sectors, max_copy_len);
   273		payload_size = struct_size(payload, range, max_pi);
   274	
   275		payload = kvmalloc(payload_size, gfp_mask);
   276		if (!payload) {
   277			ret = -ENOMEM;
   278			goto free_cio;
   279		}
   280		payload->src_bdev = src_bdev;
   281	
   282		for (ri = 0; ri < nr_srcs; ri++) {
   283			for (remaining = rlist[ri].len, src_blk = rlist[ri].src; remaining > 0;
   284							remaining -= range_len, src_blk += range_len) {
   285	
   286				range_len = min3(remaining, max_copy_range_sectors,
   287									max_copy_len - copy_len);
   288				payload->range[pi].len = range_len;
   289				payload->range[pi].src = src_blk;
   290				pi++;
   291				copy_len += range_len;
   292	
   293				/* Submit current payload, if crossing device copy limits */
   294				if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) {
   295					payload->dest = cdest;
   296					payload->copy_nr_ranges = pi;
   297					ret = blk_copy_offload_submit_bio(dest_bdev, payload,
   298									payload_size, cio, gfp_mask);
   299					if (ret)
   300						goto free_payload;
   301	
   302					/* reset index, length and allocate new payload */
   303					pi = 0;
   304					cdest += copy_len;
   305					copy_len = 0;
   306					payload = kvmalloc(payload_size, gfp_mask);
   307					if (!payload) {
   308						ret = -ENOMEM;
   309						goto free_cio;
   310					}
   311					payload->src_bdev = src_bdev;
   312				}
   313			}
   314		}
   315	
   316		if (pi) {
   317			payload->dest = cdest;
   318			payload->copy_nr_ranges = pi;
   319			ret = blk_copy_offload_submit_bio(dest_bdev, payload, payload_size, cio, gfp_mask);
   320			if (ret)
   321				goto free_payload;
   322		}
   323	
   324		/* Wait for completion of all IO's*/
   325		ret = cio_await_completion(cio);
   326	
   327		return ret;
   328	
   329	free_payload:
   330		kvfree(payload);
   331	free_cio:
   332		cio_await_completion(cio);
   333		return ret;
   334	}
   335	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Aug. 17, 2021, 8:41 p.m. UTC | #3
On Tue, 17 Aug 2021, Bart Van Assche wrote:

> On 8/17/21 3:14 AM, SelvaKumar S wrote:
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > Larger copy operation may be divided if necessary by looking at device
> > limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
> > submitted to zoned device.
> > Native copy offload is not supported for stacked devices.
> 
> Using a single operation for copy-offloading instead of separate operations
> for reading and writing is fundamentally incompatible with the device mapper.
> I think we need a copy-offloading implementation that is compatible with the
> device mapper.

I once wrote a copy offload implementation that is compatible with device 
mapper. The copy operation creates two bios (one for reading and one for 
writing), passes them independently through device mapper and pairs them 
at the physical device driver.

It's here: http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current

I verified that it works with iSCSI. Would you be interested in continuing 
this work?

Mikulas

> Storing the parameters of the copy operation in the bio payload is
> incompatible with the current implementation of bio_split().
> 
> In other words, I think there are fundamental problems with this patch series.
> 
> Bart.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Douglas Gilbert Aug. 17, 2021, 9:53 p.m. UTC | #4
On 2021-08-17 4:41 p.m., Mikulas Patocka wrote:
> 
> 
> On Tue, 17 Aug 2021, Bart Van Assche wrote:
> 
>> On 8/17/21 3:14 AM, SelvaKumar S wrote:
>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>> bio with control information as payload and submit to the device.
>>> Larger copy operation may be divided if necessary by looking at device
>>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
>>> submitted to zoned device.
>>> Native copy offload is not supported for stacked devices.
>>
>> Using a single operation for copy-offloading instead of separate operations
>> for reading and writing is fundamentally incompatible with the device mapper.
>> I think we need a copy-offloading implementation that is compatible with the
>> device mapper.
> 
> I once wrote a copy offload implementation that is compatible with device
> mapper. The copy operation creates two bios (one for reading and one for
> writing), passes them independently through device mapper and pairs them
> at the physical device driver.
> 
> It's here: http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current

In my copy solution the read-side and write-side bio pairs share the same 
storage (i.e. ram) This gets around the need to copy data between the bio_s.
See:
    https://sg.danny.cz/sg/sg_v40.html
in Section 8 on Request sharing. This technique can be efficiently extend to
source --> destination1,destination2,...      copies.

Doug Gilbert

> I verified that it works with iSCSI. Would you be interested in continuing
> this work?
> 
> Mikulas
> 
>> Storing the parameters of the copy operation in the bio payload is
>> incompatible with the current implementation of bio_split().
>>
>> In other words, I think there are fundamental problems with this patch series.
>>
>> Bart.
>>
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Aug. 17, 2021, 10:06 p.m. UTC | #5
On 8/17/21 2:53 PM, Douglas Gilbert wrote:
> On 2021-08-17 4:41 p.m., Mikulas Patocka wrote:
>> On Tue, 17 Aug 2021, Bart Van Assche wrote:
>>> On 8/17/21 3:14 AM, SelvaKumar S wrote:
>>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>>> bio with control information as payload and submit to the device.
>>>> Larger copy operation may be divided if necessary by looking at device
>>>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
>>>> submitted to zoned device.
>>>> Native copy offload is not supported for stacked devices.
>>>
>>> Using a single operation for copy-offloading instead of separate 
>>> operations
>>> for reading and writing is fundamentally incompatible with the device 
>>> mapper.
>>> I think we need a copy-offloading implementation that is compatible 
>>> with the
>>> device mapper.
>>
>> I once wrote a copy offload implementation that is compatible with device
>> mapper. The copy operation creates two bios (one for reading and one for
>> writing), passes them independently through device mapper and pairs them
>> at the physical device driver.
>>
>> It's here: 
>> http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current
> 
> In my copy solution the read-side and write-side bio pairs share the 
> same storage (i.e. ram) This gets around the need to copy data between 
> the bio_s.
> See:
>     https://sg.danny.cz/sg/sg_v40.html
> in Section 8 on Request sharing. This technique can be efficiently 
> extend to
> source --> destination1,destination2,...      copies.
> 
> Doug Gilbert
> 
>> I verified that it works with iSCSI. Would you be interested in 
>> continuing
>> this work?

Hi Mikulas and Doug,

Yes, I'm interested in continuing Mikulas' work on copy offloading. I 
will take a look at Doug's approach too for sharing buffers between 
read-side and write-side bios. It may take a few months however before I 
can find the time to work on this.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 18, 2021, 6:35 p.m. UTC | #6
> Native copy offload is not supported for stacked devices.

One of the main reasons that the historic attempts at supporting copy
offload did not get merged was that the ubiquitous deployment scenario,
stacked block devices, was not handled well.

Pitfalls surrounding stacking has been brought up several times in
response to your series. It is critically important that both kernel
plumbing and user-facing interfaces are defined in a way that works for
the most common use cases. This includes copying between block devices
and handling block device stacking. Stacking being one of the most
fundamental operating principles of the Linux block layer!

Proposing a brand new interface that out of the gate is incompatible
with both stacking and the copy offload capability widely implemented in
shipping hardware makes little sense. While NVMe currently only supports
copy operations inside a single namespace, it is surely only a matter of
time before that restriction is lifted.

Changing existing interfaces is painful, especially when these are
exposed to userland. We obviously can't predict every field or feature
that may be needed in the future. But we should at the very least build
the infrastructure around what already exists. And that's where the
proposed design falls short...
Kanchan Joshi Aug. 20, 2021, 10:39 a.m. UTC | #7
Bart, Mikulas

On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 8/17/21 3:14 AM, SelvaKumar S wrote:
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > Larger copy operation may be divided if necessary by looking at device
> > limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
> > submitted to zoned device.
> > Native copy offload is not supported for stacked devices.
>
> Using a single operation for copy-offloading instead of separate
> operations for reading and writing is fundamentally incompatible with
> the device mapper. I think we need a copy-offloading implementation that
> is compatible with the device mapper.
>

While each read/write command is for a single contiguous range of
device, with simple-copy we get to operate on multiple discontiguous
ranges, with a single command.
That seemed like a good opportunity to reduce control-plane traffic
(compared to read/write operations) as well.

With a separate read-and-write bio approach, each source-range will
spawn at least one read, one write and eventually one SCC command. And
it only gets worse as there could be many such discontiguous ranges (for
GC use-case at least) coming from user-space in a single payload.
Overall sequence will be
- Receive a payload from user-space
- Disassemble into many read-write pair bios at block-layer
- Assemble those (somehow) in NVMe to reduce simple-copy commands
- Send commands to device

We thought payload could be a good way to reduce the
disassembly/assembly work and traffic between block-layer to nvme.
How do you see this tradeoff?  What seems necessary for device-mapper
usecase, appears to be a cost when device-mapper isn't used.
Especially for SCC (since copy is within single ns), device-mappers
may not be too compelling anyway.

Must device-mapper support be a requirement for the initial support atop SCC?
Or do you think it will still be a progress if we finalize the
user-space interface to cover all that is foreseeable.And for
device-mapper compatible transport between block-layer and NVMe - we
do it in the later stage when NVMe too comes up with better copy
capabilities?
Kanchan Joshi Aug. 20, 2021, 11:11 a.m. UTC | #8
On Thu, Aug 19, 2021 at 12:05 AM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> > Native copy offload is not supported for stacked devices.
>
> One of the main reasons that the historic attempts at supporting copy
> offload did not get merged was that the ubiquitous deployment scenario,
> stacked block devices, was not handled well.
>
> Pitfalls surrounding stacking has been brought up several times in
> response to your series. It is critically important that both kernel
> plumbing and user-facing interfaces are defined in a way that works for
> the most common use cases. This includes copying between block devices
> and handling block device stacking. Stacking being one of the most
> fundamental operating principles of the Linux block layer!
>
> Proposing a brand new interface that out of the gate is incompatible
> with both stacking and the copy offload capability widely implemented in
> shipping hardware makes little sense. While NVMe currently only supports
> copy operations inside a single namespace, it is surely only a matter of
> time before that restriction is lifted.
>
> Changing existing interfaces is painful, especially when these are
> exposed to userland. We obviously can't predict every field or feature
> that may be needed in the future. But we should at the very least build
> the infrastructure around what already exists. And that's where the
> proposed design falls short...
>
Certainly, on user-space interface. We've got few cracks to be filled
there, missing the future viability.
But on stacking, can that be additive. Could you please take a look at
the other response (comment from Bart) for the trade-offs.
Bart Van Assche Aug. 20, 2021, 9:18 p.m. UTC | #9
On 8/20/21 3:39 AM, Kanchan Joshi wrote:
> Bart, Mikulas
> 
> On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 8/17/21 3:14 AM, SelvaKumar S wrote:
>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>> bio with control information as payload and submit to the device.
>>> Larger copy operation may be divided if necessary by looking at device
>>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
>>> submitted to zoned device.
>>> Native copy offload is not supported for stacked devices.
>>
>> Using a single operation for copy-offloading instead of separate
>> operations for reading and writing is fundamentally incompatible with
>> the device mapper. I think we need a copy-offloading implementation that
>> is compatible with the device mapper.
>>
> 
> While each read/write command is for a single contiguous range of
> device, with simple-copy we get to operate on multiple discontiguous
> ranges, with a single command.
> That seemed like a good opportunity to reduce control-plane traffic
> (compared to read/write operations) as well.
> 
> With a separate read-and-write bio approach, each source-range will
> spawn at least one read, one write and eventually one SCC command. And
> it only gets worse as there could be many such discontiguous ranges (for
> GC use-case at least) coming from user-space in a single payload.
> Overall sequence will be
> - Receive a payload from user-space
> - Disassemble into many read-write pair bios at block-layer
> - Assemble those (somehow) in NVMe to reduce simple-copy commands
> - Send commands to device
> 
> We thought payload could be a good way to reduce the
> disassembly/assembly work and traffic between block-layer to nvme.
> How do you see this tradeoff?  What seems necessary for device-mapper
> usecase, appears to be a cost when device-mapper isn't used.
> Especially for SCC (since copy is within single ns), device-mappers
> may not be too compelling anyway.
> 
> Must device-mapper support be a requirement for the initial support atop SCC?
> Or do you think it will still be a progress if we finalize the
> user-space interface to cover all that is foreseeable.And for
> device-mapper compatible transport between block-layer and NVMe - we
> do it in the later stage when NVMe too comes up with better copy
> capabilities?

Hi Kanchan,

These days there might be more systems that run the device mapper on top 
of the NVMe driver or a SCSI driver than systems that do use the device 
mapper. It is common practice these days to use dm-crypt on personal 
workstations and laptops. LVM (dm-linear) is popular because it is more 
flexible than a traditional partition table. Android phones use 
dm-verity on top of hardware encryption. In other words, not supporting 
the device mapper means that a very large number of use cases is 
excluded. So I think supporting the device mapper from the start is 
important, even if that means combining individual bios at the bottom of 
the storage stack into simple copy commands.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Nitesh Shetty Aug. 26, 2021, 7:46 a.m. UTC | #10
Hi Bart,Mikulas,Martin,Douglas,

We will go through your previous work and use this thread as a medium for
further discussion, if we come across issues to be sorted out.

Thank you,
Nitesh Shetty

On Sat, Aug 21, 2021 at 2:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 8/20/21 3:39 AM, Kanchan Joshi wrote:
> > Bart, Mikulas
> >
> > On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>
> >> On 8/17/21 3:14 AM, SelvaKumar S wrote:
> >>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> >>> bio with control information as payload and submit to the device.
> >>> Larger copy operation may be divided if necessary by looking at device
> >>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
> >>> submitted to zoned device.
> >>> Native copy offload is not supported for stacked devices.
> >>
> >> Using a single operation for copy-offloading instead of separate
> >> operations for reading and writing is fundamentally incompatible with
> >> the device mapper. I think we need a copy-offloading implementation that
> >> is compatible with the device mapper.
> >>
> >
> > While each read/write command is for a single contiguous range of
> > device, with simple-copy we get to operate on multiple discontiguous
> > ranges, with a single command.
> > That seemed like a good opportunity to reduce control-plane traffic
> > (compared to read/write operations) as well.
> >
> > With a separate read-and-write bio approach, each source-range will
> > spawn at least one read, one write and eventually one SCC command. And
> > it only gets worse as there could be many such discontiguous ranges (for
> > GC use-case at least) coming from user-space in a single payload.
> > Overall sequence will be
> > - Receive a payload from user-space
> > - Disassemble into many read-write pair bios at block-layer
> > - Assemble those (somehow) in NVMe to reduce simple-copy commands
> > - Send commands to device
> >
> > We thought payload could be a good way to reduce the
> > disassembly/assembly work and traffic between block-layer to nvme.
> > How do you see this tradeoff?  What seems necessary for device-mapper
> > usecase, appears to be a cost when device-mapper isn't used.
> > Especially for SCC (since copy is within single ns), device-mappers
> > may not be too compelling anyway.
> >
> > Must device-mapper support be a requirement for the initial support atop SCC?
> > Or do you think it will still be a progress if we finalize the
> > user-space interface to cover all that is foreseeable.And for
> > device-mapper compatible transport between block-layer and NVMe - we
> > do it in the later stage when NVMe too comes up with better copy
> > capabilities?
>
> Hi Kanchan,
>
> These days there might be more systems that run the device mapper on top
> of the NVMe driver or a SCSI driver than systems that do use the device
> mapper. It is common practice these days to use dm-crypt on personal
> workstations and laptops. LVM (dm-linear) is popular because it is more
> flexible than a traditional partition table. Android phones use
> dm-verity on top of hardware encryption. In other words, not supporting
> the device mapper means that a very large number of use cases is
> excluded. So I think supporting the device mapper from the start is
> important, even if that means combining individual bios at the bottom of
> the storage stack into simple copy commands.
>
> Thanks,
>
> Bart.
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index d2722ecd4d9b..541b1561b4af 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -704,6 +704,17 @@  static noinline int should_fail_bio(struct bio *bio)
 }
 ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
 
+static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
+		sector_t nr_sectors, sector_t max_sect)
+{
+	if (nr_sectors && max_sect &&
+	    (nr_sectors > max_sect || start > max_sect - nr_sectors)) {
+		handle_bad_sector(bio, max_sect);
+		return -EIO;
+	}
+	return 0;
+}
+
 /*
  * Check whether this bio extends beyond the end of the device or partition.
  * This may well happen - the kernel calls bread() without checking the size of
@@ -723,6 +734,61 @@  static inline int bio_check_eod(struct bio *bio)
 	return 0;
 }
 
+/*
+ * check for eod limits and remap ranges if needed
+ */
+static int blk_check_copy(struct bio *bio)
+{
+	struct blk_copy_payload *payload = bio_data(bio);
+	sector_t dst_max_sect, dst_start_sect, copy_size = 0;
+	sector_t src_max_sect, src_start_sect;
+	struct block_device *bd_part;
+	int i, ret = -EIO;
+
+	rcu_read_lock();
+
+	bd_part = bio->bi_bdev;
+	if (unlikely(!bd_part))
+		goto err;
+
+	dst_max_sect =  bdev_nr_sectors(bd_part);
+	dst_start_sect = bd_part->bd_start_sect;
+
+	src_max_sect = bdev_nr_sectors(payload->src_bdev);
+	src_start_sect = payload->src_bdev->bd_start_sect;
+
+	if (unlikely(should_fail_request(bd_part, bio->bi_iter.bi_size)))
+		goto err;
+
+	if (unlikely(bio_check_ro(bio)))
+		goto err;
+
+	rcu_read_unlock();
+
+	for (i = 0; i < payload->copy_nr_ranges; i++) {
+		ret = bio_check_copy_eod(bio, payload->range[i].src,
+				payload->range[i].len, src_max_sect);
+		if (unlikely(ret))
+			goto out;
+
+		payload->range[i].src += src_start_sect;
+		copy_size += payload->range[i].len;
+	}
+
+	/* check if copy length crosses eod */
+	ret = bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
+				copy_size, dst_max_sect);
+	if (unlikely(ret))
+		goto out;
+
+	bio->bi_iter.bi_sector += dst_start_sect;
+	return 0;
+err:
+	rcu_read_unlock();
+out:
+	return ret;
+}
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
@@ -799,13 +865,15 @@  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 
 	if (should_fail_bio(bio))
 		goto end_io;
-	if (unlikely(bio_check_ro(bio)))
-		goto end_io;
-	if (!bio_flagged(bio, BIO_REMAPPED)) {
-		if (unlikely(bio_check_eod(bio)))
-			goto end_io;
-		if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
+	if (likely(!op_is_copy(bio->bi_opf))) {
+		if (unlikely(bio_check_ro(bio)))
 			goto end_io;
+		if (!bio_flagged(bio, BIO_REMAPPED)) {
+			if (unlikely(bio_check_eod(bio)))
+				goto end_io;
+			if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
+				goto end_io;
+		}
 	}
 
 	/*
@@ -829,6 +897,10 @@  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		if (!blk_queue_discard(q))
 			goto not_supported;
 		break;
+	case REQ_OP_COPY:
+		if (unlikely(blk_check_copy(bio)))
+			goto end_io;
+		break;
 	case REQ_OP_SECURE_ERASE:
 		if (!blk_queue_secure_erase(q))
 			goto not_supported;
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9f09beadcbe3..7fee0ae95c44 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -151,6 +151,258 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * Wait on and process all in-flight BIOs.  This must only be called once
+ * all bios have been issued so that the refcount can only decrease.
+ * This just waits for all bios to make it through cio_bio_end_io.  IO
+ * errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+	int ret = 0;
+
+	while (atomic_read(&cio->refcount)) {
+		cio->waiter = current;
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+		/* wake up sets us TASK_RUNNING */
+		cio->waiter = NULL;
+		ret = cio->io_err;
+	}
+	kvfree(cio);
+
+	return ret;
+}
+
+/*
+ * The BIO completion handler simply decrements refcount.
+ * Also wake up process, if this is the last bio to be completed.
+ *
+ * During I/O bi_private points at the cio.
+ */
+static void cio_bio_end_io(struct bio *bio)
+{
+	struct cio *cio = bio->bi_private;
+
+	if (bio->bi_status)
+		cio->io_err = bio->bi_status;
+	kvfree(page_address(bio_first_bvec_all(bio)->bv_page) +
+			bio_first_bvec_all(bio)->bv_offset);
+	bio_put(bio);
+
+	if (atomic_dec_and_test(&cio->refcount) && cio->waiter)
+		wake_up_process(cio->waiter);
+}
+
+int blk_copy_offload_submit_bio(struct block_device *bdev,
+		struct blk_copy_payload *payload, int payload_size,
+		struct cio *cio, gfp_t gfp_mask)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	struct bio *bio;
+
+	bio = bio_map_kern(q, payload, payload_size, gfp_mask);
+	if (IS_ERR(bio))
+		return PTR_ERR(bio);
+
+	bio_set_dev(bio, bdev);
+	bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
+	bio->bi_iter.bi_sector = payload->dest;
+	bio->bi_end_io = cio_bio_end_io;
+	bio->bi_private = cio;
+	atomic_inc(&cio->refcount);
+	submit_bio(bio);
+
+	return 0;
+}
+
+/* Go through all the enrties inside user provided payload, and determine the
+ * maximum number of entries in a payload, based on device's scc-limits.
+ */
+static inline int blk_max_payload_entries(int nr_srcs, struct range_entry *rlist,
+		int max_nr_srcs, sector_t max_copy_range_sectors, sector_t max_copy_len)
+{
+	sector_t range_len, copy_len = 0, remaining = 0;
+	int ri = 0, pi = 1, max_pi = 0;
+
+	for (ri = 0; ri < nr_srcs; ri++) {
+		for (remaining = rlist[ri].len; remaining > 0; remaining -= range_len) {
+			range_len = min3(remaining, max_copy_range_sectors,
+								max_copy_len - copy_len);
+			pi++;
+			copy_len += range_len;
+
+			if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) {
+				max_pi = max(max_pi, pi);
+				pi = 1;
+				copy_len = 0;
+			}
+		}
+	}
+
+	return max(max_pi, pi);
+}
+
+/*
+ * blk_copy_offload_scc	- Use device's native copy offload feature
+ * Go through user provide payload, prepare new payload based on device's copy offload limits.
+ */
+int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs,
+		struct range_entry *rlist, struct block_device *dest_bdev,
+		sector_t dest, gfp_t gfp_mask)
+{
+	struct request_queue *q = bdev_get_queue(dest_bdev);
+	struct cio *cio = NULL;
+	struct blk_copy_payload *payload;
+	sector_t range_len, copy_len = 0, remaining = 0;
+	sector_t src_blk, cdest = dest;
+	sector_t max_copy_range_sectors, max_copy_len;
+	int ri = 0, pi = 0, ret = 0, payload_size, max_pi, max_nr_srcs;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	atomic_set(&cio->refcount, 0);
+
+	max_nr_srcs = q->limits.max_copy_nr_ranges;
+	max_copy_range_sectors = q->limits.max_copy_range_sectors;
+	max_copy_len = q->limits.max_copy_sectors;
+
+	max_pi = blk_max_payload_entries(nr_srcs, rlist, max_nr_srcs,
+					max_copy_range_sectors, max_copy_len);
+	payload_size = struct_size(payload, range, max_pi);
+
+	payload = kvmalloc(payload_size, gfp_mask);
+	if (!payload) {
+		ret = -ENOMEM;
+		goto free_cio;
+	}
+	payload->src_bdev = src_bdev;
+
+	for (ri = 0; ri < nr_srcs; ri++) {
+		for (remaining = rlist[ri].len, src_blk = rlist[ri].src; remaining > 0;
+						remaining -= range_len, src_blk += range_len) {
+
+			range_len = min3(remaining, max_copy_range_sectors,
+								max_copy_len - copy_len);
+			payload->range[pi].len = range_len;
+			payload->range[pi].src = src_blk;
+			pi++;
+			copy_len += range_len;
+
+			/* Submit current payload, if crossing device copy limits */
+			if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) {
+				payload->dest = cdest;
+				payload->copy_nr_ranges = pi;
+				ret = blk_copy_offload_submit_bio(dest_bdev, payload,
+								payload_size, cio, gfp_mask);
+				if (ret)
+					goto free_payload;
+
+				/* reset index, length and allocate new payload */
+				pi = 0;
+				cdest += copy_len;
+				copy_len = 0;
+				payload = kvmalloc(payload_size, gfp_mask);
+				if (!payload) {
+					ret = -ENOMEM;
+					goto free_cio;
+				}
+				payload->src_bdev = src_bdev;
+			}
+		}
+	}
+
+	if (pi) {
+		payload->dest = cdest;
+		payload->copy_nr_ranges = pi;
+		ret = blk_copy_offload_submit_bio(dest_bdev, payload, payload_size, cio, gfp_mask);
+		if (ret)
+			goto free_payload;
+	}
+
+	/* Wait for completion of all IO's*/
+	ret = cio_await_completion(cio);
+
+	return ret;
+
+free_payload:
+	kvfree(payload);
+free_cio:
+	cio_await_completion(cio);
+	return ret;
+}
+
+static inline sector_t blk_copy_len(struct range_entry *rlist, int nr_srcs)
+{
+	int i;
+	sector_t len = 0;
+
+	for (i = 0; i < nr_srcs; i++) {
+		if (rlist[i].len)
+			len += rlist[i].len;
+		else
+			return 0;
+	}
+
+	return len;
+}
+
+static inline bool blk_check_offload_scc(struct request_queue *src_q,
+		struct request_queue *dest_q)
+{
+	if (src_q == dest_q && src_q->limits.copy_offload == BLK_COPY_OFFLOAD_SCC)
+		return true;
+
+	return false;
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @nr_srcs:	number of source ranges to copy
+ * @src_rlist:	array of source ranges
+ * @dest_bdev:	destination block device
+ * @dest:	destination in sector
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ * @flags:	BLKDEV_COPY_* flags to control behaviour
+ *
+ * Description:
+ *	Copy source ranges from source block device to destination block device.
+ *	length of a source range cannot be zero.
+ */
+int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
+		struct range_entry *src_rlist, struct block_device *dest_bdev,
+		sector_t dest, gfp_t gfp_mask, int flags)
+{
+	struct request_queue *src_q = bdev_get_queue(src_bdev);
+	struct request_queue *dest_q = bdev_get_queue(dest_bdev);
+	sector_t copy_len;
+	int ret = -EINVAL;
+
+	if (!src_q || !dest_q)
+		return -ENXIO;
+
+	if (!nr_srcs)
+		return -EINVAL;
+
+	if (nr_srcs >= MAX_COPY_NR_RANGE)
+		return -EINVAL;
+
+	copy_len = blk_copy_len(src_rlist, nr_srcs);
+	if (!copy_len && copy_len >= MAX_COPY_TOTAL_LENGTH)
+		return -EINVAL;
+
+	if (bdev_read_only(dest_bdev))
+		return -EPERM;
+
+	if (blk_check_offload_scc(src_q, dest_q))
+		ret = blk_copy_offload_scc(src_bdev, nr_srcs, src_rlist, dest_bdev, dest, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
 /**
  * __blkdev_issue_write_same - generate number of bios with same page
  * @bdev:	target blockdev
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 86fce751bb17..7643fc868521 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -67,6 +67,7 @@  bool blk_req_needs_zone_write_lock(struct request *rq)
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_WRITE:
+	case REQ_OP_COPY:
 		return blk_rq_zone_is_seq(rq);
 	default:
 		return false;
diff --git a/block/bounce.c b/block/bounce.c
index 05fc7148489d..d9b05aaf6e56 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -176,6 +176,7 @@  static struct bio *bounce_clone_bio(struct bio *bio_src)
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
 	switch (bio_op(bio)) {
+	case REQ_OP_COPY:
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3d67d0fbc868..068fa2e8896a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -73,6 +73,7 @@  static inline bool bio_has_data(struct bio *bio)
 static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_COPY ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9e392daa1d7f..1ab77176cb46 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -347,6 +347,8 @@  enum req_opf {
 	REQ_OP_ZONE_RESET	= 15,
 	/* reset all the zone present on the device */
 	REQ_OP_ZONE_RESET_ALL	= 17,
+	/* copy ranges within device */
+	REQ_OP_COPY		= 19,
 
 	/* Driver private requests */
 	REQ_OP_DRV_IN		= 34,
@@ -470,6 +472,11 @@  static inline bool op_is_discard(unsigned int op)
 	return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
 }
 
+static inline bool op_is_copy(unsigned int op)
+{
+	return (op & REQ_OP_MASK) == REQ_OP_COPY;
+}
+
 /*
  * Check if a bio or request operation is a zone management operation, with
  * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
@@ -529,4 +536,17 @@  struct blk_rq_stat {
 	u64 batch;
 };
 
+struct cio {
+	atomic_t refcount;
+	blk_status_t io_err;
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+};
+
+struct blk_copy_payload {
+	struct block_device	*src_bdev;
+	sector_t		dest;
+	int			copy_nr_ranges;
+	struct range_entry	range[];
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fd4cfaadda5b..38369dff6a36 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -52,6 +52,12 @@  struct blk_keyslot_manager;
 /* Doing classic polling */
 #define BLK_MQ_POLL_CLASSIC -1
 
+/* Define copy offload options */
+enum blk_copy {
+	BLK_COPY_OFFLOAD_EMULATE = 0,
+	BLK_COPY_OFFLOAD_SCC,
+};
+
 /*
  * Maximum number of blkcg policies allowed to be registered concurrently.
  * Defined here to simplify include dependency.
@@ -1051,6 +1057,9 @@  static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
 
+	if (unlikely(op == REQ_OP_COPY))
+		return q->limits.max_copy_sectors;
+
 	if (unlikely(op == REQ_OP_WRITE_SAME))
 		return q->limits.max_write_same_sectors;
 
@@ -1326,6 +1335,10 @@  extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, int flags,
 		struct bio **biop);
 
+int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
+		struct range_entry *src_rlist, struct block_device *dest_bdev,
+		sector_t dest, gfp_t gfp_mask, int flags);
+
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..7a97b588d892 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,18 @@  struct fstrim_range {
 	__u64 minlen;
 };
 
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE	(1 << 12)
+
+/* maximum total copy length */
+#define MAX_COPY_TOTAL_LENGTH	(1 << 21)
+
+/* Source range entry for copy */
+struct range_entry {
+	__u64 src;
+	__u64 len;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1