diff mbox series

[v5,02/10] block: Add copy offload support infrastructure

Message ID 20221123055827.26996-3-nj.shetty@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v5,01/10] block: Introduce queue limits for copy-offload support | expand

Commit Message

Nitesh Shetty Nov. 23, 2022, 5:58 a.m. UTC
Introduce blkdev_issue_copy which supports source and destination bdevs,
and an array of (source, destination and copy length) tuples.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
 block/blk.h               |   2 +
 include/linux/blk_types.h |  44 +++++
 include/linux/blkdev.h    |   3 +
 include/uapi/linux/fs.h   |  15 ++
 5 files changed, 422 insertions(+)

Comments

Ming Lei Nov. 23, 2022, 8:04 a.m. UTC | #1
On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which supports source and destination bdevs,
> and an array of (source, destination and copy length) tuples.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy

I thought this patchset is just for enabling copy command which is
supported by hardware. But turns out it isn't, because blk_copy_offload()
still submits read/write bios for doing the copy.

I am just wondering why not let copy_file_range() cover this kind of copy,
and the framework has been there.

When I was researching pipe/splice code for supporting ublk zero copy[1], I
have got idea for async copy_file_range(), such as: io uring based
direct splice, user backed intermediate buffer, still zero copy, if these
ideas are finally implemented, we could get super-fast generic offload copy,
and bdev copy is really covered too.

[1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/

thanks,
Ming
Nitesh Shetty Nov. 23, 2022, 10:07 a.m. UTC | #2
On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > and an array of (source, destination and copy length) tuples.
> > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > bio pair with a token as payload and submitted to the device in order.
> > Read request populates token with source specific information which
> > is then passed with write request.
> > This design is courtesy Mikulas Patocka's token based copy
> 
> I thought this patchset is just for enabling copy command which is
> supported by hardware. But turns out it isn't, because blk_copy_offload()
> still submits read/write bios for doing the copy.
> 
> I am just wondering why not let copy_file_range() cover this kind of copy,
> and the framework has been there.
> 

Main goal was to enable copy command, but community suggested to add
copy emulation as well.

blk_copy_offload - actually issues copy command in driver layer.
The way read/write BIOs are percieved is different for copy offload.
In copy offload we check REQ_COPY flag in NVMe driver layer to issue
copy command. But we did missed it to add in other driver's, where they
might be treated as normal READ/WRITE.

blk_copy_emulate - is used if we fail or if device doesn't support native
copy offload command. Here we do READ/WRITE. Using copy_file_range for
emulation might be possible, but we see 2 issues here.
1. We explored possibility of pulling dm-kcopyd to block layer so that we 
can readily use it. But we found it had many dependecies from dm-layer.
So later dropped that idea.
2. copy_file_range, for block device atleast we saw few check's which fail
it for raw block device. At this point I dont know much about the history of
why such check is present.

> When I was researching pipe/splice code for supporting ublk zero copy[1], I
> have got idea for async copy_file_range(), such as: io uring based
> direct splice, user backed intermediate buffer, still zero copy, if these
> ideas are finally implemented, we could get super-fast generic offload copy,
> and bdev copy is really covered too.
> 
> [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> 

Seems interesting, We will take a look into this.

> 
> 

Thanks,
Nitesh
Ming Lei Nov. 24, 2022, 12:03 a.m. UTC | #3
On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > and an array of (source, destination and copy length) tuples.
> > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > bio pair with a token as payload and submitted to the device in order.
> > > Read request populates token with source specific information which
> > > is then passed with write request.
> > > This design is courtesy Mikulas Patocka's token based copy
> > 
> > I thought this patchset is just for enabling copy command which is
> > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > still submits read/write bios for doing the copy.
> > 
> > I am just wondering why not let copy_file_range() cover this kind of copy,
> > and the framework has been there.
> > 
> 
> Main goal was to enable copy command, but community suggested to add
> copy emulation as well.
> 
> blk_copy_offload - actually issues copy command in driver layer.
> The way read/write BIOs are percieved is different for copy offload.
> In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> copy command. But we did missed it to add in other driver's, where they
> might be treated as normal READ/WRITE.
> 
> blk_copy_emulate - is used if we fail or if device doesn't support native
> copy offload command. Here we do READ/WRITE. Using copy_file_range for
> emulation might be possible, but we see 2 issues here.
> 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> can readily use it. But we found it had many dependecies from dm-layer.
> So later dropped that idea.

Is it just because dm-kcopyd supports async copy? If yes, I believe we
can reply on io_uring for implementing async copy_file_range, which will
be generic interface for async copy, and could get better perf.

> 2. copy_file_range, for block device atleast we saw few check's which fail
> it for raw block device. At this point I dont know much about the history of
> why such check is present.

Got it, but IMO the check in generic_copy_file_checks() can be
relaxed to cover blkdev cause splice does support blkdev.

Then your bdev offload copy work can be simplified into:

1) implement .copy_file_range for def_blk_fops, suppose it is
blkdev_copy_file_range()

2) inside blkdev_copy_file_range()

- if the bdev supports offload copy, just submit one bio to the device,
and this will be converted to one pt req to device

- otherwise, fallback to generic_copy_file_range()

> 
> > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > have got idea for async copy_file_range(), such as: io uring based
> > direct splice, user backed intermediate buffer, still zero copy, if these
> > ideas are finally implemented, we could get super-fast generic offload copy,
> > and bdev copy is really covered too.
> > 
> > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > 
> 
> Seems interesting, We will take a look into this.

BTW, that is probably one direction of ublk's async zero copy IO too.


Thanks, 
Ming
Nitesh Shetty Nov. 29, 2022, 11:44 a.m. UTC | #4
On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > and an array of (source, destination and copy length) tuples.
> > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > bio pair with a token as payload and submitted to the device in order.
> > > > Read request populates token with source specific information which
> > > > is then passed with write request.
> > > > This design is courtesy Mikulas Patocka's token based copy
> > > 
> > > I thought this patchset is just for enabling copy command which is
> > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > still submits read/write bios for doing the copy.
> > > 
> > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > and the framework has been there.
> > > 
> > 
> > Main goal was to enable copy command, but community suggested to add
> > copy emulation as well.
> > 
> > blk_copy_offload - actually issues copy command in driver layer.
> > The way read/write BIOs are percieved is different for copy offload.
> > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > copy command. But we did missed it to add in other driver's, where they
> > might be treated as normal READ/WRITE.
> > 
> > blk_copy_emulate - is used if we fail or if device doesn't support native
> > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > emulation might be possible, but we see 2 issues here.
> > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > can readily use it. But we found it had many dependecies from dm-layer.
> > So later dropped that idea.
> 
> Is it just because dm-kcopyd supports async copy? If yes, I believe we
> can reply on io_uring for implementing async copy_file_range, which will
> be generic interface for async copy, and could get better perf.
>

It supports both sync and async. But used only inside dm-layer.
Async version of copy_file_range can help, using io-uring can be helpful
for user , but in-kernel users can't use uring.

> > 2. copy_file_range, for block device atleast we saw few check's which fail
> > it for raw block device. At this point I dont know much about the history of
> > why such check is present.
> 
> Got it, but IMO the check in generic_copy_file_checks() can be
> relaxed to cover blkdev cause splice does support blkdev.
> 
> Then your bdev offload copy work can be simplified into:
> 
> 1) implement .copy_file_range for def_blk_fops, suppose it is
> blkdev_copy_file_range()
> 
> 2) inside blkdev_copy_file_range()
> 
> - if the bdev supports offload copy, just submit one bio to the device,
> and this will be converted to one pt req to device
> 
> - otherwise, fallback to generic_copy_file_range()
>

We will check the feasibilty and try to implement the scheme in next versions.
It would be helpful, if someone in community know's why such checks were
present ? We see copy_file_range accepts only regular file. Was it
designed only for regular files or can we extend it to regular block
device.

> > 
> > > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > > have got idea for async copy_file_range(), such as: io uring based
> > > direct splice, user backed intermediate buffer, still zero copy, if these
> > > ideas are finally implemented, we could get super-fast generic offload copy,
> > > and bdev copy is really covered too.
> > > 
> > > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > > 
> > 
> > Seems interesting, We will take a look into this.
> 
> BTW, that is probably one direction of ublk's async zero copy IO too.
> 
> 
> Thanks, 
> Ming
> 
> 


Thanks,
Nitesh
Nitesh Shetty Dec. 7, 2022, 5:54 a.m. UTC | #5
On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > and an array of (source, destination and copy length) tuples.
> > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > Read request populates token with source specific information which
> > > > > is then passed with write request.
> > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > 
> > > > I thought this patchset is just for enabling copy command which is
> > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > still submits read/write bios for doing the copy.
> > > > 
> > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > and the framework has been there.
> > > > 
> > > 
> > > Main goal was to enable copy command, but community suggested to add
> > > copy emulation as well.
> > > 
> > > blk_copy_offload - actually issues copy command in driver layer.
> > > The way read/write BIOs are percieved is different for copy offload.
> > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > copy command. But we did missed it to add in other driver's, where they
> > > might be treated as normal READ/WRITE.
> > > 
> > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > emulation might be possible, but we see 2 issues here.
> > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > can readily use it. But we found it had many dependecies from dm-layer.
> > > So later dropped that idea.
> > 
> > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > can reply on io_uring for implementing async copy_file_range, which will
> > be generic interface for async copy, and could get better perf.
> >
> 
> It supports both sync and async. But used only inside dm-layer.
> Async version of copy_file_range can help, using io-uring can be helpful
> for user , but in-kernel users can't use uring.
> 
> > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > it for raw block device. At this point I dont know much about the history of
> > > why such check is present.
> > 
> > Got it, but IMO the check in generic_copy_file_checks() can be
> > relaxed to cover blkdev cause splice does support blkdev.
> > 
> > Then your bdev offload copy work can be simplified into:
> > 
> > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > blkdev_copy_file_range()
> > 
> > 2) inside blkdev_copy_file_range()
> > 
> > - if the bdev supports offload copy, just submit one bio to the device,
> > and this will be converted to one pt req to device
> > 
> > - otherwise, fallback to generic_copy_file_range()
> >
> 

Actually we sent initial version with single bio, but later community
suggested two bio's is must for offload, main reasoning being
dm-layer,Xcopy,copy across namespace compatibilty.

> We will check the feasibilty and try to implement the scheme in next versions.
> It would be helpful, if someone in community know's why such checks were
> present ? We see copy_file_range accepts only regular file. Was it
> designed only for regular files or can we extend it to regular block
> device.
>

As you suggested we were able to integrate def_blk_ops and
run with user application, but we see one main issue with this approach.
Using blkdev_copy_file_range requires having 2 file descriptors, which
is not possible for in kernel users such as fabrics/dm-kcopyd which has
only bdev descriptors.
Do you have any plumbing suggestions here ?

> > > 
> > > > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > > > have got idea for async copy_file_range(), such as: io uring based
> > > > direct splice, user backed intermediate buffer, still zero copy, if these
> > > > ideas are finally implemented, we could get super-fast generic offload copy,
> > > > and bdev copy is really covered too.
> > > > 
> > > > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > > > 
> > > 
> > > Seems interesting, We will take a look into this.
> > 
> > BTW, that is probably one direction of ublk's async zero copy IO too.
> > 
> > 
> > Thanks, 
> > Ming
> > 
> > 
> 
> 
> Thanks,
> Nitesh

Thanks,
Nitesh Shetty
Ming Lei Dec. 7, 2022, 11:19 a.m. UTC | #6
On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > > Read request populates token with source specific information which
> > > > > > is then passed with write request.
> > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > 
> > > > > I thought this patchset is just for enabling copy command which is
> > > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > > still submits read/write bios for doing the copy.
> > > > > 
> > > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > > and the framework has been there.
> > > > > 
> > > > 
> > > > Main goal was to enable copy command, but community suggested to add
> > > > copy emulation as well.
> > > > 
> > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > The way read/write BIOs are percieved is different for copy offload.
> > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > copy command. But we did missed it to add in other driver's, where they
> > > > might be treated as normal READ/WRITE.
> > > > 
> > > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > emulation might be possible, but we see 2 issues here.
> > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > So later dropped that idea.
> > > 
> > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > can reply on io_uring for implementing async copy_file_range, which will
> > > be generic interface for async copy, and could get better perf.
> > >
> > 
> > It supports both sync and async. But used only inside dm-layer.
> > Async version of copy_file_range can help, using io-uring can be helpful
> > for user , but in-kernel users can't use uring.
> > 
> > > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > > it for raw block device. At this point I dont know much about the history of
> > > > why such check is present.
> > > 
> > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > relaxed to cover blkdev cause splice does support blkdev.
> > > 
> > > Then your bdev offload copy work can be simplified into:
> > > 
> > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > blkdev_copy_file_range()
> > > 
> > > 2) inside blkdev_copy_file_range()
> > > 
> > > - if the bdev supports offload copy, just submit one bio to the device,
> > > and this will be converted to one pt req to device
> > > 
> > > - otherwise, fallback to generic_copy_file_range()
> > >
> > 
> 
> Actually we sent initial version with single bio, but later community
> suggested two bio's is must for offload, main reasoning being

Is there any link which holds the discussion?

> dm-layer,Xcopy,copy across namespace compatibilty.

But dm kcopy has supported bdev copy already, so once your patch is
ready, dm kcopy can just sends one bio with REQ_COPY if the device
supports offload command, otherwise the current dm kcopy code can work
as before.

> 
> > We will check the feasibilty and try to implement the scheme in next versions.
> > It would be helpful, if someone in community know's why such checks were
> > present ? We see copy_file_range accepts only regular file. Was it
> > designed only for regular files or can we extend it to regular block
> > device.
> >
> 
> As you suggested we were able to integrate def_blk_ops and
> run with user application, but we see one main issue with this approach.
> Using blkdev_copy_file_range requires having 2 file descriptors, which
> is not possible for in kernel users such as fabrics/dm-kcopyd which has
> only bdev descriptors.
> Do you have any plumbing suggestions here ?

What is the fabrics kernel user? Any kernel target code(such as nvme target)
has file/bdev path available, vfs_copy_file_range() should be fine.

Also IMO, kernel copy user shouldn't be important long term, especially if
io_uring copy_file_range() can be supported, forwarding to userspace not
only gets better performance, but also cleanup kernel related copy code
much.


thanks, 
Ming
Nitesh Shetty Dec. 9, 2022, 8:16 a.m. UTC | #7
On Wed, Dec 07, 2022 at 07:19:00PM +0800, Ming Lei wrote:
> On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> > On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > > > Read request populates token with source specific information which
> > > > > > > is then passed with write request.
> > > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > > 
> > > > > > I thought this patchset is just for enabling copy command which is
> > > > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > > > still submits read/write bios for doing the copy.
> > > > > > 
> > > > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > > > and the framework has been there.
> > > > > > 
> > > > > 
> > > > > Main goal was to enable copy command, but community suggested to add
> > > > > copy emulation as well.
> > > > > 
> > > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > > The way read/write BIOs are percieved is different for copy offload.
> > > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > > copy command. But we did missed it to add in other driver's, where they
> > > > > might be treated as normal READ/WRITE.
> > > > > 
> > > > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > > emulation might be possible, but we see 2 issues here.
> > > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > > So later dropped that idea.
> > > > 
> > > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > > can reply on io_uring for implementing async copy_file_range, which will
> > > > be generic interface for async copy, and could get better perf.
> > > >
> > > 
> > > It supports both sync and async. But used only inside dm-layer.
> > > Async version of copy_file_range can help, using io-uring can be helpful
> > > for user , but in-kernel users can't use uring.
> > > 
> > > > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > > > it for raw block device. At this point I dont know much about the history of
> > > > > why such check is present.
> > > > 
> > > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > > relaxed to cover blkdev cause splice does support blkdev.
> > > > 
> > > > Then your bdev offload copy work can be simplified into:
> > > > 
> > > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > > blkdev_copy_file_range()
> > > > 
> > > > 2) inside blkdev_copy_file_range()
> > > > 
> > > > - if the bdev supports offload copy, just submit one bio to the device,
> > > > and this will be converted to one pt req to device
> > > > 
> > > > - otherwise, fallback to generic_copy_file_range()
> > > >
> > > 
> > 
> > Actually we sent initial version with single bio, but later community
> > suggested two bio's is must for offload, main reasoning being
> 
> Is there any link which holds the discussion?
> 

This[1] is the starting thread for LSF/MM which Chaitanya organized and it
was a conference call. Also in 2022 LSM/MM there was a discussion on copy
topic as well. One of the main conclusion was using 2 bio's as must have.

Bart has summarized previous copy efforts[2] and Martin too [3],
which might be of help in understanding why 2 bio's are must.

> > dm-layer,Xcopy,copy across namespace compatibilty.
> 
> But dm kcopy has supported bdev copy already, so once your patch is
> ready, dm kcopy can just sends one bio with REQ_COPY if the device
> supports offload command, otherwise the current dm kcopy code can work
> as before.
> 
> > 
> > > We will check the feasibilty and try to implement the scheme in next versions.
> > > It would be helpful, if someone in community know's why such checks were
> > > present ? We see copy_file_range accepts only regular file. Was it
> > > designed only for regular files or can we extend it to regular block
> > > device.
> > >
> > 
> > As you suggested we were able to integrate def_blk_ops and
> > run with user application, but we see one main issue with this approach.
> > Using blkdev_copy_file_range requires having 2 file descriptors, which
> > is not possible for in kernel users such as fabrics/dm-kcopyd which has
> > only bdev descriptors.
> > Do you have any plumbing suggestions here ?
> 
> What is the fabrics kernel user? 

In fabrics setup we are exposing target side NVMe device, to host as
copy offload capable device, irrespective of target device's capability.
This will enable sending copy offload command from host.
From target side, if device doesn't support offload then we emulate.
This way we will avoid data copy over the network.

> Any kernel target code(such as nvme target)
> has file/bdev path available, vfs_copy_file_range() should be fine.
> 

From target side, fabrics device can be created in 2 ways,
1. file backed device: In this setup we get file descriptor. In fact in our
patches[4] are using vfs_copy_file_range to offload copy.
2. block backed device: Here we have only blockdev descriptor.
This setup we can't use vfs_copy_file_range since we are missing file
descriptor.
Your input on how we can plumb blkdev_copy_file_range with bdev would help.

> Also IMO, kernel copy user shouldn't be important long term, especially if
> io_uring copy_file_range() can be supported, forwarding to userspace not
> only gets better performance, but also cleanup kernel related copy code
> much.
>

Using uring is valid if consumer of copy_file_range is userspace.
But there are many possible in-kernel user of copy, such as dm-kcopyd,
garbage collection case such as F2FS GC, also fabrics as explained above.
We can't use uring from these layers.

Present implementation of vfs_copy_file_range(CFR) lacks
a. async implementation, which helps in nvme over fabrics
b. multi range(src,dst pair) support
So here we see going to blkdev_copy_file_range actually regress our present
result. But we do see a goodness of using mature vfs_copy_file_range for
emulation.

[1] https://lore.kernel.org/linux-nvme/BYAPR04MB49652C4B75E38F3716F3C06386539@BYAPR04MB4965.namprd04.prod.outlook.com/
[2] https://github.com/bvanassche/linux-kernel-copy-offload
[3] https://oss.oracle.com/~mkp/docs/xcopy.pdf
[4] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#Z2e.:..:20221123055827.26996-7-nj.shetty::40samsung.com:1drivers:nvme:target:io-cmd-file.c

Thanks,
Nitesh Shetty
diff mbox series

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..2ce3c872ca49 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,364 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+	int ret = 0;
+
+	atomic_dec(&cio->refcount);
+
+	if (cio->endio)
+		return 0;
+
+	if (atomic_read(&cio->refcount)) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+	}
+
+	ret = cio->io_err;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blk_copy_offload_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	__free_page(bio->bi_io_vec[0].bv_page);
+	bio_put(bio);
+
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_offload_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	bio_put(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_put(ctx->write_bio);
+	bio_put(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+
+	submit_bio(ctx->write_bio);
+}
+
+static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+	struct cio *cio = ctx->cio;
+	unsigned long flags = 0;
+
+	atomic_inc(&cio->refcount);
+	spin_lock_irqsave(&cio->list_lock, flags);
+
+	while (!list_empty(&cio->list)) {
+		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
+
+		if (ctx->status == REQ_COPY_READ_PROGRESS)
+			break;
+
+		atomic_inc(&ctx->refcount);
+		ctx->status = REQ_COPY_WRITE_PROGRESS;
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+		submit_bio(ctx->write_bio);
+		spin_lock_irqsave(&cio->list_lock, flags);
+
+		list_del(&ctx->list);
+		if (atomic_dec_and_test(&ctx->refcount))
+			kfree(ctx);
+	}
+
+	spin_unlock_irqrestore(&cio->list_lock, flags);
+	if (atomic_dec_and_test(&cio->refcount))
+		blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * blk_copy_offload	- Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read opration completes, we issue write with REQ_COPY flag with same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * For conventional devices we submit write bio independentenly once read
+ * completes. For zoned devices , reads can complete out of order, so we
+ * maintain a linked list and submit writes in the order, reads are submitted.
+ */
+static int blk_copy_offload(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct cio *cio;
+	struct copy_ctx *ctx;
+	struct bio *read_bio, *write_bio;
+	struct page *token;
+	sector_t src_blk, copy_len, dst_blk;
+	sector_t rem, max_copy_len;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
+			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
+
+	for (ri = 0; ri < nr; ri++) {
+		cio->ranges[ri].comp_len = ranges[ri].len;
+		src_blk = ranges[ri].src;
+		dst_blk = ranges[ri].dst;
+		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
+			copy_len = min(rem, max_copy_len);
+
+			token = alloc_page(gfp_mask);
+			if (unlikely(!token)) {
+				ret = -ENOMEM;
+				goto err_token;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
+					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
+			if (!read_bio) {
+				ret = -ENOMEM;
+				goto err_read_bio;
+			}
+			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
+					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
+					gfp_mask);
+			if (!write_bio) {
+				cio->io_err = -ENOMEM;
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
+			read_bio->bi_iter.bi_size = copy_len;
+			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
+			read_bio->bi_end_io = blk_copy_offload_read_end_io;
+			read_bio->bi_private = ctx;
+
+			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
+			write_bio->bi_iter.bi_size = copy_len;
+			write_bio->bi_end_io = blk_copy_offload_write_end_io;
+			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+			src_blk += copy_len;
+			dst_blk += copy_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	__free_page(token);
+err_token:
+	ranges[ri].comp_len = min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
+static inline int blk_copy_sanity_check(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
+{
+	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
+					bdev_logical_block_size(src_bdev)) - 1;
+	sector_t len = 0;
+	int i;
+
+	if (!nr)
+		return -EINVAL;
+
+	if (nr >= MAX_COPY_NR_RANGE)
+		return -EINVAL;
+
+	if (bdev_read_only(dst_bdev))
+		return -EPERM;
+
+	for (i = 0; i < nr; i++) {
+		if (!ranges[i].len)
+			return -EINVAL;
+
+		len += ranges[i].len;
+		if ((ranges[i].dst & align_mask) ||
+				(ranges[i].src & align_mask) ||
+				(ranges[i].len & align_mask))
+			return -EINVAL;
+		ranges[i].comp_len = 0;
+	}
+
+	if (len && len >= MAX_COPY_TOTAL_LENGTH)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline bool blk_check_copy_offload(struct request_queue *src_q,
+		struct request_queue *dst_q)
+{
+	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @dst_bdev:	destination block device
+ * @ranges:	array of source/dest/len,
+ *		ranges are expected to be allocated/freed by caller
+ * @nr:		number of source ranges to copy
+ * @end_io:	end_io function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	end_io function will be called with this private data, should be
+ *		NULL, if operation is synchronous in nature
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *	Copy source ranges from source block device to destination block
+ *	device. length of a source range cannot be zero. Max total length of
+ *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
+ *	entries is limited to MAX_COPY_NR_RANGE
+ */
+int blkdev_issue_copy(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
+	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *src_q = bdev_get_queue(src_bdev);
+	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
+	int ret = -EINVAL;
+
+	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
+	if (ret)
+		return ret;
+
+	if (blk_check_copy_offload(src_q, dst_q))
+		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/blk.h b/block/blk.h
index 5929559acd71..6d534047f20d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -308,6 +308,8 @@  static inline bool bio_may_exceed_limits(struct bio *bio,
 		break;
 	}
 
+	if (unlikely(op_is_copy(bio->bi_opf)))
+		return false;
 	/*
 	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
 	 * This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index e0b098089ef2..71278c862bba 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@  enum req_flag_bits {
 	 */
 	/* for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+	__REQ_COPY,		/* copy request */
 
 	__REQ_NR_BITS,		/* stops here */
 };
@@ -451,6 +452,7 @@  enum req_flag_bits {
 
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
+#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
@@ -484,6 +486,11 @@  static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return (op & REQ_COPY);
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
@@ -543,4 +550,41 @@  struct blk_rq_stat {
 	u64 batch;
 };
 
+typedef void (cio_iodone_t)(void *private, int status);
+
+struct cio {
+	struct range_entry *ranges;
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+	atomic_t refcount;
+	int io_err;
+	cio_iodone_t *endio;		/* applicable for async operation */
+	void *private;			/* applicable for async operation */
+
+	/* For zoned device we maintain a linked list of IO submissions.
+	 * This is to make sure we maintain the order of submissions.
+	 * Otherwise some reads completing out of order, will submit writes not
+	 * aligned with zone write pointer.
+	 */
+	struct list_head list;
+	spinlock_t list_lock;
+};
+
+enum copy_io_status {
+	REQ_COPY_READ_PROGRESS,
+	REQ_COPY_READ_COMPLETE,
+	REQ_COPY_WRITE_PROGRESS,
+};
+
+struct copy_ctx {
+	struct cio *cio;
+	struct work_struct dispatch_work;
+	struct bio *write_bio;
+	atomic_t refcount;
+	int range_idx;			/* used in error/partial completion */
+
+	/* For zoned device linked list is maintained. Along with state of IO */
+	struct list_head list;
+	enum copy_io_status status;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3ac324208f2f..a3b12ad42ed7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1065,6 +1065,9 @@  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+int blkdev_issue_copy(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b3ad173f619c..9248b6d259de 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -67,6 +67,21 @@  struct fstrim_range {
 /* maximum total copy length */
 #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
 
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE	(1 << 12)
+
+/* range entry for copy offload, all fields should be byte addressed */
+struct range_entry {
+	__u64 src;		/* source to be copied */
+	__u64 dst;		/* destination */
+	__u64 len;		/* length in bytes to be copied */
+
+	/* length of data copy actually completed. This will be filled by
+	 * kernel, once copy completes
+	 */
+	__u64 comp_len;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1