mbox series

[v5,00/10] Implement copy offload support

Message ID 20221123055827.26996-1-nj.shetty@samsung.com (mailing list archive)
Headers show
Series Implement copy offload support | expand

Message

Nitesh Shetty Nov. 23, 2022, 5:58 a.m. UTC
The patch series covers the points discussed in November 2021 virtual
call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

This is on top of our previous patchset v4[1].

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_COPY flag), with interface
		accommodating two block-devs, and multi-source/destination
		interface
		- Emulation, when offload is natively absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- new ioctl
		- copy_file_range for zonefs

	4. In-kernel user
		- dm-kcopyd
		- copy_file_range in zonefs

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Fabrics loopback.
	c. zonefs copy_file_range

	Emuation can be tested on any device.

	Sample application to use IOCTL is present in patch desciption.
	fio[3].

Performance
===========
	With the async design of copy-emulation/offload using fio[3],
	we were  able to see the following improvements as
	compared to userspace read and write on a NVMeOF TCP setup:
	Setup1: Network Speed: 1000Mb/s
		Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
		Target PC: AMD Ryzen 9 5900X 12-Core Processor
		block size 8k, range 1:
			635% improvement in IO BW (107 MiB/s to 787 MiB/s).
			Network utilisation drops from  97% to 14%.
		block-size 2M, range 16:
			2555% improvement in IO BW (100 MiB/s to 2655 MiB/s).
			Network utilisation drops from 89% to 0.62%.
	Setup2: Network Speed: 100Gb/s
		Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
			(host and target have the same configuration)
		block-size 8k, range 1:
			6.5% improvement in IO BW (791 MiB/s to 843 MiB/s).
			Network utilisation drops from  6.75% to 0.14%.
		block-size 2M, range 16:
			15% improvement in IO BW (1027 MiB/s to 1183 MiB/s).
			Network utilisation drops from 8.42% to ~0%.
		block-size 8k, 8 ranges:
			18% drop in IO BW (from 798 MiB/s to 647 MiB/s)
			Network utilisation drops from 6.66% to 0.13%.

		At present we see drop in performance for bs 8k,16k and
		higher ranges (8, 16), so something more to check there.
	Overall, in these tests, kernel copy emulation performs better than
	userspace read+write. 

Zonefs copy_file_range
======================
	Sample tests for zonefs-tools[4]. Test 0118 and 0119 will test
	basic CFR. Will raise a PR, once this series is finalized.

Future Work
===========
	- nullblk: copy-offload emulation
	- generic copy file range (CFR):
		I did go through this, but couldn't find straight forward
		way to plug in copy offload for all the cases. We are doing
		detailed study, will address this future versions.
	- loopback device copy offload support
	- upstream fio to use copy offload

	These are to be taken up after we reach consensus on the
	plumbing of current elements that are part of this series.


Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
	[1] https://lore.kernel.org/lkml/20220426101241.30100-1-nj.shetty@samsung.com/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/vincentkfu/fio/tree/copyoffload
	[4] https://github.com/nitesh-shetty/zonefs-tools/tree/cfr

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Nitesh Shetty (10):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  block: Introduce a new ioctl for copy
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload.
  dm: Enable copy offload for dm-linear target
  dm kcopyd: use copy offload support
  fs: add support for copy file range in zonefs

 Documentation/ABI/stable/sysfs-block |  36 ++
 block/blk-lib.c                      | 597 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  64 +++
 block/blk.h                          |   2 +
 block/ioctl.c                        |  36 ++
 drivers/md/dm-kcopyd.c               |  56 ++-
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  42 ++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/core.c             | 106 ++++-
 drivers/nvme/host/fc.c               |   5 +
 drivers/nvme/host/nvme.h             |   7 +
 drivers/nvme/host/pci.c              |  28 +-
 drivers/nvme/host/rdma.c             |   7 +
 drivers/nvme/host/tcp.c              |  16 +
 drivers/nvme/host/trace.c            |  19 +
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  79 ++++
 drivers/nvme/target/io-cmd-file.c    |  51 +++
 drivers/nvme/target/loop.c           |   6 +
 drivers/nvme/target/nvmet.h          |   2 +
 fs/zonefs/super.c                    | 179 ++++++++
 include/linux/blk_types.h            |  44 ++
 include/linux/blkdev.h               |  18 +
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 +-
 include/uapi/linux/fs.h              |  27 ++
 29 files changed, 1502 insertions(+), 18 deletions(-)


base-commit: e4cd8d3ff7f9efeb97330e5e9b99eeb2a68f5cf9

Comments

Chaitanya Kulkarni Nov. 23, 2022, 10:56 p.m. UTC | #1
(+ Shinichiro)

On 11/22/22 21:58, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual
> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> We have covered the initial agreed requirements in this patchset and
> further additional features suggested by community.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.
> 
> This is on top of our previous patchset v4[1].

Now that series is converging, since patch-series touches
drivers and key components in the block layer you need accompany
the patch-series with the blktests to cover the corner cases in the
drivers which supports this operations, as I mentioned this in the
call last year....

If you need any help with that feel free to send an email to linux-block
and CC me or Shinichiro (added in CC )...

-ck
Nitesh Shetty Nov. 29, 2022, 12:16 p.m. UTC | #2
On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
> (+ Shinichiro)
> 
> On 11/22/22 21:58, Nitesh Shetty wrote:
> > The patch series covers the points discussed in November 2021 virtual
> > call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> > We have covered the initial agreed requirements in this patchset and
> > further additional features suggested by community.
> > Patchset borrows Mikulas's token based approach for 2 bdev
> > implementation.
> > 
> > This is on top of our previous patchset v4[1].
> 
> Now that series is converging, since patch-series touches
> drivers and key components in the block layer you need accompany
> the patch-series with the blktests to cover the corner cases in the
> drivers which supports this operations, as I mentioned this in the
> call last year....
> 
> If you need any help with that feel free to send an email to linux-block
> and CC me or Shinichiro (added in CC )...
> 
> -ck
>

Yes any help would be appreciated. I am not familiar with blktest 
development/testing cycle. Do we need add blktests along with patch 
series or do we need to add after patch series gets merged(to be merged)?

Thanks
Nitesh
Chaitanya Kulkarni Nov. 30, 2022, 12:05 a.m. UTC | #3
On 11/29/22 04:16, Nitesh Shetty wrote:
> On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
>> (+ Shinichiro)
>>
>> On 11/22/22 21:58, Nitesh Shetty wrote:
>>> The patch series covers the points discussed in November 2021 virtual
>>> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
>>> We have covered the initial agreed requirements in this patchset and
>>> further additional features suggested by community.
>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>> implementation.
>>>
>>> This is on top of our previous patchset v4[1].
>>
>> Now that series is converging, since patch-series touches
>> drivers and key components in the block layer you need accompany
>> the patch-series with the blktests to cover the corner cases in the
>> drivers which supports this operations, as I mentioned this in the
>> call last year....
>>
>> If you need any help with that feel free to send an email to linux-block
>> and CC me or Shinichiro (added in CC )...
>>
>> -ck
>>
> 
> Yes any help would be appreciated. I am not familiar with blktest
> development/testing cycle. Do we need add blktests along with patch
> series or do we need to add after patch series gets merged(to be merged)?
> 
> Thanks
> Nitesh
> 
> 

we have many testcases you can refer to as an example.
Your cover-letter mentions that you have tested this code, just move
all the testcases to the blktests.

More importantly for a feature like this you should be providing
outstanding testcases in your github tree when you post the
series, it should cover critical parts of the block layer and
drivers in question.

The objective here is to have blktests updated when the code
is upstream so all the distros can test the code from
upstream blktest repo. You can refer to what we have done it
for NVMeOF in-band authentication (Thanks to Hannes and Sagi
in linux-nvme email-archives.

-ck
Nitesh Shetty Nov. 30, 2022, 4:14 a.m. UTC | #4
On Wed, Nov 30, 2022 at 12:05:00AM +0000, Chaitanya Kulkarni wrote:
> On 11/29/22 04:16, Nitesh Shetty wrote:
> > On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
> >> (+ Shinichiro)
> >>
> >> On 11/22/22 21:58, Nitesh Shetty wrote:
> >>> The patch series covers the points discussed in November 2021 virtual
> >>> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> >>> We have covered the initial agreed requirements in this patchset and
> >>> further additional features suggested by community.
> >>> Patchset borrows Mikulas's token based approach for 2 bdev
> >>> implementation.
> >>>
> >>> This is on top of our previous patchset v4[1].
> >>
> >> Now that series is converging, since patch-series touches
> >> drivers and key components in the block layer you need accompany
> >> the patch-series with the blktests to cover the corner cases in the
> >> drivers which supports this operations, as I mentioned this in the
> >> call last year....
> >>
> >> If you need any help with that feel free to send an email to linux-block
> >> and CC me or Shinichiro (added in CC )...
> >>
> >> -ck
> >>
> > 
> > Yes any help would be appreciated. I am not familiar with blktest
> > development/testing cycle. Do we need add blktests along with patch
> > series or do we need to add after patch series gets merged(to be merged)?
> > 
> > Thanks
> > Nitesh
> > 
> > 
> 
> we have many testcases you can refer to as an example.
> Your cover-letter mentions that you have tested this code, just move
> all the testcases to the blktests.
> 
> More importantly for a feature like this you should be providing
> outstanding testcases in your github tree when you post the
> series, it should cover critical parts of the block layer and
> drivers in question.
> 
> The objective here is to have blktests updated when the code
> is upstream so all the distros can test the code from
> upstream blktest repo. You can refer to what we have done it
> for NVMeOF in-band authentication (Thanks to Hannes and Sagi
> in linux-nvme email-archives.
> 
> -ck
>

Sure, next version will update blktest.

Thank you,
Nitesh