mbox series

[mdadm,v7,0/7] Write Zeroes option for Creating Arrays

Message ID 20230301204135.39230-1-logang@deltatee.com (mailing list archive)
Headers show
Series Write Zeroes option for Creating Arrays | expand

Message

Logan Gunthorpe March 1, 2023, 8:41 p.m. UTC
Hi,

This is the next iteration of the patchset to add a zeroing option
which bypasses the inital sync for arrays. This version of the patch
has some minor cleanup and collected a number of review and ack tags.

This patch set adds the --write-zeroes option which will imply
--assume-clean and write zeros to the data region in each disk before
starting the array. This can take some time so each disk is done in
parallel in its own fork. To make the forking code easier to
understand this patch set also starts with some cleanup of the
existing Create code.

We tested write-zeroes requests on a number of modern nvme drives of
various manufacturers and found most are not as optimized as the
discard path. A couple drives that were tested did not support
write-zeroes at all but still performed similarly with the kernel
falling back to writing zero pages. Typically we see it take on the
order of one minute per 100GB of data zeroed.

One reason write-zeroes is slower than discard is that today's NVMe
devices only allow about 2MB to be zeroed in one command where as
the entire drive can typically be discarded in one command. Partly,
this is a limitation of the spec as there are only 16 bits avalaible
in the write-zeros command size but drives still don't max this out.
Hopefully, in the future this will all be optimized a bit more
and this work will be able to take advantage of that.

Logan

--

Changes since v6:
   * Collected review and ack tags from Xiao, Chaitanya and Coly
   * Adjust the error reporting to us strerror() instead of the
     glibc %m extension. (per Coly)
   * Fix a typo in the man page ("despit" should have been "despite")
     (as noticed by Coly)

Changes since v5:
   * Ensure 'interrupted' is initialized in wait_for_zero_forks().
     (as noticed by Xiao)
   * Print a message indicating that the zeroing was interrupted.

Changes since v4:
   * Handle SIGINT better. Previous versions would leave the zeroing
     processes behind after the main thread exitted which would
     continue zeroing in the background (possibly for some time).
     This version splits the zero fallocate commands up so they can be
     interrupted quicker, and intercepts SIGINT in the main thread
     to print an appropriate message and wait for the threads
     to finish up. (as noticed by Xiao)

Changes since v3:
   * Store the pid in a local variable instead of the mdinfo struct
    (per Mariusz and Xiao)

Changes since v2:

   * Use write-zeroes instead of discard to zero the disks (per
     Martin)
   * Due to the time required to zero the disks, each disk is
     now done in parallel with separate forks of the process.
   * In order to add the forking some refactoring was done on the
     Create() function to make it easier to understand
   * Added a pr_info() call so that some prints can be done
     to stdout instead of stdour (per Mariusz)
   * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz)
   * Added a test to the mdadm test suite to test the option
     works.
   * Fixed up how the size and offset are calculated with some
     great information from Xiao.

Changes since v1:

   * Discard the data in the devices later in the create process
     while they are already open. This requires treating the
     s.discard option the same as the s.assume_clean option.
     Per Mariusz.
   * A couple other minor cleanup changes from Mariusz.

--

Logan Gunthorpe (7):
  Create: goto abort_locked instead of return 1 in error path
  Create: remove safe_mode_delay local variable
  Create: Factor out add_disks() helpers
  mdadm: Introduce pr_info()
  mdadm: Add --write-zeros option for Create
  tests/00raid5-zero: Introduce test to exercise --write-zeros.
  manpage: Add --write-zeroes option to manpage

 Create.c           | 565 +++++++++++++++++++++++++++++++--------------
 ReadMe.c           |   2 +
 mdadm.8.in         |  18 +-
 mdadm.c            |   9 +
 mdadm.h            |   7 +
 tests/00raid5-zero |  12 +
 6 files changed, 437 insertions(+), 176 deletions(-)
 create mode 100644 tests/00raid5-zero


base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab
--
2.30.2

Comments

Chaitanya Kulkarni March 2, 2023, 2:53 a.m. UTC | #1
On 3/1/2023 4:38 PM, Jonmichael Hands wrote:
> Hi Logan, Martin,
> just to be clear, the Linux implementation of write zeros is correctly 
> tagging DEAC to 1 during the NVMe write zeros command? This should be 
> very fast on drives that support write zeros with TRIM (most controllers 
> will have the same internal path as a TRIM updating the FTL to 
> deallocate the LBAs). Size of the command really shouldn't affect it 
> much, it is much more correlated with the firmware implementation of above.
> Thanks!
> JM
> Screenshot 2023-03-01 at 4.35.43 PM.png


Yes, it does please see [1].

-ck

1. NVMe DEC bit tagging in NVMe write-zeroes command setup.

static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
		struct request *req, struct nvme_command *cmnd)
{
	memset(cmnd, 0, sizeof(*cmnd));

	if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
		return nvme_setup_discard(ns, req, cmnd);

	cmnd->write_zeroes.opcode = nvme_cmd_write_zeroes;
	cmnd->write_zeroes.nsid = cpu_to_le32(ns->head->ns_id);
	cmnd->write_zeroes.slba =
		cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
	cmnd->write_zeroes.length =
		cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);

	if (!(req->cmd_flags & REQ_NOUNMAP) && (ns->features & NVME_NS_DEAC))
		*cmnd->write_zeroes.control |= cpu_to_le16(NVME_WZ_DEAC);*

	if (nvme_ns_has_pi(ns)) {
		cmnd->write_zeroes.control |= cpu_to_le16(NVME_RW_PRINFO_PRACT);

		switch (ns->pi_type) {
		case NVME_NS_DPS_PI_TYPE1:
		case NVME_NS_DPS_PI_TYPE2:
			nvme_set_ref_tag(ns, cmnd, req);
			break;
		}
	}

	return BLK_STS_OK;
}
Chaitanya Kulkarni March 2, 2023, 2:55 a.m. UTC | #2
On 3/1/2023 12:41 PM, Logan Gunthorpe wrote:
> Hi,
> 
> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.
> 
> This patch set adds the --write-zeroes option which will imply
> --assume-clean and write zeros to the data region in each disk before
> starting the array. This can take some time so each disk is done in
> parallel in its own fork. To make the forking code easier to
> understand this patch set also starts with some cleanup of the
> existing Create code.
> 
> We tested write-zeroes requests on a number of modern nvme drives of
> various manufacturers and found most are not as optimized as the
> discard path. A couple drives that were tested did not support
> write-zeroes at all but still performed similarly with the kernel
> falling back to writing zero pages. Typically we see it take on the
> order of one minute per 100GB of data zeroed.
> 
> One reason write-zeroes is slower than discard is that today's NVMe
> devices only allow about 2MB to be zeroed in one command where as
> the entire drive can typically be discarded in one command. Partly,
> this is a limitation of the spec as there are only 16 bits avalaible

's/avalaible/available/g'

> in the write-zeros command size but drives still don't max this out.
> Hopefully, in the future this will all be optimized a bit more
> and this work will be able to take advantage of that.
> 
> Logan
> 
> --
> 
>

Based on earlier discussion and feedback from Martin this
whole series looks good to me.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck
Martin K. Petersen March 2, 2023, 4:56 p.m. UTC | #3
Logan,

> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.

This looks good to me!

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Jes Sorensen March 13, 2023, 2:08 p.m. UTC | #4
On 3/1/23 15:41, Logan Gunthorpe wrote:
> Hi,
> 
> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.
> 
> This patch set adds the --write-zeroes option which will imply
> --assume-clean and write zeros to the data region in each disk before
> starting the array. This can take some time so each disk is done in
> parallel in its own fork. To make the forking code easier to
> understand this patch set also starts with some cleanup of the
> existing Create code.
> 
> We tested write-zeroes requests on a number of modern nvme drives of
> various manufacturers and found most are not as optimized as the
> discard path. A couple drives that were tested did not support
> write-zeroes at all but still performed similarly with the kernel
> falling back to writing zero pages. Typically we see it take on the
> order of one minute per 100GB of data zeroed.
> 
> One reason write-zeroes is slower than discard is that today's NVMe
> devices only allow about 2MB to be zeroed in one command where as
> the entire drive can typically be discarded in one command. Partly,
> this is a limitation of the spec as there are only 16 bits avalaible
> in the write-zeros command size but drives still don't max this out.
> Hopefully, in the future this will all be optimized a bit more
> and this work will be able to take advantage of that.
> 
> Logan
> 
> --
> 
> Changes since v6:
>    * Collected review and ack tags from Xiao, Chaitanya and Coly
>    * Adjust the error reporting to us strerror() instead of the
>      glibc %m extension. (per Coly)
>    * Fix a typo in the man page ("despit" should have been "despite")
>      (as noticed by Coly)
> 
> Changes since v5:
>    * Ensure 'interrupted' is initialized in wait_for_zero_forks().
>      (as noticed by Xiao)
>    * Print a message indicating that the zeroing was interrupted.
> 
> Changes since v4:
>    * Handle SIGINT better. Previous versions would leave the zeroing
>      processes behind after the main thread exitted which would
>      continue zeroing in the background (possibly for some time).
>      This version splits the zero fallocate commands up so they can be
>      interrupted quicker, and intercepts SIGINT in the main thread
>      to print an appropriate message and wait for the threads
>      to finish up. (as noticed by Xiao)
> 
> Changes since v3:
>    * Store the pid in a local variable instead of the mdinfo struct
>     (per Mariusz and Xiao)
> 
> Changes since v2:
> 
>    * Use write-zeroes instead of discard to zero the disks (per
>      Martin)
>    * Due to the time required to zero the disks, each disk is
>      now done in parallel with separate forks of the process.
>    * In order to add the forking some refactoring was done on the
>      Create() function to make it easier to understand
>    * Added a pr_info() call so that some prints can be done
>      to stdout instead of stdour (per Mariusz)
>    * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz)
>    * Added a test to the mdadm test suite to test the option
>      works.
>    * Fixed up how the size and offset are calculated with some
>      great information from Xiao.
> 
> Changes since v1:
> 
>    * Discard the data in the devices later in the create process
>      while they are already open. This requires treating the
>      s.discard option the same as the s.assume_clean option.
>      Per Mariusz.
>    * A couple other minor cleanup changes from Mariusz.
> 
> --
> 
> Logan Gunthorpe (7):
>   Create: goto abort_locked instead of return 1 in error path
>   Create: remove safe_mode_delay local variable
>   Create: Factor out add_disks() helpers
>   mdadm: Introduce pr_info()
>   mdadm: Add --write-zeros option for Create
>   tests/00raid5-zero: Introduce test to exercise --write-zeros.
>   manpage: Add --write-zeroes option to manpage
> 
>  Create.c           | 565 +++++++++++++++++++++++++++++++--------------
>  ReadMe.c           |   2 +
>  mdadm.8.in         |  18 +-
>  mdadm.c            |   9 +
>  mdadm.h            |   7 +
>  tests/00raid5-zero |  12 +
>  6 files changed, 437 insertions(+), 176 deletions(-)
>  create mode 100644 tests/00raid5-zero
> 
> 
> base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab

All applied!

Thanks,
Jes