Message ID | 20230301204135.39230-1-logang@deltatee.com (mailing list archive) |
---|---|
Headers | show |
Series | Write Zeroes option for Creating Arrays | expand |
On 3/1/2023 4:38 PM, Jonmichael Hands wrote: > Hi Logan, Martin, > just to be clear, the Linux implementation of write zeros is correctly > tagging DEAC to 1 during the NVMe write zeros command? This should be > very fast on drives that support write zeros with TRIM (most controllers > will have the same internal path as a TRIM updating the FTL to > deallocate the LBAs). Size of the command really shouldn't affect it > much, it is much more correlated with the firmware implementation of above. > Thanks! > JM > Screenshot 2023-03-01 at 4.35.43 PM.png Yes, it does please see [1]. -ck 1. NVMe DEC bit tagging in NVMe write-zeroes command setup. static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd) { memset(cmnd, 0, sizeof(*cmnd)); if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) return nvme_setup_discard(ns, req, cmnd); cmnd->write_zeroes.opcode = nvme_cmd_write_zeroes; cmnd->write_zeroes.nsid = cpu_to_le32(ns->head->ns_id); cmnd->write_zeroes.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); cmnd->write_zeroes.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); if (!(req->cmd_flags & REQ_NOUNMAP) && (ns->features & NVME_NS_DEAC)) *cmnd->write_zeroes.control |= cpu_to_le16(NVME_WZ_DEAC);* if (nvme_ns_has_pi(ns)) { cmnd->write_zeroes.control |= cpu_to_le16(NVME_RW_PRINFO_PRACT); switch (ns->pi_type) { case NVME_NS_DPS_PI_TYPE1: case NVME_NS_DPS_PI_TYPE2: nvme_set_ref_tag(ns, cmnd, req); break; } } return BLK_STS_OK; }
On 3/1/2023 12:41 PM, Logan Gunthorpe wrote: > Hi, > > This is the next iteration of the patchset to add a zeroing option > which bypasses the inital sync for arrays. This version of the patch > has some minor cleanup and collected a number of review and ack tags. > > This patch set adds the --write-zeroes option which will imply > --assume-clean and write zeros to the data region in each disk before > starting the array. This can take some time so each disk is done in > parallel in its own fork. To make the forking code easier to > understand this patch set also starts with some cleanup of the > existing Create code. > > We tested write-zeroes requests on a number of modern nvme drives of > various manufacturers and found most are not as optimized as the > discard path. A couple drives that were tested did not support > write-zeroes at all but still performed similarly with the kernel > falling back to writing zero pages. Typically we see it take on the > order of one minute per 100GB of data zeroed. > > One reason write-zeroes is slower than discard is that today's NVMe > devices only allow about 2MB to be zeroed in one command where as > the entire drive can typically be discarded in one command. Partly, > this is a limitation of the spec as there are only 16 bits avalaible 's/avalaible/available/g' > in the write-zeros command size but drives still don't max this out. > Hopefully, in the future this will all be optimized a bit more > and this work will be able to take advantage of that. > > Logan > > -- > > Based on earlier discussion and feedback from Martin this whole series looks good to me. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Logan, > This is the next iteration of the patchset to add a zeroing option > which bypasses the inital sync for arrays. This version of the patch > has some minor cleanup and collected a number of review and ack tags. This looks good to me! Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
On 3/1/23 15:41, Logan Gunthorpe wrote: > Hi, > > This is the next iteration of the patchset to add a zeroing option > which bypasses the inital sync for arrays. This version of the patch > has some minor cleanup and collected a number of review and ack tags. > > This patch set adds the --write-zeroes option which will imply > --assume-clean and write zeros to the data region in each disk before > starting the array. This can take some time so each disk is done in > parallel in its own fork. To make the forking code easier to > understand this patch set also starts with some cleanup of the > existing Create code. > > We tested write-zeroes requests on a number of modern nvme drives of > various manufacturers and found most are not as optimized as the > discard path. A couple drives that were tested did not support > write-zeroes at all but still performed similarly with the kernel > falling back to writing zero pages. Typically we see it take on the > order of one minute per 100GB of data zeroed. > > One reason write-zeroes is slower than discard is that today's NVMe > devices only allow about 2MB to be zeroed in one command where as > the entire drive can typically be discarded in one command. Partly, > this is a limitation of the spec as there are only 16 bits avalaible > in the write-zeros command size but drives still don't max this out. > Hopefully, in the future this will all be optimized a bit more > and this work will be able to take advantage of that. > > Logan > > -- > > Changes since v6: > * Collected review and ack tags from Xiao, Chaitanya and Coly > * Adjust the error reporting to us strerror() instead of the > glibc %m extension. (per Coly) > * Fix a typo in the man page ("despit" should have been "despite") > (as noticed by Coly) > > Changes since v5: > * Ensure 'interrupted' is initialized in wait_for_zero_forks(). > (as noticed by Xiao) > * Print a message indicating that the zeroing was interrupted. > > Changes since v4: > * Handle SIGINT better. Previous versions would leave the zeroing > processes behind after the main thread exitted which would > continue zeroing in the background (possibly for some time). > This version splits the zero fallocate commands up so they can be > interrupted quicker, and intercepts SIGINT in the main thread > to print an appropriate message and wait for the threads > to finish up. (as noticed by Xiao) > > Changes since v3: > * Store the pid in a local variable instead of the mdinfo struct > (per Mariusz and Xiao) > > Changes since v2: > > * Use write-zeroes instead of discard to zero the disks (per > Martin) > * Due to the time required to zero the disks, each disk is > now done in parallel with separate forks of the process. > * In order to add the forking some refactoring was done on the > Create() function to make it easier to understand > * Added a pr_info() call so that some prints can be done > to stdout instead of stdour (per Mariusz) > * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz) > * Added a test to the mdadm test suite to test the option > works. > * Fixed up how the size and offset are calculated with some > great information from Xiao. > > Changes since v1: > > * Discard the data in the devices later in the create process > while they are already open. This requires treating the > s.discard option the same as the s.assume_clean option. > Per Mariusz. > * A couple other minor cleanup changes from Mariusz. > > -- > > Logan Gunthorpe (7): > Create: goto abort_locked instead of return 1 in error path > Create: remove safe_mode_delay local variable > Create: Factor out add_disks() helpers > mdadm: Introduce pr_info() > mdadm: Add --write-zeros option for Create > tests/00raid5-zero: Introduce test to exercise --write-zeros. > manpage: Add --write-zeroes option to manpage > > Create.c | 565 +++++++++++++++++++++++++++++++-------------- > ReadMe.c | 2 + > mdadm.8.in | 18 +- > mdadm.c | 9 + > mdadm.h | 7 + > tests/00raid5-zero | 12 + > 6 files changed, 437 insertions(+), 176 deletions(-) > create mode 100644 tests/00raid5-zero > > > base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab All applied! Thanks, Jes