Message ID | 20191121214445.282160-2-preichl@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | mkfs: inform during block discarding | expand |
On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > Signed-off-by: Pavel Reichl <preichl@redhat.com> > --- > mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++------- > 1 file changed, 25 insertions(+), 7 deletions(-) > > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c > index 18338a61..a02d6f66 100644 > --- a/mkfs/xfs_mkfs.c > +++ b/mkfs/xfs_mkfs.c > @@ -1242,15 +1242,33 @@ done: > static void > discard_blocks(dev_t dev, uint64_t nsectors) > { > - int fd; > + int fd; > + uint64_t offset = 0; > + /* Maximal chunk of bytes to discard is 2GB */ > + const uint64_t step = (uint64_t)2<<30; You don't need the tabs after the variable name, e.g. /* Maximal chunk of bytes to discard is 2GB */ const uint64_t step = 2ULL << 30; > + /* Sector size is 512 bytes */ > + const uint64_t count = nsectors << 9; count = BBTOB(nsectors)? > > - /* > - * We intentionally ignore errors from the discard ioctl. It is > - * not necessary for the mkfs functionality but just an optimization. > - */ > fd = libxfs_device_to_fd(dev); > - if (fd > 0) > - platform_discard_blocks(fd, 0, nsectors << 9); > + if (fd <= 0) > + return; > + > + while (offset < count) { > + uint64_t tmp_step = step; tmp_step = min(step, count - offset); ? Otherwise seems reasonable to me, if nothing else to avoid the problem where you ask mkfs to discard and can't cancel it.... --D > + > + if ((offset + step) > count) > + tmp_step = count - offset; > + > + /* > + * We intentionally ignore errors from the discard ioctl. It is > + * not necessary for the mkfs functionality but just an > + * optimization. However we should stop on error. > + */ > + if (platform_discard_blocks(fd, offset, tmp_step)) > + return; > + > + offset += tmp_step; > + } > } > > static __attribute__((noreturn)) void > -- > 2.23.0 >
On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > Signed-off-by: Pavel Reichl <preichl@redhat.com> > --- This is mixing an explanation about why the change is being made and what was considered when making decisions about the change. e.g. my first questions on looking at the patch were: - why do we need to break up the discards into 2GB chunks? - why 2GB? - why not use libblkid to query the maximum discard size and use that as the step size instead? - is there any performance impact from breaking up large discards that might be optimised by the kernel into many overlapping async operations into small, synchronous discards? i.e. the reviewer can read what the patch does, but that deosn't explain why the patch does this. Hence it's a good idea to explain the problem being solved or the feature requirements that have lead to the changes in the patch.... Cheers, Dave.
Thanks Darrick for the comments. It makes sense to me, the next iteration of the patchset will address that. On Thu, Nov 21, 2019 at 10:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > > Signed-off-by: Pavel Reichl <preichl@redhat.com> > > --- > > mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++------- > > 1 file changed, 25 insertions(+), 7 deletions(-) > > > > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c > > index 18338a61..a02d6f66 100644 > > --- a/mkfs/xfs_mkfs.c > > +++ b/mkfs/xfs_mkfs.c > > @@ -1242,15 +1242,33 @@ done: > > static void > > discard_blocks(dev_t dev, uint64_t nsectors) > > { > > - int fd; > > + int fd; > > + uint64_t offset = 0; > > + /* Maximal chunk of bytes to discard is 2GB */ > > + const uint64_t step = (uint64_t)2<<30; > > You don't need the tabs after the variable name, e.g. > > /* Maximal chunk of bytes to discard is 2GB */ > const uint64_t step = 2ULL << 30; > > > + /* Sector size is 512 bytes */ > > + const uint64_t count = nsectors << 9; > > count = BBTOB(nsectors)? > > > > > - /* > > - * We intentionally ignore errors from the discard ioctl. It is > > - * not necessary for the mkfs functionality but just an optimization. > > - */ > > fd = libxfs_device_to_fd(dev); > > - if (fd > 0) > > - platform_discard_blocks(fd, 0, nsectors << 9); > > + if (fd <= 0) > > + return; > > + > > + while (offset < count) { > > + uint64_t tmp_step = step; > > tmp_step = min(step, count - offset); ? > > Otherwise seems reasonable to me, if nothing else to avoid the problem > where you ask mkfs to discard and can't cancel it.... > > --D > > > + > > + if ((offset + step) > count) > > + tmp_step = count - offset; > > + > > + /* > > + * We intentionally ignore errors from the discard ioctl. It is > > + * not necessary for the mkfs functionality but just an > > + * optimization. However we should stop on error. > > + */ > > + if (platform_discard_blocks(fd, offset, tmp_step)) > > + return; > > + > > + offset += tmp_step; > > + } > > } > > > > static __attribute__((noreturn)) void > > -- > > 2.23.0 > > >
On Fri, Nov 22, 2019 at 10:18:38AM +1100, Dave Chinner wrote: > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > > Signed-off-by: Pavel Reichl <preichl@redhat.com> > > --- > > This is mixing an explanation about why the change is being made > and what was considered when making decisions about the change. > > e.g. my first questions on looking at the patch were: > > - why do we need to break up the discards into 2GB chunks? > - why 2GB? Yeah, I'm wondering that too. > - why not use libblkid to query the maximum discard size > and use that as the step size instead? FWIW my SATA SSDs the discard-max is 2G whereas on the NVME it's 2T. I guess firmwares have gotten 1000x better in the past few years, possibly because of the hundred or so 10x programmers that they've all been hiring. > - is there any performance impact from breaking up large > discards that might be optimised by the kernel into many > overlapping async operations into small, synchronous > discards? Also: What is the end goal that you have in mind? Is the progress reporting the ultimate goal? Or is it to break up the BLKDISCARD calls so that someone can ^C a mkfs operation and not have it just sit there continuing to run? --D > i.e. the reviewer can read what the patch does, but that deosn't > explain why the patch does this. Hence it's a good idea to explain > the problem being solved or the feature requirements that have lead > to the changes in the patch.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Nov 22, 2019 at 4:38 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Fri, Nov 22, 2019 at 10:18:38AM +1100, Dave Chinner wrote: > > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > > > Signed-off-by: Pavel Reichl <preichl@redhat.com> > > > --- > > > > This is mixing an explanation about why the change is being made > > and what was considered when making decisions about the change. > > > > e.g. my first questions on looking at the patch were: > > > > - why do we need to break up the discards into 2GB chunks? > > - why 2GB? > > Yeah, I'm wondering that too. OK, thank you both for the question - simple answer is that I took what is used in e2fsprogs as default and I expected a discussion about proper value during review process :-) > > > - why not use libblkid to query the maximum discard size > > and use that as the step size instead? > > FWIW my SATA SSDs the discard-max is 2G whereas on the NVME it's 2T. I > guess firmwares have gotten 1000x better in the past few years, possibly > because of the hundred or so 10x programmers that they've all been hiring. > > > - is there any performance impact from breaking up large > > discards that might be optimised by the kernel into many > > overlapping async operations into small, synchronous > > discards? > > Also: > What is the end goal that you have in mind? Is the progress reporting > the ultimate goal? Or is it to break up the BLKDISCARD calls so that > someone can ^C a mkfs operation and not have it just sit there > continuing to run? The goal is mainly the progress reporting but the possibility to do ^C is also convenient. It seems that some users are not happy about the BLKDISCARD taking too long and at the same time not being informed about that - so they think that the command actually hung. > > --D > > > i.e. the reviewer can read what the patch does, but that deosn't > > explain why the patch does this. Hence it's a good idea to explain > > the problem being solved or the feature requirements that have lead > > to the changes in the patch.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com >
On Fri, Nov 22, 2019 at 12:18 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: > > Signed-off-by: Pavel Reichl <preichl@redhat.com> > > --- > > This is mixing an explanation about why the change is being made > and what was considered when making decisions about the change. Thanks, I'll try to improve that. > > e.g. my first questions on looking at the patch were: > > - why do we need to break up the discards into 2GB chunks? > - why 2GB? > - why not use libblkid to query the maximum discard size > and use that as the step size instead? This is new for me, please let me learn more about that. > - is there any performance impact from breaking up large > discards that might be optimised by the kernel into many > overlapping async operations into small, synchronous > discards? Honestly, I don't have an answer for that ATM - it's quite possible. It certainly needs more investigating. On the other hand - current lack of feedback causes user discomfort. So I'd like to know your opinion - should the change proposed by this patch be default behaviour (as it may be more user friendly) and should we add an option that would 'revert' to current behaviour (that would be for informed user). > > i.e. the reviewer can read what the patch does, but that deosn't > explain why the patch does this. Hence it's a good idea to explain > the problem being solved or the feature requirements that have lead > to the changes in the patch.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Fri, Nov 22, 2019 at 04:59:21PM +0100, Pavel Reichl wrote: > On Fri, Nov 22, 2019 at 4:38 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > Also: > > What is the end goal that you have in mind? Is the progress reporting > > the ultimate goal? Or is it to break up the BLKDISCARD calls so that > > someone can ^C a mkfs operation and not have it just sit there > > continuing to run? > > The goal is mainly the progress reporting but the possibility to do ^C > is also convenient. It seems that some users are not happy about the > BLKDISCARD taking too long and at the same time not being informed > about that - so they think that the command actually hung. Ok, that's a good summary to put in the commit description - it tells the reviewer exactly what you are trying to acheive, and gives them context to evaluate it against. Cheers, Dave.
On 11/21/19 3:55 PM, Darrick J. Wong wrote: > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: concur w/ others that a reason for the change (and a reason for the size selection) would be appropriate to have in the changelog. >> Signed-off-by: Pavel Reichl <preichl@redhat.com> >> --- >> mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++------- >> 1 file changed, 25 insertions(+), 7 deletions(-) >> >> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c >> index 18338a61..a02d6f66 100644 >> --- a/mkfs/xfs_mkfs.c >> +++ b/mkfs/xfs_mkfs.c >> @@ -1242,15 +1242,33 @@ done: >> static void >> discard_blocks(dev_t dev, uint64_t nsectors) >> { >> - int fd; >> + int fd; >> + uint64_t offset = 0; >> + /* Maximal chunk of bytes to discard is 2GB */ >> + const uint64_t step = (uint64_t)2<<30; > > You don't need the tabs after the variable name, e.g. > > /* Maximal chunk of bytes to discard is 2GB */ > const uint64_t step = 2ULL << 30; > >> + /* Sector size is 512 bytes */ >> + const uint64_t count = nsectors << 9; > > count = BBTOB(nsectors)? FYI this is a macro that xfs developers have learned about. ;) It stands for "Basic Block TO Byte" where "basic block" pretty much means "512-byte sector." -Eric
On 11/21/19 5:18 PM, Dave Chinner wrote: > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: >> Signed-off-by: Pavel Reichl <preichl@redhat.com> >> --- > > This is mixing an explanation about why the change is being made > and what was considered when making decisions about the change. > > e.g. my first questions on looking at the patch were: > > - why do we need to break up the discards into 2GB chunks? > - why 2GB? > - why not use libblkid to query the maximum discard size > and use that as the step size instead? Just wondering, can we trust that to be reasonably performant? (the whole motivation here is for hardware that takes inordinately long to do discard, I wonder if we can count on such hardware to properly fill out this info....) > - is there any performance impact from breaking up large > discards that might be optimised by the kernel into many > overlapping async operations into small, synchronous > discards? FWIW, I had simply suggested to Pavel that he follow e2fsprogs' lead here - afaik they haven't had issues/complaints with their 2g iteration, and at one point Lukas did some investigation into the size selection... Thanks, -Eric > i.e. the reviewer can read what the patch does, but that deosn't > explain why the patch does this. Hence it's a good idea to explain > the problem being solved or the feature requirements that have lead > to the changes in the patch.... > > Cheers, > > Dave. >
On 11/22/19 3:10 PM, Eric Sandeen wrote: > On 11/21/19 5:18 PM, Dave Chinner wrote: >> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: >>> Signed-off-by: Pavel Reichl <preichl@redhat.com> >>> --- >> >> This is mixing an explanation about why the change is being made >> and what was considered when making decisions about the change. >> >> e.g. my first questions on looking at the patch were: >> >> - why do we need to break up the discards into 2GB chunks? >> - why 2GB? >> - why not use libblkid to query the maximum discard size >> and use that as the step size instead? > > Just wondering, can we trust that to be reasonably performant? > (the whole motivation here is for hardware that takes inordinately > long to do discard, I wonder if we can count on such hardware to > properly fill out this info....) Looking at the docs in kernel/Documentation/block/queue-sysfs.rst: discard_max_hw_bytes (RO) ------------------------- Devices that support discard functionality may have internal limits on the number of bytes that can be trimmed or unmapped in a single operation. The discard_max_bytes parameter is set by the device driver to the maximum number of bytes that can be discarded in a single operation. Discard requests issued to the device must not exceed this limit. A discard_max_bytes value of 0 means that the device does not support discard functionality. discard_max_bytes (RW) ---------------------- While discard_max_hw_bytes is the hardware limit for the device, this setting is the software limit. Some devices exhibit large latencies when large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard operations. it seems like a strong suggestion that the discard_max_hw_bytes value may still be problematic, and discard_max_bytes can be hand-tuned to something smaller if it's a problem. To me that indicates that discard_max_hw_bytes probably can't be trusted to be performant, and presumably discard_max_bytes won't be either in that case unless it's been hand-tuned by the admin? -Eric
On 11/22/19 3:30 PM, Eric Sandeen wrote: > On 11/22/19 3:10 PM, Eric Sandeen wrote: >> On 11/21/19 5:18 PM, Dave Chinner wrote: >>> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote: >>>> Signed-off-by: Pavel Reichl <preichl@redhat.com> >>>> --- >>> >>> This is mixing an explanation about why the change is being made >>> and what was considered when making decisions about the change. >>> >>> e.g. my first questions on looking at the patch were: >>> >>> - why do we need to break up the discards into 2GB chunks? >>> - why 2GB? >>> - why not use libblkid to query the maximum discard size >>> and use that as the step size instead? >> >> Just wondering, can we trust that to be reasonably performant? >> (the whole motivation here is for hardware that takes inordinately >> long to do discard, I wonder if we can count on such hardware to >> properly fill out this info....) > > Looking at the docs in kernel/Documentation/block/queue-sysfs.rst: > > discard_max_hw_bytes (RO) > ------------------------- > Devices that support discard functionality may have internal limits on > the number of bytes that can be trimmed or unmapped in a single operation. > The discard_max_bytes parameter is set by the device driver to the maximum > number of bytes that can be discarded in a single operation. Discard > requests issued to the device must not exceed this limit. A discard_max_bytes > value of 0 means that the device does not support discard functionality. > > discard_max_bytes (RW) > ---------------------- > While discard_max_hw_bytes is the hardware limit for the device, this > setting is the software limit. Some devices exhibit large latencies when > large discards are issued, setting this value lower will make Linux issue > smaller discards and potentially help reduce latencies induced by large > discard operations. > > it seems like a strong suggestion that the discard_max_hw_bytes value may > still be problematic, and discard_max_bytes can be hand-tuned to something > smaller if it's a problem. To me that indicates that discard_max_hw_bytes > probably can't be trusted to be performant, and presumably discard_max_bytes > won't be either in that case unless it's been hand-tuned by the admin? Lukas, Jeff Moyer reminded me that you did a lot of investigation into this behavior a while back. Can you shed light on this, particularly how you chose 2G as the discard granularity for mke2fs? Thanks, -Eric
On 11/21/19 3:44 PM, Pavel Reichl wrote: > Signed-off-by: Pavel Reichl <preichl@redhat.com> > --- > mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++------- > 1 file changed, 25 insertions(+), 7 deletions(-) > > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c > index 18338a61..a02d6f66 100644 > --- a/mkfs/xfs_mkfs.c > +++ b/mkfs/xfs_mkfs.c > @@ -1242,15 +1242,33 @@ done: > static void > discard_blocks(dev_t dev, uint64_t nsectors) > { > - int fd; > + int fd; > + uint64_t offset = 0; > + /* Maximal chunk of bytes to discard is 2GB */ > + const uint64_t step = (uint64_t)2<<30; Regarding the discard step size, I would like to just see us keep 2G - I see problems with the alternate suggestions proposed in the threads on this patch review: 1) query block device for maximal discard size -> block device folks I've talked to (Jeff Moyer in particular) stated that many devices are known for putting a huge value in here, and then taking far, far too long to process that size request. In short, maximum size != fast. 2) discard one AG size at a time -> this can be up to 1T, which puts us right back at our problem of large, slow discards. And in particular, AG size has no relation at all to a device's discard behavior. (further complicating this, we don't have this geometry available anywhere in the current chain of calls to the discard ioctl.) Lukas did an investigation of discard behaviors (though it was some time ago https://sourceforge.net/projects/test-discard/) and arrived at 2G as a reasonable size after testing many different devices - I've not seen any complaints from mke2fs users about problems doing discards in 2G chunks. So I think just picking a fixed 2G size is the best plan for now. (one nitpick, I'd fix the comment above to not say "Maximal" because that sounds like some hard limit imposed by something other than the code; I'd just say "Discard the device 2G at a time" or something like that.) A comment above the loop explaining in more detail that we iterate in step sizes so that the utility can be interrupted would probably be helpful. Thanks, -Eric
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c index 18338a61..a02d6f66 100644 --- a/mkfs/xfs_mkfs.c +++ b/mkfs/xfs_mkfs.c @@ -1242,15 +1242,33 @@ done: static void discard_blocks(dev_t dev, uint64_t nsectors) { - int fd; + int fd; + uint64_t offset = 0; + /* Maximal chunk of bytes to discard is 2GB */ + const uint64_t step = (uint64_t)2<<30; + /* Sector size is 512 bytes */ + const uint64_t count = nsectors << 9; - /* - * We intentionally ignore errors from the discard ioctl. It is - * not necessary for the mkfs functionality but just an optimization. - */ fd = libxfs_device_to_fd(dev); - if (fd > 0) - platform_discard_blocks(fd, 0, nsectors << 9); + if (fd <= 0) + return; + + while (offset < count) { + uint64_t tmp_step = step; + + if ((offset + step) > count) + tmp_step = count - offset; + + /* + * We intentionally ignore errors from the discard ioctl. It is + * not necessary for the mkfs functionality but just an + * optimization. However we should stop on error. + */ + if (platform_discard_blocks(fd, offset, tmp_step)) + return; + + offset += tmp_step; + } } static __attribute__((noreturn)) void
Signed-off-by: Pavel Reichl <preichl@redhat.com> --- mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-)