[1/2] mkfs: Break block discard into chunks of 2 GB

Message ID	20191121214445.282160-2-preichl@redhat.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=udUw=ZN=vger.kernel.org=linux-xfs-owner@kernel.org> From: Pavel Reichl <preichl@redhat.com> To: linux-xfs@vger.kernel.org Cc: Pavel Reichl <preichl@redhat.com> Subject: [PATCH 1/2] mkfs: Break block discard into chunks of 2 GB Date: Thu, 21 Nov 2019 22:44:44 +0100 Message-Id: <20191121214445.282160-2-preichl@redhat.com> In-Reply-To: <20191121214445.282160-1-preichl@redhat.com> References: <20191121214445.282160-1-preichl@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	mkfs: inform during block discarding \| expand [0/2] mkfs: inform during block discarding [1/2] mkfs: Break block discard into chunks of 2 GB [2/2] mkfs: Show progress during block discard

Pavel Reichl Nov. 21, 2019, 9:44 p.m. UTC

Signed-off-by: Pavel Reichl <preichl@redhat.com>
---
 mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

Darrick J. Wong Nov. 21, 2019, 9:55 p.m. UTC | #1

On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> Signed-off-by: Pavel Reichl <preichl@redhat.com>
> ---
>  mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 18338a61..a02d6f66 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -1242,15 +1242,33 @@ done:
>  static void
>  discard_blocks(dev_t dev, uint64_t nsectors)
>  {
> -	int fd;
> +	int		fd;
> +	uint64_t	offset		= 0;
> +	/* Maximal chunk of bytes to discard is 2GB */
> +	const uint64_t	step		= (uint64_t)2<<30;

You don't need the tabs after the variable name, e.g.

	/* Maximal chunk of bytes to discard is 2GB */
	const uint64_t	step = 2ULL << 30;

> +	/* Sector size is 512 bytes */
> +	const uint64_t	count		= nsectors << 9;

count = BBTOB(nsectors)?

>  
> -	/*
> -	 * We intentionally ignore errors from the discard ioctl.  It is
> -	 * not necessary for the mkfs functionality but just an optimization.
> -	 */
>  	fd = libxfs_device_to_fd(dev);
> -	if (fd > 0)
> -		platform_discard_blocks(fd, 0, nsectors << 9);
> +	if (fd <= 0)
> +		return;
> +
> +	while (offset < count) {
> +		uint64_t	tmp_step = step;

tmp_step = min(step, count - offset); ?

Otherwise seems reasonable to me, if nothing else to avoid the problem
where you ask mkfs to discard and can't cancel it....

--D

> +
> +		if ((offset + step) > count)
> +			tmp_step = count - offset;
> +
> +		/*
> +		 * We intentionally ignore errors from the discard ioctl. It is
> +		 * not necessary for the mkfs functionality but just an
> +		 * optimization. However we should stop on error.
> +		 */
> +		if (platform_discard_blocks(fd, offset, tmp_step))
> +			return;
> +
> +		offset += tmp_step;
> +	}
>  }
>  
>  static __attribute__((noreturn)) void
> -- 
> 2.23.0
>

Dave Chinner Nov. 21, 2019, 11:18 p.m. UTC | #2

On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> Signed-off-by: Pavel Reichl <preichl@redhat.com>
> ---

This is mixing an explanation about why the change is being made
and what was considered when making decisions about the change.

e.g. my first questions on looking at the patch were:

	- why do we need to break up the discards into 2GB chunks?
	- why 2GB?
	- why not use libblkid to query the maximum discard size
	  and use that as the step size instead?
	- is there any performance impact from breaking up large
	  discards that might be optimised by the kernel into many
	  overlapping async operations into small, synchronous
	  discards?

i.e. the reviewer can read what the patch does, but that deosn't
explain why the patch does this. Hence it's a good idea to explain
the problem being solved or the feature requirements that have lead
to the changes in the patch....

Cheers,

Dave.

Pavel Reichl Nov. 22, 2019, 2:46 p.m. UTC | #3

Thanks Darrick for the comments. It makes sense to me, the next
iteration of the patchset will address that.

On Thu, Nov 21, 2019 at 10:57 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> > Signed-off-by: Pavel Reichl <preichl@redhat.com>
> > ---
> >  mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++-------
> >  1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> > index 18338a61..a02d6f66 100644
> > --- a/mkfs/xfs_mkfs.c
> > +++ b/mkfs/xfs_mkfs.c
> > @@ -1242,15 +1242,33 @@ done:
> >  static void
> >  discard_blocks(dev_t dev, uint64_t nsectors)
> >  {
> > -     int fd;
> > +     int             fd;
> > +     uint64_t        offset          = 0;
> > +     /* Maximal chunk of bytes to discard is 2GB */
> > +     const uint64_t  step            = (uint64_t)2<<30;
>
> You don't need the tabs after the variable name, e.g.
>
>         /* Maximal chunk of bytes to discard is 2GB */
>         const uint64_t  step = 2ULL << 30;
>
> > +     /* Sector size is 512 bytes */
> > +     const uint64_t  count           = nsectors << 9;
>
> count = BBTOB(nsectors)?
>
> >
> > -     /*
> > -      * We intentionally ignore errors from the discard ioctl.  It is
> > -      * not necessary for the mkfs functionality but just an optimization.
> > -      */
> >       fd = libxfs_device_to_fd(dev);
> > -     if (fd > 0)
> > -             platform_discard_blocks(fd, 0, nsectors << 9);
> > +     if (fd <= 0)
> > +             return;
> > +
> > +     while (offset < count) {
> > +             uint64_t        tmp_step = step;
>
> tmp_step = min(step, count - offset); ?
>
> Otherwise seems reasonable to me, if nothing else to avoid the problem
> where you ask mkfs to discard and can't cancel it....
>
> --D
>
> > +
> > +             if ((offset + step) > count)
> > +                     tmp_step = count - offset;
> > +
> > +             /*
> > +              * We intentionally ignore errors from the discard ioctl. It is
> > +              * not necessary for the mkfs functionality but just an
> > +              * optimization. However we should stop on error.
> > +              */
> > +             if (platform_discard_blocks(fd, offset, tmp_step))
> > +                     return;
> > +
> > +             offset += tmp_step;
> > +     }
> >  }
> >
> >  static __attribute__((noreturn)) void
> > --
> > 2.23.0
> >
>

Darrick J. Wong Nov. 22, 2019, 3:38 p.m. UTC | #4

On Fri, Nov 22, 2019 at 10:18:38AM +1100, Dave Chinner wrote:
> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> > Signed-off-by: Pavel Reichl <preichl@redhat.com>
> > ---
> 
> This is mixing an explanation about why the change is being made
> and what was considered when making decisions about the change.
> 
> e.g. my first questions on looking at the patch were:
> 
> 	- why do we need to break up the discards into 2GB chunks?
> 	- why 2GB?

Yeah, I'm wondering that too.

> 	- why not use libblkid to query the maximum discard size
> 	  and use that as the step size instead?

FWIW my SATA SSDs the discard-max is 2G whereas on the NVME it's 2T.  I
guess firmwares have gotten 1000x better in the past few years, possibly
because of the hundred or so 10x programmers that they've all been hiring.

> 	- is there any performance impact from breaking up large
> 	  discards that might be optimised by the kernel into many
> 	  overlapping async operations into small, synchronous
> 	  discards?

Also:
What is the end goal that you have in mind?  Is the progress reporting
the ultimate goal?  Or is it to break up the BLKDISCARD calls so that
someone can ^C a mkfs operation and not have it just sit there
continuing to run?

--D

> i.e. the reviewer can read what the patch does, but that deosn't
> explain why the patch does this. Hence it's a good idea to explain
> the problem being solved or the feature requirements that have lead
> to the changes in the patch....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Pavel Reichl Nov. 22, 2019, 3:59 p.m. UTC | #5

On Fri, Nov 22, 2019 at 4:38 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Fri, Nov 22, 2019 at 10:18:38AM +1100, Dave Chinner wrote:
> > On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> > > Signed-off-by: Pavel Reichl <preichl@redhat.com>
> > > ---
> >
> > This is mixing an explanation about why the change is being made
> > and what was considered when making decisions about the change.
> >
> > e.g. my first questions on looking at the patch were:
> >
> >       - why do we need to break up the discards into 2GB chunks?
> >       - why 2GB?
>
> Yeah, I'm wondering that too.

OK, thank you both for the question - simple answer is that I took
what is used in e2fsprogs as default and I expected a discussion about
proper value during review process :-)
>
> >       - why not use libblkid to query the maximum discard size
> >         and use that as the step size instead?
>
> FWIW my SATA SSDs the discard-max is 2G whereas on the NVME it's 2T.  I
> guess firmwares have gotten 1000x better in the past few years, possibly
> because of the hundred or so 10x programmers that they've all been hiring.
>
> >       - is there any performance impact from breaking up large
> >         discards that might be optimised by the kernel into many
> >         overlapping async operations into small, synchronous
> >         discards?
>
> Also:
> What is the end goal that you have in mind?  Is the progress reporting
> the ultimate goal?  Or is it to break up the BLKDISCARD calls so that
> someone can ^C a mkfs operation and not have it just sit there
> continuing to run?

The goal is mainly the progress reporting but the possibility to do ^C
is also convenient. It seems that some users are not happy about the
BLKDISCARD taking too long and at the same time not being informed
about that - so they think that the command actually hung.

>
> --D
>
> > i.e. the reviewer can read what the patch does, but that deosn't
> > explain why the patch does this. Hence it's a good idea to explain
> > the problem being solved or the feature requirements that have lead
> > to the changes in the patch....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>

Pavel Reichl Nov. 22, 2019, 4:09 p.m. UTC | #6

On Fri, Nov 22, 2019 at 12:18 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
> > Signed-off-by: Pavel Reichl <preichl@redhat.com>
> > ---
>
> This is mixing an explanation about why the change is being made
> and what was considered when making decisions about the change.

Thanks, I'll try to improve that.
>
> e.g. my first questions on looking at the patch were:
>
>         - why do we need to break up the discards into 2GB chunks?
>         - why 2GB?
>         - why not use libblkid to query the maximum discard size
>           and use that as the step size instead?

This is new for me, please let me learn more about that.


>         - is there any performance impact from breaking up large
>           discards that might be optimised by the kernel into many
>           overlapping async operations into small, synchronous
>           discards?

Honestly, I don't have an answer for that ATM - it's quite possible.
It certainly needs more investigating. On the other hand - current
lack of feedback causes user discomfort. So I'd like to know your
opinion - should the change proposed by this patch be default
behaviour (as it may be more user friendly) and should we add an
option that would 'revert' to current behaviour (that would be for
informed user).

>
> i.e. the reviewer can read what the patch does, but that deosn't
> explain why the patch does this. Hence it's a good idea to explain
> the problem being solved or the feature requirements that have lead
> to the changes in the patch....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

Dave Chinner Nov. 22, 2019, 9 p.m. UTC | #7

On Fri, Nov 22, 2019 at 04:59:21PM +0100, Pavel Reichl wrote:
> On Fri, Nov 22, 2019 at 4:38 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > Also:
> > What is the end goal that you have in mind?  Is the progress reporting
> > the ultimate goal?  Or is it to break up the BLKDISCARD calls so that
> > someone can ^C a mkfs operation and not have it just sit there
> > continuing to run?
> 
> The goal is mainly the progress reporting but the possibility to do ^C
> is also convenient. It seems that some users are not happy about the
> BLKDISCARD taking too long and at the same time not being informed
> about that - so they think that the command actually hung.

Ok, that's a good summary to put in the commit description - it
tells the reviewer exactly what you are trying to acheive, and gives
them context to evaluate it against.

Cheers,

Dave.

Eric Sandeen Nov. 22, 2019, 9:07 p.m. UTC | #8

On 11/21/19 3:55 PM, Darrick J. Wong wrote:
> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:

concur w/ others that a reason for the change (and a reason for the
size selection) would be appropriate to have in the changelog.

>> Signed-off-by: Pavel Reichl <preichl@redhat.com>
>> ---
>>  mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++-------
>>  1 file changed, 25 insertions(+), 7 deletions(-)
>>
>> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
>> index 18338a61..a02d6f66 100644
>> --- a/mkfs/xfs_mkfs.c
>> +++ b/mkfs/xfs_mkfs.c
>> @@ -1242,15 +1242,33 @@ done:
>>  static void
>>  discard_blocks(dev_t dev, uint64_t nsectors)
>>  {
>> -	int fd;
>> +	int		fd;
>> +	uint64_t	offset		= 0;
>> +	/* Maximal chunk of bytes to discard is 2GB */
>> +	const uint64_t	step		= (uint64_t)2<<30;
> 
> You don't need the tabs after the variable name, e.g.
> 
> 	/* Maximal chunk of bytes to discard is 2GB */
> 	const uint64_t	step = 2ULL << 30;
> 
>> +	/* Sector size is 512 bytes */
>> +	const uint64_t	count		= nsectors << 9;
> 
> count = BBTOB(nsectors)?

FYI this is a macro that xfs developers have learned about. ;)  It stands for
"Basic Block TO Byte" where "basic block" pretty much means "512-byte sector."

-Eric

Eric Sandeen Nov. 22, 2019, 9:10 p.m. UTC | #9

On 11/21/19 5:18 PM, Dave Chinner wrote:
> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
>> Signed-off-by: Pavel Reichl <preichl@redhat.com>
>> ---
> 
> This is mixing an explanation about why the change is being made
> and what was considered when making decisions about the change.
> 
> e.g. my first questions on looking at the patch were:
> 
> 	- why do we need to break up the discards into 2GB chunks?
> 	- why 2GB?
> 	- why not use libblkid to query the maximum discard size
> 	  and use that as the step size instead?

Just wondering, can we trust that to be reasonably performant?
(the whole motivation here is for hardware that takes inordinately
long to do discard, I wonder if we can count on such hardware to
properly fill out this info....)

> 	- is there any performance impact from breaking up large
> 	  discards that might be optimised by the kernel into many
> 	  overlapping async operations into small, synchronous
> 	  discards?

FWIW, I had simply suggested to Pavel that he follow e2fsprogs' lead
here - afaik they haven't had issues/complaints with their 2g iteration,
and at one point Lukas did some investigation into the size selection...

Thanks,
-Eric

> i.e. the reviewer can read what the patch does, but that deosn't
> explain why the patch does this. Hence it's a good idea to explain
> the problem being solved or the feature requirements that have lead
> to the changes in the patch....
> 
> Cheers,
> 
> Dave.
>

Eric Sandeen Nov. 22, 2019, 9:30 p.m. UTC | #10

On 11/22/19 3:10 PM, Eric Sandeen wrote:
> On 11/21/19 5:18 PM, Dave Chinner wrote:
>> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
>>> Signed-off-by: Pavel Reichl <preichl@redhat.com>
>>> ---
>>
>> This is mixing an explanation about why the change is being made
>> and what was considered when making decisions about the change.
>>
>> e.g. my first questions on looking at the patch were:
>>
>> 	- why do we need to break up the discards into 2GB chunks?
>> 	- why 2GB?
>> 	- why not use libblkid to query the maximum discard size
>> 	  and use that as the step size instead?
> 
> Just wondering, can we trust that to be reasonably performant?
> (the whole motivation here is for hardware that takes inordinately
> long to do discard, I wonder if we can count on such hardware to
> properly fill out this info....)

Looking at the docs in kernel/Documentation/block/queue-sysfs.rst:

discard_max_hw_bytes (RO)
-------------------------
Devices that support discard functionality may have internal limits on
the number of bytes that can be trimmed or unmapped in a single operation.
The discard_max_bytes parameter is set by the device driver to the maximum
number of bytes that can be discarded in a single operation. Discard
requests issued to the device must not exceed this limit. A discard_max_bytes
value of 0 means that the device does not support discard functionality.

discard_max_bytes (RW)
----------------------
While discard_max_hw_bytes is the hardware limit for the device, this
setting is the software limit. Some devices exhibit large latencies when
large discards are issued, setting this value lower will make Linux issue
smaller discards and potentially help reduce latencies induced by large
discard operations.

it seems like a strong suggestion that the discard_max_hw_bytes value may
still be problematic, and discard_max_bytes can be hand-tuned to something
smaller if it's a problem.  To me that indicates that discard_max_hw_bytes
probably can't be trusted to be performant, and presumably discard_max_bytes
won't be either in that case unless it's been hand-tuned by the admin?

-Eric

Eric Sandeen Nov. 26, 2019, 7:40 p.m. UTC | #11

On 11/22/19 3:30 PM, Eric Sandeen wrote:
> On 11/22/19 3:10 PM, Eric Sandeen wrote:
>> On 11/21/19 5:18 PM, Dave Chinner wrote:
>>> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
>>>> Signed-off-by: Pavel Reichl <preichl@redhat.com>
>>>> ---
>>>
>>> This is mixing an explanation about why the change is being made
>>> and what was considered when making decisions about the change.
>>>
>>> e.g. my first questions on looking at the patch were:
>>>
>>> 	- why do we need to break up the discards into 2GB chunks?
>>> 	- why 2GB?
>>> 	- why not use libblkid to query the maximum discard size
>>> 	  and use that as the step size instead?
>>
>> Just wondering, can we trust that to be reasonably performant?
>> (the whole motivation here is for hardware that takes inordinately
>> long to do discard, I wonder if we can count on such hardware to
>> properly fill out this info....)
> 
> Looking at the docs in kernel/Documentation/block/queue-sysfs.rst:
> 
> discard_max_hw_bytes (RO)
> -------------------------
> Devices that support discard functionality may have internal limits on
> the number of bytes that can be trimmed or unmapped in a single operation.
> The discard_max_bytes parameter is set by the device driver to the maximum
> number of bytes that can be discarded in a single operation. Discard
> requests issued to the device must not exceed this limit. A discard_max_bytes
> value of 0 means that the device does not support discard functionality.
> 
> discard_max_bytes (RW)
> ----------------------
> While discard_max_hw_bytes is the hardware limit for the device, this
> setting is the software limit. Some devices exhibit large latencies when
> large discards are issued, setting this value lower will make Linux issue
> smaller discards and potentially help reduce latencies induced by large
> discard operations.
> 
> it seems like a strong suggestion that the discard_max_hw_bytes value may
> still be problematic, and discard_max_bytes can be hand-tuned to something
> smaller if it's a problem.  To me that indicates that discard_max_hw_bytes
> probably can't be trusted to be performant, and presumably discard_max_bytes
> won't be either in that case unless it's been hand-tuned by the admin?

Lukas, Jeff Moyer reminded me that you did a lot of investigation into this
behavior a while back.  Can you shed light on this, particularly how you
chose 2G as the discard granularity for mke2fs?

Thanks,
-Eric

Eric Sandeen Nov. 26, 2019, 8:53 p.m. UTC | #12

On 11/21/19 3:44 PM, Pavel Reichl wrote:
> Signed-off-by: Pavel Reichl <preichl@redhat.com>
> ---
>  mkfs/xfs_mkfs.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 18338a61..a02d6f66 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -1242,15 +1242,33 @@ done:
>  static void
>  discard_blocks(dev_t dev, uint64_t nsectors)
>  {
> -	int fd;
> +	int		fd;
> +	uint64_t	offset		= 0;
> +	/* Maximal chunk of bytes to discard is 2GB */
> +	const uint64_t	step		= (uint64_t)2<<30;

Regarding the discard step size, I would like to just see us keep 2G -
I see problems with the alternate suggestions proposed in the
threads on this patch review:

1) query block device for maximal discard size
-> block device folks I've talked to (Jeff Moyer in particular) stated
   that many devices are known for putting a huge value in here, and then
   taking far, far too long to process that size request.  In short,
   maximum size != fast.

2) discard one AG size at a time
-> this can be up to 1T, which puts us right back at our problem of large,
   slow discards.  And in particular, AG size has no relation at all to a
   device's discard behavior.  (further complicating this, we don't have
   this geometry available anywhere in the current chain of calls to the
   discard ioctl.)

Lukas did an investigation of discard behaviors (though it was some time
ago https://sourceforge.net/projects/test-discard/) and arrived at 2G as
a reasonable size after testing many different devices - I've not seen any
complaints from mke2fs users about problems doing discards in 2G chunks.

So I think just picking a fixed 2G size is the best plan for now.

(one nitpick, I'd fix the comment above to not say "Maximal" because that
sounds like some hard limit imposed by something other than the code; I'd
just say "Discard the device 2G at a time" or something like that.)

A comment above the loop explaining in more detail that we iterate in
step sizes so that the utility can be interrupted would probably be
helpful.

Thanks,
-Eric

[1/2] mkfs: Break block discard into chunks of 2 GB

Commit Message

Comments

Patch