xfsprogs: Issue smaller discards at mkfs

Message ID	20171026144131.26885-1-keith.busch@intel.com (mailing list archive)
State	Rejected
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: linux-xfs@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Subject: [PATCH] xfsprogs: Issue smaller discards at mkfs Date: Thu, 26 Oct 2017 08:41:31 -0600 Message-Id: <20171026144131.26885-1-keith.busch@intel.com> Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Keith Busch Oct. 26, 2017, 2:41 p.m. UTC

Running mkfs.xfs was discarding the entire capacity in a single range. The
block layer would split these into potentially many smaller requests
and dispatch all of them to the device at roughly the same time.

SSD capacities are getting so large that full capacity discards will
take some time to complete. When discards are deeply queued, the block
layer may trigger timeout handling and IO failure, though the device is
operating normally.

This patch uses smaller discard ranges in a loop for mkfs to avoid
risking such timeouts. The max discard range is arbitrarilly set to
128GB in this patch.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 include/linux.h | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Darrick J. Wong Oct. 26, 2017, 4:25 p.m. UTC | #1

On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote:
> Running mkfs.xfs was discarding the entire capacity in a single range. The
> block layer would split these into potentially many smaller requests
> and dispatch all of them to the device at roughly the same time.
> 
> SSD capacities are getting so large that full capacity discards will
> take some time to complete. When discards are deeply queued, the block
> layer may trigger timeout handling and IO failure, though the device is
> operating normally.
> 
> This patch uses smaller discard ranges in a loop for mkfs to avoid
> risking such timeouts. The max discard range is arbitrarilly set to
> 128GB in this patch.

I'd have thought devices would set sane blk_queue_max_discard_sectors
so that the block layer doesn't send such a huge command that the kernel
times out...

...but then I actually went and grepped that in the kernel and
discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX,
which is 2T.  Frighteningly xen-blkfront passes in get_capacity() (which
overflows the unsigned int parameter on big virtual disks, I guess?).

(I still think this is the kernel's problem, not userspace's, but now
with an extra layer of OMGWTF sprayed on.)

I dunno.  What kind of device produces these timeouts, and does it go
away if max_discards is lowered?

--D

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  include/linux.h | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux.h b/include/linux.h
> index 6ce344c5..702aee0c 100644
> --- a/include/linux.h
> +++ b/include/linux.h
> @@ -132,10 +132,17 @@ static __inline__ void platform_uuid_copy(uuid_t *dst, uuid_t *src)
>  static __inline__ int
>  platform_discard_blocks(int fd, uint64_t start, uint64_t len)
>  {
> -	uint64_t range[2] = { start, len };
> +	uint64_t end = start + len;
> +	uint64_t size = 128ULL * 1024ULL * 1024ULL * 1024ULL;
>  
> -	if (ioctl(fd, BLKDISCARD, &range) < 0)
> -		return errno;
> +	for (; start < end; start += size) {
> +		uint64_t range[2] = { start, MIN(len, size) };
> +
> +		len -= range[1];
> +		if (ioctl(fd, BLKDISCARD, &range) < 0)
> +			return errno;
> +
> +	}
>  	return 0;
>  }
>  
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Sandeen Oct. 26, 2017, 5:49 p.m. UTC | #2

On 10/26/17 11:25 AM, Darrick J. Wong wrote:
> On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote:
>> Running mkfs.xfs was discarding the entire capacity in a single range. The
>> block layer would split these into potentially many smaller requests
>> and dispatch all of them to the device at roughly the same time.
>>
>> SSD capacities are getting so large that full capacity discards will
>> take some time to complete. When discards are deeply queued, the block
>> layer may trigger timeout handling and IO failure, though the device is
>> operating normally.
>>
>> This patch uses smaller discard ranges in a loop for mkfs to avoid
>> risking such timeouts. The max discard range is arbitrarilly set to
>> 128GB in this patch.
> 
> I'd have thought devices would set sane blk_queue_max_discard_sectors
> so that the block layer doesn't send such a huge command that the kernel
> times out...
> 
> ...but then I actually went and grepped that in the kernel and
> discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX,
> which is 2T.  Frighteningly xen-blkfront passes in get_capacity() (which
> overflows the unsigned int parameter on big virtual disks, I guess?).
> 
> (I still think this is the kernel's problem, not userspace's, but now
> with an extra layer of OMGWTF sprayed on.)
> 
> I dunno.  What kind of device produces these timeouts, and does it go
> away if max_discards is lowered?

Yeah, lots of devices are unhappy with large discards.  And yeah, in the
end I think this papers over a kernel and/or hardware problem.

But sometimes we do that, if only to keep things working reasonably
well with older kernels or hardware that'll never get fixed...

(TBH sometimes I regret putting mkfs-time discard in by default in the
first place.)

-Eric



> --D
> 
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  include/linux.h | 13 ++++++++++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux.h b/include/linux.h
>> index 6ce344c5..702aee0c 100644
>> --- a/include/linux.h
>> +++ b/include/linux.h
>> @@ -132,10 +132,17 @@ static __inline__ void platform_uuid_copy(uuid_t *dst, uuid_t *src)
>>  static __inline__ int
>>  platform_discard_blocks(int fd, uint64_t start, uint64_t len)
>>  {
>> -	uint64_t range[2] = { start, len };
>> +	uint64_t end = start + len;
>> +	uint64_t size = 128ULL * 1024ULL * 1024ULL * 1024ULL;
>>  
>> -	if (ioctl(fd, BLKDISCARD, &range) < 0)
>> -		return errno;
>> +	for (; start < end; start += size) {
>> +		uint64_t range[2] = { start, MIN(len, size) };
>> +
>> +		len -= range[1];
>> +		if (ioctl(fd, BLKDISCARD, &range) < 0)
>> +			return errno;
>> +
>> +	}
>>  	return 0;
>>  }
>>  
>> -- 
>> 2.13.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Oct. 26, 2017, 6 p.m. UTC | #3

On Thu, Oct 26, 2017 at 09:25:18AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote:
> > Running mkfs.xfs was discarding the entire capacity in a single range. The
> > block layer would split these into potentially many smaller requests
> > and dispatch all of them to the device at roughly the same time.
> > 
> > SSD capacities are getting so large that full capacity discards will
> > take some time to complete. When discards are deeply queued, the block
> > layer may trigger timeout handling and IO failure, though the device is
> > operating normally.
> > 
> > This patch uses smaller discard ranges in a loop for mkfs to avoid
> > risking such timeouts. The max discard range is arbitrarilly set to
> > 128GB in this patch.
> 
> I'd have thought devices would set sane blk_queue_max_discard_sectors
> so that the block layer doesn't send such a huge command that the kernel
> times out...

The block limit only specifies the maximum size of a *single* discard
request as seen by the end device. This single request is not a problem
for timeouts, as far as I know.

The timeouts occur when queueing many of them at the same time: the last
one in the queue will have very high latency compared to ones ahead of
it in the queue if the device processes discards serially (many do).

There's no such limit to say the maximum outstanding number discard
requests that can be dispatched at the same time; the max number of
dispatched commands are shard with read and write.

> ...but then I actually went and grepped that in the kernel and
> discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX,
> which is 2T.  Frighteningly xen-blkfront passes in get_capacity() (which
> overflows the unsigned int parameter on big virtual disks, I guess?).

The block layer will limit a single discard range to 4GB. If the IOCTL
specifies a larger range, the block layer will split it into multiple
requests.

> (I still think this is the kernel's problem, not userspace's, but now
> with an extra layer of OMGWTF sprayed on.)
> 
> I dunno.  What kind of device produces these timeouts, and does it go
> away if max_discards is lowered?

We observe timeouts on capacities above 4TB: a single BLKDISCARD for that
capacity has the block layer send 1024 discard requests at 4GB each.

Could the drivers do something else to mitigate this? We could set queue
depths lower to throttle dispatched commands and the problem would go
away. The depth, though, is set to optimize read and write, and we don't
want to harm that path to mitigate discard latency spikes.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Sandeen Oct. 26, 2017, 6:01 p.m. UTC | #4

On 10/26/17 12:49 PM, Eric Sandeen wrote:
> On 10/26/17 11:25 AM, Darrick J. Wong wrote:
>> On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote:
>>> Running mkfs.xfs was discarding the entire capacity in a single range. The
>>> block layer would split these into potentially many smaller requests
>>> and dispatch all of them to the device at roughly the same time.
>>>
>>> SSD capacities are getting so large that full capacity discards will
>>> take some time to complete. When discards are deeply queued, the block
>>> layer may trigger timeout handling and IO failure, though the device is
>>> operating normally.
>>>
>>> This patch uses smaller discard ranges in a loop for mkfs to avoid
>>> risking such timeouts. The max discard range is arbitrarilly set to
>>> 128GB in this patch.
>>
>> I'd have thought devices would set sane blk_queue_max_discard_sectors
>> so that the block layer doesn't send such a huge command that the kernel
>> times out...
>>
>> ...but then I actually went and grepped that in the kernel and
>> discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX,
>> which is 2T.  Frighteningly xen-blkfront passes in get_capacity() (which
>> overflows the unsigned int parameter on big virtual disks, I guess?).
>>
>> (I still think this is the kernel's problem, not userspace's, but now
>> with an extra layer of OMGWTF sprayed on.)
>>
>> I dunno.  What kind of device produces these timeouts, and does it go
>> away if max_discards is lowered?
> 
> Yeah, lots of devices are unhappy with large discards.  And yeah, in the
> end I think this papers over a kernel and/or hardware problem.
> 
> But sometimes we do that, if only to keep things working reasonably
> well with older kernels or hardware that'll never get fixed...
> 
> (TBH sometimes I regret putting mkfs-time discard in by default in the
> first place.)

I think I left this on a too-positive note.  It seems pretty clear that there
is no way to fix all of userspace to not issue "too big" discards, when
"too big" isn't even well-defined, or specified by anything at all.

I'm not wise in the ways of queueing and throttling, but from my naiive
perspective, it seems like something to be fixed in the kernel, or if it
can't, export some new "maximum discard request size" which can be trusted?

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Oct. 26, 2017, 6:32 p.m. UTC | #5

On Thu, Oct 26, 2017 at 01:01:29PM -0500, Eric Sandeen wrote:
> On 10/26/17 12:49 PM, Eric Sandeen wrote:
> > Yeah, lots of devices are unhappy with large discards.  And yeah, in the
> > end I think this papers over a kernel and/or hardware problem.
> > 
> > But sometimes we do that, if only to keep things working reasonably
> > well with older kernels or hardware that'll never get fixed...
> > 
> > (TBH sometimes I regret putting mkfs-time discard in by default in the
> > first place.)
> 
> I think I left this on a too-positive note.  It seems pretty clear that there
> is no way to fix all of userspace to not issue "too big" discards, when
> "too big" isn't even well-defined, or specified by anything at all.

Yeah, I totally get this proposal is just a bandaid, and other user
space programs may suffer when used with devices behaving this way. XFS
is just very popular, so it's frequently reported as problematic against
large capacity devices.
 
> I'm not wise in the ways of queueing and throttling, but from my naiive
> perspective, it seems like something to be fixed in the kernel, or if it
> can't, export some new "maximum discard request size" which can be trusted?

The problem isn't really that a discard sent to the device was "too
big". It's that "too many" are issued at the same time, and there isn't
a way for a driver to limit the number of outstanding discards without
affecting read/write.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darrick J. Wong Oct. 26, 2017, 7:59 p.m. UTC | #6

On Thu, Oct 26, 2017 at 12:32:17PM -0600, Keith Busch wrote:
> On Thu, Oct 26, 2017 at 01:01:29PM -0500, Eric Sandeen wrote:
> > On 10/26/17 12:49 PM, Eric Sandeen wrote:
> > > Yeah, lots of devices are unhappy with large discards.  And yeah, in the
> > > end I think this papers over a kernel and/or hardware problem.
> > > 
> > > But sometimes we do that, if only to keep things working reasonably
> > > well with older kernels or hardware that'll never get fixed...
> > > 
> > > (TBH sometimes I regret putting mkfs-time discard in by default in the
> > > first place.)
> > 
> > I think I left this on a too-positive note.  It seems pretty clear that there
> > is no way to fix all of userspace to not issue "too big" discards, when
> > "too big" isn't even well-defined, or specified by anything at all.
> 
> Yeah, I totally get this proposal is just a bandaid, and other user
> space programs may suffer when used with devices behaving this way. XFS
> is just very popular, so it's frequently reported as problematic against
> large capacity devices.

Sure, but now you have to go fix mke2fs and everything /else/ that
issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and
until you fix every program to work around this weird thing in the
kernel there'll still be someone somewhere with this timeout problem...

...so I started digging into what the kernel does with a BLKDISCARD
request, which is to say that I looked at blkdev_issue_discard.  That
function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn
splits the request into a chain of UINT_MAX-sized struct bios.

128G's worth of 4G ios == 32 chained bios.

2T worth of 4G ios == 512 chained bios.

So now I'm wondering, is the problem more that the first bio in the
chain times out because the last one hasn't finished yet, so the whole
thing gets aborted because we chained too much work together?

Would it make sense to fix __blkdev_issue_discard to chain fewer bios
together?  Or just issue the bios independently and track the
completions individually?

> > I'm not wise in the ways of queueing and throttling, but from my naiive
> > perspective, it seems like something to be fixed in the kernel, or if it
> > can't, export some new "maximum discard request size" which can be trusted?

I would've thought that's what max_discard_sectors was for, but... eh.

How big is the device you were trying to mkfs, anyway?

--D

> The problem isn't really that a discard sent to the device was "too
> big". It's that "too many" are issued at the same time, and there isn't
> a way for a driver to limit the number of outstanding discards without
> affecting read/write.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Oct. 26, 2017, 9:24 p.m. UTC | #7

On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote:
> 
> Sure, but now you have to go fix mke2fs and everything /else/ that
> issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and
> until you fix every program to work around this weird thing in the
> kernel there'll still be someone somewhere with this timeout problem...

e2progs already splits large discards in a loop. ;)

> ...so I started digging into what the kernel does with a BLKDISCARD
> request, which is to say that I looked at blkdev_issue_discard.  That
> function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn
> splits the request into a chain of UINT_MAX-sized struct bios.
> 
> 128G's worth of 4G ios == 32 chained bios.
> 
> 2T worth of 4G ios == 512 chained bios.
> 
> So now I'm wondering, is the problem more that the first bio in the
> chain times out because the last one hasn't finished yet, so the whole
> thing gets aborted because we chained too much work together?

You're sort of on the right track. The timeouts are set on an individual
request in the chain rather than one timeout for the entire chain.

All the bios in the chain get turned into 'struct request' and sent
to the low-level driver. The driver calls blk_mq_start_request before
sending to hardware. That starts the timer on _that_ request,
independent of the other requests in the chain.

NVMe supports very large queues. A 4TB discard becomes 1024 individual
requests started at nearly the same time. The last ones in the queue are
the ones that risk timeout.

When we're doing read/write, latencies at the same depth are well within
tolerance, and high queue depths are good for throughput. When doing
discard, though, tail latencies fall outside the timeout tolerance at
the same queue depth.

> Would it make sense to fix __blkdev_issue_discard to chain fewer bios
> together?  Or just issue the bios independently and track the
> completions individually?
>
> > > I'm not wise in the ways of queueing and throttling, but from my naiive
> > > perspective, it seems like something to be fixed in the kernel, or if it
> > > can't, export some new "maximum discard request size" which can be trusted?
> 
> I would've thought that's what max_discard_sectors was for, but... eh.
> 
> How big is the device you were trying to mkfs, anyway?

The largest single SSD I have is 6.4TB. Other folks I know have larger.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Oct. 26, 2017, 10:24 p.m. UTC | #8

On Thu, Oct 26, 2017 at 03:24:15PM -0600, Keith Busch wrote:
> On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote:
> > 
> > Sure, but now you have to go fix mke2fs and everything /else/ that
> > issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and
> > until you fix every program to work around this weird thing in the
> > kernel there'll still be someone somewhere with this timeout problem...
> 
> e2progs already splits large discards in a loop. ;)
> 
> > ...so I started digging into what the kernel does with a BLKDISCARD
> > request, which is to say that I looked at blkdev_issue_discard.  That
> > function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn
> > splits the request into a chain of UINT_MAX-sized struct bios.
> > 
> > 128G's worth of 4G ios == 32 chained bios.
> > 
> > 2T worth of 4G ios == 512 chained bios.
> > 
> > So now I'm wondering, is the problem more that the first bio in the
> > chain times out because the last one hasn't finished yet, so the whole
> > thing gets aborted because we chained too much work together?
> 
> You're sort of on the right track. The timeouts are set on an individual
> request in the chain rather than one timeout for the entire chain.
> 
> All the bios in the chain get turned into 'struct request' and sent
> to the low-level driver. The driver calls blk_mq_start_request before
> sending to hardware. That starts the timer on _that_ request,
> independent of the other requests in the chain.
> 
> NVMe supports very large queues. A 4TB discard becomes 1024 individual
> requests started at nearly the same time. The last ones in the queue are
> the ones that risk timeout.

And that's just broken when it comes to requests that might take
several seconds to run.  This is a problem the kernel needs to fix -
it's not something we should be working around in userspace.

I can't wait to see how badly running fstrim on one of those devices
screws them up....

> When we're doing read/write, latencies at the same depth are well within
> tolerance, and high queue depths are good for throughput. When doing
> discard, though, tail latencies fall outside the timeout tolerance at
> the same queue depth.

Yup, because most SSDs have really shit discard implementations -
nobody who "reviews" SSDs look at the performance aspect of discard
and so it doesn't get publicly compared against other drives like
read/write IO performance does. IOWs, discard doesn't sell devices,
so it never gets fixed or optimised.

Hardware quirks should be dealt with by the kernel, not userspace.

Cheers,

Dave.

Keith Busch Oct. 26, 2017, 11:09 p.m. UTC | #9

On Fri, Oct 27, 2017 at 09:24:50AM +1100, Dave Chinner wrote:
> And that's just broken when it comes to requests that might take
> several seconds to run.  This is a problem the kernel needs to fix -
> it's not something we should be working around in userspace.

Yeah, I agree that'd be a better thing to do, and achieve the desired
impact for every user. We currently don't have an indicator the kernel
can react to for this scenario, but I'll reinvestigate from the driver
and kernel side for more options.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

xfsprogs: Issue smaller discards at mkfs

Commit Message

Comments

Patch