Message ID | 20171026144131.26885-1-keith.busch@intel.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote: > Running mkfs.xfs was discarding the entire capacity in a single range. The > block layer would split these into potentially many smaller requests > and dispatch all of them to the device at roughly the same time. > > SSD capacities are getting so large that full capacity discards will > take some time to complete. When discards are deeply queued, the block > layer may trigger timeout handling and IO failure, though the device is > operating normally. > > This patch uses smaller discard ranges in a loop for mkfs to avoid > risking such timeouts. The max discard range is arbitrarilly set to > 128GB in this patch. I'd have thought devices would set sane blk_queue_max_discard_sectors so that the block layer doesn't send such a huge command that the kernel times out... ...but then I actually went and grepped that in the kernel and discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX, which is 2T. Frighteningly xen-blkfront passes in get_capacity() (which overflows the unsigned int parameter on big virtual disks, I guess?). (I still think this is the kernel's problem, not userspace's, but now with an extra layer of OMGWTF sprayed on.) I dunno. What kind of device produces these timeouts, and does it go away if max_discards is lowered? --D > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > include/linux.h | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/include/linux.h b/include/linux.h > index 6ce344c5..702aee0c 100644 > --- a/include/linux.h > +++ b/include/linux.h > @@ -132,10 +132,17 @@ static __inline__ void platform_uuid_copy(uuid_t *dst, uuid_t *src) > static __inline__ int > platform_discard_blocks(int fd, uint64_t start, uint64_t len) > { > - uint64_t range[2] = { start, len }; > + uint64_t end = start + len; > + uint64_t size = 128ULL * 1024ULL * 1024ULL * 1024ULL; > > - if (ioctl(fd, BLKDISCARD, &range) < 0) > - return errno; > + for (; start < end; start += size) { > + uint64_t range[2] = { start, MIN(len, size) }; > + > + len -= range[1]; > + if (ioctl(fd, BLKDISCARD, &range) < 0) > + return errno; > + > + } > return 0; > } > > -- > 2.13.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/26/17 11:25 AM, Darrick J. Wong wrote: > On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote: >> Running mkfs.xfs was discarding the entire capacity in a single range. The >> block layer would split these into potentially many smaller requests >> and dispatch all of them to the device at roughly the same time. >> >> SSD capacities are getting so large that full capacity discards will >> take some time to complete. When discards are deeply queued, the block >> layer may trigger timeout handling and IO failure, though the device is >> operating normally. >> >> This patch uses smaller discard ranges in a loop for mkfs to avoid >> risking such timeouts. The max discard range is arbitrarilly set to >> 128GB in this patch. > > I'd have thought devices would set sane blk_queue_max_discard_sectors > so that the block layer doesn't send such a huge command that the kernel > times out... > > ...but then I actually went and grepped that in the kernel and > discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX, > which is 2T. Frighteningly xen-blkfront passes in get_capacity() (which > overflows the unsigned int parameter on big virtual disks, I guess?). > > (I still think this is the kernel's problem, not userspace's, but now > with an extra layer of OMGWTF sprayed on.) > > I dunno. What kind of device produces these timeouts, and does it go > away if max_discards is lowered? Yeah, lots of devices are unhappy with large discards. And yeah, in the end I think this papers over a kernel and/or hardware problem. But sometimes we do that, if only to keep things working reasonably well with older kernels or hardware that'll never get fixed... (TBH sometimes I regret putting mkfs-time discard in by default in the first place.) -Eric > --D > >> Signed-off-by: Keith Busch <keith.busch@intel.com> >> --- >> include/linux.h | 13 ++++++++++--- >> 1 file changed, 10 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux.h b/include/linux.h >> index 6ce344c5..702aee0c 100644 >> --- a/include/linux.h >> +++ b/include/linux.h >> @@ -132,10 +132,17 @@ static __inline__ void platform_uuid_copy(uuid_t *dst, uuid_t *src) >> static __inline__ int >> platform_discard_blocks(int fd, uint64_t start, uint64_t len) >> { >> - uint64_t range[2] = { start, len }; >> + uint64_t end = start + len; >> + uint64_t size = 128ULL * 1024ULL * 1024ULL * 1024ULL; >> >> - if (ioctl(fd, BLKDISCARD, &range) < 0) >> - return errno; >> + for (; start < end; start += size) { >> + uint64_t range[2] = { start, MIN(len, size) }; >> + >> + len -= range[1]; >> + if (ioctl(fd, BLKDISCARD, &range) < 0) >> + return errno; >> + >> + } >> return 0; >> } >> >> -- >> 2.13.6 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 26, 2017 at 09:25:18AM -0700, Darrick J. Wong wrote: > On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote: > > Running mkfs.xfs was discarding the entire capacity in a single range. The > > block layer would split these into potentially many smaller requests > > and dispatch all of them to the device at roughly the same time. > > > > SSD capacities are getting so large that full capacity discards will > > take some time to complete. When discards are deeply queued, the block > > layer may trigger timeout handling and IO failure, though the device is > > operating normally. > > > > This patch uses smaller discard ranges in a loop for mkfs to avoid > > risking such timeouts. The max discard range is arbitrarilly set to > > 128GB in this patch. > > I'd have thought devices would set sane blk_queue_max_discard_sectors > so that the block layer doesn't send such a huge command that the kernel > times out... The block limit only specifies the maximum size of a *single* discard request as seen by the end device. This single request is not a problem for timeouts, as far as I know. The timeouts occur when queueing many of them at the same time: the last one in the queue will have very high latency compared to ones ahead of it in the queue if the device processes discards serially (many do). There's no such limit to say the maximum outstanding number discard requests that can be dispatched at the same time; the max number of dispatched commands are shard with read and write. > ...but then I actually went and grepped that in the kernel and > discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX, > which is 2T. Frighteningly xen-blkfront passes in get_capacity() (which > overflows the unsigned int parameter on big virtual disks, I guess?). The block layer will limit a single discard range to 4GB. If the IOCTL specifies a larger range, the block layer will split it into multiple requests. > (I still think this is the kernel's problem, not userspace's, but now > with an extra layer of OMGWTF sprayed on.) > > I dunno. What kind of device produces these timeouts, and does it go > away if max_discards is lowered? We observe timeouts on capacities above 4TB: a single BLKDISCARD for that capacity has the block layer send 1024 discard requests at 4GB each. Could the drivers do something else to mitigate this? We could set queue depths lower to throttle dispatched commands and the problem would go away. The depth, though, is set to optimize read and write, and we don't want to harm that path to mitigate discard latency spikes. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/26/17 12:49 PM, Eric Sandeen wrote: > On 10/26/17 11:25 AM, Darrick J. Wong wrote: >> On Thu, Oct 26, 2017 at 08:41:31AM -0600, Keith Busch wrote: >>> Running mkfs.xfs was discarding the entire capacity in a single range. The >>> block layer would split these into potentially many smaller requests >>> and dispatch all of them to the device at roughly the same time. >>> >>> SSD capacities are getting so large that full capacity discards will >>> take some time to complete. When discards are deeply queued, the block >>> layer may trigger timeout handling and IO failure, though the device is >>> operating normally. >>> >>> This patch uses smaller discard ranges in a loop for mkfs to avoid >>> risking such timeouts. The max discard range is arbitrarilly set to >>> 128GB in this patch. >> >> I'd have thought devices would set sane blk_queue_max_discard_sectors >> so that the block layer doesn't send such a huge command that the kernel >> times out... >> >> ...but then I actually went and grepped that in the kernel and >> discovered that nbd, zram, raid0, mtd, and nvme all pass in UINT_MAX, >> which is 2T. Frighteningly xen-blkfront passes in get_capacity() (which >> overflows the unsigned int parameter on big virtual disks, I guess?). >> >> (I still think this is the kernel's problem, not userspace's, but now >> with an extra layer of OMGWTF sprayed on.) >> >> I dunno. What kind of device produces these timeouts, and does it go >> away if max_discards is lowered? > > Yeah, lots of devices are unhappy with large discards. And yeah, in the > end I think this papers over a kernel and/or hardware problem. > > But sometimes we do that, if only to keep things working reasonably > well with older kernels or hardware that'll never get fixed... > > (TBH sometimes I regret putting mkfs-time discard in by default in the > first place.) I think I left this on a too-positive note. It seems pretty clear that there is no way to fix all of userspace to not issue "too big" discards, when "too big" isn't even well-defined, or specified by anything at all. I'm not wise in the ways of queueing and throttling, but from my naiive perspective, it seems like something to be fixed in the kernel, or if it can't, export some new "maximum discard request size" which can be trusted? -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 26, 2017 at 01:01:29PM -0500, Eric Sandeen wrote: > On 10/26/17 12:49 PM, Eric Sandeen wrote: > > Yeah, lots of devices are unhappy with large discards. And yeah, in the > > end I think this papers over a kernel and/or hardware problem. > > > > But sometimes we do that, if only to keep things working reasonably > > well with older kernels or hardware that'll never get fixed... > > > > (TBH sometimes I regret putting mkfs-time discard in by default in the > > first place.) > > I think I left this on a too-positive note. It seems pretty clear that there > is no way to fix all of userspace to not issue "too big" discards, when > "too big" isn't even well-defined, or specified by anything at all. Yeah, I totally get this proposal is just a bandaid, and other user space programs may suffer when used with devices behaving this way. XFS is just very popular, so it's frequently reported as problematic against large capacity devices. > I'm not wise in the ways of queueing and throttling, but from my naiive > perspective, it seems like something to be fixed in the kernel, or if it > can't, export some new "maximum discard request size" which can be trusted? The problem isn't really that a discard sent to the device was "too big". It's that "too many" are issued at the same time, and there isn't a way for a driver to limit the number of outstanding discards without affecting read/write. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 26, 2017 at 12:32:17PM -0600, Keith Busch wrote: > On Thu, Oct 26, 2017 at 01:01:29PM -0500, Eric Sandeen wrote: > > On 10/26/17 12:49 PM, Eric Sandeen wrote: > > > Yeah, lots of devices are unhappy with large discards. And yeah, in the > > > end I think this papers over a kernel and/or hardware problem. > > > > > > But sometimes we do that, if only to keep things working reasonably > > > well with older kernels or hardware that'll never get fixed... > > > > > > (TBH sometimes I regret putting mkfs-time discard in by default in the > > > first place.) > > > > I think I left this on a too-positive note. It seems pretty clear that there > > is no way to fix all of userspace to not issue "too big" discards, when > > "too big" isn't even well-defined, or specified by anything at all. > > Yeah, I totally get this proposal is just a bandaid, and other user > space programs may suffer when used with devices behaving this way. XFS > is just very popular, so it's frequently reported as problematic against > large capacity devices. Sure, but now you have to go fix mke2fs and everything /else/ that issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and until you fix every program to work around this weird thing in the kernel there'll still be someone somewhere with this timeout problem... ...so I started digging into what the kernel does with a BLKDISCARD request, which is to say that I looked at blkdev_issue_discard. That function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn splits the request into a chain of UINT_MAX-sized struct bios. 128G's worth of 4G ios == 32 chained bios. 2T worth of 4G ios == 512 chained bios. So now I'm wondering, is the problem more that the first bio in the chain times out because the last one hasn't finished yet, so the whole thing gets aborted because we chained too much work together? Would it make sense to fix __blkdev_issue_discard to chain fewer bios together? Or just issue the bios independently and track the completions individually? > > I'm not wise in the ways of queueing and throttling, but from my naiive > > perspective, it seems like something to be fixed in the kernel, or if it > > can't, export some new "maximum discard request size" which can be trusted? I would've thought that's what max_discard_sectors was for, but... eh. How big is the device you were trying to mkfs, anyway? --D > The problem isn't really that a discard sent to the device was "too > big". It's that "too many" are issued at the same time, and there isn't > a way for a driver to limit the number of outstanding discards without > affecting read/write. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote: > > Sure, but now you have to go fix mke2fs and everything /else/ that > issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and > until you fix every program to work around this weird thing in the > kernel there'll still be someone somewhere with this timeout problem... e2progs already splits large discards in a loop. ;) > ...so I started digging into what the kernel does with a BLKDISCARD > request, which is to say that I looked at blkdev_issue_discard. That > function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn > splits the request into a chain of UINT_MAX-sized struct bios. > > 128G's worth of 4G ios == 32 chained bios. > > 2T worth of 4G ios == 512 chained bios. > > So now I'm wondering, is the problem more that the first bio in the > chain times out because the last one hasn't finished yet, so the whole > thing gets aborted because we chained too much work together? You're sort of on the right track. The timeouts are set on an individual request in the chain rather than one timeout for the entire chain. All the bios in the chain get turned into 'struct request' and sent to the low-level driver. The driver calls blk_mq_start_request before sending to hardware. That starts the timer on _that_ request, independent of the other requests in the chain. NVMe supports very large queues. A 4TB discard becomes 1024 individual requests started at nearly the same time. The last ones in the queue are the ones that risk timeout. When we're doing read/write, latencies at the same depth are well within tolerance, and high queue depths are good for throughput. When doing discard, though, tail latencies fall outside the timeout tolerance at the same queue depth. > Would it make sense to fix __blkdev_issue_discard to chain fewer bios > together? Or just issue the bios independently and track the > completions individually? > > > > I'm not wise in the ways of queueing and throttling, but from my naiive > > > perspective, it seems like something to be fixed in the kernel, or if it > > > can't, export some new "maximum discard request size" which can be trusted? > > I would've thought that's what max_discard_sectors was for, but... eh. > > How big is the device you were trying to mkfs, anyway? The largest single SSD I have is 6.4TB. Other folks I know have larger. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 26, 2017 at 03:24:15PM -0600, Keith Busch wrote: > On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote: > > > > Sure, but now you have to go fix mke2fs and everything /else/ that > > issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and > > until you fix every program to work around this weird thing in the > > kernel there'll still be someone somewhere with this timeout problem... > > e2progs already splits large discards in a loop. ;) > > > ...so I started digging into what the kernel does with a BLKDISCARD > > request, which is to say that I looked at blkdev_issue_discard. That > > function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn > > splits the request into a chain of UINT_MAX-sized struct bios. > > > > 128G's worth of 4G ios == 32 chained bios. > > > > 2T worth of 4G ios == 512 chained bios. > > > > So now I'm wondering, is the problem more that the first bio in the > > chain times out because the last one hasn't finished yet, so the whole > > thing gets aborted because we chained too much work together? > > You're sort of on the right track. The timeouts are set on an individual > request in the chain rather than one timeout for the entire chain. > > All the bios in the chain get turned into 'struct request' and sent > to the low-level driver. The driver calls blk_mq_start_request before > sending to hardware. That starts the timer on _that_ request, > independent of the other requests in the chain. > > NVMe supports very large queues. A 4TB discard becomes 1024 individual > requests started at nearly the same time. The last ones in the queue are > the ones that risk timeout. And that's just broken when it comes to requests that might take several seconds to run. This is a problem the kernel needs to fix - it's not something we should be working around in userspace. I can't wait to see how badly running fstrim on one of those devices screws them up.... > When we're doing read/write, latencies at the same depth are well within > tolerance, and high queue depths are good for throughput. When doing > discard, though, tail latencies fall outside the timeout tolerance at > the same queue depth. Yup, because most SSDs have really shit discard implementations - nobody who "reviews" SSDs look at the performance aspect of discard and so it doesn't get publicly compared against other drives like read/write IO performance does. IOWs, discard doesn't sell devices, so it never gets fixed or optimised. Hardware quirks should be dealt with by the kernel, not userspace. Cheers, Dave.
On Fri, Oct 27, 2017 at 09:24:50AM +1100, Dave Chinner wrote: > And that's just broken when it comes to requests that might take > several seconds to run. This is a problem the kernel needs to fix - > it's not something we should be working around in userspace. Yeah, I agree that'd be a better thing to do, and achieve the desired impact for every user. We currently don't have an indicator the kernel can react to for this scenario, but I'll reinvestigate from the driver and kernel side for more options. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux.h b/include/linux.h index 6ce344c5..702aee0c 100644 --- a/include/linux.h +++ b/include/linux.h @@ -132,10 +132,17 @@ static __inline__ void platform_uuid_copy(uuid_t *dst, uuid_t *src) static __inline__ int platform_discard_blocks(int fd, uint64_t start, uint64_t len) { - uint64_t range[2] = { start, len }; + uint64_t end = start + len; + uint64_t size = 128ULL * 1024ULL * 1024ULL * 1024ULL; - if (ioctl(fd, BLKDISCARD, &range) < 0) - return errno; + for (; start < end; start += size) { + uint64_t range[2] = { start, MIN(len, size) }; + + len -= range[1]; + if (ioctl(fd, BLKDISCARD, &range) < 0) + return errno; + + } return 0; }
Running mkfs.xfs was discarding the entire capacity in a single range. The block layer would split these into potentially many smaller requests and dispatch all of them to the device at roughly the same time. SSD capacities are getting so large that full capacity discards will take some time to complete. When discards are deeply queued, the block layer may trigger timeout handling and IO failure, though the device is operating normally. This patch uses smaller discard ranges in a loop for mkfs to avoid risking such timeouts. The max discard range is arbitrarilly set to 128GB in this patch. Signed-off-by: Keith Busch <keith.busch@intel.com> --- include/linux.h | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)