Message ID | 20181105181021.8174-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: defer online discard submission to a workqueue | expand |
On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > When online discard is enabled, discards of busy extents are > submitted asynchronously as a bio chain. bio completion and > resulting busy extent cleanup is deferred to a workqueue. Async > discard submission is intended to avoid blocking log forces on a > full discard sequence which can take a noticeable amount of time in > some cases. > > We've had reports of this still producing log force stalls with XFS > on VDO, Please fix this in VDO instead. We should not work around out of tree code making stupid decisions.
On 11/5/18 3:51 PM, Christoph Hellwig wrote: > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: >> When online discard is enabled, discards of busy extents are >> submitted asynchronously as a bio chain. bio completion and >> resulting busy extent cleanup is deferred to a workqueue. Async >> discard submission is intended to avoid blocking log forces on a >> full discard sequence which can take a noticeable amount of time in >> some cases. >> >> We've had reports of this still producing log force stalls with XFS >> on VDO, > > Please fix this in VDO instead. We should not work around out of > tree code making stupid decisions. Is there some downside to Brian's proposal, in principle? It seems like it would be an improvement for device that might cause a discard bottleneck like this. Thanks, -Eric
On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote: > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > > When online discard is enabled, discards of busy extents are > > submitted asynchronously as a bio chain. bio completion and > > resulting busy extent cleanup is deferred to a workqueue. Async > > discard submission is intended to avoid blocking log forces on a > > full discard sequence which can take a noticeable amount of time in > > some cases. > > > > We've had reports of this still producing log force stalls with XFS > > on VDO, > > Please fix this in VDO instead. We should not work around out of > tree code making stupid decisions. I assume the "stupid decision" refers to sync discard execution. I'm not familiar with the internals of VDO, this is just what I was told. My understanding is that these discards can stack up and take enough time that a limit on outstanding discards is required, which now that I think of it makes me somewhat skeptical of the whole serial execution thing. Hitting that outstanding discard request limit is what bubbles up the stack and affects XFS by holding up log forces, since new discard submissions are presumably blocked on completion of the oldest outstanding request. I'm not quite sure what happens in the block layer if that limit were lifted. Perhaps it assumes throttling responsibility directly via queues/plugs? I'd guess that at minimum we'd end up blocking indirectly somewhere (via memory allocation pressure?) anyways, so ISTM that some kind of throttling is inevitable in this situation. What am I missing? Brian
On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote: > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > > > When online discard is enabled, discards of busy extents are > > > submitted asynchronously as a bio chain. bio completion and > > > resulting busy extent cleanup is deferred to a workqueue. Async > > > discard submission is intended to avoid blocking log forces on a > > > full discard sequence which can take a noticeable amount of time in > > > some cases. > > > > > > We've had reports of this still producing log force stalls with XFS > > > on VDO, > > > > Please fix this in VDO instead. We should not work around out of > > tree code making stupid decisions. > > I assume the "stupid decision" refers to sync discard execution. I'm not > familiar with the internals of VDO, this is just what I was told. IMO, what VDO does is irrelevant - any call to submit_bio() can block if the request queue is full. Hence if we've drowned the queue in discards and the device is slow at discards, then we are going to block submitting discards. > My > understanding is that these discards can stack up and take enough time > that a limit on outstanding discards is required, which now that I think > of it makes me somewhat skeptical of the whole serial execution thing. > Hitting that outstanding discard request limit is what bubbles up the > stack and affects XFS by holding up log forces, since new discard > submissions are presumably blocked on completion of the oldest > outstanding request. Exactly. > I'm not quite sure what happens in the block layer if that limit were > lifted. Perhaps it assumes throttling responsibility directly via > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > somewhere (via memory allocation pressure?) anyways, so ISTM that some > kind of throttling is inevitable in this situation. What am I missing? We still need to throttle discards - they have to play nice with all the other IO we need to dispatch concurrently. I have two issues with the proposed patch: 1. it puts both discard dispatch and completion processing on the one work qeueue, so if the queue is filled with dispatch requests, IO completion queuing gets blocked. That's not the best thing to be doing. 2. log forces no longer wait for discards to be dispatched - they just queue them. This means the mechanism xfs_extent_busy_flush() uses to dispatch pending discards (synchrnous log force) can return before discards have even been dispatched to disk. Hence we can expect to see longer wait and tail latencies when busy extents are encountered by the allocator. Whether this is a problem or not needs further investigation. Cheers, Dave.
On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote: > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote: > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > > > > When online discard is enabled, discards of busy extents are > > > > submitted asynchronously as a bio chain. bio completion and > > > > resulting busy extent cleanup is deferred to a workqueue. Async > > > > discard submission is intended to avoid blocking log forces on a > > > > full discard sequence which can take a noticeable amount of time in > > > > some cases. > > > > > > > > We've had reports of this still producing log force stalls with XFS > > > > on VDO, > > > > > > Please fix this in VDO instead. We should not work around out of > > > tree code making stupid decisions. > > > > I assume the "stupid decision" refers to sync discard execution. I'm not > > familiar with the internals of VDO, this is just what I was told. > > IMO, what VDO does is irrelevant - any call to submit_bio() can > block if the request queue is full. Hence if we've drowned the queue > in discards and the device is slow at discards, then we are going to > block submitting discards. > > > My > > understanding is that these discards can stack up and take enough time > > that a limit on outstanding discards is required, which now that I think > > of it makes me somewhat skeptical of the whole serial execution thing. > > Hitting that outstanding discard request limit is what bubbles up the > > stack and affects XFS by holding up log forces, since new discard > > submissions are presumably blocked on completion of the oldest > > outstanding request. > > Exactly. > > > I'm not quite sure what happens in the block layer if that limit were > > lifted. Perhaps it assumes throttling responsibility directly via > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > > somewhere (via memory allocation pressure?) anyways, so ISTM that some > > kind of throttling is inevitable in this situation. What am I missing? > > We still need to throttle discards - they have to play nice with all > the other IO we need to dispatch concurrently. > > I have two issues with the proposed patch: > > 1. it puts both discard dispatch and completion processing on the > one work qeueue, so if the queue is filled with dispatch requests, > IO completion queuing gets blocked. That's not the best thing to be > doing. > This is an unbound workqueue with max_active == 0. AIUI, that means we can have something like 256 execution contexts (worker threads?) per cpu. Given that, plus the batching that occurs in XFS due to delayed logging and discard bio chaining, that seems rather unlikely. Unless I'm misunderstanding the mechanism, I think that means filling the queue as such and blocking discard submission basically consumes one of those contexts. Of course, the CIL context structure appears to be technically unbound as well and it's trivial to just add a separate submission workqueue, but I'd like to at least make sure we're on the same page as to the need (so it can be documented clearly as well). > 2. log forces no longer wait for discards to be dispatched - they > just queue them. This means the mechanism xfs_extent_busy_flush() > uses to dispatch pending discards (synchrnous log force) can return > before discards have even been dispatched to disk. Hence we can > expect to see longer wait and tail latencies when busy extents are > encountered by the allocator. Whether this is a problem or not needs > further investigation. > Firstly, I think latency is kind of moot in the problematic case. The queue is already drowning in requests that presumably are going to take minutes to complete. In that case, the overhead of kicking a workqueue to do the submission is probably negligible. That leaves the normal/active case. I suspect that the overhead here may still be hard to notice given the broader overhead of online discard in the first place. The aforementioned batching means we could be sending a good number of discards all at once that ultimately need to be processed (with no priority to the particular one an allocator could be waiting on). The larger the chain, the longer the whole sequence is going to take to clear a particular busy extent relative to the additional latency of a workqueue. We could also just be sending one discard, though waiting on that one particular extent could mean that we're either not being as smart as we could be in extent selection or that we're simply running low on free space. I suppose if this approach is ultimately problemic for whatever reason, we could look into some kind of heuristic to track outstanding discards/chains and continue to do direct submission until we cross some magic threshold. I'm not sure it's worth that complexity given the limited use cases for online discard unless there's some serious unforeseen problem, though. IIRC, this was all doing synchronous, serialized submission not too long ago after all. Anyways, I'll run some file creation/deletion tests against an SSD and/or a thin vol and see what falls out.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote: > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote: > > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote: > > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > > > > > When online discard is enabled, discards of busy extents are > > > > > submitted asynchronously as a bio chain. bio completion and > > > > > resulting busy extent cleanup is deferred to a workqueue. Async > > > > > discard submission is intended to avoid blocking log forces on a > > > > > full discard sequence which can take a noticeable amount of time in > > > > > some cases. > > > > > > > > > > We've had reports of this still producing log force stalls with XFS > > > > > on VDO, > > > > > > > > Please fix this in VDO instead. We should not work around out of > > > > tree code making stupid decisions. > > > > > > I assume the "stupid decision" refers to sync discard execution. I'm not > > > familiar with the internals of VDO, this is just what I was told. > > > > IMO, what VDO does is irrelevant - any call to submit_bio() can > > block if the request queue is full. Hence if we've drowned the queue > > in discards and the device is slow at discards, then we are going to > > block submitting discards. > > > > > My > > > understanding is that these discards can stack up and take enough time > > > that a limit on outstanding discards is required, which now that I think > > > of it makes me somewhat skeptical of the whole serial execution thing. > > > Hitting that outstanding discard request limit is what bubbles up the > > > stack and affects XFS by holding up log forces, since new discard > > > submissions are presumably blocked on completion of the oldest > > > outstanding request. > > > > Exactly. > > > > > I'm not quite sure what happens in the block layer if that limit were > > > lifted. Perhaps it assumes throttling responsibility directly via > > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > > > somewhere (via memory allocation pressure?) anyways, so ISTM that some > > > kind of throttling is inevitable in this situation. What am I missing? > > > > We still need to throttle discards - they have to play nice with all > > the other IO we need to dispatch concurrently. > > > > I have two issues with the proposed patch: > > > > 1. it puts both discard dispatch and completion processing on the > > one work qeueue, so if the queue is filled with dispatch requests, > > IO completion queuing gets blocked. That's not the best thing to be > > doing. > > > > This is an unbound workqueue with max_active == 0. AIUI, that means we > can have something like 256 execution contexts (worker threads?) per > cpu. ..... WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, }; /* unbound wq's aren't per-cpu, scale max_active according to #cpus */ #define WQ_UNBOUND_MAX_ACTIVE \ max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU) IOWs, unbound queues are not per-cpu and they are execution limited to max(512, NR_CPUS * 4) kworker threads. The default (for max_active = 0), however, is still WQ_DFL_ACTIVE so the total number of active workers for the unbound xfs_discard_wq is 256. > Given that, plus the batching that occurs in XFS due to delayed > logging and discard bio chaining, that seems rather unlikely. Unless I'm > misunderstanding the mechanism, I think that means filling the queue as > such and blocking discard submission basically consumes one of those > contexts. Yes, but consider the situation where we've got a slow discard device and we're removing a file with millions of extents. We've got to issue millions of discard ops in this case. because dispatch queueing is not bound, we're easily going to overflow the discard workqueue because the freeing transactions will run far ahead of the discard operations. Sooner or later we consume all 256 discard_wq worker threads with blocked discard submission. Then both log IO completion and discard I ocompletion will block on workqueue submission and we deadlock because discard completion can't run.... > > Of course, the CIL context structure appears to be technically unbound I'm missing something here - What bit of that structure is unbound? > as well and it's trivial to just add a separate submission workqueue, > but I'd like to at least make sure we're on the same page as to the need > (so it can be documented clearly as well). A separate submission queue doesn't really solve log Io completion blocking problem. Yes, it solves the discard completion deadlock, but we eventually end up in the same place on sustained discard workloads with submission queuing blocking on a full work queue. Workqueues are no the way to solve unbound queue depth problems. That's what Kernel threads are for. e.g. this is the reason the xfsaild is a kernel thread, not a work queue. The amount of writeback work queued on the AIL can be hundreds of thousands of objects, way more than a workqueue can handle. This discard problem is no different - concurrent dispatch through kworker threads buys us nothing - we just fill the request queue from hundreds of threads instead of filling it from just one. The workqueue approach has other problems, too, like dispatch across worker threads means discard is not FIFO scheduled - it's completely random as to the order in which discards get fed to the device request queue. Hence discards can be starved because whenever the worker thread runs to process it's queue it finds the device request queue already full and blocks again. Having a single kernel thread that walks the discard queue on each context and then each context in sequence order gives us FIFO dispatch of discard requests. It would block only on full request queues giving us a much more predictable log-force-to-completion latency. It allows for the possiblity of merging discards across multiple CIL contexts, to directly control the rate of discard, and to skip small discards or even -turn off discard- when the backlog gets too great. The workqueue approach just doesn't allow anything like this to be done because every discard context is kept separate from every other context and there is no coordination at all between them. > > 2. log forces no longer wait for discards to be dispatched - they > > just queue them. This means the mechanism xfs_extent_busy_flush() > > uses to dispatch pending discards (synchrnous log force) can return > > before discards have even been dispatched to disk. Hence we can > > expect to see longer wait and tail latencies when busy extents are > > encountered by the allocator. Whether this is a problem or not needs > > further investigation. > > > > Firstly, I think latency is kind of moot in the problematic case. The > queue is already drowning in requests that presumably are going to take > minutes to complete. In that case, the overhead of kicking a workqueue > to do the submission is probably negligible. Yes, the overhead of kicking the queue is negliable. That's not the problem though. By queuing discards rather than submitting them we go from a single FIFO dispatch model (by in-order iclog IO completion) to a concurrent, uncoordinated dispatch model. It's the loss of FIFO behaviour because the synchrnous log force no longer controls dispatch order that leads to unpredictable and long tail latencies in dispatch completion, hence causing the problems for the proceeses now waiting on specific extent discard completion rather than just the log force. In some cases they'll get woken faster (don't ahve to wait for discards to be dispatched), but it is equally likely they'll have to wait for much, much longer. In essence, the async dispatch by workqueue model removes all assumptions we've made about the predictablility of discard completion latency. FIFO is predictable, concurrent async dispatch by workqueue is completely unpredictable. If we really want to do fully async dispatch of discards, I think we need to use a controllable "single dispatch by kernel thread" model like the AIL, not use workqueues and spray the dispatch in an uncoordinated, uncontrollable manner across hundreds of kernel threads.... Cheers, Dave.
On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote: > On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote: > > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote: > > > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > > > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote: > > > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote: > > > > > > When online discard is enabled, discards of busy extents are > > > > > > submitted asynchronously as a bio chain. bio completion and > > > > > > resulting busy extent cleanup is deferred to a workqueue. Async > > > > > > discard submission is intended to avoid blocking log forces on a > > > > > > full discard sequence which can take a noticeable amount of time in > > > > > > some cases. > > > > > > > > > > > > We've had reports of this still producing log force stalls with XFS > > > > > > on VDO, > > > > > > > > > > Please fix this in VDO instead. We should not work around out of > > > > > tree code making stupid decisions. > > > > > > > > I assume the "stupid decision" refers to sync discard execution. I'm not > > > > familiar with the internals of VDO, this is just what I was told. > > > > > > IMO, what VDO does is irrelevant - any call to submit_bio() can > > > block if the request queue is full. Hence if we've drowned the queue > > > in discards and the device is slow at discards, then we are going to > > > block submitting discards. > > > > > > > My > > > > understanding is that these discards can stack up and take enough time > > > > that a limit on outstanding discards is required, which now that I think > > > > of it makes me somewhat skeptical of the whole serial execution thing. > > > > Hitting that outstanding discard request limit is what bubbles up the > > > > stack and affects XFS by holding up log forces, since new discard > > > > submissions are presumably blocked on completion of the oldest > > > > outstanding request. > > > > > > Exactly. > > > > > > > I'm not quite sure what happens in the block layer if that limit were > > > > lifted. Perhaps it assumes throttling responsibility directly via > > > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > > > > somewhere (via memory allocation pressure?) anyways, so ISTM that some > > > > kind of throttling is inevitable in this situation. What am I missing? > > > > > > We still need to throttle discards - they have to play nice with all > > > the other IO we need to dispatch concurrently. > > > > > > I have two issues with the proposed patch: > > > > > > 1. it puts both discard dispatch and completion processing on the > > > one work qeueue, so if the queue is filled with dispatch requests, > > > IO completion queuing gets blocked. That's not the best thing to be > > > doing. > > > > > > > This is an unbound workqueue with max_active == 0. AIUI, that means we > > can have something like 256 execution contexts (worker threads?) per > > cpu. > > ..... > WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ > WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ > WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, > }; > > /* unbound wq's aren't per-cpu, scale max_active according to #cpus */ > #define WQ_UNBOUND_MAX_ACTIVE \ > max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU) > > > IOWs, unbound queues are not per-cpu and they are execution limited > to max(512, NR_CPUS * 4) kworker threads. > > The default (for max_active = 0), however, is still WQ_DFL_ACTIVE so > the total number of active workers for the unbound xfs_discard_wq > is 256. > Sounds about right.. > > > Given that, plus the batching that occurs in XFS due to delayed > > logging and discard bio chaining, that seems rather unlikely. Unless I'm > > misunderstanding the mechanism, I think that means filling the queue as > > such and blocking discard submission basically consumes one of those > > contexts. > > Yes, but consider the situation where we've got a slow discard > device and we're removing a file with millions of extents. We've got > to issue millions of discard ops in this case. because dispatch > queueing is not bound, we're easily going to overflow the discard > workqueue because the freeing transactions will run far ahead of the > discard operations. Sooner or later we consume all 256 discard_wq > worker threads with blocked discard submission. Then both log IO > completion and discard I ocompletion will block on workqueue > submission and we deadlock because discard completion can't > run.... > > > > > Of course, the CIL context structure appears to be technically unbound > > I'm missing something here - What bit of that structure is unbound? > I mean to say that the ctx is not a fixed/limited resource in the context of a discard submission workqueue. There's no throttling, so we can essentially discard as many busy extent lists as the user can generate (this is essentially the same point as your million extent file example above). > > as well and it's trivial to just add a separate submission workqueue, > > but I'd like to at least make sure we're on the same page as to the need > > (so it can be documented clearly as well). > > A separate submission queue doesn't really solve log Io completion > blocking problem. Yes, it solves the discard completion deadlock, > but we eventually end up in the same place on sustained discard > workloads with submission queuing blocking on a full work queue. > Well, for the purposes of this patch it's fine that we still block on discard submission. The point is just that we do so in a separate context from log I/O completion and thus avoid associated log force stalls. > Workqueues are no the way to solve unbound queue depth problems. > That's what Kernel threads are for. e.g. this is the reason the > xfsaild is a kernel thread, not a work queue. The amount of > writeback work queued on the AIL can be hundreds of thousands of > objects, way more than a workqueue can handle. This discard problem > is no different - concurrent dispatch through kworker threads buys > us nothing - we just fill the request queue from hundreds of threads > instead of filling it from just one. > Fair, but the point wasn't really to buy any kind of high level discard submission improvement over the current scheme. The goal was to avoid log force stalls on a time consuming operation that has no explicit serialization requirement with respect to log force completion... > The workqueue approach has other problems, too, like dispatch across > worker threads means discard is not FIFO scheduled - it's completely > random as to the order in which discards get fed to the device > request queue. Hence discards can be starved because whenever the > worker thread runs to process it's queue it finds the device request > queue already full and blocks again. > ... but Ok. I can see that the worst case of creating and blocking hundreds of worker threads on discard submission kind of creates a mess even without considering the resulting effect on the discard operations themselves. There's no real point or need for all of those threads. > Having a single kernel thread that walks the discard queue on each > context and then each context in sequence order gives us FIFO > dispatch of discard requests. It would block only on full request > queues giving us a much more predictable log-force-to-completion > latency. It allows for the possiblity of merging discards across > multiple CIL contexts, to directly control the rate of discard, and > to skip small discards or even -turn off discard- when the backlog > gets too great. > > The workqueue approach just doesn't allow anything like this to be > done because every discard context is kept separate from every other > context and there is no coordination at all between them. > The high level kthread approach sounds like a nice idea to me. I'm not sure that it's technically necessary to address this specific problem though. I.e., it would be nice if we could still separate enhancement from bug fix. > > > 2. log forces no longer wait for discards to be dispatched - they > > > just queue them. This means the mechanism xfs_extent_busy_flush() > > > uses to dispatch pending discards (synchrnous log force) can return > > > before discards have even been dispatched to disk. Hence we can > > > expect to see longer wait and tail latencies when busy extents are > > > encountered by the allocator. Whether this is a problem or not needs > > > further investigation. > > > > > > > Firstly, I think latency is kind of moot in the problematic case. The > > queue is already drowning in requests that presumably are going to take > > minutes to complete. In that case, the overhead of kicking a workqueue > > to do the submission is probably negligible. > > Yes, the overhead of kicking the queue is negliable. That's not the > problem though. By queuing discards rather than submitting them we > go from a single FIFO dispatch model (by in-order iclog IO > completion) to a concurrent, uncoordinated dispatch model. It's the > loss of FIFO behaviour because the synchrnous log force no longer > controls dispatch order that leads to unpredictable and long tail > latencies in dispatch completion, hence causing the problems for the > proceeses now waiting on specific extent discard completion rather > than just the log force. > I think I see what you're getting at here: discard submission via log completion means that submission is serialized (with respect to other iclogs) and in log buf order, since that is enforced by the log callback invoking code. Since the callbacks execute one at a time, blocking in submission likely means the same, single log I/O completion execution context ends up iterating over and processing the other iclogs with pending callbacks once the current submit blockage clears. Yes? If I'm following that correctly, couldn't we provide very similar behavior by using a separate "ordered" workqueue for submission? The ordered wq sets max_active to 1, which means we still have a single submitter, yet we still create a queue that disconnects submit from log force completion. I suppose we could also allocate a separate structure from xfs_cil_ctx to track pending discard jobs and slim down memory usage closer to something that might be required for a thread-based implementation. From there, we can still explore additional enhancements incrementally, such as pushing some of the busy extent processing/sorting into the workqueue, eventually converting the wq-based mechanism into a thread-based one, etc. Thoughts? Brian > In some cases they'll get woken faster (don't ahve to wait for > discards to be dispatched), but it is equally likely they'll have to > wait for much, much longer. In essence, the async dispatch by > workqueue model removes all assumptions we've made about the > predictablility of discard completion latency. FIFO is predictable, > concurrent async dispatch by workqueue is completely unpredictable. > > If we really want to do fully async dispatch of discards, I think we > need to use a controllable "single dispatch by kernel thread" model > like the AIL, not use workqueues and spray the dispatch in an > uncoordinated, uncontrollable manner across hundreds of kernel > threads.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Nov 08, 2018 at 08:50:02AM -0500, Brian Foster wrote: > On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote: > > On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote: > > > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote: > > Yes, but consider the situation where we've got a slow discard > > device and we're removing a file with millions of extents. We've got > > to issue millions of discard ops in this case. because dispatch > > queueing is not bound, we're easily going to overflow the discard > > workqueue because the freeing transactions will run far ahead of the > > discard operations. Sooner or later we consume all 256 discard_wq > > worker threads with blocked discard submission. Then both log IO > > completion and discard I ocompletion will block on workqueue > > submission and we deadlock because discard completion can't > > run.... > > > > > > > > Of course, the CIL context structure appears to be technically unbound > > > > I'm missing something here - What bit of that structure is unbound? > > > > I mean to say that the ctx is not a fixed/limited resource in the > context of a discard submission workqueue. There's no throttling, so we > can essentially discard as many busy extent lists as the user can > generate (this is essentially the same point as your million extent file > example above). It is bound. It's bound by log size and the number of checkpoints we can have uncompleted at any point in time. That can be a high bound on large logs, but it is limited and it is throttled - you can't create new contexts if there is no log space available, and hence new contexts are throttled on tail pushing. > > > as well and it's trivial to just add a separate submission workqueue, > > > but I'd like to at least make sure we're on the same page as to the need > > > (so it can be documented clearly as well). > > > > A separate submission queue doesn't really solve log Io completion > > blocking problem. Yes, it solves the discard completion deadlock, > > but we eventually end up in the same place on sustained discard > > workloads with submission queuing blocking on a full work queue. > > > > Well, for the purposes of this patch it's fine that we still block on > discard submission. The point is just that we do so in a separate > context from log I/O completion and thus avoid associated log force > stalls. And, in doing so, remove any bound we have on the number of outstanding uncompleted discards. You're essentially trading "slow things progressively as discard load goes up" with "nothing blocks until we hit a breakdown point and the system stops until the discard queue drains". We have a limited queue depth of discard operations right now, and even that is deep enough to cause problems. Removing that queue depth bound won't improve things - it will just allow the pending discard queue to run deeper and get further and further behind the extent free operations that are being committed. This will only make the stalls and problems completely unpredictable and uncontrollable, and poses a true breakdown possibility where allocation in every AG is stalled holding AGF locks waiting for a huge queue of discards to be worked through. We have to throttle at some point to prevent us from getting to these severe breakdown conditions. Moving discards out from under the log force removes the only feedback loop we have to throttle discards back to a manageable level. What we have is not perfect, but we can't just remove that feedback loop without providing something else to replace it. > > The workqueue approach just doesn't allow anything like this to be > > done because every discard context is kept separate from every other > > context and there is no coordination at all between them. > > > > The high level kthread approach sounds like a nice idea to me. I'm not > sure that it's technically necessary to address this specific problem > though. I.e., it would be nice if we could still separate enhancement > from bug fix. Just removing the discards from the log force context is just hiding a source of latency without addressing the cause of the latency. i.e. there's a huge amount of discard to be managed, and the only management we have to constrain them to the rate at which we retire the extent free transactions. > > > > 2. log forces no longer wait for discards to be dispatched - they > > > > just queue them. This means the mechanism xfs_extent_busy_flush() > > > > uses to dispatch pending discards (synchrnous log force) can return > > > > before discards have even been dispatched to disk. Hence we can > > > > expect to see longer wait and tail latencies when busy extents are > > > > encountered by the allocator. Whether this is a problem or not needs > > > > further investigation. > > > > > > > > > > Firstly, I think latency is kind of moot in the problematic case. The > > > queue is already drowning in requests that presumably are going to take > > > minutes to complete. In that case, the overhead of kicking a workqueue > > > to do the submission is probably negligible. > > > > Yes, the overhead of kicking the queue is negliable. That's not the > > problem though. By queuing discards rather than submitting them we > > go from a single FIFO dispatch model (by in-order iclog IO > > completion) to a concurrent, uncoordinated dispatch model. It's the > > loss of FIFO behaviour because the synchrnous log force no longer > > controls dispatch order that leads to unpredictable and long tail > > latencies in dispatch completion, hence causing the problems for the > > proceeses now waiting on specific extent discard completion rather > > than just the log force. > > > > I think I see what you're getting at here: discard submission via log > completion means that submission is serialized (with respect to other > iclogs) and in log buf order, since that is enforced by the log callback > invoking code. Since the callbacks execute one at a time, blocking in > submission likely means the same, single log I/O completion execution > context ends up iterating over and processing the other iclogs with > pending callbacks once the current submit blockage clears. Yes? Essentially. > If I'm following that correctly, couldn't we provide very similar > behavior by using a separate "ordered" workqueue for submission? Workqueues are not bound depth queues - they are "concurrency managed" queues. And ordered workqueue is simply a FIFO with a single processing context. You can queue as much as you like to it, but it will only process it with a single thread. > The > ordered wq sets max_active to 1, which means we still have a single > submitter, We have a single /worker/ - we can submit as many independent works as we want to it and they just stack up on the queue. There is no throttling or feedback here, it's just a black hole. > From there, we can still explore additional enhancements incrementally, > such as pushing some of the busy extent processing/sorting into the > workqueue, eventually converting the wq-based mechanism into a > thread-based one, etc. Thoughts? As I always say: do it once, do it properly, or don't touch it at all. Cheers, Dave.
On Mon, Nov 05, 2018 at 04:20:24PM -0600, Eric Sandeen wrote: > Is there some downside to Brian's proposal, in principle? > It seems like it would be an improvement for device that might cause > a discard bottleneck like this. We create another bottleneck, we lost thottling, we don't flush the workqueue when needed and probably a few more. Nevermind that I don't see any point in optimizing for out of tree code that no one even bothers to upstream. All this VDO here and there crap from Red Hat in the last year is really more than enough.
On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > My > understanding is that these discards can stack up and take enough time > that a limit on outstanding discards is required, which now that I think > of it makes me somewhat skeptical of the whole serial execution thing. > Hitting that outstanding discard request limit is what bubbles up the > stack and affects XFS by holding up log forces, since new discard > submissions are presumably blocked on completion of the oldest > outstanding request. We don't do strict ordering or request, but eventually requests waiting for completion will block others from being submitted. > I'm not quite sure what happens in the block layer if that limit were > lifted. Perhaps it assumes throttling responsibility directly via > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > somewhere (via memory allocation pressure?) anyways, so ISTM that some > kind of throttling is inevitable in this situation. What am I missing? We'll still block new allocations waiting for these blocks and other bits. Or to put it another way - if your discard implementation is slow (independent of synchronous or not) your are going to be in a world of pain with online discard. That is what it's not default to start with.
On Fri, Nov 09, 2018 at 11:20:52AM +1100, Dave Chinner wrote: > On Thu, Nov 08, 2018 at 08:50:02AM -0500, Brian Foster wrote: > > On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote: > > > On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote: > > > > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote: > > > Yes, but consider the situation where we've got a slow discard > > > device and we're removing a file with millions of extents. We've got > > > to issue millions of discard ops in this case. because dispatch > > > queueing is not bound, we're easily going to overflow the discard > > > workqueue because the freeing transactions will run far ahead of the > > > discard operations. Sooner or later we consume all 256 discard_wq > > > worker threads with blocked discard submission. Then both log IO > > > completion and discard I ocompletion will block on workqueue > > > submission and we deadlock because discard completion can't > > > run.... > > > > > > > > > > > Of course, the CIL context structure appears to be technically unbound > > > > > > I'm missing something here - What bit of that structure is unbound? > > > > > > > I mean to say that the ctx is not a fixed/limited resource in the > > context of a discard submission workqueue. There's no throttling, so we > > can essentially discard as many busy extent lists as the user can > > generate (this is essentially the same point as your million extent file > > example above). > > It is bound. It's bound by log size and the number of checkpoints we > can have uncompleted at any point in time. That can be a high bound > on large logs, but it is limited and it is throttled - you can't > create new contexts if there is no log space available, and hence > new contexts are throttled on tail pushing. > "... in the context of a discard submission workqueue." We're in violent agreement. You're describing the code as it is today (where it is bound) and I'm referring to prospective behavior with a separate submit queue (where it is made unbound by virtue of the wq). You explain the same problem below that I'm referring to above. > > > > as well and it's trivial to just add a separate submission workqueue, > > > > but I'd like to at least make sure we're on the same page as to the need > > > > (so it can be documented clearly as well). > > > > > > A separate submission queue doesn't really solve log Io completion > > > blocking problem. Yes, it solves the discard completion deadlock, > > > but we eventually end up in the same place on sustained discard > > > workloads with submission queuing blocking on a full work queue. > > > > > > > Well, for the purposes of this patch it's fine that we still block on > > discard submission. The point is just that we do so in a separate > > context from log I/O completion and thus avoid associated log force > > stalls. > > And, in doing so, remove any bound we have on the number of > outstanding uncompleted discards. You're essentially trading > "slow things progressively as discard load goes up" with "nothing > blocks until we hit a breakdown point and the system stops until the > discard queue drains". > > We have a limited queue depth of discard operations right now, and > even that is deep enough to cause problems. Removing that queue > depth bound won't improve things - it will just allow the pending > discard queue to run deeper and get further and further behind the > extent free operations that are being committed. This will only make > the stalls and problems completely unpredictable and uncontrollable, > and poses a true breakdown possibility where allocation in every AG > is stalled holding AGF locks waiting for a huge queue of discards to > be worked through. > > We have to throttle at some point to prevent us from getting to > these severe breakdown conditions. Moving discards out from under > the log force removes the only feedback loop we have to throttle > discards back to a manageable level. What we have is not perfect, > but we can't just remove that feedback loop without providing > something else to replace it. > So the fundamental question here is how to manage discard submission throttling if we disconnect submission from the implicit throttling provided by ordered log I/O completion. We may be able to loosen this up significantly from the current approach, but we still need to explore how much and provide _something_ that addresses severe breakdown conditions. Makes sense. > > > The workqueue approach just doesn't allow anything like this to be > > > done because every discard context is kept separate from every other > > > context and there is no coordination at all between them. > > > > > > > The high level kthread approach sounds like a nice idea to me. I'm not > > sure that it's technically necessary to address this specific problem > > though. I.e., it would be nice if we could still separate enhancement > > from bug fix. > > Just removing the discards from the log force context is just hiding > a source of latency without addressing the cause of the latency. > i.e. there's a huge amount of discard to be managed, and the only > management we have to constrain them to the rate at which we retire > the extent free transactions. > > > > > > 2. log forces no longer wait for discards to be dispatched - they > > > > > just queue them. This means the mechanism xfs_extent_busy_flush() > > > > > uses to dispatch pending discards (synchrnous log force) can return > > > > > before discards have even been dispatched to disk. Hence we can > > > > > expect to see longer wait and tail latencies when busy extents are > > > > > encountered by the allocator. Whether this is a problem or not needs > > > > > further investigation. > > > > > > > > > > > > > Firstly, I think latency is kind of moot in the problematic case. The > > > > queue is already drowning in requests that presumably are going to take > > > > minutes to complete. In that case, the overhead of kicking a workqueue > > > > to do the submission is probably negligible. > > > > > > Yes, the overhead of kicking the queue is negliable. That's not the > > > problem though. By queuing discards rather than submitting them we > > > go from a single FIFO dispatch model (by in-order iclog IO > > > completion) to a concurrent, uncoordinated dispatch model. It's the > > > loss of FIFO behaviour because the synchrnous log force no longer > > > controls dispatch order that leads to unpredictable and long tail > > > latencies in dispatch completion, hence causing the problems for the > > > proceeses now waiting on specific extent discard completion rather > > > than just the log force. > > > > > > > I think I see what you're getting at here: discard submission via log > > completion means that submission is serialized (with respect to other > > iclogs) and in log buf order, since that is enforced by the log callback > > invoking code. Since the callbacks execute one at a time, blocking in > > submission likely means the same, single log I/O completion execution > > context ends up iterating over and processing the other iclogs with > > pending callbacks once the current submit blockage clears. Yes? > > Essentially. > > > If I'm following that correctly, couldn't we provide very similar > > behavior by using a separate "ordered" workqueue for submission? > > Workqueues are not bound depth queues - they are "concurrency > managed" queues. And ordered workqueue is simply a FIFO with a > single processing context. You can queue as much as you like to it, > but it will only process it with a single thread. > > > The > > ordered wq sets max_active to 1, which means we still have a single > > submitter, > > We have a single /worker/ - we can submit as many independent works > as we want to it and they just stack up on the queue. There is no > throttling or feedback here, it's just a black hole. > Sure, but just using a kthread in its place by itself doesn't indicate anything to me about throttling. We can just as easily create an unthrottled thread-based model with an unbound set of busy extent lists, for example, or in the other extreme, use a workqueue to provide the exact same throttling model we have today (which wouldn't address the original problem). Throttling is a distinct element of design from whether we use a workqueue or kthread or whatever else. I think it's kind of fruitless to continue on about such implementation details without working out a higher level means of throttling... > > From there, we can still explore additional enhancements incrementally, > > such as pushing some of the busy extent processing/sorting into the > > workqueue, eventually converting the wq-based mechanism into a > > thread-based one, etc. Thoughts? > > As I always say: do it once, do it properly, or don't touch it at > all. > TBH, I'm having a hard time grokking what you consider a proper fix because we seem to be harping over less consequential implementation details like what low level mechanisms we can or can't use. I can think of ways to throttle either with different benefit and limitation tradeoffs, but I still have no real clarity on what kind of throttling you envision that demands a particular implementation. If you have a particular technique or approach of throttling in mind, could you please just describe it at a high level? IOW, I don't particularly care what mechanism we ultimately use. I'm just trying to simplify a prospective implementation if at all possible based on my limited understanding of the problem space. Exploring the problem space directly is far more helpful to me than doing so implicitly via discussions over why a particular mechanism may or may not be suitable. So to try to put this on a more productive path... my understanding is that deferring discard submissions outside of log completion introduces the following considerations: - unbound queue depth (i.e., outstanding busy extent tracking and memory consumption thereof) - long tail allocation latency caused by allocation requests for blocks that might be deep in the unbound queue (i.e., minutes away from availability) The first strikes me as of more of a memory consumption matter. It's not immediately clear to me whether this would be handled naturally via the preexisting busy extent struct allocations or we need to take further precautions. I think those structures are currently limited indirectly via being cleared in the context of a limited resource (i.e., the log). IOW, aggressive busy extent creators are eventually going to be throttled against log completion because transactions require log space and log buffers, etc. The second is the issue you've described where driving a hugely deep discard submission queue means allocators can get stuck on busy blocks deep in the queue with unpredictable submission+completion latency. Given the above, I'm wondering if we could do something like reuse the existing busy extent tree as a queue. Rather than submit busy extent lists directly, set a new "discard needed" flag on the busy extents in the tree on log I/O completion and carry on. Whenever discard needed extents are present, we'll have some kind of rotorized background submitter sending sorted discards in our preferred batch size. This background submitter is still free to block. In conjunction with that, we already have the ability for allocators to look up busy extents in the tree and trim, wait or reuse them. So how about allowing allocators to inject priority into the equation by either submitting isolated discards themselves or allowing them to manipulate the queue by reusing extents that are only busy due to pending discard submissions (i.e., the discard has not yet been submitted)? I think direct submission provides some level of predictable worst case latency on the allocation side because if the bg submitter is blocked, the allocator submission requests should have next priority over the remaining set of pending discard submissions instead of being blocked on not only the discard queue, but wherever the XFS submission queue happens to hold that particular extent. IOW, allocators are bound by the underlying discard processing engine as they are today, just not indirectly via the log subsystem. Beyond that, we may be able to actually provide additional benefits by allowing allocators to optimize away more discards for extents that are only still sitting in the busy tree for discard submission (i.e., the associated extent free transaction has committed to the log). IOW, there's potential for a larger window for extent reuse optimization, but I still need to study the busy extent code a bit more to wrap my head around that. Finally, I think that still leaves us open to creating a huge set of outstanding submissions depending on how long discards take to complete, but as you mentioned earlier, there are probably a variety of ways we can manage that problem. E.g., start dropping discards of a particular size when/if we're over a particular threshold, perhaps start throttling busy extent creators by requiring some form of ticket/reservation into the busy extent tree on transaction reservation, etc. This may be secondary problem depending on how effective the latency mitigation throttling is by itself. Thoughts on any of that? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Nov 09, 2018 at 07:06:10AM -0800, Christoph Hellwig wrote: > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote: > > My > > understanding is that these discards can stack up and take enough time > > that a limit on outstanding discards is required, which now that I think > > of it makes me somewhat skeptical of the whole serial execution thing. > > Hitting that outstanding discard request limit is what bubbles up the > > stack and affects XFS by holding up log forces, since new discard > > submissions are presumably blocked on completion of the oldest > > outstanding request. > > We don't do strict ordering or request, but eventually requests > waiting for completion will block others from being submitted. > Ok, that's kind of what I expected. > > I'm not quite sure what happens in the block layer if that limit were > > lifted. Perhaps it assumes throttling responsibility directly via > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly > > somewhere (via memory allocation pressure?) anyways, so ISTM that some > > kind of throttling is inevitable in this situation. What am I missing? > > We'll still block new allocations waiting for these blocks and > other bits. Or to put it another way - if your discard implementation > is slow (independent of synchronous or not) your are going to be in > a world of pain with online discard. That is what it's not default > to start with. Sure, it's not really the XFS bits I was asking about here. This is certainly not a high priority and not a common use case. We're working through some of the other issues in the other sub-thread. In particular, I'm wondering if we can provide broader improvements to the overall mechanism to reduce some of that pain. Brian
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index d3884e08b43c..087e99e80edd 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -529,9 +529,11 @@ xlog_discard_endio( static void xlog_discard_busy_extents( - struct xfs_mount *mp, - struct xfs_cil_ctx *ctx) + struct work_struct *work) { + struct xfs_cil_ctx *ctx = container_of(work, struct xfs_cil_ctx, + discard_work); + struct xfs_mount *mp = ctx->cil->xc_log->l_mp; struct list_head *list = &ctx->busy_extents; struct xfs_extent_busy *busyp; struct bio *bio = NULL; @@ -603,9 +605,10 @@ xlog_cil_committed( xlog_cil_free_logvec(ctx->lv_chain); - if (!list_empty(&ctx->busy_extents)) - xlog_discard_busy_extents(mp, ctx); - else + if (!list_empty(&ctx->busy_extents)) { + INIT_WORK(&ctx->discard_work, xlog_discard_busy_extents); + queue_work(xfs_discard_wq, &ctx->discard_work); + } else kmem_free(ctx); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b5f82cb36202..085bebd51135 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -245,6 +245,7 @@ struct xfs_cil_ctx { struct xfs_log_vec *lv_chain; /* logvecs being pushed */ struct xfs_log_callback log_cb; /* completion callback hook. */ struct list_head committing; /* ctx committing list */ + struct work_struct discard_work; struct work_struct discard_endio_work; };
When online discard is enabled, discards of busy extents are submitted asynchronously as a bio chain. bio completion and resulting busy extent cleanup is deferred to a workqueue. Async discard submission is intended to avoid blocking log forces on a full discard sequence which can take a noticeable amount of time in some cases. We've had reports of this still producing log force stalls with XFS on VDO, which executes discards synchronously and relies on online discard in XFS. In particular, there appears to be a limit on how many discards can execute at one time. When this limit is hit, discard submission blocks and affects XFS up through log completion. There is no need for XFS to ever block in this path as busy extents are cleared on discard completion. Since we already have a workqueue for discard bio completion, reuse that workqueue for discard submission and completely separate online discard processing from iclog completion processing. With this change, the only path that should ever block on discard completion is the allocation path if it determines a need to wait on busy extents. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_log_cil.c | 13 ++++++++----- fs/xfs/xfs_log_priv.h | 1 + 2 files changed, 9 insertions(+), 5 deletions(-)