xfs: defer online discard submission to a workqueue
diff mbox series

Message ID 20181105181021.8174-1-bfoster@redhat.com
State New
Headers show
Series
  • xfs: defer online discard submission to a workqueue
Related show

Commit Message

Brian Foster Nov. 5, 2018, 6:10 p.m. UTC
When online discard is enabled, discards of busy extents are
submitted asynchronously as a bio chain. bio completion and
resulting busy extent cleanup is deferred to a workqueue. Async
discard submission is intended to avoid blocking log forces on a
full discard sequence which can take a noticeable amount of time in
some cases.

We've had reports of this still producing log force stalls with XFS
on VDO, which executes discards synchronously and relies on online
discard in XFS. In particular, there appears to be a limit on how
many discards can execute at one time. When this limit is hit,
discard submission blocks and affects XFS up through log completion.

There is no need for XFS to ever block in this path as busy extents
are cleared on discard completion. Since we already have a workqueue
for discard bio completion, reuse that workqueue for discard
submission and completely separate online discard processing from
iclog completion processing. With this change, the only path that
should ever block on discard completion is the allocation path if it
determines a need to wait on busy extents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 13 ++++++++-----
 fs/xfs/xfs_log_priv.h |  1 +
 2 files changed, 9 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Nov. 5, 2018, 9:51 p.m. UTC | #1
On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> When online discard is enabled, discards of busy extents are
> submitted asynchronously as a bio chain. bio completion and
> resulting busy extent cleanup is deferred to a workqueue. Async
> discard submission is intended to avoid blocking log forces on a
> full discard sequence which can take a noticeable amount of time in
> some cases.
> 
> We've had reports of this still producing log force stalls with XFS
> on VDO,

Please fix this in VDO instead.  We should not work around out of
tree code making stupid decisions.
Eric Sandeen Nov. 5, 2018, 10:20 p.m. UTC | #2
On 11/5/18 3:51 PM, Christoph Hellwig wrote:
> On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
>> When online discard is enabled, discards of busy extents are
>> submitted asynchronously as a bio chain. bio completion and
>> resulting busy extent cleanup is deferred to a workqueue. Async
>> discard submission is intended to avoid blocking log forces on a
>> full discard sequence which can take a noticeable amount of time in
>> some cases.
>>
>> We've had reports of this still producing log force stalls with XFS
>> on VDO,
> 
> Please fix this in VDO instead.  We should not work around out of
> tree code making stupid decisions.

Is there some downside to Brian's proposal, in principle?
It seems like it would be an improvement for device that might cause
a discard bottleneck like this.

Thanks,
-Eric
Brian Foster Nov. 6, 2018, 2:23 p.m. UTC | #3
On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > When online discard is enabled, discards of busy extents are
> > submitted asynchronously as a bio chain. bio completion and
> > resulting busy extent cleanup is deferred to a workqueue. Async
> > discard submission is intended to avoid blocking log forces on a
> > full discard sequence which can take a noticeable amount of time in
> > some cases.
> > 
> > We've had reports of this still producing log force stalls with XFS
> > on VDO,
> 
> Please fix this in VDO instead.  We should not work around out of
> tree code making stupid decisions.

I assume the "stupid decision" refers to sync discard execution. I'm not
familiar with the internals of VDO, this is just what I was told. My
understanding is that these discards can stack up and take enough time
that a limit on outstanding discards is required, which now that I think
of it makes me somewhat skeptical of the whole serial execution thing.
Hitting that outstanding discard request limit is what bubbles up the
stack and affects XFS by holding up log forces, since new discard
submissions are presumably blocked on completion of the oldest
outstanding request.

I'm not quite sure what happens in the block layer if that limit were
lifted. Perhaps it assumes throttling responsibility directly via
queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
somewhere (via memory allocation pressure?) anyways, so ISTM that some
kind of throttling is inevitable in this situation. What am I missing?

Brian
Dave Chinner Nov. 6, 2018, 9:18 p.m. UTC | #4
On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > > When online discard is enabled, discards of busy extents are
> > > submitted asynchronously as a bio chain. bio completion and
> > > resulting busy extent cleanup is deferred to a workqueue. Async
> > > discard submission is intended to avoid blocking log forces on a
> > > full discard sequence which can take a noticeable amount of time in
> > > some cases.
> > > 
> > > We've had reports of this still producing log force stalls with XFS
> > > on VDO,
> > 
> > Please fix this in VDO instead.  We should not work around out of
> > tree code making stupid decisions.
> 
> I assume the "stupid decision" refers to sync discard execution. I'm not
> familiar with the internals of VDO, this is just what I was told.

IMO, what VDO does is irrelevant - any call to submit_bio() can
block if the request queue is full. Hence if we've drowned the queue
in discards and the device is slow at discards, then we are going to
block submitting discards.

> My
> understanding is that these discards can stack up and take enough time
> that a limit on outstanding discards is required, which now that I think
> of it makes me somewhat skeptical of the whole serial execution thing.
> Hitting that outstanding discard request limit is what bubbles up the
> stack and affects XFS by holding up log forces, since new discard
> submissions are presumably blocked on completion of the oldest
> outstanding request.

Exactly.

> I'm not quite sure what happens in the block layer if that limit were
> lifted. Perhaps it assumes throttling responsibility directly via
> queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> somewhere (via memory allocation pressure?) anyways, so ISTM that some
> kind of throttling is inevitable in this situation. What am I missing?

We still need to throttle discards - they have to play nice with all
the other IO we need to dispatch concurrently.

I have two issues with the proposed patch:

1. it puts both discard dispatch and completion processing on the
one work qeueue, so if the queue is filled with dispatch requests,
IO completion queuing gets blocked. That's not the best thing to be
doing.

2. log forces no longer wait for discards to be dispatched - they
just queue them. This means the mechanism xfs_extent_busy_flush()
uses to dispatch pending discards (synchrnous log force) can return
before discards have even been dispatched to disk. Hence we can
expect to see longer wait and tail latencies when busy extents are
encountered by the allocator. Whether this is a problem or not needs
further investigation.

Cheers,

Dave.
Brian Foster Nov. 7, 2018, 1:42 p.m. UTC | #5
On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > > > When online discard is enabled, discards of busy extents are
> > > > submitted asynchronously as a bio chain. bio completion and
> > > > resulting busy extent cleanup is deferred to a workqueue. Async
> > > > discard submission is intended to avoid blocking log forces on a
> > > > full discard sequence which can take a noticeable amount of time in
> > > > some cases.
> > > > 
> > > > We've had reports of this still producing log force stalls with XFS
> > > > on VDO,
> > > 
> > > Please fix this in VDO instead.  We should not work around out of
> > > tree code making stupid decisions.
> > 
> > I assume the "stupid decision" refers to sync discard execution. I'm not
> > familiar with the internals of VDO, this is just what I was told.
> 
> IMO, what VDO does is irrelevant - any call to submit_bio() can
> block if the request queue is full. Hence if we've drowned the queue
> in discards and the device is slow at discards, then we are going to
> block submitting discards.
> 
> > My
> > understanding is that these discards can stack up and take enough time
> > that a limit on outstanding discards is required, which now that I think
> > of it makes me somewhat skeptical of the whole serial execution thing.
> > Hitting that outstanding discard request limit is what bubbles up the
> > stack and affects XFS by holding up log forces, since new discard
> > submissions are presumably blocked on completion of the oldest
> > outstanding request.
> 
> Exactly.
> 
> > I'm not quite sure what happens in the block layer if that limit were
> > lifted. Perhaps it assumes throttling responsibility directly via
> > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> > somewhere (via memory allocation pressure?) anyways, so ISTM that some
> > kind of throttling is inevitable in this situation. What am I missing?
> 
> We still need to throttle discards - they have to play nice with all
> the other IO we need to dispatch concurrently.
> 
> I have two issues with the proposed patch:
> 
> 1. it puts both discard dispatch and completion processing on the
> one work qeueue, so if the queue is filled with dispatch requests,
> IO completion queuing gets blocked. That's not the best thing to be
> doing.
> 

This is an unbound workqueue with max_active == 0. AIUI, that means we
can have something like 256 execution contexts (worker threads?) per
cpu. Given that, plus the batching that occurs in XFS due to delayed
logging and discard bio chaining, that seems rather unlikely. Unless I'm
misunderstanding the mechanism, I think that means filling the queue as
such and blocking discard submission basically consumes one of those
contexts.

Of course, the CIL context structure appears to be technically unbound
as well and it's trivial to just add a separate submission workqueue,
but I'd like to at least make sure we're on the same page as to the need
(so it can be documented clearly as well).

> 2. log forces no longer wait for discards to be dispatched - they
> just queue them. This means the mechanism xfs_extent_busy_flush()
> uses to dispatch pending discards (synchrnous log force) can return
> before discards have even been dispatched to disk. Hence we can
> expect to see longer wait and tail latencies when busy extents are
> encountered by the allocator. Whether this is a problem or not needs
> further investigation.
> 

Firstly, I think latency is kind of moot in the problematic case. The
queue is already drowning in requests that presumably are going to take
minutes to complete. In that case, the overhead of kicking a workqueue
to do the submission is probably negligible.

That leaves the normal/active case. I suspect that the overhead here may
still be hard to notice given the broader overhead of online discard in
the first place. The aforementioned batching means we could be sending a
good number of discards all at once that ultimately need to be processed
(with no priority to the particular one an allocator could be waiting
on). The larger the chain, the longer the whole sequence is going to
take to clear a particular busy extent relative to the additional
latency of a workqueue. We could also just be sending one discard,
though waiting on that one particular extent could mean that we're
either not being as smart as we could be in extent selection or that
we're simply running low on free space.

I suppose if this approach is ultimately problemic for whatever reason,
we could look into some kind of heuristic to track outstanding
discards/chains and continue to do direct submission until we cross some
magic threshold. I'm not sure it's worth that complexity given the
limited use cases for online discard unless there's some serious
unforeseen problem, though. IIRC, this was all doing synchronous,
serialized submission not too long ago after all.

Anyways, I'll run some file creation/deletion tests against an SSD
and/or a thin vol and see what falls out..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 8, 2018, 12:38 a.m. UTC | #6
On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote:
> On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > > > > When online discard is enabled, discards of busy extents are
> > > > > submitted asynchronously as a bio chain. bio completion and
> > > > > resulting busy extent cleanup is deferred to a workqueue. Async
> > > > > discard submission is intended to avoid blocking log forces on a
> > > > > full discard sequence which can take a noticeable amount of time in
> > > > > some cases.
> > > > > 
> > > > > We've had reports of this still producing log force stalls with XFS
> > > > > on VDO,
> > > > 
> > > > Please fix this in VDO instead.  We should not work around out of
> > > > tree code making stupid decisions.
> > > 
> > > I assume the "stupid decision" refers to sync discard execution. I'm not
> > > familiar with the internals of VDO, this is just what I was told.
> > 
> > IMO, what VDO does is irrelevant - any call to submit_bio() can
> > block if the request queue is full. Hence if we've drowned the queue
> > in discards and the device is slow at discards, then we are going to
> > block submitting discards.
> > 
> > > My
> > > understanding is that these discards can stack up and take enough time
> > > that a limit on outstanding discards is required, which now that I think
> > > of it makes me somewhat skeptical of the whole serial execution thing.
> > > Hitting that outstanding discard request limit is what bubbles up the
> > > stack and affects XFS by holding up log forces, since new discard
> > > submissions are presumably blocked on completion of the oldest
> > > outstanding request.
> > 
> > Exactly.
> > 
> > > I'm not quite sure what happens in the block layer if that limit were
> > > lifted. Perhaps it assumes throttling responsibility directly via
> > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> > > somewhere (via memory allocation pressure?) anyways, so ISTM that some
> > > kind of throttling is inevitable in this situation. What am I missing?
> > 
> > We still need to throttle discards - they have to play nice with all
> > the other IO we need to dispatch concurrently.
> > 
> > I have two issues with the proposed patch:
> > 
> > 1. it puts both discard dispatch and completion processing on the
> > one work qeueue, so if the queue is filled with dispatch requests,
> > IO completion queuing gets blocked. That's not the best thing to be
> > doing.
> > 
> 
> This is an unbound workqueue with max_active == 0. AIUI, that means we
> can have something like 256 execution contexts (worker threads?) per
> cpu.

.....
        WQ_MAX_ACTIVE           = 512,    /* I like 512, better ideas? */
        WQ_MAX_UNBOUND_PER_CPU  = 4,      /* 4 * #cpus for unbound wq */
        WQ_DFL_ACTIVE           = WQ_MAX_ACTIVE / 2,
};

/* unbound wq's aren't per-cpu, scale max_active according to #cpus */
#define WQ_UNBOUND_MAX_ACTIVE   \
        max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU)


IOWs, unbound queues are not per-cpu and they are execution limited
to max(512, NR_CPUS * 4) kworker threads.

The default (for max_active = 0), however, is still WQ_DFL_ACTIVE so
the total number of active workers for the unbound xfs_discard_wq
is 256.


> Given that, plus the batching that occurs in XFS due to delayed
> logging and discard bio chaining, that seems rather unlikely. Unless I'm
> misunderstanding the mechanism, I think that means filling the queue as
> such and blocking discard submission basically consumes one of those
> contexts.

Yes, but consider the situation where we've got a slow discard
device and we're removing a file with millions of extents. We've got
to issue millions of discard ops in this case. because dispatch
queueing is not bound, we're easily going to overflow the discard
workqueue because the freeing transactions will run far ahead of the
discard operations. Sooner or later we consume all 256 discard_wq
worker threads with blocked discard submission.  Then both log IO
completion and discard I ocompletion will block on workqueue
submission and we deadlock because discard completion can't
run....

> 
> Of course, the CIL context structure appears to be technically unbound

I'm missing something here - What bit of that structure is unbound?

> as well and it's trivial to just add a separate submission workqueue,
> but I'd like to at least make sure we're on the same page as to the need
> (so it can be documented clearly as well).

A separate submission queue doesn't really solve log Io completion
blocking problem. Yes, it solves the discard completion deadlock,
but we eventually end up in the same place on sustained discard
workloads with submission queuing blocking on a full work queue.

Workqueues are no the way to solve unbound queue depth problems.
That's what Kernel threads are for. e.g. this is the reason the
xfsaild is a kernel thread, not a work queue. The amount of
writeback work queued on the AIL can be hundreds of thousands of
objects, way more than a workqueue can handle. This discard problem
is no different - concurrent dispatch through kworker threads buys
us nothing - we just fill the request queue from hundreds of threads
instead of filling it from just one.

The workqueue approach has other problems, too, like dispatch across
worker threads means discard is not FIFO scheduled - it's completely
random as to the order in which discards get fed to the device
request queue. Hence discards can be starved because whenever the
worker thread runs to process it's queue it finds the device request
queue already full and blocks again.

Having a single kernel thread that walks the discard queue on each
context and then each context in sequence order gives us FIFO
dispatch of discard requests. It would block only on full request
queues giving us a much more predictable log-force-to-completion
latency. It allows for the possiblity of merging discards across
multiple CIL contexts, to directly control the rate of discard, and
to skip small discards or even -turn off discard- when the backlog
gets too great.

The workqueue approach just doesn't allow anything like this to be
done because every discard context is kept separate from every other
context and there is no coordination at all between them.

> > 2. log forces no longer wait for discards to be dispatched - they
> > just queue them. This means the mechanism xfs_extent_busy_flush()
> > uses to dispatch pending discards (synchrnous log force) can return
> > before discards have even been dispatched to disk. Hence we can
> > expect to see longer wait and tail latencies when busy extents are
> > encountered by the allocator. Whether this is a problem or not needs
> > further investigation.
> > 
> 
> Firstly, I think latency is kind of moot in the problematic case. The
> queue is already drowning in requests that presumably are going to take
> minutes to complete. In that case, the overhead of kicking a workqueue
> to do the submission is probably negligible.

Yes, the overhead of kicking the queue is negliable. That's not the
problem though. By queuing discards rather than submitting them we
go from a single FIFO dispatch model (by in-order iclog IO
completion) to a concurrent, uncoordinated dispatch model.  It's the
loss of FIFO behaviour because the synchrnous log force no longer
controls dispatch order that leads to unpredictable and long tail
latencies in dispatch completion, hence causing the problems for the
proceeses now waiting on specific extent discard completion rather
than just the log force.

In some cases they'll get woken faster (don't ahve to wait for
discards to be dispatched), but it is equally likely they'll have to
wait for much, much longer. In essence, the async dispatch by
workqueue model removes all assumptions we've made about the
predictablility of discard completion latency. FIFO is predictable,
concurrent async dispatch by workqueue is completely unpredictable.

If we really want to do fully async dispatch of discards, I think we
need to use a controllable "single dispatch by kernel thread" model
like the AIL, not use workqueues and spray the dispatch in an
uncoordinated, uncontrollable manner across hundreds of kernel
threads....

Cheers,

Dave.
Brian Foster Nov. 8, 2018, 1:50 p.m. UTC | #7
On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote:
> > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> > > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> > > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > > > > > When online discard is enabled, discards of busy extents are
> > > > > > submitted asynchronously as a bio chain. bio completion and
> > > > > > resulting busy extent cleanup is deferred to a workqueue. Async
> > > > > > discard submission is intended to avoid blocking log forces on a
> > > > > > full discard sequence which can take a noticeable amount of time in
> > > > > > some cases.
> > > > > > 
> > > > > > We've had reports of this still producing log force stalls with XFS
> > > > > > on VDO,
> > > > > 
> > > > > Please fix this in VDO instead.  We should not work around out of
> > > > > tree code making stupid decisions.
> > > > 
> > > > I assume the "stupid decision" refers to sync discard execution. I'm not
> > > > familiar with the internals of VDO, this is just what I was told.
> > > 
> > > IMO, what VDO does is irrelevant - any call to submit_bio() can
> > > block if the request queue is full. Hence if we've drowned the queue
> > > in discards and the device is slow at discards, then we are going to
> > > block submitting discards.
> > > 
> > > > My
> > > > understanding is that these discards can stack up and take enough time
> > > > that a limit on outstanding discards is required, which now that I think
> > > > of it makes me somewhat skeptical of the whole serial execution thing.
> > > > Hitting that outstanding discard request limit is what bubbles up the
> > > > stack and affects XFS by holding up log forces, since new discard
> > > > submissions are presumably blocked on completion of the oldest
> > > > outstanding request.
> > > 
> > > Exactly.
> > > 
> > > > I'm not quite sure what happens in the block layer if that limit were
> > > > lifted. Perhaps it assumes throttling responsibility directly via
> > > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> > > > somewhere (via memory allocation pressure?) anyways, so ISTM that some
> > > > kind of throttling is inevitable in this situation. What am I missing?
> > > 
> > > We still need to throttle discards - they have to play nice with all
> > > the other IO we need to dispatch concurrently.
> > > 
> > > I have two issues with the proposed patch:
> > > 
> > > 1. it puts both discard dispatch and completion processing on the
> > > one work qeueue, so if the queue is filled with dispatch requests,
> > > IO completion queuing gets blocked. That's not the best thing to be
> > > doing.
> > > 
> > 
> > This is an unbound workqueue with max_active == 0. AIUI, that means we
> > can have something like 256 execution contexts (worker threads?) per
> > cpu.
> 
> .....
>         WQ_MAX_ACTIVE           = 512,    /* I like 512, better ideas? */
>         WQ_MAX_UNBOUND_PER_CPU  = 4,      /* 4 * #cpus for unbound wq */
>         WQ_DFL_ACTIVE           = WQ_MAX_ACTIVE / 2,
> };
> 
> /* unbound wq's aren't per-cpu, scale max_active according to #cpus */
> #define WQ_UNBOUND_MAX_ACTIVE   \
>         max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU)
> 
> 
> IOWs, unbound queues are not per-cpu and they are execution limited
> to max(512, NR_CPUS * 4) kworker threads.
> 
> The default (for max_active = 0), however, is still WQ_DFL_ACTIVE so
> the total number of active workers for the unbound xfs_discard_wq
> is 256.
> 

Sounds about right..

> 
> > Given that, plus the batching that occurs in XFS due to delayed
> > logging and discard bio chaining, that seems rather unlikely. Unless I'm
> > misunderstanding the mechanism, I think that means filling the queue as
> > such and blocking discard submission basically consumes one of those
> > contexts.
> 
> Yes, but consider the situation where we've got a slow discard
> device and we're removing a file with millions of extents. We've got
> to issue millions of discard ops in this case. because dispatch
> queueing is not bound, we're easily going to overflow the discard
> workqueue because the freeing transactions will run far ahead of the
> discard operations. Sooner or later we consume all 256 discard_wq
> worker threads with blocked discard submission.  Then both log IO
> completion and discard I ocompletion will block on workqueue
> submission and we deadlock because discard completion can't
> run....
> 
> > 
> > Of course, the CIL context structure appears to be technically unbound
> 
> I'm missing something here - What bit of that structure is unbound?
> 

I mean to say that the ctx is not a fixed/limited resource in the
context of a discard submission workqueue. There's no throttling, so we
can essentially discard as many busy extent lists as the user can
generate (this is essentially the same point as your million extent file
example above).

> > as well and it's trivial to just add a separate submission workqueue,
> > but I'd like to at least make sure we're on the same page as to the need
> > (so it can be documented clearly as well).
> 
> A separate submission queue doesn't really solve log Io completion
> blocking problem. Yes, it solves the discard completion deadlock,
> but we eventually end up in the same place on sustained discard
> workloads with submission queuing blocking on a full work queue.
> 

Well, for the purposes of this patch it's fine that we still block on
discard submission. The point is just that we do so in a separate
context from log I/O completion and thus avoid associated log force
stalls.

> Workqueues are no the way to solve unbound queue depth problems.
> That's what Kernel threads are for. e.g. this is the reason the
> xfsaild is a kernel thread, not a work queue. The amount of
> writeback work queued on the AIL can be hundreds of thousands of
> objects, way more than a workqueue can handle. This discard problem
> is no different - concurrent dispatch through kworker threads buys
> us nothing - we just fill the request queue from hundreds of threads
> instead of filling it from just one.
> 

Fair, but the point wasn't really to buy any kind of high level discard
submission improvement over the current scheme. The goal was to avoid
log force stalls on a time consuming operation that has no explicit
serialization requirement with respect to log force completion...

> The workqueue approach has other problems, too, like dispatch across
> worker threads means discard is not FIFO scheduled - it's completely
> random as to the order in which discards get fed to the device
> request queue. Hence discards can be starved because whenever the
> worker thread runs to process it's queue it finds the device request
> queue already full and blocks again.
> 

... but Ok. I can see that the worst case of creating and blocking
hundreds of worker threads on discard submission kind of creates a mess
even without considering the resulting effect on the discard operations
themselves. There's no real point or need for all of those threads.

> Having a single kernel thread that walks the discard queue on each
> context and then each context in sequence order gives us FIFO
> dispatch of discard requests. It would block only on full request
> queues giving us a much more predictable log-force-to-completion
> latency. It allows for the possiblity of merging discards across
> multiple CIL contexts, to directly control the rate of discard, and
> to skip small discards or even -turn off discard- when the backlog
> gets too great.
> 
> The workqueue approach just doesn't allow anything like this to be
> done because every discard context is kept separate from every other
> context and there is no coordination at all between them.
> 

The high level kthread approach sounds like a nice idea to me. I'm not
sure that it's technically necessary to address this specific problem
though. I.e., it would be nice if we could still separate enhancement
from bug fix.

> > > 2. log forces no longer wait for discards to be dispatched - they
> > > just queue them. This means the mechanism xfs_extent_busy_flush()
> > > uses to dispatch pending discards (synchrnous log force) can return
> > > before discards have even been dispatched to disk. Hence we can
> > > expect to see longer wait and tail latencies when busy extents are
> > > encountered by the allocator. Whether this is a problem or not needs
> > > further investigation.
> > > 
> > 
> > Firstly, I think latency is kind of moot in the problematic case. The
> > queue is already drowning in requests that presumably are going to take
> > minutes to complete. In that case, the overhead of kicking a workqueue
> > to do the submission is probably negligible.
> 
> Yes, the overhead of kicking the queue is negliable. That's not the
> problem though. By queuing discards rather than submitting them we
> go from a single FIFO dispatch model (by in-order iclog IO
> completion) to a concurrent, uncoordinated dispatch model.  It's the
> loss of FIFO behaviour because the synchrnous log force no longer
> controls dispatch order that leads to unpredictable and long tail
> latencies in dispatch completion, hence causing the problems for the
> proceeses now waiting on specific extent discard completion rather
> than just the log force.
> 

I think I see what you're getting at here: discard submission via log
completion means that submission is serialized (with respect to other
iclogs) and in log buf order, since that is enforced by the log callback
invoking code. Since the callbacks execute one at a time, blocking in
submission likely means the same, single log I/O completion execution
context ends up iterating over and processing the other iclogs with
pending callbacks once the current submit blockage clears. Yes?

If I'm following that correctly, couldn't we provide very similar
behavior by using a separate "ordered" workqueue for submission? The
ordered wq sets max_active to 1, which means we still have a single
submitter, yet we still create a queue that disconnects submit from log
force completion. I suppose we could also allocate a separate structure
from xfs_cil_ctx to track pending discard jobs and slim down memory
usage closer to something that might be required for a thread-based
implementation.

From there, we can still explore additional enhancements incrementally,
such as pushing some of the busy extent processing/sorting into the
workqueue, eventually converting the wq-based mechanism into a
thread-based one, etc. Thoughts?

Brian

> In some cases they'll get woken faster (don't ahve to wait for
> discards to be dispatched), but it is equally likely they'll have to
> wait for much, much longer. In essence, the async dispatch by
> workqueue model removes all assumptions we've made about the
> predictablility of discard completion latency. FIFO is predictable,
> concurrent async dispatch by workqueue is completely unpredictable.
> 
> If we really want to do fully async dispatch of discards, I think we
> need to use a controllable "single dispatch by kernel thread" model
> like the AIL, not use workqueues and spray the dispatch in an
> uncoordinated, uncontrollable manner across hundreds of kernel
> threads....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 9, 2018, 12:20 a.m. UTC | #8
On Thu, Nov 08, 2018 at 08:50:02AM -0500, Brian Foster wrote:
> On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote:
> > On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote:
> > > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> > Yes, but consider the situation where we've got a slow discard
> > device and we're removing a file with millions of extents. We've got
> > to issue millions of discard ops in this case. because dispatch
> > queueing is not bound, we're easily going to overflow the discard
> > workqueue because the freeing transactions will run far ahead of the
> > discard operations. Sooner or later we consume all 256 discard_wq
> > worker threads with blocked discard submission.  Then both log IO
> > completion and discard I ocompletion will block on workqueue
> > submission and we deadlock because discard completion can't
> > run....
> > 
> > > 
> > > Of course, the CIL context structure appears to be technically unbound
> > 
> > I'm missing something here - What bit of that structure is unbound?
> > 
> 
> I mean to say that the ctx is not a fixed/limited resource in the
> context of a discard submission workqueue. There's no throttling, so we
> can essentially discard as many busy extent lists as the user can
> generate (this is essentially the same point as your million extent file
> example above).

It is bound. It's bound by log size and the number of checkpoints we
can have uncompleted at any point in time. That can be a high bound
on large logs, but it is limited and it is throttled - you can't
create new contexts if there is no log space available, and hence
new contexts are throttled on tail pushing.

> > > as well and it's trivial to just add a separate submission workqueue,
> > > but I'd like to at least make sure we're on the same page as to the need
> > > (so it can be documented clearly as well).
> > 
> > A separate submission queue doesn't really solve log Io completion
> > blocking problem. Yes, it solves the discard completion deadlock,
> > but we eventually end up in the same place on sustained discard
> > workloads with submission queuing blocking on a full work queue.
> > 
> 
> Well, for the purposes of this patch it's fine that we still block on
> discard submission. The point is just that we do so in a separate
> context from log I/O completion and thus avoid associated log force
> stalls.

And, in doing so, remove any bound we have on the number of
outstanding uncompleted discards. You're essentially trading
"slow things progressively as discard load goes up" with "nothing
blocks until we hit a breakdown point and the system stops until the
discard queue drains".

We have a limited queue depth of discard operations right now, and
even that is deep enough to cause problems.  Removing that queue
depth bound won't improve things - it will just allow the pending
discard queue to run deeper and get further and further behind the
extent free operations that are being committed. This will only make
the stalls and problems completely unpredictable and uncontrollable,
and poses a true breakdown possibility where allocation in every AG
is stalled holding AGF locks waiting for a huge queue of discards to
be worked through.

We have to throttle at some point to prevent us from getting to
these severe breakdown conditions. Moving discards out from under
the log force removes the only feedback loop we have to throttle
discards back to a manageable level. What we have is not perfect,
but we can't just remove that feedback loop without providing
something else to replace it.

> > The workqueue approach just doesn't allow anything like this to be
> > done because every discard context is kept separate from every other
> > context and there is no coordination at all between them.
> > 
> 
> The high level kthread approach sounds like a nice idea to me. I'm not
> sure that it's technically necessary to address this specific problem
> though. I.e., it would be nice if we could still separate enhancement
> from bug fix.

Just removing the discards from the log force context is just hiding
a source of latency without addressing the cause of the latency.
i.e. there's a huge amount of discard to be managed, and the only
management we have to constrain them to the rate at which we retire
the extent free transactions.

> > > > 2. log forces no longer wait for discards to be dispatched - they
> > > > just queue them. This means the mechanism xfs_extent_busy_flush()
> > > > uses to dispatch pending discards (synchrnous log force) can return
> > > > before discards have even been dispatched to disk. Hence we can
> > > > expect to see longer wait and tail latencies when busy extents are
> > > > encountered by the allocator. Whether this is a problem or not needs
> > > > further investigation.
> > > > 
> > > 
> > > Firstly, I think latency is kind of moot in the problematic case. The
> > > queue is already drowning in requests that presumably are going to take
> > > minutes to complete. In that case, the overhead of kicking a workqueue
> > > to do the submission is probably negligible.
> > 
> > Yes, the overhead of kicking the queue is negliable. That's not the
> > problem though. By queuing discards rather than submitting them we
> > go from a single FIFO dispatch model (by in-order iclog IO
> > completion) to a concurrent, uncoordinated dispatch model.  It's the
> > loss of FIFO behaviour because the synchrnous log force no longer
> > controls dispatch order that leads to unpredictable and long tail
> > latencies in dispatch completion, hence causing the problems for the
> > proceeses now waiting on specific extent discard completion rather
> > than just the log force.
> > 
> 
> I think I see what you're getting at here: discard submission via log
> completion means that submission is serialized (with respect to other
> iclogs) and in log buf order, since that is enforced by the log callback
> invoking code. Since the callbacks execute one at a time, blocking in
> submission likely means the same, single log I/O completion execution
> context ends up iterating over and processing the other iclogs with
> pending callbacks once the current submit blockage clears. Yes?

Essentially.

> If I'm following that correctly, couldn't we provide very similar
> behavior by using a separate "ordered" workqueue for submission?

Workqueues are not bound depth queues - they are "concurrency
managed" queues. And ordered workqueue is simply a FIFO with a
single processing context. You can queue as much as you like to it,
but it will only process it with a single thread.

> The
> ordered wq sets max_active to 1, which means we still have a single
> submitter,

We have a single /worker/ - we can submit as many independent works
as we want to it and they just stack up on the queue. There is no
throttling or feedback here, it's just a black hole.

> From there, we can still explore additional enhancements incrementally,
> such as pushing some of the busy extent processing/sorting into the
> workqueue, eventually converting the wq-based mechanism into a
> thread-based one, etc. Thoughts?

As I always say: do it once, do it properly, or don't touch it at
all.

Cheers,

Dave.
Christoph Hellwig Nov. 9, 2018, 3:04 p.m. UTC | #9
On Mon, Nov 05, 2018 at 04:20:24PM -0600, Eric Sandeen wrote:
> Is there some downside to Brian's proposal, in principle?
> It seems like it would be an improvement for device that might cause
> a discard bottleneck like this.

We create another bottleneck, we lost thottling, we don't flush
the workqueue when needed and probably a few more.

Nevermind that I don't see any point in optimizing for out of tree
code that no one even bothers to upstream.  All this VDO here and there
crap from Red Hat in the last year is really more than enough.
Christoph Hellwig Nov. 9, 2018, 3:06 p.m. UTC | #10
On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> My
> understanding is that these discards can stack up and take enough time
> that a limit on outstanding discards is required, which now that I think
> of it makes me somewhat skeptical of the whole serial execution thing.
> Hitting that outstanding discard request limit is what bubbles up the
> stack and affects XFS by holding up log forces, since new discard
> submissions are presumably blocked on completion of the oldest
> outstanding request.

We don't do strict ordering or request, but eventually requests
waiting for completion will block others from being submitted.

> I'm not quite sure what happens in the block layer if that limit were
> lifted. Perhaps it assumes throttling responsibility directly via
> queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> somewhere (via memory allocation pressure?) anyways, so ISTM that some
> kind of throttling is inevitable in this situation. What am I missing?

We'll still block new allocations waiting for these blocks and
other bits.  Or to put it another way - if your discard implementation
is slow (independent of synchronous or not) your are going to be in
a world of pain with online discard.  That is what it's not default
to start with.
Brian Foster Nov. 9, 2018, 3:23 p.m. UTC | #11
On Fri, Nov 09, 2018 at 11:20:52AM +1100, Dave Chinner wrote:
> On Thu, Nov 08, 2018 at 08:50:02AM -0500, Brian Foster wrote:
> > On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote:
> > > > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> > > Yes, but consider the situation where we've got a slow discard
> > > device and we're removing a file with millions of extents. We've got
> > > to issue millions of discard ops in this case. because dispatch
> > > queueing is not bound, we're easily going to overflow the discard
> > > workqueue because the freeing transactions will run far ahead of the
> > > discard operations. Sooner or later we consume all 256 discard_wq
> > > worker threads with blocked discard submission.  Then both log IO
> > > completion and discard I ocompletion will block on workqueue
> > > submission and we deadlock because discard completion can't
> > > run....
> > > 
> > > > 
> > > > Of course, the CIL context structure appears to be technically unbound
> > > 
> > > I'm missing something here - What bit of that structure is unbound?
> > > 
> > 
> > I mean to say that the ctx is not a fixed/limited resource in the
> > context of a discard submission workqueue. There's no throttling, so we
> > can essentially discard as many busy extent lists as the user can
> > generate (this is essentially the same point as your million extent file
> > example above).
> 
> It is bound. It's bound by log size and the number of checkpoints we
> can have uncompleted at any point in time. That can be a high bound
> on large logs, but it is limited and it is throttled - you can't
> create new contexts if there is no log space available, and hence
> new contexts are throttled on tail pushing.
> 

"... in the context of a discard submission workqueue."

We're in violent agreement. You're describing the code as it is today
(where it is bound) and I'm referring to prospective behavior with a
separate submit queue (where it is made unbound by virtue of the wq).
You explain the same problem below that I'm referring to above.

> > > > as well and it's trivial to just add a separate submission workqueue,
> > > > but I'd like to at least make sure we're on the same page as to the need
> > > > (so it can be documented clearly as well).
> > > 
> > > A separate submission queue doesn't really solve log Io completion
> > > blocking problem. Yes, it solves the discard completion deadlock,
> > > but we eventually end up in the same place on sustained discard
> > > workloads with submission queuing blocking on a full work queue.
> > > 
> > 
> > Well, for the purposes of this patch it's fine that we still block on
> > discard submission. The point is just that we do so in a separate
> > context from log I/O completion and thus avoid associated log force
> > stalls.
> 
> And, in doing so, remove any bound we have on the number of
> outstanding uncompleted discards. You're essentially trading
> "slow things progressively as discard load goes up" with "nothing
> blocks until we hit a breakdown point and the system stops until the
> discard queue drains".
> 
> We have a limited queue depth of discard operations right now, and
> even that is deep enough to cause problems.  Removing that queue
> depth bound won't improve things - it will just allow the pending
> discard queue to run deeper and get further and further behind the
> extent free operations that are being committed. This will only make
> the stalls and problems completely unpredictable and uncontrollable,
> and poses a true breakdown possibility where allocation in every AG
> is stalled holding AGF locks waiting for a huge queue of discards to
> be worked through.
> 
> We have to throttle at some point to prevent us from getting to
> these severe breakdown conditions. Moving discards out from under
> the log force removes the only feedback loop we have to throttle
> discards back to a manageable level. What we have is not perfect,
> but we can't just remove that feedback loop without providing
> something else to replace it.
> 

So the fundamental question here is how to manage discard submission
throttling if we disconnect submission from the implicit throttling
provided by ordered log I/O completion. We may be able to loosen this up
significantly from the current approach, but we still need to explore
how much and provide _something_ that addresses severe breakdown
conditions. Makes sense.

> > > The workqueue approach just doesn't allow anything like this to be
> > > done because every discard context is kept separate from every other
> > > context and there is no coordination at all between them.
> > > 
> > 
> > The high level kthread approach sounds like a nice idea to me. I'm not
> > sure that it's technically necessary to address this specific problem
> > though. I.e., it would be nice if we could still separate enhancement
> > from bug fix.
> 
> Just removing the discards from the log force context is just hiding
> a source of latency without addressing the cause of the latency.
> i.e. there's a huge amount of discard to be managed, and the only
> management we have to constrain them to the rate at which we retire
> the extent free transactions.
> 
> > > > > 2. log forces no longer wait for discards to be dispatched - they
> > > > > just queue them. This means the mechanism xfs_extent_busy_flush()
> > > > > uses to dispatch pending discards (synchrnous log force) can return
> > > > > before discards have even been dispatched to disk. Hence we can
> > > > > expect to see longer wait and tail latencies when busy extents are
> > > > > encountered by the allocator. Whether this is a problem or not needs
> > > > > further investigation.
> > > > > 
> > > > 
> > > > Firstly, I think latency is kind of moot in the problematic case. The
> > > > queue is already drowning in requests that presumably are going to take
> > > > minutes to complete. In that case, the overhead of kicking a workqueue
> > > > to do the submission is probably negligible.
> > > 
> > > Yes, the overhead of kicking the queue is negliable. That's not the
> > > problem though. By queuing discards rather than submitting them we
> > > go from a single FIFO dispatch model (by in-order iclog IO
> > > completion) to a concurrent, uncoordinated dispatch model.  It's the
> > > loss of FIFO behaviour because the synchrnous log force no longer
> > > controls dispatch order that leads to unpredictable and long tail
> > > latencies in dispatch completion, hence causing the problems for the
> > > proceeses now waiting on specific extent discard completion rather
> > > than just the log force.
> > > 
> > 
> > I think I see what you're getting at here: discard submission via log
> > completion means that submission is serialized (with respect to other
> > iclogs) and in log buf order, since that is enforced by the log callback
> > invoking code. Since the callbacks execute one at a time, blocking in
> > submission likely means the same, single log I/O completion execution
> > context ends up iterating over and processing the other iclogs with
> > pending callbacks once the current submit blockage clears. Yes?
> 
> Essentially.
> 
> > If I'm following that correctly, couldn't we provide very similar
> > behavior by using a separate "ordered" workqueue for submission?
> 
> Workqueues are not bound depth queues - they are "concurrency
> managed" queues. And ordered workqueue is simply a FIFO with a
> single processing context. You can queue as much as you like to it,
> but it will only process it with a single thread.
> 
> > The
> > ordered wq sets max_active to 1, which means we still have a single
> > submitter,
> 
> We have a single /worker/ - we can submit as many independent works
> as we want to it and they just stack up on the queue. There is no
> throttling or feedback here, it's just a black hole.
> 

Sure, but just using a kthread in its place by itself doesn't indicate
anything to me about throttling. We can just as easily create an
unthrottled thread-based model with an unbound set of busy extent lists,
for example, or in the other extreme, use a workqueue to provide the
exact same throttling model we have today (which wouldn't address the
original problem). Throttling is a distinct element of design from
whether we use a workqueue or kthread or whatever else.

I think it's kind of fruitless to continue on about such implementation
details without working out a higher level means of throttling...

> > From there, we can still explore additional enhancements incrementally,
> > such as pushing some of the busy extent processing/sorting into the
> > workqueue, eventually converting the wq-based mechanism into a
> > thread-based one, etc. Thoughts?
> 
> As I always say: do it once, do it properly, or don't touch it at
> all.
> 

TBH, I'm having a hard time grokking what you consider a proper fix
because we seem to be harping over less consequential implementation
details like what low level mechanisms we can or can't use. I can think
of ways to throttle either with different benefit and limitation
tradeoffs, but I still have no real clarity on what kind of throttling
you envision that demands a particular implementation. If you have a
particular technique or approach of throttling in mind, could you please
just describe it at a high level?

IOW, I don't particularly care what mechanism we ultimately use. I'm
just trying to simplify a prospective implementation if at all possible
based on my limited understanding of the problem space. Exploring the
problem space directly is far more helpful to me than doing so
implicitly via discussions over why a particular mechanism may or may
not be suitable.

So to try to put this on a more productive path... my understanding is
that deferring discard submissions outside of log completion introduces
the following considerations:

- unbound queue depth (i.e., outstanding busy extent tracking and memory
  consumption thereof)
- long tail allocation latency caused by allocation requests for blocks
  that might be deep in the unbound queue (i.e., minutes away from
  availability)

The first strikes me as of more of a memory consumption matter. It's not
immediately clear to me whether this would be handled naturally via the
preexisting busy extent struct allocations or we need to take further
precautions. I think those structures are currently limited indirectly
via being cleared in the context of a limited resource (i.e., the log).
IOW, aggressive busy extent creators are eventually going to be
throttled against log completion because transactions require log space
and log buffers, etc.

The second is the issue you've described where driving a hugely deep
discard submission queue means allocators can get stuck on busy blocks
deep in the queue with unpredictable submission+completion latency.

Given the above, I'm wondering if we could do something like reuse the
existing busy extent tree as a queue. Rather than submit busy extent
lists directly, set a new "discard needed" flag on the busy extents in
the tree on log I/O completion and carry on. Whenever discard needed
extents are present, we'll have some kind of rotorized background
submitter sending sorted discards in our preferred batch size. This
background submitter is still free to block.

In conjunction with that, we already have the ability for allocators to
look up busy extents in the tree and trim, wait or reuse them. So how
about allowing allocators to inject priority into the equation by either
submitting isolated discards themselves or allowing them to manipulate
the queue by reusing extents that are only busy due to pending discard
submissions (i.e., the discard has not yet been submitted)?

I think direct submission provides some level of predictable worst case
latency on the allocation side because if the bg submitter is blocked,
the allocator submission requests should have next priority over the
remaining set of pending discard submissions instead of being blocked on
not only the discard queue, but wherever the XFS submission queue
happens to hold that particular extent. IOW, allocators are bound by the
underlying discard processing engine as they are today, just not
indirectly via the log subsystem.

Beyond that, we may be able to actually provide additional benefits by
allowing allocators to optimize away more discards for extents that are
only still sitting in the busy tree for discard submission (i.e., the
associated extent free transaction has committed to the log). IOW,
there's potential for a larger window for extent reuse optimization, but
I still need to study the busy extent code a bit more to wrap my head
around that.

Finally, I think that still leaves us open to creating a huge set of
outstanding submissions depending on how long discards take to complete,
but as you mentioned earlier, there are probably a variety of ways we
can manage that problem. E.g., start dropping discards of a particular
size when/if we're over a particular threshold, perhaps start throttling
busy extent creators by requiring some form of ticket/reservation into
the busy extent tree on transaction reservation, etc. This may be
secondary problem depending on how effective the latency mitigation
throttling is by itself.

Thoughts on any of that?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Nov. 9, 2018, 3:46 p.m. UTC | #12
On Fri, Nov 09, 2018 at 07:06:10AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> > My
> > understanding is that these discards can stack up and take enough time
> > that a limit on outstanding discards is required, which now that I think
> > of it makes me somewhat skeptical of the whole serial execution thing.
> > Hitting that outstanding discard request limit is what bubbles up the
> > stack and affects XFS by holding up log forces, since new discard
> > submissions are presumably blocked on completion of the oldest
> > outstanding request.
> 
> We don't do strict ordering or request, but eventually requests
> waiting for completion will block others from being submitted.
> 

Ok, that's kind of what I expected.

> > I'm not quite sure what happens in the block layer if that limit were
> > lifted. Perhaps it assumes throttling responsibility directly via
> > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> > somewhere (via memory allocation pressure?) anyways, so ISTM that some
> > kind of throttling is inevitable in this situation. What am I missing?
> 
> We'll still block new allocations waiting for these blocks and
> other bits.  Or to put it another way - if your discard implementation
> is slow (independent of synchronous or not) your are going to be in
> a world of pain with online discard.  That is what it's not default
> to start with.

Sure, it's not really the XFS bits I was asking about here. This is
certainly not a high priority and not a common use case. We're working
through some of the other issues in the other sub-thread. In particular,
I'm wondering if we can provide broader improvements to the overall
mechanism to reduce some of that pain.

Brian

Patch
diff mbox series

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d3884e08b43c..087e99e80edd 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -529,9 +529,11 @@  xlog_discard_endio(
 
 static void
 xlog_discard_busy_extents(
-	struct xfs_mount	*mp,
-	struct xfs_cil_ctx	*ctx)
+	struct work_struct	*work)
 {
+	struct xfs_cil_ctx	*ctx = container_of(work, struct xfs_cil_ctx,
+						    discard_work);
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 	struct list_head	*list = &ctx->busy_extents;
 	struct xfs_extent_busy	*busyp;
 	struct bio		*bio = NULL;
@@ -603,9 +605,10 @@  xlog_cil_committed(
 
 	xlog_cil_free_logvec(ctx->lv_chain);
 
-	if (!list_empty(&ctx->busy_extents))
-		xlog_discard_busy_extents(mp, ctx);
-	else
+	if (!list_empty(&ctx->busy_extents)) {
+		INIT_WORK(&ctx->discard_work, xlog_discard_busy_extents);
+		queue_work(xfs_discard_wq, &ctx->discard_work);
+	} else
 		kmem_free(ctx);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b5f82cb36202..085bebd51135 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -245,6 +245,7 @@  struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct xfs_log_callback	log_cb;		/* completion callback hook. */
 	struct list_head	committing;	/* ctx committing list */
+	struct work_struct	discard_work;
 	struct work_struct	discard_endio_work;
 };