diff mbox

[v7,9/9] block: Avoid deadlocks with bio allocation by stacking drivers

Message ID 20120830220745.GI27257@redhat.com (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Vivek Goyal Aug. 30, 2012, 10:07 p.m. UTC
On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:

[..]
> > Performance aside, punting submission to per device worker in case of deep
> > stack usage sounds cleaner solution to me.
> 
> Agreed, but performance tends to matter in the real world. And either
> way the tricky bits are going to be confined to a few functions, so I
> don't think it matters that much.
> 
> If someone wants to code up the workqueue version and test it, they're
> more than welcome...

Here is one quick and dirty proof of concept patch. It checks for stack
depth and if remaining space is less than 20% of stack size, then it
defers the bio submission to per queue worker.

Thanks
Vivek


---
 block/blk-core.c          |  171 ++++++++++++++++++++++++++++++++++------------
 block/blk-sysfs.c         |    1 
 include/linux/blk_types.h |    1 
 include/linux/blkdev.h    |    8 ++
 4 files changed, 138 insertions(+), 43 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Kent Overstreet Aug. 31, 2012, 1:43 a.m. UTC | #1
On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> 
> [..]
> > > Performance aside, punting submission to per device worker in case of deep
> > > stack usage sounds cleaner solution to me.
> > 
> > Agreed, but performance tends to matter in the real world. And either
> > way the tricky bits are going to be confined to a few functions, so I
> > don't think it matters that much.
> > 
> > If someone wants to code up the workqueue version and test it, they're
> > more than welcome...
> 
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

I can't think of any correctness issues. I see some stuff that could be
simplified (blk_drain_deferred_bios() is redundant, just make it a
wrapper around blk_deffered_bio_work()).

Still skeptical about the performance impact, though - frankly, on some
of the hardware I've been running bcache on this would be a visible
performance regression - probably double digit percentages but I'd have
to benchmark it.  That kind of of hardware/usage is not normal today,
but I've put a lot of work into performance and I don't want to make
things worse without good reason.

Have you tested/benchmarked it?

There's scheduling behaviour, too. We really want the workqueue thread's
cpu time to be charged to the process that submitted the bio. (We could
use a mechanism like that in other places, too... not like this is a new
issue).

This is going to be a real issue for users that need strong isolation -
for any driver that uses non negligable cpu (i.e. dm crypt), we're
breaking that (not that it wasn't broken already, but this makes it
worse).

I could be convinced, but right now I prefer my solution.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Aug. 31, 2012, 1:55 a.m. UTC | #2
On Thu, Aug 30, 2012 at 6:43 PM, Kent Overstreet <koverstreet@google.com> wrote:
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
>> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
>>
>> [..]
>> > > Performance aside, punting submission to per device worker in case of deep
>> > > stack usage sounds cleaner solution to me.
>> >
>> > Agreed, but performance tends to matter in the real world. And either
>> > way the tricky bits are going to be confined to a few functions, so I
>> > don't think it matters that much.
>> >
>> > If someone wants to code up the workqueue version and test it, they're
>> > more than welcome...
>>
>> Here is one quick and dirty proof of concept patch. It checks for stack
>> depth and if remaining space is less than 20% of stack size, then it
>> defers the bio submission to per queue worker.
>
> I can't think of any correctness issues. I see some stuff that could be
> simplified (blk_drain_deferred_bios() is redundant, just make it a
> wrapper around blk_deffered_bio_work()).
>
> Still skeptical about the performance impact, though - frankly, on some
> of the hardware I've been running bcache on this would be a visible
> performance regression - probably double digit percentages but I'd have
> to benchmark it.  That kind of of hardware/usage is not normal today,
> but I've put a lot of work into performance and I don't want to make
> things worse without good reason.

Here's another crazy idea - we don't really need another thread, just
more stack space.

We could check if we're running out of stack space, then if we are
just allocate another two pages and memcpy the struct thread_info
over.

I think the main obstacle is that we'd need some per arch code for
mucking with the stack pointer. And it'd break backtraces, but that's
fixable.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal Aug. 31, 2012, 3:01 p.m. UTC | #3
On Thu, Aug 30, 2012 at 06:43:59PM -0700, Kent Overstreet wrote:
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > 
> > [..]
> > > > Performance aside, punting submission to per device worker in case of deep
> > > > stack usage sounds cleaner solution to me.
> > > 
> > > Agreed, but performance tends to matter in the real world. And either
> > > way the tricky bits are going to be confined to a few functions, so I
> > > don't think it matters that much.
> > > 
> > > If someone wants to code up the workqueue version and test it, they're
> > > more than welcome...
> > 
> > Here is one quick and dirty proof of concept patch. It checks for stack
> > depth and if remaining space is less than 20% of stack size, then it
> > defers the bio submission to per queue worker.
> 
> I can't think of any correctness issues. I see some stuff that could be
> simplified (blk_drain_deferred_bios() is redundant, just make it a
> wrapper around blk_deffered_bio_work()).
> 
> Still skeptical about the performance impact, though - frankly, on some
> of the hardware I've been running bcache on this would be a visible
> performance regression - probably double digit percentages but I'd have
> to benchmark it.  That kind of of hardware/usage is not normal today,
> but I've put a lot of work into performance and I don't want to make
> things worse without good reason.

Would you like to give this patch a quick try and see with bcache on your
hardware how much performance impact do you see. 

Given the fact that submission through worker happens only in case of 
when stack usage is high, that should reduce the impact of the patch
and common use cases should reamin unaffected.

> 
> Have you tested/benchmarked it?

No, I have not. I will run some simple workloads on SSD.

> 
> There's scheduling behaviour, too. We really want the workqueue thread's
> cpu time to be charged to the process that submitted the bio. (We could
> use a mechanism like that in other places, too... not like this is a new
> issue).
> 
> This is going to be a real issue for users that need strong isolation -
> for any driver that uses non negligable cpu (i.e. dm crypt), we're
> breaking that (not that it wasn't broken already, but this makes it
> worse).

There are so many places in kernel where worker threads do work on behalf
of each process. I think this is really a minor concern and I would not
be too worried about it.

What is concerning though really is the greater stack usage due to
recursive nature of make_request() and performance impact of deferral
to a worker thread.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Tejun Heo Sept. 1, 2012, 2:13 a.m. UTC | #4
Hello, Vivek.

On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

So, it removes breadth-first walking of bio construction by ensuring
stack overflow never happens by bouncing to workqueue if stack usage
seems too high.

I do like removal of breadth-first walking.  It makes failure
scenarios a lot less mind-bending.  That said, Kent is right that this
can incur significant overhead for certain configurations, and looking
at stack usage in block layer is rather nasty both in design and
implementation.

If we're gonna need rescuer anyway and can get it right and the
mechanism can be contained in block proper relatively well, I think it
would be better to make bread-first walking safe.  Both are nasty in
their own ways after all.

Thanks.
Dave Chinner Sept. 3, 2012, 12:49 a.m. UTC | #5
On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> 
> [..]
> > > Performance aside, punting submission to per device worker in case of deep
> > > stack usage sounds cleaner solution to me.
> > 
> > Agreed, but performance tends to matter in the real world. And either
> > way the tricky bits are going to be confined to a few functions, so I
> > don't think it matters that much.
> > 
> > If someone wants to code up the workqueue version and test it, they're
> > more than welcome...
> 
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

Given that we are working around stack depth issues in the
filesystems already in several places, and now it seems like there's
a reason to work around it in the block layers as well, shouldn't we
simply increase the default stack size rather than introduce
complexity and performance regressions to try and work around not
having enough stack?

I mean, we can deal with it like the ia32 4k stack issue was dealt
with (i.e. ignore those stupid XFS people, that's an XFS bug), or
we can face the reality that storage stacks have become so complex
that 8k is no longer a big enough stack for a modern system....

Cheers,

Dave.
Kent Overstreet Sept. 3, 2012, 1:17 a.m. UTC | #6
On Mon, Sep 03, 2012 at 10:49:27AM +1000, Dave Chinner wrote:
> Given that we are working around stack depth issues in the
> filesystems already in several places, and now it seems like there's
> a reason to work around it in the block layers as well, shouldn't we
> simply increase the default stack size rather than introduce
> complexity and performance regressions to try and work around not
> having enough stack?
> 
> I mean, we can deal with it like the ia32 4k stack issue was dealt
> with (i.e. ignore those stupid XFS people, that's an XFS bug), or
> we can face the reality that storage stacks have become so complex
> that 8k is no longer a big enough stack for a modern system....

I'm not arguing against increasing the default stack size (I really
don't have an opinion there) - but it's not a solution for the block
layer, as stacking block devices can require an unbounded amount of
stack without the generic_make_request() convert recursion-to-iteration
thing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Sept. 3, 2012, 1:26 a.m. UTC | #7
On Fri, Aug 31, 2012 at 11:01:59AM -0400, Vivek Goyal wrote:
> On Thu, Aug 30, 2012 at 06:43:59PM -0700, Kent Overstreet wrote:
> > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > > 
> > > [..]
> > > > > Performance aside, punting submission to per device worker in case of deep
> > > > > stack usage sounds cleaner solution to me.
> > > > 
> > > > Agreed, but performance tends to matter in the real world. And either
> > > > way the tricky bits are going to be confined to a few functions, so I
> > > > don't think it matters that much.
> > > > 
> > > > If someone wants to code up the workqueue version and test it, they're
> > > > more than welcome...
> > > 
> > > Here is one quick and dirty proof of concept patch. It checks for stack
> > > depth and if remaining space is less than 20% of stack size, then it
> > > defers the bio submission to per queue worker.
> > 
> > I can't think of any correctness issues. I see some stuff that could be
> > simplified (blk_drain_deferred_bios() is redundant, just make it a
> > wrapper around blk_deffered_bio_work()).
> > 
> > Still skeptical about the performance impact, though - frankly, on some
> > of the hardware I've been running bcache on this would be a visible
> > performance regression - probably double digit percentages but I'd have
> > to benchmark it.  That kind of of hardware/usage is not normal today,
> > but I've put a lot of work into performance and I don't want to make
> > things worse without good reason.
> 
> Would you like to give this patch a quick try and see with bcache on your
> hardware how much performance impact do you see. 

If I can get a test system I can publish numbers setup with a modern
kernel, on I will. Will take a bit though.

> Given the fact that submission through worker happens only in case of 
> when stack usage is high, that should reduce the impact of the patch
> and common use cases should reamin unaffected.

Except depending on how users have their systems configured, it'll
either never happen or it'll happen for most every bio. That makes the
performance overhead unpredictable, too.

> > 
> > Have you tested/benchmarked it?
> 
> No, I have not. I will run some simple workloads on SSD.

Normal SATA ssds are not going to show the overhead - achi is a pig
and it'll be lost in the noise.

> There are so many places in kernel where worker threads do work on behalf
> of each process. I think this is really a minor concern and I would not
> be too worried about it.

Yeah, but this is somewhat unprecedented in the amount of cpu time
you're potentially moving to worker threads.

It is a concern.

> What is concerning though really is the greater stack usage due to
> recursive nature of make_request() and performance impact of deferral
> to a worker thread.

Your patch shouldn't increase stack usage (at least if your threshold is
safe - it's too high as is).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Sept. 3, 2012, 8:41 p.m. UTC | #8
On Thu, 30 Aug 2012, Kent Overstreet wrote:

> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > 
> > [..]
> > > > Performance aside, punting submission to per device worker in case of deep
> > > > stack usage sounds cleaner solution to me.
> > > 
> > > Agreed, but performance tends to matter in the real world. And either
> > > way the tricky bits are going to be confined to a few functions, so I
> > > don't think it matters that much.
> > > 
> > > If someone wants to code up the workqueue version and test it, they're
> > > more than welcome...
> > 
> > Here is one quick and dirty proof of concept patch. It checks for stack
> > depth and if remaining space is less than 20% of stack size, then it
> > defers the bio submission to per queue worker.
> 
> I can't think of any correctness issues. I see some stuff that could be
> simplified (blk_drain_deferred_bios() is redundant, just make it a
> wrapper around blk_deffered_bio_work()).
> 
> Still skeptical about the performance impact, though - frankly, on some
> of the hardware I've been running bcache on this would be a visible
> performance regression - probably double digit percentages but I'd have
> to benchmark it.  That kind of of hardware/usage is not normal today,
> but I've put a lot of work into performance and I don't want to make
> things worse without good reason.
> 
> Have you tested/benchmarked it?
> 
> There's scheduling behaviour, too. We really want the workqueue thread's
> cpu time to be charged to the process that submitted the bio. (We could
> use a mechanism like that in other places, too... not like this is a new
> issue).
> 
> This is going to be a real issue for users that need strong isolation -
> for any driver that uses non negligable cpu (i.e. dm crypt), we're
> breaking that (not that it wasn't broken already, but this makes it
> worse).

... or another possibility - start a timer when something is put to 
current->bio_list and use that timer to pop entries off current->bio_list 
and submit them to a workqueue. The timer can be cpu-local so only 
interrupt masking is required to synchronize against the timer.

This would normally run just like the current kernel and in case of 
deadlock, the timer would kick in and resolve the deadlock.

> I could be convinced, but right now I prefer my solution.

It fixes bio allocation problem, but not other similar mempool problems in 
dm and md.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Sept. 4, 2012, 3:41 a.m. UTC | #9
On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote:
> ... or another possibility - start a timer when something is put to 
> current->bio_list and use that timer to pop entries off current->bio_list 
> and submit them to a workqueue. The timer can be cpu-local so only 
> interrupt masking is required to synchronize against the timer.
> 
> This would normally run just like the current kernel and in case of 
> deadlock, the timer would kick in and resolve the deadlock.

Ugh. That's a _terrible_ idea.

Remember the old plugging code? You ever have to debug performance
issues caused by it?

> 
> > I could be convinced, but right now I prefer my solution.
> 
> It fixes bio allocation problem, but not other similar mempool problems in 
> dm and md.

I looked a bit more, and actually I think the rest of the problem is
pretty limited in scope - most of those mempool allocations are per
request, not per split.

I'm willing to put some time into converting dm/md over to bioset's
front_pad. I'm having to learn the code for the immutable biovec work,
anyways.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal Sept. 4, 2012, 1:54 p.m. UTC | #10
On Mon, Sep 03, 2012 at 10:49:27AM +1000, Dave Chinner wrote:
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > 
> > [..]
> > > > Performance aside, punting submission to per device worker in case of deep
> > > > stack usage sounds cleaner solution to me.
> > > 
> > > Agreed, but performance tends to matter in the real world. And either
> > > way the tricky bits are going to be confined to a few functions, so I
> > > don't think it matters that much.
> > > 
> > > If someone wants to code up the workqueue version and test it, they're
> > > more than welcome...
> > 
> > Here is one quick and dirty proof of concept patch. It checks for stack
> > depth and if remaining space is less than 20% of stack size, then it
> > defers the bio submission to per queue worker.
> 
> Given that we are working around stack depth issues in the
> filesystems already in several places, and now it seems like there's
> a reason to work around it in the block layers as well, shouldn't we
> simply increase the default stack size rather than introduce
> complexity and performance regressions to try and work around not
> having enough stack?

Dave,

In this particular instance, we really don't have any bug reports of
stack overflowing. Just discussing what will happen if we make 
generic_make_request() recursive again.

> 
> I mean, we can deal with it like the ia32 4k stack issue was dealt
> with (i.e. ignore those stupid XFS people, that's an XFS bug), or
> we can face the reality that storage stacks have become so complex
> that 8k is no longer a big enough stack for a modern system....

So first question will be, what's the right stack size? If we make
generic_make_request() recursive, then at some storage stack depth we will
overflow stack anyway (if we have created too deep a stack). Hence
keeping current logic kind of makes sense as in theory we can support
arbitrary depth of storage stack.

Yes, if higher layers are consuming more stack, then it does raise the
question whether to offload work to worker and take performance hit or
increase stack depth. I don't know what's the answer to that question.
I have only tried going through the archive where some people seem to have
pushed for even smaller stack size (4K).

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal Sept. 4, 2012, 3 p.m. UTC | #11
On Fri, Aug 31, 2012 at 07:13:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > Here is one quick and dirty proof of concept patch. It checks for stack
> > depth and if remaining space is less than 20% of stack size, then it
> > defers the bio submission to per queue worker.
> 
> So, it removes breadth-first walking of bio construction by ensuring
> stack overflow never happens by bouncing to workqueue if stack usage
> seems too high.
> 
> I do like removal of breadth-first walking.  It makes failure
> scenarios a lot less mind-bending.  That said, Kent is right that this
> can incur significant overhead for certain configurations, and looking
> at stack usage in block layer is rather nasty both in design and
> implementation.
> 
> If we're gonna need rescuer anyway and can get it right and the
> mechanism can be contained in block proper relatively well, I think it
> would be better to make bread-first walking safe.  Both are nasty in
> their own ways after all.

Hi Tejun,

That's fine. Breadth first walking does make sense given the fact that
storage stack can be arbitrarily deep. And Kent's bio set rescuer patch
is not too bad. So I am fine with pursuing that patch.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Tejun Heo Sept. 4, 2012, 6:26 p.m. UTC | #12
Hello,

On Tue, Sep 04, 2012 at 09:54:23AM -0400, Vivek Goyal wrote:
> > Given that we are working around stack depth issues in the
> > filesystems already in several places, and now it seems like there's
> > a reason to work around it in the block layers as well, shouldn't we
> > simply increase the default stack size rather than introduce
> > complexity and performance regressions to try and work around not
> > having enough stack?
> 
> Dave,
> 
> In this particular instance, we really don't have any bug reports of
> stack overflowing. Just discussing what will happen if we make 
> generic_make_request() recursive again.

I think there was one and that's why we added the bio_list thing.

> > I mean, we can deal with it like the ia32 4k stack issue was dealt
> > with (i.e. ignore those stupid XFS people, that's an XFS bug), or
> > we can face the reality that storage stacks have become so complex
> > that 8k is no longer a big enough stack for a modern system....
> 
> So first question will be, what's the right stack size? If we make
> generic_make_request() recursive, then at some storage stack depth we will
> overflow stack anyway (if we have created too deep a stack). Hence
> keeping current logic kind of makes sense as in theory we can support
> arbitrary depth of storage stack.

But, yeah, this can't be solved by enlarging the stack size.  The
upper limit is unbound.

Thanks.
Tejun Heo Sept. 4, 2012, 6:55 p.m. UTC | #13
Hello, Mikulas, Kent.

On Mon, Sep 03, 2012 at 08:41:00PM -0700, Kent Overstreet wrote:
> On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote:
> > ... or another possibility - start a timer when something is put to 
> > current->bio_list and use that timer to pop entries off current->bio_list 
> > and submit them to a workqueue. The timer can be cpu-local so only 
> > interrupt masking is required to synchronize against the timer.
> > 
> > This would normally run just like the current kernel and in case of 
> > deadlock, the timer would kick in and resolve the deadlock.
> 
> Ugh. That's a _terrible_ idea.

That's exactly how workqueue rescuers work - rescuers kick in if new
worker creation doesn't succeed in given amount of time.  The
suggested mechanism already makes use of workqueue, so it's already
doing it.  If you can think of a better way to detect the generic
stall condition, please be my guest.

> Remember the old plugging code? You ever have to debug performance
> issues caused by it?

That is not equivalent.  Plugging was kicking in all the time and it
wasn't entirely well-defined what the plugging / unplugging conditions
were.  This type of rescuing for forward-progress guarantee only kicks
in under severe memory pressure and people expect finite latency and
throughput hits under such conditions.  The usual bio / request /
scsi_cmd allocations could be failing under these circumstances and
things could be progressing only thanks to the finite preallocated
pools.  I don't think involving rescue timer would be noticeably
deterimental.

Actually, if the timer approach can reduce the frequency of rescuer
involvement, I think it could actually be better.

Thanks.
Tejun Heo Sept. 4, 2012, 7:01 p.m. UTC | #14
On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote:
> Actually, if the timer approach can reduce the frequency of rescuer
> involvement, I think it could actually be better.

Ooh, it wouldn't.  It's kicking in only after alloc failure.  I don't
know.  I think conditioning it on alloc failure is cleaner and
converting all per-bio allocations to front-pad makes sense.  Using a
timer wouldn't make the mechanism any simpler, right?

Thanks.
Mikulas Patocka Sept. 4, 2012, 7:26 p.m. UTC | #15
On Mon, 3 Sep 2012, Kent Overstreet wrote:

> On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote:
> > ... or another possibility - start a timer when something is put to 
> > current->bio_list and use that timer to pop entries off current->bio_list 
> > and submit them to a workqueue. The timer can be cpu-local so only 
> > interrupt masking is required to synchronize against the timer.
> > 
> > This would normally run just like the current kernel and in case of 
> > deadlock, the timer would kick in and resolve the deadlock.
> 
> Ugh. That's a _terrible_ idea.
> 
> Remember the old plugging code? You ever have to debug performance
> issues caused by it?

Yes, I do remember it (and I fixed one bug that resulted in missed unplug 
and degraded performance).

But currently, deadlocks due to exhausted mempools and bios being stalled 
in current->bio_list don't happen (or do happen below so rarely that they 
aren't reported).

If we add a timer, it will turn a deadlock into an i/o delay, but it can't 
make things any worse.

BTW. can these new-style timerless plugs introduce deadlocks too? What 
happens when some bios are indefinitely delayed because their requests are 
held in a plug and a mempool runs out?

> > > I could be convinced, but right now I prefer my solution.
> > 
> > It fixes bio allocation problem, but not other similar mempool problems in 
> > dm and md.
> 
> I looked a bit more, and actually I think the rest of the problem is
> pretty limited in scope - most of those mempool allocations are per
> request, not per split.
> 
> I'm willing to put some time into converting dm/md over to bioset's
> front_pad. I'm having to learn the code for the immutable biovec work,
> anyways.

Currently, dm targets allocate request-specific data from target-specific 
mempool. mempools are in dm-crypt, dm-delay, dm-mirror, dm-snapshot, 
dm-thin, dm-verity. You can change it to allocate request-specific data 
with the bio.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal Sept. 4, 2012, 7:39 p.m. UTC | #16
On Tue, Sep 04, 2012 at 03:26:19PM -0400, Mikulas Patocka wrote:

[..]
> BTW. can these new-style timerless plugs introduce deadlocks too? What 
> happens when some bios are indefinitely delayed because their requests are 
> held in a plug and a mempool runs out?

I think they will not deadlock because these on stack bios/requests are
flushed/dispatched when process schedules out. So if a submitter blocks
on a mempool, it will be scheduled out and requests on plug will be 
dispatched.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Sept. 4, 2012, 7:42 p.m. UTC | #17
On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote:
> Hello, Mikulas, Kent.
> 
> On Mon, Sep 03, 2012 at 08:41:00PM -0700, Kent Overstreet wrote:
> > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote:
> > > ... or another possibility - start a timer when something is put to 
> > > current->bio_list and use that timer to pop entries off current->bio_list 
> > > and submit them to a workqueue. The timer can be cpu-local so only 
> > > interrupt masking is required to synchronize against the timer.
> > > 
> > > This would normally run just like the current kernel and in case of 
> > > deadlock, the timer would kick in and resolve the deadlock.
> > 
> > Ugh. That's a _terrible_ idea.
> 
> That's exactly how workqueue rescuers work - rescuers kick in if new
> worker creation doesn't succeed in given amount of time.  The
> suggested mechanism already makes use of workqueue, so it's already
> doing it.  If you can think of a better way to detect the generic
> stall condition, please be my guest.
> 
> > Remember the old plugging code? You ever have to debug performance
> > issues caused by it?
> 
> That is not equivalent.  Plugging was kicking in all the time and it
> wasn't entirely well-defined what the plugging / unplugging conditions
> were.  This type of rescuing for forward-progress guarantee only kicks
> in under severe memory pressure and people expect finite latency and
> throughput hits under such conditions.  The usual bio / request /
> scsi_cmd allocations could be failing under these circumstances and
> things could be progressing only thanks to the finite preallocated
> pools.  I don't think involving rescue timer would be noticeably
> deterimental.
> 
> Actually, if the timer approach can reduce the frequency of rescuer
> involvement, I think it could actually be better.

Ok, that was an overly harsh, emotional response. But I still hate the
idea.

You want to point me at the relevant workqueue code? I'd really like to
see what you did there, it's entirely possible you're aware of some
issue I'm not but if not I'd like to take a stab at it.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Sept. 4, 2012, 7:43 p.m. UTC | #18
On Tue, Sep 04, 2012 at 12:01:19PM -0700, Tejun Heo wrote:
> On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote:
> > Actually, if the timer approach can reduce the frequency of rescuer
> > involvement, I think it could actually be better.
> 
> Ooh, it wouldn't.  It's kicking in only after alloc failure.  I don't
> know.  I think conditioning it on alloc failure is cleaner and
> converting all per-bio allocations to front-pad makes sense.  Using a
> timer wouldn't make the mechanism any simpler, right?

Exactly

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Tejun Heo Sept. 4, 2012, 9:03 p.m. UTC | #19
Hello,

On Tue, Sep 04, 2012 at 12:42:37PM -0700, Kent Overstreet wrote:
> You want to point me at the relevant workqueue code? I'd really like to
> see what you did there, it's entirely possible you're aware of some
> issue I'm not but if not I'd like to take a stab at it.

I was mistaken.  The issue was that adding @gfp_flags to
kthread_create() wasn't trivial involving updates to arch callbacks,
so the timer was added to side-step the issue.  So, yeah, if it can be
made to work w/o timer, I think that would be better.

Thanks.
Dave Chinner Sept. 5, 2012, 3:57 a.m. UTC | #20
On Tue, Sep 04, 2012 at 11:26:33AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 04, 2012 at 09:54:23AM -0400, Vivek Goyal wrote:
> > > Given that we are working around stack depth issues in the
> > > filesystems already in several places, and now it seems like there's
> > > a reason to work around it in the block layers as well, shouldn't we
> > > simply increase the default stack size rather than introduce
> > > complexity and performance regressions to try and work around not
> > > having enough stack?
> > 
> > Dave,
> > 
> > In this particular instance, we really don't have any bug reports of
> > stack overflowing. Just discussing what will happen if we make 
> > generic_make_request() recursive again.
> 
> I think there was one and that's why we added the bio_list thing.

There was more than one - it was a regular enough to be considered a
feature... :/

> > > I mean, we can deal with it like the ia32 4k stack issue was dealt
> > > with (i.e. ignore those stupid XFS people, that's an XFS bug), or
> > > we can face the reality that storage stacks have become so complex
> > > that 8k is no longer a big enough stack for a modern system....
> > 
> > So first question will be, what's the right stack size? If we make
> > generic_make_request() recursive, then at some storage stack depth we will
> > overflow stack anyway (if we have created too deep a stack). Hence
> > keeping current logic kind of makes sense as in theory we can support
> > arbitrary depth of storage stack.
> 
> But, yeah, this can't be solved by enlarging the stack size.  The
> upper limit is unbound.

Sure, but recursion issue is isolated to the block layer.

If we can still submit IO directly through the block layer without
pushing it off to a work queue, then the overall stack usage problem
still exists. But if the block layer always pushes the IO off into
another workqueue to avoid stack overflows, then the context
switches are going to cause significant performance regressions for
high IOPS workloads.  I don't really like either situation.

So while you are discussing stack issues, think a little about the
bigger picture outside of the immediate issue at hand - a better
solution for everyone might pop up....

Cheers,

Dave.
Tejun Heo Sept. 5, 2012, 4:37 a.m. UTC | #21
Hello, Dave.

On Wed, Sep 05, 2012 at 01:57:59PM +1000, Dave Chinner wrote:
> > But, yeah, this can't be solved by enlarging the stack size.  The
> > upper limit is unbound.
> 
> Sure, but recursion issue is isolated to the block layer.
> 
> If we can still submit IO directly through the block layer without
> pushing it off to a work queue, then the overall stack usage problem
> still exists. But if the block layer always pushes the IO off into
> another workqueue to avoid stack overflows, then the context
> switches are going to cause significant performance regressions for
> high IOPS workloads.  I don't really like either situation.

Kent's proposed solution doesn't do that.  The rescuer work item is
used iff mempool allocation fails w/o GFP_WAIT.  IOW, we're already
under severe memory pressure and stalls are expected all around the
kernel (somehow this sounds festive...)  It doesn't alter the
breadth-first walk of bio decomposition and shouldn't degrade
performance in any noticeable way.

> So while you are discussing stack issues, think a little about the
> bigger picture outside of the immediate issue at hand - a better
> solution for everyone might pop up....

It's probably because I haven't been bitten much from stack overflow
but I'd like to keep thinking that stack overflows are extremely
unusual and the ones causing them are the bad ones.  Thank you very
much. :p
diff mbox

Patch

Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/include/linux/blkdev.h	2012-09-01 18:09:58.805577658 -0400
@@ -430,6 +430,14 @@  struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
+
+	/*
+	 * Bio submission to queue can be deferred to a workqueue if stack
+	 * usage of submitter is high.
+	 */
+	struct bio_list         deferred_bios;
+	struct work_struct	deferred_bio_work;
+	struct workqueue_struct *deferred_bio_workqueue;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-core.c	2012-09-02 00:34:55.204091269 -0400
@@ -211,6 +211,23 @@  static void blk_delay_work(struct work_s
 	spin_unlock_irq(q->queue_lock);
 }
 
+static void blk_deferred_bio_work(struct work_struct *work)
+{
+	struct request_queue *q;
+	struct bio *bio = NULL;
+
+	q = container_of(work, struct request_queue, deferred_bio_work);
+
+	do {
+		spin_lock_irq(q->queue_lock);
+		bio = bio_list_pop(&q->deferred_bios);
+		spin_unlock_irq(q->queue_lock);
+		if (!bio)
+			break;
+		generic_make_request(bio);
+	} while (1);
+}
+
 /**
  * blk_delay_queue - restart queueing after defined interval
  * @q:		The &struct request_queue in question
@@ -289,6 +306,7 @@  void blk_sync_queue(struct request_queue
 {
 	del_timer_sync(&q->timeout);
 	cancel_delayed_work_sync(&q->delay_work);
+	cancel_work_sync(&q->deferred_bio_work);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
@@ -351,6 +369,29 @@  void blk_put_queue(struct request_queue 
 EXPORT_SYMBOL(blk_put_queue);
 
 /**
+ * blk_drain_deferred_bios - drain deferred bios
+ * @q: request_queue to drain deferred bios for
+ *
+ * Dispatch all currently deferred bios on @q through ->make_request_fn().
+ */
+static void blk_drain_deferred_bios(struct request_queue *q)
+{
+	struct bio_list bl;
+	struct bio *bio;
+	unsigned long flags;
+
+	bio_list_init(&bl);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	bio_list_merge(&bl, &q->deferred_bios);
+	bio_list_init(&q->deferred_bios);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	while ((bio = bio_list_pop(&bl)))
+		generic_make_request(bio);
+}
+
+/**
  * blk_drain_queue - drain requests from request_queue
  * @q: queue to drain
  * @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
@@ -358,6 +399,10 @@  EXPORT_SYMBOL(blk_put_queue);
  * Drain requests from @q.  If @drain_all is set, all requests are drained.
  * If not, only ELVPRIV requests are drained.  The caller is responsible
  * for ensuring that no new requests which need to be drained are queued.
+ *
+ * Note: It does not drain bios on q->deferred_bios list.
+ * Call blk_drain_deferred_bios() if need be.
+ *
  */
 void blk_drain_queue(struct request_queue *q, bool drain_all)
 {
@@ -505,6 +550,9 @@  void blk_cleanup_queue(struct request_qu
 	spin_unlock_irq(lock);
 	mutex_unlock(&q->sysfs_lock);
 
+	/* First drain all deferred bios. */
+	blk_drain_deferred_bios(q);
+
 	/* drain all requests queued before DEAD marking */
 	blk_drain_queue(q, true);
 
@@ -614,11 +662,19 @@  struct request_queue *blk_alloc_queue_no
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	if (blkcg_init_queue(q))
+	bio_list_init(&q->deferred_bios);
+	INIT_WORK(&q->deferred_bio_work, blk_deferred_bio_work);
+	q->deferred_bio_workqueue = alloc_workqueue("kdeferbiod", WQ_MEM_RECLAIM, 0);
+	if (!q->deferred_bio_workqueue)
 		goto fail_id;
 
+	if (blkcg_init_queue(q))
+		goto fail_deferred_bio_wq;
+
 	return q;
 
+fail_deferred_bio_wq:
+	destroy_workqueue(q->deferred_bio_workqueue);
 fail_id:
 	ida_simple_remove(&blk_queue_ida, q->id);
 fail_q:
@@ -1635,8 +1691,10 @@  static inline int bio_check_eod(struct b
 	return 0;
 }
 
+
+
 static noinline_for_stack bool
-generic_make_request_checks(struct bio *bio)
+generic_make_request_checks_early(struct bio *bio)
 {
 	struct request_queue *q;
 	int nr_sectors = bio_sectors(bio);
@@ -1715,9 +1773,6 @@  generic_make_request_checks(struct bio *
 	 */
 	create_io_context(GFP_ATOMIC, q->node);
 
-	if (blk_throtl_bio(q, bio))
-		return false;	/* throttled, will be resubmitted later */
-
 	trace_block_bio_queue(q, bio);
 	return true;
 
@@ -1726,6 +1781,56 @@  end_io:
 	return false;
 }
 
+static noinline_for_stack bool
+generic_make_request_checks_late(struct bio *bio)
+{
+	struct request_queue *q;
+
+	q = bdev_get_queue(bio->bi_bdev);
+
+	/*
+	 * Various block parts want %current->io_context and lazy ioc
+	 * allocation ends up trading a lot of pain for a small amount of
+	 * memory.  Just allocate it upfront.  This may fail and block
+	 * layer knows how to live with it.
+	 */
+	create_io_context(GFP_ATOMIC, q->node);
+
+	if (blk_throtl_bio(q, bio))
+		return false;	/* throttled, will be resubmitted later */
+
+	return true;
+}
+
+static void __generic_make_request(struct bio *bio)
+{
+	struct request_queue *q;
+
+	if (!generic_make_request_checks_late(bio))
+		return;
+	q = bdev_get_queue(bio->bi_bdev);
+	q->make_request_fn(q, bio);
+}
+
+static void generic_make_request_defer_bio(struct bio *bio)
+{
+	struct request_queue *q;
+	unsigned long flags;
+
+	q = bdev_get_queue(bio->bi_bdev);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (unlikely(blk_queue_dead(q))) {
+		spin_unlock_irqrestore(q->queue_lock, flags);
+		bio_endio(bio, -ENODEV);
+		return;
+	}
+	set_bit(BIO_DEFERRED, &bio->bi_flags);
+	bio_list_add(&q->deferred_bios, bio);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	queue_work(q->deferred_bio_workqueue, &q->deferred_bio_work);
+}
+
 /**
  * generic_make_request - hand a buffer to its device driver for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -1752,51 +1857,31 @@  end_io:
  */
 void generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	unsigned long sp = 0;
+	unsigned int threshold = (THREAD_SIZE * 2)/10;
 
-	if (!generic_make_request_checks(bio))
-		return;
+	BUG_ON(bio->bi_next);
 
-	/*
-	 * We only want one ->make_request_fn to be active at a time, else
-	 * stack usage with stacked devices could be a problem.  So use
-	 * current->bio_list to keep a list of requests submited by a
-	 * make_request_fn function.  current->bio_list is also used as a
-	 * flag to say if generic_make_request is currently active in this
-	 * task or not.  If it is NULL, then no make_request is active.  If
-	 * it is non-NULL, then a make_request is active, and new requests
-	 * should be added at the tail
-	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	/* Submitteing deferred bio from worker context. */
+	if (bio_flagged(bio, BIO_DEFERRED)) {
+		clear_bit(BIO_DEFERRED, &bio->bi_flags);
+		__generic_make_request(bio);
 		return;
 	}
 
-	/* following loop may be a bit non-obvious, and so deserves some
-	 * explanation.
-	 * Before entering the loop, bio->bi_next is NULL (as all callers
-	 * ensure that) so we have a list with a single bio.
-	 * We pretend that we have just taken it off a longer list, so
-	 * we assign bio_list to a pointer to the bio_list_on_stack,
-	 * thus initialising the bio_list of new bios to be
-	 * added.  ->make_request() may indeed add some more bios
-	 * through a recursive call to generic_make_request.  If it
-	 * did, we find a non-NULL value in bio_list and re-enter the loop
-	 * from the top.  In this case we really did just take the bio
-	 * of the top of the list (no pretending) and so remove it from
-	 * bio_list, and call into ->make_request() again.
-	 */
-	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
-	do {
-		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+	if (!generic_make_request_checks_early(bio))
+		return;
 
-		q->make_request_fn(q, bio);
+	/*
+	 * FIXME. Provide an arch dependent function to return left stack
+	 * space for current task. This is hack for x86_64.
+	 */
+	asm volatile("movq %%rsp,%0" : "=m"(sp));
 
-		bio = bio_list_pop(current->bio_list);
-	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	if ((sp - (unsigned long)end_of_stack(current)) < threshold)
+		generic_make_request_defer_bio(bio);
+	else
+		__generic_make_request(bio);
 }
 EXPORT_SYMBOL(generic_make_request);
 
Index: linux-2.6/block/blk-sysfs.c
===================================================================
--- linux-2.6.orig/block/blk-sysfs.c	2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-sysfs.c	2012-09-01 18:09:58.808577661 -0400
@@ -505,6 +505,7 @@  static void blk_release_queue(struct kob
 
 	ida_simple_remove(&blk_queue_ida, q->id);
 	kmem_cache_free(blk_requestq_cachep, q);
+	destroy_workqueue(q->deferred_bio_workqueue);
 }
 
 static const struct sysfs_ops queue_sysfs_ops = {
Index: linux-2.6/include/linux/blk_types.h
===================================================================
--- linux-2.6.orig/include/linux/blk_types.h	2012-09-02 00:34:17.607086696 -0400
+++ linux-2.6/include/linux/blk_types.h	2012-09-02 00:34:21.997087104 -0400
@@ -105,6 +105,7 @@  struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_DEFERRED	12	/* Bio was deferred for submission by worker */
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
 /*