diff mbox

[RFC] dm: fix excessive dm-mq context switching

Message ID 20160205151334.GA82754@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mike Snitzer Feb. 5, 2016, 3:13 p.m. UTC
On Thu, Feb 04 2016 at  8:54P -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Feb 04 2016 at  1:54am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
> > On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> > > On Wed, Feb 03 2016 at  1:04pm -0500,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > >  
> > >> I'm still not clear on where the considerable performance loss is coming
> > >> from (on null_blk device I see ~1900K read IOPs but I'm still only
> > >> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> > >> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> > >> results in a HUGE amount of additional context switches.  I can only
> > >> infer that the request completion for this stacked device (blk-mq queue
> > >> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> > >> underlying device and 1 for original request completing) is the reason
> > >> for all the extra context switches.
> > > 
> > > Starts to explain, certainly not the "reason"; that is still very much
> > > TBD...
> > > 
> > >> Here are pictures of 'perf report' for perf datat collected using
> > >> 'perf record -ag -e cs'.
> > >>
> > >> Against null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> > > 
> > >> Against dm-mpath ontop of the same null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> > > 
> > > So yeah, the percentages reflected in these respective images didn't do
> > > the huge increase in context switches justice... we _must_ figure out
> > > why we're seeing so many context switches with dm-mq.
> > > 
> > Well, the most obvious one being that you're using 1 dm-mq queue vs
> > 4 null_blk queues.
> > So you will have have to do an additional context switch for 75% of
> > the total I/Os submitted.
> 
> Right, that case is certainly prone to more context switches.  But I'm
> initially most concerned about the case where both only have 1 queue.
> 
> > Have you tested with 4 dm-mq hw queues?
> 
> Yes, it makes performance worse.  This is likely rooted in dm-mpath IO
> path not being lockless.  But I also have concern about whether the
> clone, sent to the underlying path, is completing on a different cpu
> than dm-mq's original request.
> 
> I'll be using ftrace to try to dig into the various aspects of this
> (perf, as I know how to use it, isn't giving me enough precision in its
> reporting).
> 
> > To avoid context switches we would have to align the dm-mq queues to
> > the underlying blk-mq layout for the paths.
> 
> Right, we need to take more care (how remains TBD).  But for now I'm
> just going to focus on the case where both dm-mq and null_blk have 1 for
> nr_hw_queues.  As you can see even in that config the number of context
> switches goes from 1970 to 667784 (and there is a huge loss of system
> cpu utilization) once dm-mq w/ 1 hw_queue is stacked ontop on the
> null_blk device.
> 
> Once we understand the source of all the additional context switching
> for this more simplistic stacked configuration we can look closer at
> scaling as we add more underlying paths.

Following is RFC because it really speaks to dm-mq _needing_ a variant
of blk_mq_complete_request() that supports partial completions.  Not
supporting partial completions really isn't an option for DM multipath.

From: Mike Snitzer <snitzer@redhat.com>
Date: Fri, 5 Feb 2016 08:49:01 -0500
Subject: [RFC PATCH] dm: fix excessive dm-mq context switching

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  This biggest
reason for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the queues which
ushered in ping-ponging between process context (fio in this case) and
kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
requests isn't enough to fix eliminated the oberved context switches.

In addition to this dm-mq specific blk-core fix, there were 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance but it doesn't handle
    partial completions -- which is a pretty big problem given the
    potential for partial completions with DM multipath due to path
    failure(s).  As such this makes this entire patch only RFC-worthy.

dm-mq "fix" #2 is _much_ more important than #1 for eliminating the
excessive context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes the multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |  2 +-
 drivers/md/dm.c  | 13 ++++++-------
 2 files changed, 7 insertions(+), 8 deletions(-)

Comments

Mike Snitzer Feb. 5, 2016, 6:05 p.m. UTC | #1
On Fri, Feb 05 2016 at 10:13am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Following is RFC because it really speaks to dm-mq _needing_ a variant
> of blk_mq_complete_request() that supports partial completions.  Not
> supporting partial completions really isn't an option for DM multipath.
> 
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Fri, 5 Feb 2016 08:49:01 -0500
> Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> 
> Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> than if an underlying null_blk device were used directly.  This biggest
> reason for this drop in performance is that blk_insert_clone_request()
> was calling blk_mq_insert_request() with @async=true.  This forced the
> use of kblockd_schedule_delayed_work_on() to run the queues which
> ushered in ping-ponging between process context (fio in this case) and
> kblockd's kworker to submit the cloned request.  The ftrace
> function_graph tracer showed:
> 
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
> 
> Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> requests isn't enough to fix eliminated the oberved context switches.
> 
> In addition to this dm-mq specific blk-core fix, there were 2 DM core
> fixes to dm-mq that (when paired with the blk-core fix) completely
> eliminate the observed context switching:
> 
> 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> 
>     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
>     just increases context switches.
> 
>     In my testing against a really fast null_blk device there was no benefit
>     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
>     driver does this).  So hopefully this change doesn't induce the need for
>     yet another revert like commit 621739b00e16ca2d !
> 
> 2)  use blk_mq_complete_request() in dm_complete_request()
> 
>     blk_complete_request() doesn't offer the traditional q->mq_ops vs
>     .request_fn branching pattern that other historic block interfaces
>     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
>     blk-mq requests is important for performance but it doesn't handle
>     partial completions -- which is a pretty big problem given the
>     potential for partial completions with DM multipath due to path
>     failure(s).  As such this makes this entire patch only RFC-worthy.

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index c683f6d..a618477 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
>  	struct dm_rq_target_io *tio = tio_from_request(rq);
>  
>  	tio->error = error;
> -	blk_complete_request(rq);
> +	if (!rq->q->mq_ops)
> +		blk_complete_request(rq);
> +	else
> +		blk_mq_complete_request(rq, rq->errors);
>  }
>  
>  /*

Looking closer, DM is very likely OK just using blk_mq_complete_request.

blk_complete_request() also doesn't provide native partial completion
support (it relies on the driver to do it, which DM core does):

/**
 * blk_complete_request - end I/O on a request
 * @req:      the request being processed
 *
 * Description:
 *     Ends all I/O on a request. It does not handle partial completions,
 *     unless the driver actually implements this in its completion callback
 *     through requeueing. The actual completion happens out-of-order,
 *     through a softirq handler. The user must have registered a completion
 *     callback through blk_queue_softirq_done().
 **/

blk_mq_complete_request() is effectively implemented in a comparable
fashion to blk_complete_request().  Given that DM core is providing
partial completion support by dm.c:end_clone_bio() triggering requeueing
of the request via dm-mpath.c:multipath_end_io()'s return of
DM_ENDIO_REQUEUE.

So I'm thinking I can drop the "RFC" for this patch and run with
it.. once I get Jens' feedback (hopefully) confirming my understanding.

Jens, please advise.  If you're comfortable providing your Acked-by I
can get this fix in for 4.5-rc4 or so...

Thanks!

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 5, 2016, 7:19 p.m. UTC | #2
On Fri, Feb 05 2016 at  1:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Feb 05 2016 at 10:13am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
> > Following is RFC because it really speaks to dm-mq _needing_ a variant
> > of blk_mq_complete_request() that supports partial completions.  Not
> > supporting partial completions really isn't an option for DM multipath.
> > 
> > From: Mike Snitzer <snitzer@redhat.com>
> > Date: Fri, 5 Feb 2016 08:49:01 -0500
> > Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> > 
> > Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> > than if an underlying null_blk device were used directly.  This biggest
> > reason for this drop in performance is that blk_insert_clone_request()
> > was calling blk_mq_insert_request() with @async=true.  This forced the
> > use of kblockd_schedule_delayed_work_on() to run the queues which
> > ushered in ping-ponging between process context (fio in this case) and
> > kblockd's kworker to submit the cloned request.  The ftrace
> > function_graph tracer showed:
> > 
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> > 
> > Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> > requests isn't enough to fix eliminated the oberved context switches.
> > 
> > In addition to this dm-mq specific blk-core fix, there were 2 DM core
> > fixes to dm-mq that (when paired with the blk-core fix) completely
> > eliminate the observed context switching:
> > 
> > 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> > 
> >     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
> >     just increases context switches.
> > 
> >     In my testing against a really fast null_blk device there was no benefit
> >     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
> >     driver does this).  So hopefully this change doesn't induce the need for
> >     yet another revert like commit 621739b00e16ca2d !
> > 
> > 2)  use blk_mq_complete_request() in dm_complete_request()
> > 
> >     blk_complete_request() doesn't offer the traditional q->mq_ops vs
> >     .request_fn branching pattern that other historic block interfaces
> >     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
> >     blk-mq requests is important for performance but it doesn't handle
> >     partial completions -- which is a pretty big problem given the
> >     potential for partial completions with DM multipath due to path
> >     failure(s).  As such this makes this entire patch only RFC-worthy.
> 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index c683f6d..a618477 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
> >  	struct dm_rq_target_io *tio = tio_from_request(rq);
> >  
> >  	tio->error = error;
> > -	blk_complete_request(rq);
> > +	if (!rq->q->mq_ops)
> > +		blk_complete_request(rq);
> > +	else
> > +		blk_mq_complete_request(rq, rq->errors);
> >  }
> >  
> >  /*
> 
> Looking closer, DM is very likely OK just using blk_mq_complete_request.
> 
> blk_complete_request() also doesn't provide native partial completion
> support (it relies on the driver to do it, which DM core does):
> 
> /**
>  * blk_complete_request - end I/O on a request
>  * @req:      the request being processed
>  *
>  * Description:
>  *     Ends all I/O on a request. It does not handle partial completions,
>  *     unless the driver actually implements this in its completion callback
>  *     through requeueing. The actual completion happens out-of-order,
>  *     through a softirq handler. The user must have registered a completion
>  *     callback through blk_queue_softirq_done().
>  **/
> 
> blk_mq_complete_request() is effectively implemented in a comparable
> fashion to blk_complete_request().  Given that DM core is providing
> partial completion support by dm.c:end_clone_bio() triggering requeueing
> of the request via dm-mpath.c:multipath_end_io()'s return of
> DM_ENDIO_REQUEUE.
> 
> So I'm thinking I can drop the "RFC" for this patch and run with
> it.. once I get Jens' feedback (hopefully) confirming my understanding.
> 
> Jens, please advise.  If you're comfortable providing your Acked-by I
> can get this fix in for 4.5-rc4 or so...

FYI, here is the latest revised patch:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

(revised patch header and fixed a thinko in the dm.c:rq_completed()
change from the RFC patch I posted earlier)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 7, 2016, 3:41 p.m. UTC | #3
> FYI, here is the latest revised patch:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>
> (revised patch header and fixed a thinko in the dm.c:rq_completed()
> change from the RFC patch I posted earlier)

Hi Mike,

So I gave your patches a go (dm-4.6) but I still don't see the
improvement you reported (while I do see a minor improvement).

null_blk queue_mode=2 submit_queues=24
dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y

I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

Is there something I'm missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 7, 2016, 4:07 p.m. UTC | #4
On Sun, Feb 07 2016 at 10:41am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >FYI, here is the latest revised patch:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
> >
> >(revised patch header and fixed a thinko in the dm.c:rq_completed()
> >change from the RFC patch I posted earlier)
> 
> Hi Mike,
> 
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
> 
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
> 
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
the first being the most important, it shouldn't hurt either provided
you have 24 cpus).

Could be you have multiple NUMA nodes and are seeing problems from that?

I have 12 cpus (in the same physical cpu) and only a single NUMA node.
I get the same results as blk_mq_nr_hw_queues=12 with
blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
that is much better than the over 50% drop I saw seeing).

> Is there something I'm missing?

Not sure, I just emailed out all my patches (and cc'd you).  Please
verify you're using the latest here (same as 'dm-4.6' branch):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

I rebased a couple times... so please diff what you have tested against
this latest 'dm-4.6' branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 7, 2016, 4:37 p.m. UTC | #5
On 02/07/16 07:41, Sagi Grimberg wrote:
>
>> FYI, here is the latest revised patch:
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>>
>> (revised patch header and fixed a thinko in the dm.c:rq_completed()
>> change from the RFC patch I posted earlier)
>
> Hi Mike,
>
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
>
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> Is there something I'm missing?

Hello Sagi,

Did you run your test on a NUMA system ? If so, can you check with e.g. 
perf record -ags -e LLC-load-misses sleep 10 && perf report whether this 
workload triggers perhaps lock contention ? What you need to look for in 
the perf output is whether any functions occupy more than 10% CPU time.

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 7, 2016, 4:42 p.m. UTC | #6
>> Hi Mike,
>>
>> So I gave your patches a go (dm-4.6) but I still don't see the
>> improvement you reported (while I do see a minor improvement).
>>
>> null_blk queue_mode=2 submit_queues=24
>> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>>
>> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
> the first being the most important, it shouldn't hurt either provided
> you have 24 cpus).

I tried with less but as you said, it didn't have an impact...

> Could be you have multiple NUMA nodes and are seeing problems from that?

I am running on a dual socket server, this can most likely be the
culprit...

> I have 12 cpus (in the same physical cpu) and only a single NUMA node.
> I get the same results as blk_mq_nr_hw_queues=12 with
> blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
> I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
> my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
> that is much better than the over 50% drop I saw seeing).

That's what I was planning on :(

>> Is there something I'm missing?
>
> Not sure, I just emailed out all my patches (and cc'd you).  Please
> verify you're using the latest here (same as 'dm-4.6' branch):
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
>
> I rebased a couple times... so please diff what you have tested against
> this latest 'dm-4.6' branch.

I am. I'll try to instrument what's going on...
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 7, 2016, 4:43 p.m. UTC | #7
> Hello Sagi,

Hey Bart,

> Did you run your test on a NUMA system ?

I did.

> If so, can you check with e.g.
> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> workload triggers perhaps lock contention ? What you need to look for in
> the perf output is whether any functions occupy more than 10% CPU time.

I will, thanks for the tip!
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 7, 2016, 4:53 p.m. UTC | #8
On Sun, Feb 07 2016 at 11:43am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >Hello Sagi,
> 
> Hey Bart,
> 
> >Did you run your test on a NUMA system ?
> 
> I did.
> 
> >If so, can you check with e.g.
> >perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >workload triggers perhaps lock contention ? What you need to look for in
> >the perf output is whether any functions occupy more than 10% CPU time.
> 
> I will, thanks for the tip!

Also, I found ftrace's function_graph tracer very helpful (it is how I
found the various issues fixed by the first context switch patch).  Here
is my latest script:

#!/bin/sh

set -xv

NULL_BLK_HW_QUEUES=4
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=4
DM_MQ_QUEUE_DEPTH=2048

FIO=/root/snitm/git/fio/fio
FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

PERF=perf
#PERF=/root/snitm/git/linux/tools/perf/perf

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

    if [ ! -z "${PERF_RECORD}" ]; then
    ${PERF_RECORD} ${RUN_CMD}
    mv perf.data perf.data.${TASK_NAME}
    else
    ${RUN_CMD}
    fi
}

run_fio_with_ftrace() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})

    echo > /sys/kernel/debug/tracing/trace
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    echo function_graph > /sys/kernel/debug/tracing/current_tracer
    echo 1 > /sys/kernel/debug/tracing/tracing_on
    run_fio $DEVICE
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    cat /sys/kernel/debug/tracing/trace > trace.${TASK_NAME}
    echo nop > /sys/kernel/debug/tracing/current_tracer
}

dmsetup remove dm_mq
modprobe -r null_blk

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}
#run_fio /dev/nullb0 "${PERF} record -ag -e cs"
#run_fio /dev/nullb0 "${PERF} stat"

echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_nr_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq
#echo "0 8388608 linear /dev/nullb0 0" | dmsetup create dm_mq

run_fio_with_ftrace /dev/mapper/dm_mq

#run_fio /dev/mapper/dm_mq
#run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs"
#run_fio /dev/mapper/dm_mq "${PERF} record -ag"
#run_fio /dev/mapper/dm_mq "${PERF} stat"

#run_fio /dev/mapper/dm_mq "trace-cmd record -e all"
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 7, 2016, 4:54 p.m. UTC | #9
>> If so, can you check with e.g.
>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>> workload triggers perhaps lock contention ? What you need to look for in
>> the perf output is whether any functions occupy more than 10% CPU time.
>
> I will, thanks for the tip!

The perf report is very similar to the one that started this effort..

I'm afraid we'll need to resolve the per-target m->lock in order
to scale with NUMA...

-  17.33%              fio  [kernel.kallsyms]        [k] 
queued_spin_lock_slowpath
    - queued_spin_lock_slowpath
       - 52.09% _raw_spin_lock_irq
            __multipath_map
            multipath_clone_and_map
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 46.87% _raw_spin_lock_irqsave
          - 99.97% multipath_busy
               dm_mq_queue_rq
               __blk_mq_run_hw_queue
               blk_mq_run_hw_queue
               blk_mq_insert_requests
               blk_mq_flush_plug_list
               blk_flush_plug_list
               blk_finish_plug
               do_io_submit
               SyS_io_submit
               entry_SYSCALL_64_fastpath
             + io_submit
+   4.99%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
+   3.93%              fio  [dm_multipath]           [k] __multipath_map
+   2.64%              fio  [dm_multipath]           [k] multipath_busy
+   2.38%              fio  [kernel.kallsyms]        [k] 
_raw_spin_lock_irqsave
+   2.31%              fio  [dm_mod]                 [k] dm_mq_queue_rq
+   2.25%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
+   1.81%              fio  [kernel.kallsyms]        [k] blk_queue_enter
+   1.61%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.40%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   1.26%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.14%              fio  [kernel.kallsyms]        [k] _raw_spin_lock_irq
+   0.96%              fio  [kernel.kallsyms]        [k] __bt_get
+   0.73%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   0.71%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   0.69%              fio  [dm_mod]                 [k] dm_start_request
+   0.60%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%              fio  [kernel.kallsyms]        [k] 
_raw_spin_unlock_irqrestore
+   0.58%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 7, 2016, 5:20 p.m. UTC | #10
On Sun, Feb 07 2016 at 11:54am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>If so, can you check with e.g.
> >>perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>workload triggers perhaps lock contention ? What you need to look for in
> >>the perf output is whether any functions occupy more than 10% CPU time.
> >
> >I will, thanks for the tip!
> 
> The perf report is very similar to the one that started this effort..
> 
> I'm afraid we'll need to resolve the per-target m->lock in order
> to scale with NUMA...

Could be.  Just for testing, you can try the 2 topmost commits I've put
here (once applied both __multipath_map and multipath_busy won't have
_any_ locking.. again, very much test-only):

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 8, 2016, 12:21 p.m. UTC | #11
>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
>
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
>
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2

Hi Mike,

So I still don't see the IOPs scale like I expected. With these two
patches applied I see ~670K IOPs while the perf output is different
and does not indicate a clear lock contention.

--
-   4.67%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
    - blk_account_io_start
       - 56.05% blk_insert_cloned_request
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 43.94% blk_mq_bio_to_request
            blk_mq_make_request
            generic_make_request
            submit_bio
            do_blockdev_direct_IO
            __blockdev_direct_IO
            blkdev_direct_IO
            generic_file_read_iter
            blkdev_read_iter
            aio_run_iocb
            io_submit_one
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       + 0.84% blk_mq_run_hw_queue
-   2.46%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
    - blk_mq_hctx_mark_pending
       - 99.79% blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
    - blk_mq_run_hw_queues
       - 99.70% rq_completed
            dm_done
            dm_softirq_done
            blk_done_softirq
          + __do_softirq
+   2.06%      ksoftirqd/0  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.02%      ksoftirqd/9  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/20  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/12  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.99%     ksoftirqd/11  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.97%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.96%      ksoftirqd/1  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/14  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/13  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/5  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/8  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.93%      ksoftirqd/2  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/21  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/17  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%      ksoftirqd/7  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.91%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.84%      ksoftirqd/4  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.81%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%      ksoftirqd/3  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%     ksoftirqd/16  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.75%     ksoftirqd/15  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.74%     ksoftirqd/22  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.72%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.38%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.20%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   1.18%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.08%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   1.07%              fio  [kernel.kallsyms]        [k] _raw_spin_lock
+   1.02%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   0.79%              fio  [dm_multipath]           [k] multipath_busy
+   0.57%              fio  [kernel.kallsyms]        [k] insert_work
+   0.54%              fio  [kernel.kallsyms]        [k] blk_flush_plug_list
--
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 8, 2016, 2:34 p.m. UTC | #12
On Mon, Feb 08 2016 at  7:21am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>The perf report is very similar to the one that started this effort..
> >>
> >>I'm afraid we'll need to resolve the per-target m->lock in order
> >>to scale with NUMA...
> >
> >Could be.  Just for testing, you can try the 2 topmost commits I've put
> >here (once applied both __multipath_map and multipath_busy won't have
> >_any_ locking.. again, very much test-only):
> >
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
> Hi Mike,
> 
> So I still don't see the IOPs scale like I expected. With these two
> patches applied I see ~670K IOPs while the perf output is different
> and does not indicate a clear lock contention.

Right, perf (with default events) isn't the right tool to track this down.

But I'm seeing something that speaks to you not running with the first
context switching fix (which seems odd):

> -   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k]
> blk_mq_run_hw_queues
>    - blk_mq_run_hw_queues
>       - 99.70% rq_completed
>            dm_done
>            dm_softirq_done
>            blk_done_softirq
>          + __do_softirq

As you can see here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

rq_completed() shouldn't be calling blk_mq_run_hw_queues() with the
latest code.

Please triple check you have the latest code, e.g.:
git diff snitzer/devel2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Feb. 9, 2016, 7:50 a.m. UTC | #13
On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> On Sun, Feb 07 2016 at 11:54am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
>>
>>>> If so, can you check with e.g.
>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>
>>> I will, thanks for the tip!
>>
>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
> 
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
> 
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
So, I gave those patches a spin.
Sad to say, they do _not_ resolve the issue fully.

My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
those patches.
Using a single path (without those patches, but still running
multipath on top of that path) the same testbed yields 550k IOPs.
Which very much smells like a lock contention ...
We do get a slight improvement, though; without those patches I
could only get about 350k IOPs. But still, I would somehow expect 2
paths to be faster than just one ..

Cheers,

Hannes
Mike Snitzer Feb. 9, 2016, 2:55 p.m. UTC | #14
On Tue, Feb 09 2016 at  2:50am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> > On Sun, Feb 07 2016 at 11:54am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> >>
> >>>> If so, can you check with e.g.
> >>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>
> >>> I will, thanks for the tip!
> >>
> >> The perf report is very similar to the one that started this effort..
> >>
> >> I'm afraid we'll need to resolve the per-target m->lock in order
> >> to scale with NUMA...
> > 
> > Could be.  Just for testing, you can try the 2 topmost commits I've put
> > here (once applied both __multipath_map and multipath_busy won't have
> > _any_ locking.. again, very much test-only):
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> > 
> So, I gave those patches a spin.
> Sad to say, they do _not_ resolve the issue fully.
>
> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> those patches.

That isn't a surprise.  We knew the m->lock spinlock contention to be a
problem.  And NUMA makes it even worse.

> Using a single path (without those patches, but still running
> multipath on top of that path) the same testbed yields 550k IOPs.
> Which very much smells like a lock contention ...
> We do get a slight improvement, though; without those patches I
> could only get about 350k IOPs. But still, I would somehow expect 2
> paths to be faster than just one ..

https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html

hint hint...
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Feb. 9, 2016, 3:32 p.m. UTC | #15
On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> On Tue, Feb 09 2016 at  2:50am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
>>> On Sun, Feb 07 2016 at 11:54am -0500,
>>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>
>>>>
>>>>>> If so, can you check with e.g.
>>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>>>
>>>>> I will, thanks for the tip!
>>>>
>>>> The perf report is very similar to the one that started this effort..
>>>>
>>>> I'm afraid we'll need to resolve the per-target m->lock in order
>>>> to scale with NUMA...
>>>
>>> Could be.  Just for testing, you can try the 2 topmost commits I've put
>>> here (once applied both __multipath_map and multipath_busy won't have
>>> _any_ locking.. again, very much test-only):
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
>>>
>> So, I gave those patches a spin.
>> Sad to say, they do _not_ resolve the issue fully.
>>
>> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
>> those patches.
> 
> That isn't a surprise.  We knew the m->lock spinlock contention to be a
> problem.  And NUMA makes it even worse.
> 
>> Using a single path (without those patches, but still running
>> multipath on top of that path) the same testbed yields 550k IOPs.
>> Which very much smells like a lock contention ...
>> We do get a slight improvement, though; without those patches I
>> could only get about 350k IOPs. But still, I would somehow expect 2
>> paths to be faster than just one ..
> 
> https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> 
> hint hint...
> 
I hoped they wouldn't be needed with your patches.
Plus perf revealed that I first need to address a spinlock
contention in the lpfc driver before that even would make sense.

So more debugging to follow.

Cheers,

Hannes
Mike Snitzer Feb. 10, 2016, 12:45 a.m. UTC | #16
On Tue, Feb 09 2016 at 10:32am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> > On Tue, Feb 09 2016 at  2:50am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> >>> On Sun, Feb 07 2016 at 11:54am -0500,
> >>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >>>
> >>>>
> >>>>>> If so, can you check with e.g.
> >>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>>>
> >>>>> I will, thanks for the tip!
> >>>>
> >>>> The perf report is very similar to the one that started this effort..
> >>>>
> >>>> I'm afraid we'll need to resolve the per-target m->lock in order
> >>>> to scale with NUMA...
> >>>
> >>> Could be.  Just for testing, you can try the 2 topmost commits I've put
> >>> here (once applied both __multipath_map and multipath_busy won't have
> >>> _any_ locking.. again, very much test-only):
> >>>
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> >>>
> >> So, I gave those patches a spin.
> >> Sad to say, they do _not_ resolve the issue fully.
> >>
> >> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> >> those patches.
> > 
> > That isn't a surprise.  We knew the m->lock spinlock contention to be a
> > problem.  And NUMA makes it even worse.
> > 
> >> Using a single path (without those patches, but still running
> >> multipath on top of that path) the same testbed yields 550k IOPs.
> >> Which very much smells like a lock contention ...
> >> We do get a slight improvement, though; without those patches I
> >> could only get about 350k IOPs. But still, I would somehow expect 2
> >> paths to be faster than just one ..
> > 
> > https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> > 
> > hint hint...
> > 
> I hoped they wouldn't be needed with your patches.
> Plus perf revealed that I first need to address a spinlock
> contention in the lpfc driver before that even would make sense.
> 
> So more debugging to follow.

OK, I took a crack at embracing RCU.  Only slightly better performance
on my single NUMA node testbed.  (But I'll have to track down a system
with multiple NUMA nodes to do any justice to the next wave of this
optimization effort)

This RCU work is very heavy-handed and way too fiddley (there could
easily be bugs).  Anyway, please see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745

But this might give you something to build on to arrive at something
more scalable?

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 11, 2016, 1:50 a.m. UTC | #17
On Tue, Feb 09 2016 at  7:45pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> 
> OK, I took a crack at embracing RCU.  Only slightly better performance
> on my single NUMA node testbed.  (But I'll have to track down a system
> with multiple NUMA nodes to do any justice to the next wave of this
> optimization effort)
> 
> This RCU work is very heavy-handed and way too fiddley (there could
> easily be bugs).  Anyway, please see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> 
> But this might give you something to build on to arrive at something
> more scalable?

I've a bit more polished version of this work (broken up into multiple
commits, with some fixes, etc) here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3

Hannes and/or Sagi, if you get a chance to try this on your NUMA system
please let me know how it goes.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 11, 2016, 3:35 a.m. UTC | #18
On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

FYI, with these changes my single NUMA node testbed's read IOPs went
from:

 ~1310K to ~1410K w/ nr_hw_queues dm-mq=4 and null_blk=4
 ~1330K to ~1415K w/ nr_hw_queues dm-mq=4 and null_blk=12
 ~1365K to ~1425K w/ nr_hw_queues dm-mq=12 and null_blk=12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 11, 2016, 3:34 p.m. UTC | #19
On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

Initial review has uncovered some locking problems with the current code
(nothing that caused crashes or hangs in my testing but...) so please
hold off on testing until you hear from me (hopefully tomorrow).

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Feb. 12, 2016, 3:18 p.m. UTC | #20
On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> On Wed, Feb 10 2016 at  8:50pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Tue, Feb 09 2016 at  7:45pm -0500,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>>
>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>> with multiple NUMA nodes to do any justice to the next wave of this
>>> optimization effort)
>>>
>>> This RCU work is very heavy-handed and way too fiddley (there could
>>> easily be bugs).  Anyway, please see:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>
>>> But this might give you something to build on to arrive at something
>>> more scalable?
>>
>> I've a bit more polished version of this work (broken up into multiple
>> commits, with some fixes, etc) here:
>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>
>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>> please let me know how it goes.
> 
> Initial review has uncovered some locking problems with the current code
> (nothing that caused crashes or hangs in my testing but...) so please
> hold off on testing until you hear from me (hopefully tomorrow).
> 
Good news is that I've managed to hit the roof for my array with the
devel2 version of those patches. (And a _heavily_ patched-up lpfc
driver :-)
So from that perspective everything's fine now; we've reached the
hardware limit for my setup.
Which in itself is quite impressive; beating Intel P3700 with 16FC
is not bad methinks :-)

So thanks for all your work here.

Cheers,

Hannes
Mike Snitzer Feb. 12, 2016, 3:26 p.m. UTC | #21
On Fri, Feb 12 2016 at 10:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> > On Wed, Feb 10 2016 at  8:50pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> >> On Tue, Feb 09 2016 at  7:45pm -0500,
> >> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >>>
> >>> OK, I took a crack at embracing RCU.  Only slightly better performance
> >>> on my single NUMA node testbed.  (But I'll have to track down a system
> >>> with multiple NUMA nodes to do any justice to the next wave of this
> >>> optimization effort)
> >>>
> >>> This RCU work is very heavy-handed and way too fiddley (there could
> >>> easily be bugs).  Anyway, please see:
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> >>>
> >>> But this might give you something to build on to arrive at something
> >>> more scalable?
> >>
> >> I've a bit more polished version of this work (broken up into multiple
> >> commits, with some fixes, etc) here:
> >> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >>
> >> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> >> please let me know how it goes.
> > 
> > Initial review has uncovered some locking problems with the current code
> > (nothing that caused crashes or hangs in my testing but...) so please
> > hold off on testing until you hear from me (hopefully tomorrow).
> > 
> Good news is that I've managed to hit the roof for my array with the
> devel2 version of those patches. (And a _heavily_ patched-up lpfc
> driver :-)
> So from that perspective everything's fine now; we've reached the
> hardware limit for my setup.
> Which in itself is quite impressive; beating Intel P3700 with 16FC
> is not bad methinks :-)
> 
> So thanks for all your work here.

Ah, that's really good news!  But devel2 is definitely _not_ destined
for upstream.  'devel3' is much closer to "ready".  But your testing and
review would really push it forward.

Please see/test:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3

Also, please read this header:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a

Even with devel2 I hacked it such that repeat_count > 1 is effectively
broken.  I'm now _seriously_ considering deprecating repeat_count
completely (adding a DMWARN that will inform the user. e.g.:
"repeat_count > 1 is no longer supported").  I see no point going to
great lengths to maintain a dm-mpath feature that was only a hack for
when dm-mpath was bio-based.  What do you think?

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Feb. 12, 2016, 4:04 p.m. UTC | #22
On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 10:18am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
>>> On Wed, Feb 10 2016 at  8:50pm -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Tue, Feb 09 2016 at  7:45pm -0500,
>>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>>
>>>>>
>>>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>>>> with multiple NUMA nodes to do any justice to the next wave of this
>>>>> optimization effort)
>>>>>
>>>>> This RCU work is very heavy-handed and way too fiddley (there could
>>>>> easily be bugs).  Anyway, please see:
>>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>>>
>>>>> But this might give you something to build on to arrive at something
>>>>> more scalable?
>>>>
>>>> I've a bit more polished version of this work (broken up into multiple
>>>> commits, with some fixes, etc) here:
>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>>
>>>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>>>> please let me know how it goes.
>>>
>>> Initial review has uncovered some locking problems with the current code
>>> (nothing that caused crashes or hangs in my testing but...) so please
>>> hold off on testing until you hear from me (hopefully tomorrow).
>>>
>> Good news is that I've managed to hit the roof for my array with the
>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>> driver :-)
>> So from that perspective everything's fine now; we've reached the
>> hardware limit for my setup.
>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>> is not bad methinks :-)
>>
>> So thanks for all your work here.
>
> Ah, that's really good news!  But devel2 is definitely _not_ destined
> for upstream.  'devel3' is much closer to "ready".  But your testing and
> review would really push it forward.
>
> Please see/test:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>
> Also, please read this header:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>
> Even with devel2 I hacked it such that repeat_count > 1 is effectively
> broken.  I'm now _seriously_ considering deprecating repeat_count
> completely (adding a DMWARN that will inform the user. e.g.:
> "repeat_count > 1 is no longer supported").  I see no point going to
> great lengths to maintain a dm-mpath feature that was only a hack for
> when dm-mpath was bio-based.  What do you think?
>
Drop it, and make setting of which a no-op.
Never liked it anyway, and these decisions should really be delegated to 
the path selector.

Cheers,

Hannes
Mike Snitzer Feb. 12, 2016, 6 p.m. UTC | #23
On Fri, Feb 12 2016 at 11:04am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> >On Fri, Feb 12 2016 at 10:18am -0500,
> >Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >>Good news is that I've managed to hit the roof for my array with the
> >>devel2 version of those patches. (And a _heavily_ patched-up lpfc
> >>driver :-)
> >>So from that perspective everything's fine now; we've reached the
> >>hardware limit for my setup.
> >>Which in itself is quite impressive; beating Intel P3700 with 16FC
> >>is not bad methinks :-)
> >>
> >>So thanks for all your work here.
> >
> >Ah, that's really good news!  But devel2 is definitely _not_ destined
> >for upstream.  'devel3' is much closer to "ready".  But your testing and
> >review would really push it forward.
> >
> >Please see/test:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >
> >Also, please read this header:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
> >
> >Even with devel2 I hacked it such that repeat_count > 1 is effectively
> >broken.  I'm now _seriously_ considering deprecating repeat_count
> >completely (adding a DMWARN that will inform the user. e.g.:
> >"repeat_count > 1 is no longer supported").  I see no point going to
> >great lengths to maintain a dm-mpath feature that was only a hack for
> >when dm-mpath was bio-based.  What do you think?
>
> Drop it, and make setting of which a no-op.
> Never liked it anyway, and these decisions should really be
> delegated to the path selector.

Sure, but my point is DM-mpath will no longer be able to provide the
ability to properly handle repeat_count > 1 (because updating the
->current_pgpath crushes the write-side of the RCU).

As such I've rebased 'devel3' to impose repeat_count = 1 (both in
dm-mpath.c and the defaults in each path selector).
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Feb. 15, 2016, 6:47 a.m. UTC | #24
On 02/12/2016 07:00 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 11:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
>>> On Fri, Feb 12 2016 at 10:18am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>> Good news is that I've managed to hit the roof for my array with the
>>>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>>>> driver :-)
>>>> So from that perspective everything's fine now; we've reached the
>>>> hardware limit for my setup.
>>>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>>>> is not bad methinks :-)
>>>>
>>>> So thanks for all your work here.
>>>
>>> Ah, that's really good news!  But devel2 is definitely _not_ destined
>>> for upstream.  'devel3' is much closer to "ready".  But your testing and
>>> review would really push it forward.
>>>
>>> Please see/test:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>
>>> Also, please read this header:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>>>
>>> Even with devel2 I hacked it such that repeat_count > 1 is effectively
>>> broken.  I'm now _seriously_ considering deprecating repeat_count
>>> completely (adding a DMWARN that will inform the user. e.g.:
>>> "repeat_count > 1 is no longer supported").  I see no point going to
>>> great lengths to maintain a dm-mpath feature that was only a hack for
>>> when dm-mpath was bio-based.  What do you think?
>>
>> Drop it, and make setting of which a no-op.
>> Never liked it anyway, and these decisions should really be
>> delegated to the path selector.
> 
> Sure, but my point is DM-mpath will no longer be able to provide the
> ability to properly handle repeat_count > 1 (because updating the
> ->current_pgpath crushes the write-side of the RCU).
> 
Fully understood. But as I said, the _need_ for repeat_count has
essentially vanished with the move to request-based multipath.

> As such I've rebased 'devel3' to impose repeat_count = 1 (both in
> dm-mpath.c and the defaults in each path selector).

Kewl.
I'll give it a go.

Cheers,

Hannes
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index ab51685..c60e233 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2198,7 +2198,7 @@  int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, true);
+		blk_mq_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c683f6d..a618477 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1119,12 +1119,8 @@  static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * back into ->request_fn() could deadlock attempting to grab the
 	 * queue lock again.
 	 */
-	if (run_queue) {
-		if (md->queue->mq_ops)
-			blk_mq_run_hw_queues(md->queue, true);
-		else
-			blk_run_queue_async(md->queue);
-	}
+	if (!md->queue->mq_ops && run_queue)
+		blk_mq_run_hw_queues(md->queue, true);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
@@ -1344,7 +1340,10 @@  static void dm_complete_request(struct request *rq, int error)
 	struct dm_rq_target_io *tio = tio_from_request(rq);
 
 	tio->error = error;
-	blk_complete_request(rq);
+	if (!rq->q->mq_ops)
+		blk_complete_request(rq);
+	else
+		blk_mq_complete_request(rq, rq->errors);
 }
 
 /*