diff mbox

krbd blk-mq support ?

Message ID 20141024105501.GA14699@infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Oct. 24, 2014, 10:55 a.m. UTC
If you're willing to experiment give the patches below a try, not that
I don't have a ceph test cluster available, so the conversion is
untestested.
From 00668f00afc6f0cfbce05d1186116469c1f3f9b3 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 24 Oct 2014 11:53:36 +0200
Subject: blk-mq: handle single queue case in blk_mq_hctx_next_cpu

Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

Comments

Alexandre DERUMIER Oct. 24, 2014, 12:27 p.m. UTC | #1
>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues.

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that 
I don't have a ceph test cluster available, so the conversion is 
untestested. 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 26, 2014, 1:46 p.m. UTC | #2
Hi,

some news:

I have applied patches succefully on top of 3.18-rc1 kernel.

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).

My main problem is that I can't reach more than around 50000iops on 1 machine,

and the problem seem to be the kworker process stuck at 100% of 1core.

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops.

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time,

I can reach 50000iops on each node, and both are limited by the kworker process.


That's why I thinked that blk-mq could help, but it don't seem to be the case.


Is this kworker cpu limitation a known bug ?

Regards,

Alexandre

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that 
I don't have a ceph test cluster available, so the conversion is 
untestested.
Somnath Roy Oct. 26, 2014, 7:08 p.m. UTC | #3
Alexandre,
Have you tried mapping different images on the same m/c with 'noshare' map option ?
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER

Sent: Sunday, October 26, 2014 6:46 AM
To: Christoph Hellwig
Cc: Ceph Devel
Subject: Re: krbd blk-mq support ?

Hi,

some news:

I have applied patches succefully on top of 3.18-rc1 kernel.

But don't seem to help is my case.
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).

My main problem is that I can't reach more than around 50000iops on 1 machine,

and the problem seem to be the kworker process stuck at 100% of 1core.

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops.

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time,

I can reach 50000iops on each node, and both are limited by the kworker process.


That's why I thinked that blk-mq could help, but it don't seem to be the case.


Is this kworker cpu limitation a known bug ?

Regards,

Alexandre

----- Mail original -----

De: "Alexandre DERUMIER" <aderumier@odiso.com>
À: "Christoph Hellwig" <hch@infradead.org>
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 24 Octobre 2014 14:27:47
Objet: Re: krbd blk-mq support ?

>>If you're willing to experiment give the patches below a try, not that

>>I don't have a ceph test cluster available, so the conversion is

>>untestested.


Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues.

----- Mail original -----

De: "Christoph Hellwig" <hch@infradead.org>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 24 Octobre 2014 12:55:01
Objet: Re: krbd blk-mq support ?

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Alexandre DERUMIER Oct. 27, 2014, 7:53 a.m. UTC | #4
>>Have you tried mapping different images on the same m/c with 'noshare' map option ?

Oh, I didn't known about his option.

I found 1 reference here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/034213.html

"With noshare each mapped image will appear as a separate client instance, 
which means it will have it's own session with teh monitors and own TCP 
connections to the OSDs.  It may be a viable workaround for now but in 
general I would not recommend it."

So it should help with multiple rbd.
Do you known why Sage don't recommend it in this mail ?





----- Mail original ----- 

De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com>, "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Dimanche 26 Octobre 2014 20:08:42 
Objet: RE: krbd blk-mq support ? 

Alexandre, 
Have you tried mapping different images on the same m/c with 'noshare' map option ? 
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster. 

Thanks & Regards 
Somnath 

-----Original Message----- 
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER 
Sent: Sunday, October 26, 2014 6:46 AM 
To: Christoph Hellwig 
Cc: Ceph Devel 
Subject: Re: krbd blk-mq support ? 

Hi, 

some news: 

I have applied patches succefully on top of 3.18-rc1 kernel. 

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). 

My main problem is that I can't reach more than around 50000iops on 1 machine, 

and the problem seem to be the kworker process stuck at 100% of 1core. 

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops. 

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time, 

I can reach 50000iops on each node, and both are limited by the kworker process. 


That's why I thinked that blk-mq could help, but it don't seem to be the case. 


Is this kworker cpu limitation a known bug ? 

Regards, 

Alexandre 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested.
Christoph Hellwig Oct. 27, 2014, 9:45 a.m. UTC | #5
On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote:
> Hi,
> 
> some news:
> 
> I have applied patches succefully on top of 3.18-rc1 kernel.
> 
> But don't seem to help is my case.
> (I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).
> 
> My main problem is that I can't reach more than around 50000iops on 1 machine,
> 
> and the problem seem to be the kworker process stuck at 100% of 1core.

Can you do a perf report -ag and then a perf report to see where these
cycles are spent?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 27, 2014, 10 a.m. UTC | #6
>>Can you do a perf report -ag and then a perf report to see where these
>>cycles are spent?

Yes, sure.

I have attached the perf report to this mail.
(This is with kernel 3.14, don't have access to my 3.18  host for now)

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 27 Octobre 2014 10:45:56 
Objet: Re: krbd blk-mq support ? 

On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote: 
> Hi, 
> 
> some news: 
> 
> I have applied patches succefully on top of 3.18-rc1 kernel. 
> 
> But don't seem to help is my case. 
> (I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). 
> 
> My main problem is that I can't reach more than around 50000iops on 1 machine, 
> 
> and the problem seem to be the kworker process stuck at 100% of 1core. 

Can you do a perf report -ag and then a perf report to see where these 
cycles are spent?
Alexandre DERUMIER Oct. 27, 2014, 10:26 a.m. UTC | #7
Hi Somnath,

I have just tried with 2 rbd volumes with (rbd map -o noshare rbdvolume -p pool) (kernel 3.14),
then a fio benchmark on both volumes at the same time
but I don't seem to help.

I have always the kworker process at 100%, and iops are 25000iops on each rbd volume.

----- Mail original ----- 

De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com>, "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Dimanche 26 Octobre 2014 20:08:42 
Objet: RE: krbd blk-mq support ? 

Alexandre, 
Have you tried mapping different images on the same m/c with 'noshare' map option ? 
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster. 

Thanks & Regards 
Somnath 

-----Original Message----- 
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER 
Sent: Sunday, October 26, 2014 6:46 AM 
To: Christoph Hellwig 
Cc: Ceph Devel 
Subject: Re: krbd blk-mq support ? 

Hi, 

some news: 

I have applied patches succefully on top of 3.18-rc1 kernel. 

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). 

My main problem is that I can't reach more than around 50000iops on 1 machine, 

and the problem seem to be the kworker process stuck at 100% of 1core. 

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops. 

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time, 

I can reach 50000iops on each node, and both are limited by the kworker process. 


That's why I thinked that blk-mq could help, but it don't seem to be the case. 


Is this kworker cpu limitation a known bug ? 

Regards, 

Alexandre 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested.
Christoph Hellwig Oct. 28, 2014, 6:07 p.m. UTC | #8
On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
> >>Can you do a perf report -ag and then a perf report to see where these
> >>cycles are spent?
> 
> Yes, sure.
> 
> I have attached the perf report to this mail.
> (This is with kernel 3.14, don't have access to my 3.18  host for now)

Oh, that's without the blk-mq patch?

Either way the profile doesn't really sum up to a fully used up
cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
If not we could probably aim for per-cpu queues using blk-mq and a
socket per cpu or similar.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder Oct. 28, 2014, 10:31 p.m. UTC | #9
On 10/28/2014 01:07 PM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>>>> Can you do a perf report -ag and then a perf report to see where these
>>>> cycles are spent?
>>
>> Yes, sure.
>>
>> I have attached the perf report to this mail.
>> (This is with kernel 3.14, don't have access to my 3.18  host for now)
>
> Oh, that's without the blk-mq patch?
>
> Either way the profile doesn't really sum up to a fully used up
> cpu.  Sage, Alex - are there any ordring constraints in the rbd client?

I don't remember off hand.

In libceph I recall going to great lengths to retain the original
order of requests when they got re-sent after a connection reset.

I'll go look at the code a bit and see if I can refresh my memory
(though Sage may answer before I do).

					-Alex

> If not we could probably aim for per-cpu queues using blk-mq and a
> socket per cpu or similar.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder Oct. 28, 2014, 11:11 p.m. UTC | #10
On 10/28/2014 01:07 PM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>>>> Can you do a perf report -ag and then a perf report to see where these
>>>> cycles are spent?
>>
>> Yes, sure.
>>
>> I have attached the perf report to this mail.
>> (This is with kernel 3.14, don't have access to my 3.18  host for now)
>
> Oh, that's without the blk-mq patch?
>
> Either way the profile doesn't really sum up to a fully used up
> cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
> If not we could probably aim for per-cpu queues using blk-mq and a
> socket per cpu or similar.

First, a disclaimer--I haven't really been following this discussion
very closely.

For an rbd image request (which is what gets created from requests
from the block queue), the order of completion doesn't matter, and
although the object requests are submitted in order that shouldn't
be required either.

The image request is broken into one or more object requests (usually
just one) and they are treated as a unit.  When the last object request
of a set for an image request has completed, the image request is
treated as completed.

I hope that helps.  If not, ask again a different way...

					-Alex

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 29, 2014, 9:09 a.m. UTC | #11
>>Oh, that's without the blk-mq patch?

Yes, sorry, I don't how to use perf with a custom compiled kernel.
(Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)

>>Either way the profile doesn't really sum up to a fully used up cpu.

But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.

I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu



----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 28 Octobre 2014 19:07:25 
Objet: Re: krbd blk-mq support ? 

On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> >>Can you do a perf report -ag and then a perf report to see where these 
> >>cycles are spent? 
> 
> Yes, sure. 
> 
> I have attached the perf report to this mail. 
> (This is with kernel 3.14, don't have access to my 3.18 host for now) 

Oh, that's without the blk-mq patch? 

Either way the profile doesn't really sum up to a fully used up 
cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
If not we could probably aim for per-cpu queues using blk-mq and a 
socket per cpu or similar. 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Oct. 29, 2014, 3 p.m. UTC | #12
On Wed, 29 Oct 2014, Alexandre DERUMIER wrote:
> >>Oh, that's without the blk-mq patch?
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel.
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu.
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network.  How many OSDs do you have?  It should be able to scale 
with the number of OSDs.

sage


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 30, 2014, 8:11 a.m. UTC | #13
>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>to the network. How many OSDs do you have? It should be able to scale 
>>with the number of OSDs. 

Thanks Sage for your reply.

Currently 6 OSD (ssd) on the test platform.

But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host.
Do you think messenger.c worker can be the bottleneck in this case ?


I'll try to add more OSD next week, if it's scale it's a very good news !







----- Mail original ----- 

De: "Sage Weil" <sage@newdream.net> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 29 Octobre 2014 16:00:56 
Objet: Re: krbd blk-mq support ? 

On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
> >>Oh, that's without the blk-mq patch? 
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu. 
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network. How many OSDs do you have? It should be able to scale 
with the number of OSDs. 

sage 


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 30, 2014, 4:01 p.m. UTC | #14
>>I'll try to add more OSD next week, if it's scale it's a very good news !

I just tried to add 2 more osds,

I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously).

and kworker cpu usage is also lower (84% vs 97%).
(don't understand why exactly)

So, Thanks for help everybody !





----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Sage Weil" <sage@newdream.net> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 30 Octobre 2014 09:11:11 
Objet: Re: krbd blk-mq support ? 

>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>to the network. How many OSDs do you have? It should be able to scale 
>>with the number of OSDs. 

Thanks Sage for your reply. 

Currently 6 OSD (ssd) on the test platform. 

But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host. 
Do you think messenger.c worker can be the bottleneck in this case ? 


I'll try to add more OSD next week, if it's scale it's a very good news ! 







----- Mail original ----- 

De: "Sage Weil" <sage@newdream.net> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 29 Octobre 2014 16:00:56 
Objet: Re: krbd blk-mq support ? 

On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
> >>Oh, that's without the blk-mq patch? 
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu. 
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network. How many OSDs do you have? It should be able to scale 
with the number of OSDs. 

sage 


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
>
Haomai Wang Oct. 30, 2014, 5:05 p.m. UTC | #15
Could you describe more about 2x70000 iops?
So you mean 8 OSD each backend with SSD can achieve with 14w iops?
is it read or write? could you give fio options?

On Fri, Oct 31, 2014 at 12:01 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
>>>I'll try to add more OSD next week, if it's scale it's a very good news !
>
> I just tried to add 2 more osds,
>
> I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously).
>
> and kworker cpu usage is also lower (84% vs 97%).
> (don't understand why exactly)
>
> So, Thanks for help everybody !
>
>
>
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER" <aderumier@odiso.com>
> À: "Sage Weil" <sage@newdream.net>
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org>
> Envoyé: Jeudi 30 Octobre 2014 09:11:11
> Objet: Re: krbd blk-mq support ?
>
>>>Hmm, this is probably the messenger.c worker then that is feeding messages
>>>to the network. How many OSDs do you have? It should be able to scale
>>>with the number of OSDs.
>
> Thanks Sage for your reply.
>
> Currently 6 OSD (ssd) on the test platform.
>
> But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host.
> Do you think messenger.c worker can be the bottleneck in this case ?
>
>
> I'll try to add more OSD next week, if it's scale it's a very good news !
>
>
>
>
>
>
>
> ----- Mail original -----
>
> De: "Sage Weil" <sage@newdream.net>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 29 Octobre 2014 16:00:56
> Objet: Re: krbd blk-mq support ?
>
> On Wed, 29 Oct 2014, Alexandre DERUMIER wrote:
>> >>Oh, that's without the blk-mq patch?
>>
>> Yes, sorry, I don't how to use perf with a custom compiled kernel.
>> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)
>>
>> >>Either way the profile doesn't really sum up to a fully used up cpu.
>>
>> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.
>>
>> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu
>
> Hmm, this is probably the messenger.c worker then that is feeding messages
> to the network. How many OSDs do you have? It should be able to scale
> with the number of OSDs.
>
> sage
>
>
>>
>>
>>
>> ----- Mail original -----
>>
>> De: "Christoph Hellwig" <hch@infradead.org>
>> ?: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
>> Envoy?: Mardi 28 Octobre 2014 19:07:25
>> Objet: Re: krbd blk-mq support ?
>>
>> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>> > >>Can you do a perf report -ag and then a perf report to see where these
>> > >>cycles are spent?
>> >
>> > Yes, sure.
>> >
>> > I have attached the perf report to this mail.
>> > (This is with kernel 3.14, don't have access to my 3.18 host for now)
>>
>> Oh, that's without the blk-mq patch?
>>
>> Either way the profile doesn't really sum up to a fully used up
>> cpu. Sage, Alex - are there any ordring constraints in the rbd client?
>> If not we could probably aim for per-cpu queues using blk-mq and a
>> socket per cpu or similar.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER Oct. 31, 2014, 5:04 a.m. UTC | #16
>>Could you describe more about 2x70000 iops?
>>So you mean 8 OSD each backend with SSD can achieve with 14w iops?

It's a small rbd (10G), so mostly read hit the buffer cache.
But yes, it's able to deliver 140000iops with 8 osd. (I check also stats in ceph cluster to be sure).
(and I'm not cpu bound on osd nodes)
>> 2014-10-31 05:58:34.231037 mon.0 [INF] pgmap v7109: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 560 MB/s rd, 140 kop/s


here the ceph.conf of osd nodes

[global]
fsid = c29f4643-9577-4671-ae25-59ad14550aba
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true

       debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0

        osd_op_threads = 5
        filestore_op_threads = 4


        osd_op_num_threads_per_shard = 1
        osd_op_num_shards = 25
        filestore_fd_cache_size = 64
        filestore_fd_cache_shards = 32
        osd_enable_op_tracker = false

 

>>is it read or write? could you give fio options?
random read 4K

Here the fio config.

[global]
ioengine=aio
invalidate=1    
rw=randread
bs=4K
direct=1
numjobs=1
group_reporting=1
size=10G

[test1]
iodepth=64
filename=/dev/rbd/test/test


On 1 client node, I can't reach more than 50000iops with 6osd or 70000iops with 8 osd.
(I had try to increasing numjobs to have more fio process or with 2 differents rbd volume at the same time, 
 but performance is the same).

>> 2014-10-31 05:57:30.078348 mon.0 [INF] pgmap v7070: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 290 MB/s rd, 74572 op/s


But If I launch same fio test on another client node, I can reach same 70000iops at the same time.


>> 2014-10-31 05:58:34.231037 mon.0 [INF] pgmap v7109: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 560 MB/s rd, 140 kop/s


----- Mail original ----- 

De: "Haomai Wang" <haomaiwang@gmail.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@newdream.net>, "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 30 Octobre 2014 18:05:26 
Objet: Re: krbd blk-mq support ? 

Could you describe more about 2x70000 iops? 
So you mean 8 OSD each backend with SSD can achieve with 14w iops? 
is it read or write? could you give fio options? 

On Fri, Oct 31, 2014 at 12:01 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
>>>I'll try to add more OSD next week, if it's scale it's a very good news ! 
> 
> I just tried to add 2 more osds, 
> 
> I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously). 
> 
> and kworker cpu usage is also lower (84% vs 97%). 
> (don't understand why exactly) 
> 
> So, Thanks for help everybody ! 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Alexandre DERUMIER" <aderumier@odiso.com> 
> À: "Sage Weil" <sage@newdream.net> 
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Jeudi 30 Octobre 2014 09:11:11 
> Objet: Re: krbd blk-mq support ? 
> 
>>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>>to the network. How many OSDs do you have? It should be able to scale 
>>>with the number of OSDs. 
> 
> Thanks Sage for your reply. 
> 
> Currently 6 OSD (ssd) on the test platform. 
> 
> But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host. 
> Do you think messenger.c worker can be the bottleneck in this case ? 
> 
> 
> I'll try to add more OSD next week, if it's scale it's a very good news ! 
> 
> 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Sage Weil" <sage@newdream.net> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 29 Octobre 2014 16:00:56 
> Objet: Re: krbd blk-mq support ? 
> 
> On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
>> >>Oh, that's without the blk-mq patch? 
>> 
>> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
>> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
>> 
>> >>Either way the profile doesn't really sum up to a fully used up cpu. 
>> 
>> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
>> 
>> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 
> 
> Hmm, this is probably the messenger.c worker then that is feeding messages 
> to the network. How many OSDs do you have? It should be able to scale 
> with the number of OSDs. 
> 
> sage 
> 
> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Christoph Hellwig" <hch@infradead.org> 
>> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
>> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
>> Envoy?: Mardi 28 Octobre 2014 19:07:25 
>> Objet: Re: krbd blk-mq support ? 
>> 
>> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
>> > >>Can you do a perf report -ag and then a perf report to see where these 
>> > >>cycles are spent? 
>> > 
>> > Yes, sure. 
>> > 
>> > I have attached the perf report to this mail. 
>> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
>> 
>> Oh, that's without the blk-mq patch? 
>> 
>> Either way the profile doesn't really sum up to a fully used up 
>> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
>> If not we could probably aim for per-cpu queues using blk-mq and a 
>> socket per cpu or similar. 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>> 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68929ba..eaaedea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -760,10 +760,11 @@  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
-	int cpu = hctx->next_cpu;
+	if (hctx->queue->nr_hw_queues == 1)
+		return WORK_CPU_UNBOUND;
 
 	if (--hctx->next_cpu_batch <= 0) {
-		int next_cpu;
+		int cpu = hctx->next_cpu, next_cpu;
 
 		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
 		if (next_cpu >= nr_cpu_ids)
@@ -771,9 +772,11 @@  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
 		hctx->next_cpu = next_cpu;
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+	
+		return cpu;
 	}
 
-	return cpu;
+	return hctx->next_cpu;
 }
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
@@ -781,16 +784,13 @@  void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
-	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask))
+	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask)) {
 		__blk_mq_run_hw_queue(hctx);
-	else if (hctx->queue->nr_hw_queues == 1)
-		kblockd_schedule_delayed_work(&hctx->run_work, 0);
-	else {
-		unsigned int cpu;
-
-		cpu = blk_mq_hctx_next_cpu(hctx);
-		kblockd_schedule_delayed_work_on(cpu, &hctx->run_work, 0);
+		return;
 	}
+
+	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+			&hctx->run_work, 0);
 }
 
 void blk_mq_run_queues(struct request_queue *q, bool async)
@@ -888,16 +888,8 @@  static void blk_mq_delay_work_fn(struct work_struct *work)
 
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 {
-	unsigned long tmo = msecs_to_jiffies(msecs);
-
-	if (hctx->queue->nr_hw_queues == 1)
-		kblockd_schedule_delayed_work(&hctx->delay_work, tmo);
-	else {
-		unsigned int cpu;
-
-		cpu = blk_mq_hctx_next_cpu(hctx);
-		kblockd_schedule_delayed_work_on(cpu, &hctx->delay_work, tmo);
-	}
+	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+			&hctx->delay_work, msecs_to_jiffies(msecs));
 }
 EXPORT_SYMBOL(blk_mq_delay_queue);