[RFC,v6,08/10] megaraid_sas: switch fusion adapters to MQ

Message ID	1583409280-158604-9-git-send-email-john.garry@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=t5UG=4W=vger.kernel.org=linux-block-owner@kernel.org> From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk>, <jejb@linux.ibm.com>, <martin.petersen@oracle.com>, <ming.lei@redhat.com>, <bvanassche@acm.org>, <hare@suse.de>, <don.brace@microsemi.com>, <sumit.saxena@broadcom.com>, <hch@infradead.org>, <kashyap.desai@broadcom.com>, <shivasharan.srikanteshwara@broadcom.com> CC: <chenxiang66@hisilicon.com>, <linux-block@vger.kernel.org>, <linux-scsi@vger.kernel.org>, <esc.storagedev@microsemi.com>, Hannes Reinecke <hare@suse.com>, John Garry <john.garry@huawei.com> Subject: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ Date: Thu, 5 Mar 2020 19:54:38 +0800 Message-ID: <1583409280-158604-9-git-send-email-john.garry@huawei.com> In-Reply-To: <1583409280-158604-1-git-send-email-john.garry@huawei.com> References: <1583409280-158604-1-git-send-email-john.garry@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs \| expand [RFC,v6,00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs [RFC,v6,01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED [RFC,v6,02/10] blk-mq: rename blk_mq_update_tag_set_depth() [RFC,v6,03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags [RFC,v6,04/10] blk-mq: Facilitate a shared sbitmap per tagset [RFC,v6,05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap [RFC,v6,06/10] scsi: Add template flag 'host_tagset' [RFC,v6,07/10] scsi: hisi_sas: Switch v3 hw to MQ [RFC,v6,08/10] megaraid_sas: switch fusion adapters to MQ [RFC,v6,09/10] smartpqi: enable host tagset [RFC,v6,10/10] hpsa: enable host_tagset and switch to MQ

John Garry March 5, 2020, 11:54 a.m. UTC

From: Hannes Reinecke <hare@suse.com>

Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters and
drop the hand-crafted interrupt affinity settings.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
 3 files changed, 32 insertions(+), 52 deletions(-)

Kashyap Desai April 7, 2020, 11:14 a.m. UTC | #1

> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
> *instance,  {
>  	int sdev_busy;
>
> -	/* nr_hw_queue = 1 for MegaRAID */
> -	struct blk_mq_hw_ctx *hctx =
> -		scmd->device->request_queue->queue_hw_ctx[0];
> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;

Hi John,

There is one outstanding patch which will eventually remove device_busy
from sdev. To fix this interface, we may have to track per scsi device
outstanding within a driver.
For my testing I used below since we still have below interface available.

        sdev_busy = atomic_read(&scmd->device->device_busy);

We have done some level of testing to know performance impact on SAS SSDs
and HDD setup. Here is my finding -
My testing used - Two socket Intel Skylake/Lewisburg/Purley
Output of numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 31820 MB
node 0 free: 21958 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 32247 MB
node 1 free: 21068 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10


64 HDD setup -

With higher QD and io schedulder = mq-deadline, shared host tag is not
scaling well. If I use ioscheduler = none, I can see consistent 2.0M IOPs.
This issue is seen only with RFC. Without RFC mq-deadline scales up to
2.0M IOPS.

Perf Top result of RFC - (IOPS = 1.4M IOPS)

   78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
     1.46%  [kernel]        [k] sbitmap_any_bit_set
     1.14%  [kernel]        [k] blk_mq_run_hw_queue
     0.90%  [kernel]        [k] _mix_pool_bytes
     0.63%  [kernel]        [k] _raw_spin_lock
     0.57%  [kernel]        [k] blk_mq_run_hw_queues
     0.56%  [megaraid_sas]  [k] complete_cmd_fusion
     0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     0.50%  [kernel]        [k] dd_has_work
     0.38%  [kernel]        [k] _raw_spin_lock_irqsave
     0.36%  [kernel]        [k] gup_pgd_range
     0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
     0.31%  [kernel]        [k] io_submit_one
     0.29%  [kernel]        [k] hctx_lock
     0.26%  [kernel]        [k] try_to_grab_pending
     0.24%  [kernel]        [k] scsi_queue_rq
     0.22%  fio             [.] __fio_gettime
     0.22%  [kernel]        [k] insert_work
     0.20%  [kernel]        [k] native_irq_return_iret

Perf top without RFC driver - (IOPS = 2.0 M IOPS)

    58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
     2.06%  [kernel]          [k] _mix_pool_bytes
     1.38%  [kernel]          [k] _raw_spin_lock_irqsave
     0.97%  [kernel]          [k] _raw_spin_lock
     0.91%  [kernel]          [k] scsi_queue_rq
     0.82%  [kernel]          [k] __sbq_wake_up
     0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     0.74%  [kernel]          [k] scsi_mq_get_budget
     0.61%  [kernel]          [k] gup_pgd_range
     0.58%  [kernel]          [k] aio_complete_rw
     0.52%  [kernel]          [k] elv_rb_add
     0.50%  [kernel]          [k] llist_add_batch
     0.50%  [kernel]          [k] native_irq_return_iret
     0.48%  [kernel]          [k] blk_rq_map_sg
     0.48%  fio               [.] __fio_gettime
     0.47%  [kernel]          [k] blk_mq_get_tag
     0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
     0.40%  fio               [.] io_u_queued_complete
     0.39%  fio               [.] get_io_u


If you want me to test any top up patch, please let me know.  BTW, we also
wants to provide module parameter for user to switch back to older
nr_hw_queue = 1 mode. I will work on that part.

24 SSD setup -

I am able to see performance using RFC and without RFC is almost same.
There is one specific drop, but that is generic kernel issue. Not related
to RFC.
We can discuss this issue separately. -

5.6 kernel is not able to scale very well if there is heavy outstanding
from application.
Example -
24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max, but
at QD = 256 it gives 1.4M IOPs. It looks like there are some overhead  of
finding free tags at sdev or shost level which leads drops in IOPs.

Kashyap

John Garry April 8, 2020, 9:33 a.m. UTC | #2

On 07/04/2020 12:14, Kashyap Desai wrote:
>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
>> *instance,  {
>>   	int sdev_busy;
>>
>> -	/* nr_hw_queue = 1 for MegaRAID */
>> -	struct blk_mq_hw_ctx *hctx =
>> -		scmd->device->request_queue->queue_hw_ctx[0];
>> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
> 

Hi Kashyap,

> 
> There is one outstanding patch which will eventually remove device_busy
> from sdev. To fix this interface, we may have to track per scsi device
> outstanding within a driver.
> For my testing I used below since we still have below interface available.
> 
>          sdev_busy = atomic_read(&scmd->device->device_busy);

So please confirm that this is your change in megasas_get_msix_index():

- sdev_busy = atomic_read(&hctx->nr_active);
+ sdev_busy = atomic_read(&scmd->device->device_busy);

> 
> We have done some level of testing to know performance impact on SAS SSDs
> and HDD setup. Here is my finding -
> My testing used - Two socket Intel Skylake/Lewisburg/Purley
> Output of numactl --hardware
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
> 42 43 44 45 46 47 48 49 50 51 52 53
> node 0 size: 31820 MB
> node 0 free: 21958 MB
> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 1 size: 32247 MB
> node 1 free: 21068 MB
> node distances:
> node   0   1
>    0:  10  21
>    1:  21  10
> 
> 
> 64 HDD setup -
> 
> With higher QD 

what's OD?

> and io schedulder = mq-deadline, shared host tag is not
> scaling well. If I use ioscheduler = none, I can see consistent 2.0M IOPs.
> This issue is seen only with RFC. Without RFC mq-deadline scales up to
> 2.0M IOPS.

I didn't try any scheduler. I can have a look at that.

> 
> Perf Top result of RFC - (IOPS = 1.4M IOPS)
> 
>     78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
>       1.46%  [kernel]        [k] sbitmap_any_bit_set
>       1.14%  [kernel]        [k] blk_mq_run_hw_queue
>       0.90%  [kernel]        [k] _mix_pool_bytes
>       0.63%  [kernel]        [k] _raw_spin_lock
>       0.57%  [kernel]        [k] blk_mq_run_hw_queues
>       0.56%  [megaraid_sas]  [k] complete_cmd_fusion
>       0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>       0.50%  [kernel]        [k] dd_has_work
>       0.38%  [kernel]        [k] _raw_spin_lock_irqsave
>       0.36%  [kernel]        [k] gup_pgd_range
>       0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
>       0.31%  [kernel]        [k] io_submit_one
>       0.29%  [kernel]        [k] hctx_lock
>       0.26%  [kernel]        [k] try_to_grab_pending
>       0.24%  [kernel]        [k] scsi_queue_rq
>       0.22%  fio             [.] __fio_gettime
>       0.22%  [kernel]        [k] insert_work
>       0.20%  [kernel]        [k] native_irq_return_iret
> 
> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> 
>      58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
>       2.06%  [kernel]          [k] _mix_pool_bytes
>       1.38%  [kernel]          [k] _raw_spin_lock_irqsave
>       0.97%  [kernel]          [k] _raw_spin_lock
>       0.91%  [kernel]          [k] scsi_queue_rq
>       0.82%  [kernel]          [k] __sbq_wake_up
>       0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
>       0.74%  [kernel]          [k] scsi_mq_get_budget
>       0.61%  [kernel]          [k] gup_pgd_range
>       0.58%  [kernel]          [k] aio_complete_rw
>       0.52%  [kernel]          [k] elv_rb_add
>       0.50%  [kernel]          [k] llist_add_batch
>       0.50%  [kernel]          [k] native_irq_return_iret
>       0.48%  [kernel]          [k] blk_rq_map_sg
>       0.48%  fio               [.] __fio_gettime
>       0.47%  [kernel]          [k] blk_mq_get_tag
>       0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
>       0.40%  fio               [.] io_u_queued_complete
>       0.39%  fio               [.] get_io_u
> 
> 
> If you want me to test any top up patch, please let me know.  BTW, we also
> wants to provide module parameter for user to switch back to older
> nr_hw_queue = 1 mode. I will work on that part.

ok, but I would just like to reiterate the point that you will not see 
the full benefit of blk-mq draining hw queues for cpu hotplug since you 
hide hw queues from blk-mq.

> 
> 24 SSD setup -
> 
> I am able to see performance using RFC and without RFC is almost same.
> There is one specific drop, but that is generic kernel issue. Not related
> to RFC.
> We can discuss this issue separately. -
> 
> 5.6 kernel is not able to scale very well if there is heavy outstanding
> from application.
> Example -
> 24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max, but
> at QD = 256 it gives 1.4M IOPs. It looks like there are some overhead  of
> finding free tags at sdev or shost level which leads drops in IOPs.
> 

Thanks for testing,
John

>

Kashyap Desai April 8, 2020, 9:59 a.m. UTC | #3

> Hi Kashyap,
>
> >
> > There is one outstanding patch which will eventually remove
> > device_busy from sdev. To fix this interface, we may have to track per
> > scsi device outstanding within a driver.
> > For my testing I used below since we still have below interface
> > available.
> >
> >          sdev_busy = atomic_read(&scmd->device->device_busy);
>
> So please confirm that this is your change in megasas_get_msix_index():
>
> - sdev_busy = atomic_read(&hctx->nr_active);
> + sdev_busy = atomic_read(&scmd->device->device_busy);

That  is correct.

>
> >
> > We have done some level of testing to know performance impact on SAS
> > SSDs and HDD setup. Here is my finding - My testing used - Two socket
> > Intel Skylake/Lewisburg/Purley Output of numactl --hardware
> >
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
> > 40 41
> > 42 43 44 45 46 47 48 49 50 51 52 53
> > node 0 size: 31820 MB
> > node 0 free: 21958 MB
> > node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54
> > 55
> > 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 MB
> > node 1 free: 21068 MB node distances:
> > node   0   1
> >    0:  10  21
> >    1:  21  10
> >
> >
> > 64 HDD setup -
> >
> > With higher QD
>
> what's OD?
>
> > and io schedulder = mq-deadline, shared host tag is not scaling well.
> > If I use ioscheduler = none, I can see consistent 2.0M IOPs.
> > This issue is seen only with RFC. Without RFC mq-deadline scales up to
> > 2.0M IOPS.
>
> I didn't try any scheduler. I can have a look at that.
>
> >
> > Perf Top result of RFC - (IOPS = 1.4M IOPS)
> >
> >     78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
> >       1.46%  [kernel]        [k] sbitmap_any_bit_set
> >       1.14%  [kernel]        [k] blk_mq_run_hw_queue
> >       0.90%  [kernel]        [k] _mix_pool_bytes
> >       0.63%  [kernel]        [k] _raw_spin_lock
> >       0.57%  [kernel]        [k] blk_mq_run_hw_queues
> >       0.56%  [megaraid_sas]  [k] complete_cmd_fusion
> >       0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
> >       0.50%  [kernel]        [k] dd_has_work
> >       0.38%  [kernel]        [k] _raw_spin_lock_irqsave
> >       0.36%  [kernel]        [k] gup_pgd_range
> >       0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
> >       0.31%  [kernel]        [k] io_submit_one
> >       0.29%  [kernel]        [k] hctx_lock
> >       0.26%  [kernel]        [k] try_to_grab_pending
> >       0.24%  [kernel]        [k] scsi_queue_rq
> >       0.22%  fio             [.] __fio_gettime
> >       0.22%  [kernel]        [k] insert_work
> >       0.20%  [kernel]        [k] native_irq_return_iret
> >
> > Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> >
> >      58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
> >       2.06%  [kernel]          [k] _mix_pool_bytes
> >       1.38%  [kernel]          [k] _raw_spin_lock_irqsave
> >       0.97%  [kernel]          [k] _raw_spin_lock
> >       0.91%  [kernel]          [k] scsi_queue_rq
> >       0.82%  [kernel]          [k] __sbq_wake_up
> >       0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
> >       0.74%  [kernel]          [k] scsi_mq_get_budget
> >       0.61%  [kernel]          [k] gup_pgd_range
> >       0.58%  [kernel]          [k] aio_complete_rw
> >       0.52%  [kernel]          [k] elv_rb_add
> >       0.50%  [kernel]          [k] llist_add_batch
> >       0.50%  [kernel]          [k] native_irq_return_iret
> >       0.48%  [kernel]          [k] blk_rq_map_sg
> >       0.48%  fio               [.] __fio_gettime
> >       0.47%  [kernel]          [k] blk_mq_get_tag
> >       0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
> >       0.40%  fio               [.] io_u_queued_complete
> >       0.39%  fio               [.] get_io_u
> >
> >
> > If you want me to test any top up patch, please let me know.  BTW, we
> > also wants to provide module parameter for user to switch back to
> > older nr_hw_queue = 1 mode. I will work on that part.
>
> ok, but I would just like to reiterate the point that you will not see the
> full
> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
> queues from blk-mq.

Agree. We have done  minimal testing using this RFC. We want to ACK this RFC
as long as primary performance goal is achieved.

We have done full testing on nr_hw_queue =1 (and that is what customer is
using) so we at least want to give that interface available for customer for
some time (assuming they may find some performance gap between two interface
which we may not have encountered during smoke testing.).
Over a period of time, if nr_hw_queue = N works for (Broadcom will conduct
full performance once RFC is committed in upstream) all the IO profiles, we
will share the information with customer about benefit of using nr_hw_queues
=  N.

Kashyap

>
> >
> > 24 SSD setup -
> >
> > I am able to see performance using RFC and without RFC is almost same.
> > There is one specific drop, but that is generic kernel issue. Not
> > related to RFC.
> > We can discuss this issue separately. -
> >
> > 5.6 kernel is not able to scale very well if there is heavy
> > outstanding from application.
> > Example -
> > 24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max,
> > but at QD = 256 it gives 1.4M IOPs. It looks like there are some
> > overhead  of finding free tags at sdev or shost level which leads drops
> > in
> IOPs.
> >
>
> Thanks for testing,
> John
>
> >

John Garry April 17, 2020, 4:46 p.m. UTC | #4

On 08/04/2020 10:59, Kashyap Desai wrote:

Hi Kashyap,

> 
>>> We have done some level of testing to know performance impact on SAS
>>> SSDs and HDD setup. Here is my finding - My testing used - Two socket
>>> Intel Skylake/Lewisburg/Purley Output of numactl --hardware
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
>>> 40 41
>>> 42 43 44 45 46 47 48 49 50 51 52 53
>>> node 0 size: 31820 MB
>>> node 0 free: 21958 MB
>>> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54
>>> 55
>>> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 MB
>>> node 1 free: 21068 MB node distances:
>>> node   0   1
>>>     0:  10  21
>>>     1:  21  10

Do you have other info, like IRQ-CPU affinity dump and controller PCI 
vendor+device ID? Also /proc/interrupts info would be good after a run, 
like supplied by Sumit here:

https://lore.kernel.org/linux-scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-L=g@mail.gmail.com/

Are you enabling some special driver perf mode?

>>>
>>>
>>> 64 HDD setup -
>>>
>>> With higher QD
>> what's OD?
>>
>>> and io schedulder = mq-deadline, shared host tag is not scaling well. >>> If I use ioscheduler = none, I can see consistent 2.0M IOPs.
>>> This issue is seen only with RFC. Without RFC mq-deadline scales up to
>>> 2.0M IOPS.

In theory, from this driver perspective, we should not be making a 
difference. That's after your change to use sdev-> device busy count, 
rather than the hctx nr_active count. As I understand, that's the only 
difference you made.

But I will try an IO scheduler on hisi sas for ssd to see if any difference.

>> I didn't try any scheduler. I can have a look at that.
>>
>>> Perf Top result of RFC - (IOPS = 1.4M IOPS)
>>>
>>>      78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
>>>        1.46%  [kernel]        [k] sbitmap_any_bit_set
>>>        1.14%  [kernel]        [k] blk_mq_run_hw_queue
>>>        0.90%  [kernel]        [k] _mix_pool_bytes
>>>        0.63%  [kernel]        [k] _raw_spin_lock
>>>        0.57%  [kernel]        [k] blk_mq_run_hw_queues
>>>        0.56%  [megaraid_sas]  [k] complete_cmd_fusion
>>>        0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>>>        0.50%  [kernel]        [k] dd_has_work
>>>        0.38%  [kernel]        [k] _raw_spin_lock_irqsave
>>>        0.36%  [kernel]        [k] gup_pgd_range
>>>        0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
>>>        0.31%  [kernel]        [k] io_submit_one
>>>        0.29%  [kernel]        [k] hctx_lock
>>>        0.26%  [kernel]        [k] try_to_grab_pending
>>>        0.24%  [kernel]        [k] scsi_queue_rq
>>>        0.22%  fio             [.] __fio_gettime
>>>        0.22%  [kernel]        [k] insert_work
>>>        0.20%  [kernel]        [k] native_irq_return_iret
>>>
>>> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
>>>
>>>       58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
>>>        2.06%  [kernel]          [k] _mix_pool_bytes
>>>        1.38%  [kernel]          [k] _raw_spin_lock_irqsave
>>>        0.97%  [kernel]          [k] _raw_spin_lock
>>>        0.91%  [kernel]          [k] scsi_queue_rq
>>>        0.82%  [kernel]          [k] __sbq_wake_up
>>>        0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>>        0.74%  [kernel]          [k] scsi_mq_get_budget
>>>        0.61%  [kernel]          [k] gup_pgd_range
>>>        0.58%  [kernel]          [k] aio_complete_rw
>>>        0.52%  [kernel]          [k] elv_rb_add
>>>        0.50%  [kernel]          [k] llist_add_batch
>>>        0.50%  [kernel]          [k] native_irq_return_iret
>>>        0.48%  [kernel]          [k] blk_rq_map_sg
>>>        0.48%  fio               [.] __fio_gettime
>>>        0.47%  [kernel]          [k] blk_mq_get_tag
>>>        0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
>>>        0.40%  fio               [.] io_u_queued_complete
>>>        0.39%  fio               [.] get_io_u
>>>
>>>
>>> If you want me to test any top up patch, please let me know.  BTW, we
>>> also wants to provide module parameter for user to switch back to
>>> older nr_hw_queue = 1 mode. I will work on that part.
>> ok, but I would just like to reiterate the point that you will not see the
>> full
>> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
>> queues from blk-mq.
> Agree. We have done  minimal testing using this RFC. We want to ACK this RFC
> as long as primary performance goal is achieved.
> 
> We have done full testing on nr_hw_queue =1 (and that is what customer is
> using) so we at least want to give that interface available for customer for
> some time (assuming they may find some performance gap between two interface
> which we may not have encountered during smoke testing.).
> Over a period of time, if nr_hw_queue = N works for (Broadcom will conduct
> full performance once RFC is committed in upstream) all the IO profiles, we
> will share the information with customer about benefit of using nr_hw_queues
> =  N.

Hopefully you can use nr_hw_queues = N always.

> 

Thanks,
john

Kashyap Desai April 20, 2020, 5:47 p.m. UTC | #5

> On 08/04/2020 10:59, Kashyap Desai wrote:
>
> Hi Kashyap,
>
> >
> >>> We have done some level of testing to know performance impact on SAS
> >>> SSDs and HDD setup. Here is my finding - My testing used - Two
> >>> socket Intel Skylake/Lewisburg/Purley Output of numactl --hardware
> >>>
> >>> available: 2 nodes (0-1)
> >>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
> >>> 40 41
> >>> 42 43 44 45 46 47 48 49 50 51 52 53
> >>> node 0 size: 31820 MB
> >>> node 0 free: 21958 MB
> >>> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
> >>> 54
> >>> 55
> >>> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247
> >>> MB node 1 free: 21068 MB node distances:
> >>> node   0   1
> >>>     0:  10  21
> >>>     1:  21  10
>
> Do you have other info, like IRQ-CPU affinity dump and controller PCI
> vendor+device ID? Also /proc/interrupts info would be good after a run,
> like supplied by Sumit here:
>
> https://lore.kernel.org/linux-
> scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-
> L=g@mail.gmail.com/

Controller performance mode is = IOPs which will create 8 extra reply
queues.
In this case it is 72 online CPU + 8 = 80 reply queue (MSIx) driver will
create.

First 8 vectors are non-managed and they are mapped to local numa node -1.

Here is IRQ-CPU affinity -

irq 256, cpu list 18-35,54-71
irq 257, cpu list 18-35,54-71
irq 258, cpu list 18-35,54-71
irq 259, cpu list 18-35,54-71
irq 260, cpu list 18-35,54-71
irq 261, cpu list 18-35,54-71
irq 262, cpu list 18-35,54-71
irq 263, cpu list 18-35,54-71
irq 264, cpu list 18
irq 265, cpu list 19
irq 266, cpu list 20
irq 267, cpu list 21
irq 268, cpu list 22
irq 269, cpu list 23
irq 270, cpu list 24
irq 271, cpu list 25
irq 272, cpu list 26
irq 273, cpu list 27
irq 274, cpu list 28
irq 275, cpu list 29
irq 276, cpu list 30
irq 277, cpu list 31
irq 278, cpu list 32
irq 279, cpu list 33
irq 280, cpu list 34
irq 281, cpu list 35
irq 282, cpu list 54
irq 283, cpu list 55
irq 284, cpu list 56
irq 285, cpu list 57
irq 286, cpu list 58
irq 287, cpu list 59
irq 288, cpu list 60
irq 289, cpu list 61
irq 290, cpu list 62
irq 291, cpu list 63
irq 292, cpu list 64
irq 293, cpu list 65
irq 294, cpu list 66
irq 295, cpu list 67
irq 296, cpu list 68
irq 297, cpu list 69
irq 298, cpu list 70
irq 299, cpu list 71
irq 300, cpu list 0
irq 301, cpu list 1
irq 302, cpu list 2
irq 303, cpu list 3
irq 304, cpu list 4
irq 305, cpu list 5
irq 306, cpu list 6
irq 307, cpu list 7
irq 308, cpu list 8
irq 309, cpu list 9
irq 310, cpu list 10
irq 311, cpu list 11
irq 312, cpu list 12
irq 313, cpu list 13
irq 314, cpu list 14
irq 315, cpu list 15
irq 316, cpu list 16
irq 317, cpu list 17
irq 318, cpu list 36
irq 319, cpu list 37
irq 320, cpu list 38
irq 321, cpu list 39
irq 322, cpu list 40
irq 323, cpu list 41
irq 324, cpu list 42
irq 325, cpu list 43
irq 326, cpu list 44
irq 327, cpu list 45
irq 328, cpu list 46
irq 329, cpu list 47
irq 330, cpu list 48
irq 331, cpu list 49
irq 332, cpu list 50
irq 333, cpu list 51
irq 334, cpu list 52
irq 335, cpu list 53

>
> Are you enabling some special driver perf mode?
>
> >>>
> >>>
> >>> 64 HDD setup -
> >>>
> >>> With higher QD
> >> what's OD?

I mean higher Queue Depth (QD). Higher Queue Depth is required because
congestion will happen at Sdev->queue_depth and shost->can_queue level.
If outstanding is not hitting per sdev queue depth OR shost can queue depth,
we will not see performance issue.

> >>
> >>> and io schedulder = mq-deadline, shared host tag is not scaling well.
> >>>  >>> If
> I use ioscheduler = none, I can see consistent 2.0M IOPs.
> >>> This issue is seen only with RFC. Without RFC mq-deadline scales up to
> >>> 2.0M IOPS.
>
> In theory, from this driver perspective, we should not be making a
> difference. That's after your change to use sdev-> device busy count,
> rather than the hctx nr_active count. As I understand, that's the only
> difference you made.
>
> But I will try an IO scheduler on hisi sas for ssd to see if any
> difference.
>
> >> I didn't try any scheduler. I can have a look at that.
> >>
> >>> Perf Top result of RFC - (IOPS = 1.4M IOPS)
> >>>
> >>>      78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
> >>>        1.46%  [kernel]        [k] sbitmap_any_bit_set
> >>>        1.14%  [kernel]        [k] blk_mq_run_hw_queue
> >>>        0.90%  [kernel]        [k] _mix_pool_bytes
> >>>        0.63%  [kernel]        [k] _raw_spin_lock
> >>>        0.57%  [kernel]        [k] blk_mq_run_hw_queues
> >>>        0.56%  [megaraid_sas]  [k] complete_cmd_fusion
> >>>        0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
> >>>        0.50%  [kernel]        [k] dd_has_work
> >>>        0.38%  [kernel]        [k] _raw_spin_lock_irqsave
> >>>        0.36%  [kernel]        [k] gup_pgd_range
> >>>        0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
> >>>        0.31%  [kernel]        [k] io_submit_one
> >>>        0.29%  [kernel]        [k] hctx_lock
> >>>        0.26%  [kernel]        [k] try_to_grab_pending
> >>>        0.24%  [kernel]        [k] scsi_queue_rq
> >>>        0.22%  fio             [.] __fio_gettime
> >>>        0.22%  [kernel]        [k] insert_work
> >>>        0.20%  [kernel]        [k] native_irq_return_iret
> >>>
> >>> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> >>>
> >>>       58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
> >>>        2.06%  [kernel]          [k] _mix_pool_bytes
> >>>        1.38%  [kernel]          [k] _raw_spin_lock_irqsave
> >>>        0.97%  [kernel]          [k] _raw_spin_lock
> >>>        0.91%  [kernel]          [k] scsi_queue_rq
> >>>        0.82%  [kernel]          [k] __sbq_wake_up
> >>>        0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
> >>>        0.74%  [kernel]          [k] scsi_mq_get_budget
> >>>        0.61%  [kernel]          [k] gup_pgd_range
> >>>        0.58%  [kernel]          [k] aio_complete_rw
> >>>        0.52%  [kernel]          [k] elv_rb_add
> >>>        0.50%  [kernel]          [k] llist_add_batch
> >>>        0.50%  [kernel]          [k] native_irq_return_iret
> >>>        0.48%  [kernel]          [k] blk_rq_map_sg
> >>>        0.48%  fio               [.] __fio_gettime
> >>>        0.47%  [kernel]          [k] blk_mq_get_tag
> >>>        0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
> >>>        0.40%  fio               [.] io_u_queued_complete
> >>>        0.39%  fio               [.] get_io_u
> >>>
> >>>
> >>> If you want me to test any top up patch, please let me know.  BTW, we
> >>> also wants to provide module parameter for user to switch back to
> >>> older nr_hw_queue = 1 mode. I will work on that part.
> >> ok, but I would just like to reiterate the point that you will not see
> >> the
> >> full
> >> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
> >> queues from blk-mq.
> > Agree. We have done  minimal testing using this RFC. We want to ACK this
> RFC
> > as long as primary performance goal is achieved.
> >
> > We have done full testing on nr_hw_queue =1 (and that is what customer
> > is
> > using) so we at least want to give that interface available for customer
> > for
> > some time (assuming they may find some performance gap between two
> interface
> > which we may not have encountered during smoke testing.).
> > Over a period of time, if nr_hw_queue = N works for (Broadcom will
> > conduct
> > full performance once RFC is committed in upstream) all the IO profiles,
> > we
> > will share the information with customer about benefit of using
> nr_hw_queues
> > =  N.
>
> Hopefully you can use nr_hw_queues = N always.
>
> >
>
> Thanks,
> john

John Garry April 21, 2020, 12:35 p.m. UTC | #6

On 20/04/2020 18:47, Kashyap Desai wrote:
>> ther info, like IRQ-CPU affinity dump and controller PCI
>> vendor+device ID? Also /proc/interrupts info would be good after a run,
>> like supplied by Sumit here:
>>
>> https://lore.kernel.org/linux-
>> scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-
>> L=g@mail.gmail.com/
> Controller performance mode is = IOPs which will create 8 extra reply
> queues.
> In this case it is 72 online CPU + 8 = 80 reply queue (MSIx) driver will
> create.
> 
> First 8 vectors are non-managed and they are mapped to local numa node -1.
> 
> Here is IRQ-CPU affinity -
> 
> irq 256, cpu list 18-35,54-71
> irq 257, cpu list 18-35,54-71
> irq 258, cpu list 18-35,54-71
> irq 259, cpu list 18-35,54-71
> irq 260, cpu list 18-35,54-71
> irq 261, cpu list 18-35,54-71
> irq 262, cpu list 18-35,54-71
> irq 263, cpu list 18-35,54-71
> irq 264, cpu list 18
> irq 265, cpu list 19
> irq 266, cpu list 20
> irq 267, cpu list 21
> irq 268, cpu list 22
> irq 269, cpu list 23
> irq 270, cpu list 24
> irq 271, cpu list 25
> irq 272, cpu list 26
> irq 273, cpu list 27
> irq 274, cpu list 28
> irq 275, cpu list 29
> irq 276, cpu list 30
> irq 277, cpu list 31
> irq 278, cpu list 32
> irq 279, cpu list 33
> irq 280, cpu list 34
> irq 281, cpu list 35
> irq 282, cpu list 54
> irq 283, cpu list 55
> irq 284, cpu list 56
> irq 285, cpu list 57
> irq 286, cpu list 58
> irq 287, cpu list 59
> irq 288, cpu list 60
> irq 289, cpu list 61
> irq 290, cpu list 62
> irq 291, cpu list 63
> irq 292, cpu list 64
> irq 293, cpu list 65
> irq 294, cpu list 66
> irq 295, cpu list 67
> irq 296, cpu list 68
> irq 297, cpu list 69
> irq 298, cpu list 70
> irq 299, cpu list 71
> irq 300, cpu list 0
> irq 301, cpu list 1
> irq 302, cpu list 2
> irq 303, cpu list 3
> irq 304, cpu list 4
> irq 305, cpu list 5
> irq 306, cpu list 6
> irq 307, cpu list 7
> irq 308, cpu list 8
> irq 309, cpu list 9
> irq 310, cpu list 10
> irq 311, cpu list 11
> irq 312, cpu list 12
> irq 313, cpu list 13
> irq 314, cpu list 14
> irq 315, cpu list 15
> irq 316, cpu list 16
> irq 317, cpu list 17
> irq 318, cpu list 36
> irq 319, cpu list 37
> irq 320, cpu list 38
> irq 321, cpu list 39
> irq 322, cpu list 40
> irq 323, cpu list 41
> irq 324, cpu list 42
> irq 325, cpu list 43
> irq 326, cpu list 44
> irq 327, cpu list 45
> irq 328, cpu list 46
> irq 329, cpu list 47
> irq 330, cpu list 48
> irq 331, cpu list 49
> irq 332, cpu list 50
> irq 333, cpu list 51
> irq 334, cpu list 52
> irq 335, cpu list 53
> 
>> Are you enabling some special driver perf mode?
>>

ok, thanks.

So I tested this on hisi_sas with x12 SAS SSDs, and performance with 
"mq-deadline" is comparable with "none" @ ~ 2M IOPs. But after a while 
performance drops alot, to maybe 700K IOPS. Do you have a similar 
experience?

Anyway, I'll have a look.

Thanks,
John

Kashyap Desai April 22, 2020, 6:59 p.m. UTC | #7

>
> So I tested this on hisi_sas with x12 SAS SSDs, and performance with "mq-
> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> performance drops alot, to maybe 700K IOPS. Do you have a similar
> experience?

I am using mq-deadline only for HDD. I have not tried on SSD since it is not
useful scheduler for SSDs.

I noticed that when I used mq-deadline, performance drop starts if I have
more number of drives.
I am running <fio> script which has 64 Drives, 64 thread and all treads are
bound to local numa node which has 36 logical cores.
I noticed that lock contention is in " dd_dispatch_request". I am not sure
why there is a no penalty of same lock in nr_hw_queue  = 1 mode.

static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
        struct deadline_data *dd = hctx->queue->elevator->elevator_data;
        struct request *rq;

        spin_lock(&dd->lock);
        rq = __dd_dispatch_request(dd);
        spin_unlock(&dd->lock);

        return rq;
}

Here is perf report -

-    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
     0.99% ret_from_fork
    -   kthread
      - worker_thread
         - 0.98% process_one_work
            - 0.98% __blk_mq_run_hw_queue
               - blk_mq_sched_dispatch_requests
                  - 0.98% blk_mq_do_dispatch_sched
                     - 0.97% dd_dispatch_request
                        + 0.97% queued_spin_lock_slowpath
+    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
+    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
+    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath


>
> Anyway, I'll have a look.
>
> Thanks,
> John

John Garry April 22, 2020, 9:28 p.m. UTC | #8

On 22/04/2020 19:59, Kashyap Desai wrote:
>>

Hi Kashyap,

>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with "mq-
>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
>> performance drops alot, to maybe 700K IOPS. Do you have a similar
>> experience?
> 
> I am using mq-deadline only for HDD. I have not tried on SSD since it is not
> useful scheduler for SSDs.
> 

I ask as I only have SAS SSDs to test.

> I noticed that when I used mq-deadline, performance drop starts if I have
> more number of drives.
> I am running <fio> script which has 64 Drives, 64 thread and all treads are
> bound to local numa node which has 36 logical cores.
> I noticed that lock contention is in " dd_dispatch_request". I am not sure
> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.

So this could be just pre-existing issue of exposing multiple queues for 
SCSI HBAs combined with mq-deadline iosched. I mean, that's really the 
only significant change in this series, apart from the shared sbitmap, 
and, at this point, I don't think that is the issue.

> 
> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> {
>          struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>          struct request *rq;
> 
>          spin_lock(&dd->lock);

So if multiple hctx's are accessing this lock, then much contention 
possible.

>          rq = __dd_dispatch_request(dd);
>          spin_unlock(&dd->lock);
> 
>          return rq;
> }
> 
> Here is perf report -
> 
> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
>       0.99% ret_from_fork
>      -   kthread
>        - worker_thread
>           - 0.98% process_one_work
>              - 0.98% __blk_mq_run_hw_queue
>                 - blk_mq_sched_dispatch_requests
>                    - 0.98% blk_mq_do_dispatch_sched
>                       - 0.97% dd_dispatch_request
>                          + 0.97% queued_spin_lock_slowpath
> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> 

I'll try to capture a perf report and compare to mine.

Thanks very much,
john

John Garry April 23, 2020, 4:31 p.m. UTC | #9

> 
>>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with 
>>> "mq-
>>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
>>> performance drops alot, to maybe 700K IOPS. Do you have a similar
>>> experience?
>>
>> I am using mq-deadline only for HDD. I have not tried on SSD since it 
>> is not
>> useful scheduler for SSDs.
>>
> 
> I ask as I only have SAS SSDs to test.
> 
>> I noticed that when I used mq-deadline, performance drop starts if I have
>> more number of drives.
>> I am running <fio> script which has 64 Drives, 64 thread and all 
>> treads are
>> bound to local numa node which has 36 logical cores.
>> I noticed that lock contention is in " dd_dispatch_request". I am not 
>> sure
>> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.
> 
> So this could be just pre-existing issue of exposing multiple queues for 
> SCSI HBAs combined with mq-deadline iosched. I mean, that's really the 
> only significant change in this series, apart from the shared sbitmap, 
> and, at this point, I don't think that is the issue.

As an experiment, I modified hisi_sas mainline driver to expose hw 
queues and manage tags itself, and I see the same issue I mentioned:

Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0 
iops] [eta 0

The odd time I see "sched: RT throttling activated" around the time the 
throughput falls. I think issue is the per-queue threaded irq threaded 
handlers consuming too many cycles. With "none" io scheduler, IOPS is 
flat at around 2M.

> 
>>
>> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>> {
>>          struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>>          struct request *rq;
>>
>>          spin_lock(&dd->lock);
> 
> So if multiple hctx's are accessing this lock, then much contention 
> possible.
> 
>>          rq = __dd_dispatch_request(dd);
>>          spin_unlock(&dd->lock);
>>
>>          return rq;
>> }
>>
>> Here is perf report -
>>
>> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>>       0.99% ret_from_fork
>>      -   kthread
>>        - worker_thread
>>           - 0.98% process_one_work
>>              - 0.98% __blk_mq_run_hw_queue
>>                 - blk_mq_sched_dispatch_requests
>>                    - 0.98% blk_mq_do_dispatch_sched
>>                       - 0.97% dd_dispatch_request
>>                          + 0.97% queued_spin_lock_slowpath
>> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>>
> 
> I'll try to capture a perf report and compare to mine.

Mine is spending a huge amount of time (circa 33% on a cpu servicing 
completion irqs) in mod_delayed_work_on():

--79.89%--sas_scsi_task_done |
    |--76.72%--scsi_mq_done
    |    |
    |     --76.53%--blk_mq_complete_request
    |    |
    |    |--74.81%--scsi_softirq_done
    |    |    |
    |    |     --73.91%--scsi_finish_command
    |    |    |
    |    |    |--72.11%--scsi_io_completion
    |    |    |    |
    |    |    |     --71.89%--scsi_end_request
    |    |    |    |
    |    |    |    |--40.82%--blk_mq_run_hw_queues
    |    |    |    |    |
    |    |    |    |    |--35.86%--blk_mq_run_hw_queue
    |    |    |    |    |    |
    |    |    |    |    |     --33.59%--__blk_mq_delay_run_hw_queue
    |    |    |    |    |    |
    |    |    |    |    |     --33.38%--kblockd_mod_delayed_work_on
    |    |    |    |    |          |
    |    |    |    |    |                --33.31%--mod_delayed_work_on

hmmmm...

Thanks,
John

Kashyap Desai April 24, 2020, 4:31 p.m. UTC | #10

> >
> >>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with
> >>> "mq-
> >>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> >>> performance drops alot, to maybe 700K IOPS. Do you have a similar
> >>> experience?
> >>
> >> I am using mq-deadline only for HDD. I have not tried on SSD since it
> >> is not useful scheduler for SSDs.
> >>
> >
> > I ask as I only have SAS SSDs to test.
> >
> >> I noticed that when I used mq-deadline, performance drop starts if I
> >> have
> >> more number of drives.
> >> I am running <fio> script which has 64 Drives, 64 thread and all
> >> treads are
> >> bound to local numa node which has 36 logical cores.
> >> I noticed that lock contention is in " dd_dispatch_request". I am not
> >> sure
> >> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.
> >
> > So this could be just pre-existing issue of exposing multiple queues for
> > SCSI HBAs combined with mq-deadline iosched. I mean, that's really the
> > only significant change in this series, apart from the shared sbitmap,
> > and, at this point, I don't think that is the issue.
>
> As an experiment, I modified hisi_sas mainline driver to expose hw
> queues and manage tags itself, and I see the same issue I mentioned:
>
> Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0
> iops] [eta 0
>
> The odd time I see "sched: RT throttling activated" around the time the
> throughput falls. I think issue is the per-queue threaded irq threaded
> handlers consuming too many cycles. With "none" io scheduler, IOPS is
> flat at around 2M.
>
> >
> >>
> >> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> >> {
> >>          struct deadline_data *dd =
> >> hctx->queue->elevator->elevator_data;
> >>          struct request *rq;
> >>
> >>          spin_lock(&dd->lock);
> >
> > So if multiple hctx's are accessing this lock, then much contention
> > possible.
> >
> >>          rq = __dd_dispatch_request(dd);
> >>          spin_unlock(&dd->lock);
> >>
> >>          return rq;
> >> }
> >>
> >> Here is perf report -
> >>
> >> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>       0.99% ret_from_fork
> >>      -   kthread
> >>        - worker_thread
> >>           - 0.98% process_one_work
> >>              - 0.98% __blk_mq_run_hw_queue
> >>                 - blk_mq_sched_dispatch_requests
> >>                    - 0.98% blk_mq_do_dispatch_sched
> >>                       - 0.97% dd_dispatch_request
> >>                          + 0.97% queued_spin_lock_slowpath
> >> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>
> >
> > I'll try to capture a perf report and compare to mine.
>
> Mine is spending a huge amount of time (circa 33% on a cpu servicing
> completion irqs) in mod_delayed_work_on():
>
> --79.89%--sas_scsi_task_done |
>     |--76.72%--scsi_mq_done
>     |    |
>     |     --76.53%--blk_mq_complete_request
>     |    |
>     |    |--74.81%--scsi_softirq_done
>     |    |    |
>     |    |     --73.91%--scsi_finish_command
>     |    |    |
>     |    |    |--72.11%--scsi_io_completion
>     |    |    |    |
>     |    |    |     --71.89%--scsi_end_request
>     |    |    |    |
>     |    |    |    |--40.82%--blk_mq_run_hw_queues
>     |    |    |    |    |
>     |    |    |    |    |--35.86%--blk_mq_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.59%--__blk_mq_delay_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.38%--kblockd_mod_delayed_work_on
>     |    |    |    |    |          |
>     |    |    |    |    |                --33.31%--mod_delayed_work_on
>
> hmmmm...

I did some more experiments. It looks like issue is with both <none> and
<mq-deadline> scheduler.  Let me simplify what happens with ioscheduler =
<none>.

Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue depth
= 128. We get 3.1M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very low contention in sbitmap_get()

-   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
blk_mq_make_request
   - 23.33% blk_mq_make_request
      - 21.68% blk_mq_get_request
         - 20.19% blk_mq_get_tag
            + 10.08% prepare_to_wait_exclusive
            + 4.51% io_schedule
            - 3.59% __sbitmap_queue_get
               - 2.82% sbitmap_get
                    0.86% __sbitmap_get_word
                    0.75% _raw_spin_lock_irqsave
                    0.55% _raw_spin_unlock_irqrestore

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very high contention in sbitmap_get()

-   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
generic_make_request
   - 42.27% generic_make_request
      - 41.00% blk_mq_make_request
         - 38.28% blk_mq_get_request
            - 33.76% blk_mq_get_tag
               - 30.25% __sbitmap_queue_get
                  - 29.90% sbitmap_get
                     + 9.06% _raw_spin_lock_irqsave
                     + 7.94% _raw_spin_unlock_irqrestore
                     + 3.86% __sbitmap_get_word
                     + 1.78% call_function_single_interrupt
                     + 0.67% ret_from_intr
               + 1.69% io_schedule
                 0.59% prepare_to_wait_exclusive
                 0.55% __blk_mq_get_tag

In this particular case, I observed alloc_hint = zeros which means,
sbitmap_get is not able to find free tags from hint. That may lead to
contention.
This condition is not happening with nr_hw_queue=1 (without RFC) driver.

alloc_hint=
{663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
1892, 1669, 2420,
3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
418, 4501, 3717,
2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 32. We get 3.1M IOPS in this config. This workload does *not*
exhaust host can_queue.

-    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
generic_make_request
   - 4.93% generic_make_request
      - 3.61% blk_mq_make_request
         - 2.04% blk_mq_get_request
            - 1.08% blk_mq_get_tag
               - 0.70% __sbitmap_queue_get
                    0.67% sbitmap_get

In summary, RFC has some performance bottleneck in sbitmap_get () if
outstanding per shost is about to exhaust.  Without this RFC also driver
works in nr_hw_queue = 1, but that case is managed very well.
I am not sure why it happens only with shared host tag ? Theoretically all
the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
why contention is only visible in shared host tag case.

If you want to reproduce this issue, may be you have to reduce the can_queue
in hisi_sas driver.

Kashyap

>
> Thanks,
> John

John Garry April 27, 2020, 5:06 p.m. UTC | #11

Hi Kashyap,

>> hmmmm...
> 
> I did some more experiments. It looks like issue is with both <none> and
> <mq-deadline> scheduler.  Let me simplify what happens with ioscheduler =
> <none>.

I know it's good to compare like-for-like, but, as I understand, "none" 
is more suited for MQ host, while deadline is more suited for SQ host.

> 
> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue depth
> = 128. We get 3.1M IOPS in this config. This eventually exhaust host
> can_queue.

So I think I need to find a point where we start to get throttled.

> Note - Very low contention in sbitmap_get()
> 
> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
> blk_mq_make_request
>     - 23.33% blk_mq_make_request
>        - 21.68% blk_mq_get_request
>           - 20.19% blk_mq_get_tag
>              + 10.08% prepare_to_wait_exclusive
>              + 4.51% io_schedule
>              - 3.59% __sbitmap_queue_get
>                 - 2.82% sbitmap_get
>                      0.86% __sbitmap_get_word
>                      0.75% _raw_spin_lock_irqsave
>                      0.55% _raw_spin_unlock_irqrestore
> 
> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
> depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
> can_queue.
> Note - Very high contention in sbitmap_get()
> 
> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
> generic_make_request
>     - 42.27% generic_make_request
>        - 41.00% blk_mq_make_request
>           - 38.28% blk_mq_get_request
>              - 33.76% blk_mq_get_tag
>                 - 30.25% __sbitmap_queue_get
>                    - 29.90% sbitmap_get
>                       + 9.06% _raw_spin_lock_irqsave
>                       + 7.94% _raw_spin_unlock_irqrestore
>                       + 3.86% __sbitmap_get_word
>                       + 1.78% call_function_single_interrupt
>                       + 0.67% ret_from_intr
>                 + 1.69% io_schedule
>                   0.59% prepare_to_wait_exclusive
>                   0.55% __blk_mq_get_tag
> 
> In this particular case, I observed alloc_hint = zeros which means,
> sbitmap_get is not able to find free tags from hint. That may lead to
> contention.
> This condition is not happening with nr_hw_queue=1 (without RFC) driver.
> 
> alloc_hint=
> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
> 1892, 1669, 2420,
> 3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
> 3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
> 418, 4501, 3717,
> 2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> 
> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
> depth = 32. We get 3.1M IOPS in this config. This workload does *not*
> exhaust host can_queue.

Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is 
as per RFC, and I don't think you modified from the RFC for your test. 
But I just wanted to mention that to be crystal clear.

> 
> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> generic_make_request
>     - 4.93% generic_make_request
>        - 3.61% blk_mq_make_request
>           - 2.04% blk_mq_get_request
>              - 1.08% blk_mq_get_tag
>                 - 0.70% __sbitmap_queue_get
>                      0.67% sbitmap_get
> 
> In summary, RFC has some performance bottleneck in sbitmap_get () if
> outstanding per shost is about to exhaust.  Without this RFC also driver
> works in nr_hw_queue = 1, but that case is managed very well.
> I am not sure why it happens only with shared host tag ? Theoretically all
> the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
> why contention is only visible in shared host tag case.

Let me check this.

> 
> If you want to reproduce this issue, may be you have to reduce the can_queue
> in hisi_sas driver.
> 

Thanks,
John

Kashyap Desai April 27, 2020, 6:58 p.m. UTC | #12

> Hi Kashyap,
>
> >> hmmmm...
> >
> > I did some more experiments. It looks like issue is with both <none>
> > and <mq-deadline> scheduler.  Let me simplify what happens with
> > ioscheduler = <none>.
>
> I know it's good to compare like-for-like, but, as I understand, "none"
> is more suited for MQ host, while deadline is more suited for SQ host.

I am using <mq-deadline> which is MQ version of SQ deadline.
For now we can just consider case without scheduler.

I have some more findings. May be someone from upstream community can
connect the dots.

#1. hctx_may_queue() throttle the IO at hctx level. This is eventually per
sdev throttling for megaraid_sas driver because It creates only one
context - hctx0 for each scsi device.

If driver is using only one h/w queue,  active_queues will be always steady.
In my test it was 64 thread, so active_queues=64.
Even though <fio> thread is shared among allowed cpumask
(cpus_allowed_policy=shared option in fio),  active_queues will be always 64
because we have only one h/w queue.
All the logical cpus are mapped to one h/w queue. It means, thread moving
from one cpu to another cpu will not change active_queues per hctx.

In case of this RFC, active_queues are now per hctx and there are multiple
hctx, but tags are shared. This can create unwanted throttling and
eventually more lock contention in sbitmap.
I added below patch and things improved a bit, but not a full proof.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..c708fbc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
  * For shared tag users, we track the number of currently active users
  * and attempt to provide a fair share of the tag depth for each of them.
  */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct request_queue *q,
+                                 struct blk_mq_hw_ctx *hctx,
                                  struct sbitmap_queue *bt)
 {
-       unsigned int depth, users;
+       unsigned int depth, users, i, outstanding = 0;
+       struct blk_mq_hw_ctx *hctx_iter;

        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
                return true;
@@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
*hctx,
         * Allow at least some tags
         */
        depth = max((bt->sb.depth + users - 1) / users, 4U);
-       return atomic_read(&hctx->nr_active) < depth;
+
+       queue_for_each_hw_ctx(q, hctx_iter, i)
+               outstanding += atomic_read(&hctx_iter->nr_active);
+
+       return outstanding < depth;
 }

 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
                            struct sbitmap_queue *bt)
 {
        if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-           !hctx_may_queue(data->hctx, bt))


#2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
completion.  I am not sure if it is good to run all the h/w queue or just
h/w queue of current reference is good enough.
Below patch helped to reduce contention in hcxt_lock().

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41..f72de2a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
        struct scsi_device *sdev = cmd->device;
        struct request_queue *q = sdev->request_queue;
+       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;

        if (blk_update_request(req, error, bytes))
                return true;
@@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
        else
-               blk_mq_run_hw_queues(q, true);
+               blk_mq_run_hw_queue(mq_hctx, true);

        percpu_ref_put(&q->q_usage_counter);
        return false;

#3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
optimal for shared queue.
There is a penalty if we are calling __sbq_wake_up() frequently. In case of
nr_hw_queue = 1, things are better because one hctx and hctx->state will
avoid multiple calls.
If blk_mq_tag_wakeup_all is called from hctx0 context, there is no need to
call from hctx1, hctx2 etc.

I have added below patch in my testing.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..5b331e5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)

        atomic_dec(&tags->active_queues);

-       blk_mq_tag_wakeup_all(tags, false);
+       /* TBD - Do this only for hctx->flags & BLK_MQ_F_TAG_HCTX_SHARED */
+       if (hctx->queue_num == 0)
+               blk_mq_tag_wakeup_all(tags, false);
 }

 /*


With all above mentioned changes, I see performance improved from 2.2M IOPS
to 2.7M on same workload and profile.


Kashyap

>
> >
> > Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
> > depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
> > host can_queue.
>
> So I think I need to find a point where we start to get throttled.
>
> > Note - Very low contention in sbitmap_get()
> >
> > -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
> > blk_mq_make_request
> >     - 23.33% blk_mq_make_request
> >        - 21.68% blk_mq_get_request
> >           - 20.19% blk_mq_get_tag
> >              + 10.08% prepare_to_wait_exclusive
> >              + 4.51% io_schedule
> >              - 3.59% __sbitmap_queue_get
> >                 - 2.82% sbitmap_get
> >                      0.86% __sbitmap_get_word
> >                      0.75% _raw_spin_lock_irqsave
> >                      0.55% _raw_spin_unlock_irqrestore
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 128. We get 2.3 M IOPS in this config. This eventually
> > exhaust host can_queue.
> > Note - Very high contention in sbitmap_get()
> >
> > -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
> > generic_make_request
> >     - 42.27% generic_make_request
> >        - 41.00% blk_mq_make_request
> >           - 38.28% blk_mq_get_request
> >              - 33.76% blk_mq_get_tag
> >                 - 30.25% __sbitmap_queue_get
> >                    - 29.90% sbitmap_get
> >                       + 9.06% _raw_spin_lock_irqsave
> >                       + 7.94% _raw_spin_unlock_irqrestore
> >                       + 3.86% __sbitmap_get_word
> >                       + 1.78% call_function_single_interrupt
> >                       + 0.67% ret_from_intr
> >                 + 1.69% io_schedule
> >                   0.59% prepare_to_wait_exclusive
> >                   0.55% __blk_mq_get_tag
> >
> > In this particular case, I observed alloc_hint = zeros which means,
> > sbitmap_get is not able to find free tags from hint. That may lead to
> > contention.
> > This condition is not happening with nr_hw_queue=1 (without RFC) driver.
> >
> > alloc_hint=
> > {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
> > ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
> > 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
> > 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
> > 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
> > 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 32. We get 3.1M IOPS in this config. This workload does
> > *not* exhaust host can_queue.
>
> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is as
> per
> RFC, and I don't think you modified from the RFC for your test.
> But I just wanted to mention that to be crystal clear.

Yes I have two separate driver copy. One with RFC change and another without
RFC.
>
> >
> > -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> > generic_make_request
> >     - 4.93% generic_make_request
> >        - 3.61% blk_mq_make_request
> >           - 2.04% blk_mq_get_request
> >              - 1.08% blk_mq_get_tag
> >                 - 0.70% __sbitmap_queue_get
> >                      0.67% sbitmap_get
> >
> > In summary, RFC has some performance bottleneck in sbitmap_get () if
> > outstanding per shost is about to exhaust.  Without this RFC also
> > driver works in nr_hw_queue = 1, but that case is managed very well.
> > I am not sure why it happens only with shared host tag ? Theoretically
> > all the hctx is sharing the same bitmaptag which is same as
> > nr_hw_queue=1, so why contention is only visible in shared host tag
> > case.
>
> Let me check this.
>
> >
> > If you want to reproduce this issue, may be you have to reduce the
> > can_queue in hisi_sas driver.
> >
>
> Thanks,
> John

John Garry April 28, 2020, 3:55 p.m. UTC | #13

Hi Kashyap,

> I am using <mq-deadline> which is MQ version of SQ deadline.
> For now we can just consider case without scheduler.
> 
> I have some more findings. May be someone from upstream community can
> connect the dots.
> 
> #1. hctx_may_queue() throttle the IO at hctx level. This is eventually per
> sdev throttling for megaraid_sas driver because It creates only one
> context - hctx0 for each scsi device.
> 
> If driver is using only one h/w queue,  active_queues will be always steady.
> In my test it was 64 thread, so active_queues=64.

So I figure that 64 threads comes from 64 having disks.

> Even though <fio> thread is shared among allowed cpumask
> (cpus_allowed_policy=shared option in fio),  active_queues will be always 64
> because we have only one h/w queue.
> All the logical cpus are mapped to one h/w queue. It means, thread moving
> from one cpu to another cpu will not change active_queues per hctx.
> 
> In case of this RFC, active_queues are now per hctx and there are multiple
> hctx, but tags are shared. 

Right, so we need a policy to divide up the shared tags across request 
queues, based on principle of fairness.

This can create unwanted throttling and
> eventually more lock contention in sbitmap.
> I added below patch and things improved a bit, but not a full proof.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6..c708fbc 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>    * For shared tag users, we track the number of currently active users
>    * and attempt to provide a fair share of the tag depth for each of them.
>    */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct request_queue *q,
> +                                 struct blk_mq_hw_ctx *hctx,
>                                    struct sbitmap_queue *bt)
>   {
> -       unsigned int depth, users;
> +       unsigned int depth, users, i, outstanding = 0;
> +       struct blk_mq_hw_ctx *hctx_iter;
> 
>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
>                  return true;
> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>           * Allow at least some tags
>           */
>          depth = max((bt->sb.depth + users - 1) / users, 4U);
> -       return atomic_read(&hctx->nr_active) < depth;
> +
> +       queue_for_each_hw_ctx(q, hctx_iter, i)
> +               outstanding += atomic_read(&hctx_iter->nr_active);
> +

OK,  I think that we need to find a cleaner way to do this.

> +       return outstanding < depth;
>   }
> 
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>                              struct sbitmap_queue *bt)
>   {
>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -           !hctx_may_queue(data->hctx, bt))
> 
> 
> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
> completion.  I am not sure if it is good to run all the h/w queue or just
> h/w queue of current reference is good enough.
> Below patch helped to reduce contention in hcxt_lock().
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41..f72de2a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>          struct scsi_device *sdev = cmd->device;
>          struct request_queue *q = sdev->request_queue;
> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> 
>          if (blk_update_request(req, error, bytes))
>                  return true;
> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>              !list_empty(&sdev->host->starved_list))
>                  kblockd_schedule_work(&sdev->requeue_work);
>          else
> -               blk_mq_run_hw_queues(q, true);
> +               blk_mq_run_hw_queue(mq_hctx, true);

Not sure on this. But, indeed, I found running all queues did add lots 
of extra load for when enabling the deadline scheduler.

> 
>          percpu_ref_put(&q->q_usage_counter);
>          return false;
> 
> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
> optimal for shared queue.
> There is a penalty if we are calling __sbq_wake_up() frequently. In case of
> nr_hw_queue = 1, things are better because one hctx and hctx->state will
> avoid multiple calls.
> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no need to
> call from hctx1, hctx2 etc.
> 
> I have added below patch in my testing.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6..5b331e5 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
> 
>          atomic_dec(&tags->active_queues);
> 
> -       blk_mq_tag_wakeup_all(tags, false);
> +       /* TBD - Do this only for hctx->flags & BLK_MQ_F_TAG_HCTX_SHARED */
> +       if (hctx->queue_num == 0)
> +               blk_mq_tag_wakeup_all(tags, false);

ok, I see. But, again, I think we need to find a cleaner way to do this.

>   }
> 
>   /*
> 
> 
> With all above mentioned changes, I see performance improved from 2.2M IOPS
> to 2.7M on same workload and profile.

But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below, right?

Thanks,
John

> 
>>
>>>
>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
>>> depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
>>> host can_queue.
>>
>> So I think I need to find a point where we start to get throttled.
>>
>>> Note - Very low contention in sbitmap_get()
>>>
>>> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
>>> blk_mq_make_request
>>>      - 23.33% blk_mq_make_request
>>>         - 21.68% blk_mq_get_request
>>>            - 20.19% blk_mq_get_tag
>>>               + 10.08% prepare_to_wait_exclusive
>>>               + 4.51% io_schedule
>>>               - 3.59% __sbitmap_queue_get
>>>                  - 2.82% sbitmap_get
>>>                       0.86% __sbitmap_get_word
>>>                       0.75% _raw_spin_lock_irqsave
>>>                       0.55% _raw_spin_unlock_irqrestore
>>>
>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>> queue depth = 128. We get 2.3 M IOPS in this config. This eventually
>>> exhaust host can_queue.
>>> Note - Very high contention in sbitmap_get()
>>>
>>> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
>>> generic_make_request
>>>      - 42.27% generic_make_request
>>>         - 41.00% blk_mq_make_request
>>>            - 38.28% blk_mq_get_request
>>>               - 33.76% blk_mq_get_tag
>>>                  - 30.25% __sbitmap_queue_get
>>>                     - 29.90% sbitmap_get
>>>                        + 9.06% _raw_spin_lock_irqsave
>>>                        + 7.94% _raw_spin_unlock_irqrestore
>>>                        + 3.86% __sbitmap_get_word
>>>                        + 1.78% call_function_single_interrupt
>>>                        + 0.67% ret_from_intr
>>>                  + 1.69% io_schedule
>>>                    0.59% prepare_to_wait_exclusive
>>>                    0.55% __blk_mq_get_tag
>>>
>>> In this particular case, I observed alloc_hint = zeros which means,
>>> sbitmap_get is not able to find free tags from hint. That may lead to
>>> contention.
>>> This condition is not happening with nr_hw_queue=1 (without RFC) driver.
>>>
>>> alloc_hint=
>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
>>> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
>>> 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
>>> 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
>>> 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
>>> 759, 3329, 4038, 4829, 715, 842, 1443, 556}
>>>
>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>> queue depth = 32. We get 3.1M IOPS in this config. This workload does
>>> *not* exhaust host can_queue.
>>
>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is as
>> per
>> RFC, and I don't think you modified from the RFC for your test.
>> But I just wanted to mention that to be crystal clear.
> 
> Yes I have two separate driver copy. One with RFC change and another without
> RFC.
>>
>>>
>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
>>> generic_make_request
>>>      - 4.93% generic_make_request
>>>         - 3.61% blk_mq_make_request
>>>            - 2.04% blk_mq_get_request
>>>               - 1.08% blk_mq_get_tag
>>>                  - 0.70% __sbitmap_queue_get
>>>                       0.67% sbitmap_get
>>>
>>> In summary, RFC has some performance bottleneck in sbitmap_get () if
>>> outstanding per shost is about to exhaust.  Without this RFC also
>>> driver works in nr_hw_queue = 1, but that case is managed very well.
>>> I am not sure why it happens only with shared host tag ? Theoretically
>>> all the hctx is sharing the same bitmaptag which is same as
>>> nr_hw_queue=1, so why contention is only visible in shared host tag
>>> case.
>>
>> Let me check this.
>>
>>>
>>> If you want to reproduce this issue, may be you have to reduce the
>>> can_queue in hisi_sas driver.
>>>
>>
>> Thanks,
>> John

John Garry April 29, 2020, 11:29 a.m. UTC | #14

On 28/04/2020 16:55, John Garry wrote:
> Hi Kashyap,
> 
>> I am using <mq-deadline> which is MQ version of SQ deadline.
>> For now we can just consider case without scheduler.
>>
>> I have some more findings. May be someone from upstream community can
>> connect the dots.
>>
>> #1. hctx_may_queue() throttle the IO at hctx level. This is eventually 
>> per
>> sdev throttling for megaraid_sas driver because It creates only one
>> context - hctx0 for each scsi device.
>>
>> If driver is using only one h/w queue,  active_queues will be always 
>> steady.
>> In my test it was 64 thread, so active_queues=64.
> 
> So I figure that 64 threads comes from 64 having disks.
> 
>> Even though <fio> thread is shared among allowed cpumask
>> (cpus_allowed_policy=shared option in fio),  active_queues will be 
>> always 64
>> because we have only one h/w queue.
>> All the logical cpus are mapped to one h/w queue. It means, thread moving
>> from one cpu to another cpu will not change active_queues per hctx.
>>
>> In case of this RFC, active_queues are now per hctx and there are 
>> multiple
>> hctx, but tags are shared. 
> 
> Right, so we need a policy to divide up the shared tags across request 
> queues, based on principle of fairness.
> 

Hi Kashyap,

How about this, which counts requests per request_queue for shared 
sbitmap instead of per hctx (and, as such, does not need to aggreate 
them in hctx_may_queue()):

---->8

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index ba68d934e3ca..0c8adecba219 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
   * For shared tag users, we track the number of currently active users
   * and attempt to provide a fair share of the tag depth for each of them.
   */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
  				  struct sbitmap_queue *bt)
  {
+	struct blk_mq_hw_ctx *hctx = data->hctx;
+	struct request_queue *q = data->q;
  	unsigned int depth, users;

  	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
@@ -84,14 +86,14 @@ static inline bool hctx_may_queue(struct 
blk_mq_hw_ctx *hctx,
  	 * Allow at least some tags
  	 */
  	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return atomic_read(&hctx->nr_active) < depth;
+	return __blk_mq_active_requests(hctx, q) < depth;
  }

  static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
  			    struct sbitmap_queue *bt)
  {
  	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-	    !hctx_may_queue(data->hctx, bt))
+	    !hctx_may_queue(data, bt))
  		return -1;
  	if (data->shallow_depth)
  		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7e446f946e73..5bbd95e01f08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -282,7 +282,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
  	} else {
  		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
  			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			__blk_mq_inc_active_requests(data->hctx, data->q);
  		}
  		rq->tag = tag;
  		rq->internal_tag = -1;
@@ -502,7 +502,7 @@ void blk_mq_free_request(struct request *rq)

  	ctx->rq_completed[rq_is_sync(rq)]++;
  	if (rq->rq_flags & RQF_MQ_INFLIGHT)
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, q);

  	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
  		laptop_io_completion(q->backing_dev_info);
@@ -1048,7 +1048,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
  	if (rq->tag >= 0) {
  		if (shared) {
  			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
+			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
  		}
  		data.hctx->tags->rqs[rq->tag] = rq;
  	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index dde2d29f0ce5..5ab1566c4b7d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -202,6 +202,29 @@ static inline bool 
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
  	return true;
  }

+static inline  void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx 
*hctx, struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_inc(&q->nr_active_requests);
+	else
+		atomic_inc(&hctx->nr_active);
+}
+
+static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx 
*hctx, struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_dec(&q->nr_active_requests);
+	else
+		atomic_dec(&hctx->nr_active);
+}
+
+static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx, 
struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		return atomic_read(&q->nr_active_requests);
+	return atomic_read(&hctx->nr_active);
+}
+
  static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
  					   struct request *rq)
  {
@@ -210,7 +233,7 @@ static inline void __blk_mq_put_driver_tag(struct 
blk_mq_hw_ctx *hctx,

  	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
  		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, rq->q);
  	}
  }

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32868fbedc9e..5f2955872234 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -579,6 +579,8 @@ struct request_queue {

  	size_t			cmd_size;

+	atomic_t nr_active_requests;
+
  	struct work_struct	release_work;

  #define BLK_MAX_WRITE_HINTS	5

> This can create unwanted throttling and
>> eventually more lock contention in sbitmap.
>> I added below patch and things improved a bit, but not a full proof.
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 586c9d6..c708fbc 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>    * For shared tag users, we track the number of currently active users
>>    * and attempt to provide a fair share of the tag depth for each of 
>> them.
>>    */
>> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>> +static inline bool hctx_may_queue(struct request_queue *q,
>> +                                 struct blk_mq_hw_ctx *hctx,
>>                                    struct sbitmap_queue *bt)
>>   {
>> -       unsigned int depth, users;
>> +       unsigned int depth, users, i, outstanding = 0;
>> +       struct blk_mq_hw_ctx *hctx_iter;
>>
>>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
>>                  return true;
>> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_hw_ctx
>> *hctx,
>>           * Allow at least some tags
>>           */
>>          depth = max((bt->sb.depth + users - 1) / users, 4U);
>> -       return atomic_read(&hctx->nr_active) < depth;
>> +
>> +       queue_for_each_hw_ctx(q, hctx_iter, i)
>> +               outstanding += atomic_read(&hctx_iter->nr_active);
>> +
> 
> OK,  I think that we need to find a cleaner way to do this.
> 
>> +       return outstanding < depth;
>>   }
>>
>>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>>                              struct sbitmap_queue *bt)
>>   {
>>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
>> -           !hctx_may_queue(data->hctx, bt))
>>
>>
>> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
>> completion.  I am not sure if it is good to run all the h/w queue or just
>> h/w queue of current reference is good enough.
>> Below patch helped to reduce contention in hcxt_lock().
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index 610ee41..f72de2a 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
>> blk_status_t error,
>>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>>          struct scsi_device *sdev = cmd->device;
>>          struct request_queue *q = sdev->request_queue;
>> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
>>
>>          if (blk_update_request(req, error, bytes))
>>                  return true;
>> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
>> blk_status_t error,
>>              !list_empty(&sdev->host->starved_list))
>>                  kblockd_schedule_work(&sdev->requeue_work);
>>          else
>> -               blk_mq_run_hw_queues(q, true);
>> +               blk_mq_run_hw_queue(mq_hctx, true);
> 
> Not sure on this. But, indeed, I found running all queues did add lots 
> of extra load for when enabling the deadline scheduler.
> 
>>
>>          percpu_ref_put(&q->q_usage_counter);
>>          return false;
>>
>> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
>> optimal for shared queue.
>> There is a penalty if we are calling __sbq_wake_up() frequently. In 
>> case of
>> nr_hw_queue = 1, things are better because one hctx and hctx->state will
>> avoid multiple calls.
>> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no 
>> need to
>> call from hctx1, hctx2 etc.
>>
>> I have added below patch in my testing.
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 586c9d6..5b331e5 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>
>>          atomic_dec(&tags->active_queues);
>>
>> -       blk_mq_tag_wakeup_all(tags, false);
>> +       /* TBD - Do this only for hctx->flags & 
>> BLK_MQ_F_TAG_HCTX_SHARED */
>> +       if (hctx->queue_num == 0)
>> +               blk_mq_tag_wakeup_all(tags, false);
> 
> ok, I see. But, again, I think we need to find a cleaner way to do this.
> 
>>   }
>>
>>   /*
>>
>>
>> With all above mentioned changes, I see performance improved from 2.2M 
>> IOPS
>> to 2.7M on same workload and profile.
> 
> But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below, right?
> 
> Thanks,
> John
> 
>>
>>>
>>>>
>>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
>>>> depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
>>>> host can_queue.
>>>
>>> So I think I need to find a point where we start to get throttled.
>>>
>>>> Note - Very low contention in sbitmap_get()
>>>>
>>>> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
>>>> blk_mq_make_request
>>>>      - 23.33% blk_mq_make_request
>>>>         - 21.68% blk_mq_get_request
>>>>            - 20.19% blk_mq_get_tag
>>>>               + 10.08% prepare_to_wait_exclusive
>>>>               + 4.51% io_schedule
>>>>               - 3.59% __sbitmap_queue_get
>>>>                  - 2.82% sbitmap_get
>>>>                       0.86% __sbitmap_get_word
>>>>                       0.75% _raw_spin_lock_irqsave
>>>>                       0.55% _raw_spin_unlock_irqrestore
>>>>
>>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>>> queue depth = 128. We get 2.3 M IOPS in this config. This eventually
>>>> exhaust host can_queue.
>>>> Note - Very high contention in sbitmap_get()
>>>>
>>>> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
>>>> generic_make_request
>>>>      - 42.27% generic_make_request
>>>>         - 41.00% blk_mq_make_request
>>>>            - 38.28% blk_mq_get_request
>>>>               - 33.76% blk_mq_get_tag
>>>>                  - 30.25% __sbitmap_queue_get
>>>>                     - 29.90% sbitmap_get
>>>>                        + 9.06% _raw_spin_lock_irqsave
>>>>                        + 7.94% _raw_spin_unlock_irqrestore
>>>>                        + 3.86% __sbitmap_get_word
>>>>                        + 1.78% call_function_single_interrupt
>>>>                        + 0.67% ret_from_intr
>>>>                  + 1.69% io_schedule
>>>>                    0.59% prepare_to_wait_exclusive
>>>>                    0.55% __blk_mq_get_tag
>>>>
>>>> In this particular case, I observed alloc_hint = zeros which means,
>>>> sbitmap_get is not able to find free tags from hint. That may lead to
>>>> contention.
>>>> This condition is not happening with nr_hw_queue=1 (without RFC) 
>>>> driver.
>>>>
>>>> alloc_hint=
>>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
>>>> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
>>>> 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
>>>> 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
>>>> 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
>>>> 759, 3329, 4038, 4829, 715, 842, 1443, 556}
>>>>
>>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>>> queue depth = 32. We get 3.1M IOPS in this config. This workload does
>>>> *not* exhaust host can_queue.
>>>
>>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This 
>>> is as
>>> per
>>> RFC, and I don't think you modified from the RFC for your test.
>>> But I just wanted to mention that to be crystal clear.
>>
>> Yes I have two separate driver copy. One with RFC change and another 
>> without
>> RFC.
>>>
>>>>
>>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
>>>> generic_make_request
>>>>      - 4.93% generic_make_request
>>>>         - 3.61% blk_mq_make_request
>>>>            - 2.04% blk_mq_get_request
>>>>               - 1.08% blk_mq_get_tag
>>>>                  - 0.70% __sbitmap_queue_get
>>>>                       0.67% sbitmap_get
>>>>
>>>> In summary, RFC has some performance bottleneck in sbitmap_get () if
>>>> outstanding per shost is about to exhaust.  Without this RFC also
>>>> driver works in nr_hw_queue = 1, but that case is managed very well.
>>>> I am not sure why it happens only with shared host tag ? Theoretically
>>>> all the hctx is sharing the same bitmaptag which is same as
>>>> nr_hw_queue=1, so why contention is only visible in shared host tag
>>>> case.
>>>
>>> Let me check this.
>>>
>>>>
>>>> If you want to reproduce this issue, may be you have to reduce the
>>>> can_queue in hisi_sas driver.
>>>>
>>>
>>> Thanks,
>>> John
>

Kashyap Desai April 29, 2020, 3:50 p.m. UTC | #15

> -----Original Message-----
> From: John Garry [mailto:john.garry@huawei.com]
> Sent: Wednesday, April 29, 2020 5:00 PM
> To: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: axboe@kernel.dk; jejb@linux.ibm.com; martin.petersen@oracle.com;
> ming.lei@redhat.com; bvanassche@acm.org; hare@suse.de;
> don.brace@microsemi.com; Sumit Saxena <sumit.saxena@broadcom.com>;
> hch@infradead.org; Shivasharan Srikanteshwara
> <shivasharan.srikanteshwara@broadcom.com>; chenxiang (M)
> <chenxiang66@hisilicon.com>; linux-block@vger.kernel.org; linux-
> scsi@vger.kernel.org; esc.storagedev@microsemi.com; Hannes Reinecke
> <hare@suse.com>
> Subject: Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to
> MQ
>
> On 28/04/2020 16:55, John Garry wrote:
> > Hi Kashyap,
> >
> >> I am using <mq-deadline> which is MQ version of SQ deadline.
> >> For now we can just consider case without scheduler.
> >>
> >> I have some more findings. May be someone from upstream community
> can
> >> connect the dots.
> >>
> >> #1. hctx_may_queue() throttle the IO at hctx level. This is
> >> eventually per sdev throttling for megaraid_sas driver because It
> >> creates only one context - hctx0 for each scsi device.
> >>
> >> If driver is using only one h/w queue,  active_queues will be always
> >> steady.
> >> In my test it was 64 thread, so active_queues=64.
> >
> > So I figure that 64 threads comes from 64 having disks.
> >
> >> Even though <fio> thread is shared among allowed cpumask
> >> (cpus_allowed_policy=shared option in fio),  active_queues will be
> >> always 64 because we have only one h/w queue.
> >> All the logical cpus are mapped to one h/w queue. It means, thread
> >> moving from one cpu to another cpu will not change active_queues per
> hctx.
> >>
> >> In case of this RFC, active_queues are now per hctx and there are
> >> multiple hctx, but tags are shared.
> >
> > Right, so we need a policy to divide up the shared tags across request
> > queues, based on principle of fairness.
> >
>
> Hi Kashyap,
>
> How about this, which counts requests per request_queue for shared sbitmap
> instead of per hctx (and, as such, does not need to aggreate them in
> hctx_may_queue()):

Hi John,

I have to add additional changes on top of your below patch -
active_queues also should be managed differently for shared tag map case. I
added extra flags in queue_flags interface which is per request.

Combination of your patch and below resolves fairness issues. I see some
better results using " --cpus_allowed_policy=split", but
--cpus_allowed_policy=shared is still having some issue and most likely it
is to do with fairness. If fairness is not managed properly, there is high
contention in wait/wakeup handling of sbitmap.


=====Additional patch ===

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 3719e1a..95bcb47 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,10 +23,20 @@
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-           !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               atomic_inc(&hctx->tags->active_queues);
+       struct blk_mq_tags *tags = hctx->tags;
+       struct request_queue *q = hctx->queue;
+       struct sbitmap_queue *bt;

+       if (blk_mq_is_sbitmap_shared(q->tag_set)){
+               bt = tags->bitmap_tags;
+               if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
+                       !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE,
&q->queue_flags))
+                       atomic_inc(&bt->active_queues_per_host);
+       } else {
+               if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+                   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                       atomic_inc(&hctx->tags->active_queues);
+       }
        return true;
 }

@@ -47,12 +57,20 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags,
bool include_reserve)
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
        struct blk_mq_tags *tags = hctx->tags;
+       struct sbitmap_queue *bt;
+       struct request_queue *q = hctx->queue;

-       if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               return;
-
-       atomic_dec(&tags->active_queues);
+       if (blk_mq_is_sbitmap_shared(q->tag_set)){
+               bt = tags->bitmap_tags;
+               if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
&q->queue_flags))
+                       return;
+               atomic_dec(&bt->active_queues_per_host);
+       } else {
+               if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                       return;

+               atomic_dec(&tags->active_queues);
+       }
        blk_mq_tag_wakeup_all(tags, false);
 }

@@ -65,12 +83,13 @@ static inline bool hctx_may_queue(struct
blk_mq_alloc_data *data,
 {
        struct blk_mq_hw_ctx *hctx = data->hctx;
        struct request_queue *q = data->q;
+       struct blk_mq_tags *tags = hctx->tags;
        unsigned int depth, users;

        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
                return true;
-       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               return true;
+       //if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+       //      return true;

        /*
         * Don't try dividing an ant
@@ -78,7 +97,11 @@ static inline bool hctx_may_queue(struct
blk_mq_alloc_data *data,
        if (bt->sb.depth == 1)
                return true;

-       users = atomic_read(&hctx->tags->active_queues);
+       if (blk_mq_is_sbitmap_shared(q->tag_set)) {
+               bt = tags->bitmap_tags;
+               users = atomic_read(&bt->active_queues_per_host);
+       } else
+               users = atomic_read(&hctx->tags->active_queues);
        if (!users)
                return true;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2bc9998..7049800 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -613,6 +613,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA  25      /* device supports PCI p2p requests
*/
 #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
+#define QUEUE_FLAG_HCTX_ACTIVE 28      /* atleast one hctx is active*/

 #define QUEUE_FLAG_MQ_DEFAULT  ((1 << QUEUE_FLAG_IO_STAT) |            \
                                 (1 << QUEUE_FLAG_SAME_COMP))
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index e40d019..fb44925 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -139,6 +139,8 @@ struct sbitmap_queue {
         * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
         */
        unsigned int min_shallow_depth;
+
+       atomic_t active_queues_per_host;
 };

 /**

>
> ---->8
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> ba68d934e3ca..0c8adecba219 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>    * For shared tag users, we track the number of currently active users
>    * and attempt to provide a fair share of the tag depth for each of
> them.
>    */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   				  struct sbitmap_queue *bt)
>   {
> +	struct blk_mq_hw_ctx *hctx = data->hctx;
> +	struct request_queue *q = data->q;
>   	unsigned int depth, users;
>
>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) @@ -
> 84,14 +86,14 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>   	 * Allow at least some tags
>   	 */
>   	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return atomic_read(&hctx->nr_active) < depth;
> +	return __blk_mq_active_requests(hctx, q) < depth;
>   }
>
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>   			    struct sbitmap_queue *bt)
>   {
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -	    !hctx_may_queue(data->hctx, bt))
> +	    !hctx_may_queue(data, bt))
>   		return -1;
>   	if (data->shallow_depth)
>   		return __sbitmap_queue_get_shallow(bt, data-
> >shallow_depth); diff --git a/block/blk-mq.c b/block/blk-mq.c index
> 7e446f946e73..5bbd95e01f08 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -282,7 +282,7 @@ static struct request *blk_mq_rq_ctx_init(struct
> blk_mq_alloc_data *data,
>   	} else {
>   		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>   			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>   		}
>   		rq->tag = tag;
>   		rq->internal_tag = -1;
> @@ -502,7 +502,7 @@ void blk_mq_free_request(struct request *rq)
>
>   	ctx->rq_completed[rq_is_sync(rq)]++;
>   	if (rq->rq_flags & RQF_MQ_INFLIGHT)
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, q);
>
>   	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>   		laptop_io_completion(q->backing_dev_info);
> @@ -1048,7 +1048,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>   	if (rq->tag >= 0) {
>   		if (shared) {
>   			rq->rq_flags |= RQF_MQ_INFLIGHT;
> -			atomic_inc(&data.hctx->nr_active);
> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>   		}
>   		data.hctx->tags->rqs[rq->tag] = rq;
>   	}
> diff --git a/block/blk-mq.h b/block/blk-mq.h index
> dde2d29f0ce5..5ab1566c4b7d 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -202,6 +202,29 @@ static inline bool
> blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>   	return true;
>   }
>
> +static inline  void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx
> *hctx, struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_inc(&q->nr_active_requests);
> +	else
> +		atomic_inc(&hctx->nr_active);
> +}
> +
> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx
> *hctx, struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_dec(&q->nr_active_requests);
> +	else
> +		atomic_dec(&hctx->nr_active);
> +}
> +
> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
> struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		return atomic_read(&q->nr_active_requests);
> +	return atomic_read(&hctx->nr_active);
> +}
> +
>   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>   					   struct request *rq)
>   {
> @@ -210,7 +233,7 @@ static inline void __blk_mq_put_driver_tag(struct
> blk_mq_hw_ctx *hctx,
>
>   	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
>   		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, rq->q);
>   	}
>   }
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index
> 32868fbedc9e..5f2955872234 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -579,6 +579,8 @@ struct request_queue {
>
>   	size_t			cmd_size;
>
> +	atomic_t nr_active_requests;
> +
>   	struct work_struct	release_work;
>
>   #define BLK_MAX_WRITE_HINTS	5
>
> > This can create unwanted throttling and
> >> eventually more lock contention in sbitmap.
> >> I added below patch and things improved a bit, but not a full proof.
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> >> 586c9d6..c708fbc 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx
> >> *hctx)
> >>    * For shared tag users, we track the number of currently active
> >> users
> >>    * and attempt to provide a fair share of the tag depth for each of
> >> them.
> >>    */
> >> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> >> +static inline bool hctx_may_queue(struct request_queue *q,
> >> +                                 struct blk_mq_hw_ctx *hctx,
> >>                                    struct sbitmap_queue *bt)
> >>   {
> >> -       unsigned int depth, users;
> >> +       unsigned int depth, users, i, outstanding = 0;
> >> +       struct blk_mq_hw_ctx *hctx_iter;
> >>
> >>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
> >>                  return true;
> >> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct
> >> blk_mq_hw_ctx *hctx,
> >>           * Allow at least some tags
> >>           */
> >>          depth = max((bt->sb.depth + users - 1) / users, 4U);
> >> -       return atomic_read(&hctx->nr_active) < depth;
> >> +
> >> +       queue_for_each_hw_ctx(q, hctx_iter, i)
> >> +               outstanding += atomic_read(&hctx_iter->nr_active);
> >> +
> >
> > OK,  I think that we need to find a cleaner way to do this.
> >
> >> +       return outstanding < depth;
> >>   }
> >>
> >>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
> >>                              struct sbitmap_queue *bt)
> >>   {
> >>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> >> -           !hctx_may_queue(data->hctx, bt))
> >>
> >>
> >> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon
> >> IO completion.  I am not sure if it is good to run all the h/w queue
> >> or just h/w queue of current reference is good enough.
> >> Below patch helped to reduce contention in hcxt_lock().
> >>
> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> >> 610ee41..f72de2a 100644
> >> --- a/drivers/scsi/scsi_lib.c
> >> +++ b/drivers/scsi/scsi_lib.c
> >> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
> >> blk_status_t error,
> >>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> >>          struct scsi_device *sdev = cmd->device;
> >>          struct request_queue *q = sdev->request_queue;
> >> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> >>
> >>          if (blk_update_request(req, error, bytes))
> >>                  return true;
> >> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
> >> blk_status_t error,
> >>              !list_empty(&sdev->host->starved_list))
> >>                  kblockd_schedule_work(&sdev->requeue_work);
> >>          else
> >> -               blk_mq_run_hw_queues(q, true);
> >> +               blk_mq_run_hw_queue(mq_hctx, true);
> >
> > Not sure on this. But, indeed, I found running all queues did add lots
> > of extra load for when enabling the deadline scheduler.
> >
> >>
> >>          percpu_ref_put(&q->q_usage_counter);
> >>          return false;
> >>
> >> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not
> >> be optimal for shared queue.
> >> There is a penalty if we are calling __sbq_wake_up() frequently. In
> >> case of nr_hw_queue = 1, things are better because one hctx and
> >> hctx->state will avoid multiple calls.
> >> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no
> >> need to call from hctx1, hctx2 etc.
> >>
> >> I have added below patch in my testing.
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> >> 586c9d6..5b331e5 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
> >>
> >>          atomic_dec(&tags->active_queues);
> >>
> >> -       blk_mq_tag_wakeup_all(tags, false);
> >> +       /* TBD - Do this only for hctx->flags &
> >> BLK_MQ_F_TAG_HCTX_SHARED */
> >> +       if (hctx->queue_num == 0)
> >> +               blk_mq_tag_wakeup_all(tags, false);
> >
> > ok, I see. But, again, I think we need to find a cleaner way to do this.
> >
> >>   }
> >>
> >>   /*
> >>
> >>
> >> With all above mentioned changes, I see performance improved from
> >> 2.2M IOPS to 2.7M on same workload and profile.
> >
> > But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below,
> > right?
> >
> > Thanks,
> > John
> >
> >>
> >>>
> >>>>
> >>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>
> >>>> queue depth = 128. We get 3.1M IOPS in this config. This eventually
> >>>> exhaust host can_queue.
> >>>
> >>> So I think I need to find a point where we start to get throttled.
> >>>
> >>>> Note - Very low contention in sbitmap_get()
> >>>>
> >>>> -   23.58%     0.25%  fio              [kernel.vmlinux]
> >>>> [k] blk_mq_make_request
> >>>>      - 23.33% blk_mq_make_request
> >>>>         - 21.68% blk_mq_get_request
> >>>>            - 20.19% blk_mq_get_tag
> >>>>               + 10.08% prepare_to_wait_exclusive
> >>>>               + 4.51% io_schedule
> >>>>               - 3.59% __sbitmap_queue_get
> >>>>                  - 2.82% sbitmap_get
> >>>>                       0.86% __sbitmap_get_word
> >>>>                       0.75% _raw_spin_lock_irqsave
> >>>>                       0.55% _raw_spin_unlock_irqrestore
> >>>>
> >>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from
> >>>> <fio> queue depth = 128. We get 2.3 M IOPS in this config. This
> >>>> eventually exhaust host can_queue.
> >>>> Note - Very high contention in sbitmap_get()
> >>>>
> >>>> -   42.39%     0.12%  fio              [kernel.vmlinux]
> >>>> [k] generic_make_request
> >>>>      - 42.27% generic_make_request
> >>>>         - 41.00% blk_mq_make_request
> >>>>            - 38.28% blk_mq_get_request
> >>>>               - 33.76% blk_mq_get_tag
> >>>>                  - 30.25% __sbitmap_queue_get
> >>>>                     - 29.90% sbitmap_get
> >>>>                        + 9.06% _raw_spin_lock_irqsave
> >>>>                        + 7.94% _raw_spin_unlock_irqrestore
> >>>>                        + 3.86% __sbitmap_get_word
> >>>>                        + 1.78% call_function_single_interrupt
> >>>>                        + 0.67% ret_from_intr
> >>>>                  + 1.69% io_schedule
> >>>>                    0.59% prepare_to_wait_exclusive
> >>>>                    0.55% __blk_mq_get_tag
> >>>>
> >>>> In this particular case, I observed alloc_hint = zeros which means,
> >>>> sbitmap_get is not able to find free tags from hint. That may lead
> >>>> to contention.
> >>>> This condition is not happening with nr_hw_queue=1 (without RFC)
> >>>> driver.
> >>>>
> >>>> alloc_hint=
> >>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779,
> >>>> 377, ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902,
> >>>> 2224, 3212, 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810,
> >>>> 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4,
> >>>> 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501,
> >>>> 3717, 2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> >>>>
> >>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from
> >>>> <fio> queue depth = 32. We get 3.1M IOPS in this config. This
> >>>> workload does
> >>>> *not* exhaust host can_queue.
> >>>
> >>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This
> >>> is as per RFC, and I don't think you modified from the RFC for your
> >>> test.
> >>> But I just wanted to mention that to be crystal clear.
> >>
> >> Yes I have two separate driver copy. One with RFC change and another
> >> without RFC.
> >>>
> >>>>
> >>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> >>>> generic_make_request
> >>>>      - 4.93% generic_make_request
> >>>>         - 3.61% blk_mq_make_request
> >>>>            - 2.04% blk_mq_get_request
> >>>>               - 1.08% blk_mq_get_tag
> >>>>                  - 0.70% __sbitmap_queue_get
> >>>>                       0.67% sbitmap_get
> >>>>
> >>>> In summary, RFC has some performance bottleneck in sbitmap_get ()
> >>>> if outstanding per shost is about to exhaust.  Without this RFC
> >>>> also driver works in nr_hw_queue = 1, but that case is managed very
> well.
> >>>> I am not sure why it happens only with shared host tag ?
> >>>> Theoretically all the hctx is sharing the same bitmaptag which is
> >>>> same as nr_hw_queue=1, so why contention is only visible in shared
> >>>> host tag case.
> >>>
> >>> Let me check this.
> >>>
> >>>>
> >>>> If you want to reproduce this issue, may be you have to reduce the
> >>>> can_queue in hisi_sas driver.
> >>>>
> >>>
> >>> Thanks,
> >>> John
> >

John Garry April 29, 2020, 5:55 p.m. UTC | #16

Hi Kashyap,

> 
> I have to add additional changes on top of your below patch -
> active_queues also should be managed differently for shared tag map case. I
> added extra flags in queue_flags interface which is per request.

ok, so it's not proper to use active hctx per request queue as "users", 
but rather that the active request_queue's per host (which is what we 
effectively have for nr_hw_queues = 1). Just a small comment at the 
bottom on your change.

So I already experimented with reducing shost.can_queue significantly on 
hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD), and saw:

RFC + shost.can_queue reduced: ~1.2M IOPS
With RFC + shost.can_queue unmodified: ~2M IOPS
Without RFC + shost.can_queue unmodified: ~2M IOPS

And with the change, below, I still get around 1.2M IOPS. But there may 
be some sweet spot/zone where this makes a difference, which I'm not 
visiting.

> 
> Combination of your patch and below resolves fairness issues. I see some
> better results using " --cpus_allowed_policy=split", but
> --cpus_allowed_policy=shared 

I did not try changing the cpus_allowed_policy policy, and so I would be 
using default, which I believe is shared.

is still having some issue and most likely it
> is to do with fairness. If fairness is not managed properly, there is high
> contention in wait/wakeup handling of sbitmap.

ok, I can investigate.

> 
> 
> =====Additional patch ===
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 3719e1a..95bcb47 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -23,10 +23,20 @@
>    */
>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   {
> -       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> -           !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               atomic_inc(&hctx->tags->active_queues);
> +       struct blk_mq_tags *tags = hctx->tags;
> +       struct request_queue *q = hctx->queue;
> +       struct sbitmap_queue *bt;
> 
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +               bt = tags->bitmap_tags;
> +               if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
> +                       !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE,
> &q->queue_flags))
> +                       atomic_inc(&bt->active_queues_per_host);
> +       } else {
> +               if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> +                   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +                       atomic_inc(&hctx->tags->active_queues);
> +       }
>          return true;
>   }
> 
> @@ -47,12 +57,20 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags,
> bool include_reserve)
>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   {
>          struct blk_mq_tags *tags = hctx->tags;
> +       struct sbitmap_queue *bt;
> +       struct request_queue *q = hctx->queue;
> 
> -       if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               return;
> -
> -       atomic_dec(&tags->active_queues);
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +               bt = tags->bitmap_tags;
> +               if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
> &q->queue_flags))
> +                       return;
> +               atomic_dec(&bt->active_queues_per_host);
> +       } else {
> +               if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +                       return;
> 
> +               atomic_dec(&tags->active_queues);
> +       }
>          blk_mq_tag_wakeup_all(tags, false);
>   }
> 
> @@ -65,12 +83,13 @@ static inline bool hctx_may_queue(struct
> blk_mq_alloc_data *data,
>   {
>          struct blk_mq_hw_ctx *hctx = data->hctx;
>          struct request_queue *q = data->q;
> +       struct blk_mq_tags *tags = hctx->tags;
>          unsigned int depth, users;
> 
>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>                  return true;
> -       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               return true;
> +       //if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +       //      return true;
> 
>          /*
>           * Don't try dividing an ant
> @@ -78,7 +97,11 @@ static inline bool hctx_may_queue(struct
> blk_mq_alloc_data *data,
>          if (bt->sb.depth == 1)
>                  return true;
> 
> -       users = atomic_read(&hctx->tags->active_queues);
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)) {
> +               bt = tags->bitmap_tags;
> +               users = atomic_read(&bt->active_queues_per_host);
> +       } else
> +               users = atomic_read(&hctx->tags->active_queues);
>          if (!users)
>                  return true;
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2bc9998..7049800 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -613,6 +613,7 @@ struct request_queue {
>   #define QUEUE_FLAG_PCI_P2PDMA  25      /* device supports PCI p2p requests
> */
>   #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
> +#define QUEUE_FLAG_HCTX_ACTIVE 28      /* atleast one hctx is active*/
> 
>   #define QUEUE_FLAG_MQ_DEFAULT  ((1 << QUEUE_FLAG_IO_STAT) |            \
>                                   (1 << QUEUE_FLAG_SAME_COMP))
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index e40d019..fb44925 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -139,6 +139,8 @@ struct sbitmap_queue {
>           * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
>           */
>          unsigned int min_shallow_depth;
> +
> +       atomic_t active_queues_per_host;

It's prob better to put this in blk_mq_tag_set structure.

>   };


Thanks very much,
John

John Garry April 30, 2020, 5:40 p.m. UTC | #17

On 29/04/2020 18:55, John Garry wrote:
> 

Hi Kashyap,

> ok, so it's not proper to use active hctx per request queue as "users", 
> but rather that the active request_queue's per host (which is what we 
> effectively have for nr_hw_queues = 1). Just a small comment at the 
> bottom on your change.
> 
> So I already experimented with reducing shost.can_queue significantly on 
> hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD), and 
> saw:
> 
> RFC + shost.can_queue reduced: ~1.2M IOPS
> With RFC + shost.can_queue unmodified: ~2M IOPS
> Without RFC + shost.can_queue unmodified: ~2M IOPS
> 
> And with the change, below, I still get around 1.2M IOPS. But there may 
> be some sweet spot/zone where this makes a difference, which I'm not 
> visiting.
> 
>>
>> Combination of your patch and below resolves fairness issues. I see some
>> better results using " --cpus_allowed_policy=split", but
>> --cpus_allowed_policy=shared 
> 
> I did not try changing the cpus_allowed_policy policy, and so I would be 
> using default, which I believe is shared.
> 
> is still having some issue and most likely it
>> is to do with fairness. If fairness is not managed properly, there is 
>> high
>> contention in wait/wakeup handling of sbitmap.
> 
> ok, I can investigate.

Could you also try this change:

diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0a57e4f041a9..e959971b1cee 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -46,11 +46,25 @@ extern void blk_mq_tag_wakeup_all(struct blk_mq_tags 
*tags, bool);
  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
  		void *priv);

+static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
+{
+	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
  static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
  						 struct blk_mq_hw_ctx *hctx)
  {
+	struct request_queue *queue;
+	struct blk_mq_tag_set *set;
+
  	if (!hctx)
  		return &bt->ws[0];
+	queue = hctx->queue;
+	set = queue->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(set))
+		return sbq_wait_ptr(bt, &set->wait_index);
+	
  	return sbq_wait_ptr(bt, &hctx->wait_index);
  }

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4f12363d56bf..241607806f77 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,11 +158,6 @@ struct blk_mq_alloc_data {
  	struct blk_mq_hw_ctx *hctx;
  };

-static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
-{
-	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
-}
-
  static inline struct blk_mq_tags *blk_mq_tags_from_data(struct 
blk_mq_alloc_data *data)
  {
  	if (data->flags & BLK_MQ_REQ_INTERNAL)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 957cb43c5de7..427c7934c29d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,6 +259,8 @@ struct blk_mq_tag_set {

  	struct mutex		tag_list_lock;
  	struct list_head	tag_list;
+
+	atomic_t		wait_index;
  };

  /**


Thanks,
John

Kashyap Desai April 30, 2020, 7:18 p.m. UTC | #18

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of John Garry
> Sent: Thursday, April 30, 2020 11:11 PM
> To: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: axboe@kernel.dk; jejb@linux.ibm.com; martin.petersen@oracle.com;
> ming.lei@redhat.com; bvanassche@acm.org; hare@suse.de;
> don.brace@microsemi.com; Sumit Saxena <sumit.saxena@broadcom.com>;
> hch@infradead.org; Shivasharan Srikanteshwara
> <shivasharan.srikanteshwara@broadcom.com>; chenxiang (M)
> <chenxiang66@hisilicon.com>; linux-block@vger.kernel.org; linux-
> scsi@vger.kernel.org; esc.storagedev@microsemi.com; Hannes Reinecke
> <hare@suse.com>
> Subject: Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to
> MQ
>
> On 29/04/2020 18:55, John Garry wrote:
> >
>
> Hi Kashyap,
>
> > ok, so it's not proper to use active hctx per request queue as
> > "users", but rather that the active request_queue's per host (which is
> > what we effectively have for nr_hw_queues = 1). Just a small comment
> > at the bottom on your change.
> >
> > So I already experimented with reducing shost.can_queue significantly
> > on hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD),
> > and
> > saw:
> >
> > RFC + shost.can_queue reduced: ~1.2M IOPS With RFC + shost.can_queue
> > unmodified: ~2M IOPS Without RFC + shost.can_queue unmodified: ~2M
> > IOPS
> >
> > And with the change, below, I still get around 1.2M IOPS. But there
> > may be some sweet spot/zone where this makes a difference, which I'm
> > not visiting.
> >
> >>
> >> Combination of your patch and below resolves fairness issues. I see
> >> some better results using " --cpus_allowed_policy=split", but
> >> --cpus_allowed_policy=shared
> >
> > I did not try changing the cpus_allowed_policy policy, and so I would
> > be using default, which I believe is shared.
> >
> > is still having some issue and most likely it
> >> is to do with fairness. If fairness is not managed properly, there is
> >> high contention in wait/wakeup handling of sbitmap.
> >
> > ok, I can investigate.
>
> Could you also try this change:
>
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index
> 0a57e4f041a9..e959971b1cee 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -46,11 +46,25 @@ extern void blk_mq_tag_wakeup_all(struct
> blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn
> *fn,
>   		void *priv);
>
> +static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set
> +*tag_set) {
> +	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED; }
> +
>   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue
> *bt,
>   						 struct blk_mq_hw_ctx *hctx)
>   {
> +	struct request_queue *queue;
> +	struct blk_mq_tag_set *set;
> +
>   	if (!hctx)
>   		return &bt->ws[0];
> +	queue = hctx->queue;
> +	set = queue->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(set))
> +		return sbq_wait_ptr(bt, &set->wait_index);
> +
>   	return sbq_wait_ptr(bt, &hctx->wait_index);
>   }
>
> diff --git a/block/blk-mq.h b/block/blk-mq.h index
> 4f12363d56bf..241607806f77 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -158,11 +158,6 @@ struct blk_mq_alloc_data {
>   	struct blk_mq_hw_ctx *hctx;
>   };
>
> -static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set
> *tag_set)
> -{
> -	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
> -}
> -
>   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct
> blk_mq_alloc_data *data)
>   {
>   	if (data->flags & BLK_MQ_REQ_INTERNAL)
>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
> 957cb43c5de7..427c7934c29d 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -259,6 +259,8 @@ struct blk_mq_tag_set {
>
>   	struct mutex		tag_list_lock;
>   	struct list_head	tag_list;
> +
> +	atomic_t		wait_index;
>   };

John -

I have this patch in my local repo as I was doing similar change. Your above
exact patch is not tested, but similar attempt was done. It is good to
include in next revision.
There is no significant performance up or down using this change.
Currently system is not available for me to test. I will resume testing once
system is available.


>
>   /**
>
>
> Thanks,
> John

[RFC,v6,08/10] megaraid_sas: switch fusion adapters to MQ

Commit Message

Comments

Patch