diff mbox

[v2,12/12] IB/srp: Add multichannel support

Message ID 5433E585.607@acm.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Bart Van Assche Oct. 7, 2014, 1:07 p.m. UTC
Improve performance by using multiple RDMA/RC channels per SCSI
host for communication with an SRP target. About the
implementation:
- Introduce a loop over all channels in the code that uses
  target->ch.
- Set the SRP_MULTICHAN_MULTI flag during login for the creation
  of the second and subsequent channels.
- RDMA completion vectors are chosen such that RDMA completion
  interrupts are handled by the CPU socket that submitted the I/O
  request. As one can see in this patch it has been assumed if a
  system contains n CPU sockets and m RDMA completion vectors
  have been assigned to an RDMA HCA that IRQ affinity has been
  configured such that completion vectors [i*m/n..(i+1)*m/n) are
  bound to CPU socket i with 0 <= i < n.
- Modify srp_free_ch_ib() and srp_free_req_data() such that it
  becomes safe to invoke these functions after the corresponding
  allocation function failed.
- Add a ch_count sysfs attribute per target port.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
 Documentation/ABI/stable/sysfs-driver-ib_srp |  25 ++-
 drivers/infiniband/ulp/srp/ib_srp.c          | 291 ++++++++++++++++++++-------
 drivers/infiniband/ulp/srp/ib_srp.h          |   3 +-
 3 files changed, 238 insertions(+), 81 deletions(-)

Comments

Christoph Hellwig Oct. 17, 2014, 11:06 a.m. UTC | #1
>  	} else {
> +		if (blk_mq_unique_tag_to_hwq(rsp->tag) != ch - target->ch)
> +			pr_err("Channel idx mismatch: tag %#llx <> ch %#lx\n",
> +			       rsp->tag, ch - target->ch);
>  		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);

Shouldn't we do this validity check inside scsi_host_find_tag, so that
all callers get it? That means adding an argument to it,  but there are
very few callers at the moment.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 19, 2014, 5:36 p.m. UTC | #2
On 10/7/2014 4:07 PM, Bart Van Assche wrote:
> Improve performance by using multiple RDMA/RC channels per SCSI
> host for communication with an SRP target. About the
> implementation:
> - Introduce a loop over all channels in the code that uses
>    target->ch.
> - Set the SRP_MULTICHAN_MULTI flag during login for the creation
>    of the second and subsequent channels.
> - RDMA completion vectors are chosen such that RDMA completion
>    interrupts are handled by the CPU socket that submitted the I/O
>    request. As one can see in this patch it has been assumed if a
>    system contains n CPU sockets and m RDMA completion vectors
>    have been assigned to an RDMA HCA that IRQ affinity has been
>    configured such that completion vectors [i*m/n..(i+1)*m/n) are
>    bound to CPU socket i with 0 <= i < n.
> - Modify srp_free_ch_ib() and srp_free_req_data() such that it
>    becomes safe to invoke these functions after the corresponding
>    allocation function failed.
> - Add a ch_count sysfs attribute per target port.
>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Cc: Sagi Grimberg <sagig@mellanox.com>
> Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
> ---
>   Documentation/ABI/stable/sysfs-driver-ib_srp |  25 ++-
>   drivers/infiniband/ulp/srp/ib_srp.c          | 291 ++++++++++++++++++++-------
>   drivers/infiniband/ulp/srp/ib_srp.h          |   3 +-
>   3 files changed, 238 insertions(+), 81 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-driver-ib_srp b/Documentation/ABI/stable/sysfs-driver-ib_srp
> index b9688de..d5a459e 100644
> --- a/Documentation/ABI/stable/sysfs-driver-ib_srp
> +++ b/Documentation/ABI/stable/sysfs-driver-ib_srp
> @@ -55,12 +55,12 @@ Description:	Interface for making ib_srp connect to a new target.
>   		  only safe with partial memory descriptor list support enabled
>   		  (allow_ext_sg=1).
>   		* comp_vector, a number in the range 0..n-1 specifying the
> -		  MSI-X completion vector. Some HCA's allocate multiple (n)
> -		  MSI-X vectors per HCA port. If the IRQ affinity masks of
> -		  these interrupts have been configured such that each MSI-X
> -		  interrupt is handled by a different CPU then the comp_vector
> -		  parameter can be used to spread the SRP completion workload
> -		  over multiple CPU's.
> +		  MSI-X completion vector of the first RDMA channel. Some
> +		  HCA's allocate multiple (n) MSI-X vectors per HCA port. If
> +		  the IRQ affinity masks of these interrupts have been
> +		  configured such that each MSI-X interrupt is handled by a
> +		  different CPU then the comp_vector parameter can be used to
> +		  spread the SRP completion workload over multiple CPU's.

This is fairly not trivial for the user...

Aren't we requesting a bit too much awareness here?
Can't we just "make it work"? The user hands out ch_count - why can't
you do some least-used logic here?

Maybe we can even go with per-cpu QPs and discard comp_vector argument?
this would probably bring the best performance, wouldn't it?
(fallback to least-used logic in case HW support less vectors)

>   		* tl_retry_count, a number in the range 2..7 specifying the
>   		  IB RC retry count.
>   		* queue_size, the maximum number of commands that the
> @@ -88,6 +88,13 @@ Description:	Whether ib_srp is allowed to include a partial memory
>   		descriptor list in an SRP_CMD when communicating with an SRP
>   		target.
>
> +What:		/sys/class/scsi_host/host<n>/ch_count
> +Date:		November 1, 2014
> +KernelVersion:	3.18
> +Contact:	linux-rdma@vger.kernel.org
> +Description:	Number of RDMA channels used for communication with the SRP
> +		target.
> +
>   What:		/sys/class/scsi_host/host<n>/cmd_sg_entries
>   Date:		May 19, 2011
>   KernelVersion:	2.6.39
> @@ -95,6 +102,12 @@ Contact:	linux-rdma@vger.kernel.org
>   Description:	Maximum number of data buffer descriptors that may be sent to
>   		the target in a single SRP_CMD request.
>
> +What:		/sys/class/scsi_host/host<n>/comp_vector
> +Date:		September 2, 2013
> +KernelVersion:	3.11
> +Contact:	linux-rdma@vger.kernel.org
> +Description:	Completion vector used for the first RDMA channel.
> +
>   What:		/sys/class/scsi_host/host<n>/dgid
>   Date:		June 17, 2006
>   KernelVersion:	2.6.17
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index eccaf65..80699a9 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -123,6 +123,11 @@ MODULE_PARM_DESC(dev_loss_tmo,
>   		 " if fast_io_fail_tmo has not been set. \"off\" means that"
>   		 " this functionality is disabled.");
>
> +static unsigned ch_count;
> +module_param(ch_count, uint, 0444);
> +MODULE_PARM_DESC(ch_count,
> +		 "Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA.");
> +

Why? how did you get to this magic equation?

>   static void srp_add_one(struct ib_device *device);

...
So it's pretty late for today, this is where I got so far...
Will continue later this week.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 20, 2014, 11:57 a.m. UTC | #3
On 10/17/14 13:06, Christoph Hellwig wrote:
>>   	} else {
>> +		if (blk_mq_unique_tag_to_hwq(rsp->tag) != ch - target->ch)
>> +			pr_err("Channel idx mismatch: tag %#llx <> ch %#lx\n",
>> +			       rsp->tag, ch - target->ch);
>>   		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
>
> Shouldn't we do this validity check inside scsi_host_find_tag, so that
> all callers get it? That means adding an argument to it,  but there are
> very few callers at the moment.

Hello Christoph,

That pr_err() statement was convenient while debugging the multiqueue 
code in the SRP initiator driver but can be left out. Would you agree 
with leaving the above three lines of debug code out instead of adding 
an additional argument to scsi_host_find_tag() ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 20, 2014, 12:56 p.m. UTC | #4
On 10/19/14 19:36, Sagi Grimberg wrote:
> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>           * comp_vector, a number in the range 0..n-1 specifying the
>> -          MSI-X completion vector. Some HCA's allocate multiple (n)
>> -          MSI-X vectors per HCA port. If the IRQ affinity masks of
>> -          these interrupts have been configured such that each MSI-X
>> -          interrupt is handled by a different CPU then the comp_vector
>> -          parameter can be used to spread the SRP completion workload
>> -          over multiple CPU's.
>> +          MSI-X completion vector of the first RDMA channel. Some
>> +          HCA's allocate multiple (n) MSI-X vectors per HCA port. If
>> +          the IRQ affinity masks of these interrupts have been
>> +          configured such that each MSI-X interrupt is handled by a
>> +          different CPU then the comp_vector parameter can be used to
>> +          spread the SRP completion workload over multiple CPU's.
>
> This is fairly not trivial for the user...
>
> Aren't we requesting a bit too much awareness here?
> Can't we just "make it work"? The user hands out ch_count - why can't
> you do some least-used logic here?
>
> Maybe we can even go with per-cpu QPs and discard comp_vector argument?
> this would probably bring the best performance, wouldn't it?
> (fallback to least-used logic in case HW support less vectors)

Hello Sagi,

The only reason the comp_vector parameter is still supported is because 
of backwards compatibility. What I expect is that users will set the 
ch_count parameter but not the comp_vector parameter.

Using one QP per CPU thread does not necessarily result in the best 
performance. In the tests I ran performance was about 4% better when 
using one QP for each pair of CPU threads (with hyperthreading enabled).

>> +static unsigned ch_count;
>> +module_param(ch_count, uint, 0444);
>> +MODULE_PARM_DESC(ch_count,
>> +         "Number of RDMA channels to use for communication with an
>> SRP target. Using more than one channel improves performance if the
>> HCA supports multiple completion vectors. The default value is the
>> minimum of four times the number of online CPU sockets and the number
>> of completion vectors supported by the HCA.");
>
> Why? how did you get to this magic equation?

On the systems I have access to measurements have shown that this choice 
for the ch_count parameter results in a significant performance 
improvement without consuming too many system resources. The performance 
difference when using more than four channels was small. This means that 
the exact value of this parameter is not that important. What matters to 
me is that users can benefit from improved performance even if the 
ch_count kernel module parameter has been left to its default value.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 21, 2014, 8:49 a.m. UTC | #5
On Mon, Oct 20, 2014 at 01:57:21PM +0200, Bart Van Assche wrote:
> That pr_err() statement was convenient while debugging the multiqueue code
> in the SRP initiator driver but can be left out. Would you agree with
> leaving the above three lines of debug code out instead of adding an
> additional argument to scsi_host_find_tag() ?

Feel free to remove the check.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 21, 2014, 9:10 a.m. UTC | #6
On 10/20/2014 3:56 PM, Bart Van Assche wrote:
> On 10/19/14 19:36, Sagi Grimberg wrote:
>> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>>           * comp_vector, a number in the range 0..n-1 specifying the
>>> -          MSI-X completion vector. Some HCA's allocate multiple (n)
>>> -          MSI-X vectors per HCA port. If the IRQ affinity masks of
>>> -          these interrupts have been configured such that each MSI-X
>>> -          interrupt is handled by a different CPU then the comp_vector
>>> -          parameter can be used to spread the SRP completion workload
>>> -          over multiple CPU's.
>>> +          MSI-X completion vector of the first RDMA channel. Some
>>> +          HCA's allocate multiple (n) MSI-X vectors per HCA port. If
>>> +          the IRQ affinity masks of these interrupts have been
>>> +          configured such that each MSI-X interrupt is handled by a
>>> +          different CPU then the comp_vector parameter can be used to
>>> +          spread the SRP completion workload over multiple CPU's.
>>
>> This is fairly not trivial for the user...
>>
>> Aren't we requesting a bit too much awareness here?
>> Can't we just "make it work"? The user hands out ch_count - why can't
>> you do some least-used logic here?
>>
>> Maybe we can even go with per-cpu QPs and discard comp_vector argument?
>> this would probably bring the best performance, wouldn't it?
>> (fallback to least-used logic in case HW support less vectors)
>
> Hello Sagi,
>
> The only reason the comp_vector parameter is still supported is because
> of backwards compatibility. What I expect is that users will set the
> ch_count parameter but not the comp_vector parameter.

Agreed...

>
> Using one QP per CPU thread does not necessarily result in the best
> performance. In the tests I ran performance was about 4% better when
> using one QP for each pair of CPU threads (with hyperthreading enabled).

I usually don't like using defaults based on empirical experiments on
specific workloads. IMO, going either full blown MQ (per-cpu), or
go SQ for default.

But that is just my opinion...
you call it.

>
>>> +static unsigned ch_count;
>>> +module_param(ch_count, uint, 0444);
>>> +MODULE_PARM_DESC(ch_count,
>>> +         "Number of RDMA channels to use for communication with an
>>> SRP target. Using more than one channel improves performance if the
>>> HCA supports multiple completion vectors. The default value is the
>>> minimum of four times the number of online CPU sockets and the number
>>> of completion vectors supported by the HCA.");
>>
>> Why? how did you get to this magic equation?
>
> On the systems I have access to measurements have shown that this choice
> for the ch_count parameter results in a significant performance
> improvement without consuming too many system resources. The performance
> difference when using more than four channels was small. This means that
> the exact value of this parameter is not that important. What matters to
> me is that users can benefit from improved performance even if the
> ch_count kernel module parameter has been left to its default value.

I do like the idea of giving users high performance out-of-the-box. But
as I wrote below, I less like the idea of basing your choice on
experiments.

Sagi.

>
> Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 21, 2014, 9:14 a.m. UTC | #7
On 10/7/2014 4:07 PM, Bart Van Assche wrote:
> Improve performance by using multiple RDMA/RC channels per SCSI
> host for communication with an SRP target. About the
> implementation:
> - Introduce a loop over all channels in the code that uses
>    target->ch.
> - Set the SRP_MULTICHAN_MULTI flag during login for the creation
>    of the second and subsequent channels.
> - RDMA completion vectors are chosen such that RDMA completion
>    interrupts are handled by the CPU socket that submitted the I/O
>    request. As one can see in this patch it has been assumed if a
>    system contains n CPU sockets and m RDMA completion vectors
>    have been assigned to an RDMA HCA that IRQ affinity has been
>    configured such that completion vectors [i*m/n..(i+1)*m/n) are
>    bound to CPU socket i with 0 <= i < n.
> - Modify srp_free_ch_ib() and srp_free_req_data() such that it
>    becomes safe to invoke these functions after the corresponding
>    allocation function failed.
> - Add a ch_count sysfs attribute per target port.
>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Cc: Sagi Grimberg <sagig@mellanox.com>
> Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>

<SNIP>

>   			spin_lock_irqsave(&ch->lock, flags);
>   			ch->req_lim += be32_to_cpu(rsp->req_lim_delta);
> @@ -1906,7 +1970,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
>   		goto err;

Bart,

Any chance you can share some perf output on this code?
I'm interested of knowing the contention on target->lock that is
still taken on the IO path across channels.

Can we think on how to avoid it?

Also would like to understand the where did the bottleneck transition.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 28, 2014, 6:32 p.m. UTC | #8
On 10/21/2014 12:10 PM, Sagi Grimberg wrote:
> On 10/20/2014 3:56 PM, Bart Van Assche wrote:
>> On 10/19/14 19:36, Sagi Grimberg wrote:
>>> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>>>           * comp_vector, a number in the range 0..n-1 specifying the
>>>> -          MSI-X completion vector. Some HCA's allocate multiple (n)
>>>> -          MSI-X vectors per HCA port. If the IRQ affinity masks of
>>>> -          these interrupts have been configured such that each MSI-X
>>>> -          interrupt is handled by a different CPU then the comp_vector
>>>> -          parameter can be used to spread the SRP completion workload
>>>> -          over multiple CPU's.
>>>> +          MSI-X completion vector of the first RDMA channel. Some
>>>> +          HCA's allocate multiple (n) MSI-X vectors per HCA port. If
>>>> +          the IRQ affinity masks of these interrupts have been
>>>> +          configured such that each MSI-X interrupt is handled by a
>>>> +          different CPU then the comp_vector parameter can be used to
>>>> +          spread the SRP completion workload over multiple CPU's.
>>>
>>> This is fairly not trivial for the user...
>>>
>>> Aren't we requesting a bit too much awareness here?
>>> Can't we just "make it work"? The user hands out ch_count - why can't
>>> you do some least-used logic here?
>>>
>>> Maybe we can even go with per-cpu QPs and discard comp_vector argument?
>>> this would probably bring the best performance, wouldn't it?
>>> (fallback to least-used logic in case HW support less vectors)
>>
>> Hello Sagi,
>>
>> The only reason the comp_vector parameter is still supported is because
>> of backwards compatibility. What I expect is that users will set the
>> ch_count parameter but not the comp_vector parameter.
>

Hey Bart,

Another wander I have with this. Say you have 8 cores on a single numa
node. First connection will attach to vectors 0-3 (ch_count=4) and so
are all the connections. Don't we want to spread that a little?

If we are not going per-cpu, why aren't we trying to spread vectors
around to try and reduce the interference?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 29, 2014, 10:52 a.m. UTC | #9
On 10/28/14 19:32, Sagi Grimberg wrote:
> On 10/21/2014 12:10 PM, Sagi Grimberg wrote:
>> On 10/20/2014 3:56 PM, Bart Van Assche wrote:
>>> On 10/19/14 19:36, Sagi Grimberg wrote:
>>>> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>>>>           * comp_vector, a number in the range 0..n-1 specifying the
>>>>> -          MSI-X completion vector. Some HCA's allocate multiple (n)
>>>>> -          MSI-X vectors per HCA port. If the IRQ affinity masks of
>>>>> -          these interrupts have been configured such that each MSI-X
>>>>> -          interrupt is handled by a different CPU then the
>>>>> comp_vector
>>>>> -          parameter can be used to spread the SRP completion workload
>>>>> -          over multiple CPU's.
>>>>> +          MSI-X completion vector of the first RDMA channel. Some
>>>>> +          HCA's allocate multiple (n) MSI-X vectors per HCA port. If
>>>>> +          the IRQ affinity masks of these interrupts have been
>>>>> +          configured such that each MSI-X interrupt is handled by a
>>>>> +          different CPU then the comp_vector parameter can be used to
>>>>> +          spread the SRP completion workload over multiple CPU's.
>>>>
>>>> This is fairly not trivial for the user...
>>>>
>>>> Aren't we requesting a bit too much awareness here?
>>>> Can't we just "make it work"? The user hands out ch_count - why can't
>>>> you do some least-used logic here?
>>>>
>>>> Maybe we can even go with per-cpu QPs and discard comp_vector argument?
>>>> this would probably bring the best performance, wouldn't it?
>>>> (fallback to least-used logic in case HW support less vectors)
>>>
>>> The only reason the comp_vector parameter is still supported is because
>>> of backwards compatibility. What I expect is that users will set the
>>> ch_count parameter but not the comp_vector parameter.
>
> Another wander I have with this. Say you have 8 cores on a single numa
> node. First connection will attach to vectors 0-3 (ch_count=4) and so
> are all the connections. Don't we want to spread that a little?
>
> If we are not going per-cpu, why aren't we trying to spread vectors
> around to try and reduce the interference?

Hello Sagi,

Sorry but your question is not entirely clear to me. Are you referring 
to spreading the workload over CPU's or over completion vectors ? If a 
user wants to spread the completion workload maximally by using all 
completion vectors that can be achieved by setting ch_count to a value 
that is equal to or larger than the number of completion vectors.

As mentioned in the commit message, spreading the completion workload 
over CPU's is not entirely under control of the SRP initiator driver. It 
is assumed that a user assigns IRQ affinity such that the interrupts 
associated with different completion vectors are processed by different 
CPU threads. If there are more RDMA channels than completion vectors, 
the SRP initiator driver associates RDMA channels with completion 
vectors such that the workload is spread evenly over the completion vectors.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 29, 2014, 12:36 p.m. UTC | #10
On 10/21/14 11:14, Sagi Grimberg wrote:
> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>               spin_lock_irqsave(&ch->lock, flags);
>>               ch->req_lim += be32_to_cpu(rsp->req_lim_delta);
>> @@ -1906,7 +1970,7 @@ static int srp_queuecommand(struct Scsi_Host
>> *shost, struct scsi_cmnd *scmnd)
>>           goto err;
>
> Bart,
>
> Any chance you can share some perf output on this code?
> I'm interested of knowing the contention on target->lock that is
> still taken on the IO path across channels.
>
> Can we think on how to avoid it?
>
> Also would like to understand the where did the bottleneck transition.

Hello Sagi,

Are you referring to target->lock ? That lock isn't taken anywhere in 
the hot path. More in general, I haven't seen any lock contention in the 
perf output that was caused by the block layer, SCSI core, SRP initiator 
or HCA (mlx4) drivers. The code that showed up highest in the perf 
output was the direct I/O code, the code that is triggered by fio to 
submit I/O requests.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 30, 2014, 2:19 p.m. UTC | #11
On 10/29/2014 12:52 PM, Bart Van Assche wrote:
> On 10/28/14 19:32, Sagi Grimberg wrote:
>> On 10/21/2014 12:10 PM, Sagi Grimberg wrote:
>>> On 10/20/2014 3:56 PM, Bart Van Assche wrote:
>>>> On 10/19/14 19:36, Sagi Grimberg wrote:
>>>>> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>>>>>           * comp_vector, a number in the range 0..n-1 specifying the
>>>>>> -          MSI-X completion vector. Some HCA's allocate multiple (n)
>>>>>> -          MSI-X vectors per HCA port. If the IRQ affinity masks of
>>>>>> -          these interrupts have been configured such that each MSI-X
>>>>>> -          interrupt is handled by a different CPU then the
>>>>>> comp_vector
>>>>>> -          parameter can be used to spread the SRP completion
>>>>>> workload
>>>>>> -          over multiple CPU's.
>>>>>> +          MSI-X completion vector of the first RDMA channel. Some
>>>>>> +          HCA's allocate multiple (n) MSI-X vectors per HCA port. If
>>>>>> +          the IRQ affinity masks of these interrupts have been
>>>>>> +          configured such that each MSI-X interrupt is handled by a
>>>>>> +          different CPU then the comp_vector parameter can be
>>>>>> used to
>>>>>> +          spread the SRP completion workload over multiple CPU's.
>>>>>
>>>>> This is fairly not trivial for the user...
>>>>>
>>>>> Aren't we requesting a bit too much awareness here?
>>>>> Can't we just "make it work"? The user hands out ch_count - why can't
>>>>> you do some least-used logic here?
>>>>>
>>>>> Maybe we can even go with per-cpu QPs and discard comp_vector
>>>>> argument?
>>>>> this would probably bring the best performance, wouldn't it?
>>>>> (fallback to least-used logic in case HW support less vectors)
>>>>
>>>> The only reason the comp_vector parameter is still supported is because
>>>> of backwards compatibility. What I expect is that users will set the
>>>> ch_count parameter but not the comp_vector parameter.
>>
>> Another wander I have with this. Say you have 8 cores on a single numa
>> node. First connection will attach to vectors 0-3 (ch_count=4) and so
>> are all the connections. Don't we want to spread that a little?
>>
>> If we are not going per-cpu, why aren't we trying to spread vectors
>> around to try and reduce the interference?
>
> Hello Sagi,
>
> Sorry but your question is not entirely clear to me. Are you referring
> to spreading the workload over CPU's or over completion vectors ?

I'm talking about completion vectors, but I assume both as I consider
spreading interrupt vectors across CPU cores a common practice.

> If a
> user wants to spread the completion workload maximally by using all
> completion vectors that can be achieved by setting ch_count to a value
> that is equal to or larger than the number of completion vectors.
>

I'm talking about the default.
My impression here that in the default settings, on a 1 NUMA node with
8 cores, 2 different srp connections (using 4 channels each) will be
associated with comp vectors 0-3. while it could potentially use
vectors 4-7 and reduce possible mutual interference. right?
(you said yourself that the user is not expected to use comp_vector
and it is only for backward compatibility).

Now given that each connection uses less than per-cpu channels, don't
you think this logic will be helpful?

> As mentioned in the commit message, spreading the completion workload
> over CPU's is not entirely under control of the SRP initiator driver.

I was referring to comp vectors - but I consider 1x1 mapping a common
usage when it comes to RDMA (and not only btw).

Feel free to correct me if I misunderstand the implementation.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 30, 2014, 2:22 p.m. UTC | #12
On 10/29/2014 2:36 PM, Bart Van Assche wrote:
> On 10/21/14 11:14, Sagi Grimberg wrote:
>> On 10/7/2014 4:07 PM, Bart Van Assche wrote:
>>>               spin_lock_irqsave(&ch->lock, flags);
>>>               ch->req_lim += be32_to_cpu(rsp->req_lim_delta);
>>> @@ -1906,7 +1970,7 @@ static int srp_queuecommand(struct Scsi_Host
>>> *shost, struct scsi_cmnd *scmnd)
>>>           goto err;
>>
>> Bart,
>>
>> Any chance you can share some perf output on this code?
>> I'm interested of knowing the contention on target->lock that is
>> still taken on the IO path across channels.
>>
>> Can we think on how to avoid it?
>>
>> Also would like to understand the where did the bottleneck transition.
>
> Hello Sagi,
>
> Are you referring to target->lock ? That lock isn't taken anywhere in
> the hot path.

Right, my recollection was that we used to acquire the target-lock in
srp_chkready(). I see that's not the case anymore.

> More in general, I haven't seen any lock contention in the
> perf output that was caused by the block layer, SCSI core, SRP initiator
> or HCA (mlx4) drivers. The code that showed up highest in the perf
> output was the direct I/O code, the code that is triggered by fio to
> submit I/O requests.

Interesting, you don't see more contention on SQ/RQ/CQ locks? I find
that surprising.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 30, 2014, 2:36 p.m. UTC | #13
On 10/30/14 15:19, Sagi Grimberg wrote:
> My impression here that in the default settings, on a 1 NUMA node with
> 8 cores, 2 different srp connections (using 4 channels each) will be
> associated with comp vectors 0-3. while it could potentially use
> vectors 4-7 and reduce possible mutual interference. right?

Hello Sagi,

That's correct. For this example if use of all completion vectors is 
desired additional configuration is required, e.g. by setting ch_count 
to 8 in /etc/modprobe.d/ib_srp.conf. By the way, I'm not sure it is 
possible to avoid manual configuration and tuning entirely. As an 
example, with a six core CPU at the initiator side and with 
hyperthreading enabled (12 CPU threads in total) I see higher IOPS 
results with ch_count=6 compared to ch_count=8 or ch_count=12. I have 
not tried to determine why but maybe this is because ch_count values 
that are below the number of CPU threads cause some interrupt coalescing.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 30, 2014, 3:06 p.m. UTC | #14
On 10/30/2014 4:36 PM, Bart Van Assche wrote:
> On 10/30/14 15:19, Sagi Grimberg wrote:
>> My impression here that in the default settings, on a 1 NUMA node with
>> 8 cores, 2 different srp connections (using 4 channels each) will be
>> associated with comp vectors 0-3. while it could potentially use
>> vectors 4-7 and reduce possible mutual interference. right?
>
> Hello Sagi,
>
> That's correct. For this example if use of all completion vectors is
> desired additional configuration is required, e.g. by setting ch_count
> to 8 in /etc/modprobe.d/ib_srp.conf.

That is why I think that the user is still expected to be aware of
the configuration in order to get max performance. I would like to
see best performance to "just work". For example, I don't see any sort
of sw queue count to configure, it "just works".

Now I also agree with this may mean more (or sometimes way more)
resources, but I suggest that if we go with default of 4 per numa node
we should take care of such situations.

I'm not strict about this wrt to this patch set. But I think we should
consider this bit.

> By the way, I'm not sure it is
> possible to avoid manual configuration and tuning entirely. As an
> example, with a six core CPU at the initiator side and with
> hyperthreading enabled (12 CPU threads in total) I see higher IOPS
> results with ch_count=6 compared to ch_count=8 or ch_count=12.
> I have
> not tried to determine why but maybe this is because ch_count values
> that are below the number of CPU threads cause some interrupt coalescing.

I'm not aware of any implicit interrupt coalescing effect...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 30, 2014, 3:19 p.m. UTC | #15
On 10/30/14 16:06, Sagi Grimberg wrote:
> I'm not aware of any implicit interrupt coalescing effect...

In case it was not clear what I was referring to: if multiple completion 
queue handling routines run on the same CPU then the average number of 
work completions processed by each completion handling routine increases 
due to the increased time between generation of an interrupt and the 
start of the completion handler routine. As you know this helps overall 
system throughput.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 30, 2014, 5:33 p.m. UTC | #16
On 10/30/2014 5:19 PM, Bart Van Assche wrote:
> On 10/30/14 16:06, Sagi Grimberg wrote:
>> I'm not aware of any implicit interrupt coalescing effect...
>
> In case it was not clear what I was referring to: if multiple completion
> queue handling routines run on the same CPU then the average number of
> work completions processed by each completion handling routine increases
> due to the increased time between generation of an interrupt and the
> start of the completion handler routine. As you know this helps overall
> system throughput.
>

Now I realize that we can hit serious problems here since we never
solved the issue of srp polling routine that might poll forever within
an interrupt (or at least until a hard lockup). Its interesting that
you weren't able to hit that with a high workload. Did you try running
this code on a virtual function (I witnessed this issue in iser on a VM).

Moreover, the fairness issue is even more likely to be encountered in 
multichannel. Did you try to hit that? I really think this patchset
*needs* to deal with the 2 issues I mentioned as the probability of
hitting them increases with a faster IO stack.

I remember this was discussed lately with consideration for using
blk-iopoll or not. But I think that for now the initial approach of
bailing out of the once we hit a budget is fine for now.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 31, 2014, 9:19 a.m. UTC | #17
On 10/30/14 18:33, Sagi Grimberg wrote:
> Now I realize that we can hit serious problems here since we never
> solved the issue of srp polling routine that might poll forever within
> an interrupt (or at least until a hard lockup). Its interesting that
> you weren't able to hit that with a high workload. Did you try running
> this code on a virtual function (I witnessed this issue in iser on a VM).
>
> Moreover, the fairness issue is even more likely to be encountered in
> multichannel. Did you try to hit that? I really think this patchset
> *needs* to deal with the 2 issues I mentioned as the probability of
> hitting them increases with a faster IO stack.
>
> I remember this was discussed lately with consideration for using
> blk-iopoll or not. But I think that for now the initial approach of
> bailing out of the once we hit a budget is fine for now.

Hello Sagi,

As you mentioned so far this fairness issue has only caused trouble with 
iSER in a virtual machine guest. I have not yet seen anyone reporting a 
QP servicing fairness problem for the SRP initiator. Although analyzing 
and if needed limiting the maximum number of iterations in the SRP 
polling routine is on my to-do list, addressing that issue is outside of 
the scope of this patch series.

Regarding the impact of this patch series on QP handling fairness: the 
time spent in the SRP RDMA completion handler depends on the number of 
completions processed at once. This number depends on:
(a) The number of CPU cores in the initiator system that submit I/O and
     that are associated with a single RDMA channel.
(b) The target system processing speed per RDMA channel.

This patch series reduces (a) by a factor ch_count. (b) is either 
unaffected (linear scaling) or slightly reduced (less than linear 
scaling). My conclusion is that if this patch series has an impact on QP 
handling fairness that it will improve fairness since the number of 
completions processed at once either remains unchanged or that it is 
reduced.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 2, 2014, 1:03 p.m. UTC | #18
On 10/31/2014 11:19 AM, Bart Van Assche wrote:
> On 10/30/14 18:33, Sagi Grimberg wrote:
>> Now I realize that we can hit serious problems here since we never
>> solved the issue of srp polling routine that might poll forever within
>> an interrupt (or at least until a hard lockup). Its interesting that
>> you weren't able to hit that with a high workload. Did you try running
>> this code on a virtual function (I witnessed this issue in iser on a VM).
>>
>> Moreover, the fairness issue is even more likely to be encountered in
>> multichannel. Did you try to hit that? I really think this patchset
>> *needs* to deal with the 2 issues I mentioned as the probability of
>> hitting them increases with a faster IO stack.
>>
>> I remember this was discussed lately with consideration for using
>> blk-iopoll or not. But I think that for now the initial approach of
>> bailing out of the once we hit a budget is fine for now.
>
> Hello Sagi,
>
> As you mentioned so far this fairness issue has only caused trouble with
> iSER in a virtual machine guest. I have not yet seen anyone reporting a
> QP servicing fairness problem for the SRP initiator.

IMHO, this is not iSER specific issue, it is easily indicated from the
code that a specific workload SRP will poll recv completion queue
forever in an interrupt context.

I encountered this issue on a virtual guest in a high workload (80+
sessions with heavy traffic on all) because qemu smp_affinity setting
was broken (might still be, didn't check that for a while). This caused 
all completion vectors to fire interrupts to core 0 causing a high
events contention on a single event queue (causing lockup situations
and starvation of other CQs). Using more completion queues will enhance
this situation.

I think running multichannel code when all MSIX vectors affinity are
directed to a single CPU can invoke what I'm talking about.

> Although analyzing
> and if needed limiting the maximum number of iterations in the SRP
> polling routine is on my to-do list, addressing that issue is outside of
> the scope of this patch series.

Although both of us did not yet hear of such complaints from SRP users,
I disagree because this might make the problems worse. But if you want
to take it later I guess that's fine too.

>
> Regarding the impact of this patch series on QP handling fairness: the
> time spent in the SRP RDMA completion handler depends on the number of
> completions processed at once. This number depends on:
> (a) The number of CPU cores in the initiator system that submit I/O and
>      that are associated with a single RDMA channel.
> (b) The target system processing speed per RDMA channel.
>
> This patch series reduces (a) by a factor ch_count.

This is under the assumption that IRQ affinity is spread across several
CPUS and that's fine, but we should *not* hit a hard lockup in case it
is not (and I suspect we can).

> (b) is either
> unaffected (linear scaling) or slightly reduced (less than linear
> scaling). My conclusion is that if this patch series has an impact on QP
> handling fairness that it will improve fairness since the number of
> completions processed at once either remains unchanged or that it is
> reduced.
>

I think in the single CPU completion queue processing, this can enhance
the problem as well.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Elliott, Robert (Server Storage) Nov. 3, 2014, 1:46 a.m. UTC | #19
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Sunday, November 02, 2014 7:03 AM
> To: Bart Van Assche; Christoph Hellwig
> Cc: Jens Axboe; Sagi Grimberg; Sebastian Parschauer; Elliott, Robert
> (Server Storage); Ming Lei; linux-scsi@vger.kernel.org; linux-rdma
> Subject: Re: [PATCH v2 12/12] IB/srp: Add multichannel support
> 
...
> IMHO, this is not iSER specific issue, it is easily indicated from the
> code that a specific workload SRP will poll recv completion queue
> forever in an interrupt context.
> 
> I encountered this issue on a virtual guest in a high workload (80+
> sessions with heavy traffic on all) because qemu smp_affinity setting
> was broken (might still be, didn't check that for a while). This caused
> all completion vectors to fire interrupts to core 0 causing a high
> events contention on a single event queue (causing lockup situations
> and starvation of other CQs). Using more completion queues will enhance
> this situation.
> 
> I think running multichannel code when all MSIX vectors affinity are
> directed to a single CPU can invoke what I'm talking about.

That's not an SRP specific problem either.  If you ask just one CPU to
service interrupts and block layer completions for submissions from lots
of other CPUs, it's bound to become overloaded.

Setting rq_affinity=2 helps quite a bit for the block layer completion
work.  This patch proposed making that the default for blk-mq:
	https://lkml.org/lkml/2014/9/9/931

For SRP interrupt processing, irqbalance recently changed its default 
to ignore the affinity_hint; you now need to pass an option to honor
the hint, or provide a policy script to do so for selected irqs.  For
multi-million IOPS workloads, irqbalance takes far too long to reroute
them based on activity; you're likely to overload a CPU with 100% 
hardirq processing, creating self-detected stalls for the submitting
processes on that CPU and other problems.  Sending interrupts back 
to the submitting CPU provides self-throttling.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Nov. 4, 2014, 11:46 a.m. UTC | #20
On 11/03/14 02:46, Elliott, Robert (Server Storage) wrote:
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
>> Sent: Sunday, November 02, 2014 7:03 AM
>> To: Bart Van Assche; Christoph Hellwig
>> Cc: Jens Axboe; Sagi Grimberg; Sebastian Parschauer; Elliott, Robert
>> (Server Storage); Ming Lei; linux-scsi@vger.kernel.org; linux-rdma
>> Subject: Re: [PATCH v2 12/12] IB/srp: Add multichannel support
>>
> ...
>> IMHO, this is not iSER specific issue, it is easily indicated from the
>> code that a specific workload SRP will poll recv completion queue
>> forever in an interrupt context.
>>
>> I encountered this issue on a virtual guest in a high workload (80+
>> sessions with heavy traffic on all) because qemu smp_affinity setting
>> was broken (might still be, didn't check that for a while). This caused
>> all completion vectors to fire interrupts to core 0 causing a high
>> events contention on a single event queue (causing lockup situations
>> and starvation of other CQs). Using more completion queues will enhance
>> this situation.
>>
>> I think running multichannel code when all MSIX vectors affinity are
>> directed to a single CPU can invoke what I'm talking about.
>
> That's not an SRP specific problem either.  If you ask just one CPU to
> service interrupts and block layer completions for submissions from lots
> of other CPUs, it's bound to become overloaded.
>
> Setting rq_affinity=2 helps quite a bit for the block layer completion
> work.  This patch proposed making that the default for blk-mq:
> 	https://lkml.org/lkml/2014/9/9/931
>
> For SRP interrupt processing, irqbalance recently changed its default
> to ignore the affinity_hint; you now need to pass an option to honor
> the hint, or provide a policy script to do so for selected irqs.  For
> multi-million IOPS workloads, irqbalance takes far too long to reroute
> them based on activity; you're likely to overload a CPU with 100%
> hardirq processing, creating self-detected stalls for the submitting
> processes on that CPU and other problems.  Sending interrupts back
> to the submitting CPU provides self-throttling.

Hello Sagi,

To me it seems like with Rob's reply all questions about this patch 
series have been answered. But I think Christoph is still waiting for a 
Reviewed-by tag from you for patch 12/12.

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 4, 2014, 12:15 p.m. UTC | #21
On 11/4/2014 1:46 PM, Bart Van Assche wrote:
> On 11/03/14 02:46, Elliott, Robert (Server Storage) wrote:
>>> -----Original Message-----
>>> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
>>> Sent: Sunday, November 02, 2014 7:03 AM
>>> To: Bart Van Assche; Christoph Hellwig
>>> Cc: Jens Axboe; Sagi Grimberg; Sebastian Parschauer; Elliott, Robert
>>> (Server Storage); Ming Lei; linux-scsi@vger.kernel.org; linux-rdma
>>> Subject: Re: [PATCH v2 12/12] IB/srp: Add multichannel support
>>>
>> ...
>>> IMHO, this is not iSER specific issue, it is easily indicated from the
>>> code that a specific workload SRP will poll recv completion queue
>>> forever in an interrupt context.
>>>
>>> I encountered this issue on a virtual guest in a high workload (80+
>>> sessions with heavy traffic on all) because qemu smp_affinity setting
>>> was broken (might still be, didn't check that for a while). This caused
>>> all completion vectors to fire interrupts to core 0 causing a high
>>> events contention on a single event queue (causing lockup situations
>>> and starvation of other CQs). Using more completion queues will enhance
>>> this situation.
>>>
>>> I think running multichannel code when all MSIX vectors affinity are
>>> directed to a single CPU can invoke what I'm talking about.
>>
>> That's not an SRP specific problem either.  If you ask just one CPU to
>> service interrupts and block layer completions for submissions from lots
>> of other CPUs, it's bound to become overloaded.
>>
>> Setting rq_affinity=2 helps quite a bit for the block layer completion
>> work.  This patch proposed making that the default for blk-mq:
>>     https://lkml.org/lkml/2014/9/9/931
>>
>> For SRP interrupt processing, irqbalance recently changed its default
>> to ignore the affinity_hint; you now need to pass an option to honor
>> the hint, or provide a policy script to do so for selected irqs.  For
>> multi-million IOPS workloads, irqbalance takes far too long to reroute
>> them based on activity; you're likely to overload a CPU with 100%
>> hardirq processing, creating self-detected stalls for the submitting
>> processes on that CPU and other problems.  Sending interrupts back
>> to the submitting CPU provides self-throttling.
>
> Hello Sagi,
>
> To me it seems like with Rob's reply all questions about this patch
> series have been answered. But I think Christoph is still waiting for a
> Reviewed-by tag from you for patch 12/12.
>

Hey Bart & Rob,

I'm sorry but I didn't get to reply to the Rob's email yesterday.

I think that Rob and I are not talking about the same issue. In
case only a single core is servicing interrupts it is indeed expected
that it will spend 100% in hard-irq, that's acceptable since it is
pounded with completions all the time.

However, I'm referring to a condition where SRP will spend infinite
time servicing a single interrupt (while loop on ib_poll_cq that never
drains) which will lead to a hard lockup.

This *can* happen, and I do believe that with an optimized IO path
it is even more likely to.

Anyway, since I am sure you ran sufficient testing on this code (and
didn't see the issue) and I don't want to my concerns to block this
code from 3.18, and I didn't find other gating issues, you can add:

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Elliott, Robert (Server Storage) Nov. 5, 2014, 4:57 a.m. UTC | #22
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Tuesday, November 04, 2014 6:15 AM
> To: Bart Van Assche; Elliott, Robert (Server Storage); Christoph Hellwig
> Cc: Jens Axboe; Sagi Grimberg; Sebastian Parschauer; Ming Lei; linux-
> scsi@vger.kernel.org; linux-rdma
> Subject: Re: [PATCH v2 12/12] IB/srp: Add multichannel support
> 
...
> I think that Rob and I are not talking about the same issue. In
> case only a single core is servicing interrupts it is indeed expected
> that it will spend 100% in hard-irq, that's acceptable since it is
> pounded with completions all the time.
> 
> However, I'm referring to a condition where SRP will spend infinite
> time servicing a single interrupt (while loop on ib_poll_cq that never
> drains) which will lead to a hard lockup.
> 
> This *can* happen, and I do believe that with an optimized IO path
> it is even more likely to.

If the IB completions/interrupts are only for IOs submitted on this
CPU, then the CQ will eventually drain, because this CPU is not 
submitting anything new while stuck in the loop.

This can become bursty, though - submit a lot of IOs, then be busy
completing all of them and not submitting more, resulting in the 
queue depth bouncing from 0 to high to 0 to high.  I've seen
that with both hpsa and mpt3sas drivers.  The fio options
iodepth_batch, iodepth_batch_complete, and iodepth_low
can amplify and reduce that effect (using libaio).

I haven't found a good way for the LLD ISRs and the block
layer completion code to decide to yield the CPU based on how
much time they are taking - that would almost qualify as
a realtime kernel feature.  If you compile with
CONFIG_IRQ_TIME_ACCOUNTING, the kernel does keep track
of that information; perhaps that could be exported so
modules can use it?

---
Rob Elliott, HP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 5, 2014, 11:22 a.m. UTC | #23
On 11/5/2014 6:57 AM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
>> Sent: Tuesday, November 04, 2014 6:15 AM
>> To: Bart Van Assche; Elliott, Robert (Server Storage); Christoph Hellwig
>> Cc: Jens Axboe; Sagi Grimberg; Sebastian Parschauer; Ming Lei; linux-
>> scsi@vger.kernel.org; linux-rdma
>> Subject: Re: [PATCH v2 12/12] IB/srp: Add multichannel support
>>
> ...
>> I think that Rob and I are not talking about the same issue. In
>> case only a single core is servicing interrupts it is indeed expected
>> that it will spend 100% in hard-irq, that's acceptable since it is
>> pounded with completions all the time.
>>
>> However, I'm referring to a condition where SRP will spend infinite
>> time servicing a single interrupt (while loop on ib_poll_cq that never
>> drains) which will lead to a hard lockup.
>>
>> This *can* happen, and I do believe that with an optimized IO path
>> it is even more likely to.
>
> If the IB completions/interrupts are only for IOs submitted on this
> CPU, then the CQ will eventually drain, because this CPU is not
> submitting anything new while stuck in the loop.

They're not (or not necessarily). I'm talking about the case where the
IO completions are submitted from another CPU. This creates a cycle
where the submitter is generating completions on CPU X and the completer
is evacuating room for more submissions on CPU Y. This process can
never end while the completer is in hard-irq context.

>
> This can become bursty, though - submit a lot of IOs, then be busy
> completing all of them and not submitting more, resulting in the
> queue depth bouncing from 0 to high to 0 to high.  I've seen
> that with both hpsa and mpt3sas drivers.  The fio options
> iodepth_batch, iodepth_batch_complete, and iodepth_low
> can amplify and reduce that effect (using libaio).
>

blk-iopoll (or some other form of budgeting completions) should take
care of that.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/ABI/stable/sysfs-driver-ib_srp b/Documentation/ABI/stable/sysfs-driver-ib_srp
index b9688de..d5a459e 100644
--- a/Documentation/ABI/stable/sysfs-driver-ib_srp
+++ b/Documentation/ABI/stable/sysfs-driver-ib_srp
@@ -55,12 +55,12 @@  Description:	Interface for making ib_srp connect to a new target.
 		  only safe with partial memory descriptor list support enabled
 		  (allow_ext_sg=1).
 		* comp_vector, a number in the range 0..n-1 specifying the
-		  MSI-X completion vector. Some HCA's allocate multiple (n)
-		  MSI-X vectors per HCA port. If the IRQ affinity masks of
-		  these interrupts have been configured such that each MSI-X
-		  interrupt is handled by a different CPU then the comp_vector
-		  parameter can be used to spread the SRP completion workload
-		  over multiple CPU's.
+		  MSI-X completion vector of the first RDMA channel. Some
+		  HCA's allocate multiple (n) MSI-X vectors per HCA port. If
+		  the IRQ affinity masks of these interrupts have been
+		  configured such that each MSI-X interrupt is handled by a
+		  different CPU then the comp_vector parameter can be used to
+		  spread the SRP completion workload over multiple CPU's.
 		* tl_retry_count, a number in the range 2..7 specifying the
 		  IB RC retry count.
 		* queue_size, the maximum number of commands that the
@@ -88,6 +88,13 @@  Description:	Whether ib_srp is allowed to include a partial memory
 		descriptor list in an SRP_CMD when communicating with an SRP
 		target.
 
+What:		/sys/class/scsi_host/host<n>/ch_count
+Date:		November 1, 2014
+KernelVersion:	3.18
+Contact:	linux-rdma@vger.kernel.org
+Description:	Number of RDMA channels used for communication with the SRP
+		target.
+
 What:		/sys/class/scsi_host/host<n>/cmd_sg_entries
 Date:		May 19, 2011
 KernelVersion:	2.6.39
@@ -95,6 +102,12 @@  Contact:	linux-rdma@vger.kernel.org
 Description:	Maximum number of data buffer descriptors that may be sent to
 		the target in a single SRP_CMD request.
 
+What:		/sys/class/scsi_host/host<n>/comp_vector
+Date:		September 2, 2013
+KernelVersion:	3.11
+Contact:	linux-rdma@vger.kernel.org
+Description:	Completion vector used for the first RDMA channel.
+
 What:		/sys/class/scsi_host/host<n>/dgid
 Date:		June 17, 2006
 KernelVersion:	2.6.17
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index eccaf65..80699a9 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -123,6 +123,11 @@  MODULE_PARM_DESC(dev_loss_tmo,
 		 " if fast_io_fail_tmo has not been set. \"off\" means that"
 		 " this functionality is disabled.");
 
+static unsigned ch_count;
+module_param(ch_count, uint, 0444);
+MODULE_PARM_DESC(ch_count,
+		 "Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA.");
+
 static void srp_add_one(struct ib_device *device);
 static void srp_remove_one(struct ib_device *device);
 static void srp_recv_completion(struct ib_cq *cq, void *ch_ptr);
@@ -562,11 +567,26 @@  static void srp_free_ch_ib(struct srp_target_port *target,
 	struct srp_device *dev = target->srp_host->srp_dev;
 	int i;
 
+	if (!ch->target)
+		return;
+
+	/*
+	 * Avoid that the SCSI error handler tries to use this channel after
+	 * it has been freed. The SCSI error handler can namely continue
+	 * trying to perform recovery actions after scsi_remove_host()
+	 * returned.
+	 */
+	ch->target = NULL;
+
 	if (ch->cm_id) {
 		ib_destroy_cm_id(ch->cm_id);
 		ch->cm_id = NULL;
 	}
 
+	/* If srp_new_cm_id() succeeded but srp_create_ch_ib() not, return. */
+	if (!ch->qp)
+		return;
+
 	if (dev->use_fast_reg) {
 		if (ch->fr_pool)
 			srp_destroy_fr_pool(ch->fr_pool);
@@ -647,7 +667,7 @@  static int srp_lookup_path(struct srp_rdma_ch *ch)
 	return ch->status;
 }
 
-static int srp_send_req(struct srp_rdma_ch *ch)
+static int srp_send_req(struct srp_rdma_ch *ch, bool multich)
 {
 	struct srp_target_port *target = ch->target;
 	struct {
@@ -688,6 +708,8 @@  static int srp_send_req(struct srp_rdma_ch *ch)
 	req->priv.req_it_iu_len = cpu_to_be32(target->max_iu_len);
 	req->priv.req_buf_fmt 	= cpu_to_be16(SRP_BUF_FORMAT_DIRECT |
 					      SRP_BUF_FORMAT_INDIRECT);
+	req->priv.req_flags	= (multich ? SRP_MULTICHAN_MULTI :
+				   SRP_MULTICHAN_SINGLE);
 	/*
 	 * In the published SRP specification (draft rev. 16a), the
 	 * port identifier format is 8 bytes of ID extension followed
@@ -769,14 +791,18 @@  static bool srp_change_conn_state(struct srp_target_port *target,
 
 static void srp_disconnect_target(struct srp_target_port *target)
 {
-	struct srp_rdma_ch *ch = &target->ch;
+	struct srp_rdma_ch *ch;
+	int i;
 
 	if (srp_change_conn_state(target, false)) {
 		/* XXX should send SRP_I_LOGOUT request */
 
-		if (ib_send_cm_dreq(ch->cm_id, NULL, 0)) {
-			shost_printk(KERN_DEBUG, target->scsi_host,
-				     PFX "Sending CM DREQ failed\n");
+		for (i = 0; i < target->ch_count; i++) {
+			ch = &target->ch[i];
+			if (ch->cm_id && ib_send_cm_dreq(ch->cm_id, NULL, 0)) {
+				shost_printk(KERN_DEBUG, target->scsi_host,
+					     PFX "Sending CM DREQ failed\n");
+			}
 		}
 	}
 }
@@ -789,7 +815,7 @@  static void srp_free_req_data(struct srp_target_port *target,
 	struct srp_request *req;
 	int i;
 
-	if (!ch->req_ring)
+	if (!ch->target || !ch->req_ring)
 		return;
 
 	for (i = 0; i < target->req_ring_size; ++i) {
@@ -875,7 +901,8 @@  static void srp_del_scsi_host_attr(struct Scsi_Host *shost)
 
 static void srp_remove_target(struct srp_target_port *target)
 {
-	struct srp_rdma_ch *ch = &target->ch;
+	struct srp_rdma_ch *ch;
+	int i;
 
 	WARN_ON_ONCE(target->state != SRP_TARGET_REMOVED);
 
@@ -885,10 +912,18 @@  static void srp_remove_target(struct srp_target_port *target)
 	scsi_remove_host(target->scsi_host);
 	srp_stop_rport_timers(target->rport);
 	srp_disconnect_target(target);
-	srp_free_ch_ib(target, ch);
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		srp_free_ch_ib(target, ch);
+	}
 	cancel_work_sync(&target->tl_err_work);
 	srp_rport_put(target->rport);
-	srp_free_req_data(target, ch);
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		srp_free_req_data(target, ch);
+	}
+	kfree(target->ch);
+	target->ch = NULL;
 
 	spin_lock(&target->srp_host->target_lock);
 	list_del(&target->list);
@@ -914,12 +949,12 @@  static void srp_rport_delete(struct srp_rport *rport)
 	srp_queue_remove_work(target);
 }
 
-static int srp_connect_ch(struct srp_rdma_ch *ch)
+static int srp_connect_ch(struct srp_rdma_ch *ch, bool multich)
 {
 	struct srp_target_port *target = ch->target;
 	int ret;
 
-	WARN_ON_ONCE(target->connected);
+	WARN_ON_ONCE(!multich && target->connected);
 
 	target->qp_in_error = false;
 
@@ -929,7 +964,7 @@  static int srp_connect_ch(struct srp_rdma_ch *ch)
 
 	while (1) {
 		init_completion(&ch->done);
-		ret = srp_send_req(ch);
+		ret = srp_send_req(ch, multich);
 		if (ret)
 			return ret;
 		ret = wait_for_completion_interruptible(&ch->done);
@@ -1090,10 +1125,10 @@  static void srp_finish_req(struct srp_rdma_ch *ch, struct srp_request *req,
 static void srp_terminate_io(struct srp_rport *rport)
 {
 	struct srp_target_port *target = rport->lld_data;
-	struct srp_rdma_ch *ch = &target->ch;
+	struct srp_rdma_ch *ch;
 	struct Scsi_Host *shost = target->scsi_host;
 	struct scsi_device *sdev;
-	int i;
+	int i, j;
 
 	/*
 	 * Invoking srp_terminate_io() while srp_queuecommand() is running
@@ -1102,10 +1137,15 @@  static void srp_terminate_io(struct srp_rport *rport)
 	shost_for_each_device(sdev, shost)
 		WARN_ON_ONCE(sdev->request_queue->request_fn_active);
 
-	for (i = 0; i < target->req_ring_size; ++i) {
-		struct srp_request *req = &ch->req_ring[i];
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
 
-		srp_finish_req(ch, req, NULL, DID_TRANSPORT_FAILFAST << 16);
+		for (j = 0; j < target->req_ring_size; ++j) {
+			struct srp_request *req = &ch->req_ring[j];
+
+			srp_finish_req(ch, req, NULL,
+				       DID_TRANSPORT_FAILFAST << 16);
+		}
 	}
 }
 
@@ -1121,8 +1161,9 @@  static void srp_terminate_io(struct srp_rport *rport)
 static int srp_rport_reconnect(struct srp_rport *rport)
 {
 	struct srp_target_port *target = rport->lld_data;
-	struct srp_rdma_ch *ch = &target->ch;
-	int i, ret;
+	struct srp_rdma_ch *ch;
+	int i, j, ret = 0;
+	bool multich = false;
 
 	srp_disconnect_target(target);
 
@@ -1134,27 +1175,47 @@  static int srp_rport_reconnect(struct srp_rport *rport)
 	 * case things are really fouled up. Doing so also ensures that all CM
 	 * callbacks will have finished before a new QP is allocated.
 	 */
-	ret = srp_new_cm_id(ch);
-
-	for (i = 0; i < target->req_ring_size; ++i) {
-		struct srp_request *req = &ch->req_ring[i];
-
-		srp_finish_req(ch, req, NULL, DID_RESET << 16);
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		if (!ch->target)
+			break;
+		ret += srp_new_cm_id(ch);
 	}
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		if (!ch->target)
+			break;
+		for (j = 0; j < target->req_ring_size; ++j) {
+			struct srp_request *req = &ch->req_ring[j];
 
-	/*
-	 * Whether or not creating a new CM ID succeeded, create a new
-	 * QP. This guarantees that all callback functions for the old QP have
-	 * finished before any send requests are posted on the new QP.
-	 */
-	ret += srp_create_ch_ib(ch);
-
-	INIT_LIST_HEAD(&ch->free_tx);
-	for (i = 0; i < target->queue_size; ++i)
-		list_add(&ch->tx_ring[i]->list, &ch->free_tx);
+			srp_finish_req(ch, req, NULL, DID_RESET << 16);
+		}
+	}
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		if (!ch->target)
+			break;
+		/*
+		 * Whether or not creating a new CM ID succeeded, create a new
+		 * QP. This guarantees that all completion callback function
+		 * invocations have finished before request resetting starts.
+		 */
+		ret += srp_create_ch_ib(ch);
 
-	if (ret == 0)
-		ret = srp_connect_ch(ch);
+		INIT_LIST_HEAD(&ch->free_tx);
+		for (j = 0; j < target->queue_size; ++j)
+			list_add(&ch->tx_ring[j]->list, &ch->free_tx);
+	}
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		if (ret || !ch->target) {
+			if (i > 1)
+				ret = 0;
+			break;
+		}
+		ret = srp_connect_ch(ch, multich);
+		multich = true;
+	}
 
 	if (ret == 0)
 		shost_printk(KERN_INFO, target->scsi_host,
@@ -1643,6 +1704,9 @@  static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
 			ch->tsk_mgmt_status = rsp->data[3];
 		complete(&ch->tsk_mgmt_done);
 	} else {
+		if (blk_mq_unique_tag_to_hwq(rsp->tag) != ch - target->ch)
+			pr_err("Channel idx mismatch: tag %#llx <> ch %#lx\n",
+			       rsp->tag, ch - target->ch);
 		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
 		if (scmnd) {
 			req = (void *)scmnd->host_scribble;
@@ -1650,8 +1714,8 @@  static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
 		}
 		if (!scmnd) {
 			shost_printk(KERN_ERR, target->scsi_host,
-				     "Null scmnd for RSP w/tag %016llx\n",
-				     (unsigned long long) rsp->tag);
+				     "Null scmnd for RSP w/tag %#016llx received on ch %ld / QP %#x\n",
+				     rsp->tag, ch - target->ch, ch->qp->qp_num);
 
 			spin_lock_irqsave(&ch->lock, flags);
 			ch->req_lim += be32_to_cpu(rsp->req_lim_delta);
@@ -1906,7 +1970,7 @@  static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 		goto err;
 
 	tag = blk_mq_unique_tag(scmnd->request);
-	ch = &target->ch;
+	ch = &target->ch[blk_mq_unique_tag_to_hwq(tag)];
 	idx = blk_mq_unique_tag_to_tag(tag);
 	WARN_ONCE(idx >= target->req_ring_size, "%s: tag %#x: idx %d >= %d\n",
 		  dev_name(&shost->shost_gendev), tag, idx,
@@ -2408,15 +2472,23 @@  static int srp_abort(struct scsi_cmnd *scmnd)
 	struct srp_target_port *target = host_to_target(scmnd->device->host);
 	struct srp_request *req = (struct srp_request *) scmnd->host_scribble;
 	u32 tag;
+	u16 ch_idx;
 	struct srp_rdma_ch *ch;
 	int ret;
 
 	shost_printk(KERN_ERR, target->scsi_host, "SRP abort called\n");
 
-	ch = &target->ch;
-	if (!req || !srp_claim_req(ch, req, NULL, scmnd))
+	if (!req)
 		return SUCCESS;
 	tag = blk_mq_unique_tag(scmnd->request);
+	ch_idx = blk_mq_unique_tag_to_hwq(tag);
+	if (WARN_ON_ONCE(ch_idx >= target->ch_count))
+		return SUCCESS;
+	ch = &target->ch[ch_idx];
+	if (!srp_claim_req(ch, req, NULL, scmnd))
+		return SUCCESS;
+	shost_printk(KERN_ERR, target->scsi_host,
+		     "Sending SRP abort for tag %#x\n", tag);
 	if (srp_send_tsk_mgmt(ch, tag, scmnd->device->lun,
 			      SRP_TSK_ABORT_TASK) == 0)
 		ret = SUCCESS;
@@ -2434,21 +2506,25 @@  static int srp_abort(struct scsi_cmnd *scmnd)
 static int srp_reset_device(struct scsi_cmnd *scmnd)
 {
 	struct srp_target_port *target = host_to_target(scmnd->device->host);
-	struct srp_rdma_ch *ch = &target->ch;
+	struct srp_rdma_ch *ch;
 	int i;
 
 	shost_printk(KERN_ERR, target->scsi_host, "SRP reset_device called\n");
 
+	ch = &target->ch[0];
 	if (srp_send_tsk_mgmt(ch, SRP_TAG_NO_REQ, scmnd->device->lun,
 			      SRP_TSK_LUN_RESET))
 		return FAILED;
 	if (ch->tsk_mgmt_status)
 		return FAILED;
 
-	for (i = 0; i < target->req_ring_size; ++i) {
-		struct srp_request *req = &ch->req_ring[i];
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		for (i = 0; i < target->req_ring_size; ++i) {
+			struct srp_request *req = &ch->req_ring[i];
 
-		srp_finish_req(ch, req, scmnd->device, DID_RESET << 16);
+			srp_finish_req(ch, req, scmnd->device, DID_RESET << 16);
+		}
 	}
 
 	return SUCCESS;
@@ -2534,7 +2610,7 @@  static ssize_t show_dgid(struct device *dev, struct device_attribute *attr,
 			 char *buf)
 {
 	struct srp_target_port *target = host_to_target(class_to_shost(dev));
-	struct srp_rdma_ch *ch = &target->ch;
+	struct srp_rdma_ch *ch = &target->ch[0];
 
 	return sprintf(buf, "%pI6\n", ch->path.dgid.raw);
 }
@@ -2551,8 +2627,14 @@  static ssize_t show_req_lim(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	struct srp_target_port *target = host_to_target(class_to_shost(dev));
+	struct srp_rdma_ch *ch;
+	int i, req_lim = INT_MAX;
 
-	return sprintf(buf, "%d\n", target->ch.req_lim);
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		req_lim = min(req_lim, ch->req_lim);
+	}
+	return sprintf(buf, "%d\n", req_lim);
 }
 
 static ssize_t show_zero_req_lim(struct device *dev,
@@ -2579,6 +2661,14 @@  static ssize_t show_local_ib_device(struct device *dev,
 	return sprintf(buf, "%s\n", target->srp_host->srp_dev->dev->name);
 }
 
+static ssize_t show_ch_count(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	struct srp_target_port *target = host_to_target(class_to_shost(dev));
+
+	return sprintf(buf, "%d\n", target->ch_count);
+}
+
 static ssize_t show_comp_vector(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
@@ -2622,6 +2712,7 @@  static DEVICE_ATTR(req_lim,         S_IRUGO, show_req_lim,         NULL);
 static DEVICE_ATTR(zero_req_lim,    S_IRUGO, show_zero_req_lim,	   NULL);
 static DEVICE_ATTR(local_ib_port,   S_IRUGO, show_local_ib_port,   NULL);
 static DEVICE_ATTR(local_ib_device, S_IRUGO, show_local_ib_device, NULL);
+static DEVICE_ATTR(ch_count,        S_IRUGO, show_ch_count,        NULL);
 static DEVICE_ATTR(comp_vector,     S_IRUGO, show_comp_vector,     NULL);
 static DEVICE_ATTR(tl_retry_count,  S_IRUGO, show_tl_retry_count,  NULL);
 static DEVICE_ATTR(cmd_sg_entries,  S_IRUGO, show_cmd_sg_entries,  NULL);
@@ -2639,6 +2730,7 @@  static struct device_attribute *srp_host_attrs[] = {
 	&dev_attr_zero_req_lim,
 	&dev_attr_local_ib_port,
 	&dev_attr_local_ib_device,
+	&dev_attr_ch_count,
 	&dev_attr_comp_vector,
 	&dev_attr_tl_retry_count,
 	&dev_attr_cmd_sg_entries,
@@ -3048,7 +3140,8 @@  static ssize_t srp_create_target(struct device *dev,
 	struct srp_rdma_ch *ch;
 	struct srp_device *srp_dev = host->srp_dev;
 	struct ib_device *ibdev = srp_dev->dev;
-	int ret;
+	int ret, node_idx, node, cpu, i;
+	bool multich = false;
 
 	target_host = scsi_host_alloc(&srp_template,
 				      sizeof (struct srp_target_port));
@@ -3118,34 +3211,82 @@  static ssize_t srp_create_target(struct device *dev,
 	INIT_WORK(&target->tl_err_work, srp_tl_err_work);
 	INIT_WORK(&target->remove_work, srp_remove_work);
 	spin_lock_init(&target->lock);
-	ch = &target->ch;
-	ch->target = target;
-	ch->comp_vector = target->comp_vector;
-	spin_lock_init(&ch->lock);
-	INIT_LIST_HEAD(&ch->free_tx);
-	ret = srp_alloc_req_data(ch);
-	if (ret)
-		goto err_free_mem;
-
 	ret = ib_query_gid(ibdev, host->port, 0, &target->sgid);
 	if (ret)
-		goto err_free_mem;
+		goto err;
 
-	ret = srp_create_ch_ib(ch);
-	if (ret)
-		goto err_free_mem;
+	ret = -ENOMEM;
+	target->ch_count = max_t(unsigned, num_online_nodes(),
+				 min(ch_count ? :
+				     min(4 * num_online_nodes(),
+					 ibdev->num_comp_vectors),
+				     num_online_cpus()));
+	target->ch = kcalloc(target->ch_count, sizeof(*target->ch),
+			     GFP_KERNEL);
+	if (!target->ch)
+		goto err;
 
-	ret = srp_new_cm_id(ch);
-	if (ret)
-		goto err_free_ib;
+	node_idx = 0;
+	for_each_online_node(node) {
+		const int ch_start = (node_idx * target->ch_count /
+				      num_online_nodes());
+		const int ch_end = ((node_idx + 1) * target->ch_count /
+				    num_online_nodes());
+		const int cv_start = (node_idx * ibdev->num_comp_vectors /
+				      num_online_nodes() + target->comp_vector)
+				     % ibdev->num_comp_vectors;
+		const int cv_end = ((node_idx + 1) * ibdev->num_comp_vectors /
+				    num_online_nodes() + target->comp_vector)
+				   % ibdev->num_comp_vectors;
+		int cpu_idx = 0;
+
+		for_each_online_cpu(cpu) {
+			if (cpu_to_node(cpu) != node)
+				continue;
+			if (ch_start + cpu_idx >= ch_end)
+				continue;
+			ch = &target->ch[ch_start + cpu_idx];
+			ch->target = target;
+			ch->comp_vector = cv_start == cv_end ? cv_start :
+				cv_start + cpu_idx % (cv_end - cv_start);
+			spin_lock_init(&ch->lock);
+			INIT_LIST_HEAD(&ch->free_tx);
+			ret = srp_new_cm_id(ch);
+			if (ret)
+				goto err_disconnect;
 
-	ret = srp_connect_ch(ch);
-	if (ret) {
-		shost_printk(KERN_ERR, target->scsi_host,
-			     PFX "Connection failed\n");
-		goto err_free_ib;
+			ret = srp_create_ch_ib(ch);
+			if (ret)
+				goto err_disconnect;
+
+			ret = srp_alloc_req_data(ch);
+			if (ret)
+				goto err_disconnect;
+
+			ret = srp_connect_ch(ch, multich);
+			if (ret) {
+				shost_printk(KERN_ERR, target->scsi_host,
+					     PFX "Connection %d/%d failed\n",
+					     ch_start + cpu_idx,
+					     target->ch_count);
+				if (node_idx == 0 && cpu_idx == 0) {
+					goto err_disconnect;
+				} else {
+					srp_free_ch_ib(target, ch);
+					srp_free_req_data(target, ch);
+					target->ch_count = ch - target->ch;
+					break;
+				}
+			}
+
+			multich = true;
+			cpu_idx++;
+		}
+		node_idx++;
 	}
 
+	target->scsi_host->nr_hw_queues = target->ch_count;
+
 	ret = srp_add_target(host, target);
 	if (ret)
 		goto err_disconnect;
@@ -3172,11 +3313,13 @@  out:
 err_disconnect:
 	srp_disconnect_target(target);
 
-err_free_ib:
-	srp_free_ch_ib(target, ch);
+	for (i = 0; i < target->ch_count; i++) {
+		ch = &target->ch[i];
+		srp_free_ch_ib(target, ch);
+		srp_free_req_data(target, ch);
+	}
 
-err_free_mem:
-	srp_free_req_data(target, ch);
+	kfree(target->ch);
 
 err:
 	scsi_host_put(target_host);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index bb185d4..5b7dada 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -180,8 +180,9 @@  struct srp_target_port {
 	/* read and written in the hot path */
 	spinlock_t		lock;
 
-	struct srp_rdma_ch	ch;
 	/* read only in the hot path */
+	struct srp_rdma_ch	*ch;
+	u32			ch_count;
 	u32			lkey;
 	u32			rkey;
 	enum srp_target_state	state;