diff mbox series

[v2] RDMA/cma: Make CM response timeout and # CM retries configurable

Message ID 20190226075722.1692315-1-haakon.bugge@oracle.com (mailing list archive)
State Changes Requested
Headers show
Series [v2] RDMA/cma: Make CM response timeout and # CM retries configurable | expand

Commit Message

Haakon Bugge Feb. 26, 2019, 7:57 a.m. UTC
During certain workloads, the default CM response timeout is too
short, leading to excessive retries. Hence, make it configurable
through sysctl. While at it, also make number of CM retries
configurable.

The defaults are not changed.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>

---
v1 -> v2:
   * Added unregister_net_sysctl_table() in cma_cleanup()
---
 drivers/infiniband/core/cma.c | 52 ++++++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 7 deletions(-)

Comments

Doug Ledford June 13, 2019, 2:25 p.m. UTC | #1
On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
> During certain workloads, the default CM response timeout is too
> short, leading to excessive retries. Hence, make it configurable
> through sysctl. While at it, also make number of CM retries
> configurable.
> 
> The defaults are not changed.
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
> v1 -> v2:
>    * Added unregister_net_sysctl_table() in cma_cleanup()
> ---
>  drivers/infiniband/core/cma.c | 52 ++++++++++++++++++++++++++++++---
> --
>  1 file changed, 45 insertions(+), 7 deletions(-)

This has been sitting on patchworks since forever.  Presumably because
Jason and I neither one felt like we really wanted it, but also
couldn't justify flat refusing it.  Well, I've made up my mind, so
unless Jason wants to argue the other side, I'm rejecting this patch. 
Here's why.  The whole concept of a timeout is to help recovery in a
situation that overloads one end of the connection.  There is a
relationship between the max queue backlog on the one host and the
timeout on the other host.  Generally, in order for a request to get
dropped and us to need to retransmit, the queue must already have a
full backlog.  So, how long does it take a heavily loaded system to
process a full backlog?  That, plus a fuzz for a margin of error,
should be our timeout.  We shouldn't be asking users to configure it.

However, if users change the default backlog queue on their systems,
*then* it would make sense to have the users also change the timeout
here, but I think guidance would be helpful.

So, to revive this patch, what I'd like to see is some attempt to
actually quantify a reasonable timeout for the default backlog depth,
then the patch should actually change the default to that reasonable
timeout, and then put in the ability to adjust the timeout with some
sort of doc guidance on how to calculate a reasonable timeout based on
configured backlog depth.
Bart Van Assche June 13, 2019, 3:28 p.m. UTC | #2
On 6/13/19 7:25 AM, Doug Ledford wrote:
> So, to revive this patch, what I'd like to see is some attempt to
> actually quantify a reasonable timeout for the default backlog depth,
> then the patch should actually change the default to that reasonable
> timeout, and then put in the ability to adjust the timeout with some
> sort of doc guidance on how to calculate a reasonable timeout based on
> configured backlog depth.

How about following the approach of the SRP initiator driver? It derives 
the CM timeout from the subnet manager timeout. The assumption behind 
this is that in large networks the subnet manager timeout has to be set 
higher than its default to make communication work. See also 
srp_get_subnet_timeout().

Bart.
Doug Ledford June 13, 2019, 4:32 p.m. UTC | #3
On Thu, 2019-06-13 at 08:28 -0700, Bart Van Assche wrote:
> On 6/13/19 7:25 AM, Doug Ledford wrote:
> > So, to revive this patch, what I'd like to see is some attempt to
> > actually quantify a reasonable timeout for the default backlog
> > depth,
> > then the patch should actually change the default to that
> > reasonable
> > timeout, and then put in the ability to adjust the timeout with
> > some
> > sort of doc guidance on how to calculate a reasonable timeout based
> > on
> > configured backlog depth.
> 
> How about following the approach of the SRP initiator driver? It
> derives 
> the CM timeout from the subnet manager timeout. The assumption
> behind 
> this is that in large networks the subnet manager timeout has to be
> set 
> higher than its default to make communication work. See also 
> srp_get_subnet_timeout().

Theoretically, the subnet manager needs a longer timeout in a bigger
network because it's handling more data as a single point of lookup for
the entire subnet.  Individual machines, on the other hand, have the
same backlog size (by default) regardless of the size of the network,
and there is no guarantee that if the admin increased the subnet
manager timeout, that they also increased the backlog queue depth size.
So, while I like things that auto-tune like you are suggesting, the
problem is that the one item does not directly correlate with the
other.
Haakon Bugge June 13, 2019, 4:58 p.m. UTC | #4
> On 13 Jun 2019, at 16:25, Doug Ledford <dledford@redhat.com> wrote:
> 
> On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
>> During certain workloads, the default CM response timeout is too
>> short, leading to excessive retries. Hence, make it configurable
>> through sysctl. While at it, also make number of CM retries
>> configurable.
>> 
>> The defaults are not changed.
>> 
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> v1 -> v2:
>>   * Added unregister_net_sysctl_table() in cma_cleanup()
>> ---
>> drivers/infiniband/core/cma.c | 52 ++++++++++++++++++++++++++++++---
>> --
>> 1 file changed, 45 insertions(+), 7 deletions(-)
> 
> This has been sitting on patchworks since forever.  Presumably because
> Jason and I neither one felt like we really wanted it, but also
> couldn't justify flat refusing it.

I thought the agreement was to use NL and iproute2. But I haven't had the capacity.

>  Well, I've made up my mind, so
> unless Jason wants to argue the other side, I'm rejecting this patch. 
> Here's why.  The whole concept of a timeout is to help recovery in a
> situation that overloads one end of the connection.  There is a
> relationship between the max queue backlog on the one host and the
> timeout on the other host.  

If you refer to the backlog parameter in rdma_listen(), I cannot see it being used at all for IB.

For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy UD receive queue length for the PF driver that can be construed as a backlog. Remember that any MAD packet being sent from a VF or the PF itself, is sent to a proxy UD QP in the PF. Those packets are then multiplexed out on the real QP0/1. Incoming MAD packets are demultiplexed and sent once more to the proxy QP in the VF.

> Generally, in order for a request to get
> dropped and us to need to retransmit, the queue must already have a
> full backlog.  So, how long does it take a heavily loaded system to
> process a full backlog?  That, plus a fuzz for a margin of error,
> should be our timeout.  We shouldn't be asking users to configure it.

Customer configures #VMs and different workload may lead to way different number of CM connections. The proxying of MAD packet through the PF driver has a finite packet rate. With 64 VMs, 10.000 QPs on each, all going down due to a switch failing or similar, you have 640.000 DREQs to be sent, and with the finite packet rate of MA packets through the PF, this takes more than the current CM timeout. And then you re-transmit and increase the burden of the PF proxying.

So, we can change the default to cope with this. But, a MAD packet is unreliable, we may have transient loss. In this case, we want a short timeout.

> However, if users change the default backlog queue on their systems,
> *then* it would make sense to have the users also change the timeout
> here, but I think guidance would be helpful.
> 
> So, to revive this patch, what I'd like to see is some attempt to
> actually quantify a reasonable timeout for the default backlog depth,
> then the patch should actually change the default to that reasonable
> timeout, and then put in the ability to adjust the timeout with some
> sort of doc guidance on how to calculate a reasonable timeout based on
> configured backlog depth.

I can agree to this :-)


Thxs, Håkon

> 
> -- 
> Doug Ledford <dledford@redhat.com>
>    GPG KeyID: B826A3330E572FDD
>    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
> 2FDD
Jason Gunthorpe June 13, 2019, 5:23 p.m. UTC | #5
On Thu, Jun 13, 2019 at 06:58:30PM +0200, Håkon Bugge wrote:

> If you refer to the backlog parameter in rdma_listen(), I cannot see
> it being used at all for IB.
> 
> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
> UD receive queue length for the PF driver that can be construed as a
> backlog. 

No, in IB you can drop UD packets if your RQ is full - so the proxy RQ
is really part of the overall RQ on QP1.

The backlog starts once packets are taken off the RQ and begin the
connection accept processing.

> Customer configures #VMs and different workload may lead to way
> different number of CM connections. The proxying of MAD packet
> through the PF driver has a finite packet rate. With 64 VMs, 10.000
> QPs on each, all going down due to a switch failing or similar, you
> have 640.000 DREQs to be sent, and with the finite packet rate of MA
> packets through the PF, this takes more than the current CM
> timeout. And then you re-transmit and increase the burden of the PF
> proxying.

I feel like the performance of all this proxying is too low to support
such a large work load :(

Can it be improved?

Jason
Haakon Bugge June 13, 2019, 5:39 p.m. UTC | #6
> On 13 Jun 2019, at 19:23, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Jun 13, 2019 at 06:58:30PM +0200, Håkon Bugge wrote:
> 
>> If you refer to the backlog parameter in rdma_listen(), I cannot see
>> it being used at all for IB.
>> 
>> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
>> UD receive queue length for the PF driver that can be construed as a
>> backlog. 
> 
> No, in IB you can drop UD packets if your RQ is full - so the proxy RQ
> is really part of the overall RQ on QP1.
> 
> The backlog starts once packets are taken off the RQ and begin the
> connection accept processing.

Do think we say the same thing. If, incoming REQ processing is severly delayed, the backlog is #entries in the QP1 receive queue in the PF. I can call rdma_listen() with a backlog of a zillion, but it will not help.

>> Customer configures #VMs and different workload may lead to way
>> different number of CM connections. The proxying of MAD packet
>> through the PF driver has a finite packet rate. With 64 VMs, 10.000
>> QPs on each, all going down due to a switch failing or similar, you
>> have 640.000 DREQs to be sent, and with the finite packet rate of MA
>> packets through the PF, this takes more than the current CM
>> timeout. And then you re-transmit and increase the burden of the PF
>> proxying.
> 
> I feel like the performance of all this proxying is too low to support
> such a large work load :(

That is what I am aiming at, for example to spread the completion_vector(s) for said QPs ;-)

-h

> 
> Can it be improved?
> 
> Jason
Doug Ledford June 13, 2019, 8:25 p.m. UTC | #7
On Thu, 2019-06-13 at 18:58 +0200, Håkon Bugge wrote:
> > On 13 Jun 2019, at 16:25, Doug Ledford <dledford@redhat.com> wrote:
> > 
> > On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
> > > During certain workloads, the default CM response timeout is too
> > > short, leading to excessive retries. Hence, make it configurable
> > > through sysctl. While at it, also make number of CM retries
> > > configurable.
> > > 
> > > The defaults are not changed.
> > > 
> > > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> > > ---
> > > v1 -> v2:
> > >   * Added unregister_net_sysctl_table() in cma_cleanup()
> > > ---
> > > drivers/infiniband/core/cma.c | 52
> > > ++++++++++++++++++++++++++++++---
> > > --
> > > 1 file changed, 45 insertions(+), 7 deletions(-)
> > 
> > This has been sitting on patchworks since forever.  Presumably
> > because
> > Jason and I neither one felt like we really wanted it, but also
> > couldn't justify flat refusing it.
> 
> I thought the agreement was to use NL and iproute2. But I haven't had
> the capacity.

To be fair, the email thread was gone from my linux-rdma folder.  So, I
just had to review the entry in patchworks, and there was no captured
discussion there.  So, if the agreement was made, it must have been
face to face some time and if I was involed, I had certainly forgotten
by now.  But I still needed to clean up patchworks, hence my email ;-).

> >  Well, I've made up my mind, so
> > unless Jason wants to argue the other side, I'm rejecting this
> > patch. 
> > Here's why.  The whole concept of a timeout is to help recovery in
> > a
> > situation that overloads one end of the connection.  There is a
> > relationship between the max queue backlog on the one host and the
> > timeout on the other host.  
> 
> If you refer to the backlog parameter in rdma_listen(), I cannot see
> it being used at all for IB.

No, not exactly.  I was more referring to heavy load causing an
overflow in the mad packet receive processing.  We have
IB_MAD_QP_RECV_SIZE set to 512 by default, but it can be changed at
module load time of the ib_core module and that represents the maximum
number of backlogged mad packets we can have waiting to be processed
before we just drop them on the floor.  There can be other places to
drop them too, but this is the one I was referring to.

> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
> UD receive queue length for the PF driver that can be construed as a
> backlog. Remember that any MAD packet being sent from a VF or the PF
> itself, is sent to a proxy UD QP in the PF. Those packets are then
> multiplexed out on the real QP0/1. Incoming MAD packets are
> demultiplexed and sent once more to the proxy QP in the VF.
> 
> > Generally, in order for a request to get
> > dropped and us to need to retransmit, the queue must already have a
> > full backlog.  So, how long does it take a heavily loaded system to
> > process a full backlog?  That, plus a fuzz for a margin of error,
> > should be our timeout.  We shouldn't be asking users to configure
> > it.
> 
> Customer configures #VMs and different workload may lead to way
> different number of CM connections. The proxying of MAD packet
> through the PF driver has a finite packet rate. With 64 VMs, 10.000
> QPs on each, all going down due to a switch failing or similar, you
> have 640.000 DREQs to be sent, and with the finite packet rate of MA
> packets through the PF, this takes more than the current CM timeout.
> And then you re-transmit and increase the burden of the PF proxying.
> 
> So, we can change the default to cope with this. But, a MAD packet is
> unreliable, we may have transient loss. In this case, we want a short
> timeout.
> 
> > However, if users change the default backlog queue on their
> > systems,
> > *then* it would make sense to have the users also change the
> > timeout
> > here, but I think guidance would be helpful.
> > 
> > So, to revive this patch, what I'd like to see is some attempt to
> > actually quantify a reasonable timeout for the default backlog
> > depth,
> > then the patch should actually change the default to that
> > reasonable
> > timeout, and then put in the ability to adjust the timeout with
> > some
> > sort of doc guidance on how to calculate a reasonable timeout based
> > on
> > configured backlog depth.
> 
> I can agree to this :-)
> 
> 
> Thxs, Håkon
> 
> > -- 
> > Doug Ledford <dledford@redhat.com>
> >    GPG KeyID: B826A3330E572FDD
> >    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
> > 2FDD
Jason Gunthorpe June 13, 2019, 11:46 p.m. UTC | #8
On Thu, Jun 13, 2019 at 07:39:24PM +0200, Håkon Bugge wrote:
> 
> 
> > On 13 Jun 2019, at 19:23, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Thu, Jun 13, 2019 at 06:58:30PM +0200, Håkon Bugge wrote:
> > 
> >> If you refer to the backlog parameter in rdma_listen(), I cannot see
> >> it being used at all for IB.
> >> 
> >> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
> >> UD receive queue length for the PF driver that can be construed as a
> >> backlog. 
> > 
> > No, in IB you can drop UD packets if your RQ is full - so the proxy RQ
> > is really part of the overall RQ on QP1.
> > 
> > The backlog starts once packets are taken off the RQ and begin the
> > connection accept processing.
> 
> Do think we say the same thing. If, incoming REQ processing is
> severly delayed, the backlog is #entries in the QP1 receive queue in
> the PF. I can call rdma_listen() with a backlog of a zillion, but it
> will not help.

backlog and queue depth are different things, we shouldn't confuse
them together..

Jason
Haakon Bugge June 14, 2019, 5:44 a.m. UTC | #9
> On 13 Jun 2019, at 22:25, Doug Ledford <dledford@redhat.com> wrote:
> 
> On Thu, 2019-06-13 at 18:58 +0200, Håkon Bugge wrote:
>>> On 13 Jun 2019, at 16:25, Doug Ledford <dledford@redhat.com> wrote:
>>> 
>>> On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
>>>> During certain workloads, the default CM response timeout is too
>>>> short, leading to excessive retries. Hence, make it configurable
>>>> through sysctl. While at it, also make number of CM retries
>>>> configurable.
>>>> 
>>>> The defaults are not changed.
>>>> 
>>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>>> ---
>>>> v1 -> v2:
>>>>  * Added unregister_net_sysctl_table() in cma_cleanup()
>>>> ---
>>>> drivers/infiniband/core/cma.c | 52
>>>> ++++++++++++++++++++++++++++++---
>>>> --
>>>> 1 file changed, 45 insertions(+), 7 deletions(-)
>>> 
>>> This has been sitting on patchworks since forever.  Presumably
>>> because
>>> Jason and I neither one felt like we really wanted it, but also
>>> couldn't justify flat refusing it.
>> 
>> I thought the agreement was to use NL and iproute2. But I haven't had
>> the capacity.
> 
> To be fair, the email thread was gone from my linux-rdma folder.  So, I
> just had to review the entry in patchworks, and there was no captured
> discussion there.  So, if the agreement was made, it must have been
> face to face some time and if I was involed, I had certainly forgotten
> by now.  But I still needed to clean up patchworks, hence my email ;-).

This is the "agreement" I was referring too:

> On 4 Mar 2019, at 07:27, Parav Pandit <parav@mellanox.com> wrote:
> 
>> []
> 
> I think we should use rdma_nl_register(RDMA_NL_RDMA_CM, cma_cb_table) which was removed as part of ID stats removal.
> Because of below reasons.
> 1. rdma netlink command auto loads the module
> 2. we don't need to write any extra code to do register_net_sysctl () in each netns.
> Caller's skb's netns will read/write value of response_timeout in 'struct cma_pernet'.
> 3. last time sysctl added in ipv6 was in 2017 in net/ipv6/addrconf.c, however ipv4 was done in 2018.
> 
> Currently rdma_cm/rdma_ucma has configfs, sysctl.
> We are adding netlink sys params to ib_core.
> 
> We already have 3 clients and infra built using rdma_nl_register() netlink , so hooking up to netlink will provide unified way to set rdma params.
> Let's just use netlink for any new params unless it is not doable.



> 
>>> Well, I've made up my mind, so
>>> unless Jason wants to argue the other side, I'm rejecting this
>>> patch. 
>>> Here's why.  The whole concept of a timeout is to help recovery in
>>> a
>>> situation that overloads one end of the connection.  There is a
>>> relationship between the max queue backlog on the one host and the
>>> timeout on the other host.  
>> 
>> If you refer to the backlog parameter in rdma_listen(), I cannot see
>> it being used at all for IB.
> 
> No, not exactly.  I was more referring to heavy load causing an
> overflow in the mad packet receive processing.  We have
> IB_MAD_QP_RECV_SIZE set to 512 by default, but it can be changed at
> module load time of the ib_core module and that represents the maximum
> number of backlogged mad packets we can have waiting to be processed
> before we just drop them on the floor.  There can be other places to
> drop them too, but this is the one I was referring to.

That is another scenario than what I try to solve. What I see, is that the MAD packets are delayed, not lost. The delay is longer than the CMA timeout. Hence, the MAD packets are retried, adding more burden to the PF proxying and inducing even longer delays. And excessive CM retries are observed. See 2612d723aadc ("IB/mlx4: Increase the timeout for CM cache") where I have some quantification thereof.

Back to your scenario above, yes indeed, the queue sizes are module params. If the MADs are tossed, we will see rq_num_udsdprd incrementing on a CX-3.

But I do not understand how the dots are connected. Assume one client does rdma_listen(, backlog = 1000); Where are those 1000 REQs stored, assuming an "infinite slow processor"?


Thxs, Håkon


> 
>> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
>> UD receive queue length for the PF driver that can be construed as a
>> backlog. Remember that any MAD packet being sent from a VF or the PF
>> itself, is sent to a proxy UD QP in the PF. Those packets are then
>> multiplexed out on the real QP0/1. Incoming MAD packets are
>> demultiplexed and sent once more to the proxy QP in the VF.
>> 
>>> Generally, in order for a request to get
>>> dropped and us to need to retransmit, the queue must already have a
>>> full backlog.  So, how long does it take a heavily loaded system to
>>> process a full backlog?  That, plus a fuzz for a margin of error,
>>> should be our timeout.  We shouldn't be asking users to configure
>>> it.
>> 
>> Customer configures #VMs and different workload may lead to way
>> different number of CM connections. The proxying of MAD packet
>> through the PF driver has a finite packet rate. With 64 VMs, 10.000
>> QPs on each, all going down due to a switch failing or similar, you
>> have 640.000 DREQs to be sent, and with the finite packet rate of MA
>> packets through the PF, this takes more than the current CM timeout.
>> And then you re-transmit and increase the burden of the PF proxying.
>> 
>> So, we can change the default to cope with this. But, a MAD packet is
>> unreliable, we may have transient loss. In this case, we want a short
>> timeout.
>> 
>>> However, if users change the default backlog queue on their
>>> systems,
>>> *then* it would make sense to have the users also change the
>>> timeout
>>> here, but I think guidance would be helpful.
>>> 
>>> So, to revive this patch, what I'd like to see is some attempt to
>>> actually quantify a reasonable timeout for the default backlog
>>> depth,
>>> then the patch should actually change the default to that
>>> reasonable
>>> timeout, and then put in the ability to adjust the timeout with
>>> some
>>> sort of doc guidance on how to calculate a reasonable timeout based
>>> on
>>> configured backlog depth.
>> 
>> I can agree to this :-)
>> 
>> 
>> Thxs, Håkon
>> 
>>> -- 
>>> Doug Ledford <dledford@redhat.com>
>>>   GPG KeyID: B826A3330E572FDD
>>>   Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
>>> 2FDD
> 
> -- 
> Doug Ledford <dledford@redhat.com>
>    GPG KeyID: B826A3330E572FDD
>    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
> 2FDD
Dag Moxnes Oct. 29, 2019, 12:59 p.m. UTC | #10
Hi all,

Sorry for re-opening an old discussion, but I cannot find any final 
decision on how to handle this.

See inline for more comments.

On 6/14/19 7:44 AM, Håkon Bugge wrote:
>
>> On 13 Jun 2019, at 22:25, Doug Ledford <dledford@redhat.com> wrote:
>>
>> On Thu, 2019-06-13 at 18:58 +0200, Håkon Bugge wrote:
>>>> On 13 Jun 2019, at 16:25, Doug Ledford <dledford@redhat.com> wrote:
>>>>
>>>> On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
>>>>> During certain workloads, the default CM response timeout is too
>>>>> short, leading to excessive retries. Hence, make it configurable
>>>>> through sysctl. While at it, also make number of CM retries
>>>>> configurable.
>>>>>
>>>>> The defaults are not changed.
>>>>>
>>>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>>>> ---
>>>>> v1 -> v2:
>>>>>   * Added unregister_net_sysctl_table() in cma_cleanup()
>>>>> ---
>>>>> drivers/infiniband/core/cma.c | 52
>>>>> ++++++++++++++++++++++++++++++---
>>>>> --
>>>>> 1 file changed, 45 insertions(+), 7 deletions(-)
>>>> This has been sitting on patchworks since forever.  Presumably
>>>> because
>>>> Jason and I neither one felt like we really wanted it, but also
>>>> couldn't justify flat refusing it.
>>> I thought the agreement was to use NL and iproute2. But I haven't had
>>> the capacity.
>> To be fair, the email thread was gone from my linux-rdma folder.  So, I
>> just had to review the entry in patchworks, and there was no captured
>> discussion there.  So, if the agreement was made, it must have been
>> face to face some time and if I was involed, I had certainly forgotten
>> by now.  But I still needed to clean up patchworks, hence my email ;-).
> This is the "agreement" I was referring too:
>
>> On 4 Mar 2019, at 07:27, Parav Pandit <parav@mellanox.com> wrote:
>>
>>> []
>> I think we should use rdma_nl_register(RDMA_NL_RDMA_CM, cma_cb_table) which was removed as part of ID stats removal.
>> Because of below reasons.
>> 1. rdma netlink command auto loads the module
>> 2. we don't need to write any extra code to do register_net_sysctl () in each netns.
>> Caller's skb's netns will read/write value of response_timeout in 'struct cma_pernet'.
>> 3. last time sysctl added in ipv6 was in 2017 in net/ipv6/addrconf.c, however ipv4 was done in 2018.
>>
>> Currently rdma_cm/rdma_ucma has configfs, sysctl.
>> We are adding netlink sys params to ib_core.
>>
>> We already have 3 clients and infra built using rdma_nl_register() netlink , so hooking up to netlink will provide unified way to set rdma params.
>> Let's just use netlink for any new params unless it is not doable.
>
>
>>>> Well, I've made up my mind, so
>>>> unless Jason wants to argue the other side, I'm rejecting this
>>>> patch.
>>>> Here's why.  The whole concept of a timeout is to help recovery in
>>>> a
>>>> situation that overloads one end of the connection.  There is a
>>>> relationship between the max queue backlog on the one host and the
>>>> timeout on the other host.
>>> If you refer to the backlog parameter in rdma_listen(), I cannot see
>>> it being used at all for IB.
>> No, not exactly.  I was more referring to heavy load causing an
>> overflow in the mad packet receive processing.  We have
>> IB_MAD_QP_RECV_SIZE set to 512 by default, but it can be changed at
>> module load time of the ib_core module and that represents the maximum
>> number of backlogged mad packets we can have waiting to be processed
>> before we just drop them on the floor.  There can be other places to
>> drop them too, but this is the one I was referring to.
How can we determine the CM response timeout base on MAD QP recv size? 
As far as I can see we would need
to know the process time for the requests. An incoming connection 
request will be send to and handled by the listener
before the listener calls rdma_accept, do the processing time would need 
to include this delay.
Maybe the best approach would be to let IB rdma users modify the CM 
reposponse timeout, either by adding it to the rdma_conn_param struct or 
by adding a setter function similar to rdma_set_ack_timeout. What do you 
think?

Regards,
-Dag
> That is another scenario than what I try to solve. What I see, is that the MAD packets are delayed, not lost. The delay is longer than the CMA timeout. Hence, the MAD packets are retried, adding more burden to the PF proxying and inducing even longer delays. And excessive CM retries are observed. See 2612d723aadc ("IB/mlx4: Increase the timeout for CM cache") where I have some quantification thereof.
>
> Back to your scenario above, yes indeed, the queue sizes are module params. If the MADs are tossed, we will see rq_num_udsdprd incrementing on a CX-3.
>
> But I do not understand how the dots are connected. Assume one client does rdma_listen(, backlog = 1000); Where are those 1000 REQs stored, assuming an "infinite slow processor"?
>
>
> Thxs, Håkon
>
>
>>> For CX-3, which is paravirtualized wrt. MAD packets, it is the proxy
>>> UD receive queue length for the PF driver that can be construed as a
>>> backlog. Remember that any MAD packet being sent from a VF or the PF
>>> itself, is sent to a proxy UD QP in the PF. Those packets are then
>>> multiplexed out on the real QP0/1. Incoming MAD packets are
>>> demultiplexed and sent once more to the proxy QP in the VF.
>>>
>>>> Generally, in order for a request to get
>>>> dropped and us to need to retransmit, the queue must already have a
>>>> full backlog.  So, how long does it take a heavily loaded system to
>>>> process a full backlog?  That, plus a fuzz for a margin of error,
>>>> should be our timeout.  We shouldn't be asking users to configure
>>>> it.
>>> Customer configures #VMs and different workload may lead to way
>>> different number of CM connections. The proxying of MAD packet
>>> through the PF driver has a finite packet rate. With 64 VMs, 10.000
>>> QPs on each, all going down due to a switch failing or similar, you
>>> have 640.000 DREQs to be sent, and with the finite packet rate of MA
>>> packets through the PF, this takes more than the current CM timeout.
>>> And then you re-transmit and increase the burden of the PF proxying.
>>>
>>> So, we can change the default to cope with this. But, a MAD packet is
>>> unreliable, we may have transient loss. In this case, we want a short
>>> timeout.
>>>
>>>> However, if users change the default backlog queue on their
>>>> systems,
>>>> *then* it would make sense to have the users also change the
>>>> timeout
>>>> here, but I think guidance would be helpful.
>>>>
>>>> So, to revive this patch, what I'd like to see is some attempt to
>>>> actually quantify a reasonable timeout for the default backlog
>>>> depth,
>>>> then the patch should actually change the default to that
>>>> reasonable
>>>> timeout, and then put in the ability to adjust the timeout with
>>>> some
>>>> sort of doc guidance on how to calculate a reasonable timeout based
>>>> on
>>>> configured backlog depth.
>>> I can agree to this :-)
>>>
>>>
>>> Thxs, Håkon
>>>
>>>> -- 
>>>> Doug Ledford <dledford@redhat.com>
>>>>    GPG KeyID: B826A3330E572FDD
>>>>    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
>>>> 2FDD
>> -- 
>> Doug Ledford <dledford@redhat.com>
>>     GPG KeyID: B826A3330E572FDD
>>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
>> 2FDD
diff mbox series

Patch

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 68c997be2429..50abce078ff1 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -43,6 +43,7 @@ 
 #include <linux/inetdevice.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/sysctl.h>
 #include <net/route.h>
 
 #include <net/net_namespace.h>
@@ -68,13 +69,46 @@  MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
 
-#define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_QUERY_CLASSPORT_INFO_TIMEOUT 3000
-#define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
 #define CMA_IBOE_PACKET_LIFETIME 18
 #define CMA_PREFERRED_ROCE_GID_TYPE IB_GID_TYPE_ROCE_UDP_ENCAP
 
+#define CMA_DFLT_CM_RESPONSE_TIMEOUT 20
+static int cma_cm_response_timeout = CMA_DFLT_CM_RESPONSE_TIMEOUT;
+static int cma_cm_response_timeout_min = 8;
+static int cma_cm_response_timeout_max = 31;
+#undef CMA_DFLT_CM_RESPONSE_TIMEOUT
+
+#define CMA_DFLT_MAX_CM_RETRIES 15
+static int cma_max_cm_retries = CMA_DFLT_MAX_CM_RETRIES;
+static int cma_max_cm_retries_min = 1;
+static int cma_max_cm_retries_max = 100;
+#undef CMA_DFLT_MAX_CM_RETRIES
+
+static struct ctl_table_header *cma_ctl_table_hdr;
+static struct ctl_table cma_ctl_table[] = {
+	{
+		.procname	= "cma_cm_response_timeout",
+		.data		= &cma_cm_response_timeout,
+		.maxlen		= sizeof(cma_cm_response_timeout),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &cma_cm_response_timeout_min,
+		.extra2		= &cma_cm_response_timeout_max,
+	},
+	{
+		.procname	= "cma_max_cm_retries",
+		.data		= &cma_max_cm_retries,
+		.maxlen		= sizeof(cma_max_cm_retries),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &cma_max_cm_retries_min,
+		.extra2		= &cma_max_cm_retries_max,
+	},
+	{ }
+};
+
 static const char * const cma_events[] = {
 	[RDMA_CM_EVENT_ADDR_RESOLVED]	 = "address resolved",
 	[RDMA_CM_EVENT_ADDR_ERROR]	 = "address error",
@@ -3744,8 +3778,8 @@  static int cma_resolve_ib_udp(struct rdma_id_private *id_priv,
 	req.path = id_priv->id.route.path_rec;
 	req.sgid_attr = id_priv->id.route.addr.dev_addr.sgid_attr;
 	req.service_id = rdma_get_service_id(&id_priv->id, cma_dst_addr(id_priv));
-	req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8);
-	req.max_cm_retries = CMA_MAX_CM_RETRIES;
+	req.timeout_ms = 1 << (cma_cm_response_timeout - 8);
+	req.max_cm_retries = cma_max_cm_retries;
 
 	ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req);
 	if (ret) {
@@ -3815,9 +3849,9 @@  static int cma_connect_ib(struct rdma_id_private *id_priv,
 	req.flow_control = conn_param->flow_control;
 	req.retry_count = min_t(u8, 7, conn_param->retry_count);
 	req.rnr_retry_count = min_t(u8, 7, conn_param->rnr_retry_count);
-	req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.max_cm_retries = CMA_MAX_CM_RETRIES;
+	req.remote_cm_response_timeout = cma_cm_response_timeout;
+	req.local_cm_response_timeout = cma_cm_response_timeout;
+	req.max_cm_retries = cma_max_cm_retries;
 	req.srq = id_priv->srq ? 1 : 0;
 
 	ret = ib_send_cm_req(id_priv->cm_id.ib, &req);
@@ -4700,6 +4734,9 @@  static int __init cma_init(void)
 		goto err;
 
 	cma_configfs_init();
+	cma_ctl_table_hdr = register_net_sysctl(&init_net, "net/rdma_cm", cma_ctl_table);
+	if (!cma_ctl_table_hdr)
+		pr_warn("rdma_cm: couldn't register sysctl path, using default values\n");
 
 	return 0;
 
@@ -4713,6 +4750,7 @@  static int __init cma_init(void)
 
 static void __exit cma_cleanup(void)
 {
+	unregister_net_sysctl_table(cma_ctl_table_hdr);
 	cma_configfs_exit();
 	ib_unregister_client(&cma_client);
 	unregister_netdevice_notifier(&cma_nb);