diff mbox series

mlx4_ib: Increase the timeout for CM cache

Message ID 20190131170951.178676-1-haakon.bugge@oracle.com (mailing list archive)
State Superseded
Headers show
Series mlx4_ib: Increase the timeout for CM cache | expand

Commit Message

Haakon Bugge Jan. 31, 2019, 5:09 p.m. UTC
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VMs have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA CM protocol, it is clear when an entry has to
evicted form the cache. But life is not perfect, remote peers may die
or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
the PF driver assumes the remote peer has gone.

We have experienced excessive amount of DREQ retries during fail-over
testing, when running with eight VMs per database server.

The problem has been reproduced in a bare-metal system using one VM
per physical node. In this environment, running 256 processes in each
VM, each process uses RDMA CM to create an RC QP between himself and
all (256) remote processes. All in all 16K QPs.

When tearing down these 16K QPs, excessive DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe:

      dreq:       5007
cm_rx_msgs:
      drep:       3838
      dreq:      13018
       rep:       8128
       req:       8256
       rtu:       8256
cm_tx_msgs:
      drep:       8011
      dreq:      68856
       rep:       8256
       req:       8128
       rtu:       8128
cm_tx_retries:
      dreq:      60483

Note that the active/passive side is distributed.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from 113 to 67
seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
      dreq:       7726
[]
cm_tx_retries:
      drep:          1
      dreq:       7779

Increasing the timeout further didn't help, as these duplicates and
retries stem from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about one hundred for both of them.

Adjustment of the CMA timeout is _not_ part of this commit.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 drivers/infiniband/hw/mlx4/cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Haakon Bugge Feb. 1, 2019, 3:18 p.m. UTC | #1
Sorry,

I posted this as if it a net patch. It isn't, hence resend with correct recipients.



Thxs, Håkon


> On 31 Jan 2019, at 18:09, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> Using CX-3 virtual functions, either from a bare-metal machine or
> pass-through from a VM, MAD packets are proxied through the PF driver.
> 
> Since the VMs have separate name spaces for MAD Transaction Ids
> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
> in a cache.
> 
> Following the RDMA CM protocol, it is clear when an entry has to
> evicted form the cache. But life is not perfect, remote peers may die
> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
> the PF driver assumes the remote peer has gone.
> 
> We have experienced excessive amount of DREQ retries during fail-over
> testing, when running with eight VMs per database server.
> 
> The problem has been reproduced in a bare-metal system using one VM
> per physical node. In this environment, running 256 processes in each
> VM, each process uses RDMA CM to create an RC QP between himself and
> all (256) remote processes. All in all 16K QPs.
> 
> When tearing down these 16K QPs, excessive DREQ retries (and
> duplicates) are observed. With some cat/paste/awk wizardry on the
> infiniband_cm sysfs, we observe:
> 
>      dreq:       5007
> cm_rx_msgs:
>      drep:       3838
>      dreq:      13018
>       rep:       8128
>       req:       8256
>       rtu:       8256
> cm_tx_msgs:
>      drep:       8011
>      dreq:      68856
>       rep:       8256
>       req:       8128
>       rtu:       8128
> cm_tx_retries:
>      dreq:      60483
> 
> Note that the active/passive side is distributed.
> 
> Enabling pr_debug in cm.c gives tons of:
> 
> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
> 1,sl_cm_id: 0xd393089f} is NULL!
> 
> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
> tear-down phase of the application is reduced from 113 to 67
> seconds. Retries/duplicates are also significantly reduced:
> 
> cm_rx_duplicates:
>      dreq:       7726
> []
> cm_tx_retries:
>      drep:          1
>      dreq:       7779
> 
> Increasing the timeout further didn't help, as these duplicates and
> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
> numbers fell down to about one hundred for both of them.
> 
> Adjustment of the CMA timeout is _not_ part of this commit.
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
> drivers/infiniband/hw/mlx4/cm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
> index fedaf8260105..8c79a480f2b7 100644
> --- a/drivers/infiniband/hw/mlx4/cm.c
> +++ b/drivers/infiniband/hw/mlx4/cm.c
> @@ -39,7 +39,7 @@
> 
> #include "mlx4_ib.h"
> 
> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
> 
> struct id_map_entry {
> 	struct rb_node node;
> -- 
> 2.20.1
>
Jason Gunthorpe Feb. 5, 2019, 10:36 p.m. UTC | #2
On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
> Using CX-3 virtual functions, either from a bare-metal machine or
> pass-through from a VM, MAD packets are proxied through the PF driver.
> 
> Since the VMs have separate name spaces for MAD Transaction Ids
> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
> in a cache.
> 
> Following the RDMA CM protocol, it is clear when an entry has to
> evicted form the cache. But life is not perfect, remote peers may die
> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
> the PF driver assumes the remote peer has gone.
> 
> We have experienced excessive amount of DREQ retries during fail-over
> testing, when running with eight VMs per database server.
> 
> The problem has been reproduced in a bare-metal system using one VM
> per physical node. In this environment, running 256 processes in each
> VM, each process uses RDMA CM to create an RC QP between himself and
> all (256) remote processes. All in all 16K QPs.
> 
> When tearing down these 16K QPs, excessive DREQ retries (and
> duplicates) are observed. With some cat/paste/awk wizardry on the
> infiniband_cm sysfs, we observe:
> 
>       dreq:       5007
> cm_rx_msgs:
>       drep:       3838
>       dreq:      13018
>        rep:       8128
>        req:       8256
>        rtu:       8256
> cm_tx_msgs:
>       drep:       8011
>       dreq:      68856
>        rep:       8256
>        req:       8128
>        rtu:       8128
> cm_tx_retries:
>       dreq:      60483
> 
> Note that the active/passive side is distributed.
> 
> Enabling pr_debug in cm.c gives tons of:
> 
> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
> 1,sl_cm_id: 0xd393089f} is NULL!
> 
> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
> tear-down phase of the application is reduced from 113 to 67
> seconds. Retries/duplicates are also significantly reduced:
> 
> cm_rx_duplicates:
>       dreq:       7726
> []
> cm_tx_retries:
>       drep:          1
>       dreq:       7779
> 
> Increasing the timeout further didn't help, as these duplicates and
> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
> numbers fell down to about one hundred for both of them.
> 
> Adjustment of the CMA timeout is _not_ part of this commit.
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>  drivers/infiniband/hw/mlx4/cm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Jack? What do you think?

> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
> index fedaf8260105..8c79a480f2b7 100644
> --- a/drivers/infiniband/hw/mlx4/cm.c
> +++ b/drivers/infiniband/hw/mlx4/cm.c
> @@ -39,7 +39,7 @@
>  
>  #include "mlx4_ib.h"
>  
> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>  
>  struct id_map_entry {
>  	struct rb_node node;
> -- 
> 2.20.1
>
Haakon Bugge Feb. 6, 2019, 8:50 a.m. UTC | #3
> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>> Using CX-3 virtual functions, either from a bare-metal machine or
>> pass-through from a VM, MAD packets are proxied through the PF driver.
>> 
>> Since the VMs have separate name spaces for MAD Transaction Ids
>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>> in a cache.
>> 
>> Following the RDMA CM protocol, it is clear when an entry has to
>> evicted form the cache. But life is not perfect, remote peers may die
>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>> the PF driver assumes the remote peer has gone.
>> 
>> We have experienced excessive amount of DREQ retries during fail-over
>> testing, when running with eight VMs per database server.
>> 
>> The problem has been reproduced in a bare-metal system using one VM
>> per physical node. In this environment, running 256 processes in each
>> VM, each process uses RDMA CM to create an RC QP between himself and
>> all (256) remote processes. All in all 16K QPs.
>> 
>> When tearing down these 16K QPs, excessive DREQ retries (and
>> duplicates) are observed. With some cat/paste/awk wizardry on the
>> infiniband_cm sysfs, we observe:
>> 
>>      dreq:       5007
>> cm_rx_msgs:
>>      drep:       3838
>>      dreq:      13018
>>       rep:       8128
>>       req:       8256
>>       rtu:       8256
>> cm_tx_msgs:
>>      drep:       8011
>>      dreq:      68856
>>       rep:       8256
>>       req:       8128
>>       rtu:       8128
>> cm_tx_retries:
>>      dreq:      60483
>> 
>> Note that the active/passive side is distributed.
>> 
>> Enabling pr_debug in cm.c gives tons of:
>> 
>> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
>> 1,sl_cm_id: 0xd393089f} is NULL!
>> 
>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>> tear-down phase of the application is reduced from 113 to 67
>> seconds. Retries/duplicates are also significantly reduced:
>> 
>> cm_rx_duplicates:
>>      dreq:       7726
>> []
>> cm_tx_retries:
>>      drep:          1
>>      dreq:       7779
>> 
>> Increasing the timeout further didn't help, as these duplicates and
>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>> numbers fell down to about one hundred for both of them.
>> 
>> Adjustment of the CMA timeout is _not_ part of this commit.
>> 
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Jack? What do you think?

I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit.

But sure, Jack's opinion matters.


Thxs, Håkon

> 
>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
>> index fedaf8260105..8c79a480f2b7 100644
>> --- a/drivers/infiniband/hw/mlx4/cm.c
>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>> @@ -39,7 +39,7 @@
>> 
>> #include "mlx4_ib.h"
>> 
>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>> 
>> struct id_map_entry {
>> 	struct rb_node node;
>> -- 
>> 2.20.1
Haakon Bugge Feb. 6, 2019, 3:40 p.m. UTC | #4
> On 6 Feb 2019, at 09:50, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> 
> 
>> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> 
>> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>>> Using CX-3 virtual functions, either from a bare-metal machine or
>>> pass-through from a VM, MAD packets are proxied through the PF driver.
>>> 
>>> Since the VMs have separate name spaces for MAD Transaction Ids
>>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>>> in a cache.
>>> 
>>> Following the RDMA CM protocol, it is clear when an entry has to
>>> evicted form the cache. But life is not perfect, remote peers may die
>>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>>> the PF driver assumes the remote peer has gone.
>>> 
>>> We have experienced excessive amount of DREQ retries during fail-over
>>> testing, when running with eight VMs per database server.
>>> 
>>> The problem has been reproduced in a bare-metal system using one VM
>>> per physical node. In this environment, running 256 processes in each
>>> VM, each process uses RDMA CM to create an RC QP between himself and
>>> all (256) remote processes. All in all 16K QPs.
>>> 
>>> When tearing down these 16K QPs, excessive DREQ retries (and
>>> duplicates) are observed. With some cat/paste/awk wizardry on the
>>> infiniband_cm sysfs, we observe:
>>> 
>>>     dreq:       5007
>>> cm_rx_msgs:
>>>     drep:       3838
>>>     dreq:      13018
>>>      rep:       8128
>>>      req:       8256
>>>      rtu:       8256
>>> cm_tx_msgs:
>>>     drep:       8011
>>>     dreq:      68856
>>>      rep:       8256
>>>      req:       8128
>>>      rtu:       8128
>>> cm_tx_retries:
>>>     dreq:      60483
>>> 
>>> Note that the active/passive side is distributed.
>>> 
>>> Enabling pr_debug in cm.c gives tons of:
>>> 
>>> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
>>> 1,sl_cm_id: 0xd393089f} is NULL!
>>> 
>>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>>> tear-down phase of the application is reduced from 113 to 67
>>> seconds. Retries/duplicates are also significantly reduced:
>>> 
>>> cm_rx_duplicates:
>>>     dreq:       7726
>>> []
>>> cm_tx_retries:
>>>     drep:          1
>>>     dreq:       7779
>>> 
>>> Increasing the timeout further didn't help, as these duplicates and
>>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>>> numbers fell down to about one hundred for both of them.
>>> 
>>> Adjustment of the CMA timeout is _not_ part of this commit.
>>> 
>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>> ---
>>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Jack? What do you think?
> 
> I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit.
> 
> But sure, Jack's opinion matters.

Jack,

A major contributor to the long processing time in the PF driver proxying QP1 packets is:

create_pv_resources
   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
                               NULL, ctx, cq_size, 0);

That is, comp_vector is zero.

Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero comp_vector has the intent of let the mlx4_core driver select the least used vector.

But, in mlx4_ib_create_cq(), we have:

        pr_info("eq_table: %p\n", dev->eq_table);
        if (dev->eq_table) {
		vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector,
                                                          ibdev->num_comp_vectors)];
        }

        cq->vector = vector;

and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector zero.

I have to make some reservations, as this analysis is based on uek4, but I think the code here is equal upstream, but need to double check.


Thxs, Håkon








> 
> 
> Thxs, Håkon
> 
>> 
>>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
>>> index fedaf8260105..8c79a480f2b7 100644
>>> --- a/drivers/infiniband/hw/mlx4/cm.c
>>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>>> @@ -39,7 +39,7 @@
>>> 
>>> #include "mlx4_ib.h"
>>> 
>>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>>> 
>>> struct id_map_entry {
>>> 	struct rb_node node;
>>> -- 
>>> 2.20.1
jackm Feb. 6, 2019, 6:02 p.m. UTC | #5
On Wed, 6 Feb 2019 16:40:14 +0100
Håkon Bugge <haakon.bugge@oracle.com> wrote:

> Jack,
> 
> A major contributor to the long processing time in the PF driver
> proxying QP1 packets is:
> 
> create_pv_resources
>    -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,  
>                                NULL, ctx, cq_size, 0);
> 
> That is, comp_vector is zero.
> 
> Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"),
> the zero comp_vector has the intent of let the mlx4_core driver
> select the least used vector.
> 
> But, in mlx4_ib_create_cq(), we have:
> 
>         pr_info("eq_table: %p\n", dev->eq_table);
>         if (dev->eq_table) {
> 		vector = dev->eq_table[mlx4_choose_vector(dev->dev,
> vector, ibdev->num_comp_vectors)];
>         }
> 
>         cq->vector = vector;
> 
> and dev->eq_table is NULL, so all the CQs for the proxy QPs get
> comp_vector zero.
> 
> I have to make some reservations, as this analysis is based on uek4,
> but I think the code here is equal upstream, but need to double check.
> 
> 
> Thxs, Håkon
> 
Hi Hakon and Jason,
I was ill today (bad cold, took antihistamines all day, which knocked
me out).
I'll get to this tomorrow.

-Jack
Haakon Bugge Feb. 6, 2019, 6:16 p.m. UTC | #6
> On 6 Feb 2019, at 19:02, jackm <jackm@dev.mellanox.co.il> wrote:
> 
> On Wed, 6 Feb 2019 16:40:14 +0100
> Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
>> Jack,
>> 
>> A major contributor to the long processing time in the PF driver
>> proxying QP1 packets is:
>> 
>> create_pv_resources
>>   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,  
>>                               NULL, ctx, cq_size, 0);
>> 
>> That is, comp_vector is zero.
>> 
>> Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"),
>> the zero comp_vector has the intent of let the mlx4_core driver
>> select the least used vector.
>> 
>> But, in mlx4_ib_create_cq(), we have:
>> 
>>        pr_info("eq_table: %p\n", dev->eq_table);
>>        if (dev->eq_table) {
>> 		vector = dev->eq_table[mlx4_choose_vector(dev->dev,
>> vector, ibdev->num_comp_vectors)];
>>        }
>> 
>>        cq->vector = vector;
>> 
>> and dev->eq_table is NULL, so all the CQs for the proxy QPs get
>> comp_vector zero.
>> 
>> I have to make some reservations, as this analysis is based on uek4,
>> but I think the code here is equal upstream, but need to double check.
>> 
>> 
>> Thxs, Håkon
>> 
> Hi Hakon and Jason,
> I was ill today (bad cold, took antihistamines all day, which knocked
> me out).
> I'll get to this tomorrow.

No problem Jack. I actually see that our uek4 is different in mlx4_ib_alloc_eqs(), and that may well be the root cause here.

Hence, moved the MLs to Bcc, and get back to you tomorrow.


Thxs, Håkon
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index fedaf8260105..8c79a480f2b7 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -39,7 +39,7 @@ 
 
 #include "mlx4_ib.h"
 
-#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
+#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
 
 struct id_map_entry {
 	struct rb_node node;