diff mbox

[rdma-core,1/1] mlx5: Account for at least 1 medium bfreg in low_lat_uuars check

Message ID 751f8831-a7ff-6360-f437-917c04274843@dev.mellanox.co.il (mailing list archive)
State Deferred
Headers show

Commit Message

Yishai Hadas April 30, 2018, 9:14 a.m. UTC
On 4/29/2018 8:34 PM, Rohit Zambre wrote:
> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas
> <yishaih@dev.mellanox.co.il> wrote:
>>
>> On 4/23/2018 8:58 PM, Rohit Zambre wrote:
>>>
>>> The calculation of need_uuar_lock rightly accounts for at least 1 medium
>>> bfreg.
>>
>>
>> The code should support even 0 medium bfregs comparing the suggested patch below.
> 
> So when the user sets MLX5_TOTAL_UUARS=16 and
> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low
> latency bfregs. To which bfreg will the 16th QP map to?

The 16th QP expects to get bfreg index 0 which is returned from the 
kernel once there are *no* low latency high class nor medium bfregs 
available. In this case DB is used in rdma-core instead of BF and no 
lock is needed. (see usage of bfs[bfi].buf_size)

However, looking in the kernel code, it seems that there is some bug in 
'first_med_bfreg()' which had to handle this non-default case when there 
are no medium bfregs.

Below patch expects to fix that, can you please give it a try and 
confirm the fix ? this should come in parallel with the candidate patch 
that I have sent yesterday in rdma-core in need_uuar_lock().


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rohit Zambre April 30, 2018, 6:23 p.m. UTC | #1
On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas
<yishaih@dev.mellanox.co.il> wrote:
> On 4/29/2018 8:34 PM, Rohit Zambre wrote:
>>
>> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas
>> <yishaih@dev.mellanox.co.il> wrote:
>>>
>>>
>>> On 4/23/2018 8:58 PM, Rohit Zambre wrote:
>>>>
>>>>
>>>> The calculation of need_uuar_lock rightly accounts for at least 1 medium
>>>> bfreg.
>>>
>>>
>>>
>>> The code should support even 0 medium bfregs comparing the suggested
>>> patch below.
>>
>>
>> So when the user sets MLX5_TOTAL_UUARS=16 and
>> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low
>> latency bfregs. To which bfreg will the 16th QP map to?
>
>
> The 16th QP expects to get bfreg index 0 which is returned from the kernel
> once there are *no* low latency high class nor medium bfregs available. In
> this case DB is used in rdma-core instead of BF and no lock is needed. (see
> usage of bfs[bfi].buf_size)

OK, so all the QPs after the 15th one will map to bfreg_0. I am trying
to understand why a lock would not be needed on bfreg_0. I know that
the DB is a 64-bit atomic. Is there an implicit flush/sfence with the
atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP
simultaneously, without the lock, what is guaranteeing that the device
will see both the DBs?

> However, looking in the kernel code, it seems that there is some bug in
> 'first_med_bfreg()' which had to handle this non-default case when there are
> no medium bfregs.
>
> Below patch expects to fix that, can you please give it a try and confirm
> the fix ? this should come in parallel with the candidate patch that I have
> sent yesterday in rdma-core in need_uuar_lock().

Both the rdma-core and kernel patch will fix the problem, given that
the device will see all the doorbells written simultaneously to
bfreg_0 without a lock.
In the rdma-core patch, mapping the uuarn index to i is quite specific
to the values defined as MLX5 enums. I would do it the following way
but the final decision is up to you.

 static int need_uuar_lock(struct mlx5_context *ctx, int uuarn)
 {
+       int i;
+
        if (uuarn == 0 || mlx5_single_threaded)
                return 0;

-       if (uuarn >= (ctx->tot_uuars - ctx->low_lat_uuars) * 2)
+       i = (uuarn / NUM_BFREGS_PER_UAR *
MLX5_NUM_NON_FP_BFREGS_PER_UAR) + (uuarn % NUM_BFREGS_PER_UAR);
+       if (i >= ctx->tot_uuars - ctx->low_lat_uuars)
                return 0;

Thanks,
Rohit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas May 1, 2018, 7:42 a.m. UTC | #2
On 4/30/2018 9:23 PM, Rohit Zambre wrote:
> On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas
> <yishaih@dev.mellanox.co.il> wrote:
>> On 4/29/2018 8:34 PM, Rohit Zambre wrote:
>>>
>>> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas
>>> <yishaih@dev.mellanox.co.il> wrote:
>>>>
>>>>
>>>> On 4/23/2018 8:58 PM, Rohit Zambre wrote:
>>>>>
>>>>>
>>>>> The calculation of need_uuar_lock rightly accounts for at least 1 medium
>>>>> bfreg.
>>>>
>>>>
>>>>
>>>> The code should support even 0 medium bfregs comparing the suggested
>>>> patch below.
>>>
>>>
>>> So when the user sets MLX5_TOTAL_UUARS=16 and
>>> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low
>>> latency bfregs. To which bfreg will the 16th QP map to?
>>
>>
>> The 16th QP expects to get bfreg index 0 which is returned from the kernel
>> once there are *no* low latency high class nor medium bfregs available. In
>> this case DB is used in rdma-core instead of BF and no lock is needed. (see
>> usage of bfs[bfi].buf_size)
> 
> OK, so all the QPs after the 15th one will map to bfreg_0. I am trying
> to understand why a lock would not be needed on bfreg_0. I know that
> the DB is a 64-bit atomic. Is there an implicit flush/sfence with the
> atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP
> simultaneously, without the lock, what is guaranteeing that the device
> will see both the DBs?
> 

It's guaranteed, 64-bit writing is atomic on 64 bit systems, for 32 bit 
systems there is some global lock that is taken as part of 
mmio_write64_be() implementation.

>> However, looking in the kernel code, it seems that there is some bug in
>> 'first_med_bfreg()' which had to handle this non-default case when there are
>> no medium bfregs.
>>
>> Below patch expects to fix that, can you please give it a try and confirm
>> the fix ? this should come in parallel with the candidate patch that I have
>> sent yesterday in rdma-core in need_uuar_lock().
> 
> Both the rdma-core and kernel patch will fix the problem, given that
> the device will see all the doorbells written simultaneously to
> bfreg_0 without a lock.

Thanks for confirming those patches, will take them into our regression 
system to be formally approved before sending them.

Yishai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rohit Zambre May 1, 2018, 1:05 p.m. UTC | #3
On Tue, May 1, 2018 at 2:42 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote:
> On 4/30/2018 9:23 PM, Rohit Zambre wrote:
>>
>> On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas
>> <yishaih@dev.mellanox.co.il> wrote:
>>>
>>> On 4/29/2018 8:34 PM, Rohit Zambre wrote:
>>>>
>>>>
>>>> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas
>>>> <yishaih@dev.mellanox.co.il> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 4/23/2018 8:58 PM, Rohit Zambre wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> The calculation of need_uuar_lock rightly accounts for at least 1
>>>>>> medium
>>>>>> bfreg.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The code should support even 0 medium bfregs comparing the suggested
>>>>> patch below.
>>>>
>>>>
>>>>
>>>> So when the user sets MLX5_TOTAL_UUARS=16 and
>>>> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low
>>>> latency bfregs. To which bfreg will the 16th QP map to?
>>>
>>>
>>>
>>> The 16th QP expects to get bfreg index 0 which is returned from the
>>> kernel
>>> once there are *no* low latency high class nor medium bfregs available.
>>> In
>>> this case DB is used in rdma-core instead of BF and no lock is needed.
>>> (see
>>> usage of bfs[bfi].buf_size)
>>
>>
>> OK, so all the QPs after the 15th one will map to bfreg_0. I am trying
>> to understand why a lock would not be needed on bfreg_0. I know that
>> the DB is a 64-bit atomic. Is there an implicit flush/sfence with the
>> atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP
>> simultaneously, without the lock, what is guaranteeing that the device
>> will see both the DBs?
>>
>
> It's guaranteed, 64-bit writing is atomic on 64 bit systems, for 32 bit
> systems there is some global lock that is taken as part of mmio_write64_be()
> implementation.
>
>>> However, looking in the kernel code, it seems that there is some bug in
>>> 'first_med_bfreg()' which had to handle this non-default case when there
>>> are
>>> no medium bfregs.
>>>
>>> Below patch expects to fix that, can you please give it a try and confirm
>>> the fix ? this should come in parallel with the candidate patch that I
>>> have
>>> sent yesterday in rdma-core in need_uuar_lock().
>>
>>
>> Both the rdma-core and kernel patch will fix the problem, given that
>> the device will see all the doorbells written simultaneously to
>> bfreg_0 without a lock.
>
>
> Thanks for confirming those patches, will take them into our regression
> system to be formally approved before sending them.

Thanks for looking into it!
-Rohit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -480,8 +480,12 @@  static int qp_has_rq(struct ib_qp_init_attr *attr)
         return 1;
  }

-static int first_med_bfreg(void)
+static int first_med_bfreg(struct mlx5_bfreg_info *bfregi)
  {
+       if (bfregi->total_num_bfregs -
+           (bfregi->num_low_latency_bfregs + bfregi->num_dyn_bfregs) == 1)
+               return -ENOMEM;
+
         return 1;
  }

@@ -537,10 +541,13 @@  static int alloc_high_class_bfreg(struct 
mlx5_ib_dev *dev,
  static int alloc_med_class_bfreg(struct mlx5_ib_dev *dev,
                                  struct mlx5_bfreg_info *bfregi)
  {
-       int minidx = first_med_bfreg();
+       int minidx = first_med_bfreg(bfregi);
         int i;

-       for (i = first_med_bfreg(); i < first_hi_bfreg(dev, bfregi); i++) {
+       if (minidx < 0)
+               return minidx;
+
+       for (i = minidx; i < first_hi_bfreg(dev, bfregi); i++) {
                 if (bfregi->count[i] < bfregi->count[minidx])
                         minidx = i;