Message ID | 751f8831-a7ff-6360-f437-917c04274843@dev.mellanox.co.il (mailing list archive) |
---|---|
State | Deferred |
Headers | show |
On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote: > On 4/29/2018 8:34 PM, Rohit Zambre wrote: >> >> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas >> <yishaih@dev.mellanox.co.il> wrote: >>> >>> >>> On 4/23/2018 8:58 PM, Rohit Zambre wrote: >>>> >>>> >>>> The calculation of need_uuar_lock rightly accounts for at least 1 medium >>>> bfreg. >>> >>> >>> >>> The code should support even 0 medium bfregs comparing the suggested >>> patch below. >> >> >> So when the user sets MLX5_TOTAL_UUARS=16 and >> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low >> latency bfregs. To which bfreg will the 16th QP map to? > > > The 16th QP expects to get bfreg index 0 which is returned from the kernel > once there are *no* low latency high class nor medium bfregs available. In > this case DB is used in rdma-core instead of BF and no lock is needed. (see > usage of bfs[bfi].buf_size) OK, so all the QPs after the 15th one will map to bfreg_0. I am trying to understand why a lock would not be needed on bfreg_0. I know that the DB is a 64-bit atomic. Is there an implicit flush/sfence with the atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP simultaneously, without the lock, what is guaranteeing that the device will see both the DBs? > However, looking in the kernel code, it seems that there is some bug in > 'first_med_bfreg()' which had to handle this non-default case when there are > no medium bfregs. > > Below patch expects to fix that, can you please give it a try and confirm > the fix ? this should come in parallel with the candidate patch that I have > sent yesterday in rdma-core in need_uuar_lock(). Both the rdma-core and kernel patch will fix the problem, given that the device will see all the doorbells written simultaneously to bfreg_0 without a lock. In the rdma-core patch, mapping the uuarn index to i is quite specific to the values defined as MLX5 enums. I would do it the following way but the final decision is up to you. static int need_uuar_lock(struct mlx5_context *ctx, int uuarn) { + int i; + if (uuarn == 0 || mlx5_single_threaded) return 0; - if (uuarn >= (ctx->tot_uuars - ctx->low_lat_uuars) * 2) + i = (uuarn / NUM_BFREGS_PER_UAR * MLX5_NUM_NON_FP_BFREGS_PER_UAR) + (uuarn % NUM_BFREGS_PER_UAR); + if (i >= ctx->tot_uuars - ctx->low_lat_uuars) return 0; Thanks, Rohit -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/30/2018 9:23 PM, Rohit Zambre wrote: > On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas > <yishaih@dev.mellanox.co.il> wrote: >> On 4/29/2018 8:34 PM, Rohit Zambre wrote: >>> >>> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas >>> <yishaih@dev.mellanox.co.il> wrote: >>>> >>>> >>>> On 4/23/2018 8:58 PM, Rohit Zambre wrote: >>>>> >>>>> >>>>> The calculation of need_uuar_lock rightly accounts for at least 1 medium >>>>> bfreg. >>>> >>>> >>>> >>>> The code should support even 0 medium bfregs comparing the suggested >>>> patch below. >>> >>> >>> So when the user sets MLX5_TOTAL_UUARS=16 and >>> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low >>> latency bfregs. To which bfreg will the 16th QP map to? >> >> >> The 16th QP expects to get bfreg index 0 which is returned from the kernel >> once there are *no* low latency high class nor medium bfregs available. In >> this case DB is used in rdma-core instead of BF and no lock is needed. (see >> usage of bfs[bfi].buf_size) > > OK, so all the QPs after the 15th one will map to bfreg_0. I am trying > to understand why a lock would not be needed on bfreg_0. I know that > the DB is a 64-bit atomic. Is there an implicit flush/sfence with the > atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP > simultaneously, without the lock, what is guaranteeing that the device > will see both the DBs? > It's guaranteed, 64-bit writing is atomic on 64 bit systems, for 32 bit systems there is some global lock that is taken as part of mmio_write64_be() implementation. >> However, looking in the kernel code, it seems that there is some bug in >> 'first_med_bfreg()' which had to handle this non-default case when there are >> no medium bfregs. >> >> Below patch expects to fix that, can you please give it a try and confirm >> the fix ? this should come in parallel with the candidate patch that I have >> sent yesterday in rdma-core in need_uuar_lock(). > > Both the rdma-core and kernel patch will fix the problem, given that > the device will see all the doorbells written simultaneously to > bfreg_0 without a lock. Thanks for confirming those patches, will take them into our regression system to be formally approved before sending them. Yishai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 1, 2018 at 2:42 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote: > On 4/30/2018 9:23 PM, Rohit Zambre wrote: >> >> On Mon, Apr 30, 2018 at 4:14 AM, Yishai Hadas >> <yishaih@dev.mellanox.co.il> wrote: >>> >>> On 4/29/2018 8:34 PM, Rohit Zambre wrote: >>>> >>>> >>>> On Sun, Apr 29, 2018 at 11:12 AM, Yishai Hadas >>>> <yishaih@dev.mellanox.co.il> wrote: >>>>> >>>>> >>>>> >>>>> On 4/23/2018 8:58 PM, Rohit Zambre wrote: >>>>>> >>>>>> >>>>>> >>>>>> The calculation of need_uuar_lock rightly accounts for at least 1 >>>>>> medium >>>>>> bfreg. >>>>> >>>>> >>>>> >>>>> >>>>> The code should support even 0 medium bfregs comparing the suggested >>>>> patch below. >>>> >>>> >>>> >>>> So when the user sets MLX5_TOTAL_UUARS=16 and >>>> MLX5_NUM_LOW_LAT_UUARS=15, the first 15 QPs will map to the 15 low >>>> latency bfregs. To which bfreg will the 16th QP map to? >>> >>> >>> >>> The 16th QP expects to get bfreg index 0 which is returned from the >>> kernel >>> once there are *no* low latency high class nor medium bfregs available. >>> In >>> this case DB is used in rdma-core instead of BF and no lock is needed. >>> (see >>> usage of bfs[bfi].buf_size) >> >> >> OK, so all the QPs after the 15th one will map to bfreg_0. I am trying >> to understand why a lock would not be needed on bfreg_0. I know that >> the DB is a 64-bit atomic. Is there an implicit flush/sfence with the >> atomic? When CPU_A posts on the 16th QP and CPU_B posts on the 17th QP >> simultaneously, without the lock, what is guaranteeing that the device >> will see both the DBs? >> > > It's guaranteed, 64-bit writing is atomic on 64 bit systems, for 32 bit > systems there is some global lock that is taken as part of mmio_write64_be() > implementation. > >>> However, looking in the kernel code, it seems that there is some bug in >>> 'first_med_bfreg()' which had to handle this non-default case when there >>> are >>> no medium bfregs. >>> >>> Below patch expects to fix that, can you please give it a try and confirm >>> the fix ? this should come in parallel with the candidate patch that I >>> have >>> sent yesterday in rdma-core in need_uuar_lock(). >> >> >> Both the rdma-core and kernel patch will fix the problem, given that >> the device will see all the doorbells written simultaneously to >> bfreg_0 without a lock. > > > Thanks for confirming those patches, will take them into our regression > system to be formally approved before sending them. Thanks for looking into it! -Rohit -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -480,8 +480,12 @@ static int qp_has_rq(struct ib_qp_init_attr *attr) return 1; } -static int first_med_bfreg(void) +static int first_med_bfreg(struct mlx5_bfreg_info *bfregi) { + if (bfregi->total_num_bfregs - + (bfregi->num_low_latency_bfregs + bfregi->num_dyn_bfregs) == 1) + return -ENOMEM; + return 1; } @@ -537,10 +541,13 @@ static int alloc_high_class_bfreg(struct mlx5_ib_dev *dev, static int alloc_med_class_bfreg(struct mlx5_ib_dev *dev, struct mlx5_bfreg_info *bfregi) { - int minidx = first_med_bfreg(); + int minidx = first_med_bfreg(bfregi); int i; - for (i = first_med_bfreg(); i < first_hi_bfreg(dev, bfregi); i++) { + if (minidx < 0) + return minidx; + + for (i = minidx; i < first_hi_bfreg(dev, bfregi); i++) { if (bfregi->count[i] < bfregi->count[minidx]) minidx = i;