Message ID | 1487272989-8215-9-git-send-email-jgunthorpe@obsidianresearch.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On 2/16/2017 9:23 PM, Jason Gunthorpe wrote: > The mlx5 comments are good so these translate fairly directly. > > There is one barrier in mlx5_arm_cq that I could not explain, it became > mmio_ordered_writes_hack() > > Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> > --- > #ifdef MLX5_DEBUG > { > @@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited) > * Make sure that the doorbell record in host memory is > * written before ringing the doorbell via PCI MMIO. > */ > - wmb(); > + udma_to_device_barrier(); > > doorbell[0] = htonl(sn << 28 | cmd | ci); > doorbell[1] = htonl(cq->cqn); > > mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32); > > - wc_wmb(); > + mmio_ordered_writes_hack(); We expect to use here the "mmio_flush_writes()" new macro, instead of the above "hack" one. This barrier enforces the data to be flushed immediately to the device so that the CQ will be armed with no delay. > return 0; > } -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 27, 2017 at 12:56:33PM +0200, Yishai Hadas wrote: > On 2/16/2017 9:23 PM, Jason Gunthorpe wrote: > >The mlx5 comments are good so these translate fairly directly. > > > >There is one barrier in mlx5_arm_cq that I could not explain, it became > >mmio_ordered_writes_hack() > > > >Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> > > > #ifdef MLX5_DEBUG > > { > >@@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited) > > * Make sure that the doorbell record in host memory is > > * written before ringing the doorbell via PCI MMIO. > > */ > >- wmb(); > >+ udma_to_device_barrier(); > > > > doorbell[0] = htonl(sn << 28 | cmd | ci); > > doorbell[1] = htonl(cq->cqn); > > > > mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32); > > > >- wc_wmb(); > >+ mmio_ordered_writes_hack(); > > We expect to use here the "mmio_flush_writes()" new macro, instead of the > above "hack" one. This barrier enforces the data to be flushed immediately > to the device so that the CQ will be armed with no delay. Hmm..... Is it even possible to 'speed up' writes to UC memory? (uar is UC, right?) Be aware the trade off, a barrier may stall the CPU until the UC writes progress far enough, but that stall is pointless if the barrier also doesn't 'speed up' the write. Also, the usual implementation of mlx5_write64 includes that spinlock which already has a serializing atomic in it - so it is doubtfull that the wc_wmb() actually ever did anything. Do you have any hard information one way or another? IMHO, if there is a way to speed up UC writes then it should have its own macro. Eg mmio_flush_uc_writes(), and it probably should be called within the mlx5_write64 implementation before releasing the spinlock. But, AFAIK, there is no way to do that on x86-64... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2/27/2017 8:00 PM, Jason Gunthorpe wrote: > On Mon, Feb 27, 2017 at 12:56:33PM +0200, Yishai Hadas wrote: >> On 2/16/2017 9:23 PM, Jason Gunthorpe wrote: >>> The mlx5 comments are good so these translate fairly directly. >>> >>> There is one barrier in mlx5_arm_cq that I could not explain, it became >>> mmio_ordered_writes_hack() >>> >>> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> >> >>> #ifdef MLX5_DEBUG >>> { >>> @@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited) >>> * Make sure that the doorbell record in host memory is >>> * written before ringing the doorbell via PCI MMIO. >>> */ >>> - wmb(); >>> + udma_to_device_barrier(); >>> >>> doorbell[0] = htonl(sn << 28 | cmd | ci); >>> doorbell[1] = htonl(cq->cqn); >>> >>> mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32); >>> >>> - wc_wmb(); >>> + mmio_ordered_writes_hack(); >> >> We expect to use here the "mmio_flush_writes()" new macro, instead of the >> above "hack" one. This barrier enforces the data to be flushed immediately >> to the device so that the CQ will be armed with no delay. > > Hmm..... > > Is it even possible to 'speed up' writes to UC memory? (uar is > UC, right?) > No, the UAR is mapped write combing, that's why we need here the mmio_flush_writes() to make sure that the device will see the data with no delay. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/providers/mlx5/cq.c b/providers/mlx5/cq.c index 372e40bc2b6589..cc0af920c703d9 100644 --- a/providers/mlx5/cq.c +++ b/providers/mlx5/cq.c @@ -489,7 +489,7 @@ static inline int mlx5_get_next_cqe(struct mlx5_cq *cq, * Make sure we read CQ entry contents after we've checked the * ownership bit. */ - rmb(); + udma_from_device_barrier(); #ifdef MLX5_DEBUG { @@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited) * Make sure that the doorbell record in host memory is * written before ringing the doorbell via PCI MMIO. */ - wmb(); + udma_to_device_barrier(); doorbell[0] = htonl(sn << 28 | cmd | ci); doorbell[1] = htonl(cq->cqn); mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32); - wc_wmb(); + mmio_ordered_writes_hack(); return 0; } @@ -1395,7 +1395,7 @@ void __mlx5_cq_clean(struct mlx5_cq *cq, uint32_t rsn, struct mlx5_srq *srq) * Make sure update of buffer contents is done before * updating consumer index. */ - wmb(); + udma_to_device_barrier(); update_cons_index(cq); } } diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c index b9ae72c9827c8c..d7087d986ce79f 100644 --- a/providers/mlx5/qp.c +++ b/providers/mlx5/qp.c @@ -926,10 +926,13 @@ out: * Make sure that descriptors are written before * updating doorbell record and ringing the doorbell */ - wmb(); + udma_to_device_barrier(); qp->db[MLX5_SND_DBR] = htonl(qp->sq.cur_post & 0xffff); - wc_wmb(); + /* Make sure that the doorbell write happens before the memcpy + * to WC memory below */ + mmio_wc_start(); + ctx = to_mctx(ibqp->context); if (bf->need_lock) mlx5_spin_lock(&bf->lock); @@ -944,15 +947,15 @@ out: &ctx->lock32); /* - * use wc_wmb() to ensure write combining buffers are flushed out + * use mmio_flush_writes() to ensure write combining buffers are flushed out * of the running CPU. This must be carried inside the spinlock. * Otherwise, there is a potential race. In the race, CPU A * writes doorbell 1, which is waiting in the WC buffer. CPU B * writes doorbell 2, and it's write is flushed earlier. Since - * the wc_wmb is CPU local, this will result in the HCA seeing + * the mmio_flush_writes is CPU local, this will result in the HCA seeing * doorbell 2, followed by doorbell 1. */ - wc_wmb(); + mmio_flush_writes(); bf->offset ^= bf->buf_size; if (bf->need_lock) mlx5_spin_unlock(&bf->lock); @@ -1119,7 +1122,7 @@ out: * Make sure that descriptors are written before * doorbell record. */ - wmb(); + udma_to_device_barrier(); *(rwq->recv_db) = htonl(rwq->rq.head & 0xffff); } @@ -1193,7 +1196,8 @@ out: * Make sure that descriptors are written before * doorbell record. */ - wmb(); + udma_to_device_barrier(); + /* * For Raw Packet QP, avoid updating the doorbell record * as long as the QP isn't in RTR state, to avoid receiving diff --git a/providers/mlx5/srq.c b/providers/mlx5/srq.c index b6e1eaf26bbd0c..2c71730a40f875 100644 --- a/providers/mlx5/srq.c +++ b/providers/mlx5/srq.c @@ -137,7 +137,7 @@ int mlx5_post_srq_recv(struct ibv_srq *ibsrq, * Make sure that descriptors are written before * we write doorbell record. */ - wmb(); + udma_to_device_barrier(); *srq->db = htonl(srq->counter); }
The mlx5 comments are good so these translate fairly directly. There is one barrier in mlx5_arm_cq that I could not explain, it became mmio_ordered_writes_hack() Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> --- providers/mlx5/cq.c | 8 ++++---- providers/mlx5/qp.c | 18 +++++++++++------- providers/mlx5/srq.c | 2 +- 3 files changed, 16 insertions(+), 12 deletions(-)