Message ID | 20220520151926.2616318-1-haris.iqbal@ionos.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Series | RDMA/rxe: For fast memory reg wr set both lkey and rkey | expand |
On Fri, May 20, 2022 at 5:19 PM Md Haris Iqbal <haris.iqbal@ionos.com> wrote: > > Fixes: 001345339f4c ("RDMA/rxe: Separate HW and SW l/rkeys") > Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> > Signed-off-by: Aleksei Marov <aleksei.marov@ionos.com> > --- > Cc: rpearsonhpe@gmail.com > --- > drivers/infiniband/sw/rxe/rxe_mr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c > index fc3942e04a1f..04eb804efb23 100644 > --- a/drivers/infiniband/sw/rxe/rxe_mr.c > +++ b/drivers/infiniband/sw/rxe/rxe_mr.c > @@ -648,7 +648,7 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe) > > mr->access = access; > mr->lkey = (mr->lkey & ~0xff) | key; > - mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0; > + mr->rkey = mr->lkey; > mr->state = RXE_MR_STATE_VALID; > > set = mr->cur_map_set; > -- > 2.25.1 This code was changed in the below mentioned commit. commit 001345339f4ca85790a1644a74e33ae77ac116be Author: Bob Pearson <rpearsonhpe@gmail.com> Date: Tue Sep 14 11:42:05 2021 -0500 RDMA/rxe: Separate HW and SW l/rkeys Separate software and simulated hardware lkeys and rkeys for MRs and MWs. This makes struct ib_mr and struct ib_mw isolated from hardware changes triggered by executing work requests. This change fixes a bug seen in blktest. Link: https://lore.kernel.org/r/20210914164206.19768-4-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> After the change, mapping for rnbd/rtrs fails with the following error. Mapping error, Client [Fri May 20 20:27:45 2022] rdma_rxe: qp#17 moved to error state [Fri May 20 20:27:45 2022] rtrs_server L1211: <blya>: remote access error (wr_cqe: 000000002d27244a, type: 0, vendor_err: 0x0, len: 0) Server [Fri May 20 20:27:45 2022] rnbd_client L597: Mapping device /dev/nullb0 on session blya, (access_mode: rw, nr_poll_queues: 0) [Fri May 20 20:27:45 2022] rnbd_client L1208: [session=blya] mapped 4/4 default/read queues. [Fri May 20 20:27:45 2022] rtrs_client L605: <blya>: RDMA failed: remote access error [Fri May 20 20:27:45 2022] rtrs_client L346: <blya>: Failed IB_WR_LOCAL_INV: WR flushed [Fri May 20 20:27:45 2022] rtrs_client L605: <blya>: RDMA failed: remote access error [Fri May 20 20:27:45 2022] rtrs_client L346: <blya>: Failed IB_WR_LOCAL_INV: WR flushed [Fri May 20 20:27:45 2022] rnbd_client L413: </dev/nullb0@blya> read I/O failed with err: -103 [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Fri May 20 20:27:45 2022] rnbd_client L653: </dev/nullb0@blya> Device disconnected. [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical block 0, async page read [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS failed to transfer IO, err: -103 [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical block 0, async page read [Fri May 20 20:27:45 2022] ldm_validate_partition_table(): Disk read failed. [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS failed to transfer IO, err: -103 [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical block 0, async page read [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS failed to transfer IO, err: -103 [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical block 0, async page read [Fri May 20 20:27:45 2022] rnbd0: unable to read partition table Error on write Client [Fri May 20 20:28:39 2022] rdma_rxe: rxe_invalidate_mr: rkey (0x20416) doesn't match mr->rkey (0x0) [Fri May 20 20:28:39 2022] rtrs_client L346: <blya>: Failed IB_WR_LOCAL_INV: local QP operation error The code for rnbd/rtrs works fine on mlx4 and mlx5. From the spec perspective, is there a need for the wr access flag to be set while doing IB_WR_REG_MR? >
On Fri, May 20, 2022 at 5:20 PM Haris Iqbal <haris.iqbal@ionos.com> wrote: > > On Fri, May 20, 2022 at 5:19 PM Md Haris Iqbal <haris.iqbal@ionos.com> wrote: > > > > Fixes: 001345339f4c ("RDMA/rxe: Separate HW and SW l/rkeys") > > Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> > > Signed-off-by: Aleksei Marov <aleksei.marov@ionos.com> > > --- > > Cc: rpearsonhpe@gmail.com > > --- > > drivers/infiniband/sw/rxe/rxe_mr.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c > > index fc3942e04a1f..04eb804efb23 100644 > > --- a/drivers/infiniband/sw/rxe/rxe_mr.c > > +++ b/drivers/infiniband/sw/rxe/rxe_mr.c > > @@ -648,7 +648,7 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe) > > > > mr->access = access; > > mr->lkey = (mr->lkey & ~0xff) | key; > > - mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0; > > + mr->rkey = mr->lkey; > > mr->state = RXE_MR_STATE_VALID; > > > > set = mr->cur_map_set; > > -- > > 2.25.1 > > This code was changed in the below mentioned commit. > > commit 001345339f4ca85790a1644a74e33ae77ac116be > Author: Bob Pearson <rpearsonhpe@gmail.com> > Date: Tue Sep 14 11:42:05 2021 -0500 > > RDMA/rxe: Separate HW and SW l/rkeys > > Separate software and simulated hardware lkeys and rkeys for MRs and MWs. > This makes struct ib_mr and struct ib_mw isolated from hardware changes > triggered by executing work requests. > > This change fixes a bug seen in blktest. > > Link: https://lore.kernel.org/r/20210914164206.19768-4-rpearsonhpe@gmail.com > Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > > > After the change, mapping for rnbd/rtrs fails with the following error. > > Mapping error, > Client > > [Fri May 20 20:27:45 2022] rdma_rxe: qp#17 moved to error state > [Fri May 20 20:27:45 2022] rtrs_server L1211: <blya>: remote access > error (wr_cqe: 000000002d27244a, type: 0, vendor_err: 0x0, len: 0) > > Server > [Fri May 20 20:27:45 2022] rnbd_client L597: Mapping device > /dev/nullb0 on session blya, (access_mode: rw, nr_poll_queues: 0) > [Fri May 20 20:27:45 2022] rnbd_client L1208: [session=blya] mapped > 4/4 default/read queues. > [Fri May 20 20:27:45 2022] rtrs_client L605: <blya>: RDMA failed: > remote access error > [Fri May 20 20:27:45 2022] rtrs_client L346: <blya>: Failed > IB_WR_LOCAL_INV: WR flushed > [Fri May 20 20:27:45 2022] rtrs_client L605: <blya>: RDMA failed: > remote access error > [Fri May 20 20:27:45 2022] rtrs_client L346: <blya>: Failed > IB_WR_LOCAL_INV: WR flushed > [Fri May 20 20:27:45 2022] rnbd_client L413: </dev/nullb0@blya> read > I/O failed with err: -103 > [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op > 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Fri May 20 20:27:45 2022] rnbd_client L653: </dev/nullb0@blya> Device > disconnected. > [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical > block 0, async page read > [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS > failed to transfer IO, err: -103 > [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op > 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical > block 0, async page read > [Fri May 20 20:27:45 2022] ldm_validate_partition_table(): Disk read failed. > [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS > failed to transfer IO, err: -103 > [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op > 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical > block 0, async page read > [Fri May 20 20:27:45 2022] rnbd_client L1049: </dev/nullb0@blya> RTRS > failed to transfer IO, err: -103 > [Fri May 20 20:27:45 2022] I/O error, dev rnbd0, sector 0 op > 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Fri May 20 20:27:45 2022] Buffer I/O error on dev rnbd0, logical > block 0, async page read > [Fri May 20 20:27:45 2022] rnbd0: unable to read partition table > > Error on write > Client > [Fri May 20 20:28:39 2022] rdma_rxe: rxe_invalidate_mr: rkey (0x20416) > doesn't match mr->rkey (0x0) > [Fri May 20 20:28:39 2022] rtrs_client L346: <blya>: Failed > IB_WR_LOCAL_INV: local QP operation error > > The code for rnbd/rtrs works fine on mlx4 and mlx5. > From the spec perspective, is there a need for the wr access flag to > be set while doing IB_WR_REG_MR? We went through the IB spec and realized that what RXE is doing is correct. in IBA spec it is said in sections "10.6.3.7 FAST REGISTRATION" ..."If the free L_Key has an accompanying free R_Key, and the Consumer requests Remote Access Rights, a Fast Registration associ- ates the R_Key with a set of memory locations" Hence my above patch is incorrect and should not be considered. The correct fix would be to change the access flag in RTRS IB_WR_REG_MR wr. We will send a patch soon. But I do have a follow-up question. Since without this access flag, the code works fine in mlx4 and mlx5, they are doing something differently. Is this an issue? > > > > > >
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index fc3942e04a1f..04eb804efb23 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -648,7 +648,7 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe) mr->access = access; mr->lkey = (mr->lkey & ~0xff) | key; - mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0; + mr->rkey = mr->lkey; mr->state = RXE_MR_STATE_VALID; set = mr->cur_map_set;