Message ID | 20220502053907.6388-1-cgxu519@mykernel.net (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [RFC] RDMA/rxe: skip adjusting remote addr for write in retry operation | expand |
> -----Original Message----- > From: Chengguang Xu <cgxu519@mykernel.net> > Sent: Monday, May 2, 2022 12:39 AM > To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org > Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net> > Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation > > For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation. This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe. the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting from scratch seems bad. Bob Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> --- drivers/infiniband/sw/rxe/rxe_req.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index ae5fbc79dd5c..f08010651ef7 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -33,8 +33,6 @@ static inline void retry_first_write_send(struct rxe_qp *qp, } else { advance_dma_data(&wqe->dma, to_send); } - if (mask & WR_WRITE_MASK) - wqe->iova += qp->mtu; } } -- 2.35.1
在 2022/5/3 1:15, Pearson, Robert B 写道: >> -----Original Message----- >> From: Chengguang Xu <cgxu519@mykernel.net> >> Sent: Monday, May 2, 2022 12:39 AM >> To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org >> Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net> >> Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation >> >> For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation. > This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement > from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without > loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the > previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous > read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe. > the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so > it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that > the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting > from scratch seems bad. I agree that read request indeed needs to adjust iova during retry and the adjustment(for read) has already done in below logic in req_retry(). if (mask & WR_READ_MASK) { npsn = (wqe->dma.length - wqe->dma.resid) / qp->mtu; wqe->iova += npsn * qp->mtu; } For write request, retry will not send new iova because only first write packet has RXE_RETH_MASK regardless iova adjustment. Am I missing something? Thanks, Chengguang
Sorry I misread your original post. You are correct. The wqe->iova is only used to fill in the RETH header so it is not needed after the first packet. This commit looks OK. On Tue, May 3, 2022 at 2:59 PM Chengguang Xu <cgxu519@mykernel.net> wrote: > > 在 2022/5/3 1:15, Pearson, Robert B 写道: > >> -----Original Message----- > >> From: Chengguang Xu <cgxu519@mykernel.net> > >> Sent: Monday, May 2, 2022 12:39 AM > >> To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org > >> Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net> > >> Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation > >> > >> For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation. > > This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement > > from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without > > loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the > > previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous > > read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe. > > the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so > > it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that > > the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting > > from scratch seems bad. > > I agree that read request indeed needs to adjust iova during retry and > the adjustment(for read) has already done in below logic in req_retry(). > > > if (mask & WR_READ_MASK) { > npsn = (wqe->dma.length - wqe->dma.resid) / > qp->mtu; > wqe->iova += npsn * qp->mtu; > } > > For write request, retry will not send new iova because only first write > packet has RXE_RETH_MASK regardless iova adjustment. > Am I missing something? > > > Thanks, > Chengguang > > > > >
On Mon, May 02, 2022 at 01:39:07AM -0400, Chengguang Xu wrote: > For write request the remote addr will be sent only with first packet > so we don't have to adjust wqe->iova in retry operation. > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> > --- > drivers/infiniband/sw/rxe/rxe_req.c | 2 -- > 1 file changed, 2 deletions(-) Applied to for-next, thanks Jason
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index ae5fbc79dd5c..f08010651ef7 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -33,8 +33,6 @@ static inline void retry_first_write_send(struct rxe_qp *qp, } else { advance_dma_data(&wqe->dma, to_send); } - if (mask & WR_WRITE_MASK) - wqe->iova += qp->mtu; } }
For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> --- drivers/infiniband/sw/rxe/rxe_req.c | 2 -- 1 file changed, 2 deletions(-)