diff mbox series

[RFC] RDMA/rxe: skip adjusting remote addr for write in retry operation

Message ID 20220502053907.6388-1-cgxu519@mykernel.net (mailing list archive)
State Accepted
Delegated to: Jason Gunthorpe
Headers show
Series [RFC] RDMA/rxe: skip adjusting remote addr for write in retry operation | expand

Commit Message

Chengguang Xu May 2, 2022, 5:39 a.m. UTC
For write request the remote addr will be sent only with first packet
so we don't have to adjust wqe->iova in retry operation.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 drivers/infiniband/sw/rxe/rxe_req.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Pearson, Robert B May 2, 2022, 5:15 p.m. UTC | #1
> -----Original Message-----
> From: Chengguang Xu <cgxu519@mykernel.net> 
> Sent: Monday, May 2, 2022 12:39 AM
> To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org
> Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net>
> Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation
> 
> For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation.

This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement
from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without
loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the
previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous
read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe.
the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so
it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that
the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting
from scratch seems bad.

Bob


Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 drivers/infiniband/sw/rxe/rxe_req.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index ae5fbc79dd5c..f08010651ef7 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -33,8 +33,6 @@ static inline void retry_first_write_send(struct rxe_qp *qp,
 		} else {
 			advance_dma_data(&wqe->dma, to_send);
 		}
-		if (mask & WR_WRITE_MASK)
-			wqe->iova += qp->mtu;
 	}
 }
 
--
2.35.1
Chengguang Xu May 3, 2022, 3:04 p.m. UTC | #2
在 2022/5/3 1:15, Pearson, Robert B 写道:
>> -----Original Message-----
>> From: Chengguang Xu <cgxu519@mykernel.net>
>> Sent: Monday, May 2, 2022 12:39 AM
>> To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org
>> Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net>
>> Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation
>>
>> For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation.
> This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement
> from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without
> loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the
> previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous
> read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe.
> the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so
> it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that
> the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting
> from scratch seems bad.

I agree that read request indeed needs to adjust iova during retry and  
the adjustment(for read) has already done in below logic in req_retry().


if (mask & WR_READ_MASK) {
         npsn = (wqe->dma.length - wqe->dma.resid) /
                      qp->mtu;
         wqe->iova += npsn * qp->mtu;
}

For write request, retry will not send new iova because only first write 
packet has RXE_RETH_MASK regardless iova adjustment.
Am I missing something?


Thanks,
Chengguang
Bob Pearson May 4, 2022, 3:25 p.m. UTC | #3
Sorry I misread your original post. You are correct. The wqe->iova is
only used to fill in the RETH header so it is not needed after
the first packet.
This commit looks OK.

On Tue, May 3, 2022 at 2:59 PM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
> 在 2022/5/3 1:15, Pearson, Robert B 写道:
> >> -----Original Message-----
> >> From: Chengguang Xu <cgxu519@mykernel.net>
> >> Sent: Monday, May 2, 2022 12:39 AM
> >> To: zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org
> >> Cc: linux-rdma@vger.kernel.org; Chengguang Xu <cgxu519@mykernel.net>
> >> Subject: [RFC PATCH] RDMA/rxe: skip adjusting remote addr for write in retry operation
> >>
> >> For write request the remote addr will be sent only with first packet so we don't have to adjust wqe->iova in retry operation.
> > This is problematic for lossy networks. A very large read request, say 8MiB, sends 2048 packets in response without any acknowledgement
> > from the requester. If the packet loss rate was 1% the read request would never finish as the probability of sending 2048 packets without
> > loss is very small. The way the code works today is that the iova is adjusted, and if you are lucky the responder has already finished the
> > previous read operation and starts over with a new read reply starting with a first packet at iova. If you are less fortunate the previous
> > read reply has not finished and the responder will continue to work on it until it is finished before looking at the new read request wqe.
> > the completer will respond to each out of order packet by checking to see if it should start a retry but since it has already done so
> > it drops the packet. It's messy but one can make forward progress ~100 packets at a time. It would be faster if the responder saw that
> > the new request replaced the old one and stopped sending packets on the old read. I have no idea how CX NICs do this but just restarting
> > from scratch seems bad.
>
> I agree that read request indeed needs to adjust iova during retry and
> the adjustment(for read) has already done in below logic in req_retry().
>
>
> if (mask & WR_READ_MASK) {
>          npsn = (wqe->dma.length - wqe->dma.resid) /
>                       qp->mtu;
>          wqe->iova += npsn * qp->mtu;
> }
>
> For write request, retry will not send new iova because only first write
> packet has RXE_RETH_MASK regardless iova adjustment.
> Am I missing something?
>
>
> Thanks,
> Chengguang
>
>
>
>
>
Jason Gunthorpe May 6, 2022, 4:15 p.m. UTC | #4
On Mon, May 02, 2022 at 01:39:07AM -0400, Chengguang Xu wrote:
> For write request the remote addr will be sent only with first packet
> so we don't have to adjust wqe->iova in retry operation.
> 
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> ---
>  drivers/infiniband/sw/rxe/rxe_req.c | 2 --
>  1 file changed, 2 deletions(-)

Applied to for-next, thanks

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index ae5fbc79dd5c..f08010651ef7 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -33,8 +33,6 @@  static inline void retry_first_write_send(struct rxe_qp *qp,
 		} else {
 			advance_dma_data(&wqe->dma, to_send);
 		}
-		if (mask & WR_WRITE_MASK)
-			wqe->iova += qp->mtu;
 	}
 }