Message ID | 945e2947-f67a-4202-cd27-d4631fe10f68@grimberg.me (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Tuesday, March 03/21/17, 2017 at 19:22:30 +0530, Sagi Grimberg wrote: Hi Sagi, The patch works fine! > Hi Baharat and Nic, > > Apologies for the late reply, > > > Hi Nicholas, > > I see them from 2MB onwards. > >> > >> > Here is what I see with the 3 patches alone applied: > >> > > >> > -> In isert_put_datain() and isert_post_response() a corresponding > recv > >> WR is posted before > >> > posting a send and hence for every post failure a recv is already > posted > >> into a tightly packed > >> > RQ, causing it to overflow. > >> > >> Just for me to understand, the intermittent TMR ABORT_TASKs are > caused > >> by the repeated failure to post RDMA_WRITE WRs for a large ISER > Data-In > >> payload, due to mis-sizing of needed WRs from RDMA R/W API vs. > >> underlying hardware capabilities. > > Yes. > >> > >> Moving the recv posts after the send post for RDMA_WRITEs helps to > >> reduce the severity of the issue with iw_cxgb4, but doesn't > completely > >> eliminate the issue under load. > > Moving recv posts only comes in to effect along with your changes. > > ... > > >> So the reason why your patch to swap post_recv -> post_send to > post_send > >> -> post_recv makes a difference is because it allows enough trickle > of > >> RDMA_WRITEs to make it through, where iser-initiator doesn't attempt > to > >> escalate recovery and doesn't attempt session reinstatement. > > I dont exactly know if above thing comes into play but the actual reason > I did > > swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim > during > > the intialisation itself. From thereon we post a RQ WR for every RQ > completion > > That makes it almost full at any point of time. > > > > Now in our scenario, SQ is miscalulated and too small for few adapters > and so > > filled gradually as the IO starts. Once SQ is full, according to your > patches > > isert queues it and tries to repost the command again. Here in iser > functions > > like isert_post_response(), isert_put_datain() post send is done after > post recv. > > For the first post send failure in say isert_put_datain(), the > corresponding > > post recv is already posted, then on queuing the command and trying > reposting > > an extra recv is again posted which fills up the RQ also. > > > > By swapping post recv and send as in my incermental patch, we dont post > that > > extra recv, and post recv only on successful post send. > > Therfore I think this incremental patch is necessary. > > Reversing the order to recv and send posting will cause problems > in stress IO workloads (especially for iWARP). The problem of sending > a reply before reposting the recv buffer is that the initiator can send > immediately a new request and we don't have a recv buffer waiting for > it, which will cause RNR-NAK. This *will* cause performance drops and > jitters for sure. Totally agree with you. > > How about we just track the rx_desc to know if we already posted it as > a start (untested as I don't have access to RDMA HW this week): > -- > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c > b/drivers/infiniband/ulp/isert/ib_isert.c > index 9b33c0c97468..fcbed35e95a8 100644 > --- a/drivers/infiniband/ulp/isert/ib_isert.c > +++ b/drivers/infiniband/ulp/isert/ib_isert.c > @@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32 > count) > rx_wr->sg_list = &rx_desc->rx_sg; > rx_wr->num_sge = 1; > rx_wr->next = rx_wr + 1; > + rx_desc->in_use = false; > } > rx_wr--; > rx_wr->next = NULL; /* mark end of work requests list */ > @@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn, > struct iser_rx_desc *rx_desc) > struct ib_recv_wr *rx_wr_failed, rx_wr; > int ret; > > + if (!rx_desc->in_use) { > + /* > + * if the descriptor is not in-use we already reposted it > + * for recv, so just silently return > + */ > + return 0; > + } > + > + rx_desc->in_use = false; > rx_wr.wr_cqe = &rx_desc->rx_cqe; > rx_wr.sg_list = &rx_desc->rx_sg; > rx_wr.num_sge = 1; > @@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc) > return; > } > > + rx_desc->in_use = true; > + > ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr, > ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE); > > diff --git a/drivers/infiniband/ulp/isert/ib_isert.h > b/drivers/infiniband/ulp/isert/ib_isert.h > index c02ada57d7f5..87d994de8c91 100644 > --- a/drivers/infiniband/ulp/isert/ib_isert.h > +++ b/drivers/infiniband/ulp/isert/ib_isert.h > @@ -60,7 +60,7 @@ > > #define ISER_RX_PAD_SIZE (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \ > (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct > ib_sge) + \ > - sizeof(struct ib_cqe))) > + sizeof(struct ib_cqe) + sizeof(bool))) > > #define ISCSI_ISER_SG_TABLESIZE 256 > > @@ -85,6 +85,7 @@ struct iser_rx_desc { > u64 dma_addr; > struct ib_sge rx_sg; > struct ib_cqe rx_cqe; > + bool in_use; > char pad[ISER_RX_PAD_SIZE]; > } __packed; > -- > > We have a lot of room for cleanups in isert... I'll need to > make some time to get it going... > > I'll be waiting to hear from you if it makes your issue go away. Test runs fine and it solved the issue. Thanks for the patch! > > Cheers, > Sagi. Thanks, Bharat. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Test runs fine and it solved the issue. > Thanks for the patch! Cool, can I add your "Tested-by" tag on a formal patch then? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, March 03/21/17, 2017 at 22:08:04 +0530, Sagi Grimberg wrote: > > Test runs fine and it solved the issue. > > Thanks for the patch! > > Cool, can I add your "Tested-by" tag on a formal patch then? Sure, Please add. Tested-by: Potnuri Bharat Teja <bharat@chelsio.com> Thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at [1]http://vger.kernel.org/majordomo-info.html > > References > > Visible links > 1. http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 9b33c0c97468..fcbed35e95a8 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32 count) rx_wr->sg_list = &rx_desc->rx_sg; rx_wr->num_sge = 1; rx_wr->next = rx_wr + 1; + rx_desc->in_use = false; } rx_wr--; rx_wr->next = NULL; /* mark end of work requests list */ @@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc) struct ib_recv_wr *rx_wr_failed, rx_wr; int ret; + if (!rx_desc->in_use) { + /* + * if the descriptor is not in-use we already reposted it + * for recv, so just silently return + */ + return 0; + } + + rx_desc->in_use = false; rx_wr.wr_cqe = &rx_desc->rx_cqe; rx_wr.sg_list = &rx_desc->rx_sg; rx_wr.num_sge = 1; @@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc) return; } + rx_desc->in_use = true; + ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr, ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE); diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h index c02ada57d7f5..87d994de8c91 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.h +++ b/drivers/infiniband/ulp/isert/ib_isert.h @@ -60,7 +60,7 @@ #define ISER_RX_PAD_SIZE (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \ (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct ib_sge) + \ - sizeof(struct ib_cqe))) + sizeof(struct ib_cqe) + sizeof(bool))) #define ISCSI_ISER_SG_TABLESIZE 256