Message ID | c4ee3313-9e67-f30a-3679-4c9afab79ac9@grimberg.me (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Tuesday, October 10/18/16, 2016 at 11:04:00 +0300, Sagi Grimberg wrote: > Hey Steve and Baharat, > > >Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly > >sizing the SQ: > >... > >#define ISERT_QP_MAX_REQ_DTOS (ISCSI_DEF_XMIT_CMDS_MAX + \ > > ISERT_MAX_TX_MISC_PDUS + \ > > ISERT_MAX_RX_MISC_PDUS) > >... > > attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1; > > attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1; > >... > > > >I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ. But > >the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or > >READ WRs due to limits on local sge depths target sge depths, etc. Yes? Or am > >I all wet? Or perhaps isert doesn't require the SQ to be the max possible > >because it flow controls the DTO submissions? > > I think you are correct. > > On my test devices, I didn't see that multiple WRs has had any effect > becuase: > 1. My test devices usually give next power of 2 (256) > 2. workloads that involved multiple rdma operations never stressed the > system enough to get the queues full. > > Now, in iWARP for non-immediate writes we'll need more than a single > wr per IO (I think the SQ size is expanded with the new rdma RW API > which implicitly increases with attr.cap.max_rdma_ctxs). Yes, rdma RW api factors the attr.cap.max_rdma_ctxs based on attr Flags. > > But I do agree that we need to take into account that each IO needs > at least 2 WRs (one for rdma and one for send). > > So a temp bandage would be: > -- > diff --git a/drivers/infiniband/ulp/isert/ib_isert.h > b/drivers/infiniband/ulp/isert/ib_isert.h > index fc791efe3a10..81afb95aeea9 100644 > --- a/drivers/infiniband/ulp/isert/ib_isert.h > +++ b/drivers/infiniband/ulp/isert/ib_isert.h > @@ -54,8 +54,14 @@ > > #define ISERT_MIN_POSTED_RX (ISCSI_DEF_XMIT_CMDS_MAX >> 2) > > -#define ISERT_QP_MAX_REQ_DTOS (ISCSI_DEF_XMIT_CMDS_MAX + \ > - ISERT_MAX_TX_MISC_PDUS + \ > +/* > + * Max QP send work requests consist of: > + * - RDMA + SEND for each iscsi IO > + * - iscsi misc TX pdus > + * - iscsi misc RX response pdus > + */ > +#define ISERT_QP_MAX_REQ_DTOS ((ISCSI_DEF_XMIT_CMDS_MAX * 2 ) + \ > + ISERT_MAX_TX_MISC_PDUS + \ > ISERT_MAX_RX_MISC_PDUS) > > #define ISER_RX_PAD_SIZE (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \ > -- > I tried out this change and it works fine with iwarp. I dont see SQ overflow. Apparently we have increased the sq too big to overflow. I am going to let it run with higher workloads for longer time, to see if it holds good. > But we do need to track the SQ overflow and queue a retransmit work when > we don't have enough available SQ slots.. Agreed, iscsi-target (LIO in our case)expects failure to be returned by overlying modules Or it could be for tcp as it handling is different to that of iser. It queues the WR and schedules it for repost incase of post failures with ENOMEM/EAGAIN. Shall send the necessary change if it is needed. Thanks, Bharat. > > Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I tried out this change and it works fine with iwarp. I dont see SQ > overflow. Apparently we have increased the sq too big to overflow. I am going > to let it run with higher workloads for longer time, to see if it holds good. Actually on second thought, this patch is an overkill. Effectively we now set: MAX_CMD=266 and max_rdma_ctx=128 so together we take 394 which seems to too much. If we go by the scheme of 1 rdma + 1 send for each IO we need: - 128 sends - 128 rdmas - 10 miscs so this gives 266. Perhaps this is due to the fact that iWARP needs to register memory for rdma reads as well? (and also rdma writes > 128k for chelsio HW right?) What is the workload you are running? with immediatedata enabled you should issue reg+rdma_read+send only for writes > 8k. Does this happen when you run only reads for example? I guess its time to get the sq accounting into shape... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > I tried out this change and it works fine with iwarp. I dont see SQ > > overflow. Apparently we have increased the sq too big to overflow. I am going > > to let it run with higher workloads for longer time, to see if it holds good. > > Actually on second thought, this patch is an overkill. Effectively we > now set: > > MAX_CMD=266 > and max_rdma_ctx=128 so together we take 394 which seems to too much. > > If we go by the scheme of 1 rdma + 1 send for each IO we need: > - 128 sends > - 128 rdmas > - 10 miscs > > so this gives 266. > > Perhaps this is due to the fact that iWARP needs to register memory for > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?) > iWARP definitely needs to register memory for the target of reads, due to REMOTE_WRITE requirement for the protocol. The source of a write doesn't need to register memory, but the SGE depth can cause multiple WRITE WRs to be required to service the IO. And in theory there should be some threshold where it might be better performance-wise to do a memory register + 1 WRITE vs X WRITEs. As you mentioned, the RW API should account for this, but perhaps it is still off some. Bharat, have a look into the RDMA-RW API and let us see if we can figure out if the additional SQ depth it adds is sufficient. > What is the workload you are running? with immediatedata enabled you > should issue reg+rdma_read+send only for writes > 8k. > > Does this happen when you run only reads for example? > > I guess its time to get the sq accounting into shape... So to sum up - 2 issues: 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ overflows. So the SQ sizing needs more investigation. 2) if the SQ is full, then the iSER/target code is supposed to resubmit. And apparently that isn't working. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote: > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ > overflows. So the SQ sizing needs more investigation. NFS had this sort of problem - in that case it was because the code was assuming that a RQ completion implied SQ space - that is not legal, only direct completions from SQ WCs can guide available space in the SQ.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote: > > > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ > > overflows. So the SQ sizing needs more investigation. > > NFS had this sort of problem - in that case it was because the code > was assuming that a RQ completion implied SQ space - that is not > legal, only direct completions from SQ WCs can guide available space > in the SQ.. > Agreed. And this could be the case Bharat is hitting. (I wonder if there is a clever way we can tell?) I don't see I the code that isert flow controls SQ WR submission based on SQ completions. It would be nice to have some core API, maybe in rdma-rw, that could handle this by queuing and processing when the SQ has resources... Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hey Jason, >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ >> overflows. So the SQ sizing needs more investigation. > > NFS had this sort of problem - in that case it was because the code > was assuming that a RQ completion implied SQ space - that is not > legal, only direct completions from SQ WCs can guide available space > in the SQ.. Its not the same problem. iser-target does not tie SQ and RQ spaces. The origin here is the difference between IB/RoCE and iWARP and the chelsio HW that makes it hard to predict the SQ correct size. iWARP needs extra registration for rdma reads and the chelsio device seems to be limited in the number of pages per registration so this configuration will need a larger send queue. Another problem is that we don't have a correct retry flow. Hopefully we can address that in the RW API which is designed to hide these details from the ULP... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Steve, Potnuri, & Co, On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote: > > > > > I tried out this change and it works fine with iwarp. I dont see SQ > > > overflow. Apparently we have increased the sq too big to overflow. I am > going > > > to let it run with higher workloads for longer time, to see if it holds > good. > > > > Actually on second thought, this patch is an overkill. Effectively we > > now set: > > > > MAX_CMD=266 > > and max_rdma_ctx=128 so together we take 394 which seems to too much. > > > > If we go by the scheme of 1 rdma + 1 send for each IO we need: > > - 128 sends > > - 128 rdmas > > - 10 miscs > > > > so this gives 266. > > > > Perhaps this is due to the fact that iWARP needs to register memory for > > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?) > > > > iWARP definitely needs to register memory for the target of reads, due to > REMOTE_WRITE requirement for the protocol. The source of a write doesn't need > to register memory, but the SGE depth can cause multiple WRITE WRs to be > required to service the IO. And in theory there should be some threshold where > it might be better performance-wise to do a memory register + 1 WRITE vs X > WRITEs. > > As you mentioned, the RW API should account for this, but perhaps it is still > off some. Bharat, have a look into the RDMA-RW API and let us see if we can > figure out if the additional SQ depth it adds is sufficient. > > > What is the workload you are running? with immediatedata enabled you > > should issue reg+rdma_read+send only for writes > 8k. > > > > Does this happen when you run only reads for example? > > > > I guess its time to get the sq accounting into shape... > > So to sum up - 2 issues: > > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ > overflows. So the SQ sizing needs more investigation. > > 2) if the SQ is full, then the iSER/target code is supposed to resubmit. And > apparently that isn't working. > For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver callbacks to signal internal queue-full retry logic. Otherwise, the extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session shutdown and/or reinstatement occurs. AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely the earlier v4.1+ regression: https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e That said, there is room for improvement in target-core queue-full error signaling, and iscsi-target/iser-target callback error propagation. Sending out a series shortly to address these particular items. Please have a look. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > So to sum up - 2 issues: > > > > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ > > overflows. So the SQ sizing needs more investigation. > > > > 2) if the SQ is full, then the iSER/target code is supposed to resubmit. And > > apparently that isn't working. > > > > For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver > callbacks to signal internal queue-full retry logic. Otherwise, the > extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session > shutdown and/or reinstatement occurs. > > AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely > the earlier v4.1+ regression: > > https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afd > b1e3e > > That said, there is room for improvement in target-core queue-full error > signaling, and iscsi-target/iser-target callback error propagation. > > Sending out a series shortly to address these particular items. > Please have a look. > Hey Nicholas, Bharat is out until next week. He'll try this all out next week and report back. Thanks! Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A. Bellinger wrote: > Hi Steve, Potnuri, & Co, > > On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote: > > > > > > > I tried out this change and it works fine with iwarp. I dont see SQ > > > > overflow. Apparently we have increased the sq too big to overflow. I am > > going > > > > to let it run with higher workloads for longer time, to see if it holds > > good. > > > > > > Actually on second thought, this patch is an overkill. Effectively we > > > now set: > > > > > > MAX_CMD=266 > > > and max_rdma_ctx=128 so together we take 394 which seems to too much. > > > > > > If we go by the scheme of 1 rdma + 1 send for each IO we need: > > > - 128 sends > > > - 128 rdmas > > > - 10 miscs > > > > > > so this gives 266. > > > > > > Perhaps this is due to the fact that iWARP needs to register memory for > > > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?) > > > > > > > iWARP definitely needs to register memory for the target of reads, due to > > REMOTE_WRITE requirement for the protocol. The source of a write doesn't need > > to register memory, but the SGE depth can cause multiple WRITE WRs to be > > required to service the IO. And in theory there should be some threshold where > > it might be better performance-wise to do a memory register + 1 WRITE vs X > > WRITEs. > > > > As you mentioned, the RW API should account for this, but perhaps it is still > > off some. Bharat, have a look into the RDMA-RW API and let us see if we can > > figure out if the additional SQ depth it adds is sufficient. > > > > > What is the workload you are running? with immediatedata enabled you > > > should issue reg+rdma_read+send only for writes > 8k. > > > > > > Does this happen when you run only reads for example? > > > > > > I guess its time to get the sq accounting into shape... > > > > So to sum up - 2 issues: > > > > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ > > overflows. So the SQ sizing needs more investigation. > > > > 2) if the SQ is full, then the iSER/target code is supposed to resubmit. And > > apparently that isn't working. > > > > For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver > callbacks to signal internal queue-full retry logic. Otherwise, the > extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session > shutdown and/or reinstatement occurs. > > AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely > the earlier v4.1+ regression: > > https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e > > That said, there is room for improvement in target-core queue-full error > signaling, and iscsi-target/iser-target callback error propagation. > > Sending out a series shortly to address these particular items. > Please have a look. > Thanks for the changes Nicholas. Testing them now. Thanks, Bharat. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote: > Hey Jason, > > >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're > seeing SQ > >> overflows. So the SQ sizing needs more investigation. > > > > NFS had this sort of problem - in that case it was because the code > > was assuming that a RQ completion implied SQ space - that is not > > legal, only direct completions from SQ WCs can guide available space > > in the SQ.. > > Its not the same problem. iser-target does not tie SQ and RQ spaces. > The origin here is the difference between IB/RoCE and iWARP and the > chelsio HW that makes it hard to predict the SQ correct size. > > iWARP needs extra registration for rdma reads and the chelsio device > seems to be limited in the number of pages per registration so this > configuration will need a larger send queue. > > Another problem is that we don't have a correct retry flow. > > Hopefully we can address that in the RW API which is designed to hide > these details from the ULP... Hi Sagi, Here is what our further analysis of SQ dump at the time of overflow says: RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs (another possible factor is they all are unsignalled WRs and are completed only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that any given IO will take only 1 WRITE WR to convey the data. This evidently is incorrect and rdma_rw_init_qp() needs to factor and size the queue based on max_sge of device for write and read and the sg_tablesize for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of initiator. If above analysis is correct, please suggest how could this be fixed? Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module parameter of ib_core avoids SQ overflow by registering a single REG_MR and using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE WRs, becomes just 3 WRS: REG_MR + WRITE + INV_MR as max_fast_reg_page_list_len of iw_cxgb4 is 128 page. (By default force_mr is not set and iw_cxgb4 could only use MR for rdma READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) From this is there any possibility that we could use MR if the write WR chain exceeds a certain number? Thanks for your time! -Bharat. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
+Christoph > > On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote: > > Hey Jason, > > > > >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're > > seeing SQ > > >> overflows. So the SQ sizing needs more investigation. > > > > > > NFS had this sort of problem - in that case it was because the code > > > was assuming that a RQ completion implied SQ space - that is not > > > legal, only direct completions from SQ WCs can guide available space > > > in the SQ.. > > > > Its not the same problem. iser-target does not tie SQ and RQ spaces. > > The origin here is the difference between IB/RoCE and iWARP and the > > chelsio HW that makes it hard to predict the SQ correct size. > > > > iWARP needs extra registration for rdma reads and the chelsio device > > seems to be limited in the number of pages per registration so this > > configuration will need a larger send queue. > > > > Another problem is that we don't have a correct retry flow. > > > > Hopefully we can address that in the RW API which is designed to hide > > these details from the ULP... > Hi Sagi, > Here is what our further analysis of SQ dump at the time of overflow says: > > RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI > READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max > number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs > (another possible factor is they all are unsignalled WRs and are completed > only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that > any given IO will take only 1 WRITE WR to convey the data. > > This evidently is incorrect and rdma_rw_init_qp() needs to factor and size > the queue based on max_sge of device for write and read and the sg_tablesize > for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of > initiator. If above analysis is correct, please suggest how could this be fixed? > > Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module > parameter of ib_core avoids SQ overflow by registering a single REG_MR and > using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE > WRs, becomes just 3 WRS: REG_MR + WRITE + INV_MR as > max_fast_reg_page_list_len of iw_cxgb4 is 128 page. > > (By default force_mr is not set and iw_cxgb4 could only use MR for rdma > READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) > >From this is there any possibility that we could use MR if the write WR > chain exceeds a certain number? > > Thanks for your time! > I think it is time to resolve this XXX comment in rw.c for rdma_rw_io_needs_mr(): /* * Check if the device will use memory registration for this RW operation. * We currently always use memory registrations for iWarp RDMA READs, and * have a debug option to force usage of MRs. * XXX: In the future we can hopefully fine tune this based on HCA driver * input. */ Regardless of whether the HCA driver provides input, I think 30+ RDMA WRITE WR chains isn't as efficient as 1 REG_MR + 1 WRITE + 1 INV_MR. Is it unreasonable to just add some threshold in rw.c? Also, I think rdma_rw_init_qp() does need some tweaks: It needs to take into account the max sge depth, the max REG_MR depth, and the max SQ depth device attributes/capabilities when sizing the SQ. However, if that computed depth exceeds the device max, then the SQ will not be big enough to avoid potential overflowing, and I believe ULPs should _always_ flow control their outgoing WRs based on the SQ depth regardless. And perhaps rdma-rw should even avoid overly deep SQs just because that tends to inhibit scalability. EG: allowing lots of shallow QPs vs consuming all the device resources with very deep QPs... Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h index fc791efe3a10..81afb95aeea9 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.h +++ b/drivers/infiniband/ulp/isert/ib_isert.h @@ -54,8 +54,14 @@ #define ISERT_MIN_POSTED_RX (ISCSI_DEF_XMIT_CMDS_MAX >> 2) -#define ISERT_QP_MAX_REQ_DTOS (ISCSI_DEF_XMIT_CMDS_MAX + \ - ISERT_MAX_TX_MISC_PDUS + \ +/* + * Max QP send work requests consist of: + * - RDMA + SEND for each iscsi IO + * - iscsi misc TX pdus + * - iscsi misc RX response pdus + */ +#define ISERT_QP_MAX_REQ_DTOS ((ISCSI_DEF_XMIT_CMDS_MAX * 2 ) + \ + ISERT_MAX_TX_MISC_PDUS + \ ISERT_MAX_RX_MISC_PDUS) #define ISER_RX_PAD_SIZE (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \