RQ overflow seen running isert traffic

Message ID	c4ee3313-9e67-f30a-3679-4c9afab79ac9@grimberg.me (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-rdma-owner@kernel.org> Subject: Re: RQ overflow seen running isert traffic To: Steve Wise <swise@opengridcomputing.com>, 'Potnuri Bharat Teja' <bharat@chelsio.com> References: <20160927070157.GA13140@chelsio.com> <672d8b05-5537-d45a-ba3f-cdd5f054a4ab@grimberg.me> <20161017111655.GA21245@chelsio.com> <021001d228a4$6cd6a6c0$4683f440$@opengridcomputing.com> Cc: target-devel@vger.kernel.org, nab@linux-iscsi.org, linux-rdma@vger.kernel.org From: Sagi Grimberg <sagi@grimberg.me> Message-ID: <c4ee3313-9e67-f30a-3679-4c9afab79ac9@grimberg.me> Date: Tue, 18 Oct 2016 11:04:00 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <021001d228a4$6cd6a6c0$4683f440$@opengridcomputing.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Sagi Grimberg Oct. 18, 2016, 8:04 a.m. UTC

Hey Steve and Baharat,

> Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly
> sizing the SQ:
> ...
> #define ISERT_QP_MAX_REQ_DTOS   (ISCSI_DEF_XMIT_CMDS_MAX +    \
>                                 ISERT_MAX_TX_MISC_PDUS  + \
>                                 ISERT_MAX_RX_MISC_PDUS)
> ...
>         attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
>         attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
> ...
>
> I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ.  But
> the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or
> READ WRs due to limits on local sge depths target sge depths, etc.  Yes?  Or am
> I all wet?  Or perhaps isert doesn't require the SQ to be the max possible
> because it flow controls the DTO submissions?

I think you are correct.

On my test devices, I didn't see that multiple WRs has had any effect
becuase:
1. My test devices usually give next power of 2 (256)
2. workloads that involved multiple rdma operations never stressed the
system enough to get the queues full.

Now, in iWARP for non-immediate writes we'll need more than a single
wr per IO (I think the SQ size is expanded with the new rdma RW API
which implicitly increases with attr.cap.max_rdma_ctxs).

But I do agree that we need to take into account that each IO needs
at least 2 WRs (one for rdma and one for send).

So a temp bandage would be:
--
--

But we do need to track the SQ overflow and queue a retransmit work when
we don't have enough available SQ slots..

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Potnuri Bharat Teja Oct. 18, 2016, 11:28 a.m. UTC | #1

On Tuesday, October 10/18/16, 2016 at 11:04:00 +0300, Sagi Grimberg wrote:
> Hey Steve and Baharat,
> 
> >Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly
> >sizing the SQ:
> >...
> >#define ISERT_QP_MAX_REQ_DTOS   (ISCSI_DEF_XMIT_CMDS_MAX +    \
> >                                ISERT_MAX_TX_MISC_PDUS  + \
> >                                ISERT_MAX_RX_MISC_PDUS)
> >...
> >        attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
> >        attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
> >...
> >
> >I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ.  But
> >the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or
> >READ WRs due to limits on local sge depths target sge depths, etc.  Yes?  Or am
> >I all wet?  Or perhaps isert doesn't require the SQ to be the max possible
> >because it flow controls the DTO submissions?
> 
> I think you are correct.
> 
> On my test devices, I didn't see that multiple WRs has had any effect
> becuase:
> 1. My test devices usually give next power of 2 (256)
> 2. workloads that involved multiple rdma operations never stressed the
> system enough to get the queues full.
> 
> Now, in iWARP for non-immediate writes we'll need more than a single
> wr per IO (I think the SQ size is expanded with the new rdma RW API
> which implicitly increases with attr.cap.max_rdma_ctxs).
Yes, rdma RW api factors the attr.cap.max_rdma_ctxs based on attr Flags.
> 
> But I do agree that we need to take into account that each IO needs
> at least 2 WRs (one for rdma and one for send).
> 
> So a temp bandage would be:
> --
> diff --git a/drivers/infiniband/ulp/isert/ib_isert.h
> b/drivers/infiniband/ulp/isert/ib_isert.h
> index fc791efe3a10..81afb95aeea9 100644
> --- a/drivers/infiniband/ulp/isert/ib_isert.h
> +++ b/drivers/infiniband/ulp/isert/ib_isert.h
> @@ -54,8 +54,14 @@
> 
>  #define ISERT_MIN_POSTED_RX    (ISCSI_DEF_XMIT_CMDS_MAX >> 2)
> 
> -#define ISERT_QP_MAX_REQ_DTOS  (ISCSI_DEF_XMIT_CMDS_MAX +    \
> -                               ISERT_MAX_TX_MISC_PDUS  + \
> +/*
> + * Max QP send work requests consist of:
> + * - RDMA + SEND for each iscsi IO
> + * - iscsi misc TX pdus
> + * - iscsi misc RX response pdus
> + */
> +#define ISERT_QP_MAX_REQ_DTOS  ((ISCSI_DEF_XMIT_CMDS_MAX * 2 ) +       \
> +                               ISERT_MAX_TX_MISC_PDUS          +       \
>                                 ISERT_MAX_RX_MISC_PDUS)
> 
>  #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
> --
> 
I tried out this change and it works fine with iwarp. I dont see SQ
overflow. Apparently we have increased the sq too big to overflow. I am going
to let it run with higher workloads for longer time, to see if it holds good.
> But we do need to track the SQ overflow and queue a retransmit work when
> we don't have enough available SQ slots..
Agreed, iscsi-target (LIO in our case)expects failure to be returned by
overlying modules Or it could be for tcp as it handling is different to
that of iser.
It queues the WR and schedules it for repost incase
of post failures with ENOMEM/EAGAIN. Shall send the necessary change if
it is needed.

Thanks,
Bharat.
> 
> Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Oct. 18, 2016, 1:17 p.m. UTC | #2

> I tried out this change and it works fine with iwarp. I dont see SQ
> overflow. Apparently we have increased the sq too big to overflow. I am going
> to let it run with higher workloads for longer time, to see if it holds good.

Actually on second thought, this patch is an overkill. Effectively we
now set:

MAX_CMD=266
and max_rdma_ctx=128 so together we take 394 which seems to too much.

If we go by the scheme of 1 rdma + 1 send for each IO we need:
- 128 sends
- 128 rdmas
- 10 miscs

so this gives 266.

Perhaps this is due to the fact that iWARP needs to register memory for
rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)

What is the workload you are running? with immediatedata enabled you
should issue reg+rdma_read+send only for writes > 8k.

Does this happen when you run only reads for example?

I guess its time to get the sq accounting into shape...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise Oct. 18, 2016, 2:34 p.m. UTC | #3

> 
> > I tried out this change and it works fine with iwarp. I dont see SQ
> > overflow. Apparently we have increased the sq too big to overflow. I am
going
> > to let it run with higher workloads for longer time, to see if it holds
good.
> 
> Actually on second thought, this patch is an overkill. Effectively we
> now set:
> 
> MAX_CMD=266
> and max_rdma_ctx=128 so together we take 394 which seems to too much.
> 
> If we go by the scheme of 1 rdma + 1 send for each IO we need:
> - 128 sends
> - 128 rdmas
> - 10 miscs
> 
> so this gives 266.
> 
> Perhaps this is due to the fact that iWARP needs to register memory for
> rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
>

iWARP definitely needs to register memory for the target of reads, due to
REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
to register memory, but the SGE depth can cause multiple WRITE WRs to be
required to service the IO.  And in theory there should be some threshold where
it might be better performance-wise to do a memory register + 1 WRITE vs X
WRITEs.    

As you mentioned, the RW API should account for this, but perhaps it is still
off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
figure out if the additional SQ depth it adds is sufficient.
 
> What is the workload you are running? with immediatedata enabled you
> should issue reg+rdma_read+send only for writes > 8k.
> 
> Does this happen when you run only reads for example?
> 
> I guess its time to get the sq accounting into shape...

So to sum up - 2 issues:

1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
overflows.  So the SQ sizing needs more investigation.

2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
apparently that isn't working.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Oct. 18, 2016, 4:13 p.m. UTC | #4

On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote:

> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> overflows.  So the SQ sizing needs more investigation.

NFS had this sort of problem - in that case it was because the code
was assuming that a RQ completion implied SQ space - that is not
legal, only direct completions from SQ WCs can guide available space
in the SQ..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise Oct. 18, 2016, 7:03 p.m. UTC | #5

> On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote:
> 
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> 
> NFS had this sort of problem - in that case it was because the code
> was assuming that a RQ completion implied SQ space - that is not
> legal, only direct completions from SQ WCs can guide available space
> in the SQ..
> 

Agreed.  And this could be the case Bharat is hitting. (I wonder if there is a
clever way we can tell?)  I don't see I the code that isert flow controls SQ WR
submission based on SQ completions.  

It would be nice to have some core API, maybe in rdma-rw, that could handle this
by queuing and processing when the SQ has resources...

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Oct. 20, 2016, 8:34 a.m. UTC | #6

Hey Jason,

>> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
>> overflows.  So the SQ sizing needs more investigation.
>
> NFS had this sort of problem - in that case it was because the code
> was assuming that a RQ completion implied SQ space - that is not
> legal, only direct completions from SQ WCs can guide available space
> in the SQ..

Its not the same problem. iser-target does not tie SQ and RQ spaces.
The origin here is the difference between IB/RoCE and iWARP and the
chelsio HW that makes it hard to predict the SQ correct size.

iWARP needs extra registration for rdma reads and the chelsio device
seems to be limited in the number of pages per registration so this
configuration will need a larger send queue.

Another problem is that we don't have a correct retry flow.

Hopefully we can address that in the RW API which is designed to hide
these details from the ULP...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicholas A. Bellinger Oct. 31, 2016, 3:40 a.m. UTC | #7

Hi Steve, Potnuri, & Co,

On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote:
> > 
> > > I tried out this change and it works fine with iwarp. I dont see SQ
> > > overflow. Apparently we have increased the sq too big to overflow. I am
> going
> > > to let it run with higher workloads for longer time, to see if it holds
> good.
> > 
> > Actually on second thought, this patch is an overkill. Effectively we
> > now set:
> > 
> > MAX_CMD=266
> > and max_rdma_ctx=128 so together we take 394 which seems to too much.
> > 
> > If we go by the scheme of 1 rdma + 1 send for each IO we need:
> > - 128 sends
> > - 128 rdmas
> > - 10 miscs
> > 
> > so this gives 266.
> > 
> > Perhaps this is due to the fact that iWARP needs to register memory for
> > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
> >
> 
> iWARP definitely needs to register memory for the target of reads, due to
> REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
> to register memory, but the SGE depth can cause multiple WRITE WRs to be
> required to service the IO.  And in theory there should be some threshold where
> it might be better performance-wise to do a memory register + 1 WRITE vs X
> WRITEs.    
> 
> As you mentioned, the RW API should account for this, but perhaps it is still
> off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
> figure out if the additional SQ depth it adds is sufficient.
>  
> > What is the workload you are running? with immediatedata enabled you
> > should issue reg+rdma_read+send only for writes > 8k.
> > 
> > Does this happen when you run only reads for example?
> > 
> > I guess its time to get the sq accounting into shape...
> 
> So to sum up - 2 issues:
> 
> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> overflows.  So the SQ sizing needs more investigation.
> 
> 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> apparently that isn't working.
> 

For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
callbacks to signal internal queue-full retry logic.  Otherwise, the
extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
shutdown and/or reinstatement occurs.

AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
the earlier v4.1+ regression:

https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e

That said, there is room for improvement in target-core queue-full error
signaling, and iscsi-target/iser-target callback error propagation.  

Sending out a series shortly to address these particular items.
Please have a look.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise Nov. 2, 2016, 5:03 p.m. UTC | #8

> > So to sum up - 2 issues:
> >
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> >
> > 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> > apparently that isn't working.
> >
> 
> For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
> callbacks to signal internal queue-full retry logic.  Otherwise, the
> extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
> shutdown and/or reinstatement occurs.
> 
> AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
> the earlier v4.1+ regression:
> 
> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afd
> b1e3e
> 
> That said, there is room for improvement in target-core queue-full error
> signaling, and iscsi-target/iser-target callback error propagation.
> 
> Sending out a series shortly to address these particular items.
> Please have a look.
> 

Hey Nicholas,  Bharat is out until next week.  He'll try this all out next week and report back.

Thanks!

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Potnuri Bharat Teja Nov. 8, 2016, 10:06 a.m. UTC | #9

On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A. Bellinger wrote:
> Hi Steve, Potnuri, & Co,
> 
> On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote:
> > > 
> > > > I tried out this change and it works fine with iwarp. I dont see SQ
> > > > overflow. Apparently we have increased the sq too big to overflow. I am
> > going
> > > > to let it run with higher workloads for longer time, to see if it holds
> > good.
> > > 
> > > Actually on second thought, this patch is an overkill. Effectively we
> > > now set:
> > > 
> > > MAX_CMD=266
> > > and max_rdma_ctx=128 so together we take 394 which seems to too much.
> > > 
> > > If we go by the scheme of 1 rdma + 1 send for each IO we need:
> > > - 128 sends
> > > - 128 rdmas
> > > - 10 miscs
> > > 
> > > so this gives 266.
> > > 
> > > Perhaps this is due to the fact that iWARP needs to register memory for
> > > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
> > >
> > 
> > iWARP definitely needs to register memory for the target of reads, due to
> > REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
> > to register memory, but the SGE depth can cause multiple WRITE WRs to be
> > required to service the IO.  And in theory there should be some threshold where
> > it might be better performance-wise to do a memory register + 1 WRITE vs X
> > WRITEs.    
> > 
> > As you mentioned, the RW API should account for this, but perhaps it is still
> > off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
> > figure out if the additional SQ depth it adds is sufficient.
> >  
> > > What is the workload you are running? with immediatedata enabled you
> > > should issue reg+rdma_read+send only for writes > 8k.
> > > 
> > > Does this happen when you run only reads for example?
> > > 
> > > I guess its time to get the sq accounting into shape...
> > 
> > So to sum up - 2 issues:
> > 
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> > 
> > 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> > apparently that isn't working.
> > 
> 
> For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
> callbacks to signal internal queue-full retry logic.  Otherwise, the
> extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
> shutdown and/or reinstatement occurs.
> 
> AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
> the earlier v4.1+ regression:
> 
> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
> 
> That said, there is room for improvement in target-core queue-full error
> signaling, and iscsi-target/iser-target callback error propagation.  
> 
> Sending out a series shortly to address these particular items.
> Please have a look.
>
Thanks for the changes Nicholas.
Testing them now.
Thanks,
Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Potnuri Bharat Teja March 20, 2017, 1:05 p.m. UTC | #10

On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
>    Hey Jason,
> 
>    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
>    seeing SQ
>    >> overflows.  So the SQ sizing needs more investigation.
>    >
>    > NFS had this sort of problem - in that case it was because the code
>    > was assuming that a RQ completion implied SQ space - that is not
>    > legal, only direct completions from SQ WCs can guide available space
>    > in the SQ..
> 
>    Its not the same problem. iser-target does not tie SQ and RQ spaces.
>    The origin here is the difference between IB/RoCE and iWARP and the
>    chelsio HW that makes it hard to predict the SQ correct size.
> 
>    iWARP needs extra registration for rdma reads and the chelsio device
>    seems to be limited in the number of pages per registration so this
>    configuration will need a larger send queue.
> 
>    Another problem is that we don't have a correct retry flow.
> 
>    Hopefully we can address that in the RW API which is designed to hide
>    these details from the ULP...
Hi Sagi,
Here is what our further analysis of SQ dump at the time of overflow says:

RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI 
READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max 
number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs 
(another possible factor is they all are unsignalled WRs and are completed
only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that 
any given IO will take only 1 WRITE WR to convey the data.

This evidently is incorrect and rdma_rw_init_qp() needs to factor and size 
the queue based on max_sge of device for write and read and the sg_tablesize 
for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of 
initiator. If above analysis is correct, please suggest how could this be fixed? 

Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module 
parameter of ib_core avoids SQ overflow by registering a single REG_MR and 
using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE 
WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as 
max_fast_reg_page_list_len of iw_cxgb4 is 128 page.

(By default force_mr is not set and iw_cxgb4 could only use MR for rdma 
READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) 
From this is there any possibility that we could use MR if the write WR
chain exceeds a certain number?

Thanks for your time!

-Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise March 20, 2017, 3:04 p.m. UTC | #11

+Christoph

> 
> On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
> >    Hey Jason,
> >
> >    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
> >    seeing SQ
> >    >> overflows.  So the SQ sizing needs more investigation.
> >    >
> >    > NFS had this sort of problem - in that case it was because the code
> >    > was assuming that a RQ completion implied SQ space - that is not
> >    > legal, only direct completions from SQ WCs can guide available space
> >    > in the SQ..
> >
> >    Its not the same problem. iser-target does not tie SQ and RQ spaces.
> >    The origin here is the difference between IB/RoCE and iWARP and the
> >    chelsio HW that makes it hard to predict the SQ correct size.
> >
> >    iWARP needs extra registration for rdma reads and the chelsio device
> >    seems to be limited in the number of pages per registration so this
> >    configuration will need a larger send queue.
> >
> >    Another problem is that we don't have a correct retry flow.
> >
> >    Hopefully we can address that in the RW API which is designed to hide
> >    these details from the ULP...
> Hi Sagi,
> Here is what our further analysis of SQ dump at the time of overflow says:
> 
> RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI
> READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max
> number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs
> (another possible factor is they all are unsignalled WRs and are completed
> only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that
> any given IO will take only 1 WRITE WR to convey the data.
> 
> This evidently is incorrect and rdma_rw_init_qp() needs to factor and size
> the queue based on max_sge of device for write and read and the sg_tablesize
> for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of
> initiator. If above analysis is correct, please suggest how could this be
fixed?
> 
> Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module
> parameter of ib_core avoids SQ overflow by registering a single REG_MR and
> using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE
> WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as
> max_fast_reg_page_list_len of iw_cxgb4 is 128 page.
> 
> (By default force_mr is not set and iw_cxgb4 could only use MR for rdma
> READs only as per rdma_rw_io_needs_mr() if force_mr isnt set)
> >From this is there any possibility that we could use MR if the write WR
> chain exceeds a certain number?
> 
> Thanks for your time!
> 

I think it is time to resolve this XXX comment in rw.c for
rdma_rw_io_needs_mr():

/*
 * Check if the device will use memory registration for this RW operation.
 * We currently always use memory registrations for iWarp RDMA READs, and
 * have a debug option to force usage of MRs.
 * XXX: In the future we can hopefully fine tune this based on HCA driver
 * input.
 */

Regardless of whether the HCA driver provides input, I think 30+ RDMA WRITE WR
chains isn't as efficient as 1 REG_MR + 1 WRITE + 1 INV_MR.   Is it unreasonable
to just add some threshold in rw.c?   Also, I think rdma_rw_init_qp() does need
some tweaks:  It needs to take into account the max sge depth, the max REG_MR
depth, and the max SQ depth device attributes/capabilities when sizing the SQ.
However, if that computed depth exceeds the device max, then the SQ will not be
big enough to avoid potential overflowing, and I believe ULPs should _always_
flow control their outgoing WRs based on the SQ depth regardless.    And perhaps
rdma-rw should even avoid overly deep SQs just because that tends to inhibit
scalability.  EG: allowing lots of shallow QPs vs consuming all the device
resources with very deep QPs... 

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RQ overflow seen running isert traffic

Commit Message

Comments

Patch