diff mbox

SQ overflow seen running isert traffic

Message ID 945e2947-f67a-4202-cd27-d4631fe10f68@grimberg.me (mailing list archive)
State Superseded
Headers show

Commit Message

Sagi Grimberg March 21, 2017, 1:52 p.m. UTC
Hi Baharat and Nic,

Apologies for the late reply,

> Hi Nicholas,
> I see them from 2MB onwards.
>>
>>    > Here is what I see with the 3 patches alone applied:
>>    >
>>    > -> In isert_put_datain() and isert_post_response() a corresponding recv
>>    WR is posted before
>>    > posting a send and hence for every post failure a recv is already posted
>>    into a tightly packed
>>    > RQ, causing it to overflow.
>>
>>    Just for me to understand, the intermittent TMR ABORT_TASKs are caused
>>    by the repeated failure to post RDMA_WRITE WRs for a large ISER Data-In
>>    payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
>>    underlying hardware capabilities.
> Yes.
>>
>>    Moving the recv posts after the send post for RDMA_WRITEs helps to
>>    reduce the severity of the issue with iw_cxgb4, but doesn't completely
>>    eliminate the issue under load.
> Moving recv posts only comes in to effect along with your changes.

...

>>    So the reason why your patch to swap post_recv -> post_send to post_send
>>    -> post_recv makes a difference is because it allows enough trickle of
>>    RDMA_WRITEs to make it through, where iser-initiator doesn't attempt to
>>    escalate recovery and doesn't attempt session reinstatement.
> I dont exactly know if above thing comes into play but the actual reason I did
> swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim during
> the intialisation itself. From thereon we post a RQ WR for every RQ completion
> That makes it almost full at any point of time.
>
> Now in our scenario, SQ is miscalulated and too small for few adapters and so
> filled gradually as the IO starts. Once SQ is full, according to your patches
> isert queues it and tries to repost the command again. Here in iser functions
> like isert_post_response(), isert_put_datain() post send is done after post recv.
> For the first post send failure in say isert_put_datain(), the corresponding
> post recv is already posted, then on queuing the command and trying reposting
> an extra recv is again posted which fills up the RQ also.
>
>  By swapping post recv and send as in my incermental patch, we dont post that
> extra recv, and post recv only on successful post send.
> Therfore I think this incremental patch is necessary.

Reversing the order to recv and send posting will cause problems
in stress IO workloads (especially for iWARP). The problem of sending
a reply before reposting the recv buffer is that the initiator can send
immediately a new request and we don't have a recv buffer waiting for
it, which will cause RNR-NAK. This *will* cause performance drops and
jitters for sure.

How about we just track the rx_desc to know if we already posted it as
a start (untested as I don't have access to RDMA HW this week):
--

@@ -85,6 +85,7 @@ struct iser_rx_desc {
         u64             dma_addr;
         struct ib_sge   rx_sg;
         struct ib_cqe   rx_cqe;
+       bool            in_use;
         char            pad[ISER_RX_PAD_SIZE];
  } __packed;
--

We have a lot of room for cleanups in isert... I'll need to
make some time to get it going...

I'll be waiting to hear from you if it makes your issue go away.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Potnuri Bharat Teja March 21, 2017, 3:25 p.m. UTC | #1
On Tuesday, March 03/21/17, 2017 at 19:22:30 +0530, Sagi Grimberg wrote:
Hi Sagi,
The patch works fine!
>    Hi Baharat and Nic,
> 
>    Apologies for the late reply,
> 
>    > Hi Nicholas,
>    > I see them from 2MB onwards.
>    >>
>    >>    > Here is what I see with the 3 patches alone applied:
>    >>    >
>    >>    > -> In isert_put_datain() and isert_post_response() a corresponding
>    recv
>    >>    WR is posted before
>    >>    > posting a send and hence for every post failure a recv is already
>    posted
>    >>    into a tightly packed
>    >>    > RQ, causing it to overflow.
>    >>
>    >>    Just for me to understand, the intermittent TMR ABORT_TASKs are
>    caused
>    >>    by the repeated failure to post RDMA_WRITE WRs for a large ISER
>    Data-In
>    >>    payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
>    >>    underlying hardware capabilities.
>    > Yes.
>    >>
>    >>    Moving the recv posts after the send post for RDMA_WRITEs helps to
>    >>    reduce the severity of the issue with iw_cxgb4, but doesn't
>    completely
>    >>    eliminate the issue under load.
>    > Moving recv posts only comes in to effect along with your changes.
> 
>    ...
> 
>    >>    So the reason why your patch to swap post_recv -> post_send to
>    post_send
>    >>    -> post_recv makes a difference is because it allows enough trickle
>    of
>    >>    RDMA_WRITEs to make it through, where iser-initiator doesn't attempt
>    to
>    >>    escalate recovery and doesn't attempt session reinstatement.
>    > I dont exactly know if above thing comes into play but the actual reason
>    I did
>    > swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim
>    during
>    > the intialisation itself. From thereon we post a RQ WR for every RQ
>    completion
>    > That makes it almost full at any point of time.
>    >
>    > Now in our scenario, SQ is miscalulated and too small for few adapters
>    and so
>    > filled gradually as the IO starts. Once SQ is full, according to your
>    patches
>    > isert queues it and tries to repost the command again. Here in iser
>    functions
>    > like isert_post_response(), isert_put_datain() post send is done after
>    post recv.
>    > For the first post send failure in say isert_put_datain(), the
>    corresponding
>    > post recv is already posted, then on queuing the command and trying
>    reposting
>    > an extra recv is again posted which fills up the RQ also.
>    >
>    >  By swapping post recv and send as in my incermental patch, we dont post
>    that
>    > extra recv, and post recv only on successful post send.
>    > Therfore I think this incremental patch is necessary.
> 
>    Reversing the order to recv and send posting will cause problems
>    in stress IO workloads (especially for iWARP). The problem of sending
>    a reply before reposting the recv buffer is that the initiator can send
>    immediately a new request and we don't have a recv buffer waiting for
>    it, which will cause RNR-NAK. This *will* cause performance drops and
>    jitters for sure.
Totally agree with you.
> 
>    How about we just track the rx_desc to know if we already posted it as
>    a start (untested as I don't have access to RDMA HW this week):
>    --
>    diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
>    b/drivers/infiniband/ulp/isert/ib_isert.c
>    index 9b33c0c97468..fcbed35e95a8 100644
>    --- a/drivers/infiniband/ulp/isert/ib_isert.c
>    +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>    @@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32
>    count)
>                     rx_wr->sg_list = &rx_desc->rx_sg;
>                     rx_wr->num_sge = 1;
>                     rx_wr->next = rx_wr + 1;
>    +               rx_desc->in_use = false;
>             }
>             rx_wr--;
>             rx_wr->next = NULL; /* mark end of work requests list */
>    @@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn,
>    struct iser_rx_desc *rx_desc)
>             struct ib_recv_wr *rx_wr_failed, rx_wr;
>             int ret;
> 
>    +       if (!rx_desc->in_use) {
>    +               /*
>    +                * if the descriptor is not in-use we already reposted it
>    +                * for recv, so just silently return
>    +                */
>    +               return 0;
>    +       }
>    +
>    +       rx_desc->in_use = false;
>             rx_wr.wr_cqe = &rx_desc->rx_cqe;
>             rx_wr.sg_list = &rx_desc->rx_sg;
>             rx_wr.num_sge = 1;
>    @@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>                     return;
>             }
> 
>    +       rx_desc->in_use = true;
>    +
>             ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
>                             ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
> 
>    diff --git a/drivers/infiniband/ulp/isert/ib_isert.h
>    b/drivers/infiniband/ulp/isert/ib_isert.h
>    index c02ada57d7f5..87d994de8c91 100644
>    --- a/drivers/infiniband/ulp/isert/ib_isert.h
>    +++ b/drivers/infiniband/ulp/isert/ib_isert.h
>    @@ -60,7 +60,7 @@
> 
>      #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
>                     (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct
>    ib_sge) + \
>    -                sizeof(struct ib_cqe)))
>    +                sizeof(struct ib_cqe) + sizeof(bool)))
> 
>      #define ISCSI_ISER_SG_TABLESIZE                256
> 
>    @@ -85,6 +85,7 @@ struct iser_rx_desc {
>             u64             dma_addr;
>             struct ib_sge   rx_sg;
>             struct ib_cqe   rx_cqe;
>    +       bool            in_use;
>             char            pad[ISER_RX_PAD_SIZE];
>      } __packed;
>    --
> 
>    We have a lot of room for cleanups in isert... I'll need to
>    make some time to get it going...
> 
>    I'll be waiting to hear from you if it makes your issue go away.
Test runs fine and it solved the issue.
Thanks for the patch!
> 
>    Cheers,
>    Sagi.
Thanks,
Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 21, 2017, 4:38 p.m. UTC | #2
> Test runs fine and it solved the issue.
> Thanks for the patch!

Cool, can I add your "Tested-by" tag on a formal patch then?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Potnuri Bharat Teja March 21, 2017, 5:50 p.m. UTC | #3
On Tuesday, March 03/21/17, 2017 at 22:08:04 +0530, Sagi Grimberg wrote:
>    > Test runs fine and it solved the issue.
>    > Thanks for the patch!
> 
>    Cool, can I add your "Tested-by" tag on a formal patch then?
Sure, Please add.
Tested-by: Potnuri Bharat Teja <bharat@chelsio.com>

Thanks.
>    --
>    To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>    the body of a message to majordomo@vger.kernel.org
>    More majordomo info at  [1]http://vger.kernel.org/majordomo-info.html
> 
> References
> 
>    Visible links
>    1. http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 9b33c0c97468..fcbed35e95a8 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -817,6 +817,7 @@  isert_post_recvm(struct isert_conn *isert_conn, u32 
count)
                 rx_wr->sg_list = &rx_desc->rx_sg;
                 rx_wr->num_sge = 1;
                 rx_wr->next = rx_wr + 1;
+               rx_desc->in_use = false;
         }
         rx_wr--;
         rx_wr->next = NULL; /* mark end of work requests list */
@@ -835,6 +836,15 @@  isert_post_recv(struct isert_conn *isert_conn, 
struct iser_rx_desc *rx_desc)
         struct ib_recv_wr *rx_wr_failed, rx_wr;
         int ret;

+       if (!rx_desc->in_use) {
+               /*
+                * if the descriptor is not in-use we already reposted it
+                * for recv, so just silently return
+                */
+               return 0;
+       }
+
+       rx_desc->in_use = false;
         rx_wr.wr_cqe = &rx_desc->rx_cqe;
         rx_wr.sg_list = &rx_desc->rx_sg;
         rx_wr.num_sge = 1;
@@ -1397,6 +1407,8 @@  isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
                 return;
         }

+       rx_desc->in_use = true;
+
         ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
                         ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);

diff --git a/drivers/infiniband/ulp/isert/ib_isert.h 
b/drivers/infiniband/ulp/isert/ib_isert.h
index c02ada57d7f5..87d994de8c91 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -60,7 +60,7 @@ 

  #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
                 (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct 
ib_sge) + \
-                sizeof(struct ib_cqe)))
+                sizeof(struct ib_cqe) + sizeof(bool)))

  #define ISCSI_ISER_SG_TABLESIZE                256