diff mbox series

[RFC] IB/iser: add task reference counter for tx commands

Message ID 20220824124236.812395-1-kashyap.desai@broadcom.com (mailing list archive)
State RFC
Headers show
Series [RFC] IB/iser: add task reference counter for tx commands | expand

Commit Message

Kashyap Desai Aug. 24, 2022, 12:42 p.m. UTC
In [0], Max Gurtovoy already explained a future plan to add reference
count.

Below example has Initiator Task Tag = 0x1e. We are tracking the same ITT.

        [177617.255969]  session508: iscsi_prep_scsi_cmd_pdu iscsi prep [write cid 0 sc 000000009d6ff976 cdb 0x2a itt 0x1e len 8192 cmdsn 90 win 64]
        [177617.255982] iser: iscsi_iser_task_xmit: cmd [itt 1e total 8192 imm 8192 unsol_data 0
(1) >>  [177617.255985] iser: iscsi_iser_task_xmit: ctask xmit [cid 0 itt 0x1e]
        [177617.255992] iser: iser_reg_dma: Single DMA entry: lkey=0xffffffff, rkey=0x0, addr=0x9c5de000, length=0x2000
        [177617.255995] iser: iser_prepare_write_cmd: Cmd itt:30, WRITE, adding imm.data sz: 8192
        [177617.256002] iser: iser_send_command: from iser_send_command iser_task ffff9e39dad96898 tx_count 1
(2) >>  [177617.256016] iser: iser_cmd_comp: from iser_cmd_comp iser_task ffff9e39dad96898 tx_count 0
(3) >>  [177617.256045] iser: iser_task_rsp: op 0x21 itt 0x1e dlen 0
        [177617.256049]  session508: __iscsi_complete_pdu [op 0x21 cid 0 itt 0x1e len 0]
        [177617.256052]  session508: iscsi_scsi_cmd_rsp cmd rsp done [sc 000000009d6ff976 res 0 itt 0x1e]
        [177617.256055]  session508: iscsi_complete_task complete task itt 0x1e state 3 sc 000000009d6ff976
(4) >>  [177617.256057]  session508: iscsi_free_task freeing task itt 0x1e state 1 sc 000000009d6ff976
(5) >>  [177617.256067] bnxt_en 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001e address=0x41137800 flags=0x0000]
        [177617.256068] bnxt_en 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001e address=0xc1595098 flags=0x0000]

(1) Is when iser stack has posted TX DESC type = ISCSI_TX_SCSI_COMMAND having two SGL.
    sgl[0] is contain iscsi_header and sgl[1] is scsi data (of length 8K)
    of scsi command cdb = WRITE_10
(2) In a good case, TX completion is received,
    but in some cases TX completion is not yet received to the iser stack.
(3) Is when iser stack gets completion of RX DESC of ITT = 0x1e.
    This is actual scsi completion of TX_DESC from target.
(4) Is when the iser stack destroys all TX DESC resources associated with ITT = 0x1e.
(5) Is when HCA is still accessing WQE entry of ITT = 0x1e (possible in retransmission path).
    After (4), sgl[0] and sgl[1] dma addresses are unmapped and not allowed to be used by HCA.

Get the iscsi_task reference count if TX DESC is successfully posted.
Put the iscsi_task reference count once TX_DESC is completed.
This method will make sure iscsi_free_task will be called only after TX
and RX DESC received for the task->itt.
 
This patch depends upon [0] as it required signaled completion.

[0] https://patchwork.kernel.org/project/linux-rdma/patch/20211215135721.3662-5-mgurtovoy@nvidia.com/

Cc: "Jason Gunthorpe" <jgg@nvidia.com>
Cc: "Max Gurtovoy" <mgurtovoy@nvidia.com>
Cc: "Sagi Grimberg" <sagi@grimberg.me>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c     | 5 ++++-
 drivers/infiniband/ulp/iser/iser_initiator.c | 6 ++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Max Gurtovoy Aug. 24, 2022, 11:59 p.m. UTC | #1
Hi Kashyap Desai,

Unfortunately, there is a wider problem in iser that we do the local 
invalidation if any, after we complete the iscsi task.
So the right solution is to do the logic we have in NVMe/RDMA that 
checks if remote invalidation happened and if not, it does local 
invalidation.
And the task is released after getting the send_completion and 
local_invalidation/remote invalidation.

There is some infrastructure work needed to be done there.

Sagi,

Please remind me few things regarding the iSCSI bidir cmds.

In case of bidir IO cmd, if we need to use a reg_mr for it - I see a 
potential problem there.
Is this scenario possible ?

I'm asking because if it is possible, so what happens in remote 
invalidation ? what key is invalidated ? the read_key or the write_key ? 
in ib_isert there is a priority to the read_key (is it by the spec ?).
We don't consider this in the iser recv completion and the remote 
invalidation checker logic.
If it's not possible we should re-design few components in the code and 
fix the issue that we do local_invalidation of cmd N during the send of 
cmd N + 1.

Cheers,
-Max.

On 8/24/2022 3:42 PM, Kashyap Desai wrote:
> In [0], Max Gurtovoy already explained a future plan to add reference
> count.
>
> Below example has Initiator Task Tag = 0x1e. We are tracking the same ITT.
>
>          [177617.255969]  session508: iscsi_prep_scsi_cmd_pdu iscsi prep [write cid 0 sc 000000009d6ff976 cdb 0x2a itt 0x1e len 8192 cmdsn 90 win 64]
>          [177617.255982] iser: iscsi_iser_task_xmit: cmd [itt 1e total 8192 imm 8192 unsol_data 0
> (1) >>  [177617.255985] iser: iscsi_iser_task_xmit: ctask xmit [cid 0 itt 0x1e]
>          [177617.255992] iser: iser_reg_dma: Single DMA entry: lkey=0xffffffff, rkey=0x0, addr=0x9c5de000, length=0x2000
>          [177617.255995] iser: iser_prepare_write_cmd: Cmd itt:30, WRITE, adding imm.data sz: 8192
>          [177617.256002] iser: iser_send_command: from iser_send_command iser_task ffff9e39dad96898 tx_count 1
> (2) >>  [177617.256016] iser: iser_cmd_comp: from iser_cmd_comp iser_task ffff9e39dad96898 tx_count 0
> (3) >>  [177617.256045] iser: iser_task_rsp: op 0x21 itt 0x1e dlen 0
>          [177617.256049]  session508: __iscsi_complete_pdu [op 0x21 cid 0 itt 0x1e len 0]
>          [177617.256052]  session508: iscsi_scsi_cmd_rsp cmd rsp done [sc 000000009d6ff976 res 0 itt 0x1e]
>          [177617.256055]  session508: iscsi_complete_task complete task itt 0x1e state 3 sc 000000009d6ff976
> (4) >>  [177617.256057]  session508: iscsi_free_task freeing task itt 0x1e state 1 sc 000000009d6ff976
> (5) >>  [177617.256067] bnxt_en 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001e address=0x41137800 flags=0x0000]
>          [177617.256068] bnxt_en 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001e address=0xc1595098 flags=0x0000]
>
> (1) Is when iser stack has posted TX DESC type = ISCSI_TX_SCSI_COMMAND having two SGL.
>      sgl[0] is contain iscsi_header and sgl[1] is scsi data (of length 8K)
>      of scsi command cdb = WRITE_10
> (2) In a good case, TX completion is received,
>      but in some cases TX completion is not yet received to the iser stack.
> (3) Is when iser stack gets completion of RX DESC of ITT = 0x1e.
>      This is actual scsi completion of TX_DESC from target.
> (4) Is when the iser stack destroys all TX DESC resources associated with ITT = 0x1e.
> (5) Is when HCA is still accessing WQE entry of ITT = 0x1e (possible in retransmission path).
>      After (4), sgl[0] and sgl[1] dma addresses are unmapped and not allowed to be used by HCA.
>
> Get the iscsi_task reference count if TX DESC is successfully posted.
> Put the iscsi_task reference count once TX_DESC is completed.
> This method will make sure iscsi_free_task will be called only after TX
> and RX DESC received for the task->itt.
>   
> This patch depends upon [0] as it required signaled completion.
>
> [0] https://patchwork.kernel.org/project/linux-rdma/patch/20211215135721.3662-5-mgurtovoy@nvidia.com/
>
> Cc: "Jason Gunthorpe" <jgg@nvidia.com>
> Cc: "Max Gurtovoy" <mgurtovoy@nvidia.com>
> Cc: "Sagi Grimberg" <sagi@grimberg.me>
> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> ---
>   drivers/infiniband/ulp/iser/iscsi_iser.c     | 5 ++++-
>   drivers/infiniband/ulp/iser/iser_initiator.c | 6 ++++++
>   2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c
> index 620ae5b2d80d..f1704fef9dc8 100644
> --- a/drivers/infiniband/ulp/iser/iscsi_iser.c
> +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
> @@ -339,9 +339,12 @@ static int iscsi_iser_task_xmit(struct iscsi_task *task)
>   
>   	/* Send the cmd PDU */
>   	if (!iser_task->command_sent) {
> +		iscsi_get_task(task);
>   		error = iser_send_command(conn, task);
> -		if (error)
> +		if (error) {
> +			iscsi_put_task(task);
>   			goto iscsi_iser_task_xmit_exit;
> +		}
>   		iser_task->command_sent = 1;
>   	}
>   
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 7b83f48f60c5..1f0b1601b3b3 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -669,6 +669,12 @@ void iser_task_rsp(struct ib_cq *cq, struct ib_wc *wc)
>   
>   void iser_cmd_comp(struct ib_cq *cq, struct ib_wc *wc)
>   {
> +	struct iser_tx_desc *desc = iser_tx(wc->wr_cqe);
> +	struct iscsi_task *task;
> +	task = (void *)desc - sizeof(struct iscsi_task);
> +
> +	iscsi_put_task(task);
> +
>   	if (unlikely(wc->status != IB_WC_SUCCESS))
>   		iser_err_comp(wc, "command");
>   }
Kashyap Desai Aug. 25, 2022, 12:07 p.m. UTC | #2
>
> Hi Kashyap Desai,
>
> Unfortunately, there is a wider problem in iser that we do the local
> invalidation if any, after we complete the iscsi task.
> So the right solution is to do the logic we have in NVMe/RDMA that checks
> if
> remote invalidation happened and if not, it does local invalidation.
> And the task is released after getting the send_completion and
> local_invalidation/remote invalidation.

Hi Max ,

I noticed the same. Agree that this RFC is not close to actual solution.
Final solution require major lift in iSER instead of just adding refcount in
couple of places.
In case of NVME every ib_post_send is posted with IB_SEND_SIGNALED and
nvme_rdma_send_done does the better job of handling out of order TX and RX
completion.
nvme_rdma_end_request() can be called either nvme_rdma_process_nvme_rsp,
nvme_rdma_send_done OR nvme_rdma_inv_rkey_done.
Similar concept is what we need in iSER side as well. Right ? Currently iSER
stack is freeing scsi request only from iser_task_rsp context.

>
> There is some infrastructure work needed to be done there.

I would like to contribute if you want me to test, review and/or co develop
the changes.


>
> Sagi,
>
> Please remind me few things regarding the iSCSI bidir cmds.
>
> In case of bidir IO cmd, if we need to use a reg_mr for it - I see a
> potential
> problem there.
> Is this scenario possible ?
>
> I'm asking because if it is possible, so what happens in remote
> invalidation ?
> what key is invalidated ? the read_key or the write_key ?
> in ib_isert there is a priority to the read_key (is it by the spec ?).
> We don't consider this in the iser recv completion and the remote
> invalidation
> checker logic.
> If it's not possible we should re-design few components in the code and
> fix the
> issue that we do local_invalidation of cmd N during the send of cmd N + 1.
>
> Cheers,
> -Max.
>
Sagi Grimberg Aug. 28, 2022, 2:52 p.m. UTC | #3
> Hi Kashyap Desai,
> 
> Unfortunately, there is a wider problem in iser that we do the local 
> invalidation if any, after we complete the iscsi task.
> So the right solution is to do the logic we have in NVMe/RDMA that 
> checks if remote invalidation happened and if not, it does local 
> invalidation.
> And the task is released after getting the send_completion and 
> local_invalidation/remote invalidation.
> 
> There is some infrastructure work needed to be done there.
> 
> Sagi,
> 
> Please remind me few things regarding the iSCSI bidir cmds.
> 
> In case of bidir IO cmd, if we need to use a reg_mr for it - I see a 
> potential problem there.

Not surprising.

> Is this scenario possible ?

There is no longer bidi support in scsi.
See patch: ae3d56d81507 ("scsi: remove bidirectional command support")

> I'm asking because if it is possible, so what happens in remote 
> invalidation ? what key is invalidated ? the read_key or the write_key ? 
> in ib_isert there is a priority to the read_key (is it by the spec ?).

The spec never addressed this. It wasn't well defined what to do with
remote invalidation for bidi commands.

> We don't consider this in the iser recv completion and the remote 
> invalidation checker logic.
> If it's not possible we should re-design few components in the code and 
> fix the issue that we do local_invalidation of cmd N during the send of 
> cmd N + 1.

We shouldn't bother, no consumer of this is left.
diff mbox series

Patch

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 620ae5b2d80d..f1704fef9dc8 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -339,9 +339,12 @@  static int iscsi_iser_task_xmit(struct iscsi_task *task)
 
 	/* Send the cmd PDU */
 	if (!iser_task->command_sent) {
+		iscsi_get_task(task);
 		error = iser_send_command(conn, task);
-		if (error)
+		if (error) {
+			iscsi_put_task(task);
 			goto iscsi_iser_task_xmit_exit;
+		}
 		iser_task->command_sent = 1;
 	}
 
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index 7b83f48f60c5..1f0b1601b3b3 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -669,6 +669,12 @@  void iser_task_rsp(struct ib_cq *cq, struct ib_wc *wc)
 
 void iser_cmd_comp(struct ib_cq *cq, struct ib_wc *wc)
 {
+	struct iser_tx_desc *desc = iser_tx(wc->wr_cqe);
+	struct iscsi_task *task;
+	task = (void *)desc - sizeof(struct iscsi_task);
+
+	iscsi_put_task(task);
+
 	if (unlikely(wc->status != IB_WC_SUCCESS))
 		iser_err_comp(wc, "command");
 }