diff mbox

Unexpected issues with 2 NVME initiators using the same target

Message ID 614481c7-22dd-d93b-e97e-52f868727ec3@grimberg.me (mailing list archive)
State Not Applicable
Headers show

Commit Message

Sagi Grimberg June 20, 2017, 12:02 p.m. UTC
>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>> mkey.free == 1
> 
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
> 
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
> 
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> 
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.

Well, this entire flow is broken, why should the host send the MR rkey
to the target if it is not using it for remote access, the target
should never have a chance to remote invalidate something it did not
access.

I think we have a bug in iSER code, as we should not send the key
for remote invalidation if we do inline data send...

Robert, can you try the following:
--
                          "VA:%#llX + unsol:%d\n",
--

Although, I still don't think its enough. We need to delay the
local invalidate till we received a send completion (guarantees
that ack was received)...

If this indeed the case, _all_ ULP initiator drivers share it because we
never condition on a send completion in order to complete an I/O, and
in the case of lost ack on send, looks like we need to... It *will* hurt
performance.

What do other folks think?

CC'ing Bart, Chuck, Christoph.

Guys, for summary, I think we might have a broken behavior in the
initiator mode drivers. We never condition send completions (for
requests) before we complete an I/O. The issue is that the ack for those
sends might get lost, which means that the HCA will retry them (dropped
by the peer HCA) but if we happen to complete the I/O before, either we
can unmap the request area, or for inline data, we invalidate it (so the
HCA will try to access a MR which was invalidated).

Signalling all send completions and also finishing I/Os only after we
got them will add latency, and that sucks...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever June 20, 2017, 5:01 p.m. UTC | #1
> On Jun 20, 2017, at 8:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>> 
>>> mkey.free == 1
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
> 
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
> 
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
> 
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
> 
>        if (unsol_sz < edtl) {
>                hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               }
> 
>                iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                         "VA:%#llX + unsol:%d\n",
> --
> 
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
> 
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
> 
> What do other folks think?
> 
> CC'ing Bart, Chuck, Christoph.
> 
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).

So on occasion there is a Remote Access Error. That would
trigger connection loss, and the retransmitted Send request
is discarded (if there was externally exposed memory involved
with the original transaction that is now invalid).

NFS has a duplicate replay cache. If it sees a repeated RPC
XID it will send a cached reply. I guess the trick there is
to squelch remote invalidation for such retransmits to avoid
spurious Remote Access Errors. Should be rare, though.

RPC-over-RDMA uses persistent registration for its inline
buffers. The problem there is avoiding buffer reuse to soon.
Otherwise a garbled inline message is presented on retransmit.
Those would probably not be caught by the DRC.

But the real problem is preventing retransmitted Sends from
causing a ULP request to be executed multiple times.


> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Typically, Sends will complete before the response arrives.
The additional cost will be handling extra interrupts, IMO.

With FRWR, won't subsequent WRs be delayed until the HCA is
done with the Send? I don't think a signal is necessary in
every case. Send Queue accounting currently relies on that.

RPC-over-RDMA relies on the completion of Local Invalidation
to ensure that the initial Send WR is complete. For Remote
Invalidation and pure inline, there is nothing to fence that
Send.

The question I have is: how often do these Send retransmissions
occur? Is it enough to have a robust recovery mechanism, or
do we have to wire in assumptions about retransmission to
every Send operation?


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 20, 2017, 5:08 p.m. UTC | #2
On Tue, Jun 20, 2017 at 6:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
>
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
>
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>
>         if (unsol_sz < edtl) {
>                 hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> +               }
>
>                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                          "VA:%#llX + unsol:%d\n",
> --
>
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
>
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
>
> What do other folks think?
>
> CC'ing Bart, Chuck, Christoph.
>
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).
>
> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Testing this patch I didn't see these new messages even when rebooting
the targets multiple times. It also resolved some performance problems
I was seeing (I think our switches are having bugs with IPv6 and
routing) and I was receiving expected performance. At one point in the
test, one target (4.9.33) showed:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

After this and a reboot of the target, the initiator would drop the
connection after 1.5-2 minutes then faster and faster until it was
every 5 seconds. It is almost like it set up the connection then lose
the first ping, or the ping wasn't set-up right. I tried rebooting the
target multiple times.

I tried to logout the "bad" session and got a back trace.

[Tue Jun 20 10:30:08 2017] ------------[ cut here ]------------
[Tue Jun 20 10:30:08 2017] WARNING: CPU: 20 PID: 783 at
fs/sysfs/group.c:237 sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] Modules linked in: ib_iser rdma_ucm ib_ucm
ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib mlx4_ib ib_core 8021q
garp mrp sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm iTCO_wdt iTCO_vendor_support mei_me irqbypass
crct10dif_pclmul mei crc32_pclmul ioatdma ghash_clmulni_intel lpc_ich
i2c_i801 pcbc mfd_core aesni_intel crypto_simd glue_helper cryptd
joydevsg pcspkr shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler
acpi_power_meter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 acpi_pad
nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c raid1
dm_service_time sd_mod mlx4_en be2iscsi bnx2i cnic uio qla4xxx
iscsi_boot_sysfs ast drm_kms_helper mlx5_core syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm
[Tue Jun 20 10:30:08 2017]  drm mlx4_core igb ahci libahci ptp libata
pps_core dca i2c_algo_bit dm_multipath dm_mirror dm_region_hash dm_log
dm_mod dax
[Tue Jun 20 10:30:08 2017] CPU: 20 PID: 783 Comm: kworker/u64:2 Not
tainted 4.12.0-rc6+ #5
[Tue Jun 20 10:30:08 2017] Hardware name: Supermicro
SYS-6028TP-HTFR/X10DRT-PIBF, BIOS 1.1 08/03/2015
[Tue Jun 20 10:30:08 2017] Workqueue: scsi_wq_12 __iscsi_unbind_session
[Tue Jun 20 10:30:08 2017] task: ffff887f5c45cb00 task.stack: ffffc90032ef4000
[Tue Jun 20 10:30:08 2017] RIP: 0010:sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] RSP: 0018:ffffc90032ef7d18 EFLAGS: 00010246
[Tue Jun 20 10:30:08 2017] RAX: 0000000000000038 RBX: 0000000000000000
RCX: 0000000000000000
[Tue Jun 20 10:30:08 2017] RDX: 0000000000000000 RSI: ffff883f7fd0e068
RDI: ffff883f7fd0e068
[Tue Jun 20 10:30:08 2017] RBP: ffffc90032ef7d30 R08: 0000000000000000
R09: 0000000000000676
[Tue Jun 20 10:30:08 2017] R10: 00000000000003ff R11: 0000000000000001
R12: ffffffff81da8a40
[Tue Jun 20 10:30:08 2017] R13: ffff887f52ec0838 R14: ffff887f52ec08d8
R15: ffff883f4c611000
[Tue Jun 20 10:30:08 2017] FS:  0000000000000000(0000)
GS:ffff883f7fd00000(0000) knlGS:0000000000000000
[Tue Jun 20 10:30:08 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jun 20 10:30:08 2017] CR2: 000055eef886b398 CR3: 0000000001c09000
CR4: 00000000001406e0
[Tue Jun 20 10:30:08 2017] Call Trace:
[Tue Jun 20 10:30:08 2017]  dpm_sysfs_remove+0x57/0x60
[Tue Jun 20 10:30:08 2017]  device_del+0x107/0x330
[Tue Jun 20 10:30:08 2017]  scsi_target_reap_ref_release+0x2d/0x40
[Tue Jun 20 10:30:08 2017]  scsi_target_reap+0x2e/0x40
[Tue Jun 20 10:30:08 2017]  scsi_remove_target+0x197/0x1b0
[Tue Jun 20 10:30:08 2017]  __iscsi_unbind_session+0xbe/0x170
[Tue Jun 20 10:30:08 2017]  process_one_work+0x149/0x360
[Tue Jun 20 10:30:08 2017]  worker_thread+0x4d/0x3c0
[Tue Jun 20 10:30:08 2017]  kthread+0x109/0x140
[Tue Jun 20 10:30:08 2017]  ? rescuer_thread+0x380/0x380
[Tue Jun 20 10:30:08 2017]  ? kthread_park+0x60/0x60
[Tue Jun 20 10:30:08 2017]  ret_from_fork+0x25/0x30
[Tue Jun 20 10:30:08 2017] Code: d5 c0 ff ff 5b 41 5c 41 5d 5d c3 48
89 df e8 66 bd ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 d0 3ca7 81
31 c0 e8 3c 01 ee ff <0f> ff eb d5 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00 00 55 48
[Tue Jun 20 10:30:08 2017] ---[ end trace 6161f21139b6a1ea ]---

Logging back into that target didn't help stabilize the connection. I
rebooted both initiator and targets to clear things up and after the
initiator went down, the target showed the timeout message again. It
seems something got out of whack and never recovered and "poisoned"
the other node in the process.

I'll test Max's patch now and report back.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 20, 2017, 5:12 p.m. UTC | #3
> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

I'm actually not concerned about the remote invalidation, that
is good that its discarded or failed. Its the inline sends that
are a bug here.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

Exactly.

>> Signalling all send completions and also finishing I/Os only after we
>> got them will add latency, and that sucks...
> 
> Typically, Sends will complete before the response arrives.
> The additional cost will be handling extra interrupts, IMO.

Not quite, heavy traffic _can_ results in dropped acks, my gut
feeling is that it can happen more than we suspect.

and yea, extra interrupt, extra cachelines, extra state,
but I do not see any other way around it.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

Not really, the Send after the FRWR might have a fence (not strong
ordered one) and CX3/CX4 strong order FRWR so for them that is a
non-issue. The problem is that ULPs can't rely on it.

> RPC-over-RDMA relies on the completion of Local Invalidation
> to ensure that the initial Send WR is complete.

Wait, is that guaranteed?

> For Remote
> Invalidation and pure inline, there is nothing to fence that
> Send.
> 
> The question I have is: how often do these Send retransmissions
> occur? Is it enough to have a robust recovery mechanism, or
> do we have to wire in assumptions about retransmission to
> every Send operation?

Even if its rare, we don't have any way to protect against devices
retrying the send operation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 20, 2017, 5:19 p.m. UTC | #4
> Testing this patch I didn't see these new messages even when rebooting
> the targets multiple times. It also resolved some performance problems
> I was seeing (I think our switches are having bugs with IPv6 and
> routing) and I was receiving expected performance. At one point in the
> test, one target (4.9.33) showed:
> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
> transport retry counter exceeded (12) vend_err 81

I don't understand, is this new with the patch applied?

> After this and a reboot of the target, the initiator would drop the
> connection after 1.5-2 minutes then faster and faster until it was
> every 5 seconds. It is almost like it set up the connection then lose
> the first ping, or the ping wasn't set-up right. I tried rebooting the
> target multiple times.

So the initiator could not recover even after the target as available
again?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 20, 2017, 5:28 p.m. UTC | #5
On Tue, Jun 20, 2017 at 11:19 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Testing this patch I didn't see these new messages even when rebooting
>> the targets multiple times. It also resolved some performance problems
>> I was seeing (I think our switches are having bugs with IPv6 and
>> routing) and I was receiving expected performance. At one point in the
>> test, one target (4.9.33) showed:
>> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
>> transport retry counter exceeded (12) vend_err 81
>
>
> I don't understand, is this new with the patch applied?

I applied your patch to 4.12-rc6 on the initiator, but my targets are
still 4.9.33 since it looked like the patch only affected the
initiator. I did not see this before your patch, but I also didn't try
rebooting the targets multiple times before because of the previous
messages.

>> After this and a reboot of the target, the initiator would drop the
>> connection after 1.5-2 minutes then faster and faster until it was
>> every 5 seconds. It is almost like it set up the connection then lose
>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>> target multiple times.
>
>
> So the initiator could not recover even after the target as available
> again?

The initiator recovered the connection when the target came back, but
the connection was not stable. I/O would happen on the connection,
then it would get shaky and then finally disconnect. Then it would
reconnect, pass more I/O, then get shaky and go down again. With the 5
second disconnects, it would pass traffic for 5 seconds, then as soon
as I saw the ping timeout, the I/O would stop until it reconnected. At
that point it seems that the lack of pings would kill the I/O unlike
earlier where there was a stall in I/O and then the connection would
be torn down. I can try to see if I can get it to happen again.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 20, 2017, 5:35 p.m. UTC | #6
On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote:

> >> Shouldn't this be protected somehow by the device?
> >> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> >> Say host register MR (a) and send (1) from that MR to a target,
> >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> >> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> >> so it retries, but ehh, its already invalidated.

I'm not sure I understand the example.. but...

If you pass a MR key to a send, then that MR must remain valid until
the send completion is implied by an observation on the CQ. The HCA is
free to re-execute the SEND against the MR at any time up until the
completion reaches the CQ.

As I've explained before, a ULP must not use 'implied completion', eg
a receive that could only have happened if the far side got the
send. In particular this means it cannot use an incoming SEND_INV/etc
to invalidate an MR associated with a local SEND, as that is a form
of 'implied completion'

For sanity a MR associated with a local send should not be remote
accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.

Similarly, you cannot use a MR with SEND and remote access sanely, as
the far end could corrupt or invalidate the MR while the local HCA is
still using it.

> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

Once you get a connection loss I would think the state of all the MRs
need to be resync'd. Running through the CQ should indicate which ones
are invalidate and which ones are still good.

> NFS has a duplicate replay cache. If it sees a repeated RPC
> XID it will send a cached reply. I guess the trick there is
> to squelch remote invalidation for such retransmits to avoid
> spurious Remote Access Errors. Should be rare, though.

.. and because of the above if a RPC is re-issued it must be re-issued
with corrected, now-valid rkeys, and the sender must somehow detect
that the far side dropped it for replay and tear down the MRs.

> RPC-over-RDMA uses persistent registration for its inline
> buffers. The problem there is avoiding buffer reuse to soon.
> Otherwise a garbled inline message is presented on retransmit.
> Those would probably not be caught by the DRC.

We've had this discussion on the list before. You can *never* re-use a
SEND, or RDMA WRITE buffer until you observe the HCA is done with it
via a CQ poll.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

IB RC guarentees single delivery for SEND, so that doesn't seem
possible unless the ULP re-transmits the SEND on a new QP.

> > Signalling all send completions and also finishing I/Os only after
> > we got them will add latency, and that sucks...

There is no choice, you *MUST* see the send completion before
reclamining any resources associated with the send. Only the
completion guarentees that the HCA will not resend the packet or
otherwise continue to use the resources.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
send packets on the wire up to some internal limit.

Only the local state changed by FRWR related op codes happens
sequentially with other SQ work.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 20, 2017, 6:17 p.m. UTC | #7
> On Jun 20, 2017, at 1:35 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote:
> 
>>>> Shouldn't this be protected somehow by the device?
>>>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>>> Say host register MR (a) and send (1) from that MR to a target,
>>>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>>>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>>>> so it retries, but ehh, its already invalidated.
> 
> I'm not sure I understand the example.. but...
> 
> If you pass a MR key to a send, then that MR must remain valid until
> the send completion is implied by an observation on the CQ. The HCA is
> free to re-execute the SEND against the MR at any time up until the
> completion reaches the CQ.
> 
> As I've explained before, a ULP must not use 'implied completion', eg
> a receive that could only have happened if the far side got the
> send. In particular this means it cannot use an incoming SEND_INV/etc
> to invalidate an MR associated with a local SEND, as that is a form
> of 'implied completion'
> 
> For sanity a MR associated with a local send should not be remote
> accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.
> 
> Similarly, you cannot use a MR with SEND and remote access sanely, as
> the far end could corrupt or invalidate the MR while the local HCA is
> still using it.
> 
>> So on occasion there is a Remote Access Error. That would
>> trigger connection loss, and the retransmitted Send request
>> is discarded (if there was externally exposed memory involved
>> with the original transaction that is now invalid).
> 
> Once you get a connection loss I would think the state of all the MRs
> need to be resync'd. Running through the CQ should indicate which ones
> are invalidate and which ones are still good.
> 
>> NFS has a duplicate replay cache. If it sees a repeated RPC
>> XID it will send a cached reply. I guess the trick there is
>> to squelch remote invalidation for such retransmits to avoid
>> spurious Remote Access Errors. Should be rare, though.
> 
> .. and because of the above if a RPC is re-issued it must be re-issued
> with corrected, now-valid rkeys, and the sender must somehow detect
> that the far side dropped it for replay and tear down the MRs.

Yes, if RPC-over-RDMA ULP is involved, any externally accessible
memory will be re-registered before an RPC retransmission.

The concern is whether a retransmitted Send will be exposed
to the receiving ULP. Below you imply that it will not be, so
perhaps this is not a concern after all.


>> RPC-over-RDMA uses persistent registration for its inline
>> buffers. The problem there is avoiding buffer reuse to soon.
>> Otherwise a garbled inline message is presented on retransmit.
>> Those would probably not be caught by the DRC.
> 
> We've had this discussion on the list before. You can *never* re-use a
> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> via a CQ poll.

RPC-over-RDMA is careful to invalidate buffers that are the
target of RDMA Write before RPC completion, as we have
discussed before.

Sends are assumed to be complete when a LocalInv completes.

When we had this discussion before, you explained the problem
with retransmitted Sends, but it appears that all the ULPs we
have operate without Send completion. Others whom I trust have
suggested that operating without that extra interrupt is
preferred. The client has operated this way since it was added
to the kernel almost 10 years ago.

So I took it as a "in a perfect world" kind of admonition.
You are making a stronger and more normative assertion here.


>> But the real problem is preventing retransmitted Sends from
>> causing a ULP request to be executed multiple times.
> 
> IB RC guarentees single delivery for SEND, so that doesn't seem
> possible unless the ULP re-transmits the SEND on a new QP.
> 
>>> Signalling all send completions and also finishing I/Os only after
>>> we got them will add latency, and that sucks...
> 
> There is no choice, you *MUST* see the send completion before
> reclamining any resources associated with the send. Only the
> completion guarentees that the HCA will not resend the packet or
> otherwise continue to use the resources.

On the NFS server side, I believe every Send is signaled.

On the NFS client side, we assume LocalInv completion is
good enough.


>> With FRWR, won't subsequent WRs be delayed until the HCA is
>> done with the Send? I don't think a signal is necessary in
>> every case. Send Queue accounting currently relies on that.
> 
> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> send packets on the wire up to some internal limit.

So if my ULP issues FastReg followed by Send followed by
LocalInv (signaled), I can't rely on the LocalInv completion
to imply that the Send is also complete?


> Only the local state changed by FRWR related op codes happens
> sequentially with other SQ work.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 20, 2017, 7:27 p.m. UTC | #8
On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote:

> The concern is whether a retransmitted Send will be exposed
> to the receiving ULP. Below you imply that it will not be, so
> perhaps this is not a concern after all.

A retransmitted SEND will never be exposed to the Reciever ULP for
Reliable Connected. That is part of the guarantee.

> > We've had this discussion on the list before. You can *never* re-use a
> > SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> > via a CQ poll.
> 
> RPC-over-RDMA is careful to invalidate buffers that are the
> target of RDMA Write before RPC completion, as we have
> discussed before.
> 
> Sends are assumed to be complete when a LocalInv completes.
> 
> When we had this discussion before, you explained the problem
> with retransmitted Sends, but it appears that all the ULPs we
> have operate without Send completion. Others whom I trust have
> suggested that operating without that extra interrupt is

Operating without the interrupt is of course preferred, but that means
you have to defer the invalidate for MR's refered to by SEND until a
CQ observation as well.

> preferred. The client has operated this way since it was added
> to the kernel almost 10 years ago.

I thought the use of MR's with SEND was a new invention? If you use
the local rdma lkey with send, it is never invalidated, and this is
not an issue, which IIRC, was the historical configuration for NFS.

> So I took it as a "in a perfect world" kind of admonition.
> You are making a stronger and more normative assertion here.

All ULPs must have periodic (related to SQ depth) signaled completions
or some of our supported hardware will explode.

All ULPs must flow control additions to the SQ based on CQ feedback,
or they will fail under load with SQ overflows, if this is done, then
the above happens correctly for free.

All ULPs must ensure SEND/RDMA Write resources remain stable until the
CQ indicates that work is completed. 'In a perfect world' this
includes not changing the source memory as that would cause
retransmitted packets to be different.

All ULPs must ensure the lkey remains valid until the CQ confirms
the work is done. This is not important if the lkey is always the
local rdma lkey, which is always valid.

> > No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> > send packets on the wire up to some internal limit.
> 
> So if my ULP issues FastReg followed by Send followed by
> LocalInv (signaled), I can't rely on the LocalInv completion
> to imply that the Send is also complete?

Correct.

This is explicitly defined in Table 79 of the IBA.

It describes the ordering requirements, if you order Send followed by
LocalInv the ordering is 'L' which means they are not ordered unless
the WR has the Local Invalidate Fence bit set.

LIF is an optional feature, I do not know if any of our hardware
supports it, but it is defined to cause the local invalidate to wait
until all ongoing references to the MR are completed.

No idea on the relative performance of LIF vs doing it manually, but
the need for one or the other is unambigously clear in the spec.

Why are you invaliding lkeys anyhow, that doesn't seem like something
that needs to happen synchronously.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 20, 2017, 8:56 p.m. UTC | #9
> On Jun 20, 2017, at 3:27 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote:
> 
>> The concern is whether a retransmitted Send will be exposed
>> to the receiving ULP. Below you imply that it will not be, so
>> perhaps this is not a concern after all.
> 
> A retransmitted SEND will never be exposed to the Reciever ULP for
> Reliable Connected. That is part of the guarantee.
> 
>>> We've had this discussion on the list before. You can *never* re-use a
>>> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
>>> via a CQ poll.
>> 
>> RPC-over-RDMA is careful to invalidate buffers that are the
>> target of RDMA Write before RPC completion, as we have
>> discussed before.
>> 
>> Sends are assumed to be complete when a LocalInv completes.
>> 
>> When we had this discussion before, you explained the problem
>> with retransmitted Sends, but it appears that all the ULPs we
>> have operate without Send completion. Others whom I trust have
>> suggested that operating without that extra interrupt is
> 
> Operating without the interrupt is of course preferred, but that means
> you have to defer the invalidate for MR's refered to by SEND until a
> CQ observation as well.
> 
>> preferred. The client has operated this way since it was added
>> to the kernel almost 10 years ago.
> 
> I thought the use of MR's with SEND was a new invention? If you use
> the local rdma lkey with send, it is never invalidated, and this is
> not an issue, which IIRC, was the historical configuration for NFS.

We may be conflating things a bit.

RPC-over-RDMA client uses persistently registered buffers, using
the lkey, for inline data. The use of MRs is reserved for NFS READ
and WRITE payloads. The inline buffers are never explicitly
invalidated by RPC-over-RDMA.


>> So I took it as a "in a perfect world" kind of admonition.
>> You are making a stronger and more normative assertion here.
> 
> All ULPs must have periodic (related to SQ depth) signaled completions
> or some of our supported hardware will explode.

RPC-over-RDMA client does that.


> All ULPs must flow control additions to the SQ based on CQ feedback,
> or they will fail under load with SQ overflows, if this is done, then
> the above happens correctly for free.

RPC-over-RDMA client does that.


> All ULPs must ensure SEND/RDMA Write resources remain stable until the
> CQ indicates that work is completed. 'In a perfect world' this
> includes not changing the source memory as that would cause
> retransmitted packets to be different.

I assume you mean the sending side (the server) for RDMA
Write. I believe rdma_rw uses the local rdma lkey by default
for RDMA Write source buffers.


> All ULPs must ensure the lkey remains valid until the CQ confirms
> the work is done. This is not important if the lkey is always the
> local rdma lkey, which is always valid.

As above, Send buffers use the local rdma lkey.


>>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
>>> send packets on the wire up to some internal limit.
>> 
>> So if my ULP issues FastReg followed by Send followed by
>> LocalInv (signaled), I can't rely on the LocalInv completion
>> to imply that the Send is also complete?
> 
> Correct.
> 
> This is explicitly defined in Table 79 of the IBA.
> 
> It describes the ordering requirements, if you order Send followed by
> LocalInv the ordering is 'L' which means they are not ordered unless
> the WR has the Local Invalidate Fence bit set.
> 
> LIF is an optional feature, I do not know if any of our hardware
> supports it, but it is defined to cause the local invalidate to wait
> until all ongoing references to the MR are completed.

Now, since there was confusion about using an MR for a
Send operation, let me clarify. If the client does:

FastReg(payload buffer)
Send(inline buffer)
...
Recv
LocalInv(payload buffer)
wait for LI completion

Is setting IB_SEND_FENCE on the LocalInv enough to ensure
that the Send is complete?

cscope seems to suggest all our devices support IB_SEND_FENCE.
Sagi mentioned some devices do this fencing automatically.


> No idea on the relative performance of LIF vs doing it manually, but
> the need for one or the other is unambigously clear in the spec.

It seems to me that the guarantee that the server sees
only one copy of the Send payload is good enough. That
means that by the time Recv completion occurs on the
client, even if the client HCA still thinks it needs to
retransmit the Send containing the RPC Call, the server
ULP has already seen and processed that Send payload,
and the HCA on the server won't deliver that payload a
second time.

The RPC Reply is evidence that the server saw the correct
RPC Call message payload, and the client always preserves
the Send's inline buffer until the reply has been received.

If the only concern about preserving that inline buffer is
guaranteeing that retransmits contain the same content, I
don't think we have a problem. All HCA retransmits of an
RPC Call, until the matching RPC Reply is received on the
client, will contain the same content.

The issue about the HCA not being able to access the inline
buffer during a retransmit is also not an issue for RPC-
over-RDMA because these buffers are always registered with
the local rdma lkey.


> Why are you invaliding lkeys anyhow, that doesn't seem like something
> that needs to happen synchronously.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 20, 2017, 9:19 p.m. UTC | #10
On Tue, Jun 20, 2017 at 04:56:39PM -0400, Chuck Lever wrote:

> > I thought the use of MR's with SEND was a new invention? If you use
> > the local rdma lkey with send, it is never invalidated, and this is
> > not an issue, which IIRC, was the historical configuration for NFS.
> 
> We may be conflating things a bit.
> 
> RPC-over-RDMA client uses persistently registered buffers, using
> the lkey, for inline data. The use of MRs is reserved for NFS READ
> and WRITE payloads. The inline buffers are never explicitly
> invalidated by RPC-over-RDMA.

That makes much more sense, but is that the original question in this
thread? Why are we even talking about invalidate ordering then?

> > All ULPs must ensure SEND/RDMA Write resources remain stable until the
> > CQ indicates that work is completed. 'In a perfect world' this
> > includes not changing the source memory as that would cause
> > retransmitted packets to be different.
> 
> I assume you mean the sending side (the server) for RDMA
> Write. I believe rdma_rw uses the local rdma lkey by default
> for RDMA Write source buffers.

RDMA Write or SEND

> >>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> >>> send packets on the wire up to some internal limit.
> >> 
> >> So if my ULP issues FastReg followed by Send followed by
> >> LocalInv (signaled), I can't rely on the LocalInv completion
> >> to imply that the Send is also complete?
> > 
> > Correct.
> > 
> > This is explicitly defined in Table 79 of the IBA.
> > 
> > It describes the ordering requirements, if you order Send followed by
> > LocalInv the ordering is 'L' which means they are not ordered unless
> > the WR has the Local Invalidate Fence bit set.
> > 
> > LIF is an optional feature, I do not know if any of our hardware
> > supports it, but it is defined to cause the local invalidate to wait
> > until all ongoing references to the MR are completed.
> 
> Now, since there was confusion about using an MR for a
> Send operation, let me clarify. If the client does:

> FastReg(payload buffer)
> Send(inline buffer)
> ...
> Recv
> LocalInv(payload buffer)
> wait for LI completion

Not sure what you are describing?

Is Recv landing memory for a SEND? In that case it is using a lkey,
lkeys are not remotely usable, so it does not need synchronous
invalidation. In all cases the LocalInv must only be posted once a CQE
for the Recv is observed.

If Recv is RDMA WRITE target memory, then it using the rkey and it
does does need synchronous invalidation. This must be done once a recv
CQE is observed, or optimized by having the other send via one of the
_INV operations.

In no case can you pipeline a LocalInv into the SQ that would impact
RQ activity, even with any of the fences.

> Is setting IB_SEND_FENCE on the LocalInv enough to ensure
> that the Send is complete?

No.

There are two fences in the spec, IB_SEND_FENCE is the mandatory one,
and it only interacts with RDMA READ and ATOMIC entries.

Local Invalidate Fence (the optinal one) also will not order the two
because LIF is only defined to order against SQE's that use the
MR. Since Send is using the global dma lkey it does not interact with
the LocalInv and LIF will not order them.

> > No idea on the relative performance of LIF vs doing it manually, but
> > the need for one or the other is unambigously clear in the spec.
> 
> It seems to me that the guarantee that the server sees
> only one copy of the Send payload is good enough. That
> means that by the time Recv completion occurs on the
> client, even if the client HCA still thinks it needs to
> retransmit the Send containing the RPC Call, the server
> ULP has already seen and processed that Send payload,
> and the HCA on the server won't deliver that payload a
> second time.

Yes, that is OK reasoning.

> If the only concern about preserving that inline buffer is
> guaranteeing that retransmits contain the same content, I
> don't think we have a problem. All HCA retransmits of an
> RPC Call, until the matching RPC Reply is received on the
> client, will contain the same content.

Right.

> The issue about the HCA not being able to access the inline
> buffer during a retransmit is also not an issue for RPC-
> over-RDMA because these buffers are always registered with
> the local rdma lkey.

Exactly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 27, 2017, 7:22 a.m. UTC | #11
>> I don't understand, is this new with the patch applied?
> 
> I applied your patch to 4.12-rc6 on the initiator, but my targets are
> still 4.9.33 since it looked like the patch only affected the
> initiator. I did not see this before your patch, but I also didn't try
> rebooting the targets multiple times before because of the previous
> messages.

That sounds like a separate issue. Should we move forward with the
suggested patch?

>>> After this and a reboot of the target, the initiator would drop the
>>> connection after 1.5-2 minutes then faster and faster until it was
>>> every 5 seconds. It is almost like it set up the connection then lose
>>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>>> target multiple times.
>>
>>
>> So the initiator could not recover even after the target as available
>> again?
> 
> The initiator recovered the connection when the target came back, but
> the connection was not stable. I/O would happen on the connection,
> then it would get shaky and then finally disconnect. Then it would
> reconnect, pass more I/O, then get shaky and go down again. With the 5
> second disconnects, it would pass traffic for 5 seconds, then as soon
> as I saw the ping timeout, the I/O would stop until it reconnected. At
> that point it seems that the lack of pings would kill the I/O unlike
> earlier where there was a stall in I/O and then the connection would
> be torn down. I can try to see if I can get it to happen again.

So looks like the target is not responding to NOOP_OUTs (or traffic
at all for that matter).

The messages:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260

Are indicating that something is stuck in the login thread, not sure
where though. Did you see a watchdog popping on a hang?

And massage:
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

Is an indication that the rdma fabric is in some error state.

On which reboot attempt all this happened? the first one?

Again, CCing target-devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 27, 2017, 7:37 a.m. UTC | #12
Jason,

>> The issue about the HCA not being able to access the inline
>> buffer during a retransmit is also not an issue for RPC-
>> over-RDMA because these buffers are always registered with
>> the local rdma lkey.
> 
> Exactly.

Lost track of the thread...


Indeed you raised this issue lots of times before, and I failed to see
why its important or why its error prone, but now I do...

My apologies for not listening :(

We should fix _all_ initiators for it, nvme-rdma, iser, srp
and xprtrdma (and probably some more ULPs out there)...

It also means that we cannot really suppress any send completions as
that would result in an unpredictable latency (which is not acceptable).

I wish we could somehow tell the HCA that it can ignore access fail to a
specific address when retransmitting.. but maybe its too much to ask...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 27, 2017, 2:42 p.m. UTC | #13
> On Jun 27, 2017, at 3:37 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> Jason,
> 
>>> The issue about the HCA not being able to access the inline
>>> buffer during a retransmit is also not an issue for RPC-
>>> over-RDMA because these buffers are always registered with
>>> the local rdma lkey.
>> Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...

Go back and browse the end of the thread: there's no need to change
xprtrdma, and maybe no need to change the others either.


> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 27, 2017, 4:07 p.m. UTC | #14
> Go back and browse the end of the thread: there's no need to change
> xprtrdma, and maybe no need to change the others either.

I think there is, even with inline, xprtrdma dma maps the immediate
buffers, also the message head and tail so unmapping these buffers
without waiting for the send completion would trigger a IOMMU access
error (the HCA (re)tries to access an already unmapped buffer).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 27, 2017, 4:28 p.m. UTC | #15
On Tue, Jun 27, 2017 at 07:07:08PM +0300, Sagi Grimberg wrote:
> 
> >Go back and browse the end of the thread: there's no need to change
> >xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Yes, that is an excellent observation. When using the local rdma lkey
you still need to ensure the linux API DMA map remains until
completion.

send completion mitigation is still possible, if it is OK for the
backing pages to remain, but I think a more sophisticated strategy is
needed - eg maybe push some kind of NOP through the send q after a
timer or only complete when the last available work is stuffed or
something.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 27, 2017, 4:28 p.m. UTC | #16
> On Jun 27, 2017, at 12:07 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Go back and browse the end of the thread: there's no need to change
>> xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Thinking out loud:

IIRC the message head and tail reside in the persistently registered
and DMA mapped buffers with few exceptions.

However, when page cache pages are involved, xprtrdma will do a DMA
unmap as you say.

So:
- we don't have a problem transmitting a garbled request thanks to
  exactly-once receive semantics
- we don't have a problem with the timing of registration and
  invalidation on the initiator because the PD's DMA lkey is used
- we do have a problem with DMA unmap

Using only persistently registered and DMA mapped Send buffers
should avoid the need to signal all Sends. However, when page
cache pages are involved, then the Send needs to be signaled,
and the pages unmapped only after Send completion, to be completely
safe.

Or, the initiator should just use RDMA Read and Write, and stick
with small inline sizes.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 27, 2017, 6:08 p.m. UTC | #17
On Tue, 2017-06-27 at 10:37 +0300, Sagi Grimberg wrote:
> Jason,
> 
> > > The issue about the HCA not being able to access the inline
> > > buffer during a retransmit is also not an issue for RPC-
> > > over-RDMA because these buffers are always registered with
> > > the local rdma lkey.
> > 
> > Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...
> 
> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

Hello Sagi,

Can you clarify why you think that the SRP initiator needs to be changed?
The SRP initiator submits the local invalidate work request after the RDMA
write request. According to table 79 "Work Request Operation Ordering" the
order of these work requests must be maintained by the HCA. I think if a HCA
would start with invalidating the MR before the remote HCA has acknowledged
the written data that that's a firmware bug.

The upstream SRP initiator does not use inline data.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 27, 2017, 6:14 p.m. UTC | #18
On Tue, Jun 27, 2017 at 06:08:57PM +0000, Bart Van Assche wrote:

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation
> Ordering" the

That table has a 'L' for invalidate that follows RDMA Write.

L means they are not ordered unless the optional Local Invalidate
Fence mode is used.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 28, 2017, 7:03 a.m. UTC | #19
> send completion mitigation is still possible, if it is OK for the
> backing pages to remain, but I think a more sophisticated strategy is
> needed - eg maybe push some kind of NOP through the send q after a
> timer or only complete when the last available work is stuffed or
> something.

I'm not smart enough to come up with something like that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 28, 2017, 7:08 a.m. UTC | #20
> Thinking out loud:
> 
> IIRC the message head and tail reside in the persistently registered
> and DMA mapped buffers with few exceptions.
> 
> However, when page cache pages are involved, xprtrdma will do a DMA
> unmap as you say.
> 
> So:
> - we don't have a problem transmitting a garbled request thanks to
>    exactly-once receive semantics
> - we don't have a problem with the timing of registration and
>    invalidation on the initiator because the PD's DMA lkey is used
> - we do have a problem with DMA unmap
> 
> Using only persistently registered and DMA mapped Send buffers
> should avoid the need to signal all Sends. However, when page
> cache pages are involved, then the Send needs to be signaled,
> and the pages unmapped only after Send completion, to be completely
> safe.

How do you know when that happens?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 28, 2017, 7:16 a.m. UTC | #21
> Hello Sagi,

Hi Bart,

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation Ordering" the
> order of these work requests must be maintained by the HCA. I think if a HCA
> would start with invalidating the MR before the remote HCA has acknowledged
> the written data that that's a firmware bug.

That flow is fine, we were discussing immediate data sends.

SRP only needs fixing by waiting for the all local invalidates to
complete before unmapping the user buffers and completing the I/O.

BTW, did the efforts on standardizing remote invalidate in SRP ever
evolved to something? Would it be acceptable to add a non-standard
ib_srp and ib_srpt? We can default to off and allow the user to opt
it in if it knows the two sides comply...

We need to fix that in nvmf and iser too btw (luckily we have remote
invalidate so its not a big issue).

> The upstream SRP initiator does not use inline data.

Yet :)

IIRC you have a branch with immediate-data support against scst so
it might be relevant there...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 28, 2017, 9:43 a.m. UTC | #22
On Wed, 2017-06-28 at 10:16 +0300, Sagi Grimberg wrote:
> BTW, did the efforts on standardizing remote invalidate in SRP ever
> evolved to something? Would it be acceptable to add a non-standard
> ib_srp and ib_srpt? We can default to off and allow the user to opt
> it in if it knows the two sides comply...

I'd like to hear Doug's opinion about whether it's OK to add this
feature to the upstream drivers before support for remote invalidate
is standardized.

> We need to fix that in nvmf and iser too btw (luckily we have remote
> invalidate so its not a big issue).
> 
> > The upstream SRP initiator does not use inline data.
> 
> Yet :)
> 
> IIRC you have a branch with immediate-data support against scst so
> it might be relevant there...

Support for immediate data is being standardized. My plan is to add
support for immediate data to the upstream drivers once the T10 committee
agrees about the implementation. See also
http://www.t10.org/cgi-bin/ac.pl?t=f&f=srp2r02a.pdf

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 28, 2017, 4:11 p.m. UTC | #23
> On Jun 28, 2017, at 3:08 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Thinking out loud:
>> IIRC the message head and tail reside in the persistently registered
>> and DMA mapped buffers with few exceptions.
>> However, when page cache pages are involved, xprtrdma will do a DMA
>> unmap as you say.
>> So:
>> - we don't have a problem transmitting a garbled request thanks to
>>   exactly-once receive semantics
>> - we don't have a problem with the timing of registration and
>>   invalidation on the initiator because the PD's DMA lkey is used
>> - we do have a problem with DMA unmap
>> Using only persistently registered and DMA mapped Send buffers
>> should avoid the need to signal all Sends. However, when page
>> cache pages are involved, then the Send needs to be signaled,
>> and the pages unmapped only after Send completion, to be completely
>> safe.
> 
> How do you know when that happens?

The RPC Call send path sets up the Send SGE array. If it includes
page cache pages, it can set IB_SEND_SIGNALED.

The SGE array and the ib_cqe for the send are in the same data
structure, so the Send completion handler can find the SGE array
and figure out what needs to be unmapped.

The only problem is if a POSIX signal fires. In that case the
data structure can be released before the Send completion fires,
and we get touch-after-free in the completion handler.

I'm thinking that it just isn't going to be practical to handle
unmapping this way, and I should just revert back to using RDMA
Read instead of adding page cache pages to the Send SGE.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 29, 2017, 5:35 a.m. UTC | #24
>> How do you know when that happens?
> 
> The RPC Call send path sets up the Send SGE array. If it includes
> page cache pages, it can set IB_SEND_SIGNALED.
> 
> The SGE array and the ib_cqe for the send are in the same data
> structure, so the Send completion handler can find the SGE array
> and figure out what needs to be unmapped.
> 
> The only problem is if a POSIX signal fires. In that case the
> data structure can be released before the Send completion fires,
> and we get touch-after-free in the completion handler.
> 
> I'm thinking that it just isn't going to be practical to handle
> unmapping this way, and I should just revert back to using RDMA
> Read instead of adding page cache pages to the Send SGE.

Or wait for the send completion before completing the I/O?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 29, 2017, 2:55 p.m. UTC | #25
> On Jun 29, 2017, at 1:35 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> How do you know when that happens?
>> The RPC Call send path sets up the Send SGE array. If it includes
>> page cache pages, it can set IB_SEND_SIGNALED.
>> The SGE array and the ib_cqe for the send are in the same data
>> structure, so the Send completion handler can find the SGE array
>> and figure out what needs to be unmapped.
>> The only problem is if a POSIX signal fires. In that case the
>> data structure can be released before the Send completion fires,
>> and we get touch-after-free in the completion handler.
>> I'm thinking that it just isn't going to be practical to handle
>> unmapping this way, and I should just revert back to using RDMA
>> Read instead of adding page cache pages to the Send SGE.
> 
> Or wait for the send completion before completing the I/O?

In the normal case, that works.

If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
and recovers all resources. The Send can still be running at that
point, and it can't be stopped (without transitioning the QP to
error state, I guess).

The alternative is reference-counting the data structure that has
the ib_cqe and the SGE array. That adds one or more atomic_t
operations per I/O that I'd like to avoid.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 2, 2017, 9:45 a.m. UTC | #26
>> Or wait for the send completion before completing the I/O?
> 
> In the normal case, that works.
> 
> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
> and recovers all resources. The Send can still be running at that
> point, and it can't be stopped (without transitioning the QP to
> error state, I guess).

In that case we can't complete the I/O either (or move the
QP into error state), we need to defer/sleep on send completion.


> The alternative is reference-counting the data structure that has
> the ib_cqe and the SGE array. That adds one or more atomic_t
> operations per I/O that I'd like to avoid.

Why atomics?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 2, 2017, 6:17 p.m. UTC | #27
> On Jul 2, 2017, at 5:45 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> Or wait for the send completion before completing the I/O?
>> In the normal case, that works.
>> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
>> and recovers all resources. The Send can still be running at that
>> point, and it can't be stopped (without transitioning the QP to
>> error state, I guess).
> 
> In that case we can't complete the I/O either (or move the
> QP into error state), we need to defer/sleep on send completion.

Unfortunately the RPC client finite state machine mustn't
sleep when a POSIX signal fires. xprtrdma has to unblock the
waiting application process but clean up the resources
asynchronously.

The RPC completion doesn't have to wait on DMA unmapping the
send buffer. What would have to wait is cleaning up the
resources -- in particular, allowing the rpcrdma_req
structure, where the send SGEs are kept, to be re-used. In
the current design, both happen at the same time.


>> The alternative is reference-counting the data structure that has
>> the ib_cqe and the SGE array. That adds one or more atomic_t
>> operations per I/O that I'd like to avoid.
> 
> Why atomics?

Either an atomic reference count or a spin lock is necessary
because there are two different ways an RPC can exit:

1. The common way, which is through receipt of an RPC reply,
handled by rpcrdma_reply_handler.

2. POSIX signal, where the RPC reply races with the wake-up
of the application process (in other words, the reply can
still arrive while the RPC is terminating).

In both cases, the RPC client has to invalidate any
registered memory, and it has to be done no more and no less
than once.

I deal with some of this in my for-13 patches:

http://marc.info/?l=linux-nfs&m=149693711119727&w=2

The first seven patches handle the race condition and the
need for exactly-once invalidation.

But the issue with unmapping the Send buffers has to do
with how the Send SGEs are managed. The data structure
containing the SGEs goes away once the RPC is complete.

So there are two "users": one is the RPC completion, and
one is the Send completion. Once both are done, the data
structure can be released. But RPC completion can't wait
if the Send completion hasn't yet fired.

I could kmalloc the SGE array instead, signal each Send,
and then in the Send completion handler, unmap the SGEs
and then kfree the SGE array. That's a lot of overhead.

Or I could revert all the "map page cache pages" logic and
just use memcpy for small NFS WRITEs, and RDMA the rest of
the time. That keeps everything simple, but means large
inline thresholds can't use send-in-place.

I'm still open to suggestion. for-4.14 will deal with other
problems, unless an obvious and easy fix arises.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 9, 2017, 4:47 p.m. UTC | #28
On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote:

> I could kmalloc the SGE array instead, signal each Send,
> and then in the Send completion handler, unmap the SGEs
> and then kfree the SGE array. That's a lot of overhead.

Usually after allocating the send queue you'd pre-allocate all the
tracking memory needed for each SQE - eg enough information to do the
dma unmaps/etc?

> Or I could revert all the "map page cache pages" logic and
> just use memcpy for small NFS WRITEs, and RDMA the rest of
> the time. That keeps everything simple, but means large
> inline thresholds can't use send-in-place.

Don't you have the same problem with RDMA WRITE?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 10, 2017, 7:03 p.m. UTC | #29
> On Jul 9, 2017, at 12:47 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote:
> 
>> I could kmalloc the SGE array instead, signal each Send,
>> and then in the Send completion handler, unmap the SGEs
>> and then kfree the SGE array. That's a lot of overhead.
> 
> Usually after allocating the send queue you'd pre-allocate all the
> tracking memory needed for each SQE - eg enough information to do the
> dma unmaps/etc?

Right. In xprtrdma, the QP resources are allocated in rpcrdma_ep_create.
For every RPC-over-RDMA credit, rpcrdma_buffer_create allocates an
rpcrdma_req structure, which contains an ib_cqe and an array of SGEs for
the Send, and a number of other resources used to maintain registration
state during an RPC-over-RDMA call. Both of these functions are invoked
during transport instance set-up.

The problem is the lifetime for the rpcrdma_req structure. Currently it
is acquired when an RPC is started, and it is released when the RPC
terminates.

Inline send buffers are never unmapped until transport tear-down, but
since:

commit 655fec6987be05964e70c2e2efcbb253710e282f
Author:     Chuck Lever <chuck.lever@oracle.com>
AuthorDate: Thu Sep 15 10:57:24 2016 -0400
Commit:     Anna Schumaker <Anna.Schumaker@Netapp.com>
CommitDate: Mon Sep 19 13:08:38 2016 -0400

    xprtrdma: Use gathered Send for large inline messages

Part of the Send payload can come from page cache pages for NFS WRITE
and NFS SYMLINK operations. Send buffers that are page cache pages are
DMA unmapped when rpcrdma_req is released.

IIUC what Sagi found is that Send WRs can continue running even after an
RPC completes in certain pathological cases. Therefore the Send WR can
complete after the rpcrdma_req is released and page cache-related Send
buffers have been unmapped.

It's not an issue to make the RPC reply handler wait for Send completion.
In most cases, this is not going to add any additional latency because
the Send will complete long before the RPC reply arrives. By far the
common case, and that's an extra completion interrupt for nothing.

The problem arises if the RPC is terminated locally before the reply
arrives. Suppose, for example, user hits ^C, or a timer fires. Then the
rpcrdma_req can be released and re-used before the Send completes.
There's no way to make RPC completion wait for Send completion.

One option is to somehow split the Send-related data structures from
rpcrdma_req, and manage them independently. I've already done that for
MRs: MR state is now located in rpcrdma_mw.

If instead I just never DMA map page cache pages, then all Send buffers
are always left DMA mapped while the transport is active. There's no
problem there with Send retransmits. The overhead is that I have to
either copy data into the Send buffers, or force the server to use RDMA
Read, which has a palpable overhead.


>> Or I could revert all the "map page cache pages" logic and
>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>> the time. That keeps everything simple, but means large
>> inline thresholds can't use send-in-place.
> 
> Don't you have the same problem with RDMA WRITE?

The server side initiates RDMA Writes. The final RDMA Write in a WR
chain is signaled, but a subsequent Send completion is used to
determine when the server may release resources used for the Writes.
We're already doing it the slow way there, and there's no ^C hazard
on the server.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2017, 8:05 p.m. UTC | #30
On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:

> One option is to somehow split the Send-related data structures from
> rpcrdma_req, and manage them independently. I've already done that for
> MRs: MR state is now located in rpcrdma_mw.

Yes, this is is what I was implying.. Track the SQE related stuff
seperately in memory allocated during SQ setup - MR, dma maps, etc.

No need for an atomic/lock then, right? The required memory is bounded
since the inline send depth is bounded.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 10, 2017, 8:51 p.m. UTC | #31
> On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> 
>> One option is to somehow split the Send-related data structures from
>> rpcrdma_req, and manage them independently. I've already done that for
>> MRs: MR state is now located in rpcrdma_mw.
> 
> Yes, this is is what I was implying.. Track the SQE related stuff
> seperately in memory allocated during SQ setup - MR, dma maps, etc.

> No need for an atomic/lock then, right? The required memory is bounded
> since the inline send depth is bounded.

Perhaps I lack some imagination, but I don't see how I can manage
these small objects without a serialized free list or circular
array that would be accessed in the forward path and also in a
Send completion handler.

And I would still have to signal all Sends, which is extra
interrupts and context switches.

This seems like a lot of overhead to deal with a very uncommon
case. I can reduce this overhead by signaling only Sends that
need to unmap page cache pages, but still.


But I also realized that Send Queue accounting can be broken by a
delayed Send completion.

As we previously discussed, xprtrdma does SQ accounting using RPC
completion as the gate. Basically xprtrdma will send another RPC
as soon as a previous one is terminated. If the Send WR is still
running when the RPC terminates, I can potentially overrun the
Send Queue.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2017, 9:14 p.m. UTC | #32
On Mon, Jul 10, 2017 at 04:51:20PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> > 
> > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> > 
> >> One option is to somehow split the Send-related data structures from
> >> rpcrdma_req, and manage them independently. I've already done that for
> >> MRs: MR state is now located in rpcrdma_mw.
> > 
> > Yes, this is is what I was implying.. Track the SQE related stuff
> > seperately in memory allocated during SQ setup - MR, dma maps, etc.
> 
> > No need for an atomic/lock then, right? The required memory is bounded
> > since the inline send depth is bounded.
> 
> Perhaps I lack some imagination, but I don't see how I can manage
> these small objects without a serialized free list or circular
> array that would be accessed in the forward path and also in a
> Send completion handler.

I don't get it, dma unmap can only ever happen in the send completion
handler, it can never happen in the forward path. (this is the whole
point of this thread)

Since you are not using send completion today you can just use the
wr_id to point to the pre-allocated memory containing the pages to
invalidate. Completely remove dma unmap from the forward path.

Usually I work things out so that the meta-data array is a ring and
every SQE post consumes a meta-data entry. Then occasionally I signal
completion and provide a wr_id of the latest ring index and the
completion handler runs through all the accumulated meta-data and acts
on it (eg unmaps/etc). This approach still allows batching
completions.

Since ring entries are bounded size we just preallocate the largest
size at QP creation. In this case it is some multiple of the number of
inline send pages * number of SQE entries.

> This seems like a lot of overhead to deal with a very uncommon
> case. I can reduce this overhead by signaling only Sends that
> need to unmap page cache pages, but still.

Yes, but it is not avoidable..

> As we previously discussed, xprtrdma does SQ accounting using RPC
> completion as the gate. Basically xprtrdma will send another RPC
> as soon as a previous one is terminated. If the Send WR is still
> running when the RPC terminates, I can potentially overrun the
> Send Queue.

Makes sense. The SQ accounting must be precise.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2017, 9:24 p.m. UTC | #33
On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:

> >> Or I could revert all the "map page cache pages" logic and
> >> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >> the time. That keeps everything simple, but means large
> >> inline thresholds can't use send-in-place.
> > 
> > Don't you have the same problem with RDMA WRITE?
> 
> The server side initiates RDMA Writes. The final RDMA Write in a WR
> chain is signaled, but a subsequent Send completion is used to
> determine when the server may release resources used for the Writes.
> We're already doing it the slow way there, and there's no ^C hazard
> on the server.

Wait, I guess I meant RDMA READ path.

The same contraints apply to RKeys as inline send - you cannot DMA
unmap rkey memory until the rkey is invalidated at the HCA.

So posting an invalidate SQE and then immediately unmapping the DMA
pages is bad too..

No matter how the data is transfered the unmapping must follow the
same HCA synchronous model.. DMA unmap must only be done from the send
completion handler (inline send or invalidate rkey), from the recv
completion handler (send with invalidate), or from QP error state teardown.

Anything that does DMA memory unmap from another thread is very, very
suspect, eg async from a ctrl-c trigger event.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 10, 2017, 9:29 p.m. UTC | #34
> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> 
>>>> Or I could revert all the "map page cache pages" logic and
>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>> the time. That keeps everything simple, but means large
>>>> inline thresholds can't use send-in-place.
>>> 
>>> Don't you have the same problem with RDMA WRITE?
>> 
>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>> chain is signaled, but a subsequent Send completion is used to
>> determine when the server may release resources used for the Writes.
>> We're already doing it the slow way there, and there's no ^C hazard
>> on the server.
> 
> Wait, I guess I meant RDMA READ path.
> 
> The same contraints apply to RKeys as inline send - you cannot DMA
> unmap rkey memory until the rkey is invalidated at the HCA.
> 
> So posting an invalidate SQE and then immediately unmapping the DMA
> pages is bad too..
> 
> No matter how the data is transfered the unmapping must follow the
> same HCA synchronous model.. DMA unmap must only be done from the send
> completion handler (inline send or invalidate rkey), from the recv
> completion handler (send with invalidate), or from QP error state teardown.
> 
> Anything that does DMA memory unmap from another thread is very, very
> suspect, eg async from a ctrl-c trigger event.

4.13 server side is converted to use the rdma_rw API for
handling RDMA Read. For non-iWARP cases, it's using the
local DMA key for Read sink buffers. For iWARP it should
be using Read-with-invalidate (IIRC).

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2017, 9:32 p.m. UTC | #35
On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> > 
> > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> > 
> >>>> Or I could revert all the "map page cache pages" logic and
> >>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >>>> the time. That keeps everything simple, but means large
> >>>> inline thresholds can't use send-in-place.
> >>> 
> >>> Don't you have the same problem with RDMA WRITE?
> >> 
> >> The server side initiates RDMA Writes. The final RDMA Write in a WR
> >> chain is signaled, but a subsequent Send completion is used to
> >> determine when the server may release resources used for the Writes.
> >> We're already doing it the slow way there, and there's no ^C hazard
> >> on the server.
> > 
> > Wait, I guess I meant RDMA READ path.
> > 
> > The same contraints apply to RKeys as inline send - you cannot DMA
> > unmap rkey memory until the rkey is invalidated at the HCA.
> > 
> > So posting an invalidate SQE and then immediately unmapping the DMA
> > pages is bad too..
> > 
> > No matter how the data is transfered the unmapping must follow the
> > same HCA synchronous model.. DMA unmap must only be done from the send
> > completion handler (inline send or invalidate rkey), from the recv
> > completion handler (send with invalidate), or from QP error state teardown.
> > 
> > Anything that does DMA memory unmap from another thread is very, very
> > suspect, eg async from a ctrl-c trigger event.
> 
> 4.13 server side is converted to use the rdma_rw API for
> handling RDMA Read. For non-iWARP cases, it's using the
> local DMA key for Read sink buffers. For iWARP it should
> be using Read-with-invalidate (IIRC).

The server sounds fine, how does the client work?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 10, 2017, 10:04 p.m. UTC | #36
> On Jul 10, 2017, at 5:32 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote:
>> 
>>> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>> 
>>> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
>>> 
>>>>>> Or I could revert all the "map page cache pages" logic and
>>>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>>>> the time. That keeps everything simple, but means large
>>>>>> inline thresholds can't use send-in-place.
>>>>> 
>>>>> Don't you have the same problem with RDMA WRITE?
>>>> 
>>>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>>>> chain is signaled, but a subsequent Send completion is used to
>>>> determine when the server may release resources used for the Writes.
>>>> We're already doing it the slow way there, and there's no ^C hazard
>>>> on the server.
>>> 
>>> Wait, I guess I meant RDMA READ path.
>>> 
>>> The same contraints apply to RKeys as inline send - you cannot DMA
>>> unmap rkey memory until the rkey is invalidated at the HCA.
>>> 
>>> So posting an invalidate SQE and then immediately unmapping the DMA
>>> pages is bad too..
>>> 
>>> No matter how the data is transfered the unmapping must follow the
>>> same HCA synchronous model.. DMA unmap must only be done from the send
>>> completion handler (inline send or invalidate rkey), from the recv
>>> completion handler (send with invalidate), or from QP error state teardown.
>>> 
>>> Anything that does DMA memory unmap from another thread is very, very
>>> suspect, eg async from a ctrl-c trigger event.
>> 
>> 4.13 server side is converted to use the rdma_rw API for
>> handling RDMA Read. For non-iWARP cases, it's using the
>> local DMA key for Read sink buffers. For iWARP it should
>> be using Read-with-invalidate (IIRC).
> 
> The server sounds fine, how does the client work?

The client does not initiate RDMA Read or Write today.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2017, 10:09 p.m. UTC | #37
On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:

> > The server sounds fine, how does the client work?
> 
> The client does not initiate RDMA Read or Write today.

Right, but it provides an rkey that the server uses for READ or WRITE.

The invalidate of that rkey at the client must follow the same rules
as inline send.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 11, 2017, 3:57 a.m. UTC | #38
> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
> 
>>> The server sounds fine, how does the client work?
>> 
>> The client does not initiate RDMA Read or Write today.
> 
> Right, but it provides an rkey that the server uses for READ or WRITE.
> 
> The invalidate of that rkey at the client must follow the same rules
> as inline send.

Ah, I see.

The RPC reply handler calls frwr_op_unmap_sync to invalidate
any MRs associated with the RPC.

frwr_op_unmap_sync has to sort the rkeys that are remotely
invalidated, and those that have not been.

The first step is to ensure all the rkeys for an RPC are
invalid. The rkey that was remotely invalidated is skipped
here, and a chain of LocalInv WRs is posted to invalidate
any remaining rkeys. The last WR in the chain is signaled.

If one or more LocalInv WRs are posted, this function waits
for LocalInv completion.

The last step is always DMA unmapping. Note that we can't
get a completion for a remotely invalidated rkey, and we
have to wait for LocalInv to complete anyway. So the DMA
unmapping is always handled here instead of in a
completion handler.

When frwr_op_unmap_sync returns to the RPC reply handler,
the handler calls xprt_complete_rqst, and the RPC is
terminated. This guarantees that the MRs are invalid before
control is returned to the RPC consumer.


In the ^C case, frwr_op_unmap_safe is invoked during RPC
termination. The MRs are passed to the background recovery
task, which invokes frwr_op_recover_mr.

frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
memory. (It's also used when registration or invalidation
flushes, which is why it uses a hammer).

So here, we're a little fast/loose: the ordering of
invalidation and unmapping is correct, but the MRs can be
invalidated after the RPC completes. Since RPC termination
can't wait, this is the best I can do for now.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 11, 2017, 1:23 p.m. UTC | #39
On 7/10/2017 11:57 PM, Chuck Lever wrote:
> 
>> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>
>> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
>>
>>>> The server sounds fine, how does the client work?
>>>
>>> The client does not initiate RDMA Read or Write today.
>>
>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>
>> The invalidate of that rkey at the client must follow the same rules
>> as inline send.
> 
> Ah, I see.
> 
> The RPC reply handler calls frwr_op_unmap_sync to invalidate
> any MRs associated with the RPC.
> 
> frwr_op_unmap_sync has to sort the rkeys that are remotely
> invalidated, and those that have not been.

Does the reply handler consider the possibility that the reply is
being signaled before the send WRs? There are some really interesting
races on shared or multiple CQs when the completion upcalls start
to back up under heavy load that we've seen in Windows SMB Direct.

In the end, we had to put explicit reference counts on each and
every object, and added rundown references to everything before
completing an operation and signaling the upper layer (SMB3, in
our case). This found a surprising number of double completions,
and missing completions from drivers as well.

> The first step is to ensure all the rkeys for an RPC are
> invalid. The rkey that was remotely invalidated is skipped
> here, and a chain of LocalInv WRs is posted to invalidate
> any remaining rkeys. The last WR in the chain is signaled.
> 
> If one or more LocalInv WRs are posted, this function waits
> for LocalInv completion.
> 
> The last step is always DMA unmapping. Note that we can't
> get a completion for a remotely invalidated rkey, and we
> have to wait for LocalInv to complete anyway. So the DMA
> unmapping is always handled here instead of in a
> completion handler.
> 
> When frwr_op_unmap_sync returns to the RPC reply handler,
> the handler calls xprt_complete_rqst, and the RPC is
> terminated. This guarantees that the MRs are invalid before
> control is returned to the RPC consumer.
> 
> 
> In the ^C case, frwr_op_unmap_safe is invoked during RPC
> termination. The MRs are passed to the background recovery
> task, which invokes frwr_op_recover_mr.

That worries me. How do you know it's going in sequence, and
that it will result in an invalidated MR?

> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
> memory. (It's also used when registration or invalidation
> flushes, which is why it uses a hammer).
> 
> So here, we're a little fast/loose: the ordering of
> invalidation and unmapping is correct, but the MRs can be
> invalidated after the RPC completes. Since RPC termination
> can't wait, this is the best I can do for now.

That would worry me even more. "fast/loose" isn't a good
situation when storage is concerned. Shouldn't you just be
closing the connection?

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 11, 2017, 2:55 p.m. UTC | #40
Hi Tom-


> On Jul 11, 2017, at 9:23 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 7/10/2017 11:57 PM, Chuck Lever wrote:
>>> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>> 
>>> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
>>> 
>>>>> The server sounds fine, how does the client work?
>>>> 
>>>> The client does not initiate RDMA Read or Write today.
>>> 
>>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>> 
>>> The invalidate of that rkey at the client must follow the same rules
>>> as inline send.
>> Ah, I see.
>> The RPC reply handler calls frwr_op_unmap_sync to invalidate
>> any MRs associated with the RPC.
>> frwr_op_unmap_sync has to sort the rkeys that are remotely
>> invalidated, and those that have not been.
> 
> Does the reply handler consider the possibility that the reply is
> being signaled before the send WRs? There are some really interesting
> races on shared or multiple CQs when the completion upcalls start
> to back up under heavy load that we've seen in Windows SMB Direct.

If I understand you correctly, that's exactly what we're discussing.
The Send WR that transmitted the RPC Call can outlive the RPC
transaction in rare cases.

A partial solution is to move the Send SGE array into a data structure
whose lifetime is independent of rpcrdma_req. That allows the client
to guarantee that appropriate DMA unmaps are done only after the Send
is complete.

The other part of this issue is that delayed Send completion also
needs to prevent initiation of another RPC, otherwise the Send Queue
can overflow or the client might exceed the credit grant on this
connection. This part worries me because it could mean that some
serialization (eg. a completion and context switch) is needed for
every Send operation on the client.


> In the end, we had to put explicit reference counts on each and
> every object, and added rundown references to everything before
> completing an operation and signaling the upper layer (SMB3, in
> our case). This found a surprising number of double completions,
> and missing completions from drivers as well.
> 
>> The first step is to ensure all the rkeys for an RPC are
>> invalid. The rkey that was remotely invalidated is skipped
>> here, and a chain of LocalInv WRs is posted to invalidate
>> any remaining rkeys. The last WR in the chain is signaled.
>> If one or more LocalInv WRs are posted, this function waits
>> for LocalInv completion.
>> The last step is always DMA unmapping. Note that we can't
>> get a completion for a remotely invalidated rkey, and we
>> have to wait for LocalInv to complete anyway. So the DMA
>> unmapping is always handled here instead of in a
>> completion handler.
>> When frwr_op_unmap_sync returns to the RPC reply handler,
>> the handler calls xprt_complete_rqst, and the RPC is
>> terminated. This guarantees that the MRs are invalid before
>> control is returned to the RPC consumer.
>> In the ^C case, frwr_op_unmap_safe is invoked during RPC
>> termination. The MRs are passed to the background recovery
>> task, which invokes frwr_op_recover_mr.
> 
> That worries me. How do you know it's going in sequence,
> and that it will result in an invalidated MR?

The MR is destroyed synchronously. Isn't that enough to guarantee
that the rkey is invalid and DMA unmapping is safe to proceed?

With FRWR, if a LocalInv flushes with "memory management error"
there's no way to invalidate that MR other than:

- destroy the MR, or

- disconnect

The latter is a mighty sledgehammer that affects all the live MRs
and RPCs on that connection. That's why recover_mr destroys the
MR in all cases.


>> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
>> memory. (It's also used when registration or invalidation
>> flushes, which is why it uses a hammer).
>> So here, we're a little fast/loose: the ordering of
>> invalidation and unmapping is correct, but the MRs can be
>> invalidated after the RPC completes. Since RPC termination
>> can't wait, this is the best I can do for now.
> 
> That would worry me even more. "fast/loose" isn't a good
> situation when storage is concerned.

I agree!


> Shouldn't you just be closing the connection?

We'd have to wait for the connection close to complete in a
function that is not allowed to sleep, so it would have to be
done in the background as well. This is no better than handing
the MR to a background process.

And if possible we'd like to avoid connection loss (see above).

I don't like this situation, but it's the best I can do with
the current architecture of the RPC client. I've been staring
at this for a couple of years now wondering how to fix it.
Suggestions happily accepted!


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 12ed62ce9ff7..2a07692007bd 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -137,8 +137,10 @@  iser_prepare_write_cmd(struct iscsi_task *task,

         if (unsol_sz < edtl) {
                 hdr->flags     |= ISER_WSV;
-               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
-               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
+               if (buf_out->data_len > imm_sz) {
+                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
+                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + 
unsol_sz);
+               }

                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "