Message ID | 20201207210649.19194-1-borisp@mellanox.com (mailing list archive) |
---|---|
Headers | show |
Series | nvme-tcp receive offloads | expand |
Hey Boris, sorry for some delays on my end... I saw some long discussions on this set with David, what is the status here? I'll take some more look into the patches, but if you addressed the feedback from the last iteration I don't expect major issues with this patch set (at least from nvme-tcp side). > Changes since RFC v1: > ========================================= > * Split mlx5 driver patches to several commits > * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod > init/teardown to the start/stop functions. I'm assuming that you tested controller resets and network hiccups during traffic right?
On 1/13/21 6:27 PM, Sagi Grimberg wrote: >> Changes since RFC v1: >> ========================================= >> * Split mlx5 driver patches to several commits >> * Fix nvme-tcp handling of recovery flows. In particular, move queue >> offlaod >> init/teardown to the start/stop functions. > > I'm assuming that you tested controller resets and network hiccups > during traffic right? I had questions on this part as well -- e.g., what happens on a TCP retry? packets arrive, sgl filled for the command id, but packet is dropped in the stack (e.g., enqueue backlog is filled, so packet gets dropped)
On 14/01/2021 3:27, Sagi Grimberg wrote: > Hey Boris, sorry for some delays on my end... > > I saw some long discussions on this set with David, what is > the status here? > The main purpose of this series is to address these. > I'll take some more look into the patches, but if you > addressed the feedback from the last iteration I don't > expect major issues with this patch set (at least from > nvme-tcp side). > >> Changes since RFC v1: >> ========================================= >> * Split mlx5 driver patches to several commits >> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod >> init/teardown to the start/stop functions. > > I'm assuming that you tested controller resets and network hiccups > during traffic right? > Network hiccups were tested through netem packet drops and reordering. We tested error recovery by taking the controller down and bringing it back up while the system is quiescent and during traffic. If you have another test in mind, please let me know.
On 14/01/2021 6:47, David Ahern wrote: > On 1/13/21 6:27 PM, Sagi Grimberg wrote: >>> Changes since RFC v1: >>> ========================================= >>> * Split mlx5 driver patches to several commits >>> * Fix nvme-tcp handling of recovery flows. In particular, move queue >>> offlaod >>> init/teardown to the start/stop functions. >> >> I'm assuming that you tested controller resets and network hiccups >> during traffic right? > > I had questions on this part as well -- e.g., what happens on a TCP > retry? packets arrive, sgl filled for the command id, but packet is > dropped in the stack (e.g., enqueue backlog is filled, so packet gets > dropped) > On re-transmission the HW context's expected tcp sequence number doesn't match. As a result, the received packet is un-offloaded and software will do the copy/crc for its data. As a general rule, if HW context expected sequence numbers don't match, then there's no offload.
>> Hey Boris, sorry for some delays on my end... >> >> I saw some long discussions on this set with David, what is >> the status here? >> > > The main purpose of this series is to address these. > >> I'll take some more look into the patches, but if you >> addressed the feedback from the last iteration I don't >> expect major issues with this patch set (at least from >> nvme-tcp side). >> >>> Changes since RFC v1: >>> ========================================= >>> * Split mlx5 driver patches to several commits >>> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod >>> init/teardown to the start/stop functions. >> >> I'm assuming that you tested controller resets and network hiccups >> during traffic right? >> > > Network hiccups were tested through netem packet drops and reordering. > We tested error recovery by taking the controller down and bringing it > back up while the system is quiescent and during traffic. > > If you have another test in mind, please let me know. I suggest to also perform interface down/up during traffic both on the host and the targets. Other than that we should be in decent shape...