diff mbox series

tests/nvme: Add admin-passthru+reset race test

Message ID 20221114203412.383-1-jonathan.derrick@linux.dev (mailing list archive)
State New, archived
Headers show
Series tests/nvme: Add admin-passthru+reset race test | expand

Commit Message

Jonathan Derrick Nov. 14, 2022, 8:34 p.m. UTC
Adds a test which runs many formats and reset_controllers in parallel.
The intent is to expose timing holes in the controller state machine
which will lead to hung task timing and the controller becoming
unavailable.

Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
 tests/nvme/046     | 85 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/046.out |  2 ++
 2 files changed, 87 insertions(+)
 create mode 100755 tests/nvme/046
 create mode 100644 tests/nvme/046.out

Comments

kernel test robot Nov. 15, 2022, 12:33 a.m. UTC | #1
Hi Jonathan,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.1-rc5 next-20221114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jonathan-Derrick/tests-nvme-Add-admin-passthru-reset-race-test/20221115-043752
patch link:    https://lore.kernel.org/r/20221114203412.383-1-jonathan.derrick%40linux.dev
patch subject: [PATCH] tests/nvme: Add admin-passthru+reset race test
reproduce:
        scripts/spdxcheck.py

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

spdxcheck warnings: (new ones prefixed by >>)
   drivers/cpufreq/amd-pstate-ut.c: 1:28 Invalid License ID: GPL-1.0-or-later
>> tests/nvme/046: 2:27 Invalid License ID: GPL-3.0+
Keith Busch Nov. 15, 2022, 11:07 p.m. UTC | #2
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> +	echo "Running ${TEST_NAME}"
> +
> +	local sysfs
> +	local attr
> +	local m
> +
> +	sysfs="$TEST_DEV_SYSFS/device"

That's not the correct directory when the device is using native
nvme-multipath.

> +	timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
> +
> +	sleep 5
> +
> +	if [[ ! -d "$sysfs" ]]; then
> +		echo "$sysfs doesn't exist"
> +	fi
> +
> +	# do reset controller/format loops
> +	# don't check status now because a timing race is desired
> +	i=0
> +	start=0
> +	timing_out=false
> +	while [[ $i -le 1000 ]]; do
> +		start=$SECONDS
> +		if [[ -f "$sysfs/reset_controller" ]]; then
> +			echo 1 > "$sysfs/reset_controller" 2>/dev/null &
> +			i=$((i+1))
> +		fi
> +		nvme format -l 0 -f $TEST_DEV 2>/dev/null &
> +
> +		#Assume the controller is hung and unrecoverable
> +		if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
> +			echo "nvme controller timing out"
> +			timing_out=true
> +			break
> +		fi
> +	done

If the controller is already undergoing a reset, then writing to the
reset_controller file becomes a no-op. Unless your "reset_controller"
completes near instantaneously, I find that this loop tears through 1000
iterations, forks 1000 formats, and only 1 reset_controller actually
gets through.

If I remove the upper limit, then I can also see the stalled task, but
it is only temporary and gets itself out of it after the admin timeout
(1 minute). Is that also your observation, or is it stuck forever?
Christoph Hellwig Nov. 16, 2022, 7:43 a.m. UTC | #3
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> Adds a test which runs many formats and reset_controllers in parallel.
> The intent is to expose timing holes in the controller state machine
> which will lead to hung task timing and the controller becoming
> unavailable.

This passes just fine for me both on qemu and a thunderbolt attached
m.2 SSD.  So it seems to be hardware / timing dependent.
diff mbox series

Patch

diff --git a/tests/nvme/046 b/tests/nvme/046
new file mode 100755
index 0000000..4b47783
--- /dev/null
+++ b/tests/nvme/046
@@ -0,0 +1,85 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2022 Jonathan Derrick <jonathan.derrick@linux.dev>
+#
+# Test nvme reset controller during admin passthru
+#
+# Regression for issue reported by
+# https://bugzilla.kernel.org/show_bug.cgi?id=216354
+
+. tests/nvme/rc
+
+#restrict test to nvme-pci only
+nvme_trtype=pci
+
+DESCRIPTION="test nvme reset controller during admin passthru"
+QUICK=1
+CAN_BE_ZONED=1
+
+requires() {
+	_nvme_requires
+}
+
+device_requires() {
+	_require_test_dev_is_nvme
+}
+
+test_device() {
+	echo "Running ${TEST_NAME}"
+
+	local sysfs
+	local attr
+	local m
+
+	sysfs="$TEST_DEV_SYSFS/device"
+	timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
+
+	sleep 5
+
+	if [[ ! -d "$sysfs" ]]; then
+		echo "$sysfs doesn't exist"
+	fi
+
+	# do reset controller/format loops
+	# don't check status now because a timing race is desired
+	i=0
+	start=0
+	timing_out=false
+	while [[ $i -le 1000 ]]; do
+		start=$SECONDS
+		if [[ -f "$sysfs/reset_controller" ]]; then
+			echo 1 > "$sysfs/reset_controller" 2>/dev/null &
+			i=$((i+1))
+		fi
+		nvme format -l 0 -f $TEST_DEV 2>/dev/null &
+
+		#Assume the controller is hung and unrecoverable
+		if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
+			echo "nvme controller timing out"
+			timing_out=true
+			break
+		fi
+	done
+
+	{ kill $!; wait; } &> /dev/null
+
+	# at this point it may have waited hung_task_timeout / 2 already, so
+	# only wait 25% longer for a total of about 75% of allowed timeout
+	m=0
+	while [[ $m -le $((timeout / 2)) ]]; do
+		if [[ $timing_out == true ]]; then
+			break
+		fi
+		if grep -q live "$sysfs/state"; then
+			break
+		fi
+		sleep 1
+		m=$((m+1))
+	done
+	if ! grep -q live "$sysfs/state"; then
+		echo "nvme still not live after $(($SECONDS - $start)) seconds!"
+	fi
+	udevadm settle
+
+	echo "Test complete"
+}
diff --git a/tests/nvme/046.out b/tests/nvme/046.out
new file mode 100644
index 0000000..2b5fa6a
--- /dev/null
+++ b/tests/nvme/046.out
@@ -0,0 +1,2 @@ 
+Running nvme/046
+Test complete