diff mbox series

[blktests,V2,1/1] nvme: add nvme pci timeout testcase

Message ID 20240117042206.11031-2-kch@nvidia.com (mailing list archive)
State New, archived
Headers show
Series nvme: add nvme pci timeout testcase | expand

Commit Message

Chaitanya Kulkarni Jan. 17, 2024, 4:22 a.m. UTC
Trigger and test nvme-pci timeout with concurrent fio jobs.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 tests/nvme/050     | 74 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/050.out |  2 ++
 2 files changed, 76 insertions(+)
 create mode 100755 tests/nvme/050
 create mode 100644 tests/nvme/050.out

Comments

Shin'ichiro Kawasaki Jan. 19, 2024, 11:11 a.m. UTC | #1
Hi Chaitanya, thanks for this v2 patch.

I ran this new test case with several devices and observed these two:

1) The test case fails with my QEMU NMVE devices. It looks all of the inject
   failures get recovered by error handler, then no I/O error is reported to the
   fio process. I modified two fail_io_timeout parameters as follows to inject
   more failures, and saw the test case fails with the device. I suggest these
   parameters.

@@ -49,8 +50,8 @@ test_device() {
        save_fi_settings
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

-       echo 99 > /sys/kernel/debug/fail_io_timeout/probability
-       echo 10 > /sys/kernel/debug/fail_io_timeout/interval
+       echo 100 > /sys/kernel/debug/fail_io_timeout/probability
+       echo 1 > /sys/kernel/debug/fail_io_timeout/interval
        echo -1 > /sys/kernel/debug/fail_io_timeout/times
        echo  0 > /sys/kernel/debug/fail_io_timeout/space
        echo  1 > /sys/kernel/debug/fail_io_timeout/verbose

2) After I ran the test case, I see the test target device left with zero
   size. I think this is consistent as your report [*]. I still think device
   remove and rescan at the test case end is required to avoid impacts on
   following test cases. It will depend on the discussion for your report,
   though.

[*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/

Please find my some more comments in line:

On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote:
> Trigger and test nvme-pci timeout with concurrent fio jobs.
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  tests/nvme/050     | 74 ++++++++++++++++++++++++++++++++++++++++++++++
>  tests/nvme/050.out |  2 ++
>  2 files changed, 76 insertions(+)
>  create mode 100755 tests/nvme/050
>  create mode 100644 tests/nvme/050.out
> 
> diff --git a/tests/nvme/050 b/tests/nvme/050
> new file mode 100755
> index 0000000..6c44b43
> --- /dev/null
> +++ b/tests/nvme/050
> @@ -0,0 +1,74 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2024 Chaitanya Kulkarni
> +#
> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
> +#
> +
> +. tests/nvme/rc
> +
> +DESCRIPTION="test nvme-pci timeout with fio jobs"

I ran this test case a ZNS drive today, and it looks working. I suggest to add:

CAN_BE_ZONED=1

> +
> +sysfs_path="/sys/kernel/debug/fail_io_timeout/"
> +#restrict test to nvme-pci only
> +nvme_trtype=pci
> +
> +# fault injection config array
> +declare -A fi_array
> +
> +requires() {
> +        _require_nvme_trtype pci

We can remove this line, since we added "nvme_trtype=pci" above.

> +        _have_fio
> +        _nvme_requires
> +        _have_kernel_option FAIL_IO_TIMEOUT

Today, I found that this test case depends on another kernel option. Let's add,

        _have_kernel_option FAULT_INJECTION_DEBUG_FS

> +}
> +
> +device_requires() {
> +	_require_test_dev_is_nvme
> +}

Actually, this check is not required here, since it is already done in
group_device_requires() in tests/nvme/rc. It is a small left work to remove
the same device_requires() in nvme/032.

> +
> +save_fi_settings() {
> +	for fi_attr in probability interval times space verbose
> +	do
> +		fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
> +	done
> +}
> +
> +restore_fi_settings() {
> +	for fi_attr in probability interval times space verbose
> +	do
> +		echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
> +	done
> +}
> +
> +test_device() {
> +	local nvme_ns
> +	local io_fimeout_fail
> +
> +	echo "Running ${TEST_NAME}"
> +
> +	nvme_ns="$(basename "${TEST_DEV}")"
> +	io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
> +	save_fi_settings
> +	echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
> +
> +	echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> +	echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> +	echo -1 > /sys/kernel/debug/fail_io_timeout/times
> +	echo  0 > /sys/kernel/debug/fail_io_timeout/space
> +	echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
> +
> +	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
> +	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
> +	    --time_based --runtime=1m 2>&1 | grep -q "Input/output error"

When I investigated the failure cause of this test case, I found fio output
is useful. So, I suggest to keep the output in $FULL.

            --time_based --runtime=1m >&"$FULL"

> +
> +	# shellcheck disable=SC2181
> +	if [ $? -eq 0 ]; then

With that $FULL file, the if statement can be as follows. And we don't need to
disable SC2181 either.

       if grep -q "Input/output error" "$FULL"; then

> +		echo "Test complete"
> +	else
> +		echo "Test failed"
> +	fi
> +
> +	restore_fi_settings
> +	echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
> +}
> diff --git a/tests/nvme/050.out b/tests/nvme/050.out
> new file mode 100644
> index 0000000..b78b05f
> --- /dev/null
> +++ b/tests/nvme/050.out
> @@ -0,0 +1,2 @@
> +Running nvme/050
> +Test complete
> -- 
> 2.40.0
>
Chaitanya Kulkarni Jan. 22, 2024, 4:06 a.m. UTC | #2
On 1/19/2024 3:11 AM, Shinichiro Kawasaki wrote:
> Hi Chaitanya, thanks for this v2 patch.
> 
> I ran this new test case with several devices and observed these two:
> 
> 1) The test case fails with my QEMU NMVE devices. It looks all of the inject
>     failures get recovered by error handler, then no I/O error is reported to the
>     fio process. I modified two fail_io_timeout parameters as follows to inject
>     more failures, and saw the test case fails with the device. I suggest these
>     parameters.
> 
> @@ -49,8 +50,8 @@ test_device() {
>          save_fi_settings
>          echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
> 
> -       echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> -       echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> +       echo 100 > /sys/kernel/debug/fail_io_timeout/probability
> +       echo 1 > /sys/kernel/debug/fail_io_timeout/interval
>          echo -1 > /sys/kernel/debug/fail_io_timeout/times
>          echo  0 > /sys/kernel/debug/fail_io_timeout/space
>          echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
> 

Let me test these and add to the V3. But it is really strange, current
code pass testcase 100% of the time without above changes ...

> 2) After I ran the test case, I see the test target device left with zero
>     size. I think this is consistent as your report [*]. I still think device
>     remove and rescan at the test case end is required to avoid impacts on
>     following test cases. It will depend on the discussion for your report,
>     though.
> 

Yes, still waiting on that, hopefully we will get some information to
move this forward. The reason I didn't add remove/rescan deviec coz it
helped me debug the state of the device post timeout handler which is
turned out to be not consistent for sure ..

> [*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/
> 
> Please find my some more comments in line:
> 
> On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote:
>> Trigger and test nvme-pci timeout with concurrent fio jobs.
>>
>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
>> ---
>>   tests/nvme/050     | 74 ++++++++++++++++++++++++++++++++++++++++++++++
>>   tests/nvme/050.out |  2 ++
>>   2 files changed, 76 insertions(+)
>>   create mode 100755 tests/nvme/050
>>   create mode 100644 tests/nvme/050.out
>>
>> diff --git a/tests/nvme/050 b/tests/nvme/050
>> new file mode 100755
>> index 0000000..6c44b43
>> --- /dev/null
>> +++ b/tests/nvme/050
>> @@ -0,0 +1,74 @@
>> +#!/bin/bash
>> +# SPDX-License-Identifier: GPL-3.0+
>> +# Copyright (C) 2024 Chaitanya Kulkarni
>> +#
>> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
>> +#
>> +
>> +. tests/nvme/rc
>> +
>> +DESCRIPTION="test nvme-pci timeout with fio jobs"
> 
> I ran this test case a ZNS drive today, and it looks working. I suggest to add:
> 
> CAN_BE_ZONED=1
> 

okay ..

>> +
>> +sysfs_path="/sys/kernel/debug/fail_io_timeout/"
>> +#restrict test to nvme-pci only
>> +nvme_trtype=pci
>> +
>> +# fault injection config array
>> +declare -A fi_array
>> +
>> +requires() {
>> +        _require_nvme_trtype pci
> 
> We can remove this line, since we added "nvme_trtype=pci" above.

yeah ..

> 
>> +        _have_fio
>> +        _nvme_requires
>> +        _have_kernel_option FAIL_IO_TIMEOUT
> 
> Today, I found that this test case depends on another kernel option. Let's add,
> 
>          _have_kernel_option FAULT_INJECTION_DEBUG_FS

indeed good catch I'll add this ..

> 
>> +}
>> +
>> +device_requires() {
>> +	_require_test_dev_is_nvme
>> +}
> 
> Actually, this check is not required here, since it is already done in
> group_device_requires() in tests/nvme/rc. It is a small left work to remove
> the same device_requires() in nvme/032.
> 

okay will remove that ..

>> +
>> +save_fi_settings() {
>> +	for fi_attr in probability interval times space verbose
>> +	do
>> +		fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
>> +	done
>> +}
>> +
>> +restore_fi_settings() {
>> +	for fi_attr in probability interval times space verbose
>> +	do
>> +		echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
>> +	done
>> +}
>> +
>> +test_device() {
>> +	local nvme_ns
>> +	local io_fimeout_fail
>> +
>> +	echo "Running ${TEST_NAME}"
>> +
>> +	nvme_ns="$(basename "${TEST_DEV}")"
>> +	io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
>> +	save_fi_settings
>> +	echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +
>> +	echo 99 > /sys/kernel/debug/fail_io_timeout/probability
>> +	echo 10 > /sys/kernel/debug/fail_io_timeout/interval
>> +	echo -1 > /sys/kernel/debug/fail_io_timeout/times
>> +	echo  0 > /sys/kernel/debug/fail_io_timeout/space
>> +	echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
>> +
>> +	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
>> +	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
>> +	    --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
> 
> When I investigated the failure cause of this test case, I found fio output
> is useful. So, I suggest to keep the output in $FULL.
> 
>              --time_based --runtime=1m >&"$FULL"
> 

okay and will modify the testcase for that ..

>> +
>> +	# shellcheck disable=SC2181
>> +	if [ $? -eq 0 ]; then
> 
> With that $FULL file, the if statement can be as follows. And we don't need to
> disable SC2181 either.
> 
>         if grep -q "Input/output error" "$FULL"; then

agree ..

> 
>> +		echo "Test complete"
>> +	else
>> +		echo "Test failed"
>> +	fi
>> +
>> +	restore_fi_settings
>> +	echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +}
>> diff --git a/tests/nvme/050.out b/tests/nvme/050.out
>> new file mode 100644
>> index 0000000..b78b05f
>> --- /dev/null
>> +++ b/tests/nvme/050.out
>> @@ -0,0 +1,2 @@
>> +Running nvme/050
>> +Test complete
>> -- 
>> 2.40.0

Thanks for the review comments, I'll send out V3 with your
suggestions ..

-ck
diff mbox series

Patch

diff --git a/tests/nvme/050 b/tests/nvme/050
new file mode 100755
index 0000000..6c44b43
--- /dev/null
+++ b/tests/nvme/050
@@ -0,0 +1,74 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2024 Chaitanya Kulkarni
+#
+# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
+#
+
+. tests/nvme/rc
+
+DESCRIPTION="test nvme-pci timeout with fio jobs"
+
+sysfs_path="/sys/kernel/debug/fail_io_timeout/"
+#restrict test to nvme-pci only
+nvme_trtype=pci
+
+# fault injection config array
+declare -A fi_array
+
+requires() {
+        _require_nvme_trtype pci
+        _have_fio
+        _nvme_requires
+        _have_kernel_option FAIL_IO_TIMEOUT
+}
+
+device_requires() {
+	_require_test_dev_is_nvme
+}
+
+save_fi_settings() {
+	for fi_attr in probability interval times space verbose
+	do
+		fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
+	done
+}
+
+restore_fi_settings() {
+	for fi_attr in probability interval times space verbose
+	do
+		echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
+	done
+}
+
+test_device() {
+	local nvme_ns
+	local io_fimeout_fail
+
+	echo "Running ${TEST_NAME}"
+
+	nvme_ns="$(basename "${TEST_DEV}")"
+	io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
+	save_fi_settings
+	echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
+
+	echo 99 > /sys/kernel/debug/fail_io_timeout/probability
+	echo 10 > /sys/kernel/debug/fail_io_timeout/interval
+	echo -1 > /sys/kernel/debug/fail_io_timeout/times
+	echo  0 > /sys/kernel/debug/fail_io_timeout/space
+	echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
+
+	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
+	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
+	    --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
+
+	# shellcheck disable=SC2181
+	if [ $? -eq 0 ]; then
+		echo "Test complete"
+	else
+		echo "Test failed"
+	fi
+
+	restore_fi_settings
+	echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
+}
diff --git a/tests/nvme/050.out b/tests/nvme/050.out
new file mode 100644
index 0000000..b78b05f
--- /dev/null
+++ b/tests/nvme/050.out
@@ -0,0 +1,2 @@ 
+Running nvme/050
+Test complete