Message ID | 20240117042206.11031-2-kch@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nvme: add nvme pci timeout testcase | expand |
Hi Chaitanya, thanks for this v2 patch. I ran this new test case with several devices and observed these two: 1) The test case fails with my QEMU NMVE devices. It looks all of the inject failures get recovered by error handler, then no I/O error is reported to the fio process. I modified two fail_io_timeout parameters as follows to inject more failures, and saw the test case fails with the device. I suggest these parameters. @@ -49,8 +50,8 @@ test_device() { save_fi_settings echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail - echo 99 > /sys/kernel/debug/fail_io_timeout/probability - echo 10 > /sys/kernel/debug/fail_io_timeout/interval + echo 100 > /sys/kernel/debug/fail_io_timeout/probability + echo 1 > /sys/kernel/debug/fail_io_timeout/interval echo -1 > /sys/kernel/debug/fail_io_timeout/times echo 0 > /sys/kernel/debug/fail_io_timeout/space echo 1 > /sys/kernel/debug/fail_io_timeout/verbose 2) After I ran the test case, I see the test target device left with zero size. I think this is consistent as your report [*]. I still think device remove and rescan at the test case end is required to avoid impacts on following test cases. It will depend on the discussion for your report, though. [*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/ Please find my some more comments in line: On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote: > Trigger and test nvme-pci timeout with concurrent fio jobs. > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> > --- > tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++ > tests/nvme/050.out | 2 ++ > 2 files changed, 76 insertions(+) > create mode 100755 tests/nvme/050 > create mode 100644 tests/nvme/050.out > > diff --git a/tests/nvme/050 b/tests/nvme/050 > new file mode 100755 > index 0000000..6c44b43 > --- /dev/null > +++ b/tests/nvme/050 > @@ -0,0 +1,74 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-3.0+ > +# Copyright (C) 2024 Chaitanya Kulkarni > +# > +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. > +# > + > +. tests/nvme/rc > + > +DESCRIPTION="test nvme-pci timeout with fio jobs" I ran this test case a ZNS drive today, and it looks working. I suggest to add: CAN_BE_ZONED=1 > + > +sysfs_path="/sys/kernel/debug/fail_io_timeout/" > +#restrict test to nvme-pci only > +nvme_trtype=pci > + > +# fault injection config array > +declare -A fi_array > + > +requires() { > + _require_nvme_trtype pci We can remove this line, since we added "nvme_trtype=pci" above. > + _have_fio > + _nvme_requires > + _have_kernel_option FAIL_IO_TIMEOUT Today, I found that this test case depends on another kernel option. Let's add, _have_kernel_option FAULT_INJECTION_DEBUG_FS > +} > + > +device_requires() { > + _require_test_dev_is_nvme > +} Actually, this check is not required here, since it is already done in group_device_requires() in tests/nvme/rc. It is a small left work to remove the same device_requires() in nvme/032. > + > +save_fi_settings() { > + for fi_attr in probability interval times space verbose > + do > + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}") > + done > +} > + > +restore_fi_settings() { > + for fi_attr in probability interval times space verbose > + do > + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}" > + done > +} > + > +test_device() { > + local nvme_ns > + local io_fimeout_fail > + > + echo "Running ${TEST_NAME}" > + > + nvme_ns="$(basename "${TEST_DEV}")" > + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)" > + save_fi_settings > + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail > + > + echo 99 > /sys/kernel/debug/fail_io_timeout/probability > + echo 10 > /sys/kernel/debug/fail_io_timeout/interval > + echo -1 > /sys/kernel/debug/fail_io_timeout/times > + echo 0 > /sys/kernel/debug/fail_io_timeout/space > + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose > + > + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \ > + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \ > + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" When I investigated the failure cause of this test case, I found fio output is useful. So, I suggest to keep the output in $FULL. --time_based --runtime=1m >&"$FULL" > + > + # shellcheck disable=SC2181 > + if [ $? -eq 0 ]; then With that $FULL file, the if statement can be as follows. And we don't need to disable SC2181 either. if grep -q "Input/output error" "$FULL"; then > + echo "Test complete" > + else > + echo "Test failed" > + fi > + > + restore_fi_settings > + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail > +} > diff --git a/tests/nvme/050.out b/tests/nvme/050.out > new file mode 100644 > index 0000000..b78b05f > --- /dev/null > +++ b/tests/nvme/050.out > @@ -0,0 +1,2 @@ > +Running nvme/050 > +Test complete > -- > 2.40.0 >
On 1/19/2024 3:11 AM, Shinichiro Kawasaki wrote: > Hi Chaitanya, thanks for this v2 patch. > > I ran this new test case with several devices and observed these two: > > 1) The test case fails with my QEMU NMVE devices. It looks all of the inject > failures get recovered by error handler, then no I/O error is reported to the > fio process. I modified two fail_io_timeout parameters as follows to inject > more failures, and saw the test case fails with the device. I suggest these > parameters. > > @@ -49,8 +50,8 @@ test_device() { > save_fi_settings > echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail > > - echo 99 > /sys/kernel/debug/fail_io_timeout/probability > - echo 10 > /sys/kernel/debug/fail_io_timeout/interval > + echo 100 > /sys/kernel/debug/fail_io_timeout/probability > + echo 1 > /sys/kernel/debug/fail_io_timeout/interval > echo -1 > /sys/kernel/debug/fail_io_timeout/times > echo 0 > /sys/kernel/debug/fail_io_timeout/space > echo 1 > /sys/kernel/debug/fail_io_timeout/verbose > Let me test these and add to the V3. But it is really strange, current code pass testcase 100% of the time without above changes ... > 2) After I ran the test case, I see the test target device left with zero > size. I think this is consistent as your report [*]. I still think device > remove and rescan at the test case end is required to avoid impacts on > following test cases. It will depend on the discussion for your report, > though. > Yes, still waiting on that, hopefully we will get some information to move this forward. The reason I didn't add remove/rescan deviec coz it helped me debug the state of the device post timeout handler which is turned out to be not consistent for sure .. > [*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/ > > Please find my some more comments in line: > > On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote: >> Trigger and test nvme-pci timeout with concurrent fio jobs. >> >> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >> --- >> tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++ >> tests/nvme/050.out | 2 ++ >> 2 files changed, 76 insertions(+) >> create mode 100755 tests/nvme/050 >> create mode 100644 tests/nvme/050.out >> >> diff --git a/tests/nvme/050 b/tests/nvme/050 >> new file mode 100755 >> index 0000000..6c44b43 >> --- /dev/null >> +++ b/tests/nvme/050 >> @@ -0,0 +1,74 @@ >> +#!/bin/bash >> +# SPDX-License-Identifier: GPL-3.0+ >> +# Copyright (C) 2024 Chaitanya Kulkarni >> +# >> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. >> +# >> + >> +. tests/nvme/rc >> + >> +DESCRIPTION="test nvme-pci timeout with fio jobs" > > I ran this test case a ZNS drive today, and it looks working. I suggest to add: > > CAN_BE_ZONED=1 > okay .. >> + >> +sysfs_path="/sys/kernel/debug/fail_io_timeout/" >> +#restrict test to nvme-pci only >> +nvme_trtype=pci >> + >> +# fault injection config array >> +declare -A fi_array >> + >> +requires() { >> + _require_nvme_trtype pci > > We can remove this line, since we added "nvme_trtype=pci" above. yeah .. > >> + _have_fio >> + _nvme_requires >> + _have_kernel_option FAIL_IO_TIMEOUT > > Today, I found that this test case depends on another kernel option. Let's add, > > _have_kernel_option FAULT_INJECTION_DEBUG_FS indeed good catch I'll add this .. > >> +} >> + >> +device_requires() { >> + _require_test_dev_is_nvme >> +} > > Actually, this check is not required here, since it is already done in > group_device_requires() in tests/nvme/rc. It is a small left work to remove > the same device_requires() in nvme/032. > okay will remove that .. >> + >> +save_fi_settings() { >> + for fi_attr in probability interval times space verbose >> + do >> + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}") >> + done >> +} >> + >> +restore_fi_settings() { >> + for fi_attr in probability interval times space verbose >> + do >> + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}" >> + done >> +} >> + >> +test_device() { >> + local nvme_ns >> + local io_fimeout_fail >> + >> + echo "Running ${TEST_NAME}" >> + >> + nvme_ns="$(basename "${TEST_DEV}")" >> + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)" >> + save_fi_settings >> + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail >> + >> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability >> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval >> + echo -1 > /sys/kernel/debug/fail_io_timeout/times >> + echo 0 > /sys/kernel/debug/fail_io_timeout/space >> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose >> + >> + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \ >> + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \ >> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" > > When I investigated the failure cause of this test case, I found fio output > is useful. So, I suggest to keep the output in $FULL. > > --time_based --runtime=1m >&"$FULL" > okay and will modify the testcase for that .. >> + >> + # shellcheck disable=SC2181 >> + if [ $? -eq 0 ]; then > > With that $FULL file, the if statement can be as follows. And we don't need to > disable SC2181 either. > > if grep -q "Input/output error" "$FULL"; then agree .. > >> + echo "Test complete" >> + else >> + echo "Test failed" >> + fi >> + >> + restore_fi_settings >> + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail >> +} >> diff --git a/tests/nvme/050.out b/tests/nvme/050.out >> new file mode 100644 >> index 0000000..b78b05f >> --- /dev/null >> +++ b/tests/nvme/050.out >> @@ -0,0 +1,2 @@ >> +Running nvme/050 >> +Test complete >> -- >> 2.40.0 Thanks for the review comments, I'll send out V3 with your suggestions .. -ck
diff --git a/tests/nvme/050 b/tests/nvme/050 new file mode 100755 index 0000000..6c44b43 --- /dev/null +++ b/tests/nvme/050 @@ -0,0 +1,74 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# Copyright (C) 2024 Chaitanya Kulkarni +# +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. +# + +. tests/nvme/rc + +DESCRIPTION="test nvme-pci timeout with fio jobs" + +sysfs_path="/sys/kernel/debug/fail_io_timeout/" +#restrict test to nvme-pci only +nvme_trtype=pci + +# fault injection config array +declare -A fi_array + +requires() { + _require_nvme_trtype pci + _have_fio + _nvme_requires + _have_kernel_option FAIL_IO_TIMEOUT +} + +device_requires() { + _require_test_dev_is_nvme +} + +save_fi_settings() { + for fi_attr in probability interval times space verbose + do + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}") + done +} + +restore_fi_settings() { + for fi_attr in probability interval times space verbose + do + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}" + done +} + +test_device() { + local nvme_ns + local io_fimeout_fail + + echo "Running ${TEST_NAME}" + + nvme_ns="$(basename "${TEST_DEV}")" + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)" + save_fi_settings + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail + + echo 99 > /sys/kernel/debug/fail_io_timeout/probability + echo 10 > /sys/kernel/debug/fail_io_timeout/interval + echo -1 > /sys/kernel/debug/fail_io_timeout/times + echo 0 > /sys/kernel/debug/fail_io_timeout/space + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose + + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \ + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \ + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" + + # shellcheck disable=SC2181 + if [ $? -eq 0 ]; then + echo "Test complete" + else + echo "Test failed" + fi + + restore_fi_settings + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail +} diff --git a/tests/nvme/050.out b/tests/nvme/050.out new file mode 100644 index 0000000..b78b05f --- /dev/null +++ b/tests/nvme/050.out @@ -0,0 +1,2 @@ +Running nvme/050 +Test complete
Trigger and test nvme-pci timeout with concurrent fio jobs. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> --- tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++ tests/nvme/050.out | 2 ++ 2 files changed, 76 insertions(+) create mode 100755 tests/nvme/050 create mode 100644 tests/nvme/050.out