Message ID | 20240110035756.9537-1-kch@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nvme: add nvme pci timeout testcase | expand |
On 09/01/24 07:57PM, Chaitanya Kulkarni wrote: >Trigger and test nvme-pci timeout with concurrent fio jobs. > >Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >--- > tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ > tests/nvme/050.out | 2 ++ > 2 files changed, 45 insertions(+) > create mode 100755 tests/nvme/050 > create mode 100644 tests/nvme/050.out > >diff --git a/tests/nvme/050 b/tests/nvme/050 >new file mode 100755 >index 0000000..ba54bcd >--- /dev/null >+++ b/tests/nvme/050 >@@ -0,0 +1,43 @@ >+#!/bin/bash >+# SPDX-License-Identifier: GPL-3.0+ >+# Copyright (C) 2024 Chaitanya Kulkarni. >+# >+# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. >+# >+ >+. tests/nvme/rc >+ >+DESCRIPTION="test nvme-pci timeout with fio jobs." >+ >+requires() { >+ _require_nvme_trtype pci >+ _have_fio >+ _nvme_requires >+} >+ >+test() { >+ local dev="/dev/nvme0n1" Should this be TEST_DEV instead ? >+ >+ echo "Running ${TEST_NAME}" >+ >+ echo 1 > /sys/block/nvme0n1/io-timeout-fail nvme0n1, I think this hard coding can be changed >+ >+ echo 99 > /sys/kernel/debug/fail_io_timeout/probability >+ echo 10 > /sys/kernel/debug/fail_io_timeout/interval >+ echo -1 > /sys/kernel/debug/fail_io_timeout/times >+ echo 0 > /sys/kernel/debug/fail_io_timeout/space >+ echo 1 > /sys/kernel/debug/fail_io_timeout/verbose >+ >+ # shellcheck disable=SC2046 >+ fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ >+ --name=reads --direct=1 --filename=${dev} --group_reporting \ >+ --time_based --runtime=1m 2>&1 | grep -q "Input/output error" >+ >+ # shellcheck disable=SC2181 >+ if [ $? -eq 0 ]; then >+ echo "Test complete" >+ else >+ echo "Test failed" >+ fi >+ modprobe -r nvme >+} >diff --git a/tests/nvme/050.out b/tests/nvme/050.out >new file mode 100644 >index 0000000..b78b05f >--- /dev/null >+++ b/tests/nvme/050.out >@@ -0,0 +1,2 @@ >+Running nvme/050 >+Test complete >-- >2.40.0 > -- Nitesh Shetty
On 1/9/24 22:17, Nitesh Shetty wrote: > On 09/01/24 07:57PM, Chaitanya Kulkarni wrote: >> Trigger and test nvme-pci timeout with concurrent fio jobs. >> >> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >> --- >> tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ >> tests/nvme/050.out | 2 ++ >> 2 files changed, 45 insertions(+) >> create mode 100755 tests/nvme/050 >> create mode 100644 tests/nvme/050.out >> >> diff --git a/tests/nvme/050 b/tests/nvme/050 >> new file mode 100755 >> index 0000000..ba54bcd >> --- /dev/null >> +++ b/tests/nvme/050 >> @@ -0,0 +1,43 @@ >> +#!/bin/bash >> +# SPDX-License-Identifier: GPL-3.0+ >> +# Copyright (C) 2024 Chaitanya Kulkarni. >> +# >> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout >> function. >> +# >> + >> +. tests/nvme/rc >> + >> +DESCRIPTION="test nvme-pci timeout with fio jobs." >> + >> +requires() { >> + _require_nvme_trtype pci >> + _have_fio >> + _nvme_requires >> +} >> + >> +test() { >> + local dev="/dev/nvme0n1" > > Should this be TEST_DEV instead ? why ? > > >> + >> + echo "Running ${TEST_NAME}" >> + >> + echo 1 > /sys/block/nvme0n1/io-timeout-fail > > nvme0n1, I think this hard coding can be changed > why ? -ck
On 10/01/24 07:40AM, Chaitanya Kulkarni wrote: >On 1/9/24 22:17, Nitesh Shetty wrote: >> On 09/01/24 07:57PM, Chaitanya Kulkarni wrote: >>> Trigger and test nvme-pci timeout with concurrent fio jobs. >>> >>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >>> --- >>> tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ >>> tests/nvme/050.out | 2 ++ >>> 2 files changed, 45 insertions(+) >>> create mode 100755 tests/nvme/050 >>> create mode 100644 tests/nvme/050.out >>> >>> diff --git a/tests/nvme/050 b/tests/nvme/050 >>> new file mode 100755 >>> index 0000000..ba54bcd >>> --- /dev/null >>> +++ b/tests/nvme/050 >>> @@ -0,0 +1,43 @@ >>> +#!/bin/bash >>> +# SPDX-License-Identifier: GPL-3.0+ >>> +# Copyright (C) 2024 Chaitanya Kulkarni. >>> +# >>> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout >>> function. >>> +# >>> + >>> +. tests/nvme/rc >>> + >>> +DESCRIPTION="test nvme-pci timeout with fio jobs." >>> + >>> +requires() { >>> + _require_nvme_trtype pci >>> + _have_fio >>> + _nvme_requires >>> +} >>> + >>> +test() { >>> + local dev="/dev/nvme0n1" >> >> Should this be TEST_DEV instead ? > >why ? > My understanding of blktests is, add device which we want to test in config files under TEST_DEV (except null-blk and nvme-fabrics loopback devices, which are usually populated inside the tests). In this case, if someone do not want to disturb nvme0n1 device, this test doesn't allow it. Regards, Nitesh Shetty
>>> >>> Should this be TEST_DEV instead ? >> >> why ? >> > My understanding of blktests is, add device which we want to test in > config files under TEST_DEV (except null-blk and nvme-fabrics loopback > devices, which are usually populated inside the tests). > In this case, if someone do not want to disturb nvme0n1 device, > this test doesn't allow it. > > Regards, > Nitesh Shetty > it is clearly stated in the documentation that blktests are destructive to the entire system and including any devices you have, if your system has sensitive data then _don't run these tests_ simple, when you are running blktests you are bound to disturb system you can't prevent that by using TEST_DEV. -ck
Hi Chaitanya, thanks for this new test case. I will comment on the whole patch later. Here I share my comment on the TEST_DEV topic. On Jan 10, 2024 / 08:12, Chaitanya Kulkarni wrote: > > >>> > >>> Should this be TEST_DEV instead ? > >> > >> why ? > >> > > My understanding of blktests is, add device which we want to test in > > config files under TEST_DEV (except null-blk and nvme-fabrics loopback > > devices, which are usually populated inside the tests). > > In this case, if someone do not want to disturb nvme0n1 device, > > this test doesn't allow it. > > > > Regards, > > Nitesh Shetty > > > > it is clearly stated in the documentation that blktests are destructive > to the entire system and including any devices you have, if your > system has sensitive data then _don't run these tests_ simple, when > you are running blktests you are bound to disturb system you can't > prevent that by using TEST_DEV. It's true that blktests is destructive to the test system and the test devices, but I still think it's good to reduce the level of destructive risk. It is likely that /dev/nvme0n1 is the system disk of the test system and the test case puts some risk on it. By using TEST_DEV, blktests users can specify non-system disk for testing and reduce the risk. We can run the test case (and its following test cases to be added in the future) more reliably. On top of that, TEST_DEV will help blktests users to control which device to take the destructive risk. It also helps to check that the test target device exists and it is a block, nvme device. So I recommend to use TEST_DEV in place of /dev/nvme0n1, and replace test() with test_device(). TEST_DEV_SYSFS will help to access the io-timeout-fail sysfs attribute.
On 1/10/24 09:12, Chaitanya Kulkarni wrote: > >>>> >>>> Should this be TEST_DEV instead ? >>> >>> why ? >>> >> My understanding of blktests is, add device which we want to test in >> config files under TEST_DEV (except null-blk and nvme-fabrics loopback >> devices, which are usually populated inside the tests). >> In this case, if someone do not want to disturb nvme0n1 device, >> this test doesn't allow it. >> >> Regards, >> Nitesh Shetty >> > > it is clearly stated in the documentation that blktests are destructive > to the entire system and including any devices you have, if your > system has sensitive data then _don't run these tests_ simple, when > you are running blktests you are bound to disturb system you can't > prevent that by using TEST_DEV. > I don't think this is the direction we want to go. NVMe for internal drives is becoming more and more prevalent (especially on laptops), making them unusable for running blktests. We have been putting quite some effort into nvme-cli to ensure that blktest _can_ run concurrently with other NVMe drives, so really we should not hard-code any device names. Or we discuss this at LSF/MM; Daniel or me will be happy to give an overview about concurrent NVMe-oF management applications on the same system (nvme-cli, nvme-stas, blktests all running at the same time). Cheers, Hannes
Thanks again for this test case. Please find my review comments. Tomorrow, I will try to run this test case. On Jan 09, 2024 / 19:57, Chaitanya Kulkarni wrote: > Trigger and test nvme-pci timeout with concurrent fio jobs. Is there any related kernel commit which motivated this test case? If so, can we add a kernel commit or e-mail discussion link as a reference? > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> > --- > tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ > tests/nvme/050.out | 2 ++ > 2 files changed, 45 insertions(+) > create mode 100755 tests/nvme/050 > create mode 100644 tests/nvme/050.out > > diff --git a/tests/nvme/050 b/tests/nvme/050 > new file mode 100755 > index 0000000..ba54bcd > --- /dev/null > +++ b/tests/nvme/050 > @@ -0,0 +1,43 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-3.0+ > +# Copyright (C) 2024 Chaitanya Kulkarni. Nit: you may want to remove the last '.' > +# > +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. > +# > + > +. tests/nvme/rc > + > +DESCRIPTION="test nvme-pci timeout with fio jobs." To be consitent with other test cases, let's remove the last '.'. > + > +requires() { > + _require_nvme_trtype pci > + _have_fio > + _nvme_requires This test case depends on the kernel config FAIL_IO_TIMEOUT. Let's add _have_kernel_option FAIL_IO_TIMEOUT > +} > + > +test() { > + local dev="/dev/nvme0n1" > + > + echo "Running ${TEST_NAME}" > + > + echo 1 > /sys/block/nvme0n1/io-timeout-fail > + > + echo 99 > /sys/kernel/debug/fail_io_timeout/probability > + echo 10 > /sys/kernel/debug/fail_io_timeout/interval > + echo -1 > /sys/kernel/debug/fail_io_timeout/times > + echo 0 > /sys/kernel/debug/fail_io_timeout/space > + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose To avoid impact on following test cases, I suggest to restore the original values of the fail_io_timeout/* sysfs attributes above at the end of this test case. _nvme_err_inject_setup() and _nvme_err_inject_cleanup() in test/nvme/rc do similar thing. We can add new helper functions in same manner. > + > + # shellcheck disable=SC2046 > + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ Double quotes for "$(nproc)" will allow to remove the shellcheck exception, probably. > + --name=reads --direct=1 --filename=${dev} --group_reporting \ > + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" > + > + # shellcheck disable=SC2181 > + if [ $? -eq 0 ]; then > + echo "Test complete" > + else > + echo "Test failed" > + fi > + modprobe -r nvme If nvme driver is built-in, this unload command does not work. Do we really need to unload nvme driver here? > +} > diff --git a/tests/nvme/050.out b/tests/nvme/050.out > new file mode 100644 > index 0000000..b78b05f > --- /dev/null > +++ b/tests/nvme/050.out > @@ -0,0 +1,2 @@ > +Running nvme/050 > +Test complete > -- > 2.40.0 > >
On 1/10/24 1:12 AM, Chaitanya Kulkarni wrote: > >>>> >>>> Should this be TEST_DEV instead ? >>> >>> why ? >>> >> My understanding of blktests is, add device which we want to test in >> config files under TEST_DEV (except null-blk and nvme-fabrics loopback >> devices, which are usually populated inside the tests). >> In this case, if someone do not want to disturb nvme0n1 device, >> this test doesn't allow it. >> >> Regards, >> Nitesh Shetty >> > > it is clearly stated in the documentation that blktests are destructive > to the entire system and including any devices you have, if your > system has sensitive data then _don't run these tests_ simple, when > you are running blktests you are bound to disturb system you can't > prevent that by using TEST_DEV. It should very much be possible to run blktests on specific devices ONLY, the intent of a test suite is not to potentially mess up everything around you. It should ONLY operate on the devices given, not poke at random devices in the system.
On 1/10/2024 1:24 AM, Hannes Reinecke wrote: > On 1/10/24 09:12, Chaitanya Kulkarni wrote: >> >>>>> >>>>> Should this be TEST_DEV instead ? >>>> >>>> why ? >>>> >>> My understanding of blktests is, add device which we want to test in >>> config files under TEST_DEV (except null-blk and nvme-fabrics loopback >>> devices, which are usually populated inside the tests). >>> In this case, if someone do not want to disturb nvme0n1 device, >>> this test doesn't allow it. >>> >>> Regards, >>> Nitesh Shetty >>> >> >> it is clearly stated in the documentation that blktests are destructive >> to the entire system and including any devices you have, if your >> system has sensitive data then _don't run these tests_ simple, when >> you are running blktests you are bound to disturb system you can't >> prevent that by using TEST_DEV. >> > I don't think this is the direction we want to go. > NVMe for internal drives is becoming more and more prevalent (especially > on laptops), making them unusable for running blktests. > > We have been putting quite some effort into nvme-cli to ensure that > blktest _can_ run concurrently with other NVMe drives, so really we > should not hard-code any device names. > Okay ... > Or we discuss this at LSF/MM; Daniel or me will be happy to give an > overview about concurrent NVMe-oF management applications on the same > system (nvme-cli, nvme-stas, blktests all running at the same time). > would be a great topic for the blktests session ... > Cheers, > > Hannes > I'll make it TEST_DEV change ... -ck
On 1/10/2024 6:59 AM, Jens Axboe wrote: > On 1/10/24 1:12 AM, Chaitanya Kulkarni wrote: >> >>>>> >>>>> Should this be TEST_DEV instead ? >>>> >>>> why ? >>>> >>> My understanding of blktests is, add device which we want to test in >>> config files under TEST_DEV (except null-blk and nvme-fabrics loopback >>> devices, which are usually populated inside the tests). >>> In this case, if someone do not want to disturb nvme0n1 device, >>> this test doesn't allow it. >>> >>> Regards, >>> Nitesh Shetty >>> >> >> it is clearly stated in the documentation that blktests are destructive >> to the entire system and including any devices you have, if your >> system has sensitive data then _don't run these tests_ simple, when >> you are running blktests you are bound to disturb system you can't >> prevent that by using TEST_DEV. > > It should very much be possible to run blktests on specific devices > ONLY, the intent of a test suite is not to potentially mess up > everything around you. It should ONLY operate on the devices given, not > poke at random devices in the system. > I'll make the TEST_DEV change .. -ck
On 1/10/2024 4:23 AM, Shinichiro Kawasaki wrote: > Thanks again for this test case. Please find my review comments. Tomorrow, I > will try to run this test case. > > On Jan 09, 2024 / 19:57, Chaitanya Kulkarni wrote: >> Trigger and test nvme-pci timeout with concurrent fio jobs. > > Is there any related kernel commit which motivated this test case? If so, > can we add a kernel commit or e-mail discussion link as a reference? > no, this part if never tested on the regular basis as it has some complicated logic I believe it is much needed test .. >> >> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >> --- >> tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ >> tests/nvme/050.out | 2 ++ >> 2 files changed, 45 insertions(+) >> create mode 100755 tests/nvme/050 >> create mode 100644 tests/nvme/050.out >> >> diff --git a/tests/nvme/050 b/tests/nvme/050 >> new file mode 100755 >> index 0000000..ba54bcd >> --- /dev/null >> +++ b/tests/nvme/050 >> @@ -0,0 +1,43 @@ >> +#!/bin/bash >> +# SPDX-License-Identifier: GPL-3.0+ >> +# Copyright (C) 2024 Chaitanya Kulkarni. > > Nit: you may want to remove the last '.' okay ... > >> +# >> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. >> +# >> + >> +. tests/nvme/rc >> + >> +DESCRIPTION="test nvme-pci timeout with fio jobs." > > To be consitent with other test cases, let's remove the last '.'. okay ... > >> + >> +requires() { >> + _require_nvme_trtype pci >> + _have_fio >> + _nvme_requires > > This test case depends on the kernel config FAIL_IO_TIMEOUT. Let's add > > _have_kernel_option FAIL_IO_TIMEOUT indeed, I'll add that ... > >> +} >> + >> +test() { >> + local dev="/dev/nvme0n1" >> + >> + echo "Running ${TEST_NAME}" >> + >> + echo 1 > /sys/block/nvme0n1/io-timeout-fail >> + >> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability >> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval >> + echo -1 > /sys/kernel/debug/fail_io_timeout/times >> + echo 0 > /sys/kernel/debug/fail_io_timeout/space >> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose > > To avoid impact on following test cases, I suggest to restore the original > values of the fail_io_timeout/* sysfs attributes above at the end of this > test case. _nvme_err_inject_setup() and _nvme_err_inject_cleanup() in > test/nvme/rc do similar thing. We can add new helper functions in same manner. I can add the code for config and restore, but at this point there are no other testcases using this ?(correct me if I'm wrong), so instead of bloating the code in nvme/rc let's keep it for this testcase only ? and add common helpers code later once we have more users for this case ? > >> + >> + # shellcheck disable=SC2046 >> + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ > > Double quotes for "$(nproc)" will allow to remove the shellcheck exception, > probably. not sure I understand this comment, can you please elaborate ? > >> + --name=reads --direct=1 --filename=${dev} --group_reporting \ >> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" >> + >> + # shellcheck disable=SC2181 >> + if [ $? -eq 0 ]; then >> + echo "Test complete" >> + else >> + echo "Test failed" >> + fi >> + modprobe -r nvme > > If nvme driver is built-in, this unload command does not work. > Do we really need to unload nvme driver here? > Yes, post timeout handler execution we need to make sure that device can be removed at the time of module unload, perhaps there is a way to avoid this when nvme is a built-in module so that test will not execute above ? any suggestions ? -ck
On Wed, Jan 10, 2024 at 11:58 AM Chaitanya Kulkarni <kch@nvidia.com> wrote: > > Trigger and test nvme-pci timeout with concurrent fio jobs. > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> > --- > tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ > tests/nvme/050.out | 2 ++ > 2 files changed, 45 insertions(+) > create mode 100755 tests/nvme/050 > create mode 100644 tests/nvme/050.out > > diff --git a/tests/nvme/050 b/tests/nvme/050 > new file mode 100755 > index 0000000..ba54bcd > --- /dev/null > +++ b/tests/nvme/050 > @@ -0,0 +1,43 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-3.0+ > +# Copyright (C) 2024 Chaitanya Kulkarni. > +# > +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. > +# > + > +. tests/nvme/rc How about restricting this test to pci only? just like nvme/032 did. nvme_trtype=pci > + > +DESCRIPTION="test nvme-pci timeout with fio jobs." > + > +requires() { > + _require_nvme_trtype pci > + _have_fio > + _nvme_requires > +} Check the test dev is nvme: device_requires() { _require_test_dev_is_nvme } > + > +test() { Use test_device() here. > + local dev="/dev/nvme0n1" > + > + echo "Running ${TEST_NAME}" > + > + echo 1 > /sys/block/nvme0n1/io-timeout-fail > + > + echo 99 > /sys/kernel/debug/fail_io_timeout/probability > + echo 10 > /sys/kernel/debug/fail_io_timeout/interval > + echo -1 > /sys/kernel/debug/fail_io_timeout/times > + echo 0 > /sys/kernel/debug/fail_io_timeout/space > + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose > + > + # shellcheck disable=SC2046 > + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ > + --name=reads --direct=1 --filename=${dev} --group_reporting \ > + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" > + > + # shellcheck disable=SC2181 > + if [ $? -eq 0 ]; then > + echo "Test complete" > + else > + echo "Test failed" > + fi > + modprobe -r nvme > +} > diff --git a/tests/nvme/050.out b/tests/nvme/050.out > new file mode 100644 > index 0000000..b78b05f > --- /dev/null > +++ b/tests/nvme/050.out > @@ -0,0 +1,2 @@ > +Running nvme/050 > +Test complete > -- > 2.40.0 > >
On 1/10/24 17:25, Yi Zhang wrote: > On Wed, Jan 10, 2024 at 11:58 AM Chaitanya Kulkarni <kch@nvidia.com> wrote: >> Trigger and test nvme-pci timeout with concurrent fio jobs. >> >> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> >> --- >> tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ >> tests/nvme/050.out | 2 ++ >> 2 files changed, 45 insertions(+) >> create mode 100755 tests/nvme/050 >> create mode 100644 tests/nvme/050.out >> >> diff --git a/tests/nvme/050 b/tests/nvme/050 >> new file mode 100755 >> index 0000000..ba54bcd >> --- /dev/null >> +++ b/tests/nvme/050 >> @@ -0,0 +1,43 @@ >> +#!/bin/bash >> +# SPDX-License-Identifier: GPL-3.0+ >> +# Copyright (C) 2024 Chaitanya Kulkarni. >> +# >> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. >> +# >> + >> +. tests/nvme/rc > How about restricting this test to pci only? just like nvme/032 did. > nvme_trtype=pci Yes it has to now, since I'm adding TEST_DEV, thanks for mentioning that already added that change to v2. >> + >> +DESCRIPTION="test nvme-pci timeout with fio jobs." >> + >> +requires() { >> + _require_nvme_trtype pci >> + _have_fio >> + _nvme_requires >> +} > Check the test dev is nvme: > device_requires() { > _require_test_dev_is_nvme > } > >> + >> +test() { > Use test_device() here. okay .. -ck
On Jan 11, 2024 / 00:00, Chaitanya Kulkarni wrote: > On 1/10/2024 4:23 AM, Shinichiro Kawasaki wrote: > > Thanks again for this test case. Please find my review comments. Tomorrow, I > > will try to run this test case. > > > > On Jan 09, 2024 / 19:57, Chaitanya Kulkarni wrote: > >> Trigger and test nvme-pci timeout with concurrent fio jobs. > > > > Is there any related kernel commit which motivated this test case? If so, > > can we add a kernel commit or e-mail discussion link as a reference? > > > > no, this part if never tested on the regular basis as it has some > complicated logic I believe it is much needed test .. Okay, it's good to expand the test coverage. ... > >> +} > >> + > >> +test() { > >> + local dev="/dev/nvme0n1" > >> + > >> + echo "Running ${TEST_NAME}" > >> + > >> + echo 1 > /sys/block/nvme0n1/io-timeout-fail > >> + > >> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability > >> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval > >> + echo -1 > /sys/kernel/debug/fail_io_timeout/times > >> + echo 0 > /sys/kernel/debug/fail_io_timeout/space > >> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose > > > > To avoid impact on following test cases, I suggest to restore the original > > values of the fail_io_timeout/* sysfs attributes above at the end of this > > test case. _nvme_err_inject_setup() and _nvme_err_inject_cleanup() in > > test/nvme/rc do similar thing. We can add new helper functions in same manner. > > I can add the code for config and restore, but at this point there are > no other testcases using this ?(correct me if I'm wrong), This new test case is the only one user of fail_io_timeout, I believe. > so instead of > bloating the code in nvme/rc let's keep it for this testcase only ? > and add common helpers code later once we have more users for this > case ? I think either way is good. > > > > >> + > >> + # shellcheck disable=SC2046 > >> + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ > > > > Double quotes for "$(nproc)" will allow to remove the shellcheck exception, > > probably. > > not sure I understand this comment, can you please elaborate ? I suggest this: diff --git a/tests/nvme/050 b/tests/nvme/050 index ba54bcd..8e5acb2 100755 --- a/tests/nvme/050 +++ b/tests/nvme/050 @@ -28,8 +28,7 @@ test() { echo 0 > /sys/kernel/debug/fail_io_timeout/space echo 1 > /sys/kernel/debug/fail_io_timeout/verbose - # shellcheck disable=SC2046 - fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \ --name=reads --direct=1 --filename=${dev} --group_reporting \ --time_based --runtime=1m 2>&1 | grep -q "Input/output error" > > > > >> + --name=reads --direct=1 --filename=${dev} --group_reporting \ > >> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" > >> + > >> + # shellcheck disable=SC2181 > >> + if [ $? -eq 0 ]; then > >> + echo "Test complete" > >> + else > >> + echo "Test failed" > >> + fi > >> + modprobe -r nvme > > > > If nvme driver is built-in, this unload command does not work. > > Do we really need to unload nvme driver here? > > > > Yes, post timeout handler execution we need to make sure that device > can be removed at the time of module unload, perhaps there is a way to > avoid this when nvme is a built-in module so that test will not execute > above ? any suggestions ? I see... One idea I can think of is to remove the device using PCI device sysfs. How about the code below? block/011 does similar thing. To do that for TEST_DEV, we can use _get_pci_from_dev_sysfs() in place of _get_pci_from_dev_sysfs(). Does this meet your intent? diff --git a/tests/nvme/050 b/tests/nvme/050 index 8e5acb2..592f167 100755 --- a/tests/nvme/050 +++ b/tests/nvme/050 @@ -17,6 +17,9 @@ requires() { test() { local dev="/dev/nvme0n1" + local pdev + + pdev="$(_get_pci_from_dev_sysfs /dev/nvme0n1)" echo "Running ${TEST_NAME}" @@ -38,5 +41,11 @@ test() { else echo "Test failed" fi - modprobe -r nvme + + # Remove and rescan the NVME device to ensure that it has come back + echo 1 > "/sys/bus/pci/devices/$pdev/remove" + echo 1 > /sys/bus/pci/rescan + if [[ ! -b $TEST_DEV ]]; then + echo "Failed to regain $TEST_DEV" + fi } BTW, when I ran the test case, I observed the test case failed. I took a quick look, and saw the nvme driver reported I/O timeout in kernel messages. But fio did not report the expected I/O error. Not sure why. I will look closer when I check v2 patch.
> BTW, when I ran the test case, I observed the test case failed. I took a quick > look, and saw the nvme driver reported I/O timeout in kernel messages. But fio > did not report the expected I/O error. Not sure why. I will look closer when I > check v2 patch. Ohh really ? I've consistently seen I/O timeout error in the kernel message and with fio output giving I/O error. Thanks a lot for looking into this, all the suggestions look good to me, meanwhile I've raised a question that ideal scenario should be in order to pass the testcase see [1]. As I'm getting device with 0 capacity at the end of testcase and there are still open questions. As soon as I get reply will modify the testcase passing scenario and send V2. -ck [1] http://lists.infradead.org/pipermail/linux-nvme/2024-January/044263.html
diff --git a/tests/nvme/050 b/tests/nvme/050 new file mode 100755 index 0000000..ba54bcd --- /dev/null +++ b/tests/nvme/050 @@ -0,0 +1,43 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# Copyright (C) 2024 Chaitanya Kulkarni. +# +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function. +# + +. tests/nvme/rc + +DESCRIPTION="test nvme-pci timeout with fio jobs." + +requires() { + _require_nvme_trtype pci + _have_fio + _nvme_requires +} + +test() { + local dev="/dev/nvme0n1" + + echo "Running ${TEST_NAME}" + + echo 1 > /sys/block/nvme0n1/io-timeout-fail + + echo 99 > /sys/kernel/debug/fail_io_timeout/probability + echo 10 > /sys/kernel/debug/fail_io_timeout/interval + echo -1 > /sys/kernel/debug/fail_io_timeout/times + echo 0 > /sys/kernel/debug/fail_io_timeout/space + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose + + # shellcheck disable=SC2046 + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \ + --name=reads --direct=1 --filename=${dev} --group_reporting \ + --time_based --runtime=1m 2>&1 | grep -q "Input/output error" + + # shellcheck disable=SC2181 + if [ $? -eq 0 ]; then + echo "Test complete" + else + echo "Test failed" + fi + modprobe -r nvme +} diff --git a/tests/nvme/050.out b/tests/nvme/050.out new file mode 100644 index 0000000..b78b05f --- /dev/null +++ b/tests/nvme/050.out @@ -0,0 +1,2 @@ +Running nvme/050 +Test complete
Trigger and test nvme-pci timeout with concurrent fio jobs. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> --- tests/nvme/050 | 43 +++++++++++++++++++++++++++++++++++++++++++ tests/nvme/050.out | 2 ++ 2 files changed, 45 insertions(+) create mode 100755 tests/nvme/050 create mode 100644 tests/nvme/050.out