Message ID | b22d1b62-a6b1-4dd6-9ccb-827442846f3c@roeck-us.net (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Persistent problem with handshake unit tests | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Guessing tree name failed - patch did not apply |
On Fri, 2 Feb 2024 09:21:22 -0800 Guenter Roeck wrote: > when running handshake kunit tests in qemu, I always get the following > failure. Sorry for sidetracking - how do you run kunit to get all the tests? We run: ./tools/testing/kunit/kunit.py run --alltests but more and more I feel like the --alltests is a cruel joke.
On 2/2/24 11:22, Jakub Kicinski wrote: > On Fri, 2 Feb 2024 09:21:22 -0800 Guenter Roeck wrote: >> when running handshake kunit tests in qemu, I always get the following >> failure. > > Sorry for sidetracking - how do you run kunit to get all the tests? > We run: > > ./tools/testing/kunit/kunit.py run --alltests > > but more and more I feel like the --alltests is a cruel joke. I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. The tests run during boot, so no additional work is needed. I don't run all tests because many take too long to execute in qemu. Guenter
On 2/2/24 12:49, Guenter Roeck wrote: > On 2/2/24 11:22, Jakub Kicinski wrote: >> On Fri, 2 Feb 2024 09:21:22 -0800 Guenter Roeck wrote: >>> when running handshake kunit tests in qemu, I always get the following >>> failure. >> >> Sorry for sidetracking - how do you run kunit to get all the tests? >> We run: >> >> ./tools/testing/kunit/kunit.py run --alltests >> >> but more and more I feel like the --alltests is a cruel joke. > > I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. > The tests run during boot, so no additional work is needed. I don't run all > tests because many take too long to execute in qemu. > Follow-up: If this test isn't supposed to run during boot, please let me know and I'll drop it. Thanks, Guenter
On Fri, 2 Feb 2024 12:51:07 -0800 Guenter Roeck wrote: > > I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. > > The tests run during boot, so no additional work is needed. I don't run all > > tests because many take too long to execute in qemu. I was wondering how do you discover the options. Periodically grep for new possible KUNIT options added to the kernel? Have some script that does it. > Follow-up: If this test isn't supposed to run during boot, please > let me know and I'll drop it. FWIW I don't see why, but Chuck is the best person to comment.
On 2/2/24 16:47, Jakub Kicinski wrote: > On Fri, 2 Feb 2024 12:51:07 -0800 Guenter Roeck wrote: >>> I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. >>> The tests run during boot, so no additional work is needed. I don't run all >>> tests because many take too long to execute in qemu. > > I was wondering how do you discover the options. Periodically grep for > new possible KUNIT options added to the kernel? Have some script that > does it. > "Periodically grep for > new possible KUNIT options added to the kernel" Exactly, and see what happens if I enable them to determine if they run fast and if they are stable enough to be enabled in a qemu test. I can not enable tests automatically because some just take too long to run, as in # takes too long # enable_config "${fragment}" CONFIG_TEST_RHASHTABLE or # takes too long # enable_config "${fragment}" CONFIG_TORTURE_TEST CONFIG_LOCK_TORTURE_TEST CONFIG_RCU_TORTURE_TEST Sometimes I have to disable tests because a long-running sub-option is added, as in # CONFIG_MEMCPY_KUNIT_TEST sometimes takes more than 45 seconds. # CONFIG_MEMCPY_SLOW_KUNIT_TEST avoids this, but last time I checked # this was not present in all affected kernel branches. # enable_config "${fragment} CONFIG_MEMCPY_KUNIT_TEST or because of problems such as # RTC library unit tests hang in many qemu emulations # enable_config "${fragment}" CONFIG_RTC_LIB_KUNIT_TEST or # clock unit tests seem to introduce noise warning tracebacks # enable_config "${fragment}" CONFIG_CLK_GATE_KUNIT_TEST CONFIG_CLK_KUNIT_TEST or # Results in lots of "ASoC: Parent card not yet available" log messages # enable_config "${fragment}" CONFIG_SND_SOC_TOPOLOGY_KUNIT_TEST SND_SOC_UTILS_KUNIT_TEST or # CONFIG_WW_MUTEX_SELFTEST interferes with CONFIG_PREEMPT=y # enable_config "${fragment}" CONFIG_WW_MUTEX_SELFTEST or # generates warning backtraces on purpose # enable_config "${fragment}" CONFIG_OF_UNITTEST Thanks, Guenter >> Follow-up: If this test isn't supposed to run during boot, please >> let me know and I'll drop it. > > FWIW I don't see why, but Chuck is the best person to comment.
On Fri, 2 Feb 2024 17:45:55 -0800 Guenter Roeck wrote: > > I was wondering how do you discover the options. Periodically grep for > > new possible KUNIT options added to the kernel? Have some script that > > does it. > > "Periodically grep for > new possible KUNIT options added to the kernel" > > Exactly, and see what happens if I enable them to determine if they run > fast and if they are stable enough to be enabled in a qemu test. I can not > enable tests automatically because some just take too long to run... Ugh, we have such a long way to go with kernel testing :( At least I know I'm not missing an obvious trick, thanks for explaining!
> On Feb 2, 2024, at 3:51 PM, Guenter Roeck <linux@roeck-us.net> wrote: > > On 2/2/24 12:49, Guenter Roeck wrote: >> On 2/2/24 11:22, Jakub Kicinski wrote: >>> On Fri, 2 Feb 2024 09:21:22 -0800 Guenter Roeck wrote: >>>> when running handshake kunit tests in qemu, I always get the following >>>> failure. >>> >>> Sorry for sidetracking - how do you run kunit to get all the tests? >>> We run: >>> >>> ./tools/testing/kunit/kunit.py run --alltests >>> >>> but more and more I feel like the --alltests is a cruel joke. >> I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. >> The tests run during boot, so no additional work is needed. I don't run all >> tests because many take too long to execute in qemu. > > Follow-up: If this test isn't supposed to run during boot, please > let me know and I'll drop it. These are pretty simple tests that should run quickly. I don't see a reason to exclude them. Generally the memory environment in the Kunit test harness is significantly different than the one at boot, so problems like this do crop up for qemu vs. boot time on occasion. Question is, has this test started to fail only recently, or has it been broken since it was merged? -- Chuck Lever
On 2/2/24 19:46, Chuck Lever III wrote: > > >> On Feb 2, 2024, at 3:51 PM, Guenter Roeck <linux@roeck-us.net> wrote: >> >> On 2/2/24 12:49, Guenter Roeck wrote: >>> On 2/2/24 11:22, Jakub Kicinski wrote: >>>> On Fri, 2 Feb 2024 09:21:22 -0800 Guenter Roeck wrote: >>>>> when running handshake kunit tests in qemu, I always get the following >>>>> failure. >>>> >>>> Sorry for sidetracking - how do you run kunit to get all the tests? >>>> We run: >>>> >>>> ./tools/testing/kunit/kunit.py run --alltests >>>> >>>> but more and more I feel like the --alltests is a cruel joke. >>> I have CONFIG_NET_HANDSHAKE_KUNIT_TEST=y enabled in my configuration. >>> The tests run during boot, so no additional work is needed. I don't run all >>> tests because many take too long to execute in qemu. >> >> Follow-up: If this test isn't supposed to run during boot, please >> let me know and I'll drop it. > > These are pretty simple tests that should run quickly. I don't see > a reason to exclude them. Generally the memory environment in the > Kunit test harness is significantly different than the one at boot, > so problems like this do crop up for qemu vs. boot time on occasion. > > Question is, has this test started to fail only recently, or has it > been broken since it was merged? > I am wuite sure that it has been broken since it was merged. I only recently started actually checking test results, so I missed the problem for a while. Also, I am not sure if anyone besides me is actually running those tests ;-) Guenter
On Fri, Feb 02, 2024 at 09:21:22AM -0800, Guenter Roeck wrote: > Hi, > > when running handshake kunit tests in qemu, I always get the following > failure. > > KTAP version 1 > # Subtest: Handshake API tests > 1..11 > KTAP version 1 > # Subtest: req_alloc API fuzzing > ok 1 handshake_req_alloc NULL proto > ok 2 handshake_req_alloc CLASS_NONE > ok 3 handshake_req_alloc CLASS_MAX > ok 4 handshake_req_alloc no callbacks > ok 5 handshake_req_alloc no done callback > ok 6 handshake_req_alloc excessive privsize > ok 7 handshake_req_alloc all good > # req_alloc API fuzzing: pass:7 fail:0 skip:0 total:7 > ok 1 req_alloc API fuzzing > ok 2 req_submit NULL req arg > ok 3 req_submit NULL sock arg > ok 4 req_submit NULL sock->file > ok 5 req_lookup works > ok 6 req_submit max pending > ok 7 req_submit multiple > ok 8 req_cancel before accept > ok 9 req_cancel after accept > ok 10 req_cancel after done > # req_destroy works: EXPECTATION FAILED at net/handshake/handshake-test.c:478 > Expected handshake_req_destroy_test == req, but > handshake_req_destroy_test == 00000000 > req == c5080280 > not ok 11 req_destroy works > # Handshake API tests: pass:10 fail:1 skip:0 total:11 > # Totals: pass:16 fail:1 skip:0 total:17 > not ok 31 Handshake API tests > ############## destroy 0xc5080280 > ... > > The line starting with "#######" is from added debug information. > > diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c > index 16ed7bfd29e4..a2417c56fe15 100644 > --- a/net/handshake/handshake-test.c > +++ b/net/handshake/handshake-test.c > @@ -434,6 +434,7 @@ static struct handshake_req *handshake_req_destroy_test; > > static void test_destroy_func(struct handshake_req *req) > { > + pr_info("############## destroy 0x%px\n", req); > handshake_req_destroy_test = req; > } > > It appears that the destroy function works, but is delayed. Unfortunately, > I don't know enough about the network subsystem and/or the handshake > protocol to suggest a fix. I'd be happy to submit a fix if you let me know > how that should look like. > > Thanks, > Guenter I am able to reproduce the test failure at boot: [ 125.404130] KTAP version 1 [ 125.404690] # Subtest: Handshake API tests [ 125.405540] 1..11 [ 125.405966] KTAP version 1 [ 125.406623] # Subtest: req_alloc API fuzzing [ 125.406971] ok 1 handshake_req_alloc NULL proto [ 125.408275] ok 2 handshake_req_alloc CLASS_NONE [ 125.409599] ok 3 handshake_req_alloc CLASS_MAX [ 125.410879] ok 4 handshake_req_alloc no callbacks [ 125.412200] ok 5 handshake_req_alloc no done callback [ 125.413525] ok 6 handshake_req_alloc excessive privsize [ 125.414896] ok 7 handshake_req_alloc all good [ 125.416036] # req_alloc API fuzzing: pass:7 fail:0 skip:0 total:7 [ 125.416891] ok 1 req_alloc API fuzzing [ 125.418439] ok 2 req_submit NULL req arg [ 125.419399] ok 3 req_submit NULL sock arg [ 125.420925] ok 4 req_submit NULL sock->file [ 125.422305] ok 5 req_lookup works [ 125.423667] ok 6 req_submit max pending [ 125.425061] ok 7 req_submit multiple [ 125.426151] ok 8 req_cancel before accept [ 125.427225] ok 9 req_cancel after accept [ 125.428318] ok 10 req_cancel after done [ 125.429424] # req_destroy works: EXPECTATION FAILED at net/handshake/handshake-test.c:477 [ 125.429424] Expected handshake_req_destroy_test == req, but [ 125.429424] handshake_req_destroy_test == 0000000000000000 [ 125.429424] req == ffff88802c5e6900 [ 125.430479] not ok 11 req_destroy works [ 125.435215] # Handshake API tests: pass:10 fail:1 skip:0 total:11 [ 125.435858] # Totals: pass:16 fail:1 skip:0 total:17 [ 125.437224] not ok 69 Handshake API tests I'll have a look.
On 2/6/24 08:47, Chuck Lever wrote: > On Fri, Feb 02, 2024 at 09:21:22AM -0800, Guenter Roeck wrote: >> Hi, >> >> when running handshake kunit tests in qemu, I always get the following >> failure. >> >> KTAP version 1 >> # Subtest: Handshake API tests >> 1..11 >> KTAP version 1 >> # Subtest: req_alloc API fuzzing >> ok 1 handshake_req_alloc NULL proto >> ok 2 handshake_req_alloc CLASS_NONE >> ok 3 handshake_req_alloc CLASS_MAX >> ok 4 handshake_req_alloc no callbacks >> ok 5 handshake_req_alloc no done callback >> ok 6 handshake_req_alloc excessive privsize >> ok 7 handshake_req_alloc all good >> # req_alloc API fuzzing: pass:7 fail:0 skip:0 total:7 >> ok 1 req_alloc API fuzzing >> ok 2 req_submit NULL req arg >> ok 3 req_submit NULL sock arg >> ok 4 req_submit NULL sock->file >> ok 5 req_lookup works >> ok 6 req_submit max pending >> ok 7 req_submit multiple >> ok 8 req_cancel before accept >> ok 9 req_cancel after accept >> ok 10 req_cancel after done >> # req_destroy works: EXPECTATION FAILED at net/handshake/handshake-test.c:478 >> Expected handshake_req_destroy_test == req, but >> handshake_req_destroy_test == 00000000 >> req == c5080280 >> not ok 11 req_destroy works >> # Handshake API tests: pass:10 fail:1 skip:0 total:11 >> # Totals: pass:16 fail:1 skip:0 total:17 >> not ok 31 Handshake API tests >> ############## destroy 0xc5080280 >> ... >> >> The line starting with "#######" is from added debug information. >> >> diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c >> index 16ed7bfd29e4..a2417c56fe15 100644 >> --- a/net/handshake/handshake-test.c >> +++ b/net/handshake/handshake-test.c >> @@ -434,6 +434,7 @@ static struct handshake_req *handshake_req_destroy_test; >> >> static void test_destroy_func(struct handshake_req *req) >> { >> + pr_info("############## destroy 0x%px\n", req); >> handshake_req_destroy_test = req; >> } >> >> It appears that the destroy function works, but is delayed. Unfortunately, >> I don't know enough about the network subsystem and/or the handshake >> protocol to suggest a fix. I'd be happy to submit a fix if you let me know >> how that should look like. >> >> Thanks, >> Guenter > > I am able to reproduce the test failure at boot: > > [ 125.404130] KTAP version 1 > [ 125.404690] # Subtest: Handshake API tests > [ 125.405540] 1..11 > [ 125.405966] KTAP version 1 > [ 125.406623] # Subtest: req_alloc API fuzzing > [ 125.406971] ok 1 handshake_req_alloc NULL proto > [ 125.408275] ok 2 handshake_req_alloc CLASS_NONE > [ 125.409599] ok 3 handshake_req_alloc CLASS_MAX > [ 125.410879] ok 4 handshake_req_alloc no callbacks > [ 125.412200] ok 5 handshake_req_alloc no done callback > [ 125.413525] ok 6 handshake_req_alloc excessive privsize > [ 125.414896] ok 7 handshake_req_alloc all good > [ 125.416036] # req_alloc API fuzzing: pass:7 fail:0 skip:0 total:7 > [ 125.416891] ok 1 req_alloc API fuzzing > [ 125.418439] ok 2 req_submit NULL req arg > [ 125.419399] ok 3 req_submit NULL sock arg > [ 125.420925] ok 4 req_submit NULL sock->file > [ 125.422305] ok 5 req_lookup works > [ 125.423667] ok 6 req_submit max pending > [ 125.425061] ok 7 req_submit multiple > [ 125.426151] ok 8 req_cancel before accept > [ 125.427225] ok 9 req_cancel after accept > [ 125.428318] ok 10 req_cancel after done > [ 125.429424] # req_destroy works: EXPECTATION FAILED at net/handshake/handshake-test.c:477 > [ 125.429424] Expected handshake_req_destroy_test == req, but > [ 125.429424] handshake_req_destroy_test == 0000000000000000 > [ 125.429424] req == ffff88802c5e6900 > [ 125.430479] not ok 11 req_destroy works > [ 125.435215] # Handshake API tests: pass:10 fail:1 skip:0 total:11 > [ 125.435858] # Totals: pass:16 fail:1 skip:0 total:17 > [ 125.437224] not ok 69 Handshake API tests > > I'll have a look. > Since the destroy function runs asynchronously, the best I could come up with was to use completion handling. The following patch fixes the problem for me. Guenter --- diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c index 16ed7bfd29e4..dc119c1e211b 100644 --- a/net/handshake/handshake-test.c +++ b/net/handshake/handshake-test.c @@ -432,9 +432,12 @@ static void handshake_req_cancel_test3(struct kunit *test) static struct handshake_req *handshake_req_destroy_test; +static DECLARE_COMPLETION(handshake_request_destroyed); + static void test_destroy_func(struct handshake_req *req) { handshake_req_destroy_test = req; + complete(&handshake_request_destroyed); } static struct handshake_proto handshake_req_alloc_proto_destroy = { @@ -473,6 +476,8 @@ static void handshake_req_destroy_test1(struct kunit *test) /* Act */ fput(filp); + wait_for_completion_timeout(&handshake_request_destroyed, msecs_to_jiffies(100)); + /* Assert */ KUNIT_EXPECT_PTR_EQ(test, handshake_req_destroy_test, req); }
On 2/6/24 09:55, Guenter Roeck wrote: [ ...] > > Since the destroy function runs asynchronously, the best I could come up with > was to use completion handling. The following patch fixes the problem for me. > I don't know if it is a proper fix, but I no longer see the problem with the patch below applied on top of v6.8-rc3. Guenter > > --- > diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c > index 16ed7bfd29e4..dc119c1e211b 100644 > --- a/net/handshake/handshake-test.c > +++ b/net/handshake/handshake-test.c > @@ -432,9 +432,12 @@ static void handshake_req_cancel_test3(struct kunit *test) > > static struct handshake_req *handshake_req_destroy_test; > > +static DECLARE_COMPLETION(handshake_request_destroyed); > + > static void test_destroy_func(struct handshake_req *req) > { > handshake_req_destroy_test = req; > + complete(&handshake_request_destroyed); > } > > static struct handshake_proto handshake_req_alloc_proto_destroy = { > @@ -473,6 +476,8 @@ static void handshake_req_destroy_test1(struct kunit *test) > /* Act */ > fput(filp); > > + wait_for_completion_timeout(&handshake_request_destroyed, msecs_to_jiffies(100)); > + > /* Assert */ > KUNIT_EXPECT_PTR_EQ(test, handshake_req_destroy_test, req); > } > > >
On Tue, Feb 06, 2024 at 11:05:36AM -0800, Guenter Roeck wrote: > On 2/6/24 09:55, Guenter Roeck wrote: > [ ...] > > > > > Since the destroy function runs asynchronously, the best I could come up with > > was to use completion handling. The following patch fixes the problem for me. > > > > I don't know if it is a proper fix, but I no longer see the problem with > the patch below applied on top of v6.8-rc3. I've got something similar but simpler that I will post with a full patch description, root cause, and rationale. Stand by. > Guenter > > > > > --- > > diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c > > index 16ed7bfd29e4..dc119c1e211b 100644 > > --- a/net/handshake/handshake-test.c > > +++ b/net/handshake/handshake-test.c > > @@ -432,9 +432,12 @@ static void handshake_req_cancel_test3(struct kunit *test) > > > > static struct handshake_req *handshake_req_destroy_test; > > > > +static DECLARE_COMPLETION(handshake_request_destroyed); > > + > > static void test_destroy_func(struct handshake_req *req) > > { > > handshake_req_destroy_test = req; > > + complete(&handshake_request_destroyed); > > } > > > > static struct handshake_proto handshake_req_alloc_proto_destroy = { > > @@ -473,6 +476,8 @@ static void handshake_req_destroy_test1(struct kunit *test) > > /* Act */ > > fput(filp); > > > > + wait_for_completion_timeout(&handshake_request_destroyed, msecs_to_jiffies(100)); > > + > > /* Assert */ > > KUNIT_EXPECT_PTR_EQ(test, handshake_req_destroy_test, req); > > } > > > > > > >
On 2/6/24 11:07, Chuck Lever wrote: > On Tue, Feb 06, 2024 at 11:05:36AM -0800, Guenter Roeck wrote: >> On 2/6/24 09:55, Guenter Roeck wrote: >> [ ...] >> >>> >>> Since the destroy function runs asynchronously, the best I could come up with >>> was to use completion handling. The following patch fixes the problem for me. >>> >> >> I don't know if it is a proper fix, but I no longer see the problem with >> the patch below applied on top of v6.8-rc3. > > I've got something similar but simpler that I will post with a full > patch description, root cause, and rationale. Stand by. > Excellent. Thanks! Guenter
diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c index 16ed7bfd29e4..a2417c56fe15 100644 --- a/net/handshake/handshake-test.c +++ b/net/handshake/handshake-test.c @@ -434,6 +434,7 @@ static struct handshake_req *handshake_req_destroy_test; static void test_destroy_func(struct handshake_req *req) { + pr_info("############## destroy 0x%px\n", req); handshake_req_destroy_test = req; }