Message ID | 20220615003733.16447-1-zhouzhouyi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [linux-next] RFC: torture: add pr_flush to flush the results to console | expand |
On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > false positve because of kthread printing. > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > have added functions to prefer direct printing activated for the primary > import messages. > > Meanwhile adding printk_prefer_direct_enter/exit to > torture_cleanup_begin/end still can't eliminate the false positives > for a hundred percent. > > This patch add pr_flush to torture_cleanup_end because the latter > is sleepable. > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Just the one Signed-off-by suffices. ;-) > --- > kernel/torture.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/torture.c b/kernel/torture.c > index 789aeb0e1159..b51e42463a67 100644 > --- a/kernel/torture.c > +++ b/kernel/torture.c > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > mutex_lock(&fullstop_mutex); > torture_type = NULL; > mutex_unlock(&fullstop_mutex); > + pr_flush(1000, true); But in this case, the kernel keeps on running, correct? So won't the output appear over time? All of printk()'s kthreads are still running after all. Or do you have a use case that requires that the output appear sooner than it does? If so, please let me know what that is. Thanx, Paul > } > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > -- > 2.25.1 >
On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > false positve because of kthread printing. > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > have added functions to prefer direct printing activated for the primary > > import messages. > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > torture_cleanup_begin/end still can't eliminate the false positives > > for a hundred percent. > > > > This patch add pr_flush to torture_cleanup_end because the latter > > is sleepable. > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > Just the one Signed-off-by suffices. ;-) Thank Paul for your guidance ;-) > > > --- > > kernel/torture.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > index 789aeb0e1159..b51e42463a67 100644 > > --- a/kernel/torture.c > > +++ b/kernel/torture.c > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > mutex_lock(&fullstop_mutex); > > torture_type = NULL; > > mutex_unlock(&fullstop_mutex); > > + pr_flush(1000, true); > > But in this case, the kernel keeps on running, correct? So won't the > output appear over time? All of printk()'s kthreads are still running > after all. I am doing research on linux-next following your guidance on my Dell PowerEdge R720 server: #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ #cd linux-next #./tools/testing/selftests/rcutorture/bin/torture.sh after about 14 hours, the test result show that there have been about 17 failures, I examine the corresponding console.log, and find the failures are caused by following reason: before the message output by rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: SUCCESS") has reached serial console, torture_shutdown invoked kernel_power_off(), so the message "End of test: SUCCESS" has not chance of reaching console.log of host machine. Then I add printk_prefer_direct_enter/exit to torture_cleanup_begin/end, again I invoke #./tools/testing/selftests/rcutorture/bin/torture.sh after about 14 hours, the test result show that there have been about 5 failures, Then I add a debug statement after rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: SUCCESS") which sends the debug string ("hello world") to uart_console_write directly, again I invoke #./tools/testing/selftests/rcutorture/bin/torture.sh There is "hello world" in console.log but no "End of test" strings. So I guess even with printk_prefer_direct_enter/exit to torture_cleanup_begin/end, our main shutdown thread has failed in competition for console lock (printk.c): 2401 if (console_trylock_spinning()) 2402 console_unlock(); Then I add pr_flush to torture_cleanup_end #./tools/testing/selftests/rcutorture/bin/torture.sh after 14 hours all 49 tests are success > > Or do you have a use case that requires that the output appear sooner > than it does? If so, please let me know what that is. Thank Paul for your patience, ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool that I use. I am very interested in this topic, and very glad to do further experiments under your guidance ;-) Sincerely yours Thanks Zhouyi > > Thanx, Paul > > > } > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > -- > > 2.25.1 > >
On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > false positve because of kthread printing. > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > have added functions to prefer direct printing activated for the primary > > > import messages. > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > torture_cleanup_begin/end still can't eliminate the false positives > > > for a hundred percent. > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > is sleepable. > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > Just the one Signed-off-by suffices. ;-) > Thank Paul for your guidance ;-) > > > > > --- > > > kernel/torture.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > index 789aeb0e1159..b51e42463a67 100644 > > > --- a/kernel/torture.c > > > +++ b/kernel/torture.c > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > mutex_lock(&fullstop_mutex); > > > torture_type = NULL; > > > mutex_unlock(&fullstop_mutex); > > > + pr_flush(1000, true); > > > > But in this case, the kernel keeps on running, correct? So won't the > > output appear over time? All of printk()'s kthreads are still running > > after all. > I am doing research on linux-next following your guidance on my Dell > PowerEdge R720 server: > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > #cd linux-next > #./tools/testing/selftests/rcutorture/bin/torture.sh > after about 14 hours, the test result show that there have been about > 17 failures, I examine the corresponding console.log, and find the > failures are caused by following reason: > before the message output by > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > SUCCESS") has reached serial console, torture_shutdown invoked > kernel_power_off(), Ah, got it! And good show tracking this down, by the way. Does this workaround commit on -rcu do the trick for you? 704ae8dcda84 ("torture: Flush printk() buffers before powering off") The printk() guys are looking at fixes within the printk() system as well: https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ > so the message "End of test: SUCCESS" has not chance of reaching > console.log of host machine. > > Then I add printk_prefer_direct_enter/exit to > torture_cleanup_begin/end, again I invoke > #./tools/testing/selftests/rcutorture/bin/torture.sh > after about 14 hours, the test result show that there have been about > 5 failures, > > Then I add a debug statement after > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > SUCCESS") which sends the debug string ("hello world") to > uart_console_write directly, again I invoke > #./tools/testing/selftests/rcutorture/bin/torture.sh > There is "hello world" in console.log but no "End of test" strings. > > So I guess even with printk_prefer_direct_enter/exit to > torture_cleanup_begin/end, our main shutdown thread has failed in > competition for console lock (printk.c): > 2401 if (console_trylock_spinning()) > 2402 console_unlock(); > > Then I add pr_flush to torture_cleanup_end > #./tools/testing/selftests/rcutorture/bin/torture.sh > after 14 hours > all 49 tests are success > > > > Or do you have a use case that requires that the output appear sooner > > than it does? If so, please let me know what that is. > Thank Paul for your patience, > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > that I use. > > I am very interested in this topic, and very glad to do further > experiments under your guidance ;-) Or are you already running with that commit or one of those patches? They are working reliably for me (so far, anyway), but maybe more is needed in some cases. Thanx, Paul > Sincerely yours > Thanks > Zhouyi > > > > Thanx, Paul > > > > > } > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > -- > > > 2.25.1 > > >
On Wed, Jun 15, 2022 at 06:52:20AM -0700, Paul E. McKenney wrote: > On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > > false positve because of kthread printing. > > > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > > have added functions to prefer direct printing activated for the primary > > > > import messages. > > > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > > torture_cleanup_begin/end still can't eliminate the false positives > > > > for a hundred percent. > > > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > > is sleepable. > > > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > Just the one Signed-off-by suffices. ;-) > > Thank Paul for your guidance ;-) > > > > > > > --- > > > > kernel/torture.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > > index 789aeb0e1159..b51e42463a67 100644 > > > > --- a/kernel/torture.c > > > > +++ b/kernel/torture.c > > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > > mutex_lock(&fullstop_mutex); > > > > torture_type = NULL; > > > > mutex_unlock(&fullstop_mutex); > > > > + pr_flush(1000, true); > > > > > > But in this case, the kernel keeps on running, correct? So won't the > > > output appear over time? All of printk()'s kthreads are still running > > > after all. > > I am doing research on linux-next following your guidance on my Dell > > PowerEdge R720 server: > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > > #cd linux-next > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > after about 14 hours, the test result show that there have been about > > 17 failures, I examine the corresponding console.log, and find the > > failures are caused by following reason: > > before the message output by > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > SUCCESS") has reached serial console, torture_shutdown invoked > > kernel_power_off(), > > Ah, got it! And good show tracking this down, by the way. > > Does this workaround commit on -rcu do the trick for you? > > 704ae8dcda84 ("torture: Flush printk() buffers before powering off") > > The printk() guys are looking at fixes within the printk() system > as well: > > https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ > https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ And Petr has posted a more official fix, which I have queued on -rcu for testing at branch pmladek.2022.06.15a. Thanx, Paul > > so the message "End of test: SUCCESS" has not chance of reaching > > console.log of host machine. > > > > Then I add printk_prefer_direct_enter/exit to > > torture_cleanup_begin/end, again I invoke > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > after about 14 hours, the test result show that there have been about > > 5 failures, > > > > Then I add a debug statement after > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > SUCCESS") which sends the debug string ("hello world") to > > uart_console_write directly, again I invoke > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > There is "hello world" in console.log but no "End of test" strings. > > > > So I guess even with printk_prefer_direct_enter/exit to > > torture_cleanup_begin/end, our main shutdown thread has failed in > > competition for console lock (printk.c): > > 2401 if (console_trylock_spinning()) > > 2402 console_unlock(); > > > > Then I add pr_flush to torture_cleanup_end > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > after 14 hours > > all 49 tests are success > > > > > > Or do you have a use case that requires that the output appear sooner > > > than it does? If so, please let me know what that is. > > Thank Paul for your patience, > > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > > that I use. > > > > I am very interested in this topic, and very glad to do further > > experiments under your guidance ;-) > > Or are you already running with that commit or one of those patches? > They are working reliably for me (so far, anyway), but maybe more is > needed in some cases. > > Thanx, Paul > > > Sincerely yours > > Thanks > > Zhouyi > > > > > > Thanx, Paul > > > > > > > } > > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > > > -- > > > > 2.25.1 > > > >
Thank Paul and Petr for your efforts! On Thu, Jun 16, 2022 at 3:09 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Jun 15, 2022 at 06:52:20AM -0700, Paul E. McKenney wrote: > > On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > > > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > > > false positve because of kthread printing. > > > > > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > > > have added functions to prefer direct printing activated for the primary > > > > > import messages. > > > > > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > > > torture_cleanup_begin/end still can't eliminate the false positives > > > > > for a hundred percent. > > > > > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > > > is sleepable. > > > > > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > Just the one Signed-off-by suffices. ;-) > > > Thank Paul for your guidance ;-) > > > > > > > > > --- > > > > > kernel/torture.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > > > index 789aeb0e1159..b51e42463a67 100644 > > > > > --- a/kernel/torture.c > > > > > +++ b/kernel/torture.c > > > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > > > mutex_lock(&fullstop_mutex); > > > > > torture_type = NULL; > > > > > mutex_unlock(&fullstop_mutex); > > > > > + pr_flush(1000, true); > > > > > > > > But in this case, the kernel keeps on running, correct? So won't the > > > > output appear over time? All of printk()'s kthreads are still running > > > > after all. > > > I am doing research on linux-next following your guidance on my Dell > > > PowerEdge R720 server: > > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > > > #cd linux-next > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > after about 14 hours, the test result show that there have been about > > > 17 failures, I examine the corresponding console.log, and find the > > > failures are caused by following reason: > > > before the message output by > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > SUCCESS") has reached serial console, torture_shutdown invoked > > > kernel_power_off(), > > > > Ah, got it! And good show tracking this down, by the way. > > > > Does this workaround commit on -rcu do the trick for you? > > > > 704ae8dcda84 ("torture: Flush printk() buffers before powering off") > > > > The printk() guys are looking at fixes within the printk() system > > as well: > > > > https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ > > https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ > > And Petr has posted a more official fix, which I have queued on -rcu > for testing at branch pmladek.2022.06.15a. I am very happy to see the official fix, and start test it on my Dell server by invoking ./tools/testing/selftests/rcutorture/bin/torture.sh It will take 12 hours for the above script to finish if the other guy who shares the server with me doesn't use it heavily ;-) Thanks Sincerely yours Zhouyi > > Thanx, Paul > > > > so the message "End of test: SUCCESS" has not chance of reaching > > > console.log of host machine. > > > > > > Then I add printk_prefer_direct_enter/exit to > > > torture_cleanup_begin/end, again I invoke > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > after about 14 hours, the test result show that there have been about > > > 5 failures, > > > > > > Then I add a debug statement after > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > SUCCESS") which sends the debug string ("hello world") to > > > uart_console_write directly, again I invoke > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > There is "hello world" in console.log but no "End of test" strings. > > > > > > So I guess even with printk_prefer_direct_enter/exit to > > > torture_cleanup_begin/end, our main shutdown thread has failed in > > > competition for console lock (printk.c): > > > 2401 if (console_trylock_spinning()) > > > 2402 console_unlock(); > > > > > > Then I add pr_flush to torture_cleanup_end > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > after 14 hours > > > all 49 tests are success > > > > > > > > Or do you have a use case that requires that the output appear sooner > > > > than it does? If so, please let me know what that is. > > > Thank Paul for your patience, > > > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > > > that I use. > > > > > > I am very interested in this topic, and very glad to do further > > > experiments under your guidance ;-) > > > > Or are you already running with that commit or one of those patches? > > They are working reliably for me (so far, anyway), but maybe more is > > needed in some cases. > > > > Thanx, Paul > > > > > Sincerely yours > > > Thanks > > > Zhouyi > > > > > > > > Thanx, Paul > > > > > > > > > } > > > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > > > > > -- > > > > > 2.25.1 > > > > >
[add Masahiro to CC list] 1. Abstract: During the test of 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: Wait for the global console lock when the system is going down"), I found 0ba7324b44282870af740a5a121add62c7f5f730 itself is OK. But I guess 31cb50b5590fe911077b8463ad01144fac8fa4f3 ("kbuild: check static EXPORT_SYMBOL* by script instead of modpost") lead to build systems failing in a random way. 2. Introduction: To test 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: Wait for the global console lock when the system is going down"), I do following both on my Dell PowerEdge R720 server (128G memory, 16 core Intel(R) Xeon(R) CPU E5-2660) and my Thinkpad T14 amd laptop (40G memory, 8 core AMD Ryzen 7 PRO 4750U) #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/paulmck/linux-rcu.git #cd linux-rcu #git checkout remotes/origin/pmladek.2022.06.15a #./tools/testing/selftests/rcutorture/bin/torture.sh 3. The Testing 3.1 On Dell PowerEdge R720 server, I performed torture.sh for 6 rounds (each cost 12 hours to finish) five of them report build failure, for example: root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# find . -name Make.out|xargs grep Error ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error 255 ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[1]: *** [scripts/Makefile.build:466: security/selinux] Error 2 ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make: *** [Makefile:1839: security] Error 2 one of them passed all the tests: Started at Mon Jun 20 06:09:33 CST 2022, ended at Mon Jun 20 18:34:30 CST 2022, duration 12:24:57. Summary: Successes: 49 Failures: 0. Size before compressing 60 files: 66G Mon Jun 20 18:34:30 CST 2022 Waiting for batch 1 of 32 compressions Mon Jun 20 18:34:30 CST 2022 Waiting for final batch 2 of 28 compressions Mon Jun 20 19:12:03 CST 2022 Size after compressing 60 files: 15G Mon Jun 20 19:46:42 CST 2022 3.2 On Thinkpad T14 amd laptop, I performed torture.sh for 2 rounds (each cost 13 hours to finish) one of them report build failure: zzy@zzy-ThinkPad-T14-Gen-1:~/Program/linux-rcu/tools/testing/selftests/rcutorture/res$ find . -name Make.out|xargs grep Error ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make[1]: *** [scripts/Makefile.build:249: arch/x86/boot/pm.o] Error 255 ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make: *** [arch/x86/Makefile:265: bzImage] Error 2 another passed all the tests: 4. Guess what caused the build error On Dell PowerEdge R720: root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# find . -name Make.out|xargs grep Error ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error 255 root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# emacs ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out I search services.o in my emacs editor, the only clue is: ./scripts/check-local-export security/selinux/ss/services.o make[2]: *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error 255 make[2]: *** Deleting file 'security/selinux/ss/services.o' make[2]: *** Waiting for unfinished jobs.... 5. Try to trace I have tried to trace the root cause of the problem, by modifying scripts/check-local-export: diff --git a/scripts/check-local-export b/scripts/check-local-export index da745e2743b7..a66ff333f22b 100755 --- a/scripts/check-local-export +++ b/scripts/check-local-export @@ -12,7 +12,7 @@ declare -A symbol_types declare -a export_symbols exit_code=0 - +cp ${1} /mnt/tmp/ while read value type name do # Skip the line if the number of fields is less than 3. @@ -48,7 +48,7 @@ do # # Then, the following line will be really simple: # done < <(${NM} --quiet ${1}) -done < <(${NM} ${1} 2>/dev/null || { echo "${0}: ${NM} failed" >&2; false; } ) +done < <(${NM} ${1} || { echo "${0}: ${NM} failed" >>/mnt/tmp/log-check ; false; } ) # Catch error in the process substitution wait $! @@ -57,7 +57,7 @@ for name in "${export_symbols[@]}" do # nm(3) says "If lowercase, the symbol is usually local" if [[ ${symbol_types[$name]} =~ [a-z] ]]; then - echo "$@: error: local symbol '${name}' was exported" >&2 + echo "$@: error: local symbol '${name}' was exported" >> /mnt/tmp/log-check exit_code=1 fi done Then I invoke torture.sh again, but after two rounds of test, the bug did not appear, so I can't trace anything. 6. Conclusion Sorry to bring you so much trouble, I am very interested in this. Could you please tell me what to do next? Thanks Zhouyi On Thu, Jun 16, 2022 at 4:28 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > Thank Paul and Petr for your efforts! > > On Thu, Jun 16, 2022 at 3:09 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Jun 15, 2022 at 06:52:20AM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > > > > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > > > > false positve because of kthread printing. > > > > > > > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > > > > have added functions to prefer direct printing activated for the primary > > > > > > import messages. > > > > > > > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > > > > torture_cleanup_begin/end still can't eliminate the false positives > > > > > > for a hundred percent. > > > > > > > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > > > > is sleepable. > > > > > > > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > Just the one Signed-off-by suffices. ;-) > > > > Thank Paul for your guidance ;-) > > > > > > > > > > > --- > > > > > > kernel/torture.c | 1 + > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > > > > index 789aeb0e1159..b51e42463a67 100644 > > > > > > --- a/kernel/torture.c > > > > > > +++ b/kernel/torture.c > > > > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > > > > mutex_lock(&fullstop_mutex); > > > > > > torture_type = NULL; > > > > > > mutex_unlock(&fullstop_mutex); > > > > > > + pr_flush(1000, true); > > > > > > > > > > But in this case, the kernel keeps on running, correct? So won't the > > > > > output appear over time? All of printk()'s kthreads are still running > > > > > after all. > > > > I am doing research on linux-next following your guidance on my Dell > > > > PowerEdge R720 server: > > > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > > > > #cd linux-next > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > after about 14 hours, the test result show that there have been about > > > > 17 failures, I examine the corresponding console.log, and find the > > > > failures are caused by following reason: > > > > before the message output by > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > SUCCESS") has reached serial console, torture_shutdown invoked > > > > kernel_power_off(), > > > > > > Ah, got it! And good show tracking this down, by the way. > > > > > > Does this workaround commit on -rcu do the trick for you? > > > > > > 704ae8dcda84 ("torture: Flush printk() buffers before powering off") > > > > > > The printk() guys are looking at fixes within the printk() system > > > as well: > > > > > > https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ > > > https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ > > > > And Petr has posted a more official fix, which I have queued on -rcu > > for testing at branch pmladek.2022.06.15a. > I am very happy to see the official fix, and start test it on my Dell > server by invoking > ./tools/testing/selftests/rcutorture/bin/torture.sh > > It will take 12 hours for the above script to finish if the other guy > who shares the server with me doesn't use it heavily ;-) > > Thanks > Sincerely yours > Zhouyi > > > > Thanx, Paul > > > > > > so the message "End of test: SUCCESS" has not chance of reaching > > > > console.log of host machine. > > > > > > > > Then I add printk_prefer_direct_enter/exit to > > > > torture_cleanup_begin/end, again I invoke > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > after about 14 hours, the test result show that there have been about > > > > 5 failures, > > > > > > > > Then I add a debug statement after > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > SUCCESS") which sends the debug string ("hello world") to > > > > uart_console_write directly, again I invoke > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > There is "hello world" in console.log but no "End of test" strings. > > > > > > > > So I guess even with printk_prefer_direct_enter/exit to > > > > torture_cleanup_begin/end, our main shutdown thread has failed in > > > > competition for console lock (printk.c): > > > > 2401 if (console_trylock_spinning()) > > > > 2402 console_unlock(); > > > > > > > > Then I add pr_flush to torture_cleanup_end > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > after 14 hours > > > > all 49 tests are success > > > > > > > > > > Or do you have a use case that requires that the output appear sooner > > > > > than it does? If so, please let me know what that is. > > > > Thank Paul for your patience, > > > > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > > > > that I use. > > > > > > > > I am very interested in this topic, and very glad to do further > > > > experiments under your guidance ;-) > > > > > > Or are you already running with that commit or one of those patches? > > > They are working reliably for me (so far, anyway), but maybe more is > > > needed in some cases. > > > > > > Thanx, Paul > > > > > > > Sincerely yours > > > > Thanks > > > > Zhouyi > > > > > > > > > > Thanx, Paul > > > > > > > > > > > } > > > > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > > > > > > > -- > > > > > > 2.25.1 > > > > > >
On Tue, Jun 21, 2022 at 08:22:15AM +0800, Zhouyi Zhou wrote: > [add Masahiro to CC list] > > 1. Abstract: > During the test of 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: > Wait for the global console lock when the system is going down"), I > found > 0ba7324b44282870af740a5a121add62c7f5f730 itself is OK. But I guess > 31cb50b5590fe911077b8463ad01144fac8fa4f3 ("kbuild: check static > EXPORT_SYMBOL* by script instead of modpost") > lead to build systems failing in a random way. > > 2. Introduction: > To test 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: Wait for > the global console lock when the system is going down"), I do > following both on > my Dell PowerEdge R720 server (128G memory, 16 core Intel(R) Xeon(R) > CPU E5-2660) and my Thinkpad T14 amd laptop (40G memory, 8 core AMD > Ryzen 7 PRO 4750U) > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > #cd linux-rcu > #git checkout remotes/origin/pmladek.2022.06.15a > #./tools/testing/selftests/rcutorture/bin/torture.sh > > 3. The Testing > 3.1 > On Dell PowerEdge R720 server, I performed torture.sh for 6 rounds > (each cost 12 hours to finish) > five of them report build failure, for example: > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > find . -name Make.out|xargs grep Error > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: > *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error > 255 > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[1]: > *** [scripts/Makefile.build:466: security/selinux] Error 2 > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make: > *** [Makefile:1839: security] Error 2 > one of them passed all the tests: > Started at Mon Jun 20 06:09:33 CST 2022, ended at Mon Jun 20 18:34:30 > CST 2022, duration 12:24:57. > Summary: Successes: 49 Failures: 0. > Size before compressing 60 files: 66G Mon Jun 20 18:34:30 CST 2022 > Waiting for batch 1 of 32 compressions Mon Jun 20 18:34:30 CST 2022 > Waiting for final batch 2 of 28 compressions Mon Jun 20 19:12:03 CST 2022 > Size after compressing 60 files: 15G Mon Jun 20 19:46:42 CST 2022 > 3.2 > On Thinkpad T14 amd laptop, I performed torture.sh for 2 rounds (each > cost 13 hours to finish) > one of them report build failure: > zzy@zzy-ThinkPad-T14-Gen-1:~/Program/linux-rcu/tools/testing/selftests/rcutorture/res$ > find . -name Make.out|xargs grep Error > ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make[1]: > *** [scripts/Makefile.build:249: arch/x86/boot/pm.o] Error 255 > ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make: > *** [arch/x86/Makefile:265: bzImage] Error 2 > another passed all the tests: > > 4. Guess what caused the build error > On Dell PowerEdge R720: > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > find . -name Make.out|xargs grep Error > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: > *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error > 255 > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > emacs ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out > I search services.o in my emacs editor, the only clue is: > ./scripts/check-local-export security/selinux/ss/services.o > make[2]: *** [scripts/Makefile.build:249: > security/selinux/ss/services.o] Error 255 > make[2]: *** Deleting file 'security/selinux/ss/services.o' > make[2]: *** Waiting for unfinished jobs.... > > 5. Try to trace > I have tried to trace the root cause of the problem, by modifying > scripts/check-local-export: > diff --git a/scripts/check-local-export b/scripts/check-local-export > index da745e2743b7..a66ff333f22b 100755 > --- a/scripts/check-local-export > +++ b/scripts/check-local-export > @@ -12,7 +12,7 @@ declare -A symbol_types > declare -a export_symbols > > exit_code=0 > - > +cp ${1} /mnt/tmp/ > while read value type name > do > # Skip the line if the number of fields is less than 3. > @@ -48,7 +48,7 @@ do > # > # Then, the following line will be really simple: > # done < <(${NM} --quiet ${1}) > -done < <(${NM} ${1} 2>/dev/null || { echo "${0}: ${NM} failed" >&2; false; } ) > +done < <(${NM} ${1} || { echo "${0}: ${NM} failed" > >>/mnt/tmp/log-check ; false; } ) > > # Catch error in the process substitution > wait $! > @@ -57,7 +57,7 @@ for name in "${export_symbols[@]}" > do > # nm(3) says "If lowercase, the symbol is usually local" > if [[ ${symbol_types[$name]} =~ [a-z] ]]; then > - echo "$@: error: local symbol '${name}' was exported" >&2 > + echo "$@: error: local symbol '${name}' was exported" > >> /mnt/tmp/log-check > exit_code=1 > fi > done > > Then I invoke torture.sh again, but after two rounds of test, the bug > did not appear, so I can't trace anything. > > 6. Conclusion > Sorry to bring you so much trouble, I am very interested in this. > Could you please tell me what to do next? I recently had a strange build error that was caused by a different extraneous file. I deleted the file and no problem since. I suspect that the file was lying around because I had interrupted a previous build. Though such files are -supposed- to be deleted when the build is interrupted. Next time you get an similar strange build error, try "git status" to see if it maybe doesn't belong. Beyond that point, I don't count myself as an expert on the Linux kernel's build system. Thanx, Paul > Thanks > Zhouyi > > On Thu, Jun 16, 2022 at 4:28 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > Thank Paul and Petr for your efforts! > > > > On Thu, Jun 16, 2022 at 3:09 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Wed, Jun 15, 2022 at 06:52:20AM -0700, Paul E. McKenney wrote: > > > > On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > > > > > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > > > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > > > > > false positve because of kthread printing. > > > > > > > > > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > > > > > have added functions to prefer direct printing activated for the primary > > > > > > > import messages. > > > > > > > > > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > > > > > torture_cleanup_begin/end still can't eliminate the false positives > > > > > > > for a hundred percent. > > > > > > > > > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > > > > > is sleepable. > > > > > > > > > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > > > Just the one Signed-off-by suffices. ;-) > > > > > Thank Paul for your guidance ;-) > > > > > > > > > > > > > --- > > > > > > > kernel/torture.c | 1 + > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > > > > > index 789aeb0e1159..b51e42463a67 100644 > > > > > > > --- a/kernel/torture.c > > > > > > > +++ b/kernel/torture.c > > > > > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > > > > > mutex_lock(&fullstop_mutex); > > > > > > > torture_type = NULL; > > > > > > > mutex_unlock(&fullstop_mutex); > > > > > > > + pr_flush(1000, true); > > > > > > > > > > > > But in this case, the kernel keeps on running, correct? So won't the > > > > > > output appear over time? All of printk()'s kthreads are still running > > > > > > after all. > > > > > I am doing research on linux-next following your guidance on my Dell > > > > > PowerEdge R720 server: > > > > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > > > > > #cd linux-next > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > after about 14 hours, the test result show that there have been about > > > > > 17 failures, I examine the corresponding console.log, and find the > > > > > failures are caused by following reason: > > > > > before the message output by > > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > > SUCCESS") has reached serial console, torture_shutdown invoked > > > > > kernel_power_off(), > > > > > > > > Ah, got it! And good show tracking this down, by the way. > > > > > > > > Does this workaround commit on -rcu do the trick for you? > > > > > > > > 704ae8dcda84 ("torture: Flush printk() buffers before powering off") > > > > > > > > The printk() guys are looking at fixes within the printk() system > > > > as well: > > > > > > > > https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ > > > > https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ > > > > > > And Petr has posted a more official fix, which I have queued on -rcu > > > for testing at branch pmladek.2022.06.15a. > > I am very happy to see the official fix, and start test it on my Dell > > server by invoking > > ./tools/testing/selftests/rcutorture/bin/torture.sh > > > > It will take 12 hours for the above script to finish if the other guy > > who shares the server with me doesn't use it heavily ;-) > > > > Thanks > > Sincerely yours > > Zhouyi > > > > > > Thanx, Paul > > > > > > > > so the message "End of test: SUCCESS" has not chance of reaching > > > > > console.log of host machine. > > > > > > > > > > Then I add printk_prefer_direct_enter/exit to > > > > > torture_cleanup_begin/end, again I invoke > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > after about 14 hours, the test result show that there have been about > > > > > 5 failures, > > > > > > > > > > Then I add a debug statement after > > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > > SUCCESS") which sends the debug string ("hello world") to > > > > > uart_console_write directly, again I invoke > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > There is "hello world" in console.log but no "End of test" strings. > > > > > > > > > > So I guess even with printk_prefer_direct_enter/exit to > > > > > torture_cleanup_begin/end, our main shutdown thread has failed in > > > > > competition for console lock (printk.c): > > > > > 2401 if (console_trylock_spinning()) > > > > > 2402 console_unlock(); > > > > > > > > > > Then I add pr_flush to torture_cleanup_end > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > after 14 hours > > > > > all 49 tests are success > > > > > > > > > > > > Or do you have a use case that requires that the output appear sooner > > > > > > than it does? If so, please let me know what that is. > > > > > Thank Paul for your patience, > > > > > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > > > > > that I use. > > > > > > > > > > I am very interested in this topic, and very glad to do further > > > > > experiments under your guidance ;-) > > > > > > > > Or are you already running with that commit or one of those patches? > > > > They are working reliably for me (so far, anyway), but maybe more is > > > > needed in some cases. > > > > > > > > Thanx, Paul > > > > > > > > > Sincerely yours > > > > > Thanks > > > > > Zhouyi > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > > > > > > > > > -- > > > > > > > 2.25.1 > > > > > > >
On Tue, Jun 21, 2022 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Jun 21, 2022 at 08:22:15AM +0800, Zhouyi Zhou wrote: > > [add Masahiro to CC list] > > > > 1. Abstract: > > During the test of 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: > > Wait for the global console lock when the system is going down"), I > > found > > 0ba7324b44282870af740a5a121add62c7f5f730 itself is OK. But I guess > > 31cb50b5590fe911077b8463ad01144fac8fa4f3 ("kbuild: check static > > EXPORT_SYMBOL* by script instead of modpost") > > lead to build systems failing in a random way. > > > > 2. Introduction: > > To test 0ba7324b44282870af740a5a121add62c7f5f730 ("printk: Wait for > > the global console lock when the system is going down"), I do > > following both on > > my Dell PowerEdge R720 server (128G memory, 16 core Intel(R) Xeon(R) > > CPU E5-2660) and my Thinkpad T14 amd laptop (40G memory, 8 core AMD > > Ryzen 7 PRO 4750U) > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > > #cd linux-rcu > > #git checkout remotes/origin/pmladek.2022.06.15a > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > 3. The Testing > > 3.1 > > On Dell PowerEdge R720 server, I performed torture.sh for 6 rounds > > (each cost 12 hours to finish) > > five of them report build failure, for example: > > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > > find . -name Make.out|xargs grep Error > > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: > > *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error > > 255 > > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[1]: > > *** [scripts/Makefile.build:466: security/selinux] Error 2 > > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make: > > *** [Makefile:1839: security] Error 2 > > one of them passed all the tests: > > Started at Mon Jun 20 06:09:33 CST 2022, ended at Mon Jun 20 18:34:30 > > CST 2022, duration 12:24:57. > > Summary: Successes: 49 Failures: 0. > > Size before compressing 60 files: 66G Mon Jun 20 18:34:30 CST 2022 > > Waiting for batch 1 of 32 compressions Mon Jun 20 18:34:30 CST 2022 > > Waiting for final batch 2 of 28 compressions Mon Jun 20 19:12:03 CST 2022 > > Size after compressing 60 files: 15G Mon Jun 20 19:46:42 CST 2022 > > 3.2 > > On Thinkpad T14 amd laptop, I performed torture.sh for 2 rounds (each > > cost 13 hours to finish) > > one of them report build failure: > > zzy@zzy-ThinkPad-T14-Gen-1:~/Program/linux-rcu/tools/testing/selftests/rcutorture/res$ > > find . -name Make.out|xargs grep Error > > ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make[1]: > > *** [scripts/Makefile.build:249: arch/x86/boot/pm.o] Error 255 > > ./2022.06.17-22.45.37-torture/results-refscale-refcnt-kasan/NOPREEMPT/Make.out:make: > > *** [arch/x86/Makefile:265: bzImage] Error 2 > > another passed all the tests: > > > > 4. Guess what caused the build error > > On Dell PowerEdge R720: > > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > > find . -name Make.out|xargs grep Error > > ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out:make[2]: > > *** [scripts/Makefile.build:249: security/selinux/ss/services.o] Error > > 255 > > root@42d29c5cef4b:/mnt/linux-rcu/tools/testing/selftests/rcutorture/res# > > emacs ./2022.06.19-09.29.42-torture/results-rcutorture/TREE07/Make.out > > I search services.o in my emacs editor, the only clue is: > > ./scripts/check-local-export security/selinux/ss/services.o > > make[2]: *** [scripts/Makefile.build:249: > > security/selinux/ss/services.o] Error 255 > > make[2]: *** Deleting file 'security/selinux/ss/services.o' > > make[2]: *** Waiting for unfinished jobs.... > > > > 5. Try to trace > > I have tried to trace the root cause of the problem, by modifying > > scripts/check-local-export: > > diff --git a/scripts/check-local-export b/scripts/check-local-export > > index da745e2743b7..a66ff333f22b 100755 > > --- a/scripts/check-local-export > > +++ b/scripts/check-local-export > > @@ -12,7 +12,7 @@ declare -A symbol_types > > declare -a export_symbols > > > > exit_code=0 > > - > > +cp ${1} /mnt/tmp/ > > while read value type name > > do > > # Skip the line if the number of fields is less than 3. > > @@ -48,7 +48,7 @@ do > > # > > # Then, the following line will be really simple: > > # done < <(${NM} --quiet ${1}) > > -done < <(${NM} ${1} 2>/dev/null || { echo "${0}: ${NM} failed" >&2; false; } ) > > +done < <(${NM} ${1} || { echo "${0}: ${NM} failed" > > >>/mnt/tmp/log-check ; false; } ) > > > > # Catch error in the process substitution > > wait $! > > @@ -57,7 +57,7 @@ for name in "${export_symbols[@]}" > > do > > # nm(3) says "If lowercase, the symbol is usually local" > > if [[ ${symbol_types[$name]} =~ [a-z] ]]; then > > - echo "$@: error: local symbol '${name}' was exported" >&2 > > + echo "$@: error: local symbol '${name}' was exported" > > >> /mnt/tmp/log-check > > exit_code=1 > > fi > > done > > > > Then I invoke torture.sh again, but after two rounds of test, the bug > > did not appear, so I can't trace anything. > > > > 6. Conclusion > > Sorry to bring you so much trouble, I am very interested in this. > > Could you please tell me what to do next? > > I recently had a strange build error that was caused by a different > extraneous file. I deleted the file and no problem since. I suspect > that the file was lying around because I had interrupted a previous > build. Though such files are -supposed- to be deleted when the build > is interrupted. > > Next time you get an similar strange build error, try "git status" > to see if it maybe doesn't belong. Thank Paul for your guidance and encouragement! I would like to continue the research. > > Beyond that point, I don't count myself as an expert on the Linux > kernel's build system. > > Thanx, Paul > > > Thanks > > Zhouyi I forgot to actually add Masahiro to the cc list in my previous email. I added him in this email. Thanks Zhouyi > > > > On Thu, Jun 16, 2022 at 4:28 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > Thank Paul and Petr for your efforts! > > > > > > On Thu, Jun 16, 2022 at 3:09 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Wed, Jun 15, 2022 at 06:52:20AM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jun 15, 2022 at 08:36:54PM +0800, Zhouyi Zhou wrote: > > > > > > On Wed, Jun 15, 2022 at 1:15 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > > > > > On Wed, Jun 15, 2022 at 08:37:33AM +0800, zhouzhouyi@gmail.com wrote: > > > > > > > > From: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > > > > > > > currently tools/testing/selftests/rcutorture/bin/torture.sh reports > > > > > > > > false positve because of kthread printing. > > > > > > > > > > > > > > > > Commit "printk: add functions to prefer direct printing" (2bb2b7b57f812) > > > > > > > > have added functions to prefer direct printing activated for the primary > > > > > > > > import messages. > > > > > > > > > > > > > > > > Meanwhile adding printk_prefer_direct_enter/exit to > > > > > > > > torture_cleanup_begin/end still can't eliminate the false positives > > > > > > > > for a hundred percent. > > > > > > > > > > > > > > > > This patch add pr_flush to torture_cleanup_end because the latter > > > > > > > > is sleepable. > > > > > > > > > > > > > > > > Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > > > > > > > > Just the one Signed-off-by suffices. ;-) > > > > > > Thank Paul for your guidance ;-) > > > > > > > > > > > > > > > --- > > > > > > > > kernel/torture.c | 1 + > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > diff --git a/kernel/torture.c b/kernel/torture.c > > > > > > > > index 789aeb0e1159..b51e42463a67 100644 > > > > > > > > --- a/kernel/torture.c > > > > > > > > +++ b/kernel/torture.c > > > > > > > > @@ -878,6 +878,7 @@ void torture_cleanup_end(void) > > > > > > > > mutex_lock(&fullstop_mutex); > > > > > > > > torture_type = NULL; > > > > > > > > mutex_unlock(&fullstop_mutex); > > > > > > > > + pr_flush(1000, true); > > > > > > > > > > > > > > But in this case, the kernel keeps on running, correct? So won't the > > > > > > > output appear over time? All of printk()'s kthreads are still running > > > > > > > after all. > > > > > > I am doing research on linux-next following your guidance on my Dell > > > > > > PowerEdge R720 server: > > > > > > #git clone https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/ > > > > > > #cd linux-next > > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > > after about 14 hours, the test result show that there have been about > > > > > > 17 failures, I examine the corresponding console.log, and find the > > > > > > failures are caused by following reason: > > > > > > before the message output by > > > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > > > SUCCESS") has reached serial console, torture_shutdown invoked > > > > > > kernel_power_off(), > > > > > > > > > > Ah, got it! And good show tracking this down, by the way. > > > > > > > > > > Does this workaround commit on -rcu do the trick for you? > > > > > > > > > > 704ae8dcda84 ("torture: Flush printk() buffers before powering off") > > > > > > > > > > The printk() guys are looking at fixes within the printk() system > > > > > as well: > > > > > > > > > > https://lore.kernel.org/all/87tu8pgcj0.fsf@jogness.linutronix.de/ > > > > > https://lore.kernel.org/all/YqdSw%2FfJvnkRbjvc@alley/ > > > > > > > > And Petr has posted a more official fix, which I have queued on -rcu > > > > for testing at branch pmladek.2022.06.15a. > > > I am very happy to see the official fix, and start test it on my Dell > > > server by invoking > > > ./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > > It will take 12 hours for the above script to finish if the other guy > > > who shares the server with me doesn't use it heavily ;-) > > > > > > Thanks > > > Sincerely yours > > > Zhouyi > > > > > > > > Thanx, Paul > > > > > > > > > > so the message "End of test: SUCCESS" has not chance of reaching > > > > > > console.log of host machine. > > > > > > > > > > > > Then I add printk_prefer_direct_enter/exit to > > > > > > torture_cleanup_begin/end, again I invoke > > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > > after about 14 hours, the test result show that there have been about > > > > > > 5 failures, > > > > > > > > > > > > Then I add a debug statement after > > > > > > rcu/lock/scf_torture_print_module_parms(cur_ops, "End of test: > > > > > > SUCCESS") which sends the debug string ("hello world") to > > > > > > uart_console_write directly, again I invoke > > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > > There is "hello world" in console.log but no "End of test" strings. > > > > > > > > > > > > So I guess even with printk_prefer_direct_enter/exit to > > > > > > torture_cleanup_begin/end, our main shutdown thread has failed in > > > > > > competition for console lock (printk.c): > > > > > > 2401 if (console_trylock_spinning()) > > > > > > 2402 console_unlock(); > > > > > > > > > > > > Then I add pr_flush to torture_cleanup_end > > > > > > #./tools/testing/selftests/rcutorture/bin/torture.sh > > > > > > after 14 hours > > > > > > all 49 tests are success > > > > > > > > > > > > > > Or do you have a use case that requires that the output appear sooner > > > > > > > than it does? If so, please let me know what that is. > > > > > > Thank Paul for your patience, > > > > > > ./tools/testing/selftests/rcutorture/bin/torture.sh is the only tool > > > > > > that I use. > > > > > > > > > > > > I am very interested in this topic, and very glad to do further > > > > > > experiments under your guidance ;-) > > > > > > > > > > Or are you already running with that commit or one of those patches? > > > > > They are working reliably for me (so far, anyway), but maybe more is > > > > > needed in some cases. > > > > > > > > > > Thanx, Paul > > > > > > > > > > > Sincerely yours > > > > > > Thanks > > > > > > Zhouyi > > > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > > > > } > > > > > > > > EXPORT_SYMBOL_GPL(torture_cleanup_end); > > > > > > > > > > > > > > > > -- > > > > > > > > 2.25.1 > > > > > > > >
diff --git a/kernel/torture.c b/kernel/torture.c index 789aeb0e1159..b51e42463a67 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -878,6 +878,7 @@ void torture_cleanup_end(void) mutex_lock(&fullstop_mutex); torture_type = NULL; mutex_unlock(&fullstop_mutex); + pr_flush(1000, true); } EXPORT_SYMBOL_GPL(torture_cleanup_end);