Message ID | 20221216065121.30181-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fstests: report: always save the dmesg as system-err if KEEP_DMESG is set | expand |
On Fri, Dec 16, 2022 at 02:51:21PM +0800, Qu Wenruo wrote: > When KEEP_DMESG is set to "yes", we will always save the dmesg of any > test case (no matter if it passed or not) into "$seqnum.dmesg". > > But this KEEP_DMESG behavior doesn't affect xunit report. > > This patch will make xunit report to follow KEEP_DMESG setting. > Since error is checked by testcase.failure attribute, this new > <system-err> section should not cause the existing parsers to treat > passed cases as errors. > > KEEP_DMESG is only followed if all the following conditions are met: > > - KEEP_DMESG is set to yes Feel free to push back against the proliferation of config variables, but perhaps this ought to be REPORT_DMESG={always,never,auto} ? > - Using xunit reporting > xunit-quite won't save the dmesg for passed test cases. ^^^^^ "quiet"? > This extra saved dmesg would definitely boost the xml size, but if the > end user wants to save all the dmesg (for later verification), then I'd > say it's a unavoidable cost. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > common/report | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/common/report b/common/report > index 64f9c866..4a747f8d 100644 > --- a/common/report > +++ b/common/report > @@ -87,6 +87,19 @@ _xunit_make_testcase_report() > echo -e "\t<testcase classname=\"xfstests.$sect_name\" name=\"$test_name\" time=\"$test_time\">" >> $report > case $test_status in > "pass") > + # If we have KEEP_DMESG and want full output, also save the > + # dmesg into the passed result > + if [ "$KEEP_DMESG" == yes -a "$quiet" != "yes" ]; then > + local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" > + if [ -f "$dmesg_file" ]; then > + echo -e "\t\t<system-err>" >> $report > + printf '<![CDATA[\n' >>$report > + cat "$dmesg_file" | tr -dc '[:print:][:space:]' | \ > + encode_xml >>$report > + printf ']]>\n' >>$report > + echo -e "\t\t</system-err>" >> $report Hmm. Does the junit xml schema[1] (that's what fstests implements, even if we call it xunit, and even though there's a separate xunit[2] format that is not the same!) actually allow us to have multiple <system-err> elements? For that matter, I look at things like this: if [ -n "$quiet" ]; then : elif [ -f "$dmesg_file" ]; then echo -e "\t\t<system-err>" >> $report printf '<![CDATA[\n' >>$report cat "$dmesg_file" | tr -dc '[:print:][:space:]' | encode_xml >>$report printf ']]>\n' >>$report echo -e "\t\t</system-err>" >> $report elif [ -s "$outbad_file" ]; then echo -e "\t\t<system-err>" >> $report printf '<![CDATA[\n' >>$report $diff "$out_src" "$outbad_file" | encode_xml >>$report printf ']]>\n' >>$report echo -e "\t\t</system-err>" >> $report fi and wonder why we prioritize recording dmesg output over .out.bad when we could do both? --D [1] https://raw.githubusercontent.com/windyroad/JUnit-Schema/master/JUnit.xsd [2] https://xunit.net/docs/format-xml-v2 > + fi > + fi > ;; > "notrun") > local notrun_file="${REPORT_DIR}/${test_name}.notrun" > -- > 2.38.0 >
On Mon, Dec 19, 2022 at 09:39:33AM -0800, Darrick J. Wong wrote: > On Fri, Dec 16, 2022 at 02:51:21PM +0800, Qu Wenruo wrote: > > When KEEP_DMESG is set to "yes", we will always save the dmesg of any > > test case (no matter if it passed or not) into "$seqnum.dmesg". > > > > But this KEEP_DMESG behavior doesn't affect xunit report. > > > > This patch will make xunit report to follow KEEP_DMESG setting. This may be dangerous; if the XML file is too large, the XML parser may end up rejecting the whole XML file because otherwlse a too-large XML file can trigger a denial of service attack[1]. (This is why I implemented "xunit-quiet".) [1] https://gitlab.com/gitlab-org/gitlab/-/issues/25357 So if you are running a large number of tests (e.g., "-g auto") it might very well that adding dmesg for all tests might very well end up bloating the XML file to the point where it will be unmangeable. For example, this is the size for my syslog file after running "-g auto" on the "xfs/quota" config: -rw-r----- 1 tytso primarygroup 10316684 Aug 25 10:35 ae/syslog The syslog file for all of the xfs configs are 9-10 megabytes each. If I combined the 12 xfs configs that we run into a single xunit JML file with the dmesg output, this would be *guaranteed* to blow out most XML parsers. Personally, I find that a better solution is to use the syslog daemon to save the dmesg output for all of the tests into a single file. I prefer this for three reasons: * The single file is more compressibls compared to having it broken out into separate $seqnum.dmesg files. * By keeping dmesg and other test artifacts separate from the xml file I can archive the xml file for a much larger period of time, (perhaps indefinitely) while allowing the much more volunumous test artifacts to be archived for a shorter time (say, 3-6 months). * When there are test isolation issues, it's not uncommon for a previous test to fail with some kind of global or cgroup-specific OOM-kill, or when I'm testing on bare metal with real hardware where hardware failures is a Thing, being able to look for unusual kernel messages before the start of a particular test can often be quite revealing. Cheers, - Ted
On 2022/12/20 02:56, Theodore Ts'o wrote: > On Mon, Dec 19, 2022 at 09:39:33AM -0800, Darrick J. Wong wrote: >> On Fri, Dec 16, 2022 at 02:51:21PM +0800, Qu Wenruo wrote: >>> When KEEP_DMESG is set to "yes", we will always save the dmesg of any >>> test case (no matter if it passed or not) into "$seqnum.dmesg". >>> >>> But this KEEP_DMESG behavior doesn't affect xunit report. >>> >>> This patch will make xunit report to follow KEEP_DMESG setting. > > This may be dangerous; if the XML file is too large, the XML parser > may end up rejecting the whole XML file because otherwlse a too-large > XML file can trigger a denial of service attack[1]. (This is why I > implemented "xunit-quiet".) I guess your concern is correct. Although during my auto run, single dmesg are not that large, the largest three are not MiB level yet: 108K btrfs/187.dmesg 80K btrfs/072.dmesg 64K generic/311.dmesg But the total dmesg sizes already go MiB levels, 2.6MiB. > > [1] https://gitlab.com/gitlab-org/gitlab/-/issues/25357 > > So if you are running a large number of tests (e.g., "-g auto") it > might very well that adding dmesg for all tests might very well end up > bloating the XML file to the point where it will be unmangeable. For > example, this is the size for my syslog file after running "-g auto" > on the "xfs/quota" config: > > -rw-r----- 1 tytso primarygroup 10316684 Aug 25 10:35 ae/syslog > > The syslog file for all of the xfs configs are 9-10 megabytes each. > If I combined the 12 xfs configs that we run into a single xunit JML > file with the dmesg output, this would be *guaranteed* to blow out > most XML parsers. > > Personally, I find that a better solution is to use the syslog daemon > to save the dmesg output for all of the tests into a single file. I > prefer this for three reasons: > > * The single file is more compressibls compared to having it broken > out into separate $seqnum.dmesg files. > * By keeping dmesg and other test artifacts separate from the xml > file I can archive the xml file for a much larger period of time, > (perhaps indefinitely) while allowing the much more volunumous > test artifacts to be archived for a shorter time (say, 3-6 months). > * When there are test isolation issues, it's not uncommon for a > previous test to fail with some kind of global or cgroup-specific > OOM-kill, or when I'm testing on bare metal with real hardware > where hardware failures is a Thing, being able to look for unusual > kernel messages before the start of a particular test can often be > quite revealing. On the other hand, separating dmesg from its test cases means extra parsing to bind the log section to certain test runs. In fact, I'm not even sure if possible to do that. E.g. on the same host with different kernel/fstests configs. One reason I want to keep the dmesg bound to each test case is, we had cases in the past that some btrfs specific error messages are not (and should not) caught by fstests, but can still be some clue of bugs. Thus if we have a proper paired dmesg with kernel/fstests configs and test case, we can determine how affected the bug is. Any better idea to relate them? Or is this really too niche? Thanks, Qu > > Cheers, > > - Ted
On 2022/12/20 01:39, Darrick J. Wong wrote: > On Fri, Dec 16, 2022 at 02:51:21PM +0800, Qu Wenruo wrote: >> When KEEP_DMESG is set to "yes", we will always save the dmesg of any >> test case (no matter if it passed or not) into "$seqnum.dmesg". >> >> But this KEEP_DMESG behavior doesn't affect xunit report. >> >> This patch will make xunit report to follow KEEP_DMESG setting. >> Since error is checked by testcase.failure attribute, this new >> <system-err> section should not cause the existing parsers to treat >> passed cases as errors. >> >> KEEP_DMESG is only followed if all the following conditions are met: >> >> - KEEP_DMESG is set to yes > > Feel free to push back against the proliferation of config variables, > but perhaps this ought to be REPORT_DMESG={always,never,auto} ? I guess this would cover a lot of work of xunit-quiet? > >> - Using xunit reporting >> xunit-quite won't save the dmesg for passed test cases. > > ^^^^^ "quiet"? > >> This extra saved dmesg would definitely boost the xml size, but if the >> end user wants to save all the dmesg (for later verification), then I'd >> say it's a unavoidable cost. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> common/report | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/common/report b/common/report >> index 64f9c866..4a747f8d 100644 >> --- a/common/report >> +++ b/common/report >> @@ -87,6 +87,19 @@ _xunit_make_testcase_report() >> echo -e "\t<testcase classname=\"xfstests.$sect_name\" name=\"$test_name\" time=\"$test_time\">" >> $report >> case $test_status in >> "pass") >> + # If we have KEEP_DMESG and want full output, also save the >> + # dmesg into the passed result >> + if [ "$KEEP_DMESG" == yes -a "$quiet" != "yes" ]; then >> + local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" >> + if [ -f "$dmesg_file" ]; then >> + echo -e "\t\t<system-err>" >> $report >> + printf '<![CDATA[\n' >>$report >> + cat "$dmesg_file" | tr -dc '[:print:][:space:]' | \ >> + encode_xml >>$report >> + printf ']]>\n' >>$report >> + echo -e "\t\t</system-err>" >> $report > > Hmm. Does the junit xml schema[1] (that's what fstests implements, even > if we call it xunit, and even though there's a separate xunit[2] format > that is not the same!) actually allow us to have multiple <system-err> > elements? > > For that matter, I look at things like this: > > if [ -n "$quiet" ]; then > : > elif [ -f "$dmesg_file" ]; then > echo -e "\t\t<system-err>" >> $report > printf '<![CDATA[\n' >>$report > cat "$dmesg_file" | tr -dc '[:print:][:space:]' | encode_xml >>$report > printf ']]>\n' >>$report > echo -e "\t\t</system-err>" >> $report > elif [ -s "$outbad_file" ]; then > echo -e "\t\t<system-err>" >> $report > printf '<![CDATA[\n' >>$report > $diff "$out_src" "$outbad_file" | encode_xml >>$report > printf ']]>\n' >>$report > echo -e "\t\t</system-err>" >> $report > fi > > and wonder why we prioritize recording dmesg output over .out.bad when > we could do both? Well, before you mentioning this, I didn't even notice we don't save out.bad if we have dmesg... And we need some standard on what should be put into system-out and what should go system-err. Personally speaking, out.bad should go system-out, while dmesg and full should go system-err. Another thing is, if we have multiple system-err elements, should we add some properties for the system-err elements to distinguish .full and .dmesg? Thanks, Qu > > --D > > [1] https://raw.githubusercontent.com/windyroad/JUnit-Schema/master/JUnit.xsd > [2] https://xunit.net/docs/format-xml-v2 > >> + fi >> + fi >> ;; >> "notrun") >> local notrun_file="${REPORT_DIR}/${test_name}.notrun" >> -- >> 2.38.0 >>
On Tue, Dec 20, 2022 at 06:45:27AM +0800, Qu Wenruo wrote: > > > On the other hand, separating dmesg from its test cases means extra parsing > to bind the log section to certain test runs. > > In fact, I'm not even sure if possible to do that. > E.g. on the same host with different kernel/fstests configs. Well, I just have my test runner framework drop a file which contains a single line[2] in the /etc/rsyslog.d directory, and then kill -HUP rsyslogd. When I'm done, I'll reconfigure rsyslog and HUP it again, and then tar up the results directory, and save that as part of my test artifacts for that particular test run. [2] *.*;auth,authpriv.none -/results/syslog (This could probably be "kern.*" instead, but occasionally it's useful to see what other kind of background noise might be running on the test machine that might be disturbing the test run.) The /results directory will contain a copy of the current kernel's build config file, the arguments to the xfstests run, the xfstests results directory with all of the NNN.full and NNN.out.bad files, and of course, the syslog file. I also have things like a copy of /proc/slabinfo before and after the run; and I have support for hook scripts that can enable/disable ftrace with various tracepoints and which will drop the ftrace logs in the /results directory. (I don't have support for drgn or BPF scripts yet, but one of these days...) Anyway, my test framework will compress the /results directory and save it away (in my case, in Google Cloud Store). For a full ext4 test run (twelve "-g auto" runs with different fs configs), the compressed test artifact tarball will be about 23 megabytes. In contrast, the xunut-quiet XML file will be about 80k compressed, and the test run summary which lands in my e-mail inbox is less than 4k. :-) - Ted P.S. This is a recent test run on btrfs using my test environment. I'm curious whether any of these test failures look familiar to you. (Some of them may very well be issues in how I set up the btrfs pool of block devices, which is a relatively new feature in gce-xfstests. What are you seeing when you run -g auto on btrfs using the default file system config?) TESTRUNID: tytso-20221212170803 KERNEL: kernel 6.1.0-xfstests #2 SMP PREEMPT_DYNAMIC Mon Dec 12 16:09:40 EST 2022 x86_64 CMDLINE: -c btrfs/4k -g auto CPUS: 2 MEM: 7680 btrfs/4k: 953 tests, 21 failures, 176 skipped, 12135 seconds Failures: btrfs/006 btrfs/012 btrfs/049 btrfs/100 btrfs/162 btrfs/163 btrfs/184 btrfs/192 btrfs/196 btrfs/197 btrfs/218 btrfs/219 btrfs/235 btrfs/254 btrfs/277 btrfs/291 generic/455 generic/457 Flaky: btrfs/028: 40% (2/5) generic/475: 60% (3/5) shared/298: 20% (1/5) FSTESTIMG: gce-xfstests/xfstests-amd64-202211301035 FSTESTPRJ: gce-xfstests FSTESTVER: blktests 068bd2a (Fri, 18 Nov 2022 08:38:35 +0900) FSTESTVER: fio fio-3.31 (Tue, 9 Aug 2022 14:41:25 -0600) FSTESTVER: fsverity v1.5 (Sun, 6 Feb 2022 10:59:13 -0800) FSTESTVER: ima-evm-utils v1.3.2 (Wed, 28 Oct 2020 13:18:08 -0400) FSTESTVER: nvme-cli v1.16 (Thu, 11 Nov 2021 13:09:06 -0800) FSTESTVER: quota v4.05-52-gf7e24ee (Tue, 1 Nov 2022 11:45:06 +0100) FSTESTVER: util-linux v2.38.1 (Thu, 4 Aug 2022 11:06:21 +0200) FSTESTVER: xfsprogs v6.0.0 (Mon, 14 Nov 2022 12:06:23 +0100) FSTESTVER: xfstests-bld 920c0e32 (Mon, 28 Nov 2022 00:37:48 -0500) FSTESTVER: xfstests v2022.11.27-8-g3c178050c (Wed, 30 Nov 2022 10:25:39 -0500) FSTESTVER: zz_build-distro bullseye FSTESTCFG: btrfs/4k FSTESTSET: -g auto FSTESTOPT: aex GCE ID: 1186623270044702885 - Ted
On 2022/12/20 07:49, Theodore Ts'o wrote: > On Tue, Dec 20, 2022 at 06:45:27AM +0800, Qu Wenruo wrote: >> >> >> On the other hand, separating dmesg from its test cases means extra parsing >> to bind the log section to certain test runs. >> >> In fact, I'm not even sure if possible to do that. >> E.g. on the same host with different kernel/fstests configs. > > Well, I just have my test runner framework drop a file which contains > a single line[2] in the /etc/rsyslog.d directory, and then kill -HUP > rsyslogd. When I'm done, I'll reconfigure rsyslog and HUP it again, > and then tar up the results directory, and save that as part of my > test artifacts for that particular test run. > > [2] *.*;auth,authpriv.none -/results/syslog > > (This could probably be "kern.*" instead, but occasionally it's useful > to see what other kind of background noise might be running on the > test machine that might be disturbing the test run.) > > The /results directory will contain a copy of the current kernel's > build config file, the arguments to the xfstests run, the xfstests > results directory with all of the NNN.full and NNN.out.bad files, and > of course, the syslog file. I also have things like a copy of > /proc/slabinfo before and after the run; and I have support for hook > scripts that can enable/disable ftrace with various tracepoints and > which will drop the ftrace logs in the /results directory. (I don't > have support for drgn or BPF scripts yet, but one of these days...) > > Anyway, my test framework will compress the /results directory and > save it away (in my case, in Google Cloud Store). For a full ext4 > test run (twelve "-g auto" runs with different fs configs), the > compressed test artifact tarball will be about 23 megabytes. In > contrast, the xunut-quiet XML file will be about 80k compressed, and > the test run summary which lands in my e-mail inbox is less than 4k. :-) > > - Ted > > P.S. This is a recent test run on btrfs using my test environment. > I'm curious whether any of these test failures look familiar to you. > (Some of them may very well be issues in how I set up the btrfs pool > of block devices, which is a relatively new feature in gce-xfstests. > What are you seeing when you run -g auto on btrfs using the default > file system config?) > > TESTRUNID: tytso-20221212170803 > KERNEL: kernel 6.1.0-xfstests #2 SMP PREEMPT_DYNAMIC Mon Dec 12 16:09:40 EST 2022 x86_64 > CMDLINE: -c btrfs/4k -g auto > CPUS: 2 > MEM: 7680 > > btrfs/4k: 953 tests, 21 failures, 176 skipped, 12135 seconds That's way more errors than what we see. Currently what we expect is, at most 3 from btrfs, zero from generic: - btrfs/011 If your SSD is too fast, that balance is done before canceling, it can caused output mismatch. Always reproducible here if using cache=unsafe for VM. - btrfs/219 (already in your failure) The upstream change is not merged, it should be removed from auto. - btrfs/249 (btrfs-progs bug) The fix for btrfs-progs is submitted by still needs some polish. > Failures: btrfs/006 btrfs/012 btrfs/049 btrfs/100 btrfs/162 btrfs/163 > btrfs/184 btrfs/192 btrfs/196 btrfs/197 btrfs/218 btrfs/219 btrfs/235 > btrfs/254 btrfs/277 btrfs/291 generic/455 generic/457 > Flaky: btrfs/028: 40% (2/5) generic/475: 60% (3/5) > shared/298: 20% (1/5) btrfs/012 is convert related, thus I'm wondering if some older btrfs-progs is involved in this case. For generic/455 and generic/457, we need the full out.bad and dmesg to determine. Anyway since our expectation for future fstests runs are really no errors, I'm pretty happy to help on any of the failed runs. > > FSTESTIMG: gce-xfstests/xfstests-amd64-202211301035 > FSTESTPRJ: gce-xfstests > FSTESTVER: blktests 068bd2a (Fri, 18 Nov 2022 08:38:35 +0900) > FSTESTVER: fio fio-3.31 (Tue, 9 Aug 2022 14:41:25 -0600) > FSTESTVER: fsverity v1.5 (Sun, 6 Feb 2022 10:59:13 -0800) > FSTESTVER: ima-evm-utils v1.3.2 (Wed, 28 Oct 2020 13:18:08 -0400) > FSTESTVER: nvme-cli v1.16 (Thu, 11 Nov 2021 13:09:06 -0800) > FSTESTVER: quota v4.05-52-gf7e24ee (Tue, 1 Nov 2022 11:45:06 +0100) > FSTESTVER: util-linux v2.38.1 (Thu, 4 Aug 2022 11:06:21 +0200) > FSTESTVER: xfsprogs v6.0.0 (Mon, 14 Nov 2022 12:06:23 +0100) > FSTESTVER: xfstests-bld 920c0e32 (Mon, 28 Nov 2022 00:37:48 -0500) > FSTESTVER: xfstests v2022.11.27-8-g3c178050c (Wed, 30 Nov 2022 10:25:39 -0500) > FSTESTVER: zz_build-distro bullseye Since the distro is Debian 11, I'm pretty sure the btrfs-progs is out-of-date. Mind to add btrfs-progs to the FSTESTVER reporting? And hopefully to also build a newer progs just like xfsprogs for testing? Thanks, Qu > FSTESTCFG: btrfs/4k > FSTESTSET: -g auto > FSTESTOPT: aex > GCE ID: 1186623270044702885 > > > - Ted
diff --git a/common/report b/common/report index 64f9c866..4a747f8d 100644 --- a/common/report +++ b/common/report @@ -87,6 +87,19 @@ _xunit_make_testcase_report() echo -e "\t<testcase classname=\"xfstests.$sect_name\" name=\"$test_name\" time=\"$test_time\">" >> $report case $test_status in "pass") + # If we have KEEP_DMESG and want full output, also save the + # dmesg into the passed result + if [ "$KEEP_DMESG" == yes -a "$quiet" != "yes" ]; then + local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" + if [ -f "$dmesg_file" ]; then + echo -e "\t\t<system-err>" >> $report + printf '<![CDATA[\n' >>$report + cat "$dmesg_file" | tr -dc '[:print:][:space:]' | \ + encode_xml >>$report + printf ']]>\n' >>$report + echo -e "\t\t</system-err>" >> $report + fi + fi ;; "notrun") local notrun_file="${REPORT_DIR}/${test_name}.notrun"
When KEEP_DMESG is set to "yes", we will always save the dmesg of any test case (no matter if it passed or not) into "$seqnum.dmesg". But this KEEP_DMESG behavior doesn't affect xunit report. This patch will make xunit report to follow KEEP_DMESG setting. Since error is checked by testcase.failure attribute, this new <system-err> section should not cause the existing parsers to treat passed cases as errors. KEEP_DMESG is only followed if all the following conditions are met: - KEEP_DMESG is set to yes - Using xunit reporting xunit-quite won't save the dmesg for passed test cases. This extra saved dmesg would definitely boost the xml size, but if the end user wants to save all the dmesg (for later verification), then I'd say it's a unavoidable cost. Signed-off-by: Qu Wenruo <wqu@suse.com> --- common/report | 13 +++++++++++++ 1 file changed, 13 insertions(+)