diff mbox series

[RFC,3/3] trace2: add a schema validator for trace2 events

Message ID 7475c6220895d96cdc7d25d6edea70e2f978526b.1560295286.git.steadmon@google.com (mailing list archive)
State New, archived
Headers show
Series Add a JSON Schema for trace2 events | expand

Commit Message

Josh Steadmon June 11, 2019, 11:31 p.m. UTC
trace_schema_validator can be used to verify that trace2 event output
conforms to the expectations set by the API documentation and codified
in event_schema.json (or strict_schema.json). This allows us to build a
regression test to verify that trace2 output does not change
unexpectedly.

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 t/trace_schema_validator/.gitignore           |  1 +
 t/trace_schema_validator/Makefile             | 10 +++
 .../trace_schema_validator.go                 | 74 +++++++++++++++++++
 3 files changed, 85 insertions(+)
 create mode 100644 t/trace_schema_validator/.gitignore
 create mode 100644 t/trace_schema_validator/Makefile
 create mode 100644 t/trace_schema_validator/trace_schema_validator.go

Comments

Ævar Arnfjörð Bjarmason June 12, 2019, 1:28 p.m. UTC | #1
On Wed, Jun 12 2019, Josh Steadmon wrote:

> trace_schema_validator can be used to verify that trace2 event output
> conforms to the expectations set by the API documentation and codified
> in event_schema.json (or strict_schema.json). This allows us to build a
> regression test to verify that trace2 output does not change
> unexpectedly.

Does this actually work for you? As seen in my code at
https://public-inbox.org/git/87zhnuwdkp.fsf@evledraar.gmail.com/ our
test suite emits various lines of JSON that aren't even validly encoded,
so I can't imagine we're passing any sort of proper parser validatior,
let alone a schema validator.

In terms of implementation I think it would make sense to have a *.sh
wrapper for this already, then we could test via prereqs if we have some
of the existing validators (seems there's a list at
https://json-schema.org/implementations.html) and e.g. run a dummy test
against some small list of git commands, and then you could also pass it
an env variable with "here's the trace file" so you could do:

    GIT_TRACE2_EVENT=/tmp/git.events prove <all testss> && VALIDATE_THIS=/tmp/git.events ./<that new test>.sh

And it would validate that file, if set.
Josh Steadmon June 12, 2019, 4:23 p.m. UTC | #2
On 2019.06.12 15:28, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Jun 12 2019, Josh Steadmon wrote:
> 
> > trace_schema_validator can be used to verify that trace2 event output
> > conforms to the expectations set by the API documentation and codified
> > in event_schema.json (or strict_schema.json). This allows us to build a
> > regression test to verify that trace2 output does not change
> > unexpectedly.
> 
> Does this actually work for you? As seen in my code at
> https://public-inbox.org/git/87zhnuwdkp.fsf@evledraar.gmail.com/ our
> test suite emits various lines of JSON that aren't even validly encoded,
> so I can't imagine we're passing any sort of proper parser validatior,
> let alone a schema validator.

Yes, it seems that gojsonschema (and its dependencies) are not very strict about
encoding. I also had an alternate Python implementation, and it failed to parse
lines that were not properly encoded. I just had that version print out a
warning with the number of failed decodings. I believe it was ~20 out of 1.7M
events.

> In terms of implementation I think it would make sense to have a *.sh
> wrapper for this already, then we could test via prereqs if we have some
> of the existing validators (seems there's a list at
> https://json-schema.org/implementations.html) and e.g. run a dummy test
> against some small list of git commands, and then you could also pass it
> an env variable with "here's the trace file" so you could do:
> 
>     GIT_TRACE2_EVENT=/tmp/git.events prove <all testss> && VALIDATE_THIS=/tmp/git.events ./<that new test>.sh
> 
> And it would validate that file, if set.

The problem with the existing validators is that they expect each file to be a
complete JSON entity, whereas the trace output is one object per line. You can
of course loop over the lines in a shell script, but in my testing this approach
took multiple hours on the full test suite trace output, vs. 15 minutes for the
implementation in this patch.
Jeff King June 12, 2019, 7:18 p.m. UTC | #3
On Wed, Jun 12, 2019 at 09:23:41AM -0700, Josh Steadmon wrote:

> The problem with the existing validators is that they expect each file to be a
> complete JSON entity, whereas the trace output is one object per line. You can
> of course loop over the lines in a shell script, but in my testing this approach
> took multiple hours on the full test suite trace output, vs. 15 minutes for the
> implementation in this patch.

It seems like it should be easy to turn a sequence of entities into a
single entity, with something like:

  echo '['
  sed 's/$/,/' <one-per-line
  echo ']'

You could even turn a sequence of files into a single entity (which
might be even faster to validate, since it would be one invocation for
the entire test suite) with something like:

  echo '{'
  for fn in $FILES; do
	echo "\"$fn\": "
	cat $fn
	echo ","
  done
  echo '}'

though I suspect the resulting error messages might not be as good.

Obviously neither of those is particularly robust if the individual JSON
is not well-formed. But then, if we are mostly interested in testing
whether it's well-formed and expect it to be in the normal case, that
might be a good optimization.

I also wouldn't be surprised if "jq" could do this in a more robust way.

-Peff
Josh Steadmon June 20, 2019, 6:15 p.m. UTC | #4
On 2019.06.12 15:18, Jeff King wrote:
> On Wed, Jun 12, 2019 at 09:23:41AM -0700, Josh Steadmon wrote:
> 
> > The problem with the existing validators is that they expect each file to be a
> > complete JSON entity, whereas the trace output is one object per line. You can
> > of course loop over the lines in a shell script, but in my testing this approach
> > took multiple hours on the full test suite trace output, vs. 15 minutes for the
> > implementation in this patch.
> 
> It seems like it should be easy to turn a sequence of entities into a
> single entity, with something like:
> 
>   echo '['
>   sed 's/$/,/' <one-per-line
>   echo ']'
> 
> You could even turn a sequence of files into a single entity (which
> might be even faster to validate, since it would be one invocation for
> the entire test suite) with something like:
> 
>   echo '{'
>   for fn in $FILES; do
> 	echo "\"$fn\": "
> 	cat $fn
> 	echo ","
>   done
>   echo '}'
> 
> though I suspect the resulting error messages might not be as good.
> 
> Obviously neither of those is particularly robust if the individual JSON
> is not well-formed. But then, if we are mostly interested in testing
> whether it's well-formed and expect it to be in the normal case, that
> might be a good optimization.

Yeah, as I noted in my reply to Ævar, ~20 of the trace lines generated by the
test suite are not properly encoded. So if we do something like:

  $ GIT_TRACE2_EVENT=$(pwd)/one-per-line make test
  $ (echo '[' ; sed 's/$/,/' < one-per-line ; echo ']') > list
  $ validate list

then most validators will only tell us that the file as a whole is malformed.
If we validate line-by-line, then we can just count how many malformed lines we
have and make sure it's within expectations.

Alternatively, we could just explicitly disable tracing on the tests that
generate the malformed traces.

> 
> I also wouldn't be surprised if "jq" could do this in a more robust way.

I'll go take a look at jq.

> -Peff
Jakub Narębski June 21, 2019, 11:53 a.m. UTC | #5
Josh Steadmon <steadmon@google.com> writes:
> On 2019.06.12 15:28, Ævar Arnfjörð Bjarmason wrote: 
>> On Wed, Jun 12 2019, Josh Steadmon wrote:
>> 
>>> trace_schema_validator can be used to verify that trace2 event output
>>> conforms to the expectations set by the API documentation and codified
>>> in event_schema.json (or strict_schema.json). This allows us to build a
>>> regression test to verify that trace2 output does not change
>>> unexpectedly.
>> 
>> Does this actually work for you? As seen in my code at
>> https://public-inbox.org/git/87zhnuwdkp.fsf@evledraar.gmail.com/ our
>> test suite emits various lines of JSON that aren't even validly encoded,
>> so I can't imagine we're passing any sort of proper parser validatior,
>> let alone a schema validator.
[...]
> The problem with the existing validators is that they expect each file to be a
> complete JSON entity, whereas the trace output is one object per line. You can
> of course loop over the lines in a shell script, but in my testing this approach
> took multiple hours on the full test suite trace output, vs. 15 minutes for the
> implementation in this patch.

Isn't this JSON-Lines (http://jsonlines.org/), also known as
Line-delimited JSON (LDJSON), newline-delimited JSON (NDJSON) format?

Best,
--
Jakub Narębski
Jeff Hostetler June 27, 2019, 1:57 p.m. UTC | #6
On 6/21/2019 7:53 AM, Jakub Narebski wrote:
> Josh Steadmon <steadmon@google.com> writes:
>> On 2019.06.12 15:28, Ævar Arnfjörð Bjarmason wrote:
>>> On Wed, Jun 12 2019, Josh Steadmon wrote:
>>>
>>>> trace_schema_validator can be used to verify that trace2 event output
>>>> conforms to the expectations set by the API documentation and codified
>>>> in event_schema.json (or strict_schema.json). This allows us to build a
>>>> regression test to verify that trace2 output does not change
>>>> unexpectedly.
>>>
>>> Does this actually work for you? As seen in my code at
>>> https://public-inbox.org/git/87zhnuwdkp.fsf@evledraar.gmail.com/ our
>>> test suite emits various lines of JSON that aren't even validly encoded,
>>> so I can't imagine we're passing any sort of proper parser validatior,
>>> let alone a schema validator.
> [...]
>> The problem with the existing validators is that they expect each file to be a
>> complete JSON entity, whereas the trace output is one object per line. You can
>> of course loop over the lines in a shell script, but in my testing this approach
>> took multiple hours on the full test suite trace output, vs. 15 minutes for the
>> implementation in this patch.
> 
> Isn't this JSON-Lines (http://jsonlines.org/), also known as
> Line-delimited JSON (LDJSON), newline-delimited JSON (NDJSON) format?

Yes. My Trace2 event format is a series of JSON objects, one per line.
I didn't know there was an official name for it.

Thanks
Jeff


> 
> Best,
> --
> Jakub Narębski
>
diff mbox series

Patch

diff --git a/t/trace_schema_validator/.gitignore b/t/trace_schema_validator/.gitignore
new file mode 100644
index 0000000000..c3f1e04e9e
--- /dev/null
+++ b/t/trace_schema_validator/.gitignore
@@ -0,0 +1 @@ 
+trace_schema_validator
diff --git a/t/trace_schema_validator/Makefile b/t/trace_schema_validator/Makefile
new file mode 100644
index 0000000000..ed22675e5d
--- /dev/null
+++ b/t/trace_schema_validator/Makefile
@@ -0,0 +1,10 @@ 
+.PHONY: fetch_deps clean
+
+trace_schema_validator: fetch_deps trace_schema_validator.go
+	go build
+
+fetch_deps:
+	go get github.com/xeipuuv/gojsonschema
+
+clean:
+	rm -f trace_schema_validator
diff --git a/t/trace_schema_validator/trace_schema_validator.go b/t/trace_schema_validator/trace_schema_validator.go
new file mode 100644
index 0000000000..51dc9ec608
--- /dev/null
+++ b/t/trace_schema_validator/trace_schema_validator.go
@@ -0,0 +1,74 @@ 
+// trace_schema_validator validates individual lines of an input file against a
+// provided JSON-Schema for git trace2 event output.
+//
+// Traces can be collected by setting the GIT_TRACE2_EVENT environment variable
+// to an absolute path and running any Git command; traces will be appended to
+// the file.
+//
+// Traces can then be verified like so:
+//   trace_schema_validator \
+//     --trace2_event_file /path/to/trace/output \
+//     --schema_file /path/to/schema
+package main
+
+import (
+	"bufio"
+	"flag"
+	"log"
+	"os"
+	"path/filepath"
+
+	"github.com/xeipuuv/gojsonschema"
+)
+
+// Required flags
+var schemaFile = flag.String("schema_file", "", "JSON-Schema filename")
+var trace2EventFile = flag.String("trace2_event_file", "", "trace2 event filename")
+
+func main() {
+	flag.Parse()
+	if *schemaFile == "" || *trace2EventFile == "" {
+		log.Fatal("Both --schema_file and --trace2_event_file are required.")
+	}
+	schemaURI, err := filepath.Abs(*schemaFile)
+	if err != nil {
+		log.Fatal("Can't get absolute path for schema file: ", err)
+	}
+	schemaURI = "file://" + schemaURI
+
+	schemaLoader := gojsonschema.NewReferenceLoader(schemaURI)
+	schema, err := gojsonschema.NewSchema(schemaLoader)
+	if err != nil {
+		log.Fatal("Problem loading schema: ", err)
+	}
+
+	tracesFile, err := os.Open(*trace2EventFile)
+	if err != nil {
+		log.Fatal("Problem opening trace file: ", err)
+	}
+	defer tracesFile.Close()
+
+	scanner := bufio.NewScanner(tracesFile)
+
+	count := 0
+	for ; scanner.Scan(); count++ {
+		event := gojsonschema.NewStringLoader(scanner.Text())
+		result, err := schema.Validate(event)
+		if err != nil {
+			log.Fatal(err)
+		}
+		if !result.Valid() {
+			log.Print("Trace event is invalid: ", scanner.Text())
+			for _, desc := range result.Errors() {
+				log.Print("- ", desc)
+			}
+			os.Exit(1)
+		}
+	}
+
+	if err := scanner.Err(); err != nil {
+		log.Fatal("Scanning error: ", err)
+	}
+
+	log.Print("Validated events: ", count)
+}