Documentation: fix build with Asciidoctor 2
diff mbox series

Message ID 20190907170746.273984-1-sandals@crustytoothpaste.net
State New
Headers show
Series
  • Documentation: fix build with Asciidoctor 2
Related show

Commit Message

brian m. carlson Sept. 7, 2019, 5:07 p.m. UTC
Our documentation toolchain has traditionally been built around DocBook
4.5.  This version of DocBook is the last DTD-based version of DocBook.
In 2009, DocBook 5 was introduced using namespaces and its syntax is
expressed in RELAX-NG, which is more expressive and allows a wider
variety of syntax forms.

Asciidoctor, one of the alternatives for building our documentation,
dropped support for DocBook 4.5 in its recent 2.0 release and now only
supports DocBook 5.  This would not be a problem but for the fact that
we use xmlto, which is still stuck in the DocBook 4.5 era.

xmlto performs DTD validation as part of the build process.  This is not
problematic for DocBook 4.5, which has a valid DTD, but it clearly
cannot work for DocBook 5, since no DTD can adequately express its full
syntax.  In addition, even if xmlto did support RELAX-NG validation,
that wouldn't be sufficient because it uses the libxml2-based xmllint to
do so, which has known problems with validating interleaves in RELAX-NG.

Fortunately, there's an easy way forward: ask Asciidoctor to use its
DocBook 5 backend and tell xmlto to skip validation.  Asciidoctor has
supported DocBook 5 since v0.1.4 in 2013 and xmlto has supported
skipping validation for probably longer than that.

xmlto will still use the non-namespaced DocBook XSL stylesheets (which
are designed for DocBook 4.5) for building the documentation instead of
the namespaced ones (which are designed for DocBook 5).  This isn't
really a problem, since both forms are built from the same source and
can handle both versions, but it does mean that an ugly message about
the stylesheets stripping the namespace will be printed to standard
error.

Overall, however, this message is a relatively minor price to pay and it
means that we can continue to use the aging xmlto to drive our build
process, while still supporting newer tooling like Asciidoctor 2.

The differences in output between AsciiDoc 8.6.10 on master and
Asciidoctor 2.0.10 with this patch are, with one exception, all due to
whitespace, wrapping, or quoting and do not affect substantive content.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jeff King Sept. 8, 2019, 10:48 a.m. UTC | #1
On Sat, Sep 07, 2019 at 05:07:46PM +0000, brian m. carlson wrote:

> Our documentation toolchain has traditionally been built around DocBook
> 4.5.  This version of DocBook is the last DTD-based version of DocBook.
> In 2009, DocBook 5 was introduced using namespaces and its syntax is
> expressed in RELAX-NG, which is more expressive and allows a wider
> variety of syntax forms.

Wow, this patch turned out to be much simpler than I had feared. And the
explanation turned out to be much longer and more interesting than I
expected. :) Thanks for writing it all down.

The patch worked well for me.

> Fortunately, there's an easy way forward: ask Asciidoctor to use its
> DocBook 5 backend and tell xmlto to skip validation.  Asciidoctor has
> supported DocBook 5 since v0.1.4 in 2013 and xmlto has supported
> skipping validation for probably longer than that.
> 
> xmlto will still use the non-namespaced DocBook XSL stylesheets (which
> are designed for DocBook 4.5) for building the documentation instead of
> the namespaced ones (which are designed for DocBook 5).  This isn't
> really a problem, since both forms are built from the same source and
> can handle both versions, but it does mean that an ugly message about
> the stylesheets stripping the namespace will be printed to standard
> error.

So I think this does introduce the ugly namespace message even for
people who are using older versions of asciidoctor that understand
docbook45. But it's probably sensible to keep all of our asciidoctor
builds using the same technique. We have enough trouble tracking the
minor differences between the variants as it is.

> The differences in output between AsciiDoc 8.6.10 on master and
> Asciidoctor 2.0.10 with this patch are, with one exception, all due to
> whitespace, wrapping, or quoting and do not affect substantive content.

We know already there are a lot of differences between asciidoc and
asciidoctor here. What's more interesting to me is how it changes the
output between two versions of asciidoctor. Of course we can't check for
2.0.10, because it doesn't build at all. But I did do a "doc-diff"
with and without this patch using 1.5.5. There are quite a few
whitespace changes. Some of them seem good or at least neutral, like:

  --- a/745f6812895b31c02b29bdfe4ae8e5498f776c26-asciidoctor/home/peff/share/man/man1/git-add.1
  +++ b/303729d86b69657777222bf4b3a6f95932e12648-asciidoctor/home/peff/share/man/man1/git-add.1
  [...]
  @@ -43,6 +43,7 @@ DESCRIPTION

   OPTIONS
          <pathspec>...
  +
              Files to add content from. Fileglobs (e.g.  *.c) can be given to
              add all matching files. Also a leading directory name (e.g.  dir to
              add dir/file1 and dir/file2) can be given to update the index to

Some of them seem bad, though:

  --- a/745f6812895b31c02b29bdfe4ae8e5498f776c26-asciidoctor/home/peff/share/man/man1/git-am.1
  +++ b/303729d86b69657777222bf4b3a6f95932e12648-asciidoctor/home/peff/share/man/man1/git-am.1
  [...]
  @@ -175,10 +201,10 @@ DISCUSSION

          to process. Upon seeing the first patch that does not apply, it aborts
          in the middle. You can recover from this in one of two ways:
   
  -        1. skip the current patch by re-running the command with the --skip
  +        1.  skip the current patch by re-running the command with the --skip
              option.
   
  -        2. hand resolve the conflict in the working directory, and update the
  +        2.  hand resolve the conflict in the working directory, and update the
              index file to bring it into a state that the patch should have
              produced. Then run the command with the --continue option.

I tricked doc-diff into doing a comparison against 1.5.5 without the
patch and 2.0.10 with the patch, and the diff is similar.

-Peff
SZEDER Gábor Sept. 8, 2019, 2:13 p.m. UTC | #2
On Sat, Sep 07, 2019 at 05:07:46PM +0000, brian m. carlson wrote:
> Our documentation toolchain has traditionally been built around DocBook
> 4.5.  This version of DocBook is the last DTD-based version of DocBook.
> In 2009, DocBook 5 was introduced using namespaces and its syntax is
> expressed in RELAX-NG, which is more expressive and allows a wider
> variety of syntax forms.
> 
> Asciidoctor, one of the alternatives for building our documentation,
> dropped support for DocBook 4.5 in its recent 2.0 release and now only
> supports DocBook 5.  This would not be a problem but for the fact that
> we use xmlto, which is still stuck in the DocBook 4.5 era.
> 
> xmlto performs DTD validation as part of the build process.  This is not
> problematic for DocBook 4.5, which has a valid DTD, but it clearly
> cannot work for DocBook 5, since no DTD can adequately express its full
> syntax.  In addition, even if xmlto did support RELAX-NG validation,
> that wouldn't be sufficient because it uses the libxml2-based xmllint to
> do so, which has known problems with validating interleaves in RELAX-NG.
> 
> Fortunately, there's an easy way forward: ask Asciidoctor to use its
> DocBook 5 backend and tell xmlto to skip validation.  Asciidoctor has
> supported DocBook 5 since v0.1.4 in 2013 and xmlto has supported
> skipping validation for probably longer than that.
> 
> xmlto will still use the non-namespaced DocBook XSL stylesheets (which
> are designed for DocBook 4.5) for building the documentation instead of
> the namespaced ones (which are designed for DocBook 5).  This isn't
> really a problem, since both forms are built from the same source and
> can handle both versions, but it does mean that an ugly message about
> the stylesheets stripping the namespace will be printed to standard
> error.

These messages from 'xmlto' look like these, and there are a lot of
them:

  Note: namesp. cut : stripped namespace before processing      Git User Manual
  Note: namesp. cut : stripped namespace before processing      git-sh-setup(1)
  Note: namesp. cut : stripped namespace before processing      git-get-tar-commit-id(1)

Unfortunately, these messages to standard error cause our CI builds to
fail [1].

In the Documentation build job we check that there was nothing printed
to standard error during 'make doc', in order to catch warnings that
are potential signs of a mis-rendered documentation, but do not cause
any asciidoc/asciidoctor/xmlto commands (and thus 'make doc') to fail.

Now, a few recent messages to standard error that indeed were signs of
mis-rendered docs [2] looked like this:

  asciidoctor: WARNING: api-config.txt: line 232: unterminated listing block

i.e. they came from Asciidoctor and were all-caps warnings.

OTOH, these "stripped namespace" messages come from 'xmlto', are not
warnings but have that "Note:" prefix, and, trusting that you did
check the results thoroughly, are apparently not a sign of any
rendering issues.  So I think it's safe to ignore them and this patch
should strip them from 'make doc's output in
'ci/test-documentation.sh'.

Related: after this patch we might want to update the CI builds to use
Asciidoctor 2 instead of sticking with 1.5.8.


[1] With Asciidoctor 1.5.8:

      https://travis-ci.org/szeder/git/jobs/582294090#L2085

    With an additional patch on top to use Asciidoctor 2:

      https://travis-ci.org/szeder/git/jobs/582294243#L2066

[2] See 'git log -3 b373e4d29b'
brian m. carlson Sept. 8, 2019, 5:18 p.m. UTC | #3
On 2019-09-08 at 10:48:33, Jeff King wrote:
> Some of them seem bad, though:
> 
>   --- a/745f6812895b31c02b29bdfe4ae8e5498f776c26-asciidoctor/home/peff/share/man/man1/git-am.1
>   +++ b/303729d86b69657777222bf4b3a6f95932e12648-asciidoctor/home/peff/share/man/man1/git-am.1
>   [...]
>   @@ -175,10 +201,10 @@ DISCUSSION
> 
>           to process. Upon seeing the first patch that does not apply, it aborts
>           in the middle. You can recover from this in one of two ways:
>    
>   -        1. skip the current patch by re-running the command with the --skip
>   +        1.  skip the current patch by re-running the command with the --skip
>               option.
>    
>   -        2. hand resolve the conflict in the working directory, and update the
>   +        2.  hand resolve the conflict in the working directory, and update the
>               index file to bring it into a state that the patch should have
>               produced. Then run the command with the --continue option.
> 
> I tricked doc-diff into doing a comparison against 1.5.5 without the
> patch and 2.0.10 with the patch, and the diff is similar.

I see the same thing in doc-diff, but this issue doesn't appear to
actually render in the manual pages themselves, even with MANWIDTH=80.
I'm not sure why, but it doesn't seem to misrender in the actual man
output.

I also tried with the DocBook 5 (docbook-xsl-ns) stylesheets, but that
doesn't appear to make a difference in doc-diff.
Jeff King Sept. 8, 2019, 9:21 p.m. UTC | #4
On Sun, Sep 08, 2019 at 05:18:07PM +0000, brian m. carlson wrote:

> >   -        2. hand resolve the conflict in the working directory, and update the
> >   +        2.  hand resolve the conflict in the working directory, and update the
> >               index file to bring it into a state that the patch should have
> >               produced. Then run the command with the --continue option.
> > 
> > I tricked doc-diff into doing a comparison against 1.5.5 without the
> > patch and 2.0.10 with the patch, and the diff is similar.
> 
> I see the same thing in doc-diff, but this issue doesn't appear to
> actually render in the manual pages themselves, even with MANWIDTH=80.
> I'm not sure why, but it doesn't seem to misrender in the actual man
> output.

Hrm. But doc-diff _is_ diffing the output of man, so that implies a bug
in that script. However, if I manually go into tmp-doc-diff/installed
and run "man -l git-am.1", I do see the extra whitespace, which is what
I'd expect. Are you sure you did your manual inspection on the right
thing?

I was also curious about the root cause at the roff level, so I did this
hack:

diff --git a/Documentation/doc-diff b/Documentation/doc-diff
index 88a9b20168..fde3ac7154 100755
--- a/Documentation/doc-diff
+++ b/Documentation/doc-diff
@@ -22,6 +22,7 @@ to-asciidoc		use asciidoc with the 'to'-commit
 to-asciidoctor		use asciidoctor with the 'to'-commit
 asciidoctor		use asciidoctor with both commits
 cut-footer		cut away footer
+raw			diff roff instead of rendered manpages
 "
 SUBDIRECTORY_OK=1
 . "$(git --exec-path)/git-sh-setup"
@@ -32,6 +33,7 @@ clean=
 from_program=
 to_program=
 cut_footer=
+raw=
 while test $# -gt 0
 do
 	case "$1" in
@@ -57,6 +59,8 @@ do
 		to_program=-asciidoc ;;
 	--cut-footer)
 		cut_footer=-cut-footer ;;
+	--raw)
+		raw=t ;;
 	--)
 		shift; break ;;
 	*)
@@ -159,6 +163,11 @@ render_tree () {
 		mv "$tmp/installed/$dname+" "$tmp/installed/$dname"
 	fi &&
 
+	if test -n "$raw"
+	then
+		return 0
+	fi &&
+
 	# As with "installed" above, we skip the render if it's already been
 	# done.  So using make here is primarily just about running in
 	# parallel.
@@ -181,6 +190,13 @@ render_tree () {
 	fi
 }
 
+if test -z "$raw"
+then
+	diffdir=$tmp/rendered
+else
+	diffdir=$tmp/installed
+fi
+
 render_tree $from_oid $from_dir $from_makemanflags &&
 render_tree $to_oid $to_dir $to_makemanflags &&
-git -C $tmp/rendered diff --no-index "$@" $from_dir $to_dir
+git -C $diffdir diff --no-index "$@" $from_dir $to_dir


The results are quite noisy compared to the rendered output (i.e.,
stylistic choices in the roff output that have equivalent outputs). But
it did let me easily find the difference in one of these cases:


  @@ -312,6 +458,7 @@ option\&.
   .sp -1
   .IP "  2." 4.2
   .\}
  +
   hand resolve the conflict in the working directory, and update the index file to bring it into a state that the patch should have produced\&. Then run the command with the
   \fB\-\-continue\fR
   option\&.


But it's not clear from that whether asciidoc's docbook5 backend inserts
an extra newline, or if it's part of the xml translation. Looking at the
actual XML, I see:

  <listitem>
  <simpara>hand resolve the conflict in the working directory, and update
  the index file to bring it into a state that the patch should
  have produced.  Then run the command with the <literal>--continue</literal> option.</simpara>
  </listitem>

which looks OK. I wondered if there should not be a break between
<listitem> and <simpara>, but it's there in the asciidoc version, too.
So I'm inclined to blame xmlto/docbook5 here.

-Peff
Jeff King Sept. 8, 2019, 9:32 p.m. UTC | #5
On Sun, Sep 08, 2019 at 04:13:08PM +0200, SZEDER Gábor wrote:

> OTOH, these "stripped namespace" messages come from 'xmlto', are not
> warnings but have that "Note:" prefix, and, trusting that you did
> check the results thoroughly, are apparently not a sign of any
> rendering issues.  So I think it's safe to ignore them and this patch
> should strip them from 'make doc's output in
> 'ci/test-documentation.sh'.

Yeah, I'd agree (though I do think there are some other issues with this
approach, at least in my tests).

If we do go this route, it might even be nice to use a small xmlto
wrapper that filters the stderr for us. Then the ci code doesn't have to
know about this distinction, and it prevents users from seeing the ugly
messages.

-Peff
brian m. carlson Sept. 8, 2019, 10:24 p.m. UTC | #6
On 2019-09-08 at 21:21:22, Jeff King wrote:
> But it's not clear from that whether asciidoc's docbook5 backend inserts
> an extra newline, or if it's part of the xml translation. Looking at the
> actual XML, I see:
> 
>   <listitem>
>   <simpara>hand resolve the conflict in the working directory, and update
>   the index file to bring it into a state that the patch should
>   have produced.  Then run the command with the <literal>--continue</literal> option.</simpara>
>   </listitem>
> 
> which looks OK. I wondered if there should not be a break between
> <listitem> and <simpara>, but it's there in the asciidoc version, too.
> So I'm inclined to blame xmlto/docbook5 here.

Trying again, I'm able to reproduce this.  I found the cause, which is
in the stylesheets.  XSLT stylesheets have the ability to specify
elements from which whitespace should be stripped (using the
xsl:strip-space directive).  In the DocBook stylesheets, listitem is
specified as such an element, so the whitespace there should be
stripped.

However, in DocBook 5, our elements are in a namespace.  Therefore, the
unnamespaced stylesheets specify only "listitem", not "d:listitem", like
the namespaced stylesheets do.  Because this happens right after the
tree has been constructed "but before it is otherwise processed by XSLT"
and isn't affected by the EXSLT extension that allows re-parsing the
modified tree, then we end up with the whitespace that we don't want.

There are a couple ways around this.

1. We can force xmlto to use the DocBook 5 stylesheets with the "-x"
flag, but we have to know where they are.  Debian and Fedora have them
in different places, so we'd need a knob to figure out where they are.

2. We can force xmlto to use a custom stylesheet with "-x" that merely
imports the DocBook 5 stylesheets using a URL.  If the user has the
DocBook 5 stylesheets installed and XML catalogs configured (the default
on Linux distributions), then everything will just work and the system
will resolve it to the local copy.  If, however, things are not properly
configured, this will result in multiple network downloads for each
manual page.

3. We can give up on xmlto and do things ourselves.  This has the same
problem as option 1, since we need to learn how to find the stylesheets.

4. We can send a patch to xmlto to make it use the proper stylesheets
for DocBook 5 and hope upstream does a release and everyone upgrades
shortly.  Since xmlto is not at all active upstream, this seems like it
may be an imprudent choice.

5. We can send a patch to the DocBook stylesheets and have them include
both the namespaced and unnamespaced versions of the element names in
both sets of stylesheets and hope everyone upgrades.

My personal preference is #2; I think that seems like the best choice
forward.  XML catalogs are well understood and well configured on Linux
distributions.  Homebrew supports them adequately, but you have to add
an environment variable to your shell configuration to enable them.  Of
course, if you're doing _anything_ with XML, you'll have them enabled.
Junio C Hamano Sept. 9, 2019, 5:37 p.m. UTC | #7
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> There are a couple ways around this.
>
> 1. We can force xmlto to use the DocBook 5 stylesheets with the "-x"
> flag, but we have to know where they are.  Debian and Fedora have them
> in different places, so we'd need a knob to figure out where they are.
>
> 2. We can force xmlto to use a custom stylesheet with "-x" that merely
> imports the DocBook 5 stylesheets using a URL.  If the user has the
> DocBook 5 stylesheets installed and XML catalogs configured (the default
> on Linux distributions), then everything will just work and the system
> will resolve it to the local copy.  If, however, things are not properly
> configured, this will result in multiple network downloads for each
> manual page.
>
> 3. We can give up on xmlto and do things ourselves.  This has the same
> problem as option 1, since we need to learn how to find the stylesheets.
>
> 4. We can send a patch to xmlto to make it use the proper stylesheets
> for DocBook 5 and hope upstream does a release and everyone upgrades
> shortly.  Since xmlto is not at all active upstream, this seems like it
> may be an imprudent choice.
>
> 5. We can send a patch to the DocBook stylesheets and have them include
> both the namespaced and unnamespaced versions of the element names in
> both sets of stylesheets and hope everyone upgrades.
>
> My personal preference is #2; I think that seems like the best choice
> forward.  XML catalogs are well understood and well configured on Linux
> distributions.  Homebrew supports them adequately, but you have to add
> an environment variable to your shell configuration to enable them.  Of
> course, if you're doing _anything_ with XML, you'll have them enabled.

Sounds sensible and well reasoned.
Jeff King Sept. 10, 2019, 6:44 p.m. UTC | #8
On Sun, Sep 08, 2019 at 10:24:24PM +0000, brian m. carlson wrote:

> Trying again, I'm able to reproduce this.  I found the cause, which is
> in the stylesheets.  XSLT stylesheets have the ability to specify
> elements from which whitespace should be stripped (using the
> xsl:strip-space directive).  In the DocBook stylesheets, listitem is
> specified as such an element, so the whitespace there should be
> stripped.
> 
> However, in DocBook 5, our elements are in a namespace.  Therefore, the
> unnamespaced stylesheets specify only "listitem", not "d:listitem", like
> the namespaced stylesheets do.  Because this happens right after the
> tree has been constructed "but before it is otherwise processed by XSLT"
> and isn't affected by the EXSLT extension that allows re-parsing the
> modified tree, then we end up with the whitespace that we don't want.

First off, thank you again for your explanations. I dread digging into
how anything related to docbook or xml works, so having you serve it up
on a silver platter is a delight. :)

> 2. We can force xmlto to use a custom stylesheet with "-x" that merely
> imports the DocBook 5 stylesheets using a URL.  If the user has the
> DocBook 5 stylesheets installed and XML catalogs configured (the default
> on Linux distributions), then everything will just work and the system
> will resolve it to the local copy.  If, however, things are not properly
> configured, this will result in multiple network downloads for each
> manual page.

Isn't this already the case just with the docbook DTDs? I.e., if you
don't have a catalog entry, it is up to the tool (xmlto in this case) to
either fail or try to fetch it. That seems like the best we can do. And
as you note, this typically just works out of the box on modern
installs. Of course people may want to build on non-modern ones, but
IMHO we should probably be more aggressive about dropping legacy support
in the documentation and pointing people to the pre-formatted pages.

> My personal preference is #2; I think that seems like the best choice
> forward.  XML catalogs are well understood and well configured on Linux
> distributions.  Homebrew supports them adequately, but you have to add
> an environment variable to your shell configuration to enable them.  Of
> course, if you're doing _anything_ with XML, you'll have them enabled.

Yeah, agreed.

-Peff
brian m. carlson Sept. 11, 2019, 11:19 p.m. UTC | #9
On 2019-09-10 at 18:44:22, Jeff King wrote:
> First off, thank you again for your explanations. I dread digging into
> how anything related to docbook or xml works, so having you serve it up
> on a silver platter is a delight. :)

I'm happy to do it.  I was an English major in college and virtually
every assignment I wrote in DocBook, so I'm far more familiar with it
than I'm sure most people ever want to be.

> Isn't this already the case just with the docbook DTDs? I.e., if you
> don't have a catalog entry, it is up to the tool (xmlto in this case) to
> either fail or try to fetch it. That seems like the best we can do. And
> as you note, this typically just works out of the box on modern
> installs. Of course people may want to build on non-modern ones, but
> IMHO we should probably be more aggressive about dropping legacy support
> in the documentation and pointing people to the pre-formatted pages.

Yeah, that's a good point.  Since we know that most people will have
catalogs working (since usually fetching the DTD repeatedly over the
network gets you rate-limited resulting in a failure), I think it's fair
for us to do this.  And anyway, DocBook 5 isn't exactly new: it's been
around since 2009, IIRC, so distros will have it.

I know Homebrew doesn't have this problem, since they fetch the prebuilt
docs, even for HEAD (master) builds.

I'll try to get the patch written up and sent out today or tomorrow.

Patch
diff mbox series

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 76f2ecfc1b..485f365cbf 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -197,11 +197,12 @@  ifdef USE_ASCIIDOCTOR
 ASCIIDOC = asciidoctor
 ASCIIDOC_CONF =
 ASCIIDOC_HTML = xhtml5
-ASCIIDOC_DOCBOOK = docbook45
+ASCIIDOC_DOCBOOK = docbook5
 ASCIIDOC_EXTRA += -acompat-mode -atabsize=8
 ASCIIDOC_EXTRA += -I. -rasciidoctor-extensions
 ASCIIDOC_EXTRA += -alitdd='&\#x2d;&\#x2d;'
 DBLATEX_COMMON =
+XMLTO_EXTRA += --skip-validation
 endif
 
 SHELL_PATH ?= $(SHELL)