mbox series

[00/13] builtin: implement, document and test url-parse

Message ID pull.1715.git.git.1714343461.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series builtin: implement, document and test url-parse | expand

Message

Phillip Wood via GitGitGadget April 28, 2024, 10:30 p.m. UTC
Git commands accept a wide variety of URLs syntaxes, not just standard URLs.
This can make parsing git URLs difficult since standard URL parsers cannot
be used. Even if an external parser were implemented, it would have to track
git's development closely in case support for any new URL schemes are added.

These patches introduce a new url-parse builtin command that exposes git's
native URL parsing algorithms as a plumbing command, allowing other programs
to then call upon git itself to parse the git URLs and their components.

This should be quite useful for scripts. For example, a script might want to
add remotes to repositories, naming them according to the domain name where
the repository is hosted. This new builtin allows it to parse the git URL
and extract its host name which can then be used as input for other
operations. This would be difficult to implement otherwise due to git's
support for scp style URLs.

Signed-off-by: Matheus Afonso Martins Moreira matheus@matheusmoreira.com

Matheus Afonso Martins Moreira (13):
  url: move helper function to URL header and source
  urlmatch: define url_parse function
  builtin: create url-parse command
  url-parse: add URL parsing helper function
  url-parse: enumerate possible URL components
  url-parse: define component extraction helper fn
  url-parse: define string to component converter fn
  url-parse: define usage and options
  url-parse: parse options given on the command line
  url-parse: validate all given git URLs
  url-parse: output URL components selected by user
  Documentation: describe the url-parse builtin
  tests: add tests for the new url-parse builtin

 .gitignore                      |   1 +
 Documentation/git-url-parse.txt |  59 ++++++++++
 Makefile                        |   1 +
 builtin.h                       |   1 +
 builtin/url-parse.c             | 132 ++++++++++++++++++++++
 command-list.txt                |   1 +
 connect.c                       |   8 --
 connect.h                       |   1 -
 git.c                           |   1 +
 remote.c                        |   1 +
 t/t9904-url-parse.sh            | 194 ++++++++++++++++++++++++++++++++
 url.c                           |   8 ++
 url.h                           |   2 +
 urlmatch.c                      |  90 +++++++++++++++
 urlmatch.h                      |   1 +
 15 files changed, 492 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/git-url-parse.txt
 create mode 100644 builtin/url-parse.c
 create mode 100755 t/t9904-url-parse.sh


base-commit: e326e520101dcf43a0499c3adc2df7eca30add2d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1715%2Fmatheusmoreira%2Furl-parse-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1715/matheusmoreira/url-parse-v1
Pull-Request: https://github.com/git/git/pull/1715

Comments

Torsten Bögershausen April 29, 2024, 8:53 p.m. UTC | #1
On Sun, Apr 28, 2024 at 10:30:48PM +0000, Matheus Moreira via GitGitGadget wrote:
> Git commands accept a wide variety of URLs syntaxes, not just standard URLs.
> This can make parsing git URLs difficult since standard URL parsers cannot
> be used. Even if an external parser were implemented, it would have to track
> git's development closely in case support for any new URL schemes are added.
>
> These patches introduce a new url-parse builtin command that exposes git's
> native URL parsing algorithms as a plumbing command, allowing other programs
> to then call upon git itself to parse the git URLs and their components.
>
> This should be quite useful for scripts. For example, a script might want to
> add remotes to repositories, naming them according to the domain name where
> the repository is hosted. This new builtin allows it to parse the git URL
> and extract its host name which can then be used as input for other
> operations. This would be difficult to implement otherwise due to git's
> support for scp style URLs.
>

All in all, having a URL parser as such is a good thing, thanks for working
on that.

There are, however, some notes and questions, up for discussion:

- are there any plans to integrate the parser into connect.c and fetch ?
  Speaking as a person, who manage to break the parsing of URLs once,
  with the good intention to improve things, I need to learn that
  test cases are important.
  Some work can be seen in t5601-clone.sh
  Especially, when dealing with literal IPv6 addresses, the ones with []
  and the simplified ssh syntax 'myhost:src' are interesting to test.
  Git itself strives to be RFC compliant when parsing URLs, but
  we do not fully guarantee to be "fully certified".
  And some features using the [] syntax to embedd a port number
  inside the simplified ssh syntax had not been documented,
  but used in practise, and are now part of the test suite.
  See "[myhost:123]:src" in t5601

- Or is this new tool just a helper, to verify "good" URL's,
  and not accepting our legacy parser quirks ?
  Then we still should see some IPv6 tests ?
  Or may be not, as we prefer hostnames these days ?

- One minor comment:
  in 02/13 we read:
        +enum protocol {
        +       PROTO_UNKNOWN = 0,
        +       PROTO_LOCAL,
        +       PROTO_FILE,
        +       PROTO_SSH,
        +       PROTO_GIT,
  The RFC 1738 uses the term "scheme" here, and using the very generic
  term "protocol" may lead to name clashes later.
  Would something like "git_scheme" or so be better ?

- One minor comment:
   In 13/13 we read:
        +       git url-parse "file:///" &&
        +       git url-parse "file://"

  I think that the "///" version is superflous, it should already
  be covered by the "//" version
Matheus Afonso Martins Moreira April 29, 2024, 10:04 p.m. UTC | #2
Thank you for your feedback.

> are there any plans to integrate the parser into connect.c and fetch ?

Yes.

That was my intention but I was not confident enough to touch connect.c
before getting feedback from the community, since it's critical code
and it is my first contribution.

I do want to merge all URL parsing in git into this one function though,
thereby creating a "single point of truth". This is so that if the algorithm
is modified the changes are visible to the URL parser builtin as well.

> Speaking as a person, who manage to break the parsing of URLs once,
> with the good intention to improve things, I need to learn that
> test cases are important.

Absolutely agree.

When adding test cases, I looked at the possibilities enumerated in urls.txt
and generated test cases based on those. I also looked at the urlmatch.h
test cases. However...

> Some work can be seen in t5601-clone.sh

... I did not think to check those.

> Especially, when dealing with literal IPv6 addresses,
> the ones with [] and the simplified ssh syntax 'myhost:src'
> are interesting to test.

You're right about that. I shall prepare an updated v2 patchset
with more test cases, and also any other changes/improvements
requested by maintainers.

> And some features using the [] syntax to embedd a port number
> inside the simplified ssh syntax had not been documented,
> but used in practise, and are now part of the test suite.
> See "[myhost:123]:src" in t5601

Indeed, I did not read anything of the sort when I checked it.
Would you like me to commit a note to this effect to urls.txt ?

> Or is this new tool just a helper, to verify "good" URL's,
> and not accepting our legacy parser quirks ?

It is my intention that this builtin be able to accept, parse
and decompose all types of URLs that git itself can accept.

> Then we still should see some IPv6 tests ?

I will add them!

> Or may be not, as we prefer hostnames these days ?

I would have to defer that choice to someone more experienced
with the codebase. Please advise on how to proceed.

> The RFC 1738 uses the term "scheme" here, and using the very generic
> term "protocol" may lead to name clashes later.
> Would something like "git_scheme" or so be better ?

Scheme does seem like a better word if it's the terminology used by RFCs.
I can change that in a new version if necessary.
That code is based on the existing connect.c parsing code though.

> I think that the "///" version is superflous, it should already
> be covered by the "//" version

I thought it was a good idea because of existing precedent:
my first approach to creating the test cases was to copy the
ones from t0110-urlmatch-normalization.sh which did have many
cases such as those. Then as I developed the code I came to
believe that it was not necessary: I call url_normalize
in the url_parse function and url_normalize is already being
tested. I think I just forgot to delete those lines.

Reading that file over once again, it does have IPv6 address
test cases. So I should probably go over it again.

Thanks again for the feedback,

  Matheus
Torsten Bögershausen April 30, 2024, 6:51 a.m. UTC | #3
On Mon, Apr 29, 2024 at 07:04:40PM -0300, Matheus Afonso Martins Moreira wrote:

> Thank you for your feedback.
>
> > are there any plans to integrate the parser into connect.c and fetch ?
>
> Yes.
>
> That was my intention but I was not confident enough to touch connect.c
> before getting feedback from the community, since it's critical code
> and it is my first contribution.

Welcome to the Git community.

I wasn't aware of t0110 as a test case...

>
> I do want to merge all URL parsing in git into this one function though,
> thereby creating a "single point of truth". This is so that if the algorithm
> is modified the changes are visible to the URL parser builtin as well.
>

That is a good thing to do. Be prepared for a longer journey, since we have
this legacy stuff to deal with. But I am happy to help with reviews, even
if that may take some days,

[]

> When adding test cases, I looked at the possibilities enumerated in urls.txt
> and generated test cases based on those. I also looked at the urlmatch.h
> test cases. However...
>
> > Some work can be seen in t5601-clone.sh
>
> ... I did not think to check those.
>
> > Especially, when dealing with literal IPv6 addresses,
> > the ones with [] and the simplified ssh syntax 'myhost:src'
> > are interesting to test.
>
> You're right about that. I shall prepare an updated v2 patchset
> with more test cases, and also any other changes/improvements
> requested by maintainers.
>
> > And some features using the [] syntax to embedd a port number
> > inside the simplified ssh syntax had not been documented,
> > but used in practise, and are now part of the test suite.
> > See "[myhost:123]:src" in t5601
>
> Indeed, I did not read anything of the sort when I checked it.
> Would you like me to commit a note to this effect to urls.txt ?

On short: please not.
This kind of syntax was never ment to be used.
The official "ssh://myhost:123/src" is recommended.
When IPv6 parsing was added, people discovered that it could be
used to "protect" the ':' from being a seperator between the hostname
and the path, and can be used to seperate the hostname from the port.
Once that was used in real live, it was too late to change it.
If we now get a better debug tool, it could mention that this is
a legacy feature, and recommend the longer "ssh://" syntax.

>
> > Or is this new tool just a helper, to verify "good" URL's,
> > and not accepting our legacy parser quirks ?
>
> It is my intention that this builtin be able to accept, parse
> and decompose all types of URLs that git itself can accept.
>
> > Then we still should see some IPv6 tests ?
>
> I will add them!
>
> > Or may be not, as we prefer hostnames these days ?
>
> I would have to defer that choice to someone more experienced
> with the codebase. Please advise on how to proceed.

Re-reading this email conversation,
I think that we should support (in the future),
what we support today.
Having a new parser tool means, that there is a chance to reject
those URLs with the note/hint, that they are depracted, and should
be replaced by a proper one.
From my point of view this means that all existing test case should pass
even with the new parser, as a general approach.
Deprecating things is hard, may take years, and may be done in a seperate
task/patch series. Or may be part of this one, in seperate commits.

>
> > The RFC 1738 uses the term "scheme" here, and using the very generic
> > term "protocol" may lead to name clashes later.
> > Would something like "git_scheme" or so be better ?
>
> Scheme does seem like a better word if it's the terminology used by RFCs.
> I can change that in a new version if necessary.
> That code is based on the existing connect.c parsing code though.
>
> > I think that the "///" version is superflous, it should already
> > be covered by the "//" version
>
> I thought it was a good idea because of existing precedent:
> my first approach to creating the test cases was to copy the
> ones from t0110-urlmatch-normalization.sh which did have many
> cases such as those. Then as I developed the code I came to
> believe that it was not necessary: I call url_normalize
> in the url_parse function and url_normalize is already being
> tested. I think I just forgot to delete those lines.
>
> Reading that file over once again, it does have IPv6 address
> test cases. So I should probably go over it again.
>
> Thanks again for the feedback,
>
>   Matheus
>