diff mbox series

[2/2] contrib/git-resurrect.sh: use hash-agnostic OID pattern

Message ID 6fad1fc7fdad98c3dda1ec334a10a6a9e311fef8.1602139448.git.liu.denton@gmail.com (mailing list archive)
State Accepted
Commit e37eae0c1e13cb418947383f7c803d9463bfa3d7
Headers show
Series contrib/git-resurrect.sh: make it hash-agnostic | expand

Commit Message

Denton Liu Oct. 8, 2020, 6:44 a.m. UTC
Since Git now supports hashes other than SHA-1, the hash length isn't
guaranteed to be 40 characters. Replace $_x40 with a hash-agnostic OID
pattern.

Signed-off-by: Denton Liu <liu.denton@gmail.com>
---
 contrib/git-resurrect.sh | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Jeff King Oct. 8, 2020, 4:13 p.m. UTC | #1
On Wed, Oct 07, 2020 at 11:44:40PM -0700, Denton Liu wrote:

> diff --git a/contrib/git-resurrect.sh b/contrib/git-resurrect.sh
> index 57a77c03f9..d843df3afd 100755
> --- a/contrib/git-resurrect.sh
> +++ b/contrib/git-resurrect.sh
> @@ -37,19 +37,18 @@ search_reflog_merges () {
>  	)
>  }
>  
> -_x40="[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]"
> -_x40="$_x40$_x40$_x40$_x40$_x40$_x40$_x40$_x40"
> +oid_pattern=$(git hash-object --stdin </dev/null | sed -e 's/./[0-9a-f]/g')

This looks correct, although...

>  search_merges () {
>  	git rev-list --all --grep="Merge branch '$1'" \
>  		--pretty=tformat:"%P %s" |
> -	sed -ne "/^$_x40 \($_x40\) Merge .*/ {s//\1/p;$early_exit}"
> +	sed -ne "/^$oid_pattern \($oid_pattern\) Merge .*/ {s//\1/p;$early_exit}"
>  }
>  
>  search_merge_targets () {
>  	git rev-list --all --grep="Merge branch '[^']*' into $branch\$" \
>  		--pretty=tformat:"%H %s" --all |
> -	sed -ne "/^\($_x40\) Merge .*/ {s//\1/p;$early_exit} "
> +	sed -ne "/^\($oid_pattern\) Merge .*/ {s//\1/p;$early_exit} "
>  }

in both cases we are matching output we asked for, so we really matching
[0-9a-f]\+ would be correct and sufficient. That's a little simpler. I
don't feel too strongly either way, though.

  Side note: It's a shame that there is no way to convince rev-list not
  to print the "commit ..." header, which is really what we're avoiding
  here. We probably should have suppressed it with user-formats when
  they were introduced, but it's too late to make that change. I wonder
  if it would be worth adding a command-line option, though. I've often
  had to hack around this when parsing rev-list output (and sometimes
  even resort to using git-log if it's a one-off).

-Peff
Jeff King Oct. 8, 2020, 7:53 p.m. UTC | #2
On Thu, Oct 08, 2020 at 11:29:34AM -0700, Junio C Hamano wrote:

> >   Side note: It's a shame that there is no way to convince rev-list not
> >   to print the "commit ..." header, which is really what we're avoiding
> >   here. We probably should have suppressed it with user-formats when
> >   they were introduced, but it's too late to make that change. I wonder
> >   if it would be worth adding a command-line option, though. I've often
> >   had to hack around this when parsing rev-list output (and sometimes
> >   even resort to using git-log if it's a one-off).
> 
> Or make "git log" without frills as fast as rev-list, perhaps?
> 
> What extra things do we do that makes "log" inherently slower than
> "rev-list"?

It's not the speed of log that is a problem, but just that I usually try
to use plumbing when scripting. So I often reach for rev-list first.

I do think for just listing commit hashes that log is slower, though.
One reason is that when there's a commit-graph, it's not as good at
avoiding reading the commit objects. E.g.:

  $ time git rev-list HEAD >/dev/null
  real	0m0.031s
  user	0m0.027s
  sys	0m0.004s

  $ time git -c core.commitgraph=false rev-list HEAD >/dev/null
  real	0m0.362s
  user	0m0.345s
  sys	0m0.016s

  $ time git log --format=%H HEAD >/dev/null
  real	0m0.371s
  user	0m0.355s
  sys	0m0.016s

So running "git log" takes about the same time as rev-list if we disable
the commit-graph. Which makes sense. The pretty-print code is aggressive
about loading the object contents, even if we end up not needing it
(because really, everything _except_ userformat does need it, and even
userformat usually needs it).

I think it would be nice to make the formatting code smarter about
reporting exactly which parts it needs.

> I do not mind a new option (e.g. --no-header) to "rev-list", though.

I took a brief look at this earlier today and it was more awkward than I
expected. The "commit <oid>" header might also have other stuff attached
to it (revision marks, parents, and who even knew we had a "--timestamp"
option?). It's not clear where those things should go if we suppress the
header (for oneline, they just get stuck in front of the oneline; would
that be OK for userformats, too?).

-Peff
brian m. carlson Oct. 8, 2020, 10:11 p.m. UTC | #3
On 2020-10-08 at 16:13:45, Jeff King wrote:
> On Wed, Oct 07, 2020 at 11:44:40PM -0700, Denton Liu wrote:
> 
> > diff --git a/contrib/git-resurrect.sh b/contrib/git-resurrect.sh
> > index 57a77c03f9..d843df3afd 100755
> > --- a/contrib/git-resurrect.sh
> > +++ b/contrib/git-resurrect.sh
> > @@ -37,19 +37,18 @@ search_reflog_merges () {
> >  	)
> >  }
> >  
> > -_x40="[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]"
> > -_x40="$_x40$_x40$_x40$_x40$_x40$_x40$_x40$_x40"
> > +oid_pattern=$(git hash-object --stdin </dev/null | sed -e 's/./[0-9a-f]/g')
> 
> This looks correct, although...

We could write this more simply as this:

  oid_pattern=$(git hash-object /dev/null | sed -e 's/./[0-9a-f]/g')

I'm almost certain that works just fine, even on Windows, and I think
the hashing code may be able to optimize better if we avoid --stdin
(since it knows the size ahead of time).  The performance benefit, if
any, won't be that significant, though, since we're only hashing 7
bytes.

This is, of course, not worth a reroll in itself.

> >  search_merges () {
> >  	git rev-list --all --grep="Merge branch '$1'" \
> >  		--pretty=tformat:"%P %s" |
> > -	sed -ne "/^$_x40 \($_x40\) Merge .*/ {s//\1/p;$early_exit}"
> > +	sed -ne "/^$oid_pattern \($oid_pattern\) Merge .*/ {s//\1/p;$early_exit}"
> >  }
> >  
> >  search_merge_targets () {
> >  	git rev-list --all --grep="Merge branch '[^']*' into $branch\$" \
> >  		--pretty=tformat:"%H %s" --all |
> > -	sed -ne "/^\($_x40\) Merge .*/ {s//\1/p;$early_exit} "
> > +	sed -ne "/^\($oid_pattern\) Merge .*/ {s//\1/p;$early_exit} "
> >  }
> 
> in both cases we are matching output we asked for, so we really matching
> [0-9a-f]\+ would be correct and sufficient. That's a little simpler. I
> don't feel too strongly either way, though.

The problem here is that we'd need to write [0-9a-f][0-9a-f]* because
this is a BRE and a backslashed + here is a GNU extension.

I'm fine with this patch (and patch 1) as it stands or with that change.
I don't think this can be ambiguous, although if we think it can be
(even with silly user behavior), then we should adopt this solution.

>   Side note: It's a shame that there is no way to convince rev-list not
>   to print the "commit ..." header, which is really what we're avoiding
>   here. We probably should have suppressed it with user-formats when
>   they were introduced, but it's too late to make that change. I wonder
>   if it would be worth adding a command-line option, though. I've often
>   had to hack around this when parsing rev-list output (and sometimes
>   even resort to using git-log if it's a one-off).

Yeah, I use git log for scripting more often than I'd like because
rev-list can't remove the header.  That would indeed be a welcome
feature.
Andreas Schwab Oct. 9, 2020, 7:55 a.m. UTC | #4
On Okt 08 2020, brian m. carlson wrote:

> On 2020-10-08 at 16:13:45, Jeff King wrote:
>> On Wed, Oct 07, 2020 at 11:44:40PM -0700, Denton Liu wrote:
>> >  search_merges () {
>> >  	git rev-list --all --grep="Merge branch '$1'" \
>> >  		--pretty=tformat:"%P %s" |
>> > -	sed -ne "/^$_x40 \($_x40\) Merge .*/ {s//\1/p;$early_exit}"
>> > +	sed -ne "/^$oid_pattern \($oid_pattern\) Merge .*/ {s//\1/p;$early_exit}"
>> >  }
>> >  
>> >  search_merge_targets () {
>> >  	git rev-list --all --grep="Merge branch '[^']*' into $branch\$" \
>> >  		--pretty=tformat:"%H %s" --all |
>> > -	sed -ne "/^\($_x40\) Merge .*/ {s//\1/p;$early_exit} "
>> > +	sed -ne "/^\($oid_pattern\) Merge .*/ {s//\1/p;$early_exit} "
>> >  }
>> 
>> in both cases we are matching output we asked for, so we really matching
>> [0-9a-f]\+ would be correct and sufficient. That's a little simpler. I
>> don't feel too strongly either way, though.
>
> The problem here is that we'd need to write [0-9a-f][0-9a-f]* because
> this is a BRE and a backslashed + here is a GNU extension.

I wonder why --pretty uses %s when it is filtered out again anyway?
(There is also a duplicate --all.)

Andreas.
Jeff King Oct. 9, 2020, 11:53 a.m. UTC | #5
On Fri, Oct 09, 2020 at 09:55:53AM +0200, Andreas Schwab wrote:

> >> >  search_merges () {
> >> >  	git rev-list --all --grep="Merge branch '$1'" \
> >> >  		--pretty=tformat:"%P %s" |
> >> > -	sed -ne "/^$_x40 \($_x40\) Merge .*/ {s//\1/p;$early_exit}"
> >> > +	sed -ne "/^$oid_pattern \($oid_pattern\) Merge .*/ {s//\1/p;$early_exit}"
> [...]
> I wonder why --pretty uses %s when it is filtered out again anyway?
> (There is also a duplicate --all.)

It does confirm that the commit in question is a (likely) merge commit,
and not one that happens to have "Merge branch 'foo'" in the body, which
the earlier --grep would have hit. But it doesn't actually check for
'foo' in the sed match, so it would be fooled by a commit message like:

  Merge branch 'bar'

    * bar:
      Merge branch 'foo'

If we wanted to tighten that up, then sed should match the branch name.
If we're willing to accept the looseness, the whole thing could probably
be:

  git rev-list --all --grep="Merge branch '$1'" \
    --min-parents 2 --parents ${early_exit:-"-1"} |
  awk '{print $3}'

I'm happy either way, but I'm not sure anybody overly cares. Let's not
derail Denton's actual fix to make the script work in a sha256 world.

-Peff
diff mbox series

Patch

diff --git a/contrib/git-resurrect.sh b/contrib/git-resurrect.sh
index 57a77c03f9..d843df3afd 100755
--- a/contrib/git-resurrect.sh
+++ b/contrib/git-resurrect.sh
@@ -37,19 +37,18 @@  search_reflog_merges () {
 	)
 }
 
-_x40="[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]"
-_x40="$_x40$_x40$_x40$_x40$_x40$_x40$_x40$_x40"
+oid_pattern=$(git hash-object --stdin </dev/null | sed -e 's/./[0-9a-f]/g')
 
 search_merges () {
 	git rev-list --all --grep="Merge branch '$1'" \
 		--pretty=tformat:"%P %s" |
-	sed -ne "/^$_x40 \($_x40\) Merge .*/ {s//\1/p;$early_exit}"
+	sed -ne "/^$oid_pattern \($oid_pattern\) Merge .*/ {s//\1/p;$early_exit}"
 }
 
 search_merge_targets () {
 	git rev-list --all --grep="Merge branch '[^']*' into $branch\$" \
 		--pretty=tformat:"%H %s" --all |
-	sed -ne "/^\($_x40\) Merge .*/ {s//\1/p;$early_exit} "
+	sed -ne "/^\($oid_pattern\) Merge .*/ {s//\1/p;$early_exit} "
 }
 
 dry_run=