diff mbox series

git-p4: speed up search for branch parent

Message ID pull.1013.git.git.1619640416533.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series git-p4: speed up search for branch parent | expand

Commit Message

Joachim Kuebart April 28, 2021, 8:06 p.m. UTC
From: Joachim Kuebart <joachim.kuebart@gmail.com>

Previously, the code iterated through the parent branch commits and
compared each one to the target tree using diff-tree.

This patch outputs the revision's tree hash along with the commit hash,
thereby saving the diff-tree invocation. This results in a considerable
speed-up, at least on Windows.

Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
---
    git-p4: speed up search for branch parent
    
    Previously, the code iterated through the parent branch commits and
    compared each one to the target tree using diff-tree.
    
    This patch outputs the revision's tree hash along with the commit hash,
    thereby saving the diff-tree invocation. This results in a considerable
    speed-up, at least on Windows.
    
    Signed-off-by: Joachim Kuebart joachim.kuebart@gmail.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1013%2Fjkuebart%2Fp4-faster-parent-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1013/jkuebart/p4-faster-parent-v1
Pull-Request: https://github.com/git/git/pull/1013

 git-p4.py | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)


base-commit: 311531c9de557d25ac087c1637818bd2aad6eb3a

Comments

Junio C Hamano April 29, 2021, 2:22 a.m. UTC | #1
"Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Joachim Kuebart <joachim.kuebart@gmail.com>

Thanks.  As git-p4 is not in my area of expertise, I'll make a style
critique first, while pinging Luke as an area expert (you can learn
who they are with "git shortlog --no-merges --since=18.months.ago
git-p4.py").

> Previously, the code iterated through the parent branch commits and
> compared each one to the target tree using diff-tree.

It is customary in this project to describe the problem in the
present tense.  In other words, whoever is writing the log message
still lives in the world without this patch applied to the system.

    The code iterates through the parent commits and compares each of
    them to the target tree using diff-tree.

But before that sentence, please prepare the reader with a bit
larger picture.  A reader may not know what purpose the comparison
serves.  Do we know that the tree of one of the parents of the
commit must match the tree of the target, and trying to see which
parent is the one with the same tree?  What is helped by learning
which parent has the same tree?

Perhaps

    The method searchParent() is used to find a commit in the
    history of the given 'parent' commit whose tree exactly matches
    the tree of the given 'target commit.  The code iterates through
    the commits in the history and compares each of them to the
    target tree by invoking diff-tree.

And then our log message would make observation, pointing out what
is wrong with it (after all comparing with diff-tree is not giving
us a wrong result---the point of this change is that spawning diff-tree
for each commit is wasteful when we only want to see exact matches).

    Because we only are interested in finding a tree that is exactly
    the same, and not interested in how other trees are different,
    having to spawn diff-tree for each and every commit is wasteful. 

> This patch outputs the revision's tree hash along with the commit hash,
> thereby saving the diff-tree invocation. This results in a considerable
> speed-up, at least on Windows.

And then our log message would order the codebase to "become like
so", in order to resolve the issue(s) pointed out in the
observation.  Perhaps

    Use the "--format" option of "rev-list" to find out the tree
    object name of each commit in the history, and find the tree
    whose name is exactly the same as the tree of the target commit
    to optimize this.

When making a claim on performance, it is helpful to our readers to
give some numbers, even in a limited test, e.g.

    In a sample history where ~100 commits needed to be traversed to
    find the fork point on my Windows box, the current code took
    10.4 seconds to complete, while the new code yields the same
    result in 1.8 seconds, which is a significant speed-up.

or something along these lines.

> Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
>  git-p4.py | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac401..dbe94e6fb83b 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
>          return True
>  
>      def searchParent(self, parent, branch, target):
> +        for tree in read_pipe_lines(["git", "rev-parse",
> +                                     "{}^{{tree}}".format(target)]):
> +            targetTree = tree.strip()

It looks very strange to run a commit that you expect a single line
of output, and read the result in a loop.  Doesn't git-p4.py supply
a more suitable helper to read a single line output from a command?

> +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
>                                       "--no-merges", parent]):

This is not a new problem you introduced, but when we hear word
"blob" in the context of this project, it reminds us of the "blob"
object, while the 'blob' variable used in this loop has nothing to
do with it.  Perhaps rename it to say 'line' or something?

> +            if blob[:7] == "commit ":
> +                continue

Perhaps blob.startswith("commit ") to avoid hardcoded string length?

> +            blob = blob.strip().split(" ")
> +            if blob[1] == targetTree:
>                  if self.verbose:
> +                    print("Found parent of %s in commit %s" % (branch, blob[0]))
> +                return blob[0]
> +        return None
Joachim Kuebart April 29, 2021, 7:48 a.m. UTC | #2
On Thu, 29 Apr 2021 at 04:22, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Joachim Kuebart <joachim.kuebart@gmail.com>
>
> Thanks.  As git-p4 is not in my area of expertise, I'll make a style
> critique first, while pinging Luke as an area expert (you can learn
> who they are with "git shortlog --no-merges --since=18.months.ago
> git-p4.py").

Hi Junio, thanks for your timely and thorough review and for putting
up with my greenhorn mistakes ;-)

> > Previously, the code iterated through the parent branch commits and
> > compared each one to the target tree using diff-tree.
>
> It is customary in this project to describe the problem in the
> present tense.  In other words, whoever is writing the log message
> still lives in the world without this patch applied to the system.

I will rephrase the commit message and give better details as you
mentioned. Thanks a lot for your suggestions!

> When making a claim on performance, it is helpful to our readers to
> give some numbers, even in a limited test, e.g.
>
>     In a sample history where ~100 commits needed to be traversed to
>     find the fork point on my Windows box, the current code took
>     10.4 seconds to complete, while the new code yields the same
>     result in 1.8 seconds, which is a significant speed-up.
>
> or something along these lines.

I will add some measurements.

> > Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
> >  git-p4.py | 22 +++++++++++-----------
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/git-p4.py b/git-p4.py
> > index 09c9e93ac401..dbe94e6fb83b 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
> >          return True
> >
> >      def searchParent(self, parent, branch, target):
> > +        for tree in read_pipe_lines(["git", "rev-parse",
> > +                                     "{}^{{tree}}".format(target)]):
> > +            targetTree = tree.strip()
>
> It looks very strange to run a commit that you expect a single line
> of output, and read the result in a loop.  Doesn't git-p4.py supply
> a more suitable helper to read a single line output from a command?

You're absolutely right that this isn't very readable. I had a quick
look around for a function that reads a single-line response, but I'll
look again and come up with a clearer solution.

> > +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
> >                                       "--no-merges", parent]):
>
> This is not a new problem you introduced, but when we hear word
> "blob" in the context of this project, it reminds us of the "blob"
> object, while the 'blob' variable used in this loop has nothing to
> do with it.  Perhaps rename it to say 'line' or something?

Will do, thanks!

> > +            if blob[:7] == "commit ":
> > +                continue
>
> Perhaps blob.startswith("commit ") to avoid hardcoded string length?

Yes, that's the name of the function that I can never think of when I need it.

Thanks again for your comments,

Joachim
Luke Diamand April 29, 2021, 8:22 a.m. UTC | #3
On 29/04/2021 07:48, Joachim Kuebart wrote:
> On Thu, 29 Apr 2021 at 04:22, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Joachim Kuebart <joachim.kuebart@gmail.com>
>>
>> Thanks.  As git-p4 is not in my area of expertise, I'll make a style
>> critique first, while pinging Luke as an area expert (you can learn
>> who they are with "git shortlog --no-merges --since=18.months.ago
>> git-p4.py").
> 
> Hi Junio, thanks for your timely and thorough review and for putting
> up with my greenhorn mistakes ;-)
> 
>>> Previously, the code iterated through the parent branch commits and
>>> compared each one to the target tree using diff-tree.
>>
>> It is customary in this project to describe the problem in the
>> present tense.  In other words, whoever is writing the log message
>> still lives in the world without this patch applied to the system.
> 
> I will rephrase the commit message and give better details as you
> mentioned. Thanks a lot for your suggestions!
> 
>> When making a claim on performance, it is helpful to our readers to
>> give some numbers, even in a limited test, e.g.
>>
>>      In a sample history where ~100 commits needed to be traversed to
>>      find the fork point on my Windows box, the current code took
>>      10.4 seconds to complete, while the new code yields the same
>>      result in 1.8 seconds, which is a significant speed-up.
>>
>> or something along these lines.
> 
> I will add some measurements.
> 
>>> Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
>>>   git-p4.py | 22 +++++++++++-----------
>>>   1 file changed, 11 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/git-p4.py b/git-p4.py
>>> index 09c9e93ac401..dbe94e6fb83b 100755
>>> --- a/git-p4.py
>>> +++ b/git-p4.py
>>> @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
>>>           return True
>>>
>>>       def searchParent(self, parent, branch, target):
>>> +        for tree in read_pipe_lines(["git", "rev-parse",
>>> +                                     "{}^{{tree}}".format(target)]):
>>> +            targetTree = tree.strip()
>>
>> It looks very strange to run a commit that you expect a single line
>> of output, and read the result in a loop.  Doesn't git-p4.py supply
>> a more suitable helper to read a single line output from a command?
> 
> You're absolutely right that this isn't very readable. I had a quick
> look around for a function that reads a single-line response, but I'll
> look again and come up with a clearer solution.

I don't think there is one - git-p4 has lots of functions for calling 
`p4', but for calling git, it just uses Python's Popen() API.

A good question is whether we can start taking advantage of the newer 
features in Python3 which will obviously break backward compatibility.

> 
>>> +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
>>>                                        "--no-merges", parent]):
>>
>> This is not a new problem you introduced, but when we hear word
>> "blob" in the context of this project, it reminds us of the "blob"
>> object, while the 'blob' variable used in this loop has nothing to
>> do with it.  Perhaps rename it to say 'line' or something? >
> Will do, thanks!

It confused me as well.

> 
>>> +            if blob[:7] == "commit ":
>>> +                continue
>>
>> Perhaps blob.startswith("commit ") to avoid hardcoded string length?
> 
> Yes, that's the name of the function that I can never think of when I need it.
> 
> Thanks again for your comments,
> 
> Joachim
> 

There are existing tests for importing branches which should cover this. 
I don't know if they need to be extended or not, you might want to check.

Looks good otherwise.
Junio C Hamano April 29, 2021, 8:31 a.m. UTC | #4
Luke Diamand <luke@diamand.org> writes:

>>>>       def searchParent(self, parent, branch, target):
>>>> +        for tree in read_pipe_lines(["git", "rev-parse",
>>>> +                                     "{}^{{tree}}".format(target)]):
>>>> +            targetTree = tree.strip()
>>>
>>> It looks very strange to run a commit that you expect a single line
>>> of output, and read the result in a loop.  Doesn't git-p4.py supply
>>> a more suitable helper to read a single line output from a command?
>> You're absolutely right that this isn't very readable. I had a quick
>> look around for a function that reads a single-line response, but I'll
>> look again and come up with a clearer solution.
>
> I don't think there is one - git-p4 has lots of functions for calling
> `p4', but for calling git, it just uses Python's Popen() API.

OK.  It just felt "strange", not "wrong", so I am OK with the
construct at least for now.

> There are existing tests for importing branches which should cover
> this. I don't know if they need to be extended or not, you might want
> to check.
>
> Looks good otherwise.

Thanks for a prompt review.
Joachim Kuebart April 29, 2021, 11:30 a.m. UTC | #5
On Thu, 29 Apr 2021 at 10:22, Luke Diamand <luke@diamand.org> wrote:
>
> There are existing tests for importing branches which should cover this.
> I don't know if they need to be extended or not, you might want to check.

I enhanced t9801-git-p4-branch to check for this functionality, i.e.
that the branches are branched off at the correct commits from their
parents. As far as I could see, there was no test for this before.

Thank you as well for your quick response!

Cheers,

Joachim
Joachim Kuebart April 29, 2021, 7:31 p.m. UTC | #6
On Thu, 29 Apr 2021 at 10:31, Junio C Hamano <gitster@pobox.com> wrote:
>
> Luke Diamand <luke@diamand.org> writes:
> >
> > Looks good otherwise.
>
> Thanks for a prompt review.

I addressed your comments and updated my PR at
https://github.com/git/git/pull/1013, but CI seems stuck and hasn't
kicked off yet. I'd like to see it pass especially since I modified a
test. Is there anything else I need to do?

Joachim
diff mbox series

Patch

diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac401..dbe94e6fb83b 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -3600,19 +3600,19 @@  def importNewBranch(self, branch, maxChange):
         return True
 
     def searchParent(self, parent, branch, target):
-        parentFound = False
-        for blob in read_pipe_lines(["git", "rev-list", "--reverse",
+        for tree in read_pipe_lines(["git", "rev-parse",
+                                     "{}^{{tree}}".format(target)]):
+            targetTree = tree.strip()
+        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
                                      "--no-merges", parent]):
-            blob = blob.strip()
-            if len(read_pipe(["git", "diff-tree", blob, target])) == 0:
-                parentFound = True
+            if blob[:7] == "commit ":
+                continue
+            blob = blob.strip().split(" ")
+            if blob[1] == targetTree:
                 if self.verbose:
-                    print("Found parent of %s in commit %s" % (branch, blob))
-                break
-        if parentFound:
-            return blob
-        else:
-            return None
+                    print("Found parent of %s in commit %s" % (branch, blob[0]))
+                return blob[0]
+        return None
 
     def importChanges(self, changes, origin_revision=0):
         cnt = 1