mbox series

[v2,0/4] Optimization batch 7: use file basenames to guide rename detection

Message ID pull.843.v2.git.1612870326.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Message

Philippe Blain via GitGitGadget Feb. 9, 2021, 11:32 a.m. UTC
This series depends on ort-perf-batch-6[1].

This series uses file basenames in a basic fashion to guide rename
detection.

Changes since v1:

 * Significant edits to the commit messages to make them explain the basic
   idea better and (hopefully) prevent misunderstandings.
 * Add a commit at the end that updates the documentation and includes a new
   testcase
 * Modify the code to make it clearer that it can already handle using a
   different score for basename comparison similarity, even if we don't use
   that ability yet (see below)

Changes not yet included -- I need input about what is wanted:

 * Stolee suggested not creating a separate score for basename comparisons,
   at least not yet. Junio suggested it may be prudent to use a higher score
   for that than whatever -M option the user provided for normal
   comparisons...but didn't suggest whether it should be a separate
   user-specified option, or some kind of weighted average of the -M option
   and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I
   tweaked the code to make it clearer that it already is able to handle
   such a score difference, but I'm not sure what whether we want an
   automatically computed higher value or a user-controlled possibly higher
   value.

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (4):
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection

 Documentation/gitdiffcore.txt |  15 +++
 diffcore-rename.c             | 185 +++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh        |  24 +++++
 3 files changed, 220 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v1:

 1:  377a4a39fa86 ! 1:  381a45d239bb diffcore-rename: compute basenames of all source and dest candidates
     @@ Commit message
          diffcore-rename: compute basenames of all source and dest candidates
      
          We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  Add a new function,
     +    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     +    and source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the deleted and added files.)  Add a new function,
          not yet used, which creates a map of the unique basenames within
          rename_src and another within rename_dst, together with the indices
          within rename_src/rename_dst where those basenames show up.  Non-unique
          basenames still show up in the map, but have an invalid index (-1).
      
     +    This function was inspired by the fact that in real world repositories,
     +    most renames often do not involve a basename change.  Here are some
     +    sample repositories and the percentage of their historical renames (as of
     +    early 2020) that did not involve a basename change:
     +      * linux: 76%
     +      * gcc: 64%
     +      * gecko: 79%
     +      * webkit: 89%
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 2:  5fb4493247ff ! 2:  dcd0175229aa diffcore-rename: complete find_basename_matches()
     @@ Commit message
          sufficiently similar, we record the rename; if not, we include those
          files in the more exhaustive matrix comparison.
      
     +    This means we are adding a set of preliminary additional comparisons,
     +    but for each file we only compare it with at most one other file.  For
     +    example, if there was a include/media/device.h that was deleted and a
     +    src/module/media/device.h that was added, and there were no other
     +    device.h files added or deleted between the commits being compared,
     +    then these two files would be compared in the preliminary step.
     +
     +    This commit does not yet actually employ this new optimization, it
     +    merely adds a function which can be used for this purpose.  The next
     +    commit will do the necessary plumbing to make use of it.
     +
          Note that this optimization might give us different results than without
          the optimization, because it's possible that despite files with the same
          basename being sufficiently similar to be considered a rename, there's
          an even better match between files without the same basename.  I think
     -    that is okay for four reasons: (1) That seems somewhat unlikely in
     -    practice, (2) it's easy to explain to the users what happened if it does
     -    ever occur (or even for them to intuitively figure out), and (3) as the
     -    next patch will show it provides such a large performance boost that
     -    it's worth the tradeoff.  Reason (4) takes a full paragraph to
     +    that is okay for four reasons: (1) it's easy to explain to the users
     +    what happened if it does ever occur (or even for them to intuitively
     +    figure out), (2) as the next patch will show it provides such a large
     +    performance boost that it's worth the tradeoff, and (3) it's somewhat
     +    unlikely that despite having unique matching basenames that other files
     +    serve as better matches.  Reason (4) takes a full paragraph to
          explain...
      
          If the previous three reasons aren't enough, consider what rename
     @@ Commit message
          basename and are sufficiently similar to be considered a rename, mark
          them as such without comparing the two to all other rename candidates.
      
     -    We do not use this heuristic together with either break or copy
     -    detection.  The point of break detection is to say that filename
     -    similarity does not imply file content similarity, and we only want to
     -    know about file content similarity.  The point of copy detection is to
     -    use more resources to check for additional similarities, while this is
     -    an optimization that uses far less resources but which might also result
     -    in finding slightly fewer similarities.  So the idea behind this
     -    optimization goes against both of those features, and will be turned off
     -    for both.
     -
     -    Note that this optimization is not yet effective in any situation, as
     -    the function is still unused.  The next commit will hook it into the
     -    code so that it is used when rename detection is wanted, but neither
     -    copy nor break detection are.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 3:  1d941c35076e ! 3:  ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
      
          Make use of the new find_basename_matches() function added in the last
          two patches, to find renames more rapidly in cases where we can match up
     -    files based on basenames.
     +    files based on basenames.  As a quick reminder (see the last two commit
     +    messages for more details), this means for example that
     +    docs/extensions.txt and docs/config/extensions.txt are considered likely
     +    renames if there are no 'extensions.txt' files elsewhere among the added
     +    and deleted files, and if a similarity check confirms they are similar,
     +    then they are marked as a rename without looking for a better similarity
     +    match among other files.  This is a behavioral change, as covered in
     +    more detail in the previous commit message.
     +
     +    We do not use this heuristic together with either break or copy
     +    detection.  The point of break detection is to say that filename
     +    similarity does not imply file content similarity, and we only want to
     +    know about file content similarity.  The point of copy detection is to
     +    use more resources to check for additional similarities, while this is
     +    an optimization that uses far less resources but which might also result
     +    in finding slightly fewer similarities.  So the idea behind this
     +    optimization goes against both of those features, and will be turned off
     +    for both.
      
          For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
          performance work; instrument with trace2_region_* calls", 2020-10-28),
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		remove_unneeded_paths_from_src(want_copies);
      +		trace2_region_leave("diff", "cull after exact", options->repo);
      +	} else {
     ++		/* Determine minimum score to match basenames */
     ++		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
     ++
      +		/*
      +		 * Cull sources:
      +		 *   - remove ones involved in renames (found via exact match)
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
     -+		rename_count += find_basename_matches(options, minimum_score,
     ++		rename_count += find_basename_matches(options,
     ++						      min_basename_score,
      +						      rename_src_nr);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
 -:  ------------ > 4:  a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection