[00/30,RFC] Path-walk API and applications

Message ID	pull.1786.git.1725935335.gitgitgadget@gmail.com (mailing list archive)
Headers	show Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC77D6A332 for <git@vger.kernel.org>; Tue, 10 Sep 2024 02:29:00 +0000 (UTC) Message-Id: <pull.1786.git.1725935335.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Tue, 10 Sep 2024 02:28:25 +0000 Subject: [PATCH 00/30] [RFC] Path-walk API and applications Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, Derrick Stolee <stolee@gmail.com>
Series	Path-walk API and applications \| expand [00/30,RFC] Path-walk API and applications [01/30] path-walk: introduce an object walk by path [02/30] backfill: add builtin boilerplate [03/30] backfill: basic functionality and tests [04/30] backfill: add --batch-size=<n> option [05/30] backfill: add --sparse option [06/30] backfill: assume --sparse when sparse-checkout is enabled [07/30] path-walk: allow consumer to specify object types [08/30] path-walk: allow visiting tags [09/30] survey: stub in new experimental `git-survey` command [10/30] survey: add command line opts to select references [11/30] survey: collect the set of requested refs [12/30] survey: start pretty printing data in table form [13/30] survey: add object count summary [14/30] survey: summarize total sizes by object type [15/30] survey: show progress during object walk [16/30] survey: add ability to track prioritized lists [17/30] survey: add report of "largest" paths [18/30] revision: create mark_trees_uninteresting_dense() [19/30] path-walk: add prune_all_uninteresting option [20/30] pack-objects: add --path-walk option [21/30] pack-objects: extract should_attempt_deltas() [22/30] pack-objects: introduce GIT_TEST_PACK_PATH_WALK [23/30] p5313: add size comparison test [24/30] repack: add --path-walk option [25/30] pack-objects: enable --path-walk via config [26/30] scalar: enable path-walk during push via config [27/30] pack-objects: add --full-name-hash option [28/30] test-name-hash: add helper to compute name-hash functions [29/30] p5314: add a size test for name-hash collisions [30/30] pack-objects: output debug info about deltas

Derrick Stolee via GitGitGadget Sept. 10, 2024, 2:28 a.m. UTC

This RFC is ultimately about introducing a new way to walk objects, called
the "path-walk API" in the new path-walk.[ch] files. Before digging into the
details of the API, let's discuss the applications which will hint at the
API's design.


APPLICATIONS OF THE PATH-WALK API
=================================

The applications of this API were discovered in the following order, though
I recommend reversing the order for the priority of their actual
implementation in future patch series:

 * git backfill: a builtin to download missing blobs in a blobless partial
   clone, done in batches and grouped by the path they appear in to maximize
   delta compression in each batch. Allows focusing on the paths of the
   sparse-checkout to only get the blobs necessary for history queries in
   the current focus.

 * git survey: Jeff Hostetler built this feature [1] as a way to get
   functionality similar to git-sizer [2], but using the internals of Git to
   do it faster. It also displays information not available to git-sizer,
   like the on-disk size of objects. This RFC presents a simplified version
   of the builtin focused on listing the paths that contribute the most to
   the on-disk size of trees and blobs.

 * git pack-objects --path-walk: In order to find a way to compute deltas
   among objects of the same path, I applied the path-walk API to 'git
   pack-objects' behind an optional flag. There are overlaps with the
   '--sparse' option [3], [4] that can be used here. This provides perfect
   partitioning by path name, without any possible collisions from the
   name-hash algorithm. It also allows using the name-hash values to find
   cross-path delta chains in a second pass.

 * git repack --full-name-hash: If we are worried about name-hsah
   collisions, an easier thing to implement is a different name-hash
   algorithm that is less likely to have collisions. This feature was
   already sent to the mailing list as a fully-reviewable series [5]. It is
   included here because this series allows testing the --path-walk option
   against the --full-name-hash.

[1] https://github.com/microsoft/git/pull/667 [2]
https://github.com/github/git-sizer [3]
https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f
[4]
https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/
[5]
https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com


TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
======================================================

Here's the story about how these applications came about: I was tasked with
understanding why certain internal repositories were growing larger than
expected. (Feel free to skip. Otherwise, thank you for indulging me.)

I first prototyped 'git backfill' as a way to download some of the largest
repositories without being blocked on a full clone. This batched download
mechanism allowed me to essentially have a retryable clone, since the client
could restart the process from scratch and skip any objects that were
already on disk. It was natural to batch based on the path of the blobs in
order to theoretically save time and network bandwidth due to better delta
calculations.

While investigating these repositories, I had some data hinting at the total
size of the objects by type. But I was most interested in learning what
exactly was causing this growth. I did have a hint that the "release"
branches were taking up much more space than the default branch, since
cloning with --single-branch resulted in ~55GB of data but then fetching the
release branches led to an additional ~125GB of data. Using "git diff" it
was clear that these branches stored some CHANGELOG.json and CHANGELOG.md
files, but the diffs were relatively small (but multiple such changes were
batched together into a single new commit). I needed to see why exactly
these paths were taking up so much space.

To this, I turned to Jeff Hostetler's new "git survey" command. This told me
information about the total size of trees and blobs and told me the path
name for the individual blobs that were the largest in the repository. These
paths were typically binary files that appeared only once or twice and did
not account for the scale issues. I modified 'git survey' to use the same
path-batching logic as in 'git backfill' to then consider each batch of
objects and their size. Thus, the "path-walk API" was born. (This RFC
contains a version that looks like it was created before 'git backfill'.)

The repository I was looking at had a clear pattern in its top 100 file
paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md
files. The .md files surprised me, since they were always simple appends of
the previous .md file at the same path. The .json files were capped in how
many versions were being listed (at least in recent versions) but the data
it stored was harder to compress. So I went looking into how 'git push' was
calculating these delta bases. Adding some debug information to 'git
pack-objects' demonstrated that the previous file versions were not being
matched as delta bases. Instead, other new blobs in the push were being used
as delta bases.

This meant that what should have been a trivial set of deltas bloated to
20-60 MB. (We will see later that it is possible for these to be 100-500
KB.)

Here is where I went on a little bit of a detour. (Come with me, it's
important.) I knew that 'git pack-objects' used a name-hash to group objects
by path, so I assumed that the reason these delta bases were not found was
because the UNINTERESTING objects were not being added to the packing list.
I have since discovered that this is incorrect, but I might have gotten
stuck if I didn't think this.

This seemed like a natural reason to extend the path-walk API to allow
walking commits and tags as part of 'git pack-objects' behind a new
'--path-walk' option. The idea here is to compute deltas among the objects
that share a common path and then later go through the (type, name-hash,
size) sorting system to find other delta bases across path boundaries. After
a lot of testing, failing, and testing again, the implementation in this RFC
finally works to achieve the goal. It's not pretty (especially with how it
handles tags) but it gets the job done.

In hindsight, I realized that the UNINTERESTING objects were being
considered, but due to collisions in the name-hash algorithm these objects
were being sorted outside of the delta computation window. For this reason,
I thought to create a new name-hash algorithm. Thus, the --full-name-hash
option for 'git pack-objects' and 'git repack' was born. This feature was
split out and sent to the mailing list independently from this RFC.


RFC GOALS
=========

The goals of this RFC are:

 1. To demonstrate potential applications of the path-walk API to motivate
    its generality as these features are sent in full-quality patch series,
    but in a different order.

 2. To communicate the discoveries found during the --path-walk and
    --full-name-hash features in 'git pack-objects' and 'git repack'. This
    includes comparing and contrasting the effectiveness of these features.

 3. To demonstrate the value of the path-based batching in the 'git survey'
    feature, and to inspire others to think about what other statistics
    would be valuable in that feature. (I anticipate that once a base is
    established, multiple contributors will help expand its functionality
    long into the future.)


RFC OUTLINE
===========

The patches are grouped roughly by the application, in order of discovery:


PART I: 'git backfill'
======================

These patches introduce the 'git backfill' builtin including its
'--batch-size' and '--sparse' options. While this is the first and simplest
application, it is also the lowest priority in terms of user need.

 * path-walk: introduce an object walk by path
 * backfill: add builtin boilerplate
 * backfill: basic functionality and tests
 * backfill: add --batch-size= option
 * backfill: add --sparse option
 * backfill: assume --sparse when sparse-checkout is enabled


PART II: 'git survey'
=====================

These patches reimplement a subset of the functionality of 'git survey' as
well as generalize some of the data structures that Jeff's implementation
made. The flexibility hopefully comes through to show the potential for
future extensions. These patches are quite rough and will need more
attention before they can be sent for full review.

 * path-walk: allow consumer to specify object types
 * path-walk: allow visiting tags
 * survey: stub in new experimental git-survey command
 * survey: add command line opts to select references
 * survey: collect the set of requested refs
 * survey: start pretty printing data in table form
 * survey: add object count summary
 * survey: summarize total sizes by object type
 * survey: show progress during object walk
 * survey: add ability to track prioritized lists
 * survey: add report of "largest" paths


PART III: 'git pack-objects --path-walk'
========================================

Here is where I think the meat of the RFC really lies. There are still some
rough edges, but the data will show that 'git pack-objects --path-walk' has
the potential to be an extremely effective way to pack objects. (Caveats
will come later in the analysis section.)

 * revision: create mark_trees_uninteresting_dense()
 * path-walk: add prune_all_uninteresting option
 * pack-objects: add --path-walk option
 * pack-objects: extract should_attempt_deltas()
 * pack-objects: introduce GIT_TEST_PACK_PATH_WALK
 * p5313: add size comparison test
 * repack: add --path-walk option
 * pack-objects: enable --path-walk via config
 * scalar: enable path-walk during push via config

One obvious issue with this current implementation is that it inlines much
of the delta calculation into the "Enumerate objects" phase, and thus makes
it single-threaded. This should be fixed before being considered for full
review, but even without threading this version can out-perform the standard
packing strategy in terms of end-to-end packing time!


PART IV: 'git repack --full-name-hash'
======================================

This is a simplified version of the patch series that was split out by
itself earlier for full review. This was split out on its own partly because
it doesn't actually use the path-walk API. This has benefits and drawbacks,
but it seems like a quick win for many scenarios.

 * pack-objects: add --full-name-hash option
 * test-name-hash: add helper to compute name-hash functions
 * p5314: add a size test for name-hash collisions
 * pack-objects: output debug info about deltas

This last patch is an add-on of the debugging information that I used to
discover issues with delta bases during 'git push'.


DISCUSSION OF NAME HASH
=======================

One thing to talk about before digging into --path-walk and --full-name-hash
features is the existing name-hash algorithm. This hash algorithm creates a
uint32_t based on the final 16 characters of the path name, weighing the
last characters more. There are multiple benefits to this:

 1. Files of common types (.c, .txt, ...) may be grouped together.

 2. Files that are renamed across directories may be grouped together.

(Thanks, Junio, for making this second benefit clear.)

The issue here is that some common patterns arise in repositories that use
common path names across directories, and those files are creating name-hash
collisions and making the sort less effective. One thing that can counteract
these collisions is to increase the --window setting, but this significantly
slows the delta computations.

Thus, the --path-walk and --full-name-hash features both attempt to combat
these name-hash collisions in very different ways. The --path-walk mechanism
uses the path-walk API to consider batches of objects that all share the
same path. This avoids storing every possible path names in memory while
doing the object walk, but still gives nice boundaries for delta compression
possibilities. After the path-walk is complete, the full packing list is
still sorted via name-hash and this allows for cross-path deltas. This is
critical!

The --full-name-hash feature does the simpler choice of replacing the
name-hash method with one that has fewer collisions, but loses the benefits
of "nearby" paths having close hash values.

This naturally leads to these two main differences in the two approaches:

 1. The --full-name-hash feature is not good for 'git push' or similarly
    small pack-files. Since it limits the delta chains to objects with the
    same full path and loses the benefit of "nearby" paths, this feature
    should be used for larger repacks. In my testing, 'git push' simulations
    almost always have poor packing but 'git repack -adf' simulations have
    packing rivaling the --path-walk option.

 2. The --path-walk option changes the object ordering significantly,
    meaning it may not ever be appropriate to combine with advanced
    repacking features such as delta islands or even reachability bitmaps.
    While my testing has shown that the --path-walk repacks are the most
    efficient of all options, this limitation makes me hesitate to recommend
    it wider than client repositories.

One natural question to consider is to think about storing both the
name-hash and the full-name-hash and doing two delta passes, each one
sorting the objects by a different hash function. The first issue is that we
don't want to store two hash values per object, as that will significantly
increase the memory pressure during repacking. This could be side-stepped by
storing the full-name-hash in the packing list and then a second mapping
from full-name-hash to name-hash. However, that still leads to the two
passes taking extra time. The --path-walk approach is faster than even a
single pass. And in the right scenarios, the --full-name-hash option is very
close to the --path-walk results.


ANALYSIS OF PACKING STRATEGIES
==============================

Since I was focused on an internal monorepo that stored a large collection
of Javascript packages, it should be no surprise that I eventually found
other Javascript repositories that used similar tooling and thus had similar
issues with unexplained scale problems. I'll use these four repositories as
examples repeatedly, but one is actually public: microsoft/fluentui [6].
I'll use Repo B, C, and D for the others, in increasing size.

[6] https://github.com/microsoft/fluentui

In each of these repositories, doing a full repack ('git repack -adf') right
after cloning presents a sizeable reduction in space. This is expected for
servers not being optimized for exactly the reachable set I'm cloning. So,
I'll focus on how much the default repacking parameters compare to using the
--full-name-hash or --path-walk options:

| Repo     | Standard Repack | With --full-name-hash | With --path-walk |
|----------|-----------------|-----------------------|------------------|
| fluentui |         438 MB  |               168 MB  |          148 MB  |
| Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |


Hopefully these reductions show how much these name-hash collisions are
causing issues in these repositories.

For the fluentui repo, I'm also able to share highlights from the 'git
survey' output immediately after cloning and after repacking with the
--path-walk option.

First, we can consider the total reachable objects in each scenario:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10443669 |      15092790
      Trees | 276503 |  40212070 |     244429615
      Blobs | 294500 | 635365661 |   10791187920


TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10450605 |      15092790
      Trees | 276503 |  31136263 |     244429615
      Blobs | 294500 |  94442401 |   10791187920


Now, we can consider the top 10 file paths before and after repacking with
the --path-walk feature:

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |  58060531 |     889120488
 ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
 ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
 ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
 ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
 ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
 ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
 ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
 .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
 ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258


TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |   9909326 |     889120488
 ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
 ...out/src/images/download.jpg |     1 |   1845249 |       1846117
 ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
          .yarn/releases/cli.js |     1 |   1335657 |       6741614
 ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
 ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
 ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
 ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
 ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268


As we can see from this example, before the repack we are mainly seeing the
disk space be dominated by objects that appear at paths with CHANGELOG.json
or CHANGELOG.md, which are frequently hitting collisions with the default
name-hash. After the repack with the --path-walk feature, we see the largest
paths are what we expect: checked in binaries, frequently-edited yarn files,
and other hard-to- compress data.

The nice thing about this result is that we can point to files that are
taking up space because there is no other way around it, not that the naming
convention for the files is causing confusion during packing.


WHAT TO DO NOW?
===============

Thank you for reading this far. I hope that this has added context on the
patch series for the --full-name-hash option, but also provided the right
context for when I come back in a week or so with a review-ready version of
the --path-walk option.

These patches are rough and I want to make sure everyone knows that.
Reviewing them for style or even basic organization may lead to some wasted
time. I know that at least one of the patches got mangled and its diff does
not match its description (I'm thinking specifically about "pack-objects:
extract should_attempt_deltas()" but this could apply elsewhere).

But I think that interested parties could take my branch, build it, and give
these features a try on their favorite repos. I'd love to hear feedback on
the usability and effectiveness, especially if someone finds a case where
the --path-walk option is less effective at packing data.

There are two things going on since I started working on this series that
could cause issues with applying this series on top of current 'master' or
with topics in 'seen':

 * The most obvious one is the collision with the --full-name-hash feature
   under review. The in-review series takes precedent.

 * The unused parameter warnings are now errors! I'm glad that got in, but I
   wasn't careful about it in these RFC patches.

 * John Cai is proposing to change the builtin interface to have builtin
   methods take a repository pointer. I approve of this effort, but it
   collides with the two new builtins introduced here. As discussed above,
   the 'git survey' and 'git backfill' builtins are the lowest priority
   items for getting merged, in my opinion, so they can wait until that
   change has been made.

I look forward to any and all comments.

Thanks! -Stolee

Derrick Stolee (27):
  path-walk: introduce an object walk by path
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled
  path-walk: allow consumer to specify object types
  path-walk: allow visiting tags
  survey: start pretty printing data in table form
  survey: add object count summary
  survey: summarize total sizes by object type
  survey: show progress during object walk
  survey: add ability to track prioritized lists
  survey: add report of "largest" paths
  revision: create mark_trees_uninteresting_dense()
  path-walk: add prune_all_uninteresting option
  pack-objects: add --path-walk option
  pack-objects: extract should_attempt_deltas()
  pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  p5313: add size comparison test
  repack: add --path-walk option
  pack-objects: enable --path-walk via config
  scalar: enable path-walk during push via config
  pack-objects: add --full-name-hash option
  test-name-hash: add helper to compute name-hash functions
  p5314: add a size test for name-hash collisions
  pack-objects: output debug info about deltas

Jeff Hostetler (3):
  survey: stub in new experimental `git-survey` command
  survey: add command line opts to select references
  survey: collect the set of requested refs

 .gitignore                     |   2 +
 Documentation/config/pack.txt  |   8 +
 Documentation/git-backfill.txt |  60 ++
 Documentation/git-survey.txt   |  70 +++
 Makefile                       |   4 +
 builtin.h                      |   2 +
 builtin/backfill.c             | 141 +++++
 builtin/pack-objects.c         | 298 ++++++++--
 builtin/repack.c               |  10 +
 builtin/survey.c               | 965 +++++++++++++++++++++++++++++++++
 command-list.txt               |   2 +
 git.c                          |   2 +
 pack-objects.h                 |  20 +
 path-walk.c                    | 401 ++++++++++++++
 path-walk.h                    |  73 +++
 repo-settings.c                |   3 +
 repository.h                   |   1 +
 revision.c                     |  15 +
 revision.h                     |   1 +
 scalar.c                       |   1 +
 t/README                       |   4 +
 t/helper/test-name-hash.c      |  23 +
 t/helper/test-tool.c           |   1 +
 t/helper/test-tool.h           |   1 +
 t/perf/p5313-pack-objects.sh   | 101 ++++
 t/perf/p5314-name-hash.sh      |  41 ++
 t/t5620-backfill.sh            | 181 +++++++
 t/t8100-git-survey.sh          |  76 +++
 28 files changed, 2469 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 Documentation/git-survey.txt
 create mode 100644 builtin/backfill.c
 create mode 100644 builtin/survey.c
 create mode 100644 path-walk.c
 create mode 100644 path-walk.h
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/perf/p5314-name-hash.sh
 create mode 100755 t/t5620-backfill.sh
 create mode 100755 t/t8100-git-survey.sh


base-commit: 17d4b10aea6bda2027047a0e3548a6f8ad667dde
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1786%2Fderrickstolee%2Fpath-walk-rfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1786/derrickstolee/path-walk-rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1786

Junio C Hamano Sept. 11, 2024, 9:32 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> One obvious issue with this current implementation is that it inlines much
> of the delta calculation into the "Enumerate objects" phase, and thus makes
> it single-threaded.

Naïvely, traversal of history partitioned by paths smells
embarrassingly parallelizable; it may need some post processing to
make sure that the same object only appears once, though, and the
devil probably is in the details ;-).

Thanks for an enjoyable cover letter that pulls readers in.

Christian Couder Sept. 17, 2024, 10:41 a.m. UTC | #2

On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This RFC is ultimately about introducing a new way to walk objects, called
> the "path-walk API" in the new path-walk.[ch] files. Before digging into the
> details of the API, let's discuss the applications which will hint at the
> API's design.
>
>
> APPLICATIONS OF THE PATH-WALK API
> =================================
>
> The applications of this API were discovered in the following order, though
> I recommend reversing the order for the priority of their actual
> implementation in future patch series:
>
>  * git backfill: a builtin to download missing blobs in a blobless partial
>    clone, done in batches and grouped by the path they appear in to maximize
>    delta compression in each batch. Allows focusing on the paths of the
>    sparse-checkout to only get the blobs necessary for history queries in
>    the current focus.

It's not very clear if this would be useful when doing a `git
sparse-checkout add`, or a `git blame` on a file path not covered by
the current sparse-checkout, or both. I think it would be clearer if
there were a few examples.

>  * git survey: Jeff Hostetler built this feature [1] as a way to get
>    functionality similar to git-sizer [2], but using the internals of Git to
>    do it faster. It also displays information not available to git-sizer,
>    like the on-disk size of objects. This RFC presents a simplified version
>    of the builtin focused on listing the paths that contribute the most to
>    the on-disk size of trees and blobs.

Not sure how `git survey` works, but `git sizer` works on a whole
repo, so, if they work in the same way, I am not sure I see what a new
path oriented way to walk repos would bring to the tool.

>  * git pack-objects --path-walk: In order to find a way to compute deltas
>    among objects of the same path, I applied the path-walk API to 'git
>    pack-objects' behind an optional flag. There are overlaps with the
>    '--sparse' option [3], [4] that can be used here. This provides perfect
>    partitioning by path name, without any possible collisions from the
>    name-hash algorithm. It also allows using the name-hash values to find
>    cross-path delta chains in a second pass.

Do you mean that the new way to walk repos allows pack-object to
perform better in the general case or maybe only in the case where
partial clone without blobs and sparse-checkout are used as described
in the git backfill related point above?

>  * git repack --full-name-hash: If we are worried about name-hsah

s/name-hsah/name-hash/

>    collisions, an easier thing to implement is a different name-hash
>    algorithm that is less likely to have collisions.

Maybe it could help to remind people a bit (perhaps in a note or using
a link) what name-hash collisions are and why we could be worried
about them.

Actually there is a "DISCUSSION OF NAME HASH" section below with
explanations, so maybe a note could just tell about that section.

> This feature was
>    already sent to the mailing list as a fully-reviewable series [5]. It is
>    included here because this series allows testing the --path-walk option
>    against the --full-name-hash.

Do you mean that the new way to walk repos can be used along with `git
repack --full-name-hash` (maybe with `git repack --full-name-hash
--path-walk` or a config option or otherwise?) and that it brings some
performance or other kind (better packs or which ones?) of
improvements?

> [1] https://github.com/microsoft/git/pull/667 [2]
> https://github.com/github/git-sizer [3]
> https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f
> [4]
> https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/
> [5]
> https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com

It looks like adding line breaks could help make the above link list
more readable.

> TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
> ======================================================
>
> Here's the story about how these applications came about: I was tasked with
> understanding why certain internal repositories were growing larger than
> expected. (Feel free to skip. Otherwise, thank you for indulging me.)
>
> I first prototyped 'git backfill' as a way to download some of the largest
> repositories without being blocked on a full clone. This batched download
> mechanism allowed me to essentially have a retryable clone, since the client
> could restart the process from scratch and skip any objects that were
> already on disk. It was natural to batch based on the path of the blobs in
> order to theoretically save time and network bandwidth due to better delta
> calculations.
>
> While investigating these repositories, I had some data hinting at the total
> size of the objects by type.

By type (blob, tree, commit, tag?) or by path? Why would the size of
the objects by type be interesting? Could trees delta poorly?

> But I was most interested in learning what
> exactly was causing this growth. I did have a hint that the "release"
> branches were taking up much more space than the default branch, since
> cloning with --single-branch resulted in ~55GB of data but then fetching the
> release branches led to an additional ~125GB of data. Using "git diff" it
> was clear that these branches stored some CHANGELOG.json and CHANGELOG.md
> files, but the diffs were relatively small (but multiple such changes were
> batched together into a single new commit). I needed to see why exactly
> these paths were taking up so much space.
>
> To this, I turned to Jeff Hostetler's new "git survey" command. This told me
> information about the total size of trees and blobs and told me the path
> name for the individual blobs that were the largest in the repository.

Nice.

> These
> paths were typically binary files that appeared only once or twice and did
> not account for the scale issues.

You mean they couldn't account for a 55GB to 125GB change in data size?

> I modified 'git survey' to use the same
> path-batching logic as in 'git backfill' to then consider each batch of
> objects and their size. Thus, the "path-walk API" was born. (This RFC
> contains a version that looks like it was created before 'git backfill'.)
>
> The repository I was looking at had a clear pattern in its top 100 file
> paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md
> files. The .md files surprised me, since they were always simple appends of
> the previous .md file at the same path. The .json files were capped in how
> many versions were being listed (at least in recent versions) but the data
> it stored was harder to compress. So I went looking into how 'git push' was
> calculating these delta bases. Adding some debug information to 'git
> pack-objects' demonstrated that the previous file versions were not being
> matched as delta bases. Instead, other new blobs in the push were being used
> as delta bases.

Interesting.

> This meant that what should have been a trivial set of deltas bloated to
> 20-60 MB. (We will see later that it is possible for these to be 100-500
> KB.)
>
> Here is where I went on a little bit of a detour. (Come with me, it's
> important.) I knew that 'git pack-objects' used a name-hash to group objects
> by path, so I assumed that the reason these delta bases were not found was
> because the UNINTERESTING objects were not being added to the packing list.
> I have since discovered that this is incorrect, but I might have gotten
> stuck if I didn't think this.
>
> This seemed like a natural reason to extend the path-walk API to allow
> walking commits and tags as part of 'git pack-objects' behind a new
> '--path-walk' option. The idea here is to compute deltas among the objects
> that share a common path and then later go through the (type, name-hash,
> size) sorting system to find other delta bases across path boundaries. After
> a lot of testing, failing, and testing again, the implementation in this RFC
> finally works to achieve the goal. It's not pretty (especially with how it
> handles tags) but it gets the job done.

Yeah, if it can significantly improve the generated packfiles (without
drawbacks), it looks like a great improvement.

> In hindsight, I realized that the UNINTERESTING objects were being
> considered, but due to collisions in the name-hash algorithm these objects
> were being sorted outside of the delta computation window. For this reason,
> I thought to create a new name-hash algorithm. Thus, the --full-name-hash
> option for 'git pack-objects' and 'git repack' was born. This feature was
> split out and sent to the mailing list independently from this RFC.

So the question is "Is the problem fully solved by the new name-hash
algorithm or are there still benefits to using a new way to walk
repos?"

> RFC GOALS
> =========
>
> The goals of this RFC are:
>
>  1. To demonstrate potential applications of the path-walk API to motivate
>     its generality

Yeah, the "Path-walk API and applications" title summarizes this well.

> as these features are sent in full-quality patch series,
>     but in a different order.

I wonder if the patch series could have been separated and not all
sent in a 30 patch long series.

>  2. To communicate the discoveries found during the --path-walk and
>     --full-name-hash features in 'git pack-objects' and 'git repack'. This
>     includes comparing and contrasting the effectiveness of these features.

Thanks for communicating that.

>  3. To demonstrate the value of the path-based batching in the 'git survey'
>     feature, and to inspire others to think about what other statistics
>     would be valuable in that feature. (I anticipate that once a base is
>     established, multiple contributors will help expand its functionality
>     long into the future.)

Yeah, a better git sizer would be valuable for GitLab too and probably
everyone hosting a significant number of repos.

> RFC OUTLINE
> ===========
>
> The patches are grouped roughly by the application, in order of discovery:
>
>
> PART I: 'git backfill'
> ======================
>
> These patches introduce the 'git backfill' builtin including its
> '--batch-size' and '--sparse' options. While this is the first and simplest
> application, it is also the lowest priority in terms of user need.
>
>  * path-walk: introduce an object walk by path
>  * backfill: add builtin boilerplate
>  * backfill: basic functionality and tests
>  * backfill: add --batch-size= option
>  * backfill: add --sparse option
>  * backfill: assume --sparse when sparse-checkout is enabled

I would prefer a first patch series with all the above and the first
patch creating a technical doc called maybe
Documentation/technical/path-walk.txt which could contain a lot of
information from this RFC and perhaps technical details of how the
path-walk works and how it is different from a regular walk.

> DISCUSSION OF NAME HASH
> =======================
>
> One thing to talk about before digging into --path-walk and --full-name-hash
> features is the existing name-hash algorithm. This hash algorithm creates a
> uint32_t based on the final 16 characters of the path name, weighing the
> last characters more. There are multiple benefits to this:
>
>  1. Files of common types (.c, .txt, ...) may be grouped together.
>
>  2. Files that are renamed across directories may be grouped together.
>
> (Thanks, Junio, for making this second benefit clear.)
>
> The issue here is that some common patterns arise in repositories that use
> common path names across directories, and those files are creating name-hash
> collisions and making the sort less effective. One thing that can counteract
> these collisions is to increase the --window setting, but this significantly
> slows the delta computations.
>
> Thus, the --path-walk and --full-name-hash features both attempt to combat
> these name-hash collisions in very different ways. The --path-walk mechanism
> uses the path-walk API to consider batches of objects that all share the
> same path. This avoids storing every possible path names in memory while
> doing the object walk, but still gives nice boundaries for delta compression
> possibilities. After the path-walk is complete, the full packing list is
> still sorted via name-hash and this allows for cross-path deltas. This is
> critical!
>
> The --full-name-hash feature does the simpler choice of replacing the
> name-hash method with one that has fewer collisions, but loses the benefits
> of "nearby" paths having close hash values.

The benefits of "nearby" paths are only available with --path-walk,
not in the current way object packing works.

> This naturally leads to these two main differences in the two approaches:
>
>  1. The --full-name-hash feature is not good for 'git push' or similarly
>     small pack-files. Since it limits the delta chains to objects with the
>     same full path and loses the benefit of "nearby" paths, this feature
>     should be used for larger repacks. In my testing, 'git push' simulations
>     almost always have poor packing but 'git repack -adf' simulations have
>     packing rivaling the --path-walk option.

Interesting.

>  2. The --path-walk option changes the object ordering significantly,
>     meaning it may not ever be appropriate to combine with advanced
>     repacking features such as delta islands or even reachability bitmaps.
>     While my testing has shown that the --path-walk repacks are the most
>     efficient of all options, this limitation makes me hesitate to recommend
>     it wider than client repositories.

Might still be interesting to have it for client repos.

> One natural question to consider is to think about storing both the
> name-hash and the full-name-hash and doing two delta passes, each one
> sorting the objects by a different hash function. The first issue is that we
> don't want to store two hash values per object, as that will significantly
> increase the memory pressure during repacking.

Is one more uint32_t per object really increasing the memory pressure
so much? Isn't there a way to pack bits together to avoid that?

> This could be side-stepped by
> storing the full-name-hash in the packing list and then a second mapping
> from full-name-hash to name-hash. However, that still leads to the two
> passes taking extra time.

Isn't there a way to sort using both hash functions in a single pass?
(That is to perform a single pass and when considering an object sort
it using both hash functions before considering a different object.)

> The --path-walk approach is faster than even a
> single pass.

Is there a reason for that?

> And in the right scenarios, the --full-name-hash option is very
> close to the --path-walk results.
>
>
> ANALYSIS OF PACKING STRATEGIES
> ==============================
>
> Since I was focused on an internal monorepo that stored a large collection
> of Javascript packages, it should be no surprise that I eventually found
> other Javascript repositories that used similar tooling and thus had similar
> issues with unexplained scale problems. I'll use these four repositories as
> examples repeatedly, but one is actually public: microsoft/fluentui [6].
> I'll use Repo B, C, and D for the others, in increasing size.
>
> [6] https://github.com/microsoft/fluentui
>
> In each of these repositories, doing a full repack ('git repack -adf') right
> after cloning presents a sizeable reduction in space. This is expected for
> servers not being optimized for exactly the reachable set I'm cloning. So,
> I'll focus on how much the default repacking parameters compare to using the
> --full-name-hash or --path-walk options:
>
> | Repo     | Standard Repack | With --full-name-hash | With --path-walk |
> |----------|-----------------|-----------------------|------------------|
> | fluentui |         438 MB  |               168 MB  |          148 MB  |
> | Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
> | Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
> | Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |
>
>
> Hopefully these reductions show how much these name-hash collisions are
> causing issues in these repositories.

Yeah, right. It might be interesting to see the size reduction in % too.

> For the fluentui repo, I'm also able to share highlights from the 'git
> survey' output immediately after cloning and after repacking with the
> --path-walk option.
>
> First, we can consider the total reachable objects in each scenario:
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10443669 |      15092790
>       Trees | 276503 |  40212070 |     244429615
>       Blobs | 294500 | 635365661 |   10791187920
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10450605 |      15092790
>       Trees | 276503 |  31136263 |     244429615
>       Blobs | 294500 |  94442401 |   10791187920

Using % of size reduction or increase might help make it a bit clearer here too.

> Now, we can consider the top 10 file paths before and after repacking with
> the --path-walk feature:
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |  58060531 |     889120488
>  ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
>  ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
>  ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
>  ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
>  ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
>  ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
>  ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
>  .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
>  ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258
>
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |   9909326 |     889120488
>  ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
>  ...out/src/images/download.jpg |     1 |   1845249 |       1846117
>  ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
>           .yarn/releases/cli.js |     1 |   1335657 |       6741614
>  ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
>  ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
>  ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
>  ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
>  ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268
>
>
> As we can see from this example, before the repack we are mainly seeing the
> disk space be dominated by objects that appear at paths with CHANGELOG.json
> or CHANGELOG.md, which are frequently hitting collisions with the default
> name-hash. After the repack with the --path-walk feature, we see the largest
> paths are what we expect: checked in binaries, frequently-edited yarn files,
> and other hard-to- compress data.
>
> The nice thing about this result is that we can point to files that are
> taking up space because there is no other way around it, not that the naming
> convention for the files is causing confusion during packing.

Yeah, nice result. I guess the result is the same or very similar when
the improved name-hash algorithm is used.

> WHAT TO DO NOW?
> ===============
>
> Thank you for reading this far. I hope that this has added context on the
> patch series for the --full-name-hash option, but also provided the right
> context for when I come back in a week or so with a review-ready version of
> the --path-walk option.
>
> These patches are rough and I want to make sure everyone knows that.
> Reviewing them for style or even basic organization may lead to some wasted
> time. I know that at least one of the patches got mangled and its diff does
> not match its description (I'm thinking specifically about "pack-objects:
> extract should_attempt_deltas()" but this could apply elsewhere).

Ok, I will wait for the next iteration before reviewing the patches then.

> But I think that interested parties could take my branch, build it, and give
> these features a try on their favorite repos. I'd love to hear feedback on
> the usability and effectiveness, especially if someone finds a case where
> the --path-walk option is less effective at packing data.

I might do that when reviewing the patch series later. I think it can
still be valuable for you to get some first feedback from just reading
this RFC.

Thanks.

Derrick Stolee Sept. 18, 2024, 11:18 p.m. UTC | #3

On 9/17/24 6:41 AM, Christian Couder wrote:
 > On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget
 > <gitgitgadget@gmail.com> wrote:

 >>   * git backfill: a builtin to download missing blobs in a blobless partial
 >>     clone, done in batches and grouped by the path they appear in to maximize
 >>     delta compression in each batch. Allows focusing on the paths of the
 >>     sparse-checkout to only get the blobs necessary for history queries in
 >>     the current focus.
 >
 > It's not very clear if this would be useful when doing a `git
 > sparse-checkout add`, or a `git blame` on a file path not covered by
 > the current sparse-checkout, or both. I think it would be clearer if
 > there were a few examples.

You are correct, this doesn't help with either of those examples, as
they both require blobs outside of the current sparse-checkout. The
idea is for users who are most often in a given sparse-checkout to have
the experience as if they were in a non-partial Git clone, as long as
they restrict their activity within their sparse-checkout.

If they increase their sparse-checkout, then they can re-run 'git
backfill --sparse' to get the still-missing objects in the new paths.

 >>   * git survey: Jeff Hostetler built this feature [1] as a way to get
 >>     functionality similar to git-sizer [2], but using the internals of Git to
 >>     do it faster. It also displays information not available to git-sizer,
 >>     like the on-disk size of objects. This RFC presents a simplified version
 >>     of the builtin focused on listing the paths that contribute the most to
 >>     the on-disk size of trees and blobs.
 >
 > Not sure how `git survey` works, but `git sizer` works on a whole
 > repo, so, if they work in the same way, I am not sure I see what a new
 > path oriented way to walk repos would bring to the tool.

`git survey` also works on a whole repo, but with the path-walk API
implementation can talk about a group of objects that appear at a common
path instead of only talking about single extreme objects, such as one
large binary (that never changes) being singled out over a small file
that chnages frequently and across all versions contributes more to the
repo's disk size.

The main benefit of using `git survey` over `git-sizer` is that it can
report on things internal to Git's storage that are not accessible to
`git-sizer` through the `git cat-file` output it uses.

 >>   * git pack-objects --path-walk: In order to find a way to compute deltas
 >>     among objects of the same path, I applied the path-walk API to 'git
 >>     pack-objects' behind an optional flag. There are overlaps with the
 >>     '--sparse' option [3], [4] that can be used here. This provides perfect
 >>     partitioning by path name, without any possible collisions from the
 >>     name-hash algorithm. It also allows using the name-hash values to find
 >>     cross-path delta chains in a second pass.
 >
 > Do you mean that the new way to walk repos allows pack-object to
 > perform better in the general case or maybe only in the case where
 > partial clone without blobs and sparse-checkout are used as described
 > in the git backfill related point above?

 From what I can see, `git pack-objects --path-walk` outperforms all other
repacking strategies that I have found, but it does have some weaknesses in
how it interacts with things like delta islands. There may be other concerns,
and I _was_ able to find a repo that compressed slightly better with `git
pack-objects --full-name-hash`, but it was also storing computer-generated,
single-line JSON files representing configuration of production systems. I
would like to have a more generic thing to say about this, but it will vary
on data shape.

 >> This feature was
 >>     already sent to the mailing list as a fully-reviewable series [5]. It is
 >>     included here because this series allows testing the --path-walk option
 >>     against the --full-name-hash.
 >
 > Do you mean that the new way to walk repos can be used along with `git
 > repack --full-name-hash` (maybe with `git repack --full-name-hash
 > --path-walk` or a config option or otherwise?) and that it brings some
 > performance or other kind (better packs or which ones?) of
 > improvements?

Combining the two features actually ends up with very similar performance
to what `--full-name-hash` already does. It's actually important that the
`--path-walk` option does a full pass of the objects via the standard
name-hash after its first pass in groups based on the path.

It's more that I include performance tests that compare how effective the
two features are relative to the existing repacking strategy and in a few
different situations. This helps with both time and space performance
measurements.

 >> TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
 >> ======================================================
 >>
 >> Here's the story about how these applications came about: I was tasked with
 >> understanding why certain internal repositories were growing larger than
 >> expected. (Feel free to skip. Otherwise, thank you for indulging me.)
 >>
 >> I first prototyped 'git backfill' as a way to download some of the largest
 >> repositories without being blocked on a full clone. This batched download
 >> mechanism allowed me to essentially have a retryable clone, since the client
 >> could restart the process from scratch and skip any objects that were
 >> already on disk. It was natural to batch based on the path of the blobs in
 >> order to theoretically save time and network bandwidth due to better delta
 >> calculations.
 >>
 >> While investigating these repositories, I had some data hinting at the total
 >> size of the objects by type.
 >
 > By type (blob, tree, commit, tag?) or by path? Why would the size of
 > the objects by type be interesting? Could trees delta poorly?

The previously-existing data could only group objects by type (commit,
tree, blob) and report on the size of the reachable objects in those
categories. The per-path data was not available as no tool that I knew
about had the capability to expose that information.

 >> These
 >> paths were typically binary files that appeared only once or twice and did
 >> not account for the scale issues.
 >
 > You mean they couldn't account for a 55GB to 125GB change in data size?

The numbers were so small they didn't seem like they could have accounted
for any issues at all, let alone a hundred gigabytes of data in the repo.

 >> RFC GOALS
 >> =========
 >>
 >> The goals of this RFC are:
 >>
 >>   1. To demonstrate potential applications of the path-walk API to motivate
 >>      its generality
 >
 > Yeah, the "Path-walk API and applications" title summarizes this well.
 >
 >> as these features are sent in full-quality patch series,
 >>      but in a different order.
 >
 > I wonder if the patch series could have been separated and not all
 > sent in a 30 patch long series.

I was not clear about this, but the RFC is 30 patches so it's possible to see
the big picture, but I will be breaking it into at least four series in
sequence for actual review. They match the four sections described above, but
will be in the opposite order:

  A. `git repack --full-name-hash`
  B. `git pack-objects --path-walk`
  C. `git survey`
  D. `git backfill`

(It's possible that `git survey` and `git backfill` may be orthogonal enough
that they could be under review at the same time. Alternatively, `git backfill`
may jump the line because it's so simple to implement once the path-walk API
is established.)

 >>   3. To demonstrate the value of the path-based batching in the 'git survey'
 >>      feature, and to inspire others to think about what other statistics
 >>      would be valuable in that feature. (I anticipate that once a base is
 >>      established, multiple contributors will help expand its functionality
 >>      long into the future.)
 >
 > Yeah, a better git sizer would be valuable for GitLab too and probably
 > everyone hosting a significant number of repos.

This was definitely Jeff's intention, and I agree. Hopefully we can establish
a baseline quickly and then make room for extensions as people discover new
ways to investigate repo size or performance issues.

 >> PART I: 'git backfill'
 >> ======================
 >>
 >> These patches introduce the 'git backfill' builtin including its
 >> '--batch-size' and '--sparse' options. While this is the first and simplest
 >> application, it is also the lowest priority in terms of user need.
 >>
 >>   * path-walk: introduce an object walk by path
 >>   * backfill: add builtin boilerplate
 >>   * backfill: basic functionality and tests
 >>   * backfill: add --batch-size= option
 >>   * backfill: add --sparse option
 >>   * backfill: assume --sparse when sparse-checkout is enabled
 >
 > I would prefer a first patch series with all the above and the first
 > patch creating a technical doc called maybe
 > Documentation/technical/path-walk.txt which could contain a lot of
 > information from this RFC and perhaps technical details of how the
 > path-walk works and how it is different from a regular walk.

My rework of the `git pack-objects --path-walk` series has a draft of a
technical document as you suggest [1]. I regret not having time to get
it done before this RFC.

[1] 
https://github.com/derrickstolee/git/pull/28/files#diff-d8a8f04540e5fac6727529dcabf05501ba2447a7de340b540f2601e071b45260

 >> DISCUSSION OF NAME HASH
 >> =======================
...
 >> The --full-name-hash feature does the simpler choice of replacing the
 >> name-hash method with one that has fewer collisions, but loses the benefits
 >> of "nearby" paths having close hash values.
 >
 > The benefits of "nearby" paths are only available with --path-walk,
 > not in the current way object packing works.

I suppose this depends on your definition of "nearby" and I think we are
using it differently.

The --full-name-hash mechanism groups objects by their full path name, with
a small probability of a collision with another path, and thus we can think
about it as grouping objects by their full path. However, this lack of
collision comes at a cost: having unequal but "near" hash value does not
imply any similarity in the full path.

The standard name-hash algorithm has a form of "locality" that provides a
nice property: paths with unequal by "near" hash values will have some
similarities in the end of their path names.

Between these two mechanisms, the collisions of the standard name-hash
can cause objects that should delta together to be separated in the object
order due to too many objects with the same hash value, even though they
have very different full path names. While the full-name-hash mechanism
helps keep objects from the same path close together in the order, it
fails to compute good deltas across objects from different paths, even
if they may end similarly (implying similar file types or even renames
across directories).

The --path-walk feature has the nice benefit that it first attempts to
compute delta bases are grouped to exactly the set of objects that
appear at a given path (no more, no less) but then it _also_ does the
standard name-hash sort to help look for delta bases using the name-hash
locality heuristic.

 >> One natural question to consider is to think about storing both the
 >> name-hash and the full-name-hash and doing two delta passes, each one
 >> sorting the objects by a different hash function. The first issue is that we
 >> don't want to store two hash values per object, as that will significantly
 >> increase the memory pressure during repacking.
 >
 > Is one more uint32_t per object really increasing the memory pressure
 > so much? Isn't there a way to pack bits together to avoid that?

I hesitate to conclude that four bytes per object will not have a meaningful
impact. One thing that I'll be reporting on in my v2 of the --full-name-hash
feature is that I've gone and tried to prototype what would happen with this
kind of approach.

The short version is that I failed to make the addition of more data to this
struct helpful in outperforming the --full-name-hash option. I thought that
a two-pass approach would work, but something about how the two sorts worked
caused problems that I could not overcome. (Maybe someone with a stronger
handle on this could make it work.) My WIP branch is here [2] for anyone
willing to try to succeed where I failed.

[2] 
https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

 >> This could be side-stepped by
 >> storing the full-name-hash in the packing list and then a second mapping
 >> from full-name-hash to name-hash. However, that still leads to the two
 >> passes taking extra time.
 >
 > Isn't there a way to sort using both hash functions in a single pass?
 > (That is to perform a single pass and when considering an object sort
 > it using both hash functions before considering a different object.)

We could sort the objects using both hash functions (say, name-hash then
full-name-hash to break ties) but then the full-name-hash values keep
the similar-sized objects with the same name-hash from being sorted near
each other (at least, within the short default window, which I use for
all of my testing). So this ends up being ineffective. In [2] above, this
is something I tested along the way and am pretty confident that it does
not work as well as we'd like it to.

 >> The --path-walk approach is faster than even a
 >> single pass.
 >
 > Is there a reason for that?

My guess here is that we are finding the "best" deltas quickly, allowing
us to short-circuit the possibility of other deltas due to file size
differences. (If my object has size X, with a current delta of size D,
then any object smaller than X - D will not be a good base.)

 >> First, we can consider the total reachable objects in each scenario:
 >>
 >> TOTAL OBJECT SIZES BY TYPE
 >> ================================================
 >> Object Type |  Count | Disk Size | Inflated Size
 >> ------------+--------+-----------+--------------
 >>      Commits |  20579 |  10443669 |      15092790
 >>        Trees | 276503 |  40212070 |     244429615
 >>        Blobs | 294500 | 635365661 |   10791187920
 >>
 >> TOTAL OBJECT SIZES BY TYPE
 >> ================================================
 >> Object Type |  Count | Disk Size | Inflated Size
 >> ------------+--------+-----------+--------------
 >>      Commits |  20579 |  10450605 |      15092790
 >>        Trees | 276503 |  31136263 |     244429615
 >>        Blobs | 294500 |  94442401 |   10791187920
 >
 > Using % of size reduction or increase might help make it a bit clearer here too.

True. I could compute that manually, as this data is not available in the
same process. I'll try to do that in the future.

 >> Now, we can consider the top 10 file paths before and after repacking with
 >> the --path-walk feature:
...
 >> The nice thing about this result is that we can point to files that are
 >> taking up space because there is no other way around it, not that the naming
 >> convention for the files is causing confusion during packing.
 >
 > Yeah, nice result. I guess the result is the same or very similar when
 > the improved name-hash algorithm is used.

Yes, similar results happen there. I haven't dug into them very much to see
if it helps point out anything about inefficiencies in --full-name-hash, but
I don't expect any of the big differences between --full-name-hash and
--path-walk to appear in these tables.

 >> WHAT TO DO NOW?
 >> ===============
 >>
 >> Thank you for reading this far. I hope that this has added context on the
 >> patch series for the --full-name-hash option, but also provided the right
 >> context for when I come back in a week or so with a review-ready version of
 >> the --path-walk option.
 >>
 >> These patches are rough and I want to make sure everyone knows that.
 >> Reviewing them for style or even basic organization may lead to some wasted
 >> time. I know that at least one of the patches got mangled and its diff does
 >> not match its description (I'm thinking specifically about "pack-objects:
 >> extract should_attempt_deltas()" but this could apply elsewhere).
 >
 > Ok, I will wait for the next iteration before reviewing the patches then.

Yes, I only meant for them to be available for those who were curious to
explore. Please do review v2 of the --full-name-hash series [3].

[3] https://lore.kernel.org/git/pull.1785.v2.git.1726692381.gitgitgadget@gmail.com/

 >> But I think that interested parties could take my branch, build it, and give
 >> these features a try on their favorite repos. I'd love to hear feedback on
 >> the usability and effectiveness, especially if someone finds a case where
 >> the --path-walk option is less effective at packing data.
 >
 > I might do that when reviewing the patch series later. I think it can
 > still be valuable for you to get some first feedback from just reading
 > this RFC.

Thank you! And thanks for the thoughts on this very long cover letter.

-Stolee

Junio C Hamano Sept. 22, 2024, 6:37 p.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

> Combining the two features actually ends up with very similar performance
> to what `--full-name-hash` already does. It's actually important that the
> `--path-walk` option does a full pass of the objects via the standard
> name-hash after its first pass in groups based on the path.
> ...
> I was not clear about this, but the RFC is 30 patches so it's possible to see
> the big picture, but I will be breaking it into at least four series in
> sequence for actual review. They match the four sections described above, but
> will be in the opposite order:
>
>  A. `git repack --full-name-hash`
>  B. `git pack-objects --path-walk`
>  C. `git survey`
>  D. `git backfill`
>
> (It's possible that `git survey` and `git backfill` may be orthogonal enough
> that they could be under review at the same time. Alternatively, `git backfill`
> may jump the line because it's so simple to implement once the path-walk API
> is established.)

I actually was hoping to hear something like "since it turns out
that --path-walk gives a better performance and it does not regress
small incremental transfer like --full-name-hash does, the real
series drops --full-name hash", i.e. without part (A).  That reduces
things we need to worry about (like having to either keep track of
two "hashes" per object, or making small incremental transfer more
costly) greatly.

Thanks.

Kristoffer Haugsbakk Sept. 22, 2024, 9:08 p.m. UTC | #5

On Tue, Sep 10, 2024, at 04:28, Derrick Stolee via GitGitGadget wrote:
> This RFC is ultimately about introducing a new way to walk objects, called
> the "path-walk API" in the new path-walk.[ch] files. Before digging into the
> details of the API, let's discuss the applications which will hint at the
> API's design.

This series is superbly well-presented. I’m in 
awe here from the peanut gallery.

Derrick Stolee Sept. 23, 2024, 1:22 a.m. UTC | #6

On 9/22/24 2:37 PM, Junio C Hamano wrote:
 > Derrick Stolee <stolee@gmail.com> writes:
 >
 >> Combining the two features actually ends up with very similar performance
 >> to what `--full-name-hash` already does. It's actually important that the
 >> `--path-walk` option does a full pass of the objects via the standard
 >> name-hash after its first pass in groups based on the path.
 >> ...
 >> I was not clear about this, but the RFC is 30 patches so it's possible to see
 >> the big picture, but I will be breaking it into at least four series in
 >> sequence for actual review. They match the four sections described above, but
 >> will be in the opposite order:
 >>
 >>   A. `git repack --full-name-hash`
 >>   B. `git pack-objects --path-walk`
 >>   C. `git survey`
 >>   D. `git backfill`
 >>
 >> (It's possible that `git survey` and `git backfill` may be orthogonal enough
 >> that they could be under review at the same time. Alternatively, `git backfill`
 >> may jump the line because it's so simple to implement once the path-walk API
 >> is established.)
 >
 > I actually was hoping to hear something like "since it turns out
 > that --path-walk gives a better performance and it does not regress
 > small incremental transfer like --full-name-hash does, the real
 > series drops --full-name hash", i.e. without part (A).  That reduces
 > things we need to worry about (like having to either keep track of
 > two "hashes" per object, or making small incremental transfer more
 > costly) greatly.

I believe that the --full-name-hash version still has some benefits, in
that it could better integrate with reachability bitmaps and delta
islands:

  1. The .bitmap file format would need a modification in order to signal
     which hash function is being used for compatibility reasons, but
     this does seem within reach without too much work.

  2. The delta islands feature integrates seamlessly with
     --full-name-hash and seems difficult to integrate with the
     --path-walk feature. Either we would need to have a second object
     walk to get the delta island markers, or somehow put the passing of
     the object markers into the path-walk API itself (similar to how it
     needs to push the UNINTERESTING bit around during the walk).

I'm not recommending any version that requires tracking two hash values
per object, as I have not been able to demonstrate any improvement when
doing so.

But, it would be helpful to know if the --full-name-hash feature should
not be pursued due to the --path-walk feature being prepared shortly
after it. I can see an argument for either direction: having a new hash
algorithm provides a smaller change to get most of the results for the
full repack case, but gets worse performance in many push scenarios.
This is the point of an RFC, to get questions like this worked out based
on the "big picture" view of everything.

Perhaps I should pause the --full-name-hash topic and focus on getting
the --path-walk topic up and running. I am curious to hear from folks
who are currently running Git servers about their thoughts on these
trade-offs and potential uses in their environment. My needs on the
client side are solved by the --path-walk approach.

Thanks,
-Stolee

Junio C Hamano Sept. 23, 2024, 4:56 p.m. UTC | #7

Derrick Stolee <stolee@gmail.com> writes:

> ... I can see an argument for either direction: having a new hash
> algorithm provides a smaller change to get most of the results for the
> full repack case, but gets worse performance in many push scenarios.
> This is the point of an RFC, to get questions like this worked out based
> on the "big picture" view of everything.

Exactly.  We might want to use the series as an example in our
developer docs on how to propose a large-ish effort.

> Perhaps I should pause the --full-name-hash topic and focus on getting
> the --path-walk topic up and running. I am curious to hear from folks
> who are currently running Git servers about their thoughts on these
> trade-offs and potential uses in their environment. My needs on the
> client side are solved by the --path-walk approach.

Yeah, such third-party inputs would be very useful.

Thanks.

[00/30,RFC] Path-walk API and applications

Message

Comments