[00/24] pack-objects: multi-pack verbatim reuse

Message ID	cover.1701198172.git.me@ttaylorr.com (mailing list archive)
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="BNS/g1ev" Date: Tue, 28 Nov 2023 14:07:54 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Patrick Steinhardt <ps@pks.im>, Junio C Hamano <gitster@pobox.com> Subject: [PATCH 00/24] pack-objects: multi-pack verbatim reuse Message-ID: <cover.1701198172.git.me@ttaylorr.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit
Series	pack-objects: multi-pack verbatim reuse \| expand [00/24] pack-objects: multi-pack verbatim reuse [01/24] pack-objects: free packing_data in more places [02/24] pack-bitmap-write: deep-clear the `bb_commit` slab [03/24] pack-bitmap: plug leak in find_objects() [04/24] midx: factor out `fill_pack_info()` [05/24] midx: implement `DISP` chunk [06/24] midx: implement `midx_locate_pack()` [07/24] midx: implement `--retain-disjoint` mode [08/24] pack-objects: implement `--ignore-disjoint` mode [09/24] repack: implement `--extend-disjoint` mode [10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions [11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature [12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` [13/24] pack-objects: parameterize pack-reuse routines over a single pack [14/24] pack-objects: keep track of `pack_start` for each reuse pack [15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions [16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse [17/24] pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse [18/24] pack-objects: include number of packs reused in output [19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse [20/24] pack-objects: add tracing for various packfile metrics [21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper [22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" [23/24] pack-bitmap: reuse objects from all disjoint packs [24/24] t/perf: add performance tests for multi-pack reuse

Taylor Blau Nov. 28, 2023, 7:07 p.m. UTC

Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
2013-12-21), we added support for reachability bitmaps, and taught
pack-objects how to reuse verbatim chunks from the bitmapped pack. When
multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
use the MIDX's "preferred" pack as the source for verbatim reuse.

This allows repositories to incrementally repack themselves (e.g., using
a `--geometric` repack), storing the result in a MIDX, and generating a
corresponding bitmap. This keeps our bitmap coverage up-to-date, while
maintaining a relatively small number of packs.

However, it is recommended (and matches what we do in production at
GitHub) that repositories repack themselves all-into-one, and
generate a corresponding single-pack reachability bitmap. This is done
for a couple of reasons, but the most relevant one to this series is
that it enables us to perform verbatim pack-reuse over a complete copy
of the repository, since the entire repository resides in a single pack
(and thus is eligible for verbatim pack-reuse).

As repositories grow larger, packing their contents into a single pack
becomes less feasible. This series extends the pack-reuse mechanism to
operate over multiple packs which are known ahead of time to be disjoint
with respect to one another's set of objects.

The implementation has a few components:

  - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
    to keep track of the bitmap position, number of objects, and
    disjointed-ness for each pack contained in the MIDX.

  - A new mode for `git multi-pack-index write --stdin-packs` that
    allows specifying disjoint packs, as well as a new option
    `--retain-disjoint` which preserves the set of existing disjoint
    packs in the new MIDX.

  - A new pack-objects mode `--ignore-disjoint`, which produces packs
    which are disjoint with respect to the current set of disjoint packs
    (i.e. it discards any objects from the packing list which appear in
    any of the known-disjoint packs).

  - A new repack mode, `--extend-disjoint` which causes any new pack(s)
    which are generated to be disjoint with respect to the set of packs
    currently marked as disjoint, minus any pack(s) which are about to
    be deleted.

With all of that in place, the patch series then rewrites all of the
pack-reuse functions in terms of the new `bitmapped_pack` structure.
Once we have dropped all of the assumptions stemming from only
performing pack-reuse over a single candidate pack, we can then enable
reuse over all of the disjoint packs.

In addition to the many new tests in t5332 added by that series, I tried
to simulate a "real world" test on git.git by breaking the repository
into chunks of 1,000 commits (plus their set of reachable objects not
reachable from earlier chunk(s)) and packing those chunks. This produces
a large number of packs with the objects from git.git which are known to
be disjoint with respect to one another.

    $ git clone git@github.com:git/git.git base

    $ cd base
    $ mv .git/objects/pack/pack-*.idx{,.bak}
    $ git unpack-objects <.git/objects/pack/pack-*.pack

    # pack the objects from each successive block of 1k commits
    $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
      do
        echo $rev |
        git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
      done
    # and grab any stragglers, pruning the unpacked objects
    $ git repack -d
    I then constructed a MIDX and corresponding bitmap

    $ find_pack () {
        for idx in .git/objects/pack/pack-*.idx
        do
          git show-index <$idx | grep -q "$1" && basename $idx
        done
      }
    $ preferred="$(find_pack $(git rev-parse HEAD))"

    $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
        git.compile multi-pack-index write --bitmap --stdin-packs \
          --preferred-pack=$preferred
    $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in

With all of that in place, I was able to produce a significant speed-up
by reusing objects from multiple packs:

    $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
    Benchmark 1: single-pack reuse
      Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
      Range (min … max):    6.063 s …  6.126 s    10 runs

    Benchmark 2: multi-pack reuse
      Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
      Range (min … max):   903.5 ms … 912.7 ms    10 runs

    Summary
      multi-pack reuse ran
        6.72 ± 0.03 times faster than single-pack reuse

(There are corresponding tests in p5332 that test different sized chunks
and measure the runtime performance as well as resulting pack size).

Performing verbatim pack reuse naturally trades off between CPU time and
the resulting pack size. In the above example, the single-pack reuse
case produces a clone size of ~194 MB on my machine, while the
multi-pack reuse case produces a clone size closer to ~266 MB, which is
a ~37% increase in clone size.

I think there is still some opportunity to close this gap, since the
"packing" strategy here is extremely naive. In a production setting, I'm
sure that there are more well thought out repacking strategies that
would produce more similar clone sizes.

I considered breaking this series up into smaller chunks, but was
unsatisfied with the result. Since this series is rather large, if you
have alternate suggestions on better ways to structure this, please let
me know.

Thanks in advance for your review!

Taylor Blau (24):
  pack-objects: free packing_data in more places
  pack-bitmap-write: deep-clear the `bb_commit` slab
  pack-bitmap: plug leak in find_objects()
  midx: factor out `fill_pack_info()`
  midx: implement `DISP` chunk
  midx: implement `midx_locate_pack()`
  midx: implement `--retain-disjoint` mode
  pack-objects: implement `--ignore-disjoint` mode
  repack: implement `--extend-disjoint` mode
  pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  pack-bitmap: return multiple packs via
    `reuse_partial_packfile_from_bitmap()`
  pack-objects: parameterize pack-reuse routines over a single pack
  pack-objects: keep track of `pack_start` for each reuse pack
  pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
  pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
    reuse
  pack-objects: include number of packs reused in output
  pack-bitmap: prepare to mark objects from multiple packs for reuse
  pack-objects: add tracing for various packfile metrics
  t/test-lib-functions.sh: implement `test_trace2_data` helper
  pack-objects: allow setting `pack.allowPackReuse` to "single"
  pack-bitmap: reuse objects from all disjoint packs
  t/perf: add performance tests for multi-pack reuse

 Documentation/config/pack.txt          |   8 +-
 Documentation/git-multi-pack-index.txt |  12 ++
 Documentation/git-pack-objects.txt     |   8 +
 Documentation/git-repack.txt           |  12 ++
 Documentation/gitformat-pack.txt       | 109 ++++++++++
 builtin/multi-pack-index.c             |  13 +-
 builtin/pack-objects.c                 | 200 +++++++++++++++----
 builtin/repack.c                       |  57 +++++-
 midx.c                                 | 218 +++++++++++++++++---
 midx.h                                 |  11 +-
 pack-bitmap-write.c                    |   9 +-
 pack-bitmap.c                          | 265 ++++++++++++++++++++-----
 pack-bitmap.h                          |  18 +-
 pack-objects.c                         |  15 ++
 pack-objects.h                         |   1 +
 t/helper/test-read-midx.c              |  31 ++-
 t/lib-disjoint.sh                      |  49 +++++
 t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
 t/t5319-multi-pack-index.sh            | 140 +++++++++++++
 t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
 t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
 t/t6113-rev-list-bitmap-filters.sh     |   2 +
 t/t7700-repack.sh                      |   4 +-
 t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
 t/test-lib-functions.sh                |  14 ++
 25 files changed, 1650 insertions(+), 144 deletions(-)
 create mode 100644 t/lib-disjoint.sh
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh
 create mode 100755 t/t5332-multi-pack-reuse.sh
 create mode 100755 t/t7705-repack-extend-disjoint.sh


base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d

Patrick Steinhardt Nov. 30, 2023, 10:18 a.m. UTC | #1

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
> 
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
> 
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
> 
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
> 
> The implementation has a few components:
> 
>   - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
>     to keep track of the bitmap position, number of objects, and
>     disjointed-ness for each pack contained in the MIDX.
> 
>   - A new mode for `git multi-pack-index write --stdin-packs` that
>     allows specifying disjoint packs, as well as a new option
>     `--retain-disjoint` which preserves the set of existing disjoint
>     packs in the new MIDX.
> 
>   - A new pack-objects mode `--ignore-disjoint`, which produces packs
>     which are disjoint with respect to the current set of disjoint packs
>     (i.e. it discards any objects from the packing list which appear in
>     any of the known-disjoint packs).
> 
>   - A new repack mode, `--extend-disjoint` which causes any new pack(s)
>     which are generated to be disjoint with respect to the set of packs
>     currently marked as disjoint, minus any pack(s) which are about to
>     be deleted.
> 
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
> 
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
> 
>     $ git clone git@github.com:git/git.git base
> 
>     $ cd base
>     $ mv .git/objects/pack/pack-*.idx{,.bak}
>     $ git unpack-objects <.git/objects/pack/pack-*.pack
> 
>     # pack the objects from each successive block of 1k commits
>     $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
>       do
>         echo $rev |
>         git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
>       done
>     # and grab any stragglers, pruning the unpacked objects
>     $ git repack -d
>     I then constructed a MIDX and corresponding bitmap
> 
>     $ find_pack () {
>         for idx in .git/objects/pack/pack-*.idx
>         do
>           git show-index <$idx | grep -q "$1" && basename $idx
>         done
>       }
>     $ preferred="$(find_pack $(git rev-parse HEAD))"
> 
>     $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
>         git.compile multi-pack-index write --bitmap --stdin-packs \
>           --preferred-pack=$preferred
>     $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
> 
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
> 
>     $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
>     Benchmark 1: single-pack reuse
>       Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
>       Range (min … max):    6.063 s …  6.126 s    10 runs
> 
>     Benchmark 2: multi-pack reuse
>       Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
>       Range (min … max):   903.5 ms … 912.7 ms    10 runs
> 
>     Summary
>       multi-pack reuse ran
>         6.72 ± 0.03 times faster than single-pack reuse
> 
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
> 
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?

No worry if you don't, I'm just curious.

> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
> 
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.

The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).

I'll review the remaining patches at a later point in time.

Patrick

> Thanks in advance for your review!
> 
> Taylor Blau (24):
>   pack-objects: free packing_data in more places
>   pack-bitmap-write: deep-clear the `bb_commit` slab
>   pack-bitmap: plug leak in find_objects()
>   midx: factor out `fill_pack_info()`
>   midx: implement `DISP` chunk
>   midx: implement `midx_locate_pack()`
>   midx: implement `--retain-disjoint` mode
>   pack-objects: implement `--ignore-disjoint` mode
>   repack: implement `--extend-disjoint` mode
>   pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
>   pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
>   pack-bitmap: return multiple packs via
>     `reuse_partial_packfile_from_bitmap()`
>   pack-objects: parameterize pack-reuse routines over a single pack
>   pack-objects: keep track of `pack_start` for each reuse pack
>   pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
>   pack-objects: prepare `write_reused_pack()` for multi-pack reuse
>   pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
>     reuse
>   pack-objects: include number of packs reused in output
>   pack-bitmap: prepare to mark objects from multiple packs for reuse
>   pack-objects: add tracing for various packfile metrics
>   t/test-lib-functions.sh: implement `test_trace2_data` helper
>   pack-objects: allow setting `pack.allowPackReuse` to "single"
>   pack-bitmap: reuse objects from all disjoint packs
>   t/perf: add performance tests for multi-pack reuse
> 
>  Documentation/config/pack.txt          |   8 +-
>  Documentation/git-multi-pack-index.txt |  12 ++
>  Documentation/git-pack-objects.txt     |   8 +
>  Documentation/git-repack.txt           |  12 ++
>  Documentation/gitformat-pack.txt       | 109 ++++++++++
>  builtin/multi-pack-index.c             |  13 +-
>  builtin/pack-objects.c                 | 200 +++++++++++++++----
>  builtin/repack.c                       |  57 +++++-
>  midx.c                                 | 218 +++++++++++++++++---
>  midx.h                                 |  11 +-
>  pack-bitmap-write.c                    |   9 +-
>  pack-bitmap.c                          | 265 ++++++++++++++++++++-----
>  pack-bitmap.h                          |  18 +-
>  pack-objects.c                         |  15 ++
>  pack-objects.h                         |   1 +
>  t/helper/test-read-midx.c              |  31 ++-
>  t/lib-disjoint.sh                      |  49 +++++
>  t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
>  t/t5319-multi-pack-index.sh            | 140 +++++++++++++
>  t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
>  t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
>  t/t6113-rev-list-bitmap-filters.sh     |   2 +
>  t/t7700-repack.sh                      |   4 +-
>  t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
>  t/test-lib-functions.sh                |  14 ++
>  25 files changed, 1650 insertions(+), 144 deletions(-)
>  create mode 100644 t/lib-disjoint.sh
>  create mode 100755 t/perf/p5332-multi-pack-reuse.sh
>  create mode 100755 t/t5332-multi-pack-reuse.sh
>  create mode 100755 t/t7705-repack-extend-disjoint.sh
> 
> 
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> -- 
> 2.43.0.24.g980b318f98

Taylor Blau Nov. 30, 2023, 7:39 p.m. UTC | #2

On Thu, Nov 30, 2023 at 11:18:19AM +0100, Patrick Steinhardt wrote:
> > Performing verbatim pack reuse naturally trades off between CPU time and
> > the resulting pack size. In the above example, the single-pack reuse
> > case produces a clone size of ~194 MB on my machine, while the
> > multi-pack reuse case produces a clone size closer to ~266 MB, which is
> > a ~37% increase in clone size.
>
> Quite exciting, and a tradeoff that may be worth it for Git hosters. I
> expect that this is going to be an extreme example of the benefits
> provided by your patch series -- do you by any chance also have "real"
> numbers that make it possible to quantify the effect a bit better?
>
> No worry if you don't, I'm just curious.

I don't have a great sense, no. I haven't run these patches yet in
production, although would like to do so soon for internal repositories
to get a better sense here.

There are some performance tests at the end which try and give you a
sense of at least the relative speed-up depending on how many disjoint
packs you have (IIRC, we test for 1, 10, and 100 disjoint packs).

> > I think there is still some opportunity to close this gap, since the
> > "packing" strategy here is extremely naive. In a production setting, I'm
> > sure that there are more well thought out repacking strategies that
> > would produce more similar clone sizes.
> >
> > I considered breaking this series up into smaller chunks, but was
> > unsatisfied with the result. Since this series is rather large, if you
> > have alternate suggestions on better ways to structure this, please let
> > me know.
>
> The series is indeed very involved to review. I only made it up to patch
> 8/24 and already spent quite some time on it. So I'd certainly welcome
> it if this was split up into smaller parts, but don't have a suggestion
> as to how this should be done (also because I didn't yet read the other
> 16 patches).

I suppose that one way to break it up might be:

    pack-objects: free packing_data in more places
    pack-bitmap-write: deep-clear the `bb_commit` slab
    pack-bitmap: plug leak in find_objects()

    midx: factor out `fill_pack_info()`
    midx: implement `DISP` chunk
    midx: implement `midx_locate_pack()`
    midx: implement `--retain-disjoint` mode

    pack-objects: implement `--ignore-disjoint` mode
    repack: implement `--extend-disjoint` mode

    pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
    pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
    pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
    pack-objects: parameterize pack-reuse routines over a single pack
    pack-objects: keep track of `pack_start` for each reuse pack
    pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
    pack-objects: prepare `write_reused_pack()` for multi-pack reuse
    pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
    pack-objects: include number of packs reused in output
    pack-bitmap: prepare to mark objects from multiple packs for reuse

    pack-objects: add tracing for various packfile metrics
    t/test-lib-functions.sh: implement `test_trace2_data` helper
    pack-objects: allow setting `pack.allowPackReuse` to "single"
    pack-bitmap: reuse objects from all disjoint packs
    t/perf: add performance tests for multi-pack reuse

Then you'd have five patch series, where each series does roughly the
following:

  1. Preparatory clean-up.
  2. Implementing the DISP chunk, as well as --retain-disjoint, without
     a way to generate such packs.
  3. Implement a way to generate such packs, but without actually being
     able to reuse more than one of them.
  4. Implement multi-pack reuse, but without actually reusing any packs.
  5. Enable multi-pack reuse (and implement the required scaffolding to
     do so), and test it.

That's the most sensible split that I could come up with, at least. But
I still find it relatively unsatisfying for a couple of reasons:

  - With the exception of the last group of patches, none of the earlier
    series enable any new, useful behavior on their own. IOW, if we just
    merged the first three series and then forgot about this topic, we
    wouldn't have done anything useful ;-).

  - The fourth series (which actually implements multi-pack reuse) still
    remains the most complicated, and would likely be the most difficult
    to review. So I think you'd still have one difficult series to
    review, plus four other series which are slightly less difficult to
    review ;-).

It's entirely possible that I'm just too close to these patches to see a
better split, so if you (or others) have any suggestions on how to break
this up, please don't hesitate to share them.

> I'll review the remaining patches at a later point in time.

Thanks, I'll look forward to more of your review as usual :-).

Thanks,
Taylor

Patrick Steinhardt Dec. 1, 2023, 8:31 a.m. UTC | #3

On Thu, Nov 30, 2023 at 02:39:41PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:19AM +0100, Patrick Steinhardt wrote:
[snip]
> Then you'd have five patch series, where each series does roughly the
> following:
> 
>   1. Preparatory clean-up.
>   2. Implementing the DISP chunk, as well as --retain-disjoint, without
>      a way to generate such packs.
>   3. Implement a way to generate such packs, but without actually being
>      able to reuse more than one of them.
>   4. Implement multi-pack reuse, but without actually reusing any packs.
>   5. Enable multi-pack reuse (and implement the required scaffolding to
>      do so), and test it.
> 
> That's the most sensible split that I could come up with, at least.

Looks sensible to me.

> But
> I still find it relatively unsatisfying for a couple of reasons:
> 
>   - With the exception of the last group of patches, none of the earlier
>     series enable any new, useful behavior on their own. IOW, if we just
>     merged the first three series and then forgot about this topic, we
>     wouldn't have done anything useful ;-).

Well, sometimes I wish we'd buy more into the iterative style of working
in the Git project, where it's fine to land patch series that only work
into the direction of a specific topic but don't yet do anything
interesting by themselves. The benefits would be both that individual
pieces land faster while also ensuring that the review load is kept at
bay.

But there's of course also downsides to this, especially in an open
source project like Git:

  - Contributors may go away in the middle of their endeavour, leaving
    behind dangling pieces that might have complicated some of our
    architecture without actually reaping the intended benefits.

  - Overall, it may take significantly longer to get all pieces into
    Git.

  - Contributors need to do a much better job to explain where they are
    headed when the series by itself doesn't do anything interesting
    yet.

So I understand why we typically don't work this way.

I wonder whether a compromise would be to continue sending complete
patch series, but explicitly point out "break points" in your patch
series. These break points could be an indicator to the maintainer that
it's fine to merge everything up to it while still keeping out the
remainder of the patch series.

>   - The fourth series (which actually implements multi-pack reuse) still
>     remains the most complicated, and would likely be the most difficult
>     to review. So I think you'd still have one difficult series to
>     review, plus four other series which are slightly less difficult to
>     review ;-).

That's fine in my opinion, there's no surprise here that a complicated
topic is demanding for both the author and the reviewer.

Patrick

Taylor Blau Dec. 1, 2023, 8:02 p.m. UTC | #4

On Fri, Dec 01, 2023 at 09:31:14AM +0100, Patrick Steinhardt wrote:
> > But
> > I still find it relatively unsatisfying for a couple of reasons:
> >
> >   - With the exception of the last group of patches, none of the earlier
> >     series enable any new, useful behavior on their own. IOW, if we just
> >     merged the first three series and then forgot about this topic, we
> >     wouldn't have done anything useful ;-).
>
> Well, sometimes I wish we'd buy more into the iterative style of working
> in the Git project, where it's fine to land patch series that only work
> into the direction of a specific topic but don't yet do anything
> interesting by themselves. The benefits would be both that individual
> pieces land faster while also ensuring that the review load is kept at
> bay.
>
> But there's of course also downsides to this, especially in an open
> source project like Git:

I tend to agree with the downsides you list. My biggest concern with
this series in particular is that we're trying to break down an
all-or-nothing change into smaller components. So if we landed four out
of the five of those series, it would be better to have landed none of
them, since the first four aren't really all that useful on their own.

I suppose if we're relatively confident that the last series will be
merged eventually, then that seems like less of a concern. But I'm not
sure that we're at that point yet.

> I wonder whether a compromise would be to continue sending complete
> patch series, but explicitly point out "break points" in your patch
> series. These break points could be an indicator to the maintainer that
> it's fine to merge everything up to it while still keeping out the
> remainder of the patch series.

I think that's a reasonable alternative. It does create a minor headache
for the maintainer[^1], though, so I'd like to avoid it if possible.

> >   - The fourth series (which actually implements multi-pack reuse) still
> >     remains the most complicated, and would likely be the most difficult
> >     to review. So I think you'd still have one difficult series to
> >     review, plus four other series which are slightly less difficult to
> >     review ;-).
>
> That's fine in my opinion, there's no surprise here that a complicated
> topic is demanding for both the author and the reviewer.

My preference is to avoid splitting the series if we can help it. But if
you feel strongly, or others feel similarly, I'm happy to take another
crack at breaking it up. Thanks for all of your review so far!

Thanks,
Taylor

Patrick Steinhardt Dec. 4, 2023, 8:49 a.m. UTC | #5

On Fri, Dec 01, 2023 at 03:02:22PM -0500, Taylor Blau wrote:
> On Fri, Dec 01, 2023 at 09:31:14AM +0100, Patrick Steinhardt wrote:
[snip]
> I suppose if we're relatively confident that the last series will be
> merged eventually, then that seems like less of a concern. But I'm not
> sure that we're at that point yet.

That's an additional valid concern indeed.

[snip]
> > >   - The fourth series (which actually implements multi-pack reuse) still
> > >     remains the most complicated, and would likely be the most difficult
> > >     to review. So I think you'd still have one difficult series to
> > >     review, plus four other series which are slightly less difficult to
> > >     review ;-).
> >
> > That's fine in my opinion, there's no surprise here that a complicated
> > topic is demanding for both the author and the reviewer.
> 
> My preference is to avoid splitting the series if we can help it. But if
> you feel strongly, or others feel similarly, I'm happy to take another
> crack at breaking it up. Thanks for all of your review so far!

I don't feel strongly about this at all, I've only tried to spell out my
own thoughts in this context as I thought they were kind of relevant
here. I've thought quite a lot about this topic recently due to my work
on the reftable backend, where I'm trying to get as many pieces as
possible landed individually before landing the actual backend itself.
It's working well for most of the part, but in other contexts it's a bit
weird that we try to cater towards something that doesn't exist yet. But
naturally, the reftable work is of different nature than the topic you
work on here and thus my own takeaways may likely not apply heer.

To summarize, I think there is merit in splitting up patches into chunks
that make it in individually and thus gradually work toward a topic, but
I also totally understand why you (or Junio as the maintainer) might
think that this is not a good idea. The ultimate decision for how to
handle topics should be with the patch series' author and maintainer.

Patrick

Jeff King Dec. 12, 2023, 8:12 a.m. UTC | #6

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:

> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Right, it's definitely a tradeoff. So taking a really big step back,
there are a few optimizations all tied up in the verbatim reuse code:

  1. in some cases we get to dump whole swaths of the on-disk packfile
     to the output, covering many objects with a few memcpy() calls.
     (This is still O(n), of course, but it's fewer instructions per
     object).

  2. any other reused objects have only a small-ish amount of work to
     fix up ofs deltas, handle gaps, and so on. We get to skip adding
     them to the packing_list struct (this saves some CPU, but also a
     lot of memory)

  3. we skip the delta search for these reused objects. This is where
     your big CPU / output size tradeoff comes into play, I'd think.

So my question is: how much of what you're seeing is from (1) and (2),
and how much is from (3)? Because there are other ways to trigger (3),
such as lowering the window size. For example, if you try your same
packing example with --window=0, how do the CPU and output size compare
to the results of your series? (I'd also check peak memory usage).

-Peff

Taylor Blau Dec. 15, 2023, 3:37 p.m. UTC | #7

On Tue, Dec 12, 2023 at 03:12:38AM -0500, Jeff King wrote:
> So my question is: how much of what you're seeing is from (1) and (2),
> and how much is from (3)? Because there are other ways to trigger (3),
> such as lowering the window size. For example, if you try your same
> packing example with --window=0, how do the CPU and output size compare
> to the results of your series? (I'd also check peak memory usage).

Interesting question! Here are some preliminary numbers on my machine
(which runs Debian unstable with a Intel Xenon W-2255 CPU @ 3.70GHz and
64GB of RAM).

I ran the following hyperfine command on my testing repository, which
has the Git repository broken up into ~75 packs or so:

    $ hyperfine -L v single,multi -L window 0,10 \
      --show-output \
      -n '{v}-pack reuse, pack.window={window}' \
      'git.compile \
        -c pack.allowPackReuse={v} \
        -c pack.window={window} \
        pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in 2>/dev/null | wc -c'

Which gave the following results for runtime:

    Benchmark 1: single-pack reuse, pack.window=0
    [...]
      Time (mean ± σ):      1.248 s ±  0.004 s    [User: 1.160 s, System: 0.188 s]
      Range (min … max):    1.244 s …  1.259 s    10 runs

    Benchmark 2: multi-pack reuse, pack.window=0
    [...]
      Time (mean ± σ):      1.075 s ±  0.005 s    [User: 0.990 s, System: 0.188 s]
      Range (min … max):    1.071 s …  1.088 s    10 runs

    Benchmark 3: single-pack reuse, pack.window=10
    [...]
      Time (mean ± σ):      6.281 s ±  0.024 s    [User: 43.727 s, System: 0.492 s]
      Range (min … max):    6.252 s …  6.326 s    10 runs

    Benchmark 4: multi-pack reuse, pack.window=10
    [...]
      Time (mean ± σ):      1.028 s ±  0.002 s    [User: 1.150 s, System: 0.184 s]
      Range (min … max):    1.026 s …  1.032 s    10 runs

Here are the average numbers for the resulting "clone" size in each of
the above configurations:

    Benchmark 1: single-pack reuse, pack.window=0
    264.443 MB
    Benchmark 2: multi-pack reuse, pack.window=0
    268.670 MB
    Benchmark 3: single-pack reuse, pack.window=10
    194.355 MB
    Benchmark 4: multi-pack reuse, pack.window=10
    266.473 MB

So it looks like setting pack.window=0 (with both single and multi-pack
reuse) yields a similarly sized pack output as multi-pack reuse with any
window setting.

Running the same benchmark as above again, but this time sending the
pack output to /dev/null and instead capturing the maximum RSS value
from `/usr/bin/time -v` gives us the following (averages, in MB):

    Benchmark 1: single-pack reuse, pack.window=0
    354.224 MB (max RSS)
    Benchmark 2: multi-pack reuse, pack.window=0
    315.730 MB (max RSS)
    Benchmark 3: single-pack reuse, pack.window=10
    470.651 MB (max RSS)
    Benchmark 4: multi-pack reuse, pack.window=10
    328.786 MB (max RSS)

So memory usage is similar between runs except for the single-pack reuse
case with a window size of 10.

It looks like the sweet spot is probably multi-pack reuse with a
small-ish window size, which achieves the best of both worlds (small
pack size, relative to other configurations that reuse large portions of
the pack, and low memory usage).

It's pretty close between multi-pack reuse with a window size of 0 and
a window size of 10. If you want to optimize for pack size, you could
trade a ~4% reduction in pack size for a ~1% increase in peak memory
usage.

Of course, YMMV depending on the repository, packing strategy, and
pack.window configuration (among others) while packing. But this should
give you a general idea of what to expect.

Thanks,
Taylor

Jeff King Dec. 21, 2023, 11:13 a.m. UTC | #8

On Fri, Dec 15, 2023 at 10:37:57AM -0500, Taylor Blau wrote:

> On Tue, Dec 12, 2023 at 03:12:38AM -0500, Jeff King wrote:
> > So my question is: how much of what you're seeing is from (1) and (2),
> > and how much is from (3)? Because there are other ways to trigger (3),
> > such as lowering the window size. For example, if you try your same
> > packing example with --window=0, how do the CPU and output size compare
> > to the results of your series? (I'd also check peak memory usage).
> 
> Interesting question! Here are some preliminary numbers on my machine
> (which runs Debian unstable with a Intel Xenon W-2255 CPU @ 3.70GHz and
> 64GB of RAM).
> 
> I ran the following hyperfine command on my testing repository, which
> has the Git repository broken up into ~75 packs or so:

Thanks for running these tests. The results are similar to what
expected, which is: yes, most of your CPU savings are from skipping
deltas, but not all.

Here's what I see (which I think is mostly redundant with what you've
said, but I just want to lay out my line of thinking). I'll reorder your
quoted sections a bit as I go:

>     Benchmark 2: multi-pack reuse, pack.window=0
>     [...]
>       Time (mean ± σ):      1.075 s ±  0.005 s    [User: 0.990 s, System: 0.188 s]
>       Range (min … max):    1.071 s …  1.088 s    10 runs
>
>     Benchmark 4: multi-pack reuse, pack.window=10
>     [...]
>       Time (mean ± σ):      1.028 s ±  0.002 s    [User: 1.150 s, System: 0.184 s]
>       Range (min … max):    1.026 s …  1.032 s    10 runs

OK, so when we're doing more full ("multi") reuse, the pack window
doesn't make a big difference either way. You didn't show the stderr
from each, but presumably most of the objects are hitting the "reuse"
path, and only a few are deltas (and that is backed up by the fact that
doing deltas only gives us a slight improvement in the output size:

>     Benchmark 2: multi-pack reuse, pack.window=0
>     268.670 MB
>     Benchmark 4: multi-pack reuse, pack.window=10
>     266.473 MB

Comparing the runs with less reuse:

>     Benchmark 1: single-pack reuse, pack.window=0
>     [...]
>       Time (mean ± σ):      1.248 s ±  0.004 s    [User: 1.160 s, System: 0.188 s]
>       Range (min … max):    1.244 s …  1.259 s    10 runs
>
>     Benchmark 3: single-pack reuse, pack.window=10
>     [...]
>       Time (mean ± σ):      6.281 s ±  0.024 s    [User: 43.727 s, System: 0.492 s]
>       Range (min … max):    6.252 s …  6.326 s    10 runs

there obviously is a huge amount of time saved by not doing deltas, but
we pay for it with a much bigger pack:

>     Benchmark 1: single-pack reuse, pack.window=0
>     264.443 MB
>     Benchmark 3: single-pack reuse, pack.window=10
>     194.355 MB

But of course that "much bigger" pack is about the same size as the one
we get from doing multi-pack reuse. Which is not surprising, because
both are avoiding looking for new deltas (and the packs after the
preferred one probably have mediocre deltas).

So I do actually think that disabling pack.window gives you a
similar-ish tradeoff to expanding the pack-reuse code (~6s down to ~1s,
and a 36% embiggening of the resulting pack size).

Which implies that one option is to scrap your entire series and just
set pack.window. Basically comparing multi/10 (your patches) to single/0
(a hypothetical config option), which have similar run-times and pack
sizes.

But that's not quite the whole story. There is still a CPU improvement
in your series (1.2s vs 1.0s, a 20% speedup). And as I'd expect, a
memory improvement from avoiding the extra book-keeping (almost 10%):

>     Benchmark 1: single-pack reuse, pack.window=0
>     354.224 MB (max RSS)
>     Benchmark 4: multi-pack reuse, pack.window=10
>     328.786 MB (max RSS)

So while it's a lot less code to just set the window size, I do think
those improvements are worth it. And really, it's the same tradeoff we
make for the single-pack case (i.e., one could argue that we
could/should rip out the verbatim-reuse code entirely in favor of just
tweaking the window size).

> It's pretty close between multi-pack reuse with a window size of 0 and
> a window size of 10. If you want to optimize for pack size, you could
> trade a ~4% reduction in pack size for a ~1% increase in peak memory
> usage.

I think if you want to optimize for pack size, you should consider
repacking all-into-one to get better on-disk deltas. ;) I know that's
easier said than done when the I/O costs are significant. I do wonder if
storing thin packs on disk would let us more cheaply reach a state that
could serve optimal-ish packs without spending CPU computing bespoke
deltas for each client. But that's a much larger topic.

-Peff

Taylor Blau Jan. 4, 2024, 10:22 p.m. UTC | #9

On Thu, Dec 21, 2023 at 06:13:33AM -0500, Jeff King wrote:
> But that's not quite the whole story. There is still a CPU improvement
> in your series (1.2s vs 1.0s, a 20% speedup). And as I'd expect, a
> memory improvement from avoiding the extra book-keeping (almost 10%):
>
> >     Benchmark 1: single-pack reuse, pack.window=0
> >     354.224 MB (max RSS)
> >     Benchmark 4: multi-pack reuse, pack.window=10
> >     328.786 MB (max RSS)

I agree. And I expect that we'd see larger savings on larger, real-world
repositories (the numbers here are generated from a semi out-of-date
copy of git.git).

> So while it's a lot less code to just set the window size, I do think
> those improvements are worth it. And really, it's the same tradeoff we
> make for the single-pack case (i.e., one could argue that we
> could/should rip out the verbatim-reuse code entirely in favor of just
> tweaking the window size).

Definitely an interesting direction. One question that comes to mind is
whether or not we'd want to keep the "verbatim" reuse code around to
avoid adding objects to the packing list directly. That's a
non-negligible memory savings when generating large packs where we can
reuse large swaths of existing packs.

That seems like a topic for another day, though ;-). Not quite
left-over-bits, so maybe... #leftoverboulders?

Thanks,
Taylor

[00/24] pack-objects: multi-pack verbatim reuse

Message

Comments