mbox series

[0/9] Repack objects into separate packfiles based on a filter

Message ID 20230614192541.1599256-1-christian.couder@gmail.com (mailing list archive)
Headers show
Series Repack objects into separate packfiles based on a filter | expand

Message

Christian Couder June 14, 2023, 7:25 p.m. UTC
# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

This could be useful for the following purposes:

  - As a way for servers to save storage costs by for example moving
    large blobs, or blobs in inactive repos, to separate storage
    (while still making them accessible using for example the
    alternates mechanism).

  - As a way to use partial clone on a Git server to offload large
    blobs to, for example, an http server, while using multiple
    promisor remotes (to be able to access everything) on the client
    side. (In this case the packfile that contains the filtered out
    object can be manualy removed after checking that all the objects
    it contains are available through the promisor remote.)

  - As a way for clients to reclaim some space when they cloned with a
    filter to save disk space but then fetched a lot of unwanted
    objects (for example when checking out old branches) and now want
    to remove these unwanted objects. (In this case they can first
    move the packfile that contains filtered out objects to a separate
    directory or storage, then check that everything works well, and
    then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

# Commit overview

* 1/9 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as the first patch in the previous series. To
  be able to later repack with a filter we need `git pack-objects` to
  write packfiles when it's filtering instead of just writing the pack
  without the filtered out objects to stdout.

* 2/9 pack-objects: add `--print-filtered` to print omitted objects

  We need a way to know the objects that are filtered out of the
  packfile generated by `git pack-objects --filter=<filter-spec>`. The
  simplest way is to teach pack-objects to print their oids to stdout.

* 3/9 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object.

* - 4/9 repack: refactor piping an oid to a command
  - 5/9 repack: refactor finishing pack-objects command

  These are small refactorings so that `git repack --filter=...` will
  be able to reuse useful existing functions.

* 6/9 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with both the `--filter` and the
  `--print-filtered` options. From this process it reads the oids of
  the filtered out objects and pass them to a separate `git
  pack-objects` process which will pack these objects into a separate
  packfile.

* 7/9 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile.

* 8/9 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the --expire-to option for cruft packfiles.

* 9/9 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`.


Christian Couder (9):
  pack-objects: allow `--filter` without `--stdout`
  pack-objects: add `--print-filtered` to print omitted objects
  t/helper: add 'find-pack' test-tool
  repack: refactor piping an oid to a command
  repack: refactor finishing pack-objects command
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  11 ++
 Documentation/git-pack-objects.txt     |  14 ++-
 Documentation/git-repack.txt           |  11 ++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |  55 ++++++--
 builtin/repack.c                       | 166 ++++++++++++++++++-------
 t/helper/test-find-pack.c              |  35 ++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t5317-pack-objects-filter-objects.sh |  27 ++++
 t/t6500-gc.sh                          |  23 ++++
 t/t7700-repack.sh                      |  43 +++++++
 13 files changed, 345 insertions(+), 53 deletions(-)
 create mode 100644 t/helper/test-find-pack.c

Comments

Junio C Hamano June 14, 2023, 9:36 p.m. UTC | #1
Christian Couder <christian.couder@gmail.com> writes:

> In some discussions, it was mentioned that such a feature, or a
> similar feature in `git gc`, or in a new standalone command (perhaps
> called `git prune-filtered`), should put the filtered out objects into
> a new packfile instead of deleting them.
>
> Recently there were internal discussions at GitLab about either moving
> blobs from inactive repos onto cheaper storage, or moving large blobs
> onto cheaper storage. This lead us to rethink at repacking using a
> filter, but moving the filtered out objects into a separate packfile
> instead of deleting them.
>
> So here is a new patch series doing that while implementing the
> `--filter=<filter-spec>` option in `git repack`.

Very interesting idea, indeed, and would be very useful.
Thanks.
Junio C Hamano June 16, 2023, 3:08 a.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Christian Couder <christian.couder@gmail.com> writes:
>
>> In some discussions, it was mentioned that such a feature, or a
>> similar feature in `git gc`, or in a new standalone command (perhaps
>> called `git prune-filtered`), should put the filtered out objects into
>> a new packfile instead of deleting them.
>>
>> Recently there were internal discussions at GitLab about either moving
>> blobs from inactive repos onto cheaper storage, or moving large blobs
>> onto cheaper storage. This lead us to rethink at repacking using a
>> filter, but moving the filtered out objects into a separate packfile
>> instead of deleting them.
>>
>> So here is a new patch series doing that while implementing the
>> `--filter=<filter-spec>` option in `git repack`.
>
> Very interesting idea, indeed, and would be very useful.
> Thanks.

Overall, I have a split feeling on the series.

One side of my brain thinks that the series does a very good job to
address the needs of those who want to partition their objects into
two classes, and the problem I saw in the series was mostly the way
it was sold (in other words, if it did not mention unbloating lazily
cloned repositories at all, I would have said "Yes!  It is an
excellent series.", and if it said "this mechanism is not meant to
be used to unbloat a lazily cloned repository, because the mechanism
does not distinguish objects that are only locally available and
objects that are retrievable from the promisor remotes, among those
that match the filter", it would have been even better)

To the other side of my brain, it smells as if the series wanted to
address the unbloating issue, but ended up with an unsatisfactory
solution, and used "partitioning objects in a full repository on the
server side " as an excuse for the resulting mechanism to still
exist, even though it is not usable for the original purpose.

Ideally, it would be great to have a mechanism that can be used for
both.  The "partitioning" can be treated as a degenerate case where
the repository does not have its upstream promisor (hence, any
object that match the filtering criteria can be excluded from the
primary pack because there are no "not available (yet) in our
promisor" objects), while the "unbloat" case can know who its
promisors are and ask the promisors what objects, among those that
match the filtering criteria, are still available from them to
exclude only those objects from the primary pack.

In the second ideal world, we may not be ready to tackle the
unbloating issue, but "partitioning" alone may still be a useful
feature.  In that case, perhaps the series can be salvaged by
updating how the feature is sold, with some comments indicating the
future direction to extend the mechanism later.

Thanks.