mbox series

[v5,00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index

Message ID 20190424151428.170316-1-dstolee@microsoft.com (mailing list archive)
Headers show
Series Create 'expire' and 'repack' verbs for git-multi-pack-index | expand

Message

Derrick Stolee April 24, 2019, 3:14 p.m. UTC
The multi-pack-index provides a fast way to find an object among a large list
of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the
  multi-pack-index, but all objects in that pack are duplicated in
  more-recently modified packs, then delete that pack (and any others like it).
  Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest
  pack-files covered by the multi-pack-index, find those whose "expected size"
  is below the batch size until we have a collection of packs whose expected
  sizes add up to the batch size. We compute the expected size by multiplying
  the number of referenced objects by the pack-size and dividing by the total
  number of objects in the pack. If the batch-size is zero, then select all
  packs. Create a new pack containing all objects that the multi-pack-index
  references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the last
'repack' are finished, run 'expire' again. This approach has some advantages
over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of
repacking all reachable objects. We can also limit ourselves to the objects
that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not suffer
performance degradation. By expiring only pack-files that have no referenced
objects, we know that Git commands that are doing normal object lookups* will
not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(),
* then that command could try to read the pack-files and pack-indexes that we
* are deleting during an expire command. Such commands are usually related to
* object maintenance (i.e. fsck, gc, pack-objects) or are related to
* less-often-used features (i.e. fast-import, http-backend, server-info).

We **are using** this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on an
hourly basis to keep up-to-date with the central server. The cache servers
supply packs on an hourly and daily basis, so most of the hourly packs become
useless after a new daily pack is downloaded. The 'expire' command would clear
out most of those packs, but many will still remain with fewer than 100 objects
remaining. The 'repack' command (with a batch size of 1-3gb, probably) can
condense the remaining packs in commands that run for 1-3 min at a time. Since
the daily packs range from 100-250mb, we will also combine and condense those
packs.

Updates in V5:

* Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!

* The 'repack' subcommand now computes the "expected size" of a pack instead of
  relying on the total size of the pack. This is actually really important to
  the way VFS for Git uses prefetch packs, and some packs are not being
  repacked because the pack size is larger than the batch size, but really
  there are only a few referenced objects.

* The 'repack' subcommand now allows a batch size of zero to mean "create one
  pack containing all objects in the multi-pack-index". A new commit adds a
  test that hits the boundary cases here, but follows the 'expire' subcommand
  so we can show that cycle of repack-then-expire to safely replace the packs.

Junio: It appears that there are some conflicts with the trace2 changes in
master. These are not new to the updates in this version. I saw how you
resolved these conflicts and replaying that resolution should work for you.

Thanks,
-Stolee

Derrick Stolee (11):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' subcommand
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs
  midx: add test that 'expire' respects .keep files
  t5319-multi-pack-index.sh: test batch size zero

 Documentation/git-multi-pack-index.txt |  32 +-
 builtin/multi-pack-index.c             |  14 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 440 +++++++++++++++++++------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 184 +++++++++++
 8 files changed, 602 insertions(+), 119 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595

Comments

Junio C Hamano April 25, 2019, 5:38 a.m. UTC | #1
Derrick Stolee <stolee@gmail.com> writes:

> Updates in V5:
>
> * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!
>
> * The 'repack' subcommand now computes the "expected size" of a pack instead of
>   relying on the total size of the pack. This is actually really important to
>   the way VFS for Git uses prefetch packs, and some packs are not being
>   repacked because the pack size is larger than the batch size, but really
>   there are only a few referenced objects.
>
> * The 'repack' subcommand now allows a batch size of zero to mean "create one
>   pack containing all objects in the multi-pack-index". A new commit adds a
>   test that hits the boundary cases here, but follows the 'expire' subcommand
>   so we can show that cycle of repack-then-expire to safely replace the packs.

I guess all of them need to tweak the authorship from the gmail
address to the work address on the Signed-off-by: trailer, which I
can do (as I noticed it before applying).
Derrick Stolee April 25, 2019, 11:06 a.m. UTC | #2
On 4/25/2019 1:38 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> Updates in V5:
>>
>> * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!
>>
>> * The 'repack' subcommand now computes the "expected size" of a pack instead of
>>   relying on the total size of the pack. This is actually really important to
>>   the way VFS for Git uses prefetch packs, and some packs are not being
>>   repacked because the pack size is larger than the batch size, but really
>>   there are only a few referenced objects.
>>
>> * The 'repack' subcommand now allows a batch size of zero to mean "create one
>>   pack containing all objects in the multi-pack-index". A new commit adds a
>>   test that hits the boundary cases here, but follows the 'expire' subcommand
>>   so we can show that cycle of repack-then-expire to safely replace the packs.
> 
> I guess all of them need to tweak the authorship from the gmail
> address to the work address on the Signed-off-by: trailer, which I
> can do (as I noticed it before applying).

Sorry. Due to the conflicts, GitGitGadget prevented me from submitting in
my normal way, so I pulled out format-patch and send-email for the first
time in a very long time. I manually added new "From: " lines in the bodies
of the patch files, but they got suppressed, I guess.

Thanks,
-Stolee