mbox series

[v6,00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index

Message ID 20190514184754.3196-1-dstolee@microsoft.com (mailing list archive)
Headers show
Series Create 'expire' and 'repack' verbs for git-multi-pack-index | expand

Message

Derrick Stolee May 14, 2019, 6:47 p.m. UTC
The multi-pack-index provides a fast way to find an object among a large list
of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the
  multi-pack-index, but all objects in that pack are duplicated in
  more-recently modified packs, then delete that pack (and any others like it).
  Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest
  pack-files covered by the multi-pack-index, find those whose "expected size"
  is below the batch size until we have a collection of packs whose expected
  sizes add up to the batch size. We compute the expected size by multiplying
  the number of referenced objects by the pack-size and dividing by the total
  number of objects in the pack. If the batch-size is zero, then select all
  packs. Create a new pack containing all objects that the multi-pack-index
  references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the last
'repack' are finished, run 'expire' again. This approach has some advantages
over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of
repacking all reachable objects. We can also limit ourselves to the objects
that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not suffer
performance degradation. By expiring only pack-files that have no referenced
objects, we know that Git commands that are doing normal object lookups* will
not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(),
* then that command could try to read the pack-files and pack-indexes that we
* are deleting during an expire command. Such commands are usually related to
* object maintenance (i.e. fsck, gc, pack-objects) or are related to
* less-often-used features (i.e. fast-import, http-backend, server-info).

We **are using** this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on an
hourly basis to keep up-to-date with the central server. The cache servers
supply packs on an hourly and daily basis, so most of the hourly packs become
useless after a new daily pack is downloaded. The 'expire' command would clear
out most of those packs, but many will still remain with fewer than 100 objects
remaining. The 'repack' command (with a batch size of 1-3gb, probably) can
condense the remaining packs in commands that run for 1-3 min at a time. Since
the daily packs range from 100-250mb, we will also combine and condense those
packs.

Updates in V6:

I rebased onto ds/midx-too-many-packs. Thanks, Junio for taking that
change first. There were several subtle things that needed to change to
put this change on top:

* We need a repository struct everywhere since we add pack-files to the
  packed_git list now.

* A FREE_AND_NULL() was dropped after closing a pack because the pack
  is still in the packed_git list after opening.

* I noticed some whitespace problems.

I also expect GMail to munge my added "From:" tags, so it will look
like the author is "stolee@gmail.com" instead of
"dstolee@microsoft.com". Sorry for the continued inconvenience here.

Thanks,
-Stolee

Derrick Stolee (11):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' subcommand
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs
  midx: add test that 'expire' respects .keep files
  t5319-multi-pack-index.sh: test batch size zero

 Documentation/git-multi-pack-index.txt |  32 +-
 builtin/multi-pack-index.c             |  14 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 440 +++++++++++++++++++------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 184 +++++++++++
 8 files changed, 602 insertions(+), 119 deletions(-)

Comments

Derrick Stolee June 10, 2019, 2:15 p.m. UTC | #1
On 5/14/2019 2:47 PM, Derrick Stolee wrote:
> Updates in V6:
> 
> I rebased onto ds/midx-too-many-packs. Thanks, Junio for taking that
> change first. There were several subtle things that needed to change to
> put this change on top:
> 
> * We need a repository struct everywhere since we add pack-files to the
>   packed_git list now.
> 
> * A FREE_AND_NULL() was dropped after closing a pack because the pack
>   is still in the packed_git list after opening.
> 
> * I noticed some whitespace problems.
> 
> I also expect GMail to munge my added "From:" tags, so it will look
> like the author is "stolee@gmail.com" instead of
> "dstolee@microsoft.com". Sorry for the continued inconvenience here.

Junio: thank you for taking ds/midx-too-many-packs into v2.22.0. That
was a helpful bugfix.

However, this series was dropped from the cooking emails, and never
included this v6. Now that the release is complete, could this be
reconsidered?

Thanks,
-Stolee
Junio C Hamano June 10, 2019, 5:31 p.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

> On 5/14/2019 2:47 PM, Derrick Stolee wrote:
>> Updates in V6:
>> 
>> I rebased onto ds/midx-too-many-packs. Thanks, Junio for taking that
>> change first. There were several subtle things that needed to change to
>> put this change on top:
>> ...
> However, this series was dropped from the cooking emails, and never
> included this v6. Now that the release is complete, could this be
> reconsidered?

"reconsider" is a bit strong word, as (at least as far as I recall)
it was never "rejected" as an unwanted topic, but was merely
postponed to give way to other topics in flight.  Thanks for keeping
an eye on it and finding the right moment to raising it again.

I could go back to the list archive and dig it up, but because it
has been a while since it was posted, it may not be a bad idea to
send it for a review, after making sure it cleanly applies to
'master', to make it one of the early topics to go 'next' during
this cycle, I would think.

Thanks.
Derrick Stolee June 10, 2019, 5:57 p.m. UTC | #3
On 6/10/2019 1:31 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 5/14/2019 2:47 PM, Derrick Stolee wrote:
>>> Updates in V6:
>>>
>>> I rebased onto ds/midx-too-many-packs. Thanks, Junio for taking that
>>> change first. There were several subtle things that needed to change to
>>> put this change on top:
>>> ...
>> However, this series was dropped from the cooking emails, and never
>> included this v6. Now that the release is complete, could this be
>> reconsidered?
> 
> "reconsider" is a bit strong word, as (at least as far as I recall)
> it was never "rejected" as an unwanted topic, but was merely
> postponed to give way to other topics in flight.  Thanks for keeping
> an eye on it and finding the right moment to raising it again.
> 
> I could go back to the list archive and dig it up, but because it
> has been a while since it was posted, it may not be a bad idea to
> send it for a review, after making sure it cleanly applies to
> 'master', to make it one of the early topics to go 'next' during
> this cycle, I would think.

Sure, I'll create a brand new thread and point to this thread for
history.

Thanks,
-Stolee