mbox series

[00/17] cruft packs

Message ID cover.1638224692.git.me@ttaylorr.com (mailing list archive)
Headers show
Series cruft packs | expand

Message

Taylor Blau Nov. 29, 2021, 10:25 p.m. UTC
This series implements "cruft packs", a pack which stores accumulated
unreachable objects, along with a new ".mtimes" file which tracks each
object's last known modification time.

This idea was discussed recently-ish in [1], but the most thorough
discussion I could find is in [2]. The approach settled on in this
series is laid out in detail by the first patch.

For the uninitiated, cruft packs enable repositories to safely run
`git repack -Ad` by storing unreachable objects which have not yet
"aged out" in a separate pack. This prevents repositories from storing
a potentially large number of these such objects as loose.

This series is structured as follows:

  - The first patch describes the technical details of cruft packs.
  - The next five patches implement reading and writing the new
    `.mtimes` format.
  - The next six patches implement `git pack-objects --cruft`. The
    first five implement this mode when no grace period is specified,
    and the six patch adds support for the grace period.
  - The next five patches integrate cruft packs with `git repack`,
    including the new-ish `--geometric` mode.
  - The final patch handles object freshening for objects stored in a
    cruft pack.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/
[2]: https://lore.kernel.org/git/E1SdhJ9-0006B1-6p@tytso-glaptop.cam.corp.google.com/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  23 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  95 ++++
 Documentation/technical/pack-format.txt |  22 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 306 ++++++++++-
 builtin/repack.c                        | 189 ++++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 139 +++++
 pack-mtimes.h                           |  16 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  20 +
 pack-write.c                            |  90 +++-
 pack.h                                  |   4 +
 packfile.c                              |  18 +-
 packfile.h                              |   1 +
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  53 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5327-pack-objects-cruft.sh           | 685 ++++++++++++++++++++++++
 33 files changed, 1757 insertions(+), 102 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5327-pack-objects-cruft.sh

Comments

Junio C Hamano Dec. 3, 2021, 7:51 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> This series implements "cruft packs", a pack which stores accumulated
> unreachable objects, along with a new ".mtimes" file which tracks each
> object's last known modification time.

Let me rephrase the above to test my understanding, since I need to
write a summary for the  "What's cooking" report.

 Instead of leaving unreachable objects in loose form when packing,
 or ejecting them into loose form when repacking, gather them in a
 packfile with an auxiliary file that records the last-use time of
 these objects.

That way, we do not have to waste so many inodes for loose objects
that is not likely to be used, which feels like a win.

>   - The final patch handles object freshening for objects stored in a
>     cruft pack.

I am not going to read it today, but I think this is the most
interesting part of the series.  Instead of using mtime of an
individual loose object file, we'd need to record the time of
last use for each object in a pack.

Stepping back a bit, I do not see how we can get away without doing
the same .mtimes file for non-cruft packs.  An object that is in a
non-cruft pack may be referenced immediately after the repack that
created the pack, but the ref that was referencing the object may
have gone away and now the pack is a month old.  If we were to
repack the object, we do not know when was the last time the object
was reachable from any of the refs and index entries (collectively
known as anchor points).  Of course, recording all mtimes for all
packed objects all the time would involve quite a lot of overhead.
I am guessing (I will not spend time today to figure it out myself)
that .mtimes update at runtime will happen in-place (i.e. via
seek(2)+write(2), or pwrite()), and I wonder what the safety concern
would be (which is the primary reason why we tend not to do in-place
updates but recreate-and-rename updates).

Thanks for working on such an interesting topic.
Taylor Blau Dec. 3, 2021, 8:08 p.m. UTC | #2
On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This series implements "cruft packs", a pack which stores accumulated
> > unreachable objects, along with a new ".mtimes" file which tracks each
> > object's last known modification time.
>
> Let me rephrase the above to test my understanding, since I need to
> write a summary for the  "What's cooking" report.
>
>  Instead of leaving unreachable objects in loose form when packing,
>  or ejecting them into loose form when repacking, gather them in a
>  packfile with an auxiliary file that records the last-use time of
>  these objects.

Exactly. Thanks for such a concise and accurate description of the
topic.

> That way, we do not have to waste so many inodes for loose objects
> that is not likely to be used, which feels like a win.

Yes. This had historically been a problem for GitHub. We don't
automatically prune unreachable objects during repacking, but sometimes
customers will ask us to do it on their behalf (if, for example, they
accidentally pushed sensitive information to us, and then force-pushed
over it).

But occasionally we'd get bitten by exploding many years of loose
objects (because we used to freshen packfiles too aggressively when
moving them around).

We've been running this series in production for the past few months,
and it's been a huge relief on the folks who typically run these pruning
GCs.

> >   - The final patch handles object freshening for objects stored in a
> >     cruft pack.
>
> I am not going to read it today, but I think this is the most
> interesting part of the series.  Instead of using mtime of an
> individual loose object file, we'd need to record the time of
> last use for each object in a pack.
>
> Stepping back a bit, I do not see how we can get away without doing
> the same .mtimes file for non-cruft packs.  An object that is in a
> non-cruft pack may be referenced immediately after the repack that
> created the pack, but the ref that was referencing the object may
> have gone away and now the pack is a month old.  If we were to
> repack the object, we do not know when was the last time the object
> was reachable from any of the refs and index entries (collectively
> known as anchor points).

In that situation, we would use the mtime of the pack which contains
that object itself as a proxy (or the mtime of a loose copy of the
object, if it is more recent).

That isn't perfect, as you note, since if the pack isn't otherwise
freshened, we'd consider that object to be a month old, even if the
reference pointing at it was deleted a mere second ago.

I can't recall if Peff and I talked about this off-list, but I have a
vague sense we probably did (and I forgot the details).

> Of course, recording all mtimes for all
> packed objects all the time would involve quite a lot of overhead.
> I am guessing (I will not spend time today to figure it out myself)
> that .mtimes update at runtime will happen in-place (i.e. via
> seek(2)+write(2), or pwrite()), and I wonder what the safety concern
> would be (which is the primary reason why we tend not to do in-place
> updates but recreate-and-rename updates).

Yeah, this series avoids doing an in-place update, and similarly avoids
recreating the entire .mtimes file before moving into place. Instead,
freshening an object stored in a cruft pack takes place by rewriting a
copy of the object loose, since we consider an object's mtime to be the
most recent of (a) what's in the .mtimes file, (b) the mtime of the
containing pack, and (c) the mtime of a loose copy (if one exists).

It can be wasteful, but in practice "resurrecting" an object in a cruft
pack is pretty rare, so on balance it ends up costing less work to do.

> Thanks for working on such an interesting topic.

I'm glad to have piqued your interest.

Taylor
Taylor Blau Dec. 3, 2021, 8:47 p.m. UTC | #3
On Fri, Dec 03, 2021 at 03:08:00PM -0500, Taylor Blau wrote:
> On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> > Stepping back a bit, I do not see how we can get away without doing
> > the same .mtimes file for non-cruft packs.  An object that is in a
> > non-cruft pack may be referenced immediately after the repack that
> > created the pack, but the ref that was referencing the object may
> > have gone away and now the pack is a month old.  If we were to
> > repack the object, we do not know when was the last time the object
> > was reachable from any of the refs and index entries (collectively
> > known as anchor points).
>
> In that situation, we would use the mtime of the pack which contains
> that object itself as a proxy (or the mtime of a loose copy of the
> object, if it is more recent).
>
> That isn't perfect, as you note, since if the pack isn't otherwise
> freshened, we'd consider that object to be a month old, even if the
> reference pointing at it was deleted a mere second ago.
>
> I can't recall if Peff and I talked about this off-list, but I have a
> vague sense we probably did (and I forgot the details).

Maybe I can rephrase the problem as being orthogonal to what we're
addressing here. Modification time can be a useful-ish proxy for "last
referenced time", but they are ultimately different.

Forgetting cruft packs for a moment, our behavior today in that
situation would be to prune the object if our grace period did not cover
the time in which the pack was last modified. So if the pack was a month
old, the grace period was two weeks, but the reference pointing at some
object in that pack was deleted only a second before starting a pruning
GC, we'd prune that object before this series (just as we would do the
same thing with this series).

Aside from pruning, what happens to the value recorded in the .mtimes
file is more interesting. For the case you're talking about, we'll err
on the side of newer mtimes (either the original timestamp is recorded,
or some future time when the containing pack was rewritten). But the
more interesting case is when an object becomes re-referenced. Since the
ref-update doesn't cause the object to be rewritten, we wouldn't change
the timestamp.

Anyway, both of these are still independent from cruft packs, so we're
not changing the status quo there, I don't think.

Thanks,
Taylor