Message ID | f7186147ebb0b2d01d8f1e0f742f367204d7d9c9.1612411123.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: support repacking into a geometric sequence | expand |
On Wed, Feb 03, 2021 at 10:58:50PM -0500, Taylor Blau wrote: > Future callers will want a function to fill a 'struct pack_entry' for a > given object id but _only_ from its position in any kept pack(s). > > In particular, an new 'git repack' mode which ensures the resulting Nit (not worth re-rolling): s/an new/a new/ > There is a gotcha when looking up objects that are duplicated in kept > and non-kept packs, particularly when the MIDX stores the non-kept > version and the caller asked for kept objects only. This could be > resolved by teaching the MIDX to resolve duplicates by always favoring > the kept pack (if one exists), but this breaks an assumption in existing > MIDXs, and so it would require a format change. I don't think this would be possible without a major rethink of how midxs work. The "keep" property of a pack is not set in stone when the midx is created. You could add a ".keep" file to one of its packs later, or even mark one as an in-core keep on the fly. But the duplicate resolution happens at creation. So maybe your "breaks an assumption" is the notion that we do not store duplicate information at all in the midx. If so, then I agree. :) But I'd also call fixing that more than just a format change. (None of which changes your point, which isn't that it isn't worth pursuing that direction). -Peff
On Tue, Feb 16, 2021 at 04:42:38PM -0500, Jeff King wrote: > On Wed, Feb 03, 2021 at 10:58:50PM -0500, Taylor Blau wrote: > > > Future callers will want a function to fill a 'struct pack_entry' for a > > given object id but _only_ from its position in any kept pack(s). > > > > In particular, an new 'git repack' mode which ensures the resulting > > Nit (not worth re-rolling): s/an new/a new/ Oops. Good eyes. > > There is a gotcha when looking up objects that are duplicated in kept > > and non-kept packs, particularly when the MIDX stores the non-kept > > version and the caller asked for kept objects only. This could be > > resolved by teaching the MIDX to resolve duplicates by always favoring > > the kept pack (if one exists), but this breaks an assumption in existing > > MIDXs, and so it would require a format change. > > I don't think this would be possible without a major rethink of how > midxs work. The "keep" property of a pack is not set in stone when the > midx is created. You could add a ".keep" file to one of its packs later, > or even mark one as an in-core keep on the fly. But the duplicate > resolution happens at creation. > > So maybe your "breaks an assumption" is the notion that we do not store > duplicate information at all in the midx. If so, then I agree. :) But > I'd also call fixing that more than just a format change. That's part of it, indeed. The part that I was referring to is that existing MIDX readers expect duplicates to be resolved in a certain way (effectively in favor of the pack with the lowest mtime). So the easy part is indicating a format change which tells new readers how to expect ties to be broken. But (as you note) that's only part of the problem: even if we say "ties are resolved in favor of the lowest mtime pack, or a .keep one, if it exists", then which ones are kept and which aren't? Even *if* we wrote that down (which I'm not suggesting we do), kept-ness isn't an immutable property of the pack, and so I think relying on it is a tricky direction to take. > (None of which changes your point, which isn't that it isn't worth > pursuing that direction). Yeah; my hope in writing some of this down in the above paragraph is that it would make clear to future readers that such a MIDX change would resolve some complexity here, but the complexity it adds in the MIDX code isn't worth the tradeoff. > -Peff Thanks, Taylor
diff --git a/packfile.c b/packfile.c index 4b938b4372..5f35cfe788 100644 --- a/packfile.c +++ b/packfile.c @@ -2031,7 +2031,10 @@ static int fill_pack_entry(const struct object_id *oid, return 1; } -int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +static int find_one_pack_entry(struct repository *r, + const struct object_id *oid, + struct pack_entry *e, + int kept_only) { struct list_head *pos; struct multi_pack_index *m; @@ -2041,26 +2044,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa return 0; for (m = r->objects->multi_pack_index; m; m = m->next) { - if (fill_midx_entry(r, oid, e, m)) + if (!fill_midx_entry(r, oid, e, m)) + continue; + + if (!kept_only) + return 1; + + if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core)) return 1; } list_for_each(pos, &r->objects->packed_git_mru) { struct packed_git *p = list_entry(pos, struct packed_git, mru); - if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) { - list_move(&p->mru, &r->objects->packed_git_mru); - return 1; + if (p->multi_pack_index && !kept_only) { + /* + * If this pack is covered by the MIDX, we'd have found + * the object already in the loop above if it was here, + * so don't bother looking. + * + * The exception is if we are looking only at kept + * packs. An object can be present in two packs covered + * by the MIDX, one kept and one not-kept. And as the + * MIDX points to only one copy of each object, it might + * have returned only the non-kept version above. We + * have to check again to be thorough. + */ + continue; + } + if (!kept_only || + (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) { + if (fill_pack_entry(oid, e, p)) { + list_move(&p->mru, &r->objects->packed_git_mru); + return 1; + } } } return 0; } +int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +{ + return find_one_pack_entry(r, oid, e, 0); +} + +int find_kept_pack_entry(struct repository *r, + const struct object_id *oid, + unsigned flags, + struct pack_entry *e) +{ + /* + * Load all packs, including midx packs, since our "kept" strategy + * relies on that. We're relying on the side effect of it setting up + * r->objects->packed_git, which is a little ugly. + */ + get_all_packs(r); + return find_one_pack_entry(r, oid, e, flags); +} + int has_object_pack(const struct object_id *oid) { struct pack_entry e; return find_pack_entry(the_repository, oid, &e); } +int has_object_kept_pack(const struct object_id *oid, unsigned flags) +{ + struct pack_entry e; + return find_kept_pack_entry(the_repository, oid, flags, &e); +} + int has_pack_index(const unsigned char *sha1) { struct stat st; diff --git a/packfile.h b/packfile.h index a58fc738e0..624327f64d 100644 --- a/packfile.h +++ b/packfile.h @@ -161,13 +161,19 @@ int packed_object_info(struct repository *r, void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1); const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1); +#define ON_DISK_KEEP_PACKS 1 +#define IN_CORE_KEEP_PACKS 2 +#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS) + /* * Iff a pack file in the given repository contains the object named by sha1, * return true and store its location to e. */ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e); +int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e); int has_object_pack(const struct object_id *oid); +int has_object_kept_pack(const struct object_id *oid, unsigned flags); int has_pack_index(const unsigned char *sha1);