Message ID | cover.1615482270.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | midx: implement a multi-pack reverse index | expand |
On 3/11/2021 12:04 PM, Taylor Blau wrote: > Here is another reroll of my series to implement a reverse index in > preparation for multi-pack reachability bitmaps. The previous version > was based on 'ds/chunked-file-api', but that topic has since been merged > to 'master'. This series is now built directly on top of 'master'. > > Not much has changed since last time. Jonathan Tan reviewed the previous > version, and I incorporated feedback from his review: > > - The usage macros in builtin/multi-pack-index.c were pulled out and > defined separately. > - Some sloppiness with converting a signed index referring to the > preferred pack into an unsigned value was cleaned up. > - Documentation clean-up, particularly in patches 12 and 13. > > There are a couple of new things that we found while testing this out at > GitHub. > > - We now call finalize_object_file() on the multi-pack reverse index > to set the correct permissions. > - Patch 14 removed a stray hunk that introduced a memory leak. > - Patch 16 (courtesy of Peff) is new. It improves the cache locality > of midx_pack_order_cmp(), which has a substantial impact on > repositories with many objects. > > Thanks in advance for your review. I've reviewed the changes since my last review and this one looks good, including that new patch from Peff. Thanks, -Stolee
On Thu, Mar 11, 2021 at 12:04:31PM -0500, Taylor Blau wrote: > Here is another reroll of my series to implement a reverse index in > preparation for multi-pack reachability bitmaps. The previous version > was based on 'ds/chunked-file-api', but that topic has since been merged > to 'master'. This series is now built directly on top of 'master'. I gave the whole thing another careful read. Most of what I found were small nits, but enough that I think one more re-roll is worth it. The biggest question is what we want to happen next. As you note, the concept of a midx .rev file is useless until we have the matching .bitmap file. So we _could_ let this sit in next while the dependent bitmap topic is reviewed, and then merge them down together. But I'm inclined to treat this as an independent topic that can get merged to master on its own, since the early cleanups are valuable on their own, and the .rev parts at the end, even if dead, won't hurt anything. If we did want to break it up, the useful line would be after "allow marking a pack as preferred" (while it is mostly intended for the bitmap selection, it is theoretically useful on its own to make it more likely to find a copy of an object with a useful delta). -Peff
Jeff King <peff@peff.net> writes: > On Thu, Mar 11, 2021 at 12:04:31PM -0500, Taylor Blau wrote: > >> Here is another reroll of my series to implement a reverse index in >> preparation for multi-pack reachability bitmaps. The previous version >> was based on 'ds/chunked-file-api', but that topic has since been merged >> to 'master'. This series is now built directly on top of 'master'. > > I gave the whole thing another careful read. Most of what I found were > small nits, but enough that I think one more re-roll is worth it. Thanks. > The biggest question is what we want to happen next. As you note, the > concept of a midx .rev file is useless until we have the matching > .bitmap file. So we _could_ let this sit in next while the dependent > bitmap topic is reviewed, and then merge them down together. But I'm > inclined to treat this as an independent topic that can get merged to > master on its own, since the early cleanups are valuable on their own, > and the .rev parts at the end, even if dead, won't hurt anything. It was my impression as well that the early clean-ups are worth on their own.
On Mon, Mar 29, 2021 at 09:05:33AM -0400, Jeff King wrote: > On Thu, Mar 11, 2021 at 12:04:31PM -0500, Taylor Blau wrote: > > > Here is another reroll of my series to implement a reverse index in > > preparation for multi-pack reachability bitmaps. The previous version > > was based on 'ds/chunked-file-api', but that topic has since been merged > > to 'master'. This series is now built directly on top of 'master'. > > I gave the whole thing another careful read. Most of what I found were > small nits, but enough that I think one more re-roll is worth it. Thanks. I agree that another re-roll is worth it. I have one prepared locally, and I just had one outstanding question in: https://lore.kernel.org/git/YGI6ySogGoYZi66A@nand.local/ that I'll wait on your reply to before sending a reroll. > The biggest question is what we want to happen next. As you note, the > concept of a midx .rev file is useless until we have the matching > .bitmap file. So we _could_ let this sit in next while the dependent > bitmap topic is reviewed, and then merge them down together. But I'm > inclined to treat this as an independent topic that can get merged to > master on its own, since the early cleanups are valuable on their own, > and the .rev parts at the end, even if dead, won't hurt anything. That matches what I was hoping for. I think the clean-ups are worth it on their own, but I also think it's a good idea to take the whole series, since it means there's one less long-running branch in flight while we review the MIDX bitmaps topic. (FWIW, I can also see an argument in the other direction along the lines of "we may discover something later on that requires us to change the way multi-pack .rev files work". I think that such an outcome is fairly unlikely, but worth considering anyway). Thanks, Taylor
On Mon, Mar 29, 2021 at 05:37:01PM -0400, Taylor Blau wrote: > > The biggest question is what we want to happen next. As you note, the > > concept of a midx .rev file is useless until we have the matching > > .bitmap file. So we _could_ let this sit in next while the dependent > > bitmap topic is reviewed, and then merge them down together. But I'm > > inclined to treat this as an independent topic that can get merged to > > master on its own, since the early cleanups are valuable on their own, > > and the .rev parts at the end, even if dead, won't hurt anything. > > That matches what I was hoping for. I think the clean-ups are worth it > on their own, but I also think it's a good idea to take the whole > series, since it means there's one less long-running branch in flight > while we review the MIDX bitmaps topic. > > (FWIW, I can also see an argument in the other direction along the lines > of "we may discover something later on that requires us to change the > way multi-pack .rev files work". I think that such an outcome is fairly > unlikely, but worth considering anyway). That would be my general worry, too, but in this case I am not too concerned because I know the code has received substantial exercise already on real-world production servers. So while we may clean up some cosmetic bits or respond to review as it goes upstream, I'm much less worried about seeing some brown-paper-bag bug that would be sufficient to make us want to re-roll these .rev commits. And hopefully the existing rounds have addressed the cosmetic/review bits. -Peff
On Tue, Mar 30, 2021 at 03:15:02AM -0400, Jeff King wrote: > On Mon, Mar 29, 2021 at 05:37:01PM -0400, Taylor Blau wrote: > > (FWIW, I can also see an argument in the other direction along the lines > > of "we may discover something later on that requires us to change the > > way multi-pack .rev files work". I think that such an outcome is fairly > > unlikely, but worth considering anyway). > > That would be my general worry, too, but in this case I am not too > concerned because I know the code has received substantial exercise > already on real-world production servers. So while we may clean up some > cosmetic bits or respond to review as it goes upstream, I'm much less > worried about seeing some brown-paper-bag bug that would be sufficient > to make us want to re-roll these .rev commits. And hopefully the > existing rounds have addressed the cosmetic/review bits. Yes. Another benefit is that it should give us substantial confidence in the correctness not just of this topic, but of the multi-pack bitmaps that are built on top, too. Thanks, Taylor
Here is another reroll of my series to implement a reverse index in preparation for multi-pack reachability bitmaps. The previous version was based on 'ds/chunked-file-api', but that topic has since been merged to 'master'. This series is now built directly on top of 'master'. Not much has changed since last time. Jonathan Tan reviewed the previous version, and I incorporated feedback from his review: - The usage macros in builtin/multi-pack-index.c were pulled out and defined separately. - Some sloppiness with converting a signed index referring to the preferred pack into an unsigned value was cleaned up. - Documentation clean-up, particularly in patches 12 and 13. There are a couple of new things that we found while testing this out at GitHub. - We now call finalize_object_file() on the multi-pack reverse index to set the correct permissions. - Patch 14 removed a stray hunk that introduced a memory leak. - Patch 16 (courtesy of Peff) is new. It improves the cache locality of midx_pack_order_cmp(), which has a substantial impact on repositories with many objects. Thanks in advance for your review. Jeff King (1): midx.c: improve cache locality in midx_pack_order_cmp() Taylor Blau (15): builtin/multi-pack-index.c: inline 'flags' with options builtin/multi-pack-index.c: don't handle 'progress' separately builtin/multi-pack-index.c: define common usage with a macro builtin/multi-pack-index.c: split sub-commands builtin/multi-pack-index.c: don't enter bogus cmd_mode builtin/multi-pack-index.c: display usage on unrecognized command t/helper/test-read-midx.c: add '--show-objects' midx: allow marking a pack as preferred midx: don't free midx_name early midx: keep track of the checksum midx: make some functions non-static Documentation/technical: describe multi-pack reverse indexes pack-revindex: read multi-pack reverse indexes pack-write.c: extract 'write_rev_file_order' pack-revindex: write multi-pack reverse indexes Documentation/git-multi-pack-index.txt | 14 +- Documentation/technical/multi-pack-index.txt | 5 +- Documentation/technical/pack-format.txt | 83 +++++++ builtin/multi-pack-index.c | 182 ++++++++++++--- builtin/repack.c | 2 +- midx.c | 229 +++++++++++++++++-- midx.h | 11 +- pack-revindex.c | 127 ++++++++++ pack-revindex.h | 53 +++++ pack-write.c | 36 ++- pack.h | 1 + packfile.c | 3 + t/helper/test-read-midx.c | 24 +- t/t5319-multi-pack-index.sh | 39 ++++ 14 files changed, 740 insertions(+), 69 deletions(-) Range-diff against v2: 1: 0527fa89a9 = 1: 43fc0ad276 builtin/multi-pack-index.c: inline 'flags' with options 2: a4e107b1f8 = 2: 181f11e4c5 builtin/multi-pack-index.c: don't handle 'progress' separately 3: 8679dfd212 = 3: 94c498f0e2 builtin/multi-pack-index.c: define common usage with a macro 4: bc42b56ea2 ! 4: d084f90466 builtin/multi-pack-index.c: split sub-commands @@ Commit message ## builtin/multi-pack-index.c ## @@ - #include "midx.h" - #include "trace2.h" + #define BUILTIN_MIDX_REPACK_USAGE \ + N_("git multi-pack-index [<options>] repack [--batch-size=<size>]") +static char const * const builtin_multi_pack_index_write_usage[] = { - #define BUILTIN_MIDX_WRITE_USAGE \ - N_("git multi-pack-index [<options>] write") + BUILTIN_MIDX_WRITE_USAGE, + NULL +}; - +static char const * const builtin_multi_pack_index_verify_usage[] = { - #define BUILTIN_MIDX_VERIFY_USAGE \ - N_("git multi-pack-index [<options>] verify") + BUILTIN_MIDX_VERIFY_USAGE, + NULL +}; - +static char const * const builtin_multi_pack_index_expire_usage[] = { - #define BUILTIN_MIDX_EXPIRE_USAGE \ - N_("git multi-pack-index [<options>] expire") + BUILTIN_MIDX_EXPIRE_USAGE, + NULL +}; - +static char const * const builtin_multi_pack_index_repack_usage[] = { - #define BUILTIN_MIDX_REPACK_USAGE \ - N_("git multi-pack-index [<options>] repack [--batch-size=<size>]") + BUILTIN_MIDX_REPACK_USAGE, + NULL +}; - static char const * const builtin_multi_pack_index_usage[] = { BUILTIN_MIDX_WRITE_USAGE, + BUILTIN_MIDX_VERIFY_USAGE, @@ builtin/multi-pack-index.c: static struct opts_multi_pack_index { unsigned flags; } opts; 5: 5daa2946d3 = 5: bc3b6837f2 builtin/multi-pack-index.c: don't enter bogus cmd_mode 6: 98d9ea0770 = 6: f117e442c3 builtin/multi-pack-index.c: display usage on unrecognized command 7: 2fd9f4debf = 7: ae85a68ef2 t/helper/test-read-midx.c: add '--show-objects' 8: 223b899094 ! 8: 30194a6786 midx: allow marking a pack as preferred @@ builtin/multi-pack-index.c #include "trace2.h" +#include "object-store.h" - static char const * const builtin_multi_pack_index_write_usage[] = { #define BUILTIN_MIDX_WRITE_USAGE \ - N_("git multi-pack-index [<options>] write") + N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]") - BUILTIN_MIDX_WRITE_USAGE, - NULL - }; + + #define BUILTIN_MIDX_VERIFY_USAGE \ + N_("git multi-pack-index [<options>] verify") @@ builtin/multi-pack-index.c: static char const * const builtin_multi_pack_index_usage[] = { static struct opts_multi_pack_index { @@ midx.c: static void fill_pack_entry(uint32_t pack_int_id, uint32_t nr_packs, - uint32_t *nr_objects) + uint32_t *nr_objects, -+ uint32_t preferred_pack) ++ int preferred_pack) { uint32_t cur_fanout, cur_pack, cur_object; uint32_t alloc_fanout, alloc_objects, total_objects = 0; @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack goto cleanup; - ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr); ++ ctx.preferred_pack_idx = -1; + if (preferred_pack_name) { + for (i = 0; i < ctx.nr; i++) { + if (!cmp_idx_or_pack_name(preferred_pack_name, @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack + break; + } + } -+ } else -+ ctx.preferred_pack_idx = -1; ++ } + + ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr, + ctx.preferred_pack_idx); @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1; } -+ /* -+ * Recompute the preferred_pack_idx (if applicable) according to the -+ * permuted pack order. -+ */ -+ ctx.preferred_pack_idx = -1; ++ /* Check that the preferred pack wasn't expired (if given). */ + if (preferred_pack_name) { -+ ctx.preferred_pack_idx = lookup_idx_or_pack_name(ctx.info, -+ ctx.nr, -+ preferred_pack_name); -+ if (ctx.preferred_pack_idx < 0) ++ int preferred_idx = lookup_idx_or_pack_name(ctx.info, ++ ctx.nr, ++ preferred_pack_name); ++ if (preferred_idx < 0) + warning(_("unknown preferred pack: '%s'"), + preferred_pack_name); + else { -+ uint32_t orig = ctx.info[ctx.preferred_pack_idx].orig_pack_int_id; ++ uint32_t orig = ctx.info[preferred_idx].orig_pack_int_id; + uint32_t perm = ctx.pack_perm[orig]; + -+ if (perm == PACK_EXPIRED) { ++ if (perm == PACK_EXPIRED) + warning(_("preferred pack '%s' is expired"), + preferred_pack_name); -+ ctx.preferred_pack_idx = -1; -+ } else -+ ctx.preferred_pack_idx = perm; + } + } + 9: 976848bc4b = 9: 5c5aca761a midx: don't free midx_name early 10: 5ed47f7e3a = 10: a22a1463a5 midx: keep track of the checksum 11: 0292508e12 = 11: efa54479b1 midx: make some functions non-static 12: 404d730498 ! 12: 4745bb8590 Documentation/technical: describe multi-pack reverse indexes @@ Documentation/technical/pack-format.txt: CHUNK DATA: + +Instead of mapping between offset, pack-, and index position, this +reverse index maps between an object's position within the MIDX, and -+that object's position within a pseudo-pack that the MIDX describes. ++that object's position within a pseudo-pack that the MIDX describes ++(i.e., the ith entry of the multi-pack reverse index holds the MIDX ++position of ith object in pseudo-pack order). + -+To clarify these three orderings, consider a multi-pack reachability -+bitmap (which does not yet exist, but is what we are building towards -+here). Each bit needs to correspond to an object in the MIDX, and so we -+need an efficient mapping from bit position to MIDX position. ++To clarify the difference between these orderings, consider a multi-pack ++reachability bitmap (which does not yet exist, but is what we are ++building towards here). Each bit needs to correspond to an object in the ++MIDX, and so we need an efficient mapping from bit position to MIDX ++position. + +One solution is to let bits occupy the same position in the oid-sorted +index stored by the MIDX. But because oids are effectively random, there 13: d4e01a44e7 ! 13: a6ebd4be91 pack-revindex: read multi-pack reverse indexes @@ Commit message position in the psuedo-pack order, but that order can only be recovered in the .rev file itself. This mapping can be implemented with a binary search, but note that the thing we're binary searching over isn't an - array, but rather a _permutation_. + array of values, but rather a permuted order of those values. So, when comparing two items, it's helpful to keep in mind the difference. Instead of a traditional binary search, where you are 14: ab7012b283 ! 14: f5314f1822 pack-write.c: extract 'write_rev_file_order' @@ pack-write.c: const char *write_rev_file(const char *rev_name, + pack_order[i] = i; + QSORT_S(pack_order, nr_objects, pack_order_cmp, objects); + -+ if (!(flags & (WRITE_REV | WRITE_REV_VERIFY))) -+ return NULL; -+ + ret = write_rev_file_order(rev_name, pack_order, nr_objects, hash, + flags); + 15: 01bd6a35c6 ! 15: fa3acb5d5a pack-revindex: write multi-pack reverse indexes @@ Commit message for long, since subsequent patches will introduce the multi-pack bitmap, which will begin passing this field. + (In midx.c:write_midx_internal(), the two adjacent if statements share a + conditional, but are written separately since the first one will + eventually also handle the MIDX_WRITE_BITMAP flag, which does not yet + exist.) + Signed-off-by: Taylor Blau <me@ttaylorr.com> ## midx.c ## @@ midx.c: static int write_midx_large_offsets(struct hashfile *f, + struct write_midx_context *ctx) +{ + struct strbuf buf = STRBUF_INIT; ++ const char *tmp_file; + + strbuf_addf(&buf, "%s-%s.rev", midx_name, hash_to_hex(midx_hash)); + -+ write_rev_file_order(buf.buf, ctx->pack_order, ctx->entries_nr, -+ midx_hash, WRITE_REV); ++ tmp_file = write_rev_file_order(NULL, ctx->pack_order, ctx->entries_nr, ++ midx_hash, WRITE_REV); ++ ++ if (finalize_object_file(tmp_file, buf.buf)) ++ die(_("cannot store reverse index file")); + + strbuf_release(&buf); +} -: ---------- > 16: 550e785f10 midx.c: improve cache locality in midx_pack_order_cmp()