From patchwork Thu Feb 18 03:14:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092869 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C15C5C433E0 for ; Thu, 18 Feb 2021 03:15:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7A81B64E3E for ; Thu, 18 Feb 2021 03:15:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229983AbhBRDPC (ORCPT ); Wed, 17 Feb 2021 22:15:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43078 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229937AbhBRDO7 (ORCPT ); Wed, 17 Feb 2021 22:14:59 -0500 Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6EBA8C061756 for ; Wed, 17 Feb 2021 19:14:19 -0800 (PST) Received: by mail-qk1-x733.google.com with SMTP id c3so848839qkj.11 for ; Wed, 17 Feb 2021 19:14:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=gnooYuzFE9U5xs3bVfru21QnB8P7E+qaJoxRTCQ7obA=; b=twjmcLDLTbB9RPuMJNcyP/fEVzt6eGfegTiFLXp2BxP8lGBIejZp5U8J3v+C7wp8EG ug3pI2PYV+UA9A53AGPaN5meghaSKHHtZ0VIHdecyVG+7m5650Kl/nEPC5M0XLI1mdrE yZs05BQkkROI/MFlDT5VrO5TtVA85CZ87wA8BvDGrdSBlFWKQiCMWhxJyzRPnI+vR1ts u4xoHKQCHRdTw9EqE75jXUn85ggfBePbqvsc6VHfzlBKB+8loKGqkrJM5JwTRjIFihhW gVTigqjwNjs4bgSRrv9z+SuAm2U0S1EZV8Db/LPr6qqzR0/0tBPaK4hbPRocm/RH2+NO 5syA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=gnooYuzFE9U5xs3bVfru21QnB8P7E+qaJoxRTCQ7obA=; b=h7/9NzPhN5Q9pOzYPJimBgFF1NjdcCqYjHNFK7NpIu69ds2oRLQgKjD5z76lLJ1bQM HHsLoflTyWOBnpVvmAAEKy90WvtwMPyC593wYhndkyuarrpME399LvITINRN58ujLiTo Asnptke2RgFE4Jh5z9Q57G7ptiGoWI8i089JUx2iK9GOBlhU3gESHhIFzPYDft18YcMO CXL9GfKfb0eYBOXd6yJChcH8Oz7shqmlF1fM1xH5siR90KtnpUHq8kAzqDTXoOesrDCT 2MkSpDh+syMFnPmt0Cxdsv3F96TuOdMvF87DqpmLp8UgiVwGonI9j7uFGhZ94csf7NY7 O3/A== X-Gm-Message-State: AOAM531NHTtCY/BDBUT8fnWdQhwqp8TZq9mqoK0piIDHMBgb/HED6OqV /nC3cSrv8/9YbOow9FzPAKFqc1eueRkJDqOM X-Google-Smtp-Source: ABdhPJwPtTW0iEWYcfxSP9sqrT+Cyk7CZGRoNuvzuESkVVuWUe+Z5cXxT5//DMo3NSh0rhFUOBGuAw== X-Received: by 2002:a05:620a:4050:: with SMTP id i16mr2470488qko.473.1613618058334; Wed, 17 Feb 2021 19:14:18 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id f26sm3005036qkh.80.2021.02.17.19.14.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:17 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:16 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Future callers will want a function to fill a 'struct pack_entry' for a given object id but _only_ from its position in any kept pack(s). In particular, an new 'git repack' mode which ensures the resulting packs form a geometric progress by object count will mark packs that it does not want to repack as "kept in-core", and it will want to halt a reachability traversal as soon as it visits an object in any of the kept packs. But, it does not want to halt the traversal at non-kept, or .keep packs. The obvious alternative is 'find_pack_entry()', but this doesn't quite suffice since it only returns the first pack it finds, which may or may not be kept (and the mru cache makes it unpredictable which one you'll get if there are options). Short of that, you could walk over all packs looking for the object in each one, but it scales with the number of packs, which may be prohibitive. Introduce 'find_kept_pack_entry()', a function which is like 'find_pack_entry()', but only fills in objects in the kept packs. Handle packs which have .keep files, as well as in-core kept packs separately, since certain callers will want to distinguish one from the other. (Though on-disk and in-core kept packs share the adjective "kept", it is best to think of the two sets as independent.) There is a gotcha when looking up objects that are duplicated in kept and non-kept packs, particularly when the MIDX stores the non-kept version and the caller asked for kept objects only. This could be resolved by teaching the MIDX to resolve duplicates by always favoring the kept pack (if one exists), but this breaks an assumption in existing MIDXs, and so it would require a format change. The benefit to changing the MIDX in this way is marginal, so we instead have a more thorough check here which is explained with a comment. Callers will be added in subsequent patches. Co-authored-by: Jeff King Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----- packfile.h | 5 +++++ 2 files changed, 64 insertions(+), 5 deletions(-) diff --git a/packfile.c b/packfile.c index 1fec12ac5f..7f84f221ce 100644 --- a/packfile.c +++ b/packfile.c @@ -2042,7 +2042,10 @@ static int fill_pack_entry(const struct object_id *oid, return 1; } -int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +static int find_one_pack_entry(struct repository *r, + const struct object_id *oid, + struct pack_entry *e, + int kept_only) { struct list_head *pos; struct multi_pack_index *m; @@ -2052,26 +2055,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa return 0; for (m = r->objects->multi_pack_index; m; m = m->next) { - if (fill_midx_entry(r, oid, e, m)) + if (!fill_midx_entry(r, oid, e, m)) + continue; + + if (!kept_only) + return 1; + + if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core)) return 1; } list_for_each(pos, &r->objects->packed_git_mru) { struct packed_git *p = list_entry(pos, struct packed_git, mru); - if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) { - list_move(&p->mru, &r->objects->packed_git_mru); - return 1; + if (p->multi_pack_index && !kept_only) { + /* + * If this pack is covered by the MIDX, we'd have found + * the object already in the loop above if it was here, + * so don't bother looking. + * + * The exception is if we are looking only at kept + * packs. An object can be present in two packs covered + * by the MIDX, one kept and one not-kept. And as the + * MIDX points to only one copy of each object, it might + * have returned only the non-kept version above. We + * have to check again to be thorough. + */ + continue; + } + if (!kept_only || + (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) { + if (fill_pack_entry(oid, e, p)) { + list_move(&p->mru, &r->objects->packed_git_mru); + return 1; + } } } return 0; } +int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +{ + return find_one_pack_entry(r, oid, e, 0); +} + +int find_kept_pack_entry(struct repository *r, + const struct object_id *oid, + unsigned flags, + struct pack_entry *e) +{ + /* + * Load all packs, including midx packs, since our "kept" strategy + * relies on that. We're relying on the side effect of it setting up + * r->objects->packed_git, which is a little ugly. + */ + get_all_packs(r); + return find_one_pack_entry(r, oid, e, flags); +} + int has_object_pack(const struct object_id *oid) { struct pack_entry e; return find_pack_entry(the_repository, oid, &e); } +int has_object_kept_pack(const struct object_id *oid, unsigned flags) +{ + struct pack_entry e; + return find_kept_pack_entry(the_repository, oid, flags, &e); +} + int has_pack_index(const unsigned char *sha1) { struct stat st; diff --git a/packfile.h b/packfile.h index 4cfec9e8d3..3ae117a8ae 100644 --- a/packfile.h +++ b/packfile.h @@ -162,13 +162,18 @@ int packed_object_info(struct repository *r, void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1); const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1); +#define ON_DISK_KEEP_PACKS 1 +#define IN_CORE_KEEP_PACKS 2 + /* * Iff a pack file in the given repository contains the object named by sha1, * return true and store its location to e. */ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e); +int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e); int has_object_pack(const struct object_id *oid); +int has_object_kept_pack(const struct object_id *oid, unsigned flags); int has_pack_index(const unsigned char *sha1); From patchwork Thu Feb 18 03:14:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092873 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FB1EC433E6 for ; Thu, 18 Feb 2021 03:15:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3F9D164E3E for ; Thu, 18 Feb 2021 03:15:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229989AbhBRDPU (ORCPT ); Wed, 17 Feb 2021 22:15:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229937AbhBRDPD (ORCPT ); Wed, 17 Feb 2021 22:15:03 -0500 Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B9CA8C0613D6 for ; Wed, 17 Feb 2021 19:14:22 -0800 (PST) Received: by mail-qt1-x834.google.com with SMTP id b24so443176qtp.13 for ; Wed, 17 Feb 2021 19:14:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=9gFIdURqgyY5neZOZpdb2y6B4ghQ88Obpt5irfLsWY0=; b=miNsWSv6dqCNq2u/JGENio+BN0zWf2rwmz/yIX7R5S5eAn5mVsiYrGdPiIpqC64Dlu XC9SWTEA0kymliHcPtv1RR67zbyhgB8+e0pGEl9QVusU1NXIHFaye9qrPLJWoBaeJP9p sYfrkse3mTThCID9xDdISXhDMZck/iylkZRQw4OlrxWTpOphiN1yuo2UVLGcRx2Pd7xT isF2qPsA8rOXBLFpZzDOjRyhTqWwbXWGA2/1OwS2PBqFKLo7xiveWkcTrx3QSf/x8arm 5/oK5mFCAPXkrA/PPHrC+vu4u6k2qfbngR1NqWFc3rNolgRvjLlzypQBi905Qd1gYIgz rouw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=9gFIdURqgyY5neZOZpdb2y6B4ghQ88Obpt5irfLsWY0=; b=kkU0yI+ijBNzdEyFKO/Qu9lhj6eOq+hhXR8xjPFnJ50u0ZQVzc2J6v69gES0J4X46R p4uTBmDwOSeG31cZlgNnqOWUXTJRwDdelwEXmS/imV1Pdmeoz0iyqYCXOcKK9P38/wrZ zFYUHFb+PA28SVy64gZLD+5+umgyZs8kbuDkGk5cGuTcrbk0gmrBo7k4WDiL2FhDdOQj gfgxUYa3dNeZDlqxerFFzOY2GmuBWt/rt3lb7r/N5uNHlwGyr+2pd7EKJdrkjdJ+x2+B 1569n6XofePaFO5ueBCZvipaPtbYz3fPxFWCo5ibqTFA+excsHeidXo3mc9E7sJIomtI agAg== X-Gm-Message-State: AOAM533+nMzD8T/AykeFCJCCmEk4EDIC7GqPUMPDc9Tw9KmRlVPpa7sx am6ZE3mz1B+TMM+D8AtnuJCr0mW+MACNHdSv X-Google-Smtp-Source: ABdhPJzs/MK5fsHLDBXpGmZt5Y5/f0iEzH+M7oCf2Vpw1ozahOalfO5BRH9QWAcRH2NaZ6GrkC53KQ== X-Received: by 2002:ac8:110e:: with SMTP id c14mr2358418qtj.293.1613618061639; Wed, 17 Feb 2021 19:14:21 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id 197sm3041430qkf.33.2021.02.17.19.14.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:21 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:20 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 2/8] revision: learn '--no-kept-objects' Message-ID: <82f6b45463586a90b95d7457a463eb7c50838648.1613618042.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org A future caller will want to be able to perform a reachability traversal which terminates when visiting an object found in a kept pack. The closest existing option is '--honor-pack-keep', but this isn't quite what we want. Instead of halting the traversal midway through, a full traversal is always performed, and the results are only trimmed afterwords. Besides needing to introduce a new flag (since culling results post-facto can be different than halting the traversal as it's happening), there is an additional wrinkle handling the distinction in-core and on-disk kept packs. That is: what kinds of kept pack should stop the traversal? Introduce '--no-kept-objects[=]' to specify which kinds of kept packs, if any, should stop a traversal. This can be useful for callers that want to perform a reachability analysis, but want to leave certain packs alone (for e.g., when doing a geometric repack that has some "large" packs which are kept in-core that it wants to leave alone). Note that this option is not guaranteed to produce exactly the set of objects that aren't in kept packs, since it's possible the traversal order may end up in a situation where a non-kept ancestor was "cut off" by a kept object (at which point we would stop traversing). But, we don't care about absolute correctness here, since this will eventually be used as a purely additive guide in an upcoming new repack mode. Explicitly avoid documenting this new flag, since it is only used internally. In theory we could avoid even adding it rev-list, but being able to spell this option out on the command-line makes some special cases easier to test without promising to keep it behaving consistently forever. Those tricky cases are exercised in t6114. Signed-off-by: Taylor Blau --- revision.c | 15 ++++++++++ revision.h | 4 +++ t/t6114-keep-packs.sh | 69 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88 insertions(+) create mode 100755 t/t6114-keep-packs.sh diff --git a/revision.c b/revision.c index 3efd994160..f9311f3a73 100644 --- a/revision.c +++ b/revision.c @@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg revs->unpacked = 1; } else if (starts_with(arg, "--unpacked=")) { die(_("--unpacked= no longer supported")); + } else if (!strcmp(arg, "--no-kept-objects")) { + revs->no_kept_objects = 1; + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; + } else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) { + revs->no_kept_objects = 1; + if (!strcmp(optarg, "in-core")) + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + if (!strcmp(optarg, "on-disk")) + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; } else if (!strcmp(arg, "-r")) { revs->diff = 1; revs->diffopt.flags.recursive = 1; @@ -3792,6 +3802,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi return commit_ignore; if (revs->unpacked && has_object_pack(&commit->object.oid)) return commit_ignore; + if (revs->no_kept_objects) { + if (has_object_kept_pack(&commit->object.oid, + revs->keep_pack_cache_flags)) + return commit_ignore; + } if (commit->object.flags & UNINTERESTING) return commit_ignore; if (revs->line_level_traverse && !want_ancestry(revs)) { diff --git a/revision.h b/revision.h index e6be3c845e..a20a530d52 100644 --- a/revision.h +++ b/revision.h @@ -148,6 +148,7 @@ struct rev_info { edge_hint_aggressive:1, limited:1, unpacked:1, + no_kept_objects:1, boundary:2, count:1, left_right:1, @@ -317,6 +318,9 @@ struct rev_info { * This is loaded from the commit-graph being used. */ struct bloom_filter_settings *bloom_filter_settings; + + /* misc. flags related to '--no-kept-objects' */ + unsigned keep_pack_cache_flags; }; int ref_excluded(struct string_list *, const char *path); diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh new file mode 100755 index 0000000000..9239d8aa46 --- /dev/null +++ b/t/t6114-keep-packs.sh @@ -0,0 +1,69 @@ +#!/bin/sh + +test_description='rev-list with .keep packs' +. ./test-lib.sh + +test_expect_success 'setup' ' + test_commit loose && + test_commit packed && + test_commit kept && + + KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/kept + ^refs/tags/packed + EOF + ) && + MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/packed + ^refs/tags/loose + EOF + ) && + + touch .git/objects/pack/pack-$KEPT_PACK.keep +' + +rev_list_objects () { + git rev-list "$@" >out && + sort out +} + +idx_objects () { + git show-index <$1 >expect-idx && + cut -d" " -f2 kept && + rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept && + + idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect && + comm -3 kept no-kept >actual && + + test_cmp expect actual +' + +test_expect_success '--no-kept-objects excludes kept non-MIDX object' ' + test_config core.multiPackIndex true && + + # Create a pack with just the commit object in pack, and do not mark it + # as kept (even though it appears in $KEPT_PACK, which does have a .keep + # file). + MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF + $(git rev-parse kept) + EOF + ) && + + # Write a MIDX containing all packs, but use the version of the commit + # at "kept" in a non-kept pack by touching $MIDX_PACK. + touch .git/objects/pack/pack-$MIDX_PACK.pack && + git multi-pack-index write && + + rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual && + ( + idx_objects .git/objects/pack/pack-$MISC_PACK.idx && + git rev-list --objects --no-object-names refs/tags/loose + ) | sort >expect && + test_cmp expect actual +' + +test_done From patchwork Thu Feb 18 03:14:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092877 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64152C433E0 for ; Thu, 18 Feb 2021 03:15:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1BF0D64E3E for ; Thu, 18 Feb 2021 03:15:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230026AbhBRDPY (ORCPT ); Wed, 17 Feb 2021 22:15:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43112 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229985AbhBRDPH (ORCPT ); Wed, 17 Feb 2021 22:15:07 -0500 Received: from mail-qv1-xf2c.google.com (mail-qv1-xf2c.google.com [IPv6:2607:f8b0:4864:20::f2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22797C061786 for ; Wed, 17 Feb 2021 19:14:27 -0800 (PST) Received: by mail-qv1-xf2c.google.com with SMTP id p12so331128qvv.5 for ; Wed, 17 Feb 2021 19:14:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=JJ8SFqiTG0pLkrvdhGZMXC2lGDB1NNXKXyVf2UYCxYc=; b=u3+I2dhLojt12ISkB+s/d7WG6fDacdpUoGT8RVvOLUQdXB/G23SIGy/wewj5/jCqWS aqK2SoHlYcjU3EdH+8pM10jeFacITOARItEh8WsfkFlp2f1IAx3o09K8v8VORv0s2pGa sOZdoUVIZ4AgMrlBU+/HMM8saCK7vMZrj0BvK1ZY68UABCJe9kFU0Ka5X6zPznGUtb3a qQ/nyiePltyBGjB64tqW8cYOau2BPsYJ5bOVliUULjAaW75MieygHYn/xj6U6OZyD/kJ sRSG6Wp+4CHGfj1g1bZFttWqrT4nDoU4/mjuVkBZAf/T05aIsW4PgnPnXLbBj2x3M3EA 1GmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=JJ8SFqiTG0pLkrvdhGZMXC2lGDB1NNXKXyVf2UYCxYc=; b=CMCGMrEeb/Q0OrEa/Y6pG7QjiGcwIqZYlhvhPjzMXTgiRSXufW85Ms45Dpzz2vsxqS tnbkYQR9/VYuX7p41z2OX/KWO8GpN430RO99WbjSyVKRq0aqWZx/irSOkdgwCPYrBRnE KKS8PLgd/YA5SZ0NbzBGY1jNsfBGwfUSE5xRpfvr78fwbwTOMTF9xMOcZvmedyBEmpP6 RyDsvbQEaembBU7TJSTNzTmt4jkiYKNtVwl6fd7DLVlfeBQ19xA9f0euTx1eOwVBELOm Al4X3gEzUhNzRcauJ9rI62GI4GJ2f81uVp+MsV9xKZfTWt2mn5lec+jOd/Miz7INV7Yb j3jQ== X-Gm-Message-State: AOAM531FMgyu205+e6LxoPYPgNGyHwosn1qFmewK5ed2xcw4fMasW0ky Q5S6WZDB82O95TRDZ+ECpVVkDBcbNnIUxDlK X-Google-Smtp-Source: ABdhPJyASfmCN3lB8hMcgYIzv3nmZklT2SkAAdhaM48OoEhRv8cAX5VYY6tvLe6GXK11F30lLwnelA== X-Received: by 2002:a05:6214:94a:: with SMTP id dn10mr2200177qvb.28.1613618065023; Wed, 17 Feb 2021 19:14:25 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id d5sm2557770qti.66.2021.02.17.19.14.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:24 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:23 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Message-ID: <033e4e3f67b96489c3ba1b2ab7977e23fac34189.1613618042.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org In an upcoming commit, 'git repack' will want to create a pack comprised of all of the objects in some packs (the included packs) excluding any objects in some other packs (the excluded packs). This caller could iterate those packs themselves and feed the objects it finds to 'git pack-objects' directly over stdin, but this approach has a few downsides: - It requires every caller that wants to drive 'git pack-objects' in this way to implement pack iteration themselves. This forces the caller to think about details like what order objects are fed to pack-objects, which callers would likely rather not do. - If the set of objects in included packs is large, it requires sending a lot of data over a pipe, which is inefficient. - The caller is forced to keep track of the excluded objects, too, and make sure that it doesn't send any objects that appear in both included and excluded packs. But the biggest downside is the lack of a reachability traversal. Because the caller passes in a list of objects directly, those objects don't get a namehash assigned to them, which can have a negative impact on the delta selection process, causing 'git pack-objects' to fail to find good deltas even when they exist. The caller could formulate a reachability traversal themselves, but the only way to drive 'git pack-objects' in this way is to do a full traversal, and then remove objects in the excluded packs after the traversal is complete. This can be detrimental to callers who care about performance, especially in repositories with many objects. Introduce 'git pack-objects --stdin-packs' which remedies these four concerns. 'git pack-objects --stdin-packs' expects a list of pack names on stdin, where 'pack-xyz.pack' denotes that pack as included, and '^pack-xyz.pack' denotes it as excluded. The resulting pack includes all objects that are present in at least one included pack, and aren't present in any excluded pack. To address the delta selection problem, 'git pack-objects --stdin-packs' works as follows. First, it assembles a list of objects that it is going to pack, as above. Then, a reachability traversal is started, whose tips are any commits mentioned in included packs. Upon visiting an object, we find its corresponding object_entry in the to_pack list, and set its namehash parameter appropriately. To avoid the traversal visiting more objects than it needs to, the traversal is halted upon encountering an object which can be found in an excluded pack (by marking the excluded packs as kept in-core, and passing --no-kept-objects=in-core to the revision machinery). This can cause the traversal to halt early, for example if an object in an included pack is an ancestor of ones in excluded packs. But stopping early is OK, since filling in the namehash fields of objects in the to_pack list is only additive (i.e., having it helps the delta selection process, but leaving it blank doesn't impact the correctness of the resulting pack). Even still, it is unlikely that this hurts us much in practice, since the 'git repack --geometric' caller (which is introduced in a later commit) marks small packs as included, and large ones as excluded. During ordinary use, the small packs usually represent pushes after a large repack, and so are unlikely to be ancestors of objects that already exist in the repository. (I found it convenient while developing this patch to have 'git pack-objects' report the number of objects which were visited and got their namehash fields filled in during traversal. This is also included in the below patch via trace2 data lines). Suggested-by: Jeff King Signed-off-by: Taylor Blau --- Documentation/git-pack-objects.txt | 10 ++ builtin/pack-objects.c | 198 ++++++++++++++++++++++++++++- t/t5300-pack-object.sh | 97 ++++++++++++++ 3 files changed, 303 insertions(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 54d715ead1..df533c3b19 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -85,6 +85,16 @@ base-name:: reference was included in the resulting packfile. This can be useful to send new tags to native Git clients. +--stdin-packs:: + Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) + from the standard input, instead of object names or revision + arguments. The resulting pack contains all objects listed in the + included packs (those not beginning with `^`), excluding any + objects listed in the excluded packs (beginning with `^`). ++ +Incompatible with `--revs`, or options that imply `--revs` (such as +`--all`), with the exception of `--unpacked`, which is compatible. + --window=:: --depth=:: These two options affect how the objects contained in diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 6d62aaf59a..e766a4a43b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2986,6 +2986,186 @@ static int git_pack_config(const char *k, const char *v, void *cb) return git_default_config(k, v, cb); } +/* Counters for trace2 output when in --stdin-packs mode. */ +static int stdin_packs_found_nr; +static int stdin_packs_hints_nr; + +static int add_object_entry_from_pack(const struct object_id *oid, + struct packed_git *p, + uint32_t pos, + void *_data) +{ + struct rev_info *revs = _data; + struct object_info oi = OBJECT_INFO_INIT; + off_t ofs; + enum object_type type; + + display_progress(progress_state, ++nr_seen); + + if (have_duplicate_entry(oid, 0)) + return 0; + + ofs = nth_packed_object_offset(p, pos); + if (!want_object_in_pack(oid, 0, &p, &ofs)) + return 0; + + oi.typep = &type; + if (packed_object_info(the_repository, p, ofs, &oi) < 0) + die(_("could not get type of object %s in pack %s"), + oid_to_hex(oid), p->pack_name); + else if (type == OBJ_COMMIT) { + /* + * commits in included packs are used as starting points for the + * subsequent revision walk + */ + add_pending_oid(revs, NULL, oid, 0); + } + + stdin_packs_found_nr++; + + create_object_entry(oid, type, 0, 0, 0, p, ofs); + + return 0; +} + +static void show_commit_pack_hint(struct commit *commit, void *_data) +{ + /* nothing to do; commits don't have a namehash */ +} + +static void show_object_pack_hint(struct object *object, const char *name, + void *_data) +{ + struct object_entry *oe = packlist_find(&to_pack, &object->oid); + if (!oe) + return; + + /* + * Our 'to_pack' list was constructed by iterating all objects packed in + * included packs, and so doesn't have a non-zero hash field that you + * would typically pick up during a reachability traversal. + * + * Make a best-effort attempt to fill in the ->hash and ->no_try_delta + * here using a now in order to perhaps improve the delta selection + * process. + */ + oe->hash = pack_name_hash(name); + oe->no_try_delta = name && no_try_delta(name); + + stdin_packs_hints_nr++; +} + +static int pack_mtime_cmp(const void *_a, const void *_b) +{ + struct packed_git *a = ((const struct string_list_item*)_a)->util; + struct packed_git *b = ((const struct string_list_item*)_b)->util; + + if (a->mtime < b->mtime) + return -1; + else if (b->mtime < a->mtime) + return 1; + else + return 0; +} + +static void read_packs_list_from_stdin(void) +{ + struct strbuf buf = STRBUF_INIT; + struct string_list include_packs = STRING_LIST_INIT_DUP; + struct string_list exclude_packs = STRING_LIST_INIT_DUP; + struct string_list_item *item = NULL; + + struct packed_git *p; + struct rev_info revs; + + repo_init_revisions(the_repository, &revs, NULL); + /* + * Use a revision walk to fill in the namehash of objects in the include + * packs. To save time, we'll avoid traversing through objects that are + * in excluded packs. + * + * That may cause us to avoid populating all of the namehash fields of + * all included objects, but our goal is best-effort, since this is only + * an optimization during delta selection. + */ + revs.no_kept_objects = 1; + revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + revs.blob_objects = 1; + revs.tree_objects = 1; + revs.tag_objects = 1; + + while (strbuf_getline(&buf, stdin) != EOF) { + if (!buf.len) + continue; + + if (*buf.buf == '^') + string_list_append(&exclude_packs, buf.buf + 1); + else + string_list_append(&include_packs, buf.buf); + + strbuf_reset(&buf); + } + + string_list_sort(&include_packs); + string_list_sort(&exclude_packs); + + for (p = get_all_packs(the_repository); p; p = p->next) { + const char *pack_name = pack_basename(p); + + item = string_list_lookup(&include_packs, pack_name); + if (!item) + item = string_list_lookup(&exclude_packs, pack_name); + + if (item) + item->util = p; + } + + /* + * First handle all of the excluded packs, marking them as kept in-core + * so that later calls to add_object_entry() discards any objects that + * are also found in excluded packs. + */ + for_each_string_list_item(item, &exclude_packs) { + struct packed_git *p = item->util; + if (!p) + die(_("could not find pack '%s'"), item->string); + p->pack_keep_in_core = 1; + } + + /* + * Order packs by ascending mtime; use QSORT directly to access the + * string_list_item's ->util pointer, which string_list_sort() does not + * provide. + */ + QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp); + + for_each_string_list_item(item, &include_packs) { + struct packed_git *p = item->util; + if (!p) + die(_("could not find pack '%s'"), item->string); + for_each_object_in_pack(p, + add_object_entry_from_pack, + &revs, + FOR_EACH_OBJECT_PACK_ORDER); + } + + if (prepare_revision_walk(&revs)) + die(_("revision walk setup failed")); + traverse_commit_list(&revs, + show_commit_pack_hint, + show_object_pack_hint, + NULL); + + trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", + stdin_packs_found_nr); + trace2_data_intmax("pack-objects", the_repository, "stdin_packs_hints", + stdin_packs_hints_nr); + + strbuf_release(&buf); + string_list_clear(&include_packs, 0); + string_list_clear(&exclude_packs, 0); +} + static void read_object_list_from_stdin(void) { char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2]; @@ -3489,6 +3669,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) struct strvec rp = STRVEC_INIT; int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; int rev_list_index = 0; + int stdin_packs = 0; struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; struct option pack_objects_options[] = { OPT_SET_INT('q', "quiet", &progress, @@ -3539,6 +3720,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, N_("include objects referred to by the index"), 1, PARSE_OPT_NONEG), + OPT_BOOL(0, "stdin-packs", &stdin_packs, + N_("read packs from stdin")), OPT_BOOL(0, "stdout", &pack_to_stdout, N_("output pack to stdout")), OPT_BOOL(0, "include-tag", &include_tag, @@ -3645,7 +3828,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) use_internal_rev_list = 1; strvec_push(&rp, "--indexed-objects"); } - if (rev_list_unpacked) { + if (rev_list_unpacked && !stdin_packs) { use_internal_rev_list = 1; strvec_push(&rp, "--unpacked"); } @@ -3690,8 +3873,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (filter_options.choice) { if (!pack_to_stdout) die(_("cannot use --filter without --stdout")); + if (stdin_packs) + die(_("cannot use --filter with --stdin-packs")); } + if (stdin_packs && use_internal_rev_list) + die(_("cannot use internal rev list with --stdin-packs")); + /* * "soft" reasons not to use bitmaps - for on-disk repack by default we want * @@ -3750,7 +3938,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (progress) progress_state = start_progress(_("Enumerating objects"), 0); - if (!use_internal_rev_list) + if (stdin_packs) { + /* avoids adding objects in excluded packs */ + ignore_packed_keep_in_core = 1; + read_packs_list_from_stdin(); + if (rev_list_unpacked) + add_unreachable_loose_objects(); + } else if (!use_internal_rev_list) read_object_list_from_stdin(); else { get_object_list(rp.nr, rp.v); diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh index 392201cabd..7138a54595 100755 --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -532,4 +532,101 @@ test_expect_success 'prefetch objects' ' test_line_count = 1 donelines ' +test_expect_success 'setup for --stdin-packs tests' ' + git init stdin-packs && + ( + cd stdin-packs && + + test_commit A && + test_commit B && + test_commit C && + + for id in A B C + do + git pack-objects .git/objects/pack/pack-$id \ + --incremental --revs <<-EOF + refs/tags/$id + EOF + done && + + ls -la .git/objects/pack + ) +' + +test_expect_success '--stdin-packs with excluded packs' ' + ( + cd stdin-packs && + + PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" && + PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" && + PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" && + + git pack-objects test --stdin-packs <<-EOF && + $PACK_A + ^$PACK_B + $PACK_C + EOF + + ( + git show-index <$(ls .git/objects/pack/pack-A-*.idx) && + git show-index <$(ls .git/objects/pack/pack-C-*.idx) + ) >expect.raw && + git show-index <$(ls test-*.idx) >actual.raw && + + cut -d" " -f2 expect && + cut -d" " -f2 actual && + test_cmp expect actual + ) +' + +test_expect_success '--stdin-packs is incompatible with --filter' ' + ( + cd stdin-packs && + test_must_fail git pack-objects --stdin-packs --stdout \ + --filter=blob:none err && + test_i18ngrep "cannot use --filter with --stdin-packs" err + ) +' + +test_expect_success '--stdin-packs is incompatible with --revs' ' + ( + cd stdin-packs && + test_must_fail git pack-objects --stdin-packs --revs out \ + err && + test_i18ngrep "cannot use internal rev list with --stdin-packs" err + ) +' + +test_expect_success '--stdin-packs with loose objects' ' + ( + cd stdin-packs && + + PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" && + PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" && + PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" && + + test_commit D && # loose + + git pack-objects test2 --stdin-packs --unpacked <<-EOF && + $PACK_A + ^$PACK_B + $PACK_C + EOF + + ( + git show-index <$(ls .git/objects/pack/pack-A-*.idx) && + git show-index <$(ls .git/objects/pack/pack-C-*.idx) && + git rev-list --objects --no-object-names \ + refs/tags/C..refs/tags/D + + ) >expect.raw && + ls -la . && + git show-index <$(ls test2-*.idx) >actual.raw && + + cut -d" " -f2 expect && + cut -d" " -f2 actual && + test_cmp expect actual + ) +' + test_done From patchwork Thu Feb 18 03:14:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092875 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D744C433DB for ; Thu, 18 Feb 2021 03:15:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E422A64E45 for ; Thu, 18 Feb 2021 03:15:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230017AbhBRDPW (ORCPT ); Wed, 17 Feb 2021 22:15:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229996AbhBRDPJ (ORCPT ); Wed, 17 Feb 2021 22:15:09 -0500 Received: from mail-qv1-xf2f.google.com (mail-qv1-xf2f.google.com [IPv6:2607:f8b0:4864:20::f2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92074C061788 for ; Wed, 17 Feb 2021 19:14:29 -0800 (PST) Received: by mail-qv1-xf2f.google.com with SMTP id g3so338941qvl.2 for ; Wed, 17 Feb 2021 19:14:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=D7a2htNh5TmV1PuPKkr4mM1Sv6UdleXGHbTmlSyJ708=; b=QBN5nD0ouzn+QBE9JJlmLMH6mNfuV5jWxSdm8gdI06GkkWoxTep02zhej5fLtjwj49 y65JMMSMa+iK2kVPubMAZjT3WYnhgmnEcb39EZMZ99jIc/qDsiquD293GtL4CIq8ujHp RV7PO0z9ldAG2MF03fACccAFj8ouvIjdmF2/ClR/JF24NJ0Vsw83fnS0dREx0HXlabdy NyrQYn9Ex37MZ9qBkoSOVG+aOzp/I2Paq0e1oLzrVM+iYfDrRp4pTU+MAucexRafGaNp SOGHXrgFhBi5cWblEpbXFVL73ysZzeG+hh1JBIT9pXpzSjf2uv5xhvoLb8cq6CO/8Quy /N0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=D7a2htNh5TmV1PuPKkr4mM1Sv6UdleXGHbTmlSyJ708=; b=XNBOOXPt4x1sBvyyLWWLTplDMZcKGxJmtSbBwAQHuCx+Gw8Czjrkgv5zO14u4eQe4v 7Ffrgn3pxRel0+HPVQsPTcNABNqgt+O91dQeWp0h/laLYHh0UQEpP8rYzXFtcGTRfeyd a72v6BpXCQN0tmenCi+reiV7uQrN0vTP2NDgo2zEuX6x7KHDRwSxth5opHkLNxv6WUjS 7UEjXMvgVAbaBY6xfRv6pDIfmLwvH+KWLLwMwLubFFm1JhJr8aUQ3KYrK/fZRdyR7i7q l/xUQ9q5yTjHkHbbtpWDTVI9+t9QzqjH5m8NFk0WQ49zArQmStb8vEDzIk5rLNlrh9YH 4bXA== X-Gm-Message-State: AOAM533+Nnnaa1qQyuyMnIg+LQl5Qj7maLTDq/G3hVQ0hUIUo6K3vztX 8qmXx0p1E89DDnta6e5J6HExo3MQLFMDygH9 X-Google-Smtp-Source: ABdhPJxV3uJ2dINg8z4efkU15RYCxpKWbbF6xlIXGsaJBvz79EaRB17NTNvK95nqBHhqQavLbPlCxw== X-Received: by 2002:ad4:4aaf:: with SMTP id i15mr2549053qvx.3.1613618068547; Wed, 17 Feb 2021 19:14:28 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id l127sm3151941qke.3.2021.02.17.19.14.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:28 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:26 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 4/8] p5303: add missing &&-chains Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King These are in a helper function, so the usual chain-lint doesn't notice them. This function is still not perfect, as it has some git invocations on the left-hand-side of the pipe, but it's primary purpose is timing, not finding bugs or correctness issues. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- t/perf/p5303-many-packs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index ce0c42cc9f..d90d714923 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -28,11 +28,11 @@ repack_into_n () { push @commits, $_ if $. % 5 == 1; } print reverse @commits; - ' "$1" >pushes + ' "$1" >pushes && # create base packfile head -n 1 pushes | - git pack-objects --delta-base-offset --revs staging/pack + git pack-objects --delta-base-offset --revs staging/pack && # and then incrementals between each pair of commits last= && From patchwork Thu Feb 18 03:14:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092881 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2089EC433E6 for ; Thu, 18 Feb 2021 03:15:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D2DF164E2F for ; Thu, 18 Feb 2021 03:15:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230079AbhBRDPl (ORCPT ); Wed, 17 Feb 2021 22:15:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43228 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229999AbhBRDPj (ORCPT ); Wed, 17 Feb 2021 22:15:39 -0500 Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1423EC06178A for ; Wed, 17 Feb 2021 19:14:33 -0800 (PST) Received: by mail-qt1-x834.google.com with SMTP id w20so497842qta.0 for ; Wed, 17 Feb 2021 19:14:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Qijf3IlhB1tp7PkvrvMj/xijeNQlItQ/WjZtP7G/JP8=; b=bCtirxWZmflq3BeNqtuSYr5AqozHTyyfUUeJpC/zKMmnfnzXnlRQ2V9SfyEOfpaNZP eEP80WzVHy+aHe30pyOdZDN+GWJj+F1ZhS3655W0XZc4v4Ll1969DtZTm9KNat5I4iit VwUhRJWI/n5QzsjdYVZKda4OSPkqcAvZKbWQxYNM5mZWPegADDlW6sOHJlilcT96U8Re tBbAE2JJ46OaSYvd0s1zcRlcCTqmJy71y+weUdz6BAQDIJF4RRSIbN20CpLmZmtUibqu o8axU0NUoGdMakYXioyf89YL9DkRSCFT6MoyML6E2LMD2MXBj6kOmyOuqBrhk39ma31Q VbZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Qijf3IlhB1tp7PkvrvMj/xijeNQlItQ/WjZtP7G/JP8=; b=X1a3ZI0csN4JG8nX9bROglVTVEsYVSn/DBffXA+CnWGz0vih1GtKQFFB6kZEtozyzc y370xXaGBkMFRvzqNlxYDKWAI7wNmUHadDeOv/HDuoSl2tvWHJ2Q0SbqdSXxl6Zu3BGg +PyLY6c65JTr92BZktHdWyo6026Vin6S96rvISz1EuB80Rem7P3ulcL17ycEToE3OTf+ gE2VrjnzQGJv4YgPTRe2QC7Wx0YcHpQEEgaShlI0kDjVoFvxCc5/Niy3m9FZ7vtH9JVV 84TXjfsDKKWBO+iAL6G1y3xT9iCc5oEUkHyAdJCxsJceyyGeZsm1euWcvx24P3ki8lHi S8wg== X-Gm-Message-State: AOAM533MePlHBW2QG359042r3aPgXZ5nrPlrWBVYfpD3V6YoVXSwOxGx 5puflun13ze/owDxzcSmahE+VSlQJ5jwczRn X-Google-Smtp-Source: ABdhPJyxCOvl+Qe/te67/lWvaO7JP13IJrlLQAEBfLKQfZO4Q2QZAMts4i4cxi8CogUjyNV+f47Dmw== X-Received: by 2002:ac8:530f:: with SMTP id t15mr2409638qtn.167.1613618072030; Wed, 17 Feb 2021 19:14:32 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id e23sm2577703qtr.76.2021.02.17.19.14.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:31 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:30 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 5/8] p5303: measure time to repack with keep Message-ID: <181c104a038f1ad8c30e68013bcdbc79cf394ea4.1613618042.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King Add two new tests to measure repack performance. Both test split the repository into synthetic "pushes", and then leave the remaining objects in a big base pack. The first new test marks an empty pack as "kept" and then passes --honor-pack-keep to avoid including objects in it. That doesn't change the resulting pack, but it does let us compare to the normal repack case to see how much overhead we add to check whether objects are kept or not. The other test is of --stdin-packs, which gives us a sense of how that number scales based on the number of packs we provide as input. In each of those tests, the empty pack isn't considered, but the residual pack (objects that were left over and not included in one of the synthetic push packs) is marked as kept. (Note that in the single-pack case of the --stdin-packs test, there is nothing do since there are no non-excluded packs). Here are some timings on a recent clone of the kernel: 5303.5: repack (1) 57.26(54.59+10.84) 5303.6: repack with kept (1) 57.33(54.80+10.51) in the 50-pack case, things start to slow down: 5303.11: repack (50) 71.54(88.57+4.84) 5303.12: repack with kept (50) 85.12(102.05+4.94) and by the time we hit 1,000 packs, things are substantially worse, even though the resulting pack produced is the same: 5303.17: repack (1000) 216.87(490.79+14.57) 5303.18: repack with kept (1000) 665.63(938.87+15.76) Likewise, the scaling is pretty extreme on --stdin-packs: 5303.7: repack with --stdin-packs (1) 0.01(0.01+0.00) 5303.13: repack with --stdin-packs (50) 3.53(12.07+0.24) 5303.19: repack with --stdin-packs (1000) 195.83(371.82+8.10) That's because the code paths around handling .keep files are known to scale badly; they look in every single pack file to find each object. Our solution to that was to notice that most repos don't have keep files, and to make that case a fast path. But as soon as you add a single .keep, that part of pack-objects slows down again (even if we have fewer objects total to look at). Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- t/perf/p5303-many-packs.sh | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index d90d714923..35c0cbdf49 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -31,8 +31,15 @@ repack_into_n () { ' "$1" >pushes && # create base packfile - head -n 1 pushes | - git pack-objects --delta-base-offset --revs staging/pack && + base_pack=$( + head -n 1 pushes | + git pack-objects --delta-base-offset --revs staging/pack + ) && + test_export base_pack && + + # create an empty packfile + empty_pack=$(git pack-objects staging/pack stdin.packs + # and install the whole thing rm -f .git/objects/pack/* && mv staging/* .git/objects/pack/ @@ -91,6 +104,23 @@ do --reflog --indexed-objects --delta-base-offset \ --stdout /dev/null ' + + test_perf "repack with kept ($nr_packs)" ' + git pack-objects --keep-true-parents \ + --keep-pack=pack-$empty_pack.pack \ + --honor-pack-keep --non-empty --all \ + --reflog --indexed-objects --delta-base-offset \ + --stdout /dev/null + ' + + test_perf "repack with --stdin-packs ($nr_packs)" ' + git pack-objects \ + --keep-true-parents \ + --stdin-packs \ + --non-empty \ + --delta-base-offset \ + --stdout /dev/null + ' done # Measure pack loading with 10,000 packs. From patchwork Thu Feb 18 03:14:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092879 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59E4EC433DB for ; Thu, 18 Feb 2021 03:15:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1AE0C6146D for ; Thu, 18 Feb 2021 03:15:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230063AbhBRDPk (ORCPT ); Wed, 17 Feb 2021 22:15:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230028AbhBRDPj (ORCPT ); Wed, 17 Feb 2021 22:15:39 -0500 Received: from mail-qv1-xf2c.google.com (mail-qv1-xf2c.google.com [IPv6:2607:f8b0:4864:20::f2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9BE6C06178B for ; Wed, 17 Feb 2021 19:14:36 -0800 (PST) Received: by mail-qv1-xf2c.google.com with SMTP id 2so345660qvd.0 for ; Wed, 17 Feb 2021 19:14:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=J5a4ml0WAk4pr9Z82hPYk8r78Ylqi8uXBMPpVDXCN4w=; b=Xrd75j6XOgGg+Db1ZaeOtAD9Pks6kNbGgxe3ce6juEogeXbc8jrgITtpnO01t+ASoI RMIdbSYCd7aYRzeXSUmJUEkwan+PX2sXzF/zHxg5tkV5zNoUPQIV3hG2gCev5zkY2apL nY3HkAplRXLTpEHAypX3l3fR+mo961ET1RIsLFrYOmR4oOE4ubJoJtShh1MmfWQqWYf6 K1sFbE8PAEhOGnszjAqrJnlr4OV/O0d39qFa6Z7MT6LmXGwoDfLXz0wyw18kazxOlcpa mxuocUpLH/qmM6S1ki1N262iw8sj7MDPbSIGBmwasMG306+HxGmpFmHEoKjM0Us6Lusy lomQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=J5a4ml0WAk4pr9Z82hPYk8r78Ylqi8uXBMPpVDXCN4w=; b=ntfXSCVy3AxOVnurQMMb9pnRxC1NjznFqZ/1fhXXBbMROG/+zFdOmWT9GhN9lc8WYL OEtka1u57fOP92Sh+WFPXXWJsIpDL8cWVDs+87dzYaIT+2D9ENxhu2ivBXU5Cw0eGAPY OO5ar70yIwgoPmvAT6uwfCvCXwqqpX45mGprx1RLPFKJjGEz49EZd0xpmogFi1Pya1Kc hmQd6D+Q0Ht/Z19n1K4/8Q3bY5Vomnn725mlfZPUeg7f2P3C9YdAK2C6etcDEpuCMLSm /El3RlTdbHm6gPwNT2QCenPyrGQKP4V3AemGMBseflbyI6DHLtgNK7oC60LsSnlYsFqr d1gQ== X-Gm-Message-State: AOAM531hGgHmMFoRZZrw5ajET7x6txt+p37B8Mc/FrIE0X8Hr1wpgOLa EEzg+K9kJ3o8aRsa1f4mD2WOPr3T8yq29fj/ X-Google-Smtp-Source: ABdhPJxw9qA204yl169h2oYBJNR6C5nsWiln2kpOFOjakhVnEwpifdUwPgnWv/YzoI0CuAADSbRfIw== X-Received: by 2002:a0c:a241:: with SMTP id f59mr2380008qva.33.1613618075602; Wed, 17 Feb 2021 19:14:35 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id c7sm2573412qtc.82.2021.02.17.19.14.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:35 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:33 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Message-ID: <67af143fd1f3bcece0a8b27894cbdcdc5ae60ae8.1613618042.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King Now that we have find_kept_pack_entry(), we don't have to manually keep hunting through every pack to find a possible "kept" duplicate of the object. This should be faster, assuming only a portion of your total packs are actually kept. Note that we have to re-order the logic a bit here; we can deal with the disqualifying situations first (e.g., finding the object in a non-local pack with --local), then "kept" situation(s), and then just fall back to other "--local" conditions. Here are the results from p5303 (measurements again taken on the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.26(54.59+10.84) 57.34(54.66+10.88) +0.1% 5303.6: repack with kept (1) 57.33(54.80+10.51) 57.38(54.83+10.49) +0.1% 5303.11: repack (50) 71.54(88.57+4.84) 71.70(88.99+4.74) +0.2% 5303.12: repack with kept (50) 85.12(102.05+4.94) 72.58(89.61+4.78) -14.7% 5303.17: repack (1000) 216.87(490.79+14.57) 217.19(491.72+14.25) +0.1% 5303.18: repack with kept (1000) 665.63(938.87+15.76) 246.12(520.07+14.93) -63.0% and the --stdin-packs timings: 5303.7: repack with --stdin-packs (1) 0.01(0.01+0.00) 0.00(0.00+0.00) -100.0% 5303.13: repack with --stdin-packs (50) 3.53(12.07+0.24) 3.43(11.75+0.24) -2.8% 5303.19: repack with --stdin-packs (1000) 195.83(371.82+8.10) 130.50(307.15+7.66) -33.4% So our repack with an empty .keep pack is roughly as fast as one without a .keep pack up to 50 packs. But the --stdin-packs case scales a little better, too. Notably, it is faster than a repack of the same size and a kept pack. It looks at fewer objects, of course, but the penalty for looking at many packs isn't as costly. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- builtin/pack-objects.c | 131 ++++++++++++++++++++++++----------------- 1 file changed, 78 insertions(+), 53 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index e766a4a43b..be3ba60bc2 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1188,7 +1188,8 @@ static int have_duplicate_entry(const struct object_id *oid, return 1; } -static int want_found_object(int exclude, struct packed_git *p) +static int want_found_object(const struct object_id *oid, int exclude, + struct packed_git *p) { if (exclude) return 1; @@ -1204,27 +1205,82 @@ static int want_found_object(int exclude, struct packed_git *p) * make sure no copy of this object appears in _any_ pack that makes us * to omit the object, so we need to check all the packs. * - * We can however first check whether these options can possible matter; + * We can however first check whether these options can possibly matter; * if they do not matter we know we want the object in generated pack. * Otherwise, we signal "-1" at the end to tell the caller that we do * not know either way, and it needs to check more packs. */ - if (!ignore_packed_keep_on_disk && - !ignore_packed_keep_in_core && - (!local || !have_non_local_packs)) - return 1; + /* + * Objects in packs borrowed from elsewhere are discarded regardless of + * if they appear in other packs that weren't borrowed. + */ if (local && !p->pack_local) return 0; - if (p->pack_local && - ((ignore_packed_keep_on_disk && p->pack_keep) || - (ignore_packed_keep_in_core && p->pack_keep_in_core))) - return 0; + + /* + * Then handle .keep first, as we have a fast(er) path there. + */ + if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) { + /* + * Set the flags for the kept-pack cache to be the ones we want + * to ignore. + * + * That is, if we are ignoring objects in on-disk keep packs, + * then we want to search through the on-disk keep and ignore + * the in-core ones. + */ + unsigned flags = 0; + if (ignore_packed_keep_on_disk) + flags |= ON_DISK_KEEP_PACKS; + if (ignore_packed_keep_in_core) + flags |= IN_CORE_KEEP_PACKS; + + if (ignore_packed_keep_on_disk && p->pack_keep) + return 0; + if (ignore_packed_keep_in_core && p->pack_keep_in_core) + return 0; + if (has_object_kept_pack(oid, flags)) + return 0; + } + + /* + * At this point we know definitively that either we don't care about + * keep-packs, or the object is not in one. Keep checking other + * conditions... + */ + if (!local || !have_non_local_packs) + return 1; /* we don't know yet; keep looking for more packs */ return -1; } +static int want_object_in_pack_one(struct packed_git *p, + const struct object_id *oid, + int exclude, + struct packed_git **found_pack, + off_t *found_offset) +{ + off_t offset; + + if (p == *found_pack) + offset = *found_offset; + else + offset = find_pack_entry_one(oid->hash, p); + + if (offset) { + if (!*found_pack) { + if (!is_pack_valid(p)) + return -1; + *found_offset = offset; + *found_pack = p; + } + return want_found_object(oid, exclude, p); + } + return -1; +} + /* * Check whether we want the object in the pack (e.g., we do not want * objects found in non-local stores if the "--local" option was used). @@ -1252,7 +1308,7 @@ static int want_object_in_pack(const struct object_id *oid, * are present we will determine the answer right now. */ if (*found_pack) { - want = want_found_object(exclude, *found_pack); + want = want_found_object(oid, exclude, *found_pack); if (want != -1) return want; } @@ -1260,53 +1316,22 @@ static int want_object_in_pack(const struct object_id *oid, for (m = get_multi_pack_index(the_repository); m; m = m->next) { struct pack_entry e; if (fill_midx_entry(the_repository, oid, &e, m)) { - struct packed_git *p = e.p; - off_t offset; - - if (p == *found_pack) - offset = *found_offset; - else - offset = find_pack_entry_one(oid->hash, p); - - if (offset) { - if (!*found_pack) { - if (!is_pack_valid(p)) - continue; - *found_offset = offset; - *found_pack = p; - } - want = want_found_object(exclude, p); - if (want != -1) - return want; - } - } - } - - list_for_each(pos, get_packed_git_mru(the_repository)) { - struct packed_git *p = list_entry(pos, struct packed_git, mru); - off_t offset; - - if (p == *found_pack) - offset = *found_offset; - else - offset = find_pack_entry_one(oid->hash, p); - - if (offset) { - if (!*found_pack) { - if (!is_pack_valid(p)) - continue; - *found_offset = offset; - *found_pack = p; - } - want = want_found_object(exclude, p); - if (!exclude && want > 0) - list_move(&p->mru, - get_packed_git_mru(the_repository)); + want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset); if (want != -1) return want; } } + list_for_each(pos, get_packed_git_mru(the_repository)) { + struct packed_git *p = list_entry(pos, struct packed_git, mru); + want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset); + if (!exclude && want > 0) + list_move(&p->mru, + get_packed_git_mru(the_repository)); + if (want != -1) + return want; + } + if (uri_protocols.nr) { struct configured_exclusion *ex = oidmap_get(&configured_exclusions, oid); From patchwork Thu Feb 18 03:14:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092883 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D355C433E0 for ; Thu, 18 Feb 2021 03:15:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 33B4A64E2F for ; Thu, 18 Feb 2021 03:15:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230087AbhBRDPm (ORCPT ); Wed, 17 Feb 2021 22:15:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230064AbhBRDPk (ORCPT ); Wed, 17 Feb 2021 22:15:40 -0500 Received: from mail-qk1-x729.google.com (mail-qk1-x729.google.com [IPv6:2607:f8b0:4864:20::729]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44FD7C06178C for ; Wed, 17 Feb 2021 19:14:40 -0800 (PST) Received: by mail-qk1-x729.google.com with SMTP id v206so904069qkb.3 for ; Wed, 17 Feb 2021 19:14:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=W7BAgY6QH8/c+NOhD4ioZpf2YtEHxcDQW5EvyimR7YQ=; b=XwrlUxDwBwxEsLpnjgjTjl+AzNQzExEYu62DP92Mk65w1HGd1U/PG3KjkBHRIdjsgT mF9hx9FN4tpGm0jKGZoP+QXPZqQHkS207TRrkBnX4DwhqWB6F/bjCYDVz11Oon9oqUZU GfTL19E19tjj6nJ0n2WYLFR84KTdoV2qOUGFRnjQp8N8sjcPmC2hd8a/vdPTc9M6bBMO N9FIO5AzaD04UXpcCsca2MfJeVlQDLePx9eZmjNEKohM1GVSILAroPeQrdcTFy95+wOE Do0CadwdVikVDmZSFS7eiWR6qiBZtHHiYVXZdOY0lyUQlL6UJJ6Pz8rg+5pC50kaAAbY lfqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=W7BAgY6QH8/c+NOhD4ioZpf2YtEHxcDQW5EvyimR7YQ=; b=R0vRUE6Kv+430wB4veijsefpqzAuKLoRPvk5RQyV5eBr79UT6MMoSxKhmjgoZRNrqZ QgO6sp/i7fYyI/QlWRNzO0w4LboYH38excsMm++eXHLww+yTrnc1hxXwMeFsO8AoXpHt /SvIUHNWGjfBcVkID3vdEor0dKp++xr4+xfijpYMbORnv6Z1KRzJ7sYIYDAqJi6AUOIU betdEjLOpELPR3bODMkWpWUbGuND9VsVEWXsIeQQtDwlj1fgPH4QWq/vcEt4T/Ok+npG PRLABizT9xxVEX7j86n6vM1SHVW5L8CYyG9lHk/Djn0bWv/vQ2/D+EUiAIOSjVEgcdNO nO2w== X-Gm-Message-State: AOAM5331Vz8XiU+pohR7laLuZ2ej7cjxmGwy5UiRo1Vg8/Ybv7YZvvob EkV1YQy7/Qc5pOjZzNWj/RoV3heZH7k0fHpn X-Google-Smtp-Source: ABdhPJxSbs2YywQnN796DcFecZ9ceTSW5VWRzlfdFrCVgTQ8ySnJUOrDwFO8llwa+r9jGl7ScZ9cJg== X-Received: by 2002:a05:620a:11a5:: with SMTP id c5mr2454061qkk.163.1613618079152; Wed, 17 Feb 2021 19:14:39 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id v75sm3060794qkb.14.2021.02.17.19.14.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:38 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:37 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King In a recent patch we added a function 'find_kept_pack_entry()' to look for an object only among kept packs. While this function avoids doing any lookup work in non-kept packs, it is still linear in the number of packs, since we have to traverse the linked list of packs once per object. Let's cache a reduced version of that list to save us time. Note that this cache will last the lifetime of the program. We could invalidate it on reprepare_packed_git(), but there's not much point in being rigorous here: - we might already fail to notice new .keep packs showing up after the program starts. We only reprepare_packed_git() when we fail to find an object. But adding a new pack won't cause that to happen. Somebody repacking could add a new pack and delete an old one, but most of the time we'd have a descriptor or mmap open to the old pack anyway, so we might not even notice. - in pack-objects we already cache the .keep state at startup, since 56dfeb6263 (pack-objects: compute local/ignore_pack_keep early, 2016-07-29). So this is just extending that concept further. - we don't have to worry about any packed_git being removed; we always keep the old structs around, even after reprepare_packed_git() We do defensively invalidate the cache in case the set of kept packs being asked for changes (e.g., only in-core kept packs were cached, but suddenly the caller also wants on-disk kept packs, too). In theory we could build all three caches and switch between them, but it's not necessary, since this patch (and series) never changes the set of kept packs that it wants to inspect from the cache. So that "optimization" is more about being defensive in the face of future changes than it is about asking for multiple kinds of kept packs in this patch. Here are p5303 results (as always, measured against the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.34(54.66+10.88) 56.98(54.36+10.98) -0.6% 5303.6: repack with kept (1) 57.38(54.83+10.49) 57.17(54.97+10.26) -0.4% 5303.11: repack (50) 71.70(88.99+4.74) 71.62(88.48+5.08) -0.1% 5303.12: repack with kept (50) 72.58(89.61+4.78) 71.56(88.80+4.59) -1.4% 5303.17: repack (1000) 217.19(491.72+14.25) 217.31(490.82+14.53) +0.1% 5303.18: repack with kept (1000) 246.12(520.07+14.93) 217.08(490.37+15.10) -11.8% and the --stdin-packs case, which scales a little bit better (although not by that much even at 1,000 packs): 5303.7: repack with --stdin-packs (1) 0.00(0.00+0.00) 0.00(0.00+0.00) = 5303.13: repack with --stdin-packs (50) 3.43(11.75+0.24) 3.43(11.69+0.30) +0.0% 5303.19: repack with --stdin-packs (1000) 130.50(307.15+7.66) 125.13(301.36+8.04) -4.1% Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- object-store.h | 5 +++ packfile.c | 99 ++++++++++++++++++++++++++++---------------------- 2 files changed, 61 insertions(+), 43 deletions(-) diff --git a/object-store.h b/object-store.h index 541dab0858..ec32c23dcb 100644 --- a/object-store.h +++ b/object-store.h @@ -153,6 +153,11 @@ struct raw_object_store { /* A most-recently-used ordered version of the packed_git list. */ struct list_head packed_git_mru; + struct { + struct packed_git **packs; + unsigned flags; + } kept_pack_cache; + /* * A map of packfiles to packed_git structs for tracking which * packs have been loaded already. diff --git a/packfile.c b/packfile.c index 7f84f221ce..57d5b436fb 100644 --- a/packfile.c +++ b/packfile.c @@ -2042,10 +2042,7 @@ static int fill_pack_entry(const struct object_id *oid, return 1; } -static int find_one_pack_entry(struct repository *r, - const struct object_id *oid, - struct pack_entry *e, - int kept_only) +int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) { struct list_head *pos; struct multi_pack_index *m; @@ -2055,49 +2052,63 @@ static int find_one_pack_entry(struct repository *r, return 0; for (m = r->objects->multi_pack_index; m; m = m->next) { - if (!fill_midx_entry(r, oid, e, m)) - continue; - - if (!kept_only) - return 1; - - if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) || - ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core)) + if (fill_midx_entry(r, oid, e, m)) return 1; } list_for_each(pos, &r->objects->packed_git_mru) { struct packed_git *p = list_entry(pos, struct packed_git, mru); - if (p->multi_pack_index && !kept_only) { - /* - * If this pack is covered by the MIDX, we'd have found - * the object already in the loop above if it was here, - * so don't bother looking. - * - * The exception is if we are looking only at kept - * packs. An object can be present in two packs covered - * by the MIDX, one kept and one not-kept. And as the - * MIDX points to only one copy of each object, it might - * have returned only the non-kept version above. We - * have to check again to be thorough. - */ - continue; - } - if (!kept_only || - (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) || - ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) { - if (fill_pack_entry(oid, e, p)) { - list_move(&p->mru, &r->objects->packed_git_mru); - return 1; - } + if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) { + list_move(&p->mru, &r->objects->packed_git_mru); + return 1; } } return 0; } -int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +static void maybe_invalidate_kept_pack_cache(struct repository *r, + unsigned flags) { - return find_one_pack_entry(r, oid, e, 0); + if (!r->objects->kept_pack_cache.packs) + return; + if (r->objects->kept_pack_cache.flags == flags) + return; + FREE_AND_NULL(r->objects->kept_pack_cache.packs); + r->objects->kept_pack_cache.flags = 0; +} + +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags) +{ + maybe_invalidate_kept_pack_cache(r, flags); + + if (!r->objects->kept_pack_cache.packs) { + struct packed_git **packs = NULL; + size_t nr = 0, alloc = 0; + struct packed_git *p; + + /* + * We want "all" packs here, because we need to cover ones that + * are used by a midx, as well. We need to look in every one of + * them (instead of the midx itself) to cover duplicates. It's + * possible that an object is found in two packs that the midx + * covers, one kept and one not kept, but the midx returns only + * the non-kept version. + */ + for (p = get_all_packs(r); p; p = p->next) { + if ((p->pack_keep && (flags & ON_DISK_KEEP_PACKS)) || + (p->pack_keep_in_core && (flags & IN_CORE_KEEP_PACKS))) { + ALLOC_GROW(packs, nr + 1, alloc); + packs[nr++] = p; + } + } + ALLOC_GROW(packs, nr + 1, alloc); + packs[nr] = NULL; + + r->objects->kept_pack_cache.packs = packs; + r->objects->kept_pack_cache.flags = flags; + } + + return r->objects->kept_pack_cache.packs; } int find_kept_pack_entry(struct repository *r, @@ -2105,13 +2116,15 @@ int find_kept_pack_entry(struct repository *r, unsigned flags, struct pack_entry *e) { - /* - * Load all packs, including midx packs, since our "kept" strategy - * relies on that. We're relying on the side effect of it setting up - * r->objects->packed_git, which is a little ugly. - */ - get_all_packs(r); - return find_one_pack_entry(r, oid, e, flags); + struct packed_git **cache; + + for (cache = kept_pack_cache(r, flags); *cache; cache++) { + struct packed_git *p = *cache; + if (fill_pack_entry(oid, e, p)) + return 1; + } + + return 0; } int has_object_pack(const struct object_id *oid) From patchwork Thu Feb 18 03:14:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12092885 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 150C5C433E0 for ; Thu, 18 Feb 2021 03:15:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C34E364E58 for ; Thu, 18 Feb 2021 03:15:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230100AbhBRDPn (ORCPT ); Wed, 17 Feb 2021 22:15:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43236 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230071AbhBRDPk (ORCPT ); Wed, 17 Feb 2021 22:15:40 -0500 Received: from mail-qv1-xf2c.google.com (mail-qv1-xf2c.google.com [IPv6:2607:f8b0:4864:20::f2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5168AC061793 for ; Wed, 17 Feb 2021 19:14:44 -0800 (PST) Received: by mail-qv1-xf2c.google.com with SMTP id y10so329044qvo.6 for ; Wed, 17 Feb 2021 19:14:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=jIKEo2HdKBnmzff+hMCzUOjthefC0qH1JpG0EkVyRzw=; b=t6MvI+93GC/zGxYKHilRCU3udHmEHH9gNR3997RXYNlaosHZsuClMpVUrkroIRjMJS 3uMp24bNIkfZThZ+PXbEj0mhkjuN7RXQy9C9uh1wFVkyS38uf+AyoSlKtaFWM6yXELyo 7Ke8UskW/YpPM2dzV9dsMscFAEMeimOKXf1zg+whB+reCP00x2KjIGur+kTzyuwhV6Pn aAqJ3u/CT5ftS7PCIKOmZfc22s0v0R7Rl4sp7QvGF5IYgKK6mP1GZPSi3zfZZX5JWmU8 xQwRqS1NfV+Uh9a5Ka4zb4xBq2fMNJlK1kWr9/IuATqvFikFaOGJxPpOtnHSlzDqQUQs rpwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jIKEo2HdKBnmzff+hMCzUOjthefC0qH1JpG0EkVyRzw=; b=cype8aZnGZOqmZRwAWw/tnJWFUSpkdRSW0Ny+VVPKafT8vqtuxCXibyJCXIMZfsiz0 4Q7FKaRwxqWsqn1G89Xxp7tS3z2Vc07lU4AxPc+7S19zTn2Jq2m64N1NlaS/SUJRZzot SX+q3FwjuW+/O/kH37KjAEos0iAM1RmmKtgPwVtXoLYOYKC+GX5t2Rkzg5ohnp9zs22c d0JdaQB/FS7oZix0b+G7O2QB+/3y4M3zJd4zAr5GcONpbY5P6VdgEAuZaQzKxGNFOE/d ktdHMAArqpsZi+YOygBdn98LFt3CGQgxOSIoJw64KqzrctPRX33XTg+ObSPNr65kL/rO b+aw== X-Gm-Message-State: AOAM532H9yUTZkcmAXsQ8coKkl7z1VftugLdqErA66I/64KwN6wGZhse ASGpHnXEmSRz6vw4EGFdn14YVpvXcvDdY1JN X-Google-Smtp-Source: ABdhPJwUEc1LGY+/4NJ7kPgnJCicpocaDEO0mNS818U3HZdkmQp+XccUO/3LVz5YhrqT79zJ3WPwLw== X-Received: by 2002:ad4:55aa:: with SMTP id f10mr2577230qvx.46.1613618083013; Wed, 17 Feb 2021 19:14:43 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:1f29:6ff9:b466:8c60]) by smtp.gmail.com with ESMTPSA id v12sm1352362qtw.73.2021.02.17.19.14.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 19:14:42 -0800 (PST) Date: Wed, 17 Feb 2021 22:14:41 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Often it is useful to both: - have relatively few packfiles in a repository, and - avoid having so few packfiles in a repository that we repack its entire contents regularly This patch implements a '--geometric=' option in 'git repack'. This allows the caller to specify that they would like each pack to be at least a factor times as large as the previous largest pack (by object count). Concretely, say that a repository has 'n' packfiles, labeled P1, P2, ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'. With a geometric factor of 'r', it should be that: objects(Pi) > r*objects(P(i-1)) for all i in [1, n], where the packs are sorted by objects(P1) <= objects(P2) <= ... <= objects(Pn). Since finding a true optimal repacking is NP-hard, we approximate it along two directions: 1. We assume that there is a cutoff of packs _before starting the repack_ where everything to the right of that cut-off already forms a geometric progression (or no cutoff exists and everything must be repacked). 2. We assume that everything smaller than the cutoff count must be repacked. This forms our base assumption, but it can also cause even the "heavy" packs to get repacked, for e.g., if we have 6 packs containing the following number of objects: 1, 1, 1, 2, 4, 32 then we would place the cutoff between '1, 1' and '1, 2, 4, 32', rolling up the first two packs into a pack with 2 objects. That breaks our progression and leaves us: 2, 1, 2, 4, 32 ^ (where the '^' indicates the position of our split). To restore a progression, we move the split forward (towards larger packs) joining each pack into our new pack until a geometric progression is restored. Here, that looks like: 2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32 ^ ^ ^ ^ This has the advantage of not repacking the heavy-side of packs too often while also only creating one new pack at a time. Another wrinkle is that we assume that loose, indexed, and reflog'd objects are insignificant, and lump them into any new pack that we create. This can lead to non-idempotent results. Suggested-by: Derrick Stolee Signed-off-by: Taylor Blau --- Documentation/git-repack.txt | 22 +++++ builtin/repack.c | 187 ++++++++++++++++++++++++++++++++++- t/t7703-repack-geometric.sh | 137 +++++++++++++++++++++++++ 3 files changed, 342 insertions(+), 4 deletions(-) create mode 100755 t/t7703-repack-geometric.sh diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index 92f146d27d..21c7068925 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -165,6 +165,28 @@ depth is 4095. Pass the `--delta-islands` option to `git-pack-objects`, see linkgit:git-pack-objects[1]. +-g=:: +--geometric=:: + Arrange resulting pack structure so that each successive pack + contains at least `` times the number of objects as the + next-largest pack. ++ +`git repack` ensures this by determining a "cut" of packfiles that need +to be repacked into one in order to ensure a geometric progression. It +picks the smallest set of packfiles such that as many of the larger +packfiles (by count of objects contained in that pack) may be left +intact. ++ +Unlike other repack modes, the set of objects to pack is determined +uniquely by the set of packs being "rolled-up"; in other words, the +packs determined to need to be combined in order to restore a geometric +progression. ++ +Loose objects are implicitly included in this "roll-up", without respect +to their reachability. This is subject to change in the future. This +option (implying a drastically different repack mode) is not guarenteed +to work with all other combinations of option to `git repack`). + Configuration ------------- diff --git a/builtin/repack.c b/builtin/repack.c index 01440de2d5..bcf280b10d 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -297,6 +297,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args, #define ALL_INTO_ONE 1 #define LOOSEN_UNREACHABLE 2 +struct pack_geometry { + struct packed_git **pack; + uint32_t pack_nr, pack_alloc; + uint32_t split; +}; + +static uint32_t geometry_pack_weight(struct packed_git *p) +{ + if (open_pack_index(p)) + die(_("cannot open index for %s"), p->pack_name); + return p->num_objects; +} + +static int geometry_cmp(const void *va, const void *vb) +{ + uint32_t aw = geometry_pack_weight(*(struct packed_git **)va), + bw = geometry_pack_weight(*(struct packed_git **)vb); + + if (aw < bw) + return -1; + if (aw > bw) + return 1; + return 0; +} + +static void init_pack_geometry(struct pack_geometry **geometry_p) +{ + struct packed_git *p; + struct pack_geometry *geometry; + + *geometry_p = xcalloc(1, sizeof(struct pack_geometry)); + geometry = *geometry_p; + + for (p = get_all_packs(the_repository); p; p = p->next) { + if (!pack_kept_objects && p->pack_keep) + continue; + + ALLOC_GROW(geometry->pack, + geometry->pack_nr + 1, + geometry->pack_alloc); + + geometry->pack[geometry->pack_nr] = p; + geometry->pack_nr++; + } + + QSORT(geometry->pack, geometry->pack_nr, geometry_cmp); +} + +static void split_pack_geometry(struct pack_geometry *geometry, int factor) +{ + uint32_t i; + uint32_t split; + off_t total_size = 0; + + if (geometry->pack_nr <= 1) { + geometry->split = geometry->pack_nr; + return; + } + + split = geometry->pack_nr - 1; + + /* + * First, count the number of packs (in descending order of size) which + * already form a geometric progression. + */ + for (i = geometry->pack_nr - 1; i > 0; i--) { + struct packed_git *ours = geometry->pack[i]; + struct packed_git *prev = geometry->pack[i - 1]; + if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev)) + split--; + else + break; + } + + if (split) { + /* + * Move the split one to the right, since the top element in the + * last-compared pair can't be in the progression. Only do this + * when we split in the middle of the array (otherwise if we got + * to the end, then the split is in the right place). + */ + split++; + } + + /* + * Then, anything to the left of 'split' must be in a new pack. But, + * creating that new pack may cause packs in the heavy half to no longer + * form a geometric progression. + * + * Compute an expected size of the new pack, and then determine how many + * packs in the heavy half need to be joined into it (if any) to restore + * the geometric progression. + */ + for (i = 0; i < split; i++) + total_size += geometry_pack_weight(geometry->pack[i]); + for (i = split; i < geometry->pack_nr; i++) { + struct packed_git *ours = geometry->pack[i]; + if (geometry_pack_weight(ours) < factor * total_size) { + split++; + total_size += geometry_pack_weight(ours); + } else + break; + } + + geometry->split = split; +} + +static void clear_pack_geometry(struct pack_geometry *geometry) +{ + if (!geometry) + return; + + free(geometry->pack); + geometry->pack_nr = 0; + geometry->pack_alloc = 0; + geometry->split = 0; +} + int cmd_repack(int argc, const char **argv, const char *prefix) { struct child_process cmd = CHILD_PROCESS_INIT; @@ -304,6 +422,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct string_list names = STRING_LIST_INIT_DUP; struct string_list rollback = STRING_LIST_INIT_NODUP; struct string_list existing_packs = STRING_LIST_INIT_DUP; + struct pack_geometry *geometry = NULL; struct strbuf line = STRBUF_INIT; int i, ext, ret; FILE *out; @@ -316,6 +435,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; int no_update_server_info = 0; struct pack_objects_args po_args = {NULL}; + int geometric_factor = 0; struct option builtin_repack_options[] = { OPT_BIT('a', NULL, &pack_everything, @@ -356,6 +476,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("repack objects in packs marked with .keep")), OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"), N_("do not repack this pack")), + OPT_INTEGER('g', "geometric", &geometric_factor, + N_("find a geometric progression with factor ")), OPT_END() }; @@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (write_bitmaps && !(pack_everything & ALL_INTO_ONE)) die(_(incremental_bitmap_conflict_error)); + if (geometric_factor) { + if (pack_everything) + die(_("--geometric is incompatible with -A, -a")); + init_pack_geometry(&geometry); + split_pack_geometry(geometry, geometric_factor); + } + packdir = mkpathdup("%s/pack", get_object_directory()); packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid()); @@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strvec_pushf(&cmd.args, "--keep-pack=%s", keep_pack_list.items[i].string); strvec_push(&cmd.args, "--non-empty"); - strvec_push(&cmd.args, "--all"); - strvec_push(&cmd.args, "--reflog"); - strvec_push(&cmd.args, "--indexed-objects"); + if (!geometry) { + /* + * 'git pack-objects' will up all objects loose or packed + * (either rolling them up or leaving them alone), so don't pass + * these options. + * + * The implementation of 'git pack-objects --stdin-packs' + * makes them redundant (and the two are incompatible). + */ + strvec_push(&cmd.args, "--all"); + strvec_push(&cmd.args, "--reflog"); + strvec_push(&cmd.args, "--indexed-objects"); + } if (has_promisor_remote()) strvec_push(&cmd.args, "--exclude-promisor-objects"); if (write_bitmaps > 0) @@ -429,17 +568,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1"); } } + } else if (geometry) { + strvec_push(&cmd.args, "--stdin-packs"); + strvec_push(&cmd.args, "--unpacked"); } else { strvec_push(&cmd.args, "--unpacked"); strvec_push(&cmd.args, "--incremental"); } - cmd.no_stdin = 1; + if (geometry) + cmd.in = -1; + else + cmd.no_stdin = 1; ret = start_command(&cmd); if (ret) return ret; + if (geometry) { + FILE *in = xfdopen(cmd.in, "w"); + /* + * The resulting pack should contain all objects in packs that + * are going to be rolled up, but exclude objects in packs which + * are being left alone. + */ + for (i = 0; i < geometry->split; i++) + fprintf(in, "%s\n", pack_basename(geometry->pack[i])); + for (i = geometry->split; i < geometry->pack_nr; i++) + fprintf(in, "^%s\n", pack_basename(geometry->pack[i])); + fclose(in); + } + out = xfdopen(cmd.out, "r"); while (strbuf_getline_lf(&line, out) != EOF) { if (line.len != the_hash_algo->hexsz) @@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!string_list_has_string(&names, sha1)) remove_redundant_pack(packdir, item->string); } + + if (geometry) { + struct strbuf buf = STRBUF_INIT; + + uint32_t i; + for (i = 0; i < geometry->split; i++) { + struct packed_git *p = geometry->pack[i]; + if (string_list_has_string(&names, + hash_to_hex(p->hash))) + continue; + + strbuf_reset(&buf); + strbuf_addstr(&buf, pack_basename(p)); + strbuf_strip_suffix(&buf, ".pack"); + + remove_redundant_pack(packdir, buf.buf); + } + strbuf_release(&buf); + } if (!po_args.quiet && isatty(2)) opts |= PRUNE_PACKED_VERBOSE; prune_packed_objects(opts); @@ -528,6 +706,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) string_list_clear(&names, 0); string_list_clear(&rollback, 0); string_list_clear(&existing_packs, 0); + clear_pack_geometry(geometry); strbuf_release(&line); return 0; diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh new file mode 100755 index 0000000000..96917fc163 --- /dev/null +++ b/t/t7703-repack-geometric.sh @@ -0,0 +1,137 @@ +#!/bin/sh + +test_description='git repack --geometric works correctly' + +. ./test-lib.sh + +GIT_TEST_MULTI_PACK_INDEX=0 + +objdir=.git/objects +midx=$objdir/pack/multi-pack-index + +test_expect_success '--geometric with no packs' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + git repack --geometric 2 >out && + test_i18ngrep "Nothing new to pack" out + ) +' + +test_expect_success '--geometric with an intact progression' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + # These packs already form a geometric progression. + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 2 && # 6 objects + test_commit_bulk --start=4 4 && # 12 objects + + find $objdir/pack -name "*.pack" | sort >expect && + git repack --geometric 2 -d && + find $objdir/pack -name "*.pack" | sort >actual && + + test_cmp expect actual + ) +' + +test_expect_success '--geometric with small-pack rollup' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 1 && # 3 objects + find $objdir/pack -name "*.pack" | sort >small && + test_commit_bulk --start=3 4 && # 12 objects + test_commit_bulk --start=7 8 && # 24 objects + find $objdir/pack -name "*.pack" | sort >before && + + git repack --geometric 2 -d && + + # Three packs in total; two of the existing large ones, and one + # new one. + find $objdir/pack -name "*.pack" | sort >after && + test_line_count = 3 after && + comm -3 small before | tr -d "\t" >large && + grep -qFf large after + ) +' + +test_expect_success '--geometric with small- and large-pack rollup' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + # size(small1) + size(small2) > size(medium) / 2 + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 1 && # 3 objects + test_commit_bulk --start=2 3 && # 7 objects + test_commit_bulk --start=6 9 && # 27 objects && + + find $objdir/pack -name "*.pack" | sort >before && + + git repack --geometric 2 -d && + + find $objdir/pack -name "*.pack" | sort >after && + comm -12 before after >untouched && + + # Two packs in total; the largest pack from before running "git + # repack", and one new one. + test_line_count = 1 untouched && + test_line_count = 2 after + ) +' + +test_expect_success '--geometric ignores kept packs' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + test_commit kept && # 3 objects + test_commit pack && # 3 objects + + KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF + refs/tags/kept + EOF + ) && + PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF + refs/tags/pack + ^refs/tags/kept + EOF + ) && + + # neither pack contains more than twice the number of objects in + # the other, so they should be combined. but, marking one as + # .kept on disk will "freeze" it, so the pack structure should + # remain unchanged. + touch $objdir/pack/pack-$KEPT.keep && + + find $objdir/pack -name "*.pack" | sort >before && + git repack --geometric 2 -d && + find $objdir/pack -name "*.pack" | sort >after && + + # both packs should still exist + test_path_is_file $objdir/pack/pack-$KEPT.pack && + test_path_is_file $objdir/pack/pack-$PACK.pack && + + # and no new packs should be created + test_cmp before after && + + # Passing --pack-kept-objects causes packs with a .keep file to + # be repacked, too. + git repack --geometric 2 -d --pack-kept-objects && + + find $objdir/pack -name "*.pack" >after && + test_line_count = 1 after + ) +' + +test_done