diff mbox series

[1/2] list-objects: drop --unpacked non-commit objects from results

Message ID d3992c98aaa54c3635c249a15d919aa1177324d8.1699311386.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit 4263f9279e331fabb34fac7ef92a5a1061ae1d01
Headers show
Series revision: exclude all packed objects with `--unpacked` | expand

Commit Message

Taylor Blau Nov. 6, 2023, 10:56 p.m. UTC
In git-rev-list(1), we describe the `--unpacked` option as:

    Only useful with `--objects`; print the object IDs that are not in
    packs.

This is true of commits, which we discard via get_commit_action(), but
not of the objects they reach. So if we ask for an --objects traversal
with --unpacked, we may get arbitrarily many objects which are indeed
packed.

I am nearly certain this behavior dates back to the introduction of
`--unpacked` via 12d2a18780 ("git rev-list --unpacked" shows only
unpacked commits, 2005-07-03), but I couldn't get that revision of Git
to compile for me. At least as early as v2.0.0 this has been subtly
broken:

    $ git.compile --version
    git version 2.0.0

    $ git.compile rev-list --objects --all --unpacked
    72791fe96c93f9ec5c311b8bc966ab349b3b5bbe
    05713d991c18bbeef7e154f99660005311b5004d v1.0
    153ed8b7719c6f5a68ce7ffc43133e95a6ac0fdb
    8e4020bb5a8d8c873b25de15933e75cc0fc275df one
    9200b628cf9dc883a85a7abc8d6e6730baee589c two
    3e6b46e1b7e3b91acce99f6a823104c28aae0b58 unpacked.t

There, only the first, third, and sixth entries are loose, with the
remaining set of objects belonging to at least one pack.

The implications for this are relatively benign: bare 'git repack'
invocations which invoke pack-objects with --unpacked are impacted, and
at worst we'll store a few extra objects that should have been excluded.

Arguably changing this behavior is a backwards-incompatible change,
since it alters the set of objects emitted from rev-list queries with
`--objects` and `--unpacked`. But I argue that this change is still
sensible, since the existing implementation deviates from
clearly-written documentation.

The fix here is straightforward: avoid showing any non-commit objects
which are contained in packs by discarding them within list-objects.c,
before they are shown to the user. Note that similar treatment for
`list-objects.c::show_commit()` is not needed, since that case is
already handled by `revision.c::get_commit_action()`.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 list-objects.c           |  3 +++
 t/t6000-rev-list-misc.sh | 13 +++++++++++++
 2 files changed, 16 insertions(+)

Comments

Junio C Hamano Nov. 7, 2023, 2:21 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> Subject: Re: [PATCH 1/2] list-objects: drop --unpacked non-commit objects from results

I thought the purpose of this change is to make sure that we drop
packed objects from results when --unpacked is given?  This makes
it sound as if we are dropping unpacked ones instead?  I dunno.

> In git-rev-list(1), we describe the `--unpacked` option as:
>
>     Only useful with `--objects`; print the object IDs that are not in
>     packs.
>
> This is true of commits, which we discard via get_commit_action(), but
> not of the objects they reach. So if we ask for an --objects traversal
> with --unpacked, we may get arbitrarily many objects which are indeed
> packed.

Strictly speaking, as long as all the objects that are not in packs
are shown, "print the object IDs that are not in packs" is satisfied.
With this fix, perhaps we would want to tighten the explanation a
little bit while we are at it.  Perhaps

	print the object names but exclude those that are in packs

or something along that line?

> diff --git a/list-objects.c b/list-objects.c
> index c25c72b32c..c8a5fb998e 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -39,6 +39,9 @@ static void show_object(struct traversal_context *ctx,
>  {
>  	if (!ctx->show_object)
>  		return;
> +	if (ctx->revs->unpacked && has_object_pack(&object->oid))
> +		return;
> +
>  	ctx->show_object(object, name, ctx->show_data);
>  }

The implementation is as straight-forward as it can get.  Very
pleasing.

> diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
> index 12def7bcbf..6289a2e8b0 100755
> --- a/t/t6000-rev-list-misc.sh
> +++ b/t/t6000-rev-list-misc.sh
> @@ -169,4 +169,17 @@ test_expect_success 'rev-list --count --objects' '
>  	test_line_count = $count actual
>  '
>  
> +test_expect_success 'rev-list --unpacked' '
> +	git repack -ad &&
> +	test_commit unpacked &&
> +
> +	git rev-list --objects --no-object-names unpacked^.. >expect.raw &&
> +	sort expect.raw >expect &&
> +
> +	git rev-list --all --objects --unpacked --no-object-names >actual.raw &&
> +	sort actual.raw >actual &&
> +
> +	test_cmp expect actual
> +'
> +
>  test_done
Jeff King Nov. 7, 2023, 4:02 a.m. UTC | #2
On Tue, Nov 07, 2023 at 11:21:29AM +0900, Junio C Hamano wrote:

> > In git-rev-list(1), we describe the `--unpacked` option as:
> >
> >     Only useful with `--objects`; print the object IDs that are not in
> >     packs.
> >
> > This is true of commits, which we discard via get_commit_action(), but
> > not of the objects they reach. So if we ask for an --objects traversal
> > with --unpacked, we may get arbitrarily many objects which are indeed
> > packed.
> 
> Strictly speaking, as long as all the objects that are not in packs
> are shown, "print the object IDs that are not in packs" is satisfied.
> With this fix, perhaps we would want to tighten the explanation a
> little bit while we are at it.  Perhaps
> 
> 	print the object names but exclude those that are in packs
> 
> or something along that line?

I think using the word "exclude" is a good idea, as it makes it clear
that we are omitting objects that otherwise would be traversed (as
opposed to just showing unpacked objects, reachable or not).

But I wanted to point out one other subtlety here. The existing code
(before this patch) checks for already-packed commits, and avoids adding
them to the traversal. The problem this patch is fixing is that we may
see objects they point to via other non-packed commits. But the opposite
problem exists, too: we have unpacked objects that are reachable from
those packed commits.

It's probably reasonably rare, since we _tend_ to make packs by rolling
up reachable chunks of history. But that's not a guarantee. One way I
can think of for it to happen in practice is that somebody pushes (or
fetches) a thin pack with commit C as a delta against an unpacked C'. In
that case "index-pack --fix-thin" will create a duplicate of C' in the
new pack, but its trees and blobs may remain unpacked.

I think with the patch in this series we could actually drop that "do
not traverse commits that are unpacked" line of code, and end up "more
correct". But I suspect performance of an incremental "git repack -d"
would suffer. This is kind of analagous to the "we do not traverse every
UNINTERESTING commit just to mark its trees/blobs as UNINTERESTING"
optimization. We know that it is not a true set difference, but it is OK
in practice and it buys us a lot of performance. And just like that
case, bitmaps do let us cheaply compute the true set difference. ;)

-Peff
diff mbox series

Patch

diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..c8a5fb998e 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -39,6 +39,9 @@  static void show_object(struct traversal_context *ctx,
 {
 	if (!ctx->show_object)
 		return;
+	if (ctx->revs->unpacked && has_object_pack(&object->oid))
+		return;
+
 	ctx->show_object(object, name, ctx->show_data);
 }
 
diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
index 12def7bcbf..6289a2e8b0 100755
--- a/t/t6000-rev-list-misc.sh
+++ b/t/t6000-rev-list-misc.sh
@@ -169,4 +169,17 @@  test_expect_success 'rev-list --count --objects' '
 	test_line_count = $count actual
 '
 
+test_expect_success 'rev-list --unpacked' '
+	git repack -ad &&
+	test_commit unpacked &&
+
+	git rev-list --objects --no-object-names unpacked^.. >expect.raw &&
+	sort expect.raw >expect &&
+
+	git rev-list --all --objects --unpacked --no-object-names >actual.raw &&
+	sort actual.raw >actual &&
+
+	test_cmp expect actual
+'
+
 test_done