[v3,9/9] pack-objects: improve partial packfile reuse

Message ID	20191115141541.11149-10-chriscool@tuxfamily.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=V4lN=ZH=vger.kernel.org=git-owner@kernel.org> From: Christian Couder <christian.couder@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, Christian Couder <chriscool@tuxfamily.org>, Ramsay Jones <ramsay@ramsayjones.plus.com>, Jonathan Tan <jonathantanmy@google.com> Subject: [PATCH v3 9/9] pack-objects: improve partial packfile reuse Date: Fri, 15 Nov 2019 15:15:41 +0100 Message-Id: <20191115141541.11149-10-chriscool@tuxfamily.org> In-Reply-To: <20191115141541.11149-1-chriscool@tuxfamily.org> References: <20191115141541.11149-1-chriscool@tuxfamily.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Rewrite packfile reuse code \| expand [v3,0/9] Rewrite packfile reuse code [v3,1/9] builtin/pack-objects: report reused packfile objects [v3,2/9] packfile: expose get_delta_base() [v3,3/9] ewah/bitmap: introduce bitmap_word_alloc() [v3,4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects [v3,5/9] pack-bitmap: introduce bitmap_walk_contains() [v3,6/9] csum-file: introduce hashfile_total() [v3,7/9] pack-objects: introduce pack.allowPackReuse [v3,8/9] builtin/pack-objects: introduce obj_is_packed() [v3,9/9] pack-objects: improve partial packfile reuse

Message ID

20191115141541.11149-10-chriscool@tuxfamily.org (mailing list archive)

State

New, archived

Headers

From: Christian Couder <christian.couder@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
        Christian Couder <chriscool@tuxfamily.org>,
        Ramsay Jones <ramsay@ramsayjones.plus.com>,
        Jonathan Tan <jonathantanmy@google.com>
Subject: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
Date: Fri, 15 Nov 2019 15:15:41 +0100
Message-Id: <20191115141541.11149-10-chriscool@tuxfamily.org>
In-Reply-To: <20191115141541.11149-1-chriscool@tuxfamily.org>
References: <20191115141541.11149-1-chriscool@tuxfamily.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

Rewrite packfile reuse code | expand

Commit Message

Christian Couder Nov. 15, 2019, 2:15 p.m. UTC

From: Jeff King <peff@peff.net>

The old code to reuse deltas from an existing packfile
just tried to dump a whole segment of the pack verbatim.
That's faster than the traditional way of actually adding
objects to the packing list, but it didn't kick in very
often. This new code is really going for a middle ground:
do _some_ per-object work, but way less than we'd
traditionally do.

For instance, packing torvalds/linux on GitHub servers
just now reused 6.5M objects, but only needed ~50k
chunks.

To implement this, we store the chunks of the packfile
that we reuse in a dynamic array of `struct reused_chunk`,
and we use a reuse_packfile_bitmap to speed up reusing
parts of packfiles.

The dynamic array of `struct reused_chunk` is useful
because we need to know the accumulated offset due to
missing objects. So without the array we'd end up having
to walk over the revindex for that set of objects. The
array is basically caching those accumulated offsets
(for the parts we _do_ include), so we don't have to
compute them repeatedly.

Additional checks are added in have_duplicate_entry()
and obj_is_packed() to avoid duplicate objects in the
reuse bitmap. It was probably buggy to not have such a
check before.

If a client both asks for a tag by sha1 and specifies
"include-tag", we may end up including the tag in the reuse
bitmap (due to the first thing), and then later adding it to
the packlist (due to the second). This results in duplicate
objects in the pack, which git chokes on. We should notice
that we are already including it when doing the include-tag
portion, and avoid adding it to the packlist.

The simplest place to fix this is right in add_ref_tag,
where we could avoid peeling the tag at all if we know that
we are already including it. However, I've pushed the check
instead into have_duplicate_entry(). This fixes not only
this case, but also means that we cannot have any similar
problems lurking in other code.

No tests, because git does not actually exhibit this "ask
for it and also include-tag" behavior. We do one or the
other on clone, depending on whether --single-branch is set.
However, libgit2 does both.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
 pack-bitmap.c          | 150 ++++++++++++++++++++--------
 pack-bitmap.h          |   3 +-
 3 files changed, 288 insertions(+), 87 deletions(-)

Comments

Jeff King Dec. 9, 2019, 8:11 a.m. UTC | #1

On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:

> The old code to reuse deltas from an existing packfile
> just tried to dump a whole segment of the pack verbatim.
> That's faster than the traditional way of actually adding
> objects to the packing list, but it didn't kick in very
> often. This new code is really going for a middle ground:
> do _some_ per-object work, but way less than we'd
> traditionally do.

This is a good intro to the general problem, but...

> For instance, packing torvalds/linux on GitHub servers
> just now reused 6.5M objects, but only needed ~50k
> chunks.

What the heck is a chunk? :)

I know, because I wrote the patches, but I think we need to sketch out
the solution a bit for the reader.

I think the flow here is:

  - we want to do some per-object work (i.e., your intro above)

  - the general strategy is to make a bitmap of objects from the
    packfile we'll include, and then iterate over it, writing out each
    object exactly as it is in our on-disk pack, but _not_ adding it to
    our packlist (which costs memory, and increases the search space for
    deltas)

  - one complication is that if we're omitting some objects, we can't
    set a delta against a base that we're not sending. So we have to
    check each object in try_partial_reuse() to make sure we have its
    delta (actually, that third big comment in that function is
    incomplete, I think; it talks about sending the object later, not as
    part of the reused pack, but we might not send it at all!).

  - another complication is that when we omit parts of the packfile,
    that screws up delta base offsets. So to handle that, we have to
    keep track of which chunks we send (which is the bits you included
    below about chunks)

  - and then we can talk about performance; worst case we might have
    interleaved objects that we are sending or not sending, and we'd
    have as many chunks as objects. But in practice we send big chunks,
    so that's where the 6.5M objects vs 50k chunks comes in.

> Additional checks are added in have_duplicate_entry()
> and obj_is_packed() to avoid duplicate objects in the
> reuse bitmap. It was probably buggy to not have such a
> check before.
> 
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
> 
> The simplest place to fix this is right in add_ref_tag,
> where we could avoid peeling the tag at all if we know that
> we are already including it. However, I've pushed the check
> instead into have_duplicate_entry(). This fixes not only
> this case, but also means that we cannot have any similar
> problems lurking in other code.

I think this part could go in its own commit. If we introduce
reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
at the beginning of the file with the existing code, then we can
introduce the extra checks in have_duplicate_entry() on top of that.

And then it would be safe to do the cleanup in show_objects_from_type()
that triggers the test failure in patch 4.

>  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
>  pack-bitmap.c          | 150 ++++++++++++++++++++--------
>  pack-bitmap.h          |   3 +-
>  3 files changed, 288 insertions(+), 87 deletions(-)

It might be worth having a perf test here. The case this is helping the
most, I think, is when somebody cloning wants all of the objects from
the beginning of the pack, but then there's a bunch of _other_ stuff.

That could be unreachable objects kept by "repack -k", or maybe objects
reachable outside of heads and tags. Or objects that are part of other
delta islands. This last is the most plausible, really, because we'll
put all of the root-island objects at the beginning of the pack. So
maybe a good perf test would be to take an existing repository add a
bunch of new commits in refs/foo/, and then repack with delta islands
such that refs/heads and refs/tags are in one (root) island, but
refs/foo is in another.

The old code should fail to do the pack-reuse thing, but we should get
pretty good reuse with the new code (and less CPU and peak RAM,
hopefully, though the perf suite doesn't measure RAM directly).

Answering some questions from Jonathan's response in the last round
(some of the quotes are his, some are yours):

> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
all (and I think the commit message should make that more clear, as I
described above).

> >> No tests, because git does not actually exhibit this "ask
> >> for it and also include-tag" behavior. We do one or the
> >> other on clone, depending on whether --single-branch is set.
> >> However, libgit2 does both.
>
> So my wild guess sould be that maybe the reason is that some of this
> code was shared (or intended to be shared) with libgit2?

No, the question is how the client behaves, and how we on the server
react to it. Git as a client would never ask for both include-tag and
for the tag itself by sha1. But libgit2 will, so a libgit2 client
cloning from a Git server would trigger the bug.

> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

Yes, it's possible. The result of a bitmap walk is larger than a given
pack, because we have to extend it to match whatever objects the caller
asked for. E.g., imagine we have commit A, repack into into a bitmapped
pack, and then somebody pushes up commit B. Now I want to clone, getting
both A and B.

Our bitmap result represents the whole answer, and so must include both
objects. But since "B" isn't in the pack, it doesn't have an assigned
bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
bit (valid only for that walk result!). But of course for pack reuse, we
only care about the objects that were actually in the bitmapped pack.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

This is the same issue as above. Imagine instead of two objects, imagine
we have 42 words worth. But if only 20 words worth are in the packfile,
then we have to stop there.

Now we _could_ figure this out for each individual object through the
similar logic in try_partial_reuse(). But we try to take an even faster
path for the initial chunk of objects. We don't even walk over them
looking to see if they're deltas or not. We just send that chunk of the
pack verbatim.

I don't have numbers for how often we hit this path versus the
individual try_partial_reuse() path. Possibly the stats could
differentiate that.

Thinking on it more, though, I wonder if there's a bug hiding here. We
know that we can send the whole initial chunk verbatim for OFS_DELTA
objects, because the relative offsets will remain unchanged (i.e., there
are no "holes" that trigger our chunk code). But if there were a
REF_DELTA in that initial chunk, we have no certainty that the base is
being sent.

In practice, this doesn't happen because the objects in question have to
be in a pack with bitmaps, which means it was written ourselves by
git-repack. And we'd never write REF_DELTA objects there.

But I wonder if it's worth being a bit more careful (and what the
performance impact is; it would mean checking the type of every object
in that initial chunk).

-Peff

Christian Couder Dec. 18, 2019, 11:26 a.m. UTC | #2

On Mon, Dec 9, 2019 at 9:11 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:
>
> > The old code to reuse deltas from an existing packfile
> > just tried to dump a whole segment of the pack verbatim.
> > That's faster than the traditional way of actually adding
> > objects to the packing list, but it didn't kick in very
> > often. This new code is really going for a middle ground:
> > do _some_ per-object work, but way less than we'd
> > traditionally do.
>
> This is a good intro to the general problem, but...
>
> > For instance, packing torvalds/linux on GitHub servers
> > just now reused 6.5M objects, but only needed ~50k
> > chunks.
>
> What the heck is a chunk? :)
>
> I know, because I wrote the patches, but I think we need to sketch out
> the solution a bit for the reader.

Ok with sketching it out.

> I think the flow here is:
>
>   - we want to do some per-object work (i.e., your intro above)
>
>   - the general strategy is to make a bitmap of objects from the
>     packfile we'll include, and then iterate over it, writing out each
>     object exactly as it is in our on-disk pack, but _not_ adding it to
>     our packlist (which costs memory, and increases the search space for
>     deltas)
>
>   - one complication is that if we're omitting some objects, we can't
>     set a delta against a base that we're not sending. So we have to
>     check each object in try_partial_reuse() to make sure we have its
>     delta (actually, that third big comment in that function is
>     incomplete, I think; it talks about sending the object later, not as
>     part of the reused pack, but we might not send it at all!).

Are you talking about this comment:

        /*
         * And finally, if we're not sending the base as part of our
         * reuse chunk, then don't send this object either. The base
         * would come after us, along with other objects not
         * necessarily in the pack, which means we'd need to convert
         * to REF_DELTA on the fly. Better to just let the normal
         * object_entry code path handle it.
         */

?

I don't see how it's talking about sending the object later, so I have
left it as is. Maybe you can check it again in the v4 series I am
going to send.

>   - another complication is that when we omit parts of the packfile,
>     that screws up delta base offsets. So to handle that, we have to
>     keep track of which chunks we send (which is the bits you included
>     below about chunks)
>
>   - and then we can talk about performance; worst case we might have
>     interleaved objects that we are sending or not sending, and we'd
>     have as many chunks as objects. But in practice we send big chunks,
>     so that's where the 6.5M objects vs 50k chunks comes in.

Ok, I have added your points above to the commit message.

> > Additional checks are added in have_duplicate_entry()
> > and obj_is_packed() to avoid duplicate objects in the
> > reuse bitmap. It was probably buggy to not have such a
> > check before.
> >
> > If a client both asks for a tag by sha1 and specifies
> > "include-tag", we may end up including the tag in the reuse
> > bitmap (due to the first thing), and then later adding it to
> > the packlist (due to the second). This results in duplicate
> > objects in the pack, which git chokes on. We should notice
> > that we are already including it when doing the include-tag
> > portion, and avoid adding it to the packlist.
> >
> > The simplest place to fix this is right in add_ref_tag,
> > where we could avoid peeling the tag at all if we know that
> > we are already including it. However, I've pushed the check
> > instead into have_duplicate_entry(). This fixes not only
> > this case, but also means that we cannot have any similar
> > problems lurking in other code.
>
> I think this part could go in its own commit. If we introduce
> reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
> at the beginning of the file with the existing code, then we can
> introduce the extra checks in have_duplicate_entry() on top of that.

Ok, I have extracted this part into its own commit.

> And then it would be safe to do the cleanup in show_objects_from_type()
> that triggers the test failure in patch 4.

Ok, I have eventually moved patch 4 at the end of the series.

> >  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
> >  pack-bitmap.c          | 150 ++++++++++++++++++++--------
> >  pack-bitmap.h          |   3 +-
> >  3 files changed, 288 insertions(+), 87 deletions(-)
>
> It might be worth having a perf test here. The case this is helping the
> most, I think, is when somebody cloning wants all of the objects from
> the beginning of the pack, but then there's a bunch of _other_ stuff.
>
> That could be unreachable objects kept by "repack -k", or maybe objects
> reachable outside of heads and tags. Or objects that are part of other
> delta islands. This last is the most plausible, really, because we'll
> put all of the root-island objects at the beginning of the pack. So
> maybe a good perf test would be to take an existing repository add a
> bunch of new commits in refs/foo/,

Not sure how I could do so. How would you do that?

I think it would be best to add completely new realistic commits that
are changing the main code base.

> and then repack with delta islands
> such that refs/heads and refs/tags are in one (root) island, but
> refs/foo is in another.
>
> The old code should fail to do the pack-reuse thing, but we should get
> pretty good reuse with the new code (and less CPU and peak RAM,
> hopefully, though the perf suite doesn't measure RAM directly).

I haven't added a perf test. I may do it if I get an idea about how to
add new commits in refs/foo/.

> Answering some questions from Jonathan's response in the last round
> (some of the quotes are his, some are yours):
>
> > I still don't understand this part. Going off what's written in the
> > commit message here, it seems to me that the issue is that (1) an object
> > can be added to the reuse bitmap without going through to_pack, and (2)
> > this is done when the client asks for a tag by SHA-1. But all objects
> > when specified by SHA-1 go through to_pack, don't they?
>
> No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
> all (and I think the commit message should make that more clear, as I
> described above).

I used your description above to improve the commit message, so I
guess it now answers his question.

> > >> No tests, because git does not actually exhibit this "ask
> > >> for it and also include-tag" behavior. We do one or the
> > >> other on clone, depending on whether --single-branch is set.
> > >> However, libgit2 does both.
> >
> > So my wild guess sould be that maybe the reason is that some of this
> > code was shared (or intended to be shared) with libgit2?
>
> No, the question is how the client behaves, and how we on the server
> react to it. Git as a client would never ask for both include-tag and
> for the tag itself by sha1. But libgit2 will, so a libgit2 client
> cloning from a Git server would trigger the bug.

Ok, I have made it explicit in the commit message that the bug would
be triggered by a libgit2 client but not a Git client.

> > > +     if (pos >= bitmap_git->pack->num_objects)
> > > +             return; /* not actually in the pack */
> >
> > Is this case possible? try_partial_reuse() is only called when there is
> > a 1 at the bit location.
>
> Yes, it's possible. The result of a bitmap walk is larger than a given
> pack, because we have to extend it to match whatever objects the caller
> asked for. E.g., imagine we have commit A, repack into into a bitmapped
> pack, and then somebody pushes up commit B. Now I want to clone, getting
> both A and B.
>
> Our bitmap result represents the whole answer, and so must include both
> objects. But since "B" isn't in the pack, it doesn't have an assigned
> bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> bit (valid only for that walk result!). But of course for pack reuse, we
> only care about the objects that were actually in the bitmapped pack.

Not sure if these explanations should go into the commit message or a
comment in the code. So I haven't added them for now.

> > > +     /* Don't mark objects not in the packfile */
> > > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
> >
> > Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> > i is 42. Shouldn't we allocate 42 words in our reuse bitmap?
>
> This is the same issue as above. Imagine instead of two objects, imagine
> we have 42 words worth. But if only 20 words worth are in the packfile,
> then we have to stop there.
>
> Now we _could_ figure this out for each individual object through the
> similar logic in try_partial_reuse(). But we try to take an even faster
> path for the initial chunk of objects. We don't even walk over them
> looking to see if they're deltas or not. We just send that chunk of the
> pack verbatim.
>
> I don't have numbers for how often we hit this path versus the
> individual try_partial_reuse() path. Possibly the stats could
> differentiate that.

Also not sure if these explanations should go into the commit message
or a comment in the code. So I haven't added them for now.

> Thinking on it more, though, I wonder if there's a bug hiding here. We
> know that we can send the whole initial chunk verbatim for OFS_DELTA
> objects, because the relative offsets will remain unchanged (i.e., there
> are no "holes" that trigger our chunk code). But if there were a
> REF_DELTA in that initial chunk, we have no certainty that the base is
> being sent.
>
> In practice, this doesn't happen because the objects in question have to
> be in a pack with bitmaps, which means it was written ourselves by
> git-repack. And we'd never write REF_DELTA objects there.
>
> But I wonder if it's worth being a bit more careful (and what the
> performance impact is; it would mean checking the type of every object
> in that initial chunk).

I haven't done anything related to that. Maybe something can be done
in a follow up patch.

Thanks a lot for the review!

Christian.

Jeff King Dec. 19, 2019, 12:42 a.m. UTC | #3

On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote:

> >   - one complication is that if we're omitting some objects, we can't
> >     set a delta against a base that we're not sending. So we have to
> >     check each object in try_partial_reuse() to make sure we have its
> >     delta (actually, that third big comment in that function is
> >     incomplete, I think; it talks about sending the object later, not as
> >     part of the reused pack, but we might not send it at all!).
> 
> Are you talking about this comment:
> 
>         /*
>          * And finally, if we're not sending the base as part of our
>          * reuse chunk, then don't send this object either. The base
>          * would come after us, along with other objects not
>          * necessarily in the pack, which means we'd need to convert
>          * to REF_DELTA on the fly. Better to just let the normal
>          * object_entry code path handle it.
>          */
> 
> ?
> 
> I don't see how it's talking about sending the object later, so I have
> left it as is. Maybe you can check it again in the v4 series I am
> going to send.

Yes, that's the comment. It says "the base would come after us". That
could be true, but it also could be that we are not sending the object
at all (in fact, that seems more likely in practice). The outcome is the
same, though: we don't verbatim reuse the delta'd object if its base is
not also being reused.

> > It might be worth having a perf test here. The case this is helping the
> > most, I think, is when somebody cloning wants all of the objects from
> > the beginning of the pack, but then there's a bunch of _other_ stuff.
> >
> > That could be unreachable objects kept by "repack -k", or maybe objects
> > reachable outside of heads and tags. Or objects that are part of other
> > delta islands. This last is the most plausible, really, because we'll
> > put all of the root-island objects at the beginning of the pack. So
> > maybe a good perf test would be to take an existing repository add a
> > bunch of new commits in refs/foo/,
> 
> Not sure how I could do so. How would you do that?
> 
> I think it would be best to add completely new realistic commits that
> are changing the main code base.

I agree that would be the most realistic for the "forked repository
network" case that we originally wrote this for. But I think a more
mundane (and possibly easier to generate) example might be having a
bunch of refs/notes.

So perhaps something like this:

-- >8 --
#!/bin/sh

test_description='verbatim pack reuse with bitmaps'
. ./perf-lib.sh

test_perf_large_repo

test_expect_success 'generate a bunch of notes' '
	# bleh, is there really no way to bulk-add with git-notes?
	{
		test_tick &&
		echo "commit refs/notes/commits" &&
		printf "author %s <%s> %s\\n" \
			"$GIT_AUTHOR_NAME" \
			"$GIT_AUTHOR_EMAIL" \
			"$GIT_AUTHOR_DATE" &&
		printf "committer %s <%s> %s\\n" \
			"$GIT_COMMITTER_NAME" \
			"$GIT_COMMITTER_EMAIL" \
			"$GIT_COMMITTER_DATE" &&
		echo "data <<EOF" &&
		echo "add many commit notes" &&
		echo "EOF" &&
		git rev-list HEAD |
		sed "s#^\\(..\\)\\(..\\)#\1/\2/#" |
		while read note
		do
			echo "M 644 inline $note"
			echo "data <<EOF"
			echo "a unique note for $note"
			echo "EOF"
		done &&
		echo
	} |
	git fast-import --force
'

test_expect_success 'create bitmaps' '
	git repack -adb
'

test_perf 'clone without notes' '
	git for-each-ref --format="%(objectname)" refs/heads/ refs/tags/ |
	git pack-objects --stdout --revs --delta-base-offset >clone.pack
'

test_size 'clone size' '
	wc -c <clone.pack
'

test_done
-- 8< --

From my limited prodding, it doesn't trigger pack-reuse with the current
code, but would after your series. On linux.git, it produces:

Test                          origin              origin/jk/packfile-reuse-cleanup
----------------------------------------------------------------------------------
5312.3: clone without notes   14.16(14.26+4.84)   10.80(10.40+0.44) -23.7%        
5312.4: clone size                       1.4G                1.4G +0.0%           

though I suspect the more interesting savings is in RAM (since we don't
have to allocate a "struct object_entry" for most of the objects). I
don't know how hard it would be to collect that data in the perf suite.

Running the final pack-objects manually with massif, I get peak heap
usage dropping from ~1500MB to ~380MB.

For git.git it seems less impressive (very little CPU saved, and the
resulting pack is slightly larger, perhaps due to not re-considering
some deltas whose on-disk versions had to be thrown away):

Test                          origin            origin/jk/packfile-reuse-cleanup
--------------------------------------------------------------------------------
5312.3: clone without notes   1.22(3.60+0.16)   1.20(3.56+0.16) -1.6%           
5312.4: clone size                      65.3M             67.0M +2.5%           

> > > Is this case possible? try_partial_reuse() is only called when there is
> > > a 1 at the bit location.
> >
> > Yes, it's possible. The result of a bitmap walk is larger than a given
> > pack, because we have to extend it to match whatever objects the caller
> > asked for. E.g., imagine we have commit A, repack into into a bitmapped
> > pack, and then somebody pushes up commit B. Now I want to clone, getting
> > both A and B.
> >
> > Our bitmap result represents the whole answer, and so must include both
> > objects. But since "B" isn't in the pack, it doesn't have an assigned
> > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> > bit (valid only for that walk result!). But of course for pack reuse, we
> > only care about the objects that were actually in the bitmapped pack.
> 
> Not sure if these explanations should go into the commit message or a
> comment in the code. So I haven't added them for now.

I think this is really outside the scope of this series, even, and gets
into how the bitmap traversal works in general. I'm sure the
documentation for that could be a bit better.

> > Thinking on it more, though, I wonder if there's a bug hiding here. We
> > know that we can send the whole initial chunk verbatim for OFS_DELTA
> > objects, because the relative offsets will remain unchanged (i.e., there
> > are no "holes" that trigger our chunk code). But if there were a
> > REF_DELTA in that initial chunk, we have no certainty that the base is
> > being sent.
> >
> > In practice, this doesn't happen because the objects in question have to
> > be in a pack with bitmaps, which means it was written ourselves by
> > git-repack. And we'd never write REF_DELTA objects there.
> >
> > But I wonder if it's worth being a bit more careful (and what the
> > performance impact is; it would mean checking the type of every object
> > in that initial chunk).
> 
> I haven't done anything related to that. Maybe something can be done
> in a follow up patch.

Hmm.  I was thinking the problem was introduced in this series, but the
old code should suffer from this as well. I wondered if it might simply
be that the old code did not trigger often enough for anybody to notice,
but we have been running the code in this series for several years at
GitHub. So probably my reasoning above is sound (but it might still be
worth addressing).

-Peff

Christian Couder Jan. 23, 2020, 10:29 p.m. UTC | #4

On Thu, Dec 19, 2019 at 1:42 AM Jeff King <peff@peff.net> wrote:
>
> On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote:
>
> > >   - one complication is that if we're omitting some objects, we can't
> > >     set a delta against a base that we're not sending. So we have to
> > >     check each object in try_partial_reuse() to make sure we have its
> > >     delta (actually, that third big comment in that function is
> > >     incomplete, I think; it talks about sending the object later, not as
> > >     part of the reused pack, but we might not send it at all!).
> >
> > Are you talking about this comment:
> >
> >         /*
> >          * And finally, if we're not sending the base as part of our
> >          * reuse chunk, then don't send this object either. The base
> >          * would come after us, along with other objects not
> >          * necessarily in the pack, which means we'd need to convert
> >          * to REF_DELTA on the fly. Better to just let the normal
> >          * object_entry code path handle it.
> >          */
> >
> > ?
> >
> > I don't see how it's talking about sending the object later, so I have
> > left it as is. Maybe you can check it again in the v4 series I am
> > going to send.
>
> Yes, that's the comment. It says "the base would come after us". That
> could be true, but it also could be that we are not sending the object
> at all (in fact, that seems more likely in practice). The outcome is the
> same, though: we don't verbatim reuse the delta'd object if its base is
> not also being reused.

In the V4 (https://public-inbox.org/git/20191218112547.4974-11-chriscool@tuxfamily.org/)
there is now the following in the commit message:

One complication is that if we're omitting some objects,
we can't set a delta against a base that we're not
sending. So we have to check each object in
try_partial_reuse() to make sure we have its delta.

Would you also want the comment to be improved to something like:

         /*
          * And finally, if we're not sending the base as part of our
          * reuse chunk, then don't send this object either. Because we
          * should not send the object at all. Or because the base
          * would come after us, along with other objects not
          * necessarily in the pack, which means we'd need to convert
          * to REF_DELTA on the fly. Better to just let the normal
          * object_entry code path handle it.
          */

> > > It might be worth having a perf test here. The case this is helping the
> > > most, I think, is when somebody cloning wants all of the objects from
> > > the beginning of the pack, but then there's a bunch of _other_ stuff.
> > >
> > > That could be unreachable objects kept by "repack -k", or maybe objects
> > > reachable outside of heads and tags. Or objects that are part of other
> > > delta islands. This last is the most plausible, really, because we'll
> > > put all of the root-island objects at the beginning of the pack. So
> > > maybe a good perf test would be to take an existing repository add a
> > > bunch of new commits in refs/foo/,
> >
> > Not sure how I could do so. How would you do that?
> >
> > I think it would be best to add completely new realistic commits that
> > are changing the main code base.
>
> I agree that would be the most realistic for the "forked repository
> network" case that we originally wrote this for. But I think a more
> mundane (and possibly easier to generate) example might be having a
> bunch of refs/notes.
>
> So perhaps something like this:

[...]

Ok, I will add this test to my patch series if I need to resend a V5.
Otherwise I will send it separately.

> From my limited prodding, it doesn't trigger pack-reuse with the current
> code, but would after your series. On linux.git, it produces:
>
> Test                          origin              origin/jk/packfile-reuse-cleanup
> ----------------------------------------------------------------------------------
> 5312.3: clone without notes   14.16(14.26+4.84)   10.80(10.40+0.44) -23.7%
> 5312.4: clone size                       1.4G                1.4G +0.0%

Yeah, nice improvement.

> though I suspect the more interesting savings is in RAM (since we don't
> have to allocate a "struct object_entry" for most of the objects). I
> don't know how hard it would be to collect that data in the perf suite.

I don't know either.

> Running the final pack-objects manually with massif, I get peak heap
> usage dropping from ~1500MB to ~380MB.

Very nice!

> For git.git it seems less impressive (very little CPU saved, and the
> resulting pack is slightly larger, perhaps due to not re-considering
> some deltas whose on-disk versions had to be thrown away):
>
> Test                          origin            origin/jk/packfile-reuse-cleanup
> --------------------------------------------------------------------------------
> 5312.3: clone without notes   1.22(3.60+0.16)   1.20(3.56+0.16) -1.6%
> 5312.4: clone size                      65.3M             67.0M +2.5%

Yeah, I think this is acceptable.

> > > > Is this case possible? try_partial_reuse() is only called when there is
> > > > a 1 at the bit location.
> > >
> > > Yes, it's possible. The result of a bitmap walk is larger than a given
> > > pack, because we have to extend it to match whatever objects the caller
> > > asked for. E.g., imagine we have commit A, repack into into a bitmapped
> > > pack, and then somebody pushes up commit B. Now I want to clone, getting
> > > both A and B.
> > >
> > > Our bitmap result represents the whole answer, and so must include both
> > > objects. But since "B" isn't in the pack, it doesn't have an assigned
> > > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> > > bit (valid only for that walk result!). But of course for pack reuse, we
> > > only care about the objects that were actually in the bitmapped pack.
> >
> > Not sure if these explanations should go into the commit message or a
> > comment in the code. So I haven't added them for now.
>
> I think this is really outside the scope of this series, even, and gets
> into how the bitmap traversal works in general. I'm sure the
> documentation for that could be a bit better.

Ok, I will take a look at improving the documentation, either in this
patch series if I resend it, or in a follow up small series.

> > > Thinking on it more, though, I wonder if there's a bug hiding here. We
> > > know that we can send the whole initial chunk verbatim for OFS_DELTA
> > > objects, because the relative offsets will remain unchanged (i.e., there
> > > are no "holes" that trigger our chunk code). But if there were a
> > > REF_DELTA in that initial chunk, we have no certainty that the base is
> > > being sent.
> > >
> > > In practice, this doesn't happen because the objects in question have to
> > > be in a pack with bitmaps, which means it was written ourselves by
> > > git-repack. And we'd never write REF_DELTA objects there.
> > >
> > > But I wonder if it's worth being a bit more careful (and what the
> > > performance impact is; it would mean checking the type of every object
> > > in that initial chunk).
> >
> > I haven't done anything related to that. Maybe something can be done
> > in a follow up patch.
>
> Hmm.  I was thinking the problem was introduced in this series, but the
> old code should suffer from this as well. I wondered if it might simply
> be that the old code did not trigger often enough for anybody to notice,
> but we have been running the code in this series for several years at
> GitHub. So probably my reasoning above is sound (but it might still be
> worth addressing).

Ok, I will take a look.

Thanks,
Christian.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08898331ef..64ab033923 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -92,7 +92,7 @@  static struct progress *progress_state;
 
 static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
-static off_t reuse_packfile_offset;
+static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
@@ -785,57 +785,185 @@  static struct object_entry **compute_write_order(void)
 	return wo;
 }
 
-static off_t write_reused_pack(struct hashfile *f)
+
+/*
+ * A reused set of objects. All objects in a chunk have the same
+ * relative position in the original packfile and the generated
+ * packfile.
+ */
+
+static struct reused_chunk {
+	/* The offset of the first object of this chunk in the original
+	 * packfile. */
+	off_t original;
+	/* The offset of the first object of this chunk in the generated
+	 * packfile minus "original". */
+	off_t difference;
+} *reused_chunks;
+static int reused_chunks_nr;
+static int reused_chunks_alloc;
+
+static void record_reused_object(off_t where, off_t offset)
 {
-	unsigned char buffer[8192];
-	off_t to_write, total;
-	int fd;
+	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].difference == offset)
+		return;
 
-	if (!is_pack_valid(reuse_packfile))
-		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
+	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
+		   reused_chunks_alloc);
+	reused_chunks[reused_chunks_nr].original = where;
+	reused_chunks[reused_chunks_nr].difference = offset;
+	reused_chunks_nr++;
+}
 
-	fd = git_open(reuse_packfile->pack_name);
-	if (fd < 0)
-		die_errno(_("unable to open packfile for reuse: %s"),
-			  reuse_packfile->pack_name);
+/*
+ * Binary search to find the chunk that "where" is in. Note
+ * that we're not looking for an exact match, just the first
+ * chunk that contains it (which implicitly ends at the start
+ * of the next chunk.
+ */
+static off_t find_reused_offset(off_t where)
+{
+	int lo = 0, hi = reused_chunks_nr;
+	while (lo < hi) {
+		int mi = lo + ((hi - lo) / 2);
+		if (where == reused_chunks[mi].original)
+			return reused_chunks[mi].difference;
+		if (where < reused_chunks[mi].original)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
 
-	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
-		die_errno(_("unable to seek in reused packfile"));
+	/*
+	 * The first chunk starts at zero, so we can't have gone below
+	 * there.
+	 */
+	assert(lo);
+	return reused_chunks[lo-1].difference;
+}
+
+static void write_reused_pack_one(size_t pos, struct hashfile *out,
+				  struct pack_window **w_curs)
+{
+	off_t offset, next, cur;
+	enum object_type type;
+	unsigned long size;
 
-	if (reuse_packfile_offset < 0)
-		reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
+	offset = reuse_packfile->revindex[pos].offset;
+	next = reuse_packfile->revindex[pos + 1].offset;
 
-	total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
+	record_reused_object(offset, offset - hashfile_total(out));
 
-	while (to_write) {
-		int read_pack = xread(fd, buffer, sizeof(buffer));
+	cur = offset;
+	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
+	assert(type >= 0);
 
-		if (read_pack <= 0)
-			die_errno(_("unable to read from reused packfile"));
+	if (type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		off_t fixup;
+
+		unsigned char header[MAX_PACK_OBJECT_HEADER];
+		unsigned len;
+
+		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
+		assert(base_offset != 0);
+
+		/* Convert to REF_DELTA if we must... */
+		if (!allow_ofs_delta) {
+			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			const unsigned char *base_sha1 =
+				nth_packed_object_sha1(reuse_packfile,
+						       reuse_packfile->revindex[base_pos].nr);
+
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_REF_DELTA, size);
+			hashwrite(out, header, len);
+			hashwrite(out, base_sha1, 20);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
 
-		if (read_pack > to_write)
-			read_pack = to_write;
+		/* Otherwise see if we need to rewrite the offset... */
+		fixup = find_reused_offset(offset) -
+			find_reused_offset(base_offset);
+		if (fixup) {
+			unsigned char ofs_header[10];
+			unsigned i, ofs_len;
+			off_t ofs = offset - base_offset - fixup;
 
-		hashwrite(f, buffer, read_pack);
-		to_write -= read_pack;
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_OFS_DELTA, size);
+
+			i = sizeof(ofs_header) - 1;
+			ofs_header[i] = ofs & 127;
+			while (ofs >>= 7)
+				ofs_header[--i] = 128 | (--ofs & 127);
+
+			ofs_len = sizeof(ofs_header) - i;
+
+			hashwrite(out, header, len);
+			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
+
+		/* ...otherwise we have no fixup, and can write it verbatim */
+	}
+
+	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
+}
+
+static size_t write_reused_pack_verbatim(struct hashfile *out,
+					 struct pack_window **w_curs)
+{
+	size_t pos = 0;
+
+	while (pos < reuse_packfile_bitmap->word_alloc &&
+			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+		pos++;
+
+	if (pos) {
+		off_t to_write;
+
+		written = (pos * BITS_IN_EWORD);
+		to_write = reuse_packfile->revindex[written].offset
+			- sizeof(struct pack_header);
+
+		/* We're recording one chunk, not one object. */
+		record_reused_object(sizeof(struct pack_header), 0);
+		hashflush(out);
+		copy_pack_data(out, reuse_packfile, w_curs,
+			sizeof(struct pack_header), to_write);
 
-		/*
-		 * We don't know the actual number of objects written,
-		 * only how many bytes written, how many bytes total, and
-		 * how many objects total. So we can fake it by pretending all
-		 * objects we are writing are the same size. This gives us a
-		 * smooth progress meter, and at the end it matches the true
-		 * answer.
-		 */
-		written = reuse_packfile_objects *
-				(((double)(total - to_write)) / total);
 		display_progress(progress_state, written);
 	}
+	return pos;
+}
 
-	close(fd);
-	written = reuse_packfile_objects;
-	display_progress(progress_state, written);
-	return reuse_packfile_offset - sizeof(struct pack_header);
+static void write_reused_pack(struct hashfile *f)
+{
+	size_t i = 0;
+	uint32_t offset;
+	struct pack_window *w_curs = NULL;
+
+	if (allow_ofs_delta)
+		i = write_reused_pack_verbatim(f, &w_curs);
+
+	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
+		eword_t word = reuse_packfile_bitmap->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
+
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			write_reused_pack_one(pos + offset, f, &w_curs);
+			display_progress(progress_state, ++written);
+		}
+	}
+
+	unuse_pack(&w_curs);
 }
 
 static const char no_split_warning[] = N_(
@@ -868,11 +996,9 @@  static void write_pack_file(void)
 		offset = write_pack_header(f, nr_remaining);
 
 		if (reuse_packfile) {
-			off_t packfile_size;
 			assert(pack_to_stdout);
-
-			packfile_size = write_reused_pack(f);
-			offset += packfile_size;
+			write_reused_pack(f);
+			offset = hashfile_total(f);
 		}
 
 		nr_written = 0;
@@ -1001,6 +1127,10 @@  static int have_duplicate_entry(const struct object_id *oid,
 {
 	struct object_entry *entry;
 
+	if (reuse_packfile_bitmap &&
+	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
+		return 1;
+
 	entry = packlist_find(&to_pack, oid);
 	if (!entry)
 		return 0;
@@ -2555,7 +2685,9 @@  static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 
 static int obj_is_packed(const struct object_id *oid)
 {
-	return !!packlist_find(&to_pack, oid);
+	return packlist_find(&to_pack, oid) ||
+		(reuse_packfile_bitmap &&
+		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
 }
 
 static void add_tag_chain(const struct object_id *oid)
@@ -2661,6 +2793,7 @@  static void prepare_pack(int window, int depth)
 
 	if (nr_deltas && n > 1) {
 		unsigned nr_done = 0;
+
 		if (progress)
 			progress_state = start_progress(_("Compressing objects"),
 							nr_deltas);
@@ -3042,7 +3175,6 @@  static int pack_options_allow_reuse(void)
 {
 	return allow_pack_reuse &&
 	       pack_to_stdout &&
-	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
 	       (!local || !have_non_local_packs) &&
@@ -3059,7 +3191,7 @@  static int get_object_list_from_bitmap(struct rev_info *revs)
 			bitmap_git,
 			&reuse_packfile,
 			&reuse_packfile_objects,
-			&reuse_packfile_offset)) {
+			&reuse_packfile_bitmap)) {
 		assert(reuse_packfile_objects);
 		nr_result += reuse_packfile_objects;
 		display_progress(progress_state, nr_result);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8a51302a1a..cbfc544411 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -326,6 +326,13 @@  static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 	munmap(bitmap_git->map, bitmap_git->map_size);
 	bitmap_git->map = NULL;
 	bitmap_git->map_size = 0;
+
+	kh_destroy_oid_map(bitmap_git->bitmaps);
+	bitmap_git->bitmaps = NULL;
+
+	kh_destroy_oid_pos(bitmap_git->ext_index.positions);
+	bitmap_git->ext_index.positions = NULL;
+
 	return -1;
 }
 
@@ -764,65 +771,126 @@  struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 	return NULL;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       off_t *up_to)
+static void try_partial_reuse(struct bitmap_index *bitmap_git,
+			      size_t pos,
+			      struct bitmap *reuse,
+			      struct pack_window **w_curs)
 {
+	struct revindex_entry *revidx;
+	off_t offset;
+	enum object_type type;
+	unsigned long size;
+
+	if (pos >= bitmap_git->pack->num_objects)
+		return; /* not actually in the pack */
+
+	revidx = &bitmap_git->pack->revindex[pos];
+	offset = revidx->offset;
+	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (type < 0)
+		return; /* broken packfile, punt */
+
+	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		int base_pos;
+
+		/*
+		 * Find the position of the base object so we can look it up
+		 * in our bitmaps. If we can't come up with an offset, or if
+		 * that offset is not in the revidx, the pack is corrupt.
+		 * There's nothing we can do, so just punt on this object,
+		 * and the normal slow path will complain about it in
+		 * more detail.
+		 */
+		base_offset = get_delta_base(bitmap_git->pack, w_curs,
+					     &offset, type, revidx->offset);
+		if (!base_offset)
+			return;
+		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
+		if (base_pos < 0)
+			return;
+
+		/*
+		 * We assume delta dependencies always point backwards. This
+		 * lets us do a single pass, and is basically always true
+		 * due to the way OFS_DELTAs work. You would not typically
+		 * find REF_DELTA in a bitmapped pack, since we only bitmap
+		 * packs we write fresh, and OFS_DELTA is the default). But
+		 * let's double check to make sure the pack wasn't written with
+		 * odd parameters.
+		 */
+		if (base_pos >= pos)
+			return;
+
+		/*
+		 * And finally, if we're not sending the base as part of our
+		 * reuse chunk, then don't send this object either. The base
+		 * would come after us, along with other objects not
+		 * necessarily in the pack, which means we'd need to convert
+		 * to REF_DELTA on the fly. Better to just let the normal
+		 * object_entry code path handle it.
+		 */
+		if (!bitmap_get(reuse, base_pos))
+			return;
+	}
+
 	/*
-	 * Reuse the packfile content if we need more than
-	 * 90% of its objects
+	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	static const double REUSE_PERCENT = 0.9;
+	bitmap_set(reuse, pos);
+}
 
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
 	struct bitmap *result = bitmap_git->result;
-	uint32_t reuse_threshold;
-	uint32_t i, reuse_objects = 0;
+	struct bitmap *reuse;
+	struct pack_window *w_curs = NULL;
+	size_t i = 0;
+	uint32_t offset;
 
 	assert(result);
 
-	for (i = 0; i < result->word_alloc; ++i) {
-		if (result->words[i] != (eword_t)~0) {
-			reuse_objects += ewah_bit_ctz64(~result->words[i]);
-			break;
-		}
-
-		reuse_objects += BITS_IN_EWORD;
-	}
+	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
+		i++;
 
-#ifdef GIT_BITMAP_DEBUG
-	{
-		const unsigned char *sha1;
-		struct revindex_entry *entry;
+	/* Don't mark objects not in the packfile */
+	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
+		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
 
-		entry = &bitmap_git->reverse_index->revindex[reuse_objects];
-		sha1 = nth_packed_object_sha1(bitmap_git->pack, entry->nr);
+	reuse = bitmap_word_alloc(i);
+	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
-		fprintf(stderr, "Failed to reuse at %d (%016llx)\n",
-			reuse_objects, result->words[i]);
-		fprintf(stderr, " %s\n", hash_to_hex(sha1));
-	}
-#endif
+	for (; i < result->word_alloc; ++i) {
+		eword_t word = result->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
 
-	if (!reuse_objects)
-		return -1;
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
 
-	if (reuse_objects >= bitmap_git->pack->num_objects) {
-		bitmap_git->reuse_objects = *entries = bitmap_git->pack->num_objects;
-		*up_to = -1; /* reuse the full pack */
-		*packfile = bitmap_git->pack;
-		return 0;
+			offset += ewah_bit_ctz64(word >> offset);
+			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+		}
 	}
 
-	reuse_threshold = bitmap_popcount(bitmap_git->result) * REUSE_PERCENT;
+	unuse_pack(&w_curs);
 
-	if (reuse_objects < reuse_threshold)
+	*entries = bitmap_popcount(reuse);
+	if (!*entries) {
+		bitmap_free(reuse);
 		return -1;
+	}
 
-	bitmap_git->reuse_objects = *entries = reuse_objects;
-	*up_to = bitmap_git->pack->revindex[reuse_objects].offset;
-	*packfile = bitmap_git->pack;
-
+	/*
+	 * Drop any reused objects from the result, since they will not
+	 * need to be handled separately.
+	 */
+	bitmap_and_not(result, reuse);
+	*packfile_out = bitmap_git->pack;
+	*reuse_out = reuse;
 	return 0;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 6ab6033dbe..bcd03b8993 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@  void test_bitmap_walk(struct rev_info *revs);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
-				       uint32_t *entries, off_t *up_to);
+				       uint32_t *entries,
+				       struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);

[v3,9/9] pack-objects: improve partial packfile reuse

Commit Message

Comments

Patch