diff mbox series

bundle-uri: copy all bundle references ino the refs/bundle space

Message ID pull.1897.git.git.1740489585344.gitgitgadget@gmail.com (mailing list archive)
State New
Headers show
Series bundle-uri: copy all bundle references ino the refs/bundle space | expand

Commit Message

Scott Chacon Feb. 25, 2025, 1:19 p.m. UTC
From: Scott Chacon <schacon@gmail.com>

When downloading bundles via the bundle-uri functionality, we only copy the
references from refs/heads into the refs/bundle space. I'm not sure why this
refspec is hardcoded to be so limited, but it makes the ref negotiation on
the subsequent fetch suboptimal, since it won't use objects that are
referenced outside of the current heads of the bundled repository.

This change to copy everything in refs/ in the bundle to refs/bundles/
significantly helps the subsequent fetch, since nearly all the references
are now included in the negotiation.

Signed-off-by: Scott Chacon <schacon@gmail.com>
---
    bundle-uri: copy all bundle references ino the refs/bundle space
    
    This patch probably isn't meant for inclusion, but I wanted to see if
    I'm crazy here or missing something.
    
    It appears that the bundle-uri functionality has an issue with ref
    negotiation. I hit this because I assumed all the objects I bundled
    would be seen in the negotiation, but since only references under
    refs/heads are copied to refs/bundles, they are the only ones that are
    seen for negotiation, so it's quite inefficient.
    
    I did several experiments trying to create a bundle where the subsequent
    fetch was almost a no-op and it was frustratingly impossible and it took
    me a while to figure out why it kept trying to get tons of other
    objects.
    
    Furthermore, when I bundled just a tag (thinking it would have most
    reachable objects) it completely failed to work because there were no
    refs/heads/ available for negotiation - so it downloaded a huge file and
    then still started from scratch on the fetch.
    
    However, if I copy all the refs in the bundle, it makes a big
    difference.
    
    Here are some benchmarks from the gitlab oss repo.
    
    A normal clone pulls down 3,005,985 objects:
    
    ❯  time git clone https://gitlab.com/gitlab-org/gitlab-foss.git gl5
    Cloning into 'gl5'...
    remote: Enumerating objects: 3005985, done.
    remote: Counting objects: 100% (314617/314617), done.
    remote: Compressing objects: 100% (64278/64278), done.
    remote: Total 3005985 (delta 244429), reused 311002 (delta 241404), pack-reused 2691368 (from 1)
    Receiving objects: 100% (3005985/3005985), 1.35 GiB | 23.91 MiB/s, done.
    Resolving deltas: 100% (2361484/2361484), done.
    Updating files: 100% (59972/59972), done.
    (*) 162.93s user 37.94s system 128% cpu 2:36.49 total
    
    
    Then, I tried to bundle everything from a fresh clone, including all the
    refs.
    
     ❯  git bundle create gitlab-base.bundle --all
    
    
    This creates a 1.4G bundle, which I uploaded to a CDN and cloned again
    with the bundle-uri:
    
    ❯  time git clone --bundle-uri=https://[cdn]/bundle/gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git gl4
    Cloning into 'gl4'...
    remote: Enumerating objects: 1092703, done.
    remote: Counting objects: 100% (973405/973405), done.
    remote: Compressing objects: 100% (385827/385827), done.
    remote: Total 959773 (delta 710976), reused 766809 (delta 554276), pack-reused 0 (from 0)
    Receiving objects: 100% (959773/959773), 366.94 MiB | 20.87 MiB/s, done.
    Resolving deltas: 100% (710976/710976), completed with 9081 local objects.
    Checking objects: 100% (4194304/4194304), done.
    Checking connectivity: 959668, done.
    Updating files: 100% (59972/59972), done.
    (*) 181.98s user 40.23s system 110% cpu 3:20.89 total
    
    
    Which is better from an "objects from the server" perspective, but still
    has to download 959,773 objects, so 32% of the total. But it also takes
    quite a lot longer, because it's redownloading most of those objects for
    a second time.
    
    If I apply this patch where I change the refspec for the bundle ref copy
    from refs/heads/ to just refs/ and clone with this patched version, it's
    much better:
    
    ❯  time ./git clone --bundle-uri=https://[cdn]/bundle/gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git gl3
    Cloning into 'gl3'...
    remote: Enumerating objects: 65538, done.
    remote: Counting objects: 100% (56054/56054), done.
    remote: Compressing objects: 100% (28950/28950), done.
    remote: Total 43877 (delta 27401), reused 25170 (delta 13546), pack-reused 0 (from 0)
    Receiving objects: 100% (43877/43877), 40.42 MiB | 22.27 MiB/s, done.
    Resolving deltas: 100% (27401/27401), completed with 8564 local objects.
    Updating files: 100% (59972/59972), done.
    (*) 143.45s user 29.33s system 124% cpu 2:19.27 total
    
    
    Now I'm only getting an extra 43k objects, so 1% of the original total,
    and the entire operation is a bit faster as well.
    
    I'm not sure if there is a downside here, it seems clearly how you would
    want the negotiation to go. It ends up with way more refs under
    refs/bundle (now there is refs/bundle/origin/master, etc) but that's
    being polluted by the head refs anyhow, right?
    
    Is this a reasonable change?

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1897%2Fschacon%2Fsc-more-bundle-refs-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1897/schacon/sc-more-bundle-refs-v1
Pull-Request: https://github.com/git/git/pull/1897

 bundle-uri.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: 2d2a71ce85026edcc40f469678a1035df0dfcf57

Comments

Junio C Hamano Feb. 25, 2025, 6:14 p.m. UTC | #1
"Scott Chacon via GitGitGadget" <gitgitgadget@gmail.com> writes:

>     Furthermore, when I bundled just a tag (thinking it would have most
>     reachable objects) it completely failed to work because there were no
>     refs/heads/ available for negotiation - so it downloaded a huge file and
>     then still started from scratch on the fetch.

Nice finding.

>     Now I'm only getting an extra 43k objects, so 1% of the original total,
>     and the entire operation is a bit faster as well.

That makes all sense.

>     I'm not sure if there is a downside here, it seems clearly how you would
>     want the negotiation to go. It ends up with way more refs under
>     refs/bundle (now there is refs/bundle/origin/master, etc) but that's
>     being polluted by the head refs anyhow, right?

I am not sure what you mean by "being polluted by the head refs
anyhow", but we should be equipped to deal with a repository with
tons of local branches, so having the comparable number of
remote-tracking branches instead of local branches now exposed in
refs/bundle/remotes/* hierarchy should work equally as well, or we
would have something we need to fix.  So in principle I do not see
a problem with this approach.

The mapping used to be "refs/heads/foo" to "refs/bundles/foo", but
now "refs/heads/foo" is mapped to "refs/bundles/heads/foo", so that
you would presumably be mapping "refs/remotes/origin/master" to
"refs/bundles/remotes/origin/master", right?  I hope existing users
are *not* looking at their resulting refs/bundles/ hierarchy and
somehow assuming the original mapping.

This is not something this "fix" changes, but unbundle_all_bundles()
apparently is prepared to handle more than one bundles.  I wonder
what happens when multiple bundles expose the same branch pointing
at different objects?  The way I read unbundle_from_file() is that
a new one overwrites the previous value, so even though we may have
all unbundled many bundles, the objects from earlier bundles may
lose their anchors, subject to be garbage-collected.

Imagine creating a bundle with two refs, refs/heads and
refs/remotes, and append that bundle as the last bundle of a bunch
of bundles full of local and remote-tracking branches, that have
populated refs/bundles/ hierarchy with tons of refs.  Now the last
bundle is unbundled, and these two phoney refs would nuke everything
that used to be under refs/bundles/heads/* and refs/bundles/remotes/*
left by unpacking previous bundles, right?

>     Is this a reasonable change?

This is mostly Stolee's design, IIRC, so I have CC'ed; the work is
mostly from 53a50892 (bundle-uri: create basic file-copy logic,
2022-08-09) that is more than 2 years ago.

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1897%2Fschacon%2Fsc-more-bundle-refs-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1897/schacon/sc-more-bundle-refs-v1
> Pull-Request: https://github.com/git/git/pull/1897
>
>  bundle-uri.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/bundle-uri.c b/bundle-uri.c
> index 744257c49c1..3371d56f4ce 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -403,7 +403,7 @@ static int unbundle_from_file(struct repository *r, const char *file)
>  		const char *branch_name;
>  		int has_old;
>  
> -		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
> +		if (!skip_prefix(refname->string, "refs/", &branch_name))
>  			continue;

By skipping only "refs/", "branch_name" is no longer a branch name,
which may be something we may want to fix, but if we were to address
the "names from later bundles seem to overwrite names used by
previous boundles, and unanchor objects obtained from them" problem,
I suspect we'd rather want to create new and unique names there,
without assuming or relying on that the names used by these bundles
are reasonably unique, so this part of the code may have to change
anyway, so we may not care too deeply at this point.
Derrick Stolee Feb. 25, 2025, 11:36 p.m. UTC | #2
On 2/25/25 1:14 PM, Junio C Hamano wrote:
> "Scott Chacon via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>      Furthermore, when I bundled just a tag (thinking it would have most
>>      reachable objects) it completely failed to work because there were no
>>      refs/heads/ available for negotiation - so it downloaded a huge file and
>>      then still started from scratch on the fetch.
> 
> Nice finding.

This is an interesting case that I had not considered in the original
implementation.

The intention of the design is to avoid having the bundle URI fetch
changing tag refs, especially annotated tags. Those tag updates are
expected to be advertised in the "git fetch" output. It would probably
be best to peel the tag refs to a commit and then create a fake branch
for the bundle.

>>      Now I'm only getting an extra 43k objects, so 1% of the original total,
>>      and the entire operation is a bit faster as well.
> 
> That makes all sense.
> 
>>      I'm not sure if there is a downside here, it seems clearly how you would
>>      want the negotiation to go. It ends up with way more refs under
>>      refs/bundle (now there is refs/bundle/origin/master, etc) but that's
>>      being polluted by the head refs anyhow, right?
> 
> I am not sure what you mean by "being polluted by the head refs
> anyhow", but we should be equipped to deal with a repository with
> tons of local branches, so having the comparable number of
> remote-tracking branches instead of local branches now exposed in
> refs/bundle/remotes/* hierarchy should work equally as well, or we
> would have something we need to fix.  So in principle I do not see
> a problem with this approach.
> 
> The mapping used to be "refs/heads/foo" to "refs/bundles/foo", but
> now "refs/heads/foo" is mapped to "refs/bundles/heads/foo", so that
> you would presumably be mapping "refs/remotes/origin/master" to
> "refs/bundles/remotes/origin/master", right?  I hope existing users
> are *not* looking at their resulting refs/bundles/ hierarchy and
> somehow assuming the original mapping.
> 
> This is not something this "fix" changes, but unbundle_all_bundles()
> apparently is prepared to handle more than one bundles.  I wonder
> what happens when multiple bundles expose the same branch pointing
> at different objects?  The way I read unbundle_from_file() is that
> a new one overwrites the previous value, so even though we may have
> all unbundled many bundles, the objects from earlier bundles may
> lose their anchors, subject to be garbage-collected.
> 
> Imagine creating a bundle with two refs, refs/heads and
> refs/remotes, and append that bundle as the last bundle of a bunch
> of bundles full of local and remote-tracking branches, that have
> populated refs/bundles/ hierarchy with tons of refs.  Now the last
> bundle is unbundled, and these two phoney refs would nuke everything
> that used to be under refs/bundles/heads/* and refs/bundles/remotes/*
> left by unpacking previous bundles, right?
> 
>>      Is this a reasonable change?
> 
> This is mostly Stolee's design, IIRC, so I have CC'ed; the work is
> mostly from 53a50892 (bundle-uri: create basic file-copy logic,
> 2022-08-09) that is more than 2 years ago.

The biggest question I had (and tried to get ahead of on the PR) is
the use of a test to demonstrate what kind of bundle files cause this
issue. It would be important to demosntrate that the repo is still
usable if "refs/bundles/tags/v1.0" exists and points to a tag object.

>> -		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
>> +		if (!skip_prefix(refname->string, "refs/", &branch_name))
>>   			continue;

So I'm OK with relaxing this to be more flexible, but I'm not sure
why the bundles couldn't be created using "refs/heads/", possibly via
changing the ref names during bundle creation.

> By skipping only "refs/", "branch_name" is no longer a branch name,
> which may be something we may want to fix, but if we were to address
> the "names from later bundles seem to overwrite names used by
> previous boundles, and unanchor objects obtained from them" problem,
> I suspect we'd rather want to create new and unique names there,
> without assuming or relying on that the names used by these bundles
> are reasonably unique, so this part of the code may have to change
> anyway, so we may not care too deeply at this point.

Thanks,
-Stolee
Scott Chacon March 1, 2025, 10:23 a.m. UTC | #3
Hey,

On Wed, Feb 26, 2025 at 12:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> The intention of the design is to avoid having the bundle URI fetch
> changing tag refs, especially annotated tags. Those tag updates are
> expected to be advertised in the "git fetch" output. It would probably
> be best to peel the tag refs to a commit and then create a fake branch
> for the bundle.

The issue for this and also for the other suggestion you have later on
is that I'm not sure how this can be easily done with the bundle
command. It seems like everyone would have to write some sort of
script to create a special type of bundle so that all these objects
are referenced in a way that makes the bundle-uri helper actually get
most of the objects that are needed.

Is there some option to rev-list that does this? Or are you saying
it's better to write a script?

> The biggest question I had (and tried to get ahead of on the PR) is
> the use of a test to demonstrate what kind of bundle files cause this
> issue. It would be important to demosntrate that the repo is still
> usable if "refs/bundles/tags/v1.0" exists and points to a tag object.

I have written a test and I'll submit the new series in a minute, but
I'm not sure what you mean by 'usable' in this context. Is there a
situation where Git gets mad if there are annotated tags that aren't
under refs/tags?

I have done these test clones and nothing bad seems to happen having
them in refs/bundle/tags/v1.0 that I notice, but I don't know how to
write a test that specifically verifies that.

> So I'm OK with relaxing this to be more flexible, but I'm not sure
> why the bundles couldn't be created using "refs/heads/", possibly via
> changing the ref names during bundle creation.

So same point here. I think the bundle-uri functionality isn't
particularly effective if the creation of the bundle needs special
scripts to create in a way that is expected.

One other approach would be to add an option to `git bundle` that does
this sanitization (unpeeling things into fake branch heads), like some
`--bundle-for-uri`, but I feel like just using `--all` and having the
clone handle it in the way I proposed might be much simpler and more
usable.

We could also immediately delete everything under `refs/bundle/tags`
after the fetch if we don't like them there, but still having them be
available for the fetch negotiation.

I'll send a new series with the existing tests updated to look for
`refs/bundle/heads/*` instead of `refs/bundle/*` and adding a very
simple test to see that the tags were unpacked as the next step.

Scott
Junio C Hamano March 3, 2025, 5:12 p.m. UTC | #4
Scott Chacon <schacon@gmail.com> writes:

> Hey,
>
> On Wed, Feb 26, 2025 at 12:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> The intention of the design is to avoid having the bundle URI fetch
>> changing tag refs, especially annotated tags. Those tag updates are
>> expected to be advertised in the "git fetch" output. It would probably
>> be best to peel the tag refs to a commit and then create a fake branch
>> for the bundle.

I am not sure where that need to avoid including tags comes from.

>> The biggest question I had (and tried to get ahead of on the PR) is
>> the use of a test to demonstrate what kind of bundle files cause this
>> issue. It would be important to demosntrate that the repo is still
>> usable if "refs/bundles/tags/v1.0" exists and points to a tag object.
>
> I have written a test and I'll submit the new series in a minute, but
> I'm not sure what you mean by 'usable' in this context. Is there a
> situation where Git gets mad if there are annotated tags that aren't
> under refs/tags?

I do not know of any at least for a local consumption of these tags.

> I have done these test clones and nothing bad seems to happen having
> them in refs/bundle/tags/v1.0 that I notice, but I don't know how to
> write a test that specifically verifies that.

Can it be some brittleness Derrick is worried about auto-following
of tags during future "git fetch"?  You store a tag that a regular
fetch may want to store at refs/tags/v1.0 in refs/bundles/tags/v1.0
taken from the bundle, and then a later fetch may advance the
history based on you extracted from the bundle---without having to
run an explicit "git fetch --tags" or "git fetch origin v1.0", would
we ever obtain "refs/tags/v1.0" with only the usual auto-following
when we have the same tag elsewhere?

In any case, instead of me speculating, I'd prefer to hear from
Derrick, who is a lot more familiar with the mechanism under
discussion, what the issues are that we want to limit ourselves to
local branches.

Thanks.
Derrick Stolee March 3, 2025, 6:46 p.m. UTC | #5
On 3/3/25 12:12 PM, Junio C Hamano wrote:
> Scott Chacon <schacon@gmail.com> writes:
> 
>> Hey,
>>
>> On Wed, Feb 26, 2025 at 12:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> The intention of the design is to avoid having the bundle URI fetch
>>> changing tag refs, especially annotated tags. Those tag updates are
>>> expected to be advertised in the "git fetch" output. It would probably
>>> be best to peel the tag refs to a commit and then create a fake branch
>>> for the bundle.
> 
> I am not sure where that need to avoid including tags comes from.
> 
>>> The biggest question I had (and tried to get ahead of on the PR) is
>>> the use of a test to demonstrate what kind of bundle files cause this
>>> issue. It would be important to demosntrate that the repo is still
>>> usable if "refs/bundles/tags/v1.0" exists and points to a tag object.
>>
>> I have written a test and I'll submit the new series in a minute, but
>> I'm not sure what you mean by 'usable' in this context. Is there a
>> situation where Git gets mad if there are annotated tags that aren't
>> under refs/tags?
> 
> I do not know of any at least for a local consumption of these tags.

These ideas about avoiding annotated tags outside of refs/tags/ is
likely an invention of my own, and must not be a firm expectation of
the Git tool.

>> I have done these test clones and nothing bad seems to happen having
>> them in refs/bundle/tags/v1.0 that I notice, but I don't know how to
>> write a test that specifically verifies that.
> 
> Can it be some brittleness Derrick is worried about auto-following
> of tags during future "git fetch"?  You store a tag that a regular
> fetch may want to store at refs/tags/v1.0 in refs/bundles/tags/v1.0
> taken from the bundle, and then a later fetch may advance the
> history based on you extracted from the bundle---without having to
> run an explicit "git fetch --tags" or "git fetch origin v1.0", would
> we ever obtain "refs/tags/v1.0" with only the usual auto-following
> when we have the same tag elsewhere?
> 
> In any case, instead of me speculating, I'd prefer to hear from
> Derrick, who is a lot more familiar with the mechanism under
> discussion, what the issues are that we want to limit ourselves to
> local branches.
My thinking here is similar to the prefetch maintenance task: we want
users who run "git fetch [origin]" to see the refs that are being
updated by that action in the normal foreground messages. This includes
new tag messages, such as in my local copy of the Git repository:

$ git fetch origin
remote: Enumerating objects: 186, done.
remote: Counting objects: 100% (168/168), done.
remote: Compressing objects: 100% (51/51), done.
remote: Total 186 (delta 120), reused 162 (delta 117), pack-reused 18 (from 2)
Receiving objects: 100% (186/186), 118.04 KiB | 10.73 MiB/s, done.
Resolving deltas: 100% (122/122), completed with 24 local objects.
 From github.com:git/git
    b838bf19389..db91954e186  master      -> origin/master
    2feabab25ac..627208d89de  next        -> origin/next
  + 9b2be6f7989...39dfbbf0521 seen        -> origin/seen  (forced update)
    5fa232d8520..fb8899337a8  todo        -> origin/todo
  * [new tag]                 v2.49.0-rc0 -> v2.49.0-rc0

As long as these messages are still appearing, then I'm fine with
these annotated tags being added to the object database earlier by
the bundle URI mechanism.

Scott: You also asked about the intended design for bundle URIs and
things, and the best places to look are Git's technical docs [1] and
the bundle server reference implementation docs [2]. CC Victoria who
implemented the reference implementation and wrote those docs.

[1] https://github.com/git/git/blob/master/Documentation/technical/bundle-uri.adoc

[2] 
https://github.com/git-ecosystem/git-bundle-server/blob/main/docs/technical/architecture.md

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/bundle-uri.c b/bundle-uri.c
index 744257c49c1..3371d56f4ce 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -403,7 +403,7 @@  static int unbundle_from_file(struct repository *r, const char *file)
 		const char *branch_name;
 		int has_old;
 
-		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
+		if (!skip_prefix(refname->string, "refs/", &branch_name))
 			continue;
 
 		strbuf_setlen(&bundle_ref, bundle_prefix_len);