Message ID | 20210119143348.27535-2-jacob@gitlab.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | builtin/pack-objects.c: avoid iterating all refs | expand |
On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. OK, makes sense to me. > Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> > --- > builtin/pack-objects.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 2a00358f34..2b32bc93bd 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > } > cleanup_preferred_base(); > if (include_tag && nr_result) > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); Makes sense. I was curious why it wasn't this way from the beginning in f0a24aa56e (git-pack-objects: Automatically pack annotated tags if object was packed, 2008-03-03). The patch doesn't say, but presumably it was because there wasn't any speed-up to be had iterating only a subset of references back then if they were packed (did packed refs even exist in 2008? unsure). In any case, builtin/pack-objects.c:add_ref_tag() discards anything that doesn't start with "refs/tags/", so I think your change is doing the right thing here. That said, you could shorten this to use 'for_each_tag_ref()' instead (which compiles to the exact same thing). It probably wouldn't be a bad time to drop the extra prefix check in add_ref_tag(). Thanks, Taylor
On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. > [...] > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); Yeah, this is trivially correct, since the first thing we check in add_ref_tag() is whether it starts with refs/tags/. It would be safe to remove that check as part of this patch (and unlike the multi-prefix one, I am comfortable assuming that for_each_fullref_in() will not return anything outside of the prefix). I also wondered if we could just use for_each_ref_in() here, but I think we are better having the full refname to feed to peel_ref(). -Peff
On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote: > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > > index 2a00358f34..2b32bc93bd 100644 > > --- a/builtin/pack-objects.c > > +++ b/builtin/pack-objects.c > > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > > } > > cleanup_preferred_base(); > > if (include_tag && nr_result) > > - for_each_ref(add_ref_tag, NULL); > > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); > > Makes sense. I was curious why it wasn't this way from the beginning in > f0a24aa56e (git-pack-objects: Automatically pack annotated tags if > object was packed, 2008-03-03). > > The patch doesn't say, but presumably it was because there wasn't any > speed-up to be had iterating only a subset of references back then if > they were packed (did packed refs even exist in 2008? unsure). In any > case, builtin/pack-objects.c:add_ref_tag() discards anything that > doesn't start with "refs/tags/", so I think your change is doing the > right thing here. > > That said, you could shorten this to use 'for_each_tag_ref()' instead > (which compiles to the exact same thing). You'd end up with "v1.2.3" in the refname field of the callback then, rather than "refs/tags/v1.2.3". So we'd definitely need to drop the prefix check in add_ref_tag() then (though I think it is a good idea anyway). But I'm also not sure if it would interfere with peel_ref(), and its magic for using packed-refs peeled information. -Peff
On Tue, Jan 19, 2021 at 06:33:34PM -0500, Jeff King wrote: > On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote: > > That said, you could shorten this to use 'for_each_tag_ref()' instead > > (which compiles to the exact same thing). > > You'd end up with "v1.2.3" in the refname field of the callback then, > rather than "refs/tags/v1.2.3". So we'd definitely need to drop the > prefix check in add_ref_tag() then (though I think it is a good idea > anyway). But I'm also not sure if it would interfere with peel_ref(), > and its magic for using packed-refs peeled information. Ack, thanks for correcting me. Looking at ref_peel_ref(), I am almost positive that calling peel_ref() on "v1.2.3" (referring to "refs/tags/v1.2.3") wouldn't work. So I think the current implementation is good. Thanks, Taylor
On Tue, Jan 19 2021, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. > > Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> > --- > builtin/pack-objects.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 2a00358f34..2b32bc93bd 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > } > cleanup_preferred_base(); > if (include_tag && nr_result) > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); > stop_progress(&progress_state); > trace2_region_leave("pack-objects", "enumerate-objects", > the_repository); Not really a new problem, but it would be nice to also have a test in t5305-include-tag.sh to check what happens with "tags" outside the refs/tags/ namespace. I.e. if this & the existing prefix in add_ref_tag() were dropped. To elaborate on other things that aren't really your problem & Taylor's E-Mail downthread we originally added this because: If new annotated tags have been introduced then we can also include them in the packfile, saving the client from needing to request them through a second connection. I've barked up this whole tag fetch tree before 97716d217c (fetch: add a --prune-tags option and fetch.pruneTags config, 2018-02-09) but really not this specific area. I wonder if longer term simply moving this to be work the client does wouldn't make more sense. I.e. if we just delete this for_each_ref() loop. We're just saving a client from saying they "want" a tag. I.e. with the whole thing removed we need this to make t5503-tagfollow.sh pass (see [1] at the end for the dialog, the tag is 3253df4d...): diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh index 6041a4dd32..1ddc430aef 100755 --- a/t/t5503-tagfollow.sh +++ b/t/t5503-tagfollow.sh @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' test_expect_success 'setup expect' ' cat - <<EOF >expect want $B +want $T want $S EOF ' @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' cd clone2 && git init && git remote add origin .. && + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && GIT_TRACE_PACKET=$UPATH git fetch && test $B = $(git rev-parse --verify origin/master) && test $S = $(git rev-parse --verify tag2) && We're also saving the client the work of having to go through refs/tags/* and figure out whether there are tags there that aren't on our main history. I think that since f0a24aa56e (git-pack-objects: Automatically pack annotated tags if object was packed, 2008-03-03) was written it's become clear that in the wild that's almost nobody's use-case. I.e. people with a set of refs/heads/* branches and a refs/tags/* set that doesn't refer to those branches. Or if they do, I think it's such an obscure use-case that if we were designing this today we could pass that work of figuring out if there's such tags in refs/tags/* off to the client. It seems we might just be able to delete this code on the server-side, per protocol-capabilities.txt: Clients MUST be prepared for the case where a server has ignored include-tag and has not actually sent tags in the pack. In such cases the client SHOULD issue a subsequent fetch to acquire the tags that include-tag would have otherwise given the client. I.e. in the case where the server isn't playing along and I haven't set "+refs/tags/*:refs/tags/*". But as the test shows we don't do that following ourselves unless refs/tags/* is in the refspec (and then it's not really "following", we're just getting all the tags). 1. From t5503-tagfollow.sh: $ grep -C5 3253df4d1cf6fb138b52b1938473bcfec1483223 UPLOAD_LOG packet: upload-pack< ref-prefix refs/tags/ packet: upload-pack< ref-prefix refs/tags/ packet: upload-pack< 0000 packet: upload-pack> 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master packet: fetch< 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master packet: upload-pack> 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab packet: fetch< 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab packet: upload-pack> ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78 packet: upload-pack> 0000 packet: fetch< ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78 packet: fetch< 0000 packet: fetch> command=fetch -- packet: fetch> 0001 packet: fetch> thin-pack packet: fetch> include-tag packet: fetch> ofs-delta packet: fetch> want 8e10cf4e007ad7e003463c30c34b1050b039db78 packet: fetch> want 3253df4d1cf6fb138b52b1938473bcfec1483223 packet: fetch> want ddfa4a33562179aca1ace2bcc662244a17d0b503 packet: fetch> done packet: fetch> 0000 packet: upload-pack< object-format=sha1 packet: upload-pack< command=fetch -- packet: upload-pack< 0001 packet: upload-pack< thin-pack packet: upload-pack< include-tag packet: upload-pack< ofs-delta packet: upload-pack< want 8e10cf4e007ad7e003463c30c34b1050b039db78 packet: upload-pack< want 3253df4d1cf6fb138b52b1938473bcfec1483223 packet: upload-pack< want ddfa4a33562179aca1ace2bcc662244a17d0b503 packet: upload-pack< done packet: upload-pack< 0000 packet: upload-pack> packfile packet: fetch< packfile
On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > It seems we might just be able to delete this code on the server-side, > per protocol-capabilities.txt: > > Clients MUST be prepared for the case where a server has ignored > include-tag and has not actually sent tags in the pack. In such > cases the client SHOULD issue a subsequent fetch to acquire the tags > that include-tag would have otherwise given the client. > > I.e. in the case where the server isn't playing along and I haven't set > "+refs/tags/*:refs/tags/*". But as the test shows we don't do that > following ourselves unless refs/tags/* is in the refspec (and then it's > not really "following", we're just getting all the tags). Reading your email, I see no reason not to do it, and that snippet from protocol-capabilities.txt makes me feel even better about doing so. I'd be happy to have Jacob's patch picked up in the meantime, but I think that this is a good direction to pursue. Thanks, Taylor
On Wed, Jan 20 2021, Taylor Blau wrote: > On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: >> It seems we might just be able to delete this code on the server-side, >> per protocol-capabilities.txt: >> >> Clients MUST be prepared for the case where a server has ignored >> include-tag and has not actually sent tags in the pack. In such >> cases the client SHOULD issue a subsequent fetch to acquire the tags >> that include-tag would have otherwise given the client. >> >> I.e. in the case where the server isn't playing along and I haven't set >> "+refs/tags/*:refs/tags/*". But as the test shows we don't do that >> following ourselves unless refs/tags/* is in the refspec (and then it's >> not really "following", we're just getting all the tags). > > Reading your email, I see no reason not to do it, and that snippet from > protocol-capabilities.txt makes me feel even better about doing so. > > I'd be happy to have Jacob's patch picked up in the meantime, but I > think that this is a good direction to pursue. Oh yes, none of this is commentary on not accepting the much smaller and obviously correct change in the meantime. Just musings this general area of fetching & ideas for further optimization.
On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > To elaborate on other things that aren't really your problem & Taylor's > E-Mail downthread we originally added this because: > > If new annotated tags have been introduced then we can also include > them in the packfile, saving the client from needing to request them > through a second connection. > > I've barked up this whole tag fetch tree before 97716d217c (fetch: add a > --prune-tags option and fetch.pruneTags config, 2018-02-09) but really > not this specific area. > > I wonder if longer term simply moving this to be work the client does > wouldn't make more sense. I.e. if we just delete this for_each_ref() > loop. > > We're just saving a client from saying they "want" a tag. I.e. with the > whole thing removed we need this to make t5503-tagfollow.sh pass (see > [1] at the end for the dialog, the tag is 3253df4d...): Isn't this an ordering problem, though? The client cannot mention auto-followed tags in a "want", because they first need to "want" the main history, receive the pack, and then realize they want the others. So we are not just saving the client from sending a "want", but making a second full connection to do so. That seems to be an optimization worth continuing to have. But here... > diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh > index 6041a4dd32..1ddc430aef 100755 > --- a/t/t5503-tagfollow.sh > +++ b/t/t5503-tagfollow.sh > @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' > test_expect_success 'setup expect' ' > cat - <<EOF >expect > want $B > +want $T > want $S > EOF > ' > @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' > cd clone2 && > git init && > git remote add origin .. && > + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && > GIT_TRACE_PACKET=$UPATH git fetch && > test $B = $(git rev-parse --verify origin/master) && > test $S = $(git rev-parse --verify tag2) && > > We're also saving the client the work of having to go through > refs/tags/* and figure out whether there are tags there that aren't on > our main history. You seem to be against auto-following at all. And certainly I can see an argument that it is not worth the trouble it causes. But it is the default behavior, and I suspect many people are relying on it. Fetching every tag indiscriminately is going to grab a bunch of extra unwanted objects in some repos. An obvious case is any time "clone --single-branch --depth" is used. Maybe I'm not quite understanding what you're proposing. -Peff
On Wed, Jan 20 2021, Jeff King wrote: > On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> To elaborate on other things that aren't really your problem & Taylor's >> E-Mail downthread we originally added this because: >> >> If new annotated tags have been introduced then we can also include >> them in the packfile, saving the client from needing to request them >> through a second connection. >> >> I've barked up this whole tag fetch tree before 97716d217c (fetch: add a >> --prune-tags option and fetch.pruneTags config, 2018-02-09) but really >> not this specific area. >> >> I wonder if longer term simply moving this to be work the client does >> wouldn't make more sense. I.e. if we just delete this for_each_ref() >> loop. >> >> We're just saving a client from saying they "want" a tag. I.e. with the >> whole thing removed we need this to make t5503-tagfollow.sh pass (see >> [1] at the end for the dialog, the tag is 3253df4d...): > > Isn't this an ordering problem, though? The client cannot mention > auto-followed tags in a "want", because they first need to "want" the > main history, receive the pack, and then realize they want the others. > > So we are not just saving the client from sending a "want", but making a > second full connection to do so. That seems to be an optimization worth > continuing to have. Correct. Suppose a history COMMIT(TAG) history like: mainline: A(X) -> B(Y) -> C topic: A(X) -> B(Y) -> D(Z) You only want the "mainline" history and its tags in your fetch. Now a server will give you tags X & Y for free, ignoring Z. The note in protocol-capabilities.txt supposes that you'll need to only get A/B/C in one fetch, then do another one where you see from the OIDs you have and advertised (peeled) OIDs for the tags and know you just need X/Y, not Z. So we could also just fetch Z, then do our own walk on the client to see if it's reachable, and throw it away if not. Although I suppose that'll need a list of "bad" tags on the client so we don't repeat the process the next time around, hrm... >> diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh >> index 6041a4dd32..1ddc430aef 100755 >> --- a/t/t5503-tagfollow.sh >> +++ b/t/t5503-tagfollow.sh >> @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' >> test_expect_success 'setup expect' ' >> cat - <<EOF >expect >> want $B >> +want $T >> want $S >> EOF >> ' >> @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' >> cd clone2 && >> git init && >> git remote add origin .. && >> + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && >> GIT_TRACE_PACKET=$UPATH git fetch && >> test $B = $(git rev-parse --verify origin/master) && >> test $S = $(git rev-parse --verify tag2) && >> >> We're also saving the client the work of having to go through >> refs/tags/* and figure out whether there are tags there that aren't on >> our main history. > > You seem to be against auto-following at all. And certainly I can see an > argument that it is not worth the trouble it causes. But it is the > default behavior, and I suspect many people are relying on it. Fetching > every tag indiscriminately is going to grab a bunch of extra unwanted > objects in some repos. An obvious case is any time "clone > --single-branch --depth" is used. I wonder if the use-cases for --depth in the wild aren't 99.9% --depth=1, in which case the peeled commit on the main branch being advertised by a tag would alredy give you this information. I.e. in the common case you don't get a tag, but if you happen to land on a release you can see that the commit at the tip is tagged. I wonder if doing that shortcut already in the client (so not send include-tag) is a useful micro-optimization. > Maybe I'm not quite understanding what you're proposing. Not much really, just that looking deeper into this area might be a useful exercise. I.e. it's a case of server<->client cooperation where we offlod the work to one or the other based on an old design decision, maybe that trade-off is not optimal in most cases anymore.
On Thu, Jan 21, 2021 at 11:26:13AM +0100, Ævar Arnfjörð Bjarmason wrote: > > So we are not just saving the client from sending a "want", but making a > > second full connection to do so. That seems to be an optimization worth > > continuing to have. > > Correct. Suppose a history COMMIT(TAG) history like: > > mainline: A(X) -> B(Y) -> C > topic: A(X) -> B(Y) -> D(Z) > > You only want the "mainline" history and its tags in your fetch. Now a > server will give you tags X & Y for free, ignoring Z. > > The note in protocol-capabilities.txt supposes that you'll need to only > get A/B/C in one fetch, then do another one where you see from the OIDs > you have and advertised (peeled) OIDs for the tags and know you just > need X/Y, not Z. Right. That note basically indicates that what the server is doing is "best effort". If it sends you B, it will also send you Y. But it is ultimately up to the client to decide if they want or need a tag the server didn't send and do the follow-up. So include-tag is an optimization, but it works often enough that it frequently saves the second fetch from happening at all. > So we could also just fetch Z, then do our own walk on the client to see > if it's reachable, and throw it away if not. Although I suppose that'll > need a list of "bad" tags on the client so we don't repeat the process > the next time around, hrm... Not only that, but now you've fetched "D" that you never wanted. If it's one commit, that's not so bad. But it could be pulling in arbitrary amounts of reachable history that you don't want. > > You seem to be against auto-following at all. And certainly I can see an > > argument that it is not worth the trouble it causes. But it is the > > default behavior, and I suspect many people are relying on it. Fetching > > every tag indiscriminately is going to grab a bunch of extra unwanted > > objects in some repos. An obvious case is any time "clone > > --single-branch --depth" is used. > > I wonder if the use-cases for --depth in the wild aren't 99.9% > --depth=1, in which case the peeled commit on the main branch being > advertised by a tag would alredy give you this information. I'd have thought there was basically no use for anything but --depth=1, but people seem really enamored with the notion of --depth=50. Many CI systems use that for some reason. --depth is just one case where you might not have all of history, though. Even without it, --single-branch means you wouldn't want to get the history of all of the other branches. They may only be a few commits ahead of the main history, in which case fetching extra tags that point to them might not be a big deal. But it really depends on how your repo is shaped, what tags point to, etc. Even without any options, your repo may have a disjoint chunk of history mentioned by a tag (e.g., a tag pointing to a segment of history that is meant to be grafted on, or even an archived version of history from before filter-branch rewrite). Perhaps if we were designing from scratch we might write those off as unusual cases we don't need to care about. But I'd be very hesitant to change the way tag following works now, after so many years. If we were to make any change, I think one plausible one would be to have clone set up refspecs to fetch all tags (and then avoid sending include-tag, since we know we're not auto-following). That would let people still use auto-follow when they wanted to, but make things simpler to reason about. And after all, clone already fetches all of the tags. But I haven't thought too hard about it. > > Maybe I'm not quite understanding what you're proposing. > > Not much really, just that looking deeper into this area might be a > useful exercise. I.e. it's a case of server<->client cooperation where > we offlod the work to one or the other based on an old design decision, > maybe that trade-off is not optimal in most cases anymore. I don't think it's very much work on the server, though. Sure, iterating the tags is O(nr_tags). But we do that iteration lots of other places, too (the advertisement unless the client asks for a very narrow prefix, or upload-pack's ref-marking for reachability). And saving us from sending unwanted objects is a potential win on the server (even a single unlucky delta where we have to reconstruct and re-deflate an object will likely blow a ref iteration out of the water). Likewise, preventing the client from connecting a second time is a win for the server, which doesn't have to spin up a new upload-pack (likewise for the client of course; it might even have to prompt the user for auth again!). -Peff
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 2a00358f34..2b32bc93bd 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) } cleanup_preferred_base(); if (include_tag && nr_result) - for_each_ref(add_ref_tag, NULL); + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); stop_progress(&progress_state); trace2_region_leave("pack-objects", "enumerate-objects", the_repository);
In git-pack-objects, we iterate over all the tags if the --include-tag option is passed on the command line. For some reason this uses for_each_ref which is expensive if the repo has many refs. We should use a prefix iterator instead. Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> --- builtin/pack-objects.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)