Message ID | 77035a0f-c42e-5cb3-f422-03fe81093adb@roku.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mktree: learn about promised objects | expand |
On 6/14/2022 9:36 AM, Richard Oliver wrote: > Do not use oid_object_info() to determine object type in mktree_line() > as this can cause promised objects to be dynamically faulted-in one at a > time which has poor performance. Instead, use a combination of > oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option), > and the newly introduced promisor_object_type() to determine object type > before defaulting to fetch from remote. Have you run some performance tests on this? It seems like this will scan all packed objects, which is probably much slower than just asking the remote for the object in most cases. Thanks, -Stolee
On 14/06/2022 15:14, Derrick Stolee wrote: > On 6/14/2022 9:36 AM, Richard Oliver wrote: >> Do not use oid_object_info() to determine object type in mktree_line() >> as this can cause promised objects to be dynamically faulted-in one at a >> time which has poor performance. Instead, use a combination of >> oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option), >> and the newly introduced promisor_object_type() to determine object type >> before defaulting to fetch from remote. > > Have you run some performance tests on this? It seems like this will > scan all packed objects, which is probably much slower than just asking > the remote for the object in most cases. > > Thanks, > -Stolee Hi Stolee, I've put together a synthetic experiment below (adding a new blob to anexisting tree) to show you the behaviour that we've been seeing. Our actual use-case (where we first encountered this behaviour) is updating submodules to a known hash. As you can see, the round-trip time of fetching objects one-by-one is very expensive. Before, using system git (git version 2.32.0 (Apple Git-132)): > $ git init > # Fetch a recent tree > $ git fetch --filter=tree:0 --depth 1 https://github.com/git/git cdb48006b0ec7fe19794daf7b5363ab42d9d9371 > remote: Enumerating objects: 1, done. > remote: Counting objects: 100% (1/1), done. > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (1/1), 13.12 KiB | 13.12 MiB/s, done. > From https://github.com/git/git > * branch cdb48006b0ec7fe19794daf7b5363ab42d9d9371 -> FETCH_HEAD > > $ NEW_BLOB=$(echo zzz | git hash-object --stdin -w) > > $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree > remote: Enumerating objects: 1, done. > remote: Counting objects: 100% (1/1), done. > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (1/1), 334 bytes | 334.00 KiB/s, done. > remote: Enumerating objects: 1, done. > remote: Counting objects: 100% (1/1), done. > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (1/1), 2.01 KiB | 2.01 MiB/s, done. > remote: Enumerating objects: 1, done. > remote: Counting objects: 100% (1/1), done. > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (1/1), 256 bytes | 256.00 KiB/s, done. > # ... > # SOME TIME LATER > # ... > e26c7ce7357b1649da7b4200d4e80d0b668db8d4 > 286.49 real 15.66 user 15.59 sys Repeated experiment, but using modified mktree: > $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree > e26c7ce7357b1649da7b4200d4e80d0b668db8d4 > 0.06 real 0.01 user 0.03 sys Did you have any other sort of performance test in mind? The remotes we typically deal with are geographically far away and deal with a high volume of traffic so we're keen to move behaviour to the client where it makes sense to do so. Thanks, Richard
On 6/14/2022 12:33 PM, Richard Oliver wrote: > On 14/06/2022 15:14, Derrick Stolee wrote: >> On 6/14/2022 9:36 AM, Richard Oliver wrote: >>> Do not use oid_object_info() to determine object type in mktree_line() >>> as this can cause promised objects to be dynamically faulted-in one at a >>> time which has poor performance. Instead, use a combination of >>> oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option), >>> and the newly introduced promisor_object_type() to determine object type >>> before defaulting to fetch from remote. >> >> Have you run some performance tests on this? It seems like this will >> scan all packed objects, which is probably much slower than just asking >> the remote for the object in most cases. >> >> Thanks, >> -Stolee > > > Hi Stolee, > > I've put together a synthetic experiment below (adding a new blob to anexisting tree) to show you the behaviour that we've been seeing. Our > actual use-case (where we first encountered this behaviour) is updating > submodules to a known hash. As you can see, the round-trip time of fetching > objects one-by-one is very expensive. > > Before, using system git (git version 2.32.0 (Apple Git-132)): > >> $ git init >> # Fetch a recent tree >> $ git fetch --filter=tree:0 --depth 1 https://github.com/git/git cdb48006b0ec7fe19794daf7b5363ab42d9d9371 >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 13.12 KiB | 13.12 MiB/s, done. >> From https://github.com/git/git >> * branch cdb48006b0ec7fe19794daf7b5363ab42d9d9371 -> FETCH_HEAD >> >> $ NEW_BLOB=$(echo zzz | git hash-object --stdin -w) >> >> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 334 bytes | 334.00 KiB/s, done. >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 2.01 KiB | 2.01 MiB/s, done. >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 256 bytes | 256.00 KiB/s, done. >> # ... >> # SOME TIME LATER >> # ... >> e26c7ce7357b1649da7b4200d4e80d0b668db8d4 >> 286.49 real 15.66 user 15.59 sys I see. The problem here is that we are making _many requests_ for the same tree, so maybe it would be better to introduce a batched download for the list of missing objects. This would require iterating over the objects for the tree to check existence (in quick mode) and adding the missing ones in a list, then requesting that set altogether in a single request. That probably won't be as fast as your modified mktree experiment below, but would match my expectations of "probably faster assuming the repo is big enough". > Repeated experiment, but using modified mktree: > >> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree >> e26c7ce7357b1649da7b4200d4e80d0b668db8d4 >> 0.06 real 0.01 user 0.03 sys > > Did you have any other sort of performance test in mind? The remotes we > typically deal with are geographically far away and deal with a high volume > of traffic so we're keen to move behaviour to the client where it makes sense > to do so. I guess I wonder how large your promisor pack-files are in this test, since your implementation depends on for_each_packed_object(), which should be really inefficient if you're actually dealing with a large partial clone. Thanks, -Stolee
On Tue, Jun 14, 2022 at 01:27:18PM -0400, Derrick Stolee wrote: > > Did you have any other sort of performance test in mind? The remotes we > > typically deal with are geographically far away and deal with a high volume > > of traffic so we're keen to move behaviour to the client where it makes sense > > to do so. > > I guess I wonder how large your promisor pack-files are in this test, > since your implementation depends on for_each_packed_object(), which > should be really inefficient if you're actually dealing with a large > partial clone. I had the same thought. Storing data available in the promisor packs into an oid_map is going to be expensive if there are many such objects. Is there a reason that we can't introduce a variant of find_kept_pack_entry() that deals only with .promisor packs and look these things up as-needed? Thanks, Taylor
On Tue, Jun 14, 2022 at 08:35:16PM -0400, Taylor Blau wrote: > On Tue, Jun 14, 2022 at 01:27:18PM -0400, Derrick Stolee wrote: > > > Did you have any other sort of performance test in mind? The remotes we > > > typically deal with are geographically far away and deal with a high volume > > > of traffic so we're keen to move behaviour to the client where it makes sense > > > to do so. > > > > I guess I wonder how large your promisor pack-files are in this test, > > since your implementation depends on for_each_packed_object(), which > > should be really inefficient if you're actually dealing with a large > > partial clone. > > I had the same thought. Storing data available in the promisor packs > into an oid_map is going to be expensive if there are many such objects. > > Is there a reason that we can't introduce a variant of > find_kept_pack_entry() that deals only with .promisor packs and look > these things up as-needed? It's much worse than that. The promisor mechanism is fundamentally very inefficient in runtime, optimizing instead for size. Imagine I have a partial clone and I retrieve tree X, which points to a blob Y that I don't get. I have X in a promisor pack, and asking about it is efficient. But if I want to know about Y, I have no data structure mentioning Y except the tree X itself. So to enumerate all of the promisor edges, I have to walk all of the trees in the promisor pack. So it is not just lookup, but actual tree walking that is expensive. The flip side is that you don't have to store a complete separate list of the promised objects. Whether that's a win depends on how many local objects you have, versus how many are promised. But it would be possible to cache the promisor list to make the tradeoff separately. E.g., do the walk over the promisor trees once (perhaps at pack creation time), and store a sorted list of fixed-length (oid, type) records that could be binary searched. You could even put it in the .promisor file. :) -Peff
On 15/06/2022 05:00, Jeff King wrote: > On Tue, Jun 14, 2022 at 08:35:16PM -0400, Taylor Blau wrote: > >> On Tue, Jun 14, 2022 at 01:27:18PM -0400, Derrick Stolee wrote: >>>> Did you have any other sort of performance test in mind? The remotes we >>>> typically deal with are geographically far away and deal with a high volume >>>> of traffic so we're keen to move behaviour to the client where it makes sense >>>> to do so. >>> >>> I guess I wonder how large your promisor pack-files are in this test, >>> since your implementation depends on for_each_packed_object(), which >>> should be really inefficient if you're actually dealing with a large >>> partial clone. >> >> I had the same thought. Storing data available in the promisor packs >> into an oid_map is going to be expensive if there are many such objects. >> >> Is there a reason that we can't introduce a variant of >> find_kept_pack_entry() that deals only with .promisor packs and look >> these things up as-needed? > > It's much worse than that. The promisor mechanism is fundamentally very > inefficient in runtime, optimizing instead for size. Imagine I have a > partial clone and I retrieve tree X, which points to a blob Y that I > don't get. I have X in a promisor pack, and asking about it is > efficient. But if I want to know about Y, I have no data structure > mentioning Y except the tree X itself. So to enumerate all of the > promisor edges, I have to walk all of the trees in the promisor pack. > > So it is not just lookup, but actual tree walking that is expensive. The > flip side is that you don't have to store a complete separate list of > the promised objects. Whether that's a win depends on how many local > objects you have, versus how many are promised. > > But it would be possible to cache the promisor list to make the tradeoff > separately. E.g., do the walk over the promisor trees once (perhaps at > pack creation time), and store a sorted list of fixed-length (oid, type) > records that could be binary searched. You could even put it in the > .promisor file. :) > > -Peff I like the idea of caching the promisor list at pack creation time; I'll start work on a patch set that implements this. Meanwhile, is it worth considering a '--promised-as-missing' option (or a config option) for invocations such as 'mktree --missing' that prevents promised objects being faulted-in? Currently, the only reliable way that I've found to prevent 'mktree --missing' faulting-in promised objects is to remove the remote. Such an option could either set the global variable 'fetch_if_missing' to '0' or could ensure 'OBJECT_INFO_SKIP_FETCH_OBJECT' is passed appropriately. Cheers, Richard
On 6/15/2022 1:40 PM, Richard Oliver wrote: > On 15/06/2022 05:00, Jeff King wrote: >> So it is not just lookup, but actual tree walking that is expensive. The >> flip side is that you don't have to store a complete separate list of >> the promised objects. Whether that's a win depends on how many local >> objects you have, versus how many are promised. This is also why blobless (or blob-size filters) are the recommended way to use partial clone. It's just too expensive to have tree misses. >> But it would be possible to cache the promisor list to make the tradeoff >> separately. E.g., do the walk over the promisor trees once (perhaps at >> pack creation time), and store a sorted list of fixed-length (oid, type) >> records that could be binary searched. You could even put it in the >> .promisor file. :) >> >> -Peff > > I like the idea of caching the promisor list at pack creation time; > I'll start work on a patch set that implements this. > > Meanwhile, is it worth considering a '--promised-as-missing' option > (or a config option) for invocations such as 'mktree --missing' that > prevents promised objects being faulted-in? Currently, the only > reliable way that I've found to prevent 'mktree --missing' faulting-in > promised objects is to remove the remote. Such an option could either > set the global variable 'fetch_if_missing' to '0' or could ensure > 'OBJECT_INFO_SKIP_FETCH_OBJECT' is passed appropriately. One issue I've had with the current design of partial clone is that we put all of the filter logic into the remotes, not the repository itself. This means that if I use "git remote add" to add a new remote, then that remote does not inherit the filter (and hence fetches can be too large or even fail because we are "missing" an object pointed to by the resulting pack). If we had a repository-scoped notion of the filter, then we could special-case this mktree logic to assume a blob if the filter only excludes blobs. That would be even faster than looking up the type from a "promised objects" file. Just thinking about smaller steps that could simplify things before adding a new data format. Feel free to continue pursuing that lookup file if that's what you think will work best. Thanks, -Stolee
Richard Oliver <roliver@roku.com> writes: > Meanwhile, is it worth considering a '--promised-as-missing' option > (or a config option) for invocations such as 'mktree --missing' that > prevents promised objects being faulted-in? Currently, the only > reliable way that I've found to prevent 'mktree --missing' faulting-in > promised objects is to remove the remote. Such an option could either > set the global variable 'fetch_if_missing' to '0' or could ensure > 'OBJECT_INFO_SKIP_FETCH_OBJECT' is passed appropriately. I didn't spend too much time on thinking this one through myself, but do we really need a separte option? As far as I remember, I wrote the original behaviour implemented in 1c64e79a (mktree --missing: allow missing objects, 2009-05-10) when the command is run without --missing primarily to be extra paranoid to detect broken repository with every chance we have to avoid spreading existing breakage to new objects we create. With the input "mktree" gets from the end user, we have no need to learn anything from existing objects in order to create the tree object at all. The original 1c64e79a even carefully made sure that with --missing in effect, we do not ask the object store about an object: /* It is perfectly normal if we do not have a commit from a submodule */ if (S_ISGITLINK(mode)) allow_missing = 1; if (!allow_missing) type = sha1_object_info(sha1, NULL); else type = object_type(mode); if (type < 0) die("object %s unavailable", sha1_to_hex(sha1)); We by grave mistake at 31c8221a (mktree: validate entry type in input, 2009-05-14) started insisting on inspecting objects even when allow-mising was given. I do not think it was sensible, given why we had "--missing" as an option to allow users to say "you do not have to be too paranoid". The codebase is so distant but I think we should probably do a moral revert/reconstruct of that commit so that the extra paranoia of the said commit applies only when "--missing" is not in effect, or something like that.
On Wed, Jun 15, 2022 at 02:01:33PM -0700, Junio C Hamano wrote: > Richard Oliver <roliver@roku.com> writes: > > > Meanwhile, is it worth considering a '--promised-as-missing' option > > (or a config option) for invocations such as 'mktree --missing' that > > prevents promised objects being faulted-in? Currently, the only > > reliable way that I've found to prevent 'mktree --missing' faulting-in > > promised objects is to remove the remote. Such an option could either > > set the global variable 'fetch_if_missing' to '0' or could ensure > > 'OBJECT_INFO_SKIP_FETCH_OBJECT' is passed appropriately. > > I didn't spend too much time on thinking this one through myself, > but do we really need a separte option? > [...] > We by grave mistake at 31c8221a (mktree: validate entry type in > input, 2009-05-14) started insisting on inspecting objects even when > allow-mising was given. I do not think it was sensible, given why > we had "--missing" as an option to allow users to say "you do not > have to be too paranoid". > > The codebase is so distant but I think we should probably do a moral > revert/reconstruct of that commit so that the extra paranoia of the > said commit applies only when "--missing" is not in effect, or > something like that. FWIW, I had the same reaction. I think fixing "--missing" should be the first step, and would unstick Richard's use case, as I understand it. There is some value to improving the promisor case, since using "--missing" is a pretty broad stroke (i.e., you'd fail to actual corruption of missing non-promisor objects). That could either be checking the promisor object and type without faulting it in, or just skipping the type-check for objects after confirming that they're promisors. But that can come on top, I think. The use case there is also a bit narrower. The local repository does not know about all promised objects. It can only see the boundaries of objects it doesn't have (so with --filter=tree:0, for example, a partial clone of a repo with path "a/b/c" would know about "b" but not "c"). So in the most general case you'd still have to resort to "--missing", but I suspect in practice you'd always feed things at that boundary to mktree (otherwise, how would you even know the oid of "c"). -Peff
On Wed, Jun 15, 2022 at 02:17:58PM -0400, Derrick Stolee wrote: > On 6/15/2022 1:40 PM, Richard Oliver wrote: > > On 15/06/2022 05:00, Jeff King wrote: > > >> So it is not just lookup, but actual tree walking that is expensive. The > >> flip side is that you don't have to store a complete separate list of > >> the promised objects. Whether that's a win depends on how many local > >> objects you have, versus how many are promised. > > This is also why blobless (or blob-size filters) are the recommended way > to use partial clone. It's just too expensive to have tree misses. I agree that tree misses are awful, but I'm actually talking about something different: traversing the local trees we _do_ have in order to find the set of promised objects. Which is worse for blob:none, because it means you have more trees locally. :) Try this with a big repo like linux.git: git clone --no-local --filter=blob:none linux.git repo cd repo # this is fast; we mark the promisor trees as UNINTERESTING, so we do # not look at them as part of the traversal, and never call # is_promisor_object(). time git rev-list --count --objects --all --exclude-promisor-objects # but imagine we had a fixed mktree[1] that did not fault in the blobs # unnecessarily, and we made a new tree that references a promised # blob. tree=$(git ls-tree HEAD~1000 | grep Makefile | git mktree --missing) commit=$(echo foo | git commit-tree -p HEAD $tree) git update-ref refs/heads/foo $commit # this is now slow; even though we only call is_promisor_object() # once, we have to open every single tree in the pack to find it! time git rev-list --count --objects --all --exclude-promisor-objects Those rev-lists run in 1.7s and 224s respectively. Ouch! Now the setup there is kind of contrived, and it's actually not trivial to convince rev-list to actually call is_promisor_object() these days. That's because promisor-ness (promisosity?) tends to be one-way transitive. A promised blob is usually either only referenced by promised trees (which we have in this case), or is faulted in as part of making a new tree. But it's not guaranteed, and certainly a faulted-in object could later be deleted from the local repo, since it's promised. I suspect there are less convoluted workflows to get to a similar state, but I don't know them offhand. There's more discussion of this issue in this thread from last summer: https://lore.kernel.org/git/20210403090412.GH2271@szeder.dev/ -Peff [1] The mktree I used was the fix discussed elsewhere in the thread: diff --git a/builtin/mktree.c b/builtin/mktree.c index 902edba6d2..42ae82230c 100644 --- a/builtin/mktree.c +++ b/builtin/mktree.c @@ -117,15 +117,11 @@ static void mktree_line(char *buf, int nul_term_line, int allow_missing) } /* Check the type of object identified by sha1 */ - obj_type = oid_object_info(the_repository, &oid, NULL); - if (obj_type < 0) { - if (allow_missing) { - ; /* no problem - missing objects are presumed to be of the right type */ - } else { + if (!allow_missing) { + obj_type = oid_object_info(the_repository, &oid, NULL); + if (obj_type < 0) die("entry '%s' object %s is unavailable", path, oid_to_hex(&oid)); - } - } else { - if (obj_type != mode_type) { + else if (obj_type != mode_type) { /* * The object exists but is of the wrong type. * This is a problem regardless of allow_missing
On 6/16/2022 2:07 AM, Jeff King wrote: > On Wed, Jun 15, 2022 at 02:17:58PM -0400, Derrick Stolee wrote: > >> On 6/15/2022 1:40 PM, Richard Oliver wrote: >>> On 15/06/2022 05:00, Jeff King wrote: >> >>>> So it is not just lookup, but actual tree walking that is expensive. The >>>> flip side is that you don't have to store a complete separate list of >>>> the promised objects. Whether that's a win depends on how many local >>>> objects you have, versus how many are promised. >> >> This is also why blobless (or blob-size filters) are the recommended way >> to use partial clone. It's just too expensive to have tree misses. > > I agree that tree misses are awful, but I'm actually talking about > something different: traversing the local trees we _do_ have in order to > find the set of promised objects. Which is worse for blob:none, because > it means you have more trees locally. :) Ah, I misread your email. I agree that walking trees is far too expensive to do just to find an object type. > Try this with a big repo like linux.git: > > git clone --no-local --filter=blob:none linux.git repo > cd repo > > # this is fast; we mark the promisor trees as UNINTERESTING, so we do > # not look at them as part of the traversal, and never call > # is_promisor_object(). > time git rev-list --count --objects --all --exclude-promisor-objects > > # but imagine we had a fixed mktree[1] that did not fault in the blobs > # unnecessarily, and we made a new tree that references a promised > # blob. > tree=$(git ls-tree HEAD~1000 | grep Makefile | git mktree --missing) > commit=$(echo foo | git commit-tree -p HEAD $tree) > git update-ref refs/heads/foo $commit > > # this is now slow; even though we only call is_promisor_object() > # once, we have to open every single tree in the pack to find it! > time git rev-list --count --objects --all --exclude-promisor-objects > > Those rev-lists run in 1.7s and 224s respectively. Ouch! This is exactly the reason I thought just asking for the objects directly is faster than scanning all the packs. Thanks for giving concrete numbers that support that assumption. Thanks, -Stolee
diff --git a/builtin/mktree.c b/builtin/mktree.c index 902edba6d2..a783767b28 100644 --- a/builtin/mktree.c +++ b/builtin/mktree.c @@ -8,6 +8,7 @@ #include "tree.h" #include "parse-options.h" #include "object-store.h" +#include "packfile.h" static struct treeent { unsigned mode; @@ -116,8 +117,22 @@ static void mktree_line(char *buf, int nul_term_line, int allow_missing) path, ptr, type_name(mode_type)); } - /* Check the type of object identified by sha1 */ - obj_type = oid_object_info(the_repository, &oid, NULL); + /* Check the type of object identified by oid without fetching objects */ + struct object_info oi = OBJECT_INFO_INIT; + oi.typep = &obj_type; + if (oid_object_info_extended(the_repository, &oid, &oi, + OBJECT_INFO_LOOKUP_REPLACE | + OBJECT_INFO_SKIP_FETCH_OBJECT) < 0) + obj_type = -1; + + /* Has the object been promised to us if we can't find it */ + if (obj_type < 0) + obj_type = promisor_object_type(&oid); + + /* Try to fetch the object from the remote */ + if (obj_type < 0) + obj_type = oid_object_info(the_repository, &oid, NULL); + if (obj_type < 0) { if (allow_missing) { ; /* no problem - missing objects are presumed to be of the right type */ diff --git a/packfile.c b/packfile.c index 8e812a84a3..d27cb106c1 100644 --- a/packfile.c +++ b/packfile.c @@ -2223,17 +2223,26 @@ int for_each_packed_object(each_packed_object_fn cb, void *data, return r ? r : pack_errors; } +static void map_put_type(kh_oid_pos_t *map, + const struct object_id *oid, + enum object_type type) +{ + int hash_ret; + khiter_t pos = kh_put_oid_pos(map, *oid, &hash_ret); + kh_value(map, pos) = type; +} + static int add_promisor_object(const struct object_id *oid, struct packed_git *pack, uint32_t pos, - void *set_) + void *map_) { - struct oidset *set = set_; + kh_oid_pos_t *map = map_; struct object *obj = parse_object(the_repository, oid); if (!obj) return 1; - oidset_insert(set, oid); + map_put_type(map, oid, obj->type); /* * If this is a tree, commit, or tag, the objects it refers @@ -2250,34 +2259,46 @@ static int add_promisor_object(const struct object_id *oid, */ return 0; while (tree_entry_gently(&desc, &entry)) - oidset_insert(set, &entry.oid); + map_put_type(map, + &entry.oid, + object_type(entry.mode)); free_tree_buffer(tree); } else if (obj->type == OBJ_COMMIT) { struct commit *commit = (struct commit *) obj; struct commit_list *parents = commit->parents; - oidset_insert(set, get_commit_tree_oid(commit)); + map_put_type(map, get_commit_tree_oid(commit), OBJ_TREE); for (; parents; parents = parents->next) - oidset_insert(set, &parents->item->object.oid); + map_put_type(map, + &parents->item->object.oid, + OBJ_COMMIT); } else if (obj->type == OBJ_TAG) { struct tag *tag = (struct tag *) obj; - oidset_insert(set, get_tagged_oid(tag)); + map_put_type(map, get_tagged_oid(tag), OBJ_COMMIT); } return 0; } -int is_promisor_object(const struct object_id *oid) +enum object_type promisor_object_type(const struct object_id *oid) { - static struct oidset promisor_objects; - static int promisor_objects_prepared; + static kh_oid_pos_t *promisor_objects; - if (!promisor_objects_prepared) { + if (!promisor_objects) { + promisor_objects = kh_init_oid_pos(); if (has_promisor_remote()) { for_each_packed_object(add_promisor_object, - &promisor_objects, + promisor_objects, FOR_EACH_OBJECT_PROMISOR_ONLY); } - promisor_objects_prepared = 1; } - return oidset_contains(&promisor_objects, oid); + khiter_t pos = kh_get_oid_pos(promisor_objects, *oid); + if (pos < kh_end(promisor_objects)) + return kh_value(promisor_objects, pos); + else + return OBJ_BAD; +} + +int is_promisor_object(const struct object_id *oid) +{ + return (promisor_object_type(oid) > OBJ_BAD); } diff --git a/packfile.h b/packfile.h index a3f6723857..35d0b600d2 100644 --- a/packfile.h +++ b/packfile.h @@ -182,6 +182,12 @@ int has_pack_index(const unsigned char *sha1); */ int is_promisor_object(const struct object_id *oid); +/* + * Return the object type if the given object is in, or referred to + * by, a promisor packfile; OBJ_BAD otherwise. + */ +enum object_type promisor_object_type(const struct object_id *oid); + /* * Expose a function for fuzz testing. * diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 1e864cf317..12bfb02187 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -660,6 +660,34 @@ test_expect_success 'lazy-fetch when accessing object not in the_repository' ' ! grep "[?]$FILE_HASH" out ' +test_expect_success 'missing blob object promised by tree, passes mktree' ' + rm -rf repo && + test_create_repo repo && + echo foo >repo/foo && + A=$(git -C repo hash-object foo) && + git -C repo update-index --add --info-only foo && + B=$(git -C repo write-tree --missing-ok) && + printf "$B\n" | pack_as_from_promisor && + delete_object repo "$B" && + + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.partialclone "arbitrary string" && + + printf "100644 blob $A\tbar" | git -C repo mktree +' + +test_expect_success 'missing blob object, not promised, faulted-in by mktree' ' + test_create_repo server && + A=$(echo foo | git -C server hash-object --stdin -w) && + test_commit -C server 1 && + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + + rm -rf repo && + git clone --filter=blob:none "file://$(pwd)/server" repo && + printf "100644 blob $A\tbar" | git -C repo mktree +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd
Do not use oid_object_info() to determine object type in mktree_line() as this can cause promised objects to be dynamically faulted-in one at a time which has poor performance. Instead, use a combination of oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option), and the newly introduced promisor_object_type() to determine object type before defaulting to fetch from remote. Signed-off-by: Richard Oliver <roliver@roku.com> --- builtin/mktree.c | 19 ++++++++++++++-- packfile.c | 49 ++++++++++++++++++++++++++++------------ packfile.h | 6 +++++ t/t0410-partial-clone.sh | 28 +++++++++++++++++++++++ 4 files changed, 86 insertions(+), 16 deletions(-) base-commit: 1e59178e3f65880188caedb965e70db5ceeb2d64