Message ID | 20250411-pks-split-object-file-v2-8-2bea0c9033ae@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Split up "object-file.c" | expand |
Patrick Steinhardt <ps@pks.im> writes: > Cached objects are virtual objects that can be set up without writing > anything into the object store directly. This mechanism for example > allows us to create fake commits in git-blame(1). > > The cached objects are stored in a global variable. Refactor the code so > that we instead store the array as part of the raw object store. This is > another step into the direction of libifying our object database. While we do need some execution context object to hang these virtual objects, once we decide that it cannot be global, I am not sure if epository objects are good home for them. If your application running in a repository needs to give one object name to a virtual object, and then that same application wants to access a submodule of that repository in the same process image, wouldn't you have one in-core repository object for the top-level superproject, and one for each submodule? If a submodule commit bound to a path in the superproject's tree is a viertual "pretend" commit object or if it has a virtual "pretend" tree object, don't you need to expose these to both submodule and superproject repositories, if your application wants to seamlessly cross the module boundary (think "git grep --recurse-submodules" or something)? For now, as long as the_repository is being used as that "execution context object", and not a repository instance passed along the call chain, then the globalness of these virtual objects is maintained, so this change will not cause breakage (e.g., such an application may want to pick up the virtual object from the repository instance for the superproject and it may find it, but when traversing down to a submdoule, the same virtual object may not be found in the repository instance for the submodule it descended into and working in, if you make it per repository and pass repository instance around along the call chain). But eventually somebody will start saying "let's remove USE_THE_REPOSITORY_VARIABLE", at which point I am not sure how subtle such a bug would become.
On Fri, Apr 11, 2025 at 03:58:03PM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > Cached objects are virtual objects that can be set up without writing > > anything into the object store directly. This mechanism for example > > allows us to create fake commits in git-blame(1). > > > > The cached objects are stored in a global variable. Refactor the code so > > that we instead store the array as part of the raw object store. This is > > another step into the direction of libifying our object database. > > While we do need some execution context object to hang these virtual > objects, once we decide that it cannot be global, I am not sure if > epository objects are good home for them. If your application > running in a repository needs to give one object name to a virtual > object, and then that same application wants to access a submodule > of that repository in the same process image, wouldn't you have one > in-core repository object for the top-level superproject, and one > for each submodule? If a submodule commit bound to a path in the > superproject's tree is a viertual "pretend" commit object or if it > has a virtual "pretend" tree object, don't you need to expose these > to both submodule and superproject repositories, if your application > wants to seamlessly cross the module boundary (think "git grep > --recurse-submodules" or something)? > > For now, as long as the_repository is being used as that "execution > context object", and not a repository instance passed along the call > chain, then the globalness of these virtual objects is maintained, > so this change will not cause breakage (e.g., such an application > may want to pick up the virtual object from the repository instance > for the superproject and it may find it, but when traversing down to > a submdoule, the same virtual object may not be found in the > repository instance for the submodule it descended into and working > in, if you make it per repository and pass repository instance > around along the call chain). But eventually somebody will start > saying "let's remove USE_THE_REPOSITORY_VARIABLE", at which point I > am not sure how subtle such a bug would become. I think the answer is very much "it depends". I can think of usecases where it might be the right to pretend objects to exist globally, but there's also usecases where I think it makes sense to treat them as repository-specific. The thing is: we can do the former if the virtual objects are specific to a repository, but we can't do the latter if the virtual objects are global. As far as I can see we only use this mechanism in git-blame(1) right now to create a fake working tree commit. This mechanism does not cross into submodules at all, and if it would I think we would want to create two separate fake working tree commits anyway: one for the parent repository, and one for each submodule. So converting this mechanism to be local to the repository (or rather local to an object store) feels like the right thing to do to me. But I agree with you in principle: we will have to be a lot more mindful going forward as it comes to handling multiple repositories in-memory. We don't do this well right now, but as we convert more and more code so that it doesn't use `the_repository` anymore we'll have to become better at this indeed. From my perspective that isn't only true for these fake working tree commits, but it's a general thing that we'll have to sort out over time. It's inherent to the whole libifcation process. I think for the most part we're fine right now, as we don't make use of any of the new capabilities that libifcation brings with it in theory. But once usecases start to come up that _do_ make use of this we will have to think about those issues a whole lot more carefully. Patrick
diff --git a/blame.c b/blame.c index 703dab43e78..b7c5bd692e6 100644 --- a/blame.c +++ b/blame.c @@ -277,7 +277,7 @@ static struct commit *fake_working_tree_commit(struct repository *r, convert_to_git(r->index, path, buf.buf, buf.len, &buf, 0); origin->file.ptr = buf.buf; origin->file.size = buf.len; - pretend_object_file(buf.buf, buf.len, OBJ_BLOB, &origin->blob_oid); + pretend_object_file(the_repository, buf.buf, buf.len, OBJ_BLOB, &origin->blob_oid); /* * Read the current index, replace the path entry with diff --git a/object-store-ll.h b/object-store-ll.h index 8bb0f33f9a8..bb5e8798a1b 100644 --- a/object-store-ll.h +++ b/object-store-ll.h @@ -151,6 +151,8 @@ static inline int pack_map_entry_cmp(const void *cmp_data UNUSED, return strcmp(pg1->pack_name, key ? key : pg2->pack_name); } +struct cached_object_entry; + struct raw_object_store { /* * Set of all object directories; the main directory is first (and @@ -203,6 +205,15 @@ struct raw_object_store { unsigned flags; } kept_pack_cache; + /* + * This is meant to hold a *small* number of objects that you would + * want repo_read_object_file() to be able to return, but yet you do not want + * to write them into the object store (e.g. a browse-only + * application). + */ + struct cached_object_entry *cached_objects; + size_t cached_object_nr, cached_object_alloc; + /* * A map of packfiles to packed_git structs for tracking which * packs have been loaded already. @@ -272,7 +283,8 @@ void hash_object_file(const struct git_hash_algo *algo, const void *buf, * object in persistent storage before writing any other new objects * that reference it. */ -int pretend_object_file(void *, unsigned long, enum object_type, +int pretend_object_file(struct repository *repo, + void *buf, unsigned long len, enum object_type type, struct object_id *oid); struct object_info { diff --git a/object-store.c b/object-store.c index ea2d86c429b..17fa06a86fa 100644 --- a/object-store.c +++ b/object-store.c @@ -30,31 +30,31 @@ * to write them into the object store (e.g. a browse-only * application). */ -static struct cached_object_entry { +struct cached_object_entry { struct object_id oid; struct cached_object { enum object_type type; const void *buf; unsigned long size; } value; -} *cached_objects; -static int cached_object_nr, cached_object_alloc; +}; -static const struct cached_object *find_cached_object(const struct object_id *oid) +static const struct cached_object *find_cached_object(struct raw_object_store *object_store, + const struct object_id *oid) { static const struct cached_object empty_tree = { .type = OBJ_TREE, .buf = "", }; - int i; - const struct cached_object_entry *co = cached_objects; + const struct cached_object_entry *co = object_store->cached_objects; - for (i = 0; i < cached_object_nr; i++, co++) { + for (size_t i = 0; i < object_store->cached_object_nr; i++, co++) if (oideq(&co->oid, oid)) return &co->value; - } - if (oideq(oid, the_hash_algo->empty_tree)) + + if (oid->algo && oideq(oid, hash_algos[oid->algo].empty_tree)) return &empty_tree; + return NULL; } @@ -650,7 +650,7 @@ static int do_oid_object_info_extended(struct repository *r, if (!oi) oi = &blank_oi; - co = find_cached_object(real); + co = find_cached_object(r->objects, real); if (co) { if (oi->typep) *(oi->typep) = co->type; @@ -853,18 +853,21 @@ int oid_object_info(struct repository *r, return type; } -int pretend_object_file(void *buf, unsigned long len, enum object_type type, +int pretend_object_file(struct repository *repo, + void *buf, unsigned long len, enum object_type type, struct object_id *oid) { struct cached_object_entry *co; char *co_buf; - hash_object_file(the_hash_algo, buf, len, type, oid); - if (repo_has_object_file_with_flags(the_repository, oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) || - find_cached_object(oid)) + hash_object_file(repo->hash_algo, buf, len, type, oid); + if (repo_has_object_file_with_flags(repo, oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) || + find_cached_object(repo->objects, oid)) return 0; - ALLOC_GROW(cached_objects, cached_object_nr + 1, cached_object_alloc); - co = &cached_objects[cached_object_nr++]; + + ALLOC_GROW(repo->objects->cached_objects, + repo->objects->cached_object_nr + 1, repo->objects->cached_object_alloc); + co = &repo->objects->cached_objects[repo->objects->cached_object_nr++]; co->value.size = len; co->value.type = type; co_buf = xmalloc(len); @@ -1021,6 +1024,10 @@ void raw_object_store_clear(struct raw_object_store *o) o->odb_tail = NULL; o->loaded_alternates = 0; + for (size_t i = 0; i < o->cached_object_nr; i++) + free((char *) o->cached_objects[i].value.buf); + FREE_AND_NULL(o->cached_objects); + INIT_LIST_HEAD(&o->packed_git_mru); close_object_store(o);
Cached objects are virtual objects that can be set up without writing anything into the object store directly. This mechanism for example allows us to create fake commits in git-blame(1). The cached objects are stored in a global variable. Refactor the code so that we instead store the array as part of the raw object store. This is another step into the direction of libifying our object database. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- blame.c | 2 +- object-store-ll.h | 14 +++++++++++++- object-store.c | 39 +++++++++++++++++++++++---------------- 3 files changed, 37 insertions(+), 18 deletions(-)