diff mbox series

[v2,8/9] object-store: remove global array of cached objects

Message ID 20250411-pks-split-object-file-v2-8-2bea0c9033ae@pks.im (mailing list archive)
State Superseded
Headers show
Series Split up "object-file.c" | expand

Commit Message

Patrick Steinhardt April 11, 2025, 9:29 a.m. UTC
Cached objects are virtual objects that can be set up without writing
anything into the object store directly. This mechanism for example
allows us to create fake commits in git-blame(1).

The cached objects are stored in a global variable. Refactor the code so
that we instead store the array as part of the raw object store. This is
another step into the direction of libifying our object database.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 blame.c           |  2 +-
 object-store-ll.h | 14 +++++++++++++-
 object-store.c    | 39 +++++++++++++++++++++++----------------
 3 files changed, 37 insertions(+), 18 deletions(-)

Comments

Junio C Hamano April 11, 2025, 10:58 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> Cached objects are virtual objects that can be set up without writing
> anything into the object store directly. This mechanism for example
> allows us to create fake commits in git-blame(1).
>
> The cached objects are stored in a global variable. Refactor the code so
> that we instead store the array as part of the raw object store. This is
> another step into the direction of libifying our object database.

While we do need some execution context object to hang these virtual
objects, once we decide that it cannot be global, I am not sure if
epository objects are good home for them.  If your application
running in a repository needs to give one object name to a virtual
object, and then that same application wants to access a submodule
of that repository in the same process image, wouldn't you have one
in-core repository object for the top-level superproject, and one
for each submodule?  If a submodule commit bound to a path in the
superproject's tree is a viertual "pretend" commit object or if it
has a virtual "pretend" tree object, don't you need to expose these
to both submodule and superproject repositories, if your application
wants to seamlessly cross the module boundary (think "git grep
--recurse-submodules" or something)?

For now, as long as the_repository is being used as that "execution
context object", and not a repository instance passed along the call
chain, then the globalness of these virtual objects is maintained,
so this change will not cause breakage (e.g., such an application
may want to pick up the virtual object from the repository instance
for the superproject and it may find it, but when traversing down to
a submdoule, the same virtual object may not be found in the
repository instance for the submodule it descended into and working
in, if you make it per repository and pass repository instance
around along the call chain).  But eventually somebody will start
saying "let's remove USE_THE_REPOSITORY_VARIABLE", at which point I
am not sure how subtle such a bug would become.
Patrick Steinhardt April 15, 2025, 9:19 a.m. UTC | #2
On Fri, Apr 11, 2025 at 03:58:03PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Cached objects are virtual objects that can be set up without writing
> > anything into the object store directly. This mechanism for example
> > allows us to create fake commits in git-blame(1).
> >
> > The cached objects are stored in a global variable. Refactor the code so
> > that we instead store the array as part of the raw object store. This is
> > another step into the direction of libifying our object database.
> 
> While we do need some execution context object to hang these virtual
> objects, once we decide that it cannot be global, I am not sure if
> epository objects are good home for them.  If your application
> running in a repository needs to give one object name to a virtual
> object, and then that same application wants to access a submodule
> of that repository in the same process image, wouldn't you have one
> in-core repository object for the top-level superproject, and one
> for each submodule?  If a submodule commit bound to a path in the
> superproject's tree is a viertual "pretend" commit object or if it
> has a virtual "pretend" tree object, don't you need to expose these
> to both submodule and superproject repositories, if your application
> wants to seamlessly cross the module boundary (think "git grep
> --recurse-submodules" or something)?
> 
> For now, as long as the_repository is being used as that "execution
> context object", and not a repository instance passed along the call
> chain, then the globalness of these virtual objects is maintained,
> so this change will not cause breakage (e.g., such an application
> may want to pick up the virtual object from the repository instance
> for the superproject and it may find it, but when traversing down to
> a submdoule, the same virtual object may not be found in the
> repository instance for the submodule it descended into and working
> in, if you make it per repository and pass repository instance
> around along the call chain).  But eventually somebody will start
> saying "let's remove USE_THE_REPOSITORY_VARIABLE", at which point I
> am not sure how subtle such a bug would become.

I think the answer is very much "it depends". I can think of usecases
where it might be the right to pretend objects to exist globally, but
there's also usecases where I think it makes sense to treat them as
repository-specific. The thing is: we can do the former if the virtual
objects are specific to a repository, but we can't do the latter if the
virtual objects are global.

As far as I can see we only use this mechanism in git-blame(1) right now
to create a fake working tree commit. This mechanism does not cross into
submodules at all, and if it would I think we would want to create two
separate fake working tree commits anyway: one for the parent
repository, and one for each submodule. So converting this mechanism to
be local to the repository (or rather local to an object store) feels
like the right thing to do to me.

But I agree with you in principle: we will have to be a lot more mindful
going forward as it comes to handling multiple repositories in-memory.
We don't do this well right now, but as we convert more and more code so
that it doesn't use `the_repository` anymore we'll have to become better
at this indeed. From my perspective that isn't only true for these fake
working tree commits, but it's a general thing that we'll have to sort
out over time. It's inherent to the whole libifcation process.

I think for the most part we're fine right now, as we don't make use of
any of the new capabilities that libifcation brings with it in theory.
But once usecases start to come up that _do_ make use of this we will
have to think about those issues a whole lot more carefully.

Patrick
diff mbox series

Patch

diff --git a/blame.c b/blame.c
index 703dab43e78..b7c5bd692e6 100644
--- a/blame.c
+++ b/blame.c
@@ -277,7 +277,7 @@  static struct commit *fake_working_tree_commit(struct repository *r,
 	convert_to_git(r->index, path, buf.buf, buf.len, &buf, 0);
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
-	pretend_object_file(buf.buf, buf.len, OBJ_BLOB, &origin->blob_oid);
+	pretend_object_file(the_repository, buf.buf, buf.len, OBJ_BLOB, &origin->blob_oid);
 
 	/*
 	 * Read the current index, replace the path entry with
diff --git a/object-store-ll.h b/object-store-ll.h
index 8bb0f33f9a8..bb5e8798a1b 100644
--- a/object-store-ll.h
+++ b/object-store-ll.h
@@ -151,6 +151,8 @@  static inline int pack_map_entry_cmp(const void *cmp_data UNUSED,
 	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
 }
 
+struct cached_object_entry;
+
 struct raw_object_store {
 	/*
 	 * Set of all object directories; the main directory is first (and
@@ -203,6 +205,15 @@  struct raw_object_store {
 		unsigned flags;
 	} kept_pack_cache;
 
+	/*
+	 * This is meant to hold a *small* number of objects that you would
+	 * want repo_read_object_file() to be able to return, but yet you do not want
+	 * to write them into the object store (e.g. a browse-only
+	 * application).
+	 */
+	struct cached_object_entry *cached_objects;
+	size_t cached_object_nr, cached_object_alloc;
+
 	/*
 	 * A map of packfiles to packed_git structs for tracking which
 	 * packs have been loaded already.
@@ -272,7 +283,8 @@  void hash_object_file(const struct git_hash_algo *algo, const void *buf,
  * object in persistent storage before writing any other new objects
  * that reference it.
  */
-int pretend_object_file(void *, unsigned long, enum object_type,
+int pretend_object_file(struct repository *repo,
+			void *buf, unsigned long len, enum object_type type,
 			struct object_id *oid);
 
 struct object_info {
diff --git a/object-store.c b/object-store.c
index ea2d86c429b..17fa06a86fa 100644
--- a/object-store.c
+++ b/object-store.c
@@ -30,31 +30,31 @@ 
  * to write them into the object store (e.g. a browse-only
  * application).
  */
-static struct cached_object_entry {
+struct cached_object_entry {
 	struct object_id oid;
 	struct cached_object {
 		enum object_type type;
 		const void *buf;
 		unsigned long size;
 	} value;
-} *cached_objects;
-static int cached_object_nr, cached_object_alloc;
+};
 
-static const struct cached_object *find_cached_object(const struct object_id *oid)
+static const struct cached_object *find_cached_object(struct raw_object_store *object_store,
+						      const struct object_id *oid)
 {
 	static const struct cached_object empty_tree = {
 		.type = OBJ_TREE,
 		.buf = "",
 	};
-	int i;
-	const struct cached_object_entry *co = cached_objects;
+	const struct cached_object_entry *co = object_store->cached_objects;
 
-	for (i = 0; i < cached_object_nr; i++, co++) {
+	for (size_t i = 0; i < object_store->cached_object_nr; i++, co++)
 		if (oideq(&co->oid, oid))
 			return &co->value;
-	}
-	if (oideq(oid, the_hash_algo->empty_tree))
+
+	if (oid->algo && oideq(oid, hash_algos[oid->algo].empty_tree))
 		return &empty_tree;
+
 	return NULL;
 }
 
@@ -650,7 +650,7 @@  static int do_oid_object_info_extended(struct repository *r,
 	if (!oi)
 		oi = &blank_oi;
 
-	co = find_cached_object(real);
+	co = find_cached_object(r->objects, real);
 	if (co) {
 		if (oi->typep)
 			*(oi->typep) = co->type;
@@ -853,18 +853,21 @@  int oid_object_info(struct repository *r,
 	return type;
 }
 
-int pretend_object_file(void *buf, unsigned long len, enum object_type type,
+int pretend_object_file(struct repository *repo,
+			void *buf, unsigned long len, enum object_type type,
 			struct object_id *oid)
 {
 	struct cached_object_entry *co;
 	char *co_buf;
 
-	hash_object_file(the_hash_algo, buf, len, type, oid);
-	if (repo_has_object_file_with_flags(the_repository, oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) ||
-	    find_cached_object(oid))
+	hash_object_file(repo->hash_algo, buf, len, type, oid);
+	if (repo_has_object_file_with_flags(repo, oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) ||
+	    find_cached_object(repo->objects, oid))
 		return 0;
-	ALLOC_GROW(cached_objects, cached_object_nr + 1, cached_object_alloc);
-	co = &cached_objects[cached_object_nr++];
+
+	ALLOC_GROW(repo->objects->cached_objects,
+		   repo->objects->cached_object_nr + 1, repo->objects->cached_object_alloc);
+	co = &repo->objects->cached_objects[repo->objects->cached_object_nr++];
 	co->value.size = len;
 	co->value.type = type;
 	co_buf = xmalloc(len);
@@ -1021,6 +1024,10 @@  void raw_object_store_clear(struct raw_object_store *o)
 	o->odb_tail = NULL;
 	o->loaded_alternates = 0;
 
+	for (size_t i = 0; i < o->cached_object_nr; i++)
+		free((char *) o->cached_objects[i].value.buf);
+	FREE_AND_NULL(o->cached_objects);
+
 	INIT_LIST_HEAD(&o->packed_git_mru);
 	close_object_store(o);