diff mbox series

repository.c: always allocate 'index' at repo init time

Message ID 20190519025636.24819-1-pclouds@gmail.com (mailing list archive)
State New, archived
Headers show
Series repository.c: always allocate 'index' at repo init time | expand

Commit Message

Duy Nguyen May 19, 2019, 2:56 a.m. UTC
There are two ways a 'struct repository' could be initialized before
using: via initialize_the_repository() and repo_init().

The first way always initializes 'index' field because that's how it is
before the introduction of 'struct repository'. Back then 'the_index' is
always available (even if not loaded). The second way however leaves
'index' NULL and relies on repo_read_index() to allocate it on demand.

The problem with the second way is that, the majority of our code base
was written with 'the_index' (i.e. the first way) in mind, where
dereferencing 'the_index' (or the 'index' field now) is always
safe.

The second way breaks this assumption. The 'index' field can be NULL
until loading from disk, which could lead to segfaults like
581d2fd9f2 (get_oid: handle NULL repo->index, 2019-05-14).

We have two options to handle this: either we audit the entire code
base, adding 'is index NULL' when needed, or we make sure 'index' is
never NULL to begin with.

This patch goes with the second option, making sure that 'index' is
always allocated after initialization. It's less effort than the first
one, and also safer because you could still miss things during the code
audit. The extra allocation cost is not a real concern.

The 'index' field is still freed and reset to NULL in repo_clear(). But
after that call, a lot more is missing in 'repo' and it can never be
used again without going through reinitialization phase. So it should be
fine.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 repository.c | 3 ++-
 repository.h | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Jeff King May 20, 2019, 1:17 p.m. UTC | #1
On Sun, May 19, 2019 at 09:56:36AM +0700, Nguyễn Thái Ngọc Duy wrote:

> This patch goes with the second option, making sure that 'index' is
> always allocated after initialization. It's less effort than the first
> one, and also safer because you could still miss things during the code
> audit. The extra allocation cost is not a real concern.

I think this direction makes sense.

The patch looks good, though I wonder if we could simplify even further
by just embedding an index into the repository object. The purpose of
having it as a pointer, I think, is so that the_repository can point to
the_index. But we could possibly hide the latter behind some macro
trickery like:

  #define the_index (the_repository->index)

I spent a few minutes on a proof of concept patch, but it gets a bit
hairy:

  1. There are some circular dependencies in the header files. We'd need
     repository.h to depend on cache.h to get the definition of
     index_state, but the latter includes repository.h. We'd need to
     break the index bits out of cache.h into index.h, which in turn
     requires breaking out some other parts. I did a sloppy job of it in
     the patch below.

  2. There are hundreds of spots that need to swap out "repo->index" for
     "&repo->index". In the patch below I just did enough to compile
     archive-zip.o, to illustrate. :)

So it's definitely non-trivial to go that way. I'm not sure if it's
worth the effort to switch at this point, but even if it is, your patch
seems like a good thing to do in the meantime.

Either way, I think we could probably revert the non-test portion of my
581d2fd9f2 (get_oid: handle NULL repo->index, 2019-05-14) after this.

-Peff

---
diff --git a/archive-zip.c b/archive-zip.c
index 4d66b5be6e..517e203483 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -353,7 +353,7 @@ static int write_zip_entry(struct archiver_args *args,
 				return error(_("cannot read %s"),
 					     oid_to_hex(oid));
 			crc = crc32(crc, buffer, size);
-			is_binary = entry_is_binary(args->repo->index,
+			is_binary = entry_is_binary(&args->repo->index,
 						    path_without_prefix,
 						    buffer, size);
 			out = buffer;
@@ -430,7 +430,7 @@ static int write_zip_entry(struct archiver_args *args,
 				break;
 			crc = crc32(crc, buf, readlen);
 			if (is_binary == -1)
-				is_binary = entry_is_binary(args->repo->index,
+				is_binary = entry_is_binary(&args->repo->index,
 							    path_without_prefix,
 							    buf, readlen);
 			write_or_die(1, buf, readlen);
@@ -463,7 +463,7 @@ static int write_zip_entry(struct archiver_args *args,
 				break;
 			crc = crc32(crc, buf, readlen);
 			if (is_binary == -1)
-				is_binary = entry_is_binary(args->repo->index,
+				is_binary = entry_is_binary(&args->repo->index,
 							    path_without_prefix,
 							    buf, readlen);
 
diff --git a/cache.h b/cache.h
index b4bb2e2c11..d0450025e1 100644
--- a/cache.h
+++ b/cache.h
@@ -17,6 +17,7 @@
 #include "sha1-array.h"
 #include "repository.h"
 #include "mem-pool.h"
+#include "oid.h"
 
 #include <zlib.h>
 typedef struct git_zstream {
@@ -43,28 +44,6 @@ int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
 unsigned long git_deflate_bound(git_zstream *, unsigned long);
 
-/* The length in bytes and in hex digits of an object name (SHA-1 value). */
-#define GIT_SHA1_RAWSZ 20
-#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)
-/* The block size of SHA-1. */
-#define GIT_SHA1_BLKSZ 64
-
-/* The length in bytes and in hex digits of an object name (SHA-256 value). */
-#define GIT_SHA256_RAWSZ 32
-#define GIT_SHA256_HEXSZ (2 * GIT_SHA256_RAWSZ)
-/* The block size of SHA-256. */
-#define GIT_SHA256_BLKSZ 64
-
-/* The length in byte and in hex digits of the largest possible hash value. */
-#define GIT_MAX_RAWSZ GIT_SHA256_RAWSZ
-#define GIT_MAX_HEXSZ GIT_SHA256_HEXSZ
-/* The largest possible block size for any supported hash. */
-#define GIT_MAX_BLKSZ GIT_SHA256_BLKSZ
-
-struct object_id {
-	unsigned char hash[GIT_MAX_RAWSZ];
-};
-
 #define the_hash_algo the_repository->hash_algo
 
 #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
@@ -143,16 +122,6 @@ struct cache_header {
 #define INDEX_FORMAT_LB 2
 #define INDEX_FORMAT_UB 4
 
-/*
- * The "cache_time" is just the low 32 bits of the
- * time. It doesn't matter if it overflows - we only
- * check it for equality in the 32 bits we save.
- */
-struct cache_time {
-	uint32_t sec;
-	uint32_t nsec;
-};
-
 struct stat_data {
 	struct cache_time sd_ctime;
 	struct cache_time sd_mtime;
@@ -326,32 +295,6 @@ static inline unsigned int canon_mode(unsigned int mode)
 #define UNTRACKED_CHANGED	(1 << 7)
 #define FSMONITOR_CHANGED	(1 << 8)
 
-struct split_index;
-struct untracked_cache;
-
-struct index_state {
-	struct cache_entry **cache;
-	unsigned int version;
-	unsigned int cache_nr, cache_alloc, cache_changed;
-	struct string_list *resolve_undo;
-	struct cache_tree *cache_tree;
-	struct split_index *split_index;
-	struct cache_time timestamp;
-	unsigned name_hash_initialized : 1,
-		 initialized : 1,
-		 drop_cache_tree : 1,
-		 updated_workdir : 1,
-		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
-	struct hashmap name_hash;
-	struct hashmap dir_hash;
-	struct object_id oid;
-	struct untracked_cache *untracked;
-	uint64_t fsmonitor_last_update;
-	struct ewah_bitmap *fsmonitor_dirty;
-	struct mem_pool *ce_mem_pool;
-};
-
 /* Name hashing */
 int test_lazy_init_name_hash(struct index_state *istate, int try_threaded);
 void add_name_hash(struct index_state *istate, struct cache_entry *ce);
diff --git a/repository.h b/repository.h
index 4fb6a5885f..3371afceaa 100644
--- a/repository.h
+++ b/repository.h
@@ -1,11 +1,11 @@
 #ifndef REPOSITORY_H
 #define REPOSITORY_H
 
+#include "index.h"
 #include "path.h"
 
 struct config_set;
 struct git_hash_algo;
-struct index_state;
 struct lock_file;
 struct pathspec;
 struct raw_object_store;
@@ -87,7 +87,7 @@ struct repository {
 	 * Repository's in-memory index.
 	 * 'repo_read_index()' can be used to populate 'index'.
 	 */
-	struct index_state *index;
+	struct index_state index;
 
 	/* Repository's current hash algorithm, as serialized on disk. */
 	const struct git_hash_algo *hash_algo;
Duy Nguyen May 21, 2019, 10:34 a.m. UTC | #2
On Mon, May 20, 2019 at 8:17 PM Jeff King <peff@peff.net> wrote:
> The patch looks good, though I wonder if we could simplify even further
> by just embedding an index into the repository object. The purpose of
> having it as a pointer, I think, is so that the_repository can point to
> the_index. But we could possibly hide the latter behind some macro
> trickery like:
>
>   #define the_index (the_repository->index)
>
> I spent a few minutes on a proof of concept patch, but it gets a bit
> hairy:
>
>   1. There are some circular dependencies in the header files. We'd need
>      repository.h to depend on cache.h to get the definition of
>      index_state, but the latter includes repository.h. We'd need to
>      break the index bits out of cache.h into index.h, which in turn
>      requires breaking out some other parts. I did a sloppy job of it in
>      the patch below.
>
>   2. There are hundreds of spots that need to swap out "repo->index" for
>      "&repo->index". In the patch below I just did enough to compile
>      archive-zip.o, to illustrate. :)

You are more thorough than me. I saw #2 first and immediately backed
off (partly for a selfish reason: I have plenty of the_repo conversion
patches in queue and anything touching "repo" may delay those patches
even more).

There's also #3 but this one is minor. So far 'struct repo' is more of
a glue of things. Embedding index_state in it while leaving
object_store, ref_store... pointers feels inconsistent and a bit
weird. It's not a strong reason for making index_state a pointer too,
but if we have to deal with pointers anyway...

> So it's definitely non-trivial to go that way. I'm not sure if it's
> worth the effort to switch at this point, but even if it is, your patch
> seems like a good thing to do in the meantime.
>
> Either way, I think we could probably revert the non-test portion of my
> 581d2fd9f2 (get_oid: handle NULL repo->index, 2019-05-14) after this.

Yeah. I'm thinking of doing that after, scanning for similar lines
too. But it looks like it's the only one. Will fix in v2.
Jeff King May 21, 2019, 8:58 p.m. UTC | #3
On Tue, May 21, 2019 at 05:34:02PM +0700, Duy Nguyen wrote:

> >   2. There are hundreds of spots that need to swap out "repo->index" for
> >      "&repo->index". In the patch below I just did enough to compile
> >      archive-zip.o, to illustrate. :)
> 
> You are more thorough than me. I saw #2 first and immediately backed
> off (partly for a selfish reason: I have plenty of the_repo conversion
> patches in queue and anything touching "repo" may delay those patches
> even more).

Yeah, that's true, it would be disruptive.

> There's also #3 but this one is minor. So far 'struct repo' is more of
> a glue of things. Embedding index_state in it while leaving
> object_store, ref_store... pointers feels inconsistent and a bit
> weird. It's not a strong reason for making index_state a pointer too,
> but if we have to deal with pointers anyway...

And yeah, I agree it would nice for it to all be consistent. Let's leave
it at your patch for now, and we can think about refactoring this later.

-Peff
Junio C Hamano May 28, 2019, 4:07 p.m. UTC | #4
Duy Nguyen <pclouds@gmail.com> writes:

> On Mon, May 20, 2019 at 8:17 PM Jeff King <peff@peff.net> wrote:
>> The patch looks good, though I wonder if we could simplify even further
>> by just embedding an index into the repository object. The purpose of
>> having it as a pointer, I think, is so that the_repository can point to
>> the_index. But we could possibly hide the latter behind some macro
>> trickery like:
>>
>>   #define the_index (the_repository->index)
> ...
>> So it's definitely non-trivial to go that way. I'm not sure if it's
>> worth the effort to switch at this point, but even if it is, your patch
>> seems like a good thing to do in the meantime.

Yeah, the fact that the_reopsitory->index is not an embedded
instance has bothered me from the very beginning, and I am happy to
see others share the same feeling ;-)

>> Either way, I think we could probably revert the non-test portion of my
>> 581d2fd9f2 (get_oid: handle NULL repo->index, 2019-05-14) after this.
>
> Yeah. I'm thinking of doing that after, scanning for similar lines
> too. But it looks like it's the only one. Will fix in v2.

Thanks.
diff mbox series

Patch

diff --git a/repository.c b/repository.c
index 682c239fe3..ca58692504 100644
--- a/repository.c
+++ b/repository.c
@@ -160,6 +160,7 @@  int repo_init(struct repository *repo,
 	struct repository_format format = REPOSITORY_FORMAT_INIT;
 	memset(repo, 0, sizeof(*repo));
 
+	repo->index = xcalloc(1, sizeof(*repo->index));
 	repo->objects = raw_object_store_new();
 	repo->parsed_objects = parsed_object_pool_new();
 
@@ -262,7 +263,7 @@  void repo_clear(struct repository *repo)
 int repo_read_index(struct repository *repo)
 {
 	if (!repo->index)
-		repo->index = xcalloc(1, sizeof(*repo->index));
+		BUG("the repo hasn't been setup");
 
 	return read_index_from(repo->index, repo->index_file, repo->gitdir);
 }
diff --git a/repository.h b/repository.h
index 4fb6a5885f..75c4f68b22 100644
--- a/repository.h
+++ b/repository.h
@@ -85,6 +85,7 @@  struct repository {
 
 	/*
 	 * Repository's in-memory index.
+	 * Cannot be NULL after initialization.
 	 * 'repo_read_index()' can be used to populate 'index'.
 	 */
 	struct index_state *index;
@@ -132,6 +133,9 @@  struct submodule;
 int repo_submodule_init(struct repository *subrepo,
 			struct repository *superproject,
 			const struct submodule *sub);
+/*
+ * Release all resources in 'repo'. 'repo' cannot be used again.
+ */
 void repo_clear(struct repository *repo);
 
 /*