diff mbox series

[04/12] packfile: check midx coverage with .idx rather than .pack

Message ID 20190404232546.GD21839@sigill.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series a rabbit hole of update-server-info fixes | expand

Commit Message

Jeff King April 4, 2019, 11:25 p.m. UTC
When we have a .midx that covers many packfiles, we try to avoid opening
the .idx for those packfiles. However, there are a few problems with the
filename comparison we use:

  - we ask midx_contains_pack() about the .pack name, not the .idx name.
    But it compares to the latter.

  - we compute the basename of the pack using strrchr() to find the
    final slash. But that leaves an extra "/" at the start of our
    string; we need to advance past it.

    That also raises the question of what to do when the name does not
    have a slash at all. This should generally not happen (we always
    find files in "pack/"), but it doesn't hurt to be defensive here.

The tests don't notice because there's nothing about opening those .idx
files that would cause us to give incorrect output. It's just a little
slower. The new test checks this case by corrupting the covered .idx,
and then making sure we don't complain about it.

Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.c                  | 17 ++++++++++++++---
 t/t5319-multi-pack-index.sh | 14 ++++++++++++++
 2 files changed, 28 insertions(+), 3 deletions(-)

Comments

René Scharfe April 5, 2019, 8:05 a.m. UTC | #1
Am 05.04.2019 um 01:25 schrieb Jeff King:
> When we have a .midx that covers many packfiles, we try to avoid opening
> the .idx for those packfiles. However, there are a few problems with the
> filename comparison we use:
>
>    - we ask midx_contains_pack() about the .pack name, not the .idx name.
>      But it compares to the latter.
>
>    - we compute the basename of the pack using strrchr() to find the
>      final slash. But that leaves an extra "/" at the start of our
>      string; we need to advance past it.
>
>      That also raises the question of what to do when the name does not
>      have a slash at all. This should generally not happen (we always
>      find files in "pack/"), but it doesn't hurt to be defensive here.
>
> The tests don't notice because there's nothing about opening those .idx
> files that would cause us to give incorrect output. It's just a little
> slower. The new test checks this case by corrupting the covered .idx,
> and then making sure we don't complain about it.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   packfile.c                  | 17 ++++++++++++++---
>   t/t5319-multi-pack-index.sh | 14 ++++++++++++++
>   2 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/packfile.c b/packfile.c
> index 054269ae5d..e7ca135ed5 100644
> --- a/packfile.c
> +++ b/packfile.c

> @@ -486,15 +496,16 @@ static int open_packed_git_1(struct packed_git *p)
>   	ssize_t read_result;
>   	const unsigned hashsz = the_hash_algo->rawsz;
>
> -	if (!p->index_data) {
> +	if (!p->index_data && the_repository->objects->multi_pack_index) {

So if there is no multi_pack_index, we skip this block now...

>   		struct multi_pack_index *m;
> -		const char *pack_name = strrchr(p->pack_name, '/');
> +		char *idx_name = pack_name_to_idx(pack_basename(p));
>
>   		for (m = the_repository->objects->multi_pack_index;
>   		     m; m = m->next) {
> -			if (midx_contains_pack(m, pack_name))
> +			if (midx_contains_pack(m, idx_name))
>   				break;
>   		}
> +		free(idx_name);
>
>   		if (!m && open_pack_index(p))
>   			return error("packfile %s index unavailable", p->pack_name);

... which also means this open_pack_index() call isn't done anymore if
there's no .midx file at all.  You don't mention this change in the
commit message; is it intended?

And I wonder if it would be easier overall to let midx_contains_pack()
accept .pack names in addition to .idx names.  Perhaps with something
like this?

int cmp_idx_or_pack_name(const char *idx_or_pack_name, const char *idx_name)
{
	while (*idx_name && *idx_name == *idx_or_pack_name) {
		idx_name++;
		idx_or_pack_name++;
	}
	if (!strcmp(idx_name, ".idx") && !strcmp(idx_or_pack_name, ".pack"))
		return 0;
	return strcmp(idx_or_pack_name, idx_name);
}

René
SZEDER Gábor April 5, 2019, 12:01 p.m. UTC | #2
On Thu, Apr 04, 2019 at 07:25:46PM -0400, Jeff King wrote:
> When we have a .midx that covers many packfiles, we try to avoid opening
> the .idx for those packfiles. However, there are a few problems with the
> filename comparison we use:
> 
>   - we ask midx_contains_pack() about the .pack name, not the .idx name.
>     But it compares to the latter.
> 
>   - we compute the basename of the pack using strrchr() to find the
>     final slash. But that leaves an extra "/" at the start of our
>     string; we need to advance past it.
> 
>     That also raises the question of what to do when the name does not
>     have a slash at all. This should generally not happen (we always
>     find files in "pack/"), but it doesn't hurt to be defensive here.
> 
> The tests don't notice because there's nothing about opening those .idx
> files that would cause us to give incorrect output. It's just a little
> slower. The new test checks this case by corrupting the covered .idx,
> and then making sure we don't complain about it.

When the test suite is run with GIT_TEST_MULTI_PACK_INDEX=1, the
following tests fail with this patch:

  t1006-cat-file.sh                                (Wstat: 256 Tests: 111 Failed: 1)
    Failed test:  87
    Non-zero exit status: 1
  t5320-delta-islands.sh                           (Wstat: 256 Tests: 15 Failed: 10)
    Failed tests:  2-4, 6-10, 12-13
    Non-zero exit status: 1
  t5310-pack-bitmaps.sh                            (Wstat: 256 Tests: 46 Failed: 1)
    Failed test:  44
    Non-zero exit status: 1
  t5570-git-daemon.sh                              (Wstat: 256 Tests: 21 Failed: 1)
    Failed test:  10
    Non-zero exit status: 1


> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  packfile.c                  | 17 ++++++++++++++---
>  t/t5319-multi-pack-index.sh | 14 ++++++++++++++
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/packfile.c b/packfile.c
> index 054269ae5d..e7ca135ed5 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -472,6 +472,16 @@ static unsigned int get_max_fd_limit(void)
>  #endif
>  }
>  
> +static const char *pack_basename(struct packed_git *p)
> +{
> +	const char *ret = strrchr(p->pack_name, '/');
> +	if (ret)
> +		ret = ret + 1; /* skip past slash */
> +	else
> +		ret = p->pack_name; /* we only have a base */
> +	return ret;
> +}
> +
>  /*
>   * Do not call this directly as this leaks p->pack_fd on error return;
>   * call open_packed_git() instead.
> @@ -486,15 +496,16 @@ static int open_packed_git_1(struct packed_git *p)
>  	ssize_t read_result;
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  
> -	if (!p->index_data) {
> +	if (!p->index_data && the_repository->objects->multi_pack_index) {
>  		struct multi_pack_index *m;
> -		const char *pack_name = strrchr(p->pack_name, '/');
> +		char *idx_name = pack_name_to_idx(pack_basename(p));
>  
>  		for (m = the_repository->objects->multi_pack_index;
>  		     m; m = m->next) {
> -			if (midx_contains_pack(m, pack_name))
> +			if (midx_contains_pack(m, idx_name))
>  				break;
>  		}
> +		free(idx_name);
>  
>  		if (!m && open_pack_index(p))
>  			return error("packfile %s index unavailable", p->pack_name);
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 8c4d2bd849..1ebf19ec3c 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -117,6 +117,20 @@ test_expect_success 'write midx with one v2 pack' '
>  
>  compare_results_with_midx "one v2 pack"
>  
> +test_expect_success 'corrupt idx not opened' '
> +	idx=$(test-tool read-midx $objdir | grep "\.idx\$") &&
> +	mv $objdir/pack/$idx backup-$idx &&
> +	test_when_finished "mv backup-\$idx \$objdir/pack/\$idx" &&
> +
> +	# This is the minimum size for a sha-1 based .idx; this lets
> +	# us pass perfunctory tests, but anything that actually opens and reads
> +	# the idx file will complain.
> +	test_copy_bytes 1064 <backup-$idx >$objdir/pack/$idx &&
> +
> +	git -c core.multiPackIndex=true rev-list --objects --all 2>err &&
> +	test_must_be_empty err
> +'
> +
>  test_expect_success 'add more objects' '
>  	for i in $(test_seq 6 10)
>  	do
> -- 
> 2.21.0.714.gd1be1d035b
>
Jeff King April 5, 2019, 1:21 p.m. UTC | #3
On Fri, Apr 05, 2019 at 10:05:29AM +0200, René Scharfe wrote:

> > @@ -486,15 +496,16 @@ static int open_packed_git_1(struct packed_git *p)
> >   	ssize_t read_result;
> >   	const unsigned hashsz = the_hash_algo->rawsz;
> >
> > -	if (!p->index_data) {
> > +	if (!p->index_data && the_repository->objects->multi_pack_index) {
> 
> So if there is no multi_pack_index, we skip this block now...
> 
> >   		struct multi_pack_index *m;
> > -		const char *pack_name = strrchr(p->pack_name, '/');
> > +		char *idx_name = pack_name_to_idx(pack_basename(p));
> >
> >   		for (m = the_repository->objects->multi_pack_index;
> >   		     m; m = m->next) {
> > -			if (midx_contains_pack(m, pack_name))
> > +			if (midx_contains_pack(m, idx_name))
> >   				break;
> >   		}
> > +		free(idx_name);
> >
> >   		if (!m && open_pack_index(p))
> >   			return error("packfile %s index unavailable", p->pack_name);
> 
> ... which also means this open_pack_index() call isn't done anymore if
> there's no .midx file at all.  You don't mention this change in the
> commit message; is it intended?

Doh, thank you for catching that. I made that switch at the last minute
because I didn't want to pay the malloc/free cost when we had no list to
compare to. I'm surprised it works at all. :-/

I guess it doesn't, from the other message in the thread.

> And I wonder if it would be easier overall to let midx_contains_pack()
> accept .pack names in addition to .idx names.  Perhaps with something
> like this?
> 
> int cmp_idx_or_pack_name(const char *idx_or_pack_name, const char *idx_name)
> {
> 	while (*idx_name && *idx_name == *idx_or_pack_name) {
> 		idx_name++;
> 		idx_or_pack_name++;
> 	}
> 	if (!strcmp(idx_name, ".idx") && !strcmp(idx_or_pack_name, ".pack"))
> 		return 0;
> 	return strcmp(idx_or_pack_name, idx_name);
> }

Hmm, maybe. It does a binary search, so I'd have to scratch my head for
a minute of whether this loose comparison is correct. I think it is
because of that final strcmp.

-Peff
Jeff King April 5, 2019, 1:30 p.m. UTC | #4
On Fri, Apr 05, 2019 at 09:21:25AM -0400, Jeff King wrote:

> > ... which also means this open_pack_index() call isn't done anymore if
> > there's no .midx file at all.  You don't mention this change in the
> > commit message; is it intended?
> 
> Doh, thank you for catching that. I made that switch at the last minute
> because I didn't want to pay the malloc/free cost when we had no list to
> compare to. I'm surprised it works at all. :-/
> 
> I guess it doesn't, from the other message in the thread.

Hmm, no, that's a different problem. I think we'd generally already have
the pack open unless there _is_ a midx, so this conditional probably
never kicks in unless we have a midx anyway. Meaning my malloc-avoidance
was over-thought. So we could just get rid of it. Or do it like this:

diff --git a/packfile.c b/packfile.c
index e7ca135ed5..ef0f959311 100644
--- a/packfile.c
+++ b/packfile.c
@@ -496,17 +496,22 @@ static int open_packed_git_1(struct packed_git *p)
 	ssize_t read_result;
 	const unsigned hashsz = the_hash_algo->rawsz;
 
-	if (!p->index_data && the_repository->objects->multi_pack_index) {
-		struct multi_pack_index *m;
-		char *idx_name = pack_name_to_idx(pack_basename(p));
+	if (!p->index_data) {
+		struct multi_pack_index *m =
+			the_repository->objects->multi_pack_index;
 
-		for (m = the_repository->objects->multi_pack_index;
-		     m; m = m->next) {
-			if (midx_contains_pack(m, idx_name))
-				break;
+		if (m) {
+			char *idx_name = pack_name_to_idx(pack_basename(p));
+
+			for (; m; m = m->next) {
+				if (midx_contains_pack(m, idx_name))
+					break;
+			}
+			free(idx_name);
 		}
-		free(idx_name);
 
+		if (!m)
+			warning("opening anyway");
 		if (!m && open_pack_index(p))
 			return error("packfile %s index unavailable", p->pack_name);
 	}

I'll have to re-roll to address the other problem, though, so I'll give
it some thought.

-Peff
Jeff King April 5, 2019, 1:40 p.m. UTC | #5
On Fri, Apr 05, 2019 at 02:01:20PM +0200, SZEDER Gábor wrote:

> > The tests don't notice because there's nothing about opening those .idx
> > files that would cause us to give incorrect output. It's just a little
> > slower. The new test checks this case by corrupting the covered .idx,
> > and then making sure we don't complain about it.
> 
> When the test suite is run with GIT_TEST_MULTI_PACK_INDEX=1, the
> following tests fail with this patch:
> 
>   t1006-cat-file.sh                                (Wstat: 256 Tests: 111 Failed: 1)
>     Failed test:  87
>     Non-zero exit status: 1
>   t5320-delta-islands.sh                           (Wstat: 256 Tests: 15 Failed: 10)
>     Failed tests:  2-4, 6-10, 12-13
>     Non-zero exit status: 1
>   t5310-pack-bitmaps.sh                            (Wstat: 256 Tests: 46 Failed: 1)
>     Failed test:  44
>     Non-zero exit status: 1
>   t5570-git-daemon.sh                              (Wstat: 256 Tests: 21 Failed: 1)
>     Failed test:  10
>     Non-zero exit status: 1

Thanks. It looks like this is not a bug in my patch per se, but rather
that the feature of the midx code it is un-breaking is not actually a
good idea.

The problem is with generating the pack revindex. It needs the actual
index for each pack to be loaded, and now open_packed_git() will
sometimes quietly not do that.

Probably we _could_ be generating a big revindex off of the midx, but in
the meantime, the revindex code needs to make sure that the index is
loaded before looking at it.

This patch clears up those test failures (I'll re-roll the whole series
with it in the proper spot).

---
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 4695aaf6b4..3960ad94c8 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -308,7 +308,8 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 
 	bitmap_git->bitmaps = kh_init_sha1();
 	bitmap_git->ext_index.positions = kh_init_sha1_pos();
-	load_pack_revindex(bitmap_git->pack);
+	if (load_pack_revindex(bitmap_git->pack))
+		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
 		!(bitmap_git->trees = read_bitmap_1(bitmap_git)) ||
diff --git a/pack-revindex.c b/pack-revindex.c
index 50891f77a2..d28a7e43d0 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "pack-revindex.h"
 #include "object-store.h"
+#include "packfile.h"
 
 /*
  * Pack index for existing packs give us easy access to the offsets into
@@ -158,10 +159,14 @@ static void create_pack_revindex(struct packed_git *p)
 	sort_revindex(p->revindex, num_ent, p->pack_size);
 }
 
-void load_pack_revindex(struct packed_git *p)
+int load_pack_revindex(struct packed_git *p)
 {
-	if (!p->revindex)
+	if (!p->revindex) {
+		if (open_pack_index(p))
+			return -1;
 		create_pack_revindex(p);
+	}
+	return 0;
 }
 
 int find_revindex_position(struct packed_git *p, off_t ofs)
@@ -188,7 +193,9 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
 {
 	int pos;
 
-	load_pack_revindex(p);
+	if (load_pack_revindex(p))
+		return NULL;
+
 	pos = find_revindex_position(p, ofs);
 
 	if (pos < 0)
diff --git a/pack-revindex.h b/pack-revindex.h
index e262f3efe8..848331d5d6 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -8,7 +8,7 @@ struct revindex_entry {
 	unsigned int nr;
 };
 
-void load_pack_revindex(struct packed_git *p);
+int load_pack_revindex(struct packed_git *p);
 int find_revindex_position(struct packed_git *p, off_t ofs);
 
 struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
diff --git a/packfile.c b/packfile.c
index e7ca135ed5..ca73a5b12a 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2040,8 +2040,10 @@ int for_each_object_in_pack(struct packed_git *p,
 	uint32_t i;
 	int r = 0;
 
-	if (flags & FOR_EACH_OBJECT_PACK_ORDER)
-		load_pack_revindex(p);
+	if (flags & FOR_EACH_OBJECT_PACK_ORDER) {
+		if (load_pack_revindex(p))
+			return -1;
+	}
 
 	for (i = 0; i < p->num_objects; i++) {
 		uint32_t pos;
diff mbox series

Patch

diff --git a/packfile.c b/packfile.c
index 054269ae5d..e7ca135ed5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -472,6 +472,16 @@  static unsigned int get_max_fd_limit(void)
 #endif
 }
 
+static const char *pack_basename(struct packed_git *p)
+{
+	const char *ret = strrchr(p->pack_name, '/');
+	if (ret)
+		ret = ret + 1; /* skip past slash */
+	else
+		ret = p->pack_name; /* we only have a base */
+	return ret;
+}
+
 /*
  * Do not call this directly as this leaks p->pack_fd on error return;
  * call open_packed_git() instead.
@@ -486,15 +496,16 @@  static int open_packed_git_1(struct packed_git *p)
 	ssize_t read_result;
 	const unsigned hashsz = the_hash_algo->rawsz;
 
-	if (!p->index_data) {
+	if (!p->index_data && the_repository->objects->multi_pack_index) {
 		struct multi_pack_index *m;
-		const char *pack_name = strrchr(p->pack_name, '/');
+		char *idx_name = pack_name_to_idx(pack_basename(p));
 
 		for (m = the_repository->objects->multi_pack_index;
 		     m; m = m->next) {
-			if (midx_contains_pack(m, pack_name))
+			if (midx_contains_pack(m, idx_name))
 				break;
 		}
+		free(idx_name);
 
 		if (!m && open_pack_index(p))
 			return error("packfile %s index unavailable", p->pack_name);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 8c4d2bd849..1ebf19ec3c 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -117,6 +117,20 @@  test_expect_success 'write midx with one v2 pack' '
 
 compare_results_with_midx "one v2 pack"
 
+test_expect_success 'corrupt idx not opened' '
+	idx=$(test-tool read-midx $objdir | grep "\.idx\$") &&
+	mv $objdir/pack/$idx backup-$idx &&
+	test_when_finished "mv backup-\$idx \$objdir/pack/\$idx" &&
+
+	# This is the minimum size for a sha-1 based .idx; this lets
+	# us pass perfunctory tests, but anything that actually opens and reads
+	# the idx file will complain.
+	test_copy_bytes 1064 <backup-$idx >$objdir/pack/$idx &&
+
+	git -c core.multiPackIndex=true rev-list --objects --all 2>err &&
+	test_must_be_empty err
+'
+
 test_expect_success 'add more objects' '
 	for i in $(test_seq 6 10)
 	do