diff mbox series

[RFC,02/10] range-diff.c: don't use st_mult() for signed "int"

Message ID RFC-patch-02.10-bd7d014c531-20211209T191653Z-avarab@gmail.com (mailing list archive)
State Superseded
Headers show
Series range-diff: fix segfault due to integer overflow | expand

Commit Message

Ævar Arnfjörð Bjarmason Dec. 9, 2021, 7:19 p.m. UTC
As documented in 320d0b493a2 (add helpers for detecting size_t
overflow, 2016-02-19) the arguments to st_mult() and st_add() "must be
unsigned". This code added in d9c66f0b5bf (range-diff: first
rudimentary implementation, 2018-08-13) operates on signed int.

In subsequent commits further overflows resulting in segfaults will be
fixed in this code, but let's start by removing this supposed guard
that does nothing except give us a false sense of
security. E.g. providing an "n" of INT_MAX here will result in "1" on
my system, causing us to write into memory.

There are other such issues left in the codebase, e.g. the code in
"builtin/clean.c" changed in 50a6c8efa2b (use st_add and st_mult for
allocation size computation, 2016-02-22). But let's focus on
range-diff.c for now.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 range-diff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jeff King Dec. 10, 2021, 3:39 a.m. UTC | #1
On Thu, Dec 09, 2021 at 08:19:19PM +0100, Ævar Arnfjörð Bjarmason wrote:

> As documented in 320d0b493a2 (add helpers for detecting size_t
> overflow, 2016-02-19) the arguments to st_mult() and st_add() "must be
> unsigned".

This isn't correct. The comment that says "must be unsigned" is for
unsigned_mult_overflows(). Which indeed would not work correctly for a
signed type. But st_add() is a separate function (and not a macro)
because that means its arguments will always be promoted or converted
into a size_t. And so no matter what you pass it, it will always itself
pass a size_t into unsigned_mult_overflows().

> In subsequent commits further overflows resulting in segfaults will be
> fixed in this code, but let's start by removing this supposed guard
> that does nothing except give us a false sense of
> security. E.g. providing an "n" of INT_MAX here will result in "1" on
> my system, causing us to write into memory.

This guard doesn't do nothing. Your patch here:

> @@ -312,7 +312,7 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
>  	int *cost, c, *a2b, *b2a;
>  	int i, j;
>  
> -	ALLOC_ARRAY(cost, st_mult(n, n));
> +	ALLOC_ARRAY(cost, n * n);
>  	ALLOC_ARRAY(a2b, n);
>  	ALLOC_ARRAY(b2a, n);

makes things strictly worse, because that "n * n" may overflow and cause
us to under-allocate the array. You can see it in isolation with some
extra code like this:

diff --git a/git.c b/git.c
index 5ff21be21f..63349e4b79 100644
--- a/git.c
+++ b/git.c
@@ -850,11 +850,23 @@ static int run_argv(int *argcp, const char ***argv)
 	return done_alias;
 }
 
+static void foo(void)
+{
+	int n = 2 << 16;
+
+	printf("n = %d\n", n);
+	printf("raw mult = %"PRIuMAX"\n", (uintmax_t)(n * n));
+	printf("st_mult = %"PRIuMAX"\n", (uintmax_t)st_mult(n, n));
+	exit(0);
+}
+
 int cmd_main(int argc, const char **argv)
 {
 	const char *cmd;
 	int done_help = 0;
 
+	foo();
+
 	cmd = argv[0];
 	if (!cmd)
 		cmd = "git-help";

With st_mult, we get the correct answer of 16GB. Without it, we end up
with 0!

Back to the original code, if you generate a test setup like this:

diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index e30bc48a29..f552d3086e 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -772,4 +772,11 @@ test_expect_success '--left-only/--right-only' '
 	test_cmp expect actual
 '
 
+test_expect_success 'giant case' '
+	test_commit base &&
+	test_commit_bulk --ref=refs/heads/v1 --message="v1 commit %s" 32768 &&
+	test_commit_bulk --ref=refs/heads/v2 --message="v2 commit %s" 32768 &&
+	git range-diff base v1 v2
+'
+
 test_done

Then we'd allocate a 0-length array for "cost" and segfault as soon as
we try to put anything in it. You can confirm this by applying your
patch, running under gdb, and stopping at the xmalloc() call inside
get_correspondences(). With st_mult(), then we come up with the correct
value of 16GB (or complain about overflow on a 32-bit system).

So st_add() is working as advertised here; it's goal is just to make
sure we never under-allocate. You are right, though, that the code after
that in get_correspondences() has trouble because of the signedness. If
the code used an "unsigned int", it would still be _wrong_. But when it
overflowed, it would always come up with a value under 4GB. So it might
produce a wrong answer, but it couldn't possibly point outside the
bounds of the array and cause a security problem.

But because it's a signed int, we overflow to a negative value and try
to look at memory far before the start of the array. So I can see how it
is tempting to argue that this st_mult() isn't really helping since we
segfault anyway. But I'd disagree. By correctly computing the value of
16GB instead of "0", we know that the ALLOC_ARRAY() line is doing the
right thing. And it may choose not to continue if it can't allocate
16GB. That may happen because you don't have the RAM, but also because
of GIT_ALLOC_LIMIT.

So if you set GIT_ALLOC_LIMIT=4g, for example, you are immune to the bug
here. We'd refuse the allocation and die there, which protects
downstream code from trying to fill in the array with bogus arithmetic.

Dropping the st_mult() does nothing to fix the actual problem (which is
that this function should use a more appropriate type), but introduces
new failure modes.

-Peff
Ævar Arnfjörð Bjarmason Dec. 10, 2021, 10:22 a.m. UTC | #2
On Thu, Dec 09 2021, Jeff King wrote:

> On Thu, Dec 09, 2021 at 08:19:19PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> As documented in 320d0b493a2 (add helpers for detecting size_t
>> overflow, 2016-02-19) the arguments to st_mult() and st_add() "must be
>> unsigned".
>
> This isn't correct. The comment that says "must be unsigned" is for
> unsigned_mult_overflows(). Which indeed would not work correctly for a
> signed type. But st_add() is a separate function (and not a macro)
> because that means its arguments will always be promoted or converted
> into a size_t. And so no matter what you pass it, it will always itself
> pass a size_t into unsigned_mult_overflows().
>
>> In subsequent commits further overflows resulting in segfaults will be
>> fixed in this code, but let's start by removing this supposed guard
>> that does nothing except give us a false sense of
>> security. E.g. providing an "n" of INT_MAX here will result in "1" on
>> my system, causing us to write into memory.
>
> This guard doesn't do nothing. Your patch here:
>
>> @@ -312,7 +312,7 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
>>  	int *cost, c, *a2b, *b2a;
>>  	int i, j;
>>  
>> -	ALLOC_ARRAY(cost, st_mult(n, n));
>> +	ALLOC_ARRAY(cost, n * n);
>>  	ALLOC_ARRAY(a2b, n);
>>  	ALLOC_ARRAY(b2a, n);
>
> makes things strictly worse, because that "n * n" may overflow and cause
> us to under-allocate the array. You can see it in isolation with some
> extra code like this:
>
> diff --git a/git.c b/git.c
> index 5ff21be21f..63349e4b79 100644
> --- a/git.c
> +++ b/git.c
> @@ -850,11 +850,23 @@ static int run_argv(int *argcp, const char ***argv)
>  	return done_alias;
>  }
>  
> +static void foo(void)
> +{
> +	int n = 2 << 16;
> +
> +	printf("n = %d\n", n);
> +	printf("raw mult = %"PRIuMAX"\n", (uintmax_t)(n * n));
> +	printf("st_mult = %"PRIuMAX"\n", (uintmax_t)st_mult(n, n));
> +	exit(0);
> +}
> +
>  int cmd_main(int argc, const char **argv)
>  {
>  	const char *cmd;
>  	int done_help = 0;
>  
> +	foo();
> +
>  	cmd = argv[0];
>  	if (!cmd)
>  		cmd = "git-help";
>
> With st_mult, we get the correct answer of 16GB. Without it, we end up
> with 0!
>
> Back to the original code, if you generate a test setup like this:
>
> diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
> index e30bc48a29..f552d3086e 100755
> --- a/t/t3206-range-diff.sh
> +++ b/t/t3206-range-diff.sh
> @@ -772,4 +772,11 @@ test_expect_success '--left-only/--right-only' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'giant case' '
> +	test_commit base &&
> +	test_commit_bulk --ref=refs/heads/v1 --message="v1 commit %s" 32768 &&
> +	test_commit_bulk --ref=refs/heads/v2 --message="v2 commit %s" 32768 &&
> +	git range-diff base v1 v2
> +'
> +
>  test_done
>
> Then we'd allocate a 0-length array for "cost" and segfault as soon as
> we try to put anything in it. You can confirm this by applying your
> patch, running under gdb, and stopping at the xmalloc() call inside
> get_correspondences(). With st_mult(), then we come up with the correct
> value of 16GB (or complain about overflow on a 32-bit system).
>
> So st_add() is working as advertised here; it's goal is just to make
> sure we never under-allocate. You are right, though, that the code after
> that in get_correspondences() has trouble because of the signedness. If
> the code used an "unsigned int", it would still be _wrong_. But when it
> overflowed, it would always come up with a value under 4GB. So it might
> produce a wrong answer, but it couldn't possibly point outside the
> bounds of the array and cause a security problem.
>
> But because it's a signed int, we overflow to a negative value and try
> to look at memory far before the start of the array. So I can see how it
> is tempting to argue that this st_mult() isn't really helping since we
> segfault anyway. But I'd disagree. By correctly computing the value of
> 16GB instead of "0", we know that the ALLOC_ARRAY() line is doing the
> right thing. And it may choose not to continue if it can't allocate
> 16GB. That may happen because you don't have the RAM, but also because
> of GIT_ALLOC_LIMIT.
>
> So if you set GIT_ALLOC_LIMIT=4g, for example, you are immune to the bug
> here. We'd refuse the allocation and die there, which protects
> downstream code from trying to fill in the array with bogus arithmetic.
>
> Dropping the st_mult() does nothing to fix the actual problem (which is
> that this function should use a more appropriate type), but introduces
> new failure modes.

Yes you're entirely right. I had some stupid blinders on while writing
this. FWIW I think I was experimenting with some local macros and
conflated a testing of the overflow of n*n in gdb with the caste'd
version, which you rightly point out here won't have the overflow issue
at all. Sorry.
Jeff King Dec. 10, 2021, 11:41 a.m. UTC | #3
On Fri, Dec 10, 2021 at 11:22:59AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > Dropping the st_mult() does nothing to fix the actual problem (which is
> > that this function should use a more appropriate type), but introduces
> > new failure modes.
> 
> Yes you're entirely right. I had some stupid blinders on while writing
> this. FWIW I think I was experimenting with some local macros and
> conflated a testing of the overflow of n*n in gdb with the caste'd
> version, which you rightly point out here won't have the overflow issue
> at all. Sorry.

I'm not sure if this is helpful or not, but this is the minimal fix I
came up with that runs the testcase I showed earlier. It's basically
just swapping out "int" for "ssize_t" for any variables we use to index
the arrays (though note a few are themselves held in arrays, and we have
to cross some function boundaries).

I won't be surprised if it doesn't hit all cases, or if it even hits a
few it doesn't need to (e.g., should "phase" be dragged along with "i"
and "j" in the first hunk?). I mostly did guess-and-check on the
test-case, fixing whatever segfaulted and then running again until it
worked. I didn't even really read the code very carefully.

I think you _did_ do more of that careful reading, and broke down the
refactorings into separate patches in your series. Which is good. So I
think what we'd want is to pick out those parts of your series that end
up switching the variable type. My goal in sharing this here is just to
show that the end result of the fix can (and IMHO should) be around this
same order of magnitude.

---
diff --git a/linear-assignment.c b/linear-assignment.c
index ecffc09be6..3efa30c50b 100644
--- a/linear-assignment.c
+++ b/linear-assignment.c
@@ -13,11 +13,11 @@
  * i is `cost[j + column_count * i].
  */
 void compute_assignment(int column_count, int row_count, int *cost,
-			int *column2row, int *row2column)
+			ssize_t *column2row, ssize_t *row2column)
 {
 	int *v, *d;
 	int *free_row, free_count = 0, saved_free_count, *pred, *col;
-	int i, j, phase;
+	ssize_t i, j, phase;
 
 	if (column_count < 2) {
 		memset(column2row, 0, sizeof(int) * column_count);
@@ -31,7 +31,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
 
 	/* column reduction */
 	for (j = column_count - 1; j >= 0; j--) {
-		int i1 = 0;
+		ssize_t i1 = 0;
 
 		for (i = 1; i < row_count; i++)
 			if (COST(j, i1) > COST(j, i))
@@ -51,7 +51,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
 	/* reduction transfer */
 	ALLOC_ARRAY(free_row, row_count);
 	for (i = 0; i < row_count; i++) {
-		int j1 = row2column[i];
+		ssize_t j1 = row2column[i];
 		if (j1 == -1)
 			free_row[free_count++] = i;
 		else if (j1 < -1)
@@ -74,13 +74,13 @@ void compute_assignment(int column_count, int row_count, int *cost,
 
 	/* augmenting row reduction */
 	for (phase = 0; phase < 2; phase++) {
-		int k = 0;
+		ssize_t k = 0;
 
 		saved_free_count = free_count;
 		free_count = 0;
 		while (k < saved_free_count) {
 			int u1, u2;
-			int j1 = 0, j2, i0;
+			ssize_t j1 = 0, j2, i0;
 
 			i = free_row[k++];
 			u1 = COST(j1, i) - v[j1];
@@ -130,7 +130,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
 	ALLOC_ARRAY(pred, column_count);
 	ALLOC_ARRAY(col, column_count);
 	for (free_count = 0; free_count < saved_free_count; free_count++) {
-		int i1 = free_row[free_count], low = 0, up = 0, last, k;
+		ssize_t i1 = free_row[free_count], low = 0, up = 0, last, k;
 		int min, c, u1;
 
 		for (j = 0; j < column_count; j++) {
@@ -192,7 +192,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
 		/* augmentation */
 		do {
 			if (j < 0)
-				BUG("negative j: %d", j);
+				BUG("negative j: %"PRIdMAX, (intmax_t)j);
 			i = pred[j];
 			column2row[j] = i;
 			SWAP(j, row2column[i]);
diff --git a/linear-assignment.h b/linear-assignment.h
index 1dfea76629..7005521d61 100644
--- a/linear-assignment.h
+++ b/linear-assignment.h
@@ -14,7 +14,7 @@
  * row_count).
  */
 void compute_assignment(int column_count, int row_count, int *cost,
-			int *column2row, int *row2column);
+			ssize_t *column2row, ssize_t *row2column);
 
 /* The maximal cost in the cost matrix (to prevent integer overflows). */
 #define COST_MAX (1<<16)
diff --git a/range-diff.c b/range-diff.c
index cac89a2f4f..f1e1e27bf9 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -308,9 +308,10 @@ static int diffsize(const char *a, const char *b)
 static void get_correspondences(struct string_list *a, struct string_list *b,
 				int creation_factor)
 {
-	int n = a->nr + b->nr;
-	int *cost, c, *a2b, *b2a;
-	int i, j;
+	size_t n = a->nr + b->nr;
+	int *cost, c;
+	ssize_t *a2b, *b2a;
+	size_t i, j;
 
 	ALLOC_ARRAY(cost, st_mult(n, n));
 	ALLOC_ARRAY(a2b, n);
Ævar Arnfjörð Bjarmason Dec. 10, 2021, 12:31 p.m. UTC | #4
On Fri, Dec 10 2021, Jeff King wrote:

> On Fri, Dec 10, 2021 at 11:22:59AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> > Dropping the st_mult() does nothing to fix the actual problem (which is
>> > that this function should use a more appropriate type), but introduces
>> > new failure modes.
>> 
>> Yes you're entirely right. I had some stupid blinders on while writing
>> this. FWIW I think I was experimenting with some local macros and
>> conflated a testing of the overflow of n*n in gdb with the caste'd
>> version, which you rightly point out here won't have the overflow issue
>> at all. Sorry.
>
> I'm not sure if this is helpful or not, but this is the minimal fix I
> came up with that runs the testcase I showed earlier. It's basically
> just swapping out "int" for "ssize_t" for any variables we use to index
> the arrays (though note a few are themselves held in arrays, and we have
> to cross some function boundaries).
>
> I won't be surprised if it doesn't hit all cases, or if it even hits a
> few it doesn't need to (e.g., should "phase" be dragged along with "i"
> and "j" in the first hunk?). I mostly did guess-and-check on the
> test-case, fixing whatever segfaulted and then running again until it
> worked. I didn't even really read the code very carefully.
>
> I think you _did_ do more of that careful reading, and broke down the
> refactorings into separate patches in your series. Which is good. So I
> think what we'd want is to pick out those parts of your series that end
> up switching the variable type. My goal in sharing this here is just to
> show that the end result of the fix can (and IMHO should) be around this
> same order of magnitude.
>
> [...]
>  void compute_assignment(int column_count, int row_count, int *cost,
> -			int *column2row, int *row2column);
> +			ssize_t *column2row, ssize_t *row2column);
>  
>  /* The maximal cost in the cost matrix (to prevent integer overflows). */
>  #define COST_MAX (1<<16)
> diff --git a/range-diff.c b/range-diff.c
> index cac89a2f4f..f1e1e27bf9 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -308,9 +308,10 @@ static int diffsize(const char *a, const char *b)
>  static void get_correspondences(struct string_list *a, struct string_list *b,
>  				int creation_factor)
>  {
> -	int n = a->nr + b->nr;
> -	int *cost, c, *a2b, *b2a;
> -	int i, j;
> +	size_t n = a->nr + b->nr;
> +	int *cost, c;
> +	ssize_t *a2b, *b2a;
> +	size_t i, j;
>  
>  	ALLOC_ARRAY(cost, st_mult(n, n));
>  	ALLOC_ARRAY(a2b, n);

I think I was just chasing butterflies making this intmax_t at all. I
just submitted a v2, and explained that case in a bit more detail in
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com

I *think* it fixes all the cases we plausible run into, i.e. storing the
"cost" in an "int" was enough, we just needed a size_t as an offset. It
passes the regression test you noted[3].

The first thing I tried when hacking on this some months ago (I picked
these patches up again after running into the segfault again) was this
s/int/ssize_t/ change.

I don't think using ssize_t like that is portable, and that we'd need
something like intmax_t if we needed this in another context.

Firstly it's not standard C, it's just in POSIX, intmax_t is standard C
as of C99, which and we have in-tree code that already depends on it
(and uintmax_t).

But more importantly it's not "as big as size_t, just signed" in
POSIX. size_t is "no greater than the width of type long"[1] and
LONG_MAX is at least 2^31-1 [2].

Whereas ssize_t is not a "signed size_t", but a type that stores
-1..SSIZE_MAX, and SSIZE_MAX has a minimum value of 2^15-1. I.e. I think
on that basis some implemenations would make it the same as a "short
int" under the hood.

On my linux system it's just mapped to the longest available signed
integer, but that doesn't seem to be a portable assumption.

1. https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_types.h.html
2. https://pubs.opengroup.org/onlinepubs/009696899/basedefs/limits.h.html

3. B.t.w. a thing I ended up ejecting out of this was that I made a
   "test_commit_bulkier" which is N times faster than "test_commit_bulk",
   it just makes the same commit N times with the printf-repeating feature
   and feeds it to fast-import, but the test took so long in any case that
   I couldn't find a plausible way to get it in-tree).
Johannes Schindelin Dec. 10, 2021, 2:27 p.m. UTC | #5
Hi Peff,

On Fri, 10 Dec 2021, Jeff King wrote:

> On Fri, Dec 10, 2021 at 11:22:59AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> > > Dropping the st_mult() does nothing to fix the actual problem (which is
> > > that this function should use a more appropriate type), but introduces
> > > new failure modes.
> >
> > Yes you're entirely right. I had some stupid blinders on while writing
> > this. FWIW I think I was experimenting with some local macros and
> > conflated a testing of the overflow of n*n in gdb with the caste'd
> > version, which you rightly point out here won't have the overflow issue
> > at all. Sorry.
>
> I'm not sure if this is helpful or not, but this is the minimal fix I
> came up with that runs the testcase I showed earlier. It's basically
> just swapping out "int" for "ssize_t" for any variables we use to index
> the arrays (though note a few are themselves held in arrays, and we have
> to cross some function boundaries).
>
> I won't be surprised if it doesn't hit all cases, or if it even hits a
> few it doesn't need to (e.g., should "phase" be dragged along with "i"
> and "j" in the first hunk?). I mostly did guess-and-check on the
> test-case, fixing whatever segfaulted and then running again until it
> worked. I didn't even really read the code very carefully.
>
> I think you _did_ do more of that careful reading, and broke down the
> refactorings into separate patches in your series. Which is good. So I
> think what we'd want is to pick out those parts of your series that end
> up switching the variable type. My goal in sharing this here is just to
> show that the end result of the fix can (and IMHO should) be around this
> same order of magnitude.

I am in favor of this patch. Will you have time to submit this with a
commit message?

Thank you,
Dscho

>
> ---
> diff --git a/linear-assignment.c b/linear-assignment.c
> index ecffc09be6..3efa30c50b 100644
> --- a/linear-assignment.c
> +++ b/linear-assignment.c
> @@ -13,11 +13,11 @@
>   * i is `cost[j + column_count * i].
>   */
>  void compute_assignment(int column_count, int row_count, int *cost,
> -			int *column2row, int *row2column)
> +			ssize_t *column2row, ssize_t *row2column)
>  {
>  	int *v, *d;
>  	int *free_row, free_count = 0, saved_free_count, *pred, *col;
> -	int i, j, phase;
> +	ssize_t i, j, phase;
>
>  	if (column_count < 2) {
>  		memset(column2row, 0, sizeof(int) * column_count);
> @@ -31,7 +31,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
>
>  	/* column reduction */
>  	for (j = column_count - 1; j >= 0; j--) {
> -		int i1 = 0;
> +		ssize_t i1 = 0;
>
>  		for (i = 1; i < row_count; i++)
>  			if (COST(j, i1) > COST(j, i))
> @@ -51,7 +51,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
>  	/* reduction transfer */
>  	ALLOC_ARRAY(free_row, row_count);
>  	for (i = 0; i < row_count; i++) {
> -		int j1 = row2column[i];
> +		ssize_t j1 = row2column[i];
>  		if (j1 == -1)
>  			free_row[free_count++] = i;
>  		else if (j1 < -1)
> @@ -74,13 +74,13 @@ void compute_assignment(int column_count, int row_count, int *cost,
>
>  	/* augmenting row reduction */
>  	for (phase = 0; phase < 2; phase++) {
> -		int k = 0;
> +		ssize_t k = 0;
>
>  		saved_free_count = free_count;
>  		free_count = 0;
>  		while (k < saved_free_count) {
>  			int u1, u2;
> -			int j1 = 0, j2, i0;
> +			ssize_t j1 = 0, j2, i0;
>
>  			i = free_row[k++];
>  			u1 = COST(j1, i) - v[j1];
> @@ -130,7 +130,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
>  	ALLOC_ARRAY(pred, column_count);
>  	ALLOC_ARRAY(col, column_count);
>  	for (free_count = 0; free_count < saved_free_count; free_count++) {
> -		int i1 = free_row[free_count], low = 0, up = 0, last, k;
> +		ssize_t i1 = free_row[free_count], low = 0, up = 0, last, k;
>  		int min, c, u1;
>
>  		for (j = 0; j < column_count; j++) {
> @@ -192,7 +192,7 @@ void compute_assignment(int column_count, int row_count, int *cost,
>  		/* augmentation */
>  		do {
>  			if (j < 0)
> -				BUG("negative j: %d", j);
> +				BUG("negative j: %"PRIdMAX, (intmax_t)j);
>  			i = pred[j];
>  			column2row[j] = i;
>  			SWAP(j, row2column[i]);
> diff --git a/linear-assignment.h b/linear-assignment.h
> index 1dfea76629..7005521d61 100644
> --- a/linear-assignment.h
> +++ b/linear-assignment.h
> @@ -14,7 +14,7 @@
>   * row_count).
>   */
>  void compute_assignment(int column_count, int row_count, int *cost,
> -			int *column2row, int *row2column);
> +			ssize_t *column2row, ssize_t *row2column);
>
>  /* The maximal cost in the cost matrix (to prevent integer overflows). */
>  #define COST_MAX (1<<16)
> diff --git a/range-diff.c b/range-diff.c
> index cac89a2f4f..f1e1e27bf9 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -308,9 +308,10 @@ static int diffsize(const char *a, const char *b)
>  static void get_correspondences(struct string_list *a, struct string_list *b,
>  				int creation_factor)
>  {
> -	int n = a->nr + b->nr;
> -	int *cost, c, *a2b, *b2a;
> -	int i, j;
> +	size_t n = a->nr + b->nr;
> +	int *cost, c;
> +	ssize_t *a2b, *b2a;
> +	size_t i, j;
>
>  	ALLOC_ARRAY(cost, st_mult(n, n));
>  	ALLOC_ARRAY(a2b, n);
>
Ævar Arnfjörð Bjarmason Dec. 10, 2021, 2:58 p.m. UTC | #6
On Fri, Dec 10 2021, Johannes Schindelin wrote:

> Hi Peff,
>
> On Fri, 10 Dec 2021, Jeff King wrote:
>
>> On Fri, Dec 10, 2021 at 11:22:59AM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> > > Dropping the st_mult() does nothing to fix the actual problem (which is
>> > > that this function should use a more appropriate type), but introduces
>> > > new failure modes.
>> >
>> > Yes you're entirely right. I had some stupid blinders on while writing
>> > this. FWIW I think I was experimenting with some local macros and
>> > conflated a testing of the overflow of n*n in gdb with the caste'd
>> > version, which you rightly point out here won't have the overflow issue
>> > at all. Sorry.
>>
>> I'm not sure if this is helpful or not, but this is the minimal fix I
>> came up with that runs the testcase I showed earlier. It's basically
>> just swapping out "int" for "ssize_t" for any variables we use to index
>> the arrays (though note a few are themselves held in arrays, and we have
>> to cross some function boundaries).
>>
>> I won't be surprised if it doesn't hit all cases, or if it even hits a
>> few it doesn't need to (e.g., should "phase" be dragged along with "i"
>> and "j" in the first hunk?). I mostly did guess-and-check on the
>> test-case, fixing whatever segfaulted and then running again until it
>> worked. I didn't even really read the code very carefully.
>>
>> I think you _did_ do more of that careful reading, and broke down the
>> refactorings into separate patches in your series. Which is good. So I
>> think what we'd want is to pick out those parts of your series that end
>> up switching the variable type. My goal in sharing this here is just to
>> show that the end result of the fix can (and IMHO should) be around this
>> same order of magnitude.
>
> I am in favor of this patch. Will you have time to submit this with a
> commit message?

I'd also be happy to pick it up as a massaging of my s/int/intmax_t/
change. I think per[1] that intmax_t is more portable here than ssize_t,
but I'm very likely to be missing something. Corrections most welcome.

Per [1] I ejected that out of my v2 because I think the "cost" being
larger than 1<<16 might not be all that useful. I.e. the limiting that's
in get_correspondences().

But I'll happily admit ignorance on how the actual guts of range-diff
work, I just wanted to fix a segfault I kept running into locally at
some point, and figured I'd submit this RFC.

Doesn't an enlargement of the "int" from an assumed 32 bit unsigned to
say a 64bit unsigned require that 16bit unsigned COST_MAX to be
correspondingly bumped to 32bit unsigned? I.e. we'd define it as 1/2 of
whatever "intmax_t" (or "ssize_t" or "long long int" or whatever) is
defined as?

That may be a question under the umbrella of "Ævar doesn't actually
understand range-diff", but think I recall playing with bumping one and
not the other (or bumping COST_MAX too close to the size of the
container type) and running into errors...

1. https://lore.kernel.org/git/211210.86czm4d3zo.gmgdl@evledraar.gmail.com/
Phillip Wood Dec. 10, 2021, 7:24 p.m. UTC | #7
Hi Ævar

On 10/12/2021 12:31, Ævar Arnfjörð Bjarmason wrote:
> 
> [...] 
> I think I was just chasing butterflies making this intmax_t at all. I
> just submitted a v2, and explained that case in a bit more detail in
> https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@gmail.com
> 
> I *think* it fixes all the cases we plausible run into, i.e. storing the
> "cost" in an "int" was enough, we just needed a size_t as an offset. It
> passes the regression test you noted[3].
> 
> The first thing I tried when hacking on this some months ago (I picked
> these patches up again after running into the segfault again) was this
> s/int/ssize_t/ change.
> 
> I don't think using ssize_t like that is portable, and that we'd need
> something like intmax_t if we needed this in another context.
 >
> Firstly it's not standard C, it's just in POSIX, intmax_t is standard C
> as of C99, which and we have in-tree code that already depends on it
> (and uintmax_t).

I'm not objecting to using intmax_t particularly for code that needs to 
store negative values other than -1 but we're already using ssize_t in a 
lot of places so I don't think we need to worry about it not being 
supported.

> But more importantly it's not "as big as size_t, just signed" in
> POSIX. size_t is "no greater than the width of type long"[1]

The full text is

     The implementation shall support one or more programming
     environments in which the widths of blksize_t, pid_t, size_t,
     ssize_t, and suseconds_t are no greater than the width of type
     long. The names of these programming environments can be obtained
     using the confstr() function or the getconf utility.

so "no greater than the width of type long" applies to ssize_t as well 
as size_t.

> and
> LONG_MAX is at least 2^31-1 [2].
> 
> Whereas ssize_t is not a "signed size_t", but a type that stores
> -1..SSIZE_MAX, and SSIZE_MAX has a minimum value of 2^15-1. I.e. I think
> on that basis some implemenations would make it the same as a "short
> int" under the hood.

The minimum value of SIZE_MAX is 2^16-1[1], I'm not sure you can read 
much into the width of ssize_t from the minimum value of SSIZE_MAX. If 
you think about where ssize_t is used as the return value of read() and 
write() that take a size_t as the number of bytes to read/write then it 
would be very odd if ssize_t was a different width to size_t.

Best Wishes

Phillip

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/stdint.h.html

> On my linux system it's just mapped to the longest available signed
> integer, but that doesn't seem to be a portable assumption.
> 
> 1. https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_types.h.html
> 2. https://pubs.opengroup.org/onlinepubs/009696899/basedefs/limits.h.html
> 
> 3. B.t.w. a thing I ended up ejecting out of this was that I made a
>     "test_commit_bulkier" which is N times faster than "test_commit_bulk",
>     it just makes the same commit N times with the printf-repeating feature
>     and feeds it to fast-import, but the test took so long in any case that
>     I couldn't find a plausible way to get it in-tree).
>
Johannes Schindelin Dec. 11, 2021, 2:01 p.m. UTC | #8
Hi Ævar,

On Fri, 10 Dec 2021, Ævar Arnfjörð Bjarmason wrote:

> But I'll happily admit ignorance on how the actual guts of range-diff
> work, I just wanted to fix a segfault I kept running into locally at
> some point, and figured I'd submit this RFC.

I understand that it is super tempting to avoid spending the time to
understand how range-diff works and simply make changes until the
segmentation fault is gone, and then shoot off several iterations of the
patch series in the hopes that it gets merged at some point, and that
maybe reviewers who do spend the time to become familiar with the logic
help avoid introduce new bugs.

However, as a reviewer I am totally unsympathetic of this approach. I do
not want to review patches that even go so far as renaming functions when
they claim to "just fix a segfault" and the author even admits that
they're unfamiliar with what the code is supposed to do, without any
indication that they're inclined to invest the effort to change that.

If all you want to do is to fix the segmentation fault, and want to skip
the due diligence of studying the business logic, then just fix that
segmentation fault (I strongly suspect that using `COST()` after modifying
it to use `st_*()` would accomplish that). No need to inflate that to 5
patches. Unless you're thinking of the commit-per-author count as some
sort of scoreboard where you want to win. In which case I am even less
interested in reviewing the patches.

Ciao,
Johannes
Ævar Arnfjörð Bjarmason Dec. 12, 2021, 5:44 p.m. UTC | #9
On Sat, Dec 11 2021, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Fri, 10 Dec 2021, Ævar Arnfjörð Bjarmason wrote:
>
>> But I'll happily admit ignorance on how the actual guts of range-diff
>> work, I just wanted to fix a segfault I kept running into locally at
>> some point, and figured I'd submit this RFC.
>
> I understand that it is super tempting to avoid spending the time to
> understand how range-diff works and simply make changes until the
> segmentation fault is gone, and then shoot off several iterations of the
> patch series in the hopes that it gets merged at some point, and that
> maybe reviewers who do spend the time to become familiar with the logic
> help avoid introduce new bugs.
>
> However, as a reviewer I am totally unsympathetic of this approach. I do
> not want to review patches that even go so far as renaming functions when
> they claim to "just fix a segfault" and the author even admits that
> they're unfamiliar with what the code is supposed to do, without any
> indication that they're inclined to invest the effort to change that.

What you're eliding here is the context where I say that I must not be
getting something because you're apparently endorsing the WIP
s/int/intmax_t/g patch Jeff King inlined upthread without a
corresponding change to COST_MAX.

Don't those two go hand-in-hand, and changing one without the other
would lead to a subtle bug?

> If all you want to do is to fix the segmentation fault, and want to skip
> the due diligence of studying the business logic, then just fix that
> segmentation fault (I strongly suspect that using `COST()` after modifying
> it to use `st_*()` would accomplish that).

Well, this is an RFC series for a bug that I encountered & which seems
to be fixed by these changes, but in an area which I'll happily admit
that I'm not confident enough to say that this is *the* right fix, and I
think both the "RFC" label and both cover letters make that clear.

> No need to inflate that to 5
> patches. Unless you're thinking of the commit-per-author count as some
> sort of scoreboard where you want to win. In which case I am even less
> interested in reviewing the patches.

Can you mention specific things you'd like to have squashed? I do think
this split-up makes thinsg easier to review.

E.g. if we're using the COST() macro in range-diff.c then splitting 4/5
from 5/5 means you don't need to spend as much time mentally splitting
the meaningful changes from a variable rename (which is required for
using that macro).

I agree that 1-3/5 aren't strictly necessary. I did try to do this
without those, but found e.g. reasoning about changing the
one-giant-function in linear-assignment.c harder when it came to the
segfault fix, and likewise the mechanical change from "int" to "size_t"
is (I think) easier to review & reason about.
Jeff King Dec. 14, 2021, 2:34 p.m. UTC | #10
On Fri, Dec 10, 2021 at 01:31:10PM +0100, Ævar Arnfjörð Bjarmason wrote:

> I don't think using ssize_t like that is portable, and that we'd need
> something like intmax_t if we needed this in another context.
> 
> Firstly it's not standard C, it's just in POSIX, intmax_t is standard C
> as of C99, which and we have in-tree code that already depends on it
> (and uintmax_t).
> 
> But more importantly it's not "as big as size_t, just signed" in
> POSIX. size_t is "no greater than the width of type long"[1] and
> LONG_MAX is at least 2^31-1 [2].

Thanks, I didn't know that about ssize_t. I do wonder how often it is
_not_ the case that it is of the same magnitude as size_t. Certainly I
can see how write() could decide to just work in SSIZE_MAX chunks, since
the caller has to be prepared to loop anyway. But it seems like the
obvious implementation is for it to be a signed size_t; I'd be curious
to hear of any platforms that diverge from this (i.e., is this a real
portability concern, or like NULL pointers that aren't all-zeroes, one
that we don't care about in practice).

I do suspect we've already made that assumption elsewhere, though it's
hard to easily see. Grepping for ssize_t turns up lots of reasonable and
legitimate uses. Though some like the one in strbuf_realpath() are
questionable (it's assigning from an int!).

> 3. B.t.w. a thing I ended up ejecting out of this was that I made a
>    "test_commit_bulkier" which is N times faster than "test_commit_bulk",
>    it just makes the same commit N times with the printf-repeating feature
>    and feeds it to fast-import, but the test took so long in any case that
>    I couldn't find a plausible way to get it in-tree).

Yes, I noticed it was rather slow. The main culprit is Git writing out
new blobs and trees for each commit, which is what I assume your
"bulkier" version skipped (the existing "bulk" one is careful not to
use any sub-processes).  You can instruct test_commit_bulk to use
identical content in each commit, which saves a lot of time.

It's also highly non-linear in the number of commits when the tree
changes. That suggests that fast-import's tree-handling could be
improved. Here are the results of the hacky perf script below, showing
both the non-linearity in the "full" case and how much faster the
"quick" (commits-only) case is:

  Test                 this tree        
  --------------------------------------
  1234.2: full 1000    0.35(0.27+0.08)  
  1234.3: full 2000    0.85(0.81+0.04)  
  1234.4: full 4000    3.21(3.09+0.11)  
  1234.5: full 8000    12.13(11.85+0.27)
  1234.6: quick 1000   0.14(0.12+0.02)  
  1234.7: quick 2000   0.20(0.18+0.03)  
  1234.8: quick 4000   0.31(0.28+0.04)  
  1234.9: quick 8000   0.58(0.55+0.03)  

-- >8 --
#!/bin/sh

test_description='foo'
. ./perf-lib.sh

test_expect_success 'empty repo' 'git init'

test_perf 'full 1000' 'test_commit_bulk --id=full 1000'
test_perf 'full 2000' 'test_commit_bulk --id=full 2000'
test_perf 'full 4000' 'test_commit_bulk --id=full 4000'
test_perf 'full 8000' 'test_commit_bulk --id=full 8000'

test_perf 'quick 1000' 'test_commit_bulk --id=quick --filename=foo --contents=bar 1000'
test_perf 'quick 2000' 'test_commit_bulk --id=quick --filename=foo --contents=bar 2000'
test_perf 'quick 4000' 'test_commit_bulk --id=quick --filename=foo --contents=bar 4000'
test_perf 'quick 8000' 'test_commit_bulk --id=quick --filename=foo --contents=bar 8000'

test_done
-- >8 --
Jeff King Dec. 14, 2021, 2:42 p.m. UTC | #11
On Fri, Dec 10, 2021 at 03:27:24PM +0100, Johannes Schindelin wrote:

> > I'm not sure if this is helpful or not, but this is the minimal fix I
> > came up with that runs the testcase I showed earlier. It's basically
> > just swapping out "int" for "ssize_t" for any variables we use to index
> > the arrays (though note a few are themselves held in arrays, and we have
> > to cross some function boundaries).
> >
> > I won't be surprised if it doesn't hit all cases, or if it even hits a
> > few it doesn't need to (e.g., should "phase" be dragged along with "i"
> > and "j" in the first hunk?). I mostly did guess-and-check on the
> > test-case, fixing whatever segfaulted and then running again until it
> > worked. I didn't even really read the code very carefully.
> >
> > I think you _did_ do more of that careful reading, and broke down the
> > refactorings into separate patches in your series. Which is good. So I
> > think what we'd want is to pick out those parts of your series that end
> > up switching the variable type. My goal in sharing this here is just to
> > show that the end result of the fix can (and IMHO should) be around this
> > same order of magnitude.
> 
> I am in favor of this patch. Will you have time to submit this with a
> commit message?

I'm not at all sure that it's sufficient. It avoids overflowing the
cost array accesses, and was tested on a square input of 2^15. But:

  - some of the other ints may need changing, too (e.g., column_count).
    Probably 2^31 commits is out of reach in practice (and probably
    other parts of Git run into problems there anyway). But some of
    those arguments may just be (a->nr * b->nr), whereas I was testing
    overflow at (a->nr + b->nr)^2.

  - I threw around ssize_t willy-nilly. Some of those could be size_t,
    and I think Ævar's patches go in the direction of splitting the two,
    which is good.

  - some light refactoring may be helpful to split those cases ahead of
    time.

So I was hoping Ævar would take this approach and run with it. I just
reviewed his follow-up series, though, and it looks like it is still
putting bounds-checks into the COST macro, which I think is not
sufficient.

-Peff
diff mbox series

Patch

diff --git a/range-diff.c b/range-diff.c
index cac89a2f4f2..170e8623313 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -312,7 +312,7 @@  static void get_correspondences(struct string_list *a, struct string_list *b,
 	int *cost, c, *a2b, *b2a;
 	int i, j;
 
-	ALLOC_ARRAY(cost, st_mult(n, n));
+	ALLOC_ARRAY(cost, n * n);
 	ALLOC_ARRAY(a2b, n);
 	ALLOC_ARRAY(b2a, n);