diff mbox series

[v2,15/17] midx: use 64-bit multiplication for chunk sizes

Message ID 83d292532a0fa3f3a0ad343421be4a99a03471d0.1611759716.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Refactor chunk-format into an API | expand

Commit Message

Derrick Stolee Jan. 27, 2021, 3:01 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Junio C Hamano Feb. 5, 2021, midnight UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When calculating the sizes of certain chunks, we should use 64-bit
> multiplication always. This allows us to properly predict the chunk
> sizes without risk of overflow.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

This one I find somewhat questionable for multiple reasons.

 * the fourth parameter of add_chunk() is of size_t, not uint64_t;
   shouldn't the multiplication be done in type size_t instead?

 * these mutiplications were introduced in "midx: use chunk-format
   API in write_midx_internal()"; that step should use the
   arithmetic with cast (if necessary) from the start, no?

 * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to
   add_chunk(), in the post-context of the first hunk.  Shouldn't
   that be covered as well?  I didn't grep for all uses of
   add_chunk(), but I wouldn't be surprised if this patch missed
   some of the calls that need the same treatment.

> diff --git a/midx.c b/midx.c
> index e94dcd34b7f..a365dac6bbc 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -913,7 +913,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
>  		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
>  	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
> -		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
> +		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
>  	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
>  		  write_midx_object_offsets,
>  		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
> @@ -921,7 +921,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	if (ctx.large_offsets_needed)
>  		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
>  			write_midx_large_offsets,
> -			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
> +			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
>  
>  	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
>  	write_chunkfile(cf, &ctx);
Chris Torek Feb. 5, 2021, 10:59 a.m. UTC | #2
On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?

There are (still) systems with 32-bit size_t (but 64-bit
off_t / file sizes), so ... probably not.  Is size_t ever more than
64 bits these days?

Chris
Derrick Stolee Feb. 5, 2021, 12:30 p.m. UTC | #3
On 2/4/2021 7:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> When calculating the sizes of certain chunks, we should use 64-bit
>> multiplication always. This allows us to properly predict the chunk
>> sizes without risk of overflow.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  midx.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> This one I find somewhat questionable for multiple reasons.
> 
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?

This is probably appropriate because we will truncate to size_t if
it is smaller than uint64_t.

>  * these mutiplications were introduced in "midx: use chunk-format
>    API in write_midx_internal()"; that step should use the
>    arithmetic with cast (if necessary) from the start, no?

I wanted to isolate these changes specifically so we could be
careful about the multiplications and not be distracted by them
when converting to the chunk-format API. The multiplications were
"moved" by that patch, not "introduced".

>  * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to
>    add_chunk(), in the post-context of the first hunk.  Shouldn't
>    that be covered as well?  I didn't grep for all uses of
>    add_chunk(), but I wouldn't be surprised if this patch missed
>    some of the calls that need the same treatment.

And here is a great example of why it was good to call out these
multiplications in their own patch.

I did a full inspection of all multiplications in midx.c and
found a few more instances of possible overflow. Two are on the
read side, but they require the object lookup chunk to have size
4gb or larger. This is not _that_ far off from possibility! My
multi-pack-index for the Windows repository is currently ~1.6 GB
(in total, including the other chunks).

Thanks,
-Stolee
Junio C Hamano Feb. 5, 2021, 7:42 p.m. UTC | #4
Derrick Stolee <stolee@gmail.com> writes:

> On 2/4/2021 7:00 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> When calculating the sizes of certain chunks, we should use 64-bit
>>> multiplication always. This allows us to properly predict the chunk
>>> sizes without risk of overflow.
>>>
>>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>>> ---
>>>  midx.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> This one I find somewhat questionable for multiple reasons.
>> 
>>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>>    shouldn't the multiplication be done in type size_t instead?
>
> This is probably appropriate because we will truncate to size_t if
> it is smaller than uint64_t.

In other words, if size_t turns out to be too small, doing
multiplication in uint64_t would not help at all and add_chunk() API
needs its parameter types updated [*].

    side note: I really wish that the language and the compiler
    helped us so that we didn't have to do this---after all, our
    function prototype says the result will be passed as a certain
    type, so it would be nice if the arithmetic to compute that
    result were automatically carried out in a way not to cause
    truncation.

>>  * these mutiplications were introduced in "midx: use chunk-format
>>    API in write_midx_internal()"; that step should use the
>>    arithmetic with cast (if necessary) from the start, no?
>
> I wanted to isolate these changes specifically so we could be
> careful about the multiplications and not be distracted by them
> when converting to the chunk-format API. The multiplications were
> "moved" by that patch, not "introduced".

Hmph, I somehow had an impression that they did not have truncation
issue in the original context, but perhaps I was wrong.  OK.

> I did a full inspection of all multiplications in midx.c and
> found a few more instances of possible overflow. Two are on the
> read side, but they require the object lookup chunk to have size
> 4gb or larger. This is not _that_ far off from possibility! My
> multi-pack-index for the Windows repository is currently ~1.6 GB
> (in total, including the other chunks).

Thanks.
Junio C Hamano Feb. 5, 2021, 8:41 p.m. UTC | #5
Chris Torek <chris.torek@gmail.com> writes:

> On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>>    shouldn't the multiplication be done in type size_t instead?
>
> There are (still) systems with 32-bit size_t (but 64-bit
> off_t / file sizes), so ... probably not.  Is size_t ever more than
> 64 bits these days?

Sorry, you lost me.  I do not see how it would help to perform the
multiplication in uint64_t, when you suspect that size_t is too
small, if the final destination of the result of the multiplication
is a function argument of type size_t?
Chris Torek Feb. 6, 2021, 8:35 p.m. UTC | #6
On Fri, Feb 5, 2021 at 12:41 PM Junio C Hamano <gitster@pobox.com> wrote:
> Chris Torek <chris.torek@gmail.com> writes:
> > There are (still) systems with 32-bit size_t (but 64-bit
> > off_t / file sizes), so ... probably not.  Is size_t ever more than
> > 64 bits these days?
>
> Sorry, you lost me.  I do not see how it would help to perform the
> multiplication in uint64_t, when you suspect that size_t is too
> small, if the final destination of the result of the multiplication
> is a function argument of type size_t?

No, you and Derrick Stolee are right, I wasn't looking out far enough
here (to the actual function).

(I was wondering though if there are systems where the valid range
for size_t could exceed that for off_t.  Are there still systems
using 32-bit off_t?  Sometimes I think there are too many abstracted
types running around here -- how do we know which sizes are big
enough?  There is always uintmax_t, though, and for unsigned
types, ((T)-1) gets you the maximum possible value.)

Chris
SZEDER Gábor Feb. 7, 2021, 7:50 p.m. UTC | #7
On Thu, Feb 04, 2021 at 04:00:19PM -0800, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > When calculating the sizes of certain chunks, we should use 64-bit
> > multiplication always. This allows us to properly predict the chunk
> > sizes without risk of overflow.
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  midx.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> This one I find somewhat questionable for multiple reasons.
> 
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?
> 
>  * these mutiplications were introduced in "midx: use chunk-format
>    API in write_midx_internal()";

No, that patch also removes lines like: 

-       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;

-               chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-                                          ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;

So those potentially problematic multiplications were already there
before this series, and in fact trace all the way back to the initial
midx patch series (commits 0d5b3a5ef7 (midx: write object ids in a 
chunk, 2018-07-12) and 662148c435 (midx: write object offsets,
2018-07-12)).

>    that step should use the
>    arithmetic with cast (if necessary) from the start, no?

As it fixes a long-standing issue, it should rather be a bugfix patch
at the beginning of the series.
Junio C Hamano Feb. 8, 2021, 5:41 a.m. UTC | #8
SZEDER Gábor <szeder.dev@gmail.com> writes:

> No, that patch also removes lines like: 
>
> -       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
>
> -               chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
> -                                          ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;

OK.  In other words, the above was replaced in the same patch with

    add_chunk(...., U32 * U32);

where the called function expects the result of the multiplication
as size_t in its function prototype.  It is a bit sad that U32*U32
to compute the argument that is to be passed as U64 must be casted
as (uint64_t)U32*U32 by the caller X-<.

The original that the above replaced, shown in your quote, is:

    U64 = U64 + U32 * U32;

I also wish that the fact that it is added to U64 is sufficient not
to require the RHS to be written as U64 + (uint64_t) U32 * U32 (in
other words, the original that was removed was OK without cast).

Sad.
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index e94dcd34b7f..a365dac6bbc 100644
--- a/midx.c
+++ b/midx.c
@@ -913,7 +913,7 @@  static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
 	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
-		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
 	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
 		  write_midx_object_offsets,
 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
@@ -921,7 +921,7 @@  static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.large_offsets_needed)
 		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
 			write_midx_large_offsets,
-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
 	write_chunkfile(cf, &ctx);