diff mbox series

[16/20] builtin/gc.c: guess the size of the revindex

Message ID 7c17db7a7df8b524f13969efd1cb5e6e95de5a2d.1610129796.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series pack-revindex: prepare for on-disk reverse index | expand

Commit Message

Taylor Blau Jan. 8, 2021, 6:17 p.m. UTC
'estimate_repack_memory()' takes into account the amount of memory
required to load the reverse index in memory by multiplying the assumed
number of objects by the size of the 'revindex_entry' struct.

Prepare for hiding the definition of 'struct revindex_entry' by removing
a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
guess that one off_t and one uint32_t are required per object. Strictly
speaking, this is a worse guess than asking for 'sizeof(struct
revindex_entry)' directly, since the true size of this struct is 16
bytes with padding on the end of the struct in order to align the offset
field.

But, this is an approximation anyway, and it does remove a use of the
'struct revindex_entry' from outside of pack-revindex internals.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/gc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Derrick Stolee Jan. 11, 2021, 11:52 a.m. UTC | #1
On 1/8/2021 1:17 PM, Taylor Blau wrote:
> 'estimate_repack_memory()' takes into account the amount of memory
> required to load the reverse index in memory by multiplying the assumed
> number of objects by the size of the 'revindex_entry' struct.
> 
> Prepare for hiding the definition of 'struct revindex_entry' by removing
> a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> guess that one off_t and one uint32_t are required per object. Strictly
> speaking, this is a worse guess than asking for 'sizeof(struct
> revindex_entry)' directly, since the true size of this struct is 16
> bytes with padding on the end of the struct in order to align the offset
> field.

This is so far the only not-completely-obvious change.
 
> But, this is an approximation anyway, and it does remove a use of the
> 'struct revindex_entry' from outside of pack-revindex internals.

And this might be enough justification for it, but...

> -	heap += sizeof(struct revindex_entry) * nr_objects;
> +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
...outside of the estimation change, will this need another change
when the rev-index is mmap'd? Should this instead be an API call,
such as estimate_rev_index_memory(nr_objects)? That would
centralize the estimate to be next to the code that currently
interacts with 'struct revindex_entry' and will later interact with
the mmap region.

Thanks,
-Stolee
Taylor Blau Jan. 11, 2021, 4:23 p.m. UTC | #2
On Mon, Jan 11, 2021 at 06:52:24AM -0500, Derrick Stolee wrote:
> This is so far the only not-completely-obvious change.
>
> > But, this is an approximation anyway, and it does remove a use of the
> > 'struct revindex_entry' from outside of pack-revindex internals.
>
> And this might be enough justification for it, but...
>
> > -	heap += sizeof(struct revindex_entry) * nr_objects;
> > +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
>
> ...outside of the estimation change, will this need another change
> when the rev-index is mmap'd? Should this instead be an API call,
> such as estimate_rev_index_memory(nr_objects)? That would
> centralize the estimate to be next to the code that currently
> interacts with 'struct revindex_entry' and will later interact with
> the mmap region.

I definitely did consider this, and it seems that I made a mistake in
not documenting my consideration (since I assumed that it was so benign
nobody would notice / care ;-)).

The reason I didn't pursue it here was that we haven't yet loaded the
reverse index by this point. So, you'd want a function that at least
stats the '*.rev' file (and either does or doesn't parse it [1]), or
aborts early to indicate otherwise.

One would hope that 'load_pack_revindex()' would do just that, but it
falls back to load a reverse index in memory, which involves exactly the
slow sort that we're trying to avoid. (Of course, we're going to have to
do it later anyway, but allocating many GB of heap just to provide an
estimation seems ill-advised to me ;-).)

So, we'd have to expand the API in some way or another, and to me it
didn't seem worth it. As I mentioned in the commit message, I'm
skeptical of the value of being accurate here, since this is (after all)
an estimation.

Perhaps a longer response than you were bargaining for, but... :-).

Thanks,
Taylor

[1]: Likely negligible, since all "parsing" really does is verify the
internal checksum, and then assign a pointer into it.
Derrick Stolee Jan. 11, 2021, 5:09 p.m. UTC | #3
On 1/11/2021 11:23 AM, Taylor Blau wrote:
> On Mon, Jan 11, 2021 at 06:52:24AM -0500, Derrick Stolee wrote:
>> This is so far the only not-completely-obvious change.
>>
>>> But, this is an approximation anyway, and it does remove a use of the
>>> 'struct revindex_entry' from outside of pack-revindex internals.
>>
>> And this might be enough justification for it, but...
>>
>>> -	heap += sizeof(struct revindex_entry) * nr_objects;
>>> +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
>>
>> ...outside of the estimation change, will this need another change
>> when the rev-index is mmap'd? Should this instead be an API call,
>> such as estimate_rev_index_memory(nr_objects)? That would
>> centralize the estimate to be next to the code that currently
>> interacts with 'struct revindex_entry' and will later interact with
>> the mmap region.
> 
> I definitely did consider this, and it seems that I made a mistake in
> not documenting my consideration (since I assumed that it was so benign
> nobody would notice / care ;-)).
> 
> The reason I didn't pursue it here was that we haven't yet loaded the
> reverse index by this point. So, you'd want a function that at least
> stats the '*.rev' file (and either does or doesn't parse it [1]), or
> aborts early to indicate otherwise.

In this patch, I would expect it to use sizeof(struct revindex_entry).
Later, the method would know if a .rev file exists and do the right
thing instead. (Also, should mmap'd data count towards this estimate?)

> One would hope that 'load_pack_revindex()' would do just that, but it
> falls back to load a reverse index in memory, which involves exactly the
> slow sort that we're trying to avoid. (Of course, we're going to have to
> do it later anyway, but allocating many GB of heap just to provide an
> estimation seems ill-advised to me ;-).)
> 
> So, we'd have to expand the API in some way or another, and to me it
> didn't seem worth it. As I mentioned in the commit message, I'm
> skeptical of the value of being accurate here, since this is (after all)
> an estimation.

Yes, I'm probably just poking somewhere it was easy to poke. This is
probably not worth the time I'm spending asking about it.

Feel free to disregard.

-Stolee
Jeff King Jan. 12, 2021, 9:28 a.m. UTC | #4
On Mon, Jan 11, 2021 at 12:09:27PM -0500, Derrick Stolee wrote:

> > The reason I didn't pursue it here was that we haven't yet loaded the
> > reverse index by this point. So, you'd want a function that at least
> > stats the '*.rev' file (and either does or doesn't parse it [1]), or
> > aborts early to indicate otherwise.
> 
> In this patch, I would expect it to use sizeof(struct revindex_entry).
> Later, the method would know if a .rev file exists and do the right
> thing instead. (Also, should mmap'd data count towards this estimate?)

Yeah, I think if we care about memory pressure, then the mmap would
count anyway. I agree that letting the revindex code decide which to use
would be the most accurate thing, but given that this whole chunk of
code is an estimate (that does not even seem to take into account the
memory used for the delta search!), I don't think it's worth trying to
get to accurate.

-Peff
diff mbox series

Patch

diff --git a/builtin/gc.c b/builtin/gc.c
index 4c24f41852..c60811f212 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -301,7 +301,7 @@  static uint64_t estimate_repack_memory(struct packed_git *pack)
 	/* and then obj_hash[], underestimated in fact */
 	heap += sizeof(struct object *) * nr_objects;
 	/* revindex is used also */
-	heap += sizeof(struct revindex_entry) * nr_objects;
+	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
 	/*
 	 * read_sha1_file() (either at delta calculation phase, or
 	 * writing phase) also fills up the delta base cache