diff mbox series

pack-format: correct multi-pack-index description

Message ID 20200207221640.46876-1-johannes@sipsolutions.net (mailing list archive)
State New, archived
Headers show
Series pack-format: correct multi-pack-index description | expand

Commit Message

Johannes Berg Feb. 7, 2020, 10:16 p.m. UTC
The description of the multi-pack-index contains a small bug,
if all offsets are < 2^32 then there will be no LOFF chunk,
not only if they're all < 2^31 (since the highest bit is only
needed as the "LOFF-escape" when that's actually needed.)

Correct this, and clarify that in that case only offsets up
to 2^31-1 can be stored in the OOFF chunk.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
---
 Documentation/technical/pack-format.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Derrick Stolee Feb. 10, 2020, 2:18 p.m. UTC | #1
On 2/7/2020 5:16 PM, Johannes Berg wrote:
> The description of the multi-pack-index contains a small bug,
> if all offsets are < 2^32 then there will be no LOFF chunk,
> not only if they're all < 2^31 (since the highest bit is only
> needed as the "LOFF-escape" when that's actually needed.)
> 
> Correct this, and clarify that in that case only offsets up
> to 2^31-1 can be stored in the OOFF chunk.
> 
> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
> ---
>  Documentation/technical/pack-format.txt | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index cab5bdd2ff0f..d3a142c65202 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -315,10 +315,11 @@ CHUNK DATA:
>  	    Stores two 4-byte values for every object.
>  	    1: The pack-int-id for the pack storing this object.
>  	    2: The offset within the pack.
> -		If all offsets are less than 2^31, then the large offset chunk
> +		If all offsets are less than 2^32, then the large offset chunk
>  		will not exist and offsets are stored as in IDX v1.
>  		If there is at least one offset value larger than 2^32-1, then
> -		the large offset chunk must exist. If the large offset chunk
> +		the large offset chunk must exist, and offsets larger than
> +		2^31-1 must be stored in it instead. If the large offset chunk
>  		exists and the 31st bit is on, then removing that bit reveals
>  		the row in the large offsets containing the 8-byte offset of
>  		this object.

Thank you for finding this doc bug. This is a very subtle point,
and you have described it very clearly.

-Stolee
Johannes Berg Feb. 10, 2020, 2:22 p.m. UTC | #2
On Mon, 2020-02-10 at 09:18 -0500, Derrick Stolee wrote:
> 
> Thank you for finding this doc bug. This is a very subtle point,
> and you have described it very clearly.

I was going back and forth on the wording a bit, glad I found something
that you think is a good description :)

Are you familiar with the multi-pack-index and how it's used, by any
chance?

I came here from bup (https://github.com/bup/bup/) and needed a way to
store the offset to find objects in "pure bup", today it only stores
object *presence* and *pack* in its multi-index, but not the offset.

However, it seems to do a bit better in terms of not requiring a single
multi-index, but instead storing it in midx-*.midx files and multiple
can describe the repository state. Why wasn't something like that done
for git as well? It's a bit annoying to have to recreate the full midx
every time a pack file is added, and searching in two or three midx
files wouldn't really be a big deal?

Anyway, that's just an aside, but during all this investigation I
stumbled across this small inconsistency - I'm glad the docs exist at
all! :-)

Thanks,
johannes
Derrick Stolee Feb. 10, 2020, 2:46 p.m. UTC | #3
On 2/10/2020 9:22 AM, Johannes Berg wrote:
> On Mon, 2020-02-10 at 09:18 -0500, Derrick Stolee wrote:
>>
>> Thank you for finding this doc bug. This is a very subtle point,
>> and you have described it very clearly.
> 
> I was going back and forth on the wording a bit, glad I found something
> that you think is a good description :)
> 
> Are you familiar with the multi-pack-index and how it's used, by any
> chance?

Yes. I wrote the first version, and we use it a lot in VFS for Git.

> I came here from bup (https://github.com/bup/bup/) and needed a way to
> store the offset to find objects in "pure bup", today it only stores
> object *presence* and *pack* in its multi-index, but not the offset.
> 
> However, it seems to do a bit better in terms of not requiring a single
> multi-index, but instead storing it in midx-*.midx files and multiple
> can describe the repository state. Why wasn't something like that done
> for git as well? It's a bit annoying to have to recreate the full midx
> every time a pack file is added, and searching in two or three midx
> files wouldn't really be a big deal?

Part of my initial plan was to have this incremental file format.
The commit-graph uses a very similar mechanism. The difference may
be that you likely allow multiple .midx files found by scanning the
pack directory, but I would expect something like the
"commit-graph-chain" file that provides an ordered list of the
incremental files. This can be important for deciding when to merge
layers or delete old files, and would be critical to the possibility
of converting reachability bitmaps to rely on a stable object order
stored in the multi-pack-index instead of pack-order.

The reason the multi-pack-index has not become incremental is that
VFS for Git no longer needs to write it very often. We write the
entire multi-pack-index during a background job that triggers once
per day. If we needed to write it more frequently, then the incremental
format would be more important to us.

That said: if someone wanted to contribute an incremental format,
then I would be happy to review it!

> Anyway, that's just an aside, but during all this investigation I
> stumbled across this small inconsistency - I'm glad the docs exist at
> all! :-)

I'm glad they helped.

Thanks,
-Stolee
Johannes Berg Feb. 10, 2020, 2:50 p.m. UTC | #4
On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote:

> Part of my initial plan was to have this incremental file format.
> The commit-graph uses a very similar mechanism. The difference may
> be that you likely allow multiple .midx files found by scanning the
> pack directory, 

Right, just scan and use any midx that exist, then compare the packs in
there against all the packs found, and then remove any packs that
actually *are* in an midx from the search list. That leaves you with all
information, but optimised by midx where possible.

> but I would expect something like the
> "commit-graph-chain" file that provides an ordered list of the
> incremental files. This can be important for deciding when to merge
> layers or delete old files, and would be critical to the possibility
> of converting reachability bitmaps to rely on a stable object order
> stored in the multi-pack-index instead of pack-order.

Right, if we delete then we have to also remove any midx covering the
deleted pack, that's pretty rare in bup as a backup tool though.

> The reason the multi-pack-index has not become incremental is that
> VFS for Git no longer needs to write it very often. We write the
> entire multi-pack-index during a background job that triggers once
> per day. If we needed to write it more frequently, then the incremental
> format would be more important to us.

So, wait, what if a new pack is created? Does it just get used in
addition to the multi-pack-index, if it's not covered by it, like I
described above?

If so, I guess it wouldn't actually really matter here. I was afraid
(but didn't check yet) that git would always use only the single multi-
pack-index file, and not also search additional packs, so that it always
has to be maintained in "perfect order" ...

> That said: if someone wanted to contribute an incremental format,
> then I would be happy to review it!

I might still get motivated to do so :-)

johannes
Derrick Stolee Feb. 10, 2020, 3:02 p.m. UTC | #5
On 2/10/2020 9:50 AM, Johannes Berg wrote:
> On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote:
> 
>> Part of my initial plan was to have this incremental file format.
>> The commit-graph uses a very similar mechanism. The difference may
>> be that you likely allow multiple .midx files found by scanning the
>> pack directory, 
> 
> Right, just scan and use any midx that exist, then compare the packs in
> there against all the packs found, and then remove any packs that
> actually *are* in an midx from the search list. That leaves you with all
> information, but optimised by midx where possible.
> 
>> but I would expect something like the
>> "commit-graph-chain" file that provides an ordered list of the
>> incremental files. This can be important for deciding when to merge
>> layers or delete old files, and would be critical to the possibility
>> of converting reachability bitmaps to rely on a stable object order
>> stored in the multi-pack-index instead of pack-order.
> 
> Right, if we delete then we have to also remove any midx covering the
> deleted pack, that's pretty rare in bup as a backup tool though.
> 
>> The reason the multi-pack-index has not become incremental is that
>> VFS for Git no longer needs to write it very often. We write the
>> entire multi-pack-index during a background job that triggers once
>> per day. If we needed to write it more frequently, then the incremental
>> format would be more important to us.
> 
> So, wait, what if a new pack is created? Does it just get used in
> addition to the multi-pack-index, if it's not covered by it, like I
> described above?
> 
> If so, I guess it wouldn't actually really matter here. I was afraid
> (but didn't check yet) that git would always use only the single multi-
> pack-index file, and not also search additional packs, so that it always
> has to be maintained in "perfect order" ...

Git loads the multi-pack-index file, which includes a sorted list of
the packs it covers. It then scans the "pack" directory for pack-indexes
and checks if they are covered by the multi-pack-index. If not, then
Git will add them to the packed_git struct and use them as normal.
The hope is that this list of "uncovered" packs is small compared to
the data covered by the multi-pack-index.

This allows Git to continue functioning after an action like "git fetch"
that adds a new pack but may not want to rewrite the multi-pack-index.

Our background maintenance essentially runs these commands:

 1. git multi-pack-index write
 2. git multi-pack-index expire
 3. git multi-pack-index repack

Step 1 ensures all packs are pulled into the multi-pack-index. Step 2
deletes any pack-files whose objects are contained in newer pack-files.
Step 3 creates a new pack-file containing all objects from a set of
small pack-files (using the --batch-size=X option). This process helps
incrementally reduce the size and number of packs. That may be helpful
for your backup took, too.

Perhaps after an incremental multi-pack-index is added, then Git could
(optionally) have a mode that only checks the multi-pack-index to
avoid scanning the packs directory. It would require inserting a
multi-pack-index write into the index-pack logic so Git.

I'm not sure if that mode would be helpful, since the pack directory
scan is typically done once per command and is relatively fast.

>> That said: if someone wanted to contribute an incremental format,
>> then I would be happy to review it!
> 
> I might still get motivated to do so :-)

YOU CAN DO IT! (Did that help?)

-Stolee
Johannes Berg Feb. 10, 2020, 3:06 p.m. UTC | #6
On Mon, 2020-02-10 at 10:02 -0500, Derrick Stolee wrote:
> Git loads the multi-pack-index file, which includes a sorted list of
> the packs it covers. It then scans the "pack" directory for pack-indexes
> and checks if they are covered by the multi-pack-index. If not, then
> Git will add them to the packed_git struct and use them as normal.
> The hope is that this list of "uncovered" packs is small compared to
> the data covered by the multi-pack-index.
> 
> This allows Git to continue functioning after an action like "git fetch"
> that adds a new pack but may not want to rewrite the multi-pack-index.

Ah, ok.

So then perhaps I'll just make bup write the multi-pack-index file as
is. This is fine, there's no real need to have multiple, I just didn't
want to have to make sure the file was always consistent.

Or maybe just call git to do it, and only be able to read the resulting
file :-)

> Our background maintenance essentially runs these commands:
> 
>  1. git multi-pack-index write
>  2. git multi-pack-index expire
>  3. git multi-pack-index repack
> 
> Step 1 ensures all packs are pulled into the multi-pack-index. Step 2
> deletes any pack-files whose objects are contained in newer pack-files.
> Step 3 creates a new pack-file containing all objects from a set of
> small pack-files (using the --batch-size=X option). This process helps
> incrementally reduce the size and number of packs. That may be helpful
> for your backup took, too.

I'll have to look at this in more detail later, and understand exactly
what the steps do here. Evidently that modifies pack files, which I
hadn't expected for a type of "index" command :-)

> Perhaps after an incremental multi-pack-index is added, then Git could
> (optionally) have a mode that only checks the multi-pack-index to
> avoid scanning the packs directory. It would require inserting a
> multi-pack-index write into the index-pack logic so Git.

I guess you'd still want to read non-covered pack files just in case old
git was used or something though.

> I'm not sure if that mode would be helpful, since the pack directory
> scan is typically done once per command and is relatively fast.

Right.

> > > That said: if someone wanted to contribute an incremental format,
> > > then I would be happy to review it!
> > 
> > I might still get motivated to do so :-)
> 
> YOU CAN DO IT! (Did that help?)

:-)

Thanks,
johannes
Junio C Hamano Feb. 10, 2020, 5:02 p.m. UTC | #7
Derrick Stolee <stolee@gmail.com> writes:

> On 2/7/2020 5:16 PM, Johannes Berg wrote:
>> The description of the multi-pack-index contains a small bug,
>> if all offsets are < 2^32 then there will be no LOFF chunk,
>> not only if they're all < 2^31 (since the highest bit is only
>> needed as the "LOFF-escape" when that's actually needed.)
>> 
>> Correct this, and clarify that in that case only offsets up
>> to 2^31-1 can be stored in the OOFF chunk.
>> 
>> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
>> ---
>>  Documentation/technical/pack-format.txt | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
>> index cab5bdd2ff0f..d3a142c65202 100644
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -315,10 +315,11 @@ CHUNK DATA:
>>  	    Stores two 4-byte values for every object.
>>  	    1: The pack-int-id for the pack storing this object.
>>  	    2: The offset within the pack.
>> -		If all offsets are less than 2^31, then the large offset chunk
>> +		If all offsets are less than 2^32, then the large offset chunk
>>  		will not exist and offsets are stored as in IDX v1.
>>  		If there is at least one offset value larger than 2^32-1, then
>> -		the large offset chunk must exist. If the large offset chunk
>> +		the large offset chunk must exist, and offsets larger than
>> +		2^31-1 must be stored in it instead. If the large offset chunk
>>  		exists and the 31st bit is on, then removing that bit reveals
>>  		the row in the large offsets containing the 8-byte offset of
>>  		this object.
>
> Thank you for finding this doc bug. This is a very subtle point,
> and you have described it very clearly.

Thanks, both; queued.
diff mbox series

Patch

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index cab5bdd2ff0f..d3a142c65202 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -315,10 +315,11 @@  CHUNK DATA:
 	    Stores two 4-byte values for every object.
 	    1: The pack-int-id for the pack storing this object.
 	    2: The offset within the pack.
-		If all offsets are less than 2^31, then the large offset chunk
+		If all offsets are less than 2^32, then the large offset chunk
 		will not exist and offsets are stored as in IDX v1.
 		If there is at least one offset value larger than 2^32-1, then
-		the large offset chunk must exist. If the large offset chunk
+		the large offset chunk must exist, and offsets larger than
+		2^31-1 must be stored in it instead. If the large offset chunk
 		exists and the 31st bit is on, then removing that bit reveals
 		the row in the large offsets containing the 8-byte offset of
 		this object.