Message ID | 20200207221640.46876-1-johannes@sipsolutions.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pack-format: correct multi-pack-index description | expand |
On 2/7/2020 5:16 PM, Johannes Berg wrote: > The description of the multi-pack-index contains a small bug, > if all offsets are < 2^32 then there will be no LOFF chunk, > not only if they're all < 2^31 (since the highest bit is only > needed as the "LOFF-escape" when that's actually needed.) > > Correct this, and clarify that in that case only offsets up > to 2^31-1 can be stored in the OOFF chunk. > > Signed-off-by: Johannes Berg <johannes@sipsolutions.net> > --- > Documentation/technical/pack-format.txt | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt > index cab5bdd2ff0f..d3a142c65202 100644 > --- a/Documentation/technical/pack-format.txt > +++ b/Documentation/technical/pack-format.txt > @@ -315,10 +315,11 @@ CHUNK DATA: > Stores two 4-byte values for every object. > 1: The pack-int-id for the pack storing this object. > 2: The offset within the pack. > - If all offsets are less than 2^31, then the large offset chunk > + If all offsets are less than 2^32, then the large offset chunk > will not exist and offsets are stored as in IDX v1. > If there is at least one offset value larger than 2^32-1, then > - the large offset chunk must exist. If the large offset chunk > + the large offset chunk must exist, and offsets larger than > + 2^31-1 must be stored in it instead. If the large offset chunk > exists and the 31st bit is on, then removing that bit reveals > the row in the large offsets containing the 8-byte offset of > this object. Thank you for finding this doc bug. This is a very subtle point, and you have described it very clearly. -Stolee
On Mon, 2020-02-10 at 09:18 -0500, Derrick Stolee wrote: > > Thank you for finding this doc bug. This is a very subtle point, > and you have described it very clearly. I was going back and forth on the wording a bit, glad I found something that you think is a good description :) Are you familiar with the multi-pack-index and how it's used, by any chance? I came here from bup (https://github.com/bup/bup/) and needed a way to store the offset to find objects in "pure bup", today it only stores object *presence* and *pack* in its multi-index, but not the offset. However, it seems to do a bit better in terms of not requiring a single multi-index, but instead storing it in midx-*.midx files and multiple can describe the repository state. Why wasn't something like that done for git as well? It's a bit annoying to have to recreate the full midx every time a pack file is added, and searching in two or three midx files wouldn't really be a big deal? Anyway, that's just an aside, but during all this investigation I stumbled across this small inconsistency - I'm glad the docs exist at all! :-) Thanks, johannes
On 2/10/2020 9:22 AM, Johannes Berg wrote: > On Mon, 2020-02-10 at 09:18 -0500, Derrick Stolee wrote: >> >> Thank you for finding this doc bug. This is a very subtle point, >> and you have described it very clearly. > > I was going back and forth on the wording a bit, glad I found something > that you think is a good description :) > > Are you familiar with the multi-pack-index and how it's used, by any > chance? Yes. I wrote the first version, and we use it a lot in VFS for Git. > I came here from bup (https://github.com/bup/bup/) and needed a way to > store the offset to find objects in "pure bup", today it only stores > object *presence* and *pack* in its multi-index, but not the offset. > > However, it seems to do a bit better in terms of not requiring a single > multi-index, but instead storing it in midx-*.midx files and multiple > can describe the repository state. Why wasn't something like that done > for git as well? It's a bit annoying to have to recreate the full midx > every time a pack file is added, and searching in two or three midx > files wouldn't really be a big deal? Part of my initial plan was to have this incremental file format. The commit-graph uses a very similar mechanism. The difference may be that you likely allow multiple .midx files found by scanning the pack directory, but I would expect something like the "commit-graph-chain" file that provides an ordered list of the incremental files. This can be important for deciding when to merge layers or delete old files, and would be critical to the possibility of converting reachability bitmaps to rely on a stable object order stored in the multi-pack-index instead of pack-order. The reason the multi-pack-index has not become incremental is that VFS for Git no longer needs to write it very often. We write the entire multi-pack-index during a background job that triggers once per day. If we needed to write it more frequently, then the incremental format would be more important to us. That said: if someone wanted to contribute an incremental format, then I would be happy to review it! > Anyway, that's just an aside, but during all this investigation I > stumbled across this small inconsistency - I'm glad the docs exist at > all! :-) I'm glad they helped. Thanks, -Stolee
On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote: > Part of my initial plan was to have this incremental file format. > The commit-graph uses a very similar mechanism. The difference may > be that you likely allow multiple .midx files found by scanning the > pack directory, Right, just scan and use any midx that exist, then compare the packs in there against all the packs found, and then remove any packs that actually *are* in an midx from the search list. That leaves you with all information, but optimised by midx where possible. > but I would expect something like the > "commit-graph-chain" file that provides an ordered list of the > incremental files. This can be important for deciding when to merge > layers or delete old files, and would be critical to the possibility > of converting reachability bitmaps to rely on a stable object order > stored in the multi-pack-index instead of pack-order. Right, if we delete then we have to also remove any midx covering the deleted pack, that's pretty rare in bup as a backup tool though. > The reason the multi-pack-index has not become incremental is that > VFS for Git no longer needs to write it very often. We write the > entire multi-pack-index during a background job that triggers once > per day. If we needed to write it more frequently, then the incremental > format would be more important to us. So, wait, what if a new pack is created? Does it just get used in addition to the multi-pack-index, if it's not covered by it, like I described above? If so, I guess it wouldn't actually really matter here. I was afraid (but didn't check yet) that git would always use only the single multi- pack-index file, and not also search additional packs, so that it always has to be maintained in "perfect order" ... > That said: if someone wanted to contribute an incremental format, > then I would be happy to review it! I might still get motivated to do so :-) johannes
On 2/10/2020 9:50 AM, Johannes Berg wrote: > On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote: > >> Part of my initial plan was to have this incremental file format. >> The commit-graph uses a very similar mechanism. The difference may >> be that you likely allow multiple .midx files found by scanning the >> pack directory, > > Right, just scan and use any midx that exist, then compare the packs in > there against all the packs found, and then remove any packs that > actually *are* in an midx from the search list. That leaves you with all > information, but optimised by midx where possible. > >> but I would expect something like the >> "commit-graph-chain" file that provides an ordered list of the >> incremental files. This can be important for deciding when to merge >> layers or delete old files, and would be critical to the possibility >> of converting reachability bitmaps to rely on a stable object order >> stored in the multi-pack-index instead of pack-order. > > Right, if we delete then we have to also remove any midx covering the > deleted pack, that's pretty rare in bup as a backup tool though. > >> The reason the multi-pack-index has not become incremental is that >> VFS for Git no longer needs to write it very often. We write the >> entire multi-pack-index during a background job that triggers once >> per day. If we needed to write it more frequently, then the incremental >> format would be more important to us. > > So, wait, what if a new pack is created? Does it just get used in > addition to the multi-pack-index, if it's not covered by it, like I > described above? > > If so, I guess it wouldn't actually really matter here. I was afraid > (but didn't check yet) that git would always use only the single multi- > pack-index file, and not also search additional packs, so that it always > has to be maintained in "perfect order" ... Git loads the multi-pack-index file, which includes a sorted list of the packs it covers. It then scans the "pack" directory for pack-indexes and checks if they are covered by the multi-pack-index. If not, then Git will add them to the packed_git struct and use them as normal. The hope is that this list of "uncovered" packs is small compared to the data covered by the multi-pack-index. This allows Git to continue functioning after an action like "git fetch" that adds a new pack but may not want to rewrite the multi-pack-index. Our background maintenance essentially runs these commands: 1. git multi-pack-index write 2. git multi-pack-index expire 3. git multi-pack-index repack Step 1 ensures all packs are pulled into the multi-pack-index. Step 2 deletes any pack-files whose objects are contained in newer pack-files. Step 3 creates a new pack-file containing all objects from a set of small pack-files (using the --batch-size=X option). This process helps incrementally reduce the size and number of packs. That may be helpful for your backup took, too. Perhaps after an incremental multi-pack-index is added, then Git could (optionally) have a mode that only checks the multi-pack-index to avoid scanning the packs directory. It would require inserting a multi-pack-index write into the index-pack logic so Git. I'm not sure if that mode would be helpful, since the pack directory scan is typically done once per command and is relatively fast. >> That said: if someone wanted to contribute an incremental format, >> then I would be happy to review it! > > I might still get motivated to do so :-) YOU CAN DO IT! (Did that help?) -Stolee
On Mon, 2020-02-10 at 10:02 -0500, Derrick Stolee wrote: > Git loads the multi-pack-index file, which includes a sorted list of > the packs it covers. It then scans the "pack" directory for pack-indexes > and checks if they are covered by the multi-pack-index. If not, then > Git will add them to the packed_git struct and use them as normal. > The hope is that this list of "uncovered" packs is small compared to > the data covered by the multi-pack-index. > > This allows Git to continue functioning after an action like "git fetch" > that adds a new pack but may not want to rewrite the multi-pack-index. Ah, ok. So then perhaps I'll just make bup write the multi-pack-index file as is. This is fine, there's no real need to have multiple, I just didn't want to have to make sure the file was always consistent. Or maybe just call git to do it, and only be able to read the resulting file :-) > Our background maintenance essentially runs these commands: > > 1. git multi-pack-index write > 2. git multi-pack-index expire > 3. git multi-pack-index repack > > Step 1 ensures all packs are pulled into the multi-pack-index. Step 2 > deletes any pack-files whose objects are contained in newer pack-files. > Step 3 creates a new pack-file containing all objects from a set of > small pack-files (using the --batch-size=X option). This process helps > incrementally reduce the size and number of packs. That may be helpful > for your backup took, too. I'll have to look at this in more detail later, and understand exactly what the steps do here. Evidently that modifies pack files, which I hadn't expected for a type of "index" command :-) > Perhaps after an incremental multi-pack-index is added, then Git could > (optionally) have a mode that only checks the multi-pack-index to > avoid scanning the packs directory. It would require inserting a > multi-pack-index write into the index-pack logic so Git. I guess you'd still want to read non-covered pack files just in case old git was used or something though. > I'm not sure if that mode would be helpful, since the pack directory > scan is typically done once per command and is relatively fast. Right. > > > That said: if someone wanted to contribute an incremental format, > > > then I would be happy to review it! > > > > I might still get motivated to do so :-) > > YOU CAN DO IT! (Did that help?) :-) Thanks, johannes
Derrick Stolee <stolee@gmail.com> writes: > On 2/7/2020 5:16 PM, Johannes Berg wrote: >> The description of the multi-pack-index contains a small bug, >> if all offsets are < 2^32 then there will be no LOFF chunk, >> not only if they're all < 2^31 (since the highest bit is only >> needed as the "LOFF-escape" when that's actually needed.) >> >> Correct this, and clarify that in that case only offsets up >> to 2^31-1 can be stored in the OOFF chunk. >> >> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> >> --- >> Documentation/technical/pack-format.txt | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt >> index cab5bdd2ff0f..d3a142c65202 100644 >> --- a/Documentation/technical/pack-format.txt >> +++ b/Documentation/technical/pack-format.txt >> @@ -315,10 +315,11 @@ CHUNK DATA: >> Stores two 4-byte values for every object. >> 1: The pack-int-id for the pack storing this object. >> 2: The offset within the pack. >> - If all offsets are less than 2^31, then the large offset chunk >> + If all offsets are less than 2^32, then the large offset chunk >> will not exist and offsets are stored as in IDX v1. >> If there is at least one offset value larger than 2^32-1, then >> - the large offset chunk must exist. If the large offset chunk >> + the large offset chunk must exist, and offsets larger than >> + 2^31-1 must be stored in it instead. If the large offset chunk >> exists and the 31st bit is on, then removing that bit reveals >> the row in the large offsets containing the 8-byte offset of >> this object. > > Thank you for finding this doc bug. This is a very subtle point, > and you have described it very clearly. Thanks, both; queued.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index cab5bdd2ff0f..d3a142c65202 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -315,10 +315,11 @@ CHUNK DATA: Stores two 4-byte values for every object. 1: The pack-int-id for the pack storing this object. 2: The offset within the pack. - If all offsets are less than 2^31, then the large offset chunk + If all offsets are less than 2^32, then the large offset chunk will not exist and offsets are stored as in IDX v1. If there is at least one offset value larger than 2^32-1, then - the large offset chunk must exist. If the large offset chunk + the large offset chunk must exist, and offsets larger than + 2^31-1 must be stored in it instead. If the large offset chunk exists and the 31st bit is on, then removing that bit reveals the row in the large offsets containing the 8-byte offset of this object.
The description of the multi-pack-index contains a small bug, if all offsets are < 2^32 then there will be no LOFF chunk, not only if they're all < 2^31 (since the highest bit is only needed as the "LOFF-escape" when that's actually needed.) Correct this, and clarify that in that case only offsets up to 2^31-1 can be stored in the OOFF chunk. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> --- Documentation/technical/pack-format.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)