Message ID | fa6b8309088fd04410ca7276c5cf14db0fb82fb2.1610129796.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | pack-revindex: prepare for on-disk reverse index | expand |
On Fri, Jan 08, 2021 at 01:16:43PM -0500, Taylor Blau wrote: > In the next several patches, we will prepare for loading a reverse index > either in memory, or from a yet-to-be-introduced on-disk format. To do > that, we'll introduce an API that avoids the caller explicitly indexing > the revindex pointer in the packed_git structure. This API looks good to me. Here are a few extra thoughts: > There are four ways to interact with the reverse index. Accordingly, > four functions will be exported from 'pack-revindex.h' by the time that > the existing API is removed. A caller may: This tells us what the new API functions do. That's useful, but should it be in the header file itself, documenting each function? Likewise, I think we'd want to define the concepts in that documentation. Something like: /* * A revindex allows converting efficiently between three properties * of an object within a pack: * * - index position: the numeric position within the list of * sorted object ids found in the .idx file * * - pack position: the numeric position within the list of objects * in their order within the actual .pack file (i.e., 0 is the * first object in the .pack, 1 is the second, and so on) * * - offset: the byte offset within the .pack file at which the * object contents can be found */ And then above each function we can just say that it converts X to Y (like you have in the commit message). It may also be worth indicating the run-time of each (some of them are constant-time once you have a revindex, and some are log(n)). > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) The types here make sense. off_t is clearly needed for a pack offset, and uint32_t is correct for the position fields, because packs have a 4-byte object count. Separating the error return from the out-parameter makes the interface slightly more awkward, but is needed to use the properly-sized types. Makes sense. > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > +{ > + int ret; > + > + if (load_pack_revindex(p) < 0) > + return -1; This one lazy-loads the revindex for us, which seems handy... > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) > +{ > + if (!p->revindex) > + BUG("pack_pos_to_index: reverse index not yet loaded"); > + if (pos >= p->num_objects) > + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); > + return p->revindex[pos].nr; > +} But these ones don't. I'm glad we at least catch it with a BUG(), but it makes the API a little funny. Returning an error here would require a similarly awkward out-parameter, I guess. -Peff
On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote: > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > > +{ > > + int ret; > > + > > + if (load_pack_revindex(p) < 0) > > + return -1; > > This one lazy-loads the revindex for us, which seems handy... > > > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) > > +{ > > + if (!p->revindex) > > + BUG("pack_pos_to_index: reverse index not yet loaded"); > > + if (pos >= p->num_objects) > > + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); > > + return p->revindex[pos].nr; > > +} > > But these ones don't. I'm glad we at least catch it with a BUG(), but it > makes the API a little funny. Returning an error here would require a > similarly awkward out-parameter, I guess. Having now looked at the callers through the series, I think adding an error return to pack_pos_to_index() would be really awkward (since it cannot currently fail). We _could_ insist that callers of offset_to_pack_pos() also make sure the revindex is loaded themselves. But it would be annoying and error-prone to check the existing callers. So I'm OK with leaving this asymmetry in the API. -Peff
On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote: > > There are four ways to interact with the reverse index. Accordingly, > > four functions will be exported from 'pack-revindex.h' by the time that > > the existing API is removed. A caller may: > > This tells us what the new API functions do. That's useful, but should > it be in the header file itself, documenting each function? Mm, that's a good idea. I took your suggestion for a comment at the top of pack-revindex.h directly, and then added some of my own commentary above each function. I avoided documenting the functions we're about to remove for obvious reasons. I think that the commit message is OK as-is, since it provides more of a rationale of what operations need to exist, rather than the specifics of each implementation. We'll have to update these again when the on-disk format exists (i.e., because all of the runtimes become constant with the exception of going to the index), but that's a topic for another series. > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > > The types here make sense. off_t is clearly needed for a pack offset, > and uint32_t is correct for the position fields, because packs have a > 4-byte object count. > > Separating the error return from the out-parameter makes the interface > slightly more awkward, but is needed to use the properly-sized types. > Makes sense. Yep, exactly. > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > > +{ > > + int ret; > > + > > + if (load_pack_revindex(p) < 0) > > + return -1; > > This one lazy-loads the revindex for us, which seems handy... > > > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) > > +{ > > + if (!p->revindex) > > + BUG("pack_pos_to_index: reverse index not yet loaded"); > > + if (pos >= p->num_objects) > > + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); > > + return p->revindex[pos].nr; > > +} > > But these ones don't. I'm glad we at least catch it with a BUG(), but it > makes the API a little funny. Returning an error here would require a > similarly awkward out-parameter, I guess. It is awkward, but (as you note downthread) the callers of pack_pos_to_index() make it difficult to do so, so I didn't pursue it here. > -Peff Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > In the next several patches, we will prepare for loading a reverse index > either in memory, or from a yet-to-be-introduced on-disk format. To do Does "load revindex in memory" (as opposed to "from on-disk file") mean the good old "read the forward index and make inverse map in-core", or something else? IOW, is "We will prepare a reverse index either by computing in memory from forward index, or loading from on-disk file" what we want to say here? > There are four ways to interact with the reverse index. Accordingly, > four functions will be exported from 'pack-revindex.h' by the time that > the existing API is removed. A caller may: > > 1. Load the pack's reverse index. This involves opening up the index, > generating an array, and then sorting it. Since opening the index > can fail, this function ('load_pack_revindex()') returns an int. > Accordingly, it takes only a single argument: the 'struct > packed_git' the caller wants to build a reverse index for. > > This function is well-suited for both the current and new API. > Callers will have to continue to open the reverse index explicitly, > but this function will eventually learn how to detect and load a > reverse index from the on-disk format, if one exists. Otherwise, it > will fallback to generating one in memory from scratch. OK. > 2. Convert a pack position into an offset. This operation is now > called `pack_pos_to_offset()`. It takes a pack and a position, and > returns the corresponding off_t. > > 3. Convert a pack position into an index position. Same as above; this > takes a pack and a position, and returns a uint32_t. This operation > is known as `pack_pos_to_index()`. > > 4. Find the pack position for a given offset. This operation is now > known as `offset_to_pack_pos()`. It takes a pack, an offset, and a > pointer to a uint32_t where the position is written, if an object > exists at that offset. Otherwise, -1 is returned to indicate > failure. Without knowing what exactly "pack position", "offset" and "index position" refer to, the above three are almost impossible to grok. Can we have one paragraph description for each? Something along the lines of... - Pack position: a packstream consists of series of byte ranges, each of which represents an object, so the objects can be numbered from 0 (the object whose data is stored at the earliest part in the packfile) to N (the object whose data is stored at the tail end of the packfile). The number corresponding to an object in this order in the packfile is called the "pack position" of the object. - Offset: The ofs_t distance between the beginning of a pack stream and the beginning of data that represents an object is called the "offset" of the object in the packfile. - Index position: for a single pack stream, there is a table that maps object name to its offset and the entries in this table are sorted by the object name (this is what pack ".idx" file is). The location (counting from 0) of an object in this table is called the "index position" of the object in the packfile. I am not sure if the above correctly reflects what you meant by "position", though. > Unlike some of the callers that used to access '->offset' and '->nr' > directly, the error checking around this call is somewhat more > robust. This is important since callers can pass an offset which > does not contain an object. Meaning "offset ought to point at the boundary between objects in the pack stream, and the API, unlike the direct access, makes sure that is the case"? That is a good thing. > This will become important in a subsequent patch where a caller > which does not but could check the return value treats the signed > `-1` from `find_revindex_position()` as an index into the 'revindex' > array. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > pack-revindex.c | 32 ++++++++++++++++++++++++++++++++ > pack-revindex.h | 4 ++++ > 2 files changed, 36 insertions(+) > > diff --git a/pack-revindex.c b/pack-revindex.c > index ecdde39cf4..6d86a85208 100644 > --- a/pack-revindex.c > +++ b/pack-revindex.c > @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs) > > return p->revindex + pos; > } > + > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > +{ > + int ret; > + > + if (load_pack_revindex(p) < 0) > + return -1; > + > + ret = find_revindex_position(p, ofs); > + if (ret < 0) > + return -1; Why not "return ret"? We know that find_revindex_position() would signal an error by returning -1, but is there a reason why we want to prevent it from returning richer errors in the future? > + *pos = ret; The untold assumption is that uint32_t can fit the maximum returned value from find_revindex_position() and "signed int" can also big enough. I guess it is OK to be limited to up-to 2 billion objects on 32-bit systems. > + return 0; > +} > + > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) > +{ > + if (!p->revindex) > + BUG("pack_pos_to_index: reverse index not yet loaded"); The previous function lazy loaded the revindex, but this one and the next one refuses to work without revindex. Intended? > + if (pos >= p->num_objects) > + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); Personally I find it easier to place items on a single line in an ascending order of magnitude, i.e. if (p->num_objects <= pos) BUG("..."); The assertion requires pos to be strictly lower than p->num_objects, which is in line with how we usually count elements of an array of size p->num_objects, but the next one allows pos == p->num_objects; intended? p->revindex[] is an array of two-member struct, so if an element of the array is invalid for its .nr member here because pos is exactly at p->num_objects, I would imagine it is also invalid for its .offset member, too, no? Ah, perhaps the "offset beyond the end of the pack positions" is a sentinel element to give the in-pack-stream size of the object at the last pack position? If that is the case, it deserves a comment, I would think. > + return p->revindex[pos].nr; > +} > + > +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos) > +{ > + if (!p->revindex) > + BUG("pack_pos_to_index: reverse index not yet loaded"); > + if (pos > p->num_objects) > + BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos); > + return p->revindex[pos].offset; > +} > diff --git a/pack-revindex.h b/pack-revindex.h > index 848331d5d6..256c0a9106 100644 > --- a/pack-revindex.h > +++ b/pack-revindex.h > @@ -13,4 +13,8 @@ int find_revindex_position(struct packed_git *p, off_t ofs); > > struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs); > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos); > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos); > +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos); > + > #endif
Junio C Hamano <gitster@pobox.com> writes: > Without knowing what exactly "pack position", "offset" and "index > position" refer to, the above three are almost impossible to grok. > Can we have one paragraph description for each? Something along the > lines of... > > - Pack position: a packstream consists of series of byte ranges, > each of which represents an object, so the objects can be > numbered from 0 (the object whose data is stored at the earliest > part in the packfile) to N (the object whose data is stored at > the tail end of the packfile). The number corresponding to an > object in this order in the packfile is called the "pack > position" of the object. > > - Offset: The ofs_t distance between the beginning of a pack stream > and the beginning of data that represents an object is called the > "offset" of the object in the packfile. > > - Index position: for a single pack stream, there is a table that > maps object name to its offset and the entries in this table are > sorted by the object name (this is what pack ".idx" file is). > The location (counting from 0) of an object in this table is > called the "index position" of the object in the packfile. > > I am not sure if the above correctly reflects what you meant by > "position", though. Please scratch the above. I see Peff already pointed out pretty much the same, and his phrasing looked a lot cleaner than my attempt. Thanks.
On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote: > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > > +{ > > + int ret; > [...] > > + *pos = ret; > > The untold assumption is that uint32_t can fit the maximum returned > value from find_revindex_position() and "signed int" can also big > enough. I guess it is OK to be limited to up-to 2 billion objects > on 32-bit systems. Thanks for pointing this out. I recalled there being an "int" problem somewhere in the revindex code, but I didn't notice it on my read-through. This bug already exists (the problem is actually in the find_revindex_position() interface), and is fixed when we inline that into offset_to_pack_pos() in patch 18. It might be worth mentioning the fix there. -Peff
On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > In the next several patches, we will prepare for loading a reverse index > > either in memory, or from a yet-to-be-introduced on-disk format. To do > > Does "load revindex in memory" (as opposed to "from on-disk file") > mean the good old "read the forward index and make inverse map > in-core", or something else? Indeed, that's what it means. I've made that clearer in the patch message by saying it explicitly. > IOW, is "We will prepare a reverse index either by computing in > memory from forward index, or loading from on-disk file" what we > want to say here? Yep. > Without knowing what exactly "pack position", "offset" and "index > position" refer to, the above three are almost impossible to grok. > Can we have one paragraph description for each? Something along the > lines of... Yep, and I see later on in the thread that you want to discard this suggestion since Peff has suggested similar changes in pack-revindex.h, which I've applied. > > Unlike some of the callers that used to access '->offset' and '->nr' > > directly, the error checking around this call is somewhat more > > robust. This is important since callers can pass an offset which > > does not contain an object. > > Meaning "offset ought to point at the boundary between objects in > the pack stream, and the API, unlike the direct access, makes sure > that is the case"? That is a good thing. Indeed, and I've called that out directly in the patch message to highlight it. > > diff --git a/pack-revindex.c b/pack-revindex.c > > index ecdde39cf4..6d86a85208 100644 > > --- a/pack-revindex.c > > +++ b/pack-revindex.c > > @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs) > > > > return p->revindex + pos; > > } > > + > > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) > > +{ > > + int ret; > > + > > + if (load_pack_revindex(p) < 0) > > + return -1; > > + > > + ret = find_revindex_position(p, ofs); > > + if (ret < 0) > > + return -1; > > Why not "return ret"? We know that find_revindex_position() would > signal an error by returning -1, but is there a reason why we want > to prevent it from returning richer errors in the future? No reason, I've changed it to 'return ret' instead. > > + *pos = ret; > > The untold assumption is that uint32_t can fit the maximum returned > value from find_revindex_position() and "signed int" can also big > enough. I guess it is OK to be limited to up-to 2 billion objects > on 32-bit systems. Indeed, and that is fixed in a later on patch. I'll make sure to call it out there. > > + return 0; > > +} > > + > > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) > > +{ > > + if (!p->revindex) > > + BUG("pack_pos_to_index: reverse index not yet loaded"); > > The previous function lazy loaded the revindex, but this one and the > next one refuses to work without revindex. Intended? Yes, and discussed a little bit more here [1]. Obviously that discussion doesn't do any good to those reading the git log in the future, so I've summarized the important detail (that some callers are equipped to deal with errors but others aren't) in the patch. > > + if (pos >= p->num_objects) > > + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); > > Personally I find it easier to place items on a single line in an > ascending order of magnitude, i.e. > > if (p->num_objects <= pos) > BUG("..."); More readable, thanks. > The assertion requires pos to be strictly lower than p->num_objects, > which is in line with how we usually count elements of an array of > size p->num_objects, but the next one allows pos == p->num_objects; > intended? > > p->revindex[] is an array of two-member struct, so if an element of > the array is invalid for its .nr member here because pos is exactly > at p->num_objects, I would imagine it is also invalid for its .offset > member, too, no? > > Ah, perhaps the "offset beyond the end of the pack positions" is a > sentinel element to give the in-pack-stream size of the object at > the last pack position? If that is the case, it deserves a comment, > I would think. Exactly. I added a detail about that in the patch, too. Thanks, Taylor [1]: https://lore.kernel.org/git/X%2F3ODgaa9wr65M09@nand.local/
diff --git a/pack-revindex.c b/pack-revindex.c index ecdde39cf4..6d86a85208 100644 --- a/pack-revindex.c +++ b/pack-revindex.c @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs) return p->revindex + pos; } + +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) +{ + int ret; + + if (load_pack_revindex(p) < 0) + return -1; + + ret = find_revindex_position(p, ofs); + if (ret < 0) + return -1; + *pos = ret; + return 0; +} + +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos) +{ + if (!p->revindex) + BUG("pack_pos_to_index: reverse index not yet loaded"); + if (pos >= p->num_objects) + BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos); + return p->revindex[pos].nr; +} + +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos) +{ + if (!p->revindex) + BUG("pack_pos_to_index: reverse index not yet loaded"); + if (pos > p->num_objects) + BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos); + return p->revindex[pos].offset; +} diff --git a/pack-revindex.h b/pack-revindex.h index 848331d5d6..256c0a9106 100644 --- a/pack-revindex.h +++ b/pack-revindex.h @@ -13,4 +13,8 @@ int find_revindex_position(struct packed_git *p, off_t ofs); struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs); +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos); +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos); +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos); + #endif
In the next several patches, we will prepare for loading a reverse index either in memory, or from a yet-to-be-introduced on-disk format. To do that, we'll introduce an API that avoids the caller explicitly indexing the revindex pointer in the packed_git structure. There are four ways to interact with the reverse index. Accordingly, four functions will be exported from 'pack-revindex.h' by the time that the existing API is removed. A caller may: 1. Load the pack's reverse index. This involves opening up the index, generating an array, and then sorting it. Since opening the index can fail, this function ('load_pack_revindex()') returns an int. Accordingly, it takes only a single argument: the 'struct packed_git' the caller wants to build a reverse index for. This function is well-suited for both the current and new API. Callers will have to continue to open the reverse index explicitly, but this function will eventually learn how to detect and load a reverse index from the on-disk format, if one exists. Otherwise, it will fallback to generating one in memory from scratch. 2. Convert a pack position into an offset. This operation is now called `pack_pos_to_offset()`. It takes a pack and a position, and returns the corresponding off_t. 3. Convert a pack position into an index position. Same as above; this takes a pack and a position, and returns a uint32_t. This operation is known as `pack_pos_to_index()`. 4. Find the pack position for a given offset. This operation is now known as `offset_to_pack_pos()`. It takes a pack, an offset, and a pointer to a uint32_t where the position is written, if an object exists at that offset. Otherwise, -1 is returned to indicate failure. Unlike some of the callers that used to access '->offset' and '->nr' directly, the error checking around this call is somewhat more robust. This is important since callers can pass an offset which does not contain an object. This will become important in a subsequent patch where a caller which does not but could check the return value treats the signed `-1` from `find_revindex_position()` as an index into the 'revindex' array. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- pack-revindex.c | 32 ++++++++++++++++++++++++++++++++ pack-revindex.h | 4 ++++ 2 files changed, 36 insertions(+)