diff mbox series

[01/20] pack-revindex: introduce a new API

Message ID fa6b8309088fd04410ca7276c5cf14db0fb82fb2.1610129796.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series pack-revindex: prepare for on-disk reverse index | expand

Commit Message

Taylor Blau Jan. 8, 2021, 6:16 p.m. UTC
In the next several patches, we will prepare for loading a reverse index
either in memory, or from a yet-to-be-introduced on-disk format. To do
that, we'll introduce an API that avoids the caller explicitly indexing
the revindex pointer in the packed_git structure.

There are four ways to interact with the reverse index. Accordingly,
four functions will be exported from 'pack-revindex.h' by the time that
the existing API is removed. A caller may:

 1. Load the pack's reverse index. This involves opening up the index,
    generating an array, and then sorting it. Since opening the index
    can fail, this function ('load_pack_revindex()') returns an int.
    Accordingly, it takes only a single argument: the 'struct
    packed_git' the caller wants to build a reverse index for.

    This function is well-suited for both the current and new API.
    Callers will have to continue to open the reverse index explicitly,
    but this function will eventually learn how to detect and load a
    reverse index from the on-disk format, if one exists. Otherwise, it
    will fallback to generating one in memory from scratch.

 2. Convert a pack position into an offset. This operation is now
    called `pack_pos_to_offset()`. It takes a pack and a position, and
    returns the corresponding off_t.

 3. Convert a pack position into an index position. Same as above; this
    takes a pack and a position, and returns a uint32_t. This operation
    is known as `pack_pos_to_index()`.

 4. Find the pack position for a given offset. This operation is now
    known as `offset_to_pack_pos()`. It takes a pack, an offset, and a
    pointer to a uint32_t where the position is written, if an object
    exists at that offset. Otherwise, -1 is returned to indicate
    failure.

    Unlike some of the callers that used to access '->offset' and '->nr'
    directly, the error checking around this call is somewhat more
    robust. This is important since callers can pass an offset which
    does not contain an object.

    This will become important in a subsequent patch where a caller
    which does not but could check the return value treats the signed
    `-1` from `find_revindex_position()` as an index into the 'revindex'
    array.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 32 ++++++++++++++++++++++++++++++++
 pack-revindex.h |  4 ++++
 2 files changed, 36 insertions(+)

Comments

Jeff King Jan. 12, 2021, 8:41 a.m. UTC | #1
On Fri, Jan 08, 2021 at 01:16:43PM -0500, Taylor Blau wrote:

> In the next several patches, we will prepare for loading a reverse index
> either in memory, or from a yet-to-be-introduced on-disk format. To do
> that, we'll introduce an API that avoids the caller explicitly indexing
> the revindex pointer in the packed_git structure.

This API looks good to me. Here are a few extra thoughts:

> There are four ways to interact with the reverse index. Accordingly,
> four functions will be exported from 'pack-revindex.h' by the time that
> the existing API is removed. A caller may:

This tells us what the new API functions do. That's useful, but should
it be in the header file itself, documenting each function?

Likewise, I think we'd want to define the concepts in that
documentation. Something like:


  /*
   * A revindex allows converting efficiently between three properties
   * of an object within a pack:
   *
   *  - index position: the numeric position within the list of
   *    sorted object ids found in the .idx file
   *
   *  - pack position: the numeric position within the list of objects
   *    in their order within the actual .pack file (i.e., 0 is the
   *    first object in the .pack, 1 is the second, and so on)
   *
   *  - offset: the byte offset within the .pack file at which the
   *    object contents can be found
   */

And then above each function we can just say that it converts X to Y
(like you have in the commit message). It may also be worth indicating
the run-time of each (some of them are constant-time once you have a
revindex, and some are log(n)).

> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)

The types here make sense. off_t is clearly needed for a pack offset,
and uint32_t is correct for the position fields, because packs have a
4-byte object count.

Separating the error return from the out-parameter makes the interface
slightly more awkward, but is needed to use the properly-sized types.
Makes sense.

> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> +{
> +	int ret;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;

This one lazy-loads the revindex for us, which seems handy...

> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");
> +	if (pos >= p->num_objects)
> +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> +	return p->revindex[pos].nr;
> +}

But these ones don't. I'm glad we at least catch it with a BUG(), but it
makes the API a little funny. Returning an error here would require a
similarly awkward out-parameter, I guess.

-Peff
Jeff King Jan. 12, 2021, 9:41 a.m. UTC | #2
On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote:

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
> 
> This one lazy-loads the revindex for us, which seems handy...
> 
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> > +	return p->revindex[pos].nr;
> > +}
> 
> But these ones don't. I'm glad we at least catch it with a BUG(), but it
> makes the API a little funny. Returning an error here would require a
> similarly awkward out-parameter, I guess.

Having now looked at the callers through the series, I think adding an
error return to pack_pos_to_index() would be really awkward (since it
cannot currently fail).

We _could_ insist that callers of offset_to_pack_pos() also make sure
the revindex is loaded themselves. But it would be annoying and
error-prone to check the existing callers. So I'm OK with leaving this
asymmetry in the API.

-Peff
Taylor Blau Jan. 12, 2021, 4:27 p.m. UTC | #3
On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote:
> > There are four ways to interact with the reverse index. Accordingly,
> > four functions will be exported from 'pack-revindex.h' by the time that
> > the existing API is removed. A caller may:
>
> This tells us what the new API functions do. That's useful, but should
> it be in the header file itself, documenting each function?

Mm, that's a good idea. I took your suggestion for a comment at the top
of pack-revindex.h directly, and then added some of my own commentary
above each function. I avoided documenting the functions we're about to
remove for obvious reasons.

I think that the commit message is OK as-is, since it provides more of a
rationale of what operations need to exist, rather than the specifics of
each implementation.

We'll have to update these again when the on-disk format exists (i.e.,
because all of the runtimes become constant with the exception of going
to the index), but that's a topic for another series.

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>
> The types here make sense. off_t is clearly needed for a pack offset,
> and uint32_t is correct for the position fields, because packs have a
> 4-byte object count.
>
> Separating the error return from the out-parameter makes the interface
> slightly more awkward, but is needed to use the properly-sized types.
> Makes sense.

Yep, exactly.

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
>
> This one lazy-loads the revindex for us, which seems handy...
>
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> > +	return p->revindex[pos].nr;
> > +}
>
> But these ones don't. I'm glad we at least catch it with a BUG(), but it
> makes the API a little funny. Returning an error here would require a
> similarly awkward out-parameter, I guess.

It is awkward, but (as you note downthread) the callers of
pack_pos_to_index() make it difficult to do so, so I didn't pursue it
here.

> -Peff

Thanks,
Taylor
Junio C Hamano Jan. 13, 2021, 8:06 a.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:

> In the next several patches, we will prepare for loading a reverse index
> either in memory, or from a yet-to-be-introduced on-disk format. To do

Does "load revindex in memory" (as opposed to "from on-disk file")
mean the good old "read the forward index and make inverse map
in-core", or something else?  

IOW, is "We will prepare a reverse index either by computing in
memory from forward index, or loading from on-disk file" what we
want to say here?

> There are four ways to interact with the reverse index. Accordingly,
> four functions will be exported from 'pack-revindex.h' by the time that
> the existing API is removed. A caller may:
>
>  1. Load the pack's reverse index. This involves opening up the index,
>     generating an array, and then sorting it. Since opening the index
>     can fail, this function ('load_pack_revindex()') returns an int.
>     Accordingly, it takes only a single argument: the 'struct
>     packed_git' the caller wants to build a reverse index for.
>
>     This function is well-suited for both the current and new API.
>     Callers will have to continue to open the reverse index explicitly,
>     but this function will eventually learn how to detect and load a
>     reverse index from the on-disk format, if one exists. Otherwise, it
>     will fallback to generating one in memory from scratch.

OK.

>  2. Convert a pack position into an offset. This operation is now
>     called `pack_pos_to_offset()`. It takes a pack and a position, and
>     returns the corresponding off_t.
>
>  3. Convert a pack position into an index position. Same as above; this
>     takes a pack and a position, and returns a uint32_t. This operation
>     is known as `pack_pos_to_index()`.
>
>  4. Find the pack position for a given offset. This operation is now
>     known as `offset_to_pack_pos()`. It takes a pack, an offset, and a
>     pointer to a uint32_t where the position is written, if an object
>     exists at that offset. Otherwise, -1 is returned to indicate
>     failure.

Without knowing what exactly "pack position", "offset" and "index
position" refer to, the above three are almost impossible to grok.
Can we have one paragraph description for each?  Something along the
lines of...

 - Pack position: a packstream consists of series of byte ranges,
   each of which represents an object, so the objects can be
   numbered from 0 (the object whose data is stored at the earliest
   part in the packfile) to N (the object whose data is stored at
   the tail end of the packfile).  The number corresponding to an
   object in this order in the packfile is called the "pack
   position" of the object.

 - Offset: The ofs_t distance between the beginning of a pack stream
   and the beginning of data that represents an object is called the
   "offset" of the object in the packfile.

 - Index position: for a single pack stream, there is a table that
   maps object name to its offset and the entries in this table are
   sorted by the object name (this is what pack ".idx" file is).
   The location (counting from 0) of an object in this table is
   called the "index position" of the object in the packfile.

I am not sure if the above correctly reflects what you meant by
"position", though.

>     Unlike some of the callers that used to access '->offset' and '->nr'
>     directly, the error checking around this call is somewhat more
>     robust. This is important since callers can pass an offset which
>     does not contain an object.

Meaning "offset ought to point at the boundary between objects in
the pack stream, and the API, unlike the direct access, makes sure
that is the case"?  That is a good thing.

>     This will become important in a subsequent patch where a caller
>     which does not but could check the return value treats the signed
>     `-1` from `find_revindex_position()` as an index into the 'revindex'
>     array.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  pack-revindex.c | 32 ++++++++++++++++++++++++++++++++
>  pack-revindex.h |  4 ++++
>  2 files changed, 36 insertions(+)
>
> diff --git a/pack-revindex.c b/pack-revindex.c
> index ecdde39cf4..6d86a85208 100644
> --- a/pack-revindex.c
> +++ b/pack-revindex.c
> @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
>  
>  	return p->revindex + pos;
>  }
> +
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> +{
> +	int ret;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;
> +
> +	ret = find_revindex_position(p, ofs);
> +	if (ret < 0)
> +		return -1;

Why not "return ret"?  We know that find_revindex_position() would
signal an error by returning -1, but is there a reason why we want
to prevent it from returning richer errors in the future?

> +	*pos = ret;

The untold assumption is that uint32_t can fit the maximum returned
value from find_revindex_position() and "signed int" can also big
enough.  I guess it is OK to be limited to up-to 2 billion objects
on 32-bit systems.

> +	return 0;
> +}
> +
> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");

The previous function lazy loaded the revindex, but this one and the
next one refuses to work without revindex.  Intended?

> +	if (pos >= p->num_objects)
> +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);

Personally I find it easier to place items on a single line in an
ascending order of magnitude, i.e.

	if (p->num_objects <= pos)
		BUG("...");

The assertion requires pos to be strictly lower than p->num_objects,
which is in line with how we usually count elements of an array of
size p->num_objects, but the next one allows pos == p->num_objects;
intended?

p->revindex[] is an array of two-member struct, so if an element of
the array is invalid for its .nr member here because pos is exactly
at p->num_objects, I would imagine it is also invalid for its .offset
member, too, no?

Ah, perhaps the "offset beyond the end of the pack positions" is a
sentinel element to give the in-pack-stream size of the object at
the last pack position?  If that is the case, it deserves a comment,
I would think.

> +	return p->revindex[pos].nr;
> +}
> +
> +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");
> +	if (pos > p->num_objects)
> +		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
> +	return p->revindex[pos].offset;
> +}
> diff --git a/pack-revindex.h b/pack-revindex.h
> index 848331d5d6..256c0a9106 100644
> --- a/pack-revindex.h
> +++ b/pack-revindex.h
> @@ -13,4 +13,8 @@ int find_revindex_position(struct packed_git *p, off_t ofs);
>  
>  struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
>  
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
> +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
> +
>  #endif
Junio C Hamano Jan. 13, 2021, 8:54 a.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> Without knowing what exactly "pack position", "offset" and "index
> position" refer to, the above three are almost impossible to grok.
> Can we have one paragraph description for each?  Something along the
> lines of...
>
>  - Pack position: a packstream consists of series of byte ranges,
>    each of which represents an object, so the objects can be
>    numbered from 0 (the object whose data is stored at the earliest
>    part in the packfile) to N (the object whose data is stored at
>    the tail end of the packfile).  The number corresponding to an
>    object in this order in the packfile is called the "pack
>    position" of the object.
>
>  - Offset: The ofs_t distance between the beginning of a pack stream
>    and the beginning of data that represents an object is called the
>    "offset" of the object in the packfile.
>
>  - Index position: for a single pack stream, there is a table that
>    maps object name to its offset and the entries in this table are
>    sorted by the object name (this is what pack ".idx" file is).
>    The location (counting from 0) of an object in this table is
>    called the "index position" of the object in the packfile.
>
> I am not sure if the above correctly reflects what you meant by
> "position", though.

Please scratch the above.  I see Peff already pointed out pretty
much the same, and his phrasing looked a lot cleaner than my
attempt.

Thanks.
Jeff King Jan. 13, 2021, 1:17 p.m. UTC | #6
On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote:

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> [...]
> > +	*pos = ret;
> 
> The untold assumption is that uint32_t can fit the maximum returned
> value from find_revindex_position() and "signed int" can also big
> enough.  I guess it is OK to be limited to up-to 2 billion objects
> on 32-bit systems.

Thanks for pointing this out. I recalled there being an "int" problem
somewhere in the revindex code, but I didn't notice it on my
read-through. This bug already exists (the problem is actually in the
find_revindex_position() interface), and is fixed when we inline that
into offset_to_pack_pos() in patch 18.

It might be worth mentioning the fix there.

-Peff
Taylor Blau Jan. 13, 2021, 4:23 p.m. UTC | #7
On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > In the next several patches, we will prepare for loading a reverse index
> > either in memory, or from a yet-to-be-introduced on-disk format. To do
>
> Does "load revindex in memory" (as opposed to "from on-disk file")
> mean the good old "read the forward index and make inverse map
> in-core", or something else?

Indeed, that's what it means. I've made that clearer in the patch
message by saying it explicitly.

> IOW, is "We will prepare a reverse index either by computing in
> memory from forward index, or loading from on-disk file" what we
> want to say here?

Yep.

> Without knowing what exactly "pack position", "offset" and "index
> position" refer to, the above three are almost impossible to grok.
> Can we have one paragraph description for each?  Something along the
> lines of...

Yep, and I see later on in the thread that you want to discard this
suggestion since Peff has suggested similar changes in pack-revindex.h,
which I've applied.

> >     Unlike some of the callers that used to access '->offset' and '->nr'
> >     directly, the error checking around this call is somewhat more
> >     robust. This is important since callers can pass an offset which
> >     does not contain an object.
>
> Meaning "offset ought to point at the boundary between objects in
> the pack stream, and the API, unlike the direct access, makes sure
> that is the case"?  That is a good thing.

Indeed, and I've called that out directly in the patch message to
highlight it.

> > diff --git a/pack-revindex.c b/pack-revindex.c
> > index ecdde39cf4..6d86a85208 100644
> > --- a/pack-revindex.c
> > +++ b/pack-revindex.c
> > @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
> >
> >  	return p->revindex + pos;
> >  }
> > +
> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
> > +
> > +	ret = find_revindex_position(p, ofs);
> > +	if (ret < 0)
> > +		return -1;
>
> Why not "return ret"?  We know that find_revindex_position() would
> signal an error by returning -1, but is there a reason why we want
> to prevent it from returning richer errors in the future?

No reason, I've changed it to 'return ret' instead.

> > +	*pos = ret;
>
> The untold assumption is that uint32_t can fit the maximum returned
> value from find_revindex_position() and "signed int" can also big
> enough.  I guess it is OK to be limited to up-to 2 billion objects
> on 32-bit systems.

Indeed, and that is fixed in a later on patch. I'll make sure to call it
out there.

> > +	return 0;
> > +}
> > +
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
>
> The previous function lazy loaded the revindex, but this one and the
> next one refuses to work without revindex.  Intended?

Yes, and discussed a little bit more here [1]. Obviously that discussion
doesn't do any good to those reading the git log in the future, so I've
summarized the important detail (that some callers are equipped to deal
with errors but others aren't) in the patch.

> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
>
> Personally I find it easier to place items on a single line in an
> ascending order of magnitude, i.e.
>
> 	if (p->num_objects <= pos)
> 		BUG("...");

More readable, thanks.

> The assertion requires pos to be strictly lower than p->num_objects,
> which is in line with how we usually count elements of an array of
> size p->num_objects, but the next one allows pos == p->num_objects;
> intended?
>
> p->revindex[] is an array of two-member struct, so if an element of
> the array is invalid for its .nr member here because pos is exactly
> at p->num_objects, I would imagine it is also invalid for its .offset
> member, too, no?
>
> Ah, perhaps the "offset beyond the end of the pack positions" is a
> sentinel element to give the in-pack-stream size of the object at
> the last pack position?  If that is the case, it deserves a comment,
> I would think.

Exactly. I added a detail about that in the patch, too.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F3ODgaa9wr65M09@nand.local/
diff mbox series

Patch

diff --git a/pack-revindex.c b/pack-revindex.c
index ecdde39cf4..6d86a85208 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -203,3 +203,35 @@  struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
 
 	return p->revindex + pos;
 }
+
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
+{
+	int ret;
+
+	if (load_pack_revindex(p) < 0)
+		return -1;
+
+	ret = find_revindex_position(p, ofs);
+	if (ret < 0)
+		return -1;
+	*pos = ret;
+	return 0;
+}
+
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (pos >= p->num_objects)
+		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].nr;
+}
+
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (pos > p->num_objects)
+		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].offset;
+}
diff --git a/pack-revindex.h b/pack-revindex.h
index 848331d5d6..256c0a9106 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -13,4 +13,8 @@  int find_revindex_position(struct packed_git *p, off_t ofs);
 
 struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
 
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
+
 #endif