diff mbox series

[1/7] chunk-format: introduce `pair_chunk_expect()` helper

Message ID af5fe3b7237caeba8f970e967933db96c83a230e.1699569246.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series chunk-format: introduce `pair_chunk_expect()` | expand

Commit Message

Taylor Blau Nov. 9, 2023, 10:34 p.m. UTC
In 570b8b8836 (chunk-format: note that pair_chunk() is unsafe,
2023-10-09), the pair_chunk() interface grew a required "size" pointer,
so that the caller is forced to at least have a handle on the actual
size of the given chunk.

Many callers were converted to the new interface. A handful were instead
converted from the unsafe version of pair_chunk() to read_chunk() so
that they could check their expected size.

This led to a lot of code like:

    static int graph_read_oid_lookup(const unsigned char *chunk_start,
                                     size_t chunk_size, void *data)
    {
      struct commit_graph *g = data;
      g->chunk_oid_lookup = chunk_start;
      if (chunk_size / g->hash_len != g->num_commits)
        return error(_("commit-graph OID lookup chunk is the wrong size"));
      return 0;
    }

, leaving the caller to use read_chunk(), like so:

    read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);

The callback to read_chunk() (in the above, `graph_read_oid_lookup()`)
does nothing more than (a) assign a pointer to the location of the start
of the chunk in the mmap'd file, and (b) assert that it has the correct
size.

For callers that know the expected size of their chunk(s) up-front (most
often because they are made up of a known number of fixed-size records),
we can simplify this by teaching the chunk-format API itself to validate
the expected size for us.

This is wrapped in a new function, called `pair_chunk_expect()` which
takes a pair of "size_t"s (corresponding to the record size and count),
instead of a "size_t *", and validates that the given chunk matches the
expected size as given.

This will allow us to reduce the number of lines of code it takes to
perform these basic read_chunk() operations, by taking the above and
replacing it with something like:

    if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP,
                          &graph->chunk_oid_lookup,
                          graph->hash_len, graph->num_commits))
      error(_("commit-graph oid lookup chunk is wrong size"));

We will perform those transformations in the following commits.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 29 +++++++++++++++++++++++++++++
 chunk-format.h | 13 ++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Nov. 10, 2023, 4:55 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}

I know one of the original places did the "divide the whole by
per-record size and see if it matches the number of records", the
same as we see above, but the check in the above could also be

	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
		return -1;

which would also catch the case where chunk_size is not a multiple
of the record size.  Your conversion of OOFF in midx.c loses this
protection as the original uses the multiplication-and-compare, but
the rewrite to call pair_chunk_expect would call the above and
checks with the truncating-divide-and-compare.

Does the distinction matter?  I dunno.  If the record/chunk
alignment is asserted elsewhere, then the distinction should not
matter, but even if it were, seeing a truncating division used in
any validation makes my skin tingle.

Other than that, the series was a pleasant read.

Thanks.
Taylor Blau Nov. 10, 2023, 4:27 p.m. UTC | #2
On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
>
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
>
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
>
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.

Hmm. I was thinking of Peff's "commit-graph: handle overflow in
chunk_size checks", but I think that I was overly eager in applying the
same reasoning to the MIDX code.

The important piece of the rationale in that patch is as follows:

    In the current code this is only possible for the CDAT chunk, but
    the reasons are quite subtle. We compute g->num_commits by dividing
    the size of the OIDL chunk by the hash length (since it consists of
    a bunch of hashes). So we know that any size_t multiplication that
    uses a value smaller than the hash length cannot overflow. And the
    CDAT records are the only ones that are larger (the others are just
    4-byte records). So it's worth fixing all of these, to make it clear
    that they're not subject to overflow (without having to reason about
    seemingly unrelated code).

In particular, that g->num_commits is computed by dividing the length of
the OIDL chunk by the hash length, thus any size_t multiplication of
g->num_commits with a value smaller than that hash length cannot
overflow.

But I don't think we enjoy the same benefits in the MIDX scenario. In
this case, the num_objects field is just:

    m->num_objects = ntohl(m->chunk_oid_fanout[255])

so I don't think we can make the same guarantees about what is and isn't
save to compute under size_t arithmetic.

I'd be curious what Peff has to say, but if he agrees with me, I'd
recommend taking the first five patches, and dropping the two
MIDX-related ones.

Thanks,
Taylor
Jeff King Nov. 10, 2023, 9:57 p.m. UTC | #3
On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
> 
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
> 
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
> 
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.
> 
> Does the distinction matter?  I dunno.  If the record/chunk
> alignment is asserted elsewhere, then the distinction should not
> matter, but even if it were, seeing a truncating division used in
> any validation makes my skin tingle.

Yes, the distinction does matter. If pair_chunk_expect_fn() used
st_mult(), then it would die on overflow, rather than returning an
error. For commit-graph files this is propagated up, and we continue the
operation by falling back to the non-graph code. There's a test in t5318
that will fail in this case. See patches 1 and 6 in my series for more
discussion.

One of those patches calls out the truncating division issue, but to
summarize: IMHO this is OK, as what we really want to know is "is it big
enough that we can always ask for NR records of size ELEM", which
division gives us. If we do want to be more precise, but also avoid
die(), we'd need a variant of st_mult() that returns a boolean. I didn't
think it was worth it for this case (but arguably it is something that
would be useful to have in general).

-Peff
Jeff King Nov. 10, 2023, 10:01 p.m. UTC | #4
On Fri, Nov 10, 2023 at 11:27:41AM -0500, Taylor Blau wrote:

> Hmm. I was thinking of Peff's "commit-graph: handle overflow in
> chunk_size checks", but I think that I was overly eager in applying the
> same reasoning to the MIDX code.
> 
> The important piece of the rationale in that patch is as follows:
> 
>     In the current code this is only possible for the CDAT chunk, but
>     the reasons are quite subtle. We compute g->num_commits by dividing
>     the size of the OIDL chunk by the hash length (since it consists of
>     a bunch of hashes). So we know that any size_t multiplication that
>     uses a value smaller than the hash length cannot overflow. And the
>     CDAT records are the only ones that are larger (the others are just
>     4-byte records). So it's worth fixing all of these, to make it clear
>     that they're not subject to overflow (without having to reason about
>     seemingly unrelated code).
> 
> In particular, that g->num_commits is computed by dividing the length of
> the OIDL chunk by the hash length, thus any size_t multiplication of
> g->num_commits with a value smaller than that hash length cannot
> overflow.
> 
> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
> 
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
> 
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

Yes, the logic does not apply to the midx code (just like the graph code
after we switch to using the fanout value later in my series). But that
paragraph was just explaining why only the CDAT chunk was _currently_
vulnerable.

The more interesting question is the paragraphs after that, about
whether it is OK to die() or not when we see overflow (and IMHO it is
not for commit-graphs).

The situation _is_ different there for midx's, because we immediately
die() if we see a bogus midx anyway. But I think that is wrong, and
something we may want to change in the long run. Both because it's the
wrong thing for lib-ification, but also because we can easily keep going
if the midx is broken; we can still use the individual pack idx files.

> I'd be curious what Peff has to say, but if he agrees with me, I'd
> recommend taking the first five patches, and dropping the two
> MIDX-related ones.

I think dropping those is a bad direction. The point of adding
pair_chunk_expect() is that we could use it consistently.

-Peff
Jeff King Nov. 10, 2023, 10:08 p.m. UTC | #5
On Thu, Nov 09, 2023 at 05:34:11PM -0500, Taylor Blau wrote:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}

I think this is taking us backwards in terms of the output the user sees
(see the message I just sent elsewhere in the thread). The user won't be
able to tell the difference between a missing chunk and one with the
wrong size.

And we miss the opportunity to give more details about the expected and
detected sizes of the chunks.

If the two-line error message really bothers you, then...

> +int pair_chunk_expect(struct chunkfile *cf,
> +		      uint32_t chunk_id,
> +		      const unsigned char **p,
> +		      size_t record_size,
> +		      size_t record_nr)
> +{
> +	struct pair_chunk_data pcd = {
> +		.p = p,
> +		.record_size = record_size,
> +		.record_nr = record_nr,
> +	};
> +	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
> +}

...this is where we could take a CHUNK_REQUIRED flag, and so:

  ret = read_chunk(...);
  /* pair_chunk_expect_fn() wrote an error already for other cases */
  if (ret == CHUNK_MISSING)
	error("chunk not found");
  return ret;

And then the callers become a very nice:

  if (pair_chunk_expect(cf, id, &ptr, size, nr, CHUNK_REQUIRED))
	return -1;

You probably would need to pass some kind of string giving more context
for the error messages, though. We can show the chunk id, but the
context of "commit-graph" vs "midx" is important.

-Peff
Jeff King Nov. 10, 2023, 10:09 p.m. UTC | #6
On Fri, Nov 10, 2023 at 04:57:47PM -0500, Jeff King wrote:

> One of those patches calls out the truncating division issue, but to
> summarize: IMHO this is OK, as what we really want to know is "is it big
> enough that we can always ask for NR records of size ELEM", which
> division gives us. If we do want to be more precise, but also avoid
> die(), we'd need a variant of st_mult() that returns a boolean. I didn't
> think it was worth it for this case (but arguably it is something that
> would be useful to have in general).

Oh, and obviously there is another option here if we want to be more
careful but don't want to introduce an st_mult() variant: we can use "%"
to check for divisibility ourselves.

I don't think it's worth doing that in every individual size-check, but
maybe it would be in a central pair_chunk_expect().

-Peff
Junio C Hamano Nov. 10, 2023, 11:38 p.m. UTC | #7
Taylor Blau <me@ttaylorr.com> writes:

> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
>
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
>
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

So ..., from the correctness's point of view, if we do not mind
st_mult() dying, the "multiply-and-compare" should give us much more
robust results.  If we do mind st_mult() dying, we could pair the
"truncating-division-and-compare" you wrote with "ensure that chunk
size is a multiple of record size" to achieve the same, I would
think.  I.e.,

	if (chunk_size % pcd->record_size ||
	    chunk_size / pcd->record_size != pcd->record_nr)
		return -1;

or something like that.
Junio C Hamano Nov. 10, 2023, 11:39 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> I think dropping those is a bad direction. The point of adding
> pair_chunk_expect() is that we could use it consistently.

I think so, too.  If we are adding a helper to avoid common
mistakes, and if it cannot be used in some situations, then the
helper needs to be improved.
Linus Arver Jan. 15, 2024, 10:31 p.m. UTC | #9
Taylor Blau <me@ttaylorr.com> writes:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}
> +

I don't think this function should assume anything about the inputs
(chunk_size, record_size, nor record_nr). If we are asking the helper to
be the common tool for multiple locations, it should assume even less
about what these inputs could look like.

For example, if record_size is 0 this will lead to a
divide-by-zero. Likewise, if record_nr is zero it doesn't make
sense to check if chunk_size fits into record_size * record_nr.

And even if there are no (unexpected) zero-values involved, shouldn't we
also check for nonsensical comparisons, such as if chunk_size is smaller
than record_size?

I think there are several possibilities:

CHUNK_MISSING   (chunk_size == 0)
CHUNK_TOO_SMALL (chunk_size < record_size)
CHUNK_OK        (chunk_size == record_nr * record_size)
CHUNK_TOO_BIG   (chunk_size > record_size, record_nr * record_size < chunk_size)

And for the CHUNK_TOO_BIG case there are two possibilities --- the
leftover parts of chunk_size that are not accounted by "record_nr *
record_size" are (or are not) "aligned" such that increasing the
record_size by some positive increment could exactly match the
chunk_size. For example, chunk_size = 128, record_size = 8, record_nr =
8. In this case the records account for 64 bytes so we have 128 - 64 =
64 bytes left over, and simply increasing record_nr to 16 could account
for every bytes in chunk_size. If chunk_size in this example was 120 or
130, there would be bytes left over that cannot be accounted for by
adjusting record_size. This divisibility-of-leftover-bytes check could
be done with '%' as mentioned already by others.

So in summary there appear to be the following possibilities:

CHUNK_MISSING
CHUNK_TOO_SMALL
CHUNK_OK
CHUNK_TOO_BIG_ALIGNED
CHUNK_TOO_BIG_MISALIGNED

(on top of the cases where record_* inputs are zero).

And it seems prudent to treat each of the not-OK cases differently
(including how we report error messages) once we know which category we
fall into.
Linus Arver Jan. 15, 2024, 10:53 p.m. UTC | #10
Linus Arver <linusa@google.com> writes:

> So in summary there appear to be the following possibilities:
>
> CHUNK_MISSING
> CHUNK_TOO_SMALL
> CHUNK_OK
> CHUNK_TOO_BIG_ALIGNED
> CHUNK_TOO_BIG_MISALIGNED

On second thought, it appears that CHUNK_TOO_SMALL has three cases:

(1) chunk_size < record_size
(2) chunk_size >= record_size, but chunk_size < record_size * record_nr
    and decreasing record_nr can make chunk_size "fit"
(3) chunk_size >= record_size, but chunk_size < record_size * record_nr
    and decreasing record_nr cannot make chunk_size "fit"

where (2) and (3) are just like the *_(MIS)ALIGNED cases above when
chunk_size is too big.

My default position is that these additional cases might need to be
treated differently, although maybe it's overkill also.

Thanks.
Jeff King Jan. 16, 2024, 3:10 p.m. UTC | #11
On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote:

> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
> > +
> 
> I don't think this function should assume anything about the inputs
> (chunk_size, record_size, nor record_nr). If we are asking the helper to
> be the common tool for multiple locations, it should assume even less
> about what these inputs could look like.
>
> For example, if record_size is 0 this will lead to a
> divide-by-zero. Likewise, if record_nr is zero it doesn't make
> sense to check if chunk_size fits into record_size * record_nr.

I'm not sure that divide-by-zero is a big deal, because 0-length
fixed-size records do not make any sense to ask about. And even if
somebody accidentally passed 0, even though it won't be caught by the
compiler, it would barf on any input, so even rudimentary testing would
catch it.

A zero record_nr is a perfectly reasonable thing to ask about. If you
have a chunk file with zero entries for some entity, then we are
checking that the chunk is the expected zero length as well.

> And even if there are no (unexpected) zero-values involved, shouldn't we
> also check for nonsensical comparisons, such as if chunk_size is smaller
> than record_size?

Aren't we checking that already? If chunk_size is less than record_size,
then the division will result in "0". If the expected number of records
is not also 0, then that's a bogus file.

What we really care about here is that we won't walk off the end of the
buffer while looking at N fixed-size records (in that sense, the "too
big" test is overly careful, but it's easy to include as a sanity
check).

> So in summary there appear to be the following possibilities:
> 
> CHUNK_MISSING
> CHUNK_TOO_SMALL
> CHUNK_OK
> CHUNK_TOO_BIG_ALIGNED
> CHUNK_TOO_BIG_MISALIGNED

So yes, I agree all these cases exist. We detect them all except the
"misaligned" case (which I think was discussed earlier in the thread as
not worth caring too much about).

But...

> (on top of the cases where record_* inputs are zero).
> 
> And it seems prudent to treat each of the not-OK cases differently
> (including how we report error messages) once we know which category we
> fall into.

I'm not sure it is worth caring too much about the different cases. From
the caller's perspective the end result is always to avoid using the
chunk/file. From the user's perspective, any error of the form "we
expected N bytes and got X" is plenty. Especially since record_nr
typically comes from another part of the file, we cannot tell if our
chunk is too big/small, or if another chunk gave us a bogus record_nr.

-Peff
Linus Arver Jan. 18, 2024, 11:59 p.m. UTC | #12
Jeff King <peff@peff.net> writes:

> On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote:
>
>> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
>> > +				size_t chunk_size,
>> > +				void *data)
>> > +{
>> > +	struct pair_chunk_data *pcd = data;
>> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
>> > +		return -1;
>> > +	*pcd->p = chunk_start;
>> > +	return 0;
>> > +}
>> > +
>> 
>> I don't think this function should assume anything about the inputs
>> (chunk_size, record_size, nor record_nr). If we are asking the helper to
>> be the common tool for multiple locations, it should assume even less
>> about what these inputs could look like.
>>
>> For example, if record_size is 0 this will lead to a
>> divide-by-zero. Likewise, if record_nr is zero it doesn't make
>> sense to check if chunk_size fits into record_size * record_nr.
>
> I'm not sure that divide-by-zero is a big deal, because 0-length
> fixed-size records do not make any sense to ask about.

So why not make this function check for this? While it may be true that
0-length fixed-size records are impossible currently, nothing guarantees
they will always be that way all the time in the future.

> And even if
> somebody accidentally passed 0, even though it won't be caught by the
> compiler, it would barf on any input, so even rudimentary testing would
> catch it.

If somebody is accidentally passing an invalid value to a function, then
it feels right for that function to be able to handle it instead of
crashing (or doing any other undefined behavior) with division-by-zero.

Taking a step back though, maybe I am being overly defensive about
possible failure modes. I don't know the surrounding area well so I
might be overreacting.

> A zero record_nr is a perfectly reasonable thing to ask about. If you
> have a chunk file with zero entries for some entity, then we are
> checking that the chunk is the expected zero length as well.

Right.

>> And even if there are no (unexpected) zero-values involved, shouldn't we
>> also check for nonsensical comparisons, such as if chunk_size is smaller
>> than record_size?
>
> Aren't we checking that already? If chunk_size is less than record_size,
> then the division will result in "0". If the expected number of records
> is not also 0, then that's a bogus file.

I was thinking of an early return like

    if (chunk_size < record_size)
        return CHUNK_TOO_SMALL

before evaluating (chunk_size / pcd->record_size != pcd->record_nr).
You're correct that the division will result in "0" if chunk_size is
less than record_size. But I didn't like having the extra mental load
for reading and understanding the correctness of "if (chunk_size /
pcd->record_size != pcd->record_nr)" for that case. IOW, the more early
returns we have for known-bad cases, by the time we get to "if
(chunk_size / pcd->record_size != pcd->record_nr)" it would be that much
easier to understand that code.
 
> What we really care about here is that we won't walk off the end of the
> buffer while looking at N fixed-size records ...

Ah, I see. This sort of insight would be great to have as a comment in
the code.

> ... (in that sense, the "too
> big" test is overly careful, but it's easy to include as a sanity
> check).

OK.

>> So in summary there appear to be the following possibilities:
>> 
>> CHUNK_MISSING
>> CHUNK_TOO_SMALL
>> CHUNK_OK
>> CHUNK_TOO_BIG_ALIGNED
>> CHUNK_TOO_BIG_MISALIGNED
>
> So yes, I agree all these cases exist. We detect them all except the
> "misaligned" case (which I think was discussed earlier in the thread as
> not worth caring too much about).

OK.

> But...
>
>> (on top of the cases where record_* inputs are zero).
>> 
>> And it seems prudent to treat each of the not-OK cases differently
>> (including how we report error messages) once we know which category we
>> fall into.
>
> I'm not sure it is worth caring too much about the different cases. From
> the caller's perspective the end result is always to avoid using the
> chunk/file.

Ah OK. Then yes, it does seem like caring about the different cases is
too much from the callers' perspective.

But I do think that checking the different cases with early returns will
at least help readability (and as a bonus assure future Git developers
that divide-by-zero errors are impossible).
diff mbox series

Patch

diff --git a/chunk-format.c b/chunk-format.c
index cdc7f39b70..be078dcca8 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -163,6 +163,10 @@  int read_table_of_contents(struct chunkfile *cf,
 struct pair_chunk_data {
 	const unsigned char **p;
 	size_t *size;
+
+	/* for pair_chunk_expect() only */
+	size_t record_size;
+	size_t record_nr;
 };
 
 static int pair_chunk_fn(const unsigned char *chunk_start,
@@ -175,6 +179,17 @@  static int pair_chunk_fn(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int pair_chunk_expect_fn(const unsigned char *chunk_start,
+				size_t chunk_size,
+				void *data)
+{
+	struct pair_chunk_data *pcd = data;
+	if (chunk_size / pcd->record_size != pcd->record_nr)
+		return -1;
+	*pcd->p = chunk_start;
+	return 0;
+}
+
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       const unsigned char **p,
@@ -184,6 +199,20 @@  int pair_chunk(struct chunkfile *cf,
 	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
 }
 
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t record_size,
+		      size_t record_nr)
+{
+	struct pair_chunk_data pcd = {
+		.p = p,
+		.record_size = record_size,
+		.record_nr = record_nr,
+	};
+	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
+}
+
 int read_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       chunk_read_fn fn,
diff --git a/chunk-format.h b/chunk-format.h
index 14b76180ef..10806d7a9a 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -17,7 +17,8 @@  struct chunkfile;
  *
  * If reading a file, use a NULL 'struct hashfile *' and then call
  * read_table_of_contents(). Supply the memory-mapped data to the
- * pair_chunk() or read_chunk() methods, as appropriate.
+ * pair_chunk(), pair_chunk_expect(), or read_chunk() methods, as
+ * appropriate.
  *
  * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
  * for reading and writing.
@@ -54,6 +55,16 @@  int pair_chunk(struct chunkfile *cf,
 	       const unsigned char **p,
 	       size_t *size);
 
+/*
+ * Similar to 'pair_chunk', but used for callers who are reading a chunk
+ * with a known number of fixed-width records.
+ */
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t record_size,
+		      size_t record_nr);
+
 typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 			     size_t chunk_size, void *data);
 /*