Message ID | af5fe3b7237caeba8f970e967933db96c83a230e.1699569246.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | chunk-format: introduce `pair_chunk_expect()` | expand |
Taylor Blau <me@ttaylorr.com> writes: > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > + size_t chunk_size, > + void *data) > +{ > + struct pair_chunk_data *pcd = data; > + if (chunk_size / pcd->record_size != pcd->record_nr) > + return -1; > + *pcd->p = chunk_start; > + return 0; > +} I know one of the original places did the "divide the whole by per-record size and see if it matches the number of records", the same as we see above, but the check in the above could also be if (chunk_size != st_mult(pcd->record_size, pcd->record_nr)) return -1; which would also catch the case where chunk_size is not a multiple of the record size. Your conversion of OOFF in midx.c loses this protection as the original uses the multiplication-and-compare, but the rewrite to call pair_chunk_expect would call the above and checks with the truncating-divide-and-compare. Does the distinction matter? I dunno. If the record/chunk alignment is asserted elsewhere, then the distinction should not matter, but even if it were, seeing a truncating division used in any validation makes my skin tingle. Other than that, the series was a pleasant read. Thanks.
On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > > + size_t chunk_size, > > + void *data) > > +{ > > + struct pair_chunk_data *pcd = data; > > + if (chunk_size / pcd->record_size != pcd->record_nr) > > + return -1; > > + *pcd->p = chunk_start; > > + return 0; > > +} > > I know one of the original places did the "divide the whole by > per-record size and see if it matches the number of records", the > same as we see above, but the check in the above could also be > > if (chunk_size != st_mult(pcd->record_size, pcd->record_nr)) > return -1; > > which would also catch the case where chunk_size is not a multiple > of the record size. Your conversion of OOFF in midx.c loses this > protection as the original uses the multiplication-and-compare, but > the rewrite to call pair_chunk_expect would call the above and > checks with the truncating-divide-and-compare. Hmm. I was thinking of Peff's "commit-graph: handle overflow in chunk_size checks", but I think that I was overly eager in applying the same reasoning to the MIDX code. The important piece of the rationale in that patch is as follows: In the current code this is only possible for the CDAT chunk, but the reasons are quite subtle. We compute g->num_commits by dividing the size of the OIDL chunk by the hash length (since it consists of a bunch of hashes). So we know that any size_t multiplication that uses a value smaller than the hash length cannot overflow. And the CDAT records are the only ones that are larger (the others are just 4-byte records). So it's worth fixing all of these, to make it clear that they're not subject to overflow (without having to reason about seemingly unrelated code). In particular, that g->num_commits is computed by dividing the length of the OIDL chunk by the hash length, thus any size_t multiplication of g->num_commits with a value smaller than that hash length cannot overflow. But I don't think we enjoy the same benefits in the MIDX scenario. In this case, the num_objects field is just: m->num_objects = ntohl(m->chunk_oid_fanout[255]) so I don't think we can make the same guarantees about what is and isn't save to compute under size_t arithmetic. I'd be curious what Peff has to say, but if he agrees with me, I'd recommend taking the first five patches, and dropping the two MIDX-related ones. Thanks, Taylor
On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > > + size_t chunk_size, > > + void *data) > > +{ > > + struct pair_chunk_data *pcd = data; > > + if (chunk_size / pcd->record_size != pcd->record_nr) > > + return -1; > > + *pcd->p = chunk_start; > > + return 0; > > +} > > I know one of the original places did the "divide the whole by > per-record size and see if it matches the number of records", the > same as we see above, but the check in the above could also be > > if (chunk_size != st_mult(pcd->record_size, pcd->record_nr)) > return -1; > > which would also catch the case where chunk_size is not a multiple > of the record size. Your conversion of OOFF in midx.c loses this > protection as the original uses the multiplication-and-compare, but > the rewrite to call pair_chunk_expect would call the above and > checks with the truncating-divide-and-compare. > > Does the distinction matter? I dunno. If the record/chunk > alignment is asserted elsewhere, then the distinction should not > matter, but even if it were, seeing a truncating division used in > any validation makes my skin tingle. Yes, the distinction does matter. If pair_chunk_expect_fn() used st_mult(), then it would die on overflow, rather than returning an error. For commit-graph files this is propagated up, and we continue the operation by falling back to the non-graph code. There's a test in t5318 that will fail in this case. See patches 1 and 6 in my series for more discussion. One of those patches calls out the truncating division issue, but to summarize: IMHO this is OK, as what we really want to know is "is it big enough that we can always ask for NR records of size ELEM", which division gives us. If we do want to be more precise, but also avoid die(), we'd need a variant of st_mult() that returns a boolean. I didn't think it was worth it for this case (but arguably it is something that would be useful to have in general). -Peff
On Fri, Nov 10, 2023 at 11:27:41AM -0500, Taylor Blau wrote: > Hmm. I was thinking of Peff's "commit-graph: handle overflow in > chunk_size checks", but I think that I was overly eager in applying the > same reasoning to the MIDX code. > > The important piece of the rationale in that patch is as follows: > > In the current code this is only possible for the CDAT chunk, but > the reasons are quite subtle. We compute g->num_commits by dividing > the size of the OIDL chunk by the hash length (since it consists of > a bunch of hashes). So we know that any size_t multiplication that > uses a value smaller than the hash length cannot overflow. And the > CDAT records are the only ones that are larger (the others are just > 4-byte records). So it's worth fixing all of these, to make it clear > that they're not subject to overflow (without having to reason about > seemingly unrelated code). > > In particular, that g->num_commits is computed by dividing the length of > the OIDL chunk by the hash length, thus any size_t multiplication of > g->num_commits with a value smaller than that hash length cannot > overflow. > > But I don't think we enjoy the same benefits in the MIDX scenario. In > this case, the num_objects field is just: > > m->num_objects = ntohl(m->chunk_oid_fanout[255]) > > so I don't think we can make the same guarantees about what is and isn't > save to compute under size_t arithmetic. Yes, the logic does not apply to the midx code (just like the graph code after we switch to using the fanout value later in my series). But that paragraph was just explaining why only the CDAT chunk was _currently_ vulnerable. The more interesting question is the paragraphs after that, about whether it is OK to die() or not when we see overflow (and IMHO it is not for commit-graphs). The situation _is_ different there for midx's, because we immediately die() if we see a bogus midx anyway. But I think that is wrong, and something we may want to change in the long run. Both because it's the wrong thing for lib-ification, but also because we can easily keep going if the midx is broken; we can still use the individual pack idx files. > I'd be curious what Peff has to say, but if he agrees with me, I'd > recommend taking the first five patches, and dropping the two > MIDX-related ones. I think dropping those is a bad direction. The point of adding pair_chunk_expect() is that we could use it consistently. -Peff
On Thu, Nov 09, 2023 at 05:34:11PM -0500, Taylor Blau wrote: > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > + size_t chunk_size, > + void *data) > +{ > + struct pair_chunk_data *pcd = data; > + if (chunk_size / pcd->record_size != pcd->record_nr) > + return -1; > + *pcd->p = chunk_start; > + return 0; > +} I think this is taking us backwards in terms of the output the user sees (see the message I just sent elsewhere in the thread). The user won't be able to tell the difference between a missing chunk and one with the wrong size. And we miss the opportunity to give more details about the expected and detected sizes of the chunks. If the two-line error message really bothers you, then... > +int pair_chunk_expect(struct chunkfile *cf, > + uint32_t chunk_id, > + const unsigned char **p, > + size_t record_size, > + size_t record_nr) > +{ > + struct pair_chunk_data pcd = { > + .p = p, > + .record_size = record_size, > + .record_nr = record_nr, > + }; > + return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd); > +} ...this is where we could take a CHUNK_REQUIRED flag, and so: ret = read_chunk(...); /* pair_chunk_expect_fn() wrote an error already for other cases */ if (ret == CHUNK_MISSING) error("chunk not found"); return ret; And then the callers become a very nice: if (pair_chunk_expect(cf, id, &ptr, size, nr, CHUNK_REQUIRED)) return -1; You probably would need to pass some kind of string giving more context for the error messages, though. We can show the chunk id, but the context of "commit-graph" vs "midx" is important. -Peff
On Fri, Nov 10, 2023 at 04:57:47PM -0500, Jeff King wrote: > One of those patches calls out the truncating division issue, but to > summarize: IMHO this is OK, as what we really want to know is "is it big > enough that we can always ask for NR records of size ELEM", which > division gives us. If we do want to be more precise, but also avoid > die(), we'd need a variant of st_mult() that returns a boolean. I didn't > think it was worth it for this case (but arguably it is something that > would be useful to have in general). Oh, and obviously there is another option here if we want to be more careful but don't want to introduce an st_mult() variant: we can use "%" to check for divisibility ourselves. I don't think it's worth doing that in every individual size-check, but maybe it would be in a central pair_chunk_expect(). -Peff
Taylor Blau <me@ttaylorr.com> writes: > But I don't think we enjoy the same benefits in the MIDX scenario. In > this case, the num_objects field is just: > > m->num_objects = ntohl(m->chunk_oid_fanout[255]) > > so I don't think we can make the same guarantees about what is and isn't > save to compute under size_t arithmetic. So ..., from the correctness's point of view, if we do not mind st_mult() dying, the "multiply-and-compare" should give us much more robust results. If we do mind st_mult() dying, we could pair the "truncating-division-and-compare" you wrote with "ensure that chunk size is a multiple of record size" to achieve the same, I would think. I.e., if (chunk_size % pcd->record_size || chunk_size / pcd->record_size != pcd->record_nr) return -1; or something like that.
Jeff King <peff@peff.net> writes: > I think dropping those is a bad direction. The point of adding > pair_chunk_expect() is that we could use it consistently. I think so, too. If we are adding a helper to avoid common mistakes, and if it cannot be used in some situations, then the helper needs to be improved.
Taylor Blau <me@ttaylorr.com> writes: > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > + size_t chunk_size, > + void *data) > +{ > + struct pair_chunk_data *pcd = data; > + if (chunk_size / pcd->record_size != pcd->record_nr) > + return -1; > + *pcd->p = chunk_start; > + return 0; > +} > + I don't think this function should assume anything about the inputs (chunk_size, record_size, nor record_nr). If we are asking the helper to be the common tool for multiple locations, it should assume even less about what these inputs could look like. For example, if record_size is 0 this will lead to a divide-by-zero. Likewise, if record_nr is zero it doesn't make sense to check if chunk_size fits into record_size * record_nr. And even if there are no (unexpected) zero-values involved, shouldn't we also check for nonsensical comparisons, such as if chunk_size is smaller than record_size? I think there are several possibilities: CHUNK_MISSING (chunk_size == 0) CHUNK_TOO_SMALL (chunk_size < record_size) CHUNK_OK (chunk_size == record_nr * record_size) CHUNK_TOO_BIG (chunk_size > record_size, record_nr * record_size < chunk_size) And for the CHUNK_TOO_BIG case there are two possibilities --- the leftover parts of chunk_size that are not accounted by "record_nr * record_size" are (or are not) "aligned" such that increasing the record_size by some positive increment could exactly match the chunk_size. For example, chunk_size = 128, record_size = 8, record_nr = 8. In this case the records account for 64 bytes so we have 128 - 64 = 64 bytes left over, and simply increasing record_nr to 16 could account for every bytes in chunk_size. If chunk_size in this example was 120 or 130, there would be bytes left over that cannot be accounted for by adjusting record_size. This divisibility-of-leftover-bytes check could be done with '%' as mentioned already by others. So in summary there appear to be the following possibilities: CHUNK_MISSING CHUNK_TOO_SMALL CHUNK_OK CHUNK_TOO_BIG_ALIGNED CHUNK_TOO_BIG_MISALIGNED (on top of the cases where record_* inputs are zero). And it seems prudent to treat each of the not-OK cases differently (including how we report error messages) once we know which category we fall into.
Linus Arver <linusa@google.com> writes: > So in summary there appear to be the following possibilities: > > CHUNK_MISSING > CHUNK_TOO_SMALL > CHUNK_OK > CHUNK_TOO_BIG_ALIGNED > CHUNK_TOO_BIG_MISALIGNED On second thought, it appears that CHUNK_TOO_SMALL has three cases: (1) chunk_size < record_size (2) chunk_size >= record_size, but chunk_size < record_size * record_nr and decreasing record_nr can make chunk_size "fit" (3) chunk_size >= record_size, but chunk_size < record_size * record_nr and decreasing record_nr cannot make chunk_size "fit" where (2) and (3) are just like the *_(MIS)ALIGNED cases above when chunk_size is too big. My default position is that these additional cases might need to be treated differently, although maybe it's overkill also. Thanks.
On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote: > > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, > > + size_t chunk_size, > > + void *data) > > +{ > > + struct pair_chunk_data *pcd = data; > > + if (chunk_size / pcd->record_size != pcd->record_nr) > > + return -1; > > + *pcd->p = chunk_start; > > + return 0; > > +} > > + > > I don't think this function should assume anything about the inputs > (chunk_size, record_size, nor record_nr). If we are asking the helper to > be the common tool for multiple locations, it should assume even less > about what these inputs could look like. > > For example, if record_size is 0 this will lead to a > divide-by-zero. Likewise, if record_nr is zero it doesn't make > sense to check if chunk_size fits into record_size * record_nr. I'm not sure that divide-by-zero is a big deal, because 0-length fixed-size records do not make any sense to ask about. And even if somebody accidentally passed 0, even though it won't be caught by the compiler, it would barf on any input, so even rudimentary testing would catch it. A zero record_nr is a perfectly reasonable thing to ask about. If you have a chunk file with zero entries for some entity, then we are checking that the chunk is the expected zero length as well. > And even if there are no (unexpected) zero-values involved, shouldn't we > also check for nonsensical comparisons, such as if chunk_size is smaller > than record_size? Aren't we checking that already? If chunk_size is less than record_size, then the division will result in "0". If the expected number of records is not also 0, then that's a bogus file. What we really care about here is that we won't walk off the end of the buffer while looking at N fixed-size records (in that sense, the "too big" test is overly careful, but it's easy to include as a sanity check). > So in summary there appear to be the following possibilities: > > CHUNK_MISSING > CHUNK_TOO_SMALL > CHUNK_OK > CHUNK_TOO_BIG_ALIGNED > CHUNK_TOO_BIG_MISALIGNED So yes, I agree all these cases exist. We detect them all except the "misaligned" case (which I think was discussed earlier in the thread as not worth caring too much about). But... > (on top of the cases where record_* inputs are zero). > > And it seems prudent to treat each of the not-OK cases differently > (including how we report error messages) once we know which category we > fall into. I'm not sure it is worth caring too much about the different cases. From the caller's perspective the end result is always to avoid using the chunk/file. From the user's perspective, any error of the form "we expected N bytes and got X" is plenty. Especially since record_nr typically comes from another part of the file, we cannot tell if our chunk is too big/small, or if another chunk gave us a bogus record_nr. -Peff
Jeff King <peff@peff.net> writes: > On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote: > >> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, >> > + size_t chunk_size, >> > + void *data) >> > +{ >> > + struct pair_chunk_data *pcd = data; >> > + if (chunk_size / pcd->record_size != pcd->record_nr) >> > + return -1; >> > + *pcd->p = chunk_start; >> > + return 0; >> > +} >> > + >> >> I don't think this function should assume anything about the inputs >> (chunk_size, record_size, nor record_nr). If we are asking the helper to >> be the common tool for multiple locations, it should assume even less >> about what these inputs could look like. >> >> For example, if record_size is 0 this will lead to a >> divide-by-zero. Likewise, if record_nr is zero it doesn't make >> sense to check if chunk_size fits into record_size * record_nr. > > I'm not sure that divide-by-zero is a big deal, because 0-length > fixed-size records do not make any sense to ask about. So why not make this function check for this? While it may be true that 0-length fixed-size records are impossible currently, nothing guarantees they will always be that way all the time in the future. > And even if > somebody accidentally passed 0, even though it won't be caught by the > compiler, it would barf on any input, so even rudimentary testing would > catch it. If somebody is accidentally passing an invalid value to a function, then it feels right for that function to be able to handle it instead of crashing (or doing any other undefined behavior) with division-by-zero. Taking a step back though, maybe I am being overly defensive about possible failure modes. I don't know the surrounding area well so I might be overreacting. > A zero record_nr is a perfectly reasonable thing to ask about. If you > have a chunk file with zero entries for some entity, then we are > checking that the chunk is the expected zero length as well. Right. >> And even if there are no (unexpected) zero-values involved, shouldn't we >> also check for nonsensical comparisons, such as if chunk_size is smaller >> than record_size? > > Aren't we checking that already? If chunk_size is less than record_size, > then the division will result in "0". If the expected number of records > is not also 0, then that's a bogus file. I was thinking of an early return like if (chunk_size < record_size) return CHUNK_TOO_SMALL before evaluating (chunk_size / pcd->record_size != pcd->record_nr). You're correct that the division will result in "0" if chunk_size is less than record_size. But I didn't like having the extra mental load for reading and understanding the correctness of "if (chunk_size / pcd->record_size != pcd->record_nr)" for that case. IOW, the more early returns we have for known-bad cases, by the time we get to "if (chunk_size / pcd->record_size != pcd->record_nr)" it would be that much easier to understand that code. > What we really care about here is that we won't walk off the end of the > buffer while looking at N fixed-size records ... Ah, I see. This sort of insight would be great to have as a comment in the code. > ... (in that sense, the "too > big" test is overly careful, but it's easy to include as a sanity > check). OK. >> So in summary there appear to be the following possibilities: >> >> CHUNK_MISSING >> CHUNK_TOO_SMALL >> CHUNK_OK >> CHUNK_TOO_BIG_ALIGNED >> CHUNK_TOO_BIG_MISALIGNED > > So yes, I agree all these cases exist. We detect them all except the > "misaligned" case (which I think was discussed earlier in the thread as > not worth caring too much about). OK. > But... > >> (on top of the cases where record_* inputs are zero). >> >> And it seems prudent to treat each of the not-OK cases differently >> (including how we report error messages) once we know which category we >> fall into. > > I'm not sure it is worth caring too much about the different cases. From > the caller's perspective the end result is always to avoid using the > chunk/file. Ah OK. Then yes, it does seem like caring about the different cases is too much from the callers' perspective. But I do think that checking the different cases with early returns will at least help readability (and as a bonus assure future Git developers that divide-by-zero errors are impossible).
diff --git a/chunk-format.c b/chunk-format.c index cdc7f39b70..be078dcca8 100644 --- a/chunk-format.c +++ b/chunk-format.c @@ -163,6 +163,10 @@ int read_table_of_contents(struct chunkfile *cf, struct pair_chunk_data { const unsigned char **p; size_t *size; + + /* for pair_chunk_expect() only */ + size_t record_size; + size_t record_nr; }; static int pair_chunk_fn(const unsigned char *chunk_start, @@ -175,6 +179,17 @@ static int pair_chunk_fn(const unsigned char *chunk_start, return 0; } +static int pair_chunk_expect_fn(const unsigned char *chunk_start, + size_t chunk_size, + void *data) +{ + struct pair_chunk_data *pcd = data; + if (chunk_size / pcd->record_size != pcd->record_nr) + return -1; + *pcd->p = chunk_start; + return 0; +} + int pair_chunk(struct chunkfile *cf, uint32_t chunk_id, const unsigned char **p, @@ -184,6 +199,20 @@ int pair_chunk(struct chunkfile *cf, return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd); } +int pair_chunk_expect(struct chunkfile *cf, + uint32_t chunk_id, + const unsigned char **p, + size_t record_size, + size_t record_nr) +{ + struct pair_chunk_data pcd = { + .p = p, + .record_size = record_size, + .record_nr = record_nr, + }; + return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd); +} + int read_chunk(struct chunkfile *cf, uint32_t chunk_id, chunk_read_fn fn, diff --git a/chunk-format.h b/chunk-format.h index 14b76180ef..10806d7a9a 100644 --- a/chunk-format.h +++ b/chunk-format.h @@ -17,7 +17,8 @@ struct chunkfile; * * If reading a file, use a NULL 'struct hashfile *' and then call * read_table_of_contents(). Supply the memory-mapped data to the - * pair_chunk() or read_chunk() methods, as appropriate. + * pair_chunk(), pair_chunk_expect(), or read_chunk() methods, as + * appropriate. * * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances * for reading and writing. @@ -54,6 +55,16 @@ int pair_chunk(struct chunkfile *cf, const unsigned char **p, size_t *size); +/* + * Similar to 'pair_chunk', but used for callers who are reading a chunk + * with a known number of fixed-width records. + */ +int pair_chunk_expect(struct chunkfile *cf, + uint32_t chunk_id, + const unsigned char **p, + size_t record_size, + size_t record_nr); + typedef int (*chunk_read_fn)(const unsigned char *chunk_start, size_t chunk_size, void *data); /*
In 570b8b8836 (chunk-format: note that pair_chunk() is unsafe, 2023-10-09), the pair_chunk() interface grew a required "size" pointer, so that the caller is forced to at least have a handle on the actual size of the given chunk. Many callers were converted to the new interface. A handful were instead converted from the unsafe version of pair_chunk() to read_chunk() so that they could check their expected size. This led to a lot of code like: static int graph_read_oid_lookup(const unsigned char *chunk_start, size_t chunk_size, void *data) { struct commit_graph *g = data; g->chunk_oid_lookup = chunk_start; if (chunk_size / g->hash_len != g->num_commits) return error(_("commit-graph OID lookup chunk is the wrong size")); return 0; } , leaving the caller to use read_chunk(), like so: read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph); The callback to read_chunk() (in the above, `graph_read_oid_lookup()`) does nothing more than (a) assign a pointer to the location of the start of the chunk in the mmap'd file, and (b) assert that it has the correct size. For callers that know the expected size of their chunk(s) up-front (most often because they are made up of a known number of fixed-size records), we can simplify this by teaching the chunk-format API itself to validate the expected size for us. This is wrapped in a new function, called `pair_chunk_expect()` which takes a pair of "size_t"s (corresponding to the record size and count), instead of a "size_t *", and validates that the given chunk matches the expected size as given. This will allow us to reduce the number of lines of code it takes to perform these basic read_chunk() operations, by taking the above and replacing it with something like: if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP, &graph->chunk_oid_lookup, graph->hash_len, graph->num_commits)) error(_("commit-graph oid lookup chunk is wrong size")); We will perform those transformations in the following commits. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- chunk-format.c | 29 +++++++++++++++++++++++++++++ chunk-format.h | 13 ++++++++++++- 2 files changed, 41 insertions(+), 1 deletion(-)