Message ID | 18d5864f81e89585cc94cd12eca166a9d8b929a5.1597509583.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Implement Corrected Commit Date | expand |
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Abhishek Kumar <abhishekkumar8222@gmail.com> > > Both fill_commit_graph_info() and fill_commit_in_graph() parse > information present in commit data chunk. Let's simplify the > implementation by calling fill_commit_graph_info() within > fill_commit_in_graph(). > > The test 'generate tar with future mtime' creates a commit with commit > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into > generation number (within CDAT chunk) and has undefined behavior. > > The test used to pass Could you please tell us how does this test starts to fail without the change to the test described there? What is the error message, etc.? > as fill_commit_in_graph() guarantees the values of ^^^^^^^^^^^^^^^^^^^^^^ s/fill_commit_in_graph()/fill_commit_graph_info()/ It is fill_commit_graph_info() that changes its behavior in this patch. > graph position and generation number, and did not load timestamp. > However, with corrected commit date we will need load the timestamp as > well to populate the generation number. > > Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds. I think this commit should be split into two commits: - fix to the 'generate tar with future mtime' test - simplify implementation of fill_commit_in_graph() The test 'generate tar with future mtime' in t/t5000-tar-tree.sh creates a commit with commit time of (2 ^ 36 - 1) seconds since EPOCH (68719476737). However, the commit-graph file format version 1 provides only 34-bits for storing committer date (32 + 2 bits), not 64-bits. Therefore maximum far in the future commit time can only be at most (2 ^ 34 - 1) seconds since EPOCH, as Stolee said in commet for v1 of this series. This "limitation" is not a problem in practice, because the maximum timestamp allowed takes place in the year 2514. I hope at that time there would be no Git version in use that still crashes on changing the version field in the commit-graph format -- then we can simply get rid of storing topological levels (generation number v1) in those 30 bits of CDAT chunk and use full 64 bits for committer date. Git does not perform any bounds checking for committer date value in write_graph_chunk_data(): uint32_t packedDate[2]; /* ... */ if (sizeof((*list)->date) > 4) packedDate[0] = htonl(((*list)->date >> 32) & 0x3); else packedDate[0] = 0; packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2); packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); This means that the date is trimmed to 34 bits on save discarding most significant bits, assuming that unsigned overflow simply discards most significant bits truncating the signed (?) value. In this case running the test with GIT_TEST_COMMIT_GRAPH=1 would lead to errors, as the committer date read from the commit graph would be incorrect, and therefore generation number v2 would also be incorrect. I don't quite understand however how second part of this patch in its current iteration, namely simplifing the implementation of fill_commit_in_graph() makes this bug / error shows... Do I understand it correctly that before this change the committer date would always be parsed out of the commit object, instead of reading it from the commit-graph file? However the only user of static fill_commit_in_graph() is the parse_commit_in_graph(), which in turn is used by parse_commit_gently(); but fill_commit_in_graph() read commit date from commit-graph before this change... color me confused. Ah, after the change fill_commit_graph_info() changes its behavior, not fill_commit_in_graph() as said in the commit message. Before this commit it used to only load graph position and generation number, and did not load the timestamp. The function fill_commit_graph_info() is used in turn by public-facing load_commit_graph_info(): /* * It is possible that we loaded commit contents from the commit buffer, * but we also want to ensure the commit-graph content is correctly * checked and filled. Fill the graph_pos and generation members of * the given commit. */ void load_commit_graph_info(struct repository *r, struct commit *item); This function is used in turn by get_bloom_filter(), contains_tag_algo() and parse_commit_buffer(), change in any of which behavior can lead to failing 'generate tar with future mtime' test. > > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> > --- > commit-graph.c | 29 +++++++++++------------------ > t/t5000-tar-tree.sh | 4 ++-- > 2 files changed, 13 insertions(+), 20 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index ace7400a1a..af8d9cc45e 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, > const unsigned char *commit_data; > struct commit_graph_data *graph_data; > uint32_t lex_index; > + uint64_t date_high, date_low; > > while (pos < g->num_commits_in_base) > g = g->base_graph; > > + if (pos >= g->num_commits + g->num_commits_in_base) > + die(_("invalid commit position. commit-graph is likely corrupt")); > + All right, if we want to use fill_commit_graph_info() function to load the graph data (graph position and generation number) in the fill_commit_in_graph() we need to perform this check. I'd think that this check should be here from the beginning, just in case. Sidenote: I wonder if it would be good idea to print more information in the above error message, for example: die(_("invalid commit position %ld. commit-graph '%s' is likely corrupt"), pos, g->filename); But this is unrelated thing, tangential to this change, and it might not add anything useful. > lex_index = pos - g->num_commits_in_base; > commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index; > > graph_data = commit_graph_data_at(item); > graph_data->graph_pos = pos; > + > + date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; > + date_low = get_be32(commit_data + g->hash_len + 12); > + item->date = (timestamp_t)((date_high << 32) | date_low); > + I think this change, moving loading of commit date from commit-graph out of fill_commit_in_graph() and into fill_commit_graph_info(), is in my opinion a bit inadequatly described in the commit message. As I understand it this change prepares fill_commit_graph_info() for generation number v2, that is Corrected Commit Date, where loading commit date from CDAT together with loading offset from GDAT would be necessary to correctly set the 'generation' field of 'struct commit_graph_data' (on the commit_graph_data_slab). I'm not sure if it would be worth it splitting this refactoring change (Move Statements into Function) into a separate patch -- it would split this commit into three, changing 11 part series into 13 part series. Note that we might want to update the description of load_commit_graph_info() in commit-graph.h to include that it incidentally loads commit date from the commit-graph. Butthis might be not worth it -- it is a side effect, not the major goal of this function. > graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > } > > @@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r, > { > uint32_t edge_value; > uint32_t *parent_data_ptr; > - uint64_t date_low, date_high; > struct commit_list **pptr; > - struct commit_graph_data *graph_data; > const unsigned char *commit_data; > uint32_t lex_index; > > while (pos < g->num_commits_in_base) > g = g->base_graph; > > - if (pos >= g->num_commits + g->num_commits_in_base) > - die(_("invalid commit position. commit-graph is likely corrupt")); > + fill_commit_graph_info(item, g, pos); All right, the check got moved into fill_commit_graph_info(). > > - /* > - * Store the "full" position, but then use the > - * "local" position for the rest of the calculation. > - */ > - graph_data = commit_graph_data_at(item); > - graph_data->graph_pos = pos; All right, 'graph_pos' field in the graph data (on commit slab) got filled by just called load_commit_graph_info(). > lex_index = pos - g->num_commits_in_base; > - > - commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index; > + commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index; All right, unrelated cleanup in the neighbourhood. > > item->object.parsed = 1; > > set_commit_tree(item, NULL); > > - date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; > - date_low = get_be32(commit_data + g->hash_len + 12); > - item->date = (timestamp_t)((date_high << 32) | date_low); > - All right, this code got moved down the call chain into just called load_commit_graph_info(). > - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > - All right, 'generation' field in the graph data (on commit slab) got filled by load_commit_graph_info() called at the beginning of the function. > pptr = &item->parents; > > edge_value = get_be32(commit_data + g->hash_len); > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh > index 37655a237c..1986354fc3 100755 > --- a/t/t5000-tar-tree.sh > +++ b/t/t5000-tar-tree.sh > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' ' > rm -f .git/index && > echo content >file && > git add file && > - GIT_COMMITTER_DATE="@68719476737 +0000" \ > + GIT_COMMITTER_DATE="@17179869183 +0000" \ > git commit -m "tempori parendum" > ' > > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' ' > ' > > test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' ' > - echo 4147 >expect && > + echo 2514 >expect && > tar_info future.tar | cut -d" " -f2 >actual && > test_cmp expect actual > ' Looks good to me. Best,
On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote: > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Abhishek Kumar <abhishekkumar8222@gmail.com> > > > > Both fill_commit_graph_info() and fill_commit_in_graph() parse > > information present in commit data chunk. Let's simplify the > > implementation by calling fill_commit_graph_info() within > > fill_commit_in_graph(). > > > > The test 'generate tar with future mtime' creates a commit with commit > > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into > > generation number (within CDAT chunk) and has undefined behavior. > > > > The test used to pass > > Could you please tell us how does this test starts to fail without the > change to the test described there? What is the error message, etc.? > Here's what I revised the commit message to: commit-graph: consolidate fill_commit_graph_info Both fill_commit_graph_info() and fill_commit_in_graph() parse information present in commit data chunk. Let's simplify the implementation by calling fill_commit_graph_info() within fill_commit_in_graph(). The test 'generate tar with future mtime' creates a commit with commit time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides 34-bits for storing commiter date, thus committer time overflows into generation number (within CDAT chunk) and has undefined behavior. The test used to pass as fill_commit_graph_info() would not set struct member `date` of struct commit and loads committer date from the object database, generating a tar file with the expected mtime. However, with corrected commit date, we will load the committer date from CDAT chunk (truncated to lower 34-bits) to populate the generation number. Thus, fill_commit_graph_info() sets date and generates tar file with the truncated mtime and the test fails. Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which will not be truncated. > > as fill_commit_in_graph() guarantees the values of > ^^^^^^^^^^^^^^^^^^^^^^ > > s/fill_commit_in_graph()/fill_commit_graph_info()/ > > ... > > Ah, after the change fill_commit_graph_info() changes its behavior, not > fill_commit_in_graph() as said in the commit message. Before this commit > it used to only load graph position and generation number, and did not > load the timestamp. The function fill_commit_graph_info() is used in > turn by public-facing load_commit_graph_info(): > That's exactly it. I should have elaborated better in the commit message. Thanks for the through investigation. > /* > * It is possible that we loaded commit contents from the commit buffer, > * but we also want to ensure the commit-graph content is correctly > * checked and filled. Fill the graph_pos and generation members of > * the given commit. > */ > void load_commit_graph_info(struct repository *r, struct commit *item); > > ... > > Looks good to me. > > Best, > -- > Jakub Narębski Thanks - Abhishek
Hello, Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote: >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: >> >>> From: Abhishek Kumar <abhishekkumar8222@gmail.com> >>> >>> Both fill_commit_graph_info() and fill_commit_in_graph() parse >>> information present in commit data chunk. Let's simplify the >>> implementation by calling fill_commit_graph_info() within >>> fill_commit_in_graph(). >>> >>> The test 'generate tar with future mtime' creates a commit with commit >>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into >>> generation number (within CDAT chunk) and has undefined behavior. >>> >>> The test used to pass >> >> Could you please tell us how does this test starts to fail without the >> change to the test described there? What is the error message, etc.? >> > > Here's what I revised the commit message to: > > commit-graph: consolidate fill_commit_graph_info > > Both fill_commit_graph_info() and fill_commit_in_graph() parse > information present in commit data chunk. Let's simplify the > implementation by calling fill_commit_graph_info() within > fill_commit_in_graph(). All right. We might want to add here the information that we also move loading the commit date from the commit-graph file from fill_commit_in_graph() down the [new] call chain into fill_commit_graph_info(). The commit date would be needed in fill_commit_graph_info() in the next commit to compute corrected commit date out of corrected commit date offset, and store it as generation number. NOTE that this means that if we switch to storing 64-bit corrected commit date directly in the commit-graph file, instead of storing 32-bit offsets, neither this Move Statement Into Function Out of Caller refactoring nor change to the 'generate tar with future mtime' test would be necessary. > > The test 'generate tar with future mtime' creates a commit with commit > time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides > 34-bits for storing commiter date, thus committer time overflows into > generation number (within CDAT chunk) and has undefined behavior. > > The test used to pass as fill_commit_graph_info() would not set struct > member `date` of struct commit and loads committer date from the object > database, generating a tar file with the expected mtime. I guess that in the case of generating a tar file we would read the commit out of 'object database', and then only add commit-graph specific info with fill_commit_graph_info(). Possibly because we need more information that commit-graph provides for a commit. > > However, with corrected commit date, we will load the committer date > from CDAT chunk (truncated to lower 34-bits) to populate the generation > number. Thus, fill_commit_graph_info() sets date and generates tar file > with the truncated mtime and the test fails. > > Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which > will not be truncated. Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH was used. The commit that introduced the 'generate tar with future mtime' test, namely e51217e15 (t5000: test tar files that overflow ustar headers, 30-06-2016), says: The ustar format only has room for 11 (or 12, depending on some implementations) octal digits for the size and mtime of each file. For values larger than this, we have to add pax extended headers to specify the real data, and git does not yet know how to do so. Before fixing that, let's start off with some test infrastructure [...] The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12. So we need the value of (2 ^ 36 + 1) for this test do do its job. Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough (if we skip testing "some implementations"). So I think to make this test more clear (for inquisitive minds) we should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds since EPOCH. Maybe even add a variant of this test that uses the origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off use of serialized commit-graph. I'm sorry for not checking this earlier. Best,
On Tue, Aug 25, 2020 at 01:11:11PM +0200, Jakub Narębski wrote: > Hello, > > ... > > All right. > > We might want to add here the information that we also move loading the > commit date from the commit-graph file from fill_commit_in_graph() down > the [new] call chain into fill_commit_graph_info(). The commit date > would be needed in fill_commit_graph_info() in the next commit to > compute corrected commit date out of corrected commit date offset, and > store it as generation number. > > > NOTE that this means that if we switch to storing 64-bit corrected > commit date directly in the commit-graph file, instead of storing 32-bit > offsets, neither this Move Statement Into Function Out of Caller > refactoring nor change to the 'generate tar with future mtime' test > would be necessary. > > > > > The test 'generate tar with future mtime' creates a commit with commit > > time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides > > 34-bits for storing commiter date, thus committer time overflows into > > generation number (within CDAT chunk) and has undefined behavior. > > > > The test used to pass as fill_commit_graph_info() would not set struct > > member `date` of struct commit and loads committer date from the object > > database, generating a tar file with the expected mtime. > > I guess that in the case of generating a tar file we would read the > commit out of 'object database', and then only add commit-graph specific > info with fill_commit_graph_info(). Possibly because we need more > information that commit-graph provides for a commit. > > > > > However, with corrected commit date, we will load the committer date > > from CDAT chunk (truncated to lower 34-bits) to populate the generation > > number. Thus, fill_commit_graph_info() sets date and generates tar file > > with the truncated mtime and the test fails. > > > > Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which > > will not be truncated. > > Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH > was used. > > The commit that introduced the 'generate tar with future mtime' test, > namely e51217e15 (t5000: test tar files that overflow ustar headers, > 30-06-2016), says: > > The ustar format only has room for 11 (or 12, depending on > some implementations) octal digits for the size and mtime of > each file. For values larger than this, we have to add pax > extended headers to specify the real data, and git does not > yet know how to do so. > > Before fixing that, let's start off with some test > infrastructure [...] > > The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12. > So we need the value of (2 ^ 36 + 1) for this test do do its job. > Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough > (if we skip testing "some implementations"). > > So I think to make this test more clear (for inquisitive minds) we > should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds > since EPOCH. Maybe even add a variant of this test that uses the > origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off > use of serialized commit-graph. That's pretty interesting! I didn't look into this either, will modify the existing test and add a new test for it. Thanks for investigating this further. > > I'm sorry for not checking this earlier. > > Best, > -- > Jakub Narębski Thanks - Abhishek
diff --git a/commit-graph.c b/commit-graph.c index ace7400a1a..af8d9cc45e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, const unsigned char *commit_data; struct commit_graph_data *graph_data; uint32_t lex_index; + uint64_t date_high, date_low; while (pos < g->num_commits_in_base) g = g->base_graph; + if (pos >= g->num_commits + g->num_commits_in_base) + die(_("invalid commit position. commit-graph is likely corrupt")); + lex_index = pos - g->num_commits_in_base; commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index; graph_data = commit_graph_data_at(item); graph_data->graph_pos = pos; + + date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; + date_low = get_be32(commit_data + g->hash_len + 12); + item->date = (timestamp_t)((date_high << 32) | date_low); + graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; } @@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r, { uint32_t edge_value; uint32_t *parent_data_ptr; - uint64_t date_low, date_high; struct commit_list **pptr; - struct commit_graph_data *graph_data; const unsigned char *commit_data; uint32_t lex_index; while (pos < g->num_commits_in_base) g = g->base_graph; - if (pos >= g->num_commits + g->num_commits_in_base) - die(_("invalid commit position. commit-graph is likely corrupt")); + fill_commit_graph_info(item, g, pos); - /* - * Store the "full" position, but then use the - * "local" position for the rest of the calculation. - */ - graph_data = commit_graph_data_at(item); - graph_data->graph_pos = pos; lex_index = pos - g->num_commits_in_base; - - commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index; + commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index; item->object.parsed = 1; set_commit_tree(item, NULL); - date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; - date_low = get_be32(commit_data + g->hash_len + 12); - item->date = (timestamp_t)((date_high << 32) | date_low); - - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; - pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index 37655a237c..1986354fc3 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' ' rm -f .git/index && echo content >file && git add file && - GIT_COMMITTER_DATE="@68719476737 +0000" \ + GIT_COMMITTER_DATE="@17179869183 +0000" \ git commit -m "tempori parendum" ' @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' ' ' test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' ' - echo 4147 >expect && + echo 2514 >expect && tar_info future.tar | cut -d" " -f2 >actual && test_cmp expect actual '