Message ID | 5e9b985e39b0b9edee7af55dd8b0698a20062cf7.1656924376.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bitmap: integrate a lookup table extension to the bitmap format | expand |
On Mon, Jul 04, 2022 at 08:46:12AM +0000, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > The bitmap lookup table extension was documented by an earlier > change, but Git does not yet know how to write that extension. This and the first patch both look in great shape to me. I haven't had a chance to take a close look through the remaining four patches, but I anticipate that they are in similarly-good shape. I'll have some more time to finish reviewing this tomorrow morning. I want to give it a closer inspection this round to make sure that everything is correct (and that we're assembling the various orderings the right way by stepping through it in a debugger, etc., etc.). Thanks for all of your patience :-). Thanks, Taylor
Hi Abhradeep, I just wanted to make absolutely sure that I understood what the implementation in this patch was doing, since I think generating and converting between all of these different orderings is by far the most confusing component of this series. On Mon, Jul 04, 2022 at 08:46:12AM +0000, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > The bitmap lookup table extension was documented by an earlier > change, but Git does not yet know how to write that extension. > +static int table_cmp(const void *_va, const void *_vb, void *_data) > +{ > + uint32_t *commit_positions = _data; > + uint32_t a = commit_positions[*(uint32_t *)_va]; > + uint32_t b = commit_positions[*(uint32_t *)_vb]; > + > + if (a > b) > + return 1; > + else if (a < b) > + return -1; > + > + return 0; > +} Let's skip the above part for now, and just look at the implementation of writing_lookup_table(): > +static void write_lookup_table(struct hashfile *f, > + struct pack_idx_entry **index, > + uint32_t index_nr, > + off_t *offsets) > +{ > + uint32_t i; > + uint32_t *table, *table_inv, *commit_positions; > + > + ALLOC_ARRAY(table, writer.selected_nr); > + ALLOC_ARRAY(table_inv, writer.selected_nr); > + ALLOC_ARRAY(commit_positions, writer.selected_nr); Makes sense. > + /* store the index positions of the commits */ > + for (i = 0; i < writer.selected_nr; i++) { > + int pos = commit_bitmap_writer_pos(&writer.selected[i].commit->object.oid, > + index, index_nr); > + if (pos < 0) > + BUG(_("trying to write commit not in index")); > + > + commit_positions[i] = pos; > + } By the end of this loop, we have an array `commit_positions` which maps the ith selected commit to its lexical position among all objects in the bitmap. IOW, `commit_positions[i] = j` means the `i`th selected commit can be found at index `j` among all objects in the pack/MIDX in their lexical order. > + for (i = 0; i < writer.selected_nr; i++) > + table[i] = i; At this point, table[i] = i. > + /* > + * At the end of this sort table[j] = i means that the i'th > + * bitmap corresponds to j'th bitmapped commit in lex order of > + * OIDs. > + */ > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); And then we sort table by treating its values as indexes into `commit_positions`. Here's where I'm not sure that I follow what's going on. You say above that `table[j] = i`, where `i` corresponds to the order of selected commits, and `j` is in lexical order. If that's the case, then I'd expect that printing `index[table[j]]` for increasing `j` would output OIDs in increasing lexical order. But that doesn't quite seem to be the case. From a debugger session that has a breakpoint after computing and sorting table, along with building `table_inv`: (gdb) p oid_to_hex(&index[table[0]]->oid) $17 = 0x555555983ea0 <hexbuffer> "0006763074748d43b539c1c8e8882c08034ab178" (gdb) p oid_to_hex(&index[table[1]]->oid) $18 = 0x555555983ee1 <hexbuffer+65> "001ce83dd43f03dcfc67f29d38922e4a9682aab0" (gdb) p oid_to_hex(&index[table[2]]->oid) $19 = 0x555555983f22 <hexbuffer+130> "002db882ece2ab6a240e495a169c6e06422289c8" (gdb) p oid_to_hex(&index[table[3]]->oid) $20 = 0x555555983f63 <hexbuffer+195> "0007a5feb040e1ff704f3ad636619ddca3e7382b" that doesn't look like the OIDs are increasing in lexical order. I'm not quite sure if I'm even looking at the right thing, or if this is to be expected, or if the comment isn't quite accurate. If you could help clarify what's going on here, that would be great. > + /* table_inv helps us discover that relationship (i'th bitmap > + * to j'th commit by j = table_inv[i]) > + */ > + for (i = 0; i < writer.selected_nr; i++) > + table_inv[table[i]] = i; This part makes sense, as does the rest of the implementation. Thanks, Taylor
On Fri, Jul 15, 2022 at 7:52 AM Taylor Blau <me@ttaylorr.com> wrote: > > By the end of this loop, we have an array `commit_positions` which maps > the ith selected commit to its lexical position among all objects in the > bitmap. IOW, `commit_positions[i] = j` means the `i`th selected commit > can be found at index `j` among all objects in the pack/MIDX in their > lexical order. Right. > > + /* > > + * At the end of this sort table[j] = i means that the i'th > > + * bitmap corresponds to j'th bitmapped commit in lex order of > > + * OIDs. > > + */ > > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); > > And then we sort table by treating its values as indexes into > `commit_positions`. Here's where I'm not sure that I follow what's going > on. You say above that `table[j] = i`, where `i` corresponds to the > order of selected commits, and `j` is in lexical order. Correct. > If that's the case, then I'd expect that printing `index[table[j]]` for > increasing `j` would output OIDs in increasing lexical order. But that > doesn't quite seem to be the case. From a debugger session that has a > breakpoint after computing and sorting table, along with building > `table_inv`: > > (gdb) p oid_to_hex(&index[table[0]]->oid) > $17 = 0x555555983ea0 <hexbuffer> "0006763074748d43b539c1c8e8882c08034ab178" > (gdb) p oid_to_hex(&index[table[1]]->oid) > $18 = 0x555555983ee1 <hexbuffer+65> "001ce83dd43f03dcfc67f29d38922e4a9682aab0" > (gdb) p oid_to_hex(&index[table[2]]->oid) > $19 = 0x555555983f22 <hexbuffer+130> "002db882ece2ab6a240e495a169c6e06422289c8" > (gdb) p oid_to_hex(&index[table[3]]->oid) > $20 = 0x555555983f63 <hexbuffer+195> "0007a5feb040e1ff704f3ad636619ddca3e7382b" > > that doesn't look like the OIDs are increasing in lexical order. > > I'm not quite sure if I'm even looking at the right thing, or if this is > to be expected, or if the comment isn't quite accurate. If you could > help clarify what's going on here, that would be great. I think you're not looking at the right thing. you should look at `writer.selected[table[i]].commit->object.oid` instead. I think the order of `index[]` is not the same as the pack index (or midx). I am saying this because if we use the `pos` variable (that we get from `commit_bitmap_writer_pos(&writer.selected[table[i]].commit->object.oid, index, index_nr)`) in `fprintf(stderr, "commit hex: %s\n", &index[pos]->oid);`, you'll see that `&index[pos]->oid` and `&writer.selected[table[i]].commit->object.oid` are not same. So, If you do - int spos = commit_bitmap_writer_pos(&index[pos]->oid, index, index_nr); you'll see `spos` is not equal to `pos`. > > + /* table_inv helps us discover that relationship (i'th bitmap > > + * to j'th commit by j = table_inv[i]) > > + */ > > + for (i = 0; i < writer.selected_nr; i++) > > + table_inv[table[i]] = i; > > This part makes sense, as does the rest of the implementation. > > Thanks, > Taylor
On Fri, Jul 15, 2022 at 09:28:25PM +0530, Abhradeep Chakraborty wrote: > > If that's the case, then I'd expect that printing `index[table[j]]` for > > increasing `j` would output OIDs in increasing lexical order. But that > > doesn't quite seem to be the case. From a debugger session that has a > > breakpoint after computing and sorting table, along with building > > `table_inv`: > > > > (gdb) p oid_to_hex(&index[table[0]]->oid) > > $17 = 0x555555983ea0 <hexbuffer> "0006763074748d43b539c1c8e8882c08034ab178" > > (gdb) p oid_to_hex(&index[table[1]]->oid) > > $18 = 0x555555983ee1 <hexbuffer+65> "001ce83dd43f03dcfc67f29d38922e4a9682aab0" > > (gdb) p oid_to_hex(&index[table[2]]->oid) > > $19 = 0x555555983f22 <hexbuffer+130> "002db882ece2ab6a240e495a169c6e06422289c8" > > (gdb) p oid_to_hex(&index[table[3]]->oid) > > $20 = 0x555555983f63 <hexbuffer+195> "0007a5feb040e1ff704f3ad636619ddca3e7382b" > > > > that doesn't look like the OIDs are increasing in lexical order. > > > > I'm not quite sure if I'm even looking at the right thing, or if this is > > to be expected, or if the comment isn't quite accurate. If you could > > help clarify what's going on here, that would be great. > > I think you're not looking at the right thing. you should look at > `writer.selected[table[i]].commit->object.oid` instead. I think the > order of `index[]` > is not the same as the pack index (or midx). > > I am saying this because if we use the `pos` variable (that we get > from `commit_bitmap_writer_pos(&writer.selected[table[i]].commit->object.oid, > index, index_nr)`) in `fprintf(stderr, "commit hex: %s\n", > &index[pos]->oid);`, you'll see that `&index[pos]->oid` and > `&writer.selected[table[i]].commit->object.oid` are not same. So, If > you do - > > int spos = commit_bitmap_writer_pos(&index[pos]->oid, index, index_nr); > > you'll see `spos` is not equal to `pos`. `index` there comes from the list of objects that `pack-objects` or the MIDX told us about, and it's sorted in lexical order (via `write_pack_file()` -> `stage_tmp_packfiles()` -> `write_idx_file()`). So I think this implementation is indexing the commits by the order they appearn in the `writer.selected` array, *not* by the order they appear in the index. For what it's worth, I think the latter ordering makes more sense to use to refer to individual objects. But we should be consistent with our choice here and what's in the documentation. And right now I think we're not, since the documentation change in the first patch says we write the `commit_pos` field in order of the index: * {empty} commit_pos (4 byte integer, network byte order): :: It stores the object position of a commit (in the midx or pack index). Thanks, Taylor
On Sat, Jul 16, 2022 at 3:45 AM Taylor Blau <me@ttaylorr.com> wrote: > > On Fri, Jul 15, 2022 at 09:28:25PM +0530, Abhradeep Chakraborty wrote: > > > If that's the case, then I'd expect that printing `index[table[j]]` for > > > increasing `j` would output OIDs in increasing lexical order. But that > > > doesn't quite seem to be the case. From a debugger session that has a > > > breakpoint after computing and sorting table, along with building > > > `table_inv`: > > > > > > (gdb) p oid_to_hex(&index[table[0]]->oid) > > > $17 = 0x555555983ea0 <hexbuffer> "0006763074748d43b539c1c8e8882c08034ab178" > > > (gdb) p oid_to_hex(&index[table[1]]->oid) > > > $18 = 0x555555983ee1 <hexbuffer+65> "001ce83dd43f03dcfc67f29d38922e4a9682aab0" > > > (gdb) p oid_to_hex(&index[table[2]]->oid) > > > $19 = 0x555555983f22 <hexbuffer+130> "002db882ece2ab6a240e495a169c6e06422289c8" > > > (gdb) p oid_to_hex(&index[table[3]]->oid) > > > $20 = 0x555555983f63 <hexbuffer+195> "0007a5feb040e1ff704f3ad636619ddca3e7382b" > > > > > > that doesn't look like the OIDs are increasing in lexical order. > > > > > > I'm not quite sure if I'm even looking at the right thing, or if this is > > > to be expected, or if the comment isn't quite accurate. If you could > > > help clarify what's going on here, that would be great. > > > > I think you're not looking at the right thing. you should look at > > `writer.selected[table[i]].commit->object.oid` instead. I think the > > order of `index[]` > > is not the same as the pack index (or midx). > > > > I am saying this because if we use the `pos` variable (that we get > > from `commit_bitmap_writer_pos(&writer.selected[table[i]].commit->object.oid, > > index, index_nr)`) in `fprintf(stderr, "commit hex: %s\n", > > &index[pos]->oid);`, you'll see that `&index[pos]->oid` and > > `&writer.selected[table[i]].commit->object.oid` are not same. So, If > > you do - > > > > int spos = commit_bitmap_writer_pos(&index[pos]->oid, index, index_nr); > > > > you'll see `spos` is not equal to `pos`. > > `index` there comes from the list of objects that `pack-objects` or the > MIDX told us about, and it's sorted in lexical order (via > `write_pack_file()` -> `stage_tmp_packfiles()` -> `write_idx_file()`). This was a bit strange for me because all the tests were passing. But now I find the reason why your results were not in lexical order. you were doing `oid_to_hex(&index[table[i]]->oid)` which is not what you intended to do. Let me explain it with a simple workflow - Suppose 12 commits are selected for bitmaps and are sorted by their date. I will now use their index numbers to denote those commits (i.e. `0` denotes the most recent commit, `1` denotes the second commit in this order and so on..). So, before that quick sort, `table` = {0,1, 2, 3, 4, ...,11}. Now suppose, `11`th commit is lexically smallest among all the selected commits, `5`th commit is the second smallest commit and so on. So, after that quick sort, `table` array now contains the following - {11, 5, 9, 4,0, 3, ...}. So, when you do `&index[table[0]]->oid`, it becomes `&index[11]->oid`. Similarly, `&index[table[1]]->oid` becomes `&index[5]->oid` and so on. That's why you're not getting the oids in lexical order - `&index[11]->oid` gives the 11th oid in the pack-index and `&index[5]->oid` gives the 5th oid in the pack-index. So, the right thing would be to do `&index[commit_positions[table[0]]]->oid`, `&index[commit_positions[table[1]]]->oid` ... Here `&index[commit_positions[table[0]]]->oid` becomes `&index[commit_positions[11]]->oid` => `&index[pos_of_11_commit_with_respect_to_pack_index]->oid` which ultimately prints the oid of 11th commit ( among the selected bitmap commits IN THE SELECTED BITMAP COMMIT ORDER) . I think the comment I added is not that good. The following might be better - At the end of this sort table[j] = i means that the i'th bitmap corresponds to j'th bitmapped commit (among the selected commits) in lex order of OIDs. > Thanks, > Taylor
On Mon, 4 Jul 2022 at 10:48, Abhradeep Chakraborty via GitGitGadget <gitgitgadget@gmail.com> wrote: > > +static int table_cmp(const void *_va, const void *_vb, void *_data) > +{ > + uint32_t *commit_positions = _data; > + uint32_t a = commit_positions[*(uint32_t *)_va]; > + uint32_t b = commit_positions[*(uint32_t *)_vb]; This casting and dereferencing are ok because ... > +static void write_lookup_table(struct hashfile *f, > + struct pack_idx_entry **index, > + uint32_t index_nr, > + off_t *offsets) > +{ > + uint32_t i; > + uint32_t *table, *table_inv, *commit_positions; > + > + ALLOC_ARRAY(table, writer.selected_nr); > + ALLOC_ARRAY(table_inv, writer.selected_nr); > + ALLOC_ARRAY(commit_positions, writer.selected_nr); ... `table` is where `_va` and `_vb` will be pointing into. > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); I started looking at this casting because of something similar in "pack-bitmap: prepare to read lookup table extension". I'm pointing out this instance just to say that it looks ok to me. Martin
On Sat, Jul 16, 2022 at 05:20:57PM +0530, Abhradeep Chakraborty wrote: > I think the comment I added is not that good. The following might be better - > > At the end of this sort table[j] = i means that the i'th > bitmap corresponds to j'th bitmapped commit (among the selected commits) > in lex order of OIDs. Makes sense, I think that version of the comment is more helpful. I appreciate your attention to detail on getting these things right! Thanks, Taylor
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index c43375bd344..4a0edd746bc 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -648,9 +648,17 @@ static const struct object_id *oid_access(size_t pos, const void *table) return &index[pos]->oid; } +static int commit_bitmap_writer_pos(struct object_id *oid, + struct pack_idx_entry **index, + uint32_t index_nr) +{ + return oid_pos(oid, index, index_nr, oid_access); +} + static void write_selected_commits_v1(struct hashfile *f, struct pack_idx_entry **index, - uint32_t index_nr) + uint32_t index_nr, + off_t *offsets) { int i; @@ -658,11 +666,14 @@ static void write_selected_commits_v1(struct hashfile *f, struct bitmapped_commit *stored = &writer.selected[i]; int commit_pos = - oid_pos(&stored->commit->object.oid, index, index_nr, oid_access); + commit_bitmap_writer_pos(&stored->commit->object.oid, index, index_nr); if (commit_pos < 0) BUG("trying to write commit not in index"); + if (offsets) + offsets[i] = hashfile_total(f); + hashwrite_be32(f, commit_pos); hashwrite_u8(f, stored->xor_offset); hashwrite_u8(f, stored->flags); @@ -671,6 +682,92 @@ static void write_selected_commits_v1(struct hashfile *f, } } +static int table_cmp(const void *_va, const void *_vb, void *_data) +{ + uint32_t *commit_positions = _data; + uint32_t a = commit_positions[*(uint32_t *)_va]; + uint32_t b = commit_positions[*(uint32_t *)_vb]; + + if (a > b) + return 1; + else if (a < b) + return -1; + + return 0; +} + +static void write_lookup_table(struct hashfile *f, + struct pack_idx_entry **index, + uint32_t index_nr, + off_t *offsets) +{ + uint32_t i; + uint32_t *table, *table_inv, *commit_positions; + + ALLOC_ARRAY(table, writer.selected_nr); + ALLOC_ARRAY(table_inv, writer.selected_nr); + ALLOC_ARRAY(commit_positions, writer.selected_nr); + + /* store the index positions of the commits */ + for (i = 0; i < writer.selected_nr; i++) { + int pos = commit_bitmap_writer_pos(&writer.selected[i].commit->object.oid, + index, index_nr); + if (pos < 0) + BUG(_("trying to write commit not in index")); + + commit_positions[i] = pos; + } + + for (i = 0; i < writer.selected_nr; i++) + table[i] = i; + + /* + * At the end of this sort table[j] = i means that the i'th + * bitmap corresponds to j'th bitmapped commit in lex order of + * OIDs. + */ + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); + + /* table_inv helps us discover that relationship (i'th bitmap + * to j'th commit by j = table_inv[i]) + */ + for (i = 0; i < writer.selected_nr; i++) + table_inv[table[i]] = i; + + trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository); + for (i = 0; i < writer.selected_nr; i++) { + struct bitmapped_commit *selected = &writer.selected[table[i]]; + uint32_t xor_offset = selected->xor_offset; + uint32_t xor_row; + + if (xor_offset) { + /* + * xor_index stores the index (in the bitmap entries) + * of the corresponding xor bitmap. But we need to convert + * this index into lookup table's index. So, table_inv[xor_index] + * gives us the index position w.r.t. the lookup table. + * + * If "k = table[i] - xor_offset" then the xor base is the k'th + * bitmap. `table_inv[k]` gives us the position of that bitmap + * in the lookup table. + */ + uint32_t xor_index = table[i] - xor_offset; + xor_row = table_inv[xor_index]; + } else { + xor_row = 0xffffffff; + } + + hashwrite_be32(f, commit_positions[table[i]]); + hashwrite_be64(f, (uint64_t)offsets[table[i]]); + hashwrite_be32(f, xor_row); + } + trace2_region_leave("pack-bitmap-write", "writing_lookup_table", the_repository); + + free(table); + free(table_inv); + free(commit_positions); +} + static void write_hash_cache(struct hashfile *f, struct pack_idx_entry **index, uint32_t index_nr) @@ -695,6 +792,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index, { static uint16_t default_version = 1; static uint16_t flags = BITMAP_OPT_FULL_DAG; + off_t *offsets = NULL; struct strbuf tmp_file = STRBUF_INIT; struct hashfile *f; @@ -715,7 +813,14 @@ void bitmap_writer_finish(struct pack_idx_entry **index, dump_bitmap(f, writer.trees); dump_bitmap(f, writer.blobs); dump_bitmap(f, writer.tags); - write_selected_commits_v1(f, index, index_nr); + + if (options & BITMAP_OPT_LOOKUP_TABLE) + CALLOC_ARRAY(offsets, index_nr); + + write_selected_commits_v1(f, index, index_nr, offsets); + + if (options & BITMAP_OPT_LOOKUP_TABLE) + write_lookup_table(f, index, index_nr, offsets); if (options & BITMAP_OPT_HASH_CACHE) write_hash_cache(f, index, index_nr); @@ -730,4 +835,5 @@ void bitmap_writer_finish(struct pack_idx_entry **index, die_errno("unable to rename temporary bitmap file to '%s'", filename); strbuf_release(&tmp_file); + free(offsets); } diff --git a/pack-bitmap.h b/pack-bitmap.h index 3d3ddd77345..67a9d0fc303 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -24,8 +24,9 @@ struct bitmap_disk_header { #define NEEDS_BITMAP (1u<<22) enum pack_bitmap_opts { - BITMAP_OPT_FULL_DAG = 1, - BITMAP_OPT_HASH_CACHE = 4, + BITMAP_OPT_FULL_DAG = 0x1, + BITMAP_OPT_HASH_CACHE = 0x4, + BITMAP_OPT_LOOKUP_TABLE = 0x10, }; enum pack_bitmap_flags {