Message ID | YBHlGPBSJC++CnPy@coredump.intra.peff.net (mailing list archive) |
---|---|
Headers | show |
Series | rev-list --disk-usage | expand |
On Wed, Jan 27, 2021 at 05:11:36PM -0500, Jeff King wrote: > The first patch is just a test-script enhancement to let test_commit > avoid creating tags. During some recent refactoring, we actually broke > the --disk-usage feature but the test script didn't catch it because the > tags were being picked up by "--all". Since this is at least the third > time I've run into that in our test suite, I thought I'd make it a > little more convenient to avoid. :) I appreciate the non-incriminating "we", but the person who caused the regression was most certainly me ;-). This happened while cherry-picking Junio's recent merge of tb/revindex-api, which obviously did not cause a merge conflict with this new caller. The remaining details are boring, but they definitely weren't Peff's fault :-). Thanks, Taylor
Here's a re-roll of my series to add "rev-list --disk-usage", for counting up object storage used for various slices of history. This fixes the minor bits mentioned in review for v1, but the big change is that "--disk-usage" no longer implies "--objects". I think you generally would want to use it with that option, but it really seemed to violate the principle of least surprise for the user. That requires handling each object type independently, but the code for that turned out to be not too bad (and is modeled after the similar logic in traverse_bitmap_commit_list()). I was slightly concerned that it would slow things down to walk over the bitmap multiple times, but it doesn't seem to make much of a difference in practice. There's a range-diff below, but it's not really worth looking at. All of the interesting parts were rewritten completely, so you're better off to just read patch 2 again (and patch 1 did not change at all). [1/2]: t: add --no-tag option to test_commit [2/2]: rev-list: add --disk-usage option for calculating disk usage Documentation/rev-list-options.txt | 9 ++++ builtin/rev-list.c | 46 +++++++++++++++++ pack-bitmap.c | 81 ++++++++++++++++++++++++++++++ pack-bitmap.h | 2 + t/t4208-log-magic-pathspec.sh | 9 +--- t/t6114-rev-list-du.sh | 51 +++++++++++++++++++ t/test-lib-functions.sh | 9 +++- 7 files changed, 199 insertions(+), 8 deletions(-) create mode 100755 t/t6114-rev-list-du.sh 1: 20f8edeff1 = 1: 6365cd94bd t: add --no-tag option to test_commit 2: 64e28cb6c9 ! 2: 8a93583dee rev-list: add --disk-usage option for calculating disk usage @@ Commit message You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: - git rev-list --objects main..branch - cut -d' ' -f1 | + git rev-list --objects --no-object-names main..branch git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' @@ Commit message torvalds/linux: [rev-list piped to cat-file, no bitmaps] - $ time git rev-list --objects --all | - cut -d' ' -f1 | + $ time git rev-list --objects --no-object-names --all | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' - 1455691059 - real 0m34.336s - user 0m46.533s - sys 0m2.953s + 1459938510 + real 0m29.635s + user 0m38.003s + sys 0m1.093s [internal, no bitmaps] - $ time git rev-list --disk-usage --all - 1455691059 - real 0m32.662s - user 0m32.306s - sys 0m0.353s + $ time git rev-list --disk-usage --objects --all + 1459938510 + real 0m31.262s + user 0m30.885s + sys 0m0.376s - The wall-clock times aren't that different because of parallelism, but - notice the CPU savings between the two. We saved 35% of the CPU just by + Even though the wall-clock time is slightly worse due to parallelism, + notice the CPU savings between the two. We saved 21% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] - $ time git rev-list --objects --all --use-bitmap-index | - cut -d' ' -f1 | + $ time git rev-list --objects --no-object-names --all --use-bitmap-index | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' - real 0m9.954s - user 0m11.234s - sys 0m8.522s + 1459938510 + real 0m6.244s + user 0m8.452s + sys 0m0.311s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] - $ time git rev-list --disk-usage --all --use-bitmap-index - 1455691059 - real 0m0.235s - user 0m0.186s + $ time git rev-list --disk-usage --objects --all --use-bitmap-index + 1459938510 + real 0m0.219s + user 0m0.169s sys 0m0.049s then we get the same answer much faster. @@ Commit message of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: - $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10 + $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s @@ Documentation/rev-list-options.txt: ifdef::git-rev-list[] + +--disk-usage:: + Suppress normal output; instead, print the sum of the bytes used -+ for on-disk storage by the selected objects. This is equivalent -+ to piping the output of `rev-list --objects` into -+ `git cat-file --batch-check='%(objectsize:disk)', except that it -+ runs much faster (especially with `--use-bitmap-index`). See the -+ `CAVEATS` section in linkgit:git-cat-file[1] for the limitations -+ of what "on-disk storage" means. ++ for on-disk storage by the selected commits or objects. This is ++ equivalent to piping the output into `git cat-file ++ --batch-check='%(objectsize:disk)'`, except that it runs much ++ faster (especially with `--use-bitmap-index`). See the `CAVEATS` ++ section in linkgit:git-cat-file[1] for the limitations of what ++ "on-disk storage" means. endif::git-rev-list[] --cherry-mark:: @@ builtin/rev-list.c: static int try_bitmap_traversal(struct rev_info *revs, + return -1; + + printf("%"PRIuMAX"\n", -+ (uintmax_t)get_disk_usage_from_bitmap(bitmap_git)); ++ (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs)); + return 0; +} + @@ builtin/rev-list.c: int cmd_rev_list(int argc, const char **argv, const char *pr + if (!strcmp(arg, "--disk-usage")) { + show_disk_usage = 1; -+ revs.tag_objects = 1; -+ revs.tree_objects = 1; -+ revs.blob_objects = 1; + info.flags |= REV_LIST_QUIET; + continue; + } @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid); } + -+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git) ++static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git, ++ enum object_type object_type) +{ + struct bitmap *result = bitmap_git->result; + struct packed_git *pack = bitmap_git->pack; -+ struct eindex *eindex = &bitmap_git->ext_index; -+ struct object_info oi = OBJECT_INFO_INIT; -+ off_t object_size; + off_t total = 0; ++ struct ewah_iterator it; ++ eword_t filter; + size_t i; + -+ oi.disk_sizep = &object_size; -+ -+ for (i = 0; i < result->word_alloc; i++) { -+ eword_t word = result->words[i]; ++ init_type_iterator(&it, bitmap_git, object_type); ++ for (i = 0; i < result->word_alloc && ++ ewah_iterator_next(&filter, &it); i++) { ++ eword_t word = result->words[i] & filter; + size_t base = (i * BITS_IN_EWORD); + unsigned offset; + ++ if (!word) ++ continue; ++ + for (offset = 0; offset < BITS_IN_EWORD; offset++) { + size_t pos; + @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g + + offset += ewah_bit_ctz64(word >> offset); + pos = base + offset; -+ -+ /* -+ * If it's in the pack, we can use the fast path -+ * and just check the revindex. Otherwise, we -+ * fall back to looking it up. -+ */ -+ if (pos < pack->num_objects) { -+ object_size = -+ pack_pos_to_offset(pack, pos + 1) - -+ pack_pos_to_offset(pack, pos); -+ } else { -+ struct object *obj; -+ obj = eindex->objects[pos - pack->num_objects]; -+ if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) -+ die(_("unable to get disk usage of %s"), -+ oid_to_hex(&obj->oid)); -+ } -+ -+ total += object_size; ++ total += pack_pos_to_offset(pack, pos + 1) - ++ pack_pos_to_offset(pack, pos); + } + } + + return total; ++} ++ ++static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git) ++{ ++ struct bitmap *result = bitmap_git->result; ++ struct packed_git *pack = bitmap_git->pack; ++ struct eindex *eindex = &bitmap_git->ext_index; ++ off_t total = 0; ++ struct object_info oi = OBJECT_INFO_INIT; ++ off_t object_size; ++ size_t i; ++ ++ oi.disk_sizep = &object_size; ++ ++ for (i = 0; i < eindex->count; i++) { ++ struct object *obj = eindex->objects[i]; ++ ++ if (!bitmap_get(result, pack->num_objects + i)) ++ continue; ++ ++ if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) ++ die(_("unable to get disk usage of %s"), ++ oid_to_hex(&obj->oid)); ++ ++ total += object_size; ++ } ++ return total; ++} ++ ++off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git, ++ struct rev_info *revs) ++{ ++ off_t total = 0; ++ ++ total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT); ++ if (revs->tree_objects) ++ total += get_disk_usage_for_type(bitmap_git, OBJ_TREE); ++ if (revs->blob_objects) ++ total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB); ++ if (revs->tag_objects) ++ total += get_disk_usage_for_type(bitmap_git, OBJ_TAG); ++ ++ total += get_disk_usage_for_extended(bitmap_git); ++ ++ return total; +} ## pack-bitmap.h ## @@ pack-bitmap.h: int bitmap_walk_contains(struct bitmap_index *, */ int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid); -+off_t get_disk_usage_from_bitmap(struct bitmap_index *); ++off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *); + void bitmap_writer_show_progress(int show); void bitmap_writer_set_checksum(unsigned char *sha1); @@ t/t6114-rev-list-du.sh (new) +# packing, zlib, etc. We'll assume that the regular rev-list and cat-file +# machinery works and compare the --disk-usage output to that. +disk_usage_slow () { -+ git rev-list --objects "$@" | -+ cut -d' ' -f1 | ++ git rev-list --no-object-names "$@" | + git cat-file --batch-check="%(objectsize:disk)" | + perl -lne '$total += $_; END { print $total}' +} @@ t/t6114-rev-list-du.sh (new) +} + +check_du HEAD -+check_du HEAD^..HEAD ++check_du --objects HEAD ++check_du --objects HEAD^..HEAD + +test_done
On Tue, Feb 09, 2021 at 05:52:28AM -0500, Jeff King wrote: > This fixes the minor bits mentioned in review for v1, but the big change > is that "--disk-usage" no longer implies "--objects". I think you > generally would want to use it with that option, but it really seemed to > violate the principle of least surprise for the user. > > That requires handling each object type independently, but the code for > that turned out to be not too bad (and is modeled after the similar > logic in traverse_bitmap_commit_list()). I was slightly concerned that > it would slow things down to walk over the bitmap multiple times, but it > doesn't seem to make much of a difference in practice. You might reasonably ask whether we could just directly use traverse_bitmap_commit_list(), since after all it takes a callback. And indeed, doing so reduces the size of the code (see the patch below). But it's shockingly slower! It takes consistently 2-3x longer to produce the same answer on linux.git with bitmaps. The problem is that we give more information to the callback than the disk-usage computation needs. In particular, finding nth_packed_object_id() is a big killer. Which kind of makes sense. We memcpy() the oids out of the .idx file into a "struct object_id" on the stack. And linux.git has ~200MB of oids to copy (and I'm sure doing it 20 bytes at a time isn't quite optimal). That adds several hundred milliseconds. Not a lot in absolute terms, but we're able to do the whole computation in ~200ms to start with, so it's relatively a big change. This could be solved by having a more "bare" callback that just passes the pack position, and not the oid (and then the callback is responsible for looking it up if they care). But it gets pretty awkward when we have to complete the bitmap traversal with non-bitmap objects (for those we _do_ have an oid to pass, but no pack position). I think the implementation in my 2/2 isn't so bad in comparison (and we can always swap it out later; these are all just implementation details). I did find it a bit interesting, though. When we moved to "struct object_id" and started copying bits out with nth_packed_object_id(), rather than just pointing to the mmap'd .idx bytes, we wondered whether there would be any measurable difference. Likewise when we extended it to handle the oid size changing at runtime. At the time, I wasn't able to measure any impact for real operations, but I guess we just needed a case that highlighted it more. I don't know that it's really worth digging into that much, though it's quite possible there may be some easy wins by optimizing those memcpy calls. E.g., I'm not sure if the compiler ends up inlining them or not. If it doesn't realize that the_hash_algo->rawsz is only ever "20" or "32", we could perhaps help it along with specialized versions of hashcpy(). If somebody does want to play with it, this patch may make a good testbed. :) -- >8 -- builtin/pack-objects.c | 3 +- builtin/rev-list.c | 40 +++++++++++------------ pack-bitmap.c | 86 ++------------------------------------------------ pack-bitmap.h | 1 + reachable.c | 1 + 5 files changed, 25 insertions(+), 106 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 13cde5896a..33f7d19eb3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1388,7 +1388,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type, static int add_object_entry_from_bitmap(const struct object_id *oid, enum object_type type, int flags, uint32_t name_hash, - struct packed_git *pack, off_t offset) + struct packed_git *pack, + uint32_t pack_pos, off_t offset) { display_progress(progress_state, ++nr_seen); diff --git a/builtin/rev-list.c b/builtin/rev-list.c index b4d8ea0a35..cc96b4c854 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -83,13 +83,13 @@ static int arg_show_object_names = 1; static int show_disk_usage; static off_t total_disk_usage; -static off_t get_object_disk_usage(struct object *obj) +static off_t get_object_disk_usage(const struct object_id *oid) { off_t size; struct object_info oi = OBJECT_INFO_INIT; oi.disk_sizep = &size; - if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) - die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid)); + if (oid_object_info_extended(the_repository, oid, &oi, 0) < 0) + die(_("unable to get disk usage of %s"), oid_to_hex(oid)); return size; } @@ -102,7 +102,7 @@ static void show_commit(struct commit *commit, void *data) display_progress(progress, ++progress_counter); if (show_disk_usage) - total_disk_usage += get_object_disk_usage(&commit->object); + total_disk_usage += get_object_disk_usage(&commit->object.oid); if (info->flags & REV_LIST_QUIET) { finish_commit(commit); @@ -275,7 +275,7 @@ static void show_object(struct object *obj, const char *name, void *cb_data) return; display_progress(progress, ++progress_counter); if (show_disk_usage) - total_disk_usage += get_object_disk_usage(obj); + total_disk_usage += get_object_disk_usage(&obj->oid); if (info->flags & REV_LIST_QUIET) return; @@ -363,8 +363,19 @@ static int show_object_fast( int exclude, uint32_t name_hash, struct packed_git *found_pack, + uint32_t pack_pos, off_t found_offset) { + if (show_disk_usage) { + if (found_pack) { + total_disk_usage += + pack_pos_to_offset(found_pack, pack_pos + 1) - + found_offset; + } else { + total_disk_usage += get_object_disk_usage(oid); + } + return 1; + } fprintf(stdout, "%s\n", oid_to_hex(oid)); return 1; } @@ -467,23 +478,10 @@ static int try_bitmap_traversal(struct rev_info *revs, traverse_bitmap_commit_list(bitmap_git, revs, &show_object_fast); free_bitmap_index(bitmap_git); - return 0; -} - -static int try_bitmap_disk_usage(struct rev_info *revs, - struct list_objects_filter_options *filter) -{ - struct bitmap_index *bitmap_git; - if (!show_disk_usage) - return -1; - - bitmap_git = prepare_bitmap_walk(revs, filter); - if (!bitmap_git) - return -1; + if (show_disk_usage) + printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage); - printf("%"PRIuMAX"\n", - (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs)); return 0; } @@ -667,8 +665,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (use_bitmap_index) { if (!try_bitmap_count(&revs, &filter_options)) return 0; - if (!try_bitmap_disk_usage(&revs, &filter_options)) - return 0; if (!try_bitmap_traversal(&revs, &filter_options)) return 0; } diff --git a/pack-bitmap.c b/pack-bitmap.c index 1f69b5fa85..f118a365e1 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -655,7 +655,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git, (obj->type == OBJ_TAG && !revs->tag_objects)) continue; - show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0); + show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, 0); } } @@ -726,7 +726,8 @@ static void show_objects_for_type( if (bitmap_git->hashes) hash = get_be32(bitmap_git->hashes + index_pos); - show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs); + show_reach(&oid, object_type, 0, hash, + bitmap_git->pack, pos + offset, ofs); } } } @@ -1430,84 +1431,3 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git, return bitmap_git && bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid); } - -static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git, - enum object_type object_type) -{ - struct bitmap *result = bitmap_git->result; - struct packed_git *pack = bitmap_git->pack; - off_t total = 0; - struct ewah_iterator it; - eword_t filter; - size_t i; - - init_type_iterator(&it, bitmap_git, object_type); - for (i = 0; i < result->word_alloc && - ewah_iterator_next(&filter, &it); i++) { - eword_t word = result->words[i] & filter; - size_t base = (i * BITS_IN_EWORD); - unsigned offset; - - if (!word) - continue; - - for (offset = 0; offset < BITS_IN_EWORD; offset++) { - size_t pos; - - if ((word >> offset) == 0) - break; - - offset += ewah_bit_ctz64(word >> offset); - pos = base + offset; - total += pack_pos_to_offset(pack, pos + 1) - - pack_pos_to_offset(pack, pos); - } - } - - return total; -} - -static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git) -{ - struct bitmap *result = bitmap_git->result; - struct packed_git *pack = bitmap_git->pack; - struct eindex *eindex = &bitmap_git->ext_index; - off_t total = 0; - struct object_info oi = OBJECT_INFO_INIT; - off_t object_size; - size_t i; - - oi.disk_sizep = &object_size; - - for (i = 0; i < eindex->count; i++) { - struct object *obj = eindex->objects[i]; - - if (!bitmap_get(result, pack->num_objects + i)) - continue; - - if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) - die(_("unable to get disk usage of %s"), - oid_to_hex(&obj->oid)); - - total += object_size; - } - return total; -} - -off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git, - struct rev_info *revs) -{ - off_t total = 0; - - total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT); - if (revs->tree_objects) - total += get_disk_usage_for_type(bitmap_git, OBJ_TREE); - if (revs->blob_objects) - total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB); - if (revs->tag_objects) - total += get_disk_usage_for_type(bitmap_git, OBJ_TAG); - - total += get_disk_usage_for_extended(bitmap_git); - - return total; -} diff --git a/pack-bitmap.h b/pack-bitmap.h index 36d99930d8..ba71a9f5c6 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -38,6 +38,7 @@ typedef int (*show_reachable_fn)( int flags, uint32_t hash, struct packed_git *found_pack, + uint32_t pack_pos, off_t found_offset); struct bitmap_index; diff --git a/reachable.c b/reachable.c index 77a60c70a5..79ebe8f940 100644 --- a/reachable.c +++ b/reachable.c @@ -182,6 +182,7 @@ static int mark_object_seen(const struct object_id *oid, int exclude, uint32_t name_hash, struct packed_git *found_pack, + uint32_t pack_pos, off_t found_offset) { struct object *obj = lookup_object_by_type(the_repository, oid, type);
Jeff King <peff@peff.net> writes: > I don't know that it's really worth digging into that much, though it's > quite possible there may be some easy wins by optimizing those memcpy > calls. E.g., I'm not sure if the compiler ends up inlining them or not. > If it doesn't realize that the_hash_algo->rawsz is only ever "20" or > "32", we could perhaps help it along with specialized versions of > hashcpy(). If somebody does want to play with it, this patch may make a > good testbed. :) Yuck. That reminds me of the adventure Shawn he made in the Java land benchmarking which one among int[5], int a,b,c,d,e, char[40] is the most efficient way (both storage-wise and performance-wise) to store SHA-1 hash. I wish we didn't have to go there. It indeed is an interesting, despite a bit sad, observation that even with a good precomputed information, an overly heavy interface can kill potential performance benefit. Thanks.
Jeff King <peff@peff.net> writes: > Here's a re-roll of my series to add "rev-list --disk-usage", for > counting up object storage used for various slices of history. > ... > t/t6114-rev-list-du.sh | 51 +++++++++++++++++++ > t/test-lib-functions.sh | 9 +++- > 7 files changed, 199 insertions(+), 8 deletions(-) > create mode 100755 t/t6114-rev-list-du.sh I relocated 6114 to 6115 to avoid tests sharing the same number. I am getting these numbers from random ranges I am interested in, but do they say what I think they mean? Was the development effort went into the v2.28 release almost half the size of v2.29, and have we already done about the same amont of work for this cycle? : gitster git.git/seen; rungit seen rev-list --disk-usage master..next 83105 : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master 183463 : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0 231640 : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0 334355 : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0 182298
On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Here's a re-roll of my series to add "rev-list --disk-usage", for > > counting up object storage used for various slices of history. > > ... > > t/t6114-rev-list-du.sh | 51 +++++++++++++++++++ > > t/test-lib-functions.sh | 9 +++- > > 7 files changed, 199 insertions(+), 8 deletions(-) > > create mode 100755 t/t6114-rev-list-du.sh > > I relocated 6114 to 6115 to avoid tests sharing the same number. Thanks. > I am getting these numbers from random ranges I am interested in, > but do they say what I think they mean? Was the development effort > went into the v2.28 release almost half the size of v2.29, and have > we already done about the same amont of work for this cycle? > > : gitster git.git/seen; rungit seen rev-list --disk-usage master..next > 83105 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master > 183463 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0 > 231640 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0 > 334355 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0 > 182298 I think you are surprised by these numbers because you're only counting disk usage of commit objects in those ranges. v1 of this series implied --objects by default, but this changed in v2 due to my suggestion. Passing --objects to count the disk-usage of all objects in those ranges gives more reasonable numbers (and match my rough guesses, i.e., that 2.29 was busier than 2.30, and so on): $ for range in origin/master..origin/next v2.30.0..origin/master \ v2.29.0..v2.30.0 v2.28.0..v2.29.0 v2.27.0..v2.28.0 do printf "%s %d vs. %d\n" $range \ "$(git rev-list --objects --no-object-names $range | git cat-file --batch-check='%(objectsize:disk)' | paste -sd+ | bc)" \ "$(git.seen rev-list --objects --disk-usage $range)" done origin/master..origin/next 671380 vs. 671380 v2.30.0..origin/master 1618815 vs. 1618815 v2.29.0..v2.30.0 3308295 vs. 3308295 v2.28.0..v2.29.0 4080789 vs. 4080789 v2.27.0..v2.28.0 2846196 vs. 2846196 Thanks, Taylor
On Tue, Feb 09, 2021 at 01:14:17PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I don't know that it's really worth digging into that much, though it's > > quite possible there may be some easy wins by optimizing those memcpy > > calls. E.g., I'm not sure if the compiler ends up inlining them or not. > > If it doesn't realize that the_hash_algo->rawsz is only ever "20" or > > "32", we could perhaps help it along with specialized versions of > > hashcpy(). If somebody does want to play with it, this patch may make a > > good testbed. :) > > Yuck. That reminds me of the adventure Shawn he made in the Java > land benchmarking which one among int[5], int a,b,c,d,e, char[40] is > the most efficient way (both storage-wise and performance-wise) to > store SHA-1 hash. I wish we didn't have to go there. > > It indeed is an interesting, despite a bit sad, observation that > even with a good precomputed information, an overly heavy interface > can kill potential performance benefit. Agreed. But I'm hoping we can continue to mostly ignore it. I suspect this finding means we are wasting a few hundred milliseconds copying oids around during a clone of torvalds/linux. But overall that is a pretty heavy-weight operation, and I doubt anybody really notices. And for something as lightweight as --disk-usage, it was easy enough to optimize around it. It probably does have a more measurable impact in something like: git rev-list --use-bitmap-index --objects HEAD >/dev/null where we really do need those oids, and the extra copying might add up. I guess if somebody is interested in micro-optimizing, that is probably a good command to look at. -Peff
On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Here's a re-roll of my series to add "rev-list --disk-usage", for > > counting up object storage used for various slices of history. > > ... > > t/t6114-rev-list-du.sh | 51 +++++++++++++++++++ > > t/test-lib-functions.sh | 9 +++- > > 7 files changed, 199 insertions(+), 8 deletions(-) > > create mode 100755 t/t6114-rev-list-du.sh > > I relocated 6114 to 6115 to avoid tests sharing the same number. Thanks. I wondered why I didn't notice, but it's because the other 6114 also just made it into "seen". :) > I am getting these numbers from random ranges I am interested in, > but do they say what I think they mean? Was the development effort > went into the v2.28 release almost half the size of v2.29, and have > we already done about the same amont of work for this cycle? > > : gitster git.git/seen; rungit seen rev-list --disk-usage master..next > 83105 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master > 183463 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0 > 231640 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0 > 334355 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0 > 182298 As Taylor mentioned, this is only hitting the commits. So you might as well just be looking at commit counts as a measure of work, I'd think (and indeed v2.28 has about half as many commits as v2.29!). Adding --objects gets you a rougher estimate of "bytes changed", which helps accounts for commits of different sizes. But there I think you'd do just as well to look at the actual number of lines changed with "git diff --numstat". I'd expect the number of on-disk bytes to _roughly_ correspond to the size of the changes. But you are working against the heuristics of the delta chains there. It may well be that we would store a base object in the v2.28..v2.29 range, and a delta against it in v2.27..v2.28. And that would attribute most of the bytes to v2.29, even though they should be shared roughly with v2.28. I'm sure one could devise a scheme for "sharing" the bytes from a delta family across all of its objects. That might even be worth implementing on top (I don't even think it would be too expensive; you just have to collect the delta chains for any objects you're reporting, and then average the total size among a chain). But in practice, we've found this kind of naive --disk-usage useful for answering questions like: - do I need all of these objects? Comparing "rev-list --disk-usage --objects --all", "rev-list --disk-usage --objects --all --reflog", and "du objects/pack/*.pack" will tell you if a prune/repack might help, and whether expiring reflogs makes a difference. - the size of the shared alternates repo for a set of forks has jumped. Comparing "rev-list --disk-usage --objects --remotes=$base --not --remotes=$fork" will tell you what's reachable from a fork but not from the base (we use "refs/remotes/$id/*" to keep track of fork refs in our alternates repo). This can be junk like somebody forking git/git and then uploading a bunch of pirated video files. :) - likewise, the size of cloning a single repo may jump. Comparing "rev-list --disk-usage --objects HEAD..$branch" for each branch might show that one branch is an outlier (e.g., because somebody accidentally committed a bunch of build artifacts). In those kinds of cases, it's not usually "oh, this version is twice as big as this other one". It's more like "wow, this branch is 100x as big as the other branches", and little decisions like delta direction are just noise. I imagine that in those cases the uncompressed object sizes would probably produce similar patterns and answers. But it's actually faster to produce the on-disk sizes. :) -Peff
Jeff King <peff@peff.net> writes: > But in practice, we've found this kind of naive --disk-usage useful for > answering questions like: > > - do I need all of these objects? Comparing "rev-list --disk-usage > --objects --all", "rev-list --disk-usage --objects --all --reflog", > and "du objects/pack/*.pack" will tell you if a prune/repack might > help, and whether expiring reflogs makes a difference. > > - the size of the shared alternates repo for a set of forks has > jumped. Comparing "rev-list --disk-usage --objects --remotes=$base > --not --remotes=$fork" will tell you what's reachable from a fork > but not from the base (we use "refs/remotes/$id/*" to keep track of > fork refs in our alternates repo). This can be junk like somebody > forking git/git and then uploading a bunch of pirated video files. > :) > > - likewise, the size of cloning a single repo may jump. Comparing > "rev-list --disk-usage --objects HEAD..$branch" for each branch > might show that one branch is an outlier (e.g., because somebody > accidentally committed a bunch of build artifacts). > > In those kinds of cases, it's not usually "oh, this version is twice as > big as this other one". It's more like "wow, this branch is 100x as big > as the other branches", and little decisions like delta direction are > just noise. I imagine that in those cases the uncompressed object sizes > would probably produce similar patterns and answers. But it's actually > faster to produce the on-disk sizes. :) Thanks. I kind of feel sad to have a nice write-up like this only in the list archive. Is there a section in our documentation set to keep collection of such a real-life use cases? Perhaps the examples section of manpages is the closest thing, but it looks a bit too narrowly scoped for the example section of "rev-list" manpage. THanks.
On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > But in practice, we've found this kind of naive --disk-usage useful for > > answering questions like: > [...] > > I kind of feel sad to have a nice write-up like this only in the > list archive. Is there a section in our documentation set to keep > collection of such a real-life use cases? Perhaps the examples > section of manpages is the closest thing, but it looks a bit too > narrowly scoped for the example section of "rev-list" manpage. Agreed on both counts. If this gets put into a release, I suspect Taylor would cover it in a release blog post. That is not quite the same thing as having it in the documentation, but it may provide more search engine boost than the list archive. I dunno. -Peff
On Wed, Feb 10, 2021 at 03:38:58PM -0500, Jeff King wrote: > On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote: > > > Jeff King <peff@peff.net> writes: > > > > > But in practice, we've found this kind of naive --disk-usage useful for > > > answering questions like: > > [...] > > > > I kind of feel sad to have a nice write-up like this only in the > > list archive. Is there a section in our documentation set to keep > > collection of such a real-life use cases? Perhaps the examples > > section of manpages is the closest thing, but it looks a bit too > > narrowly scoped for the example section of "rev-list" manpage. > > Agreed on both counts. If this gets put into a release, I suspect Taylor > would cover it in a release blog post. That is not quite the same thing > as having it in the documentation, but it may provide more search engine > boost than the list archive. I dunno. Yeah, this is the perfect sort of thing for those blog posts. But it makes sense to include some of these examples in our own documentation here, too. git-rev-list(1) doesn't have an EXAMPLES section, but maybe it should. > -Peff Thanks, Taylor
On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote: > > > I kind of feel sad to have a nice write-up like this only in the > > > list archive. Is there a section in our documentation set to keep > > > collection of such a real-life use cases? Perhaps the examples > > > section of manpages is the closest thing, but it looks a bit too > > > narrowly scoped for the example section of "rev-list" manpage. > > > > Agreed on both counts. If this gets put into a release, I suspect Taylor > > would cover it in a release blog post. That is not quite the same thing > > as having it in the documentation, but it may provide more search engine > > boost than the list archive. I dunno. > > Yeah, this is the perfect sort of thing for those blog posts. > > But it makes sense to include some of these examples in our own > documentation here, too. git-rev-list(1) doesn't have an EXAMPLES > section, but maybe it should. I think this is the "narrowly scoped" bit from Junio's response above. It would be a bit weird to have an examples section for rev-list that only mentions this rather obscure feature. -Peff
On Thu, Feb 11 2021, Jeff King wrote: > On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote: > >> > > I kind of feel sad to have a nice write-up like this only in the >> > > list archive. Is there a section in our documentation set to keep >> > > collection of such a real-life use cases? Perhaps the examples >> > > section of manpages is the closest thing, but it looks a bit too >> > > narrowly scoped for the example section of "rev-list" manpage. >> > >> > Agreed on both counts. If this gets put into a release, I suspect Taylor >> > would cover it in a release blog post. That is not quite the same thing >> > as having it in the documentation, but it may provide more search engine >> > boost than the list archive. I dunno. >> >> Yeah, this is the perfect sort of thing for those blog posts. >> >> But it makes sense to include some of these examples in our own >> documentation here, too. git-rev-list(1) doesn't have an EXAMPLES >> section, but maybe it should. > > I think this is the "narrowly scoped" bit from Junio's response above. > It would be a bit weird to have an examples section for rev-list that > only mentions this rather obscure feature. I don't think the lack of an EXAMPLES section or the relative obscurity of the switch should preclude us from adding useful documentation. Yes it would feel a bit out of place, but we can always have a sub-section of EXAMPLES, and we've got to start somewhere. In this case I don't see why it couldn't be added to OPTIONS, we've got some very long discussion there already, and as long as there's a clear separation in prose from an initial brief discussion of the switch and further prose it won't be confusing for readers, they can just page past the details.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> I think this is the "narrowly scoped" bit from Junio's response above. >> It would be a bit weird to have an examples section for rev-list that >> only mentions this rather obscure feature. > > I don't think the lack of an EXAMPLES section or the relative obscurity > of the switch should preclude us from adding useful documentation. > > Yes it would feel a bit out of place, but we can always have a > sub-section of EXAMPLES, and we've got to start somewhere. > > In this case I don't see why it couldn't be added to OPTIONS, we've got > some very long discussion there already, and as long as there's a clear > separation in prose from an initial brief discussion of the switch and > further prose it won't be confusing for readers, they can just page past > the details. OK. In any case, [v2] as we have it (with test number relocation) should be good as-is, so I'd start preparing to merge it down to 'next' soonish. Thanks.