Message ID | YBHmY7vNxu2hqOa/@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rev-list --disk-usage | expand |
On Wed, Jan 27, 2021 at 05:17:07PM -0500, Jeff King wrote: > It can sometimes be useful to see which refs are contributing to the > overall repository size (e.g., does some branch have a bunch of objects > not found elsewhere in history, which indicates that deleting it would > shrink the size of a clone). > > You can find that out by generating a list of objects, getting their > sizes from cat-file, and then summing them, like: > > git rev-list --objects main..branch > cut -d' ' -f1 | I suspect that this is from the original commit message that you wrote a half-decade ago. Not that it really means much, but you could shave one process off of this example by passing '--no-object-names' to 'git rev-list'. The whole point is that we can avoid having to do this, so I don't think it really matters, anyway. > [...] > then we're faster to generate the list of objects, but we still spend a > lot of time piping and looking things up. But if we do both together: > > [internal, bitmaps] > $ time git rev-list --disk-usage --all --use-bitmap-index > 1455691059 > real 0m0.235s > user 0m0.186s > sys 0m0.049s > > then we get the same answer much faster. Very nice. > This _could_ be made more flexible, but I didn't think it was worth the > complexity. Some obvious things one might want are: > > - not counting up all reachable objects (i.e., requiring --objects for > this output, and omitting it just counts up commits). This could be > handled in the bitmap case with some extra code (OR-ing with the > type bitmaps). > > But after 5 years of this patch, I've never wanted that once. The > disk usage of just some of the objects isn't really that useful (and > of course you can still get it by piping to cat-file). Yeah. I think it's trivial to support it, but I'm in favor of a simpler interface. That said, I worry about painting ourselves into a corner if the default implies --objects. If we wanted to change that, I'm pretty sure you'd have to write a rule that says "imply objects, unless --tags, --blobs or etc. are specified, and then only do that". Maybe we'll never have to address that, but it's worth thinking about before committing to implying '--objects'. > - an option to output the sizes of specific objects along with their > oids. But if you want to get to this level of flexibility, I think > you're better off just using cat-file (and if we are concerned about > the pipe costs, we should teach rev-list to understand cat-file's > custom formats). This I agree with completely. Any caller who wants that level of flexibility shouldn't mind the piping. I have no comments on the patch itself, which looks fine to me (and I have seen over and over again as it seems to regularly cause conflicts when merging new releases into GitHub's fork :-)). Thanks, Taylor
Jeff King writes: > diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt > index 002379056a..1e5826f26d 100644 > --- a/Documentation/rev-list-options.txt > +++ b/Documentation/rev-list-options.txt > @@ -222,6 +222,15 @@ ifdef::git-rev-list[] > test the exit status to see if a range of objects is fully > connected (or not). It is faster than redirecting stdout > to `/dev/null` as the output does not have to be formatted. > + > +--disk-usage:: > + Suppress normal output; instead, print the sum of the bytes used > + for on-disk storage by the selected objects. This is equivalent > + to piping the output of `rev-list --objects` into > + `git cat-file --batch-check='%(objectsize:disk)', except that it [ Just a drive-by typo comment from a reader not knowledgeable enough to review the code change :) ] The cat-file command is missing its closing quote.
On Wed, Jan 27, 2021 at 5:20 PM Jeff King <peff@peff.net> wrote: > This patch implements a --disk-usage option which produces the same > answer in a fraction of the time. Here are some timings using a clone of > torvalds/linux: > > [rev-list piped to cat-file, no bitmaps] > $ time git rev-list --objects --all | > cut -d' ' -f1 | > git cat-file --buffer --batch-check='%(objectsize:disk)' | > perl -lne '$total += $_; END { print $total }' > 1455691059 > real 0m34.336s > user 0m46.533s > sys 0m2.953s This example shows the computed size (1455691059)... > But the real win is with bitmaps. If we use them without the new option: > > [rev-list piped to cat-file, bitmaps] > $ time git rev-list --objects --all --use-bitmap-index | > cut -d' ' -f1 | > git cat-file --batch-check='%(objectsize:disk)' | > perl -lne '$total += $_; END { print $total }' > real 0m9.954s > user 0m11.234s > sys 0m8.522s ...however, this example does not (but all the others do). Simple copy/paste error? Not worth a re-roll, of course.
On Wed, Jan 27, 2021 at 05:57:21PM -0500, Taylor Blau wrote: > > You can find that out by generating a list of objects, getting their > > sizes from cat-file, and then summing them, like: > > > > git rev-list --objects main..branch > > cut -d' ' -f1 | > > I suspect that this is from the original commit message that you wrote a > half-decade ago. Not that it really means much, but you could shave one > process off of this example by passing '--no-object-names' to 'git > rev-list'. That, plus my muscle memory to do the cut. We should probably model the better form here, and use it in the test, though (not worth a re-roll on its own, but it looks like there are a few other minor bits). > > - not counting up all reachable objects (i.e., requiring --objects for > > this output, and omitting it just counts up commits). This could be > > handled in the bitmap case with some extra code (OR-ing with the > > type bitmaps). > > > > But after 5 years of this patch, I've never wanted that once. The > > disk usage of just some of the objects isn't really that useful (and > > of course you can still get it by piping to cat-file). > > Yeah. I think it's trivial to support it, but I'm in favor of a simpler > interface. > > That said, I worry about painting ourselves into a corner if the default > implies --objects. If we wanted to change that, I'm pretty sure you'd > have to write a rule that says "imply objects, unless --tags, --blobs or > etc. are specified, and then only do that". > > Maybe we'll never have to address that, but it's worth thinking about > before committing to implying '--objects'. Yeah, the one thing that gives me pause is that it would be hard to undo later. I didn't write the code to handle it in the bitmap case, but I don't think it would be _too_ bad. It is slightly annoying for the all-objects case, because the existing code isn't set up well to iterate either a specific type, or all types. > I have no comments on the patch itself, which looks fine to me (and I > have seen over and over again as it seems to regularly cause conflicts > when merging new releases into GitHub's fork :-)). You are exposing my ulterior motive. :) -Peff
On Wed, Jan 27, 2021 at 06:01:51PM -0500, Kyle Meyer wrote: > > +--disk-usage:: > > + Suppress normal output; instead, print the sum of the bytes used > > + for on-disk storage by the selected objects. This is equivalent > > + to piping the output of `rev-list --objects` into > > + `git cat-file --batch-check='%(objectsize:disk)', except that it > > [ Just a drive-by typo comment from a reader not knowledgeable enough to > review the code change :) ] > > The cat-file command is missing its closing quote. Thanks for catching that. I should have looked at the output of doc-diff, which does reveal it. -Peff
On Wed, Jan 27, 2021 at 06:07:57PM -0500, Eric Sunshine wrote: > This example shows the computed size (1455691059)... > [...] > ...however, this example does not (but all the others do). Simple > copy/paste error? Yep, thanks for catching. (Of course I have since repacked my linux.git, so now it produces a different answer! It does match the current value of the other techniques, though). > Not worth a re-roll, of course. Agreed, but it looks like there are a few other minor bits, so I'll definitely fix it up at the same time. I'll give a little more time before re-rolling in case there are any other comments. -Peff
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt index 002379056a..1e5826f26d 100644 --- a/Documentation/rev-list-options.txt +++ b/Documentation/rev-list-options.txt @@ -222,6 +222,15 @@ ifdef::git-rev-list[] test the exit status to see if a range of objects is fully connected (or not). It is faster than redirecting stdout to `/dev/null` as the output does not have to be formatted. + +--disk-usage:: + Suppress normal output; instead, print the sum of the bytes used + for on-disk storage by the selected objects. This is equivalent + to piping the output of `rev-list --objects` into + `git cat-file --batch-check='%(objectsize:disk)', except that it + runs much faster (especially with `--use-bitmap-index`). See the + `CAVEATS` section in linkgit:git-cat-file[1] for the limitations + of what "on-disk storage" means. endif::git-rev-list[] --cherry-mark:: diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 25c6c3b38d..2262b613dd 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -80,6 +80,19 @@ static int arg_show_object_names = 1; #define DEFAULT_OIDSET_SIZE (16*1024) +static int show_disk_usage; +static off_t total_disk_usage; + +static off_t get_object_disk_usage(struct object *obj) +{ + off_t size; + struct object_info oi = OBJECT_INFO_INIT; + oi.disk_sizep = &size; + if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) + die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid)); + return size; +} + static void finish_commit(struct commit *commit); static void show_commit(struct commit *commit, void *data) { @@ -88,6 +101,9 @@ static void show_commit(struct commit *commit, void *data) display_progress(progress, ++progress_counter); + if (show_disk_usage) + total_disk_usage += get_object_disk_usage(&commit->object); + if (info->flags & REV_LIST_QUIET) { finish_commit(commit); return; @@ -258,6 +274,8 @@ static void show_object(struct object *obj, const char *name, void *cb_data) if (finish_object(obj, name, cb_data)) return; display_progress(progress, ++progress_counter); + if (show_disk_usage) + total_disk_usage += get_object_disk_usage(obj); if (info->flags & REV_LIST_QUIET) return; @@ -452,6 +470,23 @@ static int try_bitmap_traversal(struct rev_info *revs, return 0; } +static int try_bitmap_disk_usage(struct rev_info *revs, + struct list_objects_filter_options *filter) +{ + struct bitmap_index *bitmap_git; + + if (!show_disk_usage) + return -1; + + bitmap_git = prepare_bitmap_walk(revs, filter); + if (!bitmap_git) + return -1; + + printf("%"PRIuMAX"\n", + (uintmax_t)get_disk_usage_from_bitmap(bitmap_git)); + return 0; +} + int cmd_rev_list(int argc, const char **argv, const char *prefix) { struct rev_info revs; @@ -584,6 +619,15 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) continue; } + if (!strcmp(arg, "--disk-usage")) { + show_disk_usage = 1; + revs.tag_objects = 1; + revs.tree_objects = 1; + revs.blob_objects = 1; + info.flags |= REV_LIST_QUIET; + continue; + } + usage(rev_list_usage); } @@ -626,6 +670,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (use_bitmap_index) { if (!try_bitmap_count(&revs, &filter_options)) return 0; + if (!try_bitmap_disk_usage(&revs, &filter_options)) + return 0; if (!try_bitmap_traversal(&revs, &filter_options)) return 0; } @@ -690,5 +736,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) printf("%d\n", revs.count_left + revs.count_right); } + if (show_disk_usage) + printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage); + return 0; } diff --git a/pack-bitmap.c b/pack-bitmap.c index 60fe20fb87..ba36b9c6a0 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -1430,3 +1430,53 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git, return bitmap_git && bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid); } + +off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git) +{ + struct bitmap *result = bitmap_git->result; + struct packed_git *pack = bitmap_git->pack; + struct eindex *eindex = &bitmap_git->ext_index; + struct object_info oi = OBJECT_INFO_INIT; + off_t object_size; + off_t total = 0; + size_t i; + + oi.disk_sizep = &object_size; + + for (i = 0; i < result->word_alloc; i++) { + eword_t word = result->words[i]; + size_t base = (i * BITS_IN_EWORD); + unsigned offset; + + for (offset = 0; offset < BITS_IN_EWORD; offset++) { + size_t pos; + + if ((word >> offset) == 0) + break; + + offset += ewah_bit_ctz64(word >> offset); + pos = base + offset; + + /* + * If it's in the pack, we can use the fast path + * and just check the revindex. Otherwise, we + * fall back to looking it up. + */ + if (pos < pack->num_objects) { + object_size = + pack_pos_to_offset(pack, pos + 1) - + pack_pos_to_offset(pack, pos); + } else { + struct object *obj; + obj = eindex->objects[pos - pack->num_objects]; + if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0) + die(_("unable to get disk usage of %s"), + oid_to_hex(&obj->oid)); + } + + total += object_size; + } + } + + return total; +} diff --git a/pack-bitmap.h b/pack-bitmap.h index 25dfcf5615..c8070606b7 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -68,6 +68,8 @@ int bitmap_walk_contains(struct bitmap_index *, */ int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid); +off_t get_disk_usage_from_bitmap(struct bitmap_index *); + void bitmap_writer_show_progress(int show); void bitmap_writer_set_checksum(unsigned char *sha1); void bitmap_writer_build_type_index(struct packing_data *to_pack, diff --git a/t/t6114-rev-list-du.sh b/t/t6114-rev-list-du.sh new file mode 100755 index 0000000000..1fadbcaded --- /dev/null +++ b/t/t6114-rev-list-du.sh @@ -0,0 +1,51 @@ +#!/bin/sh + +test_description='basic tests of rev-list --disk-usage' +. ./test-lib.sh + +# we want a mix of reachable and unreachable, as well as +# objects in the bitmapped pack and some outside of it +test_expect_success 'set up repository' ' + test_commit --no-tag one && + test_commit --no-tag two && + git repack -adb && + git reset --hard HEAD^ && + test_commit --no-tag three && + test_commit --no-tag four && + git reset --hard HEAD^ +' + +# We don't want to hardcode sizes, because they depend on the exact details of +# packing, zlib, etc. We'll assume that the regular rev-list and cat-file +# machinery works and compare the --disk-usage output to that. +disk_usage_slow () { + git rev-list --objects "$@" | + cut -d' ' -f1 | + git cat-file --batch-check="%(objectsize:disk)" | + perl -lne '$total += $_; END { print $total}' +} + +# check behavior with given rev-list options; note that +# whitespace is not preserved in args +check_du () { + args=$* + + test_expect_success "generate expected size ($args)" " + disk_usage_slow $args >expect + " + + test_expect_success "rev-list --disk-usage without bitmaps ($args)" " + git rev-list --disk-usage $args >actual && + test_cmp expect actual + " + + test_expect_success "rev-list --disk-usage with bitmaps ($args)" " + git rev-list --disk-usage --use-bitmap-index $args >actual && + test_cmp expect actual + " +} + +check_du HEAD +check_du HEAD^..HEAD + +test_done
It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone). You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: git rev-list --objects main..branch cut -d' ' -f1 | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' Though note that the caveats from git-cat-file(1) apply here. We "blame" base objects more than their deltas, even though the relationship could easily be flipped. Still, it can be a useful rough measure. But one problem is that it's slow to run. Teaching rev-list to sum up the sizes can be much faster for two reasons: 1. It skips all of the piping of object names and sizes. 2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size. This patch implements a --disk-usage option which produces the same answer in a fraction of the time. Here are some timings using a clone of torvalds/linux: [rev-list piped to cat-file, no bitmaps] $ time git rev-list --objects --all | cut -d' ' -f1 | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1455691059 real 0m34.336s user 0m46.533s sys 0m2.953s [internal, no bitmaps] $ time git rev-list --disk-usage --all 1455691059 real 0m32.662s user 0m32.306s sys 0m0.353s The wall-clock times aren't that different because of parallelism, but notice the CPU savings between the two. We saved 35% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] $ time git rev-list --objects --all --use-bitmap-index | cut -d' ' -f1 | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' real 0m9.954s user 0m11.234s sys 0m8.522s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] $ time git rev-list --disk-usage --all --use-bitmap-index 1455691059 real 0m0.235s user 0m0.186s sys 0m0.049s then we get the same answer much faster. For "--all", that answer will correspond closely to "du objects/pack", of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s sys 0m0.072s Signed-off-by: Jeff King <peff@peff.net> --- This _could_ be made more flexible, but I didn't think it was worth the complexity. Some obvious things one might want are: - not counting up all reachable objects (i.e., requiring --objects for this output, and omitting it just counts up commits). This could be handled in the bitmap case with some extra code (OR-ing with the type bitmaps). But after 5 years of this patch, I've never wanted that once. The disk usage of just some of the objects isn't really that useful (and of course you can still get it by piping to cat-file). - an option to output the sizes of specific objects along with their oids. But if you want to get to this level of flexibility, I think you're better off just using cat-file (and if we are concerned about the pipe costs, we should teach rev-list to understand cat-file's custom formats). Documentation/rev-list-options.txt | 9 ++++++ builtin/rev-list.c | 49 ++++++++++++++++++++++++++++ pack-bitmap.c | 50 +++++++++++++++++++++++++++++ pack-bitmap.h | 2 ++ t/t6114-rev-list-du.sh | 51 ++++++++++++++++++++++++++++++ 5 files changed, 161 insertions(+) create mode 100755 t/t6114-rev-list-du.sh