mbox series

[0/2] rev-list --disk-usage

Message ID YBHlGPBSJC++CnPy@coredump.intra.peff.net (mailing list archive)
Headers show
Series rev-list --disk-usage | expand

Message

Jeff King Jan. 27, 2021, 10:11 p.m. UTC
This series teaches rev-list to compute the on-disk size used by a set
of objects. You can do the same thing with cat-file, but this is much
faster (see the timings in the second commit).

We've been running it for about 5 years at GitHub. I hesitated sending
it upstream because it's a bit weird and special-purpose. But it does
come in handy for debugging, analyzing repos, etc. So maybe others will
find it useful.

The first patch is just a test-script enhancement to let test_commit
avoid creating tags. During some recent refactoring, we actually broke
the --disk-usage feature but the test script didn't catch it because the
tags were being picked up by "--all". Since this is at least the third
time I've run into that in our test suite, I thought I'd make it a
little more convenient to avoid. :)

  [1/2]: t: add --no-tag option to test_commit
  [2/2]: rev-list: add --disk-usage option for calculating disk usage

 Documentation/rev-list-options.txt |  9 ++++++
 builtin/rev-list.c                 | 49 ++++++++++++++++++++++++++++
 pack-bitmap.c                      | 50 +++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 ++
 t/t4208-log-magic-pathspec.sh      |  9 ++----
 t/t6114-rev-list-du.sh             | 51 ++++++++++++++++++++++++++++++
 t/test-lib-functions.sh            |  9 +++++-
 7 files changed, 171 insertions(+), 8 deletions(-)
 create mode 100755 t/t6114-rev-list-du.sh

-Peff

Comments

Taylor Blau Jan. 27, 2021, 10:46 p.m. UTC | #1
On Wed, Jan 27, 2021 at 05:11:36PM -0500, Jeff King wrote:
> The first patch is just a test-script enhancement to let test_commit
> avoid creating tags. During some recent refactoring, we actually broke
> the --disk-usage feature but the test script didn't catch it because the
> tags were being picked up by "--all". Since this is at least the third
> time I've run into that in our test suite, I thought I'd make it a
> little more convenient to avoid. :)

I appreciate the non-incriminating "we", but the person who caused the
regression was most certainly me ;-).

This happened while cherry-picking Junio's recent merge of
tb/revindex-api, which obviously did not cause a merge conflict with
this new caller. The remaining details are boring, but they definitely
weren't Peff's fault :-).

Thanks,
Taylor
Jeff King Feb. 9, 2021, 10:52 a.m. UTC | #2
Here's a re-roll of my series to add "rev-list --disk-usage", for
counting up object storage used for various slices of history.

This fixes the minor bits mentioned in review for v1, but the big change
is that "--disk-usage" no longer implies "--objects". I think you
generally would want to use it with that option, but it really seemed to
violate the principle of least surprise for the user.

That requires handling each object type independently, but the code for
that turned out to be not too bad (and is modeled after the similar
logic in traverse_bitmap_commit_list()). I was slightly concerned that
it would slow things down to walk over the bitmap multiple times, but it
doesn't seem to make much of a difference in practice.

There's a range-diff below, but it's not really worth looking at. All of
the interesting parts were rewritten completely, so you're better off to
just read patch 2 again (and patch 1 did not change at all).

  [1/2]: t: add --no-tag option to test_commit
  [2/2]: rev-list: add --disk-usage option for calculating disk usage

 Documentation/rev-list-options.txt |  9 ++++
 builtin/rev-list.c                 | 46 +++++++++++++++++
 pack-bitmap.c                      | 81 ++++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 +
 t/t4208-log-magic-pathspec.sh      |  9 +---
 t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
 t/test-lib-functions.sh            |  9 +++-
 7 files changed, 199 insertions(+), 8 deletions(-)
 create mode 100755 t/t6114-rev-list-du.sh

1:  20f8edeff1 = 1:  6365cd94bd t: add --no-tag option to test_commit
2:  64e28cb6c9 ! 2:  8a93583dee rev-list: add --disk-usage option for calculating disk usage
    @@ Commit message
         You can find that out by generating a list of objects, getting their
         sizes from cat-file, and then summing them, like:
     
    -        git rev-list --objects main..branch
    -        cut -d' ' -f1 |
    +        git rev-list --objects --no-object-names main..branch
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
     
    @@ Commit message
         torvalds/linux:
     
           [rev-list piped to cat-file, no bitmaps]
    -      $ time git rev-list --objects --all |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all |
             git cat-file --buffer --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      1455691059
    -      real  0m34.336s
    -      user  0m46.533s
    -      sys   0m2.953s
    +      1459938510
    +      real  0m29.635s
    +      user  0m38.003s
    +      sys   0m1.093s
     
           [internal, no bitmaps]
    -      $ time git rev-list --disk-usage --all
    -      1455691059
    -      real  0m32.662s
    -      user  0m32.306s
    -      sys   0m0.353s
    +      $ time git rev-list --disk-usage --objects --all
    +      1459938510
    +      real  0m31.262s
    +      user  0m30.885s
    +      sys   0m0.376s
     
    -    The wall-clock times aren't that different because of parallelism, but
    -    notice the CPU savings between the two. We saved 35% of the CPU just by
    +    Even though the wall-clock time is slightly worse due to parallelism,
    +    notice the CPU savings between the two. We saved 21% of the CPU just by
         avoiding the pipes.
     
         But the real win is with bitmaps. If we use them without the new option:
     
           [rev-list piped to cat-file, bitmaps]
    -      $ time git rev-list --objects --all --use-bitmap-index |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all --use-bitmap-index |
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      real  0m9.954s
    -      user  0m11.234s
    -      sys   0m8.522s
    +      1459938510
    +      real  0m6.244s
    +      user  0m8.452s
    +      sys   0m0.311s
     
         then we're faster to generate the list of objects, but we still spend a
         lot of time piping and looking things up. But if we do both together:
     
           [internal, bitmaps]
    -      $ time git rev-list --disk-usage --all --use-bitmap-index
    -      1455691059
    -      real  0m0.235s
    -      user  0m0.186s
    +      $ time git rev-list --disk-usage --objects --all --use-bitmap-index
    +      1459938510
    +      real  0m0.219s
    +      user  0m0.169s
           sys   0m0.049s
     
         then we get the same answer much faster.
    @@ Commit message
         of course. But we're actually checking reachability here, so we're still
         fast when we ask for more interesting things:
     
    -      $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10
    +      $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
           374798628
           real  0m0.429s
           user  0m0.356s
    @@ Documentation/rev-list-options.txt: ifdef::git-rev-list[]
     +
     +--disk-usage::
     +	Suppress normal output; instead, print the sum of the bytes used
    -+	for on-disk storage by the selected objects. This is equivalent
    -+	to piping the output of `rev-list --objects` into
    -+	`git cat-file --batch-check='%(objectsize:disk)', except that it
    -+	runs much faster (especially with `--use-bitmap-index`). See the
    -+	`CAVEATS` section in linkgit:git-cat-file[1] for the limitations
    -+	of what "on-disk storage" means.
    ++	for on-disk storage by the selected commits or objects. This is
    ++	equivalent to piping the output into `git cat-file
    ++	--batch-check='%(objectsize:disk)'`, except that it runs much
    ++	faster (especially with `--use-bitmap-index`). See the `CAVEATS`
    ++	section in linkgit:git-cat-file[1] for the limitations of what
    ++	"on-disk storage" means.
      endif::git-rev-list[]
      
      --cherry-mark::
    @@ builtin/rev-list.c: static int try_bitmap_traversal(struct rev_info *revs,
     +		return -1;
     +
     +	printf("%"PRIuMAX"\n",
    -+	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git));
    ++	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
     +	return 0;
     +}
     +
    @@ builtin/rev-list.c: int cmd_rev_list(int argc, const char **argv, const char *pr
      
     +		if (!strcmp(arg, "--disk-usage")) {
     +			show_disk_usage = 1;
    -+			revs.tag_objects = 1;
    -+			revs.tree_objects = 1;
    -+			revs.blob_objects = 1;
     +			info.flags |= REV_LIST_QUIET;
     +			continue;
     +		}
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
      		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
      }
     +
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git)
    ++static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
    ++				     enum object_type object_type)
     +{
     +	struct bitmap *result = bitmap_git->result;
     +	struct packed_git *pack = bitmap_git->pack;
    -+	struct eindex *eindex = &bitmap_git->ext_index;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    -+	off_t object_size;
     +	off_t total = 0;
    ++	struct ewah_iterator it;
    ++	eword_t filter;
     +	size_t i;
     +
    -+	oi.disk_sizep = &object_size;
    -+
    -+	for (i = 0; i < result->word_alloc; i++) {
    -+		eword_t word = result->words[i];
    ++	init_type_iterator(&it, bitmap_git, object_type);
    ++	for (i = 0; i < result->word_alloc &&
    ++			ewah_iterator_next(&filter, &it); i++) {
    ++		eword_t word = result->words[i] & filter;
     +		size_t base = (i * BITS_IN_EWORD);
     +		unsigned offset;
     +
    ++		if (!word)
    ++			continue;
    ++
     +		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
     +			size_t pos;
     +
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
     +
     +			offset += ewah_bit_ctz64(word >> offset);
     +			pos = base + offset;
    -+
    -+			/*
    -+			 * If it's in the pack, we can use the fast path
    -+			 * and just check the revindex. Otherwise, we
    -+			 * fall back to looking it up.
    -+			 */
    -+			if (pos < pack->num_objects) {
    -+				object_size =
    -+					pack_pos_to_offset(pack, pos + 1) -
    -+					pack_pos_to_offset(pack, pos);
    -+			} else {
    -+				struct object *obj;
    -+				obj = eindex->objects[pos - pack->num_objects];
    -+				if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    -+					die(_("unable to get disk usage of %s"),
    -+					      oid_to_hex(&obj->oid));
    -+			}
    -+
    -+			total += object_size;
    ++			total += pack_pos_to_offset(pack, pos + 1) -
    ++				 pack_pos_to_offset(pack, pos);
     +		}
     +	}
     +
     +	return total;
    ++}
    ++
    ++static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
    ++{
    ++	struct bitmap *result = bitmap_git->result;
    ++	struct packed_git *pack = bitmap_git->pack;
    ++	struct eindex *eindex = &bitmap_git->ext_index;
    ++	off_t total = 0;
    ++	struct object_info oi = OBJECT_INFO_INIT;
    ++	off_t object_size;
    ++	size_t i;
    ++
    ++	oi.disk_sizep = &object_size;
    ++
    ++	for (i = 0; i < eindex->count; i++) {
    ++		struct object *obj = eindex->objects[i];
    ++
    ++		if (!bitmap_get(result, pack->num_objects + i))
    ++			continue;
    ++
    ++		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    ++			die(_("unable to get disk usage of %s"),
    ++			    oid_to_hex(&obj->oid));
    ++
    ++		total += object_size;
    ++	}
    ++	return total;
    ++}
    ++
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
    ++				 struct rev_info *revs)
    ++{
    ++	off_t total = 0;
    ++
    ++	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
    ++	if (revs->tree_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
    ++	if (revs->blob_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
    ++	if (revs->tag_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
    ++
    ++	total += get_disk_usage_for_extended(bitmap_git);
    ++
    ++	return total;
     +}
     
      ## pack-bitmap.h ##
     @@ pack-bitmap.h: int bitmap_walk_contains(struct bitmap_index *,
       */
      int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
      
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *);
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *);
     +
      void bitmap_writer_show_progress(int show);
      void bitmap_writer_set_checksum(unsigned char *sha1);
    @@ t/t6114-rev-list-du.sh (new)
     +# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
     +# machinery works and compare the --disk-usage output to that.
     +disk_usage_slow () {
    -+	git rev-list --objects "$@" |
    -+	cut -d' ' -f1 |
    ++	git rev-list --no-object-names "$@" |
     +	git cat-file --batch-check="%(objectsize:disk)" |
     +	perl -lne '$total += $_; END { print $total}'
     +}
    @@ t/t6114-rev-list-du.sh (new)
     +}
     +
     +check_du HEAD
    -+check_du HEAD^..HEAD
    ++check_du --objects HEAD
    ++check_du --objects HEAD^..HEAD
     +
     +test_done
Jeff King Feb. 9, 2021, 11:09 a.m. UTC | #3
On Tue, Feb 09, 2021 at 05:52:28AM -0500, Jeff King wrote:

> This fixes the minor bits mentioned in review for v1, but the big change
> is that "--disk-usage" no longer implies "--objects". I think you
> generally would want to use it with that option, but it really seemed to
> violate the principle of least surprise for the user.
> 
> That requires handling each object type independently, but the code for
> that turned out to be not too bad (and is modeled after the similar
> logic in traverse_bitmap_commit_list()). I was slightly concerned that
> it would slow things down to walk over the bitmap multiple times, but it
> doesn't seem to make much of a difference in practice.

You might reasonably ask whether we could just directly use
traverse_bitmap_commit_list(), since after all it takes a callback. And
indeed, doing so reduces the size of the code (see the patch below).

But it's shockingly slower! It takes consistently 2-3x longer to produce
the same answer on linux.git with bitmaps. The problem is that we give
more information to the callback than the disk-usage computation needs.

In particular, finding nth_packed_object_id() is a big killer. Which
kind of makes sense. We memcpy() the oids out of the .idx file into a
"struct object_id" on the stack. And linux.git has ~200MB of oids to
copy (and I'm sure doing it 20 bytes at a time isn't quite optimal).
That adds several hundred milliseconds. Not a lot in absolute terms, but
we're able to do the whole computation in ~200ms to start with, so it's
relatively a big change.

This could be solved by having a more "bare" callback that just passes
the pack position, and not the oid (and then the callback is responsible
for looking it up if they care). But it gets pretty awkward when we have
to complete the bitmap traversal with non-bitmap objects (for those we
_do_ have an oid to pass, but no pack position). I think the
implementation in my 2/2 isn't so bad in comparison (and we can always
swap it out later; these are all just implementation details).

I did find it a bit interesting, though. When we moved to "struct
object_id" and started copying bits out with nth_packed_object_id(),
rather than just pointing to the mmap'd .idx bytes, we wondered whether
there would be any measurable difference. Likewise when we extended it
to handle the oid size changing at runtime. At the time, I wasn't able
to measure any impact for real operations, but I guess we just needed a
case that highlighted it more.

I don't know that it's really worth digging into that much, though it's
quite possible there may be some easy wins by optimizing those memcpy
calls. E.g., I'm not sure if the compiler ends up inlining them or not.
If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
"32", we could perhaps help it along with specialized versions of
hashcpy(). If somebody does want to play with it, this patch may make a
good testbed. :)

-- >8 --
 builtin/pack-objects.c |  3 +-
 builtin/rev-list.c     | 40 +++++++++++------------
 pack-bitmap.c          | 86 ++------------------------------------------------
 pack-bitmap.h          |  1 +
 reachable.c            |  1 +
 5 files changed, 25 insertions(+), 106 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 13cde5896a..33f7d19eb3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1388,7 +1388,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
 static int add_object_entry_from_bitmap(const struct object_id *oid,
 					enum object_type type,
 					int flags, uint32_t name_hash,
-					struct packed_git *pack, off_t offset)
+					struct packed_git *pack,
+					uint32_t pack_pos, off_t offset)
 {
 	display_progress(progress_state, ++nr_seen);
 
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index b4d8ea0a35..cc96b4c854 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -83,13 +83,13 @@ static int arg_show_object_names = 1;
 static int show_disk_usage;
 static off_t total_disk_usage;
 
-static off_t get_object_disk_usage(struct object *obj)
+static off_t get_object_disk_usage(const struct object_id *oid)
 {
 	off_t size;
 	struct object_info oi = OBJECT_INFO_INIT;
 	oi.disk_sizep = &size;
-	if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-		die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid));
+	if (oid_object_info_extended(the_repository, oid, &oi, 0) < 0)
+		die(_("unable to get disk usage of %s"), oid_to_hex(oid));
 	return size;
 }
 
@@ -102,7 +102,7 @@ static void show_commit(struct commit *commit, void *data)
 	display_progress(progress, ++progress_counter);
 
 	if (show_disk_usage)
-		total_disk_usage += get_object_disk_usage(&commit->object);
+		total_disk_usage += get_object_disk_usage(&commit->object.oid);
 
 	if (info->flags & REV_LIST_QUIET) {
 		finish_commit(commit);
@@ -275,7 +275,7 @@ static void show_object(struct object *obj, const char *name, void *cb_data)
 		return;
 	display_progress(progress, ++progress_counter);
 	if (show_disk_usage)
-		total_disk_usage += get_object_disk_usage(obj);
+		total_disk_usage += get_object_disk_usage(&obj->oid);
 	if (info->flags & REV_LIST_QUIET)
 		return;
 
@@ -363,8 +363,19 @@ static int show_object_fast(
 	int exclude,
 	uint32_t name_hash,
 	struct packed_git *found_pack,
+	uint32_t pack_pos,
 	off_t found_offset)
 {
+	if (show_disk_usage) {
+		if (found_pack) {
+			total_disk_usage +=
+				pack_pos_to_offset(found_pack, pack_pos + 1) -
+				found_offset;
+		} else {
+			total_disk_usage += get_object_disk_usage(oid);
+		}
+		return 1;
+	}
 	fprintf(stdout, "%s\n", oid_to_hex(oid));
 	return 1;
 }
@@ -467,23 +478,10 @@ static int try_bitmap_traversal(struct rev_info *revs,
 
 	traverse_bitmap_commit_list(bitmap_git, revs, &show_object_fast);
 	free_bitmap_index(bitmap_git);
-	return 0;
-}
-
-static int try_bitmap_disk_usage(struct rev_info *revs,
-				 struct list_objects_filter_options *filter)
-{
-	struct bitmap_index *bitmap_git;
 
-	if (!show_disk_usage)
-		return -1;
-
-	bitmap_git = prepare_bitmap_walk(revs, filter);
-	if (!bitmap_git)
-		return -1;
+	if (show_disk_usage)
+		printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage);
 
-	printf("%"PRIuMAX"\n",
-	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
 	return 0;
 }
 
@@ -667,8 +665,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (use_bitmap_index) {
 		if (!try_bitmap_count(&revs, &filter_options))
 			return 0;
-		if (!try_bitmap_disk_usage(&revs, &filter_options))
-			return 0;
 		if (!try_bitmap_traversal(&revs, &filter_options))
 			return 0;
 	}
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1f69b5fa85..f118a365e1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -655,7 +655,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
 		    (obj->type == OBJ_TAG && !revs->tag_objects))
 			continue;
 
-		show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
+		show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, 0);
 	}
 }
 
@@ -726,7 +726,8 @@ static void show_objects_for_type(
 			if (bitmap_git->hashes)
 				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
+			show_reach(&oid, object_type, 0, hash,
+				   bitmap_git->pack, pos + offset, ofs);
 		}
 	}
 }
@@ -1430,84 +1431,3 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 	return bitmap_git &&
 		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
 }
-
-static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
-				     enum object_type object_type)
-{
-	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
-	off_t total = 0;
-	struct ewah_iterator it;
-	eword_t filter;
-	size_t i;
-
-	init_type_iterator(&it, bitmap_git, object_type);
-	for (i = 0; i < result->word_alloc &&
-			ewah_iterator_next(&filter, &it); i++) {
-		eword_t word = result->words[i] & filter;
-		size_t base = (i * BITS_IN_EWORD);
-		unsigned offset;
-
-		if (!word)
-			continue;
-
-		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
-			size_t pos;
-
-			if ((word >> offset) == 0)
-				break;
-
-			offset += ewah_bit_ctz64(word >> offset);
-			pos = base + offset;
-			total += pack_pos_to_offset(pack, pos + 1) -
-				 pack_pos_to_offset(pack, pos);
-		}
-	}
-
-	return total;
-}
-
-static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
-{
-	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
-	struct eindex *eindex = &bitmap_git->ext_index;
-	off_t total = 0;
-	struct object_info oi = OBJECT_INFO_INIT;
-	off_t object_size;
-	size_t i;
-
-	oi.disk_sizep = &object_size;
-
-	for (i = 0; i < eindex->count; i++) {
-		struct object *obj = eindex->objects[i];
-
-		if (!bitmap_get(result, pack->num_objects + i))
-			continue;
-
-		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-			die(_("unable to get disk usage of %s"),
-			    oid_to_hex(&obj->oid));
-
-		total += object_size;
-	}
-	return total;
-}
-
-off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
-				 struct rev_info *revs)
-{
-	off_t total = 0;
-
-	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
-	if (revs->tree_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
-	if (revs->blob_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
-	if (revs->tag_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
-
-	total += get_disk_usage_for_extended(bitmap_git);
-
-	return total;
-}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 36d99930d8..ba71a9f5c6 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -38,6 +38,7 @@ typedef int (*show_reachable_fn)(
 	int flags,
 	uint32_t hash,
 	struct packed_git *found_pack,
+	uint32_t pack_pos,
 	off_t found_offset);
 
 struct bitmap_index;
diff --git a/reachable.c b/reachable.c
index 77a60c70a5..79ebe8f940 100644
--- a/reachable.c
+++ b/reachable.c
@@ -182,6 +182,7 @@ static int mark_object_seen(const struct object_id *oid,
 			     int exclude,
 			     uint32_t name_hash,
 			     struct packed_git *found_pack,
+			     uint32_t pack_pos,
 			     off_t found_offset)
 {
 	struct object *obj = lookup_object_by_type(the_repository, oid, type);
Junio C Hamano Feb. 9, 2021, 9:14 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I don't know that it's really worth digging into that much, though it's
> quite possible there may be some easy wins by optimizing those memcpy
> calls. E.g., I'm not sure if the compiler ends up inlining them or not.
> If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
> "32", we could perhaps help it along with specialized versions of
> hashcpy(). If somebody does want to play with it, this patch may make a
> good testbed. :)

Yuck.  That reminds me of the adventure Shawn he made in the Java
land benchmarking which one among int[5], int a,b,c,d,e, char[40] is
the most efficient way (both storage-wise and performance-wise) to
store SHA-1 hash.  I wish we didn't have to go there.

It indeed is an interesting, despite a bit sad, observation that
even with a good precomputed information, an overly heavy interface
can kill potential performance benefit.

Thanks.
Junio C Hamano Feb. 10, 2021, 12:44 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> Here's a re-roll of my series to add "rev-list --disk-usage", for
> counting up object storage used for various slices of history.
> ...
>  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
>  t/test-lib-functions.sh            |  9 +++-
>  7 files changed, 199 insertions(+), 8 deletions(-)
>  create mode 100755 t/t6114-rev-list-du.sh

I relocated 6114 to 6115 to avoid tests sharing the same number.

I am getting these numbers from random ranges I am interested in,
but do they say what I think they mean?  Was the development effort
went into the v2.28 release almost half the size of v2.29, and have
we already done about the same amont of work for this cycle?

: gitster git.git/seen; rungit seen rev-list --disk-usage master..next
83105
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
183463
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
231640
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
334355
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
182298
Taylor Blau Feb. 10, 2021, 1:49 a.m. UTC | #6
On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > Here's a re-roll of my series to add "rev-list --disk-usage", for
> > counting up object storage used for various slices of history.
> > ...
> >  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
> >  t/test-lib-functions.sh            |  9 +++-
> >  7 files changed, 199 insertions(+), 8 deletions(-)
> >  create mode 100755 t/t6114-rev-list-du.sh
>
> I relocated 6114 to 6115 to avoid tests sharing the same number.

Thanks.

> I am getting these numbers from random ranges I am interested in,
> but do they say what I think they mean?  Was the development effort
> went into the v2.28 release almost half the size of v2.29, and have
> we already done about the same amont of work for this cycle?
>
> : gitster git.git/seen; rungit seen rev-list --disk-usage master..next
> 83105
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
> 183463
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
> 231640
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
> 334355
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
> 182298

I think you are surprised by these numbers because you're only counting
disk usage of commit objects in those ranges. v1 of this series implied
--objects by default, but this changed in v2 due to my suggestion.

Passing --objects to count the disk-usage of all objects in those ranges
gives more reasonable numbers (and match my rough guesses, i.e., that
2.29 was busier than 2.30, and so on):

    $ for range in origin/master..origin/next v2.30.0..origin/master \
        v2.29.0..v2.30.0 v2.28.0..v2.29.0 v2.27.0..v2.28.0
    do
      printf "%s %d vs. %d\n" $range \
        "$(git rev-list --objects --no-object-names $range |
           git cat-file --batch-check='%(objectsize:disk)' |
           paste -sd+ | bc)" \
        "$(git.seen rev-list --objects --disk-usage $range)"
    done
    origin/master..origin/next 671380 vs. 671380
    v2.30.0..origin/master 1618815 vs. 1618815
    v2.29.0..v2.30.0 3308295 vs. 3308295
    v2.28.0..v2.29.0 4080789 vs. 4080789
    v2.27.0..v2.28.0 2846196 vs. 2846196

Thanks,
Taylor
Jeff King Feb. 10, 2021, 9:38 a.m. UTC | #7
On Tue, Feb 09, 2021 at 01:14:17PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I don't know that it's really worth digging into that much, though it's
> > quite possible there may be some easy wins by optimizing those memcpy
> > calls. E.g., I'm not sure if the compiler ends up inlining them or not.
> > If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
> > "32", we could perhaps help it along with specialized versions of
> > hashcpy(). If somebody does want to play with it, this patch may make a
> > good testbed. :)
> 
> Yuck.  That reminds me of the adventure Shawn he made in the Java
> land benchmarking which one among int[5], int a,b,c,d,e, char[40] is
> the most efficient way (both storage-wise and performance-wise) to
> store SHA-1 hash.  I wish we didn't have to go there.
> 
> It indeed is an interesting, despite a bit sad, observation that
> even with a good precomputed information, an overly heavy interface
> can kill potential performance benefit.

Agreed. But I'm hoping we can continue to mostly ignore it. I suspect
this finding means we are wasting a few hundred milliseconds copying
oids around during a clone of torvalds/linux. But overall that is a
pretty heavy-weight operation, and I doubt anybody really notices. And
for something as lightweight as --disk-usage, it was easy enough to
optimize around it.

It probably does have a more measurable impact in something like:

  git rev-list --use-bitmap-index --objects HEAD >/dev/null

where we really do need those oids, and the extra copying might add up.
I guess if somebody is interested in micro-optimizing, that is probably
a good command to look at.

-Peff
Jeff King Feb. 10, 2021, 10:01 a.m. UTC | #8
On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Here's a re-roll of my series to add "rev-list --disk-usage", for
> > counting up object storage used for various slices of history.
> > ...
> >  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
> >  t/test-lib-functions.sh            |  9 +++-
> >  7 files changed, 199 insertions(+), 8 deletions(-)
> >  create mode 100755 t/t6114-rev-list-du.sh
> 
> I relocated 6114 to 6115 to avoid tests sharing the same number.

Thanks. I wondered why I didn't notice, but it's because the other 6114
also just made it into "seen". :)

> I am getting these numbers from random ranges I am interested in,
> but do they say what I think they mean?  Was the development effort
> went into the v2.28 release almost half the size of v2.29, and have
> we already done about the same amont of work for this cycle?
> 
> : gitster git.git/seen; rungit seen rev-list --disk-usage master..next
> 83105
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
> 183463
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
> 231640
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
> 334355
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
> 182298

As Taylor mentioned, this is only hitting the commits. So you might as
well just be looking at commit counts as a measure of work, I'd think
(and indeed v2.28 has about half as many commits as v2.29!).

Adding --objects gets you a rougher estimate of "bytes changed", which
helps accounts for commits of different sizes. But there I think you'd
do just as well to look at the actual number of lines changed with "git
diff --numstat".

I'd expect the number of on-disk bytes to _roughly_ correspond to the
size of the changes. But you are working against the heuristics of the
delta chains there. It may well be that we would store a base object in
the v2.28..v2.29 range, and a delta against it in v2.27..v2.28. And that
would attribute most of the bytes to v2.29, even though they should be
shared roughly with v2.28.

I'm sure one could devise a scheme for "sharing" the bytes from a delta
family across all of its objects. That might even be worth implementing
on top (I don't even think it would be too expensive; you just have to
collect the delta chains for any objects you're reporting, and then
average the total size among a chain).

But in practice, we've found this kind of naive --disk-usage useful for
answering questions like:

  - do I need all of these objects? Comparing "rev-list --disk-usage
    --objects --all", "rev-list --disk-usage --objects --all --reflog",
    and "du objects/pack/*.pack" will tell you if a prune/repack might
    help, and whether expiring reflogs makes a difference.

  - the size of the shared alternates repo for a set of forks has
    jumped. Comparing "rev-list --disk-usage --objects --remotes=$base
    --not --remotes=$fork" will tell you what's reachable from a fork
    but not from the base (we use "refs/remotes/$id/*" to keep track of
    fork refs in our alternates repo). This can be junk like somebody
    forking git/git and then uploading a bunch of pirated video files.
    :)

  - likewise, the size of cloning a single repo may jump. Comparing
    "rev-list --disk-usage --objects HEAD..$branch" for each branch
    might show that one branch is an outlier (e.g., because somebody
    accidentally committed a bunch of build artifacts).

In those kinds of cases, it's not usually "oh, this version is twice as
big as this other one". It's more like "wow, this branch is 100x as big
as the other branches", and little decisions like delta direction are
just noise. I imagine that in those cases the uncompressed object sizes
would probably produce similar patterns and answers. But it's actually
faster to produce the on-disk sizes. :)

-Peff
Junio C Hamano Feb. 10, 2021, 4:31 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> But in practice, we've found this kind of naive --disk-usage useful for
> answering questions like:
>
>   - do I need all of these objects? Comparing "rev-list --disk-usage
>     --objects --all", "rev-list --disk-usage --objects --all --reflog",
>     and "du objects/pack/*.pack" will tell you if a prune/repack might
>     help, and whether expiring reflogs makes a difference.
>
>   - the size of the shared alternates repo for a set of forks has
>     jumped. Comparing "rev-list --disk-usage --objects --remotes=$base
>     --not --remotes=$fork" will tell you what's reachable from a fork
>     but not from the base (we use "refs/remotes/$id/*" to keep track of
>     fork refs in our alternates repo). This can be junk like somebody
>     forking git/git and then uploading a bunch of pirated video files.
>     :)
>
>   - likewise, the size of cloning a single repo may jump. Comparing
>     "rev-list --disk-usage --objects HEAD..$branch" for each branch
>     might show that one branch is an outlier (e.g., because somebody
>     accidentally committed a bunch of build artifacts).
>
> In those kinds of cases, it's not usually "oh, this version is twice as
> big as this other one". It's more like "wow, this branch is 100x as big
> as the other branches", and little decisions like delta direction are
> just noise. I imagine that in those cases the uncompressed object sizes
> would probably produce similar patterns and answers. But it's actually
> faster to produce the on-disk sizes. :)

Thanks.

I kind of feel sad to have a nice write-up like this only in the
list archive.  Is there a section in our documentation set to keep
collection of such a real-life use cases?  Perhaps the examples
section of manpages is the closest thing, but it looks a bit too
narrowly scoped for the example section of "rev-list" manpage.

THanks.
Jeff King Feb. 10, 2021, 8:38 p.m. UTC | #10
On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > But in practice, we've found this kind of naive --disk-usage useful for
> > answering questions like:
> [...]
>
> I kind of feel sad to have a nice write-up like this only in the
> list archive.  Is there a section in our documentation set to keep
> collection of such a real-life use cases?  Perhaps the examples
> section of manpages is the closest thing, but it looks a bit too
> narrowly scoped for the example section of "rev-list" manpage.

Agreed on both counts. If this gets put into a release, I suspect Taylor
would cover it in a release blog post. That is not quite the same thing
as having it in the documentation, but it may provide more search engine
boost than the list archive. I dunno.

-Peff
Taylor Blau Feb. 10, 2021, 11:15 p.m. UTC | #11
On Wed, Feb 10, 2021 at 03:38:58PM -0500, Jeff King wrote:
> On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote:
>
> > Jeff King <peff@peff.net> writes:
> >
> > > But in practice, we've found this kind of naive --disk-usage useful for
> > > answering questions like:
> > [...]
> >
> > I kind of feel sad to have a nice write-up like this only in the
> > list archive.  Is there a section in our documentation set to keep
> > collection of such a real-life use cases?  Perhaps the examples
> > section of manpages is the closest thing, but it looks a bit too
> > narrowly scoped for the example section of "rev-list" manpage.
>
> Agreed on both counts. If this gets put into a release, I suspect Taylor
> would cover it in a release blog post. That is not quite the same thing
> as having it in the documentation, but it may provide more search engine
> boost than the list archive. I dunno.

Yeah, this is the perfect sort of thing for those blog posts.

But it makes sense to include some of these examples in our own
documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
section, but maybe it should.

> -Peff

Thanks,
Taylor
Jeff King Feb. 11, 2021, 11 a.m. UTC | #12
On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote:

> > > I kind of feel sad to have a nice write-up like this only in the
> > > list archive.  Is there a section in our documentation set to keep
> > > collection of such a real-life use cases?  Perhaps the examples
> > > section of manpages is the closest thing, but it looks a bit too
> > > narrowly scoped for the example section of "rev-list" manpage.
> >
> > Agreed on both counts. If this gets put into a release, I suspect Taylor
> > would cover it in a release blog post. That is not quite the same thing
> > as having it in the documentation, but it may provide more search engine
> > boost than the list archive. I dunno.
> 
> Yeah, this is the perfect sort of thing for those blog posts.
> 
> But it makes sense to include some of these examples in our own
> documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
> section, but maybe it should.

I think this is the "narrowly scoped" bit from Junio's response above.
It would be a bit weird to have an examples section for rev-list that
only mentions this rather obscure feature.

-Peff
Ævar Arnfjörð Bjarmason Feb. 11, 2021, 12:04 p.m. UTC | #13
On Thu, Feb 11 2021, Jeff King wrote:

> On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote:
>
>> > > I kind of feel sad to have a nice write-up like this only in the
>> > > list archive.  Is there a section in our documentation set to keep
>> > > collection of such a real-life use cases?  Perhaps the examples
>> > > section of manpages is the closest thing, but it looks a bit too
>> > > narrowly scoped for the example section of "rev-list" manpage.
>> >
>> > Agreed on both counts. If this gets put into a release, I suspect Taylor
>> > would cover it in a release blog post. That is not quite the same thing
>> > as having it in the documentation, but it may provide more search engine
>> > boost than the list archive. I dunno.
>> 
>> Yeah, this is the perfect sort of thing for those blog posts.
>> 
>> But it makes sense to include some of these examples in our own
>> documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
>> section, but maybe it should.
>
> I think this is the "narrowly scoped" bit from Junio's response above.
> It would be a bit weird to have an examples section for rev-list that
> only mentions this rather obscure feature.

I don't think the lack of an EXAMPLES section or the relative obscurity
of the switch should preclude us from adding useful documentation.

Yes it would feel a bit out of place, but we can always have a
sub-section of EXAMPLES, and we've got to start somewhere.

In this case I don't see why it couldn't be added to OPTIONS, we've got
some very long discussion there already, and as long as there's a clear
separation in prose from an initial brief discussion of the switch and
further prose it won't be confusing for readers, they can just page past
the details.
Junio C Hamano Feb. 11, 2021, 5:57 p.m. UTC | #14
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> I think this is the "narrowly scoped" bit from Junio's response above.
>> It would be a bit weird to have an examples section for rev-list that
>> only mentions this rather obscure feature.
>
> I don't think the lack of an EXAMPLES section or the relative obscurity
> of the switch should preclude us from adding useful documentation.
>
> Yes it would feel a bit out of place, but we can always have a
> sub-section of EXAMPLES, and we've got to start somewhere.
>
> In this case I don't see why it couldn't be added to OPTIONS, we've got
> some very long discussion there already, and as long as there's a clear
> separation in prose from an initial brief discussion of the switch and
> further prose it won't be confusing for readers, they can just page past
> the details.

OK.

In any case, [v2] as we have it (with test number relocation) should
be good as-is, so I'd start preparing to merge it down to 'next'
soonish.

Thanks.