diff mbox series

[07/14] tree-diff: drop path_appendnew() alloc optimization

Message ID 20250109084649.GG2748836@coredump.intra.peff.net (mailing list archive)
State New
Headers show
Series combine-diff cleanups | expand

Commit Message

Jeff King Jan. 9, 2025, 8:46 a.m. UTC
When we're diffing trees, we create a list of combine_diff_path structs
that represent changed paths. We allocate each struct and add it to the
list with path_appendnew(), which we then feed to opt->pathchange().
That function tells us whether the path is of interest or not; if not,
then we can throw away the struct we allocated.

So there's an optimization to avoid extra allocations: instead of
throwing away the new entry, we try to reuse it. If it was large enough
to store the next path we care about, we can do so. And if not, we fall
back to freeing and re-allocating a new struct.

This comes from 72441af7c4 (tree-diff: rework diff_tree() to generate
diffs for multiparent cases as well, 2014-04-07), where the goal was to
have even the 2-parent diff code use the combine-diff infrastructure,
but without taking a performance hit.

The implementation causes some complexities in the interface (as we
store the allocation length inside the "next" pointer), and prevents us
from using the regular combine_diff_path_new() constructor. The
complexity is mostly contained inside two functions, but it's worth
re-evaluating how much it's helping.

That commit claims it helps ~1% on generating two-parent diffs in
linux.git. Here are the timings I get on the same command today ("old"
is the current tip of master, and "new" has this patch applied):

  Benchmark 1: ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11
    Time (mean ± σ):     532.9 ms ±   5.8 ms    [User: 472.7 ms, System: 59.6 ms]
    Range (min … max):   525.9 ms … 543.3 ms    10 runs

  Benchmark 2: ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11
    Time (mean ± σ):     538.3 ms ±   5.7 ms    [User: 478.0 ms, System: 59.7 ms]
    Range (min … max):   528.5 ms … 545.3 ms    10 runs

  Summary
    ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11 ran
    1.01 ± 0.02 times faster than ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11

So we do end up on average 1% faster, but with 2% of noise. I tried to
focus more on diff performance by running the commit traversal
separately, like:

  git rev-list v3.10..v3.11 >in

and then timing just the diffs:

  Benchmark 1: ./git.old diff-tree --stdin -r <in
    Time (mean ± σ):     415.7 ms ±   5.8 ms    [User: 357.7 ms, System: 58.0 ms]
    Range (min … max):   410.9 ms … 430.3 ms    10 runs

  Benchmark 2: ./git.new diff-tree --stdin -r <in
    Time (mean ± σ):     418.5 ms ±   2.1 ms    [User: 361.7 ms, System: 56.6 ms]
    Range (min … max):   414.9 ms … 421.3 ms    10 runs

  Summary
    ./git.old diff-tree --stdin -r <in ran
      1.01 ± 0.02 times faster than ./git.new diff-tree --stdin -r <in

That gets roughly the same result.

Adding in "-c" to do multi-parent diffs doesn't change much:

  Benchmark 1: ./git.old diff-tree --stdin -r -c <in
    Time (mean ± σ):     525.3 ms ±   6.6 ms    [User: 470.0 ms, System: 55.1 ms]
    Range (min … max):   508.4 ms … 531.0 ms    10 runs

  Benchmark 2: ./git.new diff-tree --stdin -r -c <in
    Time (mean ± σ):     532.3 ms ±   6.2 ms    [User: 469.0 ms, System: 63.1 ms]
    Range (min … max):   520.3 ms … 539.4 ms    10 runs

  Summary
    ./git.old diff-tree --stdin -r -c <in ran
      1.01 ± 0.02 times faster than ./git.new diff-tree --stdin -r -c <in

And of course if you add in a lot more work by doing actual
content-level diffs, any difference is lost entirely (here the newer
version is actually faster, but that's really just noise):

  Benchmark 1: ./git.old diff-tree --stdin -r --cc <in
    Time (mean ± σ):     11.571 s ±  0.064 s    [User: 11.287 s, System: 0.283 s]
    Range (min … max):   11.497 s … 11.615 s    3 runs

  Benchmark 2: ./git.new diff-tree --stdin -r --cc <in
    Time (mean ± σ):     11.466 s ±  0.109 s    [User: 11.108 s, System: 0.357 s]
    Range (min … max):   11.346 s … 11.560 s    3 runs

  Summary
    ./git.new diff-tree --stdin -r --cc <in ran
      1.01 ± 0.01 times faster than ./git.old diff-tree --stdin -r --cc <in

So my conclusion is that it probably does help a little, but it's mostly
lost in the noise. I could see an argument for keeping it, as the
complexity is hidden away in functions that do not often need to be
touched. But it does make them more confusing than necessary (despite
some detailed explanations from the author of that commit; it just took
me a while to wrap my head around what was going on) and prevents
further refactoring of the combine_diff_path struct. So let's drop it.

Signed-off-by: Jeff King <peff@peff.net>
---
 tree-diff.c | 67 +++++------------------------------------------------
 1 file changed, 6 insertions(+), 61 deletions(-)

Comments

Patrick Steinhardt Jan. 13, 2025, 3:40 p.m. UTC | #1
On Thu, Jan 09, 2025 at 03:46:49AM -0500, Jeff King wrote:
> So my conclusion is that it probably does help a little, but it's mostly
> lost in the noise. I could see an argument for keeping it, as the
> complexity is hidden away in functions that do not often need to be
> touched. But it does make them more confusing than necessary (despite
> some detailed explanations from the author of that commit; it just took
> me a while to wrap my head around what was going on) and prevents
> further refactoring of the combine_diff_path struct. So let's drop it.

A 1% performance speedup does not feel like a good argument to me, so
I'm perfectly fine with dropping the code, even if most of it is
actually in the form of comments. But that already shows that it needs
quite a bit of explanation.

I wonder though: did you also use e.g. Valgrind to compare the number of
allocations? glibc tends to be heavily optimized with regards to small
allocations, so you typically don't notice the performance impact caused
by them even when the number of saved allocations is significant. So the
effect might be more pronounced with other libcs that aren't optimized
for such usecases, like e.g. musl libc.

Patrick
Jeff King Jan. 14, 2025, 10:30 a.m. UTC | #2
On Mon, Jan 13, 2025 at 04:40:28PM +0100, Patrick Steinhardt wrote:

> On Thu, Jan 09, 2025 at 03:46:49AM -0500, Jeff King wrote:
> > So my conclusion is that it probably does help a little, but it's mostly
> > lost in the noise. I could see an argument for keeping it, as the
> > complexity is hidden away in functions that do not often need to be
> > touched. But it does make them more confusing than necessary (despite
> > some detailed explanations from the author of that commit; it just took
> > me a while to wrap my head around what was going on) and prevents
> > further refactoring of the combine_diff_path struct. So let's drop it.
> 
> A 1% performance speedup does not feel like a good argument to me, so
> I'm perfectly fine with dropping the code, even if most of it is
> actually in the form of comments. But that already shows that it needs
> quite a bit of explanation.
> 
> I wonder though: did you also use e.g. Valgrind to compare the number of
> allocations? glibc tends to be heavily optimized with regards to small
> allocations, so you typically don't notice the performance impact caused
> by them even when the number of saved allocations is significant. So the
> effect might be more pronounced with other libcs that aren't optimized
> for such usecases, like e.g. musl libc.

I didn't use valgrind, but I did confirm via some hacky printf() calls
that the optimization does kick in. Here's a version with counting:

diff --git a/tree-diff.c b/tree-diff.c
index d9237ffd9b..60db2b2f51 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -154,6 +154,11 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
  *
  * p->parent[] remains uninitialized.
  */
+static int hit, total;
+void show_counter(void)
+{
+	warning("%d / %d\n", hit, total);
+}
 static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 	int nparent, const struct strbuf *base, const char *path, int pathlen,
 	unsigned mode, const struct object_id *oid)
@@ -168,6 +173,11 @@ static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 		FREE_AND_NULL(p);
 	}
 
+	if (!total++)
+		atexit(show_counter);
+	if (p)
+		hit++;
+
 	if (!p) {
 		p = xmalloc(alloclen);
 
It seems to kick in about half of the time when running "git log --raw"
on git.git and linux.git. The absolute best case for the optimization is
comparing two trees with all entries of the same size, and all changed,
like:

  git init
  blob1=$(echo one | git hash-object -w --stdin)
  blob2=$(echo two | git hash-object -w --stdin)

  mktree() {
    perl -e '
      printf "100644 blob %s\tpath%08d\n", $ARGV[0], $_ for (1..1000000)
    ' $1
  }
  git tag tree1 $(mktree $blob1 | git mktree)
  git tag tree2 $(mktree $blob2 | git mktree)

  git diff-tree tree1 tree2

In that optimal case I see ~3% speedup on glibc. If somebody on a
platform with a different allocator can show a bigger change, that would
definitely be interesting.

I suspect it won't make that big a difference even with a slower
allocator, though, because each changed path involves other allocations
(like creating a diff_pair).

Running under valgrind with that optimal case, the old code does ~3M
allocations (so 3 per entry). Now we do 4 per entry.

So if we really care about micro-optimizing, I suspect a more productive
path would be getting a better allocator. ;) Here are hyperfine results
for the existing code ("old") versus my series ("new") with the glibc
allocator versus jemalloc:

  Benchmark 1: LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2
    Time (mean ± σ):     625.3 ms ±  13.3 ms    [User: 547.9 ms, System: 77.3 ms]
    Range (min … max):   599.8 ms … 649.9 ms    10 runs
  
  Benchmark 2: LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2
    Time (mean ± σ):     650.8 ms ±  14.5 ms    [User: 568.2 ms, System: 82.5 ms]
    Range (min … max):   632.2 ms … 673.6 ms    10 runs
  
  Benchmark 3: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2
    Time (mean ± σ):     563.9 ms ±   9.2 ms    [User: 538.4 ms, System: 25.3 ms]
    Range (min … max):   545.4 ms … 571.0 ms    10 runs
  
  Benchmark 4: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2
    Time (mean ± σ):     582.9 ms ±  10.8 ms    [User: 545.1 ms, System: 37.7 ms]
    Range (min … max):   568.6 ms … 595.5 ms    10 runs
  
  Summary
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2 ran
      1.03 ± 0.03 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2
      1.11 ± 0.03 times faster than LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2
      1.15 ± 0.03 times faster than LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2

So rather than saving 2-3%, a better allocator gives you 10-15% (again,
these are pretty synthetic numbers because this is a pathological test
case). It is still faster to do fewer allocations with jemalloc, but
both the relative and absolute improvement is smaller.

-Peff
diff mbox series

Patch

diff --git a/tree-diff.c b/tree-diff.c
index 24f7b5912c..22fc2d8f8c 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -127,30 +127,6 @@  static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
 /*
  * Make a new combine_diff_path from path/mode/sha1
  * and append it to paths list tail.
- *
- * Memory for created elements could be reused:
- *
- *	- if last->next == NULL, the memory is allocated;
- *
- *	- if last->next != NULL, it is assumed that p=last->next was returned
- *	  earlier by this function, and p->next was *not* modified.
- *	  The memory is then reused from p.
- *
- * so for clients,
- *
- * - if you do need to keep the element
- *
- *	p = path_appendnew(p, ...);
- *	process(p);
- *	p->next = NULL;
- *
- * - if you don't need to keep the element after processing
- *
- *	pprev = p;
- *	p = path_appendnew(p, ...);
- *	process(p);
- *	p = pprev;
- *	; don't forget to free tail->next in the end
  */
 static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 	int nparent, const struct strbuf *base, const char *path, int pathlen,
@@ -160,22 +136,8 @@  static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 	size_t len = st_add(base->len, pathlen);
 	size_t alloclen = combine_diff_path_size(nparent, len);
 
-	/* if last->next is !NULL - it is a pre-allocated memory, we can reuse */
-	p = last->next;
-	if (p && (alloclen > (intptr_t)p->next)) {
-		FREE_AND_NULL(p);
-	}
-
-	if (!p) {
-		p = xmalloc(alloclen);
-
-		/*
-		 * until we go to it next round, .next holds how many bytes we
-		 * allocated (for faster realloc - we don't need copying old data).
-		 */
-		p->next = (struct combine_diff_path *)(intptr_t)alloclen;
-	}
-
+	p = xmalloc(alloclen);
+	p->next = NULL;
 	last->next = p;
 
 	p->path = (char *)&(p->parent[nparent]);
@@ -279,21 +241,11 @@  static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 		if (opt->pathchange)
 			keep = opt->pathchange(opt, p);
 
-		/*
-		 * If a path was filtered or consumed - we don't need to add it
-		 * to the list and can reuse its memory, leaving it as
-		 * pre-allocated element on the tail.
-		 *
-		 * On the other hand, if path needs to be kept, we need to
-		 * correct its .next to NULL, as it was pre-initialized to how
-		 * much memory was allocated.
-		 *
-		 * see path_appendnew() for details.
-		 */
-		if (!keep)
+		if (!keep) {
+			free(p);
+			pprev->next = NULL;
 			p = pprev;
-		else
-			p->next = NULL;
+		}
 	}
 
 	if (recurse) {
@@ -585,13 +537,6 @@  struct combine_diff_path *diff_tree_paths(
 	struct strbuf *base, struct diff_options *opt)
 {
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt, 0);
-
-	/*
-	 * free pre-allocated last element, if any
-	 * (see path_appendnew() for details about why)
-	 */
-	FREE_AND_NULL(p->next);
-
 	return p;
 }