diff: discard blob data from stat-unmatched pairs
diff mbox series

Message ID 20200601202218.GA2763518@coredump.intra.peff.net
State New
Headers show
Series
  • diff: discard blob data from stat-unmatched pairs
Related show

Commit Message

Jeff King June 1, 2020, 8:22 p.m. UTC
On Mon, Jun 01, 2020 at 12:54:08PM -0400, Jeff King wrote:

> > What could be the reason that the problem only occurs with the `--quiet` flag
> > that I use in my prompt command, but not when using `git diff` without the flag.
> 
> That is curious, and I can reproduce it here.

This turned out to be quite an interesting bug to look at. The fix is
simple, but I tried to summarize my explorations below.

I'm still curious why all of your files were stat-dirty. :) As a
workaround until you have a version of Git with the fix, you can
manually refresh the index with:

  git update-index --refresh

before running git-diff.

-- >8 --
Subject: [PATCH] diff: discard blob data from stat-unmatched pairs

When performing a tree-level diff against the working tree, we may find
that our index stat information is dirty, so we queue a filepair to be
examined later. If the actual content hasn't changed, we call this a
stat-unmatch; the stat information was out of date, but there's no
actual diff.  Normally diffcore_std() would detect and remove these
identical filepairs via diffcore_skip_stat_unmatch().  However, when
"--quiet" is used, we want to stop the diff as soon as we see any
changes, so we check for stat-unmatches immediately in diff_change().

That check may require us to actually load the file contents into the
pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
then no big deal; we'd likely load the contents later anyway to generate
a patch, do rename detection, etc, so we want to hold on to it. But if
it a stat-unmatch, then we have no more use for that data; the whole
point is that we're going discard the pair. However, we never free the
allocated diff_filespec data.

In most cases, keeping that data isn't a problem. We don't expect a lot
of stat-unmatch entries, and since we're using --quiet, we'd quit as
soon as we saw such a real change anyway. However, there are extreme
cases where it makes a big difference:

  1. We'd generally mmap() the working tree half of the pair. And since
     the OS may limit the total number of maps, we can run afoul of this
     in large repositories. E.g.:

       $ cd linux
       $ git ls-files | wc -l
       67959
       $ sysctl vm.max_map_count
       vm.max_map_count = 65530
       $ git ls-files | xargs touch ;# everything is stat-dirty!
       $ git diff --quiet
       fatal: mmap failed: Cannot allocate memory

     It should be unusual to have so many files stat-dirty, but it's
     possible if you've just run a script like "sed -i" or similar.

     After this patch, the above correctly exits with code 0.

  2. Even if you don't hit mmap limits, the index half of the pair will
     have been pulled from the object database into heap memory. Again
     in a clone of linux.git, running:

       $ git ls-files | head -n 10000 | xargs touch
       $ git diff --quiet

     peaks at 145MB heap before this patch, and 94MB after.

This patch solves the problem by freeing any diff_filespec data we
picked up during the "--quiet" stat-unmatch check in diff_changes.
Nobody is going to need that data later, so there's no point holding on
to it. There are a few things to note:

  - we could skip queueing the pair entirely, which could in theory save
    a little work. But there's not much to save, as we need a
    diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
    And since we cache the result of the stat-unmatch checks, a later
    call to diffcore_skip_stat_unmatch() call will quickly skip over
    them. The diffcore code also counts up the number of stat-unmatched
    pairs as it removes them. It's doubtful any callers would care about
    that in combination with --quiet, but we'd have to reimplement the
    logic here to be on the safe side. So it's not really worth the
    trouble.

  - I didn't write a test, because we always produce the correct output
    unless we run up against system mmap limits, which are both
    unportable and expensive to test against. Measuring peak heap
    would be interesting, but our perf suite isn't yet capable of that.

  - note that diff without "--quiet" does not suffer from the same
    problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
    entries and drop them immediately, so we're not carrying their data
    around.

  - you _can_ still trigger the mmap limit problem if you truly have
    that many files with actual changes. But it's rather unlikely. The
    stat-unmatch check avoids loading the file contents if the sizes
    don't match, so you'd need a pretty trivial change in every single
    file. Likewise, inexact rename detection might load the data for
    many files all at once. But you'd need not just 64k changes, but
    that many deletions and additions. The most likely candidate is
    perhaps break-detection, which would load the data for all pairs and
    keep it around for the content-level diff. But again, you'd need 64k
    actually changed files in the first place.

    So it's still possible to trigger this case, but it seems like "I
    accidentally made all my files stat-dirty" is the most likely case
    in the real world.

Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 diff.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Eric Sunshine June 1, 2020, 8:53 p.m. UTC | #1
On Mon, Jun 1, 2020 at 4:22 PM Jeff King <peff@peff.net> wrote:
> Subject: [PATCH] diff: discard blob data from stat-unmatched pairs
>
> When performing a tree-level diff against the working tree, we may find
> that our index stat information is dirty, so we queue a filepair to be
> examined later. If the actual content hasn't changed, we call this a
> stat-unmatch; the stat information was out of date, but there's no
> actual diff.  Normally diffcore_std() would detect and remove these
> identical filepairs via diffcore_skip_stat_unmatch().  However, when
> "--quiet" is used, we want to stop the diff as soon as we see any
> changes, so we check for stat-unmatches immediately in diff_change().
>
> That check may require us to actually load the file contents into the
> pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
> then no big deal; we'd likely load the contents later anyway to generate
> a patch, do rename detection, etc, so we want to hold on to it. But if
> it a stat-unmatch, then we have no more use for that data; the whole

s/it/& is/

> point is that we're going discard the pair. However, we never free the
> allocated diff_filespec data.
>
> In most cases, keeping that data isn't a problem. We don't expect a lot
> of stat-unmatch entries, and since we're using --quiet, we'd quit as
> soon as we saw such a real change anyway. However, there are extreme
> cases where it makes a big difference:
>
>   1. We'd generally mmap() the working tree half of the pair. And since
>      the OS may limit the total number of maps, we can run afoul of this
>      in large repositories. E.g.:
>
>        $ cd linux
>        $ git ls-files | wc -l
>        67959
>        $ sysctl vm.max_map_count
>        vm.max_map_count = 65530
>        $ git ls-files | xargs touch ;# everything is stat-dirty!
>        $ git diff --quiet
>        fatal: mmap failed: Cannot allocate memory
>
>      It should be unusual to have so many files stat-dirty, but it's
>      possible if you've just run a script like "sed -i" or similar.
>
>      After this patch, the above correctly exits with code 0.
>
>   2. Even if you don't hit mmap limits, the index half of the pair will
>      have been pulled from the object database into heap memory. Again
>      in a clone of linux.git, running:
>
>        $ git ls-files | head -n 10000 | xargs touch
>        $ git diff --quiet
>
>      peaks at 145MB heap before this patch, and 94MB after.
>
> This patch solves the problem by freeing any diff_filespec data we
> picked up during the "--quiet" stat-unmatch check in diff_changes.
> Nobody is going to need that data later, so there's no point holding on
> to it. There are a few things to note:
>
>   - we could skip queueing the pair entirely, which could in theory save
>     a little work. But there's not much to save, as we need a
>     diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
>     And since we cache the result of the stat-unmatch checks, a later
>     call to diffcore_skip_stat_unmatch() call will quickly skip over
>     them. The diffcore code also counts up the number of stat-unmatched
>     pairs as it removes them. It's doubtful any callers would care about
>     that in combination with --quiet, but we'd have to reimplement the
>     logic here to be on the safe side. So it's not really worth the
>     trouble.
>
>   - I didn't write a test, because we always produce the correct output
>     unless we run up against system mmap limits, which are both
>     unportable and expensive to test against. Measuring peak heap
>     would be interesting, but our perf suite isn't yet capable of that.
>
>   - note that diff without "--quiet" does not suffer from the same
>     problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
>     entries and drop them immediately, so we're not carrying their data
>     around.
>
>   - you _can_ still trigger the mmap limit problem if you truly have
>     that many files with actual changes. But it's rather unlikely. The
>     stat-unmatch check avoids loading the file contents if the sizes
>     don't match, so you'd need a pretty trivial change in every single
>     file. Likewise, inexact rename detection might load the data for
>     many files all at once. But you'd need not just 64k changes, but
>     that many deletions and additions. The most likely candidate is
>     perhaps break-detection, which would load the data for all pairs and
>     keep it around for the content-level diff. But again, you'd need 64k
>     actually changed files in the first place.
>
>     So it's still possible to trigger this case, but it seems like "I
>     accidentally made all my files stat-dirty" is the most likely case
>     in the real world.
>
> Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com>
> Signed-off-by: Jeff King <peff@peff.net>
Junio C Hamano June 2, 2020, 4:49 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> Subject: [PATCH] diff: discard blob data from stat-unmatched pairs
>
> When performing a tree-level diff against the working tree, we may find
> that our index stat information is dirty, so we queue a filepair to be
> examined later. If the actual content hasn't changed, we call this a
> stat-unmatch; the stat information was out of date, but there's no
> actual diff.  Normally diffcore_std() would detect and remove these
> identical filepairs via diffcore_skip_stat_unmatch().  However, when
> "--quiet" is used, we want to stop the diff as soon as we see any
> changes, so we check for stat-unmatches immediately in diff_change().
>
> That check may require us to actually load the file contents into the
> pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
> then no big deal; we'd likely load the contents later anyway to generate
> a patch, do rename detection, etc, so we want to hold on to it. But if
> it a stat-unmatch, then we have no more use for that data; the whole
> point is that we're going discard the pair. However, we never free the
> allocated diff_filespec data.

Nicely spotted.  So we can discard when quiet is in effect after
this check, which makes sense.

After reading the initial analysis, I wondered if the fix we did in
f34b205f (diff: do not quit early on stat-dirty files, 2014-01-25)
was suboptimal and we should have instead done the "if QUICK, check
if the pair is merely stat-unmatch" in the loop(s) that call
diff_change(), hoping that it may have given us a better control
over the lifetime of the filespecs in each diff_filepair, but I do
not think it would made much difference.

>  	if (options->flags.quick && options->skip_stat_unmatch &&
> -	    !diff_filespec_check_stat_unmatch(options->repo, p))
> +	    !diff_filespec_check_stat_unmatch(options->repo, p)) {
> +		diff_free_filespec_data(p->one);
> +		diff_free_filespec_data(p->two);
>  		return;
> +	}

Thanks.  Will queue (with that s/it/& is/ typofix mentioned
elsewhere).

Patch
diff mbox series

diff --git a/diff.c b/diff.c
index d1ad6a3c4a..4d46fab5d3 100644
--- a/diff.c
+++ b/diff.c
@@ -6758,8 +6758,11 @@  void diff_change(struct diff_options *options,
 		return;
 
 	if (options->flags.quick && options->skip_stat_unmatch &&
-	    !diff_filespec_check_stat_unmatch(options->repo, p))
+	    !diff_filespec_check_stat_unmatch(options->repo, p)) {
+		diff_free_filespec_data(p->one);
+		diff_free_filespec_data(p->two);
 		return;
+	}
 
 	options->flags.has_changes = 1;
 }