[5/6] dir: replace exponential algorithm with a linear one
diff mbox series

Message ID 40b378e7adbbff5ecfd95fd888465fd0f99791c8.1580335424.git.gitgitgadget@gmail.com
State New
Headers show
Series
  • Avoid multiple recursive calls for same path in read_directory_recursive()
Related show

Commit Message

Allo via GitGitGadget Jan. 29, 2020, 10:03 p.m. UTC
From: Elijah Newren <newren@gmail.com>

dir's read_directory_recursive() naturally operates recursively in order
to walk the directory tree.  Treating of directories is sometimes weird
because there are so many different permutations about how to handle
directories.  Some examples:

   * 'git ls-files -o --directory' only needs to know that a directory
     itself is untracked; it doesn't need to recurse into it to see what
     is underneath.

   * 'git status' needs to recurse into an untracked directory, but only
     to determine whether or not it is empty.  If there are no files
     underneath, the directory itself will be omitted from the output.
     If it is not empty, only the directory will be listed.

   * 'git status --ignored' needs to recurse into untracked directories
     and report all the ignored entries and then report the directory as
     untracked -- UNLESS all the entries under the directory are
     ignored, in which case we don't print any of the entries under the
     directory and just report the directory itself as ignored.

   * For 'git clean', we may need to recurse into a directory that
     doesn't match any specified pathspecs, if it's possible that there
     is an entry underneath the directory that can match one of the
     pathspecs.  In such a case, we need to be careful to omit the
     directory itself from the list of paths (see e.g. commit
     404ebceda01c ("dir: also check directories for matching pathspecs",
     2019-09-17))

Part of the tension noted above is that the treatment of a directory can
changed based on the files within it, and based on the various settings
in dir->flags.  Trying to keep this in mind while reading over the code,
it is easy to (accidentally?) think in terms of "treat_directory() tells
us what to do with a directory, and read_directory_recursive() is the
thing that recurses".  Since we need to look into a directory to know
how to treat it, though, it was quite easy to decide to recurse into the
directory from treat_directory() by adding a read_directory_recursive()
call.  Adding such a call is actually fine, IF we didn't also cause
read_directory_recursive() to recurse into the same directory again.

Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
for ignored files", 2017-05-18), added exactly such a case to the code,
meaning we'd have two calls to read_directory_recursive() for an
untracked directory.  So, if we had a file named
   one/two/three/four/five/somefile.txt
and nothing in one/ was tracked, then 'git status --ignored' would
call read_directory_recursive() twice on the directory 'one/', and
each of those would call read_directory_recursive() twice on the
directory 'one/two/', and so on until read_directory_recursive() was
called 2^5 times for 'one/two/three/four/five/'.

Avoid calling read_directory_recursive() twice per level by moving a
lot of the special logic into treat_directory().

Since dir.c is somewhat complex, extra cruft built up around this over
time.  While trying to unravel it, I noticed several instances where the
first call to read_directory_recursive() would return e.g.
path_untracked for a some directory and a later one would return e.g.
path_none, and the code relied on the side-effect of the first adding
untracked entries to dir->entries in order to get the correct output
despite the supposed override in return value by the later call.

I am somewhat concerned that there are still bugs and maybe even
testcases with the wrong expectation.  I have tried to carefully
document treat_directory() since it becomes more complex after this
change (though much of this complexity came from elsewhere that probably
deserved better comments to begin with).  However, much of my work felt
more like a game of whackamole while attempting to make the code match
the existing regression tests than an attempt to create an
implementation that matched some clear design.  That seems wrong to me,
but the rules of existing behavior had so many special cases that I had
a hard time coming up with some overarching rules about what correct
behavior is for all cases, forcing me to hope that the regression tests
are correct and sufficient.  (I'll note that this turmoil makes working
with dir.c extremely unpleasant for me; I keep hoping it'll get better,
but it never seems to.)

However, on the positive side, it does make the code much faster.  For
the following simple shell loop in an empty repository:

  for depth in $(seq 10 25)
  do
    dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
    rm -rf dir
    mkdir -p $dirs
    >$dirs/untracked-file
    /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
  done

I saw the following timings, in seconds (note that the numbers are a
little noisy from run-to-run, but the trend is very clear with every
run):

    10: 0.03
    11: 0.05
    12: 0.08
    13: 0.19
    14: 0.29
    15: 0.50
    16: 1.05
    17: 2.11
    18: 4.11
    19: 8.60
    20: 17.55
    21: 33.87
    22: 68.71
    23: 140.05
    24: 274.45
    25: 551.15

After this fix, those drop to:

    10: 0.00
    11: 0.00
    12: 0.00
    13: 0.00
    14: 0.00
    15: 0.00
    16: 0.00
    17: 0.00
    18: 0.00
    19: 0.00
    20: 0.00
    21: 0.00
    22: 0.00
    23: 0.00
    24: 0.00
    25: 0.00

In fact, it isn't until a depth of 190 nested directories that it
sometimes starts reporting a time of 0.01 seconds and doesn't
consistently report 0.01 seconds until there are 240 nested directories.
The previous code would have taken
  17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
to have completed the 240 nested directories case.  It's not often
that you get to speed something up by a factor of 3*10^69.

WARNING: This change breaks t7063.  I don't know whether that is to be expected
(I now intentionally visit untracked directories differently so naturally the
untracked cache should change), or if I've broken something.  I'm hoping to get
an untracked cache expert to chime in...

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 dir.c | 151 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 105 insertions(+), 46 deletions(-)

Comments

Derrick Stolee Jan. 30, 2020, 3:55 p.m. UTC | #1
I am very enticed by the subject!

On 1/29/2020 5:03 PM, Elijah Newren via GitGitGadget wrote:
> Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
> for ignored files", 2017-05-18), added exactly such a case to the code,

I was disappointed that the commit you mention did not add a test for
the new behavior, but then found a test change in the following commit
fb89888849 (dir: hide untracked contents of untracked dirs, 2017-05-18).
This makes me feel better that your changes are less likely to un-do
the intention of df5bcdf83aeb.

> meaning we'd have two calls to read_directory_recursive() for an
> untracked directory.  So, if we had a file named
>    one/two/three/four/five/somefile.txt
> and nothing in one/ was tracked, then 'git status --ignored' would
> call read_directory_recursive() twice on the directory 'one/', and
> each of those would call read_directory_recursive() twice on the
> directory 'one/two/', and so on until read_directory_recursive() was
> called 2^5 times for 'one/two/three/four/five/'.

Wow! Good find. "Accidentally exponential" is a lot worse than
"accidentally quadratic". At least the N here _usually_ does not
grow too quickly, but the constant here (lstat-ing directories and
files) is significant enough that 2^3 or 2^4 is enough to notice
the difference.

> Avoid calling read_directory_recursive() twice per level by moving a
> lot of the special logic into treat_directory().
> 
> Since dir.c is somewhat complex, extra cruft built up around this over
> time.  While trying to unravel it, I noticed several instances where the
> first call to read_directory_recursive() would return e.g.
> path_untracked for a some directory and a later one would return e.g.
> path_none, and the code relied on the side-effect of the first adding
> untracked entries to dir->entries in order to get the correct output
> despite the supposed override in return value by the later call.
>
> I am somewhat concerned that there are still bugs and maybe even
> testcases with the wrong expectation.  I have tried to carefully
> document treat_directory() since it becomes more complex after this
> change (though much of this complexity came from elsewhere that probably
> deserved better comments to begin with).  However, much of my work felt
> more like a game of whackamole while attempting to make the code match
> the existing regression tests than an attempt to create an
> implementation that matched some clear design.  That seems wrong to me,
> but the rules of existing behavior had so many special cases that I had
> a hard time coming up with some overarching rules about what correct
> behavior is for all cases, forcing me to hope that the regression tests
> are correct and sufficient.  (I'll note that this turmoil makes working
> with dir.c extremely unpleasant for me; I keep hoping it'll get better,
> but it never seems to.)

Keep fighting the good fight! It appears that some of our most-important
code has these complicated cases and side-effects because it has grown
so organically over time. It's unlikely that someone _could_ rewrite it
to avoid that pain, as dir.c contains a lot of accumulated knowledge from
the many special-cases Git handles. I suppose the only thing we can do
is try to write as many detailed tests as possible.

> However, on the positive side, it does make the code much faster.  For
> the following simple shell loop in an empty repository:
> 
>   for depth in $(seq 10 25)
>   do
>     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
>     rm -rf dir
>     mkdir -p $dirs
>     >$dirs/untracked-file
>     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
>   done
> 
> I saw the following timings, in seconds (note that the numbers are a
> little noisy from run-to-run, but the trend is very clear with every
> run):
> 
>     10: 0.03
>     11: 0.05
>     12: 0.08
>     13: 0.19
>     14: 0.29
>     15: 0.50
>     16: 1.05
>     17: 2.11
>     18: 4.11
>     19: 8.60
>     20: 17.55
>     21: 33.87
>     22: 68.71
>     23: 140.05
>     24: 274.45
>     25: 551.15

Are these timings on Linux? I imagine that the timings will increase
much more quickly on Windows.

> After this fix, those drop to:
> 
>     10: 0.00
...
>     25: 0.00

Nice. I wonder if presenting these 0.00 values as a table is worth
the space? At least the effect is dramatic.

> In fact, it isn't until a depth of 190 nested directories that it
> sometimes starts reporting a time of 0.01 seconds and doesn't
> consistently report 0.01 seconds until there are 240 nested directories.
> The previous code would have taken
>   17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
> to have completed the 240 nested directories case.  It's not often
> that you get to speed something up by a factor of 3*10^69.

Awesome.

> WARNING: This change breaks t7063.  I don't know whether that is to be expected
> (I now intentionally visit untracked directories differently so naturally the
> untracked cache should change), or if I've broken something.  I'm hoping to get
> an untracked cache expert to chime in...

I suppose that when the untracked cache is enabled, your expectation that we
do not need to recurse into an untracked directory is incorrect: we actually
want to explore that directory. Is there a mode we can check to see if we
are REALLY REALLY collecting _all_ untracked paths? Perhaps we need to create
one?

I'm CC'ing Kevin Willford because he is more familiar with the Git index
than me, and perhaps the untracked cache in particular.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  dir.c | 151 ++++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 105 insertions(+), 46 deletions(-)
> 
> diff --git a/dir.c b/dir.c
> index ef3307718a..aaf038a9c4 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1659,7 +1659,13 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
>  	const char *dirname, int len, int baselen, int excluded,
>  	const struct pathspec *pathspec)
>  {
> -	int nested_repo;
> +	/*
> +	 * WARNING: From this function, you can return path_recurse or you
> +	 *          can call read_directory_recursive() (or neither), but
> +	 *          you CAN'T DO BOTH.
> +	 */
> +	enum path_treatment state;
> +	int nested_repo, old_ignored_nr, stop_early;
>  
>  	/* The "len-1" is to strip the final '/' */
>  	switch (directory_exists_in_index(istate, dirname, len-1)) {
> @@ -1713,18 +1719,101 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
>  
>  	/* This is the "show_other_directories" case */
>  
> -	if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> +	/*
> +	 * We only need to recurse into untracked/ignored directories if
> +	 * either of the following bits is set:
> +	 *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
> +	 *                           there are ignored directories below)
> +	 *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
> +	 *                                 the directory is empty)

Perhaps here is where you could also have a DIR_LIST_ALL_UNTRACKED
flag to ensure the untracked cache loads all untracked paths?

> +	 */
> +	if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
>  		return excluded ? path_excluded : path_untracked;
>  
> +	/*
> +	 * If we only want to determine if dirname is empty, then we can
> +	 * stop at the first file we find underneath that directory rather
> +	 * than continuing to recurse beyond it.  If DIR_SHOW_IGNORED_TOO
> +	 * is set, then we want MORE than just determining if dirname is
> +	 * empty.
> +	 */
> +	stop_early = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
> +		      !(dir->flags & DIR_SHOW_IGNORED_TOO));
> +
> +	/*
> +	 * If /every/ file within an untracked directory is ignored, then
> +	 * we want to treat the directory as ignored (for e.g. status
> +	 * --porcelain), without listing the individual ignored files
> +	 * underneath.  To do so, we'll save the current ignored_nr, and
> +	 * pop all the ones added after it if it turns out the entire
> +	 * directory is ignored.

Here is a question for an untracked cache expert: Do we store ignored
paths in the untracked cache?

> +	 */
> +	old_ignored_nr = dir->ignored_nr;
> +
> +	/* Actually recurse into dirname now, we'll fixup the state later. */
>  	untracked = lookup_untracked(dir->untracked, untracked,
>  				     dirname + baselen, len - baselen);
> +	state = read_directory_recursive(dir, istate, dirname, len, untracked,
> +					 stop_early, stop_early, pathspec);
> +
> +	/* There are a variety of reasons we may need to fixup the state... */
> +	if (state == path_excluded) {
> +		int i;
> +
> +		/*
> +		 * When stop_early is set, read_directory_recursive() will
> +		 * never return path_untracked regardless of whether
> +		 * underlying paths were untracked or ignored (because
> +		 * returning early means it excluded some paths, or
> +		 * something like that -- see commit 5aaa7fd39aaf ("Improve
> +		 * performance of git status --ignored", 2017-09-18)).
> +		 * However, we're not really concerned with the status of
> +		 * files under the directory, we just wanted to know
> +		 * whether the directory was empty (state == path_none) or
> +		 * not (state == path_excluded), and if not, we'd return
> +		 * our original status based on whether the untracked
> +		 * directory matched an exclusion pattern.
> +		 */
> +		if (stop_early)
> +			state = excluded ? path_excluded : path_untracked;
> +
> +		else {
> +			/*
> +			 * When
> +			 *     !stop_early && state == path_excluded
> +			 * then all paths under dirname were ignored.  For
> +			 * this case, git status --porcelain wants to just
> +			 * list the directory itself as ignored and not
> +			 * list the individual paths underneath.  Remove
> +			 * the individual paths underneath.
> +			 */
> +			for (i = old_ignored_nr + 1; i<dir->ignored_nr; ++i)
> +				free(dir->ignored[i]);
> +			dir->ignored_nr = old_ignored_nr;
> +		}
> +	}
>  
>  	/*
> -	 * If this is an excluded directory, then we only need to check if
> -	 * the directory contains any files.
> +	 * If there is nothing under the current directory and we are not
> +	 * hiding empty directories, then we need to report on the
> +	 * untracked or ignored status of the directory itself.
>  	 */
> -	return read_directory_recursive(dir, istate, dirname, len,
> -					untracked, 1, excluded, pathspec);
> +	if (state == path_none && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> +		state = excluded ? path_excluded : path_untracked;
> +
> +	/*
> +	 * We can recurse into untracked directories that don't match any
> +	 * of the given pathspecs when some file underneath the directory
> +	 * might match one of the pathspecs.  If so, we should make sure
> +	 * to note that the directory itself did not match.
> +	 */
> +	if (pathspec &&
> +	    !match_pathspec(istate, pathspec, dirname, len,
> +			    0 /* prefix */, NULL,
> +			    0 /* do NOT special case dirs */))
> +		state = path_none;
> +
> +	return state;
>  }

This is certainly a substantial change, and I'm not able to read it
carefully right now. I hope to return to it soon, but hopefully I've
pointed out some places that may lead you to resolve your untracked
cache issues.

Thanks,
-Stolee
Elijah Newren Jan. 30, 2020, 5:13 p.m. UTC | #2
On Thu, Jan 30, 2020 at 7:55 AM Derrick Stolee <stolee@gmail.com> wrote:
> > However, on the positive side, it does make the code much faster.  For
> > the following simple shell loop in an empty repository:
> >
> >   for depth in $(seq 10 25)
> >   do
> >     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
> >     rm -rf dir
> >     mkdir -p $dirs
> >     >$dirs/untracked-file
> >     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
> >   done
> >
> > I saw the following timings, in seconds (note that the numbers are a
> > little noisy from run-to-run, but the trend is very clear with every
> > run):
> >
> >     10: 0.03
> >     11: 0.05
> >     12: 0.08
> >     13: 0.19
> >     14: 0.29
> >     15: 0.50
> >     16: 1.05
> >     17: 2.11
> >     18: 4.11
> >     19: 8.60
> >     20: 17.55
> >     21: 33.87
> >     22: 68.71
> >     23: 140.05
> >     24: 274.45
> >     25: 551.15
>
> Are these timings on Linux? I imagine that the timings will increase
> much more quickly on Windows.

Yes, on Linux, with an SSD for the hard drive in this case (though I
suspect OS caching of the directories would probably eliminate any
differences between an SSD and a spinny disk since the same
directories are visited so many times).

> > After this fix, those drop to:
> >
> >     10: 0.00
> ...
> >     25: 0.00
>
> Nice. I wonder if presenting these 0.00 values as a table is worth
> the space? At least the effect is dramatic.

I first considered a table, but then noted it didn't match the code
snippet I provided and was worried I'd have to spend more time
explaining how I post-processed the output from two runs than we'd
gain from compressing the number of lines of the commit message.
Assuming reader time was more valuable, I opted to just keep the two
snippets of output.

> > WARNING: This change breaks t7063.  I don't know whether that is to be expected
> > (I now intentionally visit untracked directories differently so naturally the
> > untracked cache should change), or if I've broken something.  I'm hoping to get
> > an untracked cache expert to chime in...
>
> I suppose that when the untracked cache is enabled, your expectation that we
> do not need to recurse into an untracked directory is incorrect: we actually
> want to explore that directory. Is there a mode we can check to see if we
> are REALLY REALLY collecting _all_ untracked paths? Perhaps we need to create
> one?

I don't think I made any significant changes about using the untracked
cache versus traversing; the primary differences should be that I
traverse each directory once instead of 2^N times.  However, the
previous code would traverse with both check_only=0 and check_only=1,
and to avoid the whole 2^N thing I only traverse once.  That
fundamentally means I only won't traverse with both settings of that
flag.

The output in t7063 seems to suggest to me that the check_only flag
matters to what the untracked-cache stores ("check_only" literally
appears as part of the expected output), and the output also suggests
that the untracked-cache is recording when entries are visited
multiple times somehow.  Or maybe I'm just totally misunderstanding
the expected output in t7063.  I really have no clue about that stuff.

> I'm CC'ing Kevin Willford because he is more familiar with the Git index
> than me, and perhaps the untracked cache in particular.

Getting another set of eyes, even if they only know enough to provide
hunches or guesses, would be very welcome.

> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  dir.c | 151 ++++++++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 105 insertions(+), 46 deletions(-)
> >
> > diff --git a/dir.c b/dir.c
> > index ef3307718a..aaf038a9c4 100644
> > --- a/dir.c
> > +++ b/dir.c
> > @@ -1659,7 +1659,13 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
> >       const char *dirname, int len, int baselen, int excluded,
> >       const struct pathspec *pathspec)
> >  {
> > -     int nested_repo;
> > +     /*
> > +      * WARNING: From this function, you can return path_recurse or you
> > +      *          can call read_directory_recursive() (or neither), but
> > +      *          you CAN'T DO BOTH.
> > +      */
> > +     enum path_treatment state;
> > +     int nested_repo, old_ignored_nr, stop_early;
> >
> >       /* The "len-1" is to strip the final '/' */
> >       switch (directory_exists_in_index(istate, dirname, len-1)) {
> > @@ -1713,18 +1719,101 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
> >
> >       /* This is the "show_other_directories" case */
> >
> > -     if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> > +     /*
> > +      * We only need to recurse into untracked/ignored directories if
> > +      * either of the following bits is set:
> > +      *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
> > +      *                           there are ignored directories below)
> > +      *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
> > +      *                                 the directory is empty)
>
> Perhaps here is where you could also have a DIR_LIST_ALL_UNTRACKED
> flag to ensure the untracked cache loads all untracked paths?

Do you mean DIR_KEEP_UNTRACKED_CONTENTS (which is documented in dir.h
as only having meaning when DIR_SHOW_IGNORED_TOO is also set, and thus
caused me to not list it separately)?

Speaking of DIR_KEEP_UNTRACKED_CONTENTS, though, its handling as a
post-processing step in read_directory() is now inconsistent with how
we handle squashing a directory full of ignores into just marking the
containing directory as ignored.  I think I should move the
read_directory() logic for DIR_KEEP_UNTRACKED_CONTENTS to
treat_directory() and use another counter similar to old_ignored_nr.
It should be more efficient that way, too.

>
> > +      */
> > +     if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
> >               return excluded ? path_excluded : path_untracked;
> >
> > +     /*
> > +      * If we only want to determine if dirname is empty, then we can
> > +      * stop at the first file we find underneath that directory rather
> > +      * than continuing to recurse beyond it.  If DIR_SHOW_IGNORED_TOO
> > +      * is set, then we want MORE than just determining if dirname is
> > +      * empty.
> > +      */
> > +     stop_early = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
> > +                   !(dir->flags & DIR_SHOW_IGNORED_TOO));
> > +
> > +     /*
> > +      * If /every/ file within an untracked directory is ignored, then
> > +      * we want to treat the directory as ignored (for e.g. status
> > +      * --porcelain), without listing the individual ignored files
> > +      * underneath.  To do so, we'll save the current ignored_nr, and
> > +      * pop all the ones added after it if it turns out the entire
> > +      * directory is ignored.
>
> Here is a question for an untracked cache expert: Do we store ignored
> paths in the untracked cache?

According to 0dcb8d7fe0ec ("untracked cache: record .gitignore
information and dir hierarchy", 2015-03-08), no:

    This cached output is about untracked files only, not ignored files
    because the number of tracked files is usually small, so small cache
    overhead, while the number of ignored files could go really high
    (e.g. *.o files mixing with source code).

...unless, of course, someone came along later and changed the design goals.

[...]

> This is certainly a substantial change, and I'm not able to read it
> carefully right now. I hope to return to it soon, but hopefully I've
> pointed out some places that may lead you to resolve your untracked
> cache issues.

Yeah, it's pretty hard to reason about; personally I needed lots of
dumps of state during traversals just to partially make sense of it.

I had dumps of output from both before and after my changes printing
out return values of treat_directory() and paths and a bunch of other
stuff and was doing lots of comparisons (and repeatedly did this for
many, many different testcases with different toplevel git commands).
It was particularly annoying that the old stuff would traverse
everything 2^N times, half the time with check_only on and half the
time with it off.  It would return different state values for the same
path from different calls, often depending on the side effects of
dir.entries having had more entries added by the first recursion to
get the right output, despite the fact that the "wrong" state was
returned by treat_directory() for later visits to the same path (e.g.
path_untracked returned for the first time it was visited, then
path_none later, and it was a case where path_untracked was correct in
my view).

Despite those difficulties, having an extra set of eyes try to reason
about it and pointing out anything that looks amiss or even that just
looks hard to understand would be very welcome.
Elijah Newren Jan. 30, 2020, 5:45 p.m. UTC | #3
On Thu, Jan 30, 2020 at 9:13 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 7:55 AM Derrick Stolee <stolee@gmail.com> wrote:
[...]
> > > @@ -1713,18 +1719,101 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
> > >
> > >       /* This is the "show_other_directories" case */
> > >
> > > -     if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
> > > +     /*
> > > +      * We only need to recurse into untracked/ignored directories if
> > > +      * either of the following bits is set:
> > > +      *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
> > > +      *                           there are ignored directories below)
> > > +      *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
> > > +      *                                 the directory is empty)
> >
> > Perhaps here is where you could also have a DIR_LIST_ALL_UNTRACKED
> > flag to ensure the untracked cache loads all untracked paths?
>
> Do you mean DIR_KEEP_UNTRACKED_CONTENTS (which is documented in dir.h
> as only having meaning when DIR_SHOW_IGNORED_TOO is also set, and thus
> caused me to not list it separately)?
>
> Speaking of DIR_KEEP_UNTRACKED_CONTENTS, though, its handling as a
> post-processing step in read_directory() is now inconsistent with how
> we handle squashing a directory full of ignores into just marking the
> containing directory as ignored.  I think I should move the
> read_directory() logic for DIR_KEEP_UNTRACKED_CONTENTS to
> treat_directory() and use another counter similar to old_ignored_nr.
> It should be more efficient that way, too.

Oh, actually, I think I understand what you're getting at so let me
clear it up.  With DIR_SHOW_IGNORED_TOO, we always recurse to the
bottom, because it's needed to find any files that might be ignored.
(Maybe we could do something clever with checking .gitignore entries
and seeing if it's impossible for them to match anything below the
current directory, but the code doesn't do anything that clever.)  As
a side effect, we'll get all untracked files whenever that flag is
set.  As such, the only question is whether we want to keep all those
extra untracked files that we found or not, which is the purpose of
DIR_KEEP_UNTRACKED_CONTENTS.  Without DIR_SHOW_IGNORED_TOO, there's no
need or want to visit all untracked files without also learning of all
ignored files (and, in fact, git-clean is currently the only one that
wants to know about all untracked files).

As far as a simple test goes, in a simple repository with a file named
   one/two/three/four/five/untracked-file
and with nothing else under one/:

Before my changes:
    $ strace -e trace=file git status --ignored 2>&1 | grep
'open("one/' | grep -v gitignore.*ENOENT | wc -l
    62
Note that 62 == 2^5 + 2^4 + 2^3 + 2^2 + 2^1, showing how many
directories we open and read.

After my changes:
    $ strace -e trace=file git status --ignored 2>&1 | grep
'open("one/' | grep -v gitignore.*ENOENT | wc -l
    5
showing that it does open and read each directory, but does so only once.
SZEDER Gábor Jan. 31, 2020, 5:13 p.m. UTC | #4
On Wed, Jan 29, 2020 at 10:03:42PM +0000, Elijah Newren via GitGitGadget wrote:
> Part of the tension noted above is that the treatment of a directory can
> changed based on the files within it, and based on the various settings

s/changed/change/, or perhaps s/changed/be changed/ ?

> Since dir.c is somewhat complex, extra cruft built up around this over
> time.  While trying to unravel it, I noticed several instances where the
> first call to read_directory_recursive() would return e.g.
> path_untracked for a some directory and a later one would return e.g.

s/for a some/for some/

> However, on the positive side, it does make the code much faster.  For
> the following simple shell loop in an empty repository:
> 
>   for depth in $(seq 10 25)
>   do
>     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
>     rm -rf dir
>     mkdir -p $dirs
>     >$dirs/untracked-file
>     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
>   done
> 
> I saw the following timings, in seconds (note that the numbers are a
> little noisy from run-to-run, but the trend is very clear with every
> run):
> 
>     10: 0.03
>     11: 0.05
>     12: 0.08
>     13: 0.19
>     14: 0.29
>     15: 0.50
>     16: 1.05
>     17: 2.11
>     18: 4.11
>     19: 8.60
>     20: 17.55
>     21: 33.87
>     22: 68.71
>     23: 140.05
>     24: 274.45
>     25: 551.15
> 
> After this fix, those drop to:
> 
>     10: 0.00
>     11: 0.00
>     12: 0.00
>     13: 0.00
>     14: 0.00
>     15: 0.00
>     16: 0.00
>     17: 0.00
>     18: 0.00
>     19: 0.00
>     20: 0.00
>     21: 0.00
>     22: 0.00
>     23: 0.00
>     24: 0.00
>     25: 0.00

I agree with Derrick here: if you just said that all these report
0.00, I would have taken your word for it.

Having said that...  I don't know how to get more decimal places out
of /use/bin/time, but our trace performance facility uses nanosecond
resolution timestamps.  So using this command in the loop above:

  GIT_TRACE_PERFORMANCE=2 git status --ignored 2>&1 >/dev/null |
    sed -n -e "s/.* performance: \(.*\): git command.*/$depth: \1/p"

gave me this:

  1: 0.000574302 s
  2: 0.000584995 s
  3: 0.000608684 s
  4: 0.000951336 s
  5: 0.000762019 s
  6: 0.000816685 s
  7: 0.000672516 s
  8: 0.000912628 s
  9: 0.000661538 s
  10: 0.000687465 s
  11: 0.000708880 s
  12: 0.000693754 s
  13: 0.000726120 s
  14: 0.000737334 s
  15: 0.000787362 s
  16: 0.000856687 s
  17: 0.000780892 s
  18: 0.000790798 s
  19: 0.000834411 s
  20: 0.000859094 s
  21: 0.001230912 s
  22: 0.001048852 s
  23: 0.000891057 s
  24: 0.000934097 s
  25: 0.001051704 s

Not sure it's worth including, though.
Elijah Newren Jan. 31, 2020, 5:47 p.m. UTC | #5
On Fri, Jan 31, 2020 at 9:13 AM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> On Wed, Jan 29, 2020 at 10:03:42PM +0000, Elijah Newren via GitGitGadget wrote:
> > Part of the tension noted above is that the treatment of a directory can
> > changed based on the files within it, and based on the various settings
>
> s/changed/change/, or perhaps s/changed/be changed/ ?
>
> > Since dir.c is somewhat complex, extra cruft built up around this over
> > time.  While trying to unravel it, I noticed several instances where the
> > first call to read_directory_recursive() would return e.g.
> > path_untracked for a some directory and a later one would return e.g.
>
> s/for a some/for some/
>
> > However, on the positive side, it does make the code much faster.  For
> > the following simple shell loop in an empty repository:
> >
> >   for depth in $(seq 10 25)
> >   do
> >     dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
> >     rm -rf dir
> >     mkdir -p $dirs
> >     >$dirs/untracked-file
> >     /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
> >   done
> >
> > I saw the following timings, in seconds (note that the numbers are a
> > little noisy from run-to-run, but the trend is very clear with every
> > run):
> >
> >     10: 0.03
> >     11: 0.05
> >     12: 0.08
> >     13: 0.19
> >     14: 0.29
> >     15: 0.50
> >     16: 1.05
> >     17: 2.11
> >     18: 4.11
> >     19: 8.60
> >     20: 17.55
> >     21: 33.87
> >     22: 68.71
> >     23: 140.05
> >     24: 274.45
> >     25: 551.15
> >
> > After this fix, those drop to:
> >
> >     10: 0.00
> >     11: 0.00
> >     12: 0.00
> >     13: 0.00
> >     14: 0.00
> >     15: 0.00
> >     16: 0.00
> >     17: 0.00
> >     18: 0.00
> >     19: 0.00
> >     20: 0.00
> >     21: 0.00
> >     22: 0.00
> >     23: 0.00
> >     24: 0.00
> >     25: 0.00
>
> I agree with Derrick here: if you just said that all these report
> 0.00, I would have taken your word for it.

Thanks, I'll include all these fixes.  Good timing too, as I was about
to send a re-roll.

> Having said that...  I don't know how to get more decimal places out
> of /use/bin/time, but our trace performance facility uses nanosecond
> resolution timestamps.  So using this command in the loop above:
>
>   GIT_TRACE_PERFORMANCE=2 git status --ignored 2>&1 >/dev/null |
>     sed -n -e "s/.* performance: \(.*\): git command.*/$depth: \1/p"
>
> gave me this:
>
>   1: 0.000574302 s
>   2: 0.000584995 s
>   3: 0.000608684 s
>   4: 0.000951336 s
>   5: 0.000762019 s
>   6: 0.000816685 s
>   7: 0.000672516 s
>   8: 0.000912628 s
>   9: 0.000661538 s
>   10: 0.000687465 s
>   11: 0.000708880 s
>   12: 0.000693754 s
>   13: 0.000726120 s
>   14: 0.000737334 s
>   15: 0.000787362 s
>   16: 0.000856687 s
>   17: 0.000780892 s
>   18: 0.000790798 s
>   19: 0.000834411 s
>   20: 0.000859094 s
>   21: 0.001230912 s
>   22: 0.001048852 s
>   23: 0.000891057 s
>   24: 0.000934097 s
>   25: 0.001051704 s
>
> Not sure it's worth including, though.

Yeah, I'm afraid people will spend time trying to analyze it and the
numbers are extremely noisy.  I instead included some words about
counting the number of untracked files opened according to strace,
which shows before we had 2^(1+$depth)-2 untracked directories get
opened and after we had exactly $depth get opened.

Patch
diff mbox series

diff --git a/dir.c b/dir.c
index ef3307718a..aaf038a9c4 100644
--- a/dir.c
+++ b/dir.c
@@ -1659,7 +1659,13 @@  static enum path_treatment treat_directory(struct dir_struct *dir,
 	const char *dirname, int len, int baselen, int excluded,
 	const struct pathspec *pathspec)
 {
-	int nested_repo;
+	/*
+	 * WARNING: From this function, you can return path_recurse or you
+	 *          can call read_directory_recursive() (or neither), but
+	 *          you CAN'T DO BOTH.
+	 */
+	enum path_treatment state;
+	int nested_repo, old_ignored_nr, stop_early;
 
 	/* The "len-1" is to strip the final '/' */
 	switch (directory_exists_in_index(istate, dirname, len-1)) {
@@ -1713,18 +1719,101 @@  static enum path_treatment treat_directory(struct dir_struct *dir,
 
 	/* This is the "show_other_directories" case */
 
-	if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
+	/*
+	 * We only need to recurse into untracked/ignored directories if
+	 * either of the following bits is set:
+	 *   - DIR_SHOW_IGNORED_TOO (because then we need to determine if
+	 *                           there are ignored directories below)
+	 *   - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
+	 *                                 the directory is empty)
+	 */
+	if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
 		return excluded ? path_excluded : path_untracked;
 
+	/*
+	 * If we only want to determine if dirname is empty, then we can
+	 * stop at the first file we find underneath that directory rather
+	 * than continuing to recurse beyond it.  If DIR_SHOW_IGNORED_TOO
+	 * is set, then we want MORE than just determining if dirname is
+	 * empty.
+	 */
+	stop_early = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
+		      !(dir->flags & DIR_SHOW_IGNORED_TOO));
+
+	/*
+	 * If /every/ file within an untracked directory is ignored, then
+	 * we want to treat the directory as ignored (for e.g. status
+	 * --porcelain), without listing the individual ignored files
+	 * underneath.  To do so, we'll save the current ignored_nr, and
+	 * pop all the ones added after it if it turns out the entire
+	 * directory is ignored.
+	 */
+	old_ignored_nr = dir->ignored_nr;
+
+	/* Actually recurse into dirname now, we'll fixup the state later. */
 	untracked = lookup_untracked(dir->untracked, untracked,
 				     dirname + baselen, len - baselen);
+	state = read_directory_recursive(dir, istate, dirname, len, untracked,
+					 stop_early, stop_early, pathspec);
+
+	/* There are a variety of reasons we may need to fixup the state... */
+	if (state == path_excluded) {
+		int i;
+
+		/*
+		 * When stop_early is set, read_directory_recursive() will
+		 * never return path_untracked regardless of whether
+		 * underlying paths were untracked or ignored (because
+		 * returning early means it excluded some paths, or
+		 * something like that -- see commit 5aaa7fd39aaf ("Improve
+		 * performance of git status --ignored", 2017-09-18)).
+		 * However, we're not really concerned with the status of
+		 * files under the directory, we just wanted to know
+		 * whether the directory was empty (state == path_none) or
+		 * not (state == path_excluded), and if not, we'd return
+		 * our original status based on whether the untracked
+		 * directory matched an exclusion pattern.
+		 */
+		if (stop_early)
+			state = excluded ? path_excluded : path_untracked;
+
+		else {
+			/*
+			 * When
+			 *     !stop_early && state == path_excluded
+			 * then all paths under dirname were ignored.  For
+			 * this case, git status --porcelain wants to just
+			 * list the directory itself as ignored and not
+			 * list the individual paths underneath.  Remove
+			 * the individual paths underneath.
+			 */
+			for (i = old_ignored_nr + 1; i<dir->ignored_nr; ++i)
+				free(dir->ignored[i]);
+			dir->ignored_nr = old_ignored_nr;
+		}
+	}
 
 	/*
-	 * If this is an excluded directory, then we only need to check if
-	 * the directory contains any files.
+	 * If there is nothing under the current directory and we are not
+	 * hiding empty directories, then we need to report on the
+	 * untracked or ignored status of the directory itself.
 	 */
-	return read_directory_recursive(dir, istate, dirname, len,
-					untracked, 1, excluded, pathspec);
+	if (state == path_none && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
+		state = excluded ? path_excluded : path_untracked;
+
+	/*
+	 * We can recurse into untracked directories that don't match any
+	 * of the given pathspecs when some file underneath the directory
+	 * might match one of the pathspecs.  If so, we should make sure
+	 * to note that the directory itself did not match.
+	 */
+	if (pathspec &&
+	    !match_pathspec(istate, pathspec, dirname, len,
+			    0 /* prefix */, NULL,
+			    0 /* do NOT special case dirs */))
+		state = path_none;
+
+	return state;
 }
 
 /*
@@ -1872,6 +1961,11 @@  static enum path_treatment treat_path_fast(struct dir_struct *dir,
 					   int baselen,
 					   const struct pathspec *pathspec)
 {
+	/*
+	 * WARNING: From this function, you can return path_recurse or you
+	 *          can call read_directory_recursive() (or neither), but
+	 *          you CAN'T DO BOTH.
+	 */
 	strbuf_setlen(path, baselen);
 	if (!cdir->ucd) {
 		strbuf_addstr(path, cdir->file);
@@ -2177,14 +2271,10 @@  static enum path_treatment read_directory_recursive(struct dir_struct *dir,
 	int stop_at_first_file, const struct pathspec *pathspec)
 {
 	/*
-	 * WARNING WARNING WARNING:
-	 *
-	 * Any updates to the traversal logic here may need corresponding
-	 * updates in treat_leading_path().  See the commit message for the
-	 * commit adding this warning as well as the commit preceding it
-	 * for details.
+	 * WARNING: Do NOT call recurse unless path_recurse is returned
+	 *          from treat_path().  Recursing on any other return value
+	 *          results in exponential slowdown.
 	 */
-
 	struct cached_dir cdir;
 	enum path_treatment state, subdir_state, dir_state = path_none;
 	struct strbuf path = STRBUF_INIT;
@@ -2206,13 +2296,7 @@  static enum path_treatment read_directory_recursive(struct dir_struct *dir,
 			dir_state = state;
 
 		/* recurse into subdir if instructed by treat_path */
-		if ((state == path_recurse) ||
-			((state == path_untracked) &&
-			 (resolve_dtype(cdir.d_type, istate, path.buf, path.len) == DT_DIR) &&
-			 ((dir->flags & DIR_SHOW_IGNORED_TOO) ||
-			  (pathspec &&
-			   do_match_pathspec(istate, pathspec, path.buf, path.len,
-					     baselen, NULL, DO_MATCH_LEADING_PATHSPEC) == MATCHED_RECURSIVELY_LEADING_PATHSPEC)))) {
+		if (state == path_recurse) {
 			struct untracked_cache_dir *ud;
 			ud = lookup_untracked(dir->untracked, untracked,
 					      path.buf + baselen,
@@ -2296,15 +2380,6 @@  static int treat_leading_path(struct dir_struct *dir,
 			      const char *path, int len,
 			      const struct pathspec *pathspec)
 {
-	/*
-	 * WARNING WARNING WARNING:
-	 *
-	 * Any updates to the traversal logic here may need corresponding
-	 * updates in read_directory_recursive().  See 777b420347 (dir:
-	 * synchronize treat_leading_path() and read_directory_recursive(),
-	 * 2019-12-19) and its parent commit for details.
-	 */
-
 	struct strbuf sb = STRBUF_INIT;
 	struct strbuf subdir = STRBUF_INIT;
 	int prevlen, baselen;
@@ -2355,23 +2430,7 @@  static int treat_leading_path(struct dir_struct *dir,
 		strbuf_reset(&subdir);
 		strbuf_add(&subdir, path+prevlen, baselen-prevlen);
 		cdir.d_name = subdir.buf;
-		state = treat_path(dir, NULL, &cdir, istate, &sb, prevlen,
-				    pathspec);
-		if (state == path_untracked &&
-		    resolve_dtype(cdir.d_type, istate, sb.buf, sb.len) == DT_DIR &&
-		    (dir->flags & DIR_SHOW_IGNORED_TOO ||
-		     do_match_pathspec(istate, pathspec, sb.buf, sb.len,
-				       baselen, NULL, DO_MATCH_LEADING_PATHSPEC) == MATCHED_RECURSIVELY_LEADING_PATHSPEC)) {
-			if (!match_pathspec(istate, pathspec, sb.buf, sb.len,
-					    0 /* prefix */, NULL,
-					    0 /* do NOT special case dirs */))
-				state = path_none;
-			add_path_to_appropriate_result_list(dir, NULL, &cdir,
-							    istate,
-							    &sb, baselen,
-							    pathspec, state);
-			state = path_recurse;
-		}
+		state = treat_path(dir, NULL, &cdir, istate, &sb, prevlen, pathspec);
 
 		if (state != path_recurse)
 			break; /* do not recurse into it */