diff mbox series

[v3,4/8] t7300: add testcase showing unnecessary traversal into ignored directory

Message ID dc3d3f2471410aa55da4dbc8e2747192888bce5f.1620503945.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Directory traversal fixes | expand

Commit Message

Elijah Newren May 8, 2021, 7:59 p.m. UTC
From: Elijah Newren <newren@gmail.com>

The PNPM package manager is apparently creating deeply nested (but
ignored) directory structures; traversing them is costly
performance-wise, unnecessary, and in some cases is even throwing
warnings/errors because the paths are too long to handle on various
platforms.  Add a testcase that checks for such unnecessary directory
traversal.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t7300-clean.sh | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Junio C Hamano May 10, 2021, 5:28 a.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +test_expect_failure 'avoid traversing into ignored directories' '
> +	test_when_finished rm -f output error trace.* &&
> +	test_create_repo avoid-traversing-deep-hierarchy &&
> +	(
> +		cd avoid-traversing-deep-hierarchy &&
> +
> +		mkdir -p untracked/subdir/with/a &&
> +		>untracked/subdir/with/a/random-file.txt &&
> +
> +		GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.output" \
> +		git clean -ffdxn -e untracked
> +	) &&
> +
> +	grep data.*read_directo.*visited trace.output \
> +		| cut -d "|" -f 9 >trace.relevant &&
> +	cat >trace.expect <<-EOF &&
> +	 directories-visited:1
> +	 paths-visited:4

Are the origins of '1' and '4' trivially obvious to those who are
reading the test, or do these deserve comments?

We create an empty test repository, go there and create a untracked/
hierarchy with a junk file, and tell "clean" that 'untracked' is
"also" in the exclude pattern (but since there is no other exclude
pattern, that is the only one), so everything underneath untracked/
we have no reason to inspect.

So, we do not visit 'untracked' directory.  Which ones do we visit?
Is '1' coming from the top-level of the working tree '.'?  What
about the number of visited paths '4' (the trace is stored outside
this new test repository, so that's not it).

Thanks.

> +	EOF
> +	test_cmp trace.expect trace.relevant
> +'
> +
>  test_done
Elijah Newren May 11, 2021, 5:45 p.m. UTC | #2
On Sun, May 9, 2021 at 10:28 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +test_expect_failure 'avoid traversing into ignored directories' '
> > +     test_when_finished rm -f output error trace.* &&
> > +     test_create_repo avoid-traversing-deep-hierarchy &&
> > +     (
> > +             cd avoid-traversing-deep-hierarchy &&
> > +
> > +             mkdir -p untracked/subdir/with/a &&
> > +             >untracked/subdir/with/a/random-file.txt &&
> > +
> > +             GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.output" \
> > +             git clean -ffdxn -e untracked
> > +     ) &&
> > +
> > +     grep data.*read_directo.*visited trace.output \
> > +             | cut -d "|" -f 9 >trace.relevant &&
> > +     cat >trace.expect <<-EOF &&
> > +      directories-visited:1
> > +      paths-visited:4
>
> Are the origins of '1' and '4' trivially obvious to those who are
> reading the test, or do these deserve comments?
>
> We create an empty test repository, go there and create a untracked/
> hierarchy with a junk file, and tell "clean" that 'untracked' is
> "also" in the exclude pattern (but since there is no other exclude
> pattern, that is the only one), so everything underneath untracked/
> we have no reason to inspect.
>
> So, we do not visit 'untracked' directory.  Which ones do we visit?
> Is '1' coming from the top-level of the working tree '.'?  What
> about the number of visited paths '4' (the trace is stored outside
> this new test repository, so that's not it).

Good points.  I'll make a comment that directories-visited:1 is about
ensuring we only went into the toplevel directory, and I'll removed
the paths-visited check.

But to answer your question, the paths we visit are '.', '..', '.git',
and 'untracked', the first three of which we mark as path_none and
don't recurse into because of special rules for those paths, and the
last of which we shouldn't recurse into since it is ignored.  There
weren't any non-directory files in the toplevel directory, or those
would also be included in the paths-visited count.  A later patch in
the series will fix the code to not recurse into the 'untracked'
directory, fixing this test.
Junio C Hamano May 11, 2021, 10:43 p.m. UTC | #3
Elijah Newren <newren@gmail.com> writes:

> But to answer your question, the paths we visit are '.', '..', '.git',
> and 'untracked', the first three of which we mark as path_none and
> don't recurse into because of special rules for those paths, and the
> last of which we shouldn't recurse into since it is ignored.

Not a hard requirement, but I wish if we entirely ignored "." and
".." in our code (not just not counting, but making whoever calls
readdir() skip and call it again when it gets "." or "..").

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html

seems to imply that readdir() may not give "." or ".." (if dot or
dot-dot exists, you are to return them only once, which implies that
it is perfectly OK for dot or dot-dot to be missing).

So dropping the test for number of visited paths would be nicer from
portability's point of view ;-)

Thanks.
Elijah Newren May 12, 2021, 2:07 a.m. UTC | #4
On Tue, May 11, 2021 at 3:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > But to answer your question, the paths we visit are '.', '..', '.git',
> > and 'untracked', the first three of which we mark as path_none and
> > don't recurse into because of special rules for those paths, and the
> > last of which we shouldn't recurse into since it is ignored.
>
> Not a hard requirement, but I wish if we entirely ignored "." and
> ".." in our code (not just not counting, but making whoever calls
> readdir() skip and call it again when it gets "." or "..").
>
>   https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>
> seems to imply that readdir() may not give "." or ".." (if dot or
> dot-dot exists, you are to return them only once, which implies that
> it is perfectly OK for dot or dot-dot to be missing).


Something like this?

diff --git a/dir.c b/dir.c
index 993a12145f..7f470bc701 100644
--- a/dir.c
+++ b/dir.c
@@ -2341,7 +2341,11 @@ static int read_cached_dir(struct cached_dir *cdir)
        struct dirent *de;

        if (cdir->fdir) {
-               de = readdir(cdir->fdir);
+               while ((de = readdir(cdir->fdir))) {
+                       /* Ignore '.' and '..' by re-looping; handle the rest */
+                       if (!de || !is_dot_or_dotdot(de->d_name))
+                               break;
+               }
                if (!de) {
                        cdir->d_name = NULL;
                        cdir->d_type = DT_UNKNOWN;

It appears that the other two callers of readdir() in dir.c, namely in
is_empty_dir() and remove_dir_recurse() already have such special
repeat-if-is_dot_or_dotdot() logic built into them, so this was
partially lifted from those.

If you'd like, I can add another patch in the series with this change
so that all readdir() calls in dir.c have such ignore '.' and '..'
logic.  Or, we could perhaps introduce a new readdir() wrapper that
does nothing other than ignore '.' and '..' and have all three of
these callsites use that new wrapper.

> So dropping the test for number of visited paths would be nicer from
> portability's point of view ;-)

Yep, makes sense.  I already did that in v4, which means it'll
continue to pass with or without the above proposed change to
read_cached_dir().
Junio C Hamano May 12, 2021, 3:17 a.m. UTC | #5
Elijah Newren <newren@gmail.com> writes:

> If you'd like, I can add another patch in the series with this change
> so that all readdir() calls in dir.c have such ignore '.' and '..'
> logic.  Or, we could perhaps introduce a new readdir() wrapper that
> does nothing other than ignore '.' and '..' and have all three of
> these callsites use that new wrapper.

Yeah, it is good to be consistent (either implementation).

>> So dropping the test for number of visited paths would be nicer from
>> portability's point of view ;-)
>
> Yep, makes sense.  I already did that in v4, which means it'll
> continue to pass with or without the above proposed change to
> read_cached_dir().

Yup.
diff mbox series

Patch

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index a74816ca8b46..b7c9898fac5b 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -746,4 +746,26 @@  test_expect_success 'clean untracked paths by pathspec' '
 	test_must_be_empty actual
 '
 
+test_expect_failure 'avoid traversing into ignored directories' '
+	test_when_finished rm -f output error trace.* &&
+	test_create_repo avoid-traversing-deep-hierarchy &&
+	(
+		cd avoid-traversing-deep-hierarchy &&
+
+		mkdir -p untracked/subdir/with/a &&
+		>untracked/subdir/with/a/random-file.txt &&
+
+		GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.output" \
+		git clean -ffdxn -e untracked
+	) &&
+
+	grep data.*read_directo.*visited trace.output \
+		| cut -d "|" -f 9 >trace.relevant &&
+	cat >trace.expect <<-EOF &&
+	 directories-visited:1
+	 paths-visited:4
+	EOF
+	test_cmp trace.expect trace.relevant
+'
+
 test_done