diff mbox series

[v3,6/8] grep: add repository to OID grep sources

Message ID f362fc278cc2d0291c24097a75f8cfdfe88c27d6.1629148153.git.jonathantanmy@google.com (mailing list archive)
State Accepted
Commit 0693806bf82fb76347e226d8fc5e69077c0a3df5
Headers show
Series In grep, no adding submodule ODB as alternates | expand

Commit Message

Jonathan Tan Aug. 16, 2021, 9:09 p.m. UTC
Record the repository whenever an OID grep source is created, and teach
the worker threads to explicitly provide the repository when accessing
objects.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/grep.c | 15 ++++++---------
 grep.c         |  7 +++++--
 grep.h         | 17 ++++++++++++++++-
 3 files changed, 27 insertions(+), 12 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Sept. 27, 2021, 12:08 p.m. UTC | #1
On Mon, Aug 16 2021, Jonathan Tan wrote:

> Record the repository whenever an OID grep source is created, and teach
> the worker threads to explicitly provide the repository when accessing
> objects.
> [...]
> diff --git a/grep.h b/grep.h
> index 480b3f5bba..128007db65 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -120,7 +120,20 @@ struct grep_opt {
>  	struct grep_pat *header_list;
>  	struct grep_pat **header_tail;
>  	struct grep_expr *pattern_expression;
> +
> +	/*
> +	 * NEEDSWORK: See if we can remove this field, because the repository
> +	 * should probably be per-source. That is, grep.c functions using this
> +	 * field should probably start using "repo" in "struct grep_source"
> +	 * instead.
> +	 *
> +	 * This is potentially the cause of at least one bug - "git grep"
> +	 * ignoring the textconv attributes from submodules. See [1] for more
> +	 * information.
> +	 * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
> +	 */
>  	struct repository *repo;
> +

I ran into this comment and read the linked E-Mail, and then the
downthread
https://lore.kernel.org/git/CAHd-oW6uG1fap-T4UF17bJmjoHAqWCDq9KbY+_8a3cEnnfATxg@mail.gmail.com/;

Given Matheus's "I've somehow missed this guard and the..." there I'm
not quite sure what/if we should be doing here & what this comment is
recommending? I.e. do we still need to adjust the call chains as noted
in the E-Mail the comment links to, or not?
Matheus Tavares Sept. 27, 2021, 4:45 p.m. UTC | #2
On Mon, Sep 27, 2021 at 9:09 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Mon, Aug 16 2021, Jonathan Tan wrote:
>
> > Record the repository whenever an OID grep source is created, and teach
> > the worker threads to explicitly provide the repository when accessing
> > objects.
> > [...]
> > diff --git a/grep.h b/grep.h
> > index 480b3f5bba..128007db65 100644
> > --- a/grep.h
> > +++ b/grep.h
> > @@ -120,7 +120,20 @@ struct grep_opt {
> >       struct grep_pat *header_list;
> >       struct grep_pat **header_tail;
> >       struct grep_expr *pattern_expression;
> > +
> > +     /*
> > +      * NEEDSWORK: See if we can remove this field, because the repository
> > +      * should probably be per-source. That is, grep.c functions using this
> > +      * field should probably start using "repo" in "struct grep_source"
> > +      * instead.
> > +      *
> > +      * This is potentially the cause of at least one bug - "git grep"
> > +      * ignoring the textconv attributes from submodules. See [1] for more
> > +      * information.
> > +      * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
> > +      */
> >       struct repository *repo;
> > +
>
> I ran into this comment and read the linked E-Mail, and then the
> downthread
> https://lore.kernel.org/git/CAHd-oW6uG1fap-T4UF17bJmjoHAqWCDq9KbY+_8a3cEnnfATxg@mail.gmail.com/;
>
> Given Matheus's "I've somehow missed this guard and the..." there I'm
> not quite sure what/if we should be doing here & what this comment is
> recommending? I.e. do we still need to adjust the call chains as noted
> in the E-Mail the comment links to, or not?

I think we should still adjust the call chains, yes. The downthread
message you mentioned is kind of a tangent about performance, where
Junio helped me understand something I had previously missed in the
code, regarding the persistence of the attributes stack.

But the issue that started the thread was about a correctness problem:
the superproject textconv attributes are being used on submodules'
files when running `git grep` with `--recurse-submodules --textconv`.
The three cases to consider are:

- .gitattributes from the working tree
- .gitattributes from the index
- .git/info/attributes

On all these cases, the superproject attributes are being used on the
submodule. Additionally, if the superproject does not define any
attribute, the submodule attributes are being ignored in all cases
except by the first one (but that is only because the code sees the
.gitattributes file on the submodule as if it were a "regular"
subdirectory of the surperproject. So the submodule's .gitattribures
takes higher precedence when evaluating the attributes for files in
that directory).

Another issue is that the textconv cache is always saved to (and read
from) the superproject gitdir, even for submodules' files.

Here are some test cases that demonstrate these issues:

-- snipsnap --
diff --git a/t/t7814-grep-recurse-submodules.sh b/t/t7814-grep-recurse-submodules.sh
index 3172f5b936..d01a3bc5d8 100755
--- a/t/t7814-grep-recurse-submodules.sh
+++ b/t/t7814-grep-recurse-submodules.sh
@@ -441,4 +441,104 @@ test_expect_success 'grep --recurse-submodules with --cached ignores worktree mo
 	test_must_fail git grep --recurse-submodules --cached "A modified line in submodule" >actual 2>&1 &&
 	test_must_be_empty actual
 '
+
+test_expect_failure 'grep --textconv: superproject .gitattributes does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv: superproject .gitattributes (from index) does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >.gitattributes &&
+	git add .gitattributes &&
+	rm .gitattributes &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv: superproject .git/info/attributes does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	super_attr="$(git rev-parse --path-format=relative --git-path info/attributes)" &&
+	test_when_finished rm -f "$super_attr" &&
+	echo "a diff=d2x" >"$super_attr" &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --textconv corectly reads submodule .gitattributes' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >submodule/.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv corectly reads submodule .gitattributes (from index)' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >submodule/.gitattributes &&
+	git -C submodule add .gitattributes &&
+	rm submodule/.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv corectly reads submodule .git/info/attributes' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+
+	# Workaround: we use --path-format=relative because the absolute path
+	# contains whitespaces and that seems to confuse test_when_finished
+	#
+	submodule_attr="submodule/$(git -C submodule rev-parse --path-format=relative --git-path info/attributes)" &&
+	test_when_finished rm -f "$submodule_attr" &&
+	echo "a diff=d2x" >"$submodule_attr" &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep saves textconv cache in the appropriated repository' '
+	reset_and_clean &&
+	test_config_global diff.d2x_cached.textconv "sed -e \"s/d/x/\"" &&
+	test_config_global diff.d2x_cached.cachetextconv true &&
+	echo "a diff=d2x_cached" >submodule/.gitattributes &&
+
+	# Note: we only read/write to the textconv cache when grepping from an
+	# OID as the working tree file might have modifications. That is why
+	# we use --cached here.
+	#
+	git grep --textconv --cached --recurse-submodules x &&
+	test_path_is_missing "$(git rev-parse --git-path refs/notes/textconv/d2x_cached)" &&
+	test_path_is_file "$(git -C submodule rev-parse --git-path refs/notes/textconv/d2x_cached)"
+'
+
 test_done
--

Junio seemed to agree that the behavior I described above is not the
correct one:

"None of the attributes defined in the superproject
should affect the paths in the submodule, as it is a totally
separate project, oblivious to the existence of enclosing the
superproject."

But he raised an important concern regarding how to fix this without
affecting [too much] performance:

"As there is only one attribute cache IIUC, invalidating
the whole cache for the top-level and replacing it with the one for
a submodule, every time we cross the module boundary, would probably
have a negative effect on the performance, and I am not sure what
would happen if you run more than one threads working in different
repositories (i.e. top-level and submodules)."

Maybe we would need a different attributes stack for each repository?
Ævar Arnfjörð Bjarmason Sept. 27, 2021, 5:30 p.m. UTC | #3
On Mon, Sep 27 2021, Matheus Tavares wrote:

> On Mon, Sep 27, 2021 at 9:09 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Aug 16 2021, Jonathan Tan wrote:
>>
>> > Record the repository whenever an OID grep source is created, and teach
>> > the worker threads to explicitly provide the repository when accessing
>> > objects.
>> > [...]
>> > diff --git a/grep.h b/grep.h
>> > index 480b3f5bba..128007db65 100644
>> > --- a/grep.h
>> > +++ b/grep.h
>> > @@ -120,7 +120,20 @@ struct grep_opt {
>> >       struct grep_pat *header_list;
>> >       struct grep_pat **header_tail;
>> >       struct grep_expr *pattern_expression;
>> > +
>> > +     /*
>> > +      * NEEDSWORK: See if we can remove this field, because the repository
>> > +      * should probably be per-source. That is, grep.c functions using this
>> > +      * field should probably start using "repo" in "struct grep_source"
>> > +      * instead.
>> > +      *
>> > +      * This is potentially the cause of at least one bug - "git grep"
>> > +      * ignoring the textconv attributes from submodules. See [1] for more
>> > +      * information.
>> > +      * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
>> > +      */
>> >       struct repository *repo;
>> > +
>>
>> I ran into this comment and read the linked E-Mail, and then the
>> downthread
>> https://lore.kernel.org/git/CAHd-oW6uG1fap-T4UF17bJmjoHAqWCDq9KbY+_8a3cEnnfATxg@mail.gmail.com/;
>>
>> Given Matheus's "I've somehow missed this guard and the..." there I'm
>> not quite sure what/if we should be doing here & what this comment is
>> recommending? I.e. do we still need to adjust the call chains as noted
>> in the E-Mail the comment links to, or not?
>
> I think we should still adjust the call chains, yes. The downthread
> message you mentioned is kind of a tangent about performance, where
> Junio helped me understand something I had previously missed in the
> code, regarding the persistence of the attributes stack.
>
> But the issue that started the thread was about a correctness problem:
> the superproject textconv attributes are being used on submodules'
> files when running `git grep` with `--recurse-submodules --textconv`.
> The three cases to consider are:
>
> - .gitattributes from the working tree
> - .gitattributes from the index
> - .git/info/attributes
>
> On all these cases, the superproject attributes are being used on the
> submodule. Additionally, if the superproject does not define any
> attribute, the submodule attributes are being ignored in all cases
> except by the first one (but that is only because the code sees the
> .gitattributes file on the submodule as if it were a "regular"
> subdirectory of the surperproject. So the submodule's .gitattribures
> takes higher precedence when evaluating the attributes for files in
> that directory).
>
> Another issue is that the textconv cache is always saved to (and read
> from) the superproject gitdir, even for submodules' files.
>
> Here are some test cases that demonstrate these issues:
>
> -- snipsnap --
> diff --git a/t/t7814-grep-recurse-submodules.sh b/t/t7814-grep-recurse-submodules.sh
> index 3172f5b936..d01a3bc5d8 100755
> --- a/t/t7814-grep-recurse-submodules.sh
> +++ b/t/t7814-grep-recurse-submodules.sh
> @@ -441,4 +441,104 @@ test_expect_success 'grep --recurse-submodules with --cached ignores worktree mo
>  	test_must_fail git grep --recurse-submodules --cached "A modified line in submodule" >actual 2>&1 &&
>  	test_must_be_empty actual
>  '
> +
> +test_expect_failure 'grep --textconv: superproject .gitattributes does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv: superproject .gitattributes (from index) does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >.gitattributes &&
> +	git add .gitattributes &&
> +	rm .gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv: superproject .git/info/attributes does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	super_attr="$(git rev-parse --path-format=relative --git-path info/attributes)" &&
> +	test_when_finished rm -f "$super_attr" &&
> +	echo "a diff=d2x" >"$super_attr" &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --textconv corectly reads submodule .gitattributes' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >submodule/.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv corectly reads submodule .gitattributes (from index)' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >submodule/.gitattributes &&
> +	git -C submodule add .gitattributes &&
> +	rm submodule/.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv corectly reads submodule .git/info/attributes' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +
> +	# Workaround: we use --path-format=relative because the absolute path
> +	# contains whitespaces and that seems to confuse test_when_finished
> +	#
> +	submodule_attr="submodule/$(git -C submodule rev-parse --path-format=relative --git-path info/attributes)" &&
> +	test_when_finished rm -f "$submodule_attr" &&
> +	echo "a diff=d2x" >"$submodule_attr" &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep saves textconv cache in the appropriated repository' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x_cached.textconv "sed -e \"s/d/x/\"" &&
> +	test_config_global diff.d2x_cached.cachetextconv true &&
> +	echo "a diff=d2x_cached" >submodule/.gitattributes &&
> +
> +	# Note: we only read/write to the textconv cache when grepping from an
> +	# OID as the working tree file might have modifications. That is why
> +	# we use --cached here.
> +	#
> +	git grep --textconv --cached --recurse-submodules x &&
> +	test_path_is_missing "$(git rev-parse --git-path refs/notes/textconv/d2x_cached)" &&
> +	test_path_is_file "$(git -C submodule rev-parse --git-path refs/notes/textconv/d2x_cached)"
> +'
> +
>  test_done

Thanks! I think it would be very good to have these tests in-tree along
with an updated comment pointing to them.
diff mbox series

Patch

diff --git a/builtin/grep.c b/builtin/grep.c
index fa7fd08150..51278b01fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -349,7 +349,7 @@  static int grep_oid(struct grep_opt *opt, const struct object_id *oid,
 	struct grep_source gs;
 
 	grep_source_name(opt, filename, tree_name_len, &pathbuf);
-	grep_source_init_oid(&gs, pathbuf.buf, path, oid);
+	grep_source_init_oid(&gs, pathbuf.buf, path, oid, opt->repo);
 	strbuf_release(&pathbuf);
 
 	if (num_threads > 1) {
@@ -462,14 +462,11 @@  static int grep_submodule(struct grep_opt *opt,
 	repo_read_gitmodules(subrepo, 0);
 
 	/*
-	 * NEEDSWORK: This adds the submodule's object directory to the list of
-	 * alternates for the single in-memory object store.  This has some bad
-	 * consequences for memory (processed objects will never be freed) and
-	 * performance (this increases the number of pack files git has to pay
-	 * attention to, to the sum of the number of pack files in all the
-	 * repositories processed so far).  This can be removed once the object
-	 * store is no longer global and instead is a member of the repository
-	 * object.
+	 * All code paths tested by test code no longer need submodule ODBs to
+	 * be added as alternates, but add it to the list just in case.
+	 * Submodule ODBs added through add_submodule_odb_by_path() will be
+	 * lazily registered as alternates when needed (and except in an
+	 * unexpected code interaction, it won't be needed).
 	 */
 	add_submodule_odb_by_path(subrepo->objects->odb->path);
 	obj_read_unlock();
diff --git a/grep.c b/grep.c
index 8a8105c2eb..79598f245f 100644
--- a/grep.c
+++ b/grep.c
@@ -1863,7 +1863,8 @@  void grep_source_init_file(struct grep_source *gs, const char *name,
 }
 
 void grep_source_init_oid(struct grep_source *gs, const char *name,
-			  const char *path, const struct object_id *oid)
+			  const char *path, const struct object_id *oid,
+			  struct repository *repo)
 {
 	gs->type = GREP_SOURCE_OID;
 	gs->name = xstrdup_or_null(name);
@@ -1872,6 +1873,7 @@  void grep_source_init_oid(struct grep_source *gs, const char *name,
 	gs->size = 0;
 	gs->driver = NULL;
 	gs->identifier = oiddup(oid);
+	gs->repo = repo;
 }
 
 void grep_source_clear(struct grep_source *gs)
@@ -1900,7 +1902,8 @@  static int grep_source_load_oid(struct grep_source *gs)
 {
 	enum object_type type;
 
-	gs->buf = read_object_file(gs->identifier, &type, &gs->size);
+	gs->buf = repo_read_object_file(gs->repo, gs->identifier, &type,
+					&gs->size);
 	if (!gs->buf)
 		return error(_("'%s': unable to read %s"),
 			     gs->name,
diff --git a/grep.h b/grep.h
index 480b3f5bba..128007db65 100644
--- a/grep.h
+++ b/grep.h
@@ -120,7 +120,20 @@  struct grep_opt {
 	struct grep_pat *header_list;
 	struct grep_pat **header_tail;
 	struct grep_expr *pattern_expression;
+
+	/*
+	 * NEEDSWORK: See if we can remove this field, because the repository
+	 * should probably be per-source. That is, grep.c functions using this
+	 * field should probably start using "repo" in "struct grep_source"
+	 * instead.
+	 *
+	 * This is potentially the cause of at least one bug - "git grep"
+	 * ignoring the textconv attributes from submodules. See [1] for more
+	 * information.
+	 * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
+	 */
 	struct repository *repo;
+
 	const char *prefix;
 	int prefix_length;
 	regex_t regexp;
@@ -187,6 +200,7 @@  struct grep_source {
 		GREP_SOURCE_BUF,
 	} type;
 	void *identifier;
+	struct repository *repo; /* if GREP_SOURCE_OID */
 
 	char *buf;
 	unsigned long size;
@@ -198,7 +212,8 @@  struct grep_source {
 void grep_source_init_file(struct grep_source *gs, const char *name,
 			   const char *path);
 void grep_source_init_oid(struct grep_source *gs, const char *name,
-			  const char *path, const struct object_id *oid);
+			  const char *path, const struct object_id *oid,
+			  struct repository *repo);
 void grep_source_clear_data(struct grep_source *gs);
 void grep_source_clear(struct grep_source *gs);
 void grep_source_load_driver(struct grep_source *gs,